Table 1.
Summary of vendor information demonstrating the number of clinics, physicians, and patients per vendor.
Table 2.
Summary statistics for each sociodemographic characteristic from the patient sample used in our study.
The data are sectioned by male and female demographics.
Table 3.
Evaluation metric values of the logistic regression model (LR), random forest classifier (RF), and the rule-based algorithm (RB) for each of the sociodemographic characteristics.
The best-performing algorithm is indicated in bold for each metric.
Fig 1.
Proportion of patients with availability for each sociodemographic characteristic averaged across all clinics for both (a) the reference standard and (b) the full database.
Each bar represents point estimates for each characteristic with a 95% confidence interval denoted by the error bars. The asterisk on the vertical bar denotes that the availability of data for the specific characteristic is significantly different from the other characteristics (P < 0.05).
Table 4.
Descriptive statistics and missingness rate for each sociodemographic characteristic within the clinical text in the EMR.
For education status, whether a patient is currently studying or has completed their degree was distinguished using separate categories.
Fig 2.
Average percent of documentation rates for each sociodemographic characteristic based on vendor for the reference standard (a) and the full UTOPIAN database (b).
The error bars were calculated based on the standard error for each characteristic’s documentation rates. The asterisk on a characteristic denotes that documentation rates are statistically different across vendors (P < 0.05).
Fig 3.
Summary of completeness of sociodemographic characteristics by clinic for (a) the reference standard and (b) the full database.
Table 5.
Median and distribution of completeness rates presented in terms of the first and third quartiles for sociodemographic characteristics across the clinics for the reference standard and the full database.
Fig 4.
The effect of various physician and clinic variables on completeness rates for sociodemographic characteristics.
Foreign vs Canadian medical graduate is abbreviated here as FMG. The variables that have a statistically significant effect on completeness rates are highlighted in red with a diamond shape. Error bars represent the 95% confidence interval. The dotted line at 1 indicates no association between the two variables. The p-value represents the statistical significance of the results. The reference category for each variable is listed in between brackets.