Fig 1.
A simplified visualization of a patient’s healthcare timeline.
The time between specialist visit and procedure is known because administrative records are accurate. We show our methodology in two steps: 1) Using machine learning to predict the target specialty of referral notes, and 2) Measuring the wait time between referral and consultation visit of pair specialties.
Table 1.
The number of notes per specialist type.
Notes with fewer than 6 tokens were not counted as they were not informative, per manual checking. The specialty “internal” was also removed as the types of medical consultation done by an internal specialist overlaps with many individual specific specialties such as cardiology, gastroenterology, among others. If present, the names within parentheses represent how the specialty is represented in later figures.
Table 2.
All possible options for each step of the machine learning pipeline.
All possible combinations of these options were trained for each target specialty (for a total of 80 models per specialty).
Table 3.
Precision, recall, and F1Score for each optimization for each specialty found in Table 1.
We do not present evaluate any models for the “aesthesia” class as there were fewer examples than folds. Specialties which are bolded pass our performance threshold.
Table 4.
Precision, recall, specificity, negative predictive value (NPV), and Brier score loss for the models which we used for prediction.
The standard deviations for each measurement are presented between parentheses.
Fig 2.
The F1Score as a function of the number of training examples.
In this figure we present the F1 score from the model with the highest F1 score for each specialty (the last column of Table 3) against the number of training (Table 1).
Fig 3.
The average accuracy (across all specialties) for the highest performing classifiers (when optimized for F1Score) by different lengths of the referral notes (as measured by number of tokens).
We observe that notes with few tokens have poor accuracy. On the other hand, notes with many tokens are likely have a noisy signal (containing more keywords belonging to other specialties), thus resulting in poor performance.
Fig 4.
The estimated median and 90th percentile wait time from family physician referral to specialist visit for nine specialties.
The error bars represent the 95% confidence interval (arrived at by bootstrapping).
Fig 5.
The estimated median and 75th percentile wait time from family physician referral to specialist visit for four specialties.
The error bars represent 95% confidence interval (arrived at by bootstrapping). The 2008 numbers are from our previous work and relied on manual coding.
Fig 6.
The effect of different confidence thresholds (x-axis) on classification metrics (left y-axis) and estimated median wait times (right y-axis) for the 9 selected specialties.
The dashed green line represents the “baseline” wait-time: the wait time if we only estimated using our gold labels (i.e., using patients who only ever had a single referral and specialist consultation in 2015).