Cohort profile: St. Michael’s Hospital Tuberculosis Database (SMH-TB), a retrospective cohort of electronic health record data and variables extracted using natural language processing

doi:10.1371/journal.pone.0247872

Fig 1.

Data sources for SMH-TB database.

More »

Expand

Fig 2.

Patient-level and encounter-level data in SMH-TB.

More »

Expand

Table 1.

Variables available in SMH-TB from both structured and unstructured sources.

More »

Expand

Fig 3.

Example of a component of a ruleset for extracting a variable (active TB diagnosis) from unstructured text in clinical dictations (using CHARTextract).

More »

Expand

Fig 4.

QuickLabel interface for manual variable abstraction.

(A) Value labels are shown for example variables—the Tuberculin Skin Test (TST) and Interferon Gamma Release Assay (IGRA). (B) A screen shot of a representative data extraction using the Quicklabel tool. The corresponding sentences containing the variables of interest are highlighted in yellow.

More »

Expand

Table 2.

Derivation of the value labels for diabetes mellitus.

More »

Expand

Table 3.

Demographics of the patients included in the SMH-TB database, 2011–2018.

More »

Expand

Table 4.

Summary of performance metrics on test set for variables extracted from unstructured dictations. Patients included in test set: N = 200.

More »

Expand

Table 5.

Binomial proportion estimate and 95% Confidence Interval (CI) using standard binary regression and MC-SIMEX model for binary variables created from extracted variables.

Total patients with at least 1 dictation: N = 3237.

More »

Expand

Table 6.

Association between demographic characteristics and receipt of LTBI treatment.

Total patients who were diagnosed with LTBI, N = 1473.

More »

Expand