Mining heterogeneous clinical notes by multi-modal latent topic model

Latent knowledge can be extracted from the electronic notes that are recorded during patient encounters with the health system. Using these clinical notes to decipher a patient’s underlying comorbidites, symptom burdens, and treatment courses is an ongoing challenge. Latent topic model as an efficient Bayesian method can be used to model each patient’s clinical notes as “documents” and the words in the notes as “tokens”. However, standard latent topic models assume that all of the notes follow the same topic distribution, regardless of the type of note or the domain expertise of the author (such as doctors or nurses). We propose a novel application of latent topic modeling, using multi-note topic model (MNTM) to jointly infer distinct topic distributions of notes of different types. We applied our model to clinical notes from the MIMIC-III dataset to infer distinct topic distributions over the physician and nursing note types. Based on manual assessments made by clinicians, we observed a significant improvement in topic interpretability using MNTM modeling over the baseline single-note topic models that ignore the note types. Moreover, our MNTM model led to a significantly higher prediction accuracy for prolonged mechanical ventilation and mortality using only the first 48 hours of patient data. By correlating the patients’ topic mixture with hospital mortality and prolonged mechanical ventilation, we identified several diagnostic topics that are associated with poor outcomes. Because of its elegant and intuitive formation, we envision a broad application of our approach in mining multi-modality text-based healthcare information that goes beyond clinical notes. Code available at https://github.com/li-lab-mcgill/heterogeneous_ehr.


Introduction
Multitudes of clinical notes are generated within the electronic health records (EHR) for each encounter between a patient and healthcare providers. These notes are written by clinical experts with specialized domain knowledge and include a plethora of rich information not otherwise captured within the EHR's laboratory, imaging, billing, and administrative documentation. Importantly, there exist overlapping sub-domains of medical knowledge which depends on the particular expertise of the author. Due to distinct medical domain knowledge, different note types often involve different clinical vocabularies. In particular, clinical notes authored by physicians may differ considerably in terms of vocabulary and content compared to those notes authored by registered nurses. While Latent Dirichlet Allocation (LDA) [1] is a popular approach to extract meaningful topics from documents, it assumes that all of the documents follow the same topic distributions. We hypothesize that by modeling different note types each with a distinct discrete distribution using a multi-modal latent topic model, we can improve the interpretability of the latent topics learned from the notes and generate a more accurate risk stratification of patients.
To this end, we propose a multi-note topic model (MNTM) that jointly infers distinct latent topic distributions corresponding to each distinct note type. As a proof-of-concept, we use the clinical notes from the Medical Information Mart for Intensive Care III (MIMIC-III) data [2] for 17,000 patients in the intensive care unit (ICU). Our goal is to develop an early prediction model of the risk of prolonged mechanical ventilation (PMV) and in-hospital mortality among ICU patients based solely on clinical notes data accrued during the first 48 hours of their ICU admission. Early prediction was selected as the unit of analysis because of its high clinical relevance. PMV and in-hospital mortality were selected because they are the conventional outcomes for early prognostication in the critical care literature [3].

Related methods
Our method of latent topic modeling is distinct from several previous methods [1,[4][5][6]. While previous investigators have employed latent topic models for mining clinical notes, to the best of our knowledge, none of these methods consider distinct note types differently. Chen et al. (2015) applied LDA directly to the EHR data without considering multi-modality [4]. Pivovarov et al. (2015) described a multi-modal LDA that infers topics by data types, where clinical note is one of the four data types (billing code, laboratory tests, clinical notes, and prescription) [5] but does not distinguish between note types. This model only works with a fixed set of data types. Li et al. (2020) described a multi-modal topic model called MixEHR to jointly infer distinct topic distributions for each data type while imputing non-missing at random laboratory test results [7]. While MixEHR can generalize to any arbitrary data type, it has not been applied to the current problem of multi-note-type modeling. Therefore, we consider our current approach as a novel application of the multi-modal topic model.

Multi-modal latent topic model
We propose a multi-modal latent topic model (Fig 1). Suppose there are K latent disease topics. Each topic k 2 {1, . . ., K} under note type t 2 {1, . . ., T} represents a distribution over the vocabulary, which is a vector of unknown word frequency � ðtÞ k ¼ ½� ðtÞ wk � W ðtÞ for W (t) distinct words in the vocabulary. We assume that the topic-specific word frequency � ðtÞ k follows a Dirichlet distribution with unknown hyperparameter β wt . For each patient j 2 {1, . . ., D}, the disease mixture membership θ j is generated from the K-dimensional Dirichlet distribution Dir (α) with unknown asymmetric hyperparameters α k . To generate a note token i for patient j, a latent topic z ðtÞ ij under data type t is first drawn from a categorical distribution θ j . Then a clinical feature x ðtÞ ij is drawn from a categorical distribution with rate equal to � ðtÞ z ðtÞ ij . Formally, we first generate global variables for the K topics:

PLOS ONE
� ðtÞ k � Dirðb t Þ : where t is the note types (e.g., t 2 {physician note, nursing note}). We then generate local variables for the patient topic mixture: Given the topic mixture, we sample a topic for each token in note type t of each patient's note: We then sample a word for token i from topic distribution under topic z ij : Notably, the topic mixture θ j is shared across note types and can therefore facilitate "borrowing" information between different note types when learning the topic distribution ϕ (t) . To learn the model, we implemented a collapsed variational Bayesian algorithm [8]. Briefly, we first integrate out the Dirichlet variables because they are conjugate to the multinomial distribution of the tokens making the resulting inference much more efficient. We then approximate the expectations by first deriving the conditional distribution for the topic assignments z ðtÞ ijk and then approximating their sufficient statistics by the variational parameters: g ðtÞ ijk / a k þñ À ði;jÞ where the notation n −(i,j) indicates the exclusion of token i in patient j's clinical note and the sufficient statistics areñ À ði;jÞ :jk The learning algorithm therefore follows a variational Bayes expectation-maximization algorithm: E-step infers g ðtÞ ijk 's with Eq (1); M-step updates sufficient statisticsñ :jk andñ ðtÞ w:k with Eqs (2) and (3), respectively. The EM update guarantees maximizing the evidence lower bound (ELBO) of the model under the mean-field variational distribution for independent topic assignments (i.e., qðzÞ ¼ Q t;i;j qðz ðtÞ ij jg ðtÞ ij Þ) [8]. Upon convergence of ELBO, we infer the respective variational expectations of the patient topic mixture and topics distribution: Furthermore, we update the hyper-parameters by maximizing the marginal likelihood under the variational expectations via empirical Bayes fixed-point update [9,10]: where C(.) is the digamma function, W t is the vocabulary size under clinical note type t, the Gamma parameters are set to fixed values mainly for numerical stability: a α = 1;b α = 0, a β = 1, b β = 100.

MIMIC-III note processing
From the entire cohort (all patients admitted to the ICU), we selected a subset, which we have called day-2 cohort. This subset includes the notes of patients that have been mechanically ventilated for at least two consecutive days. We used the entire cohort excluding the day-2 cohort, to train our unsupervised topic model and then used this trained topic model to infer topic mixtures of notes in the day-2 cohort, which are used for mechanical ventilation prediction. For both cohorts, we performed a standard text preprocessing procedure including converting letters to lower case, removing punctuation, white spaces, stop words provided by Natural Language Toolkit library (https://www.nltk.org/), and words that appeared in fewer than 5 notes or in more than 15% of notes. After the preprocessing, each note had around 300 words on average. The vocabulary for physicians' notes contained 8948 words and the vocabulary for nursing notes contained 8076 words. In our study, the notes of an admission, instead of a patient, were grouped together as one document, and were therefore assumed to have one topic composition. While notes written in different admissions might have different focuses on the topics, it is reasonable to assume notes within a single admission have mostly the same topics, including notes written by different professionals.
For the single-note-type model, we processed the notes in two different ways: (1) the same words from the different types were assigned the same word ID and their frequencies were the overall total sum over all types of notes (referred to as "single-note-type (same word)"); (2) the same words from different types were assigned different word IDs and their frequencies were computed separately (referred to "single-note-type (diff. word)"). For example, the word 'heartbeat' may occur in both a physician's note as well as a nursing note but is represented separately (as 'physician-heartbeat' and 'nurse-heartbeat'). For the proposed multi-note model, we differentiated such words by assigning different note types to them. We evaluated our model's predictive performance by 5-fold cross-validation. Prolonged mechanical ventilation was defined as � 7 days because this time period represents a major clinical decision branch in a patient's care [11][12][13].

Qualitative evaluation
We performed a qualitative evaluation of the topic cohesiveness. Topic cohesiveness was defined a priori as "relatedness of each term within the topic to a central disease process or health state". Cohesiveness was measured by a blinded physician using a 5-point scale. A second blinded physician with content expertise in critical care medicine reviewed the word clouds of each model in aggregate and provided a determination of the relative cohesiveness of the two models.

Multi-note model improves PMV and mortality prediction
In each validation fold, we trained both the single-note models (single-note-type (same words) and single-note-type (diff. words)) and the multi-note model on the training set followed by a logistic regression model to predict the binary outcome of PMV also on the same training set. We used 50 topics for each of the 3 topic models. We experimented with 10, 30, 50 and 100 topics by measuring the perplexity on held-out documents and chose the best number of topics going forward.
We then predicted the PMV binary outcome on the validation set (Fig 2). We observed consistent improvement in terms of area under the receiver operating characteristic (ROC) curves (AUROC: 66.8% for multi note type, 66.0% for single note type (diff. words), 60.7% for single note type (same words)) and area under the precision-recall curve (AUPRC: 40.8% for multi note type, 39.2% for single-note-type (diff. words), 33.9% for single note type (same words)). In particular, the multi-type model achieved AUROC equal to 0.668 with standard deviation (std) equal to 0.008. Hence, the 95% confidence interval (CI) was ½0:668 À 1:96 � 0:008= ffi ffi ffi ffi ffi 10 p ; 0:668 þ 1:96 � 0:008= ffi ffi ffi ffi ffi 10 p � ¼ ½0:6630; 0:6730�. The best single-note type model (diff-word) achieved on average 0.660 ± 0.008 std (i.e., [0.6550, 0.6650] 95% CI). Therefore, the AUROC of the multi-note model was higher than the best single-note model but the difference was not statistically significant at 95% CI. However, AUROC tends to be insensitive to unbalanced data. We therefore turned to AUPRC. In terms of AUPRC, the multi-note model achieved on average 0.408 ± 0.007 (std), while the best single-note model achieved on average 0.392 ± 0.008 (std), and the 95% confidence interval in terms of AUPRC were [0.404, 0.412] and [0.387, 0.397], respectively. This showed that the AUPRC of the multinote model was significantly higher than the AUPRC of the best single-note model at 95% CI.
To further illustrate the benefits of modeling multi-note types, we applied our approach to mortality prediction. Here we used the first 48 hours nursing and physician notes to predict in-hospital mortality. Same as the PMV application, we trained a 50-topic model for each approach and used the topic mixture memberships as an input to a logistic regression classifier for predicting mortality. We performed 5-fold CV to evaluate each method. In particular, each fold including 1560 admissions for evaluation and the remaining 4 folds including 6233 admissions total were used for training each topic model. We found that the multi-note model performed slightly better compared to single-note models, as measured by AUROC and AUPRC (S2 Fig in S1 File). On mortality prediction, the multi-note model achieved on average 0.861 ± 0.004 (std) in terms of AUROC and [0.859, 0.863] 95% CI. The best single-note (sameword) model achieved on average 0.845 ± 0.004 (std) and [0.843, 0.847] 95% CI. In terms of AUPRC, the multi-note model achieved on average 0.419 ± 0.011 (std) and [0.412, 0.426] 95% CI, while the single-note model achieved on average 0.404 ± 0.008 (std) and [0.399, 0.409] 95% CI. These indicated that both the AUPRC and AUROC of the multi-note model are significantly higher than those of the best single-note model at 95% CI.
By construction, the single type (diff. word) model operates over a vocabulary that is roughly twice as big as that of the single type (same word) model (because the same word coming from the two note types is treated as two different words). On the other hand, the multi type model operates on the same vocabulary as single type (same word), but counts the same word coming from different notes types differently. Therefore, to compare more fairly by controlling the impact brought by the effective "vocabulary size" (unique words that are seen by the models), we focused our subsequent analysis on the comparison between the multi-note type model and single-note-type (diff. words) model. For ease of reference, we rename the single-note-type (diff. words) model simply as single-note. We focus our analysis on PMV henceforth as it is less explored than mortality.

Fig 2. ROC and precision-recall curve for binary PMV prediction.
We trained the two single-note topic models and the multi-note topic models on the first 48 hours of the clinical notes for each patient. We then trained a separate logistic regression classifier that took the patient-note topic mixture as input and predicted whether the patient is going to stay on MV for more than 7 days. The trained topic models and logistic classifiers were then applied to the test patients to make the predictions of PMV duration. The prediction accuracy was evaluated by ROC and precision-recall curves. The figure inset shows the AUROC and AUPRC values for each model, and the standard deviations across 10 random splits are in parenthesis. https://doi.org/10.1371/journal.pone.0249622.g002

Evaluating the topic interpretability of single-note and multi-note topic models
To evaluate the interpretability of the single-note versus the multi-note topic model, we generated a word cloud representing each of the 50 topics in both models (i.e. 100 word clouds). Each topic's word cloud was comprised of the top 100 words within the topic, based on the inferred word probabilities under each topic. (S1 Fig in S1 File and Fig 3).

PLOS ONE
For the single-note model, the most common topic themes were "mixed" topics followed by topics pertaining to cardiology, gastroenterology, neurology and respiratory issues. The most common topic themes for multiple-note model were those pertaining to cardiology, gastroenterology, respiratory and neurology. The topics generated by the multi-note model had significantly more cohesiveness than the topics generated by the single-note model. In the multinote model, most word clouds were comprised of words, phrases, or abbreviations that tracked closely with that topic's theme. By comparison, the topics extracted in the single-note model contained a greater amount of noisy, unrelated words. For example, the single-note model generated a topic themed "hematological" in which 'pillow' was the most common word, and a topic themed "stroke" in which 'adenoca' was a common word In addition, we sought an unbiased quantitative evaluation of the topic interpretability. We asked a physician to manually review the general medical cohesiveness of each word-cloud in the single-note and multi-note model and rated from 1 (poor; irrelevant) to 5 (excellent; sticks to one common disease topic).
Quantitatively, the average interpretability score is 3.46 (± 1.15 standard deviation (std)) for single-note model and 4.22 (± 1.15 std) for multi-note model (Fig 4). We conducted a twosided t-test between the physician ratings of the multi-note topic model and the single-note topic model (i.e., the standard LDA model) in R and obtained a p-value equal to 0.001298. This indicated that the difference between these two models in terms of physician's ratings is statistically significant. This trend was further confirmed by the content expert reviewer. The detail of the topic disease and cohesiveness score are listed in S1 and S2 Tables in S1 File.

Correlating topics with mechanical ventilation duration
To gain further insights from the 50 learned topics, we inferred the 50-topic patient mixture memberships using the trained topic model. We then correlated the patient 50-topic mixture with the patient's total mechanical ventilation (MV) duration using only those patients' notes that were recorded within 48 hours of the their ICU admission (Fig 5 top panels). We chose Pearson's correlation coefficient because it is a normalized metric whose magnitude reflects the strength of linear correlation, in the range of -1 to 1, and the range restriction of the variables has no impact on the correlation. We also tried Spearman's and Kendall's correlation coefficients and observed similar results. We visualized the top 3 most positively correlated topics and the top 3 most negatively correlated topics for single-note model and multi-note model (Fig 5 bottom panels). The multi-note model clearly revealed more meaningful topics related to MV duration. For example, the most correlated topics for MV from multi-note model was associated with septic shock followed by pneumonia. In contrast, the most correlated topic for MV from the single-note model is associated with 'javascript system error' along with some discrete and irrelevant terms and concepts. The most common negatively correlated topic for MV was chronic obstructive pulmonary disease (COPD) with acute exacerbation (AE) from the multi-note model and liver transplant with some sparse and unrelated terms from the single-note model.

Discussion
Different types of medical specialists, such as physicians and nurses, hold distinct domains of medical knowledge. These differences are reflected in the language and terms that populate clinical notes. Existing methods of LTM treat notes authored by different types of medical specialists as the same by assuming all notes follow a homogeneous topic distribution. To the best of our knowledge, we are the first group to propose a model that applies separate analysis depending on the author type of the notes. Our simple and elegant multi-modal topic model showed the advantage of inferring distinct distributions of latent topics between physician and nursing notes. We demonstrated that the proposed multi-note model extracts more meaningful topics and improves the interpretability of the knowledge learned from the notes as compared to the single-note model. We also showed that our model confers, slightly but statistically significantly, more accurate prediction of duration of MV-a highly clinically relevant clinical question among medical specialists caring for patients in critical conditions.
As a future work, we will explore supervised topic models [14] to learn both the topics and predictions simultaneously. There are also more flexible neural network language models such as ClinicalBERT that can learn more abstract terms [15,16]. We will compare our simpler topic model with ClinicalBERT. Moreover, we will also explore a powerful combination of recurrent neural network and topic model (TopicRNN) [17], which learns both the global context with the topic model and the local context with the RNN. Applications using an analogous idea of predicting readmission of ICU patients using billing code has also shown some promising results [18]. Lastly, our method is not limited to the healthcare domain. For example, we can model documents written in different languages or book reviews by literary scholars from different domains. Together, we envision that our current model can succeed in many application domains, where knowledge is manifested as free-form text in human natural language from diverse empirical domain-knowledge.