Subtypes in patients with opioid misuse: A prognostic enrichment strategy using electronic health record data in hospitalized patients

Background Approaches are needed to better delineate the continuum of opioid misuse that occurs in hospitalized patients. A prognostic enrichment strategy with latent class analysis (LCA) may facilitate treatment strategies in subtypes of opioid misuse. We aim to identify subtypes of patients with opioid misuse and examine the distinctions between the subtypes by examining patient characteristics, topic models from clinical notes, and clinical outcomes. Methods This was an observational study of inpatient hospitalizations at a tertiary care center between 2007 and 2017. Patients with opioid misuse were identified using an operational definition applied to all inpatient encounters. LCA with eight class-defining variables from the electronic health record (EHR) was applied to identify subtypes in the cohort of patients with opioid misuse. Comparisons between subtypes were made using the following approaches: (1) descriptive statistics on patient characteristics and healthcare utilization using EHR data and census-level data; (2) topic models with natural language processing (NLP) from clinical notes; (3) association with hospital outcomes. Findings The analysis cohort was 6,224 (2.7% of all hospitalizations) patient encounters with opioid misuse with a data corpus of 422,147 clinical notes. LCA identified four subtypes with differing patient characteristics, topics from the clinical notes, and hospital outcomes. Class 1 was categorized by high hospital utilization with known opioid-related conditions (36.5%); Class 2 included patients with illicit use, low socioeconomic status, and psychoses (12.8%); Class 3 contained patients with alcohol use disorders with complications (39.2%); and class 4 consisted of those with low hospital utilization and incidental opioid misuse (11.5%). The following hospital outcomes were the highest for each subtype when compared against the other subtypes: readmission for class 1 (13.9% vs. 10.5%, p<0.01); discharge against medical advice for class 2 (12.3% vs. 5.3%, p<0.01); and in-hospital death for classes 3 and 4 (3.2% vs. 1.9%, p<0.01). Conclusions A 4-class latent model was the most parsimonious model that defined clinically interpretable and relevant subtypes for opioid misuse. Distinct subtypes were delineated after examining multiple domains of EHR data and applying methods in artificial intelligence. The approach with LCA and readily available class-defining substance use variables from the EHR may be applied as a prognostic enrichment strategy for targeted interventions.


Introduction
The principles of personalized medicine to find the appropriate treatment based on a patient's individualized determinants of health and clinical needs are a priority for improving clinical outcomes [1]. The ability to identify characteristics in patients more likely to have a clinical outcome (prognostic enrichment) is needed in conditions with a wide spectrum of clinical manifestations. In this regard, identification and treatment of opioid misuse is not a "one-sizefits-all" approach. Opioid misuse occurs along a continuum ranging from individuals who occasionally use opioids for non-medical purposes to individuals with severe opioid use disorders. The spectrum of opioid misuse impacts patients with co-occurring mental health conditions, coexisting alcohol misuse and polysubstance use, complex pain conditions, and inequities in social determinants of health [2][3][4][5]. These characteristics also influence clinical outcomes, so a tailored approach is needed to identify appropriate interventions given varying barriers to treatment for different types of misuse identified.
A data-driven approach to developing subtypes of opioid misuse using electronic health record (EHR) data has not been published in previous work. A major target group in clinical studies is patients with chronic pain and/or long-term prescription opioid use, but these targeted cohorts fail to address other types of opioid misuse behaviors that may be common in hospitalized patients [6]. Community health settings and treatment programs have used latent class analysis (LCA) from health surveys to better delineate subtypes of individuals with opioid use [7][8][9][10]. Heterogeneity in polysubstance use, illicit use, socioeconomic status, and mental illness were common subtype characteristics across study settings. Application of LCA to EHR may reveal important distinct subtypes in our patient cohort with clinically meaningful traits and demonstrate differing risks for negative health outcomes [11][12][13]. Identifying latent subtypes present opportunities to better align the intensity of an intervention and follow-up services for patients. The application of data-driving approaches including unsupervised learning for understanding the underlying structure of data is an important element to a learning healthcare system so that prognostic enrichment strategies are feasible from an ever-expanding quantity of EHR data.
The aim of this study is to identify subtypes of opioid misuse using readily available structured EHR data (e.g. labs, diagnoses). Additionally, the clinical notes are the largest domain of the EHR that frequently contain unstructured data (free text) about social and behavioral determinants of health that cannot be comprehensively examined manually; therefore, topic modelling was applied to summarize the corpus of text. The aim is to identify distinct subtypes of patients with opioid misuse, and provide validity with topic modelling and associations with health outcomes. We hypothesize LCA will identify distinct subtypes in our patient cohort with clinically meaningful traits and demonstrate different risks for negative health outcomes.

Study setting and opioid misuse definition
This study utilized data from the EHR of an urban tertiary academic center between January 1, 2007 and September 30,2017. An operational definition for opioid misuse was developed following the National Survey on Drug Use and Health criteria for opioid misuse with criterion input from a board-certified addiction specialist (ESA) and psychiatrist (NSK). The analysis cohort included consecutive adult (�18 years of age) emergency department and inpatient encounters meeting criteria for opioid misuse during the study period. The criteria for opioid misuse were any of the following: (1) positive urine drug screen for an opiate with polysubstance use with any of the following: an illicit drug (phencyclidine or cocaine), a benzodiazepine that is not on the patient's medication administration record, or an amphetamine that is not on the patient's medication administration record; (2) positive urine drug screen for an opiate but without a prescription for an opioid on the patient's admission administration record. Urine drug screens were eligible only if no opioid or benzodiazepine drug was dispensed by the hospital pharmacy before the urine drug screen was ordered.; (3) International Classification of Diseases (ICD)-9 and -10 codes for opioid-related hospitalizations were adopted from the Healthcare Cost and Utilization Project (HCUP) [14]. ICD codes reflect final billing diagnostic codes used for claims with payers. The codes include a variety of opioid-related events and opioid misuse codes and are detailed in S1 Appendix. Many of the ICD codes do not allow for heroin-related cases to be explicitly identified. In addition, the codes do not distinguish between illegal use of prescription drugs and their use as prescribed.
To validate the operational definition, a random sample of hospital encounters was extracted from the EHR during the study period for chart review. A sample of 1,000 patient encounters including age-sex matched controls were reviewed. The annotations included an oversampling of patients who met case criteria and non-cases who had ICD codes for chronic pain, naloxone administration, or a physician order for a urine drug screen. An annotator (KS) who is an MD, MPH candidate received substance use training through Loyola's Institute for Transformative Interprofessional Education and completed Screening, Brief Intervention, and Referral to Treatment (SBIRT) through online training. Additional training was provided to screen for likelihood of opioid misuse on a Likert Scale (1)(2)(3)(4)(5), and the annotator met an inter-rater reliability of a Cohen's kappa coefficient greater than 0.80 with a critical care physician and addiction specialist (MA, ESA) before independent review was performed.
The operational definition had a sensitivity of 88.6% (95% CI 85.2%-91.9%) and specificity of 78.5% (95% CI 75.4% -81.7%). Chart review identified many false positives that occurred in outside hospital transfers that administered an opioid during care; therefore, hospital transfers were excluded from this analysis. Cases of overdose could not be reliably discriminated using billing codes or naloxone administration with many false positives occurring as well.
Multiple encounters by the same patient were included as independent observations during analysis. As patients' severity and subtype of misuse may change over time, our primary unit of analysis is the patient encounter in order to provide actionable insight into the subtype of misuse at hospitalization which could inform timely custom interventions. To address the potentially high correlation of intra-patient encounters, sensitivity analysis was performed to remove multiple encounters by analyzing the most recent inpatient encounter by each patient in our cohort.

Identifying subtypes with latent class analysis (LCA)
Latent class analysis (LCA) is a statistical technique that uses mixture modelling to identify mutually exclusive and qualitatively different subgroups from multivariate categorical data [15,16]. LCA takes observed data as inputs to define a number of unobservable, distinct subtypes or classes from the population of interest. Model fit statistics are utilized to identify the appropriate number of latent classes, and the distributions of the observed class-defining indicator variables, called item response probabilities, are used to characterize the classes. Posterior probabilities for encounters indicate the likelihood of membership into each of the latent classes. The following were class-defining variables in the LCA model: (1) urine drug screen results; (2) ICD codes for opioid-related hospitalizations; (3) ICD codes for chronic pain; (4) age; (5) ICD codes for alcohol use disorders; (6) ICD codes for psychoses; (7) ICD codes for depression; and (8) ICD codes for liver disease. The eight class-defining variables for the LCA model were chosen a priori based on existing evidence for identifying cases of misuse and risk factors for misuse [17][18][19][20][21]. LCA models were considered for one to eight classes for these class-defining variables. The optimal number of latent classes was selected using fit statistics including the Bayesian information criterion (BIC), adjusted Bayesian information criterion (aBIC), consistent Akaike information criterion (cAIC), class prevalence, class separation, and model interpretability [22]. Each patient encounter was assigned a class according to the highest latent class posterior probability.
The face validity and clinical utility are examined by comparisons between latent classes using the following approaches: (1) descriptive statistics on patient characteristics and health utilization (structured EHR and census-level data) for each subtype; (2) topic models from natural language processing (clinical notes) and their probability assignment to each subtype; (3) association of subtypes with clinical outcomes (described below).

Structured EHR and census-level data
Individual patient measures from the EHR included the following: (1) demographics and insurance status; (2) Elixhauser mortality score; (3) ICD codes for chronic pain and other disease categories developed by the Agency for Healthcare Research and Quality [23,24]; (4) hospital utilization patterns; and (5) admission service (medicine, surgery, trauma); and (6) naloxone administration (only within first three hours of first recorded vital sign). Census tract measures were used as a proxy for individual level socioeconomic status (SES). An application program interface was built to match the housing addresses to corresponding geocodes for all patients in our health system's clinical data warehouse and provide data at the censustract level which is equivalent to a neighborhood established by the Bureau of Census for analyzing populations. The data were collected from the 2015 American Community Survey [25] and linked to corresponding geocodes at the patient level. The census-tract measures reported for this study were the following: (1) education level (more than high school vs. high school/ less than high school); (2) employment status (employed vs. unemployed); (3) median household earnings; (4) homeowner status (any homeownership vs. none); and (5) poverty level. The census-tract variable for poverty level was shown to represent an important indicator of census-level SES that correlates well with other SES measures [26,27]; therefore, we categorized patients into high-poverty census-tract (20.0+ percent of households below federal poverty level) vs. low-(�9.9 percent of households below federal poverty level) or middle-(10.0-19.9 percent of households below federal poverty level) [28].
For identifying conditions of chronic pain [23], ICD codes for chronic primary pain, psychogenic pain, chronic postsurgical and posttraumatic pain, and chronic neuropathic pain were included. Additional codes were included for chronic secondary musculoskeletal pain and chronic secondary visceral pain. Codes were excluded for acute pain, chronic cancerrelated pain, and chronic secondary headache or orofacial pain. The final list of codes is provided in S2 Appendix.

Unstructured EHR data (clinical notes): Natural language processing and topic modelling
Pre-processing of all clinical notes was performed in the Apache clinical Text Analysis Knowledge Extraction System (cTAKES) [29]. cTAKES is a widely used software library for clinical Natural Language Processing (NLP) [30,31]. The pre-processing steps include text tokenization (splitting into words), sentence segmentation (splitting into sentences), part-of-speech tagging, and clinical concept lookup. The clinical concept lookup identified spans of clinicallyrelevant text, such as 'chronic pain', filtering out non-essential/non-clinical vocabulary. cTAKES maps each concept to a common standardized medical vocabulary in the Unified Medical Language System identified as a Concept Unique Identifier (CUI). This method is an approach to provide structure to otherwise unstructured text data, and the CUIs served as inputs for topic modelling.
Latent Dirichlet Allocation (LDA) is an unsupervised text mining approach used for topic modeling [32]. The LDA model was trained on the full cohort of patients with opioid misuse and produced topics expressed as a probability distribution over CUIs. LDA discovers latent topic structure by finding the mixture of CUIs that is associated with each topic and determining the probability of each topic for the individual patient encounter. We provide topic modelling for two reasons: (1) to demonstrate if the notes reflect what the administrative data and ICD codes describe for the subtypes, a representation of face validity; and (2) unstructured data from notes and reports comprise nearly 80% of the EHR data [33,34] and is a future direction in analyzing patient cohorts for clinical decision support. The optimal number of latent topics was identified by examining a range of topics with model fit statistics for topic coherence [35]. In LDA model development, 250 passes on the clinical notes were performed to learn the topics. To improve the efficiency of topic modelling, we restricted concepts, CUIs, to those observed in less than 70% and more than 10% of the EHR notes to eliminate concepts that were too commonly reported or too rare to be informative. To evaluate the contribution of a topic to an LCA-derived class, we average the probability of that topic across all the encounters in that class.

Clinical outcomes: 30-day unplanned hospital readmission and discharge dispositions
All-cause unplanned readmission was identified using the Center for Medicare and Medicaid Services (CMS) rules for index (eligible) admission and unplanned 30-day readmission [36]. Pre-specified billing codes for planned readmission were used from CMS rules and include obstetrical delivery, scheduled procedures, maintenance chemotherapy, and rehabilitation. To analyze data on diagnoses and procedures that met qualifying criteria for readmission, the Clinical Classification Software from the Agency for Healthcare Research and Quality was used to crosswalk with diagnoses from billing codes in the EHR [37]. Readmissions during the 30-day period that follow a planned readmission are not counted in the outcome. In the case of multiple readmissions during a 30-day period, we measured only one outcome. Readmissions on the same day were also not counted in the outcome. Ultimately, index admissions include any inpatient hospitalization during the study period and are excluded for the reasons described above. Additional outcomes examined include discharge status from the hospital (in-hospital death, psychiatric admission, against medical advice (AMA), and home).
Analysis was performed using Python Version 3.6.5 (Python Software Foundation) and RStudio Version 1.1.463 (RStudio Team, Boston, MA). Latent class analysis was performed with the poLCA package in R (http://dlinzer.github.com/poLCA) and followed the analysis plan by Zhang et al. [22]. Our open-source code to perform LCA may be viewed in S3 Appendix as well as at: https://bitbucket.org/afsharjoycelab/opioid-misuse-lca/. The GenSim package was used in Python to infer topic model structure [38]. The Institutional Review Board of Loyola University Chicago approved this study.

Opioid misuse cohort
The health system had 228,884 inpatient encounters during the study period and 6,224 (2.7%) met inclusion criteria for opioid misuse. In topic modelling, the final data corpus in the 6,224 patient encounters was comprised of 25,801 unique CUIs across 422,147 clinical notes. Twenty topics were identified to have the best model fit from the cohort (S4 Appendix). The top ten CUIs for each topic and a summarized topic theme for each topic are listed in Table 1. The topics spanned themes from chronic medical conditions to behavioral conditions and healthcare services.

Identification of four subtypes of opioid misuse
Fit statistics and model interpretability suggest a 4-class model was optimal. Improvements to model fit begin to diminish around 5 classes (Fig 1).
Although lower model fit statistics could be achieved with more classes, the 4-class model represents meaningful and distinct clinical representation with better distribution of class prevalence and less complexity ( Table 2). Details for the 5-class model are shown in S5 Appendix. In the 4-class model, the average latent class probabilities were 0.72 (sd 0.10) for Class 1, 0.87 (sd 0.12) for Class 2, and 0.88 (sd 0.11) for Class 3, and 0.95 (sd 0.12) for Class 4 indicating acceptable class separation. In addition, the 4-class model had appropriate face validity from topic modelling (Table 3). Polysubstance use as a topic had the highest probability for all classes.
In sensitivity analysis examining patient-level data using the most recent hospital encounter, model fit statistics and interpretability continued to show a 4-class model to be optimal (S6 Appendix). Good class separation was found with average latent class probabilities of 0.92 (sd 0.12) for Class 1, 0.91 (sd 0.12) for Class 2, 0.77 (sd 0.11) for Class 3, and 0.94 (sd 0.14) for Class 4. The patient-level 4-class model represented nearly identical distributions of patient characteristics across demographics, comorbidities, substance use, and SES (S6 Appendix).

Clinical distinctions between the four subtypes
The 6,224 encounters were categorized within one of the four latent classes. Class 1 represents encounters that carried a higher probability for topics on pain procedures and medical conditions associated with chronic pain (pancreatitis and metastatic cancer) ( Table 3). In Table 2, approximately half were females and carry the greatest proportion with Medicare insurance among the classes. Nearly one-third had a diagnosis category for chronic pain and Class 1 had the fewest with urine drug screen testing and lowest rates for positive tests. All encounters in this class were for opioid-related hospitalizations with a greater proportion of prior 1-year inpatient and outpatient encounters compared to other classes. Class 1 comprised 36.5% of the The mentions are the concept unique identifier (CUI) from the free text of all the clinical documents from the EHR. ‡Overall topic themes were finalized after consensus agreement for face validity between a clinical informatics and critical care specialist, psychiatrist, and addiction specialist.
https://doi.org/10.1371/journal.pone.0219717.t001 cohort and received the label, "High hospital utilization with known opioid-related conditions". Class 2 are patient encounters with topics that had the highest probability for polysubstance use and mental health conditions and the greatest proportions with Elixhauser ICD codes for psychoses and drug use (Table 3). In Table 2, these patient encounters were mainly for patients 36-55 years of age and the majority were non-Hispanic black with Medicaid and uninsured status. All patient encounters in this class had positive cocaine urine drug screens. This class had the highest proportion of patients living in low socioeconomic census tracts and with the lowest median household income of all classes. Class 2 comprised 12.8% of the cohort and received the label, "Illicit use, low SES, and psychoses". Class 3 are patient encounters with topics that had the highest probability for alcohol use disorders and liver disease (Table 3). In Table 2, these encounters were for older and largely non-Hispanic white male patients compared to other classes, and represented the highest risk for mortality by Elixhauser score. The proportion with ICD codes for alcohol use disorders and liver disease were greatest in this class. Class 3 comprised 39.2% of the cohort and received the label, "Alcohol use disorders with complications". Subtypes of opioid misuse in hospital patients Class 4 patient encounters contained topics that had the highest probability for trauma and neurological diseases (seizures) ( Table 3). In Table 2, these patients had more encounters in trauma centers with the highest proportion of urine drug testing and positive cases for opioids and benzodiazepines. Over one-third had chronic pain, similar to Class 1 but with more naloxone administration. Class 4 comprised 11.5% of the cohort and received the label, "Low hospital utilization and incidental opioid misuse".

Clinical outcomes between the four classes/subtypes
The class labelled as "High hospital utilization with known opioid-related conditions" (Class 1) had the greatest proportion with 30-day unplanned hospital readmission at 13.9% (Table 4). This was followed by Class 3 with the label "Alcohol use disorders with complications". Class 2 labelled as "Illicit use, low SES, and psychoses" had the greatest proportions for being discharged to inpatient psychiatry services and leaving against medical advice. Class 4 labelled as the "Low hospital utilization with incidental opioid misuse" had the greatest proportion with naloxone administration in the hospital and in-hospital death but the lowest proportion with readmission.

Discussion
We identified a four-class model of clinically interpretable and relevant subtypes for opioid misuse, with good class separation and face validity based on documentation in the notes, structured data, and clinical outcomes. The following distinctions were made for each class: AMA-left against medical advice; Other = long and short-term care facilities, jail, police custody (1) high hospital utilization with opioid-related hospitalizations; (2) illicit use, low SES, and psychoses; (3) alcohol use disorders (AUD) with complications; (4) low hospital utilization and incidental opioid misuse. We demonstrate major differences in comorbidities, utilization patterns, polysubstance use, and SES across subtypes that may help health systems to better understand the needs of their patient population and to identify appropriate treatment options and pathways. Using an LCA approach with EHR data may better inform health systems like ours that serve diverse communities and give a level of detail that is greater than what is reported in regional or state epidemiology and surveillance data. The subtype with the greatest proportion of patient encounters was "Alcohol use disorder with complications". Prevalence of drug misuse is high among hospitalized patients with an AUD, but this group typically receives lower rates of treatment for addiction following hospitalization [39]. Very sparse data are available on effective treatment methods in patients with both AUD and opioid use disorder (OUD) and more research is needed. One effectiveness trial of injectable extended release naltrexone for patients with OUD found that individuals with alcohol use to intoxication in the 30 days prior to initiating treatment were more likely to relapse to opioid use in comparison to those without alcohol intoxication in the prior 30 days, indicating patients with both OUD and AUD may have worse addiction treatment outcomes [40]. In comparison to other subtypes, the patients in this subtype are older, non-Hispanic white and have higher rates of liver disease, possibly contributing to their higher risk for death. The polysubstance use and associated organ injury represent a subtype of patients that may need more intensive addiction treatment, including higher levels of care such as residential treatment.
The "high utilization with known opioid-related conditions" subtype was found to have a diverse group of patients with chronic comorbidities and pain conditions and all had an opioid-related hospitalization code. Another study identified a similar subgroup using LCA among individuals filling opioid prescriptions [8]. Many of these patients had high utilization including the greatest proportion with unplanned readmissions and nearly one-third had Medicare insurance. Multiple themes related to chronic pain conditions were present in the notes and one-third also had a billing diagnosis for chronic pain. This suggests these patients may require a treatment approach involving a comprehensive pain management team that incorporates a variety of non-pharmacologic and non-opioid alternatives and possibly reduce their need for acute care [6].
In contrast, the "Low hospital utilization and incidental opioid misuse" subtype was found to visit the trauma center more frequently and have a greater proportion of patients with positive urine drug screens for nonmedical opioid and benzodiazepine use. Naloxone was administered more frequently and this subtype had a higher proportion of in-hospital deaths. This subtype has also been previously identified using LCA for polysubstance use among trauma patients [41]. Despite positive urine drug screens, it is not possible to determine whether these patients actually have an OUD that would necessitate treatment, or whether they were occasional opioid users with subsequent trauma, injury or accidental poisoning. Screening, brief intervention, and referral to treatment (SBIRT) is one approach that could be used to identify which patients may benefit from motivational interviewing for unhealthy but infrequent use, versus others being identified with an OUD and needing initiation and/or referral to treatment [42]. Patients identified with OUD in the ED or hospital setting could be offered opioid agonist treatment, such as buprenorphine or methadone prior to discharge with linkage to community treatment. While all subtypes should be offered education on overdose prevention and naloxone prescription [43], this subtype in particular should be prioritized, as they are less likely to re-visit the health system given their low utilization pattern.
Lastly, the "Illicit use, low SES, and psychoses" subtype has the largest proportion of patient encounters with Medicaid or no insurance, suggesting patients in this subtype likely have less access to healthcare. One quarter of patients in this subtype had been diagnosed with a mental health condition in our health system, and this subtype had the highest proportion of patients leaving against medical advice. Buprenorphine initiation during the hospital encounter has been shown to reduce hospitalizations in a similar cohort of patients with heroin use [44]. The high rates of uninsured status, low SES metrics from the census-tract variables (% in poverty, unemployed, education, and median household earnings), and high rates of mental health conditions imply behavioral and social determinants of health are important considerations in this subtype [45]. These patients likely need access to intensive and comprehensive treatment programs that can offer both treatment for mental illness and for OUD (sometimes referred to as "Mental Illness and Substance Abuse" or MISA programs). Additionally, resources such as housing first models that serve individuals experiencing chronic homelessness and living with mental illness and substance use disorders may improve health-related outcomes in this subtype of patients [46,47].
Prior work examining LCA in opioid misuse were focused on self-report data in specific cohorts of individuals. One study examining approximately 200 military veterans [7] used the Overdose Risk Behavioral Scale to identify five subtypes that also revealed a "regular" opioid user category similar to our high-utilizer subtype and separate from their subtypes of occasional users and illicit users, similar to our distinctions in EHR data. In an outpatient community pharmacy study, self-report from approximately 330 surveys showed 3-classes with labels of mental health, poor health, and hazardous alcohol use [8]. This also matched our labelling of a low SES with poor mental health subtype and a co-substance alcohol subtype. In the largest studies of between 19,000 and 26,000 patients evaluated for substance use treatment programs, LCA also highlighted distinctions between polysubstance use with heroin and cocaine similar to our illicit use subtype and separate from prescription drug use subtype [9,10]. With similarities to self-report surveys in subtypes and demographics within subtypes, EHR data may serve as another reliable source to identify clinically distinct subtypes of opioid misuse for targeted interventions.
The heterogeneity in characteristics and outcomes across subtypes present opportunities to better align intervention types and level of care/intensity of services for each subtype. Other approaches that address heterogeneity in treatment effect include adaptive treatment designs [48] as well as adaptive sampling in community health surveys [49]. Factorial experimental designs have also been used in behavioral health to better characterize individuals [50]. One example is the Sequential Multiple Assignment Randomized Trial (SMART). SMART is an approach that has been more successful than conventional treatment designs for substance use [51,52]. The results from SMART highlight the heterogeneity in treatment effect and the need for better patient identification and allocation for interventions in substance use. Herein we propose a method that caters to a health system's patient population to identify subtypes across a cohort of individuals using LCA to augment adaptive treatment interventions like SMART [53]. This approach may be promising to identify and better address the many barriers in treating opioid misuse.
Methods in machine learning and NLP are important tools to handle the large volume and variability of EHR data. Topic modelling has been used successfully in the EHR to detect themes and relevant concepts in patient care to inform clinical decision support [54,55]. In psychiatry, similar methods have been used to predict readmissions, suicides, and accidental death with substantial improvements in model performance with LDA and other NLP methods [11,12]. However, unlike prior studies, our study first converted all the raw text into standardized medical vocabularies with CUIs to provide a common structural framework that accounts for lexical variations and semantic ambiguities. This may serve as a more interoperable approach between health systems interested in employing these methods.
Several limitations are present in this single-center study. Although we attempt to account for biases introduced in our health system with methods in NLP, topic modeling across sites may not be consistent. In addition, our operational definition for opioid misuse may introduce misclassification bias with a urine drug screen that did not capture synthetics and semi-synthetic opioids. The subtypes identified here require external validation to demonstrate if the class-defining characteristics remain consistent across multiple health systems. Bias may have been introduced by patients with multiple encounters to our health system as well as not capturing encounters at other health systems. We attempted to reduce this bias by performing sensitivity analysis of patient-level data that continued to show a 4-class model was optimal and represented the same clinical subtypes.

Conclusion
Unsupervised statistical approaches using all domains of the EHR may be leveraged to better identify subtypes of opioid misuse in the patients served by a health system. This is a comprehensive approach to better delineate clinically meaningful subtypes so targeted treatment strategies may be employed for the patients served by the individual health system. Table. Operational definition algorithm including ICD 9/10 codes. (DOCX) Table. ICD 9/10 codes for chronic Pain. (DOCX) Table. EHR-specific R code for LCA analysis.