Machine learning with routine electronic medical record data to identify people at high risk of disengagement from HIV care in Tanzania

Machine learning methods for health care delivery optimization have the potential to improve retention in HIV care, a critical target of global efforts to end the epidemic. However, these methods have not been widely applied to medical record data in low- and middle-income countries. We used an ensemble decision tree approach to predict risk of disengagement from HIV care (missing an appointment by ≥28 days) in Tanzania. Our approach used routine electronic medical records (EMR) from the time of antiretroviral therapy (ART) initiation through 24 months of follow-up for 178 adults (63% female). We compared prediction accuracy when using EMR-based predictors alone and in combination with sociodemographic survey data collected by a research study. Models that included only EMR-based indicators and incorporated changes across past clinical visits achieved a mean accuracy of 75.2% for predicting risk of disengagement in the next 6 months, with a mean sensitivity of 54.7% for targeting the 30% highest-risk individuals. Additionally including survey-based predictors only modestly improved model performance. The most important variables for prediction were time-varying EMR indicators including changes in treatment status, body weight, and WHO clinical stage. Machine learning methods applied to existing EMR data in resource-constrained settings can predict individuals’ future risk of disengagement from HIV care, potentially enabling better targeting and efficiency of interventions to promote retention in care.


Introduction
Forty years into the HIV epidemic, disengagement from care and poor antiretroviral therapy (ART) adherence remain central challenges that undermine global efforts for epidemic control. Only 65% of people living with HIV (PLHIV) in eastern and southern Africa have viral suppression, a partial reflection of persistent attrition from care and suboptimal adherence to ART [1]. Especially in sub-Saharan Africa, lifelong retention in HIV care is continually threatened by myriad barriers such as stigma, food insecurity, negative clinic experiences, anticipated or actual side effects, misinformation, "treatment fatigue," and poverty [2]. Consequently, retention is a dynamic process, as PLHIV may default and re-engage in care numerous times over a lifetime, such that 33% of PLHIV starting ART in sub-Saharan Africa between 2009 and 2014 were not alive and/or on ART after five years [3].
Fortunately, there is a burgeoning literature on effective strategies to bolster engagement with HIV care. This includes, for example: enhanced adherence counseling in response to high viremia, an approach common across sub-Saharan Africa [4]; reminder text messages, an inexpensive and highly scalable approach [5]; and short-term financial incentives, an approach that can support habit formation and nudge PLHIV towards HIV care [6,7]. Nevertheless, effect sizes of such interventions are often modest, especially after intervention periods are complete and effectiveness gradually wanes [8,9]. Furthermore, even in well-conducted trials which find that an intervention has significant benefits, a sizable proportion of participants in the comparison group achieve the desired outcomes (e.g., viral suppression) without the intervention. Thus, scale-up of behavioral programs to improve adherence and retention may be difficult to justify, especially in resource-constrained environments.
An alternative approach is to better target interventions to the subset of the population that is most in need, rather than 'one size fits all' approaches that can be costly at scale. A practical challenge of better targeting is that until recently, customized behavioral interventions would have been cumbersome to seamlessly integrate into busy healthcare settings. However, the growth of electronic medical record (EMR) data in sub-Saharan Africa, digital platforms for real-time data collection and analysis, and the application of machine learning to HIV outcomes have increased the possibility for 'precision public health' to efficiently guide timely and appropriate HIV care [10,11]. Compared to traditional group-based or regression approaches, machine learning can better process complex time-varying information such as EMR data, identify non-linear or rare trends, and make accurate predictions about risk. For example, a machine learning-based hospital alert system for early sepsis detection improved patient outcomes compared to the previous scoring system manually tabulated by nurses every 12 hours [12]. Similarly, machine learning-based decision support tools could identify individuals who may benefit from ART adherence counseling, tailored SMS messages, or economic support. Such an approach could better target scarce resources to individuals most in need while minimizing cost.
Despite these potential benefits, applications of machine learning to HIV outcomes remain limited in low and middle-income country (LMIC) settings that shoulder the greatest burden of HIV [1,11]. A recent review of machine learning in the field of HIV prevention identified seven applications, five of which were in the United States, one in Denmark, and one in Eastern Africa [13]. The latter analysis used sociodemographic data collected by a populationbased study in rural Kenya and Uganda to predict risk of HIV acquisition and identify potential pre-exposure prophylaxis (PrEP) candidates [14]. However, implementation of this approach may be limited in real-world clinical settings where detailed sociodemographic data are not readily available as they are in research studies.
To establish proof of concept, we applied a machine learning development-and-validation approach to routine EMR data from HIV care and treatment centers in Tanzania to identify those who are in care but at risk of disengagement, with the ultimate goal of better aligning proactive, supportive interventions to those most in need. We additionally explored whether models that also incorporated survey data from a research study could enhance the accuracy of predictions over EMR data alone.

Ethics statement
The Tanzania National Institute for Medical Research and the Committee for Protection of Human Subjects at the University of California, Berkeley provided ethics approval for this study. Written informed consent to participate in the study was obtained from all participants at the time of enrollment.

Data sources and study participants
Risk of disengagement from HIV care was modeled using data about the same group of individuals from two sources: (1) EMR data from the Tanzania national HIV care and treatment center ("CTC3") database and (2) survey data from a randomized trial of financial incentives ("Afya II"), which was conducted at 4 health facilities in Shinyanga region, Tanzania. We restricted this analysis to participants in the trial's control arm in order to model disengagement under the current standard of care, without additional intervention. Trial methods are described elsewhere [7]; briefly, HIV-positive individuals aged 18 years or older who had initiated ART within the past 30 days were recruited during routine clinic visits in 2018. Participants provided informed consent to participate in the study and to share medical records. Sociodemographic surveys were conducted by research assistants at the time of enrollment and 6 months later. For the current analysis, participant medical records from the time of ART initiation through 24 months of follow-up were abstracted from the EMR database on April 8, 2021. This included visits to any HIV care facility in the country that was using the standard EMR (the vast majority of facilities), assuming that patients retained the same unique identification number when transferring between facilities per government protocols (but not if they restarted ART at a new facility without disclosing previous care).

Measures
Outcomes. Our objective was to predict future disengagement from HIV care. We measured disengagement from care using EMR clinic attendance data, defined as ever missing an appointment for 28 or more days (a standard PEPFAR monitoring and evaluation indicator [15]) within specified intervals. This interim outcome of disengagement, which is commonly experienced by the study population, was selected for prediction because delayed and missed visits are the first steps towards eventual loss to follow-up, which we theoretically want to avert via early intervention. Specifically, our future outcomes of interest were disengagement from care during 6-month intervals (6-12 months, 12-18 months, and 18-24 months), following the standard virologic monitoring schedule in Tanzania. Separate models were developed for each 6-month interval.
Predictors. Predictors included in models were hypothesized to be correlated with disengagement. EMR-based characteristics included age, sex, and marital status (measured at ART initiation); measures that were regularly collected at clinic visits (typically monthly), including weight (kg), WHO clinical stage (1-4), family planning use, and antiretroviral drug (ARV) status (start, continue, substitution, or stop); and HIV viral load (copies/mL, measured every 6 months). Pregnancy status and tuberculosis treatment were also considered but were omitted due to high sparseness. Time-varying EMR variables included linear change in weight and WHO stage, and linear and quadratic changes in ARV status (S1 Text). Pre-specified predictors of interest from the research study's baseline and 6-month surveys [7,16] included language and education (measured at baseline only), occupation, employment, head of household, household size, household socioeconomic status, cost of transportation to the facility, food insecurity [17], mental health [18], self-rated overall health, hopefulness about future health, and functional status (as defined in Table 1).

Statistical analysis
Model development and validation. We developed supervised machine learning models with ensemble decision trees to predict risk of disengagement from HIV care. The aggregated decision tree learning approach is appropriate for this analysis that involves a smaller sample size where prediction accuracy can be substantially influenced by missing data (common in EMR data). We implemented decision tree modeling in R with the "rpart" package, which handles missingness by finding a surrogate split at each tree node wherever there is a missing value (i.e., relying on other non-missing variables that have similar prediction power).
We used a one-interval-ahead approach to build and validate future outcome models based on past predictors. For each 6-month disengagement risk interval (6-12 months, 12-18 months, and 18-24 months), we modeled observed outcomes using predictor data collected prior to the start of the interval. For example, for 6-12-month disengagement, we developed models using predictors from 0-6 months. This strategy approximated a practical application in which past visit records could be used to train the model, identify the individuals at highest risk for future disengagement, target interventions accordingly, and then continually test and improve model performance after collecting observations for the next interval. Cross-validation was performed using an 80/20 train/test split approach, whereby models were trained on data for 80% of participants and validated on data for the remaining 20% of participants, using random data splitting to ensure fairness in the algorithm [19]. The optimal tuning parameters (number of splits, number of surrogate splits, complexity of a tree, etc.) in each decision tree were selected to minimize the mean squared error evaluated by five-fold cross-validation. Lastly, to reduce the variability in a single decision tree, we adopted bagging and generated 1,000 bootstrap samples to further aggregate the prediction results across multiple trees.
For each prediction interval, we compared models using only the most current EMR data at the start of the interval such as most recent weight, WHO clinical stage, etc. (as typical in previous point-of-care risk assessment approaches) to those incorporating time-varying trends since ART initiation. Additionally, we compared models using only routinely collected EMR data to those which added survey data collected via the research study. Specifically, for each 6-month risk prediction interval (6-12 months, 12-18 months, and 18-24 months), we considered three models, using: (1) current EMR data from the last visit before the prediction interval (e.g., the 5-month visit for the 6-12 month prediction period); (2) current and timevarying EMR data from all the previous visits (e.g., 0-12 month data for 12-18 month prediction); and (3) current and time-varying EMR data and the most recent survey data (baseline or 6 months). We also conducted sensitivity analyses using only three predictors per model (targeting a shrinkage of �10% [20]) and compared results to assess for potential overfitting. Additionally, we conducted a sensitivity analysis using an alternative approach where models were trained using observed outcomes and predictors from the same interval (e.g., 0-6 month disengagement modeled using 0-6 month predictors) and validated on observed outcomes from the next interval (e.g., 6-12 months) for the same individuals; this approach would facilitate model development before observing future outcomes, offering practical benefit for real- time application in health facilities, although this sometimes involved temporal misalignment between predictors and outcomes during model building (i.e., predictor information were collected after disengagement occurred during the model development time period).
Model evaluation. We tested the performance of each model on the following 6-month validation cohorts. We evaluated the overall prediction accuracy (proportion with correctly predicted positive or negative disengagement status out of the total population) and area under the diagnostic curve (AUC). Additionally, we assessed the efficiency of each model for targeting the "highest risk" individuals in the context of limited resources where not all individuals can receive an intervention [14,21]. For set risk score thresholds corresponding to proportions of the population flagged as "high risk" (ranging from 10-50% targeted to simulate different intervention scenarios), we evaluated each model's sensitivity (proportion correctly categorized as "high risk"-i.e., truly went on to disengage from care-out of the total population who disengaged from care in the prediction period) and positive predictive value (PPV, the proportion correctly categorized as "high risk" out of the total population categorized as high risk). Lastly, we computed the importance of each variable based on the reduction in predictive accuracy when removing the predictor of interest from each model.

Participants
Of 184 total individuals enrolled in the control group of the trial; 1 individual who was missing all EMR records and 5 individuals who had a death recorded in the EMR during the first 6 months were excluded from this analysis, yielding a sample size of 178 participants with a total of 2,698 clinical visits over 24 months of follow-up (mean visits per participant = 17.7, SD: 6.3). Five participants with deaths recorded after 6 months (6-12 months: n = 3; 12-18 months: n = 2) were not included in models for subsequent prediction periods. In addition, participants who dropped out of care at least 6 months before later prediction periods were excluded from those models (12-18 months: n = 12; 18-24 months: n = 17). All participants completed the research study survey upon enrollment in the study, and 163 participants completed the survey again at approximately 6 months (91.6% of 178). The majority of participants were female (63.5%) with a mean age of 36 years at 6 months (Table 1).
Similar trends were observed for models of 12-to 18-month and 18-to 24-month risk of disengagement. For all prediction time periods, gains in accuracy resulted from models including time-varying EMR data and, to a lesser extent, sociodemographic survey data ( Table 2). Results were similar in sensitivity analyses using only the three most important predictors in each model (S1 Table) and those using the past outcome development and future outcome validation approach (S2 Table).

Prediction efficiency
When setting a threshold to identify the top 10% of individuals at highest risk of disengagement from care in the next 6 months (based on the predicted risk score generated by each model), the mean sensitivity over each time period was 25% using current EMR data only; 30% using time-varying EMR data; and 32% using time-varying EMR data along with survey indicators (Fig 1 and S3 Table). In other words, of all the individuals who truly went on to disengage from care, about 3 in 10 would be classified as "high risk" with each model (and hypothetically prioritized for intervention) if positive predictions of disengagement were limited to 10% (e.g., under a scenario where resources restrict the intervention capacity to 10% of the total population). The corresponding mean PPV (proportion of those 10% classified as highrisk who truly disengaged from care) was 68% using current EMR data only; 68% using timevarying EMR data; and 69% using time-varying EMR and survey data (S4 Table). When increasing the risk score threshold to identify the 30% highest-risk individuals, all models achieved a mean sensitivity of over 50% while also maintaining a PPV above 50% (S3 and S4 Tables).

Variable importance
The strongest predictors from any model were time-varying EMR variables, including changes in ARV status, weight, and WHO clinical stage, along with age (Fig 2 and S5 Table). Food insecurity followed as an important variable in models including sociodemographic survey data.

Discussion
Machine learning methods applied to secondary EMR data predicted future risk of disengagement from HIV care in this 2-year proof of concept study of adults who had recently initiated ART in Tanzania. Models performed especially well when incorporating changes in clinical status over time that were captured in the EMR. Additionally including survey indicators collected as part of a research study, such as food insecurity and mental health, only modestly improved accuracy; however, these data are not readily available in most clinical settings. To our knowledge, this is the first application of machine learning methods to predict retention in HIV care using routine EMR data in a low-or middle-income country. Electronic medical records are now standard within HIV care and treatment centers in many LMIC settings including Tanzania. These readily available data present an untapped opportunity to apply predictive analytics for care optimization. Despite containing fewer variables than integrated electronic health record (EHR) systems and linked datasets in highincome countries, and possibly high levels of missingness, our results nevertheless demonstrate the potential utility of these EMR data to predict future risk of disengagement from HIV care. We implemented a novel application of machine learning to benefit from the information in the limited EMR data, using decision trees (which can appropriately handle missingness) and incorporating time-varying information into the prediction models (which was strongly predictive of disengagement status). Together, these strategies resulted in a model that could potentially be used as an early warning system for at-risk individuals in HIV care. Our decision tree model using time-varying EMR data achieved a mean PPV of 68% for identifying the top 10% highest risk individuals as needing intervention, even without including sociodemographic indicators collected by the research study. In comparison, a machine learning analysis of complex EHR and linked geospatial data to predict dropping out of HIV care (a more rare outcome) in the United States achieved a mean PPV of 35% for identifying  the 10% highest-risk individuals [22]. In Switzerland, using EHR data including electronically monitored ART adherence to predict virologic outcomes achieved a PPV of 85% [23]. Our analysis suggests that comparable results to those obtained with detailed data from highincome countries can also be achieved with current EMR data from LMICs.
As electronic record systems for HIV care continue to develop, consideration should be given to the benefits and drawbacks of adding new data collection fields. For example, EHRbased screening tools to assess social and behavioral domains are increasingly used in the United States, as part of an effort to provide patient-centered care [24][25][26]. In our analysis, model performance somewhat improved when including similar survey-based indicators collected by the research study, especially food insecurity (a well-documented barrier to retention in HIV care [27]). However, the strongest predictors in these models remained EMR-based measures including time-varying ARV status, WHO clinical stage, and weight. Changes in these EMR variables-or lack thereof-might capture continuity of care or changes in health status that are associated with disengagement from care. Our results suggest that these routinely collected EMR variables are among the strongest predictors of future disengagement from HIV care.
While collecting additional health and social indicators in the EMR could marginally improve risk algorithms and support individualized care, there are critical limitations in the context of already overburdened and resource-constrained settings. This extra data collection would require technological infrastructure, provider time and training to ask sensitive questions, private space within busy clinics, means and protocols for responding to surfaced needs, and data confidentiality protections. Standardized assessment of select social needs may yet be warranted for other reasons, as in the case of mental health and food insecurity, where strong arguments exist for integrated approaches to address these factors within HIV care [28,29]. However, our results show that existing EHR data in HIV care and treatment centers have potential for immediately use to predict future retention in care and thereby target interventions to those who could most benefit.
A limitation of this study was the relatively small sample size of participants, although two years of follow-up data per participant increased our ability to generate inference. Given the small sample size, relatively large 6-month intervals were used for predicting disengagement from care. In future work, we plan to focus on smaller intervals that would be more actionable in a clinical setting (e.g., to predict whether a patient will come to their next appointment given all information to date). We emphasize that this is a prediction exercise, rather than a causal analysis, whereby we intentionally use all of the available data to describe patterns and then extrapolate predictions into the future. In the applied clinical setting we envision, models would be developed using all of the available data from past visits and history of disengagement, and then used to predict future disengagement.
A critical strength of this study was the use of the national EMR database to capture visits from participants who transferred to different clinics after study enrollment. Still, some attended visits may have gone unobserved if transferred participants were assigned a new unique identification number (known as a "silent transfer"), or at rural clinics with unstable network connections where data syncing with the national database occurs less frequently. In addition, some attended visits may not have been captured in the EMR because of incomplete or inaccurate data entry. However, our past experience comparing abstracted paper-based clinical records (CTC2 card) to electronic records at the four study clinics has found the EMR data to be of acceptably high quality.
In conclusion, we found that machine learning methods applied to routine EMR data in Tanzania predicted future risk of disengagement from HIV care. Incorporating time-varying EMR information enhanced the prediction accuracy compared to only using point-in-time EMR data. The addition of survey data collected through a research study modestly improved the accuracy. This approach, using EMR data alone or in combination with other data sources, could potentially be used to improve the efficiency of HIV prevention and care programs by targeting supportive interventions to individuals who could most benefit.