Health system measurement: Harnessing machine learning to advance global health

Background Further improvements in population health in low- and middle-income countries demand high-quality care to address an increasingly complex burden of disease. Health facility surveys provide an important but costly source of information on readiness to provide care. To improve the efficiency of health system measurement, we applied unsupervised machine learning methods to assess the performance of the service readiness index (SRI) defined by the World Health Organization and compared it to empirically derived indices. Methods We drew data from nationally representative Service Provision Assessment surveys conducted in 10 countries between 2007 and 2015. We extracted 649 items in domains such as infrastructure, medication, and management to calculate an index using all available information and classified facilities into quintiles. We compared three approaches against the full item set: the SRI, a new index based on sequential backward selection, and an enriched SRI that added empirically selected items to the SRI. We evaluated index performance with a cross-validated kappa statistic comparing classification using the candidate index against the 649-item index. Results 9238 facilities were assessed. The 49-item SRI performed poorly against the index using all 649 items, with a kappa value of 0.35. New empirically derived indices with 50 and 100 items captured much more information, with cross-validated kappa statistics of 0.71 and 0.80, respectively. Items varied across the indices and in sensitivity analyses. A 100-item enriched SRI reliably captured the information from the full index: 83% of the facilities were classified into correct quintiles of service readiness based on the full index. Conclusion A facility readiness measure developed by global health experts performed poorly in capturing the totality of readiness information collected during facility surveys. Using a machine learning approach with sequential selection and cross-validation to identify the most informative items dramatically improved performance. Such approaches can make assessment of health facility readiness more efficient. Further improvements in measurement will require identification of external criteria—such as patient outcomes—to guide and validate measure development.


Introduction
The current era in global health is marked by pursuit of the Sustainable Development Goals (SDGs) for 2015-2030. In contrast to the Millennium Development Goals, the SDGs include an explicit commitment to universal health coverage and a recognition of the increasingly varied global disease burden, including chronic conditions such as diabetes that require continuous and coordinated health services. [1] High-quality health systems will be necessary to deliver this care and achieve the ambitious health-related SDGs. [2,3] Whether health systems in low-and middle-income countries (LMIC) have the capacity to provide quality care is the subject of increasing scrutiny. [3][4][5][6][7] While national health information systems remain under development in many LMIC, [8] periodic health facility surveys can provide valuable information on health system capacity. [9,10] These assessments can cover hundreds of individual items, from medications to diagnostic tests. The World Health Organization (WHO) has defined the items required to demonstrate basic readiness to provide health services and defined measures to reduce the information from health facility assessments into facility-or service-level indices. [9] For instance, the general service readiness index (SRI) includes 50 items in the domains of basic amenities (infrastructure), infection control, equipment, diagnostics, and medication; it is intended to capture the essential foundation needed to provide basic health services.
Although research using health facility assessments is increasing, [11][12][13][14][15] there is little evidence that information from facility surveys is used to inform national policy on service allocation or health system strengthening. [8] The impact of investments in conducting such surveys -from $100,000 for a small survey to much more for nationally representative system assessments in populous countries [9]-is undermined by the limited use of the resulting data. While the SRI is the most commonly used summary measure for such facility assessments, it uses a fraction of the total information collected. Health policy makers may be reluctant to make use of information from health facility assessments without an indication of the value of such measures in distinguishing better and worse equipped facilities or in representing the overall capacity of a facility. In high-income countries, machine learning approaches have been applied to routine health information data to extract insight from large datasets. [16,17] These approaches employ predictive algorithms that learn from the data without overfitting, with the goal of reducing large, unwieldy data to useful and usable summaries. [18] Application of these methods is limited in low-resource contexts to date, despite their potential utility in deriving insight from data.
The objective of this study is to develop summary measures of health facility capacity using machine learning approaches of sequential selection and cross-validation in order to enhance efficiency of and insights provided by existing health facility assessments. We assess the performance of the SRI in capturing full facility readiness and test new measures to summarize readiness with fewer items.

Ethical approval
The original survey implementers obtained ethical approvals for data collection; the Harvard University Human Research Protection Program deemed this analysis exempt from human subjects review.

Study sample
We identified the Service Provision Assessment (SPA) surveys as the most detailed nationally representative health system assessments and included all assessments conducted between 2006 and 2015 (pre-2006 assessments focused on either HIV or maternal and child health alone). SPA surveys were conducted in ten countries in this decade: Bangladesh, Haiti, Kenya, Malawi, Namibia, Nepal, Rwanda, Senegal, Tanzania, and Uganda, with repeat surveys in Tanzania (2006 and 2015) and three annual surveys in Senegal (2013,2014,2015). In most countries, the SPA draws a representative sample of both public and private health facilities, with stratified sampling in urban and rural locations and oversampling of hospitals. Haiti, Malawi, Namibia, and Rwanda conducted a census or near census of all health facilities; Bangladesh did not sample small private facilities. Facilities assessed in both the 2013 and 2015 waves of the Senegal SPA were dropped from the 2013 data to eliminate duplicate observations. We also excluded health huts, extension facilities in Senegal that were assessed using an abbreviated survey.
Each survey entailed a facility audit consisting of interviews with facility and service managers and direct verification of the resources available for care, including management, staff, supplies, equipment, medication, and diagnostics. Areas assessed include facility-wide resources and services such as central pharmacy and laboratory as well as specific clinical services such as HIV, delivery care, and child health; common items such as infection prevention measures are repeated across multiple services. The assessment tool was modified in 2012 to include basic readiness for non-communicable disease care and minor surgery in addition to its prior focus on maternal health, child health, and infectious diseases.

Facility readiness indices
We defined two summary measures for each facility: SRI based on the 2013 definition from WHO [19] and an index based on all available items of facility readiness in each survey. All items are binary, with 1 indicating the item was observed present (and functional as applicable) and 0 indicating the item was not present or could not be assessed, e.g. due to the relevant service not being offered. Due to the evolution of the survey and country-to-country variation, between 37 and 49 of the 50 items in the SRI definition could be extracted for each country. Items fell into 5 domains: infrastructure (7 items), equipment (6 items), infection prevention (9 items), diagnostic capacity (8 items), and medication (20 items). Items were averaged within domain; domain scores were averaged to provide the final index, ranging from 0 to 1. Following the logic of the SRI, we extracted all possible items in these five sub-domains as well as an additional domain of human resources and management in order to capture all inputs to care assessed in the SPA, a total of 649 items. Items were averaged within domain, and the overall index was calculated as the average readiness across these six domains (0 to 1).

Analysis
We report descriptive statistics using the survey sampling weights. We classified facilities into quintiles using the 649-item index to identify better and worse performing facilities in terms of overall readiness. We classified facilities into quintiles using the full measure and the original SRI and compared classifications using the kappa statistic, a measure of inter-rater reliability.
In this analysis, no external information was available to serve as a source of validation for facility readiness; we undertook unsupervised machine learning using the information from within the dataset-namely the 649-item index-as the reference criterion. The development of new readiness indices using this machine learning approach involved two steps: selection and evaluation. We first implemented sequential backward and sequential forward selection of individual items. [20] Sequential backward selection entailed discarding one item of the 649, recalculating the index and reclassifying facilities into quintiles, and calculating a kappa statistic to assess performance against the original measure. This procedure was applied for each of the 649 items, with the item whose exclusion resulted in the least loss of reliability (highest kappa statistic) dropped and the procedure repeated on the remaining items. Sequential forward selection is similar but started with an empty set and tested all possible single-item measures before retaining the best performing item based on the kappa statistic and repeating. We compared backward and forward selection based on the magnitude of the kappa statistic for a given number of items.
To evaluate the performance of an index of a given number of items, M, we used a 10-fold cross-validation procedure. The data were randomly partitioned into 10 roughly equal-sized parts. Nine parts were taken as training data and used to choose the M items using selection as described above. To obtain the cross-validated kappa statistic for each M-item score, we chose the best M items in the training data and calculated the new index based on those items in the tenth part. We repeated item selection in each training set and calculated the resulting index in the validation fold until all facilities had an M-item index determined by the other 9 folds. Indices were based on the same number of items for each fold, but the specific items may have differed between folds, at least in part, as selected by the training data. We then classified facilities into quintiles by their M-item index and computed a kappa statistic to assess performance of the M-item index against the original 649-item measure. This statistic provided an estimate of expected performance of an index with a given number of items. The procedure was conducted for all possible numbers of items, from 1 to 648. Following cross-validation, we chose the items for each M-item index using the full data set and selecting items 1 to M by order of selection.
We first developed new indices with no pre-specified items and plotted performance using the cross-validated kappa statistic for indices of 648 to 1 item. As a second approach, we developed an enriched SRI with empirically selected items added to the existing SRI and assessed the performance of this enriched metric from 648 items to 50 items. As sensitivity analyses, we repeated the analysis within subsets of facilities (hospitals and non-hospitals) and by tertiles and deciles instead of quintiles.
We selected empirically defined indices of 50 and 100 items (equivalent to or twice as long as the original SRI, respectively) and an enriched SRI index of 100 items as candidate shorter measures. We classified these indices into quintiles and compared this classification to quintiles of readiness based on the full 649-item index using percent agreement and a kappa statistic calculated on the full sample.

Results
A total of 9,976 facilities were selected for assessment from master facility lists; 9,690 assessments were successfully conducted (97.1% response). We excluded 452 assessments in Senegal from the analysis (191 that were repeated surveys and 261 health huts) for an analytic sample of 9,238 health facilities in ten countries ( Table 1). The average SRI ranged from 0.42 in Bangladesh and Uganda to 0.70 in Namibia. Readiness based upon all 649 items was consistently less than the SRI: average readiness exceeded 0.50 in Namibia alone and fell below 0.40 in most countries.
The kappa statistic for SRI and the full index was 0.35, indicating these indices agreed on facility classification 35% of the time beyond chance alone. This kappa value suggests minimal agreement. [21] This lack of agreement is further illustrated in the first panel of Table 2: SRI as defined by the WHO classifies facilities divergently from the full index, with only 4,445 facilities (48%) classified in the same quintiles. While no facilities in the best group for one index were in the worst for the other, 275 facilities in the top two quintiles of all facilities based on the SRI were in the bottom two quintiles using the full index.
Sequential backward selection required more computing time (4.6 vs. 2.3 hours) but outperformed sequential forward selection in all analyses based on the cross-validated kappa statistic at any given number of items (S1 and S2 Figs); we thus present results from sequential backward selection only. Indices selected using sequential backward selection with no prespecified items performed very well against the full index, particularly when large numbers of items were retained: cross-validated kappa exceeded 0.88 for at least 200 items and declined to 0.80 for 100 items and 0.71 for 50 items (Fig 1). The performance of the 50-and 100-item indices-measures that could provide considerable efficiency by cutting the facility assessment to under 20% of its current length-is detailed in Table 2 Panels B and C. These empirical indices outperform the original SRI compared to quintiles based on the full index, with 80% of facilities (7,412) classified correctly by the 50-item empirical index and 87% of facilities (8,051) classified correctly by the 100-item empirical index. With 100 items, no facility is misclassified by more than one quintile.
The content of the 100-item empirical index is shown in S1 Table; it included only 8 SRI items. The selected items reflect the breadth of the facility assessment rather than a coherent picture of facility readiness; for example, 11 of 16 amenities items pertain to client privacy, while four different diagnostic items address the availability of rapid HIV tests in distinct areas of the facility. Sensitivity analyses limiting the sample to hospitals or non-hospitals or assessing the performance relative to the full index using tertiles or deciles returned comparable results in terms of improving on the original SRI, but with substantial differences in the list of items selected (results not shown). The second approach attempted to enrich the SRI. The enriched SRI performed comparably to the empirical indices with large numbers of items, with reliability declining more sharply below 150 items (Fig 1). The cross-validated kappa statistic for a 100-item enriched SRI was 0.75 (compared to 0.80 for the 100-item empirical index). As shown in Table 2 Panel D, 83.0% of the facilities (7,663) were correctly classified into quintiles by the 100-item enriched SRI, with no facilities misclassified by more than one quintile. Below 100 items, reliability declined substantially to 0.37 for 50 items (the original SRI plus 1 item). A comparison of the 100-item enriched SRI and original SRI shows the substantial improvements contributed by the additional items: 49% facilities disagree by at least one quintile, including 8% by more, with a kappa statistic of 0.39 (S2 Table). Table 3 lists the items retained in the 100-item enriched SRI, sorted by domain and, for those not included in the original SRI, their selection order. These 100 items are relatively evenly distributed across the six domains, ranging from 13 equipment items (6 in the original SRI) to 24 medications (19 in original SRI). Selected items include both facility-wide attributes such as a daily update of medication availability and service-specific items such as privacy in the family planning exam room or syringes in sick child rooms. Thirty-five of 100 items are present in both the enriched SRI and purely empirical indices.

Discussion
This study of over 9,000 health facilities in ten countries is the first effort to apply machine learning to derive insight and improve efficiency of health facility survey data in LMICs. The results demonstrate that the SRI as defined by the WHO captures only a portion of the information contained in detailed facility assessments and may result in highly divergent classification of facilities as poorly or well prepared to provide high-quality care. Purely empirical indices captured much of the information of the full survey with many fewer items, although the items selected varied across sensitivity analyses. Enriching the SRI with additional items provided a blended approach between normative guidelines and empirical assessment. A 100-item index incorporating the SRI items proved reliable in capturing the full information contained in the facility surveys. This work demonstrates that the unsupervised machine learning approach applied-sequential backward selection with cross validation for evaluation [22] -provides a feasible approach to extract shorter measures from the data collected during health facility assessments as a step towards enhancing the use of these surveys. Further insights into health system performance will require better data for linking health facility inputs to patient perspectives and health outcomes.
Existing research on health facility readiness focuses primarily on describing overall readiness, [11,12] identifying gaps in particular services, [23] and linking readiness to outcomes such as health service utilization. [24] These studies have identified deficiencies in SRI in multiple countries, from low-income nations like Malawi and Haiti to less poor countries like Kenya and Namibia, [11,12] as well as low correlation between readiness and processes of care. [25] The findings of this work suggest that overall service readiness as measured by all input items in the SPA surveys is even lower in most health facilities than indices such as the SRI that focus on basic elements of readiness. This result adds to the growing recognition of deficiencies undermining the quality of facility infrastructure available in health systems in low-and middle-income countries. [3] In addition, the low concordance between rankings based on the SRI and those based on all available survey information demonstrates that the SRI is not a good proxy for readiness based on the full survey.
Can the SRI be improved? Using all readiness items in the SPA surveys as a guide, we developed considerably shorter indices that classified most facilities into quintiles correctly as compared to the full measure, with a cross-validated kappa statistic of 0.80 for the 100-item index. More efficient measurement is possible without losing much insight on facility readiness. However, the instability of this measure in terms of the items selected across sensitivity analyses and its lack of coherence suggests it may not be a compelling option for policy makers. This shortcoming may reflect the survey content as a whole: the breadth of items-including repeated assessment of common items such as privacy and infection control measures-and Machine learning for global health system measurement lack of predefined summary measures fit for purpose means that the reference point itself does not provide consistent insight on full facility readiness. A blended approach combining the predefined SRI items with items added through empirical assessment provided shorter indices balancing normative coherence with concordance with the classification based on full information. The percent agreement for quintiles using the 100-item enriched SRI compared to the 649-item index was 83%, with a cross-validated kappa of 0.75, suggesting moderate inter-rater reliability. [21] The methods applied here can be refined to suit the needs of individual countries or analysts in terms of extracting insight from health facility data. The content of the enriched index highlight the range of items included in the SPA surveys, including both fixed infrastructure such as private rooms, major assets such as ambulances, and stocked items such as medication, supplies, and infection control measures that may fluctuate between available and out of stock on a regular basis. One limitation of measuring SRI in periodic surveys, no matter how well designed, is that information on physical stock is quickly out of date. Routine health information systems may be better placed to assess items that go out of date quickly such as medication stock, while periodic surveys may be best positioned to capture more costly but valuable measures such as facility function or performance.
Prior applications of machine learning analysis in health include questions such as predicting future disease or mortality using electronic health records [26,27] or population based surveys. [17] While more limited, applications in health services research include efforts to improve risk adjustment for health insurance plan payments. [16] These studies have demonstrated improvements in synthesizing large and complex data into summary measures or predictions, using tools such as cross-validation. The current study confirms that such methods can be applied to derive insight from global health data as well.
One important limitation of the work is the lack of an external criterion, such as mortality, treatment success, or retention in care, to guide empirical selection. A supervised learning approach anchors the selection to a meaningful outcome and can identify reduced numbers of variables that are as or more predictive of this health outcome and presumably those related to it as full sets. [16] In the unsupervised analysis employed here, we are able to identify efficiencies in capturing the full set of items but not improve beyond what can be accomplished by this full set. Moving towards more efficient data collection might best require internal or external outcome data to validate the indices. Linking health system and population data in order to attribute population outcomes to the health system is a difficult undertaking in lowresource settings at the moment. [28] Without such external information, however, efforts to streamline data collection and enhance its utility are constrained. Increased coordination in data collection and country-led synthesis may be necessary to obtain linked health system quality and patient outcome data to enable more complex analysis of health system capacity and performance.
Other study limitations include inconsistencies in SPA surveys over time and between countries that prevented comparison of identical measures across countries. Survey questions were skipped if a service was not offered in a given facility; we set all such items to zero on the basis that the resources were not demonstrably present. Finally, the SPA surveys are cross-sectional and do not capture change over time or fluctuations in readiness; although these differences should not affect the main findings of this analysis, they limit the generalizability of the descriptive results to current health system readiness.
The findings of this analysis suggest that collecting an extensive number of items in each facility assessment is an inefficient use of resources and one that should be reconsidered as global and national stakeholders turn greater focus to health system capacity and performance. Moving forward, health system measurement should: 1) predefine the purpose of the data, including the form and purpose of the intended summary measures for synthesis and use of results, 2) optimize efficiency by blending expert opinion and empirical methods for the selection of items, and 3) include external items such as patient outcomes for validation. Better insight and informed action for health system strengthening are achievable and will prove to be important elements in improving the quality of care provided worldwide.