Claims-based algorithms for common chronic conditions were efficiently constructed using machine learning methods

Identification of medical conditions using claims data is generally conducted with algorithms based on subject-matter knowledge. However, these claims-based algorithms (CBAs) are highly dependent on the knowledge level and not necessarily optimized for target conditions. We investigated whether machine learning methods can supplement researchers’ knowledge of target conditions in building CBAs. Retrospective cohort study using a claims database combined with annual health check-up results of employees’ health insurance programs for fiscal year 2016–17 in Japan (study population for hypertension, N = 631,289; diabetes, N = 152,368; dyslipidemia, N = 614,434). We constructed CBAs with logistic regression, k-nearest neighbor, support vector machine, penalized logistic regression, tree-based model, and neural network for identifying patients with three common chronic conditions: hypertension, diabetes, and dyslipidemia. We then compared their association measures using a completely hold-out test set (25% of the study population). Among the test cohorts of 157,822, 38,092, and 153,608 enrollees for hypertension, diabetes, and dyslipidemia, 25.4%, 8.4%, and 38.7% of them had a diagnosis of the corresponding condition. The areas under the receiver operating characteristic curve (AUCs) of the logistic regression with/without subject-matter knowledge about the target condition were .923/.921 for hypertension, .957/.938 for diabetes, and .739/.747 for dyslipidemia. The logistic lasso, logistic elastic-net, and tree-based methods yielded AUCs comparable to those of the logistic regression with subject-matter knowledge: .923-.931 for hypertension; .958-.966 for diabetes; .747-.773 for dyslipidemia. We found that machine learning methods can attain AUCs comparable to the conventional knowledge-based method in building CBAs.


Introduction
A growing body of studies using medical and pharmacy claims data has been conducted in various fields of health research [1][2][3][4][5][6][7]. Among them, a notable amount of research has used claims data to assess medical conditions [1,2,6]. Despite its large volume of information and highly standardized format, however, claims data is frequently criticized for its potential imprecision in the identification of medical conditions mainly because they are primarily issued for reimbursement purpose [8][9][10][11][12].
To address these concerns, plenty of studies have proposed a claims-based algorithm (CBA) for identifying patients with their target condition and computed association measures to assess the usability of the algorithm [9,10,. Previous studies have engaged in a knowledge-based condition-specific CBA construction procedure, i.e., researchers selected input variables and decided how to incorporate them in the CBA based on their experience or existing clinical knowledge regarding the target condition. Although this approach is widely used and intuitively plausible, it is highly dependent on the level of knowledge on the target conditions and is hard to obtain appropriate and reproducible CBAs. This is imposing challenges to the use of administrative data in the transition from the ICD-9 to the ICD-10 coding scheme in the United States [42,43].
Moreover, since previous CBA studies are predominantly coming from North American countries, research using diagnosis derived from North American countries' claims data can be largely backed by a corresponding CBA study. In contrast, despite the rapid increase of research using diagnosis derived from claims data in other countries-e.g., Japan and Taiwan-CBAs are not established for most medical conditions thus far [44]. It is notable that the lack of confirmed CBA not only degrades the quality of research but also makes the research extremely difficult to be accepted by journals with high impact factors [45]. For this reason, researchers who are using claims data in these countries are facing an urgent need to establish CBAs for various medical conditions.
To this end, some researchers applied conventional regression methods to develop CBAs which are less dependent on the knowledge [9,14,17,19,24,35,37,40]. However, the selection of input variables is required before implementing a regression model to obtain a satisfactory CBA, as conventional regression methods often work poorly in prediction accuracy when the number of input variables is large relative to the sample size [46]. Besides, if researchers expect nonlinear or interactive effects of the input variables, they have to specify those terms a priori as a functional form of the regression model.
Machine learning methods are promising technologies to overcome the problems of conventional regression methods, and some researchers have attempted to use these methods in the context of CBA [18,25,29,30,39]. However, they selected the input variables according to their target condition. Thus, to apply their procedures to other conditions, it is necessary to start over from the variable selection. Additionally, among different methods for machine learning, the methods better suited than others for developing CBAs have not been addressed yet.
In this study, using a large database of employees' health insurance programs, we developed CBAs with selected machine learning methods for identifying patients with three common chronic conditions: hypertension, diabetes, and dyslipidemia. We then compared their association measures using a hold-out test set.

Institutional settings
The Japanese government provides a universal health insurance program for all registered inhabitants. Besides, each employer is obliged by law to provide annual health check-up to its employees. Medical and pharmacy claims data combined with annual health check-up results of employees' health insurance programs were obtained in an anonymous format from JMDC Inc. [47]. Further details on the institutional settings have been described previously [33].
Claims data contain enrollee information, including gender, month and year of birth, and their diagnostic code, medical institutions, pharmacies, and medical treatments provided. Diagnostic and medication codes are classified by the 2003 version of the International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10) [48] and 2016 version of the World Health Organization-Anatomical Therapeutic Chemical (WHO-ATC) [49], respectively. Enrollees' age was defined as their age in March 2018. Annual health check-up results include information on the results of the physical examination and blood test, whether fasting blood samples were collected, and the answer to a health-related questionnaire including questions on medication usage. The study protocol was approved by the research ethics committee of the University of Tokyo (approval number: KE18-44). The ethics committee waived informed consent because this is a retrospective study with the data that were fully anonymized before we accessed them. JMDC Inc. applies strict policies to protect the privacy of enrollees and medical providers, and all private information that could identify enrollees and medical providers were removed beforehand [47].

Study population
The study population for each condition of hypertension, diabetes, and dyslipidemia was defined as beneficiaries (1) who were enrolled in the claims database from April 1, 2016, to March 31, 2018, and whose health check-ups were sequentially conducted for fiscal year (FY) 2016 and FY2017 (N = 1,040,351), (2) with complete data on the self-reported use of blood pressure-and lipid-lowering drugs and hypoglycemic drugs for FY2016 and FY2017 (N = 944,717), (3) who in FY2017 visited a clinic/hospital that mainly specializes in internal medicine (N = 631,731), and (4) with complete data on examination results required for the gold standard of each condition mentioned later for FY2016 and FY2017 (hypertension, N = 631,289; diabetes, N = 152,368; dyslipidemia, N = 614,434) (Fig 1).
In similar studies to date, chart review has often been the source of the gold standard, with the population to calculate association measures constrained to those who visited primary care facilities [15,16,19]. To make the present study comparable to the past research, we restricted the study population to those who, at least once in the FY, had visited a clinic/hospital that mainly specializes in internal medicine, which has the function of primary care in Japan.

Gold standard and claims-based algorithm
We constructed a gold standard to diagnose each condition from the health check-up results of FY2016 and FY2017 as previously described (Table 1) [33]. We used FY2017 claims data as the source of the CBA and compared it with the diagnosis derived from the gold standard. The scheme of using one-year claims data corresponding to the latter health check-up year for developing CBAs is the same as that of the previous study [33].
To construct CBAs, we first set up a dataset containing the input variables that can be chosen without subject-matter knowledge on the target conditions, namely, age, gender, and the number of observations of each of ICD-10/WHO-ATC code with a letter followed by two digits (main dataset). We counted the observations of ICD-10/WHO-ATC codes on claims as one occurrence when the information was accrued from the same month. We excluded the ICD-10 codes for suspected cases and counted the ICD-10 codes regardless of whether they were listed as primary diagnoses. We then applied following popular machine learning methods, (1) k-nearest neighbor (kNN), (2) support vector machine (SVM), (3) penalized logistic regression, (4) tree-based model, and (5) neural network, to the dataset.
Additionally, as benchmarks, we developed two sets of conventional CBAs. Firstly, we emulated two manually constructed CBAs proposed in the previous study [33]. Patients meeting the following selection rule were classified as "test-positive" for condition X (hypertension, diabetes, or dyslipidemia): (1) the diagnostic code corresponding to condition X is found in the claims at least once (diagnostic code-based CBA); and (2) the medication code corresponding to condition X is found in the claims at least once (medication code-based CBA).
Secondly, we applied a logistic regression model to the main dataset and an alternative dataset where input variables were selected according to each condition. The logistic regression model with the alternative dataset corresponds to a typical procedure among the conventional  Diagnose as hypertension if any of the following conditions are satisfied: 1. Systolic blood pressure � 140 mmHg and/or diastolic blood pressure � 90 mmHg for FY2016 and FY2017 2. Self-report of taking blood pressure-lowering drugs in at least one of FY2016 and FY2017 � Diagnose as diabetes if any of the following conditions are satisfied: 1. HbA1c � 6.5% in at least one of the two years and FBG � 126 mg/dL in at least one of FY2016 and FY2017 2. FBG � 126 mg/dL for FY2016 and FY2017 3. Self-report of taking hypoglycemic drugs in at least one of FY2016 and FY2017 � knowledge-based methods in building CBAs. The selected input variables were age, gender, and the number of observations of each of ICD-10/WHO-ATC codes that corresponds to the target condition. The ICD-10 codes corresponding to hypertension, diabetes, and dyslipidemia were defined as I10-I15, E10-E14, and E78, respectively. The WHO-ATC codes corresponding to hypertension, diabetes, and dyslipidemia were defined as C08 and/or C09, A10, and C10, respectively.

Association measures
We quantified the goodness of CBAs by the following association measures: sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), receiver operating characteristic (ROC) curve, and area under the ROC curve (AUC). For the calculation of the association measures, true positive and test positive cases were defined as the enrollees who were assessed as having a disease by the gold standard and those who were identified as having a disease by the CBA, respectively.

Statistical analysis
We randomly divided the dataset into two sets: training (75%), which was used to estimate parameters and tune hyperparameters; test (25%), which was used to assess the association measures of the CBA. The sensitivity, specificity, PPV, and NPV were estimated for the diagnostic code-and medication code-based CBAs, and their 95% confidence intervals (CIs) were calculated using exact binomial confidence limits [50]. We calculated these association measures and 95% CIs using the epiR package [51]. We estimated a prediction function that outputs the score of the propensity for having a disease given a set of input variables using the selected methods. The outcome variable in hand is a binary indicator of having a disease that is assessed by the gold standard. For each selected machine learning method, we chose several types of prediction procedures that are commonly applied. The Euclidean distance with raw or standardized (i.e., rescaled to have mean zero and variance one) input variables was adopted as a distance metric for the kNN [52,53]. A linear basis function with a hinge or squared hinge loss was adopted in the SVM [54]. From the penalized logistic regression, logistic regressions with the L 2 -penalty (logistic ridge) [55], L 1penalty (logistic lasso) [56], and elastic-net penalty (logistic elastic-net) [57] were applied. Two types of tree-based models were applied: random forest [58] and importance sampled learning ensemble (ISLE) [59]. A single hidden layer neural network was applied with a different number of hidden units: 5, 10, and 20 [60].
If the model involved a hyperparameter to be tuned, the training set was used for the tuning. The expected value of the AUC was estimated through tenfold cross-validation with the training set. If the computational burden of tenfold cross-validation was prohibitive, we used a validation set to estimate the expected value of the AUC. A third of the training set was chosen at random to construct the validation set. Which of tenfold cross-validation or a validation set was used for each model is described below. The hyperparameter was then chosen to be the value that maximized the AUC. After the hyperparameter determination, the training set was used again to estimate parameters for the prediction function. When no hyperparameter tuning was required, the training set was used to estimate parameters in the prediction function from the beginning. We described the details of the parameter estimation and hyperparameter tuning for each method in the following.
Logistic regression. The outcome variable was regressed on the input variables to generate a prediction function. The analysis of the logistic regression was implemented by the mnlogit package [61].
k-nearest neighbor. The number of the nearest neighbors to be counted, k, was optimized using the validation set. The predicted class probabilities that were computed from (1) the frequency of the class of the k-nearest neighbors (vote) [52] and (2) the inverse distance weighted frequency of the class of the k-nearest neighbors (IDW) [53] composed a prediction function. The analysis of the kNN was implemented by the fastknn package [62]. Support vector machine. The cost parameter was optimized using the validation set. Decision values (i.e., the distance of the point from the hyperplane) made up a prediction function. The analysis of the SVM was implemented by the LiblineaR package [63].
Penalized logistic regression. The regularization coefficient and elastic-net mixing parameter were determined by cross-validation. The analysis of the penalized logistic regression model was implemented by the glmnet package [64].
Tree-based model. The minimum node size was set to 10 for each tree, and 200 trees were bagged in the random forest. The number of variables selected for each split was tuned using the validation set. The probability forest was used to generate a prediction function [65]. The analysis of the random forest was implemented by the ranger package [66]. There are five hyperparameters in the importance sampled learning ensemble (ISLE): a hyperparameter for the tree size, subsampling ratio for each tree, learning rate, number of trees to be bagged, and regularization coefficient for the post-processing. We adopted the depth of the tree as the hyperparameter for the tree size and fixed it to be six [60]. As the combination of the subsampling ratio for each tree and learning rate, we selected (1,0.05), (0.5,0.1), and (0.1,0.1). Since the basis function generating process of the ISLE is identical to that of the gradient boosting machine (GBM) if the subsampling ratio is one and that of the stochastic gradient boosting machine (SGBM) if otherwise, we set the learning rate according to Friedman's recommendation for the GBM and SGBM [67,68]. The remaining two hyperparameters, the number of trees to be bagged and regularization coefficient, were determined by cross-validation. In particular, for a given value of the regularization coefficient, the basis function generating process was stopped if the cross-validation AUC did not improve for three basis function generating rounds. The value with the maximum cross-validation AUC was then chosen as the regularization coefficient for the prediction function. The L 1 -penalty was adopted in the post-processing following the recommendation of Friedman and Popescu (2003) [59]. The analysis of the ISLE was implemented by the xgboost package [69].
Neural network. All hidden units were fully connected with the nodes in the input and output layers. Weight decay was employed for the regularization of parameters, and the regularization coefficient of it was tuned using the validation set. The analysis of the neural network was implemented by the nnet package [70].
Provided an estimated prediction function from the model, an ROC curve was drawn from the scores and the matched observed outcome values as the threshold of considering a patient positive were moved over the range of all possible scores. The AUC was calculated from the resulting ROC curve, and DeLong's method was used to determine the 95% CI for the AUC [71].
In the end, a representative point of sensitivity and specificity on the ROC curve was chosen based on the Youden index [72,73]. The PPV and NPV were calculated according to the representative point. Moreover, the 95% CIs for the sensitivity, specificity, PPV, and NPV were calculated with 200 bootstrap resampling and the averaging methods as described previously [74]. We drew the ROC curve and calculated the association measures and their 95% CIs using the pROC package [75]. All statistical analysis was conducted using R version 3.6.1 [76]. R codes are available at https://github.com/harakonan/research-public/tree/master/cba.  Table 3 displays the cumulative counts and distribution of the proportion of enrollees whose claims contain the ICD-10/WHO-ATC code at least once in the study population. The numbers of the ICD-10 and WHO-ATC codes that appeared in the dataset for the study population were 1333 and 92, respectively. Nearly 90% of the ICD-10 codes that appeared in the dataset were only observed for less than 1% of enrollees, and more than half of the WHO-ATC codes that appeared in the dataset were observed for less than 5% of enrollees.  Table 4 reports the association measures and their 95% CIs for the diagnostic code-and medication code-based CBAs. The sensitivity, specificity, PPV, and NPV closely followed those values computed previously [33]. The diagnostic code-based CBAs had higher sensitivity and NPV but lower specificity and PPV compared to the medication code-based CBAs. For hypertension, all association measures were acceptably high, while, for diabetes, the diagnostic code- Table 3. Cumulative distribution of the proportion of enrollees whose claims contain the ICD-10/WHO-ATC code at least once in the study population (N = 631,731). Notes: For each two-digit ICD-10/WHO-ATC code, the proportion of enrollees whose claims contain the code at least once was computed for the study population.

ICD-10 code WHO-ATC code
Cumulative counts and distribution of the computed proportion was tabulated separately for ICD-10 codes and WHO-ATC codes. The count (percentile) column tabulates the number (fraction) of two-digit ICD-10/WHO-ATC codes that the proportion of enrollees whose claims contain the code at least once is below the value in the proportion column.
https://doi.org/10.1371/journal.pone.0254394.t003 Abbreviations: CI, confidence interval; NPV, negative predictive value; PPV, positive predictive value. Patients meeting the following selection rule were classified as "test-positive" for each condition: (1) the diagnostic code corresponding to the condition is found in the claims at least once (diagnostic code-based claims-based algorithm); and (2) the medication code corresponding to the condition is found in the claims at least once (medication code-based claims-based algorithm). We calculated 95% CIs for all estimates of sensitivity, specificity, PPV, and NPV using exact binomial confidence limits. https://doi.org/10.1371/journal.pone.0254394.t004 based CBA fell short of a satisfactory level of the PPV. For dyslipidemia, the sensitivity of both CBAs was considerably lower than those for hypertension and diabetes. Table 5 shows the association measures and their 95% CIs for the CBAs derived from the machine learning methods for hypertension (Table 5A), diabetes (Table 5B), and dyslipidemia (Table 5C). ROC curves are shown in S1 File.
The AUC of the logistic regression with subject-matter knowledge about the target condition, i.e., the logistic regression with the alternative dataset, was .923 for hypertension, .957 for diabetes, and .739 for dyslipidemia. The representative sensitivity, specificity, PPV, and NPV of this method were comparable to those of the convex combination of the diagnostic codeand medication code-based CBAs: hypertension, sensitivity 78.0%, specificity 96.1%, PPV 87.0%, and NPV 92.8%; diabetes, 86.9%, 95.6%, 64.3%, and 98.8%; dyslipidemia, 42.6%, 91.8%, 76.6%, and 71.6%. Without subject-matter knowledge about the target condition, i.e., the logistic regression with the main dataset, the AUC for hypertension stayed similar, .921, that for diabetes decreased to .938, and that for dyslipidemia increased to .747.
The logistic lasso, logistic elastic-net, and tree-based methods yielded AUCs that were comparable to or higher than those of the logistic regression with subject-matter knowledge: logis- The model which achieved the highest AUC for all three conditions, the ISLE with 1 for the subsampling ratio and 0.05 for the learning rate, yielded the following association measures at the representative coordinate on the ROC curve: hypertension, sensitivity 80.5%, specificity 95.8%, PPV 86.8%, and NPV 93.5%; diabetes, 89.8%, 94.7%, 60.5%, and 99.0%; dyslipidemia, 51.2%, 88.9%, 74.5%, and 74.2%.

Discussion
Using health check-up results as the source of the gold standard, we demonstrated the association measures of the CBAs derived from machine learning methods without a condition-specific variable selection for identifying patients with three common chronic medical conditions, hypertension, diabetes, and dyslipidemia. This is the first study to investigate the benefits of machine learning methods in building CBAs comprehensively.
Among the logistic regression and penalized logistic regression, the logistic lasso and logistic elastic-net achieved the highest AUC, followed by logistic regression and logistic ridge. They are all linear in the parameter model with the same loss function, log-loss, but different penalty functions: zero penalties for logistic regression; an L 2 -penalty for logistic ridge; an L 1penalty for logistic lasso; and an elastic-net penalty for logistic elastic-net.
The methods using the L 1 -penalty are better suited to sparse and high-dimensional situations than those using zero penalties or the L 2 -penalty because of the selection of the effective  Abbreviations: AUC, area under the receiver operating characteristic curve; CI, confidence interval; IDW, inverse distance weighting; ISLE, importance sampled learning ensemble; NPV, negative predictive value; PPV, positive predictive value.
Notes: Age, gender, and all International Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10)/World Health Organization-Anatomical Therapeutic Chemical (WHO-ATC) codes with a letter followed by two digits were used as input variables for all models but the logistic regression using the alternative dataset. The main logistic regression fitted a logistic regression model to the dataset that was appropriately trimmed. The Euclidean distance with raw or standardized (i.e., rescaled to have mean zero and variance one) input variables was adopted as a distance metric for the k-nearest neighbor (kNN). The number of the nearest neighbors to be counted, k, was optimized using the validation set. The predicted class probabilities that were computed from (1) the frequency of the class of the knearest neighbors (vote) and (2) the inverse distance weighted frequency of the class of the k-nearest neighbors (IDW) composed a prediction function. A linear basis function with a hinge or squared hinge loss was adopted in the support vector machine (SVM). The cost parameter was optimized using the validation set. Decision values (i.e., the distance of the point from the hyperplane) made up a prediction function. From the penalized logistic regression, logistic regression with the L 2 -penalty (logistic ridge), L 1 -penalty (logistic lasso), and elastic-net penalty (logistic elastic-net) were applied. The regularization coefficient and elastic-net mixing parameter were determined by cross-validation. Two types of tree-based models were applied: random forest and importance sampled learning ensemble (ISLE). The minimum node size was set to 10 for each tree, and 200 trees were bagged in the random forest. The number of variables selected for each split was tuned using the validation set. We fixed the depth to be six for the ISLEs. As the combination of the subsampling ratio for each tree and the learning rate, we selected (1,0.05), (0.5,0.1), and (0.1,0.1). The number of trees and regularization coefficient were determined by cross-validation. The L 1 -penalty was adopted in the post-processing. A single hidden layer neural network was applied with a different number of hidden units: 5, 10, and 20. All hidden units were fully connected with the nodes in the input and output layers. Weight decay was employed for the regularization of parameters, and the regularization coefficient of it was tuned using the validation set. Delong's method was used to determine the 95% CI for the AUC. A representative point of sensitivity and specificity on the ROC curve is chosen based on the Youden index. PPV and NPV were calculated according to the representative point, and the 95% CIs for the resulting sensitivity, specificity, PPV, and NPV were calculated with 200 bootstrap resampling and the averaging methods.
https://doi.org/10.1371/journal.pone.0254394.t005 input variables. These results are backed by theoretical results that support the superiority of the estimation methods that use the L 1 -penalty in sparse and high-dimensional settings [77][78][79]. Despite the fact that the prediction performance of the lasso is expected to be improved by the elastic-net if there is a group of variables among which the pairwise correlations are very high [46] and usually the diagnostic and medication codes corresponding to the target disease are highly correlated, we could not boost the AUC by the elastic-net compared with the lasso.
The tree-based model and neural network automatically select the input variables that are crucial to the discrimination and flexibly incorporate nonlinearity and interactions of them. The tree-based model largely attained superior AUC to any models and was at least as good as the benchmark cases. Among the tree-based models, the ISLE performed better than the random forest. Past Monte Carlo simulation studies have shown the superior performance of the ISLE to the random forest that uses the lasso post-processing in the aggregation process and the superior performance of the latter to the usual random forest [59,60]. Therefore, two components of the ISLE are contributing to its superior performance to that of the random forest: learning term in the basis function generating process and lasso post-processing. The difference of the hyperparameter within the ISLE was not so much affecting the results.
In contrast to the tree-based model, the AUC of the neural network was not that high but comparable to that of the logistic regression. The performance of the neural network was much lower in the preliminary investigation that used a smaller sample size. The number of parameters in the neural network is nearly 7500, 15000, and 30000 for 5, 10, and 20 hidden units, respectively. Although the use of weight decay should alleviate the overfitting of the parameters to some extent, the sample size may still be insufficient for the neural network to demonstrate its true predictive power. As using multiple hidden layers with constraints such as local connectivity and weight sharing on the network, which allow for more complex connectivity but fewer parameters, improved the performance of the neural network dramatically in the field of image recognition [80,81], it may also improve the performance of the neural network in the current subject. Increasing the sample size of data and devising more complex connectivity that suits the situation are fruitful directions for future research.
The AUC of the kNN with raw input variables was as good as that of the logistic regression, but that of the kNN with standardized input variables was lower. As is implicated by the difference of the AUC of the kNN with raw and standardized input variables, designing the distance metric in the kNN is difficult. If the input variables are standardized, the model is coerced to attach less importance on the input variables with high standard deviation, such as age and gender, than otherwise. Although the kNN had established an era in image recognition by the invention of the tangent distance [82], there is no such versatile distance measure yet in the field of CBA or studies using administrative data. It may be possible to improve the performance of the kNN by applying an unsupervised learning method that extracts essential components of the input variables, for example, principal component analysis [83], before measuring the distance. Although we do not probe further in this study, this is one direction of future research.
The AUC of the SVM was higher than that of the logistic ridge. They are linear in the parameter model with the same L 2 -penalty but different loss functions. The logistic ridge uses the log-loss, while the SVM uses the (squared) hinge loss. The hinge losses give zero penalties to points correctly classified and outside the margin. On the other hand, the log-loss gives continuously decreasing penalties as the correctly classified points get farther from the boundary of the margin. This feature of the hinge losses makes the SVM more robust to outliers than the other methods that are using the log-loss. Since most of the enrollees were far from the margin or outliers (i.e., most of them could be easily labeled as disease or non-disease by the CBA), the SVM is achieving a higher performance by better-discriminating enrollees with and without the target disease near the boundary than the other methods.
By comparing the results from the two datasets prepared for the logistic regression, we can see that the AUC declined for diabetes without a condition-specific variable selection. The inconsistency of the trend of the AUC among the target conditions demonstrates the trade-off between the accuracy and variance of the prediction function. When the number of the input variables of the prediction model becomes large relative to the sample size, there is a potential accuracy gain from the use of rich information and a possibility of a variance increase due to the variance inflation of the parameter estimates. In diabetes, the main dataset did not provide enough accuracy gain to offset the variance inflation, as the sample size was relatively small and the factors of being diagnosed as the target condition are successfully captured in the condition-specific variable selection (i.e., a high AUC is achieved by the alternative dataset). Conversely, in dyslipidemia, since the factors of being diagnosed as the condition seem to be not sufficiently covered by the condition-specific variable selection, the effect of the accuracy gain outweighs that of the variance increase.
There are potentially various ways of refining the AUC obtained in this study, drawing on the context of machine learning. Although the objective of this study is not to seek high AUC or prediction accuracy but to outline the prospect of the development of an efficient CBA construction procedure, we briefly introduce the concepts that are expected to become significant in the future accuracy pursuit of CBAs. The first one is more complicated and sophisticated learning models flourished in the field of machine learning, such as deep learning models [84]. The second is the pre-processing techniques that transform datasets ex-ante to utilize the power of learning machines more efficiently. There are mainly two approaches for pre-processing: methods that deal with imbalanced datasets [85] and those that perform feature selection [86]. The last one is error analysis in the performance analysis and debugging step of model building [87]. How one can successfully use these methods in CBA or, more broadly, claims data situation should be a worthwhile subject to be pursued.
We note that, admittedly, the most demanding and time-consuming task when conducting CBA research will usually be the construction of gold standards. For instance, most previous studies reviewed medical charts to construct the gold standard [13, 15, 16, 18, 19, 21-23, 25, 26, 28, 36, 39, 41]. Nevertheless, we still believe that our proposed method may also lower the bar of CBA research and useful for the following three reasons. Firstly, the performance measures calculated using appropriate machine learning methods can be potentially useful as a reference point, even when creating CBAs manually or exploring new CBA construction procedures.
Secondly, in some cases, it may be possible to sidestep the burden of chart reviewing by using regularly collected data like annual health screening results, which are used in this study. Electronic medical records and disease registries are possible candidates along this line. An increasing number of phenotype algorithms [88][89][90] may well function as gold standards for CBA research when electronic medical records are available. Cancer registries can be used to conduct comprehensive CBA research for various cancers. In fact, some CBA research is using health screening results [27,33], blood test results from electronic medical records [22,31], and disease registries [9,14,17,24,29,40]. If this is the case, researchers can construct the gold standards from the regularly collected data without a serious burden and may be able to apply our proposed method to construct CBAs for a broad set of target conditions once an initial set of input variables are selected.
Finally, as we highlighted in the Introduction, there are demands to build CBAs for a wide variety of diseases efficiently. For instance, we need to renew CBAs when the coding scheme changes [42,43], and a number of countries are still suffering from a lack of CBAs [44]. The lack of confirmed CBAs degrades the research quality, and the research will likely fail to attract high impact factor journals' attention [45].
There is a two-dimensional generalizability issue on the value of association measures computed here: the study population only covers regular employees; the research only dealt with three conditions, hypertension, diabetes, and dyslipidemia. Additionally, input variables selected without using subject-matter knowledge on the target conditions in this study may be inadequate for other situations and conditions. Additional enrollee characteristics, ICD-10/ WHO-ATC codes with three or more digits, and procedure codes may need to be included to attain satisfactory CBAs. The information on primary diagnoses and suspected cases may also be helpful. However, considering that comorbidities were our focus, we do not expect that incorporating those types of information would appreciably affect the methods' accuracy in this study. Lastly, the learning method that suits may depend on the target condition. We hope that similar studies will be conducted on situations other than those that were investigated in the present research to gain a deeper understanding regarding the development of efficient CBA research.
In sum, the penalized logistic regressions other than ridge and tree-based models, which are the leading machine learning methods, achieved AUCs comparable to the logistic regression with a knowledge-based condition-specific variable selection. Besides, the AUC level was satisfactory for hypertension and diabetes. Appropriate machine learning methods can substitute our knowledge of target conditions to construct CBAs efficiently.
Supporting information S1 File. Receiver operating characteristic curve for claims-based algorithms derived from machine learning methods (A, Hypertension; B, Diabetes; C, Dyslipidemia). Abbreviations: AUC, area under the receiver operating characteristic curve; IDW, inverse distance weighting; ISLE, importance sampled learning ensemble; kNN, k-nearest neighbor; Std., standardized; SVM, support vector machine; RF, random forest. Notes: Age, gender, and all International Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10)/World Health Organization-Anatomical Therapeutic Chemical (WHO-ATC) codes with a letter followed by two digits were used as input variables for all models but the logistic regression using the alternative dataset. The main logistic regression fitted a logistic regression model to the dataset that was appropriately trimmed. The Euclidean distance with raw or standardized (i.e., rescaled to have mean zero and variance one) input variables was adopted as a distance metric for the k-nearest neighbor (kNN). The number of the nearest neighbors to be counted, k, was optimized using the validation set. The predicted class probabilities that were computed from (1) the frequency of the class of the k-nearest neighbors (vote) and (2) the inverse distance weighted frequency of the class of the k-nearest neighbors (IDW) composed a prediction function. A linear basis function with a hinge or squared hinge loss was adopted in the support vector machine (SVM). The cost parameter was optimized using the validation set. Decision values (i.e., the distance of the point from the hyperplane) made up a prediction function. From the penalized regression, logistic regression with the L 2 -penalty (logistic ridge), L 1 -penalty (logistic lasso), and elastic-net penalty (logistic elastic-net) were applied. The regularization coefficient and elastic-net mixing parameter were determined by cross-validation. Two types of tree-based models were applied: random forest and importance sampled learning ensemble (ISLE). The minimum node size was set to 10 for each tree, and 200 trees were bagged in the random forest. The number of variables selected for each split was tuned using the validation set. We fixed the depth to be six for the ISLEs. As the combination of the subsampling ratio for each tree and the learning rate, we selected (1,0.05), (0.5,0.1), and (0.1,0.1).
The number of trees and regularization coefficient were determined by cross-validation. The L 1 -penalty was adopted in the post-processing. A single hidden layer neural network was applied with a different number of hidden units: 5, 10, and 20. All hidden units were fully connected with the nodes in the input and output layers. Weight decay was employed for the regularization of parameters, and the regularization coefficient of it was tuned using the validation set. (DOCX)