Predictors of tooth loss: A machine learning approach

Introduction Little is understood about the socioeconomic predictors of tooth loss, a condition that can negatively impact individual’s quality of life. The goal of this study is to develop a machine-learning algorithm to predict complete and incremental tooth loss among adults and to compare the predictive performance of these models. Methods We used data from the National Health and Nutrition Examination Survey from 2011 to 2014. We developed multiple machine-learning algorithms and assessed their predictive performances by examining the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, and positive and negative predictive values. Results The extreme gradient boosting trees presented the highest performance in the prediction of edentulism (AUC = 88.7%; 95%CI: 87.1, 90.2), the absence of a functional dentition (AUC = 88.3% 95%CI: 87.3,89.3) and for predicting missing any tooth (AUC = 83.2%; 95%CI, 82.0, 84.4). Although, as expected, age and routine dental care emerged as strong predictors of tooth loss, the machine learning approach identified additional predictors, including socioeconomic conditions. Indeed, the performance of models incorporating socioeconomic characteristics was better at predicting tooth loss than those relying on clinical dental indicators alone. Conclusions Future application of machine-learning algorithm, with longitudinal cohorts, for identification of individuals at risk for tooth loss could assist clinicians to prioritize interventions directed toward the prevention of tooth loss.

Introduction Tooth loss is considered the "end state" of dental disease [1] and can adversely affect individuals' general health, quality of life, and well-being [2,3]. Although its prevalence has declined over the past decade, the aging population means that the risk of tooth loss is expected to rise [4]. Moreover, low-income and marginalized populations still experience a disproportionate share of the burden [5,6].
Tooth loss can generally be prevented if dental disease is diagnosed and treated at an early stage. Evidence from longitudinal studies suggests that routine dental attenders lose fewer teeth [7]. However, ongoing barriers to access to dental care, including its high cost, limit utilization of dental services, particularly among low-income and minority populations [8]. Adult dental coverage is not an essential health benefit in most public health insurance programs in the United States. Thus, even when able to access dental services, a large proportion of lowincome adults have poor oral health due to a lack of routine care, and extraction becomes the most affordable and expedient dental treatment. Identification of individuals at high risk of tooth loss could therefore (a) aid clinicians in implementing early prevention, and (b) inform policies to ensure access to dental care and improve the oral health of vulnerable populations.
While prior research indicates that dental caries remains (by far) the greatest contributor to tooth loss [9], followed by periodontal disease [10], the role of socioeconomic conditions and other health characteristics is less clear. This is primarily because most prior analyses were based on descriptive studies with a limited number of variables [11,12]. Machine-learning algorithms comprise an approach that utilizes information on a large number of characteristics to identify variables that predict an outcome. This procedure relies on pattern recognition by training the algorithm using "training data" to identify complex patterns to predict outcomes in a separate data "test data" and are therefore better able to model non-linear and highdimensional characteristics, which is the case of most health data [13,14]. Machine-learning methods have been recently applied in medicine to provide information to support clinical decisions, such as in predicting survival in cancer patients or survival in intensive care units [14,15]. However, little is known about developing machine-learning algorithms for the prediction of oral health outcomes [13]. Our objective is to build on that evidence and develop and test multiple machine-learning algorithms to predict complete and incremental tooth loss among adults using socioeconomic and medical condition predictors and to compare the predictive performance of those developed models.

Study population and data sources
We analyzed data from the National Health and Nutrition Examination Survey (NHANES), conducted by the National Center for Health Statistics [16]. NHANES use stratified multistage probability samples of the civilian non-institutionalized population of the US. NHANES surveys contain information on sociodemographic data, medical conditions, in addition to detailed dental examination. We restricted our sample to adults ages 18 and older. We used data from NHANES cycle 2011 to 2012 (n = 5,864) to develop the predictive models for each outcome "training set", and we then used cycle 2013 to 2014 (n = 6,113) to test our models' performance in new unseen data (Fig 1).

Study variables
Our outcome variables were: (1) edentulism, which is the complete loss of all natural teeth; (2) the presence or absence of a functional dentition, which is defined as having at least 20 teeth [17]; and (3) having one or more missing teeth. All outcomes were dichotomized (yes, no). In our primary analyses, we included a total of 28 socioeconomic characteristics, oral health behavior, and chronic medical conditions as predictors for the machine-learning algorithms. Those variables included age, gender, race, nativity, employment, education, marital status, family size, home ownership, number of rooms, food expenditures (at home and away from home), ratio of family income to poverty level, health insurance, body mass index (BMI), routine dental care, and self-reported diagnoses of asthma, diabetes, arthritis, stroke, heart attack, coronary heart disease, heart failure, angina, high cholesterol, hypertension, gout, and cancer.
In secondary analysis, we ran a model which included only routine clinical variables that clinicians might rely upon to predict future tooth loss in patients, viz., the number of decayed teeth and periodontal disease, in addition to age, gender, and race. Detailed description of predictor variables is shown in S1 Table.

Statistical analysis
We tested five popular machine-learning algorithms to predict each outcome. These algorithms were logistic regression, random forest (ensemble of multiple decision trees with bootstrap aggregating), light gradient boosting machine and extreme gradient boosting trees (both based on sequential models of decision trees), and artificial neural networks (algorithms inspired by neural structures and trained with back propagation).
We performed one-hot encoding for every categorical variable and standardized continuous variables to avoid oversized effects due to differences in scale. We applied 10-fold crossvalidation to tune hyperparameters with Bayesian optimization (hyperopt) for the training set to avoid overfitting, separately for each outcome. For edentulism, due to a small number of positive values (reflecting low prevalence), the training set was resampled with one-side selection. In imbalanced datasets, machine learning algorithms have a tendency of biasing decisions towards the majority class. A common solution to this problem is to undersample the majority class. We applied one-sided selection to the training set, which is an undersampling method that removes examples from the majority class that are noisy and distant from the decision border.
After selecting the combination of hyperparameters with the highest area under the receiver operating characteristic curve (AUC) for each model, the parameters of the final algorithms were defined with the entire 2011-12 NHANES cycle (training set) and their predictive performance tested on the 2013-14 cycle of NHANES (test set). All of the results presented here are from the test set.
To assess the predictive performance of the algorithms, we calculated the AUC, accuracy (ACC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and the harmonic mean for sensitivity and specificity for each predictive model. We used 50% threshold for reporting sensitivity, specificity, F1, PPV, NPV. However, in a sensitivity analyses we also tested two other thresholds 25% and 75% (S2 Table).
Furthermore, we computed Shapley values for each predictive model to determine the importance of each variable in predicting our study outcomes. Shapley values are an additive feature importance measure that represent the responsibility of each feature in pushing the model output away from its base value [18].
We used Python (scikit-learn library) [19] and STATA 15.1 software for our analyses [20]. NHANES surveys are approved by National Center for Health Statistics (NCHS) Research Ethics Review Board (ERB) [21]. This study used deidentified data and was determined to be "not-human subjects research" by the institutional review board of the of the Harvard Faculty of Medicine.

Results
The study included a total of 11,977 adults. There were 736 (5.3%) individuals who were edentulous, 2,663 (18.5%) adults without a functional dentition, and 6,919 (58.3%) adults missing at least one tooth. Nearly half of the sample were women (51.8%) and the majority had more than high school education (63.0%) and were non-Hispanic white (65.7%). The distribution of demographic and health characteristics was relatively similar across the three outcomes ( Table 1).
The performance of the machine-learning algorithms on the test data for each study outcome, for the primary analyses (without dental clinical variables), is summarized in Table 2. For edentulism; all machine-learning models demonstrated high performance with high AUC (>86.5%). The ACC ranged between 82.2% and 84.3%, indicating good accuracy. The sensitivity ranged between 71.9% and 78.5%, while specificity was high (>82.5%) for all models. In predicting the lack of a functional dentition, all models had high performance, with high AUC and ACC (>87.0% and >81.0% respectively). The specificity was greater than 84.0% and the sensitivity ranged between 48.4% and 74.1%. For predicting one or more missing teeth, the AUC were greater than 81.0% and ACC more than 73.0%. The sensitivity for all models was high (>85.0%), and the specificity ranged between 29.6% and 59.7%. The performance of the machine-learning algorithms on the train data for each study outcome is presented in (S3 Table).
We compared the AUC curves from all machine-leaning algorithms by outcome (Fig 2). Generally, all models were very similar demonstrating high AUC (>81.5%). Considering all performance parameters (Table 2), the extreme gradient boosting trees had the highest  Table. The most important predictive variables, from the best performing classifiers for each outcome, were nearly similar for all outcomes (Fig 3). Age, education, routine dental care, employment, ratio of family income to poverty level, race, and home ownership were strong predictors of tooth loss. While less significant, medical conditions such as arthritis, diabetes, high cholesterol, hypertension, and cardiovascular diseases were also among the list of predictors. The variable rankings are robust throughout the different machine learning models and different outcomes, with age, education and race emerging as the top predictors for most models (S1-S3 Figs).
Results from our secondary analyses-using only routine clinical variables to predict tooth loss-are presented in Table 3. The performance of these machine-learning algorithms showed that for edentulism, the logistic regression had the highest predictive performance (AUC = 84.6%; 95% CI: 83.0, 86.1). The extreme gradient boosting trees demonstrated the highest performance in predicting the absence of a functional dentition (AUC = 80.4%; 95% CI: 78.9, 81.7) and for predicting one or more teeth missing (AUC = 79.8%; 95% CI: 78.2, 81.2). In each case, the algorithms using only clinical variables (number of decayed teeth, periodontal disease) performed worse than the algorithms excluding the same variables, but incorporating socioeconomic factors.

PLOS ONE
Predictors of tooth loss: A machine learning approach

Discussion
To the best of our knowledge, this is the first use of machine-learning algorithms to predict complete and incremental tooth loss based on socioeconomic and medical health characteristics [22]. In this study, we used national data to develop and test the performance of five machine-learning algorithms and to identify predictors of complete and incremental tooth loss. We assessed the predictive performances of our models by examining several parameters, including area under the receiver operating characteristic curve, accuracy, sensitivity, specificity, positive and negative predictive values. Overall, all machine-learning models demonstrated high predictive performance with high discrimination, achieving AUC greater than 82.0%. We found that the extreme gradient boosting trees model had the highest performance in predicting edentulism, the absence of a functional dentition, and missing any tooth.
Tooth loss is an important oral health indicator. Depending on its severity, it can significantly impact the ability to chew, speak, socialize, and overall general health [2,3,23]. While previous research has identified the determinants of tooth loss, most of this literature examined only edentulism and did not examine incremental tooth loss, which is a far more prevalent oral state [11]. Moreover, those studies were mostly based on cross-sectional data or examined a few variables using classical statistical modeling rather than predictive modeling [11,12,24]. We found that machine-learning models performed better than conventional statistical methods (logistic regression) for predicting edentulism and the absence of a functional dentition. Similarly, Krois et al recently developed and evaluated the performance of multiple predictive models for tooth loss in patients with periodontal disease. Their study demonstrated the utility of applying machine-leaning framework for predicting tooth loss mainly from periodontal tooth-level predictors [13].
In this study, we assessed the utility of machine-learning algorithms for predicting complete and incremental tooth loss, and we demonstrated a high predictive performance of those models. We did not include clinical dental variables in the primary analyses since they are generally highly correlated with tooth loss [9,13,25,26]. Indicators such as decayed teeth and poor periodontal condition have been documented as strong determinants for tooth loss; when we included them, our prediction models were nearly perfect. Instead, we used a comprehensive list of socioeconomic characteristics, self-reported dental care, and medical condition variables. Our approach aimed to develop predictive models using variables that do not require

PLOS ONE
dental examination so that non-dental clinicians could readily identify this high-risk population. However, in our secondary analyses, we assessed the predictive performance of the machine-learning algorithms for predicting our outcomes using a limited number of dental clinical predictors and demographic variables. Our findings suggest that the machine-learning algorithms models using socioeconomic characteristics, self-reported dental care, and medical condition variables performed better at predicting tooth loss than relying on clinical dental indicators alone. Knowing the patient's education level, employment status, and income is just as relevant for predicting tooth loss as assessing their clinical dental status. Our findings echo the advice of Bernardino Ramazzini (1633-1714), widely considered to be the father of occupational medicine, who admonished clinicians to always ask about their patients' occupation when taking down their medical histories [27].
Our findings are consistent with those of previous studies to have identified age and socioeconomic conditions as risk factors for tooth loss [11,12,28]. Aging populations have accumulated oral and non-communicable health conditions, and so remain susceptible to ongoing tooth loss. We also found education to be another strong predictor of tooth loss. Education is a marker of socioeconomic position and a key determinant of life chances, opportunities, beliefs, and values; it therefore plays an important role in enabling access to (and affordability of) dental services [29,30]. Routine dental care also emerged as a strong predictor of tooth loss. This finding provides support for the association between regular preventive dental visits and better oral health [7]. Our findings also provide insights into the role of pre-existing medical conditions as determinants of tooth loss. We found that medical conditions-such as arthritis, diabetes, high cholesterol, hypertension and cardiovascular diseases-are among the predictors of tooth loss. Clinicians could use this information to screen patients at high risk for tooth loss and coordinate their referral and dental care.
Even though the association between socioeconomic status (SES) and tooth loss has been documented previously, we believe our study makes a novel contribution by providing a direct comparison with the predictive performance of widely accepted clinical indicators. We believe our approach builds on prior knowledge by quantitatively demonstrating-via machine-learning algorithm-that a set of socioeconomic variables perform better than clinical dental indicators in predicting tooth loss. Again, we cannot establish causality (which would require longitudinal data), but it draws attention to the potential utility of incorporating SES among a set of variables that clinicians ought to consider in their practice. Future studies need to evaluate the application of these algorithms in clinical settings and explore their use in the identification of populations at risk for other dental outcomes.
Previous studies in other areas of clinical practice have pointed out the potential utility of incorporating socioeconomic information in prediction algorithms. For example, the Framingham Risk calculator-one of the most widely used algorithms to predict future risk of cardiovascular disease-currently does not incorporate socioeconomic variables. Studies have suggested that this omission results in the systematic under-treatment of low-SES patients with hyperlipidemia, because the Framingham Risk score under-estimates the risk of CVD in low SES patients, and clinicians' treatment decisions (such as when to start statin therapy) are often based on stratifying patients using the same algorithms [31,32]. Although we cannot cite an example of a comparable example in dentistry, our study has potential implications for any future attempts to develop prediction algorithms to guide decision making.
A very limited number of studies have utilized machine-learning approaches, mostly developing a single algorithm model or using clinical dental variables, for predicting oral health outcomes [13,33,34]. Our study builds on prior findings and demonstrates the feasibility of using this approach in predicting tooth loss. Future studies could utilize machine-learning for predicting populations at risk for other dental outcomes.
We used cross-sectional data in this study. Although we used separate cycles from NHANES data to develop and to test the predictive models, the principal threat to causal inference in cross-sectional data is reverse causation. Studies have suggested that poor dentition is, indeed, a predictor of low SES, e.g. because individuals with "poor oral health" are less likely to be selected at job interviews [35,36]. However this type of reverse causation is less likely for education, since missing teeth is less likely to affect educational attainment. The evidence is stronger for job hiring, promotion, and income. Future studies with longitudinal cohorts with a larger time separation between training and test datasets are needed to test the external validity of our models and ensure robustness of the models' performance with time. In addition, we excluded variables that had 20% or more missing data from the primary analyses which may have limited the number of variables we were able to test such as other oral heath behaviors (brushing and flossing) and lifestyle factors (smoking, drinking, and exercise). Nonetheless, we were able to develop machine-learning algorithms with high predictive performances for all outcomes. Additionally, we conducted a sensitivity analyses by using multivariate imputation by chained equations (MICE) to impute missing data so that all individuals are included for every model, even if they have missing values and the results were largely similar to our main analysis (S5 Table). Finally, although the performance of models predicting edentulism was high (AUC >86%), the prevalence of edentulism was low in our sample (thus, a "rare event"), and this may have affected the performance of those models.

Conclusion
In this analysis we developed and tested the performance of five machine-learning algorithms for predicting complete and incremental tooth loss. Our findings support the application of machine-learning algorithms to predict tooth loss using socioeconomic and medical health characteristics. However, future studies will need to validate our models using longitudinal data to aid health policy as well as clinicians in identifying individuals at high risk of tooth loss so that early interventions can be directed at those most at risk. In addition, the application of machine-learning methods can be used to identify predictors of other dental conditions.