Non-invasive assessment of NAFLD as systemic disease—A machine learning perspective

Background & aims Current non-invasive scores for the assessment of severity of non-alcoholic fatty liver disease (NAFLD) and identification of patients with non-alcoholic steatohepatitis (NASH) have insufficient performance to be included in clinical routine. In the current study, we developed a novel machine learning approach to overcome the caveats of existing approaches. Methods Non-invasive parameters were selected by an ensemble feature selection (EFS) from a retrospectively collected training cohort of 164 obese individuals (age: 43.5±10.3y; BMI: 54.1±10.1kg/m2) to develop a model able to predict the histological assessed NAFLD activity score (NAS). The model was evaluated in an independent validation cohort (122 patients, age: 45.2±11.75y, BMI: 50.8±8.61kg/m2). Results EFS identified age, γGT, HbA1c, adiponectin, and M30 as being highly associated with NAFLD. The model reached a Spearman correlation coefficient with the NAS of 0.46 in the training cohort and was able to differentiate between NAFL (NAS≤4) and NASH (NAS>4) with an AUC of 0.73. In the independent validation cohort, an AUC of 0.7 was achieved for this separation. We further analyzed the potential of the new model for disease monitoring in an obese cohort of 38 patients under lifestyle intervention for one year. While all patients lost weight under intervention, increasing scores were observed in 15 patients. Increasing scores were associated with significantly lower absolute weight loss, lower reduction of waist circumference and basal metabolic rate. Conclusions A newly developed model (http://CHek.heiderlab.de) can predict presence or absence of NASH with reasonable performance. The new score could be used to detect NASH and monitor disease progression or therapy response to weight loss interventions.


Ethics statement
The study protocol conformed to the ethical guidelines of 1975 Declaration of Helsinki and was approved by the Institutional Review Board (IRB; Ethik-Kommission der Medizinischen Fakultät der Universität Duisburg-Essen; Germany; 15-6356-BO). Due to the retrospective nature of the training study the IRB waived the need for written informed consent. All procedures adhered to the Declaration of Helsinki and the requirements of the IRB.
The protocol of the prospectively recruited validation study cohort was approved by the local Ethics Committee and the Review Board of the University Hospital Würzburg (Ethik-Kommission der Medizinischen Fakultät der Universität Würzburg; AZ96/12). Written informed consent was obtained from all participants.
All individuals participating in the (non-invasive) lifestyle intervention study were prospectively collected and gave written informed consent to the study protocol.
All authors had access to the study data and reviewed and approved the final manuscript.

Study design and sample acquisition
The training cohort (University Hospital Essen) consisted of 164 obese patients with histologically confirmed NAFLD (NAS 1-8; Table 1), who underwent bariatric surgery. Patients were eligible for the retrospective study, when liver histology was available and a NAS of at least 1 was diagnosed by a pathologist. Patients received dietary and exercise counselling for 6 months prior surgery, no calorie restriction was imposed. A blood sample was collected for assessment of serum derived factors on the day of surgery (prior surgery) and liver tissue was sampled during bariatric surgery. Detailed demographic and clinical information of this cohort is given in Table 1. All data shown were recorded on the day of surgery. The validation cohort was recruited prospectively at the University Hospital Würzburg (Division of Hepatology), including 122 patients. The majority of these patients underwent bariatric surgery (n = 105). Detailed demographic and clinical information of this cohort is given in Table 1. Patients were eligible when all parameters used in the newly generated score were complete and histological assessment of the NAS was available.
Pathologists assessed the NAS prior development of the new score and were thus blinded to the results of the new score. While during training the NAS as reference was required to be unblinded, calculation of the new score in the validation cohort was performed blinded to the NAS.
To assess a possible use of the new score for monitoring of treatment, data from a cohort recruited for lifestyle intervention at the University of Munich (Department of Medicine II) were applied [28]. The whole study cohort consisted of 152 patients. For the present analysis 38 patients with complete datasets to generate the new score at start and end of the study were selected.
The validation dataset included age, HbA1c, GGT, M30, adiponectin, sex, BMI, and NAS. Statistical data analyses were performed with R (http://www.r-project.org/) and Prism Version 5 (Graphpad Inc., La Jolla, CA, USA). All data are presented as mean ± standard error of the mean (SEM) unless specified otherwise. Correlation analysis was performed using Spearman's rank correlation coefficient.

Importance analysis and predictive modeling
Importance analysis of the biomarkers were carried out with EFS with default settings [25] using the web-interface (http://efs.heiderlab.de). EFS aggregates eight different feature selection methods and provides a quantitative ranking of the features [29]. EFS has been shown to compensate biases of single feature selection methods, overcoming instability and unreliability of biomarker discovery approaches.
The biomarkers with the highest ranks identified by EFS, were used for subsequent model development by logistic regression. The stats package of R with standard settings was used. For evaluation of the model performance a 10-fold cross-validation scheme was used and the Receiver Operation Characteristics (ROC) curve and the corresponding Area under the Curve (AUC) were calculated with pROC [30]. The 95% CI was computed with 2000 stratified bootstrap replicates.

Proposed non-invasive scores do not correlate with histological assessment of NAFLD severity
To assess performance of existing non-invasive scores, we calculated correlations between APRI, BARD, Gholam's score, NAFLD fibrosis score (NFS), and Palekar's score and NAS, and fibrosis, respectively. Of note, APRI and NFS were developed to assess severity of fibrosis/cirrhosis, but have been tested for assessment of NASH by other groups previously. Only the APRI (r = 0.3269, p < 0.001) and Gholam's scores (r = 0.3670, p < 0.0001) were significantly, positively correlated with NAS (S1A and S1B Fig Table 1.

A new score with unbiased parameter selection reaches high accuracy to predict NASH in the training cohort
To improve non-invasive classification of NAFL versus NASH, all available markers were evaluated by EFS. The markers age, HbA1c, GGT, adiponectin, and M30, were used as input for a logistic regression model. Logistic regression models are easy to interpret, have been widely applied and have very high acceptance among physicians in clinical practice [31]. The model was evaluated using a 10-fold cross-validation. The model is publicly available at http://CHek. heiderlab.de.
The predicted scores are strongly correlated with NAS (r = 0.4473; p < 0.0001; Fig 1A) and the model reached an AUC of 0.7339 (95% CI: 0.6567-8111; p < 0.0001; Fig 1B). The AUC of the model was significantly higher compared to all other tested scores. While no correlation was found for the new score and fibrosis stage, the AUC of a ROC curve to separate moderate from advanced fibrosis reached 0.8117 (95% CI: 0.6964 to 0.9271; p = 0.03; advanced fibrosis n = 4).

The new score achieves high accuracy in an independent validation cohort
In the validation cohort (n = 122, see Table 1)

The new score can be applied to monitor therapy response in patients with metabolic alterations
Due to the fact that the new model was able to separate NAFL and NASH in an independent cohort with good accuracy, we explored, whether it would be applicable for monitoring response to weight loss therapy. To this end, a cohort of patients participating in a lifestyle intervention program to treat morbid obesity for one year, based on the current guideline of the German Association of Obesity [32], was analyzed (Table 2) [28]. No liver biopsies were available in this cohort. Complete datasets for calculating the new score at start and end of the study were available for 38 patients. The treatment regime led to weight loss in all patients and an overall improvement of metabolic parameters and health indices in the majority of patients.  In parallel, a drop of the new score was observed in 60% of the patients (26 with reduced score, 17 with increased score). Mean reduction of the new score was 1.53 points. This would correspond to a reduction in NAS by 1 point for seven patients, including one case of hypothetical resolution from NASH (NAS = 5) to NAFL (NAS = 4).
The new score indicates efficacy of weight loss therapy to restore metabolic health. All treated patients exhibited weight loss and at least mild improvements of the clinical situation. Since the new score was reduced in only 60% of the patients, correlation analyses were performed to identify markers which might explain this discrepancy. Significant correlations of the new score at start (T0) and end (T52) of the study are given in S1 and S2 Tables. Interestingly, the new score is correlated to almost all metabolic and liver related health indicators, including body composition and basal metabolic rate at both time points. We then grouped patients by either a decrease of-3.8 [(-10.32)-(-0.05)] points or increase of 2.5 (0.13-9.7) points of the score during the treatment duration. Basic parameters of these subgroups did not differ at T0 (Table 3) but suggested a slightly better liver related health condition for individuals with an increase of the new score. At T0 the new score was significantly higher in patients with stronger reduction of the new score. Patients with decreased score exhibited significantly higher absolute weight loss (Fig 3A), and stronger reduction of waist circumference (Fig 3B) at the end of the study duration. While reduction of corrected body fat proportion (NutriPlus Version 5.4.1, Data input, Poecking, Germany) was not significantly different ( Fig 3C) the change of basal metabolic rate was higher in individuals with reduced score (Fig 3D). In addition, serum albumin and the gastrointestinal hormone ghrelin were significantly higher in patients with reduced score at both time points (T0 and T52; Fig 4A and 4B). Conversely, serum palmitic acid (C16:0; Fig 4C) was significantly lower in patients with reduced score at T0 and T52. Fatty liver index or NAFLD fibrosis score did not differ between patients with decrease or increase of the new score (not shown). When applied as monitoring tool the new score can indicate efficacy of weight loss treatment for restoration of metabolic health.

Discussion
The non-invasive assessment and monitoring of NAFLD is a complex problem and has been reviewed outstandingly [23,33,34]. In brief, histological evaluation still remains the gold standard, but is limited by sampling error and inter-as well as intra-observer variance in assessment [35]. Moreover, liver biopsy procedures needed for histological assessment are time-consuming, require trained experts, and pose a risk for the patient [36,37]. Magnetic resonance (MR) imaging or MR elastography might be more exact and less dangerous compared to biopsy, but are also costly and time-consuming, making a screening or monitoring inefficient [38]. Other non-invasive methods, as ultrasound or transient elastography lack sensitivity and accuracy. No known single non-invasive parameter achieves sufficient sensitivity and specificity to discern NAFL from NASH, let alone reflect nuanced NAS-based grading [39]. The current consensus seems that more exact and easy accessible biomarker panels are warranted for research and disease monitoring [23,34]. Due to the described diagnostic dilemma many scores and methods for non-invasive assessment of NAFLD have been proposed in various settings. In the present study, we evaluated previously published scoring systems in a specific cohort, to identify the ideal one for our purposes. Unfortunately, few of the scores correlate with NAS or fibrosis score in our cohort. The best performing Gholam score includes presence of diabetes as a dichotomous variable, prohibiting disease or therapy monitoring, in particular for short time frames. Separation of NAFL versus NASH is only possible with the APRI score, initially developed to predict advanced fibrosis in NASH, with a suboptimal AUC of 0.65. Another disadvantage of many scores is a dichotomous separation of disease entities: non-NASH vs. NASH, no/mild fibrosis vs. advanced fibrosis. This is sufficient in clinical settings to decide further diagnostics or therapy for an individual patient. However, it does not reflect the underlying biology, where a more nuanced spectrum exists. Non-invasive scores to predict severity of NAFLD or fibrosis have limited performance in independent settings, as shown in the current study.
The unsatisfactory performance of existing scores in our setting, prompted us to find an optimized score from biomarkers available in the present cohort in an unsupervised manner. An ensemble biomarker discovery approach, which included several machine learning algorithms, identified age, γGT, HbA1c, M30, and adiponectin as ideal available parameters. The logistic regression model generated from these parameters has a strong correlation with NAS in the training and in the independent validation cohort. In addition, it demonstrated reasonable performance to discriminate between NAFL and NASH in both cohorts. Notably individual components of the new score are not limited to classic liver related markers. The selected markers reflect an overall metabolic health state: • age as surrogate for the duration of metabolic derangements or obesity; • γGT as classic liver injury marker; • HbA1c as surrogate marker for impaired glucose metabolism; • M30 as marker for epithelial / hepatocyte apoptosis; • Adiponectin as marker for adipose tissue function.
We believe that this combination of factors from adipose tissue, glucose metabolism, and liver health to detect NASH represent processes influencing severity of NAFLD by altered hepatic metabolism in obesity. Of note, these markers have been selected by an unbiased biomarker discovery method and seem to reflect an organism wide metabolic situation, as shown in the cohort under weight loss therapy. One advantage of the new score is a continuous distribution of values. We currently refrain from giving a fixed cut off to exclude or definitely confirm NASH, as these may depend on the tested cohort. However, the score reflects a biological spectrum from low to high liver injury due to NAFLD.
In a cohort with treatment for obesity, the new score seems to allow assessment of the metabolic status of a patient. Higher scores demonstrate an impaired situation regarding metabolic syndrome and associated diseases, while negative values suggest a low risk to suffer from metabolic alterations. For an individual patient reduction of the score would demonstrate an improved situation and therapeutic benefit beyond mere weight loss. In parallel to weight loss and improvement of many health indicators, the new score dropped in 60% of the tested patients. Upon analysis of the remaining 40%, relatively low absolute weight loss, lower reduction of waist circumference, and only mildly impaired metabolic rate were associated with increased score values. This suggests that the new score is reflective of an overall metabolic health and could indicate objective efficiency of weight loss programs for metabolic health. This particular finding is of course limited by lack of liver biopsies as confirmation of the situation in the liver itself. Moreover, the baseline score differed significantly between the groups, suggesting that part of the study cohort did only have mild NAFLD related liver injury, which could not improve during the weight loss treatment. These and other possible confounders for monitoring with the new score need to be evaluated in appropriate study settings.
There are some limitations of the presented score. The training cohort for score generation was extremely obese, which might influence distribution of values for most of the included parameters. This includes a very low proportion of advanced fibrosis (2.4%), a common observation in extremely obese NAFLD patients [19,35]. Advanced fibrosis is the central determining factor for hepatic outcomes, including those in NAFLD, and for further clinical and therapeutic management [23,40,41]. Thus, non-invasive detection of advanced fibrosis would be more important from a mere hepatological point of view. As the new score was able to separate no or mild fibrosis from advanced fibrosis in this cohort, despite the low proportion of advanced fibrosis, it would be interesting to test the score in cohorts with higher prevalence. As limitation could also be seen that categorization of NAFL vs. NASH was performed by NAS [24], since this is still widely applied in clinical practice. Separation of NAFL vs. NASH by the SAF [11] would result in a slightly different score, nevertheless performance of the new score in the training cohort when SAF was applied was still reasonable. Another limitation is the population background that is mostly Caucasian/European. Data from other cohorts will be required to test whether the newly generated score can reach similar performance in other populations. To this end, this new score is freely available and open to use for testing in any cohort. Integration of new data into the score will hopefully refine it and further improve performance for specific cohorts and populations. One addition could be ultrasonographic or elastographic measurements, when available in a sufficiently large cohort. Finally, the use of adiponectin as parameter could be seen as another limitation, as it is not routinely measured, not even in settings of metabolic alterations, diabetes, or suspected NAFLD. It has been consistently shown that reduced adiponectin is closely associated to the status of the adipose tissue and the liver in obesity and metabolic syndrome [9,10,[42][43][44][45][46]. Adiponectin also seems to indicate severity of glucose intolerance or insulin resistance and could possibly replace the BMI in screening approaches for diabetes [9,47,48]. In the present study, an unsupervised selection of parameters again resulted in adiponectin as one important factor indicating severity of liver injury in NAFLD. Thus, we deem it is about time to integrate adiponectin quantification into clinical routine for metabolic syndrome, insulin resistance, and associated diseases.
In summary, we present a new score for non-invasive assessment of the severity of NAFLD. Advantages of this score are a continuous distribution allowing disease assessment apart from a dichotomous classification as NAFL or NASH. Additional parameters, i.e., transient elastography or controlled attenuation parameter, could be added, given sufficiently large reference datasets. The score could possibly be used to monitor disease progression or resolution over time. We invite the scientific community and in particular all hepatologists examining patients suspected with NAFLD to test and apply the new score to assess patient health at http://CHek. heiderlab.de. In a cohort of 164 obese individuals with NAFLD AUCs were calculated to assess classification into NAFL or NASH by known non-invasive scores. The APRI score reached an AUC of 0.65 (A) and Gholam's Score an AUC of 0.7 (B), both significantly better than random guessing. This was not the case for the BARD (C), the NAFLD fibrosis score (D), or Palekars Score (E). (JPG) S3 Fig. Current non-invasive scores for assessment of NAFLD severity do not correlate with fibrosis stage. In a cohort of 164 obese individuals with NAFLD known non-invasive scores were tested for correlation with the fibrosis stage. None of the tested scores, APRI (A), Gholam's Score (B), BARD (C), NAFLD fibrosis score (D), or Palekars score (E) were correlated to the histological fibrosis stage. ROC calculations did not show separation between no or mild fibrosis (grade 0-2) and advanced fibrosis (grades 3-4), that would have been better than random guessing (not shown). This lack of performance might be due to the very low number of individuals with advanced fibrosis (Grade 3: n = 3; grade 4: n = 1). (JPG) S4 Fig. The new score significantly correlates with SAF score. Using NAS-based classification of NAFLD to generate the new score might be seen as limitation. Thus the score was correlated to the SAF in the training (A) and validation cohort (C), resulting in significant robust correlations. In addition the score was tested to separate NAFL from NASH according to the Bedossa algorithm (at least 1 point in steatosis, ballooning and lobular inflammation each to diagnose NASH). The new score achieved reasonable performance in the training cohort (B) but insufficient performance in the validation cohort (D), though still significant versus random guessing. (JPG) S1 Table.