Application of machine learning in the diagnosis of gastric cancer based on noninvasive characteristics

Background The diagnosis of gastric cancer mainly relies on endoscopy, which is invasive and costly. The aim of this study is to develop a predictive model for the diagnosis of gastric cancer based on noninvasive characteristics. Aims To construct a predictive model for the diagnosis of gastric cancer with high accuracy based on noninvasive characteristics. Methods A retrospective study of 709 patients at Zhejiang Provincial People's Hospital was conducted. Variables of age, gender, blood cell count, liver function, kidney function, blood lipids, tumor markers and pathological results were analyzed. We used gradient boosting decision tree (GBDT), a type of machine learning method, to construct a predictive model for the diagnosis of gastric cancer and evaluate the accuracy of the model. Results Of the 709 patients, 398 were diagnosed with gastric cancer; 311 were health people or diagnosed with benign gastric disease. Multivariate analysis showed that gender, age, neutrophil lymphocyte ratio, hemoglobin, albumin, carcinoembryonic antigen (CEA), carbohydrate antigen 125 (CA125) and carbohydrate antigen 199 (CA199) were independent characteristics associated with gastric cancer. We constructed a predictive model using GBDT, and the area under the receiver operating characteristic curve (AUC) of the model was 91%. For the test dataset, sensitivity was 87.0% and specificity 84.1% at the optimal threshold value of 0.56. The overall accuracy was 83.0%. Positive and negative predictive values were 83.0% and 87.8%, respectively. Conclusion We construct a predictive model to diagnose gastric cancer with high sensitivity and specificity. The model is noninvasive and may reduce the medical cost.


Introduction
Gastric cancer is a common malignancy with high incidence and mortality rates. Ranking third among cancers worldwide, the mortality of gastric cancer is 8.2% [1]. It has also been identified as one of the leading causes of cancer death in China [2]. Indeed, gastric cancer remains a serious health issue in China: there are approximately 400 000 new cases of gastric cancer in China every year, with a crude incidence rate of 42 per 100 000 in males and 20 per 100 000 in females. As the incidence and mortality rates of stomach cancer in urban areas are higher than those in rural areas [3,4], prevention and control strategies should be implemented while considering regional differences.
At present, the diagnosis of gastric cancer mainly relies on endoscopy and surgery. However, some patients, especially in rural areas, refuse to undergo endoscopy or surgery due to the invasive nature and costliness of the procedure [5]. Unfortunately, there are no noninvasive characteristics defined for detecting gastric cancer with high sensitivity and specificity. Early diagnosis and early treatment are crucial for improving the survival rate and reducing the mortality of gastric cancer. Therefore, it is of great significance to explore noninvasive characteristics or models by which to diagnose gastric cancer.
Machine learning has been applied to detect many types of cancer, such as colorectal cancer [6] and breast cancer [7], with high accuracy. In the field of gastric cancer, machine learning is mainly applied for analyzing endoscopic images, which are obtained through invasive procedures [8,9]. In contrast, detection of routine blood, biochemical, and tumor markers is noninvasive and inexpensive. Therefore, we constructed a predictive model using machine learning to diagnose gastric cancer based on these noninvasive characteristics.

Materials and methods
The study was reviewed and approved by the institutional review board (IRB) of the Zhejiang Provincial People's Hospital (220QT111). The need to obtain written informed consent for participants' clinical records to be used was also waived by the IRB for this retrospective study. As was approved by the IRB, we used patient identification numbers to collect and analyze clinical records. Names and other personal information were anonymized and de-identified prior to analysis to protect patient privacy.

Study subjects
We reviewed the medical records of 960 patients who were diagnosed with gastric cancer, benign gastric disease or health people from December 2018 to August 2019 at Zhejiang Provincial People's Hospital. We used medical record numbers to identify individual subjects. The inclusion criteria were as follows: i) age >18 years, ii) histologically confirmed gastric cancer, benign gastric disease or health, iii) complete relevant data, iv) no other cancer. The exclusion criteria included the following: i) incomplete relevant data, ii) double cancers, and iii) recurrent gastric cancer. Notably, 32 patients with another cancer and 166 with recurrent gastric cancer were excluded; 53 patients who had insufficient data were also excluded. Finally, 709 patients were enrolled in our study to develop a predictive model. The hypothesis of this study is that the predictive model can effectively distinguish gastric cancer patients from nongastric cancer controls. We learned that the sensitivity of the predictive model was 0.8 and the specificity was 0.75 based on preliminary data from our early observation of 80 subjects (40 of each group). We used PASS software (PASS, version 11.0) to estimate the sample size. It was found that at least 23 patients in each group were required, with a two-tailed test of α = 0.05, 1 -β (the power) = 0.90 and the ratio between groups 1:1 (prevalence rate 0.5). Considering a 10% loss of drop-out rate, at least 26 patients in each group were required. Thus the sample size of this study met with the requirement. We conducted this study from August 2019 to March 2020.

Data preparation
We identified outliers, treated them as missing values and filled them by k-means clustering. A univariate analysis was performed to evaluate the relationship between the noninvasive characteristics and diagnosis of gastric cancer. Measurement data were treated with the t-test if they followed the normal distribution or treated with the Mann-Whitney U test if they did not follow the normal distribution. Enumeration data were treated with the chi-square test. Significant characteristics were screened in the univariate analysis. Then, a multivariate analysis was performed using the significant characteristics to screen independent characteristics for diagnosing gastric cancer. Multivariable analysis was performed by logistic regression. A P-value of less than 0.05 was considered to be significant. Data were analyzed with SPSS software (SPSS, version 26.0, United States).
Independent characteristics related to gastric cancer were selected to construct a dataset. The dataset was randomly divided into a training dataset (n = 496) and a test dataset (n = 213) with a proportion of 7:3. The procedure of our study is shown in Fig 1.

Training algorithm
A gradient boosting decision tree (GBDT) [10] was used to construct the model with Pythonsklearn package. GBDT has been widely used in machine learning. It requires a total of M iterations. Each iteration generates a weak learner, and each learner is trained based on the residual of the previous learner. By using the gradient descent method, each iteration moves to the negative gradient direction of the loss function so that the loss function decreases, and the model is increasingly accurate. The architecture of the GBDT is shown in Fig 2. The principle of GBDT is as follows. Samples in the training dataset were model inputs, and the predicted results were model outputs. Samples in the training dataset were expressed The procedure of data preparation and model construction. A univariate analysis was performed to evaluate the relationship between the characteristics and diagnosis of gastric cancer. Significant characteristics were screened in the univariate analysis. Then, a multivariate analysis was performed to screen independent characteristics for diagnosing gastric cancer. Independent characteristics related to gastric cancer were selected to construct a dataset. The dataset was randomly divided into a training dataset to construct the model and a test dataset to test the model.
The parameter x was an independent characteristic for diagnosing gastric cancer, and parameter y was a label (non-gastric cancer or gastric cancer). Sample (x i , y i ) ranks ith in the training dataset. The total number of iterations was M. The principle was as follows.
1. First, we initialized the learner in the gradient boosting decision tree.
where f 0 (x) is the initialized learner, a tree with only one root node, γ is a constant value that minimizes the loss function, N is the number of training samples in the training dataset, i is a positive integer and 1� i� N. L is the loss function and defined as where y2{−1,+1} and f(x) is the output value of the training sample (x, y). To update the learner in the GBDT. We calculated the negative gradient of the residual of the ith training sample in iteration m. where r im is the negative gradient of the residual of the ith training sample in iteration m, and f(x i ) is the output value of the training sample (x i , y i ), that is, the predictive occurrence probability corresponding to the ith training sample. The r im is used for the next iteration to fit a new regression tree f m (x). The corresponding leaf node region of f m (x) is R jm , where j is the number of leaf nodes in the leaf node region, j is a positive integer and 1� j� J.
3. Then, we calculated the best fit value for the leaf area.
where γ m is the best fit value for the j leaf node in iteration m and f m-1 (x i ) is the output value of the training sample (x i , y i ), that is, the predictive occurrence probability corresponding to the ith training sample in iteration m-1.
4. Then, we updated the strong learner.
where I is a unit vector.
where f M (x) is the learner obtained in iteration M.
Tuning parameters of GBDT. The parameters of GBDT could be divided into boosting parameters and tree-specific parameters. Boosting parameters consisted of learning rate and number of trees. Tree-specific parameters consisted of min_samples_leaf and max_depth. We used k-fold cross-validation to tune the parameters. In our study, k = 5. The training dataset was split into a 5 folds where each fold was used as a validation dataset at some point. In the first iteration, the first fold was used to validate the model and the rest were used to train the model. In the second iteration, the second fold was used as the validation dataset while the rest served as the training dataset. This process was repeated until each fold of the 5 folds have been used as the validation dataset.
Firstly, tree-specific parameters adopted default values of which min_samples_leaf = 1 and max_depth = 3, then we did a grid search to select the optimum learning rate and number of trees (Fig 3). The initial value of learning rate was 0.02 and the step was 0.02. The initial value of number of trees was 10 and the step was 10. We chose the parameters on the grid and calculated the scores of area under the receiver operating characteristic curve (AUC) using the 5-fold cross-validation. After searching the entire grid, we selected the optimal parameters with highest score of AUC. The optimal parameters of learning rate and number of trees were 0.12 and 70, which won the highest score of 0.86.
Secondly, the optimum learning rate and number of trees of 0.12 and 70 were used for determining the tree parameters. We set some initial values of tree-specific parameters: min_-samples_leaf = 1 and max_depth = 3. Then we moved onto tuning the parameters detail by detail according to the result of 5-fold cross-validation to get more robust model. We have run 30 combinations and the ideal values were: min_samples_leaf = 10 and max_depth = 5.
The performance of the model is shown in Table 4. The point for the optimal threshold appeared to be closest to the top-left of the plot, which maximized Yoden index (sensitivity+specificity-1). In our study, the optimal threshold was 0.56. At the threshold, sensitivity was 87.0% and specificity 84.1%. The overall accuracy of the model was 83.0%. Positive and negative predictive values were 83.0% and 87.8%, respectively.

Discussion
Early diagnosis and early treatment are crucial for improving the survival rate and reducing mortality [11]. Currently, the diagnosis of gastric cancer mainly depends on endoscopy. However, endoscopy is invasive, and some patients feel discomfort when doing it. In some rural areas, patients cannot afford expenses, which results in high mortality [5]. Therefore, it is of great significance to explore noninvasive characteristics or models to diagnose gastric cancer, which could reduce the medical cost and improve patient satisfaction.
A few researchers use genetics, proteomics and molecular biology to diagnose gastric cancer. For example, Zhou B et al. found the proteins identified by plasma proteomics could help distinguish EGC from healthy controls [12]. Wu J et al. stated that circulating microRNA-21 is a potential diagnostic biomarker in gastric cancer [13]. Watanabe Y et al. noted that several genes, such as adam23 and mint25, are methylated with higher frequency and therefore are analyzed as possible biomarkers [14]. Unfortunately, due to the restrictions of invasion, complexity and high cost, these achievements are not available in many hospitals and may impose a financial burden. Tumor markers such as CEA, CA199, CA125, and CA724 are widely used clinically for the detection of gastric cancer. The reported sensitivity of CEA, CA199, CA125 or CA724 is approximately 40%, and combined detection of these characteristics can increase the sensitivity to 65% [15][16][17]. Sahin AG et al. found that the NLR is also associated with gastric cancer and that patients with gastric cancer have a significantly higher NLR [18]. Wu Y et al. showed that the combined use of the neutrophil-lymphocyte ratio, platelet-lymphocyte ratio and carcinoembryonic antigen could aid in the diagnosis of gastric cancer [19]. However, these available noninvasive characteristics are not satisfactory in their sensitivity and accuracy. The recent development of machine learning offers the advantage of diagnosing gastric cancer accurately. Liu MM et al. applied data mining methods to predict gastric cancer, and the accuracy was 77% [20]. Su Y et al. diagnosed gastric cancer using decision tree classification of mass spectral data with an accuracy of 86.4% [21]. Some researchers use neural networks to diagnose gastric cancer in the area of endoscopic images with high sensitivity [8,22].
Based on these studies, we combined the noninvasive characteristics with the help of machine learning to diagnose gastric cancer. In this study, we found that gender, age, NLR, Hb, Alb, CEA, CA125 and CA199 were independent characteristics for diagnosing gastric cancer. Gastric cancer was more common in older people and males. Hb and Alb levels in patients with gastric cancer were significantly decreased. Patients with gastric cancer had significantly According to GBDT, a predictive model was constructed to diagnose gastric cancer. We plotted a receiver operating curve of the probability of non-gastric cancer (negative) and gastric cancer (positive) classifications for the test dataset to assess the robustness of our model. The AUC was 91%.
https://doi.org/10.1371/journal.pone.0244869.g004 higher NLR, CEA, CA125 and CA199 than patients of non-gastric cancer. These findings were consistent with previous reports [15][16][17][18]23]. GBDT is widely used in machine learning. GBDT methods obtain good predictions when dealing with numerous factors and complicated relations among factors. It could make good use of weak classifiers for cascading and fully consider the weight of each classifier [10]. GBDT often works great with categorical and numerical values and is applicable to our study. We generated a GBDT model with high accuracy in distinguishing patients with gastric cancer from non-gastric cancer based on noninvasive characteristics. We tuned the parameters by kfold cross-validation to deal with overfitting. Min_samples_leaf and max_depth were tuned to control over-fitting. Higher values of min_samples_leaf and lower max_depth prevented the model from learning relations which might be highly specific to the particular sample selected for a tree. To our knowledge, this was the first report of the use of GBDT to diagnose gastric cancer based on noninvasive characteristics. In addition, these characteristics are widely used clinically and inexpensive. Patients were initially screened by the model, and then the highrisk patients screened were confirmed by further endoscopy and pathology biopsy. Furthermore, the model obtained a high prediction performance. The model correctly predicted 83.0% in the test dataset, resulting in a positive predictive value of 83.0% and a negative predictive value of 87.8%.
This study had several limitations. First, the sample size was small, and all data were obtained from a single center. It was still far from sufficient to develop a reliable model. Further studies with many more cases and data from other centers are urgently required. Second, we only used GBDT to diagnose gastric cancer. Due to dependencies between weak learners, it was difficult to train data in parallel. In the next study, other related methods, such as neural networks and random forests, could also be used to construct the model.
In conclusion, we construct a GBDT model to diagnose gastric cancer with high sensitivity and accuracy, which is noninvasive and could reduce the medical cost. The model could be applied for auxiliary diagnosis of gastric cancer.
Supporting information S1 Checklist. STROBE statement-checklist of items that should be included in reports of observational studies. (DOCX) S1 Data. (PDF)