Association of specific gene mutations derived from machine learning with survival in lung adenocarcinoma

Lung cancer is the second most common cancer in the United States and the leading cause of mortality in cancer patients. Biomarkers predicting survival of patients with lung cancer have a profound effect on patient prognosis and treatment. However, predictive biomarkers for survival and their relevance for lung cancer are not been well known yet. The objective of this study was to perform machine learning with data from The Cancer Genome Atlas of patients with lung adenocarcinoma (LUAD) to find survival-specific gene mutations that could be used as survival-predicting biomarkers. To identify survival-specific mutations according to various clinical factors, four feature selection methods (information gain, chi-squared test, minimum redundancy maximum relevance, and correlation) were used. Extracted survival-specific mutations of LUAD were applied individually or as a group for Kaplan-Meier survival analysis. Mutations in MMRN2 and GMPPA were significantly associated with patient mortality while those in ZNF560 and SETX were associated with patient survival. Mutations in DNAJC2 and MMRN2 showed significant negative association with overall survival while mutations in ZNF560 showed significant positive association with overall survival. Mutations in MMRN2 showed significant negative association with disease-free survival while mutations in DRD3 and ZNF560 showed positive associated with disease-free survival. Mutations in DRD3, SETX, and ZNF560 showed significant positive association with survival in patients with LUAD while the opposite was true for mutations in DNAJC2, GMPPA, and MMRN2. These gene mutations were also found in other cohorts of LUAD, lung squamous cell carcinoma, and small cell lung cancer. In LUAD of Pan-Lung Cancer cohort, mutations in GMPPA, DNAJC2, and MMRN2 showed significant negative associations with survival of patients while mutations in DRD3 and SETX showed significant positive association with survival. In this study, machine learning was conducted to obtain information necessary to discover specific gene mutations associated with the survival of patients with LUAD. Mutations in the above six genes could predict survival rate and disease-free survival rate in patients with LUAD. Thus, they are important biomarker candidates for prognosis.


Introduction
Lung cancer is the leading cause of death in patients with cancer. It is the second most common cancer in men and women to date in the United States, following prostate cancer in men and breast cancer in women, respectively [1][2][3]. In the 1990s, stomach and lung cancers were leading causes of death among cancer patients in Korea, with stomach cancer accounting for 25% of cases while lung cancer accounting for 17% of cases. In 2016, 17,963 people died from lung cancer in Korea, accounting for 23% of all cancer-related deaths. Although lung cancer has the highest mortality rate, few biomarkers for predicting overall survival or disease-free survival have been reported. Accurately predicting survival rate in patients with cancer has a significant impact on their prognosis and treatment [4][5][6].
Machine learning methods have been used in a variety of ways in cancer research [7][8][9][10]. These methods can be used to identify breast cancer patients by genetic mutations as a result of applying gene learning methods to breast cancer samples [7]. One prostate cancer study has combined machine learning methods with National Institute for Health and Care Excellence features to observe the association between the prognosis of prostate cancer and genetic mutation profile [8]. In addition, previous studies have applied machine learning using healthy eating index scores to predict the interaction between colorectal cancer and overweight status [9]. In another study, modeling was used to demonstrate the benefit of exact binomial test for analyzing genome-wide somatic gene mutation through performance comparisons among different machine learning models [10].
In this study, machine learning methods with data from The Cancer Genome Atlas (TCGA) of patients with lung adenocarcinoma (LUAD) were utilized to discover gene mutations associated with patient survival. Results suggested that mutations in six genes, DRD3, SETX, ZNF560, DNAJC2, GMPPA, and MMRN2, were significantly associated with survival and overall survival time of patients with LUAD. These gene mutations could be used as survival-predicting biomarkers. Machine learning can be a useful tool to discover important biomarkers for predicting prognosis and survival in patients with lung cancer.

Ethics statement
All patient data were acquired from previously published studies where written informed consents were obtained [1,[11][12][13][14][15]. The TCGA-LUAD cohort, Pan-Lung Cancer cohort, Lung Squamous Cell Carcinoma TCGA cohort, Small Cell Lung Cancer cohort, Lung Adenocarcinoma Broad cohort, and Lung Adenocarcinoma MSKCC cohort in their methods stated that "Specimens were obtained from patients, with appropriate consent from institutional review boards" [1], "All specimens were obtained from patients with appropriate consent and with approval from the relevant institutional review boards" [11], "All specimens were obtained from patients with appropriate consent from the relevant Institutional Review Board" [12], "Human tumour samples were obtained from patients under IRB-approved protocols following written informed consent" [13], "Informed consent (Institutional Review Board) was obtained for each sample using protocols approved by the Broad Institute of Harvard and MIT and each originating tissue source site" [14], and "All patients had consented to Institutional Review Board-approved protocols permitting tissue collection and sequencing" [15], respectively. and mutation data matrix. In the original data, clinical information of 471 patients and mutation status of about 40,000 genes were recorded. After preprocessing, label setting and identification were given. Among 471 patients in the data set, 303 living (64.3%) and 168 dead (35.7%) patients were divided accordingly into the survival and non-survival groups, respectively. Data sets of Pan-Lung Cancer cohort (n = 954), Lung Squamous Cell Carcinoma TCGA cohort (n = 498), Small Cell Lung Cancer cohort (n = 101), Lung Adenocarcinoma Broad cohort (n = 135), and Lung Adenocarcinoma MSKCC cohort (n = 34) were also used to determine specific gene mutation frequencies. The first three data sets were used for association of specific gene mutation with survival. All data used within this study were obtained from open access data sets. They have passed the criteria for unrestricted publication with the following statement listed at https://cancergenome.nih.gov/publications/publicationguidelines "No restrictions; all data available without limitations".

Machine learning
RapidMinor (Boston, MA, USA) was the software used for machine learning. For feature selection, information gain, Chi-squared test, minimum redundancy maximum relevance, and correlation algorithm were used. Classification algorithms included Naive Bayes, k-nearest neighbors, support vector machine, and decision trees [16]. This study concentrated on the yield and selection of gene mutations using dependent algorithms rather than improvement of algorithms (S1 Fig). Accuracy, precision, recall, classification error, and correlation are shown in S1 Table.

Data analysis
For the specificity of gene mutations, Fisher's exact test and Kaplan-Meier analysis were applied. Frequencies of gene mutations were compared using Fisher's exact test. Overall survival and disease-free survival were calculated using Kaplan-Meier analysis based on clinical information of mortality, survival, and observation time for patients. cBioPortal software was used to evaluate gene mutation status within the TCGA-LUAD cohort [17,18]. Statistical significant was considered when p value was less than 0.05.

Results
In this study, to discover gene mutations predicting survival, data of LUAD patients obtained from TCGA were processed and classified using machine learning methods. Mutations in 19 genes were then selected and analyzed by frequencies, overall survival, and disease-free survival. Results suggested that specific gene mutations were associated with patient survival. Among mutations in 19 genes, mutations in GMPPA and MMRN2 were significantly associated with patient mortality while mutations in ZNF560 and SETX were significantly associated with patient survival (Fig 1 and Table 1). Mutations in DNAJC2 and MMRN2 showed significant negative association with overall survival while mutations in ZNF560 showed significant positive association with overall survival. The median survival time in patients with LUAD was about 49 months. However, the median survival time in patients with mutations in MMRN2 was about 11 months. Mutations in MMRN2 were significantly and negatively associated with disease-free survival while mutations in DRD3 and ZNF560 were significantly and positively associated with disease-free survival. The median disease-free survival time in patients with LUAD was about 36 months while that in patients with mutations in MMRN2 was about 5 months. Mutations in DNAJC2, GMPPA, MMRN2, DRD3, SETX, and ZNF560 were associated with survival in patients with LUAD. Table 2, patients with LUAD lacking mutations in DNAJC2 or MMRN2 had median survival time of about 48 months while those with mutations in DNAJC2 or MMRN2 all died within 20 months (S2 Fig). About 42% of patients with LUAD lacking a mutation in MMRN2 relapsed with a median disease-free time of about 33 months while those with mutations in MMRN2 all relapsed within 10 months. Therefore, mutations in DNAJC2 and/or MMRN2 are considered to be predictors of survival or relapse of patients with LUAD since the probability of death or recurrence due to LUAD might be higher in the presence of mutations in DNAJC2 or MMRN2. In contrast, patients with LUAD without a mutation in ZNF560 had shorter survival than other patients, with a median survival time of about 45 months while those with mutations in ZNF560 all survived. Additionally, about 42% of patients with LUAD without a mutation in ZNF560 or DRD3 relapsed with a median disease-free time of about 30 months. However, those with mutations in ZNF560 or DRD3 relapsed at a rate of about 11% or 0%, respectively. Because the probability of death or relapse due to LUAD might be lower when the ZNF560 or DRD3 was mutated, mutations in ZNF560 or DRD3 were considered to be predictors of survival or relapse in patients with LUAD.

As shown in
To evaluate the association of mutations in multiple genes and survival, mutations in 19 genes and those in genes associated with survival were analyzed (Fig 2 and S2 Table). Mutations in these 19 genes were not associated with overall survival or disease-free survival. Mutations in DNAJC2 or MMRN2 were negatively associated with survival. They significantly decreased the median survival time to 9.95 months and the median disease-free survival time to 4.57 months. However, mutations in DRD3 or ZNF560 were positively associated with survival. They significantly increased both survival time and disease-free survival time. Furthermore, mutations in DNAJC2, GMPPA, or MMRN2 significantly decreased median survival time to 11.27 months and the median disease-free survival time to 6.87 months (Fig 3). Mutations in DRD3, SETX, or ZNF560 significantly increased both survival time and disease-free survival time. Patients with LUAD who had mutations in DNAJC2, GMPPA, or MMRN2 exhibited significantly shorter survival and earlier recurrence than those without mutations. However, patients with mutations in DRD3, SETX, or ZNF560 exhibited longer survival and later recurrence than those without these mutations. Association of specific gene mutations and survival in LUAD Table 1. Comparative analysis of mutation frequency and survival with mutations in 19 genes selected by feature selection methods.   The frequency of mutations in six gene associated with survival was further analyzed in other lung cancer types such as lung squamous cell carcinoma, small cell lung cancer, and another two data sets of LUAD (Table 3 and S3 Fig). Mutations in these six genes were found not only in another two LUAD data sets (n = 21 and n = 4), but also in lung squamous cell carcinoma (n = 28), small cell lung cancer (n = 12), and Pan-lung cancer (n = 146).
Associations of mutations in six genes with survival were analyzed using three data sets of lung cancer cohorts with survival information (Table 4). Mutations in GMPPA, DNAJC2, and MMRN2 were significantly associated with patient mortality in lung adenocarcinoma of Pan-Lung Cancer cohort. Mutations in DNAJC2 and MMRN2 were significantly associated with shortened overall survival (median survival of 10 and 21.9 months, respectively) (S4 Fig). Mutations in DRD3 and SETX were significantly associated with patient survival, and mutations in SETX extended overall survival. These gene mutations were not associated with mortality or overall survival in other types of lung cancer. These results suggested that mutations in DNAJC2, GMPPA, MMRN2, DRD3, and SETX could be significantly associated with survival in patients with LUAD. They might be considered as biomarkers for predicting survival or recurrence in patients with LUAD.

Discussion
Lung cancer is the second most common cancer. It has a high mortality rate. The discovery of biomarkers that can predict overall survival of lung cancer patients is essential for treatment of patients. Identification of survival-specific gene mutations is important not only for understanding genetic disparities associated with survival, but also for predicting the survival of LUAD patients. These gene mutations can be significant biomarkers for LUAD. In this study, TCGA LUAD data set was used to derive gene mutations by machine learning. Patients with LUAD were divided into surviving and non-surviving groups and machine learning was performed with four feature selection methods to identify gene mutations associated with survival of patients with LUAD from mutations in about 40,000 genes [16,[19][20][21][22][23][24]. The most frequently observed mutations determined by machine learning were in SETX and ZNF560 genes. Mutational incidence of SETX, ZNF560, GMPPA, and MMRN2 was significant. Mutations in MMRN2 and DNAJC2 were significantly and negatively associated with patient survival while those in ZNF560 and DRD3 were positively associated with patient survival. Mutations in genes determined by machine learning seem to influence survival in LUAD.
Because of the relatively small number of mutations in LUAD cohort, mutations in six genes were applied to other data sets of lung cancer cohorts to analyze mutation frequencies and association with survivals such as Lung Adenocarcinoma (Broad and MSKCC) cohorts, Lung Squamous Cell Carcinoma TCGA cohort, Pan-Lung Cancer cohort, and Small Cell Lung Cancer cohort. The average frequency of mutations in six genes was 0.81%~3.30% and their associations with survival were similar between LUAD cohort and Pan-Lung Cancer cohort. Data set of Pan-Lung Cancer cohort was composed of LUAD and lung squamous cell carcinoma. In LUAD of Pan-Lung Cancer cohort, mutations in DNAJC2, GMPPA, MMRN2, DRD3, and SETX were significantly associated with survival status, and those in DNAJC2, MMRN2, and SETX were significantly associated with overall survival. This result supports that mutations in these six genes can predict the survival of patients with LUAD and overall survival time. They could be considered as biomarkers of LUAD.
Mutations in MMRN2 and DNAJC2 were observed to be important for predicting the survival and prognosis negatively. MMRN2 encodes a multimerin2 which is an elastin microfibril interface-located (EMILIN)-like protein, extracellular matrix glycoprotein [25]. MMRN2 acts as a modified growth factor β antagonist. It can interfere with VEGF-A/VEGFR2 pathway in endothelial cells [25]. Recent studies have demonstrated that CLEC14A-MMRN2 binding has potential for future anti-angiogenic therapy because it plays a role in inhibiting angiogenesis during tumor growth [26]. The DNAJC2 gene encodes a phosphorylated protein with a Jdomain and a Myb DNA-binding domain. Its protein is observed in both nucleus and cytoplasm [27]. DNAJC2 protein can form a heterodimeric complex with the ribosome to acts as a molecular protector for the initial polypeptide chain when exiting the ribosome [27]. DNAJC2 protein has been identified as a leukemia-associated antigen. Its expression is increased in those with leukemic seizures [28]. In addition, chromosomal abnormalities involving the DNAJC2 gene are associated with primary head and neck squamous cell tumors [29]. These studies have revealed molecular mechanisms that MMRN2 and DNAJC2 either cause or exacerbate cancers. However, further studies are needed to determine the role of MMRN2 and DNAJC2 in LUAD. Mutations in ZNF560 and SETX were observed to be important for predicting the survival and prognosis positively. The ZNF560 gene has been reported in colorectal cancer studies [30]. Left-sided colon cancer (LSCC) and right-sided colon cancer (RSCC) differ in their genetic susceptibilities to neoplastic transformation. Patients with LSCC had low mortality and improved overall 5-year survival rate than patients with RSCC [30]. ZNF560 was down-regulated in LSCC compared to that in RSCC. It may be useful for predicting a positive prognosis. Association of specific gene mutations and survival in LUAD SETX is a RNA/DNA helicase that splices RNA, regulates gene expression, terminates transcription, and stabilizes telomere and genome [31]. Mutations in SETX are linked to neurodegenerative disorders, ataxia oculomotor apraxia type 2, and amyotrophic lateral sclerosis type 4 [31]. Although the role of SETX in cancer has not been known, its expression level is relatively lower than other genes. Since there is little research on the role of ZNF560 and SETX in cancers, more researches are needed to understand their roles in LUAD.
Since this study focused on the yield and selection of gene mutation rather than deducing an efficient algorithm through machine learning, a dependent algorithm was used. In this case, the weighted results of the independent algorithm could not be obtained. Utilizing 100% trained data is closer to probability statistics than machine learning. Of 19 gene mutations, six gene mutations were significantly associated with survival in LUAD, showing a relatively high rate (about 32%). Further study is needed to determine the differences between using dependent and independent algorithms in machine learning methods for analyzing medical information of solid tumors.
It is important to apply the optimal feature selection method that classifies human cancer genetic mutations according to specific factors among various feature selection methods. Previously reported feature selection methods in medical studies have used Weka that can implement information gain, correlation, and ranker algorithms, and ensemble learning methods [32][33][34]. However, in order to classify gene mutations using dependent algorithm, selection methods for prediction of economic demand as well as the above feature selection method were applied to feature selection in this study. Of combination algorithms used, the combination with the highest classification prediction rate was information gain-Naïve bayes combination. It can be adopted to analyze RNA sequence or other medical information in LUAD.
In summary, machine learning was conducted to obtain information necessary to select mutations in genes associated with survival of patients with LUAD. We identified specific mutational markers associated with survival of patients with LUAD. Mutations in DNAJC2, GMPPA, and MMRN2 can be used as biomarkers of negative prognosis for patient's overall survival and disease-free survival while mutations in DRD3, SETX, and ZNF560 can be used as biomarkers of positive prognosis. This study also suggested a predictive classification model of LUAD based on mutation expression.