Identifying candidate diagnostic markers for early stage of non-small cell lung cancer

We performed a series of bioinformatics analysis on a set of important gene expression data with 76 samples in early stage of non-small cell lung cancer, including 40 adenocarcinoma samples, 16 squamous cell carcinoma samples and 20 normal samples. In order to identify the specific markers for diagnosis, we compared the two subtypes with the normal samples respectively to determine the gene expression characteristics. Through the multi-dimensional scaling classification, we found that the samples were clustered well according to the disease cases. Based on the classification results and using empirical Bayes moderation and treat method, 486 important genes associated with the disease were identified. We constructed gene functions and gene pathways to verify our result and explain the pathogenicity factor and process. We generated a protein-protein interaction network based on the mutual interaction between the selected genes and found that the top thirteen hub genes were highly associated with lung cancer or some other cancers including five newly found genes through our method. The results of this study indicated that contrast on the gene expression between different subtypes and normal samples provides important information for the detection of non-small cell lung cancer and helps exploration of the disease pathogenesis.


Introduction
Lung cancer is the most common malignant tumors, which poses a major threat to public health. In 2018, it was predicted that 1,735,350 new cancer cases and 609,640 cancer deaths will occur in the United States, including 13.49% lung cancer cases and 25.27% lung cancer deaths [1]. Lung cancer is the cancer with the highest mortality. There are two major subtypes of lung cancer, small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC). NSCLC, including two major histopathological subtypes, adenocarcinoma (AC) and squamous cell carcinoma (SCC), accounts for 80% of all lung cancer cases [2]. At present, the most effective treatment for NSCLC is surgical treatment in the early stage and radiotherapy and chemotherapy in the middle and late stage. About 75% of the patients are diagnosed in the middle and late stage. Regardless of the treatment options, the overall survival rate is still very poor [3]. PLOS  In the recent years, researchers have paid more attention to the mechanism of the occurrence and growth of NSCLC, which has brought new breakthroughs to the diagnosis and treatment of this disease. However, because of the high cost of treatment and the presence of drug resistance, effective treatment is only applicable to a narrow population.
With the development of information technology, using gene expression data resources to solve medical problems has become a general trend. Data mining technology is helpful to extract potential and valuable information related to diseases, so as to effectively prevent and control the diseases. Therefore, gene expression profile analysis has been widely used to identify new potential biomarkers of cancer [4,5], among which tumor-associated genetic alterations have played essential roles in the tumorigenesis and progression of cancer [6].
In this study, we focus on a particular set of gene expression data associated with early stage of NSCLC. We are interested in this data set of 76 samples because the data set contains detailed information about AC, SCC and normal samples. This information, as our study will show, is critical for the extraction of candidate diagnostic markers for NSCLC. We will use the affy package to read raw data, the edgeR package [7] to filter and normalize the data and the limma package [8] to assess differential expressed genes (DEGs) and perform exploration analysis of the results. Using a multi-dimensional scaling analysis, we will observe the significantly different gene expressions between different NSCLC subtypes and health cases. Applying the linear models in limma package and empirical Bayes moderation in Bioconductor, we will discover more host genes associated with NSCLC. To verify these genes from the underlying biology mechanism, we will use the database for annotation, visualization and integrated discovery (DAVID) [9] to perform the gene ontology (GO) functional analysis [10] and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses [11]. Further, the protein-protein interaction (PPI) network will be constructed by search tool for the retrieval of interacting genes/proteins (STRING) database [12], and the Cytoscape software [13] will be used to analyze the PPIs to screen the hub genes.

Microarray data
In this study, the data was obtained through the National Center for Biotechnology

Data pre-processing and clustering analysis of samples
For differential expression and correlation analysis, gene expression is seldom considered at the original counting level. Rather, it is common to convert the original data into a scale that is suitable for the library size. Here raw counts were transformed onto reads per kilobase of transcript per million (RPKM) values firstly. In the process of sample preparation and sequencing, there is no biological significance such as sample batches which will affect the expression of a single sample. Therefore, we need to standardize the data of each sample to ensure the similarity of data distribution. Here, normalization by the method of trimmed mean of M-value (TMM) was applied [14].
In the previous papers using GSE33532 datasets [15][16][17][18], the authors combined two different subtypes or stages of NSCLC into one single type directly and compared this type with normal samples. However, sample classification is an essential step in bioinformatic analysis. It is important to see whether genes are expressed at different level between different classifications. Therefore, in this study, we focused more on the information of sample classification. we used the plotMDS function in limma package to draw a multi-dimensional scaling (MDS) plot which showed the similarities and differences between different samples in an unsupervised way. And then we did the comparison based on the classification results. In our dataset, cancer subtypes and stages are two possible classification criterions and were therefore tested.

Differential expression analysis
We followed the workflow in Bioconductor to find DEGs [19]. Firstly, we built a design matrix for pairwise comparisons based on classification information by makeContrasts function. Secondly, based on the limma linear fitting, the empirical Bayes moderation was carried out to infer the results of linear models [20]. P value < 0.05 was set by default to screen DEGs. The number of up-and down-regulated DEGs can be summarized. However, the empirical Bayes moderation is only successful in testing whether the differential expression differentiate from zero, which cannot guarantee that the differences found are large enough to have biological significance. Here, in order to get more meaningful conclusions, we used treat method, a t-test related to the minimum fold change, to screen DEGs. And the differential expression obtained is greater than a given threshold [21]. This method can also improve the existing false discovery rate and identify more biologically significant DEGs. Finally, the DEGs in multiple comparisons were extracted as the most important genes.

GO functional and KEGG pathway analysis
In this study, we used DAVID (https://david.ncifcrf.gov/), a comprehensive set of functional annotation tool, to analyze GO function and KEGG pathway analysis of DEGs. It uses statistical methods to select the most prominent annotations in a large number of biological annotations, and the related information of their involvement in biological processes (BP), molecular function (MF), cell component (CC) and signal pathway can be found where p < 0.05 was considered to indicate a statistically significant difference.

Integration of PPI network and identification of hub genes
We used STRING (https://string-db.org/), a database that collects and integrates known protein-protein interactions, to explore protein-protein interactions and construct PPI network. Through the plug-in network analysis in Cytoscape, the degree between nodes was calculated and the genes with the largest degree were selected to represent the hub genes which play important roles in the whole PPI network.

Clustering analysis
Based on the two different classification criterions of the samples we found that, samples were clustered well within cancer subtypes over dimension 1 and 2, but the classification using the grouping defined by cancer stage was not good. The clustering result based on cancer subtypes was shown in Fig 1. The first dimension of the MDSplot explained the proportion of maximum changes in data. It showed that the transcription differences between AC versus N and SCC versus N were the greatest in the first dimension, which inspired us to compare the two cancer subtypes with the normal samples respectively to get more DEGs. Data sets of samples with poor clustering results may show little or no evidence of differential expression in downstream analysis. Therefore, we ignored the classification based on cancer stages.

Differential expressed genes
Based on the empirical Bayes moderation, 13,629 DEGs were found including upregulated and downregulated for AC versus N, 14,271 DEGs were found for SCC versus N and 8,095 DEGs were found for AC versus SCC ( Table 2, left).
In order to obtain more biologically significant conclusions, DEGs were screened according to treat method. The number of DEGs reduced to a total of 641 DEGs for AC versus N, 1,085 DEGs for SCC versus N and 178 DEGs for AC versus SCC when testing requires genes to have a Fold Change that is significantly greater than 1.2 ( Table 2, right). Comparisons between AC versus N and SCC versus N resulted in a larger number of DEGs, which verified our conjecture from the MDS plot (Fig 1).
Through integration of the DEGs in different contrasts, 486 DEGs including 116 upregulated and 370 downregulated DEGs in both AC versus N and SCC versus N were extracted by treat method (Fig 2), which were taken as the most significant genes associated with NSCLC.

GO functional and KEGG pathway analysis
DAVID performed BP, MF and CC function analysis (Table 3) and KEGG pathway analysis (Table 4) on 116 upregulated DEGs and 370 downregulated DEGs, respectively.
As a result, it was shown that upregulated DEGs belonged to the component of cytoplasm, nucleus, nucleoplasm and other organelles, they had the molecular functions such as ATP binding, microtubule motor activity, ATP-dependent microtubule motor activity, plus-enddirected, participating in microtubule-based movement, positive regulation of cytokinesis, chromosome segregation and other biological processes (Table 3A). They were mainly involved in cell cycle, oocyte meiosis, progesterone-mediated oocyte maturation, Fanconi anemia pathway and other signaling pathways (Table 4A). While, downregulated DEGs belonged to the component of integral membrane, plasma membrane, extracellular exosome and other organelles, they had the molecular functions such as calcium ion binding, Ras guanyl-nucleotide exchange factor activity, heparin binding, participating in positive regulation of GTPase activity, angiogenesis, cell adhesion and other biological processes (Table 3B). They were mainly involved in adrenergic signaling in cardiomyocytes, neuroactive ligand-Receptor interaction, cGMP-PKG signaling pathway, vascular smooth muscle contraction and other signaling pathways (Table 4B).

Integration of PPI network and identification of hub genes
After introducing all DEGs into STRING database, we constructed a PPI network which incorporated 436 nodes and 1,193 edges. We performed the subset of PPI network (Fig 3) for the DEGs with a combined score > 0.7 to determine the hub genes. As shown in Fig 3, blue nodes represented downregulated DEGs and red nodes represented upregulated DEGs. According to the degree of each gene, the top thirteen hub genes with the highest degrees were selected (Table 5), including BUBI mitotic checkpoint serine/threonine kinase (BUB1), Cyclin B1 (CCNB1), Mitotic arrest deficient 2 like 1 (MAD2L1), DNA topoisomerase 2-alpha (TOP2A), Kinesin family member 11 (KIF11), Cell division cycle 20 (CDC20), BUBI mitotic checkpoint serine/threonine kinase B (BUB1B), PDZ binding kinase(PBK), Abnormal spindle microtubule assembly(ASPM), Non-SMC condensin I complex subunit G(NCAPG), Centromere protein F(CENPF), TTK protein kinase(TTK) and Aurora kinase B(AURKB). Table 5 also showed that all the top thirteen hub genes were upregulated DEGs.

Discussion
NSCLC has been a serious threat to the public health worldwide. It is important to identify genes which express differentially between subtypes and normal cases, predict their  underlying functions, pathways and construct PPI network for the diagnosis and treatment of NSCLC.
In the present study, based on the expression profiles of GSE33532, which is associated with the early stage of NSCLC, selection of DEGs and bioinformatics analysis were performed. In our process of data analysis, we focused more on the classification of samples which was an essential step of bioinformatics analysis and found that samples were clustered well within cancer subtypes. Considering the difference between the two subtypes of NSCLC, we compared both cancer subtypes with normal samples, respectively. And then we took intersection when selecting DEGs for subsequent analysis. We eventually found 116 upregulated DEGs and 370 downregulated DEGs. To obtain further analysis of these DEGs, we performed GO functional analysis and KEGG pathway analysis.
We found that the upregulated DEGs mainly participated in four pathways, the top three are cell cycle, oocyte meiosis nad progesterone-mediated oocyte maturation, which were consistent with previous results using the same dataset. Besides, we also found a new pathway through data analysis, the Fanconi anemia complex I that functions to activate FANCD2 and FANCI by mono-ubiquitinating the protein following response to DNA damage. The Fanconi anemia pathway is a major mechanism of homologous recombination DNA repair. DNArepair deficiencies have been considered of interest in lung cancer prevention, given the persistence of damage produced by cigarette smoke in this setting, as well as in treatment, given potential increased efficacy of DNA-damaging drugs [22][23][24].
The downregulated DEGs mainly participated in five pathways, including adrenergic signaling in cardiomyocytes, neuroactive ligand-Receptor interaction, cAMP signaling pathway, cGMP-PKG signaling pathway and vascular smooth muscle contraction. All these pathways were not mentioned in the previous studies but played important roles in lung cancer or other disease. Beta-adrenergic signaling has been found to regulate multiple cellular processes that Candidate diagnostic markers for non-small cell lung cancer contribute to the initiation and progression of cancer, including inflammation, angiogenesis, apoptosis/anoikis, cell motility and trafficking, activation of tumor-associated viruses, DNA damage repair, cellular immune response and epithelial-mesenchymal transition [25]. The increase in cAMP levels activates target molecules, such as cAMP-dependent protein kinase (protein kinase A, PKA), exchange protein directly activated by cAMP (Epac) and cyclic nucleotide-gated ion channels [26]. These target effector molecules regulate various cellular responses, including metabolism, gene expression, proliferation and apoptosis. Various alterations to key molecules of the cAMP signaling pathway have been observed in lung cancer, and phosphodiesterase inhibitors have been shown to synergize with cisplatin to induce apoptosis in a broad panel of human lung cancer cell lines [27]. Down-regulation of cGMP/PKG-mediated signaling pathways often occurs during tumorigenesis and cell transformation, the activation of the cGMP-dependent enzyme protein kinase G (PKG) can play an important role in inhibiting cell proliferation and inducing apoptosis [28]. Vascular smooth muscle (VSM) is a major component of the tunica media of blood vessels, and an important regulator of vascular function. VSM contraction plays an important role in the regulation of peripheral vascular resistance and blood pressure, and vascular dysfunction, excessive vasoconstriction, and vasospasm could lead to major cardiovascular disorders such as hypertension and coronary artery disease [29]. Through the PPI network, we selected thirteen hub genes. Most of these hub genes were reported by previous studies to participate in the corresponding functions during the infection of NSCLC [30][31][32][33][34][35][36][37][38][39][40][41][42][43]. We also found five new hub genes that were not reported in previous references using dataset GSE33532, including TOP2A, PBK, ASPM, NCAPG and TTK. Four of them had great impacts on lung cancer based on experimental results, which was summarized as follows. The overexpression of TOP2A in NSCLC tissues is related to lymph node metastasis, which can promote cell proliferation and invasion [33]. PBK, also known as TOPK, is a potential therapeutic target in lung cancer that promotes cell migration by modulating a PI3K/ PTEN/AKT-dependent signaling pathway. High PBK expression, either alone or in combination with a low level of PTEN, may serve as a prognostic marker for lung cancer [37]. Suberoylanilide hydroxamic acid significantly enhanced the tumor initiating capacity and the expression of malignant genes such as ASPM in the remaining living ALDH cells, which can suppress the growth of tumor xenografts and decreases the lung cancer stem cell population in vivo [38]. The non-SMC condensin I complex subunit G (NCAPG) that organizes the coiling topology of individual chromatids, represents an overexpressed antigen in various types of cancer, and also contributes to restructuring chromatin into rod-shaped mitotic chromosomes and ensuring the segregation of sister chromatid during cell division [39]. The expression of TTK in lung cancer tissues is significantly different from that in smokers and non-smokers, which is consistent with the important role of TTK in smoking-induced lung cancer. TTK is a candidate target gene for chemical prevention and treatment of lung cancer in smokers [42].
Further, six of the selected hub genes including BUB1, CCNB1, MAD2L1, CDC20, BUB1B and TTK were found to participated in the same cell cycle pathway. It was also shown by the previous study that these hub genes served as a regulatory protein at multiple checkpoints in the cell cycle pathway. Cell cycle pathway is the key pathway of lung cancer and regulatory proteins located in cell cycle signaling pathway play an important role in the mechanism of lung cancer [44][45][46].
In conclusion, the present study provides a broader analysis of DEGs for NSCLC which contributes to exploration NSCLC pathogenesis and may serve as potential biomarkers for future research on early NSCLC detection. However, current research is theoretical analysis based on data, prospective clinical studies remains to be an important next step of investigation.
Supporting information S1