Improved cancer biomarkers identification using network-constrained infinite latent feature selection

Identifying biomarkers that are associated with different types of cancer is an important goal in the field of bioinformatics. Different researcher groups have analyzed the expression profiles of many genes and found some certain genetic patterns that can promote the improvement of targeted therapies, but the significance of some genes is still ambiguous. More reliable and effective biomarkers identification methods are then needed to detect candidate cancer-related genes. In this paper, we proposed a novel method that combines the infinite latent feature selection (ILFS) method with the functional interaction (FIs) network to rank the biomarkers. We applied the proposed method to the expression data of five cancer types. The experiments indicated that our network-constrained ILFS (NCILFS) provides an improved prediction of the diagnosis of the samples and locates many more known oncogenes than the original ILFS and some other existing methods. We also performed functional enrichment analysis by inspecting the over-represented gene ontology (GO) biological process (BP) terms and applying the gene set enrichment analysis (GSEA) method on selected biomarkers for each feature selection method. The enrichments analysis reports show that our network-constraint ILFS can produce more biologically significant gene sets than other methods. The results suggest that network-constrained ILFS can identify cancer-related genes with a higher discriminative power and biological significance.


Introduction
The recent development of high-throughput gene expression profiling provided an opportunity for researchers to better understand the molecular characteristics of the cancer disease, leading to advances in its diagnosis and treatment.Accurate identification of the cancer diagnostic biomarkers is very critical for the provision of appropriate therapies and drug development.Some gene mutations, such as BRCA1, BRCA2, VHL, PBMR1 and others were identified to be correlated with an increased tumor aggressiveness in cancer [1][2][3][4].A few targeted therapies have been designed providing more options for treating patients [5].However, the global incidence and mortality of cancer is still high, 1,806,590 new cancer cases and 606,520 cancer deaths are projected to occur in the United States in 2020 [6].It is then hoped that ongoing and planned research will develop more reliable and effective feature selection methods to identify more predictors of the tumors' sensitivity to therapy.
In the field of genomics, disease signatures identification has long been a research topic in which they might revolutionize the way diseases are treated clinically.However, identifying disease associated genes in gene expression data is challenging due to the high dimensional features with low sample size.A lot of studies were published that handled this issue and were employed in the biological analysis [7][8][9][10][11][12].From a statistical perspective, it is hard to filter the true factors in high dimensional data [13].Published material showed that selected features are susceptible to the perturbation of the high dimensional training data.One limitation of these popular methods is that they are merely designed based on statistical or arithmetic points; they don't utilize any biological information.Over the past few years, more biological knowledge and pathway information became available on the Internet, especially that related to cancer.Some of the biological pathways information can be downloaded from online databases, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) [14], Reactome [15,16] and others.These pathways are often presented as graphs where the vertices represent genes or gene products and the edges indicate some regulatory relationship between the genes.Such prior biological information is a very useful supplement to those graph-based feature selection methods.Some popular graph-based methods usually combine l_1 penalty with graph regularization procedure to simultaneously obtain sparsity and smoothness for the linear model analysis [9,17], while other typical methods are designed based on neural networks and deep learning frameworks [18].However, linear correlation does not often appear in genomic data, thus graph regularization-based models for linear analyses are barely suitable for this task.On the other hand, deep learning frameworks and neural network methods have the limitation that a great number of samples is needed in order to obtain a reliable model while small sample size is a general feature in the field of genomics.In recent years, some achievements have been made on biomarkers identification by integrating biological network into graph algorithms [19][20][21][22].Such methods can produce more robust gene sets across datasets from different cancer types.But it was found that they may find too many hub genes.Furthermore, the significance of genes in such network-based methods is usually evaluated by their neighbors or genes in the same sub-network.Which means that only a limited number of gene subsets with limited cardinality would be tested.
In this article, we propose a novel method by introducing a graph filter procedure on ILFS.ILFS is a graph model-based filter method that was proposed by Giorgio [23].The motivation behind applying the ILFS method in our work lies in its logic in ranking features, in which the significance of a feature is evaluated by considering all possible subsets of features of any cardinality.Integrating the FIs network with ILFS can utilize the interaction information between genes.In our study, we applied this method to analyze gene expression data for five cancer types: breast invasive carcinoma (BRCA), colon adenocarcinoma (COAD), kidney renal clear cell carcinoma (KIRC), liver hepatocellular carcinoma (LIHC), and prostate adenocarcinoma (PRAD).The proposed method showed improved prediction performance and a much higher selection ratio for known oncogenes than some popular existing methods, including LASSO [7], mRMR [24], VIP score using PLS-DA [25], ReliefF [26] and the original ILFS.We performed functional enrichment analysis on selected markers and found that the number of over-represented GO BP terms obtained from the network-constrained ILFS is much higher than those obtained from other methods.We also performed GSEA on selected biomarkers, the analysis showed that the network-constrained ILFS generates a more biologically significant gene set that is related to the cancer disease than other methods.

Data preprocessing
The data for our research is from The Cancer Genome Atlas (TCGA) platform.The data category we chose is "transcriptome profiling" and the data type is "Gene Expression Quantification".The RNA-Seq expression data of different cancer types is publicly available from (TCGA, https://portal.gdc.cancer.gov/).First, we downloaded the HT-SEQ FPKM (Fragments per Kilobase of transcript per Million mapped reads) values of the type primary solid tumor and solid tissue normal of the BRCA, COAD, KIRC, LIHC, and PRAD cancer types.According to TCGA documentation, the FPKM calculation (1) normalizes read count by dividing it by the gene length and the total number of reads mapped to protein-coding genes.Then we converted the FPKM values to TPM (Transcripts Per Kilobase Million) values as it was shown that TPM values may be more comparable across samples [27].The conversion follows formula (2).For each sample in our data, 19,754 genes are measured for later analysis.
N i : Number of reads mapped to the gene i M: Number of reads mapped to all protein-coding genes L i : Length of the gene in base pairs; calculated as the sum of all exons in a gene i Note: The read count is multiplied by a scalar (10 9 ) during normalization to account for the kilobase and 'million mapped reads' units.
The datasets include paired and non-paired samples and we divided each dataset into two parts: part one and part two.Part one includes approximately 70% of the paired samples which are used for feature selection.Part two consists of the remaining paired samples (nearly 30% of the paired samples) and all of the non-paired samples, which are used for classifier training and model validation by using a k-fold cross validation process.The "paired" samples mean that the case and control are from different tissues of the same patient.For such patients, the gene expression data of primary solid tumor and normal tissue are available.For other patients, only gene expression data of primary solid tumor are provided, we call them "nonpaired" samples.All of "non-paired" samples belong to one group (Tumor).To avoid the effects of genetic differences, we do feature selection only on "paired" samples.More detailed information of the samples and parts is listed in Table 1.

Method
We implemented our approach through the following steps: (1) select a subset of differential genes by applying a paired t-test process on the expression data in Part one, (2) select another set of candidates gene according to the number of their connections in the FIs network, (3) combine the above-mentioned two sets and reconstruct the FIs network with candidate genes in the collection, (4) integrate the reconstructed FIs network with ILFS to score the genes.The flow diagram of our method is plotted in S1 Fig.

Reconstructing the FIs network
Since having a few genes with a very low expression level is statistically meaningless, and performing computations on the whole gene level may be unnecessary, we performed a paired ttest process on the expression data of Part one samples.The cut-off values in the paired t-test process were set as FDR<0.05 and |log 2 FC|>1, so that the top N genes that showed great discriminative power were filtered for further feature selection.
We also downloaded the FIs network from the Reactome database, which includes the known pathways in human biology.These pathways are expressed as pairs of genes (a i , b i ) and the regulatory relation between them which can be regarded as directed edges in a graph.Since our method is graph-based, we also chose the M genes with more than 100 edge connections in the FIs network.We have tested several edge thresholds and found that 100 is applicable in our study.In the genetic network, the more edges a gene has, the more central role it has within the network.We united the sets of N genes and M genes into a set T such that T = {N genes}[{M genes}.We reconstructed the FIs network G F as follows: for each edge (a i , b i ) in the FIs network, if a i 2T and b i 2T, then the edge and its direction are included in G F .We expressed this directed graph G F as an adjacent matrix A F , according to the direction of the edge (a i , b i ) that could be forward, backward or bidirectional, the values of the matrix A F were assigned using formula (3),

Network-constrained infinite latent feature selection
ILFS is one kind of filter methods which rank the features depending on the intrinsic properties of the data and are not sensitive to the predictive model type.ILFS ranks the features through three steps: data preprocessing, graph weighting and ranking.In the steps of data preprocessing and ranking, our network-constrained ILFS does the same thing as the original ILFS.The second step of graph weighting includes the introduction of the reconstructed FIs.Initially, the raw feature space X is represented as a set of feature distributions X ¼ fx 1 ⃑ ; . . .; x n ⃑ g, where each m×1 vector x i ⃑ is the i th feature (gene) with regard to m samples.In the data preprocessing step, a discriminative quantization process is applied on the raw feature distributions x i ⃑ through which the raw feature vector x i ⃑ is mapped to a countable nominal smaller set of intervals and represented as a descriptor f i , where f i is a t×1 vector (t�m); thus, each feature will be represented by a new low-dimensional pattern.Following this formulation, a strong new representation of the training data X in the form of F = {f 1 ,. ..,f n } where each feature f i is described using a vocabulary of few tokens is obtained.
In the graph-weighting process, an undirected fully connected graph G is built whose nodes correspond to each feature f i and whose edges are weighted by f i ⇝f j , which represents the probability that the features x i and x j are relevant.Using a learning framework that is based on a variation of the probabilistic latent semantic analysis (PLSA) technique, the weights were computed by modeling the probability of each co-occurrence in f i , f j .After this process, we obtained the weighted graph G, which can be expressed as an adjacent matrix A p .We have intuitively found that ILFS constructs a fully connected graph in this step that assumes all features to be connected with each other, while in fact the true correlation structure of the gene expression data is much sparer than this.This enlightened us that incorporating a prior knowledge of genetic pathways may produce more biologically reasonable results, and we proposed to add an extra process after the graph weighting step.We employed the reconstructed graph matrix A F as a filter to achieve a sparser connection graph.We implemented this process by calculating the following formula: where the symbol � denotes a Hadamard (element-wise) product.The Hadamard product is a binary operation that takes two matrices of the same dimensions, and produces another matrix where each element A ij is the product of (A p ) ij and (A F ) ij .The value of A ij will be zero after this operation if there is no regulatory relationship between the genes i and j.As a result, only actual pathways are retained after this process.In ranking step, the importance score of a feature is defined as a function of the importance of its neighbors.Let γ = {υ 0 = i, v 1 , v 2 ,. ..,v l−1 , v l = j} denote a path of length l between node i and j, namely, feature x ⃑ i and x ⃑ j , through other nodes v 1 , v 2 ,. ..,v l−1 .Suppose that the length l is lower than the total number of nodes in the graph.Therefore, a path is a subset of available features(nodes).The join probability that γ is a good subset of features can be estimated as Let a set P l i;j as considering all the paths of length l between i and j.The energy of all the paths of length l can be summed as follows, Following the standard matrix algebra, it can be written as: Considering all the possible subsets of features of any cardinality means considering all the possible paths of any length in the graph.As a result, extending the path length to infinite implies calculating the geometric series of matrix A.
Tending the path length to infinite brings divergence.So that a consistence r for regularization is assigned for the computation.
From an algebraic point of view, Č can be efficiently computed by using the convergence property of the geometric power series of a matrix.Therefore, the value of Č can be calculated as follows: Matrix Č encodes all the information about the goodness of the set of features.The final scores for each node can then be obtained by marginalizing the quantity s ˇ(i) = [Če] i , where e denotes a 1D array of ones.Ranking the s ˇ(i) scores in a descending order, we can get the most discriminative features at the top of the sorted list.

Accuracy of predictors
In order to measure the prediction performance of our proposed method, the training data reduced to selected genes was used to train the classifiers which were subsequently employed to be tested on the rest of samples.We performed the model training and validation process on the samples in part two.As can be seen in Table 1, the amount of the available normal samples is much less than the tumor samples.As most standard classifiers assume a relatively balanced class distribution, training with imbalanced data will lead to illusive classification performance.Therefore, we adopted a technique called ADASYN sampling approach for imbalanced training [28].The basic idea of ADASYN is to adaptively generate more synthetic data for the minority class samples according to their distributions.The characteristic of this method is that it can shift the decision boundary to focus on those samples that are hard to be learned.By applying ADASYN, the ratio of normal to tumor samples was adjusted to be close to 1.To avoid over-fitting, the generation of training and testing data was separately executed in the classification and validation process.We tested support vector machine (SVM) classifier on the expression data restricted to the top 100 genes.We implemented a 5-fold CV process on the generative datasets for 100 times.To explore the prediction performance, we employed the mean AUC value over 100 times CV as the measurement for the selected features.To better understand the data processing flow, we plotted it in Fig 1.

Prediction of selected biomarkers
To evaluate the prediction performance, several popular feature selection methods, including mRMR, LASSO, PLS-DA and ReliefF and the original ILFS were compared with the proposed method.We provided the top 100 biomarkers obtained by the above-mentioned feature selection methods for five cancer types in the S2 Table .Appropriate selection of the tuning parameter in penalized likelihood methods is very essential for high dimensional data analysis; thus, we executed an additional procedure for LASSO to select the optimal tuning parameter [29].We tested the accuracy of the top 100 genes obtained by each feature selection method, combined with an SVM classifier to train the prediction models.We repeated a 5-fold cross-validation process 100 times and computed the average AUC value.We used the following parameters for SVM (linear kernel, C = 1); we have also tested a few parameters and found that no significant better results were reached.The mean AUCs over 100 times classifications limited to top 100 genes for five cancer types were plotted in Fig 2 .Generally, we observed that the network-constrained ILFS shows better prediction performance, except in the case of KIRC.For KIRC, mRMR and ReliefF had the greatest predictive power.Moreover, networkconstrained ILFS showed much better performance for LIHC than other cancers.To explore the distribution of AUCs, we also plotted the accuracies obtained by each feature selection method on the PRAD data set as shown in Fig 3 .Another measurement to assess these feature selection methods is the F1 score.The F1 score is a weighted average of precision and recall, which can be calculated by formula (11).The results are listed in S1 Table.

Known oncogenes including test
The efforts of many scientists resulted in revealing some genetic mutations that might be involved in cancer development.The IntOGen-mutations platform summarizes the somatic mutations, genes and pathways that are involved in tumor genesis [30].We collected the known oncogenes of BRCA, COAD, KIRC, LIHC, and PRAD from this platform and counted the number of known oncogenes in the top 100 genes obtained by these feature selection methods and then calculated the p-values using hyper geometric tests.The number of oncogenes in selected 100-genes is listed in S3 Table .Fig 4 shows the significance values of the selected oncogenes ratio for each method on five cancer types using hyper geometric tests.It is obvious that the proposed method outperforms other methods in this task.The selected oncogenes by the network-constrained ILFS are listed in Table 2.The results show that our network-constrained ILFS has a great chance for BRCA, COAD, KIRC and PRAD to mine the true factors in high dimensional gene expression data.For LIHC, no oncogenes in top 100 have been detected from any feature selection method.

Biological interpretability
Finding gene groups that show predictive power is no longer a very hard job.However, mining biomarkers that provide insights into the biological mechanisms remains a challenge.To assess the interpretability significance of selected biomarkers, we adopted two ways: GO functional enrichment analysis and GSEA [31].The top 100 biomarkers gene list was analyzed using the tool DAVID [32] to produce GO BP terms.We computed the number of GO BP terms that are overrepresented at 5% FDR.Fig 5 shows the number of enriched GO BP terms for five cancer types.The detailed information of the GO functional enrichment analysis is listed in the S4-S8 Tables.In general, the number of overrepresented GO terms indicates how easily selected biomarkers can extract a biological insight.Apparently, the network-constrained ILFS provides a more functionally significant gene set.We also applied a gene set enrichment analysis to the top 100 biomarkers obtained from each feature selection method.The reference gene sets that were used in the GSEA process were downloaded from the Molecular Signatures Database [33].We chose the C4 and C6 collections as the reference gene sets which includes computational gene sets that were defined by mining large collections of cancer-oriented microarray data and oncogenic signature gene sets that were directly from microarray gene expression data from cancer gene perturbations.The normalized enrichment score (NES) is the primary statistic for examining gene set enrichment results, which reflects the degree to which a gene set in the C4 and C6 collections is overrepresented at the top or the bottom of the selected biomarkers ranked list.FDR is the estimated probability of a gene set with a given NES, and the nominal p-value estimates the statistical significance of the enrichment score for a single gene set.In general, an FDR cutoff of 25%, |NES|>1 or a nominal p value cutoff of 5% are appropriate.The GSEA analysis report about PRAD is shown in Table 3. Obviously, we observed that the number of significantly enriched gene sets obtained by the network-constrained ILFS is much larger than the number of those obtained by other methods.Typically, the larger the number of significantly enriched gene sets is, the more likely interesting hypothesis will generate.The results indicate that the network-constrained ILFS could produce more biological interesting gene set than other methods.The detailed GSEA analysis report for five cancer types is provided in S9 Table .From GO BP enrichment analysis results, we found that the selected genes by our method for five cancer types are significantly involved in GO:0000398~mRNA splicing, via spliceosome, GO:0010467~gene expression, GO:0008543~fibroblast growth factor receptor signaling pathway, and GO:0006370~7-methylguanosine mRNA capping.The core spliceosome machinery has been demonstrated to be overexpressed in multiple cancers and affect autophagy and cell proliferation, becoming a potential therapeutic target for malignant solid tumors treating [34,35].The fibroblast growth factor receptor (FGFR) pathway is increasingly proved to play a role in the pathogenesis of different tumor types, such as urothelial, breast, endometrial, squamous cell lung cancer and hepatocellular carcinoma [36][37][38].These facts confirm our method in biological interpretations.
In GSEA analysis, the most significantly enriched gene sets from our method include MORF_SOD1 (Neighborhood of superoxide dismutase 1 in the MORF expression compendium), MORF_CSNK2B (Neighborhood of casein kinase 2, beta polypeptide in the MORF expression compendium) for prostate cancer, GCM_CSNK2B (Neighborhood of casein kinase 2, beta polypeptide in the GCM expression compendium) and MORF_EIF3S2 (Neighborhood of eukaryotic translation initiation factor 3, subunit 2 beta, 36kDa in the MORF expression compendium) for breast cancer.Casein kinase 2 (CK2) is a ubiquitous serine/threonine protein kinase.A previous study has demonstrated that CK2 is to be overexpressed in a number of human cancers, including prostate and breast cancer [39,40].SOD1, plays an important role in maintaining the normal life activities of cells, which has been reported associated with tumorigenesis [41,42].Eukaryotic initiation factor 3 (EIF3) is involved in the initiation process of protein translation and overexpression of its subunit eukaryotic translation initiation factor 3 (EIF3I) has been observed in breast carcinoma [43].The results of GSEA also prove that the proposed method can identify genes with biological significance.

Discussion
In the field of genomics, it is very common to have a high dimensional data with low sample size; thus, feature selection plays a very critical role in scientific discoveries.Most existing feature selection methods rank the features only from the statistical perspective.Such methods tend to filter out those genes that show the best discriminative power in model training.However, a lot of those genes are meaningless when it comes to the biological process and interpretability.This can be perceived from our experiments result.It is easy to find in S3 Table that the proposed method can produce highly overlapping signatures over all cancer types, while classical methods fail to identify common gene sets across the same cancer types.For future work, it is more promising to explore such similar signatures than those no overlapping signatures.LASSO, mRMR, ILFS, VIP score using PLS-DA and ReliefF show no significant worse performance than the network-constrained ILFS regarding the prediction accuracy, but the signatures obtained by them share little overlap not only with each other but also with known oncogenes.This demonstrates that different gene groups can lead to same predictive accuracies, but methods with great power in model training are not necessarily good at selecting true features.It implies that maybe no biological insight should be expected from the analysis of those genes using such methods.To avoid selecting too many differential but biologically meaningless genes, we propose that adding some biological prior information may improve the reliability and feasibility of statistical methods.For this purpose, we employed the FIs network to modify the ILFS graph-weighting process.In addition, we followed a special way in the initial gene screening step.We picked two kinds of genes at first.One kind is differential on expression data obtained by a paired-t test process.Another is central in the graph which is measured by its connections.This setting is very important because the basic idea of ILFS is to consider all possible subsets, which can be regarded as walking down all possible paths in a graph, while the central genes are key nodes to connect those paths.As a result, the selected biomarkers showed both great prediction power and remarkable biological significance.

Conclusions
In this study, we proposed a novel feature selection method which combined the biological network with the statistical method of ILFS.We applied this method to identify biomarkers in the gene expression data of BRCA, COAD, KIRC, LIHC, and PRAD.First, we compared it with the methods of ILFS, mRMR, LASSO, VIP score using PLS-DA and ReliefF on estimation precision and selection ratio of known oncogenes.Then, we performed functional enrichment and gene set enrichment analysis on selected features and perceived that the selected features are meaningful from a biological perspective.The results indicate that the network-constrained ILFS is helpful in cancer biomarkers identification.