Sparse Bayesian classification and feature selection for biological expression data with high correlations

Classification models built on biological expression data are increasingly used to predict distinct disease subtypes. Selected features that separate sample groups can be the candidates of biomarkers, helping us to discover biological functions/pathways. However, three challenges are associated with building a robust classification and feature selection model: 1) the number of significant biomarkers is much smaller than that of measured features for which the search will be exhaustive; 2) current biological expression data are big in both sample size and feature size which will worsen the scalability of any search algorithms; and 3) expression profiles of certain features are typically highly correlated which may prevent to distinguish the predominant features. Unfortunately, most of the existing algorithms are partially addressing part of these challenges but not as a whole. In this paper, we propose a unified framework to address the above challenges. The classification and feature selection problem is first formulated as a nonconvex optimisation problem. Then the problem is relaxed and solved iteratively by a sequence of convex optimisation procedures which can be distributed computed and therefore allows the efficient implementation on advanced infrastructures. To illustrate the competence of our method over others, we first analyse a randomly generated simulation dataset under various conditions. We then analyse a real gene expression dataset on embryonal tumour. Further downstream analysis, such as functional annotation and pathway analysis, are performed on the selected features which elucidate several biological findings.


Introduction
Biological system is being comprehensively profiled by various expression data through highthroughput technologies, such as gene expression data (measured by the microarray or next generation sequencing technology), protein expression expression (measured by the mass spectrometry-based flow cytometer) and medical imaging (measured by functional magnetic resonance imaging or computerised tomography scan) [1,2]. Computational and statistical methods for discovering functional roles of features from expression data are required to have PLOS  feature selection process. GCS information is obtained from gene expression data indicating whether a gene is sensitive to sample classes. To evaluate candidate gene subsets selected from BPSO, extreme learning machine (ELM) is used for classification model construction. Unlike this method, SBL, similar to other embedded feature selection methods, integrates the feature selection step into the predictive model construction.
There are some examples of embedded feature selection methods which achieve the feature selection by imposing regularisation on existing classification methods, such as regularised SVM [28,29] and sparse logistic regression [30,31]. These methods need to tune the regularization parameters via the cross validation process. The work in [32] develops a Bayesian approach based on a probit regression model with a generalised singular g-prior distribution for regression coefficients. The hyperparameters need to be predefined in the model and the selected feature set is quite small. With limited number of selected features, it is not easy to discover biological functions. The main reason that only a quite small feature set is selected is correlated features are not selected coherently. As we know that in a biological process, multiple molecules are working together, resulting in correlated feature expression levels. SBL, imposed with sparsity constraints, cannot simultaneously select correlated features. Instead, one out of a set of correlated features is usually selected in the predictive model. In the literature, several approaches have been proposed for classification and feature selection and some of them are actually based on SBL [33][34][35]. Not surprisingly, the feature set generated by the Bayesian selection method in [35] is quite small that only 12 out of 7128 genes in an example gene expression dataset are selected. The feature set returned from these methods cannot be easily used to discover predictive pathways or biological functions.
Motivated by the fact that existing classification methods either fail to identify a list of predictive features or easily discard correlated features, we propose a computational method derived from SBL to simultaneously build a classifier and select predictive features which are highly correlated. Our classification model is constructed through an iterative convex optimisation procedure instead of a one-step closed form calculation. Moreover, the optimisationcentric formulation of our method can be easily paralleled [36]. The cost function is cast using hierarchical Bayesian model, where the parameters' prior distribution is parameterised by the hyperparameter. Its main goal is to infer the posterior distribution of parameter via Bayes' rule. Rather than using the EM procedure to update parameters and hyperparameters, our method infer both parameters and hyperparameters via the convex optimisation procedure [24,37]. The paper is organised as follows, we first detailed the method and then test against simulated and real datasets. In the simulation study, we compare the performance of our method with other methods in the aspects of classification and feature selection abilities. In the real dataset analysis, we apply our method to construct a classification model to predict different types of tumours. The selected features in the predictive model are fed into downstream analyses for biological functions and pathways discovery. The results show that our method can perform classification and feature selection at the same time, while the selected features can give insight into new functional modules.

Methods
The proposed classification method constructs a mathematical model through an iterative optimisation procedure. The resulting model can simultaneously perform classification and return a relevant feature set. In this section, we first follow the sparse Bayesian approach to define a single target optimisation function to obtain both parameters and hyperparameters. Then we infer the equations to iteratively solve the optimisation problem via the smooth-concave procedure. The standard EM updating procedure is replaced by the optimisation procedure.

Optimisation problem definition
Suppose we get a set of input vectors fx n g N n¼1 along with corresponding targets fy n g N n¼1 . We wish to learn the underlying functional mapping which is defined by a parameterised function f ðx; βÞ ¼ P M i¼1 b i 0 i ðxÞ, where the output is the linear weighted sum of M basis functions and β = [β 1 , β 2 , . . ., β M ] > contains the parameter. Let F be the N × M design matrix with . Then we can express the mapping function as f(x; β) = Fβ. Usually, the prior distribution of the weights is assumed to follow a zero-mean isotropic Gaussian: where Here, we focus on investigating the case that the target variable is binary. The likelihood function Pðyjβ; xÞ is expressed in the form of the logistic regression model: According to the Bayes' rule, the posterior distribution over weights Pðβjy; γÞ is proportional to Pðyjβ; xÞPðβjγÞ. Maximisation of posterior is equivalent to finding the maximum over β of log Pðyjβ; xÞPðβjγÞ ¼ À Thus, the weights can be found through The gradient and Hessian matrices at arbitrary point of β Ã are defined as and where and The hyperparameter γ is updated by maximising the marginal likelihood, which is equivalent to argmin γ À Pðyjγ; xÞ ¼ argmin According to Taylor expansion, we can get the following approximation at mode β Ã : Therefore, the logarithm of negative marginal likelihood is Thus γ can be estimated by From Eqs (5) and (13), we can jointly estimate β and γ through a common objective function: where |H(β Ã )| is the Hessian matrix calculated at mode β Ã , which is assumed to be obtained through the minimisation step of β in the iterative optimisation process. The reason we label Ã in |H(β Ã )| is to emphasize that the term |H| is not involved when updating β according to Eq (5).

Iterative optimisation algorithm
The objective function defined in Eq (14) can be formulated as a convex-concave procedure (CCCP) for updating β and γ, which is where the data fitting term u(β, γ) is a smooth function in the form of and the regularisation term v(γ) is a concave function [24,38]: By expressing the objective function in the convex-concave form [39], we can evoke standard iterative optimisation procedure to get its solution at the k + 1 th iteration as follows: and If we define β k+1 can be obtained as the following expression in the form of reweighted ℓ 2 regularisation: Let us use the following notation: According to the matrix derivative rule, we can derive the expression of the ith element of α k as: According to Eq (19), optimal g kþ1 i is obtained by minimising Since the optimal γ can be obtained as: The pseudo code is summarised in Algorithm 1. Algorithm 1 Reweighted ℓ 2 type algorithm 1: Initialise the unknown hyperparameter γ 1 as a unit vector; 2: Initialise 8: if a stopping criterion is satisfied then 9: Break. 10: end if 11: end for The cost function defined in Eq (27) of Algorithm 1 is convex, making it possible to apply standard solvers to obtain a global optimal solution. For example, we can consider iterative solvers, such as the standard gradient method, the Newton method and its variants. We should note that in the 5 th line of Algorithm 1, the inverse of Hessian matrix needs to be calculated. When the number of features is quite large, which is usually the case in the gene expression data, it is quite computational expensive to do the Hessian matrix inversion. Therefore, we apply a Quasi-Newton method, limited memory Broyden Fletcher Goldfarb Shanno (L-BFGS) algorithm, to directly generate the inverse of Hessian matrix in the iterative solver without performing matrix inversion [40].

Results
In this section, a set of simulated datasets are first used to show the ability of selecting relevant features as well as constructing predictive models. Our method is compared with the other two representative methods: ℓ 1 regularised logistic regression implemented by alternating direction method of multipliers (ADMM) [41] (code is downloaded from https://web. stanford.edu/*boyd/papers/admm/logreg-l1/) and SBL (code is downloaded from http:// www.relevancevector.com). The results will highlight the performance of our method on datasets with high correlations. To demonstrate the applicability of our method on real datasets, we choose a publicly available gene expression dataset for illustration. The identified features (genes) are fed into downstream analyses. We find that the detected functional terms agree with the findings from literature. The analyses of real data show that our method can generate a list of predictive genes that are both used for classifier construction and biological functionality discovery.

Simulated data analysis
In the simulation, we first generate a dataset with the sample and feature size of 500 and 50. We assume there are only 8 non-zero elements in β. To check the ability of finding correlated features, we split the features with non-zero weights into two sets, S 1 and S 2 , each of which contains 4 true features. Then, we initialise a design matrix F = x with 500 samples and 50 features from the normal distribution. Let ϕ i (x) be the ith column of F. We set ϕ S 2 m ðxÞ ¼ ϕ S 1 m ðxÞ þ N ð0; 0:1Þ, where S 1 m and S 2 m denote the mth element in each feature set. In this way, we get data generated from correlated features. The target variable vector fy n g 500 n¼1 is generated by the linear model using β and F with additive independent identical distributed Gaussian noises, where the standard deviation of noise varies ranging from {0, 0.1, 0.5, 1}.
We first work on the whole dataset of 500 samples with no additive noise to investigate the ability of selecting the true features. The estimated weights from our method is compared with the true weights, the estimates from the ℓ 1 regularised logistic regression method (RLR) [41] and the sparse Bayesian learning method. We scale the estimated results to make the first real nonzero feature have the same value. The results are shown in Fig 1. It shows that only our method can successfully detect all the true features with magnitudes of the corresponding weights significantly larger than the others. RLR and SBL easily ignore some features that one out of a correlated pair can be detected. This observation is quite encouraging, showing that our method can be potentially used as a good tool to select relevant features. Most classification methods are not guaranteed to have this characteristic.
We then investigate the performance of our method with different levels of noise. The performance is evaluated by the 10-fold cross validation. In the cross validation process, different training dataset is used for feature selection and predictive model construction in each fold. Therefore, selected feature sets from all folds may vary due to the variation of training datasets.
To select a feature set which is stable with small fluctuations of the input dataset and also has good predictive accuracy, we use the method from [42]. In the kth fold of cross validation, the whole dataset D is split into two subsets: CV training dataset D k and CV testing dataset D \k . Our method can work as a feature selection method on the training dataset D k to rank and select the top q features, labelled as V q,k . After features have been selected, our method then constructs a predictive model for classification using V q,k . The prediction results at this CV fold are recorded for later evaluation. To get the complete prediction results, we repeat the above steps for all folds of CV. The method presented in [42] returns an optimal feature setṼ q with an associated performance scoreP q under each value of q. The scoreP q is calculated according to the 6 th strategy proposed in [43] to assess the prediction accuracy and stability of features (the details of calculating this score can be found in [43]). By checking the maximum value ofP q , we can determine the optimal value of q and the corresponding optimal feature setṼ q . The detected optimal set can then be used to construct a predictive model for future prediction.
In Fig 2, it shows the change of scoresP q under different settings of feature size q and noise level. From Fig 2, we can see that the scores of our method under different noise levels peak when the value of q is close to the number of real nonzero features. In contrast to our method, the optimal value of q for the RLR method and the SBL method are around 4, which is different from the real number of nonzero features. From Fig 2, we can expect that our method works better than the RLR method with respect to feature selection. Table 1 Table 1. We can see that under different noise levels, the accuracy achieved by different methods are all maintained at high levels. However, the false negative rates from other methods are much higher than the rates from our methods. This is because, the other methods cannot detect correlated features that some real features are ignored.
The above observations are only for datasets with fixed sample and feature sizes (500 and 50 respectively). To test generality, we generate a set of datasets as follows: the sample and feature sizes are chosen from {50, 100, 500}; the ratio of non-zero features (sparsity) is either 0.1 or 0.2; non-zero features can be fully independent or 50% of them are highly correlated. For each combination of settings, 20 randomly sampled datasets are generated. We apply our method and also the other two methods on these datasets. The FN, FP and accuracy values are derived from the average values of 20 randomly sampled datasets under each specific setting. The results for the datasets with the sparsity of 10% are recorded in Table 2. The results for the datasets with the sparsity of 20% can be found in S1 Table. We can see that for the datasets with independent features, the performance of our method is similar with the other two methods. However, for the datasets with correlated features, our method works much better than the other two methods with lower FN and FP rates and similar accuracy values. Especially, the FN rates are much smaller, indicating that the real non-zero features are more likely to be detected by our method. This characteristic of our method is quite important, since biological expression data contains a lot of correlated features. We can also observe a general trend that: with a fixed sample size, the performance of all three methods is reduced when the feature size increases; with a fixed feature size, the performance of all three methods can be improved with more samples. When the proportion of non-zero feature increases to 0.2 (see S1 Table), the performance of all methods are deteriorated. This is because, all these methods are based on the assumption that the feature space is sparse. For biological expression data, the ratio of predictive features or biomarkers is quite low, which is much smaller than 0.2 or even 0.1.

Embryonal tumour gene expression data analysis
We use a public available gene expression dataset of the central nervous system embryonal tumours from the study in [8]. All relevant data are available from the figshare repository at  Sparse Bayesian classification and feature selection for biological data the following URL: https://doi.org/10.6084/m9.figshare.5678806.v1. We selected 10 CNS medulloblastomas (MD) samples and 10 non-neuronal origin malignant gliomas (Mglio) samples to show the performance of our method in classifying two tumour types. The samples were hybridised on Affymetrix HuGeneFL GeneChip arrays. We first preprocessed the raw data using GCRMA with empirical Bayes estimate [44]. Then we filtered out probe sets which are either not annotated or have little variability across samples. Probes for 5669 genes were remained after preprocessing. Our method can find differences between two tumour types at molecular level. We construct a classifier using the selected 20 samples with the accuracy of tumour type prediction approaching to 100% in the 10-fold cross validation procedure. The beauty of our method is that it does not only have strong predictive power, but also selects relevant features that could be candidates of disease biomarkers. The following parts of this section focus on investigating the performance of features selection. In the classification model of the whole dataset, 98 features have non-zero weights, which can be regarded as molecular features distinguishing tumours. By looking at Fig 3, we can see that many of these features are highly correlated, telling that our method does not discard features from correlated ones. We also apply SBL to Sparse Bayesian classification and feature selection for biological data construct a classification model. Although it can also return high predictive accuracy, it only selects two features with non-zero weights in the model. As RLR cannot work well for large datasets, we do not present its results for comparison.
The weights of features estimated from our method are compared with significant levels of features from the traditional statistics tests, which check features one by one to see whether the distributions of each feature in different groups are significantly different. Only the features whose p-values are smaller than the significant level (e.g., 0.05) are selected. Although there are many p value correction methods, a hard cut-off value is still needed. The number of selected features depends on the value of the significant level. Although it is conventional to set the significant level to be 0.05 or 0.01, we can hardly say any features whose p-value is slightly larger than this value do not have discriminant power. In this experiment, a t-test for each gene is conducted to find significant changes in expression levels between the MD and Mglio samples. We show the top 20 genes with the smallest p values in Fig 4.  Fig 4 compares the weights resulting from our method and SBL with fold changes and p values. Fig 4A) shows the weights of features in our classification model, where the circle points indicates the top 20 features from the t test. We can see that most features which are statistically significant have non-zero weights. In contrast to our method, Fig 4B) shows the results from SBL, telling that the top 20 features from the t test all have zero weights. From Fig 4A) and 4B) we can see our method outperforms SBL in the aspect of feature selection. Fig 4C) shows the fold change of the top 20 statistically significant features. Comparing Fig 4A) with 4C) we can see that the signs of weights agree with the signs of fold changes. It demonstrates that our method can reflect whether a gene is up-regulated or down-regulated for MD compared to Mglio. Fig 4D) shows the exp(−log(Á)) transformed p values for the top 20 statistically significant genes. It should be noted that although we compare our results with p value results, we cannot simply regard p value results as the ground truth. The disagreements between our method and t test can be resulted from the case that t test gives wrong results. This is because, in the statistic tests, features are investigated separately. It is often the case that some individual features are not discriminant but have strong predictive power when they join together. If we compare our results with the ground truth, we may find a better comparison results than those shown in Fig 4. As the ground truth is not at hand, we use p value results for comparison.
After comparing our results with other methods, we then investigate the biological functions of the 98 selected genes from our method. In this experiment, we determine a list of 53 up-regulated genes for MD compared to Mglio. Genes from this list are analysed for functional category enrichment using the Functional Annotation Clustering tool on the Database for Annotation, Visualisation and Integrated Discovery [45]. Metastasis-associated genes are classified according to their annotated role in molecular function, biological process, and cellular component from Gene Ontology (GO). Category enrichment is tested against all human genes, where p values are adjusted using the Benjamini-Hochberg multiple testing correction method [46]. In S2 Table, we show the discovered GO annotation clusters, where the detected GO terms are consistent with the findings in [8]. We can further select the GO terms of interest and build a sub-ontology that includes the ancestors of the terms. Fig 5 shows the ontology built using the top 10 significant GO terms. From this experiment, we can see that our method can work as a classifier and also a feature selection method whose output (selected features) can be fed into downstream analyses such as gene set enrichment and pathway analysis.

Discussion
With the development of high-throughput technology, biological process can be quantitatively measured. Differential feature detection and classification model construction are two main analyses in the biological expression study. In this paper, we propose a method to perform these two analyses at the same time: the model can predict sample groups; and the features used in the model with non-zero weights can be regarded as potential biomarkers. Currently, there are many machine learning methods for classification model construction. Most of them cannot directly return a list of predictive features with non-zero weights in the model. For example, the linear SVM may use all features for model construction, where they all have nonzero weights.
As SBL imposes sparsity to the model, a lot of features are forced to be pruned in the classification model. Thus, we use SBL as the basis of our method. Different from SBL which follows an EM style to infer parameters and hyperparameters iteratively, we formulate the inference problem in the framework of optimisation: the target function in the optimisation process is originated from SBL; the iterative updating procedure follows the idea of convex-concave optimisation. Compared with SVM-based methods, our method has the following distinguishing features: 1) Our method is parameter free that hyperparameters are directly learned from datasets, while SVM-based methods need to set parameters through the cross validation process. 2) Our method imposes sparse constraints to the classification model. By choosing linear kernel, we can obtain a small set of features with non-zero weights used in model. The feature selection and classification steps have been integrated into one step. 3) Our method can detect correlated features which are important for downstream analysis, such as functional pathway analysis. Our method is also different from other optimisation based methods. Let us take the BPSO method as an example for discussion. The main differences between our method and BPSO are: 1) The BPSO method is inherently a global optimisation method. Our method although is an optimisation method, it constructs the model from the Bayesian point of view, where prior knowledge can be explicitly included in the model. 2) Our method is parameter free that hyperparameters are learned from the data. BPSO needs to set parameters in advance or obtained them via the cross validation process. 3) Looking at the results from BPSO, we can see that in each run of BPSO, only a small subset of genes is selected (e.g., 10 genes). 4) Our method carries out classification and feature selection in one step, where BPSO is mainly used for feature selection requiring other classification method such as ELM for classification model construction.
The simulation results show that our method can effectively select features with high classification accuracy. In contrast to other methods, correlated features can be successfully detected. A real gene expression data from the embryonal brain tumour study is then used to demonstrate the applicability of our method. In the results, we first show that the selected features are correlated by looking at the heatmap of correlation matrix. Then we compare the weights estimated from our method with p values from statistic test and fold changes. We find that our method can successfully identify up-regulated and down-regulated genes with positive and negative weights, respectively. Moreover, we find that most features which are statistically significant have non-zero weights in our model. The gene list generated by our method can be used to do functional analysis. We show the detected gene ontology terms, which are consistent with the findings in previous study. In conclusion, the classification and feature selection method proposed in this paper can effectively handle highly correlated biological expression dataset, in order to predict distinct disease subtypes and select candidates of biomarkers simultaneously.
Supporting information S1 Table. The