Figures
Abstract
Analysis of gene expression data is an attractive topic in the field of bioinformatics, and a typical application is to classify and predict individuals’ diseases or tumors by treating gene expression values as predictors. A primary challenge of this study comes from ultrahigh-dimensionality, which makes that (i) many predictors in the dataset might be non-informative, (ii) pairwise dependence structures possibly exist among high-dimensional predictors, yielding the network structure. While many supervised learning methods have been developed, it is expected that the prediction performance would be affected if impacts of ultrahigh-dimensionality were not carefully addressed. In this paper, we propose a new statistical learning algorithm to deal with multi-classification subject to ultrahigh-dimensional gene expressions. In the proposed algorithm, we employ the model-free feature screening method to retain informative gene expression values from ultrahigh-dimensional data, and then construct predictive models with network structures of selected gene expression accommodated. Different from existing supervised learning methods that build predictive models based on entire dataset, our approach is able to identify informative predictors and dependence structures for gene expression. Throughout analysis of a real dataset, we find that the proposed algorithm gives precise classification as well as accurate prediction, and outperforms some commonly used supervised learning methods.
Citation: Chen L-P (2022) Classification and prediction for multi-cancer data with ultrahigh-dimensional gene expressions. PLoS ONE 17(9): e0274440. https://doi.org/10.1371/journal.pone.0274440
Editor: Enrique Hernandez-Lemus, Instituto Nacional de Medicina Genomica, MEXICO
Received: September 27, 2021; Accepted: August 28, 2022; Published: September 15, 2022
Copyright: © 2022 Li-Pang Chen. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting information files.
Funding: This research is supported by Ministry of Science and Technology with grant ID 110-2118-M-004 -006 -MY2. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Analysis of gene expression data is an important topic in bioinformatics. A large body of research and relevant developments have been explored in recent years. One of important branches of gene expression data analysis is to take gene expression values as predictors to classify and predict tumors to possible cancers. A motivated example in this paper is the GCM dataset, which contains 16,063 gene expression values and 14 human cancers among 198 tumor samples. The goal of this study is to take gene expression values as the predictors, and use them to classify tumor samples to their corresponding cancers. In this dataset, a key feature is ultrahigh-dimensional predictors in the sense that the dimension of predictors (number of gene expression values) is extremely greater than the sample size (tumor samples). This feature further induces some challenges, including (a) pairwise interactions among gene expressions and (b) existence of non-informative gene expressions, that affect the performance of classification and the accuracy of prediction.
To address classification and prediction for biomedical research, many supervised learning methods have been developed and have been widely applied in machine learning frameworks. With the ignorance of pairwise interactions and existence of non-informative predictors induced by ultrahigh-dimensional predictors, [1] proposed the integration of several heterogeneous cancer series, and performed a multi-class classification. [2] studied multicategory support vector machine (SVM) for the classification of multiple cancer. [3] presented comprehensive discussions of SVM methods. [4] applied SVM ensembers to analyze breast cancer prediction. [5] discussed linear discrimination analysis (LDA) and its application in the microarray. [6] discussed the multi-class analysis by generalized sparse linear discriminant analysis. The detailed and fundamental discussions of those methods can be found in [7, 8], and were reviewed by [9] as well. In recent years, deep learning approaches, such as convolutional neural network (e.g., [10]) or natural language processing (e.g., [11]), have been developed to deal with multicalssification. More applications can be found in some monographs, such as [12–14].
To characterize pairwise interactions among gene expressions, which usually refers to the network dependence among gene expressions, we employ graphical models that are powerful methods in describing the dependence structure of variables. A general introduction of graphical models can be found in [7] (Chapter 17). In the past literature, graphical models have been used to deal with the classification problem. For example, [15] proposed the network-based support vector machine for the classification of microarray samples for binary classification. [16] discussed the identification of rheumatoid arthritis-related genes by using a network-based support vector machine. [17] proposed network linear discriminant analysis. [18] proposed the nearest neighbor network. Most existing methods focused on binary responses and restricted the predictors to follow the normal distribution because of explorations of the precision matrix. Furthermore, it is intuitive to understand that the network structure of variables in different classes may not be exactly equal to each other. To address this issue, [19, 20] explored SVM and logistic regressions with heterogeneous network structures accommodated, respectively. More recently, [21, 22] developed multiclass discriminant analysis with network structures accommodated. From the perspectives of Bayesian approaches, several methods were also investigated with the network structure incorporated, including [23, 24].
To address non-informative gene expression values in ultrahigh-dimensional data, variable selection or dimension reduction are perhaps commonly used strategies in the past literature. For example, [25] applied unsupervised feature extraction, such as principal component analysis, tensor decomposition, and kernel tensor decomposition, to select potentially important genes. [26] adopted SIS method to do feature screening for gene expressions and combined Nottingham Prognostic Index with a hybrid signature accommodated. With the combination of supervised learning, [27] proposed the penalized method for SVM. [28, 29] explored variable selection based on LDA. Those methods mainly handled the setting that the dimension is smaller than the sample size, however, it is unknown whether those methods are able to deal with the case that the dimension of predictors is much higher than the sample size.
From the two challenges and developments described above, we note that most existing methods deal with either network structure or variable selection but not both. It motivates us to propose a strategy to simultaneously retain important predictors and construct the network structure of predictors when doing classification. Our strategy is outlined in Fig 1. Roughly speaking,
- (i). to deal with ultrahigh-dimensional predictors where the dimension of predictors is extremely greater than the sample size, we adopt feature screening techniques to retain predictors that are informative to the response;
- (ii). to detect network structures of predictors, we employ exponential family graphical models to detect network structure of the selected predictors under the whole dataset or different classes;
- (iii). use the results in (i) and (ii) to develop network-based classification models to examine class separation and make the prediction for tumor samples.
There are several contributions in the proposed method. First, unlike existing methods that may specify a model when doing feature screening, our feature screening procedure is model-free and does not need to specify the model formulation. Second, although there exist methods handling network structures in classification, they assume a common network structure for predictors of all subjects without taking into account of possible heterogeneity for different classes. Instead, the proposed method is able to construct predictive models with possibly class-dependent network structures of predictors taken into account. Finally, the proposed method is able to handle multi-class labels with the accommodation of network structures in predictors, which is different from existing methods that either handle multiclassification but not use the information of network structure, or simply accommodate network structure to deal with binary classification.
The remainder is organized as following. In Section 2, we introduce a motivated real dataset and its data structure. In addition, we define the relevant mathematical notation. In Section 3, we give detailed presentation for each step in Fig 1. In Section 4, we implement the proposed method to analyze a real dataset and compare the proposed method with its competitors. A general discussion is presented in Section 5.
2 Data structure with multi-class responses
In this section, we first introduce a motivated dataset outlined in Section 1. After that, we define mathematical notation to describe the data structure with multi-class responses.
2.1 Description of motivated dataset
The data presented in the following are the GCM dataset collected by [30]. This dataset contains 16,063 gene expression values and 198 tumor samples, including 144 training samples (denoted as ) and 54 testing samples (denoted as
). In addition, 14 common human cancers, including Breast (BR), Prostate (PR), Lung (LU), Colorectal (CO), Lymphoma (LY), Bladder (BL), Melanoma (ML), Uterus (UT), Leukemia (LE), Renal (RE), Pancreas (PA), Ovarym (OV), Mesothelioma (ME) and CNS cancers, are included in the dataset. The sample sizes of each cancer are summarized in Table 1. Our main goal is to classify tumor samples into different categories of cancer according to gene expression values of the samples, which are treated as predictors.
The first row with contains sample sizes of the training data in cancer labels; the second row with
contains sample sizes of the testing data in cancer labels; the last row with “Total” contains sample sizes of the whole data in cancer labels.
Even though this dataset is no need to pre-processing due to complete observations without missing value, and some of its features having been well analyzed by [30], still, the dataset can be further investigated in two aspects. First of all, we propose to note the issue of high-dimentionality of the data, which usually implies the existence of irrelevant variables, i.e., not every gene expression is dependent upon the response. Therefore, to ensure the accuracy of prediction, it is necessary to exclude irrelevant variables. As a result, it is crucial to select gene expressions that are informative in terms of responses. Secondly, as discussed in [31, 32], complex dependence structures may exist among high-dimensional gene expressions. Therefore, to increase the accuracy of predictions, it is necessary to incorporate the network structure of gene expressions into the classification procedure.
2.2 Notation
In this subsection, we define mathematical notation to describe the data in order to develop the method.
Suppose the data of n subjects come from I classes, where I is a fixed integer greater than 2 and the classes are nominal. Let ni be the class size in class i with i = 1, ⋯, I, and hence . Let Y denote the n-dimensional vector of response with the jth component being Yj = i, which reflects the class membership that the jth subject is in the ith class for i = 1, ⋯, I and j = 1, ⋯, n.
Let p > 1 denote the dimension of predictors for each subject. Define X = [Xj, l] as the n × p matrix of predictors for j = 1, ⋯, n and l = 1, ⋯, p, where the component Xj,l represents the lth predictor for the jth subject. Furthermore, let Xj• = (Xj,1, ⋯, Xj,p)⊤ denote the p-dimensional predictor vector for the jth subject in the jth row of X and let X•k = (X1,k, ⋯, Xn,k)⊤ represent the n-dimensional vector of the kth predictor in the kth column of X. In this paper, we consider a setting that the dimension of the predictors p is ultrahigher than the sample size n, i.e., p = exp{O(nr)} for some constant r > 0 (e.g., [33]).
Without loss of generality, the {Xj•, Yj} are treated as independent and identically distributed (i.i.d.) for j = 1, ⋯, n. We let lower case letters represent realized values for the corresponding random variables.
The objective of the study is to build models to predict the class label for a new subject with observation .
3 Proposed method
In this section, we present detailed estimation procedure for each step as shown in Fig 1.
3.1 Feature screening via rank-based correlation coefficient
Let
denote the true active set which contains all relevant predictors for the response Y with
and q < n, and
is the complement of
that contains all irrelevant predictors for the response Y. Basically, the goal of Step 1 in Fig 1 is to estimate the active set
. When
is determined, then the associated vector of predictors
contains important information in terms of the response, and its dimension is smaller than the sample size n. Thus,
can be adopted to the subsequent analysis.
The remaining concern is to obtain the estimated active set. Following the spirit of [33], we employ the technique of feature screening, whose idea is to take the correlation of the response and the predictors as a signal, and retain the important predictors with large values of signals. We propose to take the rank-based correlation coefficient as the signal. Specifically, for the kth predictor X•k, the rank-based correlation coefficient between X•k and Y is given by (e.g., [34, 35])
(1)
where
denotes the indicator function and μ(⋅) is the law of Y. It can be shown that ωk is in an interval [0, 1], and a higher value of ωk indicates a stronger correlation between Y and X•k. Therefore, (1) can be regarded as similar to the classical coefficients such as Pearson’s correlation.
To implement this idea, we estimate (1) using the sample data. For j = 1, ⋯, n, denote Y(j) as the rearranged response according to the sort of the kth predictors X•k, i.e., (X(1),k, Y(1)), ⋯, (X(n),k, Y(n)) with X(1),k ≤ X(2),k ≤ ⋯ ≤ X(n),k and X(j),k being the jth sorted predictor in X•k. The corresponding estimator of ωk is given by [34]:
(2)
where, for j = 1, ⋯, n,
,
, and
represents the number of elements in a set
. In applications, one can use the R package XICOR to compute (2).
Therefore, the estimated active set based on (2) is given by
(3)
where c and κ ∈ (0, 1/2) are prespecified threshold values. In applications, one can specify c and κ such that variables with the first
largest values of
can be retained, where [⋅] represents the ceiling function (e.g., [33, 35, 36]).
Different from the conventional feature screening method (e.g., [33]), the main advantage of (3) is model-free feature screening because it does not impose model formulation, and thus, (3) is able to detect predictors that may have nonlinear relationship with the response Y. Theoretically, by the similar derivations of [35], the sure screening property of (3) can be justified. That is, as n → ∞, which ensures that the estimated active set contains truly informative predictors that are dependent on the response with a probability approaching one. Moreover, while there are several methods to deal with feature screening, as examined by [35], (2) generally outperforms other existing approaches and is able to handle oscillatory trajectory between the response and predictors.
When the active set is determined, we then let denote the vector containing all the active predictors for the jth subject, and denote
as the realization values of
.
3.2 The expressions of graphical structure
Since the estimated active set is identified, we now explore the network structure of selected gene expressions in
for Step 2 in Fig 1. Graphical models are commonly used strategies to achieve this goal.
The graph is expressed as G = (V, E), where V is the set of the vertices and E ⊂ V × V is the set of the edges. In our case, is treated as selected predictors with
and E is regarded as pairwise dependence of any two selected predictors. In graphical model frameworks, we start by formulating the distribution function of selected predictors. In this article, we consider exponential family graphical models because it generalizes the commonly used models. The formulation is given by
(4)
where
is the
-dimensional parameter vector, Θ = [θst] is a
symmetric matrix, B(⋅) and C(⋅) are given functions that reflect the distribution of
(e.g., [20, 37]), and the function A(β, Θ) is normalizing constant which ensures (4) to be integrated as 1.
Without loss of general interest, we take B(Xj,r) as the linear function B(Xj,r) = Xj,r for r ∈ V. In addition, in the graphical model theory, the main interest is the estimation of θst because of its interpretation that Xj,s and Xj,t are conditionally dependent if θst ≠ 0. Therefore, to focus on presenting the estimation of θst, we drop the main effect term, and consider the following graphical model
(5)
where the function A(Θ) is normalization constant which makes (5) be integrated as 1.
For the estimation method for Θ, one of the famous methods is the conditional inference [38]. Without loss of generality, we consider the vertex s, and define the neighbourhood set (6)
which collect vertexes that are dependent on the vertex s. To estimate the neighbourhood set of s, it suffices to study the inference of Xj,s|Xj,V\{s}, where
. Let
denote the
-dimensional vector of parameters that is associated with Xj,V\{s}. By some algebra, we have
(7)
where D(⋅) is a normalization constant ensuring that the integration of (7) is equal to 1. Then the estimator of θs, denoted as
, is given by
(8)
where
‖⋅‖1 is the L1-norm and λ is the tuning parameter.
In the penalization problem for selecting the variables, estimating the tuning parameter is also a crucial issue. In this paper, we employ the BIC approach (e.g., [39]) to select the tuning parameter λ. To emphasize the dependence on the tuning parameter, we let denote the estimator obtained from (8). Define
(9)
where
represents the number of non-zero elements in
for a given λ. The optimal tuning parameter λ, denoted by
, is determined by minimizing (9) within suitable ranges of λ. As a result, the estimator of θs is determined by
.
Finally, the estimated neighbourhood set is given by
(10)
Note that θst is equal to θts since Θ is a symmetric matrix. However, the estimators
and
are not equal. To overcome this problem, we apply the AND rule [38], which indicates that the final estimators of
and
are determined by their maximum if both
and
are nonzero;
and
are set to be zero if one of them is zero. Moreover, the estimated set of edges is given by
(11)
After deriving the estimated set of edges, a crucial question is the relationship of and E. To answer this question, we present the following theorem, which gives an important result for the estimated graph.
Theorem 3.1 (Network Recovery)
Suppose E is the set of edges, and let
be the estimated set of edges. Under some regular conditions in [38], we have that as n → ∞,
(12)
This result and regular conditions are similar to Section 2.2 in [40] and Theorem 5 (b) in [37]. Theorem 3.1 tells us that based on the mild conditions, the estimated network structure can be recovered to the true network structure.
3.3 Multinomial logistic regression with homogeneous network structure in predictors
After obtaining the estimated network structure based on informative predictors, we wish to use such a network structure to examine the classification for different cancers, as demonstrated in Step 3 of Fig 1. Therefore, to incorporate the network structures of the predictors into a prediction model, we present two methods which can be readily implemented using the R package glm for fitting a logistic regression model.
In the first method, called the multinomial logistic regression with homogeneous network structure in predictors (MLR-HomoNet), we consider the case where the subjects in different classes share a common network structure in the predictors. To build a prediction model, we make use of the development of the logistic model with multiclass responses ([41], Section 6.1; [42], Section 7.1).
We first identify the pairwise dependence of the predictors using the measurements of all the subjects without distinguishing their class label. Let be the estimate for θst obtained for (8) by using all the predictor measurements of
, and let
denote the resulting estimated set of edges.
Next, for i = 1, ⋯, I and j = 1, ⋯, n, we let
be the conditional probability of Yj = i given
. Consider the parametric multinomial logistic model
(13)
for i = 1, 2, ⋯, I − 1, where
is the vector of parameters with vectors
and
reflecting parameters for class i, and the constraint
is imposed for every j = 1, ⋯, n.
For subject j = 1, ⋯, n, we let if subject j is in class i and
otherwise, and hence,
for every j. Let
denote a realized value of
. For i = 1, ⋯, I and j = 1, ⋯, n, the log-likelihood function is given by ([42], p.273)
(14)
The estimator of α, denoted , can be derived by maximizing (14). In applications, since
has no closed form, we usually implement the Newton-Raphson algorithm to (14) and obtain the resulting estimator. Therefore, for the realization
of the q-dimensional vector
,
is estimated as
(15)
for i = 1, ⋯, I − 1, and
is estimated as
(16)
Finally, to predict the class label for a new subject with a selected -dimensional predictor instance
, we first calculate the right-hand side of (15) and (16), and let
denote the corresponding values. Let i* denote the index which corresponds to the largest value of
, i.e.,
. Then the class label for this new subject is predicted as i*.
To the end, we summarize key steps in Sections 3.1–3.3 in Algorithm 1.
Algorithm 1: MLR-HomoNet
Under the training data ;
Step 1: Determine informative predictors
Apply (2) to do feature screening and retain predictors among p-dimensional predictors. A set of selected predictors is given by (3).
Step 2: Determine the network structure of predictors
Based on selected predictors in , use (8) to determine pairwise dependence structure and obtain (11). The resulting network structure is formed by
.
Step 3: Construct the predictive model
Given a initial value α(0), then perform the following Newton-Raphson algorithm;
for step t with t = 1, 2, ⋯, T, say T = 1000 do
Step 3.1: calculate the score function evaluated at the tth iterated value:
with
Step 3.2: calculate the Henssian matrix evaluated at the tth iterated value:
with
Step 3.3: update α(t+1) ← α(t) − {H(α(t))}−1 S(α(t));
end
Let denote the resulting estimator, and combine
with (15) and (16) to determine the resulting predictive model
for i = 1, ⋯, I.
Under the testing data ;
Step 4: Prediction
For a new predictor in
, use
with i = 1, ⋯, I to compute the corresponding probabilities
. The predicted class i* is then determined by
.
3.4 Logistic regression with heterogeneous network structured in predictors
We now present an alternative method to that described in Section 3.3. Instead of pooling all the predictors to feature the predictor network structure, this method, called the logistic regression with heterogeneous network structured in predictors (LR-HeteNet), stratifies the predictor information by class when characterizing the predictor network structures. The implementation is summarized in Algorithm 2.
Algorithm 2: LR-HeteNet
Under the training data ;
for i = 1, 2, ⋯, I do
Step 0: Let Yi denote an n-dimensional vector formulated by (17).
Step 1: Class-dependent active set
Apply (18) to do feature screening and retain predictors among p-dimensional predictors. A set of selected predictors for class i is given by (19).
Step 2: Class-dependent predictor network
Based on selected predictors in , use (8) to determine pairwise dependence structure and obtain (11). Denote
as the resulting network structure.
Step 3: Class-dependent predictive model
Given a initial value , then perform the Newton-Raphson algorithm;
for step t with t = 1, 2, ⋯, T, say T = 1000 do
Step 3.1: calculate the score function evaluated at the tth iterated value:
where
is (20) with parameters replaced by
;
Step 3.2: calculate the Henssian matrix evaluated at the tth iterated value:
Step 3.3: update ;
end
Let denote the resulting estimator, and combine
and (22) to determine the resulting predictive model
.
end
Under the testing data ;
Step 4: Prediction
For a new predictor in
, we use
with i = 1, ⋯, I to compute the corresponding probabilities
. The predicted class i* is then determined by
.
Be more specific, under the training data , we first introduce a binary, surrogate response variable for every i = 1, ⋯, I and j = 1, ⋯, n. Let
(17)
and let
be an n-dimensional vector whose elements corresponding to class i are respectively
, and the other elements are zero. That is,
with i = 1, ⋯, I.
After that, we adopt the similar strategy in Algorithm 1 to construct predictive models for class i. Specifically, in Step 1 of Algorithm 2, let
denote the true active set of the class i which contains all relevant predictors for the response Yi with
. Following (2), the signal of X•k and Yi is defined as
, and it can be estimated by
(18)
where, for j = 1, ⋯, n,
and
with
being the rearranged response according to the sort of the kth predictors X•k. Therefore,
can be estimated as
(19)
where ci and κi ∈ (0, 1/2) are some prespecified threshold values. Let
denote the vector of all the active predictors that depends on Yi for the jth subject. Moreover, since Yi is defined as the response with binary outcomes, similar derivations in [35] show that (18) is valid to measure the dependence between categorical and continuous variables, and the point-biserial correlation coefficient is a special case of (18).
In Step 2 of Algorithm 2, let denote the vertex set containing predictors that are dependent on the class i = 1, ⋯, I. We apply the procedure described in Section 3.2 to determine the network structure of predictors in the class i. Let
denote an estimated set of edges for the class i, where
is the estimate of θst derived from (8) based on using the predictor measurements in the class i.
After that, Step 3 in algorithm 2 aims to fit a logistic regression model using the surrogate response vector Yi with the estimated predictors network structure incorporated for i = 1, ⋯, I. Specifically, for the jth component of Yi, say
, define
and consider the parametric logistic regression model
(20)
where j = 1, ⋯, n,
with
is the vector of parameters associated with class i. In the spirit of the maximum likelihood estimation (MLE) method (e.g., [42]), the log-likelihood function of (20) is given by
(21)
and the estimator of γi, denoted
, is obtained by maximizing (21). In applications, we implement the Newton-Raphson algorithm to obtain
; the detailed procedure is summarized in Algorithm 2. Consequently, for the realization
of the
-dimensional vector
, based on (20),
can be estimated by
(22)
for i = 1, ⋯, I.
Finally, when predictive models based on the training data are obtained, we now examine the prediction for the testing data
in Step 4 of Algorithm 2. Let
denote a
-dimensional predictor vector for a new subject. We calculate (22) with
replaced by
for i = 1, ⋯, I, and let
denote the corresponding values. Let i* denote the index which corresponds to the largest value of
, i.e.,
(23)
Then the class label for this new subject is predicted as i*.
Remark 3.1 The main difference between the MLR-HomoNet and LR-HeteNet methods is that the MLR-HomoNet method adopts the feature screening approach to retain informative predictors by pooling all subjects, while the feature screening approach of the LR-HeteNet method retains predictors under subjects that are in a specific class. It suggests that the estimated active sets (19) depend on the class and are different from each other, and thus, the resulting network structures determined by Step 2 of Algorithm 2 are different based on different classes. Therefore, we conclude that the MLR-HomoNet method only adopts different levels of gene expression values to classify tumor samples, while the LR-HeteNet method uses not only gene expression values but also class-dependent network structures to do the classification.
4 Results
In this section, we aim to implement Algorithms 1 and 2 in Section 3 to the GCM dataset introduced in Section 2.1.
4.1 Detection of informative gene expressions via feature screening
In the GCM dataset, there are I = 14 classes. The dimension of predictors is p = 16, 063 and the sample size is n = 198, where the size of the training set is 144 and the size of the testing set is 54. Following steps in Fig 1, we first implement the proposed method in Section 3 to fit models based on the training set, and then assess the performance of prediction by examining the testing set.
Since the dimension of predictors is extremely larger than the sample size, i.e., p ≫ n, to determine the informative predictors, we adopt the screening signal (2) to retain informative gene expressions. The first strategy in Algorithm 1 is to apply (2) to evaluate the signal of X•k and Y ∈ {1, ⋯, 14} and determine the estimated active set (3); the second consideration in Algorithm 2 is to calculate the signal of X•k and Yi for i = 1, ⋯, 14 and then obtain the estimated class-dependent active set (19). As suggested in [33, 35, 36], under the training set, we consider to retain gene expression values for the MLR-HomoNet method and retain
gene expression values with i = 1, ⋯, 14 for the LR-HeteNet method, where ni is the sample size of class i summarized in Table 1.
4.2 Network-based classification models
After the feature screening step, we next apply the estimation procedure in Section 3.2 to determine the network structure of selected gene expressions in the training set. Fig 2 displays the network structure with all samples accommodated, and the network structures of selected gene expressions based on different cancers are displayed in Fig 3. In Fig 2, we can see that the selected gene expressions have complex dependence structures. For example, gene expressions with ID 10111, 9548, and 9446 are connected with several gene expressions, while three gene expressions 10884, 15854, and 10208 have no connections with others. On the other hand, as shown in Fig 3, different classes have different selected gene expressions and associated network structures, which verifies the discussion in Remark 3.1. That is, as different kinds of cancer differ in their corresponding gene expressions, according to the specific network structures of gene expressions produced from our analysis, we can infer which cancer each tumor sample is from.
To adopt the determined network structures to examine the classification, we implement the network structures and the training set to the classification models proposed in Sections 3.3 and 3.4, respectively. To see the fitness of two models, we first implement the training data to the fitted models and examine the classification. The 14×14 confusion matrices based on the MLR-HomoNet and LR-HeteNet methods are shown in Tables 2 and 3, respectively, where columns are labels from the training data , rows are labels of fitted values, diagonal entries reflect number of correct classification, and nondiagonal entries are number of misclassification by fitted values. In general, both methods show satisfactory model fittness as the accuracy of classification is high. Moreover, we observe that the LR-HeteNet method seems to slightly outperform the MLR-HomoNet method since the latter method produces slightly larger misclassification on BR, PR, CO, and UT than those of the former method. This result makes sense because the LR-HeteNet method is based on class-dependent network structure that can directly reflect the corresponding cancers. For a clear visualization, we further display two heatmaps in Fig 4, which are obtained by Tables 2 and 3 with each row divided by the class-dependent sample size in the training data. We observe that diagonal entries have dark color, which indicate that the proportion of true classification is high and Algorithms 1 and 2 give well-fitted models.
The left panel is obtained by Algorithm 1, the right panel is obtained by Algorithm 2. Z represents the proportion of (mis)classification.
4.3 Prediction
When the predictive models are constructed, we now assess the performance of the proposed method by examining the prediction for the testing data. We implement the predictors in the testing data to the two proposed methods, and then make the prediction of classification. After that, we summarize the response in the testing data and the predictive classes to 14 × 14 confusion matrices in Tables 4 and 5, respectively, where columns are labels from the testing samples , rows are labels of predicted values, diagonal entries reflect number of correct classification, and nondiagonal entries are number of misclassification by predicted values. Moreover, we also display two heatmaps in Fig 5 that are obtained by Tables 4 and 5 with each row divided by the class-dependent sample size in the testing data. From confusion matrices and heatmaps, We can see that two proposed methods have satisfactory performance in prediction because most of predicted classes are the same as class labels in the testing data, except for little misclassification.
The left panel is obtained by Algorithm 1, the right panel is obtained by Algorithm 2. Z represents the proportion of (mis)classification.
To assess the performance of classification and prediction numerically, we evaluate some commonly used criteria: micro averaged metrics, macro averaged metrics, and the adjusted Rand index. For a subject j in the testing data with j = 1, ⋯, 54, let denote the predicted class label determined by the prediction models and let ynew,j denote the class label in the testing data. For class i = 1, ⋯, I, we respectively calculate the number of the true positives (TP), the number of the false positives (FP), and the number of the false negatives (FN) as
(24)
(25)
and
(26)
For micro averaged metrics, precision and recall are, respectively, defined in terms of (24), (25), and (26):
(27)
and
(28)
Then Micro-F-score is defined as
(29)
On the other hand, for macro averaged metrics, for i = 1, ⋯, I, let denote precision for class i, and let
denote recall for class i. Then the overall precision and recall are, respectively, given by
(30)
and
(31)
and Macro-F-score is defined as
(32)
According to the definitions, when all subjects are correctly classified, then FP and FN are equal to zero, yielding that PRE and REC are equal to one; if all subjects are falsely classified, then TP is equal to zero, and thus, PRE and REC are equal to zero. Therefore, values of PRE and REC are between zero to one. Moreover, the F-score falls in [0, 1] as well by treating 0/0 as zero. In principle, higher values of PRE, REC and F-score based on both micro and macro reflect better performance of methods ([20–22]).
In addition to criteria above, the other commonly used criterion is the adjusted Rand index (ARI). For i, l = 1, ⋯, I, let . Moreover, define
for i = 1, ⋯, I and
for l = 1, ⋯, I. Then ARI is defined as (e.g., [43])
(33)
As mentioned in [43], ARI is bounded above by one, and higher value of ARI indicates accurate classification.
We primarily adopt (27), (28), (29), (30), (31), (32), and (33) to assess the performance of two proposed methods. In addition, to compare with the proposed methods, we also examine several well established supervised learning methods, including logistic regression models without incorporating network structure [42], the support vector machine (SVM) that was examined by [30], K-nearest neighbor (KNN), linear discriminant analysis (LDA), Bayes, artificial neural network (ANN), XGBoost, random forest (RF), bagging, and long short-term memory (LSTM) methods. The implementation of corresponding R packages is summarized in Table 6.
The prediction results of the proposed and competitive methods are summarized in Table 7. In general, we can observe that the two proposed methods have the largest values of PRE, REC, F-score, and ARI than other existing methods. For the comparisons among existing methods, we can see that advanced machine learning or deep learning methods (e.g., ANN, RF, Bagging) outperform the conventional ones, such as LDA or SVM, but are less satisfactory than the proposed methods because of slightly large misclassification. It verifies that incorporating network structures would improve the accuracy of classification and prediction. In addition, the other reason is that, unlike existing methods that possibly incur overfitting because of direct implementation of all gene expression values to fit models, the two proposed methods simply retain gene expression values and detect network structures that are related to the response, yielding parsimonious models. In this way, noises and impacts induced by irrelevant gene expression values can be eliminated. Compared with two proposed methods, we can see that the LR-HeteNet method outperforms the MLR-HomoNet method with larger values of criteria. The main reason is that the MLR-HomoNet model in Section 3.3 directly deals with multi-label classification by using a common network structure to classify tumors to the corresponding cancers. To simultaneously reflect information to all classes, the network structure displayed in Fig 2 is expected to require more gene expression values and complex interactions. On the other hand, the LR-HeteNet method in Section 3.4 identifies predictors and unique network structure to reflect a specific cancer, suggesting that types of cancers can be uniquely represented by different network structures of gene expression values. As shown in Fig 3, one can directly adopt a given network structure to classify tumors to their cancers with high accuracy of prediction. In summary, with noise induced by irrelevant predictors removed and informative network structures of predictors accommodated, the accuracy of classification and prediction has significant improvement.
5 Discussion
In this paper, we present the network-based classification method to predict the classification of the tumor samples, which is an ultrahigh dimensional system, i.e., with multitudinous gene expressions as predictors. In the proposed method, we first adopt model-free feature screening technique to retain informative gene expressions from ultrahigh-dimensional data. After that, we identify the network structures of the detected gene expressions based on different cancers, and the property of the network structure recovery allows us to fit the nominal logistic regression based on the network structure and examine the classification and prediction. Compared with other existing methods, the proposed method gives more precise prediction results.
There are several possible extensions based on the current work. For example, the RNA sequences, regarded as count data, are also frequently explored in bioinformatics. The proposed method can be naturally extended to deal with the RNA sequence data by treating them as the predictors because the signal of detecting predictors (2) is free of distribution of random variables, and the identification of network structure in Section 3.2 is based on exponential family graphical models. For the implementation of classification models, it is interesting to explore other machine learning methods, such as SVM, LDA, or KNN, and other deep learning approaches that are popular in data science.
Moreover, the research gap still exists and more explorations can be done by extending the proposed method. For example, as discussed in [32], measurement error in predictors is ubiquitous in data analysis, especially that mismeasurement is inevitable in gene expression data (e.g, [52]). Ignoring measurement error effects is expected to increase the possibility of false classification and lead to wrong conclusion. Therefore, it is important to develop a new error-eliminating strategy to deal with measurement error based on the current method. Finally, as R packages associated with some of the existing methods have been developed, the new method proposed here anticipates a corresponding R package.
Acknowledgments
The author would like to appreciate Lingyu Cai for technical support of programming code, helpful language editing, grammar revision, and proofreading. The author thanks the editorial team for providing constructive and suggestive comments to improve the presentation of the manuscript.
References
- 1. Gálvez J. M., Castillo D., Herrera L. J., San Román B., Valenzuela O., Ortuno F. M., et al. (2018). Multiclass classification for skin cancer profiling based on the integration of heterogeneous gene expression series. PLoS ONE, 13(5), e0196836. pmid:29750795
- 2. Lee Y. and Lee C.-K. (2003). Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics. 19, 1132–1139. pmid:12801874
- 3.
Cristianini N. and Shawe-Taylor J. (2000). An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge.
- 4. Huang M. W., Chen C. W., Lin W. C., Ke S. W., and Tsai C. F. (2017). SVM and SVM ensembles in breast cancer prediction. PLoS ONE, 12(1), e0161501. pmid:28060807
- 5. Guo Y., Hastie T., Tibshirani R. (2007). Regularized linear discriminant analysis and its application in microarrays. Biostatistics, 8, 86–100. pmid:16603682
- 6. Safo S. E. and Ahn J. (2016). General sparse multi-class linear discriminant analysis. Computational Statistics and Data Analysis, 99, 81–90.
- 7.
Hastie T., Tibshirani R., and Friedman J. (2008). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.
- 8.
James G., Witten D., Hastie T., and Tibshirani R. (2017). An Introduction to Statistical Learning: with Applications in R. Springer, New York.
- 9. Chen L.-P. (2019). Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Statistical Papers, 60, 1793–1795.
- 10. Heenaye-Mamode Khan M., Boodoo-Jahangeer N., Dullull W., Nathire S., Gao X., Sinha G. R., et al. (2021). Multi-class classification of breast cancer abnormalities using Deep Convolutional Neural Network (CNN). PLoS ONE, 16(8), e0256500. pmid:34437623
- 11.
Pandey A. and Roy S. S. (2022). Protein sequence classification using convolutional neural network and natural language processing. Handbook of Machine Learning Applications for Genomics, edited by S. S. Roy and Y. H. Taguchi, 133–144.
- 12.
Roy S. S., Samui P., Deo R., and Ntalampiras S. (Eds.). (2018). Big Data in Engineering Applications (Vol. 44). Springer, Berlin/Heidelberg, Germany.
- 13.
Roy S. S. and Taguchi Y. H. (2022). Handbook of Machine Learning Applications for Genomics. Springer Nature, Singapore.
- 14.
Samui P., Roy S. S., and Balas V. E. (Eds.). (2017). Handbook of Neural Computation, Academic Press.
- 15. Zhu S. X. Y. and Pan W. (2009). Network-based support vector machine for classification of microarray samples. BMC Bioinformatics, 10, 1–11. pmid:19208121
- 16. Zi X., Liu Y., Gao P. (2016). Mutual information network-based support vector machine for identification of rheumatoid arthritis-related genes. International Journal of Clinical and Experimental Medicine, 9, 11764–11771.
- 17. Cai W., Guan G., Pan R., Zhu X., and Wang H. (2018). Network linear discriminant analysis. Computational Statistics and Data Analysis, 117, 32–44.
- 18. Huttenhower C., Flamholz A.I., Landis J.N. et al. (2007). Nearest neighbor networks: clustering expression data based on gene neighborhoods. BMC Bioinformatics, 8, 250, 1–13. pmid:17626636
- 19.
He, W., Yi, G. Y., and Chen, L.-P. (2019). Support vector machine with graphical network structures in features. Proceedings, Machine Learning and Data Mining in Pattern Recognition, 15th International Conference on Machine Learning and Data Mining, MLDM 2019, vol.II, New York, NY, USA, ibai-publishing, 557–570.
- 20. Chen L.-P., Yi G. Y., Zhang Q., and He W. (2019). Multiclass analysis and prediction with network structured covariates. Journal of Statistical Distributions and Applications, 6:6.
- 21. Chen L.-P. (2022a). Network-based discriminant analysis for multiclassification. Journal of Classification. To appear.
- 22. Chen L.-P. (2022b). Nonparametric discriminant analysis with network structures in predictor. Journal of Statistical Computation and Simulation. To appear.
- 23. Baladanddayuthapani V., Talluri R., Ji Y., Coombes K. R., Lu Y., Hennessy B. T., et al. (2014). Bayesian sparse graphical models for classification with application to protein expression data. The Annals of Applied. Statistics, 8, 1443–1468.
- 24. Peterson C. B., Stingo F. C., and Vannucci M. (2015). Joint Bayesian variable and graph selection for regression models with network-structured predictors. Statistics in Medicine, 35, 1017–1031. pmid:26514925
- 25. Roy S. S. and Taguchi Y. H. (2021). Identification of genes associated with altered gene expression and m6A profiles during hypoxia using tensor decomposition based unsupervised feature extraction. Scientific Reports, 11(1), 1–18. pmid:33903618
- 26. Tschodu D., Ulm B., Bendrat K., Lippoldt J., Gottheil P., Käs J. A., et al. (2022). Comparative analysis of molecular signatures reveals a hybrid approach in breast cancer: combining the Nottingham Prognostic Index with gene expressions into a hybrid signature. PloS ONE, 17(2), e0261035. pmid:35143511
- 27. Zhang X., Wu Y., Wang L., and Li R. (2016). Variable selection for support vector machines in moderately high dimensions. Journal of the Royal Statistical Society, Series B, 78, 53–76. pmid:26778916
- 28. Maugis C., Celeux G., and Martin-Magniette M.-L. (2011). Variable selection in model-based discriminant analysis. Journal of Multivariate Analysis, 102, 1374–1387.
- 29. Wang C., Cao L., and Miao B. (2013). Optimal feature selection for sparse linear discriminant analysis and its applications in gene expression data. Computational Statistics and Data Analysis, 66, 140–149.
- 30. Ramaswamy S., Tamayo P., Rifkin R. et al. (2001). Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences of the United States, 98, 15149–15154. pmid:11742071
- 31. Lukashin A. V., Lukashev M. E., and Fuchs R. (2003). Topology of gene expression networks as revealed by data mining and modeling. Bioinformatics, 19, 1909–1916. pmid:14555623
- 32. Chen L.-P. (2018). Multiclassification to gene expression data with some complex features. Biostatistics and Biometrics Open Access Journal, 9, 555751.
- 33. Fan J. and Lv J. (2008). Sure independence screening for ultra high dimensional feature space. Journal of the Royal Statistical Society, Series B, 70, 849–911.
- 34. Chatterjee S. (2021). A new coefficient of correlation. Journal of the American Statistical Association, 16, 2009–2022.
- 35.
Chen, L.-P. (2020). A note of feature screening via rank-based coefficient of correlation. arXiv:2008.04456
- 36. Chen L.-P. (2021). Feature screening based on distance correlation for ultrahigh-dimensional censored data with covariates measurement error. Computational Statistics, 36, 857–884.
- 37. Yang E., Ravikumar P., Allen G. I., and Liu Z. (2015). Graphical models via univariate exponential family distribution. Journal of Machine Learning Research, 16, 3813–3847. pmid:27570498
- 38. Meinshausen N. and Bühlmann P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34, 1436–1462.
- 39. Schwarz G. (1978). Estimating the dimension of model. Annals of Statistics, 6, 461–464.
- 40. Ravikumar P., Wainwright M. J., and Lafferty J. (2010). High-dimensional Ising model selection using ℓ1-regularized logistic regression. The Annals of Statistics, 38, 1287–1319.
- 41.
Agresti A. (2007). An Introduction to Categorical Data Analysis. Wiley, New York.
- 42.
Agresti A. (2012). Categorical Data Analysis. Wiley, New York.
- 43. Hubert L. and Arabie P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.
- 44.
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A. et al. (2022). e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-11. https://CRAN.R-project.org/package=e1071
- 45.
Torgo, L. (2022). DMwR: Functions and data for “Data Mining with R”. R package version 0.4.1. https://CRAN.R-project.org/package=DMwR
- 46.
Ripley, B., Venables, B., Bates, D. M., Hornik, K. et al. (2022). MASS: Support functions and datasets for venables and Ripley’s MASS. R package version 7.3-57. https://CRAN.R-project.org/package=MASS
- 47.
Fritsch, S., Guenther, F., Wright, M. N., Suling, M., and Mueller, S. M. (2019). neuralnet: Training of neural networks. R package version 1.44.2. https://CRAN.R-project.org/package=neuralnet
- 48.
Chen, T., He, T., Benesty, M., Khotilovich, V. et al. (2022). xgboost: Extreme gradient boosting. R package version 1.6.0.1. https://CRAN.R-project.org/package=xgboost
- 49.
Breiman, L., Cutler, A., Liaw, A. and Wiener, M. (2022). randomForest: Breiman and Cutler’s random forests for classification and regression. R package version 4.7-1. https://CRAN.R-project.org/package=randomForest
- 50.
Peters, A., Hothorn, T., Ripley, B. D., Therneau, T., and Atkinson, B. (2022). ipred: Improved predictors. R package version 0.9-13. https://CRAN.R-project.org/package=ipred
- 51.
Quast, B. and Fichou, D. (2022). rnn: Recurrent Neural Network. R package version 1.5.0. https://CRAN.R-project.org/package=rnn
- 52. Chen L.-P. and Yi G. Y. (2021). Analysis of Noisy Survival Data with Graphical Proportional Hazards Measurement Error Models. Biometrics, 77, 956–969. pmid:32687216