Classification and prediction for multi-cancer data with ultrahigh-dimensional gene expressions

Li-Pang Chen

doi:10.1371/journal.pone.0274440

Abstract

Analysis of gene expression data is an attractive topic in the field of bioinformatics, and a typical application is to classify and predict individuals’ diseases or tumors by treating gene expression values as predictors. A primary challenge of this study comes from ultrahigh-dimensionality, which makes that (i) many predictors in the dataset might be non-informative, (ii) pairwise dependence structures possibly exist among high-dimensional predictors, yielding the network structure. While many supervised learning methods have been developed, it is expected that the prediction performance would be affected if impacts of ultrahigh-dimensionality were not carefully addressed. In this paper, we propose a new statistical learning algorithm to deal with multi-classification subject to ultrahigh-dimensional gene expressions. In the proposed algorithm, we employ the model-free feature screening method to retain informative gene expression values from ultrahigh-dimensional data, and then construct predictive models with network structures of selected gene expression accommodated. Different from existing supervised learning methods that build predictive models based on entire dataset, our approach is able to identify informative predictors and dependence structures for gene expression. Throughout analysis of a real dataset, we find that the proposed algorithm gives precise classification as well as accurate prediction, and outperforms some commonly used supervised learning methods.

Citation: Chen L-P (2022) Classification and prediction for multi-cancer data with ultrahigh-dimensional gene expressions. PLoS ONE 17(9): e0274440. https://doi.org/10.1371/journal.pone.0274440

Editor: Enrique Hernandez-Lemus, Instituto Nacional de Medicina Genomica, MEXICO

Received: September 27, 2021; Accepted: August 28, 2022; Published: September 15, 2022

Copyright: © 2022 Li-Pang Chen. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting information files.

Funding: This research is supported by Ministry of Science and Technology with grant ID 110-2118-M-004 -006 -MY2. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Analysis of gene expression data is an important topic in bioinformatics. A large body of research and relevant developments have been explored in recent years. One of important branches of gene expression data analysis is to take gene expression values as predictors to classify and predict tumors to possible cancers. A motivated example in this paper is the GCM dataset, which contains 16,063 gene expression values and 14 human cancers among 198 tumor samples. The goal of this study is to take gene expression values as the predictors, and use them to classify tumor samples to their corresponding cancers. In this dataset, a key feature is ultrahigh-dimensional predictors in the sense that the dimension of predictors (number of gene expression values) is extremely greater than the sample size (tumor samples). This feature further induces some challenges, including (a) pairwise interactions among gene expressions and (b) existence of non-informative gene expressions, that affect the performance of classification and the accuracy of prediction.

To address classification and prediction for biomedical research, many supervised learning methods have been developed and have been widely applied in machine learning frameworks. With the ignorance of pairwise interactions and existence of non-informative predictors induced by ultrahigh-dimensional predictors, [1] proposed the integration of several heterogeneous cancer series, and performed a multi-class classification. [2] studied multicategory support vector machine (SVM) for the classification of multiple cancer. [3] presented comprehensive discussions of SVM methods. [4] applied SVM ensembers to analyze breast cancer prediction. [5] discussed linear discrimination analysis (LDA) and its application in the microarray. [6] discussed the multi-class analysis by generalized sparse linear discriminant analysis. The detailed and fundamental discussions of those methods can be found in [7, 8], and were reviewed by [9] as well. In recent years, deep learning approaches, such as convolutional neural network (e.g., [10]) or natural language processing (e.g., [11]), have been developed to deal with multicalssification. More applications can be found in some monographs, such as [12–14].

To characterize pairwise interactions among gene expressions, which usually refers to the network dependence among gene expressions, we employ graphical models that are powerful methods in describing the dependence structure of variables. A general introduction of graphical models can be found in [7] (Chapter 17). In the past literature, graphical models have been used to deal with the classification problem. For example, [15] proposed the network-based support vector machine for the classification of microarray samples for binary classification. [16] discussed the identification of rheumatoid arthritis-related genes by using a network-based support vector machine. [17] proposed network linear discriminant analysis. [18] proposed the nearest neighbor network. Most existing methods focused on binary responses and restricted the predictors to follow the normal distribution because of explorations of the precision matrix. Furthermore, it is intuitive to understand that the network structure of variables in different classes may not be exactly equal to each other. To address this issue, [19, 20] explored SVM and logistic regressions with heterogeneous network structures accommodated, respectively. More recently, [21, 22] developed multiclass discriminant analysis with network structures accommodated. From the perspectives of Bayesian approaches, several methods were also investigated with the network structure incorporated, including [23, 24].

To address non-informative gene expression values in ultrahigh-dimensional data, variable selection or dimension reduction are perhaps commonly used strategies in the past literature. For example, [25] applied unsupervised feature extraction, such as principal component analysis, tensor decomposition, and kernel tensor decomposition, to select potentially important genes. [26] adopted SIS method to do feature screening for gene expressions and combined Nottingham Prognostic Index with a hybrid signature accommodated. With the combination of supervised learning, [27] proposed the penalized method for SVM. [28, 29] explored variable selection based on LDA. Those methods mainly handled the setting that the dimension is smaller than the sample size, however, it is unknown whether those methods are able to deal with the case that the dimension of predictors is much higher than the sample size.

From the two challenges and developments described above, we note that most existing methods deal with either network structure or variable selection but not both. It motivates us to propose a strategy to simultaneously retain important predictors and construct the network structure of predictors when doing classification. Our strategy is outlined in Fig 1. Roughly speaking,

(i). to deal with ultrahigh-dimensional predictors where the dimension of predictors is extremely greater than the sample size, we adopt feature screening techniques to retain predictors that are informative to the response;
(ii). to detect network structures of predictors, we employ exponential family graphical models to detect network structure of the selected predictors under the whole dataset or different classes;
(iii). use the results in (i) and (ii) to develop network-based classification models to examine class separation and make the prediction for tumor samples.

Download:

Fig 1. Summary of key steps for the proposed classification method via ultrahigh-dimensional gene expressions.

https://doi.org/10.1371/journal.pone.0274440.g001

There are several contributions in the proposed method. First, unlike existing methods that may specify a model when doing feature screening, our feature screening procedure is model-free and does not need to specify the model formulation. Second, although there exist methods handling network structures in classification, they assume a common network structure for predictors of all subjects without taking into account of possible heterogeneity for different classes. Instead, the proposed method is able to construct predictive models with possibly class-dependent network structures of predictors taken into account. Finally, the proposed method is able to handle multi-class labels with the accommodation of network structures in predictors, which is different from existing methods that either handle multiclassification but not use the information of network structure, or simply accommodate network structure to deal with binary classification.

The remainder is organized as following. In Section 2, we introduce a motivated real dataset and its data structure. In addition, we define the relevant mathematical notation. In Section 3, we give detailed presentation for each step in Fig 1. In Section 4, we implement the proposed method to analyze a real dataset and compare the proposed method with its competitors. A general discussion is presented in Section 5.

2 Data structure with multi-class responses

In this section, we first introduce a motivated dataset outlined in Section 1. After that, we define mathematical notation to describe the data structure with multi-class responses.

2.1 Description of motivated dataset

The data presented in the following are the GCM dataset collected by [30]. This dataset contains 16,063 gene expression values and 198 tumor samples, including 144 training samples (denoted as ) and 54 testing samples (denoted as ). In addition, 14 common human cancers, including Breast (BR), Prostate (PR), Lung (LU), Colorectal (CO), Lymphoma (LY), Bladder (BL), Melanoma (ML), Uterus (UT), Leukemia (LE), Renal (RE), Pancreas (PA), Ovarym (OV), Mesothelioma (ME) and CNS cancers, are included in the dataset. The sample sizes of each cancer are summarized in Table 1. Our main goal is to classify tumor samples into different categories of cancer according to gene expression values of the samples, which are treated as predictors.

Download:

Table 1. Sample sizes for each cancer.

The first row with contains sample sizes of the training data in cancer labels; the second row with contains sample sizes of the testing data in cancer labels; the last row with “Total” contains sample sizes of the whole data in cancer labels.

https://doi.org/10.1371/journal.pone.0274440.t001

Even though this dataset is no need to pre-processing due to complete observations without missing value, and some of its features having been well analyzed by [30], still, the dataset can be further investigated in two aspects. First of all, we propose to note the issue of high-dimentionality of the data, which usually implies the existence of irrelevant variables, i.e., not every gene expression is dependent upon the response. Therefore, to ensure the accuracy of prediction, it is necessary to exclude irrelevant variables. As a result, it is crucial to select gene expressions that are informative in terms of responses. Secondly, as discussed in [31, 32], complex dependence structures may exist among high-dimensional gene expressions. Therefore, to increase the accuracy of predictions, it is necessary to incorporate the network structure of gene expressions into the classification procedure.

2.2 Notation

In this subsection, we define mathematical notation to describe the data in order to develop the method.

Suppose the data of n subjects come from I classes, where I is a fixed integer greater than 2 and the classes are nominal. Let n_i be the class size in class i with i = 1, ⋯, I, and hence . Let Y denote the n-dimensional vector of response with the jth component being Y_j = i, which reflects the class membership that the jth subject is in the ith class for i = 1, ⋯, I and j = 1, ⋯, n.

Let p > 1 denote the dimension of predictors for each subject. Define X = [X_{j, l}] as the n × p matrix of predictors for j = 1, ⋯, n and l = 1, ⋯, p, where the component X_j,l represents the lth predictor for the jth subject. Furthermore, let X_j• = (X_j,1, ⋯, X_j,p)^⊤ denote the p-dimensional predictor vector for the jth subject in the jth row of X and let X_•k = (X_1,k, ⋯, X_n,k)^⊤ represent the n-dimensional vector of the kth predictor in the kth column of X. In this paper, we consider a setting that the dimension of the predictors p is ultrahigher than the sample size n, i.e., p = exp{O(n^r)} for some constant r > 0 (e.g., [33]).

Without loss of generality, the {X_j•, Y_j} are treated as independent and identically distributed (i.i.d.) for j = 1, ⋯, n. We let lower case letters represent realized values for the corresponding random variables.

The objective of the study is to build models to predict the class label for a new subject with observation .

3 Proposed method

In this section, we present detailed estimation procedure for each step as shown in Fig 1.

3.1 Feature screening via rank-based correlation coefficient

Let denote the true active set which contains all relevant predictors for the response Y with and q < n, and is the complement of that contains all irrelevant predictors for the response Y. Basically, the goal of Step 1 in Fig 1 is to estimate the active set . When is determined, then the associated vector of predictors contains important information in terms of the response, and its dimension is smaller than the sample size n. Thus, can be adopted to the subsequent analysis.

The remaining concern is to obtain the estimated active set. Following the spirit of [33], we employ the technique of feature screening, whose idea is to take the correlation of the response and the predictors as a signal, and retain the important predictors with large values of signals. We propose to take the rank-based correlation coefficient as the signal. Specifically, for the kth predictor X_•k, the rank-based correlation coefficient between X_•k and Y is given by (e.g., [34, 35]) (1) where denotes the indicator function and μ(⋅) is the law of Y. It can be shown that ω_k is in an interval [0, 1], and a higher value of ω_k indicates a stronger correlation between Y and X_•k. Therefore, (1) can be regarded as similar to the classical coefficients such as Pearson’s correlation.

To implement this idea, we estimate (1) using the sample data. For j = 1, ⋯, n, denote Y_(j) as the rearranged response according to the sort of the kth predictors X_•k, i.e., (X_(1),k, Y₍₁₎), ⋯, (X_(n),k, Y_(n)) with X_(1),k ≤ X_(2),k ≤ ⋯ ≤ X_(n),k and X_(j),k being the jth sorted predictor in X_•k. The corresponding estimator of ω_k is given by [34]: (2) where, for j = 1, ⋯, n, , , and represents the number of elements in a set . In applications, one can use the R package XICOR to compute (2).

Therefore, the estimated active set based on (2) is given by (3) where c and κ ∈ (0, 1/2) are prespecified threshold values. In applications, one can specify c and κ such that variables with the first largest values of can be retained, where [⋅] represents the ceiling function (e.g., [33, 35, 36]).

Different from the conventional feature screening method (e.g., [33]), the main advantage of (3) is model-free feature screening because it does not impose model formulation, and thus, (3) is able to detect predictors that may have nonlinear relationship with the response Y. Theoretically, by the similar derivations of [35], the sure screening property of (3) can be justified. That is, as n → ∞, which ensures that the estimated active set contains truly informative predictors that are dependent on the response with a probability approaching one. Moreover, while there are several methods to deal with feature screening, as examined by [35], (2) generally outperforms other existing approaches and is able to handle oscillatory trajectory between the response and predictors.

When the active set is determined, we then let denote the vector containing all the active predictors for the jth subject, and denote as the realization values of .

3.2 The expressions of graphical structure

Since the estimated active set is identified, we now explore the network structure of selected gene expressions in for Step 2 in Fig 1. Graphical models are commonly used strategies to achieve this goal.

The graph is expressed as G = (V, E), where V is the set of the vertices and E ⊂ V × V is the set of the edges. In our case, is treated as selected predictors with and E is regarded as pairwise dependence of any two selected predictors. In graphical model frameworks, we start by formulating the distribution function of selected predictors. In this article, we consider exponential family graphical models because it generalizes the commonly used models. The formulation is given by (4) where is the -dimensional parameter vector, Θ = [θ_st] is a symmetric matrix, B(⋅) and C(⋅) are given functions that reflect the distribution of (e.g., [20, 37]), and the function A(β, Θ) is normalizing constant which ensures (4) to be integrated as 1.

Without loss of general interest, we take B(X_j,r) as the linear function B(X_j,r) = X_j,r for r ∈ V. In addition, in the graphical model theory, the main interest is the estimation of θ_st because of its interpretation that X_j,s and X_j,t are conditionally dependent if θ_st ≠ 0. Therefore, to focus on presenting the estimation of θ_st, we drop the main effect term, and consider the following graphical model (5) where the function A(Θ) is normalization constant which makes (5) be integrated as 1.

For the estimation method for Θ, one of the famous methods is the conditional inference [38]. Without loss of generality, we consider the vertex s, and define the neighbourhood set (6) which collect vertexes that are dependent on the vertex s. To estimate the neighbourhood set of s, it suffices to study the inference of X_j,s|X_j,V\{s}, where . Let denote the -dimensional vector of parameters that is associated with X_j,V\{s}. By some algebra, we have (7) where D(⋅) is a normalization constant ensuring that the integration of (7) is equal to 1. Then the estimator of θ_s, denoted as , is given by (8) where ‖⋅‖₁ is the L₁-norm and λ is the tuning parameter.

In the penalization problem for selecting the variables, estimating the tuning parameter is also a crucial issue. In this paper, we employ the BIC approach (e.g., [39]) to select the tuning parameter λ. To emphasize the dependence on the tuning parameter, we let denote the estimator obtained from (8). Define (9) where represents the number of non-zero elements in for a given λ. The optimal tuning parameter λ, denoted by , is determined by minimizing (9) within suitable ranges of λ. As a result, the estimator of θ_s is determined by .

Finally, the estimated neighbourhood set is given by (10) Note that θ_st is equal to θ_ts since Θ is a symmetric matrix. However, the estimators and are not equal. To overcome this problem, we apply the AND rule [38], which indicates that the final estimators of and are determined by their maximum if both and are nonzero; and are set to be zero if one of them is zero. Moreover, the estimated set of edges is given by (11)

After deriving the estimated set of edges, a crucial question is the relationship of and E. To answer this question, we present the following theorem, which gives an important result for the estimated graph.

Theorem 3.1 (Network Recovery)

Suppose E is the set of edges, and let be the estimated set of edges. Under some regular conditions in [38], we have that as n → ∞, (12) This result and regular conditions are similar to Section 2.2 in [40] and Theorem 5 (b) in [37]. Theorem 3.1 tells us that based on the mild conditions, the estimated network structure can be recovered to the true network structure.

3.3 Multinomial logistic regression with homogeneous network structure in predictors

After obtaining the estimated network structure based on informative predictors, we wish to use such a network structure to examine the classification for different cancers, as demonstrated in Step 3 of Fig 1. Therefore, to incorporate the network structures of the predictors into a prediction model, we present two methods which can be readily implemented using the R package glm for fitting a logistic regression model.

In the first method, called the multinomial logistic regression with homogeneous network structure in predictors (MLR-HomoNet), we consider the case where the subjects in different classes share a common network structure in the predictors. To build a prediction model, we make use of the development of the logistic model with multiclass responses ([41], Section 6.1; [42], Section 7.1).

We first identify the pairwise dependence of the predictors using the measurements of all the subjects without distinguishing their class label. Let be the estimate for θ_st obtained for (8) by using all the predictor measurements of , and let denote the resulting estimated set of edges.

Next, for i = 1, ⋯, I and j = 1, ⋯, n, we let be the conditional probability of Y_j = i given . Consider the parametric multinomial logistic model (13) for i = 1, 2, ⋯, I − 1, where is the vector of parameters with vectors and reflecting parameters for class i, and the constraint is imposed for every j = 1, ⋯, n.

For subject j = 1, ⋯, n, we let if subject j is in class i and otherwise, and hence, for every j. Let denote a realized value of . For i = 1, ⋯, I and j = 1, ⋯, n, the log-likelihood function is given by ([42], p.273) (14)

The estimator of α, denoted , can be derived by maximizing (14). In applications, since has no closed form, we usually implement the Newton-Raphson algorithm to (14) and obtain the resulting estimator. Therefore, for the realization of the q-dimensional vector , is estimated as (15) for i = 1, ⋯, I − 1, and is estimated as (16)

Finally, to predict the class label for a new subject with a selected -dimensional predictor instance , we first calculate the right-hand side of (15) and (16), and let denote the corresponding values. Let i* denote the index which corresponds to the largest value of , i.e., . Then the class label for this new subject is predicted as i*.

To the end, we summarize key steps in Sections 3.1–3.3 in Algorithm 1.

Algorithm 1: MLR-HomoNet

Under the training data ;

Step 1: Determine informative predictors

Apply (2) to do feature screening and retain predictors among p-dimensional predictors. A set of selected predictors is given by (3).

Step 2: Determine the network structure of predictors

Based on selected predictors in , use (8) to determine pairwise dependence structure and obtain (11). The resulting network structure is formed by .

Step 3: Construct the predictive model

Given a initial value α⁽⁰⁾, then perform the following Newton-Raphson algorithm;

for step t with t = 1, 2, ⋯, T, say T = 1000 do

Step 3.1: calculate the score function evaluated at the tth iterated value: with

Step 3.2: calculate the Henssian matrix evaluated at the tth iterated value: with

Step 3.3: update α^(t+1) ← α^(t) − {H(α^(t))}⁻¹ S(α^(t));

end

Let denote the resulting estimator, and combine with (15) and (16) to determine the resulting predictive model for i = 1, ⋯, I.

Under the testing data ;

Step 4: Prediction

For a new predictor in , use with i = 1, ⋯, I to compute the corresponding probabilities . The predicted class i* is then determined by .

3.4 Logistic regression with heterogeneous network structured in predictors

We now present an alternative method to that described in Section 3.3. Instead of pooling all the predictors to feature the predictor network structure, this method, called the logistic regression with heterogeneous network structured in predictors (LR-HeteNet), stratifies the predictor information by class when characterizing the predictor network structures. The implementation is summarized in Algorithm 2.

Algorithm 2: LR-HeteNet

Under the training data ;

for i = 1, 2, ⋯, I do

Step 0: Let Yⁱ denote an n-dimensional vector formulated by (17).

Step 1: Class-dependent active set

Apply (18) to do feature screening and retain predictors among p-dimensional predictors. A set of selected predictors for class i is given by (19).

Step 2: Class-dependent predictor network

Based on selected predictors in , use (8) to determine pairwise dependence structure and obtain (11). Denote as the resulting network structure.

Step 3: Class-dependent predictive model

Given a initial value , then perform the Newton-Raphson algorithm;

for step t with t = 1, 2, ⋯, T, say T = 1000 do

Step 3.1: calculate the score function evaluated at the tth iterated value: where is (20) with parameters replaced by ;

Step 3.2: calculate the Henssian matrix evaluated at the tth iterated value:

Step 3.3: update ;

end

Let denote the resulting estimator, and combine and (22) to determine the resulting predictive model .

end

Under the testing data ;

Step 4: Prediction

For a new predictor in , we use with i = 1, ⋯, I to compute the corresponding probabilities . The predicted class i* is then determined by .

Be more specific, under the training data , we first introduce a binary, surrogate response variable for every i = 1, ⋯, I and j = 1, ⋯, n. Let (17) and let be an n-dimensional vector whose elements corresponding to class i are respectively , and the other elements are zero. That is, with i = 1, ⋯, I.

After that, we adopt the similar strategy in Algorithm 1 to construct predictive models for class i. Specifically, in Step 1 of Algorithm 2, let denote the true active set of the class i which contains all relevant predictors for the response Yⁱ with . Following (2), the signal of X_•k and Yⁱ is defined as , and it can be estimated by (18) where, for j = 1, ⋯, n, and with being the rearranged response according to the sort of the kth predictors X_•k. Therefore, can be estimated as (19) where c_i and κ_i ∈ (0, 1/2) are some prespecified threshold values. Let denote the vector of all the active predictors that depends on Yⁱ for the jth subject. Moreover, since Yⁱ is defined as the response with binary outcomes, similar derivations in [35] show that (18) is valid to measure the dependence between categorical and continuous variables, and the point-biserial correlation coefficient is a special case of (18).

In Step 2 of Algorithm 2, let denote the vertex set containing predictors that are dependent on the class i = 1, ⋯, I. We apply the procedure described in Section 3.2 to determine the network structure of predictors in the class i. Let denote an estimated set of edges for the class i, where is the estimate of θ_st derived from (8) based on using the predictor measurements in the class i.

After that, Step 3 in algorithm 2 aims to fit a logistic regression model using the surrogate response vector Yⁱ with the estimated predictors network structure incorporated for i = 1, ⋯, I. Specifically, for the jth component of Yⁱ, say , define and consider the parametric logistic regression model (20) where j = 1, ⋯, n, with is the vector of parameters associated with class i. In the spirit of the maximum likelihood estimation (MLE) method (e.g., [42]), the log-likelihood function of (20) is given by (21) and the estimator of γ_i, denoted , is obtained by maximizing (21). In applications, we implement the Newton-Raphson algorithm to obtain ; the detailed procedure is summarized in Algorithm 2. Consequently, for the realization of the -dimensional vector , based on (20), can be estimated by (22) for i = 1, ⋯, I.

Finally, when predictive models based on the training data are obtained, we now examine the prediction for the testing data in Step 4 of Algorithm 2. Let denote a -dimensional predictor vector for a new subject. We calculate (22) with replaced by for i = 1, ⋯, I, and let denote the corresponding values. Let i* denote the index which corresponds to the largest value of , i.e., (23) Then the class label for this new subject is predicted as i*.

Remark 3.1 The main difference between the MLR-HomoNet and LR-HeteNet methods is that the MLR-HomoNet method adopts the feature screening approach to retain informative predictors by pooling all subjects, while the feature screening approach of the LR-HeteNet method retains predictors under subjects that are in a specific class. It suggests that the estimated active sets (19) depend on the class and are different from each other, and thus, the resulting network structures determined by Step 2 of Algorithm 2 are different based on different classes. Therefore, we conclude that the MLR-HomoNet method only adopts different levels of gene expression values to classify tumor samples, while the LR-HeteNet method uses not only gene expression values but also class-dependent network structures to do the classification.

4 Results

In this section, we aim to implement Algorithms 1 and 2 in Section 3 to the GCM dataset introduced in Section 2.1.

4.1 Detection of informative gene expressions via feature screening

In the GCM dataset, there are I = 14 classes. The dimension of predictors is p = 16, 063 and the sample size is n = 198, where the size of the training set is 144 and the size of the testing set is 54. Following steps in Fig 1, we first implement the proposed method in Section 3 to fit models based on the training set, and then assess the performance of prediction by examining the testing set.

Since the dimension of predictors is extremely larger than the sample size, i.e., p ≫ n, to determine the informative predictors, we adopt the screening signal (2) to retain informative gene expressions. The first strategy in Algorithm 1 is to apply (2) to evaluate the signal of X_•k and Y ∈ {1, ⋯, 14} and determine the estimated active set (3); the second consideration in Algorithm 2 is to calculate the signal of X_•k and Yⁱ for i = 1, ⋯, 14 and then obtain the estimated class-dependent active set (19). As suggested in [33, 35, 36], under the training set, we consider to retain gene expression values for the MLR-HomoNet method and retain gene expression values with i = 1, ⋯, 14 for the LR-HeteNet method, where n_i is the sample size of class i summarized in Table 1.

4.2 Network-based classification models

After the feature screening step, we next apply the estimation procedure in Section 3.2 to determine the network structure of selected gene expressions in the training set. Fig 2 displays the network structure with all samples accommodated, and the network structures of selected gene expressions based on different cancers are displayed in Fig 3. In Fig 2, we can see that the selected gene expressions have complex dependence structures. For example, gene expressions with ID 10111, 9548, and 9446 are connected with several gene expressions, while three gene expressions 10884, 15854, and 10208 have no connections with others. On the other hand, as shown in Fig 3, different classes have different selected gene expressions and associated network structures, which verifies the discussion in Remark 3.1. That is, as different kinds of cancer differ in their corresponding gene expressions, according to the specific network structures of gene expressions produced from our analysis, we can infer which cancer each tumor sample is from.

Download:

Fig 2. The whole network structure with selected gene expressions.

https://doi.org/10.1371/journal.pone.0274440.g002

Download:

Fig 3. The network structure with selected gene expressions based on different cancers.

https://doi.org/10.1371/journal.pone.0274440.g003

To adopt the determined network structures to examine the classification, we implement the network structures and the training set to the classification models proposed in Sections 3.3 and 3.4, respectively. To see the fitness of two models, we first implement the training data to the fitted models and examine the classification. The 14×14 confusion matrices based on the MLR-HomoNet and LR-HeteNet methods are shown in Tables 2 and 3, respectively, where columns are labels from the training data , rows are labels of fitted values, diagonal entries reflect number of correct classification, and nondiagonal entries are number of misclassification by fitted values. In general, both methods show satisfactory model fittness as the accuracy of classification is high. Moreover, we observe that the LR-HeteNet method seems to slightly outperform the MLR-HomoNet method since the latter method produces slightly larger misclassification on BR, PR, CO, and UT than those of the former method. This result makes sense because the LR-HeteNet method is based on class-dependent network structure that can directly reflect the corresponding cancers. For a clear visualization, we further display two heatmaps in Fig 4, which are obtained by Tables 2 and 3 with each row divided by the class-dependent sample size in the training data. We observe that diagonal entries have dark color, which indicate that the proportion of true classification is high and Algorithms 1 and 2 give well-fitted models.

Download:

Fig 4. Heatmaps for the fitted values based on two proposed methods under the training data.

The left panel is obtained by Algorithm 1, the right panel is obtained by Algorithm 2. Z represents the proportion of (mis)classification.

https://doi.org/10.1371/journal.pone.0274440.g004

Download:

Table 2. A 14 × 14 confusion matrix: Model fittness based on the MLR-HomoNet method for the training data

.

https://doi.org/10.1371/journal.pone.0274440.t002

Download:

Table 3. A 14 × 14 confusion matrix: Model fittness based on the LR-HeteNet method for the training data

.

https://doi.org/10.1371/journal.pone.0274440.t003

4.3 Prediction

When the predictive models are constructed, we now assess the performance of the proposed method by examining the prediction for the testing data. We implement the predictors in the testing data to the two proposed methods, and then make the prediction of classification. After that, we summarize the response in the testing data and the predictive classes to 14 × 14 confusion matrices in Tables 4 and 5, respectively, where columns are labels from the testing samples , rows are labels of predicted values, diagonal entries reflect number of correct classification, and nondiagonal entries are number of misclassification by predicted values. Moreover, we also display two heatmaps in Fig 5 that are obtained by Tables 4 and 5 with each row divided by the class-dependent sample size in the testing data. From confusion matrices and heatmaps, We can see that two proposed methods have satisfactory performance in prediction because most of predicted classes are the same as class labels in the testing data, except for little misclassification.

Download:

Fig 5. Heatmaps for the predicted values based on two proposed methods under the testing data.

The left panel is obtained by Algorithm 1, the right panel is obtained by Algorithm 2. Z represents the proportion of (mis)classification.

https://doi.org/10.1371/journal.pone.0274440.g005

Download:

Table 4. A 14 × 14 confusion matrix: Prediction based on the MLR-HomoNet method for the testing data

.

https://doi.org/10.1371/journal.pone.0274440.t004

Download:

Table 5. A 14 × 14 confusion matrix: Prediction based on the LR-HeteNet method for the testing data

.

https://doi.org/10.1371/journal.pone.0274440.t005

To assess the performance of classification and prediction numerically, we evaluate some commonly used criteria: micro averaged metrics, macro averaged metrics, and the adjusted Rand index. For a subject j in the testing data with j = 1, ⋯, 54, let denote the predicted class label determined by the prediction models and let y_new,j denote the class label in the testing data. For class i = 1, ⋯, I, we respectively calculate the number of the true positives (TP), the number of the false positives (FP), and the number of the false negatives (FN) as (24) (25) and (26) For micro averaged metrics, precision and recall are, respectively, defined in terms of (24), (25), and (26): (27) and (28) Then Micro-F-score is defined as (29)

On the other hand, for macro averaged metrics, for i = 1, ⋯, I, let denote precision for class i, and let denote recall for class i. Then the overall precision and recall are, respectively, given by (30) and (31) and Macro-F-score is defined as (32) According to the definitions, when all subjects are correctly classified, then FP and FN are equal to zero, yielding that PRE and REC are equal to one; if all subjects are falsely classified, then TP is equal to zero, and thus, PRE and REC are equal to zero. Therefore, values of PRE and REC are between zero to one. Moreover, the F-score falls in [0, 1] as well by treating 0/0 as zero. In principle, higher values of PRE, REC and F-score based on both micro and macro reflect better performance of methods ([20–22]).

In addition to criteria above, the other commonly used criterion is the adjusted Rand index (ARI). For i, l = 1, ⋯, I, let . Moreover, define for i = 1, ⋯, I and for l = 1, ⋯, I. Then ARI is defined as (e.g., [43]) (33) As mentioned in [43], ARI is bounded above by one, and higher value of ARI indicates accurate classification.

We primarily adopt (27), (28), (29), (30), (31), (32), and (33) to assess the performance of two proposed methods. In addition, to compare with the proposed methods, we also examine several well established supervised learning methods, including logistic regression models without incorporating network structure [42], the support vector machine (SVM) that was examined by [30], K-nearest neighbor (KNN), linear discriminant analysis (LDA), Bayes, artificial neural network (ANN), XGBoost, random forest (RF), bagging, and long short-term memory (LSTM) methods. The implementation of corresponding R packages is summarized in Table 6.

Download:

Table 6. A list of existing methods and corresponding packages.

https://doi.org/10.1371/journal.pone.0274440.t006

The prediction results of the proposed and competitive methods are summarized in Table 7. In general, we can observe that the two proposed methods have the largest values of PRE, REC, F-score, and ARI than other existing methods. For the comparisons among existing methods, we can see that advanced machine learning or deep learning methods (e.g., ANN, RF, Bagging) outperform the conventional ones, such as LDA or SVM, but are less satisfactory than the proposed methods because of slightly large misclassification. It verifies that incorporating network structures would improve the accuracy of classification and prediction. In addition, the other reason is that, unlike existing methods that possibly incur overfitting because of direct implementation of all gene expression values to fit models, the two proposed methods simply retain gene expression values and detect network structures that are related to the response, yielding parsimonious models. In this way, noises and impacts induced by irrelevant gene expression values can be eliminated. Compared with two proposed methods, we can see that the LR-HeteNet method outperforms the MLR-HomoNet method with larger values of criteria. The main reason is that the MLR-HomoNet model in Section 3.3 directly deals with multi-label classification by using a common network structure to classify tumors to the corresponding cancers. To simultaneously reflect information to all classes, the network structure displayed in Fig 2 is expected to require more gene expression values and complex interactions. On the other hand, the LR-HeteNet method in Section 3.4 identifies predictors and unique network structure to reflect a specific cancer, suggesting that types of cancers can be uniquely represented by different network structures of gene expression values. As shown in Fig 3, one can directly adopt a given network structure to classify tumors to their cancers with high accuracy of prediction. In summary, with noise induced by irrelevant predictors removed and informative network structures of predictors accommodated, the accuracy of classification and prediction has significant improvement.

Download:

Table 7. Prediction of classification for the testing data

.

https://doi.org/10.1371/journal.pone.0274440.t007

5 Discussion

In this paper, we present the network-based classification method to predict the classification of the tumor samples, which is an ultrahigh dimensional system, i.e., with multitudinous gene expressions as predictors. In the proposed method, we first adopt model-free feature screening technique to retain informative gene expressions from ultrahigh-dimensional data. After that, we identify the network structures of the detected gene expressions based on different cancers, and the property of the network structure recovery allows us to fit the nominal logistic regression based on the network structure and examine the classification and prediction. Compared with other existing methods, the proposed method gives more precise prediction results.

There are several possible extensions based on the current work. For example, the RNA sequences, regarded as count data, are also frequently explored in bioinformatics. The proposed method can be naturally extended to deal with the RNA sequence data by treating them as the predictors because the signal of detecting predictors (2) is free of distribution of random variables, and the identification of network structure in Section 3.2 is based on exponential family graphical models. For the implementation of classification models, it is interesting to explore other machine learning methods, such as SVM, LDA, or KNN, and other deep learning approaches that are popular in data science.

Moreover, the research gap still exists and more explorations can be done by extending the proposed method. For example, as discussed in [32], measurement error in predictors is ubiquitous in data analysis, especially that mismeasurement is inevitable in gene expression data (e.g, [52]). Ignoring measurement error effects is expected to increase the possibility of false classification and lead to wrong conclusion. Therefore, it is important to develop a new error-eliminating strategy to deal with measurement error based on the current method. Finally, as R packages associated with some of the existing methods have been developed, the new method proposed here anticipates a corresponding R package.

Supporting information

S1 Data.

https://doi.org/10.1371/journal.pone.0274440.s001

(ZIP)

Acknowledgments

The author would like to appreciate Lingyu Cai for technical support of programming code, helpful language editing, grammar revision, and proofreading. The author thanks the editorial team for providing constructive and suggestive comments to improve the presentation of the manuscript.

References

1. Gálvez J. M., Castillo D., Herrera L. J., San Román B., Valenzuela O., Ortuno F. M., et al. (2018). Multiclass classification for skin cancer profiling based on the integration of heterogeneous gene expression series. PLoS ONE, 13(5), e0196836. pmid:29750795
- View Article
- PubMed/NCBI
- Google Scholar
2. Lee Y. and Lee C.-K. (2003). Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics. 19, 1132–1139. pmid:12801874
- View Article
- PubMed/NCBI
- Google Scholar
3. Cristianini N. and Shawe-Taylor J. (2000). An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge.
4. Huang M. W., Chen C. W., Lin W. C., Ke S. W., and Tsai C. F. (2017). SVM and SVM ensembles in breast cancer prediction. PLoS ONE, 12(1), e0161501. pmid:28060807
- View Article
- PubMed/NCBI
- Google Scholar
5. Guo Y., Hastie T., Tibshirani R. (2007). Regularized linear discriminant analysis and its application in microarrays. Biostatistics, 8, 86–100. pmid:16603682
- View Article
- PubMed/NCBI
- Google Scholar
6. Safo S. E. and Ahn J. (2016). General sparse multi-class linear discriminant analysis. Computational Statistics and Data Analysis, 99, 81–90.
- View Article
- Google Scholar
7. Hastie T., Tibshirani R., and Friedman J. (2008). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.
8. James G., Witten D., Hastie T., and Tibshirani R. (2017). An Introduction to Statistical Learning: with Applications in R. Springer, New York.
9. Chen L.-P. (2019). Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Statistical Papers, 60, 1793–1795.
- View Article
- Google Scholar
10. Heenaye-Mamode Khan M., Boodoo-Jahangeer N., Dullull W., Nathire S., Gao X., Sinha G. R., et al. (2021). Multi-class classification of breast cancer abnormalities using Deep Convolutional Neural Network (CNN). PLoS ONE, 16(8), e0256500. pmid:34437623
- View Article
- PubMed/NCBI
- Google Scholar
11. Pandey A. and Roy S. S. (2022). Protein sequence classification using convolutional neural network and natural language processing. Handbook of Machine Learning Applications for Genomics, edited by S. S. Roy and Y. H. Taguchi, 133–144.
12. Roy S. S., Samui P., Deo R., and Ntalampiras S. (Eds.). (2018). Big Data in Engineering Applications (Vol. 44). Springer, Berlin/Heidelberg, Germany.
13. Roy S. S. and Taguchi Y. H. (2022). Handbook of Machine Learning Applications for Genomics. Springer Nature, Singapore.
14. Samui P., Roy S. S., and Balas V. E. (Eds.). (2017). Handbook of Neural Computation, Academic Press.
15. Zhu S. X. Y. and Pan W. (2009). Network-based support vector machine for classification of microarray samples. BMC Bioinformatics, 10, 1–11. pmid:19208121
- View Article
- PubMed/NCBI
- Google Scholar
16. Zi X., Liu Y., Gao P. (2016). Mutual information network-based support vector machine for identification of rheumatoid arthritis-related genes. International Journal of Clinical and Experimental Medicine, 9, 11764–11771.
- View Article
- Google Scholar
17. Cai W., Guan G., Pan R., Zhu X., and Wang H. (2018). Network linear discriminant analysis. Computational Statistics and Data Analysis, 117, 32–44.
- View Article
- Google Scholar
18. Huttenhower C., Flamholz A.I., Landis J.N. et al. (2007). Nearest neighbor networks: clustering expression data based on gene neighborhoods. BMC Bioinformatics, 8, 250, 1–13. pmid:17626636
- View Article
- PubMed/NCBI
- Google Scholar
19. He, W., Yi, G. Y., and Chen, L.-P. (2019). Support vector machine with graphical network structures in features. Proceedings, Machine Learning and Data Mining in Pattern Recognition, 15th International Conference on Machine Learning and Data Mining, MLDM 2019, vol.II, New York, NY, USA, ibai-publishing, 557–570.
20. Chen L.-P., Yi G. Y., Zhang Q., and He W. (2019). Multiclass analysis and prediction with network structured covariates. Journal of Statistical Distributions and Applications, 6:6.
- View Article
- Google Scholar
21. Chen L.-P. (2022a). Network-based discriminant analysis for multiclassification. Journal of Classification. To appear.
- View Article
- Google Scholar
22. Chen L.-P. (2022b). Nonparametric discriminant analysis with network structures in predictor. Journal of Statistical Computation and Simulation. To appear.
- View Article
- Google Scholar
23. Baladanddayuthapani V., Talluri R., Ji Y., Coombes K. R., Lu Y., Hennessy B. T., et al. (2014). Bayesian sparse graphical models for classification with application to protein expression data. The Annals of Applied. Statistics, 8, 1443–1468.
- View Article
- Google Scholar
24. Peterson C. B., Stingo F. C., and Vannucci M. (2015). Joint Bayesian variable and graph selection for regression models with network-structured predictors. Statistics in Medicine, 35, 1017–1031. pmid:26514925
- View Article
- PubMed/NCBI
- Google Scholar
25. Roy S. S. and Taguchi Y. H. (2021). Identification of genes associated with altered gene expression and m6A profiles during hypoxia using tensor decomposition based unsupervised feature extraction. Scientific Reports, 11(1), 1–18. pmid:33903618
- View Article
- PubMed/NCBI
- Google Scholar
26. Tschodu D., Ulm B., Bendrat K., Lippoldt J., Gottheil P., Käs J. A., et al. (2022). Comparative analysis of molecular signatures reveals a hybrid approach in breast cancer: combining the Nottingham Prognostic Index with gene expressions into a hybrid signature. PloS ONE, 17(2), e0261035. pmid:35143511
- View Article
- PubMed/NCBI
- Google Scholar
27. Zhang X., Wu Y., Wang L., and Li R. (2016). Variable selection for support vector machines in moderately high dimensions. Journal of the Royal Statistical Society, Series B, 78, 53–76. pmid:26778916
- View Article
- PubMed/NCBI
- Google Scholar
28. Maugis C., Celeux G., and Martin-Magniette M.-L. (2011). Variable selection in model-based discriminant analysis. Journal of Multivariate Analysis, 102, 1374–1387.
- View Article
- Google Scholar
29. Wang C., Cao L., and Miao B. (2013). Optimal feature selection for sparse linear discriminant analysis and its applications in gene expression data. Computational Statistics and Data Analysis, 66, 140–149.
- View Article
- Google Scholar
30. Ramaswamy S., Tamayo P., Rifkin R. et al. (2001). Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences of the United States, 98, 15149–15154. pmid:11742071
- View Article
- PubMed/NCBI
- Google Scholar
31. Lukashin A. V., Lukashev M. E., and Fuchs R. (2003). Topology of gene expression networks as revealed by data mining and modeling. Bioinformatics, 19, 1909–1916. pmid:14555623
- View Article
- PubMed/NCBI
- Google Scholar
32. Chen L.-P. (2018). Multiclassification to gene expression data with some complex features. Biostatistics and Biometrics Open Access Journal, 9, 555751.
- View Article
- Google Scholar
33. Fan J. and Lv J. (2008). Sure independence screening for ultra high dimensional feature space. Journal of the Royal Statistical Society, Series B, 70, 849–911.
- View Article
- Google Scholar
34. Chatterjee S. (2021). A new coefficient of correlation. Journal of the American Statistical Association, 16, 2009–2022.
- View Article
- Google Scholar
35. Chen, L.-P. (2020). A note of feature screening via rank-based coefficient of correlation. arXiv:2008.04456
36. Chen L.-P. (2021). Feature screening based on distance correlation for ultrahigh-dimensional censored data with covariates measurement error. Computational Statistics, 36, 857–884.
- View Article
- Google Scholar
37. Yang E., Ravikumar P., Allen G. I., and Liu Z. (2015). Graphical models via univariate exponential family distribution. Journal of Machine Learning Research, 16, 3813–3847. pmid:27570498
- View Article
- PubMed/NCBI
- Google Scholar
38. Meinshausen N. and Bühlmann P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34, 1436–1462.
- View Article
- Google Scholar
39. Schwarz G. (1978). Estimating the dimension of model. Annals of Statistics, 6, 461–464.
- View Article
- Google Scholar
40. Ravikumar P., Wainwright M. J., and Lafferty J. (2010). High-dimensional Ising model selection using ℓ₁-regularized logistic regression. The Annals of Statistics, 38, 1287–1319.
- View Article
- Google Scholar
41. Agresti A. (2007). An Introduction to Categorical Data Analysis. Wiley, New York.
42. Agresti A. (2012). Categorical Data Analysis. Wiley, New York.
43. Hubert L. and Arabie P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.
- View Article
- Google Scholar
44. Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A. et al. (2022). e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-11. https://CRAN.R-project.org/package=e1071
45. Torgo, L. (2022). DMwR: Functions and data for “Data Mining with R”. R package version 0.4.1. https://CRAN.R-project.org/package=DMwR
46. Ripley, B., Venables, B., Bates, D. M., Hornik, K. et al. (2022). MASS: Support functions and datasets for venables and Ripley’s MASS. R package version 7.3-57. https://CRAN.R-project.org/package=MASS
47. Fritsch, S., Guenther, F., Wright, M. N., Suling, M., and Mueller, S. M. (2019). neuralnet: Training of neural networks. R package version 1.44.2. https://CRAN.R-project.org/package=neuralnet
48. Chen, T., He, T., Benesty, M., Khotilovich, V. et al. (2022). xgboost: Extreme gradient boosting. R package version 1.6.0.1. https://CRAN.R-project.org/package=xgboost
49. Breiman, L., Cutler, A., Liaw, A. and Wiener, M. (2022). randomForest: Breiman and Cutler’s random forests for classification and regression. R package version 4.7-1. https://CRAN.R-project.org/package=randomForest
50. Peters, A., Hothorn, T., Ripley, B. D., Therneau, T., and Atkinson, B. (2022). ipred: Improved predictors. R package version 0.9-13. https://CRAN.R-project.org/package=ipred
51. Quast, B. and Fichou, D. (2022). rnn: Recurrent Neural Network. R package version 1.5.0. https://CRAN.R-project.org/package=rnn
52. Chen L.-P. and Yi G. Y. (2021). Analysis of Noisy Survival Data with Graphical Proportional Hazards Measurement Error Models. Biometrics, 77, 956–969. pmid:32687216
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Gálvez J. M., Castillo D., Herrera L. J., San Román B., Valenzuela O., Ortuno F. M., et al. (2018). Multiclass classification for skin cancer profiling based on the integration of heterogeneous gene expression series. PLoS ONE, 13(5), e0196836. pmid:29750795
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Lee Y. and Lee C.-K. (2003). Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics. 19, 1132–1139. pmid:12801874
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Cristianini N. and Shawe-Taylor J. (2000). An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge.

[ref4] 4. Huang M. W., Chen C. W., Lin W. C., Ke S. W., and Tsai C. F. (2017). SVM and SVM ensembles in breast cancer prediction. PLoS ONE, 12(1), e0161501. pmid:28060807
View Article
PubMed/NCBI
Google Scholar

[11] View Article

[12] PubMed/NCBI

[13] Google Scholar

[ref5] 5. Guo Y., Hastie T., Tibshirani R. (2007). Regularized linear discriminant analysis and its application in microarrays. Biostatistics, 8, 86–100. pmid:16603682
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref6] 6. Safo S. E. and Ahn J. (2016). General sparse multi-class linear discriminant analysis. Computational Statistics and Data Analysis, 99, 81–90.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref7] 7. Hastie T., Tibshirani R., and Friedman J. (2008). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.

[ref8] 8. James G., Witten D., Hastie T., and Tibshirani R. (2017). An Introduction to Statistical Learning: with Applications in R. Springer, New York.

[ref9] 9. Chen L.-P. (2019). Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Statistical Papers, 60, 1793–1795.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref10] 10. Heenaye-Mamode Khan M., Boodoo-Jahangeer N., Dullull W., Nathire S., Gao X., Sinha G. R., et al. (2021). Multi-class classification of breast cancer abnormalities using Deep Convolutional Neural Network (CNN). PLoS ONE, 16(8), e0256500. pmid:34437623
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref11] 11. Pandey A. and Roy S. S. (2022). Protein sequence classification using convolutional neural network and natural language processing. Handbook of Machine Learning Applications for Genomics, edited by S. S. Roy and Y. H. Taguchi, 133–144.

[ref12] 12. Roy S. S., Samui P., Deo R., and Ntalampiras S. (Eds.). (2018). Big Data in Engineering Applications (Vol. 44). Springer, Berlin/Heidelberg, Germany.

[ref13] 13. Roy S. S. and Taguchi Y. H. (2022). Handbook of Machine Learning Applications for Genomics. Springer Nature, Singapore.

[ref14] 14. Samui P., Roy S. S., and Balas V. E. (Eds.). (2017). Handbook of Neural Computation, Academic Press.

[ref15] 15. Zhu S. X. Y. and Pan W. (2009). Network-based support vector machine for classification of microarray samples. BMC Bioinformatics, 10, 1–11. pmid:19208121
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref16] 16. Zi X., Liu Y., Gao P. (2016). Mutual information network-based support vector machine for identification of rheumatoid arthritis-related genes. International Journal of Clinical and Experimental Medicine, 9, 11764–11771.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref17] 17. Cai W., Guan G., Pan R., Zhu X., and Wang H. (2018). Network linear discriminant analysis. Computational Statistics and Data Analysis, 117, 32–44.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref18] 18. Huttenhower C., Flamholz A.I., Landis J.N. et al. (2007). Nearest neighbor networks: clustering expression data based on gene neighborhoods. BMC Bioinformatics, 8, 250, 1–13. pmid:17626636
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref19] 19. He, W., Yi, G. Y., and Chen, L.-P. (2019). Support vector machine with graphical network structures in features. Proceedings, Machine Learning and Data Mining in Pattern Recognition, 15th International Conference on Machine Learning and Data Mining, MLDM 2019, vol.II, New York, NY, USA, ibai-publishing, 557–570.

[ref20] 20. Chen L.-P., Yi G. Y., Zhang Q., and He W. (2019). Multiclass analysis and prediction with network structured covariates. Journal of Statistical Distributions and Applications, 6:6.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref21] 21. Chen L.-P. (2022a). Network-based discriminant analysis for multiclassification. Journal of Classification. To appear.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref22] 22. Chen L.-P. (2022b). Nonparametric discriminant analysis with network structures in predictor. Journal of Statistical Computation and Simulation. To appear.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref23] 23. Baladanddayuthapani V., Talluri R., Ji Y., Coombes K. R., Lu Y., Hennessy B. T., et al. (2014). Bayesian sparse graphical models for classification with application to protein expression data. The Annals of Applied. Statistics, 8, 1443–1468.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref24] 24. Peterson C. B., Stingo F. C., and Vannucci M. (2015). Joint Bayesian variable and graph selection for regression models with network-structured predictors. Statistics in Medicine, 35, 1017–1031. pmid:26514925
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref25] 25. Roy S. S. and Taguchi Y. H. (2021). Identification of genes associated with altered gene expression and m6A profiles during hypoxia using tensor decomposition based unsupervised feature extraction. Scientific Reports, 11(1), 1–18. pmid:33903618
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref26] 26. Tschodu D., Ulm B., Bendrat K., Lippoldt J., Gottheil P., Käs J. A., et al. (2022). Comparative analysis of molecular signatures reveals a hybrid approach in breast cancer: combining the Nottingham Prognostic Index with gene expressions into a hybrid signature. PloS ONE, 17(2), e0261035. pmid:35143511
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref27] 27. Zhang X., Wu Y., Wang L., and Li R. (2016). Variable selection for support vector machines in moderately high dimensions. Journal of the Royal Statistical Society, Series B, 78, 53–76. pmid:26778916
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref28] 28. Maugis C., Celeux G., and Martin-Magniette M.-L. (2011). Variable selection in model-based discriminant analysis. Journal of Multivariate Analysis, 102, 1374–1387.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref29] 29. Wang C., Cao L., and Miao B. (2013). Optimal feature selection for sparse linear discriminant analysis and its applications in gene expression data. Computational Statistics and Data Analysis, 66, 140–149.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref30] 30. Ramaswamy S., Tamayo P., Rifkin R. et al. (2001). Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences of the United States, 98, 15149–15154. pmid:11742071
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref31] 31. Lukashin A. V., Lukashev M. E., and Fuchs R. (2003). Topology of gene expression networks as revealed by data mining and modeling. Bioinformatics, 19, 1909–1916. pmid:14555623
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref32] 32. Chen L.-P. (2018). Multiclassification to gene expression data with some complex features. Biostatistics and Biometrics Open Access Journal, 9, 555751.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref33] 33. Fan J. and Lv J. (2008). Sure independence screening for ultra high dimensional feature space. Journal of the Royal Statistical Society, Series B, 70, 849–911.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref34] 34. Chatterjee S. (2021). A new coefficient of correlation. Journal of the American Statistical Association, 16, 2009–2022.
View Article
Google Scholar

[98] View Article

[99] Google Scholar

[ref35] 35. Chen, L.-P. (2020). A note of feature screening via rank-based coefficient of correlation. arXiv:2008.04456

[ref36] 36. Chen L.-P. (2021). Feature screening based on distance correlation for ultrahigh-dimensional censored data with covariates measurement error. Computational Statistics, 36, 857–884.
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref37] 37. Yang E., Ravikumar P., Allen G. I., and Liu Z. (2015). Graphical models via univariate exponential family distribution. Journal of Machine Learning Research, 16, 3813–3847. pmid:27570498
View Article
PubMed/NCBI
Google Scholar

[105] View Article

[106] PubMed/NCBI

[107] Google Scholar

[ref38] 38. Meinshausen N. and Bühlmann P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34, 1436–1462.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref39] 39. Schwarz G. (1978). Estimating the dimension of model. Annals of Statistics, 6, 461–464.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref40] 40. Ravikumar P., Wainwright M. J., and Lafferty J. (2010). High-dimensional Ising model selection using ℓ₁-regularized logistic regression. The Annals of Statistics, 38, 1287–1319.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref41] 41. Agresti A. (2007). An Introduction to Categorical Data Analysis. Wiley, New York.

[ref42] 42. Agresti A. (2012). Categorical Data Analysis. Wiley, New York.

[ref43] 43. Hubert L. and Arabie P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.
View Article
Google Scholar

[120] View Article

[121] Google Scholar

[ref44] 44. Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A. et al. (2022). e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-11. https://CRAN.R-project.org/package=e1071

[ref45] 45. Torgo, L. (2022). DMwR: Functions and data for “Data Mining with R”. R package version 0.4.1. https://CRAN.R-project.org/package=DMwR

[ref46] 46. Ripley, B., Venables, B., Bates, D. M., Hornik, K. et al. (2022). MASS: Support functions and datasets for venables and Ripley’s MASS. R package version 7.3-57. https://CRAN.R-project.org/package=MASS

[ref47] 47. Fritsch, S., Guenther, F., Wright, M. N., Suling, M., and Mueller, S. M. (2019). neuralnet: Training of neural networks. R package version 1.44.2. https://CRAN.R-project.org/package=neuralnet

[ref48] 48. Chen, T., He, T., Benesty, M., Khotilovich, V. et al. (2022). xgboost: Extreme gradient boosting. R package version 1.6.0.1. https://CRAN.R-project.org/package=xgboost

[ref49] 49. Breiman, L., Cutler, A., Liaw, A. and Wiener, M. (2022). randomForest: Breiman and Cutler’s random forests for classification and regression. R package version 4.7-1. https://CRAN.R-project.org/package=randomForest

[ref50] 50. Peters, A., Hothorn, T., Ripley, B. D., Therneau, T., and Atkinson, B. (2022). ipred: Improved predictors. R package version 0.9-13. https://CRAN.R-project.org/package=ipred

[ref51] 51. Quast, B. and Fichou, D. (2022). rnn: Recurrent Neural Network. R package version 1.5.0. https://CRAN.R-project.org/package=rnn

[ref52] 52. Chen L.-P. and Yi G. Y. (2021). Analysis of Noisy Survival Data with Graphical Proportional Hazards Measurement Error Models. Biometrics, 77, 956–969. pmid:32687216
View Article
PubMed/NCBI
Google Scholar

[131] View Article

[132] PubMed/NCBI

[133] Google Scholar

Figures

Abstract

1 Introduction

2 Data structure with multi-class responses

2.1 Description of motivated dataset

2.2 Notation

3 Proposed method

3.1 Feature screening via rank-based correlation coefficient

3.2 The expressions of graphical structure

3.3 Multinomial logistic regression with homogeneous network structure in predictors

3.4 Logistic regression with heterogeneous network structured in predictors

4 Results

4.1 Detection of informative gene expressions via feature screening

4.2 Network-based classification models

4.3 Prediction

5 Discussion

Supporting information

S1 Data.

Acknowledgments

References