Model-free feature screening for categorical outcomes: Nonlinear effect detection and false discovery rate control

Feature screening has become a real prerequisite for the analysis of high-dimensional genomic data, as it is effective in reducing dimensionality and removing redundant features. However, existing methods for feature screening have been mostly relying on the assumptions of linear effects and independence (or weak dependence) between features, which might be inappropriate in real practice. In this paper, we consider the problem of selecting continuous features for a categorical outcome from high-dimensional data. We propose a powerful statistical procedure that consists of two steps, a nonparametric significance test based on edge count and a multiple testing procedure with dependence adjustment for false discovery rate control. The new method presents two novelties. First, the edge-count test directly targets distributional difference between groups, therefore it is sensitive to nonlinear effects. Second, we relax the independence assumption and adapt Efron’s procedure to adjust for the dependence between features. The performance of the proposed procedure, in terms of statistical power and false discovery rate, is illustrated by simulated data. We apply the new method to three genomic datasets to identify genes associated with colon, cervical and prostate cancers.

• Setting S2 (Gaussian versus Gaussian mixture): X i ∼ N(1, 1.1), Y j ∼ 1 2 N(0, 1 2 )+ 1 2 N(2, 1 2 ), i = 1, 2, ..., n; j = 1, 2, ..., n For both settings, sample sizes n = 50, 100, 200, 300, 500 were used and the performance of the two methods were evaluated based on 1,000 simulations. For edge-count test, a 3-MST was used as the similarity graph and p-value was evaluated by the asymptotic Chi-squared distribution. S1 Fig summarized the empirical statistical power (proportion of correct rejections) by two methods, where it can be seen that the KS test is rather conservative even for relatively large sample size. For instance, with 300 samples in each group, the KS test exhibited an extremely low statistical power (below 20%) while the edge-count test achieved a satisfactory power (70∼80%) in both settings.

C. Application to cervical and prostate cancer data
The new method was further tested on two additional datasets, namely the RNA-seq data for cervical cancer [1] and the microarray data for prostate cancer [2]. The cervical cancer data contains the expression levels of 714 microRNAs on 58 samples (29 tumor samples and 29 normal controls). The prostate cancer data contains the expression levels (microarray) of 2,000 genes (preselected from a total of 26,000 genes) on 103 samples consisted of 41 normal samples and 62 tumor samples. For the RNA-seq data, the counts were transformed by log 2 (x + 1) and same normalization was applied to the transformed data as we did for the colon cancer data. Two methods including (1) Welch's t test with BH procedure and (2) edge-count test with Efron's procedure were applied to both data sets, with FDR α = 0.10.
Similar as in the colon cancer data, the edge-count test with Efron's procedure consistently detected more genes than the Welch's t test. For instance, in the cervical cancer data, 71 genes were detected by our new method while only 55 genes were detected by the Welch's t test. In the prostate cancer, the new method identified 42 genes while the Welch's t test only detected 30 genes. All the newly discovered genes (identified by edge-count test but missed by Welch's t test) have similar sample means but significantly different distributions in normal and tumor groups. The details of nine such genes/miRNAs, including miR-17, miR-30c-2, miR-329, miR-433, HEXB, PXR1, MTCP1, CAMK1D, ATP6V1G1, were shown in Figure   K-S in S1 File.

Hsa.3180 in normal samples
Expression level

Hsa.44676 in normal samples
Expression level  Cancer is a complex disease in the sense that it is not a result of an abnormality of a single gene but a consequence of changes in many genes, it is therefore of great importance to understand the roles of different oncogenic and tumor suppressor pathways in tumorigenesis. In recent years, there have been many computational models developed to study the genetic alterations of different pathways in the evolutionary process of cancer. However, most of the methods are knowledge-based enrichment analyses and inflexible to analyze user-defined pathways or gene sets.
Results: In this paper, we develop a nonparametric and data-driven approach to testing for the dynamic changes of pathways over the cancer progression. Our method is based on an expansion and refinement of the pathway being studied, followed by a graph-based multivariate test, which is very easy to implement in practice. The new test is applied to the rich Cancer Genome Atlas data to study the ( (Tu et al. [2005]), Catmap (Breslin et al. [2004]), ErmineJ (Lee et al. [2005]), and GeneTrail (Backes et al. [2007]). Most of these approaches, however, are knowledge-based and inflexible to userdefined pathways. One exception is Edelman et al. [2008], which introduced a data-driven approach to modeling cancer progression via pathway dependencies and flexible to any user-defined pathways.
Unlike the knowledge-based enrichment analysis, Edelman et al.'s method is essentially a hierarchical analysis that targets pathways relevant to particular transition between cancer stages, for instance, from normal to primary tumor, and from primary tumor to metastasis. However, Edelman's approach relied on several computationally intensive steps, such as regularized multi-task learning, inverse regression, learning gradients as well as leave-one-out cross validation, which greatly limited its application to large-scale data.
To this end, we proposed an efficient and powerful test based on inter-point distance to identify cancerdriving pathways. Similar as Edelman's method, our method is purely data-driven and capable of dealing with new pathways adapted to certain tissue/disease. In the meanwhile, the test is model-free and very easy to implement as it only requires few simple calculations, such as minimum spanning tree and Chi-square test.
The rest of the paper is structured as follows: In Section 2, we introduce the new test and establish its asymptotic distribution. The finite sample performance is evaluated via simulations. In Section 3, we apply the new test to the rich Cancer Genome Atlas data to identify important pathways that drive serous ovarian cancer. We discuss the strengths and shortcomings of our approach in Section 4 and conclude this paper in Section 5. Technical proof about the asymptotic distribution of the edge-count test is provided in the Appendix.

Problem formulation
Detecting cancer-driving pathways is essentially detecting differentially acted pathways between cancer stages. Here, we formulate it as a statistical problem of testing the equality of two or multiple joint distributions, where each random variable represents the expression level of one gene. To be precise, we let i ∈ {1, 2, ..., p} be the index for cancer stages, and (X (i) 1 , ..., X (i) d ) be the expression levels of d genes in the pathway being studied, with a joint distribution F F F (i) . Given n i i.i.d. observations in stage i, (x kd ), we concern the following hypothesis testing: For a particular transition step, from stage i to stage i + 1 (i = 1, ..., p − 1), we could conduct the following pairwise test, which is a special case when p = 2

Graph-based multivariate test
The two-sample multivariate tests in the statistics literature can be roughly classified into two categories, namely the parametric and nonparametric multivariate tests. Hotelling's T is a simple and widely used test for comparing the mean vectors of two independent populations, which generally works well for lowdimension case. There are also numerous tests recently developed for high dimensions (Cai et al. [2014]; Gregory et al. [2015]; Srivastava et al. [2013]), which have mostly focused on the equality of mean vectors. Unlike the parametric tests, the nonparametric tests directly test the equality of two multivariate distributions, for instance, Kolmogorov-Smirnov (KS) test (Lopes et al. [2008]) and graph-based test (Friedman and Rafsky [1979]; Rosenbaum [2005]), which are two popular families of such test, but both have practical limitations in real world applications. KS test is known to be very conservative, i.e., the null hypothesis is too often not rejected. Moreover, by the brute force algorithm, the implementation of multidimensional KS test can be prohibitively computationally intensive. Graph-based tests such as edge-count tests are easy to implement but they could be problematic under certain location and scale alternatives, as pointed out by Chen and Friedman [2017]. The rationale of the edge count test is that if two groups have different distributions, samples would be preferentially closer to others from the same group than those from the other group in a similarity graph, therefore the edges in the graph would be more likely to connect samples from the same group. The test rejects the null if the number of between-group edges is significantly less than expected. Chen and Friedman [2017] developed an modified edge-count test and established its asymptotic distribution under two samples. This test works properly under different alternatives and exhibits substantial power gains over existing edge-count tests. Similar as other edge-count tests, their test is based upon a similarity graph constructed over the pooled samples from different groups. For instance, one could use a minimum spanning tree (MST, Cheriton and Tarjan [2006]) based on euclidean distance as the similarity graph.
In order to test the hypotheses in 2.1, we extended Chen and Friedman's test to a multi-sample case and derived the asymptotic result for p-value approximation. We began with pooling samples from all p groups and indexing them by 1, 2, ..., N = p ∑ i=1 n i . A similarity graph, denoted by G, was then constructed on the pooled observations. Let R i be the number of edges in the graph that connect observations within sample i. We worked under the permutation null distribution, which places 1/ N n 1 ,n 2 ,...,n p probability on each of the N n 1 ,n 2 ,...,n p choices of n i out of the total N observations as sample i, i = 1, . . . , p, with each observation chosen once. When there is no further specification, we denote by P P , E P , Var P , Cov P probability, expectation, variance, and covariance, respectively, under the permutation null distribution.
It is not hard to show that: where |G| is the number of edges in graph G, and C = 1 2 ∑ N k=1 |G k | 2 − |G| with G k being the subgraph in G that includes all edge(s) that connect to node k, so C is the number of edge pairs that share a common node in G. Consider the following test statistic where Σ is the covariance matrix of vector (R 1 , R 2 , . . . , R p ) T . The following theorem, based on extension of Chen and Friedman's result, establishes the asymptotic distribution of S under multiple samples: Theorem 1. For an edge e ∈ G, let A e = {e} ∪ {e ∈ G : e and e share a node}, B e = A e ∪ {e ∈ G : ∃ e ∈ A e , such that e and e share a node}.

Simulation study: Accuracy of p-value approximation
We conducted a simulation study under p = 4 to evaluate the finite performance of the p-value approximation under moderate sample sizes. In particular, we compared the approximate p-value with a permutation p-value from 10,000 permutations, under different dimensions (d = 10, 100) and sample sizes (N = 80, n 1 = n 2 = n 3 = n 4 = 20, N = 140, n 1 = n 2 = 20, n 3 = n 4 = 50 and N = 200, n 1 = n 2 = n 3 = n 4 = 50).
The data were generated from p-dimension Gaussian distribution with zero mean vector and identical covariance matrix. A k-MST (k = 1, 3, 5) was then constructed over the pooled observations, based on which the test statistics S could be calculated. The approximate p-value can be obtained as follows: which is to be compared with permutation p-value from 10,000 permutations. Figure 1 and Figure 2 summarized the accuracy of the approximate p-values (approximate p-value minus permutation p-value) under different dimensions, sample sizes, and similarity graphs. It can be seen that under all conditions, the approximate p-values tend to be slightly conservative and increasing dimension leads to a slightly decreasing accuracy. In addition, using a denser similarity graph, e.g., 3-MST or 5-MST can slightly improve the p-value approximation. A sample size such that min i n i > 20 seems to be enough in practice, in order to approximate p-value with Chi-square distribution.

Results
In this section ,we applied the multi-sample edge-count test to the rich TCGA data (TCGA Research Network. [2012,2014]) and studied the roles of 186 KEGG pathways (http://www.genome.jp/kegg) in the progression of serous ovarian cancer. In TCGA, each subject is represented by multiple molecular data types including gene expression, exon expression, genotype (SNP), MicroRNA expression, copy number variation (CNV), somatic mutation, DNA methylation, as well as a complete clinical record including tumor stage, survival, age, race, outcomes of debulking surgery and chemotherapy. To take advantage of this comprehensive data, we considered three important data types in our analysis including gene expression level, CNV and DNA methylation. Other data types could also be incorporated, without causing too much calculation.

Data preprocessing
The we took the average of the "seg.mean" value as its overall copy number. We normalized each data type for each gene to avoid possible dominance by any of these three data types.
The expression level of each gene was quantified by the count of reads mapped to the gene. The quantifications were done by software HTSeq of version 0.9.1 (Anders et al. [2015]) and the count data were log-transformed for further processing. In addition, we removed the effects due to different age groups and batches using a median-matching and variance-matching strategy (Hsu et al. [2012]; Zhang et al. [2014]).
For example, the batch effect can be removed in the following way : where g i jk refers to the expression value for gene i from sample k in batch j ( j = 1, 2, ..., J; k = 1, 2, ..., n j ), M i j represents the median of g i j = (g i j1 , ..., g i jn ), M i refers to the median of g i = (g i1 , ..., g iJ ),σ g i andσ g i j stand for the standard deviations of g i and g i j , respectively.

Pathway expansion and refinement
Each of the 186 pathways was expanded by integrating all the three data types. For instance, if gene i was in the pathway being studied, then its methylation level and CNV level were added to the pathways as two new nodes. As it is well known that the expression level of a gene could be greatly affected by genetic or epigenetic changes such as copy number variation and DNA methylation, the inclusion of these two factors may provide insights about the upstream cause of abnormal expression.
To adapt the KEGG pathways to our context, we further refined the pathways by removing genes that are irrelevant to phenotypic changes. Similar as in Edelman's et al. [2007], an F-test was used to calculate the p-value of each single gene. A gene was excluded if its p-value exceeds some predefined threshold. In this analysis, we set the threshold to be 0.1, and the same procedure was applied to the methylation and CNV data. By the refinement, the sizes of 186 KEGG pathways were greatly reduced, for instance, the ERBB pathway was reduced from 87 genes to 29 genes, and 165 out of 174 methylation/CNV nodes were excluded.

Cancer-driving pathways
For each pathway, we first conducted a multi-sample test to obtain the p-value for the whole process of development. By a Benjamini-Hochberg (BH) procedure with level 0.05, a set of nine pathways were identified, including some well-studied cancer-related pathways such as cell cycle, ERBB, p53 signaling and JAK-STAT signaling pathways. For each of the identified pathways, a two-sample test was further conducted in order to investigate the roles of these pathways in each particular transition step (I→II, II→III, III→IV). Table 1 lists the nine cancer-driving pathways with p-values for the entire process and each particular transition. Interestingly, we found that most of these pathways contributed only to a particular step (the ERBB pathway is an exception, which has significant p-values in both transition I→II and transition II→III). For instance, five pathways were found to contribute only to the early-stage transition, three pathways contribute only to the high-grade transition. Six pathways, including ERBB, cell cycle, prostate cancer, TGF β signaling, pancreatic cancer and p53 signaling pathways, were found to significantly contribute to the transition I→II, which confirmed some existing studies. (p-value= 6.2 × 10 −5 ) and apoptosis pathway (p-value= 1.9 × 10 −3 ), respectively. As compared to III→IV transition, the transition from stage I to stage II is more substantial, which partially confirm some existing studies and clinical statistics. For instance, the 5-year survival rate for stage-I ovarian cancer patients is above 90%, while the rate for stage-II cancer patients drop below 30% (TCGA Research Network. [2011]), suggesting an intrinsic genetic/epigenetic difference between these two groups. As a contrast, we also presented two MDS plots with insignificant p-valules in Figure 7 (p-value=0.3, prostate cancer pathway) and Figure 8 (p-value=0.6, pancreatic cancer pathway).
The regulatory relations of the genes in the refined set were statistically predicted by a logistic Bayesian network model (Zhang et al. [2014]). A coordinate descent algorithm was employed to solve the penalized likelihood function and the penalizing constant λ was tuned by a likelihood method suggested by Zhang et al. (2014). Figures 9 and 10 show the predicted Bayesian networks for the cell cycle pathway and ERBB pathway. These directed networks may provide more clues about how the cancer-driving genes regulate each other, and together change the phenotype. For instance, in the cell cycle pathway (Figure 9), we found gene CDC20 is a driver gene (high outdegree and high centrality), regulating eight other cancer-driving genes in the network. Intervention on this gene may result in a global change of this network.

Discussion
Gene set analyses hold great promise in elucidating the molecular basis of complex diseases such as cancers. Nevertheless, most of the existing gene set analyses are knowledge-based and relied on the enrichment analysis of the pathways defined in existing databases. In this paper, we proposed a data-driven tool for gene set analysis, without requiring any predefined pathways or gene sets. The new test is very easy to implement in practice as it only requires the calculation of a similarity graph over the observations. In practice, one can apply our test to large data set as an efficient searching tool for phenotype-changing pathways, as the computation is simple and efficient. Moreover, under the permutation null and some mild conditions, the test statistics converges to a Chi-square distribution and sample size in tens for each group is generally good enough for p-value approximation, as we show in the simulation study.
The new test also has some limitations. First, the new test only targets for the difference of multiple joint distributions, but cannot detect the direction of changes. A follow-up test or a graphical visualization will be needed in order to tell the direction of changes, i.e., the up-regulation or down-regulation. For example, one could calculate an enrichment score for the pathway being studied as in existing gene set analyses or use a heatmap of the gene set.
Second, as we illustrated in the TCGA application, the performance of the new method relies on a refinement of the gene set, as the existence of irrelevant genes may greatly affect the identification of genes that differentiate phenotypes. In a case that the majority of a gene set are uncorrelated with phenotypic change, the test will fail to identify that gene set. One possible way to solve this problem is redefining distance metric, for example, one can consider to use some weighted distance measures or the ∞ -norm measures that are more sensitive to a change in single gene or a small proportion of genes.

Conclusions
In this work, we introduced a simple but powerful multivariate test to identify (epi)genetic pathways that drive the cancer progression. Different from most existing gene set analyses, our test is data-driven and nonparametric, with allowance of user-defined gene sets or pathways. As we see in the method section, this graph-based test can be easily implemented by calculating Euclidian distance between observations and constructing a minimum spanning tree, thus it is efficient to high dimensional problems. Besides the cancer progression application presented in this paper, the new test can be also applied to many other problems.
For instance, one can test for pathways that differentiate cancer subtypes.

Appendix: Detailed proof for asymptotic distribution of the test statistics
Proof. To prove the theorem, we take one step back to study the statistic under the bootstrap null distribution, which is defined as follows: For each observation, we assign it to be from sample i with probability n i /N independently of other observations. Let X i be the number of observations that are assigned to be from sample i. Then, conditioning on {X i = n i } i=1,...,p , the bootstrap null distribution becomes the permutation null distribution. We use P B , E B , Var B to denote the probability, expectation, and variance under the bootstrap null distribution, respectively.
We have where λ i,N = n i /N. Under the conditions in the theorem, as N → ∞, we can prove the following results: (1) Under the bootstrap null, (W B 1 ,W B 2 , . . . ,W B p ,U 1 , . . . ,U p−1 ) becomes multivariate normal distributed and the covariance matrix of (U 1 , . . . ,U p−1 ) is of positive definite. ( where c i 's are constants. (3) rank(Σ) = p.
Given (1), the conditional distribution of (W B 1 ,W B 2 , . . . ,W B p ) given (U 1 , . . . ,U p−1 ) becomes a multivariate Gaussian distribution under the boostrap null distribution as N → ∞. Since the permutation null distribution is equivalent to the bootstrap null distribution given U i = 0, i = 1, . . . , p − 1, (W B 1 ,W B 2 , . . . ,W B p ) becomes a multivariate Gaussian distribution under the permutation null distribution as N → ∞. Since given (2), we have (W 1 ,W 2 , . . . ,W p ) becomes a multivariate Gaussian distribution under the permutation null distribution as N → ∞. Together with (3), we have the conclusion in the theorem.
To prove the first part of (1), by by Cramér-Wold device, we only need to show that To show this, we use Stein's method. Consider sums of the form V = ∑ i∈J ξ i , where J is an index set and ξ are random variables with Eξ i = 0, and E(V 2 ) = 1. The following assumption restricts the dependence between {ξ i : i ∈ J}.
Assumption 1. [Chen and Shao , 2005, p. 17] For each i ∈ J there exists K i ⊂ L i ⊂ J such that ξ i is independent of ξ K c i and ξ K i is independent of ξ L c i .
We will use the following theorem in proving Theorem 1.
where Lip(1) = {h : R → R; h ≤ 1}, Z has N(0, 1) distribution and with η i = ∑ j∈K i ξ j and θ i = ∑ j∈L i ξ j , where K i and L i are defined in Assumption 1.
For e ∈ G, let where {J e = i} means the edge connects two observations from sample i.
For k ∈ {1, . . . , N}, let where {g k = i} means node k is from sample i.
For j ∈ J, let η j = ∑ k∈K j ξ k , θ j = ∑ k∈L j ξ k . By Theorem 2, we have sup h∈Lip (1) We next check the covariance matrix of (U 1 , . . . ,U p−1 ). The diagonal elements of this matrix are all 1's .
So the covariance matrix can be written as D − vv T where D is a diagonal matrix with the ith diagonal Since the covariance matrix is by natural non-negative definite, when it is full rank, it is positive definite.
We next show result (2). Notice that σ 2 i can be re-written as We next show (3). The diagonal elements of Σ are σ 2 i 's. The off-diagonal elements are, for i = j, Then, the leading terms of Σ can be decomposed asD + (a 0 − b 0 )uu T , whereD is a diagonal matrix with is strictly larger than 0 (this quantity is strictly increasing in a 0 and equals 0 when a 0 = 0), Σ is of full rank.

First direction
Second direction Stage I sample Stage II sample

First direction
Second direction Stage III sample Stage IV sample

First direction
Second direction Stage III sample Stage IV sample  (2014), where each yellow node represents the expression level of a gene, each green node represents the overall methylation level of a gene, and each blue node represents the copy number of a gene.  (2014), where each yellow node represents the expression level of a gene, each green node represents the overall methylation level of a gene, and each blue node represents the copy number of a gene.