A Comparison of Aggregate P-Value Methods and Multivariate Statistics for Self-Contained Tests of Metabolic Pathway Analysis

For pathway analysis of genomic data, the most common methods involve combining p-values from individual statistical tests. However, there are several multivariate statistical methods that can be used to test whether a pathway has changed. Because of the large number of variables and pathway sizes in genomics data, some of these statistics cannot be computed. However, in metabolomics data, the number of variables and pathway sizes are typically much smaller, making such computations feasible. Of particular interest is being able to detect changes in pathways that may not be detected for the individual variables. We compare the performance of both the p-value methods and multivariate statistics for self-contained tests with an extensive simulation study and a human metabolomics study. Permutation tests, rather than asymptotic results are used to assess the statistical significance of the pathways. Furthermore, both one and two-sided alternatives hypotheses are examined. From the human metabolomic study, many pathways were statistically significant, although the majority of the individual variables in the pathway were not. Overall, the p-value methods perform at least as well as the multivariate statistics for these scenarios.


Introduction
In metabolomics, we are not only interested in changes for individual metabolites, but also groups of related metabolites, which may be metabolites in the same class, metabolites in the same pathway, or other groups of bio-signatures. For simplicity, henceforth, all these categories will be referred to as "pathways." In particular we are interested in the cases where the individual metabolites miss the cut-off for statistical significance from univariate analyses, but in aggregate are found to be statistically significant. For example, suppose we observe eight metabolites in a pathway with p-values of 0.07. If the pair-wise correlations (unless otherwise stated, "correlation" refers to the Pearson correlation) are 0.99, then we expect the aggregate p-value to be similar to an individual p-value. However, if these are statistically independent, then the Fisher meta-analysis [1] p-value = 0.0003. So the aggregate p-value could range from 0.07 (all correlated = 1) to 0.0003. Hence, it is desirable to formally test whether a pathway is changed.
In genomics, there are two categories of pathway analysis: (1) competitive tests and (2) selfcontained tests [2,3]. Competitive tests assess whether there are more changes for those genes in a given pathway than those not in that pathway. Hence, the scenario in the previous paragraph would not be detected if a p-value < 0.05 is defined as a metabolite "changing." The selfcontained tests assess whether the pathway has changed without regard to genes in other pathways. These tests are of more interest for our applications and will be the focus of this paper.
For genomics pathway analysis, the methods for self-contained tests often involve combining the p-values, and in some cases "data modeling" (e.g., logistic regression, principal components analysis) [2][3][4][5] is performed, but various multivariate statistics, such as Hotelling's T 2 statistic [6], are not considered. Conversely, in the literature of multivariate methods, the aforementioned methods are not considered, but instead multivariate statistics are employed [7][8][9][10][11][12][13][14]. Some of these methods, such as Hotelling's T 2 statistic, require the inversion of the sample covariance matrix, which is not possible when the number of observations is less than the number of variables, as is typically the case for-omics data. However, other multivariate statistics such as those proposed by Dempster [7,8], Bai and Saranadasa [9], and Srivastava and Du [10] do not require the inversion of the sample covariance matrix. The distributions of some of these statistics, such as the Bai-Saranadasa statistic [9] rely on asymptotic results, which require even larger sample sizes. Thus, in genomics, many of these statistics will not apply. However, while genomics data sets typically have over 10,000 variables, metabolomics sets often have fewer than 1,000 (chemo-centric, rather than ion-centric [15]). Furthermore, the sizes of the pathways for metabolomics are also significantly smaller, with many containing fewer than 20 metabolites. Thus, these multivariate statistics can apply in many cases for metabolomics data.
In the next section, we describe various methods that aggregate p-values and methods that employ multivariate statistics. Only one "data modeling" approach is considered because for the global models mentioned in [3], logistic regression performed on multiple principal components often failed to converge for the smaller sample sizes in the simulation study. Many of the tests described are two-sided tests; however, we often have pre-specified alternatives, i.e., we expect that the metabolites in the pathway to change in a certain direction, so statistics that can take these into account are also desirable. We thus adapt several methods for one-sided alternatives. These statistical tests are compared in a simulation study and are then compared in an application to a human metabolomics study.

Test Statistics for Pathway Analysis
The following are compared for testing the above hypothesis.
1. Fisher's method. Fisher's method [1] is often used in meta-analysis for testing the same hypothesis across different data sets. So in this context, each member of the pathway is seen as a representative of the same biology. Hence, this test will not work well for the case where one variable changes, but the others do not. However, this is not of primary interest for our applications. Let p i , i = 1,2, . . ., m, be the set of p-values for the m members of the pathway. For example, these could be the p-values that result from performing two-sided t-tests for each metabolite. If the variables are independent, then the p-values have a Uniform(0,1) distribution, and thus the statistic X ¼ À2 log ðp i Þ has a chi-squared distribution with 2 Ã m degrees of freedom. Since, in the scenarios examined, we are considering related metabolites, which often have strong correlations, we use a permutation test as described by Fridley [3] instead of the chi-squared test.
2. Tail Strength. The tail strength measure [4] was developed to give an overall measure of significance for a set of hypotheses and has also been used to test significance of pathways [2,3]. The test statistic compares the ordered p-values to the expected moments of the Uni- p (1) is the smallest p-value and p (m) is the largest p-value. This test is approximately normal when m is large, which is not the case in our scenarios, so as with (1) we determine the distribution with a permutation test. Alternatively, since the variables are correlated, the moments could also be estimated from the permutation distribution, but this was not done here. 3. Adaptive Rank Truncated Product. For the adaptive rank truncated product method (ARTP) [5], the J smallest p-values are used for the statistic W J = p (1) p (2) . . .p (J) , where J is chosen to minimize the p-value. More specifically, first permute the group labels B times in order to generate B data sets. Next, for each of these data sets compute the m p-values, . Then for each truncation point J, 1 J m, and each data set b, 0 b B (b = 0 corresponds to the original data set), compute ðiÞ . Now for each b and J, compute its estimated p-value with Ge's algorithm , where I is the usual indicator function. Next, let MinP b ¼ min q b J over 1 J m. Then the p-value for the adaptive rank truncated product is given as

4.
Principal Components Analysis (PCA). There are various ways a principal components analysis (PCA) can be performed-one may simply choose a set number of components or choose the first k components so that a certain percentage of the total variation of is accounted for (such as 80%) [3], and then perform logistic regression on these components. Because the numbers of variables in the pathways under consideration are low, we simply choose the first component, and then perform a two-sample t-test. One advantage of this method is that a single score can be obtained for each sample and fold changes computed.
Then under the null hypothesis, n 1 þn 2 ÀmÀ1 ðn 1 þn 2 À2Þm t 2 $ Fðm; n 1 þ n 2 À 1 À mÞ where t 2 ¼ n 1 n 2 n 1 þn 2 ð X 1 À X 2 Þ 0 S À1 p ð X 1 À X 2 Þ with X 1 and X 2 representing the usual vector of sample means and S p representing the pooled estimate of the covariance matrix, i.e., S p ¼ ðn 1 À1ÞS 1 þðn 2 À1ÞS 2 n 1 þn 2 À2 , with S 1 and S 2 representing the usual sample covariance matrices [6]. As one can see, this requires multivariate normality, the same covariance matrix for each group, and requires that the total number of observations (n 1 + n 2 ) is greater than (m+1). If the normality assumption is questionable, a permutation distribution could be used instead of the F-statistic. 6. Dempster's Test. Dempster also developed multivariate statistics for these scenarios [7,8].
This test statistic is given by DM ¼ , where tr represents the trace function. Let n = n 1 + n 2 -2. When the data are sampled from a multivariate normal distribution, this has an approximate F-distribution with r and nr degrees of freedom, where r is given by r ¼ ðtr ðSÞÞ 2 tr S 2 [9,10] and can be estimated byr ¼ mâ 1 m trðS 2 p Þ À ðtr ðS p ÞÞ 2 n h i ½10: Rather than using the approximation for the test statistic's distribution, which relies on both the F-approximation and the approximation for the degrees of freedom, a permutation based value is used instead. Like Hotelling's T 2 , this requires the same covariance matrix for each group, but this has a strong advantage over Hotelling's T 2 because this test statistic does not require the inversion of the sample covariance matrix. 7. Bai-Saranadasa Test. Bai and Saranadasa [9] also developed a test statistic that does not require the inversion of the sample covariance matrix. Let n = n 1 + n 2. − 2. Their test statistic is given by BS ¼ This statistic has an asymptotic N(0,1) distribution, but since the sample sizes may be too small, we again use a permutation distribution of the statistic to determine the p-value. 8. Srivastava-Du Test. Srivastava and Du [10] also developed a multivariate test statistic that does not require the inversion of the sample covariance matrix. This test statistic is given by , where D s is the matrix consisting of the sample pooled variances on the diagonal and zero otherwise, c m,n = 1 + tr(R 2 )/m 3/2 , and R is the matrix ob- , where D s -1/2 is the matrix with the reciprocals of the pooled standard deviations along the diagonal and zero otherwise. Similar to (1)-(3), (6), (7), a permutation distribution is used to compute the p-values. 9. For one-sided alternatives, we propose a simple modification of statistics of the form (2.4) given in Sen [11], which was used in the context of union-intersection tests for one-sample problems: T n ðbÞ ¼ X and S are the usual one-sample mean vector and covariance matrix for testing a series of hypotheses of the form H 0,b = b 0 θ = 0. We propose the natural two-sample analog. Let a represent the vector of coefficients related to the hypothesized direction of change. For example, if one is considering four variables and expects them to increase under one experimental condition (i.e., one expects that the means for group 2 are greater than those for group 1), then a = (1,1,1,1); or if we expect the first two variables to increase, but the second two variables to decrease then a = (1,1,-1,-1). The proposed statistic for one-sided alter- . As with (6)-(8), this does not require inversion of the sample covariance matrix. Furthermore, we are not concerned with the asymptotic distribution, as a permutation distribution will be used. 11. Additional one-sided alternatives. In the literature there are several publications concerning testing whether the parameters fall in the positive orthant, i.e., where at least one of the mean differences is positive, and the others are non-negative [12][13][14]. The common form of testing this is by using the test statistic in conjunction with the condition that the sum of the sample mean differences is positive: for a level of 0.05, reject if the test statistic has pvalue less than 0.10 and the sum of the mean differences is positive [12].

Data
Simulation Study. For the simulation study, we use sample sizes and variable sizes often seen in metabolomic animal studies, which typically have low numbers of animals used (of the 100 most recently analyzed data sets at Metabolon, 39 were rodent studies with median group size = 5, and 38 of the 39 had group sizes from 2 to 8 animals) and small pathway sizes. We use n 1 = n 2 = 5, n 1 = n 2 = 10, and m = 4, 8. The data are simulated from multinormal distributions with the same covariance matrix for each group. The correlation matrices are of the form where all the off-diagonal elements are equal to ρ. Let R 1 , R 2 , R 3 , R 4 , represent the correlation matrices where ρ = 0, 0.5, 0.7, and 0.9, respectively. Let For each permutation test, 1,000 permutations were used to determine the p-value, i.e., the group labels were permuted 1,000 times, and then the test statistic was computed for each permuted data set. The p-value is computed as the proportion of these values at least as large as the observed value. More formally, let Y 0 represent the test statistic (from (1)-(13)), and let Y 1 , Y 2 . . .Y 1000 represent the test statistics generated from the permuted data sets. Then IðY k ! Y 0 Þ, where p represents the empirical p-value. Next 1,000 simulations runs were performed to determine the empirical type I error and power for the combination of parameters under consideration, i.e., the power or type I error is equal to 1 1000 À Á X 1000 j¼1 Iðp j < aÞ, where α is the level of the test and p j is the p-value computed from the j th simulation run. For the one-sided alternatives, the tests used a = I 4 or a = I 8 (i.e., all mean changes are positive). All simulations were performed in R [16] and the computation of Hotelling's T 2 was performed with the package "Hotelling" [17].
Insulin Resistance Human Metabolomic Study. Next, we apply these methods to a large human metabolomics data set concerning insulin resistance [18], which is a data set that has been previously published in this journal. Here the comparison is for insulin resistant subjects," IR", (n 1 = 138), vs. insulin sensitive subjects," IS", (n 2 = 261) as defined in the paper. There are many more subjects in this study than many metabolomics data sets, as this was used for human biomarker discovery. More details on the data processing are given in that paper. All statistics are performed on the log-transformed data.
This data set shows many of the difficulties in performing metabolic pathway analysis in practice, many of which are similar problems encountered in genomics pathway analysis. For example, as genes belong to multiple pathways [19,20], some metabolites belong to multiple pathways. For example, using KEGG [21], alanine is listed in 17 pathways, many of which overlap. Furthermore, incomplete, limited, or erroneous annotations and information poses problems in classifying genes into pathways [19,20], which also occurs in metabolomics. For example, using KEGG [21] some metabolites are not assigned to any pathways, such as 1-methyladenosine, which has one reaction listed, but no pathways listed. Additionally, in metabolomics, every element in the pathway may not be observed depending on the concentration of the metabolite and the technology used to measure its levels. Note that as the statistics used are functional class scoring methods [19], any specific relationships within a pathway are not modeled [19], and the analyses performed do not take into account relationships of the pathways to each other [19,20].
For this data set, each metabolite was assigned to a single pathway as defined by in-house experts, who made use of such public databases such as KEGG [21]. As mentioned previously, "pathway" is being used to refer to any group of related metabolites, so metabolites in the same class may be grouped, even though they do not belong the same physical pathway. The pathways are larger in genomics studies where the minimum and maximum numbers of elements in the pathways considered are often 10 and 100-200 genes, respectively [20]. However, virtually none of the metabolic pathways would satisfy these criteria, as the metabolic pathways tend to be much smaller. Here, pathways with only one representative were excluded from the analysis. Since this data set had large sample sizes, the permutation distributions for each statistic were determined from 10,000 permutations, rather than 1,000 permutations, which were used for the simulation study.

Simulation Study
For all tables, MU is the mean difference, σ the vector of the standard deviations, ρ, the pairwise correlations, N = n 1 = n 2 , FX = Fisher's statistic using the chi-squared distribution, FP = Fisher's statistic using the permutation distribution, TS = tail strength statistic, ARTP = adaptive rank truncated product, PCA, the results from performing the two-sample ttest on the first principal component, HT = Hotelling's' T 2 , BSN = Bai-Saranadasa statistic using the normal approximation, BSP = Bai-Saranadasa statistic using the permutation distribution, DM = Dempster's statistic, SD = Srivastava and Du's statistic and T 0 = the two-sample analog of Sen's statistic given in (9).
We first assess the type I error for each of the tests under the various parameter combinations listed in the previous section. The chi-squared statistic for (1) (FX) and the normal approximation for (7) (BSN) are also examined in order to see how close these are to the desired type I error. A level of α = 0.05 was used for each test. As we can see from Tables 1 and 2, the non-permutation based measurements (FX, BSN) have type I errors larger than the nominal levels, especially with highly correlated variables, as expected. All the others (the permutationbased measures) have type I errors at the nominal level. The results for the one-sided tests (not shown) were similar. Note that If the true type I error = 0.05, then the standard error of its estimate is ffiffiffiffiffiffiffiffiffiffiffiffi 0:05Ã0:95 1000 q , which is approximately 0.007.
Because of the inflated type I errors, FX and BSN are not considered, henceforth. The empirical power is computed for the combinations of means and covariances defined in the previous section. Both Bai and Saranadasa [9] and Srivastava and Du [10] have shown that asymptotically, Dempster's statistic has the same power as the Bai-Saranadasa statistic. For the small sample sizes used in the simulation runs, these two statistics also have essentially equivalent performance as seen in Fig 1, which plots the empirical powers of one against the other (the line y = x is shown for comparison) for all the combinations shown in S1 and S2 Tables (the two-sided tests). Since these two tests are essentially equivalent, the remainder of comparisons with these two will reference only the BSP values.
In the comparison of the p-value methods performed by Fridley et al., the Fisher statistic (FP) was the top overall performer [3]. Fig 2, shows the performance of the other tests relative to FP for the two-sided tests. Those above the line y = x have higher power than FP, and those below the line have lower power. From Fig 2, it is clear that Hotelling's test (HT) performs much differently from the others. In many of the parameter combinations, it has much lower power than the other methods. To see this more specifically, Fig 3 shows  . The power to detect the change for a single variable is 0.29, and HT even performs worse than that (also see S1, S2, and S3 Tables). This low power is the result of the low sample size relative to the number of variables, as 10 subjects (two groups of 5) is the minimum number of subjects required for the inversion of the sample covariance matrix, and this is the only method that uses the inverse of the sample covariance matrix in the computation of the statistic. There are some cases where the power for HT is much larger than the others. For example, when there are two groups of 10 subjects each, the power of HT to detect the mean difference of m 13 = (0.3,0,0,0) with all standard deviations 0.3 (σ 11 ) is much higher than the others, and much higher than its ability to detect m 12 = (0.3,0.3,0.3,0.3)! Furthermore, this power increases with increasing correlation, when a decrease would be expected (see the bottom four rows of S1, S2, S4, and S5 Tables). This is very counterintuitive, but by examining the two-variable case, we can gain some insight. Suppose there are two variables with correlation ρ and standard deviation = 1. For the same sample sizes, we evaluate (substitute the sample means and sample covariance matrix in the test statistic with the expected values) ðm 11 À m 12 ; m 21 À m 22 Þ 0 P À1 ðm 11 À m 12 ; m 21 À m 22 Þ ¼ 1 1 À r 2 ððm 11 À m 12 Þ 2 þ ðm 21 À m 22 Þ 2 À 2rðm 11 À m 12 Þðm 21 À m 22 ÞÞ: Thus, when the mean difference for each variable is μ, this reduces to 2m 2 ð1ÀrÞ ð1Àr 2 Þ and if the mean difference for variable one is μ, but variable two is 0, then we have m 2 ð1Àr 2 Þ . For independent variables this is 2μ 2 vs. μ 2 . However, when ρ = 0.9, the first equation is approximately 1.05 Ã μ 2 , while the second equation is approximately 5.26 Ã μ 2 ! This is definitely not a desirable property for the power. The other multivariate statistics behave as desired.
From Fig 2, we see that SD and ARTP appear to be the most competitive methods to FP, while PCA, TS, BSP often underperform compared to the others. To see some of the patterns more clearly, the empirical power for SD compared to BSP is shown in Fig 4 for the two-sided For the mixed standard deviations, TS clearly underperforms FP and ARTP; but, for the mixed standard deviations, ARTP often outperforms FP, while for many cases where the standard deviations are the same, it underperforms FP. Fig 6 shows a comparison now with just the top 3 performers: FP, ARTP, and SD. Finally, to see the effect of larger samples sizes, FP was compared to SD for four variables with two groups of 20 subjects each and two groups with 50 subjects each: their empirical powers are virtually identical (see S6 Table).
For one-sided tests, comparing S1 Table to S4 Table and S2 Table to S5 Table, one can clearly see the improvement in power for the one-sided tests, as expected. Next, we compare the top 3 performers for the two-sided tests (FP, ARTP, SD) and T0. These are shown in Fig 7, with FP compared against the other three. Here, all combinations were used except for those where the mean vector contained only one non-zero element (m 13 and m 23 ). T0 generally underperforms against the others, except for independent variables. The others are fairly comparable, but FP consistently outperforms SD for the independent variables (but these two are similar for twosided tests-see S1 and S2 Tables for the ρ = 0 rows).
Overall, we see that many of the methods are comparable for many of the combinations, but HT often has much lower power and sometimes lower than the power for a single variable. In most cases, the p-value methods perform at least as well as the multivariate statistics. In particular, FP is usually among the top ranking statistics. ARTP is competitive with FP when the variances are unequal. SD is typically the top performer for the multivariate statistics and is often comparable to the top p-value methods, with the exception of independent variables for the one-sided tests. Table 3 shows a summary of the results from performing Welch's two-sample t-test for each metabolite. As mentioned previously, any pathway where only one metabolite was observed was dropped from the analysis. There are 39 pathways remaining. Of these, only five pathways contain at least 10 metabolites, with 24 as the maximum. There is only one pathway, where every member is significant at the 0.05 level (P02 = benzoate metabolism, see Table 3 and S7  Table), and as expected, the p-values for assessing the significance of this pathway are very strong (Table 4). Results for pathways such as P01, where none of its individual members are even close to significant, and pathways such as P17, which has many strong p-values, have pathway statistics one would expect. The results for some of the other pathways are not obvious from simply examining the individual p-values. In particular, there are some cases where none of the individual p-values reach the threshold for significance, but the pathway statistic is indeed significant. Two such pathways are P13 (p-values 0.1334, 0.0540) and P31 (p-values 0.1332, 0.0563) whose p-values from Fisher's statistic are 0.0449 and 0.0475, respectively. The individual p-values for P13 and P31 are similar to P08 (0.1444, 0.0633), but the p-value for P08 from Fisher's statistic is 0.0857. The difference between these results can be explained by the differences in the correlations, which are positive in general (see S1 Fig). For the P13 and P31 pathways, the pair-wise correlations (computed from the pooled covariance matrices) are -0.09 and -0.08, respectively, so for both of these pathways, the two members are essentially independent. However, for P08, the pair-wise correlation between its two members is 0.81, showing that these two variables are highly redundant, and hence combining the signals does not strengthen the overall results.

Human Metabolomics Study
Although in most cases, the p-values are similar for the different methods, there are a few exceptions. For example, for pathway P39 none of the statistics for the pathway are significant at the 0.05-level (e.g., Fisher's p-value is 0.1835), except for Hotelling's statistic, which has a pvalue < 0.0001. Of the four metabolites in this pathway, there is only one strong p-value, and the pair-wise correlations are 0.48, 0.51, 0.57, 0.84, 0.85, and 0.88. This seems to mirror the results from the simulation study where Hotelling's statistic had stronger empirical powers than the others for the case of one strong change and highly correlated variables.
Overall, using the pathway analyses discussed, more than half the pathways are significant at the 0.05-level (before correcting for multiple comparisons) as seen in Table 4, with several pathways that are statistically significant where fewer than half the members reached the 0.05 level. As the sample sizes are very high here, it is not surprising that many of the measures give very similar results with the exception of PCA, which in some cases performed much worse. The p-values for the top performers from the simulation for two-sided tests, Fisher's statistic

Discussion
An extensive simulation study was performed to compare various statistics for assessing the statistical significance of a pathway. The statistics assessed were mainly either based on an aggregation of the p-values or multivariate statistics. While all the multivariate statistics require that the covariance matrix is the same for each group, most of the multivariate statistics considered do not require the inversion of the sample covariance matrix, and hence could be applied for even small sample sizes. The sample sizes and variables sizes in the simulation study were chosen to represent sizes seen in animal metabolomic studies. The empirical powers for the pvalue aggregation methods (top performers were generally Fisher's statistic and the ARTP statistic) are at least good as those from the multivariate statistics (top performer was the Srivastava-Du statistic). Furthermore, the power can be improved if appropriate one-sided statistics are implemented. With these sample sizes, permutation distributions are recommended over asymptotic approximations. Additionally, permutation statistics are recommended over test statistics that assume independent variables, such as the chi-squared statistic for Fisher's test, since the variables within a pathway are often highly correlated. The main alternatives considered were those for changes for all metabolites in the pathway, rather than an alternative such as one metabolite is differential, while the others are not. Hence for other alternatives, the multivariate statistics may have an advantage. The p-value methods also have the advantage of  being easier to adapt to various statistical designs, i.e., experimental designs that are more complicated than two independent groups.
Supporting Information