An Extension of PPLS-DA for Classification and Comparison to Ordinary PLS-DA

Classification studies are widely applied, e.g. in biomedical research to classify objects/patients into predefined groups. The goal is to find a classification function/rule which assigns each object/patient to a unique group with the greatest possible accuracy (classification error). Especially in gene expression experiments often a lot of variables (genes) are measured for only few objects/patients. A suitable approach is the well-known method PLS-DA, which searches for a transformation to a lower dimensional space. Resulting new components are linear combinations of the original variables. An advancement of PLS-DA leads to PPLS-DA, introducing a so called ‘power parameter’, which is maximized towards the correlation between the components and the group-membership. We introduce an extension of PPLS-DA for optimizing this power parameter towards the final aim, namely towards a minimal classification error. We compare this new extension with the original PPLS-DA and also with the ordinary PLS-DA using simulated and experimental datasets. For the investigated data sets with weak linear dependency between features/variables, no improvement is shown for PPLS-DA and for the extensions compared to PLS-DA. A very weak linear dependency, a low proportion of differentially expressed genes for simulated data, does not lead to an improvement of PPLS-DA over PLS-DA, but our extension shows a lower prediction error. On the contrary, for the data set with strong between-feature collinearity and a low proportion of differentially expressed genes and a large total number of genes, the prediction error of PPLS-DA and the extensions is clearly lower than for PLS-DA. Moreover we compare these prediction results with results of support vector machines with linear kernel and linear discriminant analysis.


Introduction
In discrimination studies, data sets are often handled having high numbers of features but only few samples. Especially for gene expression experiments, where thousands of genes are measured and in comparison only few samples are used, dimension reduction is advantageous as a pre-processing step before the final classification step takes place. There exist a lot of feature extraction methods for dimension reduction. One such method is powered partial least squares discriminant analysis (PPLS-DA) which is a specialized version of the well-known partial least squares discriminant analysis (PLS-DA), which was first introduced in chemometrics by Wold et al. [1] by using the PLS regression [2] for classification purposes. Here the response variable (Y) in the linear model is given in form of indicator variables. Barker & Rayens [3] and Nocairi et al. [4] were the first to formulate PLS-DA accurately. The aim is to reduce the dimensions (number of features) by coordinate transformation to a lower dimensional space. PPLS-DA was introduced by Liland and Indahl in [5] to improve the calculation of the loading weights for better separation of the groups by introducing a power parameter analogously to powered partial least squares (PPLS) [6] and maximizing the correlation between the data matrix and the group memberships, analogous to Fisher's canonical discriminant anal-ysis (FCDA). The optimization criterion in PPLS-DA is therewith not directly aimed at prediction, and therefore the original algorithm does not necessarily yield the best components for class prediction.
Former studies of Telaar et al. in [7] show similar prediction errors (PEs) for PLS-DA and PPLS-DA for most of the analysed data sets, and even lower error rates compared to other classification methods e.g. random forest and support vector machine. Therefore, we try to optimize the power parameter of PPLS-DA towards class prediction in a training set, to see if the prediction result for a test set can be improved. The power parameter and the number of components are determined according to the lowest prediction error of a linear discriminant analysis (LDA) using the PPLS-DA components, taking a crossvalidation approach. Furthermore, we compare the results of this extension and of PPLS-DA with the ordinary PLS-DA with respect to prediction error (PE) for simulated data sets and five publicly available experimental data sets. Finally the PPLS-DA results of the simulated and experimental data sets are compared to those of support vector machine (SVM) with linear kernel and LDA with ten features selected according to the t-test.

Description of a Classification Problem
An n|p data matrix X with n objects and p features together with a vector y~(y 1 , . . . ,y n ) t of group memberships form the basis of a classification problem. Here y i [ G,i~1, . . . ,n is the group label for object i which for example can be given in the form of discrete variables or symbolics. The k-th column of the matrix X is denoted by x k . We assume that each sample belongs to a unique group g and each group has a sample size of n g and a prior probability of group membership of p g . In total G different groups exist and n~P G g~1 n g . In this article, we restrict ourselves to only two different groups G~2. Our results can be extended to more than two groups.
The group information can also be given in the form of a dummy coded matrix Y as follows Y~(y ig ) i~1,...,n,g~1,2 ,y ig equals 1 if sample i belongs to groups g, otherwise the entry equals 0, i~1, . . . ,n and g~1,2. The goal is to determine a function C : R p ?G which assigns to each object a unique group (x.y) with the greatest possible accuracy [8]. In the following we assume that X is centered. Therewith the empirical total covariance matrix T is equal to X t X, and let B denote the empirical between group sum of squares and cross-product matrix, which can be formalized as

Introduction of PLS-DA and PPLS
PLS-DA. Before we start with the detailed description of PPLS-DA, we give a brief introduction to the roots of this method, PLS-DA and PPLS. Let a [ R p be a vector of loadings with a t a~1 and t [ R p a vector containing the mean values of the groups, t~X t Y(Y t Y) {1 . Nocairi et al. [4] showed that the dominant eigenvector of B maximized the covariance between Xa and t in the context of classification. Therefore this eigenvector should be used for the determination of the loading weights vector which was also recommended by Barker & Rayens [3]. An enhanced version of PLS-DA with inclusion of prior probabilities (p g ) in the estimation of B is proposed by Indahl et al. [8]. In this version, the importance of each group no longer depends on the empirical prior probabilities. Therewith a direct opportunity is given to put more weight on special groups for the calculation of the loading weights. Moreover X is transformed to an n|2 matrix Z~XW 0 , {1 and P~diag(fp g Dg~1,2g). Therewith a G|G eigenvalue problem (B [ R p|p ) is reduced to a 2|2 eigenvalue problem (2 = number of groups), because only the between-group covariance matrix according to Z is considered . Afterwards the eigenvalue needs to be back transformed by W 0 .
PPLS. Now we are turning back to PLS regression, to explain the introduction of the power parameter c in PLS regression. The loading weights vector of PLS regression w~½w 1 , . . . ,w p ~X t y DDX t yDD maximizes cov(Xw,y) and can be rewritten as.
w~k : ½cov(y,x 1 ), . . . ,cov(y,x p ) t k : sd(y)½corr(y,x 1 ) : sd(x 1 ), . . . ,corr(y, with a scaling constant k where cov, corr and sd denote covariance, correlation and standard deviation, respectively. Hence, the influence on the loading weights of the correlation and standard deviation part is balanced. Dominating X-variance which is irrelevant for prediction does not lead to optimal models, therefore Indahl et al. [6] propose PPLS which allows the user to control the importance of the correlation part and the standard deviation part by a power parameter c as follows: Here s j denotes the sign of corr(x j ,y g ). The power parameter c[½0,1 is determined such that the correlation is maximal: arg max c[(0, 1) corr(Xw(c),y). Also c equal to 0 and 1 is included in the maximization problem by calculation correlation with a loading weights vector which has only a non-zero entry (for c~0) for the feature with the largest standard deviation and by determining the correlation according to a loading weights vector which has only a non-zero entry for the feature with the largest correlation to y (c~1).

PPLS-DA: Optimization of the Power Parameter According to Correlation
PPLS-DA is designed to deliver components which are optimal for discriminating the cases coded in Y. This optimality can be understood in terms of a correlation approach. Nocairi et al. showed in [4], that correlation is determined by the so called Rayleigh quotient.
Here the correlation is measured by the squared coefficient of correlation r 2 . Maximization of the correlation is therewith equivalent to maximization of the Rayleigh quotient.
The well known solution of the maximization problem (2) is the dominant eigenvector a max of T {1 B. This is exactly the approach of Fisherss canonical discriminant analysis (FCDA) for determining the vector of loadings.
For PPLS-DA, Liland and Indahl combine in [5] the approaches of FCDA and PPLS and further include, like in [8], prior probabilities p g for each group g in the calculation of T and B. The data matrix X is transformed with W 0 (g) to the n|G matrix Z(c)~XW 0 (c), where W 0 (c) contains the possible candidate loading weights vectors as columns.
The power parameter c enables the focus on features which have a high correlation to y g or on features which have a high standard deviation. For the transformed matrix Z(c) the between group sum of squares and cross-product matrix including prior probabilities can be calculated as follows with P~diag(fp 1 ,p 2 g) and V P~d iag(fv i Dv i~n p g =n g ,i~1, . . . ,ng) [5]. V P is an n|n diagonal matrix where the non-zero entry belonging to sample i is the ratio of the prior probability of the group to which i belongs and the corresponding group size times the number of samples n.
The maximization problem of PPLS-DA is as follows.
arg max For c~0 the feature(s) with the highest standard deviation are tested and for c~1 the feature(s) with the highest correlation to the group membership vector are checked separately. We denote the solution of the optimization problem (4) by c max . To avoid singular matrices T P (c) and to get a numerically more stable solution, Liland and Indahl [5] substitute the maximization problem (4) by the maximization problem arg max c[½0,1 cca(Z(c),Y), where cca denotes the canonical correlation. In [9] the authors showed that these procedures lead to equal results.
For each component, c max is determined by maximization of the canonical correlation: In the algorithm of PPLS-DA in the R package pls, the R function optimize is used to search for the maximum. For this purpose, the whole training set is used to find c max . Transforming X into an n|g data matrix Z(c), the maximization problem depends on the number of groups and number of samples. It does not depend on the usually much higher number of features. The final loading weight vector is then w~W 0 (c max )a c max . For a detailed description of PPLS-DA we refer to [5].
In our experience with PPLS-DA, we have observed that optimization of c max does not always lead to the lowest possible prediction error available through other choices of c.

PPLS-DA: Optimizing the Power Parameter c with Respect to Prediction
To improve prediction errors in classification tasks using PPLS-DA, we aim at optimizing the power parameter c with respect to prediction error of LDA. For this, we propose a cross-validation approach, to avoid overfitting. Therefore, we separate in a first step the data into a training and a test set which are disjunct. We now call these sets outer training set and outer test set. Using only the outer training set to optimize the c{value, we use the outer test set to evaluate the computed classification function. In a next step we split the outer training set randomly in N inner different inner training and inner test sets. Because unbalanced data have influence on the estimated classifier, we down-sample the majority group objects to get equal numbers of objects in both groups for the outer training set. Here, the proportion of 0.7 of the smallest group size determines the size of the outer training set for each group. The remaining objects build the outer test set. For the optimization step, we take into account equidistant fixed c{values in [0,1] with step size n c , resulting in a sequence of c{values for the optimization c 1~0 ,c 2 , . . . ,c 1 nc z1~1 . For example for a choice of n c~0 :1, we consider 11 c{values, 0,0:1,0:2 . . . ,0:9,1.
We calculate the prediction error (PE), the proportion of wrongly classified samples of a test set, as a measure for good classification. In Figure 1 a rough overview of our proposed extension is given. In this paper, all cross-validation procedures consist of random samples of the corresponding data sets to the proportions of 0.7 (training set) and 0.3 (test set). For example, a cross-validation with 10 repeats, repeats the sampling 10 times. Utilizing the statistical software R, we use the function cppls (of the R-package pls) for PPLS-DA and the function lda (of the Rpackage MASS). Furthermore, we use the default setting for the priors in the lda function, using the proportions of the groups which are equal in our cases.
We propose an extension for optimizing both, c and the number of components to be used as input for the LDA. The optimization of the c-value depends on the parameters N inner and n c .
In our extension all components are used to optimize the c{value minimizing PE. The number of components and the power parameter are optimized in one procedure. Therewith all components have the same c{value.
In the optimization, for each fixed c{value the optimal number of components of PPLS-DA is determined as follows (see Figure 2): For the N inner different inner test sets, the PE for one up to five components Cj, j~1, . . . ,5 is calculated, all using the same c i resulting in a matrix (PE s,Cj ) s,j , s~1, . . . N inner ,j~1, . . . ,5. Then the average inner PE is calculated for each component. Therewith we select for each c i the smallest mean PE over the N inner -test sets with a corresponding optimal number of components (c i ). We search for the minimal PE leading to c opt with the optimal number of components c opt . For these c opt components and the power parameter c opt , we calculate the corresponding loading weights vectors on the outer training set, and we finally determine the PE of the outer test set.

R Functions Used
(P)PLS-DA implementations. For PPLS-DA, we used the R-function cppls of the R-package pls. We implemented an Rcode for PLS-DA based on [8]. The optimal number of components was determined by a cross-validation on the outer training set for PLS-DA, PPLS-DA with c max and our described extensions of PPLS-DA. For this step, we restricted the maximal number of components to five. The segments of the crossvalidation are randomly chosen at the proportion of 0.7 and 0.3 of the data set. We repeat this procedure 10 times.

Further Classification Methods
SVM. The classification method SVM with a linear kernel searching for a linear hyperplane for the separation of the data is considered for comparison. For this purpose the R package e1071 [10] is applied and the parameter C for the linear kernel is tuned within the interval ½2 {5 ,2 4 using the R-function tune.svm with a cross-validation of 10 steps. The interval for tuning is chosen according to the suggestion of Dettling & Buehlmann [11].
t-LDA. Additionally an LDA is performed using ten features which are filtered based on the outer training set according to a ranking list based on the lowest p-value of the t-test. For the t-test we use the R-package stats.
The same segments of the outer training and outer test set are used across all tested methods for fair comparison.

Data
We investigate simulated data and five publicly available experimental data sets. After preprocessing (like mentioned in the description of the experimental data sets), all experimental data are on the log-scale of gene expression as for the simulated data.
For detailed description of the covariance structure of our data, we use two measures analogous to Saebø et al. in [12]. The condition index, first used in [13], and the absolute value of the covariances between the principal components of X and the response vector as used in [12]. The condition index k j~ffi ffiffiffiffiffiffiffiffiffiffi l 1 =l j p , j~1, . . . ,p is used as a measure for variable dependence, with l k being the kth eigenvalue of cov(X). It can be assumed that l 1 §l 2 § Á Á Á §l p . The increase of the first five condition indexes (k 1 ,k 2 ,k 3 ,k 4 ,k 5 ) reflects the collinearity of the features. A rapid increase means, the features have a strong linear dependence, a weak increase implies a weak dependence. If we now consider the principal components, like in [12], the relevance of a component is measured by means of the absolute value of the covariances (Dcov(z j ,y)D) between the principal component z j~X e j and the class vector y. Here y i equals 1 if sample i belongs to group g~1, otherwise y i equals -1, i~1, . . . ,n. The eigenvector belonging to the kth largest eigenvalue is denoted by e j . Helland and Almøy [14] infer, that data sets with relevant components, which have small eigenvalues, are difficult to predict. The condition index is plotted for the first five largest eigenvalues (scaled to the first eigenvalue) in Figure 3. Figure 4 shows the first 50 largest scaled eigenvalues and the corresponding scaled covariances between z j and y for all experimental data sets and a simulated data set (case 3) investigated.
Simulated data sets. Gene expression data are simulated as normally distributed data, considering the log scale of microarray intensities after normalization: x ij~N (m j ,s 2 b zs 2 t ),i~1 . . . ,n and j~1, . . . ,p. Here s 2 b denotes the biological variance which we chose equal to 0.04, and s 2 t represents the technical variance which we chose in different proportions of the biological variance. Studying a two-class classification objective, we simulate 60 samples per class for the whole data set and p~1000 genes partitioned in an informative part and non-informative part for the classification. For the test set, 30 single samples per class are randomly chosen. The remaining 30 single samples per class constitute the training set. The non-informative part of the data matrix which shows no differences between the two classes, consists of normally distributed random variables with mean m g~8 and biological variance s 2 b~0 :04, as well as different cases of technical variance s 2 t . The informative part contains ten differentially expressed genes (DEGs) with a mean class difference D [7]. We take 5 cases into account. For case 1, 2 and 3, for each DEGs, D is chosen according to the uniform distribution from the interval ½0:1,0:5. Case 1 has a technical variance of zero, case 2 of one-quarter of the biological variance (s 2 We illustrate the data structure for the simulated data on the example of case 3. The condition indexes are 1.00, 1.01, 1.04, 1.05, 1.06. The increase is the weakest for all data sets considered (Figure 3), and therewith the genes are only weakly linear dependent, which is also shown by Figure 4F. This is similar for all simulated data cases. Furthermore, the proportion of DEGs is very low with 1% for all cases.
Experimental data sets. Additionally, we considered five publicly available experimental microarray gene expression data sets which are summarized in Table 1 containing information about the group size, number of genes, proportion of differentially expressed genes and original publication. For the determination of the number of differentially expressed genes (N DEGs ) we use a t-test (from the R-package stats) and an FDR correction [15] (R-package qvalue). We count all genes with a q-value below 0.05. In the following the five data sets are described: The Leukemia data were downloaded from the Whitehead Institute website. We merge the training set and the test set to get a higher sample size and sample from these to get new proportions 0.7 and 0.3 for the training and test set. The R code for data preprocessing from http://svitsrv25.epfl.ch/R-doc/library/ multtest/doc/golub.R is used which is according to [16]. The data set consists of two groups, 25 patients with acute myeloid leukemia and 47 patients with acute lymphoblastic leukemia and 3571 genes.
The condition indexes show a weak increase for this data set (1.00, 1.31, 1.49, 1.537, 1.83). This and the plot of the eigenvalues ( Figure 4A) lead to the assumption of a weak linear dependency between the genes. The more relevant components have the largest eigenvalues ( Figure 4A). Therefore we can expect good prediction performance of this data set. This data set has the highest proportion of DEGs (40.46%, Table 1).
The Lymphoma data set was downloaded from the website http://www.broadinstitute.org/mpr/lymphoma/. The data are GC-RMA normalized. Two groups are considered, 58 patients with diffuse large B-cell lymphomas and 19 patients with B-cell lymphoma, follicular lymphoma. Only genes with a non-zero variance are used in our analysis, which leads to 7129 genes.
The between-variable dependencies are comparable to the Leukemia data set (condition indexes: 1.00, 1.10, 1.40, 1.50, 1.84). The covariance structure ( Figure 4B) is also comparable to those of the Leukemia data set and the total number of DEGs is a little bit higher than for the Leukemia data set, but the proportion on the total number of genes is clearly lower (24.3%, Table 1).
The Breast Cancer data set consists of normalized and filtered data, downloaded from http://homes.dsi. unimi.it/,valenti/ DATA/MICROARRAY-DATA/R-code/Do-Veer-data.R. The normalization was performed according to [17]. In this data set, only the two groups with the highest sample size are included: 34 patients with distant metastases within 5 years and 44 patients without, after at least 5 years. The total number of genes is 4997.
We found again a weak increase in the condition indexes for the first five eigenvalues (1.00, 1.42, 1.77, 1.90 and 1.91), but slightly faster than for the Leukemia and Lymphoma data set ( Figure 3). The eigenvalue plot ( Figure 4C) illustrates also a weak linear dependence between the features. The proportion of DEGs is the lowest for all experimental data sets (1.08%, Table 1).
The Prostate 1 data set contains 52 tumor and 50 non-tumor cases and was downloaded from http://stat.ethz.ch/dettling/ bagboost.html. The preprocessing is described in [18] and the final data set contains 6033 genes.
This data set shows a rapid increase of the condition index from k 1 to k 5 (1.00, 2.96, 3.24, 5.046, 5.397), describing a strong linear dependency of the genes (Figure 3). This property is also indicated by the plot of the eigenvectors ( Figure 4D). This data set also has a high proportion of DEGs (32.26%, Table 1).
We downloaded the Prostate 2 data set, which was already normalized, from http://bioinformatics. mdanderson.org/Tail-Rank/. A description of the normalization can be found at http:// bioinformatics.mda nderson.org/TailRank/tolstoy-new.pdf. In this data set, only the two groups with 41 patients with normal prostate tissue and the 62 patients with primary tumors are in included.
The condition index shows a rapid increase (1.00, 1.71, 2.10, 2.56, 2.93) for the first five eigenvalues (Figure 3), but more moderate than for the Prostate 1 data set. Figure 4E shows that also relevant components have small eigenvalues, which indicates low prediction performance. The proportion of DEGs is very low (1.4%, Table 1) and similar to those of the Breast Cancer data set, but the total number of genes is the largest (42129) for all experimental data sets.

Results
Results for the simulated data are based on 100 repeated simulations, and for the experimental data 100 different outer training and outer test sets are sampled. We present mean PE values of the outer test sets and the corresponding 95% confidence intervals. At first, we describe and compare the results of PPLS-DA using c opt with PPLS-DA using c max , followed by the comparison between PLS-DA, PPLS-DA using c~0:5, t-LDA and SVM.
We calculate confidence intervals for PE as follows: let z be the vector with 100 estimates of prediction errors. The upper bound of the confidence interval is then calculated as u(z)~mean(z)z1:96(z)= ffiffiffiffiffiffiffi ffi 100 p . The lower bound is calculated likewise. If these confidence intervals overlap, we report no significant differences, if they are disjunct, the corresponding PEs are reported as significantly different.

Results for the Simulated Data
PE results. At first we study the dependency of c optimization on the step size n c , and on the number of internal crossvalidation steps N inner . We calculate the mean PE results for N inner~5 ,10,20,30,40,50,60,70,80 and n c~0 :05 or 0:1. Because the corresponding 95% confidence intervals overlap for all different parameter choices for fix N inner (Figure S1), we show all further results for N inner~5 0 and n c~0 :1. Figure 5 illustrates the mean PE results and corresponding 95% confidence intervals for the simulated data and Table 2     Considering the frequency distributions of c opt -values ( Figure S2 shows the corresponding histograms for case 3), the value with the highest frequency for the optimal c-value (c opt ) determined in a cross-validation approach is 0.8. This is in contrast to the values for c max , with 0.5 as highest frequency (Figures S2A and B, see Supplementary Materials).
Different power parameters have large effects on the loading weights, which can be seen in Figure 6, e.g. for the first component for c max and c opt with N inner = 50 and n c = 0.1 for case 3. The first 10 genes, which are simulated as differentially expressed, receive the highest absolute loading weight values for all methods. For c opt , these loading weights of the informative genes are increasing in absolute values, and especially the non-informative genes receive loading weight values near to zero in comparison to the loading weights induced by c max .
Comparing our above findings of the PE with those of the PE of PLS-DA and of PPLS-DA with c~0:5, PLS-DA shows an equal PE to PPLS-DA using c max or c~0:5, for all cases of the simulated data. Therewith also PPLS-DA using c~0:5 shows a significantly lower PE than PLS-DA. The number of components used for PLS-DA is equal to the corresponding number of components used for PPLS-DA using c max or c~0:5. Hence overall, PPLS-DA using c opt uses in average the lowest number of components. Now, we consider the results of the two methods, SVM and t-LDA. For all simulated data cases, the method SVM shows the largest PE for all simulated data cases. The method t-LDA does not show a significant different PE to PPLS-DA using c opt .

Results for the Experimental Data Sets
As for the simulated data, we choose N inner~5 0 and n c~0 :1. In Figure 7 the mean PE results and the corresponding 95% confidence intervals are shown for all five experimental data sets for all considered methods. For PPLS-DA using c max , c opt and c~0:5 and PLS-DA, the average number of of components are shown in Table 3.
Leukemia data set: For this data set, we found no significant differences in the PE of PPLS-DA using c opt compared to PPLS-DA using c max ( Figure 7A For PPLS-DA using c opt we find a significantly lower PE than for t-LDA, but for SVM the PE is similar to that of PPLS-DA using c opt . Lymphoma data set: The PE of PPLS-DA using c opt is significantly lower than the PE of PPLS-DA using c max ( Figure 7B). PPLS-DA using c opt leads in average to 3.1 components and PPLS-DA using c max to 2.9 components.
Considering PLS-DA, we find a significantly lower PE than for PPLS-DA using c max , which is equal to the PEs of PPLS-DA using c opt and c~0:5. PLS-DA used the smallest average number of components (1.8), for PPLS-DA using c max and c~0:5 we find similar number of components in average 2.9 and 2.8.
The method t-LDA shows a significantly higher PE in comparison to PPLS-DA using c opt , and SVM an equal PE to PPLS-DA using c opt .
Breast cancer data set: For this data set, PPLS-DA using c max and PPLS-DA using c opt does not show significantly different PEs ( Figure 7C). The number of components used are also similar for these two methods with 2.4 for PPLS-DA using c opt and 2.3 for PPLS-DA using c max .
Also PLS-DA shows an equal PE and a slightly higher number of components (2.7), compared to PPLS-DA using c opt and c max or c~0:5.
If compared to PPLS-DA using c opt , t-LDA and SVM show similar PE.
Prostate 1 data set: PPLS-DA using c opt shows equal PE to PPLS-DA using c max ( Figure 7D). PPLS-DA using c opt leads in average to 3.4 components, while PPLS-DA using c max uses 2.5 components.
Investigating PLS-DA, the PE is equal to the PE of PPLS-DA using c max or c~0:5. PPLS-DA using c~0:5 and PLS-DA use the largest average number of components, 4.1 and 4 (see Figure S4).
Also t-LDA and SVM show equal PEs to the other methods. Prostate 2 data set: The PE of PPLS-DA using c opt is equal to the PE of PPLS-DA with c max ( Figure 7E). Moreover, PPLS-DA using c max used 2.9 components and PPLS-DA using c opt uses 3.7 components.
PPLS-DA using c~0:5 and PLS-DA show equal PE and significantly higher PE than for PPLS-DA using c max or using c opt . PPLS-DA with c~0:5 and PLS-DA use more components in comparison to all versions of PPLS-DA.
Both methods t-LDA and SVM show a significantly larger PE than PPLS-DA using c max or PPLS-DA using c opt , but a significantly lower PE than PLS-DA and PPLS-DA using c~0:5.

Discussion
The focus of our study is on introduction of an extension for the method PPLS-DA for better classification in high-dimensional datasets, as for example gene expression datasets in biomedicine. The optimization criterion for the power parameter c max in the ordinary PPLS-DA is towards canonical correlation, and does not need to be best for prediction. Our extension of PPLS-DA introduces optimization of c with respect to prediction using an inner cross-validation approach. We carry along comparisons to LDA and SVM, to bring our proposed method into line with these standard classification methods.  Comparison between PPLS-DA using c max and c opt The PEs of the outer test sets for PPLS-DA were improved or showed at least equal values by optimization of the c{value with respect to the prediction with LDA in comparison to PPLS-DA using c max , for all simulated data and the experimental data sets.
Simulated data. Comparing the histograms of c{values found by PPLS-DA using c opt and PPLS-DA with c max , the reason for the lower PE can be traced back to the down-weighting of the non-informative features for the simulated data (see Figure S2 and Figure 6). The loading weights for these features are near or equal to zero. The influence of the features, which are not informative for the discrimination (and can be interpreted as noise), is reduced, because the impact on the calculation of the components is lower for PPLS-DA using c opt than for PPLS-DA with c max . Values of c near to one, leads to preference of features which show a high correlation to the dummy response (in the simulation study these are the differentially expressed genes). Changing the optimization criterion from correlation towards prediction, leads also to a lower average number of components for the simulated data.
Experimental data. For the experimental data, the c-values determined by the canonical correlation (c max ) are larger than the c{values detected by our proposed extension of PPLS-DA. Even if our analyses of simulated data suggested lower PE values for larger choices of c, for the Leukemia data set and the Lymphoma data set also PPLS-DA using c opt shows a significantly lower PE than PPLS-DA using c max . Note, that the true informative genes for the experimental data are not known, and the proportion of differentially expressed genes most likely is much larger than for the simulated data, therefore the comparison of the results for simulated and experimental data is not straightforward. Moreover experimental data are usually noisier than simulated data. The part of reality we have not been able to model in the simulated data might be a sort of noise and data structure that we cannot improve on, regardless our choice of c. If we could remove this part of the noise from the real data, the relative improvements might be just as good as with the simulated data.
Summarizing the findings for the simulated data, PPLS-DA using c opt shows significantly lower PEs than PPLS-DA with c max . For the experimental data, the results are also significantly lower or equal for the extensions considering the PE.
Additionally we had considered three further versions to determine the power parameter towards prediction. These versions for example differ in the number of considered components for the optimization of the power parameter and also one version which optimizes an individual power parameter for each component was studied. The results are similar to those of the extension presented here (data not shown).

Comparison between PPLS-DA using c max and PLS-DA
The development of PPLS-DA followed the development of powered partial least squares as a natural extension of the power methodology to handle discrete responses. Several factors motivated this advancement to PLS-DA. First the application of powers enables focusing on fewer explanatory variables in the loading weights, smoothing over some of the noise in the remaining variables. Second, focus can be shifted between the correlation and standard deviation parts of the loading weights, which is even more important for discrete responses. Finally, the maximization criterion is moved from the between-group variation (B) to the product of the between group variation matrix and the inverse of the within-group variation matrix (W {1 B~T {1 B). This has the effect of moving from covariance maximization to a correlation maximization. Instead of just searching for the space having highest variation between the groups, we also minimize the variation inside the groups, increasing the likeliness of good group separation.
In our study, PPLS-DA with c~0:5 (applying no power parameter) and PLS-DA always show equal PEs for the simulated and the experimental data sets. Hence, for this case the different optimization tasks show no great differences with respect to the PEs of the outer test sets. Including the power parameter, the PE of PPLS-DA using c max is equal to the PE of PLS-DA for all simulated data. Also, the number of components used is in average lower or equal for PPLS-DA using c max than for PLS-DA.
For two of the five experimental data sets (Leukemia and Lymphoma), the PE of PPLS-DA using c max is significantly higher than the PE of PLS-DA. For these data sets, the proportions of differentially expressed genes are large (24.4% and 40.5%) and the genes are only weakly linear dependent (considering the condition indexes). The PE of PPLS-DA using c max is significantly lower than for PLS-DA for the Prostate 2 data set, and the number of components used is also in average lower. This data set contains only a low proportion of 1.4% differentially expressed genes, and the total number of genes is very high (42129). Moreover, for this data set, the genes show a stronger linear dependency (rapid increase of the condidion index) than for the Leukemia or the Lymphoma data set.
Summarizing, a weak increase of k j ,j~1 . . . ,5 indicates no improvement for the PE when using PPLS-DA with c max instead of PLS-DA. Concerning percentage of DEGs, the PE of PPLS-DA using c max is equal to the PE of PLS-DA only for a small percentage of DEGs (Breast Cancer data and case 3 of the simulated data). For a weak increase of k j ,j~1 . . . ,5 and high percentage of DEGs, the PE for PPLS-DA using c max is even larger than for PLS-DA (Leukemia and Lymphoma data sets).
A rapid increase of k j ,j~1 . . . ,5 and a large proportion of DEGs, using c max instead of PLS-DA does not improve the PE (Prostate 1). On the contrary, for a rapid increase of k j and the case of small percentage of DEGs, we can improve the PE, employing PPLS-DA using c max instead of PLS-DA (Prostate 2).

Comparison between PLS-DA and PPLS-DA using c opt
For the simulated data, PLS-DA shows equal PE if compared to PPLS-DA using c max or c~0:5. The PE of PPLS-DA using c opt is significantly lower than the PE of PLS-DA. For the experimental data, for four of the five data sets (Leukemia, Lymphoma, Breast Cancer and Prostate 1), the PEs of PLS-DA and PPLS-DA using c opt show no significant differences. For the Prostate 2 data set, the PE of PPLS-DA using c opt is clearly lower than for PLS-DA.
We conclude first, equal PEs between PLS-DA and PPLS-DA using c opt are caused by a weak between-feature dependency, independent of the proportion of DEGs. Second, a data set with strong collinearity between the features and a low number of DEGs, in contrary shows a clearly lower PE for PPLS-DA using c opt than for PLS-DA.
Comparison between PPLS-DA using c~0:5 and PLS-DA The PEs for PLS-DA and PPLS-DA using c~0:5 are nondistinguishable for all simulated and all experimental data sets investigated. Maximization of the covariance or maximization of the correlation without the power parameter, results in equal PEs of the outer test set.
Comparison between PPLS-DA using c opt and t-LDA and

SVM
The more classic and widely available classification methods t-LDA and SVM were also run and compared to PPLS-DA using c opt on all simulated and experimental data sets. For the simulated data, t-LDA performs indistinguishable well as PPLS-DA using c opt . For the experimental datasets, SVM draws level with our proposed approach except for the case of Prostate 2. In this comparison, PPLS-DA using c opt shows a comparatively stable well performance. There may, however, exist further statistical learning methods which outperform the methods presented in this study. t-LDA and SVM should serve to demonstrate comparability of PLS-related methods to other commonly chosen approaches for the classification problems. The focus of our study, however, was on further developing PPLS-DA, a so-called multivariate method, which has already been proven its large potential for classification problems involving magnitude more features than samples as it is the case in OMICs data sets.

Conclusions and Outlook
It is conceivable to use results of an initial PPLS-DA crossvalidation series, optimizing c max , to try to judge if running the extended version would be rewarding. Data sets with a high proportion of differentially expressed genes and weak linear dependency (like the Leukemia data set and the Lymphoma data set) most probably show good prediction results for PLS-DA. Here, we found no gain using PPLS-DA with powers (c max or c opt ). On the contrary, for a rapid increase of the condition index, a low proportion of differentially expressed genes and a large total number of genes, using PPLS-DA with c max clearly improves the prediction error compared to PLS-DA. In cases where PPLS-DA using c max gives no advantages over PLS-DA, using the extensions of PPLS-DA (optimizing the power parameter) for prediction can be advantageous. Starting to analyse the eigenvalue structure and the number of differentially expressed genes, can possibly be useful to decide which method to use. One aspect of future work is to validate our conclusions by additional experimental data sets as well as further simulations implementing a more complex covariance structure. Figure S1 Mean PE of PPLS-DA for simulated data using c opt plotted against N inner . The simulated data of case 3 are constructed such that the technical variance is of the same size as the biological variance. 10 differentially expressed genes with a mean class difference D [ ½0:1,0:5 are simulated. For different numbers of the cross-validation steps, the mean prediction error