Ridge regression and its applications in genetic studies

With the advancement of technology, analysis of large-scale data of gene expression is feasible and has become very popular in the era of machine learning. This paper develops an improved ridge approach for the genome regression modeling. When multicollinearity exists in the data set with outliers, we consider a robust ridge estimator, namely the rank ridge regression estimator, for parameter estimation and prediction. On the other hand, the efficiency of the rank ridge regression estimator is highly dependent on the ridge parameter. In general, it is difficult to provide a satisfactory answer about the selection for the ridge parameter. Because of the good properties of generalized cross validation (GCV) and its simplicity, we use it to choose the optimum value of the ridge parameter. The GCV function creates a balance between the precision of the estimators and the bias caused by the ridge estimation. It behaves like an improved estimator of risk and can be used when the number of explanatory variables is larger than the sample size in high-dimensional problems. Finally, some numerical illustrations are given to support our findings.


Introduction
High-dimensional statistical inference is essential whenever the number of unknown parameters is larger than sample size. Typically, high-throughput technology provides large-scale data of gene expressions in transcriptomics. As an example, the riboflavin production data set with Bacillus subtilis (Lee et al. [1] and Zamboni et al. [2]) includes the logarithm of the riboflavin production rate as the response variable along with 4088 covariates which are the logarithm of the expression levels of 4088 genes, which are normalized using the Affymetrix oligonucleotide arrays normalizing methods. One rather homogeneous data set exists from 71 samples that were hybridized repeatedly during a fed-batch fermentation process in which different engineered strains and strains grown under different fermentation conditions were analyzed.
A relevant family of methods for prediction of the response based on the high dimensional gene expression data are sparse linear regression models. The least absolute shrinkage and selection operator (LASSO), proposed by Tibshirani [3], is the most popular, while other relevant methods are SCAD penalization [4] and minimum concave penalty [5]. In spite of the a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 suitable sparsity caused by these penalized methods, they have low prediction performance for high dimensional data sets, because of their shrinkage and bias. Hence, developing shrinkage strategies to improve prediction is an interest in genome studies.
The primary aim of this study is improving the prediction accuracy of the riboflavin production data, in genome regression modeling; secondly, we further focus on detecting outliers. Intuitive methods for labeling observations as outliers can be provided by diagnostic plots. Fig 1 gives the diagnostics plots to identify outliers for the riboflavin data set based on the ordinary least-squares model with effective genes. The plots suggest there exist some outliers in the data set. Hence, developing efficient robust estimation strategy is another aspect of our approach.

Rank regression
Jureckova [6] and Jaeckel [7] proposed rank-based estimators for linear models as highly efficient and robust methods to outliers in response space. In short, rank regression is a simple technique which consists of replacing the data with their corresponding ranks. Rank regression and related inferential methods are useful in situations where 1. the relation between the response and covariate variables is nonlinear and monotonic and a simple and practical nonlinear form is of interest rather than polynomial, spline, kernel and/or other forms 2. there are outliers present in the study and we need a nonparametric robust procedure 3. the mere presence of so many important input variables makes it difficult to think in terms to find an appropriate parametric nonlinear model The package Rfit in R, developed by Kloke and McKean [8] is a convenient and efficient tool for estimation/testing, diagnostic procedures and measures of influential cases in rank regression.

Rank estimator
Consider the setting where observed data are realizations of fðX i ; y i Þg n i¼1 with p-dimensional covariates X i 2 R p and univariate continuous response variables y i 2 R. A simple regression model has form where β is the vector of regression coefficients and � i is the i th error component. For simplicity, we assume that the intercept is zero. In case it exists, by centering the observations one can eliminate it from the study. We assume that: 1. Errors � = (� 1 , . . ., � n ) > are independently and identically distributed (i.i.d.) random variables with (unknown) cumulative distribution function (c.d.f.) F having absolutely continuous probability density function (p.d.f.) f with finite and nonzero Fisher information 2. For obtaining the linear rank estimator, we consider the score generating function c : ð0; 1Þ ! R which is assumed to be non constant, nondecreasing, and square integrable on (0, 1). The scores are defined in either of the following ways: for n � 1, where U 1:n � . . . � U n: n are order statistics from a sample of size n from the uniform distribution Uð0; 1Þ.
To obtain the rank estimate of β, define the pseudo-norm where for y = (y 1 , . . ., y n ) > , a(R(y)) = (a(R(y 1 )), . . ., a(R(y n ))) > and R(y i ) is the rank of y i , i = 1, . . ., n and a(1)�a(2)�. . .�a(n). Then, the rank-estimate of β is given bŷ where X = (X 1 , . . ., X n ) > andŷ c is the minimizer of dispersion function D ψ (η) = ky − ηk ψ over Z�CðXÞ, where CðXÞ is the column space spanned by the columns of X. Thus,β c is the solution to the rank-normal equations X > a(R(y − Xβ)) = 0 and D c ðXβ c Þ ¼ k y Àŷ c k c . Refer to the "S1 File" for a simple example of rank estimator, the form of ψ, and references.

Regularization methods
Under situations in which the matrix X > X is singular, the usual estimators are not applicable. From a practical point of view, high empirical correlations among two or a few other covariates lead to unstable results for estimating β or for pursuing variable selection. To overcome this problem, the ridge version for estimation can be considered. Ridge estimation is a regularization technique initially introduced by Tikhonov [9] and followed by Hoerl and Kennard [10] to regularize parameters or linear combination of them in order to provide acceptable estimators with less variability than the usual estimator, in multicollinear situations (see [8,[11][12][13][14] for more details). On the other hand, when the response distribution is non-normal or there are some outliers present in the study, the usual least squares and maximum likelihood methods fail to provide efficient estimates. In such cases, there is a need to develop an estimation strategy which is applicable in multicollinear situations and has acceptable efficiency in the presence of outliers.

Our contribution
Our contribution is two fold. First, for the situations where both multicollinearity and outliers exist we develop a shrinkage estimation strategy based on the rank ridge regression estimator. This creates two tuning parameters that must be optimized. Then, we define a new generalized cross validation (GCV) criterion to select the induced tuning parameters. The GCV has been applied to obtain the optimal ridge parameter in a ridge regression model by Golub et al. [15] and to obtain the optimal ridge parameter and bandwidth of the kernel smoother in semiparametric regression model by Amini and Roozbeh [16] as well as in partial linear models by Speckman [17]. Here, we use the GCV criterion for selecting the optimal values of ridge and shrinkage parameters, simultaneously. Our proposed GCV criterion creates a balance between the precision of the estimators and the biasedness caused by the ridge and shrinkage parameters.
The following section provides a robust shrinkage estimator based on the improved rankbased test statistic with developing a generalized cross validation (GCV) criterion, to obtain optimal values of tuning parameters. Subsequently, application of the proposed improved estimation method is illustrated for two real world data sets and an extensive simulation study to demonstrate usefulness of the proposed improved methodology. Finally, our study is concluded.

Methodology
In this section, we first define a robust test-statistic for testing the null hypothesis H o : β ¼ 0 in the rank-regression analysis. This test is further employed in the construction of a robust rank-based shrinkage estimator. Then, we consider the rank ridge regression estimator and by the aid of the proposed test-statistic, we define the Stein-type shrinkage estimator for β for robust analysis. Since this estimator will have two tuning parameters, we evaluate these parameters using a generalized cross validation (GCV) criterion.

Robust shrinkage estimator
In order to define the robust rank-based shrinkage estimator, we need to develop a robust test statistic to test the following set of hypotheses Denote the i th element of H = X(X > X) −1 X > , the projection matrix onto the space CðXÞ, by h iin . We also need the following regularity conditions to be held.
(A1) lim n ! 1 max 1�i�n h iin = 0, where h iin is commonly called the leverage of the i th data point.
where X(k) is an invertible matrix given by Proof 1 Refer to the "S1 File". Now, using a similar approach in formulating the ridge estimator, we use the following rank ridge regression estimator (Roozbeh et al. [18]) where k > 0 is the ridge parameter. In order to improve upon the rank ridge regression estimator, following Saleh [19], we use the Stein-type shrinkage estimator (SSE) aŝ The SSE shrinks the coefficients towards the origin using the test statistic R n (k). The amount of shrinkage is controlled by the shrinkage coefficient d.
In the following result we show that the SSE is a shrinkage estimator with respect to the l qnorm, k a k q ¼ ð P n j¼1 ja j j q Þ 1=q , q > 0, with a = (a 1 , . . ., a n ) > . The reason we take the l q -norm is that we can simultaneously take l 1 and l 2 norms into consideration. One must consider l 1norm keeps the scale of observation, however, l 2 -norm is mathematically tractable.
Theorem 2β ðSÞ c ðk; dÞ is a shrinkage estimator under l q -norm under some regularity conditions as stated below (i): Under the set of local alternatives K n : Proof 2 Refer to the "S1 File". The proposed SSE may be criticized since it depends on the two tuning parameters k and d and it may come to mind why we need an estimator with two tuning parameters, when we have the rank ridge regression estimator. In what follows we elaborate more on the advantages of the SSEβ ðSÞ c ðk; dÞ in our analysis. Accepting the fact that we need a robust rank estimator, apart from the justifications provided in Saleh [19], we give the following reasons.
1. Apparently, as d ! 0,β ðSÞ c ðk; dÞ !β c ðkÞ and thus for small values d the gain in estimation is just the information provided by the robust ridge parameter, even if the null hypothesis H o is not true. Thus, even if we agree that the rank ridge regression estimator shrink the coefficients to zero, the information provided by the test statistic R n (k), which is controlled by d in the SSE, is useful.
2. Consider a situation in which we do not have strong evidence to reject the null hypothesis.
Knowing the fact that the ridge estimator does not select variables (see Saleh et al. [20]), we can not estimate the zero vector using the rank ridge regression estimator, however, the shrinkage coefficient d maybe obtained such that for a given k, d = R n (k) and the resulting shrinkage estimator becomes equal to zero. This might be a rare event, but theoretically sounds.
3. The last but not the least, for the set of local alternatives K n , as in Theorem 2, the proposed SSE shrinks more than the rank ridge regression estimator. Thus in order to have robust shrinkage estimator, the SSE with two tuning parameters is preferred.
The SSE depends on both the ridge parameter k and shrinkage parameter d. For optimization purposes, we use the GCV of Roozbeh et al. [18] in the forthcoming section.

Generalized cross validation
The GCV chooses the ridge and shrinkage parameters by minimizing an estimate of the unobservable risk function with Lðk; dÞ ¼ 1 À d R n ðkÞ � � 2t c Xðn À 1 X > X þ kI p Þ À 1 X > , termed as the hat matrix of y, andt c is a consistent estimator (see Hettmansperger and McKean [21]) for the scale parameter τ ψ given by  (11), GCVðβ ðSÞ c ðk; dÞÞ is an estimator of EðRðβ;β ðSÞ c ðk; dÞÞÞ with a nearly constant bias. Using the techniques of Section 3, this can be shown to be an estimator of t 2 c with positive but asymptotically negligible bias, so the resulting "F" statistic can be expected to be conservative. The main result of this section is to obtain a good estimate of the minimizer of EðRðβ;β ðSÞ c ðk; dÞÞÞ from the data which does not require knowledge of t 2 c so that, by minimizing it, we can extract the optimal values for the two tuning parameters simultaneously.

Applications
In this section we consider some numerical experiments to illustrate the usefulness of the suggested improved methodology in the regression model. We analyze the performance of the proposed estimators in a real-world examples related to the riboflavin production.

Application to riboflavin production data set
To support our assertions, we consider the data set about riboflavin (vitamin B2) production in Bacillus subtilis, which can be found in R package "hdi". There is a single real valued response variable which is the logarithm of the riboflavin production rate and p = 4088 explanatory variables measuring the logarithm of the expression level of 4088 genes. There is one rather homogeneous data set from n = 71 samples that were hybridized repeatedly during a fed batch fermentation process where different engineered strains and strains grown under different fermentation conditions were analyzed. Fig 2 shows the normal Q-Q plot based on the ridge estimation for the riboflavin production data set. Also, the bivariate boxplot for selected genes of this data is depicted in Fig 3. The bivariate boxplot is a two-dimensional analogue of the boxplot for univariate data. This diagram is based on calculating robust measures of location, scale, and correlation; it consists essentially a pair of concentric ellipses, one of which (the hinge) includes 50% of the data and the other (called the fence) delineates potentially troublesome outliers. In addition, robust regression lines of both response on predictor and vice versa are shown, with their intersection showing the bivariate location estimator. The acute (large) angle between the regression lines will be small (large) for a large (small) absolute value of correlations. Figs 2 and 3 clearly reveals that the data contains some outliers.
We use GCV to select the the ridge and shrinkage parameters for the proposed estimators, simultaneously. Similar to the SSE, the GCV score functions forβðkÞ andβ c ðkÞ can be procured by setting in (10), respectively. All computations were conducted using the statistical software R 3.4.3 to develop the package Rfit for calculating the proposed estimators, their test statistics and powers described in this paper. The R codes are available at https://mahdiroozbeh.profile.semnan.ac. ir/#downloads. The 3D diagram as well as the 2D slices of GCV ofβ ðSÞ c ðk; dÞ versus k and d are  evidence to reject the null hypothesis H o and so, the SSE can be efficient for prediction purposes according to the second comment right after Theorem 2.
To measure the prediction accuracy of proposed estimators, the leave-one-out cross-validation (CV) criterion was used, which is defined by ðy ðÀ sÞ À X ðÀ sÞβðÀ sÞ Þ 2 ; whereβ ðÀ sÞ is obtained by replacing X and y with X ðÀ sÞ ¼ ðx jkðÀ sÞ Þ, 1 � k � n, 1 � j � p, y ðÀ sÞ ¼ ðỹ 1ðÀ sÞ ; . . . ;ỹ nðÀ sÞ Þ > ,x lkðÀ sÞ ¼ x lk À P n j6 ¼s W nj ðt s Þx lj ,ỹ kðÀ sÞ ¼ y k À P n j6 ¼s W nj ðt s Þy j . Here y ðÀ sÞ is the predicted value of response variable where sth observation left out of the estimation of the β. Table 1 displays a summary of the results. In this Table, a goodness of fit criterion Rsquared is calculated for comparing the proposed estimators using the following formula where SSEðβÞ ¼ P n i¼1 ðŷ i À � yÞ 2 forŷ i ¼ X >β and S yy ¼ P n i¼1 ðy i À � yÞ 2 . From Table 1, it is seen thatβ ðSÞ c ðk; dÞ performs better than ridge regression, since it offers smaller GCV and bigger R-squared values in the presence of multicollinearity and outliers. Moreover, because of the existence of outliers in the data set, it can be seen that R-squared's of robust type estimators are more acceptable than the R-squared of non-robust type estimator.
For further illustrative purposes, we analyze some simulated data sets in the forthcoming section.

Monte-Carlo simulation
In this section, we perform some Monte-Carlo simulation studies to justify our assertions as well as examining the performance of the proposed estimators. As pointed and explained in Section 1, high-dimensional case p > n causes the matrix X > X to be ill-conditioned. To accommodate ill-conditioning, apart from generating multicollinear data, we will evaluate how our estimators work for the high-dimensional case p > n.
We also examine the robustness of the proposed estimators in the presence of contaminated data. The regressors are drawn a new in every replication. The efficiencies ofβ i relative toβ 1 are defined by where M is the number of iterations andβ i ðmÞ is the ith estimator of β in the mth stage. To examine the performance of the proposed estimators, we perform a Monte-Carlo simulation. To achieve different degrees of collinearity, following McDonald and Galarneau [23] and Gibbons [24] the explanatory variables were generated for (n, p) = {(180, 60), (180, 120)} (low-dimensional) and (n, p) = {(200, 240), (200, 360), (250, 10000)} (high-dimensional) from the following model: where z ij are independent standard normal pseudo-random numbers and γ is specified so that the correlation between any two explanatory variables is given by γ 2 . These variables are then standardized so that X > X and X > y are in correlation forms. Two different sets of correlation corresponding to γ = 0.20, 0.50, 0.90 and 0.95 are considered. The observations for the dependent variable are determined by where w 2 m ðdÞ is the non-central chi-squared distribution with m degrees of freedom and noncentrality parameter δ. The main reason of selecting such structure for errors is to contaminate the data and evaluate the robustness of the estimators. We set the first h error terms as dependent normal random variables and the last (n − h) error terms as independent non-central chisquared random variables. The non-centrality causes the outliers to lie on one side of the true regression model which then pulls the non-robust estimation toward them.
To save space, the Tables have been reported in the S1 File of this paper and their results have been briefly shown by Fig 5. In the "S1 File", we provided 12 tables (S2-S9 Tables in S1 File) to extensively analyze the numerical outputs. However, to save space here, we only report Fig 5 as an abstract of tables' results. Fig 5 summarizes the empirical type I errors and powers at a 5% significance level under low-dimensional (based on F test statistic) and high-dimensional (based on R n (k) test statistics) settings for γ = 0.20, 0.50, 0.90 and 0.95, respectively. The contaminated sample is the percentage of the sample contaminated with outliers (CS ¼ 100 � nÀ h n %). The F-test is valid when p is less than n. Please note in Tables S10-S13 Tables in "S1 File", we numerically estimated the risks and efficiencies of the proposed estimators relative toβ 1 .
We apply the generalized cross-validation (GCV) method to select the optimal ridge parameter (k opt ) and shrinkage parameter (d opt ), which minimizes the GCV function. Since the results were similar across cases, to save space we report here only the results for the sparse case with γ = 0.95, n = 200, p = 240 and CS = 20%. For this case, the minimum of GCV approximately occurred at k opt = 5.90 and d opt = 0.006254 for the model (14). The 3D diagram as well as the 2D slices of GCV versus k and d are plotted in Fig 6. Fig 7 shows the results of the F test and RRB test for the non-sparse and sparse cases with parameter values γ = 0.95, n = 180, Comparison results based on the simulations are similar: Firstly, we observe the empirical type I errors for the F test, are not reasonable in comparison with the significance level α = 0.05, while the powers are slightly smaller than RRB test in most cases. Secondly, the RRB test is highly efficient for all cases under consideration. Its sizes are reasonable, while its powers

PLOS ONE
compared to F test is high, in the low dimensional settings. Thirdly, RRB test is powerful in the high-dimensional settings, as we would expect.
As demonstrated, the RRB test is overly conservative. In other words, it achieves smaller empirical type I error and bigger empirical power compared to the F test.

Summary & conclusions
In this paper, we proposed a robust ridge test statistic to improve the predictions in a regression model. In the presence of multicollinearity and outliers, we introduced robust ridge type estimator and improved it by shrinking toward the origin to incorporate the information contained in the null-hypothesis. By defining a generalized cross validation criterion optimal values of ridge and shrinkage parameters obtained simultaneously. Figs 3 and 6 showed the global minimum archived by this criterion. Through nonlinear minimization algorithms, we found the global minimum of this criterion with respect to both parameters. Finally, a Monte-Carlo simulation study as well as a real data example were considered to compare performances of the proposed estimators numerically.
According to Fig 5 and the detailed tabulated numerical results in the "S1 File", we observed that the proposed robust ridge-type test statistic is more powerful than the classical F test in the presence of multicollinearity and outliers. Moreover, we found that efficiencies of robust type estimators with respect to non-robust type increase when the percentage of outliers increases. Another factor affecting the efficiency of the estimators was the number of explanatory variables. We seen the estimatorβ ðSÞ c ðk; dÞ is leading to be the best estimator among others, since it offers smaller risk and bigger efficiency values in all cases. Moreover,βðkÞ was not a suitable estimator in the presence of outliers, especially, for the high percentage of outliers. For the real examples, from Table 1, we deducedβ ðSÞ c ðk; dÞ is quite efficient in the sense that it has significant value of goodness of fit. Supporting information S1 File. (PDF)