Recursive Random Lasso (RRLasso) for Identifying Anti-Cancer Drug Targets

Uncovering driver genes is crucial for understanding heterogeneity in cancer. L 1-type regularization approaches have been widely used for uncovering cancer driver genes based on genome-scale data. Although the existing methods have been widely applied in the field of bioinformatics, they possess several drawbacks: subset size limitations, erroneous estimation results, multicollinearity, and heavy time consumption. We introduce a novel statistical strategy, called a Recursive Random Lasso (RRLasso), for high dimensional genomic data analysis and investigation of driver genes. For time-effective analysis, we consider a recursive bootstrap procedure in line with the random lasso. Furthermore, we introduce a parametric statistical test for driver gene selection based on bootstrap regression modeling results. The proposed RRLasso is not only rapid but performs well for high dimensional genomic data analysis. Monte Carlo simulations and analysis of the “Sanger Genomics of Drug Sensitivity in Cancer dataset from the Cancer Genome Project” show that the proposed RRLasso is an effective tool for high dimensional genomic data analysis. The proposed methods provide reliable and biologically relevant results for cancer driver gene selection.


Introduction
Much research is currently underway to understand the complexity of the heterogeneous genetic networks underlying cancer. To identify the heterogeneous genetic networks that underlie cancer, various large scale-omics projects (e.g., The Cancer Genome Project, The Cancer Genome Atlas (TCGA), Sanger Genomics of Drug Sensitivity in Cancer dataset from the Cancer Genome Project, and others) have been initiated and have provided large amounts of data, such as genomic and epigenomic data for cancer patients or cell lines. A crucial issue in cancer research is to identify cancer driver genes based on various genomic data analysis (e.g., expression levels, copy number variations, methylation, and others), since efficient identification of cancer drug targets facilitates development of successful anti-cancer therapies. Although various L 1 -type regularization approaches, e.g., lasso [1] and elastic net [2], have been widely used to identify cancer driver genes, they possess several drawbacks as tools for driver gene identification [3]. The lasso and adaptive lasso [4] suffer from the limitation of subset size (i.e., these methods select features at most sample size, n). The elastic net, which has been widely used in bioinformatics research, may provide erroneous estimation results for coefficients of highly correlated variables with different magnitudes, especially those that differ in sign, because of its "grouping effect". However, coefficients of highly correlated variables with different magnitudes are frequently observed in bioinformatics research, since genes in common biological pathways are usually correlated, and their regression coefficients can have different magnitudes or different signs. Furthermore, adaptive L 1 -type regularization methods suffer from multicollinearity, since their adaptive data driven weights are based on Ordinary Least squares (OLS) estimators.
To resolve these issues, Wang et al. [3] proposed a random lasso based on bootstrap regression modeling with random forest method. Although the random lasso overcomes the drawbacks of existing L 1 -type regularization approaches by using a random forest strategy, the method is computationally intensive because it employs two step bootstrap procedures. Furthermore, Wang et al. [3] performed final feature selection based on an arbitrarily decided threshold, even though the variable selection results heavily depend on the threshold.
We propose a novel statistical strategy to identify driver genes of anti-cancer drug sensitivity in line with the random lasso. We introduce recursive bootstrap approaches to simultaneously measure the significance of each gene and perform driver gene selection. We also propose a novel threshold based on a parametric statistical test to effectively identify driver genes based on bootstrap regression modeling. By using a recursive bootstrap procedure, we perform time-efficient bootstrap regression modeling for high dimensional genomic data analysis without loss of modeling accuracy. Furthermore, the proposed feature selection method using parametric statistical test can be a useful tool for variable selection based on the bootstrap regression modeling.
Using Monte Carlo simulations of various scenarios, we demonstrate the effectiveness of the proposed recursive random lasso and elastic net with a parametric statistical test for high dimensional regression modeling. We also apply the proposed statistical strategy to the publicly available "Sanger Genomics of Drug Sensitivity in Cancer dataset from the Cancer Genome Project" (http://www.cancerrxgene.org/), and identify potential driver genes of anticancer drug sensitivity. Numerical analyses show that the proposed recursive random lasso and elastic net are time-efficient procedures, and outperform high dimensional genomic data analysis (i.e., from a view point of feature selection and predictive accuracy).
In Section 2, we introduce the existing L 1 -type regularization approaches, and point out their drawbacks. We then introduce the random lasso, and propose the recursive random lasso and elastic net procedures. In Section 3, we describe the Monte Carlo simulations and driver gene selection using the Sanger Genomics of Drug Sensitivity in Cancer dataset to examine the effectiveness of the proposed statistical strategies. We state our conclusions in Section 4.

Materials and Methods
Suppose we have n independent observations {(y i , x i );i = 1, . . ., n}, where y i are random response variables and x i are p-dimensional vectors of the predictor variables. Consider the linear regression model, where β is an unknown p-dimensional vector of regression coefficients and ε i are the random errors which are assumed to be independently and identically distributed with mean 0 and variance σ 2 . We assume that the y i are centered and x ij are standardized by their mean and standard deviation: P n i y i =n ¼ 0, P n i x ij =n ¼ 0 and P n i x 2 ij =n ¼ 1, thus an intercept term is excluded from the regression model in Eq (1). Many studies are currently underway on regression modeling, especially for high dimensional data analysis (e.g., genomic alterations data analysis).
Tibshirani [1] proposed the lasso, which minimizes the residual sum of squares subject to a constraint l P p j¼1 jb j j, and its solution is given bŷ where λ is a tuning parameter controlling model complexity. By imposing a penalty term, the sum of the absolute values of the regression coefficients, the lasso can simultaneously perform parameter estimation and variable selection. However, a recent work suggested that the lasso may suffer from the following limitations [2]: • In the p > n case, the lasso selects at most n variables, because of the convex optimization problem. This implies that the lasso is not suitable for driver gene selection, since genomic alteration data is typically high dimensional data.
• The lasso cannot account for grouping effect of predictor variables, and thus tends to select only one variable from among highly correlated variables, even if all are related to response variable. However, genomic alterations of genes (e.g., expression levels, copy number variations, methylation, etc.) that share a common biological pathway are usually highly correlated, and the genes may be associated with a complex cancer mechanism considered to be the response variable. This also implies that the lasso is not suitable for genomic data analysis.
To overcome these drawbacks, various L 1 -type regularization methods have been proposed. The elastic net [2] in particular has drawn considerable attention in the field of bioinformatics: The penalty term of the elastic net is a convex combination of the ridge [5] and lasso penalties. By imposing an additional L 2 -penalty on the lasso, the elastic net performs effectively feature selection in high dimensional data analysis, i.e., there is no limitation on subset size. Furthermore, the elastic net can enjoy the following grouping effect: where r ¼ x T j x k is sample correlation [2]. Although the elastic net performs well for high dimensional data analysis, Wang et al. [3] demonstrated that the elastic net has the following drawbacks: • The property of "grouping effect" leads to erroneous estimation results when coefficients of highly correlated variables with different magnitudes, especially those with different signs. However, coefficients of highly correlated variables with different magnitudes are frequently observed in bioinformatics research, since genes in the common biological pathway are usually highly correlated, and their regression coefficients can have different magnitudes or a different sign.
The adaptive L 1 -type penalties have also been proposed and are widely used in various fields of research: • adaptive lasso: • adaptive elastic net: where w j ¼ 1=jb OLS j j g is an adaptive data driven weight for γ > 0. By using the weight, we can discriminately impose a penalty on each feature depending on their significance, and thus effectively perform feature selection. Zou and Hastie [4] and Zou and Zhang [2] established the oracle property of the adaptive lasso and the adaptive elastic net, respectively. However, the performance of adaptive regularization methods heavily depends on the OLS estimator, and thus these methods suffer from multicollinearity. Furthermore, the adaptive L 1 -type regularization methods suffer from the same drawbacks as the common methods, i.e., when using the adaptive lasso, the number of selected variables cannot exceed n, and the adaptive elastic net may also provide erroneous estimation results when coefficients of highly correlated variables with different magnitudes are present.

Random Lasso
Wang et al. [3] detailed the drawbacks of existing L 1 -type approaches, and proposed the random lasso based on a bootstrap strategy that employs the random forest method. In the random lasso procedure, randomly selected q variables are considered as candidate variables in regression modeling for each bootstrap sample. Thus, the results do not suffer from the highly correlated variables drawbacks, since each bootstrap sample may include only a subset of the highly correlated variables. Furthermore, the random lasso can overcome the subset size limitation, since variable selection is based on the results of bootstrap regression modeling with randomly selected q 1 or q 2 variables in each bootstrap sample. Wang et al. [3] proposed the following algorithm based on a two-step bootstrap procedure to implement the random lasso: . ., B}, q 1 candidate variables are randomly selected, and the lasso is applied for regression modeling and we obtain estimatorsb ðb 1 Þ j for j = 1, . . ., p. The importance measure of x j is calculated as

• Step 2: Variable selection
Draw B bootstrap samples with size n by sampling with replacement from the original dataset. For the b th 2 bootstrap sample, b 2 2 {1, 2, . . ., B}, q 2 candidate variables are randomly selected with a selection probability of x j proportional to I j , and the adaptive lasso is applied for regression modeling, and we obtain the estimatorb ðb 2 Þ j for j = 1, . . ., p.
For noise predictor variables, the coefficients in the respective bootstrap samples are estimated to be small or to have different signs, and thus the absolute value of the average coefficients (i.e., I j ) will be small or close to zero. On the other hand, the coefficients of crucial predictor variables may be consistently large in different bootstrap samples, and thus a crucial gene has a large value of I j . This implies that the selection probability I j provides effective feature selection. Wang et al. [3] considered q 1 and q 2 as tuning parameters, and the importance measure I j can also be used to weight for the adaptive lasso.
Wang et al. [3] noted that the variable selection results of the random lasso are unfair, since some of the final non-zero coefficients may result from a particular bootstrap sample (i.e., the random lasso can yield false positives in variable selection). Thus, a threshold t n = 1/ n was added for variable selection, and predictor variables with jb j j⩽ t n were deleted from the final model.

Recursive Random Lasso for Effective Feature Selection
The random lasso can overcome the drawbacks of existing L 1 -type regularization by using a random forest method with bootstrap regression modeling. Although the random lasso performs well for high dimensional regression modeling with highly correlated predictors, the method also suffers from the following drawbacks: • The random lasso is computationally intensive, since it is based on two bootstrap procedures with respective B replications. The computational complexity of the random lasso is significantly increased in genomic data analysis, because the dataset is constructed with an extremely large number of predictor variables.
• The threshold is crucial in feature selection, since the feature selection results depend heavily on the threshold. However, Wang et al. [3] arbitrarily set the threshold as 1/n, without any statistical background.
• The method has too many tuning parameters, i.e., λ in L 1 -type penalties, and q 1 and q 2 in the random forest method. The large number of tuning parameters also makes the method time consuming, since the random lasso procedures should be implemented repeatedly to select the optimal parameter combination.
We propose an effective modeling strategy in line with the random lasso, called a recursive random lasso (or elastic net). To efficiently perform high dimensional genomic data analysis, we propose a recursive bootstrap procedure for generating the importance measure and regression modeling. We also propose a novel threshold to effectively select predictor variables in bootstrap regression modeling using a parametric statistical test. Furthermore, a number of candidate predictors, q, is also randomly selected in each bootstrap sample (i.e., we do not consider q as a tuning parameter). The proposed recursive random lasso (elastic net) is implemented by the following algorithm. 2. For the first bootstrap sample (i.e., b = 1), q candidate variables are randomly selected and the lasso (or elastic net) is applied for regression modeling. We then obtain estimatorsb ð1Þ j for j = 1, . . ., p.
3. For b 2 {2, . . ., B}, the importance measure of x j is calculated as ðrÞ j j. The q candidate variables are randomly selected with a selection probability I j , and the adaptive lasso (or adaptive elastic net) with w j = 1/I j is applied for regression modeling. We obtain the estimatorsb ðbÞ j for j = 1, . . ., p.

Final estimators are computed asb
5. Finally, we perform variable selection based on the threshold t Ã via the parametric statistical test.
Parametric Statistical Test for Variable Selection in Bootstrap Regression Modeling (PSTVSboot). In order to effectively perform feature selection, we propose a parametric statistical test based on the bootstrap regression modeling results. We first consider a B × p binary matrix D obtained from the above recursive bootstrap procedures. We set an element of the binary matrix as D bj = 1 for a non-zerob j in the b th bootstrap sample; otherwise D bj = 0. In other words, we consider that the binary matrix is obtained from Bernoulli experiments, and let D j be a random variable associated with Bernoulli trials as follows: The Bernoulli random variable has the following probability density function, where the probability π can be estimated as follows, which indicates the average of the selection ratio of all predictor variables in B bootstrap samples. For reasonable variable selection, we then consider the following statistic: which indicates the number of non-zerob ðbÞ j in B Bernoulli trials (i.e. B bootstrap samples). The statistic C j follows the Binomial distribution bðB;pÞ and has the following probability mass function: We then calculate a p-value for each predictor variable as follows, and finally perform variable selection based on the p-value with a threshold t Ã = 0.05 as follows, b Ã j ¼b j Iðp À value j < 0:05Þ; where I(Á) is an indicator function. We can expect that the parametric statistical test can overcome false positive feature selection results of bootstrap regression modeling. Although we have described the proposed variable selection strategy focused on the random lasso procedure, the parametric statistical test will be a useful tool for bootstrap regression modeling.

Monte Carlo Simulations
Monte Carlo simulations were conducted to investigate the effectiveness of the proposed modeling strategy. We simulated 100 datasets from the following linear regression model, where ε i are N(0, σ 2 ), and the correlation between x l and x m is 0.5 |l−m| . We considered the following simulation situations: • Type1: n = 100 and p = 1000 as β j = 3 for 50 randomly selected variables, otherwise β j = 0, • Type2: n = 100 and p = 1000 as β j = 3 for 25 randomly selected variables, β j = −3 for 25 randomly selected variables, otherwise β j = 0, • Type3: n = 100 and p = 1000 as β j = 3 for 150 randomly selected variables, otherwise β j = 0.
• Type8: n = 50 and p = 2000 as β j = 3 for 100 randomly selected variables, β j = −3 for 100 randomly selected variables, otherwise β j = 0, To evaluate the proposed recursive random lasso and elastic net procedures, we compared the performance of our methods, recursive random elastic net (RCS.RD.EL), recursive random lasso (RCS.RD.LA), with the lasso (LASSO), adaptive lasso (AD.LA), elastic net (ELA), and existing random lasso (RD.LA). In numerical studies, we used a ridge estimator for weight in the existing adaptive lasso, and we considered the threshold of the existing random lasso to be s/n, and selected s based on the root mean squared error in the validation dataset. We considered the number of bootstrap samples to B = 1000 and a dataset constructed with training, validation, and test datasets with sample size n, respectively. The tuning parameters were selected by 5-fold cross validation based on the training dataset.
We first evaluated the computational efficiency of our methods. Table 1 shows the computational time required for the existing random lasso in ALGORITHM 1 (RD.LA) and the proposed recursive random lasso in ALGORITHM 2 (RCS.RD.LA). The run time indicates the total time required to estimate the regression model via tuning parameters selection and bootstrap replication. Table 1 shows that the performances of the proposed recursive random lasso is computationally effective compared with the existing random lasso in all simulation situations.
To show the effectiveness of recursive bootstrap strategy, we compared the importance measures for the random lasso procedures. Table 2 shows the average of the importance measures I j for predictor variables with truly non-zero coefficients and truly zero coefficients in the recursive random elastic net (RCS.RD.EL), recursive random lasso (RCS.RD.LA) and random lasso (RD.LA), where the numbers in parentheses are the average of the importance measures for small number of bootstrap samples B = 20.
In the existing random lasso, the importance measure is calculated independently with regression modeling (i.e., in step 1 of ALGORITHM 1). However, in our method, the I j is recursively calculated during regression modeling. Furthermore, the I j of our method is based on a randomly selected number of candidate predictor variables q, whereas in the existing random lasso method, I j is based on the tuning parameters q 1 and q 2 selected by minimizing prediction error in the validation dataset. In short, our method provides time-effective procedures compared with the existing random lasso. From Table 2, it can be seen that the importance measure in our method shows larger differences between truly zero and non-zero coefficients than it does in the existing random lasso, although the difference is small. Furthermore, we can see that the proposed recursive bootstrap procedure also provides the larger differences for importance measure even in the small number of bootstrap samples (i.e., B = 20 given in parentheses of Table 2). This implies that the proposed recursive bootstrap approaches perform effectively for feature selection by using the random forest procedure, although our method provides computationally effective modeling results.
We then compared the results of regression modeling based on prediction accuracy in the test dataset and the variable selection results shown in Figs 1 and 2. Fig 1 shows the prediction errors given as average of root mean squared errors using recursive random elastic net (RCS.RD.EL), recursive random lasso (RCS.RD.LA), random lasso (RD.LA), elastic net (ELA), adaptive lasso (AD.LA), and lasso (LASSO). It can be seen though Fig 1 that the proposed recursive random elastic net shows superior prediction accuracy in almost simulation situations. In addition, the proposed recursive random lasso also shows much higher prediction accuracy than the lasso, adaptive lasso or elastic net, and results similar to the existing random lasso, even though the recursive random lasso provides time-effective performances compared with the existing random lasso as shown in Table 1.
We also compared variable selection results given as the average of true positive rate (i.e., the average number of true non-zero coefficients, incorrectly set to zero) and true negative rate (i.e., the average percentage of true zero coefficients, that were correctly set to zero) in Fig 2. We can see though Fig 2 that the proposed recursive random lasso and recursive random elastic net show outstanding performance for variable selection in all simulation situations. On the other hands, the lasso and adaptive lasso show poor results for variable selection in high dimensional data situations, since the methods suffer from the limitation of subset size.
In short, the proposed recursive random lasso and elastic net methods are not only computationally effective but produce outstanding regression modeling results (i.e., prediction accuracy and variable selection). This results imply that our methods can be useful tools for high dimensional genomic alteration data analysis.

Real World Examples: Identifying Driver Genes of Anti-cancer Drug Sensitivity
We applied the proposed strategies to identify potential driver genes of anti-cancer drug sensitivity in the publicly available "Sanger Genomics of Drug Sensitivity in Cancer dataset from the Cancer Genome Project" (http://www.cancerrxgene.org/). The dataset contains the gene expression levels, copy number and mutation status for 654 cell lines and the half-maximal inhibitory drug concentrations (IC50 values) of 138 anti-cancer drugs as an indicator of drug sensitivity. We considered the expression levels of 13321 genes and the IC50 values of anti-cancer drugs to reveal driver genes, which are available from the resources: "Cell line genetic (mutation and copy number) and gene expression data used for EN analysis" and "Cell line drug sensitivity, mutations and tissue type", respectively, in "http://www.cancerrxgene.org/". Many IC50 values are missing from the Sanger dataset, and we therefore considered only 99 anti-cancer drugs, which have non-missing observations for at least 600 cancer cell lines, as response variables. The expression levels of 10% of the genes (i.e., 1332 genes) having the highest variance in all samples were considered as predictor variables. We employed B = 1000 bootstrap replications and the tuning parameters were selected by 5-fold cross validation.
To evaluate the proposed methods, we compared the prediction accuracy of the recursive random lasso and elastic net, existing random lasso, elastic net, adaptive lasso and lasso based on 99 regression models corresponding to 99 anti-cancer drugs. Table 3 shows the average of root means squared error of the 99 regression models. We can see through Table 3 that the random lasso type approaches show outstanding performance compared with the L 1 -type regularization methods. The proposed recursive random lasso and elastic net show similar performance to the existing random lasso, although our methods show time-effective procedure as shown in the list of run times in Table 3.
We then identified potential driver genes using the proposed recursive random elastic net. We focused on five popular anti-cancer drugs: Cisplatin, Docetaxel, Doxorubicin, Gemcitabine and Vinorelbine, which have attracted considerable for cancer research [6,7]. We will introduce the five anti-cancer drugs.
• Cisplatin (trade name: Platinol): a platinum-compound chemotherapy drug that stops cancer cells from growing. Target: DNA crosslinker. Used to treat: testicular, ovarian, bladder, head and neck, breast, cervical and prostate cancers. Side effects: nausea and vomiting, kidney toxicity, low white blood cell counts, and low red blood cell counts.
• Docetaxel (trade name: Taxotere): belongs to a class of chemotherapy drugs that works by preventing division of cancer cells. Targets: Microtubules. Used to treat: breast, non-small cell lung, advanced stomach, and head and neck cancers. Side effects: nausea, diarrhea, hair loss, nail change, low white blood cell counts, and low red blood cell counts.
• Doxorubicin (trade name: Adriamycin): an anti-cancer chemotherapy drug that is classified as an "anthracycline antibiotic". It slows or stops the growth of cancer cells, and binds to DNA by intercalation between specific base pairs, thus blocking DNA synthesis [8]. Target: DNA intercalation. Used to treat leukemia, bladder, breast, stomach, lung, ovarian and thyroid cancers, and soft tissue sarcoma. Side effects: hair loss, myelosuppression, oral mucositis, and diarrhea.
• Gemcitabine (trade name: Gemzar): an anti-cancer chemotherapy drug that is classified as an antimetabolite. Gemcitabine prevents the growth of cancer cells, eventually resulting in their destruction. It inhibits thymidylate synthetase, which leads to inhibition of DNA synthesis and cell death [9]. Targets: DNA replication.Used to treat pancreatic, non-small cell lung, bladder, metastatic breast, and ovarian cancers, and soft-tissue sarcoma. Side effects: flu-like symptoms (e.g., muscle pain, fever, headache, etc.), fatigue, and poor appetite.
• Vinorelbine (trade name: Navelbine): an anti-cancer chemotherapy drug that is classified as a "plant alkaloid". Vinorelbine kills cancer cells by interfering with their DNA, which is necessary for their growth and reproduction. The antitumor activity of vinorelbine is thought to be due primarily to inhibition of mitosis at metaphase through its interaction with tubulin [9]. Target: Microtubules. Used to treat non-small cell lung, breast, and ovarian cancers, and Hodgkin's disease. Side effects: temporary decrease in white and red blood cells, muscle weakness, and constipation.
We identified the potential driver genes with top 10 largest importance measures I j among the selected genes for each anti-cancer drug ( Table 4). As shown in Table 4, the identified  genes are strong candidates for cancer driver genes. This implies that our method provides reliable results for uncovering driver genes. In short, the proposed strategies based on the recursive bootstrap method and parametric statistical test are useful tools for driver gene selection based on high dimensional genomic data analysis. Drug sensitive-specific driver genes were identified by the "Cancer Genome Project". In the project, they considered regression modeling and applied the elastic net to identify driver genes. The results are given in the project website (http://www.cancerrxgene.org/). There are, however, differences between selected driver genes of our study and given in the project website, since we consider only 10% of genes (i.e., 1332 genes) having the highest variance as candidate genes in regression modeling. Although the identified driver genes by our method are difference from the driver genes identified by the project, we can see through Table 4 that the identified driver genes by our method have strong evidence as cancer driver genes.
We also show a gene network based on protein-protein interactions (PPIs). Fig 3 shows the potential driver genes identified in Table 4 as well as genes that have PPIs with the identified genes.
Solid lines indicate potential driver genes identified for each anti-cancer drug and dashed lines indicate PPIs between genes. The anti-cancer drug cisplatin has the largest sub-network constructed by PPIs with a path length of 1. In Fig 3, we can also see that the sub-networks of the five anti-cancer drugs share common genes. The common genes can be considered as driver genes for anti-cancer therapy, and investigation of the common genes may lead to development of effective cancer therapies.
We also focused on driver genes with large sub-network, i.e., NEDD9, TCP1, CCT5, ACTC1, CS, CLIC4, and NCAM1, these genes are connected with a large number of genes (n ! 9) by PPIs. Table 5 shows the genes with large sub-networks and their importance measures in the recursive random elastic net.
The numbers in parentheses indicate the number of genes connected by PPIs. We can see that the genes with large sub-networks have relatively larger importance measures (I j ) than average of all selected genes (I sct j ) and of all 1332 candidate genes (I all j ). This implies that possession of a large sub-network can be considered as a crucial feature for predicting anti-cancer drug sensitivity. We can also see through the results that the proposed recursive random elastic net can effectively be used to reveal driver genes with real biological relevance.

Conclusion
We have proposed a novel statistical strategy based on a recursive bootstrap approach and parametric statistical test (PSTVSboot) for identifying driver genes. To effectively perform high dimensional genomic data analysis, we used recursive bootstrap strategies in line with the random lasso method. Furthermore, we have proposed a parametric statistical test for gene selection based on the results of bootstrap regression modeling.
Numerical studies showed that the proposed methods show outstanding performance for variable selection and prediction accuracy. Furthermore, our methods showed time-effective performance compared with existing random lasso. We expect that our methods based on recursive bootstrap regression modeling and parametric statistical test will be useful tools for high dimensional genomic data analysis, especially driver gene selection. Furthermore, we expected that the proposed parametric test can be used effectively for variable selection in bootstrap regression modeling.
Although the proposed parametric statistical test performs well for feature selection, our method is sensitive to the initial selection of predictor variables, because the initial selection result directly affects the selection probability in the random forest procedure. Thus, further work is required for robust recursive random L 1 -type regularization method against initial selection.
Furthermore, we have focused on the proposed recursive random lasso in not theoretical but practical viewpoint. We considered constructing theoretical properties of our method (e.g., consistency of feature selection) as one of further work of this study.
Variation in gene expression levels in cancer is known to be caused by copy number variation, and thus the two features should be considered concurrently when searching for driver genes. We also considered cancer driver gene selection via analysis of copy number driven expression levels via extension of the recursive random lasso strategies.