Phenotype Prediction Using Regularized Regression on Genetic Data in the DREAM5 Systems Genetics B Challenge

doi:10.1371/journal.pone.0029095

Table 1.

Highest absolute correlations of genotype and gene expression data to phenotype, versus random background.

More »

Expand

Figure 1.

Large contribution of outliers to variance in phenotype 1.

The largest seven outliers in phenotype 1 account for the bulk of the variance in the data; in contrast, the outlier distribution for phenotype 2 is similar to that of a random normal variable.

More »

Expand

Figure 2.

Single-variable, two-variable, and pairwise logic regression for phenotype 2.

The plot compares the best least squares fits attainable under three model types: single-variable regression using each genotype feature independently (blue), two-variable regression using pairs of features at once (green), and single-variable regression using pairs of features combined through a binary boolean relation (red). The best single-variable fits using boolean combination features outperform the best two-variable regressions.

More »

Expand

Figure 3.

Correlation coefficients between genotype markers, displaying linkage disequilibrium.

The heat map shows Pearson correlations between pairs of genotype markers; most pairs have only slightly positive or negative correlations attributable to chance, but groups of nearby markers exhibit distinctly positive correlations.

More »

Expand

Figure 4.

Goodness of fit of regularized regression models on training data using various regressor sets.

We tested elastic net, lasso, and approximate best subset selection on phenotypes 1 and 2 using regressor sets derived from the DREAM5 subchallenges B1, B2, and B3. In each case the regularization parameter(s) were chosen to optimize average Spearman correlation. We ran multiple cross-validation tests with different random fold splits to reduce uncertainty in mean performance and enable comparison between methods; error bars show one standard deviation of confidence.

More »

Expand

Figure 5.

Example elastic net predictions versus actual values with and without rank transformation for subchallenge B2P1.

Each scatter plot shows predictions from one cross-validation run on the training data (blue points) as well as predictions of the fitted model for the gold standard test set (red points). For the elastic net modeling on rank-transformed data (right plot), predictions of phenotype 1 values on an absolute scale were obtained by interpolation. The reported values of are the Pearson correlation coefficients.

More »

Expand

Table 2.

Improvement in goodness of fit with rank transformation on phenotype 1.

More »

Expand

Figure 6.

Variation in cross-validation and test set performance with model complexity for subchallenge B2.

Each plot follows the performance of a regression model as complexity increases. For lasso (top plots), model complexity is determined by a regularization parameter ; for best subset selection (bottom plots), complexity is defined as the number of features used. The blue curves show Spearman correlations averaged over cross-validation folds, each fold having approximately the same size as the gold standard test set. Performance varies dramatically from fold to fold; error bars show one standard deviation of the Spearman correlations achieved for different folds. The red curves follow performance of the models on the actual gold standard.

More »

Expand

Figure 7.

Stability of features and coefficients selected by elastic net regression for subchallenge B2.

The heat maps show regression coefficients chosen by the best-fit elastic net models as each cross-validation fold is in turn held out of the training set. The features shown on the vertical axis are those having a nonzero coefficient in at least one of the seven runs; they are indexed by their rank in Table 1, correlation to the phenotype being predicted.

More »

Expand

Figure 8.

Final results of the five teams participating in all three DREAM5 Systems Genetics B subchallenges.

All teams had difficulty even achieving consistently positive correlations; we suspect the main obstacles were the large amount of noise in the data and the small 30-sample gold standard evaluation sets. We achieved the best performance on the test set used for subchallenge B2 (prediction using gene expression data only).

More »

Expand