Penalized regression and model selection methods for polygenic scores on summary statistics

doi:10.1371/journal.pcbi.1008271

Fig 1.

Prediction r² values for simulation 1.

Error bars represent standard deviation for the r² value across 20 replications.

More »

Expand

Fig 2.

Prediction r² values for simulation 2.

Error bars represent standard deviation for the r² value across 20 replications.

More »

Expand

Fig 3.

Prediction r² values for simulation 3.

Error bars represent standard deviation for the r² value across 20 replications.

More »

Expand

Fig 4.

Predictive r² on out-of-sample data for TlpSum and LassoSum for each of the 100 replications at each of the four simulation settings.

Lines are at a 45 degree angle through the origin, and not a line of best fit. Points below the line indicate better performance of TlpSum.

More »

Expand

Fig 5.

Predictive r² on out-of-sample data for TlpSum and ElastSum for each of the 100 replications at each of the four simulation settings.

Lines are at a 45 degree angle through the origin, and not a line of best fit. Points below the line indicate better performance of TlpSum.

More »

Expand

Fig 6.

Number of nonzero effect sizes estimated by the three penalized regression methods as compared to the true number of nonzero effects, for the three sparse simulation settings.

More »

Expand

Fig 7.

Number of true positives for the three penalized regression methods in the three sparse simulation settings.

More »

Expand

Fig 8.

Precision of estimated nonzero effect sizes for the penalized regression methods applied to the three sparse simulation settings.

More »

Expand

Fig 9.

Performance of the seven different model selection methods applied to a set of candidate LassoSum models.

Performance is measured by r² on the testing data (the right bar in each group), and by squared quasi-correlation on the testing data (the left bar in each group). Error bars represent the standard deviation across 20 replications.

More »

Expand

Fig 10.

Number of estimated nonzero effects for each model selection method across each of the simulation settings in simulation 1.

Models were selected from a set of candidate LassoSum models.

More »

Expand

Fig 11.

Performance of the selected models for each of the model selection methods across the different simulation settings of simulation 1, as measured by precision, recall, and F1 score.

The leftmost box in each grouping of three corresponds to pseudo AIC, the center corresponds to pseudo BIC, and the rightmost corresponds to pseudovalidation. Models were selected from a set of candidate LassoSum models.

More »

Expand

Table 1.

Median sample size for each study in the lipid analysis.

More »

Expand

Table 2.

Model performance, as measured by quasi-correlation of the model predicted into the BioBank data, for each model selection method.

Models were estimated via TlpSum on the Teslovich data.

More »

Expand