Simulation study to evaluate when Plasmode simulation is superior to parametric simulation in estimating the mean squared error of the least squares estimator in linear regression

doi:10.1371/journal.pone.0299989

Table 1.

Parameters for true data generating processes (DGP) and outcome generating models (OGM).

In all scenarios, the true vector of coefficients is equal to and the error distribution is set to ε ∼ N(0, 0.3²). 0_p denotes the p-dimensional vector of zeros.

More »

Expand

Table 2.

Deviations from true DGP and OGM for parametric and Plasmode simulation.

More »

Expand

Fig 1.

Relative error in MSE estimation for individual coefficients for different types of Plasmode simulation compared to parametric simulation under assumption of true DGP and OGM.

More »

Expand

Fig 2.

Absolute value of the relative error in MSE estimation averaged over individual coefficients, for different types of Plasmode simulation compared to parametric simulation under the assumption of the true DGP and OGM, for p = 2, n = 100, β = (1, 1, 1)^T, σ = 0.3, Cor(X_i, X_j) = 0.2 ∀i ≠ j.

More »

Expand

Fig 3.

Absolute value of relative error in MSE estimation for individual coefficients when the assumed feature distribution in parametric simulation deviates from the true distribution, for p = 2, n = 100, β = (1, 1, 1)^T, σ = 0.3, Cor(X_i, X_j) = 0.2 ∀i ≠ j.

More »

Expand

Fig 4.

Absolute value of relative error in the MSE estimation averaged over individual coefficients for different types of Plasmode simulation compared to parametric simulation, under the assumption of the true data generating process and outcome generating model, for p = 50, n = 100, β = 1₅₁, σ = 0.3, Cor(X_i, X_j) = 0.2 ∀i ≠ j.

More »

Expand

Fig 5.

Relative error in MSE estimation for individual coefficients when the assumed mean of the marginal distribution of the second feature in parametric simulation deviates from the true mean, for p = 2, n = 100, β = (1, 1, 1)^T, σ = 0.3, Cor(X_i, X_j) = 0.2 ∀i ≠ j.

N(0,1), N(μ,1) denotes that the first feature is generated from a standard normal (truth), and the second feature is generated from a normal distribution with mean μ instead (deviation).

More »

Expand

Fig 6.

Relative error in MSE estimation for individual coefficients when the assumed variance of the marginal distribution of the second feature in parametric simulation deviates from the true variance, for p = 2, n = 100, β = (1, 1, 1)^T, σ = 0.3, Cor(X_i, X_j) = 0.2 ∀i ≠ j.

N(0,1), N(0,σ²) denotes that the first feature is generated from a standard normal (truth), and the second feature is generated from a normal distribution with variance σ² instead (deviation).

More »

Expand

Fig 7.

Relative error in MSE estimation for individual coefficients when the assumed correlation of the features in parametric simulation deviates from true correlation, for p = 2, n = 100, β = (1, 1, 1)^T, σ = 0.3, Cor(X_i, X_j) = 0.5 ∀i ≠ j.

More »

Expand

Fig 8.

Relative error in MSE estimation for individual coefficients when the assumed correlation of the features in parametric simulation deviates from true correlation, for p = 2, n = 100, β = (1, 1, 1)^T, σ = 0.3, Cor(X_i, X_j) = 0.2 ∀i ≠ j.

More »

Expand

Fig 9.

Relative error in MSE estimation for individual coefficients when the assumed correlation of the features in parametric simulation deviates from true correlation, for p = 2, n = 100, β = (1, 1, 1)^T, σ = 0.3, Cor(X_i, X_j) = 0.2^|i−j| for ith and jth feature within each of the 5 blocks.

More »

Expand

Fig 10.

Relative error in MSE estimation for individual coefficients when the assumed marginal distribution of the second feature in parametric simulation is misspecified as Gaussian mixture with increasing proportion of data drawn from Gaussian with different expectations (bimodal distribution).

The mean and the variance of the marginal normal distribution of the first feature are set to match those of the second. The mixing proportion is given on the x-axis.

More »

Expand

Fig 11.

Relative error in MSE estimation for individual coefficients when the assumed marginal distribution of the second feature in parametric simulation is misspecified as Gaussian mixture with increasing proportion of data drawn from Gaussian with different variance (contaminated distribution), for p = 2, n = 100, β = (1, 1, 1)^T, σ = 0.3, Cor(X_i, X_j) = 0.2 ∀i ≠ j.

The mean and the variance of the marginal normal distribution of the first feature are set to match those of the second. The mixing proportion is given on the x-axis.

More »

Expand

Fig 12.

Relative error in MSE estimation for individual coefficients when the assumed marginal distribution of the second feature in parametric simulation is misspecified as log-normal, for p = 2, n = 100, β = (1, 1, 1)^T, σ = 0.3, Cor(X_i, X_j) = 0.2 ∀i ≠ j.

The mean and the variance of the marginal normal distribution of the first feature are set to match those of the second.

More »

Expand

Fig 13.

Relative error in MSE estimation for individual coefficients when the assumed marginal distribution of the second feature in parametric simulation is misspecified as Bernoulli with different success probabilities, for p = 2, n = 100, β = (1, 1, 1)^T, σ = 0.3, Cor(X_i, X_j) = 0.2 ∀i ≠ j.

More »

Expand

Fig 14.

Absolute value of relative error in MSE estimation averaged over individual coefficients when the assumed coefficients in parametric and Plasmode simulation are misspecified, for p = 50, n = 100, β = 1₅₁, σ = 0.3, Cor(X_i, X_j) = 0.2 ∀i ≠ j, β_I = (0, 0.02, …, 1)^T, β_II = 0.05₅₁, β_III = 10₅₁, β_IV = 0₅₁.

Large outliers for n out of n Bootstrap are not displayed.

More »

Expand

Fig 15.

Absolute value of relative error in MSE estimation averaged over individual coefficients when the assumed error variance in parametric and Plasmode simulation are misspecified for p = 50, n = 100, β = 1₅₁, σ = 0.3, Cor(X_i, X_j) = 0.2 ∀i ≠ j.

Large outliers for n out of n Bootstrap are not displayed.

More »

Expand

Fig 16.

Absolute value of relative error in MSE estimation averaged over individual coefficients when the assumed error distributions in parametric and Plasmode simulation are misspecified, for p = 50, n = 100, β = 1₅₁, σ = 0.3, Cor(X_i, X_j) = 0.2 ∀i ≠ j.

Large outliers for n out of n Bootstrap are not displayed.

More »

Expand

Fig 17.

Absolute value of relative error in MSE estimation for individual coefficients when the assumed feature correlation matrix in parametric simulation is misspecified.

True correlation matrix is estimated from the benchmark dataset quake (p = 3, n = 100, β = 1₄, σ = 0.3).

More »

Expand

Fig 18.

Absolute value of relative error in MSE estimation for individual coefficients when the assumed feature correlation matrix in parametric simulation is misspecified.

True correlation matrix is estimated from benchmark dataset wine_quality (p = 11, n = 100, β = 1₁₂, σ = 0.3).

More »

Expand

Fig 19.

Absolute value of relative error in MSE estimation averaged over individual coefficients when the assumed feature correlation matrix in parametric simulation is misspecified.

True correlation matrix is estimated from benchmark dataset Yolanda (p = 100, n = 200, β = 1₁₀₁, σ = 0.3).

More »

Expand

Fig 20.

Comparison of different resampling types for different numbers of observations resampled from a dataset with 100 observations.

Absolute value of relative error in MSE estimation averaged over individual coefficients when the true model is assumed in parametric and Plasmode simulation, for p = 10, n = 100, β = 1₁₁, σ = 0.3, Cor(X_i, X_j) = 0.2 ∀i ≠ j.

More »

Expand

Table 3.

Smallest deviations in parametric simulations for which Plasmode simulation is superior to parametric simulation.

p denotes the number of features, n the number of observations. True ρ gives the true correlation structure, scenario type the type of deviation and true value the true parameter value that the deviation refers to.

More »

Expand