Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data

doi:10.1371/journal.pone.0160733

Fig 1.

Estimates of accuracy as a function of L for 13028 sites imputed with Beagle.

Imputation accuracies were estimated using the sample Pearson correlation coefficient, r. The sample Pearson correlation is a function of two vectors, both of length L. Fig 1. presents estimated accuracy as a function of L for set A sites (n = 13028). The range of L is divided into a series of seven equally sized bins (i.e. 0 < L ≤ 100, 100 < L ≤ 200, …, 600 < L ≤ 700). Accuracy estimates were divided into bins according to their corresponding values of L. Bin means and medians are presented as red and blue points, respectively.

More »

Expand

Fig 2.

A summary and comparison of per-site and per-individual imputation accuracy from Beagle and glmnet imputation.

(A and B) The x- and y-axes report estimates of imputation accuracy for glmnet and Beagle, respectively. Each point represents the estimated accuracy for a single site (A) and individual (B). (C) Both Beagle and glmnet produced bimodal distributions of per-site accuracies, with median per-site imputation accuracies of 0.76 (black vertical line) and 0.82 (red vertical line), respectively. (D) Both methods produced left-skewed distributions of per-individual accuracies, with median per-individual accuracies of 0.991 and 0.992 for Beagle and glmnet, respectively.

More »

Expand

Fig 3.

Per-site and per-individual imputation accuracy as a function of missing data and median read depth.

(A) Beagle and glmnet imputation accuracy as a function of missing data for sites in set C (n = 9737). (B) The x- and y-axis display the proportion of missing data and the accuracy difference between Beagle and glmnet at the site and individual level. The range of x is divided into ten-equally sized bins (i.e. 0.00 < x ≤ 0.10, 0.10 < x ≤ 0.20, …, 0.90 < x ≤ 1.00), and accuracy differences are divided into bins according to levels of missing data. Bin means and medians, summarizing the data within each bin, are displayed as red and blue points, respectively. Points falling on the black vertical line at y = 0 indicate no observed accuracy difference between Beagle and glmnet imputation. Points falling below y = 0 represent cases where glmnet imputes with higher accuracy relative to Beagle.

More »

Expand

Fig 4.

Imputation accuracy as a function of MAF.

The left and middle panels show per-site accuracy of Beagle and glmnet as a function of (estimated) MAF. The right-most panel shows the difference in accuracy between Beagle and glmnet at each site as a function of MAF. We observed the greatest difference in accuracy at low-frequency variants. Low-frequency variants were imputed with high variance.

More »

Expand

Fig 5.

The accuracy difference between reference panel A and panel B as a function of MAF and proportion of missing data for 11535 sites. (A) Genotypes in a sample of 2490 C1 individuals were imputed using two different reference panels: reference panel A, comprised of 694 phased GG individuals, and reference panel B, comprised of 80 phased individuals listed as progenitors of the C1 population. (B and C) Points falling on the black vertical line at y = 0 indicate no observed accuracy difference when imputing with reference panel A or B. Points falling below y = 0 represent cases where Beagle imputes with higher accuracy when using reference panel B relative to imputing with reference panel A.

More »

Expand

Table 1.

A summary of Beagle and glmnet’s computation cost (in seconds) and median per-site and per-individual accuracy under scenario 1, 2, and 3.

More »

Expand