Detectability of runs of homozygosity is influenced by analysis parameters and population-specific demographic history

doi:10.1371/journal.pcbi.1012566

Fig 1.

A) Diagrams describing population size changes for each of the simulated demographic scenarios. Each diagram shows the final 1,000 generations of each scenario, which was preceded in each by a burn-in of 10,000 generations with a population size of 10,000 diploid individuals. Population sizes for each interval are noted for each scenario. B,D,F,H) True overall F_ROH frequencies. C,E,G,I) Length bin-specific true F_ROH values; horizontal lines correspond to bin median values.

More »

Expand

Table 1.

Parameter values applied during ROH calling for both simulated and empirical data.

For PLINK, a total of 486 combinations were tested. PLINK default values are underlined. ROH = run of homozygosity.

More »

Expand

Fig 2.

The relationship between true and inferred F_ROH values depends on inference method and population demographic scenario.

Each regression line represents linear model results for a single level of coverage with the shaded areas representing 95% confidence intervals. Each point represents data for a single simulated individual. Panels display outcomes using BCFtools in Genotypes mode (A-D), BCFtools in Likelihoods mode (E-H) and, PLINK (I-L), as well as by population scenarios including large (A, E, I), small (B, F, J), bottlenecked (C, G, K), and declining (D, H, L) populations. Dashed line is 1:1 line and x- and y-axes are consistent within each demographic scenario. Note the differing slopes across demographic scenarios (e.g., among panels A-D) and differing overall accuracies across methods (e.g., differing distances between regression lines and 1:1 line among panels D, H, and L). Another version of this figure with consistent axis limits across panels and colorized by sequencing depth is available in S19 Fig.

More »

Expand

Fig 3.

PLINK outperforms BCFtools with respect to false negative rates, but underperforms with respect to false positive rates.

A) False positive (i.e., incorrectly calling a base position as being located in a ROH) and B) false negative (i.e., failing to identify a base position as being located in a ROH) rates across demographic scenarios and methods. Horizontal lines indicate median values and shaded boxes are 50% quantiles. Note the difference in scale of y-axis between panels A and B. Both BCFtools approaches outperform PLINK with respect to false positive rates but the reverse is true for false negative rates. Increasing coverage corresponds to decreasing false positive rates and to increasing false negative rates. Values displayed for 5X and 50X coverages; data for all coverage levels presented in S9 and S10 Figs, as well as a scatter plot in S18 Fig.

More »

Expand

Fig 4.

Increasing true ROH length corresponds to increasing detection.

Called F_ROH−True F_ROH displayed by length bin (short, intermediate, long, very long) and demographic scenario (A: large population; B: small population; C: bottlenecked population; D: declining population) at 15X (results for all coverage levels presented in S9–S10 Figs). For BCFtools Genotypes and PLINK, F_ROH for short ROHs is consistently underestimated whereas F_ROH for very long ROHs is overestimated when these ROHs are present. BCFtools Likelihoods does not overestimate ROHs in any length bin.

More »

Expand

Fig 5.

All three methods tested combine multiple true ROHs into single called ROHs, with increasing coverage only providing improvements for BCFtools Likelihoods.

A) Diagram illustrating this lumping issue. B) Examples of this issue at 5X and 50X in a single simulated individual drawn from the small population demographic scenario. C) Number of true ROHs combined into a single called ROH for ROHs of varying lengths when called by all three methods at 5X and (D) at 50X in the small population (results for all coverage levels and demographic scenarios provided in S16 Fig). Points correspond to mean values and vertical and horizontal error lines indicate 95% confidence intervals. Dashed horizontal line corresponds to y = 1 (a 1:1 relationship between numbers of true and called ROHs).

More »

Expand

Fig 6.

When applied to the empirical data set, the three ROH calling methods differ greatly in their inferences, particularly at 5X coverage.

A-C) Overall F_ROH and (D-G) length bin F_ROH results for each method and level of coverage, with means and 95% confidence intervals indicated by points and vertical lines, respectively. Lighter background lines indicate results for individual samples.

More »

Expand

Fig 7.

Comparison between empirical and simulated declining population across coverages.

In all tools, F_ROH is consistently underestimated. Increasing coverage from 5X to 10X can have significant effects on F_ROH estimates. A-C) Inferred F_ROH values for declining population data and D-F) inferred F_ROH values for empirical data at varying coverage levels for all three methods. True mean F_ROH values for simulated data are indicated by horizontal dashed line. For the simulated data, error bars are bootstrapped 95% CIs and points represent mean values, lines for 15 randomly subsampled individuals are displayed for simplicity. For the empirical results, points represent mean values (n = 15) and error bars correspond to 95% CIs.

More »

Expand