Enhanced Methods for Local Ancestry Assignment in Sequenced Admixed Individuals

doi:10.1371/journal.pcbi.1003555

Table 1.

The average number of observed CSVs per haplotype per megabase from each ancestry.

More »

Expand

Figure 1.

Example of CSVs in a 2-way admixed individual (e.g. African American).

Lines denote the true local ancestry while the dots denote CSVs. Different dot types denote the continental ancestry of each CSV. From visual inspection it is relatively easy to discern the true ancestry from the three observed patterns. Spurious CSVs are denoted by CSVs mislabeling the true ancestry state.

More »

Expand

Figure 2.

Local ancestry inference accuracy in three simulated populations.

“Array data” denotes that a method was run only on the variants present on the Illumina 1 M genotyping array. “Full genome” denotes methods were run using all the variants. RFMix requires phased haplotype input, which was infered using Beagle; all other methods received unphased genotype data as input. Correlation values are the mean squared correlation across SNPs of the true vs. inferred ancestry across individuals. LAMP-LD and MULTIMIX were optimized to run with genotyping array data, possibly explaining the steep drop in accuracy when they are run using full sequencing data. MULTIMIX is not plotted when run on full sequencing data because it performed very poorly, possibly due to inaccurate parameters for sequencing data. Haploid and diploid errors are reported in Table 2.

More »

Expand

Figure 3.

Runtime (in CPU days) as a function of the number of individuals in a study with sequencing data.

Lanc-CSV is always faster than LAMP-LD and MULTIMIX when run on either full genome sequencing data or genotyping array data (see Figure S3 and Table S1). The full sequencing data contained ∼30 times more alleles than the genotyping array data. Only RFMix has comparable speed for full sequenced data and is faster for genotype array data. We show the runtime for RFMix with phasing time included.

More »

Expand

Table 2.

Local ancestry accuracy in simulations of African Americans, Mexicans and Puerto Ricans.

More »

Expand

Figure 4.

Accuracy as a function of sequencing coverage.

African-Americans with only two distinct ancestral populations increases fastest in accuracy.

More »

Expand

Figure 5.

Accuracy as a function of sample size.

While accuracy increases with increasing numbers of admixed individuals, the most significant increase is seen in Mexican individuals. We report accuracy for Lanc-CSV using 200 admixed individuals, but accuracy exceeds this as the number of admixed individuals increases. This is due to the method being better able to correct for spurious CSVs and to add in new CSVs when there are more individuals.

More »

Expand

Figure 6.

Proportions of sCSVs from each population observed on a held out haplotype.

Each row represents the ancestry of the haplotype that was held out and each column represents the average number of sCSVs observed on the held out haplotype from the given population. Each row is normalized by the maximum value of the row so that the population with the most sCSVs observed has a value of 1. In each row, higher values are associated with populations in the same continental group as would be expected. The IBS have only fourteen individuals, which makes determining IBS sCSVs extremely difficult.

More »

Expand

Figure 7.

sCSVs allow for calling the sub-continental population of a haplotype.

Randomly drawn segments of haplotypes from known populations can be accurately assigned to the population of origin. Accuracy for each population is significantly correlated with the number of reference haplotypes for that population (r = 0.65, p-value = 0.042). The highest accuracies are seen in populations that are more isolated from other populations in their continents.

More »

Expand

Figure 8.

sCSVs are able to assign the correct continental group to small haplotype segments with high accuracy.

This shows most of the incorrectly called accuracies still call to the correct continental group.

More »

Expand

Figure 9.

The average number of sCSVs from each 1000 Genomes population observed per megabase on the African-African called local ancestry regions of the real ASW individuals on chromosome 10.

The large number of YRI sCSVs seen in these regions supports the hypothesis that the African admixture component in African Americans comes from western Africa. We plot the expected number of observed sCSVs per megabase on a YRI haplotype (red diamonds) and the expected number of observed sCSVs on an LWK haplotype (green squares). The observed counts more closely resemble the count profile expected from the YRI haplotypes.

More »

Expand

Figure 10.

The average number of sCSVs from each 1000 Genomes population observed on the European-European called local ancestry regions of the real ASW individuals.

More »

Expand

Table 3.

The transition probabilities between ancestry pairs.

More »

Expand

Table 4.

Probability of emitting an informative CSV from an ancestry state.

More »

Expand