Using population-specific add-on polymorphisms to improve genotype imputation in underrepresented populations

doi:10.1371/journal.pcbi.1009628

Fig 1.

Schematic of our add-on tag SNP selection procedures, with steps illustrating.

Step 1) Constructing a Tanzanian reference panel. Identifying candidate target variants, which are derived from poorly imputed variants when the H3Africa array is imputed based on the Tanzanian and AFGR reference panel. Step 2) Selecting add-on tags that optimally tag candidate target variants based on population-specific LD structures, allele frequencies, and probe qualities. Step 3) Evaluating improvements in imputation performance after adding add-on tags to the base H3Africa array. Calculating imputation quality metrics, including INFO score and r² (correlation between imputed and sequencing-based genotypes). WGS, Whole-Genome Sequencing; AFGR, African Genome Resource; MAF, Minor Allele Frequency; MI, Mutual Information; LD, Linkage Disequilibrium.

More »

Expand

Table 1.

Imputation performance of publicly available reference panels when applied to the TB-DAR data based on the H3Africa array content.

Minor allele frequency (MAF) is based on the frequency observed in the TB-DAR cohort. Imputation quality (Subcolumn 1) is measured by either INFO score (AFGR and HRC; Sanger Imputation Server) or r² (CAAPA; Michigan Imputation Server). Correlation with ground truth (Subcolumn 2) measures the correlation between the imputed dosage and the ground truth WGS dosage using the squared pearson correlation coefficient (r²). Percent of variants imputed (Subcolumn 3) represents the fraction of variants observed in the TB-DAR WGS data that were successfully imputed (Imputation Quality > 0.8).

More »

Expand

Fig 2.

Genetic differentiation of African populations.

A) Sampling locations of the TB-DAR WGS cohort and populations within the AFGR reference panel, which includes the Sub-Saharan African populations of the 1000 Genomes (1KG) project. Line colors illustrate the degree of differentiation (F_ST) between TB-DAR and 1KG populations. B) Pairwise F_ST measures between 1KG populations and TB-DAR. 1000 Genomes Populations: GWD—Gambian in Western Divisions in the Gambia; MSL—Mende in Sierra Leone; YRI—Yoruba in Ibadan, Nigeria; ESN—Esan in Nigeria; LWK—Luhya in Webuye, Kenya. The map was created programmatically in R using the spData package [58], with the base layer based on public domain maps from Natural Earth (https://www.naturalearthdata.com/).

More »

Expand

Fig 3.

Improvement in imputation performance subsequent to the addition of add-on tags.

Mean INFO score and r² (between imputed and sequenced ground truth) of target variants designed to be tagged by add-on tags based on three array designs: 1) H3Africa array without any add-on tags 2) The H3Africa array with random add-on tags 3) The H3Africa array with population-specific add-on tags selected based on the proposed approach. Facet grids illustrate results based on two tag SNP selection settings: coverage-guaranteeing within prioritized regions (Setting 1) and efficiency-driven in all other regions (Setting 2). Error bars represent standard error (SE) of the mean imputation quality within each MAF bin.

More »

Expand

Fig 4.

Improvement in imputation performance in an example region.

Example region on chromosome 10 where the incorporation of add-on tags lead to the increase in imputation performance. Facet grids illustrate imputation performance of the H3Africa array without any add-on tags, with random add-on tags, and with add-on tags selected under the proposed approach. Color of dots represent type of variant (existing H3Africa tags, add-on tags, or any other imputed variants.

More »

Expand

Table 2.

Performance of add-on tags, categorized based on settings and methods.

Number of probes (Column 2) indicates the total number of Illumina probes that are required to genotype the add-on tags. The mean probe-ability score (Column 3) estimates the genotyping success rate for the selected add-on probes. The number of successfully tagged imputed variants are measured by either any improvement in INFO score (Column 4), or those exceeding INFO score of 0.8 when previously below (Column 5). Per probe and per tag indicate the number of imputed variants with imputation improvements per add-on tag and add-on probe respectively. Standard error (SE) represents variability of the per tag and per probe metric across different genomic regions. %AFGR and %Tanz indicate the proportion of imputed variants with better imputation accuracy based on the AFGR or internal Tanzanian reference panel respectively.

More »

Expand