Novel genotyping algorithms for rare variants significantly improve the accuracy of Applied Biosystems™ Axiom™ array genotyping calls: Retrospective evaluation of UK Biobank array data

doi:10.1371/journal.pone.0277680

Fig 1.

RHA scheme for distinguishing true heterozygotes from false ones.

(A) Overall flow of the algorithm. (B) Illustration of acceptable range for heterozygous intensity Z-scores relative to the distribution of intensities from major homozygous samples. (B1) Acceptance region for the Z-score of the major allele intensity. (B2) Acceptance region for the Z-score of the minor allele intensity. Het = heterozygote; Hom = homozygote.

More »

Expand

Fig 2.

Unexpected intensities can cause false heterozygous calls.

(A) & (C): Two examples of cluster plots displaying the summarized intensities of each sample for a given probeset in “signal contrast” vs. “signal strength” space. Each point is the summary (median polish) of two replicate probes (rep1 and rep2) for the same sample. Major homozygous calls are blue upside-down triangles and heterozygous calls are yellow dots. In A one of the heterozygous calls was flagged and is a false positive (circled in red). In C the heterozygous call that was flagged is circled in blue. Data are from Phase II HapMap [20] samples from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research, representing African, East Asian and European populations. The false positive is in a sample of East Asian origin. All remaining heterozygous calls shown are in samples of European origin. (B) & (D): The blue graphs represent the intensities of good quality major homozygotes in the respective probeset. The shaded grey areas denote the empirically determined intensity ranges where we would expect the intensity of a heterozygote to fall (for better legibility, minor allele expected ranges are truncated on the right). Yellow dots mark the intensities of all samples corresponding to heterozygous calls.

More »

Expand

Fig 3.

The fraction of predicted heterozygotes that are set to “No Call” by RHA decreases as the size of the heterozygote cluster increases.

For cluster sizes ranging from 1 to 4, we show the fraction of predicted heterozygotes with underlying unexpected intensities and the fraction of heterozygotes that are set to “No Call.” Both quantities decrease with increasing size of heterozygote cluster. Het = heterozygote.

More »

Expand

Fig 4.

Improvement in positive predictive value of rare variants after application of RHA in the 50k and 200k data.

Bars indicate mean positive predictive value (PPV) of variants genotyped by UK Biobank Axiom array. The minor allele frequency ranges were calculated from the genotyping results (cMAF) before applying RHA. We used genotypes from exome datasets 50k FE-VCF (A) or 200k OQFE-PLINK (B) as truth, comparing performance before and after applying RHA. We also indicate the percentage of true positive heterozygous calls (TP hets) retained after applying RHA. Data for the UK BiLEVE Axiom array shows similar trends (S2 Fig in S1 File). Note that for a given cMAF range, the number of variants contributing to the mean may be lower after the application of RHA because for some variants RHA eliminates all heterozygous predictions by the array, so that the positive predictive value cannot be calculated.

More »

Expand

Table 1.

Positive predictive value of UK Biobank Axiom™ array versus whole exome sequencing before and after application of RHA.

More »

Expand

Fig 5.

At very low cMAF (≤0.005%) a large proportion of variants are monomorphic in the exome sequencing (monoWES).

(A) The variants in various cMAF ranges are divided into three groups. Group 1 is non-responsive in array data (“non-responsive in Axiom”). Group 2 is monoWES (“monomorphic in WES”). Group 3 includes all remaining variants (“other”). Note that a variant can be both non-responsive in the array data and monoWES; in this panel such variants are counted in the “non-responsive in Axiom” category. Variants that are monoWES make up 59% and 34% of the total variants in cMAF ranges 0%-0.001% and 0.001%-0.005%, respectively. (B) Further partitioning of monoWES variants into three subgroups. The partitioning is of all 13,049 monoWES variants, including those labeled as monoWES in Group 1 as well as the 464 probesets that are identified as non-responsive (included in Group 1 bars in A) but are also monoWES. 2a: monoAx pre-RHA (2,025 variants); 2b: monoAx only post-RHA (6,518 variants); 2c: polymorphic in Axiom pre- and post-RHA (4,506 variants). Bars indicate the total number of group 2 variants, as well as the proportion of each subgroup, for each cMAF range. (C) Mean positive predictive value (PPV) of variants genotyped by UK Biobank Axiom array, using genotypes from 200k OQFE-PLINK as reference, and restricting to variants that are polymorphic in the exome sequencing and responsive in the array data. Bars indicate PPV before and after applying RHA; dots indicate the percentage of true positive heterozygous calls retained after applying RHA.

More »

Expand

Table 2.

Effect of exclusion of probesets non-responsive in array on post-RHA sensitivity.

More »

Expand

Table 3.

Polymorphic status in Axiom UK Biobank array of Group 2 single nucleotide variants that are monomorphic in the 200k whole exome sequencing data.

More »

Expand

Fig 6.

Comparison of sequencing depth by exome genotyping call for the 200k OQFE-PLINK whole exome sequencing data set.

Sequencing depth for exome reference homozygous calls, which tend to correspond to major homozygous calls, is significantly lower than sequencing depth for exome heterozygous calls.

More »

Expand

Table 4.

Performance of Axiom UK Biobank array on the BRCA module.

More »

Expand