< Back to Article

Linkage Disequilibrium-Based Quality Control for Large-Scale Genetic Studies

Figure 2

Example genotype intensity scatter plots from Affymetrix 500K technology on unrelated HapMap samples.

Original calls from the Affymetrix data are indicated by colour and shape of the small solid points (homozygotes: blue♦, red •, heterozygotes: green▴). The larger, open symbols with the same colour scheme (◊, ○, Δ) represent corrected genotype calls from applying our LD-based method to the Affymetrix data. Orange symbols indicate genotypes that are discrepant between the Affymetrix and HapMap datasets, with the shape of these symbols indicating the genotype calls in the HapMap database. LD-based error rate estimates are those obtained from applying the LD-based method to the Affymetrix data. The first row shows plots for three SNPs with large numbers of discrepancies between HapMap and Affymetrix calls, but low LD-based error rate estimates and clean intensity plots, with three well-separated clusters. The likely explanation for these results is that the discrepancies are due to errors in the HapMap database, and not the Affymetrix calls on which the LD-based error rates are based. The second row shows plots for three SNPs where the HapMap and Affymetrix calls agree (0 discrepancies) but high LD-based error rate estimates and unusual intensity plots. The unusual intensity results, combined with the fact that genotypes identified as likely to be incorrect by the LD-based method tend to cluster together, suggests that the high LD-based error rates reflect genuine signal at these SNPs, such as genotyping errors or other anomalies (e.g. copy number variation). This illustrates the potential for the LD-based method to detect problems that duplicate genotyping may miss. The third row shows plots for three SNPs with high LD-based error rate estimates, and large numbers of discrepancies, where the intensity plots are relatively clean, but where the genotyping algorithm appears to have done a poor job of clustering the genotypes. In each case the LD-based method successfully identifies and corrects most of these erroneous genotypes. Although these examples were chosen to illustrate particular points, they are not atypical in that we saw other examples of each type of behaviour.

Figure 2