^{1}

^{1}

^{2}

^{3}

^{1}

^{2}

^{5}

^{*}

^{1}

^{2}

^{4}

^{*}

Conceived and designed the experiments: CRC. Performed the experiments: SL. Analyzed the data: SL. Contributed reagents/materials/analysis tools: SK. Wrote the paper: SL ZW. Participated in writing the paper: SK.

The authors have declared that no competing interests exist.

Single nucleotide polymorphisms (SNPs) have been used extensively in genetics and epidemiology studies. Traditionally, SNPs that did not pass the Hardy-Weinberg equilibrium (HWE) test were excluded from these analyses. Many investigators have addressed possible causes for departure from HWE, including genotyping errors, population admixture and segmental duplication. Recent large-scale surveys have revealed abundant structural variations in the human genome, including copy number variations (CNVs). This suggests that a significant number of SNPs must be within these regions, which may cause deviation from HWE.

We performed a Bayesian analysis on the potential effect of copy number variation, segmental duplication and genotyping errors on the behavior of SNPs. Our results suggest that copy number variation is a major factor of HWE violation for SNPs with a small minor allele frequency, when the sample size is large and the genotyping error rate is 0∼1%.

Our study provides the posterior probability that a SNP falls in a CNV or a segmental duplication, given the observed allele frequency of the SNP, sample size and the significance level of HWE testing.

Single nucleotide polymorphisms (SNPs) are common biallelic variations that are widely used as genetic markers in linkage analyses and association studies

A copy number variation (CNV) is a genomic segment larger than 1 kb that occurs in variable numbers in the genome. When the variant frequency is larger than 1% in a population, it is called a copy number polymorphism (CNP). In some contexts, CNV stands for copy number variants

A segmental duplication (SD) refers to a large duplicated sequence in the genome, conventionally longer than 1 kb with at least 90% sequence identity between duplicate copies (reviewed by Bailey and Eichler

Recent studies show that at least 12%–15% of the human genome is covered by copy number variations

We are interested to know how a SNP would behave when it is in a copy number variation. We begin with an ‘observed SNP’ site, that shows two different bases in sequencing or genotyping experiments. The measured genotype and allele frequencies of an observed SNP may not reflect the true frequencies when additional copies exist. An observed SNP may not even be a true SNP, but instead a variation between two duplicate copies.

It is difficult to separate duplicate copies experimentally. The sequences flanking the two loci are nearly identical and PCR (polymerase chain reaction) and extension reactions cannot differentiate them. Finding out the exact genotypes for CNVs is also a challenging problem and only relative quantification is available to date

Our study focused on relatively small scale SNP studies with limited information. Detection and validation of CNVs through experimental and computational methods have been an ongoing problem. However such information is often limited due to difference in population (e.g. ethnicity), lack of confirmed boundaries, and quantification relative to the population average than the absolute number of copies.

Methods have been developed specifically for detecting CNVs using a large number of SNPs. SNP arrays (BeadArray™ by Illumina and GeneChip® by Affymetrix) became available recently that allow simultaneous genotyping of CNVs and SNPs. Software that detects CNVs from the SNP arrays has been developed (eg. BeadStudio LOH+ by Illumina and QuantiSNP by Colella et al.

However, not every investigator genotypes such a dense set of SNPs, depending on the goal of the genetics or epidemiology study. Closely positioned SNPs are often in linkage disequilibrium and many investigators prefer typing distant SNPs for cost effectiveness. Our goal is to compute the theoretical degree of contribution of CNVs and SDs to HWD of individual SNPs provided limited knowledge of CNVs in the particular population under study, rather than developing a method of detecting CNVs using a dense set of genotyped SNPs.

The power to detect deviation from HWE in SNPs in a segmentally duplicated region was recently examined by theoretical analysis and simulation

According to Redon et al., only about 1∼2% of CNVs are multi-allelic and 5∼10% are complex

Under a biallelic CNV assumption, we can imagine a situation as depicted in _{1} the true frequency of allele A at L1, and by p_{2} the true frequency of allele A at L2. Though we assume that A is the observed minor allele, it does not have to be a minor allele at each site and p_{1} and p_{2} may range from 0 to 1. Additionally, we introduce a new parameter r, the frequency of having both sites L1 and L2, as apposed to having only L1. Thus, r refers to the true allele frequency of the underlying CNV. For a CNV, r can vary between 0 and 1. When there is no duplication (i.e. regular genomic regions), r = 0. When duplication is fixed in all individuals in the population (segmental duplication), r = 1. For convenience, here r∈(0,1) (i.e. 0<r<1) is treated equivalent to a CNV, r = 0 to a regular genomic region, and r = 1 to a SD.

All possible cases of observed SNPs on a biallelic, duplication-type CNV. Each gray box represents an individual. Two parallel lines are homologous chromosomes. The left homologous pair represents the original site (L1) and the right pair represents the ectopic site (L2). The ectopic site may not exist or exist in only one of the homologous chromosomes in some individuals.

If both sites are polymorphic with different pairs of bases, the observed SNP will be triallelic (or even quadrallelic), which are not considered in the current study. Here, we assume the observed SNP is biallelic, as well as the true sites and the CNV itself.

Given true SNP allele frequencies p_{1} and p_{2} and CNV allele frequency r, observed SNP genotype frequencies

In theory, SNP genotyping errors can be in both ways and its rate depends on which nucleotides are involved. However, it is more common to misread a heterozygote as a homozygote. In our mixture model, we take a conservative approach and assume that all genotyping errors mistake a heterozygote as a homozygote, and not the other way around. If we consider both directions, the two effects counterbalance each other and contribute less to HWD. Thus, our assumption of one-way genotyping error means that the genotyping error fully contributes to HWD and does not cancel out within itself.

Our first goal is to understand the relationship between HWD, r, p_{1,} p_{2} and

Under HWE, θ = 1. When there are excessive heterozygotes, θ>1. When there are more homozygotes than expected under HWE, θ<1. Unlike other HWD measures such as the disequilibrium parameter D

As seen in _{1} and p_{2}. This indicates that the ectopic site contributes to increasing the number of observed heterozygotes relative to homozygotes. Based on the assumption of no other causes of HWD such as SNP genotyping errors, θ never goes below 1 (log(θ) is always ≥0). Thus, duplication always results in excessive heterozygotes.

A. p_{2} = 0, B. p_{2} = 1. Log base 2.

Given the observed minor allele frequency, the possible values of r vary widely depending on the assumption of p_{2}. The plots in _{2}.

A. p_{2} = 0, B. p_{2} = 1. Log base 2. Observed allele frequencies are derived from computed observed genotype frequencies.

_{1} can be very large. Thus, in this case the observed allele frequency cannot serve as a substitute for the true allele frequency. In the majority of the cases, the minor allele frequency is overestimated.

The black diagonal line is the case where the true frequency p_{1} is identical to the observed frequency. Red and blue curves represent p_{2} = 0 and p_{2} = 1, respectively.

P(CNV|HWD), or the probability that a SNP is in a CNV (i.e. r∈(0,1)), given that the SNP is in HWD, was computed at different observed allele frequency(_{g}). Several hypothesis tests for HWE have been proposed, including the most commonly used chi-square goodness-of-fit test

As seen in

The posterior probabilities given HWD computed using the beta prior, at n = 100, α = 0.05 (left), and n = 1000, α = 0.01 (right), with respect to observed allele frequency

The relative contribution by duplication is quite different depending on the stringency of HWD testing (

The likelihoods of HWD computed using the beta prior, at n = 100, α = 0.05 (left), and n = 1000, α = 0.01 (right), with respect to observed allele frequency

The uniform model (

The computation by sampling directly from priors converged, as suggested by one of the cases shown here (_{1} and p_{2}. Some individual cases failed to converge but did not affect the overall summation, because the values were ignorably small (

Our simulation shows that the HWD measure θ only increases with respect to r under no experimental errors, supporting that duplication acts in the direction of increasing observed heterozygotes.

Our results suggest that copy number variation can be a major contributor to HWD, even assuming the tendency towards small variant frequencies of CNV, especially at a low observed SNP minor allele frequency and large sample size. Segmental duplication is a major effect at a higher observed SNP minor allele frequency. About 1% genotyping errors did not make much difference to P(CNV|HWD). At a 5% or higher genotyping error, CNV or SD is less likely to be the cause of HWD.

Out results show that the probability of a SNP being in a duplicated region given HWD depends on the observed allele frequency. In case of a high observed minor allele frequency, HWD tends to be due to duplication, whereas in case of a small

Hosking et al.

For the prior distribution of r, we incorporated estimates from previous studies about CNVs. Fredman et al.

Our beta prior assumes about 50% of the CNVs have a minor allele frequency (MAF) more than 3.5% and about 13% and 1.5% have >10% and >20% MAFs, respectively, which are approximately consistent with Iafrate et al.'s estimate

Genotyping error rates for Sequenom (San Diego, California, USA), Illumina (San Diego, California, USA) and other new methods were reported as less than 1% (personal communication, Cantor). Sources and types of genotyping errors may vary and such heterogeneous effects were not considered in our model.

Cox and Kraft

Hunter et al.

Although at least some CNVs are generated in tandem

In addition, we assumed that an underlying CNV itself is under HWE. Sebat et al.

Nguyen et al.

Our model assumes duplication, genotyping error and random variation as the only sources of HWD. In reality, there are other sources of HWD. One of them is the noise in the actual population. Shoemaker et al._{A}<|0.03| as the limit of HWD in human population, as suggested by a National Research Council report (National Research Council 1996)_{A} = 0 indicates HWE

Our model does not consider population admixture effect. Population admixture is an important confounding factor in case-control studies and it is known that the admixture effect causes deviation from HWE, as we mentioned in the background section of our manuscript. Nevertheless, with sample size <1000, population admixture can be detected by HWE testing only when f>0.4 and k>0.2, where f is the allele frequency difference between the mixed populations and k the proportion of the minor population

Our study shows that the degree of HWD increases with respect to r, the frequency of two-copy alleles. Duplication acts in the direction of increasing observed heterozygotes. The results of our Bayesian analysis suggest that copy number variation can be a major contributor to HWD, when sample size is large and genotyping error is small. The relative contribution of CNV and SD to HWD varies with observed SNP allele frequency.

We varied r, p_{1} and p_{2} and observed genotype frequencies and allele frequencies were computed. Values for log_{2}(θ) were also obtained from the computed genotype frequencies. The simulation was done using a Perl script that we wrote, and the plots were drawn using the R language.

Given a value of r, either estimated or derived from genotyping experiments, we asked whether the true allele frequency for the SNP could be derived. For varying values of p_{2}, we have plotted the range of possible values of the true allele frequency p_{1}, given observed allele frequency

Here, the range of p_{1} is not less informative than a posterior distribution of p_{1}, because in this case the posterior probability depends only on p_{2}, for which we assumed a uniform prior except in marginal cases.

_{1} and p_{2}. When r is fixed, p_{1} and p_{2} have complementary effect on _{1} can be obtained by assuming the minimum and maximum values of p_{2}, given r and _{1} for different values of r and _{1}, p_{2} values ranging from 0 to 1, for a given r.

Additionally, we have looked at the range of p_{1}, given the observed allele frequency measured using a pooled-sample technique. The pooled-sample SNP allele frequency, which is different from the allele frequency derived from genotype frequencies (equations (4)), can also be expressed in terms of r, p_{1} and p_{2}:

Experimentally, a pooled sample allele frequency can be obtained by pooling DNA samples and measuring the relative quantities of each allele in the pooled sample

The second goal is to compute P(CNV|HWD) and P(HWD|CNV), given sample size (number of individuals genotyped) ^{2} goodness-of-fit test without continuity correction at α = 0.05 or 0.01. Though it has been proposed that other tests are superior under certain conditions^{2} test, to provide a practical perspective. Four different genotyping error models were tried including 0%, 1%, 5% and 25%. x% genotyping error is defined as follows: x% of heterozygote are read as one of the homozygotes and another x% is read as the other homozygote. This results in excessive homozygotes. Our genotyping error model only misreads heterozygotes as homozygotes, but not vise versa. It is trivial to include the opposite trend, but we do not for the following reasons: an additional genotyping error in the opposite direction would only decrease the overall deviation from HWE by counterbalancing the increased number of observed homozygotes. Experimental techniques often miss one of the two existing alleles, but less often identify an allele that does not exist, unless there is contamination or a high noise level. The 25% genotyping error is not realistic but it provides a comparative perspective.

A procedure for computing the conditional probabilities P(CNV|HWD) and P(HWD|CNV) is described below (See _{1} and p_{2} and likelihood of _{1}, p_{2}, n and α.

The prior distribution of r was set in a hierarchical way. The probabilities of r∈(0,1), r = 0, r = 1 were first set to 14%, 81% and 5%, and within r∈(0,1), the density of r was set to either a beta or a uniform distribution. The beta function parameters were determined so that the mean of r within r∈(0,1) is 0.05.

The joint prior of p_{1} and p_{2} was also set to a hierarchical distribution, so that the probability of being biallelic is reasonably smaller than that of being monomorphic, for each site. Within p_{1}∈(0,1) or p_{2}∈(0,1), p_{1} and p_{2} are uniformly distributed (details in

The likelihood was computed based on the likelihood of every possible observed genotype frequency case that corresponds to the observed allele frequency _{1} and p_{2}. HWD was determined for each genotype frequency case using the chi-square test (detail in

In order to approximate the integrals, M independent random samples of triplets (r, p_{1}, p_{2}) or pairs or singlets were drawn from the prior distribution within (0,1)^{3}, (0,1)^{2} or (0,1), respectively.

_{g} = 0, 0.01, 0.05 and 0.5. Beta and uniform priors for r were tried for comparison. Two independent replicates were generated in order to provide confidence estimates about the probabilities.

All the codes were written in the R language (

PDF file describing method detail.

(0.06 MB PDF)

Range of p1, given r and pooled sample allele frequency. The black diagonal line is the case where the true frequency p1 is identical to the observed frequency. Red and blue lines represent p2 = 0 and p2 = 1, respectively.

(3.64 MB EPS)

The posterior probabilities given HWD computed using the uniform prior, at n = 100, a = 0.05 (left), and n = 1000, a = 0.01 (right), with respect to observed allele frequency. Each row respresents error rate of 0%, 1%, 5% and 25%, from top to bottom, respectively. Estimates are the sample mean of two replicates and the standard deviations are depicted with error bars.

(4.98 MB EPS)

The likelihoods computed using the uniform prior, at n = 100, a = 0.05 (left), and n = 1000, a = 0.01 (right), with respect to observed allele frequency. Each row respresents error rate of 0%, 1%, 5% and 25%, from top to bottom, respectively. Estimates are the sample mean of two replicates and the standard deviations are depicted with error bars.

(4.99 MB EPS)

Convergence of the 15 integrals. The Y values represent joint probabilities (integral multiplied by prior probabilities).

(1.30 MB EPS)

We want to thank Dr. Darryl Irwin, Dr. Min Seob Lee and other colleagues at Sequenom for the great feedback. We thank Prof. Chanwon Kang and his lab members, Hyoungseok Ju, Kwangwoo Kim, Eunjin Kim and Tae-Un Han at KAIST, for a valuable discussion. We also thank Prof. Shamil Sunyaev at Harvard Medical School and Prof. Ben Raphael at Brown University for detailed discussion over the manuscript. We also thank Dr. John Zhang, Dr. Jason Laramie and Dr. Paula Sebastiani at Boston University for helpful comments, and Konstantin Aizikov at the Electrical and Computer Engineering Department at Boston University for his advice on using mass spectrometry-based quantification data. We greatly appreciate the valuable feedback from Dr. David Cox at Harvard School of Public Health.