^{*}

BFV and JKP both conceived of and designed the model, and wrote the paper. In addition, BFV also performed the simulations and analyzed the data.

The authors have declared that no competing interests exist.

Case-control association studies are widely used in the search for genetic variants that contribute to human diseases. It has long been known that such studies may suffer from high rates of false positives if there is unrecognized population structure. It is perhaps less widely appreciated that so-called “cryptic relatedness” (i.e., kinship among the cases or controls that is not known to the investigator) might also potentially inflate the false positive rate. Until now there has been little work to assess how serious this problem is likely to be in practice. In this paper, we develop a formal model of cryptic relatedness, and study its impact on association studies. We provide simple expressions that predict the extent of confounding due to cryptic relatedness. Surprisingly, these expressions are functions of directly observable parameters. Our analytical results show that, for well-designed studies in outbred populations, the degree of confounding due to cryptic relatedness will usually be negligible. However, in contrast, studies where there is a sampling bias toward collecting relatives may indeed suffer from excessive rates of false positives. Furthermore, cryptic relatedness may be a serious concern in founder populations that have grown rapidly and recently from a small size. As an example, we analyze the impact of excess relatedness among cases for six phenotypes measured in the Hutterite population.

Case-control association studies are a popular, convenient, and potentially powerful strategy for identifying genes of small effect that contribute to complex traits [

However, in their 1999 paper, Devlin and Roeder argued that another source of confounding, “cryptic relatedness,” might actually be a more serious source of error for case-control studies. Cryptic relatedness refers to the idea that some members of a case-control sample might actually be close relatives, in which case their genotypes are not independent draws from the population frequencies. When that happens, the allele frequency estimates in the case and control samples are unbiased but may have greater variance than expected, and tests of association that ignore the excess relatedness have inflated type-1 error rates. Devlin and Roeder [

At this time, there are few empirical data that bear on whether cryptic relatedness is a serious problem in practice. One study of association mapping in a founder population concluded that in that population, cryptic relatedness

In this article, we aim to address the question of whether, and when, cryptic relatedness is likely to be a serious issue for case-control association studies. Our approach is to develop a formal model of cryptic relatedness within a population framework. We show that a natural measure of the impact of cryptic relatedness, that we will denote δ, depends on the population size, the genetic model parameterized by the recurrence risk ratio [

Consider a study in which

We suppose that cases and controls are sampled from a single population (i.e., without population structure) of finite size, with discrete generations, and that mating is independent of the phenotype of interest. All individuals are sampled from the current generation. Since the impact of cryptic relatedness is due to alleles that are identical by descent, it will be necessary to model the coalescence times of chromosomes. We will use _{x}

We will also assume that affected individuals have the same distribution of family sizes as do unaffected individuals, and that selection against the disease phenotype is negligible. Hence, chromosomes from affected individuals coalesce with chromosomes from random individuals at the same rate as do chromosomes from pairs of random individuals. To be precise, let _{(i,a)(i′,a′)} denote the coalescence time between chromosomes

where Φ_{i} = aff and Φ_{i} = rand indicate that individuals _{i} = aff|_{(i,a)(i′,a′)} = _{i} = rand] = _{p}_{p}

We also define a quantity _{t}_{r}_{t}_{t}_{i′} = aff|Φ_{i} = aff, _{(i,a)(i′,a′)} =

Notice, however, that the definition of _{t}_{t},

The ratio _{t}/K_{p}_{t}_{r}_{r}_{t}

Let

Then we define a test statistic,

When appropriately normalized, ^{2}/^{2} distributed with one degree of freedom [

Under the standard null hypothesis, an allele copy at a given marker is type ^{*}[

In the absence of true association between the marker and the genotype, the commonly used test of association, ^{2}/[4^{2} random variable [

Values of the inflation factor,

We now characterize the extra variance that is caused by relatedness within a given case-control study, and use this to compute the expected inflation factor ^{*}[

where

Since _{i}_{j}

The following two terms in Equation 4 account for the possibility of departures from Hardy-Weinberg equilibrium in the sample. Assuming that these factors are independent of case-control status, we can write these as

where

In our model, the controls are sampled randomly from the population. This means that the terms

Next, since case alleles _{i}

where

Recall that _{p}_{t}_{t}_{t}/K_{p}_{(i,a)(i′,a′)} denote the coalescent time of allele copies _{(i,a)(i′,a′)} as _{ii′}_{i} = aff indicates that

where _{ii′}_{i} = aff|_{ii′} = _{p}

Equation 9 produces a pleasingly simple result: the coalescence rate for chromosomes from affected individuals is increased by a factor that is closely related to the standard recurrence risk ratio.

The recurrence risk ratio is an important quantity in genetic epidemiology, and is widely measured [

In our theoretical development, we will assume that disease inheritance is governed by a single additive gene [_{t}

For the additive model, [_{r},_{s}:

where Φ_{r} is the kinship coefficient between _{r} = 1/4 between sibs, and decays by 1/2 for each increment to _{r}

For example, under the standard Wright-Fisher model where individuals select their parents independently at random, most relatives are “half-relations”: half-siblings, half-first cousins, half-second cousins, etc. In that case, for _{r} = 1/8, 1/32, 1/128, and so on. Then for example, for _{t}_{s}_{r} = 1/4, 1/16, 1/64,… .

In summary, _{t}

Notice that chromosomes from affected individuals have a small excess probability of coalescing very rapidly (i.e., in the most recent ten generations or so). Otherwise, their coalescence rates are essentially like those of random chromosomes. The region at the left-hand side of the graph between the red and green lines represents the excess probability of very recent coalescence among case chromosomes (denoted _{s}

The dynamics of this process are reminiscent of structured coalescent models with many demes [

As described above, ancestral chromosomes of affected individuals coalesce at an increased rate during the most recent few generations (

Let

where

To evaluate

And finally, substituting Equations 11, 13, and 7 into Equation 3, we obtain

Equation 14 is worthy of discussion. When the simplest model of independence among sampled alleles holds, then _{t}

In this section, we evaluate Equation 14 under a range of specific models, in order to determine when cryptic relatedness is likely to have a substantial impact on case-control studies. The models presented assume an additive genetic model, as described above. At first, we will assume that the population is of constant size _{ii′}^{t−1}/(2

Recall from Equation 10 that λ_{t} − 1 = 4Φ_{r}(λ_{s} − 1). Recall also, that when individuals select their parents independently at random, as in the standard Wright-Fisher model, that most relatives are “half-relations” (e.g., half-siblings, half-cousins, etc.), and then the kinship coefficients Φ_{r} are 1/8, 1/32, 1/128, … for _{half} to indicate this situation where individuals are related via “half-relationships,” it follows that

Noting that ^{−2t+1} converges quickly to 2/3; Equation 15 can be further approximated as

If instead, mating is purely monogamous, but partners are still chosen at random, then all relationships are “full”—e.g., full siblings, full cousins, etc., and the kinship coefficients are two-fold higher. The corresponding inflation factor, _{full}, is

indicating that the impact of cryptic relatedness is approximately doubled when there is fully monogamous pairing of parents, compared to when there is independent pairing of parents for each offspring.

To check the accuracy of our analytical results, we generated population histories via Wright-Fisher simulation and estimated the inflation factor,

Values of the Inflation Factor as a Function of Model Parameters, and a Comparison of the Simulated (δ̂_{mean}) and Analytical (_{A}

While the choice of an additive model for the phenotype (i.e., a heterozygote has exactly one-half the penetrance for the phenotype as a homozygote for the risk allele does) is mathematically convenient, alternative modes of inheritance (including multilocus models, or models with dominance components) are certainly likely in practice. Such models will have the impact of changing the rate of decay of _{t},

^{−3} of tests would be significant at the ^{−3} level. As another example, consider a genetic model based loosely on a study of autism [_{s}_{p}

These examples notwithstanding, however, Equations 16 and 17 seem to suggest that _{p}_{s}

To be more specific, let _{s}_{s}_{s} K_{p}_{p},

Therefore, since _{s}_{s}_{p}_{s}_{s}

In summary, in populations of constant size, the impact of cryptic relatedness is generally very small, unless (1) _{s}

We now consider a model that allows for changes in population size. Let _{t}_{t}_{t}

where again _{t}_{t}_{t},

To check the accuracy of our results regarding demographic expansion, we modified the forward simulation procedure used above such that instead of a single _{onset} in the recent past starting at an initial population size _{A}_{t}_{+1} = _{t} · e^{α}_{f}

Results of the simulations, for a range of parameter values, are summarized in

Values of the Inflation Factor in Very Recently Expanded Populations

Each line plots the estimated probability that two chromosomes drawn at random, from different individuals affected with a given phenotype, or from two random control individuals, descend from a single ancestral chromosome within the last

The qualitative difference between the equilibrium model and the population-growth model can be understood as follows. Consider two studies in which

Thus far, we have considered models that assume “good” sampling design, in the sense that the sample of cases represents a random sample of the affected individuals in a population. We now consider the impact of sampling schemes that bias toward enrolling close relatives as cases in a study. For the previous models, we showed that with random ascertainment of cases, the inflation factor

As an extreme, but simple example, consider first the situation in which the case sample consists of

As a second simple example, suppose that a study recruits only a small fraction of affected individuals from a large population, but that recruits sometimes then encourage their siblings to enroll. Let the number of siblings of a recruited individual be Poisson with mean _{s}_{s}

From these examples, it seems that biased sampling of cases can have a substantial effect on inflating the test statistics—though this is less dramatic perhaps than might have been expected. For example, suppose that index cases have an average of _{s}

We have used data collected from a founder population, the Schmiedeleut (S-leut) Hutterites of South Dakota, to illustrate the impact of cryptic relatedness on association studies for phenotypes measured in that population [

It has previously been reported that naïve tests of association produce an excess of false positive signals in this population [

The fact that we have complete genealogical information for the Hutterites allows us to estimate the coalescence probabilities for pairs of alleles in any two individuals at any time since the founding of this population. These probabilities were estimated as described in the Materials and Methods section. The data do not provide information about coalescent events more than about 12 generations before the present, but the theory presented above suggests that the impact of cryptic relatedness is due to very recent coalescent events (and this is supported by our results, as follows).

The results of this analysis are presented in _{t}

We next used the genotype data to obtain an empirical estimate of

Observed (δ̂_{obs}_{A}

Should one be concerned about confounding from cryptic relatedness in association studies? To address this question, we have developed theory to predict the amount of cryptic relatedness expected in a random-mating population. Our results demonstrate that confounding effects of this kind are expected to be substantial only under rather special conditions. The bulk of the effect is due to the occurrence of quite close relationships among sampled individuals. Except in small populations, random pairs of affected individuals are unlikely to be closely related. Our results in Equation 14 show that for a given genetic model and population size, the impact of cryptic relatedness grows linearly with sample size. However, this obscures the fact that in practice, the maximum number of cases _{p}_{r}

In contrast, studies of populations in which there has been rapid and recent population growth, and where the total study population is small, should indeed be concerned about cryptic relatedness. This scenario produces higher levels of relatedness than are possible for the same values of _{r}

Another situation in which cryptic relatedness may be important is when there is extensive inbreeding. A model in which individuals are likely to mate with relatives will increase

It should be noted that our results assume that the disease phenotype is selectively neutral (see

Lastly, it should be noted that our primary model assumed a “good” epidemiological design in which individuals are ascertained randomly from the population. However, cryptic relatedness can also result from the non-random ascertainment of family members in a case-control study. For instance, affected family members might be more likely to seek treatment in the same clinic, or affected individuals might encourage their affected relatives to enroll in a study. These types of situations may be difficult to detect at the time of enrollment, but can have non-trivial consequences even in large outbred populations. We have shown that these situations indeed result in excess false positive rates. After data collection, we recommend the use of techniques for identifying cryptic relative pairs based on genetic data [

To check the accuracy of our initial analytical results, we generated population histories via Wright-Fisher simulation and estimated the inflation factor, ^{2} test statistic. We then estimated the inflation factor ^{2} statistics [

We estimated the coalescent probabilities for pairs of alleles in two individual Hutterites by the following. Starting from the affected individuals in the population, or from a matched random sample of individuals from the current population, we simulated the inheritance of a pair of randomly chosen chromosomes from different individuals, backward through time, from the present to the founders of the population. If the two chromosomes coalesced to a common ancestral chromosome within the pedigree, we counted the number of meioses back to that common ancestor, reporting the average number if the number of meioses was different on the two lineages. We repeated this procedure until we observed at least 500,000 coalescence events within the simulation. To estimate the mean inbreeding coefficient (

For each marker, we constructed a 2 ×

To be more careful about the possibility that some loci might be genuinely associated with a phenotype or in various degrees of linkage, we repeated the analysis using approximately 40 microsatellite markers, unlinked either to one another or to candidate gene regions showing evidence of linkage. The resulting δ̂s based on the mean were almost identical for all phenotypes to the larger marker sample (unpublished data). Finally, we generated a semi-analytical result for the phenotype by plugging the coalescent probabilities estimated from the pedigree, along with estimated inbreeding coefficients, and the average number of cases selected across all replicates, into Equation 14.

We thank Carole Ober for providing the marker, phenotypic, and genealogical data used for the Hutterite data analysis and for comments on the manuscript; Rebecca Anderson and Natasha Phillips for additional assistance in organizing and interpreting the phenotype data; and Catherine Bourgain, Graham Coop, William Wen, Sebastian Zöllner, and the anonymous reviewers for helpful comments or discussion. This work was supported in part by the National Institutes of Health (HG002772) and a Hitchings-Elion award from Burroughs Wellcome Fund to JKP; BFV received support from the above grant to JKP as well as NIH DK55889 to Nancy J. Cox and from a Genetics Regulation Training Grant NIH/NIGMS NRSA 5 T32 GM07197.