Dosage Transmission Disequilibrium Test (dTDT) for Linkage and Association Detection

Both linkage and association studies have been successfully applied to identify disease susceptibility genes with genetic markers such as microsatellites and Single Nucleotide Polymorphisms (SNPs). As one of the traditional family-based studies, the Transmission/Disequilibrium Test (TDT) measures the over-transmission of an allele in a trio from its heterozygous parents to the affected offspring and can be potentially useful to identify genetic determinants for complex disorders. However, there is reduced information when complete trio information is unavailable. In this study, we developed a novel approach to “infer” the transmission of SNPs by combining both the linkage and association data, which uses microsatellite markers from families informative for linkage together with SNP markers from the offspring who are genotyped for both linkage and a Genome-Wide Association Study (GWAS). We generalized the traditional TDT to process these inferred dosage probabilities, which we name as the dosage-TDT (dTDT). For evaluation purpose, we developed a simulation procedure to assess its operating characteristics. We applied the dTDT to the simulated data and documented the power of the dTDT under a number of different realistic scenarios. Finally, we applied our methods to a family study of alcohol dependence (COGA) and performed individual genotyping on complete families for the top signals. One SNP (rs4903712 on chromosome 14) remained significant after correcting for multiple testing Methods developed in this study can be adapted to other platforms and will have widespread applicability in genomic research when case-control GWAS data are collected in families with existing linkage data.


Introduction
Linkage studies have been successfully used to identify many disease genes such as hyper-cholesterolaemia [1][2][3], Huntington's disease [4] and cystic fibrosis [5]. Linkage studies allow direct observation of recombination events in a family pedigree with a limited number of generations, as well as simultaneous analysis of multiple genetic markers. The LOD score (logarithm of odds), developed by Newton E. Morton [6], is a statistical test often used for linkage analysis in human. The LOD score compares the likelihood of obtaining the test data if the two loci are indeed linked, to the likelihood of observing the same data by chance. However, this setup requires tailor-made likelihood statistics. When it comes to a multi-loci model, the situation can be even more cumbersome [7]. On the other hand, because of the requirement of a large number of families with several affected generations, linkage analysis can be less helpful when dealing with diseases of late-onset with a high mortality. Alternatively, association studies are used to identify disease susceptibility genes by comparing genetic variants between individuals with and without the disease of interest. High-throughput genotyping has allowed large-scale association studies over the entire human genome. In 2005, the first Genome-Wide Association Study (GWAS) was successfully applied on human age-related macular degeneration [8]. Since then, GWAS has been widely used to identify the association between genetic variants, typically singlenucleotide polymorphisms (SNPs), and heritable traits or diseases.
In general, there are two major types of designs that are commonly used in association research: population-based and family-based studies. As the most common population-based approach, the case-control setup compares an unrelated healthy control group and affected case group. The genotyped SNPs are investigated to identify the allele frequency differences between these two groups. The study then determines whether the SNPs are associated with the genetic trait or disease based on the statistical significance of the differences. The independent samples are typically easier to obtain in a case-control study than family samples. However, many case-control samples select independent cases from existing family data that were originally used in linkage analysis. Because cases can be over-sampled from groups with higher disease prevalence, the differences of allele frequencies in an admixture of ethnic groups may produce spurious associations. Therefore, although case-control studies have shown advantages in identifying association between the disease susceptibility and markers in a candidate gene, the results may reflect type I errors (false-positive) due to unaccounted confounding factors [9][10][11] such as population stratification [12][13][14][15][16].
Unlike the population-based studies, family-based studies are resistant to type I errors arising from population stratification. The family-based Transmission/Disequilibrium Test (TDT) measures the over-transmission of an allele from heterozygous parents to their affected offspring, in which the non-transmitted parental alleles serve, in effect, as a control group. Therefore the TDT is a robust test of association in the presence of geographical or ethnical impact from the population [17]. In the original TDT [18], a parent-proband trio is considered as a basic unit, in which a proband is the first affected family member who seeks medical attention for a genetic disorder. Assuming complete genotype information for a two allele marker locus in each trio, the TDT compares the number of heterozygous parents who transmit either allele to the affected offspring. The TDT can be constructed through a 2 by 2 table (Table 1). Under the null hypothesis of no associaton, the proportions b=(bzc) & c=(bzc)are tested against (0.5, 0.5) using a binomial (asymptotically chi-square) test with one degree of freedom: Because neither genotypes nor allele frequencies are required, the TDT is considered robust to the population stratification as mentioned above.
A variety of TDT-like tests have been suggested starting with Rubinstein et al [19]. Curtis and Sham studied a multi-allelic TDT with incorporation of missing parents [20,21]. This was extended by Spielman et al and Horvath et al [22,23] with the TDT applied to different family structures in their sib-ship tests. For an allele of interest at a marker locus, the sib-ship test essentially compares the frequency of that allele among affected individuals with the frequency of the allele among unaffected individuals, which allows the TDT to be applied to diseases with late age of onset, such as non-insulin-dependent diabetes, cardiovascular diseases, Alzheimer's disease, and other diseases related to aging. Several studies also discussed the application of the TDT for mapping quantitative trait loci [24][25][26][27][28][29][30]. Gordon et al.'s TDTae allows for genotyping errors in the analysis and accommodates various error models [31]. As discussed above, multiple affected and unaffected siblings are often collected and used for both linkage and association analysis. The family-based association test (FBAT) generalized the TDT model on various phenotypic traits and multiple markers [32][33][34][35][36]. Instead of using data from only the heterozygous parents as in the TDT, the affected-family based controls (AFBAC) method [37] is developed to take advantage of all the parental information. But the trade-off of this setup is its vulnerability to population stratification as genotype frequencies are not irrelevant in this test [37,38]. Another extension of the TDT, the pedigree disequilibrium test (PDT), is specifically designed for analyzing the Linkage Disequilibrium (LD, the non-random association of alleles at two or more loci) in general pedigrees, which has been successfully applied on a number of complex traits such as diabetes [38,39]. Further, as a more powerful development to the PDT, the presence of linkage (APL) is used to handle diseases of late-onset [38,40]. However, in spite of the divergences as well as the great promise of these TDTtype analyses [9,24,41], one primary limitation that most of these extensions encounter is the dependence on completeness of the genotype information for all trio members in a single test and lack of scalability on utilizing both the linkage and association data in a study.
Disorders can often have genotype information from only one parent of the affected individuals. As a common practice, these trios are simply discarded [42] though this can result in considerable loss of information and bias to the association study [20,38]. Several studies have been proposed to allow TDT to handle missing parental genotypic information [20,22,[43][44][45][46][47][48][49][50]. Within these studies, the missing parental genotypes are mostly reconstructed based on the assumption that they are missing completely at random and do not depend on the genotypes themselves [51]. However, this assumption may not hold true and the probability that a genotype is missing may rely on the unobserved alleles [38,52,53]. Furthermore, these approaches are not designed for pedigrees with missing genotypes on the proband when both linkage and association data are available.
Against this background, we note this is a two step procedure. In the first step, the SNP data is used with both parents missing, so that the analysis depends on external allele frequency estimates and is sensitive to population stratification. In the second step, individual genotyping is performed on the parents for the top SNPs from step one, so this step is a traditional TDT and insensitive to the potential biases in step one. Accordingly, having available family DNA is needed to avoid false positive results.

Methods
As stated above, the traditional TDT requires complete genotypic information from all members of the nuclear families. However, obtaining all genotypes cannot always be feasible for some diseases or families. Therefore the traditional TDT-type studies may not be useful to identify the presence of genetic determinants in data with relatively small amounts of complete trio information. One way to solve this type of issue is to reconstruct the missing parental genotypes under the assumption that they follow the probability distribution of the fully observed cases. However, most studies designed for this purpose do not incorporate the impact from LD and thus may introduce bias to the results. On the other hand, there are data available that have genotype information on both microsatellite and SNP markers for diseases; one example is alcoholism [54]. With both the linkage and association data, usually the microsatellite genotypic information from families for linkage analysis together with SNP data from the offspring who are genotyped for both linkage and GWAS, we can ''infer'' the transmission of SNPs for the rest of the family members who have not been genotyped on SNPs. We call these family members the ''missing individuals''. In detail, first we generate the combined pedigrees in which each individual has both the linkage and GWAS genotype data filled in. Genotypes that those individuals do not have will be taken as missing data in the combined pedigrees. Then we use the program MERLIN [55] to read these combined pedigrees as input and infer the dosage probabilities of dense SNP genotypes for these missing individuals (see section Genotype Inference of Familial Individuals for more details). All the trio combinations from the inferred pedigrees are extracted on the condition that the children were affected and at least one parent in the trio was genotyped on microsatellite markers. The dosage-TDT that we have developed in this study is applied on these trio pedigrees using their inferred dosage probabilities. By incorporating the family linkage information into the GWAS data, we can potentially have higher power to detect association between our genotypic markers and the disease susceptibility alleles.

Dosage Transmission Disequilibrium Test (dTDT)
Common map of both linkage & GWAS data. The common map of both linkage and GWAS data is designed in the following way. With the genetic position of the linkage markers (microsatellite as in here) and physical position of both linkage and GWAS markers (microsatellite and SNPs), the genetic positions of all the GWAS markers (SNPs) are calculated based on equation (2): where gm denotes the genetic position of a marker, in centi-Morgan (cM) unit; pm denotes the physical position of a marker, in base pair units; MS last is the last microsatellite marker on a chromosome, pm MS jz1 ƒpm MS last . Because at both ends of a chromosome when the physical position of a SNP is either smaller than the 1 st microsatellite marker or larger than the last microsatellite marker, there is only one microsatellite marker that can be referred to compute the genetic position for the SNP. We simply use the convention that 1cM < 10 6 base pairs to convert a SNP's physical map to its genetic map. While a SNP is in between two microsatellite markers, we use the ratio pm MS jz1 {pm SNP i pm MS jz1 {pm MSj and multiply this ratio with the genetic distance between these two microsatellite markers (gm MS jz1 {gm MSj ): In this way we compute the relative genetic position of a SNP marker to the microsatellite marker that's next to it. Genotype inference of familial individuals. Initially, many approaches implicitly imputed missing genotypes based on the potential genotype distribution in a family [56][57][58]. In practice, the genetic linkage implied that family members share a certain degree of similarity through their ''identical-by-descent'' (IBD) regions on the chromosomes. In this way, genotypes of the non-typed markers for these family members can be inferred according to their shared IBD with the other relatives. Figure 1 illustrates the procedure of this genotype inference. As shown in the figure, a subset of microsatellite markers has been typed for all the family members except the founders (red), whereas both microsatellite and SNP markers have been typed in only a few selected common individuals (black). Genotypes of the dense SNPs for missing individuals can be inferred by comparing the haplotypes that are IBD with the other individuals in the family. Several studies have been published on the genotype imputation procedures described above [59,60]. These procedures are implemented in programs such as MERLIN [55,61] and MENDEL [62,63], using one of the pedigree analysis algorithms such as the Lander-Green [64] or Elston-Stewart [65] algorithms, or Monte Carlo sampling [66,67]. Merlin uses sparse trees to represent gene flow in pedigrees and is considered as one of the fastest packages among packages implementing the same algorithms such as Allegro [68] and Genehunter [69]. In this study, we use MERLIN to infer the dosage probabilities. The output of this program includes the most likely genotypes, the expected number of copies for the tested alleles (0, 1, or 2 with genotype observed), and the posterior probabilities (dosage probabilities) of the three alternative genotypes [70]. Because a large number of related individuals are included, this family-based genotype inference is expected to improve the power of association tests [59]. Furthermore, when a GWA scan follows a linkage study, only a proportion of individuals may need to be genotyped and the inferred genotypes can be useful for the next step in the association analysis.
Dosage-TDT. Because the traditional TDT is a simple representation of the x 2 statistics, it requires single counts of the transmitted/non-transmitted alleles from the heterozygous parents to the affected offspring. Thus the inferred dosage probabilities cannot be processed through this setup. In this study, we generalize the original TDT by taking all possible allele transmissions in a pedigree into account. Table 2 shows the dosage probabilities of three alternative genotypes (1/1, 1/2 and 2/2) in a trio (named a trio-dosage set in this work) from the inferred results. Table 3 lists all 11 TDT-informative allele transmissions in a trio where at least one of the parents is heterozygous. The values of b i and c i used in the x 2 calculation of the TDT in each trio i are calculated by summing up the probabilities across all these 11 types of transmissions. Let t denote the probability that allele 1 is transmitted by a heterozygote parent of an affected child. We can then write the dosage probabilities of a child in terms of the dosage probabilities of its parents and t as follows: Thus, the frequencies of allele 1 and 2 of a child appearing in a trio are as follows: Based on equations (3) to (8), we can derive that: In summary, the sum of all 11 informative transmissions for b i and c i can be written as: where i represents the i th trio. By substituting t from equation (9) into (10) and (11) and summing up b i and c i for all trios, we can compute the x 2 values for each SNP: which follows a one degree freedom x2 distribution under the null hypothesis of no association. We name this generalized TDT the dosage TDT (dTDT). The original TDT proposed by Spielman et al [18] can then be considered as a special case when t and the dosage probabilities in a trio-dosage set is 0 or 1. The beauty of the above equations is the denominator from t can be canceled out with the one in b i and c i thus equation (12) can be further written as: in which we denote for each trio i. Using form (13) can be computationally efficient.

Simulation
The dTDT makes it possible to process the inferred dosage probabilities of the un-genotyped SNPs for those missing individuals in a nuclear family. As a follow-up study of this generalized TDT approach, we develop a simulation to investigate how the power changes for association detection with different inputs. In this simulation, we generate multiple sets of trios under various settings. Each set has 1,000 trios. Each trio has one affected child. Because we focus on the interaction between SNP and microsatellite markers, only one SNP marker and one microsatellite marker are simulated. In each trio, the microsatellite markers are assigned to both the parents and the child. SNPs are only assigned to the child. The parents who do not have such SNP markers are considered as the missing individuals and their SNP genotypes are inferred by MERLIN. The dTDT is then used to process the inferred dosage probabilities and p-values are reported from the x 2 statistics.
Generating SNPs. Denote the low and high risk alleles at a disease locus D as D 1 and D 2 , with population frequencies p 1 and p 2 . Assuming Hardy-Weinberg equilibrium, the population prevalence (K) of the disease is where f 11 , f 12 and f 22 are the penetrances of the three genotypes D 1 D 1 , D 1 D 2 and D 2 D 2 .
We have considered three disease models: dominant, recessive and co-dominant. The combinations of the penetrances in these three models are designed as follows: dominant (f 11 ,f 12 = f 22 ), recessive (f 11 = f 12 ,f 22 ) and co-dominant (f 11 ,f 12 = K f 22 ).
With K and f predefined, we can compute p 1 and p 2 using the following equations: Denote the haplotype frequencies of disease locus D and SNP locus S as h 11 , h 12 , h 21 and h 22 . On condition of the child being  Table 3. Calculation of b i and c i in terms of dosage probabilities and t for the i th trio with all 11 TDT-informative transmissions.
affected, the probabilities of different haplotypes in a child can be calculated through: With a predefined correlation coefficient (R) of linkage disequilibrium (LD) between D and S, we can derive the haplotype frequencies as follows: where q 1 and q 2 are the population frequencies of the SNP alleles S 1 and S 2 . To simplify our model, we will assume p i = q i , where i = 1 or 2. The rationale behind this is that if the SNP marker and disease allele have very different frequencies, then R 2 is small and there is little power. Keeping both frequencies equal allow R to vary the full range from 21 to +1. By substituting (20) into (19), we can assign the genotypes (S 1 S 1 , S 1 S 2 , or S 2 S 2 ) at locus S to the affected children based on these derived frequencies: Generating microsatellites. Denote M i as the microsatellite marker from the parents. Because of a large number of polymorphisms (alleles) for a microsatellite marker, we assume that our microsatellite marker is completely informative (i.e., each parent is heterozygous at the microsatellite locus M) and assign alleles Parameter settings. Without losing biological meaning, i.e. with valid p i M(0, 1.0] (i = 1 or 2 and p 1 + p 2 = 1), but also with a good coverage of possible natural phenomena, we predefine the following values for the parameters to generate each set of trios: N: the number of trios = 1,000; K: prevalence = 0.01, 0.1, or 0.2; R: correlation coefficient of LD between D and S = 0.5, 0.7, 0.9, or 1.0 (as negative value of R does not produce informative divergence from the result using positive value of R, we are only considering positive value of R herewith); f or g: penetrance of disease genotype D i D i or D i D j , where i or j = 1 or 2 and i ? j. To simplify the notation, here we use f to denote f 11 and g to denote f 12 or f 22 . As noted, we separate the disease models as dominant (f, g, g), recessive (f, f, g), and codominant (f, 0.5g, g) where f = 0.0, 0.1K, 0.3K, 0.5K, 0.7K, or 0.9K and g = 1.1K, 0.5, 0.7, 0.9, or 1.0.
These values are first permuted to generate all their possible combinations then any combination that produces invalid p, i.e.p 6 [ (0,1:0, is excluded. Under each setting, we produce 1,000 trios based on equation (15) for SNPs in the affected offsprings. Parental SNPs are inferred by MERLIN and dTDT is used to process the inferred dosage probabilities.

Application to Alcohol Dependence
Alcohol dependence is a serious psychiatric disorder in which an individual is characterized as having harmful consequences of repeated or compulsive alcohol use, and (sometimes) physiological dependence on alcohol (i.e., tolerance and/or symptoms of withdrawal) [71,72]. During 2001-2005, excessive alcohol use contributed to about 79,000 deaths and 2.3 million years of potential life lost in the United States [73]. Excessive alcohol consumption, the third leading cause of preventable death in the United States, can cause damage to the central and peripheral nervous system, and to nearly every organ system in the body [74,75]. It is reported that alcohol dependence affects about 12% of American adults across their lifetime [76]. As a complex disease, alcohol dependence can be influenced by various factors such as genetic susceptibility, environmental influences and interactions among genes or between genes and environment. The nine-site national Collaborative study On the Genetics of Alcoholism (COGA) funded by National Institute of Alcohol and Alcoholism (NIAAA) aims to identify and characterize genes that affect the susceptibility to develop alcohol dependence and related phenotypes. COGA is applying multiple strategies for genetic research. The most densely affected, multiplex alcoholic families were used in a multi-wave family-based linkage study. 2,283 out of 2,459 individuals from 262 families were genotyped using microsatellite markers in Wave I and Wave II (data denoted as Map03MS) ( Table 4) [54]. At Wave III, another 1,442 out of 2,106 individuals from 312 families were selected for microsatellite genotyping by the Mammalian Genotyping Service (MGS) from Marshfield Clinic (data denoted as MarshfieldMS) ( Table 4). Combined data from all three waves are denoted as LinkageMS in this study. COGA also has high throughput GWAS data with over 1 million SNP markers from 1,884 independent individuals, generated by the Center for Inherited Disease Research (CIDR) (data denoted as CIDRSNP) (Edenberg et al., 2010). The GWAS data include 566 mutual individuals chosen from the LinkageMS families.
All participants agreed to share their DNA and phenotypic information for research purposes and provided written informed consent following instructions from institutional review boards at all data collection sites. The study was approved by the institutional review board at each COGA site, with the OHRP Assurance Numbers being: FWA00003624 (SUNY Research As described above, the dTDT uses the inferred dosage probabilities of dense SNPs for association detection. The COGA family data provides us such an opportunity to integrate the information from both linkage and association studies. In this study, we first generate the combined pedigrees with both the LinkageMS and CIDRSNP genotype data from COGA. Figure 2 demonstrates the structure of the combined pedigrees. Most individuals in the combined pedigrees were genotyped on microsatellite markers. A subset of individuals in the pedigrees was genotyped for dense SNPs. These individuals include one affected child in each of the families and other unrelated members chosen as a control group. All other individuals who have not been genotyped on SNPs are considered as the missing individuals. We use the program MERLIN [55] to read these combined pedigrees as input and infer the dosage probabilities of dense SNP genotypes for these missing individuals (described in detail below). All the trio combinations from the inferred pedigrees are extracted on the condition that the children were affected and at least one parent in the trio was genotyped on microsatellite markers. The dTDT is applied on these trio pedigrees using their inferred dosage probabilities. In addition, PLINK 77 is used to conduct a standard case-control study on the CIDRSNP data. With the idea that making use of all the available sample data would increase the power for association detection, we further combine the results from both dTDT and case-control study through the MH test [78]. With AA and mixed families excluded, we have 3,016 out of 3,567 EA individuals from 453 families genotyped on microsatellite markers in the LinkageMS data. 471 of these individuals in 398 linkage families were selected for SNP genotyping (known as the mutual individuals) ( Table 6), including 41 individuals without microsatellite genotyping data. For GWAS, from each of these 398 families, one affected child (normally the proband) was selected as the case and other biologically unrelated family member(s) were used as the control. In total, 1,399 EA CIDRSNP individuals consist of 847 cases and 552 controls. Figure 3 shows the pedigree of one of the LinkageMS EA families (FAM_ID 20059). This family has four mutual individuals. Except for proband #1, all the other three (#2, 9, and 13) selected for GWAS are relatives by affinity to this family.
Marker cleaning & mapping. In order to match up the genetic positions of all microsatellite and SNP markers, 38 microsatellite markers in Map03 and 58 microsatellite markers in Marshfield were excluded because of their missing physical positions. ,200,000 SNPs with low minor allele frequency (#5%) were excluded. In consideration of any possible impact from linkage disequilibrium (LD), we exclude ,1,500 SNPs that are within 1,000 base pairs flanking each microsatellite marker. We use equation (2) to create the common map for these microsatellite and SNP markers. A comparison of the numbers of microsatellite and SNP markers before and after cleaning is given in Table 6. Figure 4 shows the distribution of these cleaned markers in EA families. With these cleaned and mapped markers, we create new pedigrees with the Linkage MS and CIDR SNP data combined together. One combined pedigree has ind total individuals (in rows) with (mrk MS +mrk SNP ) markers (in columns). Missing SNPs of (ind totalind common ) individuals were inferred by MERLIN. (Figure 1).
dTDT and mantel-Haenszel test. In this study, we apply both the dTDT and Mantel-Haenszel (MH) tests to the COGA data. The MH test was first proposed by Mantel and Haenszel in 1959 [78]. The method has been widely applied to analysis of contingency tables (normally 2 6 2) and comparison of results from different treatments. In case-control studies, a 2 6 2 table is typically used. The discrepancy between observed and expected values in each cell from the table is evaluated by x 2 test with one degree of freedom. Comparatively, because the dTDT only takes account of values of b and c, the test can be constructed by a 1 62 table instead. To maximally benefit from all sample data and multiple studies, we extend the MH test to pool results on each SNP from these two contingency tables in both case-control study and dTDT. Calculation of each term in the MH test is shown in Table 7. Having the Observed & Expected values and Variances from case-control study and dTDT, terms in MH test can be written as the sums of corresponding values these two tests. The null hypothesis assumes no association between markers and disease.

Simulation
We separate the simulation results into nine groups that are combinations of three disease models and three K values (0.01, 0.1 and 0.2). In each group there are 100,120 settings with different R and f values. With each setting we generate 1,000 trios and replicate the inference and dTDT procedures. Because of the large number of these settings (1,320 in total), we attach the results as in supplement tables. A plot of the log 10 pÀvalue ð Þ for these models is shown in Figure 5. In the figure, graphs from the top row to the bottom row represent the dominant, recessive and codominant models respectively, and from left to right represent the results with three different K values (0.01, 0.1 and 0.2). Each blue dot corresponds to a/under that specific setting.
Because there are many factors interacting with each other, we will start with a general comparison of different models then look at the impact from one or two factors while constraining the others constant.
In general, reading the values of log 10 pÀvalue ð Þ from each model, we find that a rare (K = 0.01) recessive disease model Table 5. produces higher power compared to common (K = 0.1 or 0.2), codominant or dominant disease models. Meanwhile, high R value (0.9 or 1.0) also helps increase dTDT's ability to detect signals. This is because in a rare recessive case, markers with high LD to the disease allele both parents are heterozygous and both transmit the recessive risk allele to their offspring. Our findings from the simulation validate what we already observed in the biological phenomenon.
In the figure, each graph is broken down into four bins having R = 0.5, 0.7, 0.9 and 1.0. Interestingly, within each bin, when log 10 pÀvalues ð Þare ordered by descending f 11 (from 0.9K to 0.0) and increasing f 12 & f 22 (from 1.1K to 1.0), it shows a noticeable increasing trend as shown on the graphs. Meanwhile, when f 22 is small ( = 1.1K), the blue dots are close to the bottom line on each graph. We note that in order to have enough power to detect the signals, we need to have relatively distinguishable penetrances (i.e. f cannot be too close to g) in the model. Indeed, there is no information when all three penetrances are equal to K.
When we generate the trios, we use a roulette wheel algorithm to assign SNPs to the children. This randomness is reflected on the graphs as the dots spread in some irregular patterns. Reading the graphs from left to right, we can see that with low prevalence (K = 0.01) the dots appear in clear clusters. Each cluster corresponds to a specific f 11 value. Taking the top left graph (dominant with K = 0.01) as an example, f 11 changes in the order of 0.9K, 0.7K, 0.5K, 0.3K, 0.1K and 0.0. Within each cluster, f 12 & f 22 increase in the order of 1.1K, 0.5, 0.9 and 1.0. This clustering holds true in the other two disease models (recessive and codominant) when K is small (K = 0.01) except some f 11 valued clusters are missing because combinations with invalid p were excluded. In summary, the above observation indicates that f 11 has a higher impact to the power than f 12 & f 22 do in a rare disease model. As the prevalence increases (K = 0.1 and 0.2), the clustering effect gradually disappears. In each R valued bin when K is large (0.1 or 0.2), though the penetrances are sorted in the same order as mentioned above, the dots represent certain continuity instead of clustering. This shows that, in a common disease model, f 11 is not the only or the most effective factor as it is in a rare disease model. Other factors start to interact with each other. Especially when K = 0.2 and R = 1.0 (the fourth bin in the three graphs on the right), the dots appear in clear fan-shaped sectors. This irregularity can be partially explained by the sensitivity to randomness of the model under such setting, i.e., small changes of the parameters can have high impact on the results.
We can further see how the penetrances differ by looking at the slope of the trend in each bin. Apparently as the value of R increases across the bins, the slope of the trend also increases. This is because when R is high (such as 0.9 and 1.0), the same degree of lift in the penetrances will add more power and move/more quickly to its next level compared to the situation when R is low (such as 0.5). From another point of view, we can imagine these slopes as the (first) derivatives of a convex function in terms of R. On this convex curve, as R moves along to its rightmost end (increases), the derivative of the function increases and the function value (power) improves faster.

Application to Alcohol Dependence
In a recent work in a case-control study using GWAS data on the COGA sample, Edenberg et al [79] identified the most significant SNP rs10511260 on chromosome 3 with p-value (P) = 3.4610 26 . A cluster of SNPs was found in a region of chromosome 11 with p-values ranging from 4.8610 25 to 6.9610 24 . No single SNP showed genome-wide significance (5610 28 ). In the following sections, we will compare our results from dTDT on COGA data with these findings from Edenberg et al's work.
dTDT on COGA data. To apply the dTDT on each SNP from COGA, we re-build the inferred pedigrees by extracting all trio combinations in which every child in a trio must be affected and at least one parent was genotyped on microsatellite markers.  Because one family can have more than one affected child, the parents can be found in more than one trio. For instance, the family (FAM_ID 20059) as shown in Figure 3 has four affected children (#18, 1, 5 and 12, from left to right). Mother (#30) was genotyped on microsatellite markers. Therefore we have four trios from this family. In total, 893 trios from 323 families with 1,654 individuals in the Linkage MS EA group were extracted and used to build the trio-dosage pedigrees. 166 SNPs are found with p-values ,10 24 . This is compared to 93 SNPs at the same level in Edenberg et al's paper [79]. dTDT. This may be primarily due to the relatively small sample size used in dTDT and accuracy of the ''-infer'' program.
Combining case-Control study and dTDT. With dTDT and case-control analysis applied to the Linkage MS and CIDR SNP data respectively, we compute the p-values of MH test based on calculations in Table 7. Figure 6 shows the Manhattan plots oflog 10 P across all 23 chromosomes from the MH test. The most significant SNP (rs11583322) on chromosome 1 gives a pvalue = 1.10610 28 that meets the GWAS significance level. This SNP lies in the gene Serine/Threonine Kinase 40 (STK40) that connects pluripotency factor Oct4 to the Erk/MAPK pathway controls extraembryonic endoderm differentiation [80]. Table 8    lists the top SNPs with MH test p-values ,10 25 and their corresponding case-control study and dTDT p-values. From the table we can see that the p-value of each SNP in the MH test is approximately the product of p-values in the other two tests. However, SNPs with high rankings by p-values in the MH test do not systematically correspond to high rankings in the individual tests. The p-values in the dTDT share the highest variance (2.01610 23 ) among the three tests because of the randomness introduced by the inference procedure as well as the difference in sample sizes across the tests. A larger sample size will likely increase the power and generate more robust test results. In total, we have 257 SNPs in 75 genes with p-values ,10 24 . 14 SNPs at the same level of p-values are found in replication of Edenberg et al's study [79]. Four of these 14 SNPs have associated genes: CSMD2 on chromosome 1, LZTS2 & PDZD7 on chromosome 10, and Gcom1 on chromosome 15. There are 34 vs. 11 SNPs that have p-values ,10 25 . Five SNPs across chromosome 1, 3, 9, 12 and 14 show p-values ,5.1610 27 which is more statistically significant than the case-control analysis by Edenberg et al [79]. Our results also show clusters of SNPs by distance with p-values ,10 25 (more than five such markers in one cluster) in genes EXOC6B, FTO, NCAM2 and PPEF1 on chromosome 2, 5, 21 and X, respectively.

Experimental Validation
Based on results from the MH test (5 th column on Table 8), 19 out of the top 30 SNP markers were genotyped for 1,586 individuals from 220 Wave I & II families. We used the Sequenom MassArray technology for SNP genotyping [81]. PCR primers, extension primers, and multiplexing capabilities were determined with Sequenom MassARRAY Assay Designer software v3.1.2.2. Standard procedures were used to amplify PCR products; unincorporated nucleotides were deactivated with shrimp alkaline phosphatase. A single base pair extension step was completed with the mass extension primer and the terminator (iPLEX). The primer extension products were cleaned with resin and spotted onto a silicon SpectroChip. The chip was scanned with a mass spectrometry workstation (Bruker). The resulting genotype spectra were analyzed with Sequenom SpectroTYPER software v3.4.
Because variant rs11583322 did not work well with the Sequenom genotyping platform, we used the PrimerPicker software [82] to design the assay and followed the protocol described in KASPar SNP Genotyping System manual to run PCR reaction with an ABI GeneAmp PCR System 9700 [83]. Genotypes were accessed using an ABI 7900 HT Fast Real-Time PCR system. Because the genotypes are from linkage families, we used the program UNPHASED [84] to perform a genetic association analysis. Our colleagues in Allison Goate's lab implemented the above genotyping process. The author did the final analyses of the genotypes. Results are shown in Table 8.

Simulation
As shown above, we can see that simulation can be a powerful tool to investigate many interactions between various factors and help discover potential rules underlying these factors.
With slight modification of the above technique, we can use our simulation to investigate how the dTDT is affected by population stratification. Simulating different populations to have different prevalences, we can choose two sets of trios using different allele frequencies. Applying the dTDT to this combined set of trios, we can test whether the power is lowered or heightened because of the prevalence difference within the populations. Since there is no information on phase of two markers in a trio, we have not introduced the recombination frequency (h) in the simulation.

Tradeoff
When applying the dTDT to the alcohol dependence data from COGA, nearly twice the number of SNPs (166 vs. 93) were found having p-values ,10 24 . Further, to maximally make use of the available sample data, we combine case-control study and dTDT with the MH test. This potentially increases our sample size and makes the method more robust to uncontrolled factors. As a result, we have one signal in gene STK40 with p-value that attains a genome-wide significance level. A large number of SNPs are found having p-values ,10 24 . Several clusters of SNPs by distance with p-values ,10 25 are found in various genes across the genome.
Number of Cases (#cs) = 847; Number of Controls (#cn) = 552; f cs : allele frequency in cases; f cn : allele frequency in controls; T is short for Transmitted; NT is short for Non-Transmitted. b 2 = Sb i and c 2 = Sc i . Using either a 1 or c 1 in Case-Control study will give the same results. *equivalent to (c 1 zb 2 )or(c 1 zc 2 ). doi:10.1371/journal.pone.0063526.t007 The Quantile-Quantile (Q-Q) plots are used in GWAS to assess the inflation of FPRs by comparing the distribution of observed pvalues against the theoretical model distribution of expected pvalues [85]. In theory, without type I error arising from population stratification or other artifacts, the Q-Q plot shall align with the diagonal line. This is true if we use completely randomized data. By comparing the distortion of the Q-Q plot of the test results from this diagonal line, we can tell whether there are false positives or other errors due to genotyping or imputation. Before we draw any conclusion, we provide the Q-Q plots for results from all three Case-Control, dTDT and MH tests on the COGA data ( Figure  7). From the figure we can see that most of the observed p-values from Case-Control and dTDT are along the diagonal line. We do not observe significant distortion, i.e., type I error, in both tests. On the other hand, the Q-Q plot of MH test lies above the diagonal line. As stated earlier, both Case-Control and the dTDT are not robust to the population stratification because of the dependence of allele frequencies in the populations. MH test is basically a combined statistic of these two tests. Though we have increased the sample size in the combined test, we have reason to believe that such sensitivity to population stratification has been inflated in the MH test. This is a tradeoff that we need to pay attention to in the future studies. However, this issue can be partially addressed by restricting accurate genotypes based on the imputation quality score (IQS) [86] (discussed below in Dealing with errors).
Having the observed -log 10 (p-value) along the diagonal line in the Q-Q plots doesn't mean these tests agree with each other. To investigate the concordance among these three tests, we rank the -log 10 (p-value) from the dTDT (or MH test) and pick the top 100

Dealing with Errors
After correcting for multiple tests (p-value = 0.05/19 = 0.0026), SNP rs4903712 on chromosome 14 remained significant. This was the seventh most significant SNP from the MH test. As discussed above, there are several issues affecting the dependability of our test results. As we saw from the Q-Q plots, signals from MH test have been inflated because of double counting of the population stratification factor. On the other hand, the genotyping and imputation accuracy may be taken into account as well. To address these issues, we compute the IQS for the listed top SNP markers on Table 8 with and without setting a 0.8 threshold on the dosage probabilities. We report the number of individuals who meet such 0.8 threshold (in the last column, the numbers in the brackets next to IQS with dosage probabilities .0.8). According to the IQS, when we exclude the dosage probabilities that are below 0.8, the inference program performs very well and provides above ,0.90 IQS on average. The reason is that for dosage probabilities that are lower than 0.8, there is too much uncertainty for the program to impute, which not only heavily distorts the results (poor specificity) but also makes it difficult to filter out true  positives (low sensitivity). However, there is always a tradeoff when we enhance the accuracy. If we set a 0.8 threshold on dosage probabilities, the sample size dramatically reduces from 1,586 to 326 (intersection set across all markers). We further apply dTDT and MH test onto these 326 individuals with either the inferred or genotyped data but do not find significant markers at a 10 25 level due to a small sample size (data not shown).
As above, the experimental validation shows that the accuracy of the inference program can heavily impact the results of the dTDT and MH tests. The disagreement of results from these two tests on the real data could be attributed to several sources. First, the sample size of informative data is small. In total, we have 3,567 individuals from 453 EA families included for inference calculation. Within all these individuals, only 471 individuals were selected for SNP validation genotyping. This requires genotypes of more than 85% of individuals to be estimated. In addition, compared to the total number of SNPs, the number of microsatellite markers is also trivial (722 vs. 1 million).
According to Table 8, though the sample size may be reduced, we still recommend limiting dosage probabilities before genotyping. In our experience, a threshold at 0.7 , 0.8 level can be a good cut-off. A threshold lower than this level may contribute too much noise and a threshold higher than this may reduce the sample size significantly. Meanwhile, we suggest interpreting the dTDT signals only after genotyping validation in order to lower the risk of false positives.

Conclusions
Since the discovery of Mendel's law, genetic research has been challenged to identify genetic variants that contribute to human diseases. Along with the development of genome sequencing technologies, there have been impressive progresses within the research community over the past decade. Numerous methodologies have been developed and many disease-associated genes have been reported [87]. In this study, the work presented here embraces the recent development and addresses some of the research challenges in the field of genetic research. However, as we have seen, despite the promises of the solution we provide, it also prompts a great need to further investigate many of the issues we have presented.
As discussed, traditional case-control studies on GWA often include only unrelated individuals. By including family information in the study, we can expect an increase in power for linkage and association detection. On the other hand, because the traditional TDT requires complete genotypic information from a trio by measuring over-transmission of an allele from heterozygous parents to the affected offspring, it may be less useful in trio data like COGA where there are relatively few complete trios. To overcome these limitations, in this project, we extend the original TDT to the dTDT to accommodate dosage probabilities of a trio. The trio-dosage sets can be inferred through programs like MERLIN. Compared to a recent work from Edenberg et al, the dTDT shows increased power to detect association.
Genotype inference allows us to evaluate the evidence for association at the genetic markers that are not directly-genotyped. It helps improve the power of individual scans and is of particular usefulness for combining information from different studies such as linkage and GWAS. However, the accuracy of genotype inference may be impaired for the following reasons. First, because datasets where subjects are genotyped on different platforms may have different genotyping error rates, when we combine these datasets, inference can be problematic. Second, genotype inference for large datasets based on a small amount of shared information may encounter too much uncertainty in the procedure. For a similar reason, SNPs with low MAF may also have a higher chance of being inferred inaccurately.
On the other hand, a major advantage of the TDT is that it is not susceptible to population stratification. The dTDT is, however, sensitive in that it depends on the marker allele frequencies in the population. Because of this reason, when we combine results from both Case-Control and dTDT, the MH test potentially inflates errors due to population stratification. This can be noticed in the Q-Q plot as we present in Figure 7.
In summary, as we inspect the reasons for having low concordance among the tests as well as poor replication from our test results to the experiment validation, we have the following conclusions: I.
Because the linkage and GWAS data were genotyped on different platforms, they may carry different genotyping errors, which make it difficult to obtain genotype inference accurately; inference across these platforms can generate spurious associations; II. Due to a great sparsity in the combined dataset, a large number of the markers have to be inferred without sufficient support from the common markers, which introduces too much uncertainty in the inference; III. Because of possible population stratification, combining both the Case-Control and the dTDT to enhance the sample size may introduce false positive signals.
As one solution, when we filter out poorly-inferred SNP markers using IQS, we are able to removes thousands of false positives that can be particularly useful for SNPs with low MAF and when datasets are genotyped on different platforms. However, the tradeoff is we have to exclude a good number of individuals from our database in order to meet such restriction. But this can always be an option when enough samples are available.
Despite this area for methodological development, our work posits that the dTDT has considerable utility for linkage and association testing. By exploiting family data with inference and existing case-control information, the dTDT demonstrated here has opened a new window to possible routes for the integration of both population-based and family-based studies.

Future Direction
To address the sensitivity issue due to population stratification, it is necessary to validate the SNP genotyping and perform a test such as the PTDT to validate results and use a program such as UNPHASED. This approach minimizes expense when the case/ control sample is derived from an existing family study in which relatives have available DNA for typing. Moreover, we may implement additional application to other populations such as African Americans to compare findings with what we have from the European Americans. This will require extending the techniques described above.
We may explore using more of the family data instead of only using SNPs. Other sources of information could be captured, such as the copy number variants (CNVs) [88,89]. It is also suggested that non-genetic risk factors tend to raise the attention for complex traits and should also be incorporated into the genetic studies. Meanwhile, it is likely that COGA will obtain GWAS data in the relatives so that we can evaluate the efficiency of inference versus having GWAS genotypes available. Power calculation and sample size estimation of the new statistics are needed for general use. Due to the uncertainty inherent to the inference procedure, we plan to develop better strategies for generating dosage probabilities.
Finally, the dTDT should not be limited to dosage probabilities from the inference programs only. As a perspective, the next-generation sequencing data will provide a challenge using the method developed in this paper. Similar methods can be used when pedigree members have a SNP chip, and a subset has sequence data. Despite the discordance and poor replication from our test results, we believe that linkage can help identify regions of interest in conjunction with association testing. Computational inference has helped us reduce the experimental cost in that we may only need to do sequencing on a limited number of family members. Keeping this in mind, we need to implement appropriate selection of the most informative families when we do genotyping. All these future directions shall be promising and have potential to inform the field.