Handling Missing Data in Transmission Disequilibrium Test in Nuclear Families with One Affected Offspring

The Transmission Disequilibrium Test (TDT) compares frequencies of transmission of two alleles from heterozygote parents to an affected offspring. This test requires all genotypes to be known from all members of the nuclear families. However, obtaining all genotypes in a study might not be possible for some families, in which case, a data set results in missing genotypes. There are many techniques of handling missing genotypes in parents but only a few in offspring. The robust TDT (rTDT) is one of the methods that handles missing genotypes for all members of nuclear families [with one affected offspring]. Even though all family members can be imputed, the rTDT is a conservative test with low power. We propose a new method, Mendelian Inheritance TDT (MITDT-ONE), that controls type I error and has high power. The MITDT-ONE uses Mendelian Inheritance properties, and takes population frequencies of the disease allele and marker allele into account in the rTDT method. One of the advantages of using the MITDT-ONE is that the MITDT-ONE can identify additional significant genes that are not found by the rTDT. We demonstrate the performances of both tests along with Sib-TDT (S-TDT) in Monte Carlo simulation studies. Moreover, we apply our method to the type 1 diabetes data from the Warren families in the United Kingdom to identify significant genes that are related to type 1 diabetes.


Introduction
The Transmission Disequilibrium Test (TDT) is the most widely used family-based test for linkage disequilibrium [1], [2]. It was first introduced to handle one affected offspring in a nuclear family, and was later extended to two or more affected offspring, and to multi-allelic markers as well. The TDT is a test for linkage in the presence of linkage disequilibrium [1], [2].
The TDT compares frequencies of the transmission of two alleles from heterozygote parents to an affected offspring. The TDT requires complete genotypes from parents and offspring. However, sometimes genotypes may not be available. If genotypes of parents are missing, including only complete cases [3], [4], [5], [6], [7], or reconstructing missing parental genotypes by assuming a missing at random (MAR) model [8] have been suggested as common approaches in practice. However, if parental genotypes are missing due to his genotype at the locus of interest, then the informatively missing model is more appropriate than the MAR model [9]. Also, including only complete families and families with only one parent missing in informatively missing parent(s) [3], [6], [7], [10] reconstructing parental genotypes from their affected offspring [2], or from affected and unaffected siblings (Reconstruction-Combined TDT) [4], [11], or completely ignoring parental genotypes and comparing frequencies of genotypes of unaffected and affected offspring (S-TDT) [12], [13], [14], [15], or combining different data sets from families with parental genotypes and from families with missing parental genotype data but whose siblings' genotypes are unaffected (C-TDT) [12] has been also proposed as alternative approaches.
The robust TDT (rTDT) was proposed to handle any missing genotypes in a nuclear family with one affected offspring and biallelic marker [16]. The rTDT does not assume any missing model, and defines an interval estimate of TDT by considering all possible completions of missing genotypes. Sebastiani et al. [16] claimed that rTDT has more power than TDT. The simulation study was not performed, and the claim of having more power than TDT was shown mathematically for a specific missing pattern for each family [16]. That is, they assumed that missing families have the same form: the genotype of one parent is missing, the other parent has a heterozygous genotype, and the affected child has homozygous genotype [see Discussion section for more details]. This specific missing pattern for each family is not a reasonable assumption in practice. Alpargu (Bourget) [17] defined the rTDT for two affected offspring, and showed in simulation studies that rTDT was too conservative, and had low power. Because of its poor performance, the Mendelian Inheritance-Transmission Disequilibrium Test (MI-TDT), which takes population frequencies of the disease allele (p) and marker allele (m) into account in rTDT, was proposed [17]. The MI-TDT performed better than rTDT by controlling type I error rates and having high power. Since, MI-TDT outperformed rTDT, in this paper we propose the Mendel Inheritance-Transmission Disequilibrium Test (MITDT-ONE) for one affected offspring. The MITDT-ONE considers p and m in rTDT. The simulation study replicating real life scenarios such as different missing models and different genetic models shows that MITDT-ONE outperforms rTDT by providing better control of type I error rates and producing higher power.

Methods
We demonstrate the features of rTDT and MITDT-ONE with an example. We assume that we have genotypes of nuclear families with one affected offspring, and bi-allelic markers with alleles 1 and 2. In a given data set, there are (1,1), (1,2), or (2,2) complete genotypes or (0,0) missing genotypes. For each family, there are three genotypes with the first two genotypes for parents and the last genotype for offspring (e.g., (1,2)(1,1) (1,2)). If at least one of the genotypes is unknown, then the data is called incomplete. Otherwise it is called complete. Hence, a whole data set has two parts for a given marker: complete and incomplete trio genotypes.
The TDT considers transmission from heterozygote parents (h) to affected offspring. Let u be the number of h that transmit allele 1 to an affected offspring, and v be the number of h that transmit allele 2 to an affected offspring. Then, the TDT statistic for complete data tests linkage (h) between a disease and a marker locus in the presence of linkage disequilibrium (dw0 or dv0) [1]. Under the null hypothesis of no linkage (h~0:5), x 2 TDT follows a central chisquare distribution with 1 degree of freedom (df).
We construct interval estimates of MITDT-ONE and rTDT as follows: (1) compute maximum and minimum increments in u and v by considering all possible admissible completions of missing genotypes (u inc (v inc ) for maximum increments of u (v)), (2) find population frequencies of disease allele (p), and marker allele (m), and finally, (3) compute maximum and minimum values of u and v (u max and u min for u and v max and v min for v). While all three steps are involved in MITDT-ONE, rTDT does not require step (2). This is the only important difference between two methods. However, MITDT-ONE requires the value of p, which is difficult to know in some diseases. We can overcome the knowledge of p by assuming p&m because McGinnis (1998) [18] showed that TDT is able to detect linkage, and its power exceeds 0.5 only when d is close to its most positive value d max (see the definition of d max in the following section) when dw0, and allele frequencies m and p are similar in magnitude at marker and disease locus.
For complete families, let us assume that we have 50 heterozygote parents (h c ) in which 35 of them transmit allele 1 (u c ), and 15 of them transmit allele 2 (v c ). Using (1), we compute x 2 TDT~( 35{15) 2 =50~8. The chi-square distribution with 1 df at 5% nominal level is 3.84. Based on only complete cases, we reject the null hypothesis of no linkage at 5% nominal level. Now, assume m~0:25 and p~0:05 with two missing families as in Table 1.
Under the null hypothesis (h~0:5), heterozygote parent transmits allele 1 but not allele 2 to an affected offspring with probability h 1~( mzd=p)(1{m){(hd)=p, and the same parent transmits allele 2 but not allele 1 to an affected offspring with probability h 2~( 1{m{d=p)mz(hd)=p, where d is the coefficient of disequilibrium, m is the frequency of the marker allele 1, and p is the population relative frequency of disease allele [19]. The x 2 statistic compares the number of transmissions with probabilities h 1 and h 2 . It can be shown that these probabilities are the same under the null hypothesis. Thus, the expected number of transmissions are the same. Thus, E(u)~E(v). However, the probabilities are different when there is linkage, and hence the number of transmissions are different. This means that the x 2 statistic is related to the parameters d,m,p, and h.
All these families have equal probabilities of being considered under the null hypothesis of no linkage. However, MITDT-ONE and rTDT consider increments in u (u inc ) and v (v inc ). The exact maximum and minimum values of TDT in (1) are attained by rTDT. The interval estimate of rTDT is ½5:45,9:98. While the minimum value is attained when u~35 and v~18 (scenarios 7 and 9), the maximum value is attained when u~38 and v~15 (scenarios 5 and 8). The interval estimate of MITDT-ONE is ½7:61,8:27 with the same completion of the families as rTDT.
Both tests use the same admissible cases and consider lower limits to identify significant genes. Both methods reject the null hypothesis of no linkage at 5% nominal level in the above example. The interval estimate of MITDT-ONE is always contained in the interval estimate of rTDT (see in Construction of the MITDT-ONE and rTDT for more details). It is important to note that MITDT-ONE and rTDT have the same minimum values for u and v but differ at maximum values of u and v. Therefore, MITDT-ONE will never have less power than rTDT. Since the MITDT-ONE has more power and controls type I error rates better, we suggest using the MITDT-ONE test instead of rTDT test.

Construction of the MITDT-ONE and rTDT
There are 17 admissible missing cases in a nuclear family with one affected offspring (Table 3). Sebastiani et al. [16] proposed an interval estimate of rTDT for one affected offspring. They proceeded in the following way: x 2 TDT in (1) is a monotone convex function on a closed domain. Thus, it achieves its maximum and where u c (v c ) is the number of h that transmit allele 1 (2) to affected offspring in complete data set. And finally, the interval estimate ½x 2 min ,x 2 max of rTDT was defined as 1. If u min §v max , then x 2 min~x 2 (u min ,v max )ƒx 2 TDT ƒx 2 (u max ,v min )~x 2 max : 2. If u max ƒv min , then x 2 min~x 2 (u max ,v min )ƒx 2 TDT ƒx 2 (u min ,v max )~x 2 max : 3. In all other cases: x 2 min~0 ƒx 2 TDT ƒmaxfx 2 (u max v min ),x 2 (u min ,v max )g~x 2 max : The value of x 2 min (x 2 max ) makes a decision against (conforming) the null hypothesis. If x 2 TDT for complete data (i.e., missing data are ignored) and x 2 min reach the conclusion of the alternative hypothesis (i.e., significant genes), and x 2 min ƒx 2 TDT , then rTDT affirms significant genes of complete data. Similarly, the value of x 2 max ratifies the insignificant genes if x 2 TDT and x 2 max cannot reject Total number of admissible incomplete trios 17 The symbols z, {, and ? denote possible incomplete, impossible incomplete, and complete cases, respectively. doi:10.1371/journal.pone.0046100.t003 the null hypothesis, and x 2 TDT ƒx 2 max . In all other scenarios, rTDT cannot verify any conclusions of complete data.
Sebastiani et al. [16] did not run any simulation study to demonstrate the performance of rTDT. They theoretically showed that if all missing families are in case 9, which is not a reasonable assumption in practice, then rTDT has higher power than the classical x 2 TDT . Since the power of TDT depends on linkage disequilibrium (d), and relative frequencies of marker allele (m) and disease allele (p) [20], we ran simulation studies to take into account different realistic disease models and missing models, involving m and p. The simulation results show that rTDT overestimates the values of u max (v max ) (results are not shown), and hence becomes a conservative test with low power. Since u min (v min ) does not involve u inc (v inc ), we decided to scale down u inc (v inc ) to have a smaller value of u max (v max ) for MITDT-ONE. One way to achieve this goal is to involve m and p in scaling. These parameters appear together in maximum linkage disequilibrium d max~m inf(1{m)p,(1{p)mg when linkage disequilibrium is positive dw0, and d max~m infmp,(1{m)(1{p)g when linkage disequilibrium is negative (dv0) [18]. We scale u inc (v inc ) with (1{m)p and (1{p)m when dw0, and define u Ã max (v Ã max ) for MITDT-ONE as the average of these values. That is, Similarly, we can define v Ã inc by replacing in (4) u inc with v inc . Since TDT provides better power when linkage disequilibrium is at its maximum (d max ) for dw0, and m&p [18], we can reformulate (4) for real sample data as The lowest values of the interval estimates of rTDT and MITDT-ONE find significant genes when they are actually not. The way the interval estimate for MITDT-ONE constructed guarantees that its lowest interval estimate (x 2 2 ) is always larger than the lowest interval estimate of rTDT (x 2 1 ). This fact can be shown theoretically in the following way: let us assume u min §v max (the other two conditions in (31) can be shown similarly). Since v Ã max vv max , we have We claimed that rTDT is a conservative test. We have observed this through simulation study but not theoretically. The reason rTDT becomes conservative is that the value of x 2 1 , in general, falls below the value of chi-square distribution with 1 df at a nominal level (for example, when a~0:05, this value is 3.84).

Simulation
We replicated the simulation study in [17] for one affected offspring. Let us assume a bi-allelic marker with alleles 1 and 2 which is linked to a bi-allelic disease locus with diseasepredisposing allele D and non-predisposing allele d. The penetrance for DD,Dd and dd genotypes are a,b and c, respectively, with 0ƒa,c,bƒ1, and the population frequencies for the marker with disease locus haplotype for 1D, 1d, 2D and 2d are c 1 ,c 2 ,c 3 and c 4 , respectively, where c 1 zc 2 zc 3 zc 4~1 . The population relative frequency of disease allele D is p (~c 1 zc 3 ). The frequencies of the marker alleles 1 and 2 are m(~c 1 zc 2 ) and 1{m(~c 3 zc 4 ), respectively. The recombination fraction between the disease and marker locus is h, and the coefficient of disequilibrium is d(~c 1 c 4 {c 2 c 3 ). The probability of a heterozygote parent transmitting marker allele 1 to a particular affected child [18] is defined as where Our simulation study demonstrates realistic complex disease models. We generated 5,000 data sets for four different missing models and three genetics models (additive, dominant and recessive). In each simulation, we generated 100 families and each family consisted of one affected and one unaffected offspring, and 50 heterozygote fathers and 50 heterozygote mothers. In disease models, the probabilities of an affected child given the homozygosity (DD), heterozygosity (Dd), and absence of the disease alleles (dd) are defined as a,b, and c, respectively. The values of these parameters were as a~0:8,c~0:025 for dominant (a~b), additive (b~(azc)=2), and recessive models (b~c). In missing models, we consider (1) Missing Completely at Random (MCAR) for all genotypes, (2) informative missing for parental genotypes and MCAR for offspring genotypes, (3) informative missing for all genotypes, and (4) MCAR for parental genotypes and informative missing for offspring genotypes. A model is called ''informatively missing'' if at least two of the P 11 ,P 12 ,P 22 are not equal, where P i11 ,P i12 ,P i22 ,(i~father (f ), mother (m),offspring (o)) are missing rates for f, m, and o with (1,1), (1,2) and (2,2) genotypes, respectively. In Table 7, the first column k=l denotes the missing patterns (k~1,2,3,4) and missing rates (l~1, . . . 6).
The performances of the methods were demonstrated by validity and power analysis. The S-TDT, which ignores genotypes of the parents and compares frequencies of the affected and unaffected offspring [see 14 for the computation of S-TDT], was included to compare our methods with one of the widely used family based methods. Since S-TDT completely ignores parental genotypes and requires unaffected offspring genotypes from these families, and also assumes affected offspring genotypes are available, none of the missing mechanism models were taken into account. It means that the type I error rates for S-TDT are all the same whatever the missing mechanism models are for a given d value.
In validity and power analysis tables, the TDT ignores missing cases and considers only complete cases, S-TDT ignores parental genotypes and considers only genotypes of affected and unaffected offspring of all 100 families (genotypes are all known), and MITDT and rTDT use all 100 families after construction of all possible admissible genotypes.
The most positive value of linkage disequilibrium is defined as d max~m inf(1{m)p,(1{p)mg when dw0, and the most negative value of linkage disequilibrium is defined as d max~m in fmp,(1{p)(1{m)g when dv0. Since type I error rate and power In (i,j) , i and j represent the increments in u and v, respectively. The plus (minus) sign indicates that the increment is plausible (not plausible). The last two columns show the maximum and minimum increments in each cases. doi:10.1371/journal.pone.0046100.t006

Validity Analysis
When h~0:5, the probability that an informative parent transmits marker allele 1 to a particular affected child (P t ) becomes 0.5 because L t is zero in (8). That is, the value of d in M t and the disease model in R t are not involved in validity analysis. It means that type I error rates are the same for every disease model.
All testing procedures (TDT, MITDT-ONE, rTDT) except S-TDT were valid tests at 1% and 5% significance levels (Tables 8  and 9). Since TDT, MITDT-ONE and rTDT takes also information about genotypes of parents into account as opposed to S-TDT, this information had a positive impact on the sizes of the tests. Since S-TDT had inflated type I errors, we excluded its performance in power analysis. Overall, MITDT-ONE outperformed rTDT by providing type I error rates close to the corresponding significance levels. The rTDT was the conservative test. Actually, this was the main reason for us to propose a new test that controls type I error rates better. The results in Tables 8 and 9 show that the MITDT-ONE achieved this goal. Since MITDT-ONE (and rTDT) does not assume any specific missing models, we suggest that MITDT-ONE should be preferred over some widely used family based testing procedures.

Power Analysis
In power analysis, the null hypothesis is that there is a complete linkage (h~0). When h~0, the probability of an informative parent transmitting marker allele 1 to a particular affected child (P t ) becomes greater than or equal to 0.5 because L t ,M t , and R t contribute to the value of P t . It means information from linkage disequilibrium and a, b and c (parameters of disease model) have positive effect on power. This theoretical fact was also observed through simulation studies in Tables 10, 11, 12, 13, 14, 15. The pattern of power for all disease models, missing rates, missing models, and strength of linkage disequilibrium were the same for different significance levels (1% and 5%). However, the power values were better at 5% significance level than at 1% significance level. (P f11 ,P f12 ,P f22 ) ( P m11 ,P m12 ,P m22 ) ( P o11 ,P o12 ,P o22 )    and Inflammation Laboratory (JDRF/WT DIL) compiled data from 475 families with two affected offspring from the U.K. Warren Families for 52 SNPs. This data set was analyzed by [17] to demonstrate the method of MI-TDT for two affected offspring. The author of [21] used extensive logistic regression studies on the same data set, and identified 223 HphI, +1,140A/C, +1428 FokI, and VNTR as significant SNPs. The same SNPs as in [21] and six more were also identified by [17]. We considered the same U.K. Warren Families but chose the first affected child from each family to have only one affected offspring to demonstrate the performance of MITDT-ONE and rTDT. For the MITDT-ONE, we need to know frequencies of marker allele 1 (m) and disease allele (p) for each SNP. The values of m were provided to us along with the data set, except two (VNTR (DIL967) and TH micro' Z (DIL950)), but not the values of p. McGinnis (1998) [18] showed that TDT was able to detect linkage and its power exceeded 0.5 only when d was close to d max and allele frequencies m and p were similar in magnitude at the marker and disease locus. Therefore, we chose optimal values for p by assuming m~p.
The percentage of missing genotypes ranged from low (4% for DIL977) to high (52% for DIL997). Table 16 reports 18 significant   SNPs out of 52 at 5% significance level for complete genotypes.
Since we tested 52 SNPs, we applied Bonferroni multiple testing procedure at 0.05% significance level or 99.95% confidence level, and identified seven significant SNPs (underlined p-values). Since percentage of missing genotypes ranged from small to high, one should be cautious to declare significant SNPs when missing genotypes are ignored. Since DIL950 was insignificant for complete data, we dropped it from the computation of MITDT-ONE and rTDT. DIL967 was significant for complete data but its marker allele were not provided to us. Since we did not have any knowledge about the value of m, and did not want to assign any preferential value, we considered equal frequencies for m~0:5 and 1{m~0:5.
The MITDT-ONE and rTDT could verify if the significant SNPs for complete data are also significant when missing genotypes are taken into account. However, if either method could not reach significant result as in complete case, it does not mean that these SNPs are insignificant. It simply means that both methods reach an inconclusive decision. Moreover, the number of significant SNPs could be smaller when either test is employed, compared to the number of significant SNPs for complete data.  Out of 18 significant SNPs in complete cases, MITDT-ONE (rTDT) verified seven (three) to be significant (Table 17). The MITDT-ONE as well as rTDT found 23 HphI, +1428 FokI, and VNTR as significant SNPs as in [21] and [17]. Furthermore, MITDT-ONE identified four more same SNPs in [17] as significant; hence, we suggest researchers to investigate these SNPs as possible casual variant genes.

Discussion
Sebastiani et al. [16] proposed to handle missing genotypes of parents or offspring in a nuclear family with one affected offspring.
However, rTDT produces a conservative test and lacks power. Hence, we proposed MITDT-ONE to correct the problems of rTDT. The MITDT-ONE takes population frequencies of marker allele m and disease allele p into account in the rTDT method. With these m and p values, we restrict the domain of rTDT to have much better estimates for the maximum values of u and v.
The minimum values of the interval estimates of MITDT-ONE and rTDT make a decision against the null hypothesis of no linkage. One of the advantages of using MITDT-ONE is that significance results achieved by complete data is ratified when the minimum value of the interval estimate is smaller than the value of TDT for complete data. The other advantage of our method is that it allows researchers to implement our method to any missing rates. As discussed in the introduction, many studies deal with missing genotypes in parents but not in offspring. Moreover, these methods assume some missing mechanism (e.g., MAR) to recover parental genotypes. Thus, another strength of MITDT-ONE is that it does not assume any missing model but simply considers the Mendelian Inheritance property to define all possible admissible genotypes in parents or offspring. Also, MITDT-ONE and rTDT become classical TDT when u inc~vinc~0 .
In the construction of MITDT-ONE, we consider cases where all genotypes of family members are missing (Case 1). It is intuitive that since these families do not have any information they should be ignored from the study. We suggest that these families be omitted from the data if only one SNP is studied. However, if more than one SNP are studied then we suggest keeping them in the computation of MITDT-ONE to have same number of families for each SNP.  In summary, simulation studies show that MITDT-ONE controls type I error rates very well and produces high power when degree of linkage disequilibrium is mild.
More than one offspring: rTDT for two affected offspring was proposed by [17]. However, it was a conservative test and had low power. Hence, Alpargu [17] proposed MI-TDT to remedy the problems. With the motivation of Alpargu [17], we proposed MITDT-ONE. Both MITDT-ONE and MI-TDT correct the problems arising from rTDT. Theoretically, it is possible to propose our method for families with at least three and more affected offspring. However, the computation will be tedious because the number of missing cases increases as the number of affected offspring increases. Moreover, in the linkage studies it is very rare to have more than two affected offspring.
Multiple alleles: We proposed MITDT-ONE for bi-allelic cases. However, it is possible to extend to multi-allelic cases. We consider two approaches that have been used in practice [22,23]. In the first approach, all alleles except the allele of interest are grouped as allele 2, and the MITDT-ONE for bi-allelic case is applied [22]. In the second approach, if we have q alleles, then for each allele, the first approach is applied to obtain q MITDT-ONE statistics, then the largest MITDT-ONE is chosen as the test statistic [23] to make a decision about significant gene. The third and fourth columns show the name of the SNP defined in Barratt et al. (2004) and the SNP database, respectively. The fifth and sixth columns show the statistics for complete and incomplete data when MITDT-ONE is applied, respectively. The seventh and eight columns show the type I errors of the columns fifth and sixth, respectively. *are SNPs found in association by using rTDT. doi:10.1371/journal.pone.0046100.t017