Adaptive Combination of P-Values for Family-Based Association Testing with Sequence Data

Family-based study design will play a key role in identifying rare causal variants, because rare causal variants can be enriched in families with multiple affected subjects. Furthermore, different from population-based studies, family studies are robust to bias induced by population substructure. It is well known that rare causal variants are difficult to detect from single-locus tests. Therefore, burden tests and non-burden tests have been developed, by combining signals of multiple variants in a chromosomal region or a functional unit. This inevitably incorporates some neutral variants into the test statistics, which can dilute the power of statistical methods. To guard against the noise caused by neutral variants, we here propose an ‘adaptive combination of P-values method’ (abbreviated as ‘ADA’). This method combines per-site P-values of variants that are more likely to be causal. Variants with large P-values (which are more likely to be neutral variants) are discarded from the combined statistic. In addition to performing extensive simulation studies, we applied these tests to the Genetic Analysis Workshop 17 data sets, where real sequence data were generated according to the 1000 Genomes Project. Compared with some existing methods, ADA is more robust to the inclusion of neutral variants. This is a merit especially when dichotomous traits are analyzed. However, there are some limitations for ADA. First, it is more computationally intensive. Second, pedigree structures and founders' sequence data are required for the permutation procedure. Third, unrelated controls cannot be included. We here show that, for family-based studies, the application of ADA is limited to dichotomous trait analyses with full pedigree information.


Introduction
Studies in genetic epidemiology are important to uncover the genetic architecture of complex human diseases. The development of next-generation sequencing technologies has allowed for the mapping of all genetic variants across the human genome. With this, we can search for rare causal variants (minor allele frequency (MAF) ,1%), which are mostly not genotyped in genome-wide association studies (GWAS) but are related to the etiology of complex diseases. Till now, many statistical methods have been proposed for rare variant association testing. Most of them were designed for population-based studies where unrelated cases and controls were recruited and analyzed .
Despite a variety of statistical methods, there are two concerns in populationbased rare variant association studies. First, population stratification may cause false-positive results. This issue was tackled since the era of genome-wide association studies (GWAS). In GWAS where most genotyped variants were common, principal component analysis (PCA) [23] and mixed models [24] were proposed as effective methods to deal with population stratification. However, studies of population stratification are still limited for next-generation sequencing data [25]. Existing methods, such as PCA and mixed models, can fail to correct for rare variant stratification [26]. Second, rare causal variants are difficult to observe in general populations, and therefore statistical methods are usually underpowered [27]. Although burden tests [2][3][4][5]27] and non-burden tests [7][8][9] have been proposed to aggregate signals of multiple variants, searching for rare causal variants remains challenging.
Family-based study design will play a key role in identifying rare causal variants, because rare causal variants can be enriched in families with multiple affected subjects [28,29]. Burden tests (such as the weighted sum approach (WS) [3], the cumulative minor allele test (CMAT) [22]), and non-burden tests (such as the sequence kernel association test (SKAT) [7,8]) have been extended to familybased designs by incorporating within-family correlation structures into the statistics [30][31][32][33][34][35][36][37][38][39][40]. For continuous traits, Chen et al. [31] has proposed ''famSKAT'' that can account for members' relationships within families. This method was essentially equivalent to the method proposed by Schifano et al. [38] and the adjusted sequence kernel association test (abbreviated as ''ASKAT'') proposed by Oualkacha et al. [40], although Schifano et al.'s method was not originally designed for rare variant association testing. Svishcheva et al. [39] then developed a fast family-based SKAT (abbreviated as ''FFBSKAT'') that was shown to be the fastest method to perform the kernel-based association tests for continuous traits. Svishcheva et al. have shown a pure coincidence of the P-values calculated by the famSKAT, ASKAT, and the FFBSKAT software programs [39].
Testing for effects of rare variants individually is known to be underpowered. To strengthen association signals, both the burden tests and the non-burden tests combine information of multiple rare variants in a gene/region. This inevitably incorporates many neutral variants into the test statistics. Adaptive combination of P-values method (abbreviated as 'ADA') has been shown to outperform the burden tests (e.g., WS and the variable threshold approach [5]) and the nonburden tests (e.g., SKAT) in rare variant association testing for unrelated subjects, because ADA is more robust to the inclusion of neutral variants [13,21]. Taking this advantage, we here extend the ADA method to deal with pedigree data, and compare its power performance with that of the burden test [32], the kernel statistic [32], and the FFBSKAT method [39] (famSKAT [31] and ASKAT [40] are essentially equivalent to FFBSKAT). We also apply the method to the Genetic Analysis Workshop 17 (GAW 17) data [27,41]. Some family-based association testing methods were designed for trio data (or trios plus unrelated controls), such as the rare-variant extensions of the transmission disequilibrium test (rvTDT) [36,42]. Therefore, these methods are not compared here.

Materials and Methods
Let Y i be the trait value of the ith subject (i~1, Á Á Á ,n). Suppose there are L loci in the chromosomal region of interest, and let g il be the genotype score at the lth locus of the ith subject (i~1, Á Á Á ,n, l~1, Á Á Á ,L). Under the assumption of additive genetic model, g il is the number of minor alleles, i.e., 0, 1, or 2. The statistic to test for the association between the trait and the lth marker is where Y {Ŷ À Á is the vector of residuals after adjusting for covariates (e.g., age, gender), g l is the genotype score vector at the lth marker for the n subjects, MAF l is the MAF of the lth marker calculated using founders, and V is an n6n matrix of genetic correlations of these n subjects. For autosomes, the (i, j)th element of V is 2w ij , where w ij is the kinship coefficient of the ith and the jth subjects. Because the kinship coefficient of subjects belonging to different pedigrees should be 0, V is a block-diagonal matrix with block sizes as the sizes of pedigrees.
The test statistic in Equation (1) has an approximate x 2 distribution with 1 degree of freedom. This statistic is essentially equivalent to the statistic proposed by Thornton and McPeek [43] (see Equation 1 in [43]). Phenotypes and genotypes are treated as fixed and random, respectively. This retrospective view allows us to correct the ascertainment bias when recruiting pedigrees through affected subjects. After performing L tests for the L markers, we have P-values p 1 ,p 2 , Á Á Á p L . Suppose we consider J candidate truncation thresholds, h 1 ,h 2 , Á Á Á ,h J . Summarizing the L markers, the significance score under the jth truncation threshold is where I p l vh j Â Ã is an indicator variable coded as 1 if the lth marker has a P-value smaller than h j (the jth truncation threshold) and 0 otherwise. Throughout this study, the candidate truncation thresholds are specified as h 1~0 :10,h 2~0 :11, Á Á Á ,h 11~0 :20, suggested by the ADA method for populationbased studies [13,21]. Using a wider range of P-value truncation thresholds, say, h 1~0 :05,h 2~0 :06, Á Á Á ,h 21~0 :25, does not contribute a noticeable power gain to ADA (results not shown).
Because multiple P-value truncation thresholds are considered, to correct for multiple testing, statistical significance must be obtained with permutations. We first construct the distribution of the significance score S j under the null hypothesis, where the transmission of haplotypes from parents to offspring is completely random, conditional on parental genotypes [44,45]. For example, the family shown by Fig. 1 consists of 12 members. Conditional on the genotypes of No. 1 and No. 2, the probability that No. 3 has the observed/unobserved pattern of allelic transmission is 1 = 2 under the null hypothesis. Given the founders' (Nos. 1, 2, 5, 6) genotypes, this family consists of more than 2 8 5256 possible patterns of allelic transmission. With a total of N families, there are at least 256 N different permutations of the genotype data.
The above permutation procedure was extended from that used for trio data [44,45]. Unambiguous haplotype phases are not always required in the process. For example, No. 3 has a probability of 1 = 2 to possess the observed pattern of allelic transmission, and then there is no need to change his genotypes. The probability of owning unobserved pattern of allelic transmission is also 1 = 2 . In this situation, No. 3's number of minor alleles at the lth locus (0, 1, or 2, representing three different genotype scores) is g (U) 3l~g 1l zg 2l {g 3l , where g 1l and g 2l are the genotype scores at the lth locus of Nos. 1 and 2, respectively; g 3l is No. 3's original observed genotype score. Conditional on g 3l and g 6l belonging to Nos. 3 and 6, the unobserved pattern of allelic transmission for No. 7 is g (U) 7l~g 3l zg 6l {g 7l . On the other hand, given g (U) 3l and g 6l , haplotype phases of Nos. 3 and 6 are required to determine the genotype scores of Nos. 7-9. An haplotype-phasing software package (such as Beagle [46]) is used to infer the most-likely haplotype pairs for Nos. 3 and 6. The genotype scores of Nos. 7-9 are then determined by randomly drawing one haplotype from No. 3 and one from No. 6.
Suppose we perform B permutations, say, B51000. For the bth permutation, the significance score under the jth truncation threshold can be calculated with , for j51, …, J and b'~1, Á Á Á ,B. We can then find the minimum P-values across the J candidate truncation thresholds for the observed sample and the bth permuted sample, denoted as MinP and . This method is referred to as ''ADA'', because the optimal P-value truncation threshold is driven adaptively according to the data.

Simulation Study
We generated sequence data with the SeqSIMLA software [47], which was designed to simulate sequence data for family samples. SeqSIMLA used GENOME [48] as the default sequence generator that could efficiently simulate sequence data according to the standard coalescent model [49][50][51][52]. In this way, we aim to evaluate statistical methods with simulated data that can reflect realistic DNA sequences. In each simulation, 50, 80 or 100 three-generation families each with 12 members were generated. The family structure was shown by Fig. 1. For each subject, a chromosome region with m550, 100, or 150 SNPs/SNVs (singlenucleotide polymorphisms/single-nucleotide variants) was simulated. Dichotomous traits and continuous traits were considered respectively. When evaluating type-I error rates, no causal locus was specified. When evaluating power, five SNPs/SNVs (Nos. 10,20,30,40,50) were assumed to be causal. We did not restrict all causal variants to be rare/uncommon, because in reality causal variants could also be common. Let 'signal proportion' be the fraction of causal variants out of all variants. The signal proportion in our simulations was set at one of the three levels: 0.033 (5 5 = 150 ), 0.05 (5 5 = 100 ), 0.1 (5 5 = 50 ). When dichotomous traits were considered, the overall population attributable risk (PAR) for all causal loci was assumed to be 0.05, 0.10, 0.15, 0.20, 0.25, and 0.30, respectively. Therefore, the marginal PAR for each causal locus was 0.01, 0.02, 0.03, 0.04, 0.05, and 0.06, respectively. In the SeqSIMLA software [47], the genotype relative risk (GRR) of the jth causal SNP/SNV was: where PAR j and MAF j were the PAR and the population MAF of that SNP/SNV, respectively. S1 Fig. showed [3,[18][19][20]47]. For a founder, SeqSIMLA randomly sampled two haplotypes {H 1 , H 2 } from the population sequence pool created by GENOME [48]. According to the SeqSIMLA software [47], the disease status of this subject was determined by where f 0 was the baseline penetrance specified as 0.05, H k,j was the allele at the jth causal SNP/SNV on the haplotype H k (k~1,2), and a j was the minor allele at the jth causal SNP/SNV (j~1,2, Á Á Á ,5) [47]. Given parental haplotypes, a child's haplotypes were formed by randomly selecting one from the father and one from the mother, and the child's disease status was again determined by Eq. (4). Based on this equation, a subject with no causal variant would have a probability of f 0 (baseline penetrance) to be diseased, while a subject with more causal variants would have a larger probability to be diseased. When continuous traits were simulated, five SNPs/SNVs (Nos. 10,20,30,40,50) were assumed to be quantitative trait loci (QTLs). Causal variants were not restricted to be all rare or uncommon, because in reality common variants can be causal as well. Let Y i be the trait value of the ith subject. It was determined by the model Y i~m z P 5 l~1 G il ze i , where m was the overall mean of the trait, G il was the genotypic value at the lth QTL of the ith subject (i~1, Á Á Á ,n, l~1, Á Á Á ,5) which followed a normal distribution (with a mean of m l , 0, or {m l for 2, 1, and 0 minor alleles at the lth QTL, respectively [47]), and e i was the error term for the ith subject following a normal distribution as well. According to the default setting of the SeqSIMLA software [47], the genetic effects of QTLs were all assumed to be additive, and Var Y ð Þ and m were both specified at 100. These two values were not critical because SeqSIMLA software [47] actually controlled the proportion of Var Y ð Þ explained by each QTL (denoted as V P ). The value of V P was assumed to be 0.001, 0.002, 0.003, 0.004, 0.005, and 0.006 for each QTL, respectively. The corresponding proportion of variance explained by all the five QTLs was therefore 0.005, 0.01, 0.015, 0.02, 0.025, or 0.03.

Tests under Comparison
We compared ADA with the broad classes of the burden test (referred to as ''Burden'') [32], the kernel test for family data (referred to as ''Kernel'') [32], and the FFBSKAT method [39]. Burden and Kernel were implemented with the R package ''pedgene'' [32]; FFBSKAT was performed with the package ''FFBSKAT'' (http://mga.bionet.nsc.ru/soft/FFBSKAT/) [39]. FFBSKAT was performed only when continuous traits were considered, because it could not analyze dichotomous traits. Following the default setting of FFBSKAT [39] and Kernel [32], the (j, j)th element of the diagonal weighting matrix W was set as Beta MAF j ; 1,25 À Á , where MAF j was the MAF of the jth genetic variant. The Pvalues of ADA were obtained with 1,000 permutations.

Type-I Error Rates
By setting the PAR or the proportion of variance explained by causal SNPs at exactly 0%, we evaluated type-I error rates by performing 10,000 replications. In each replication, 50 three-generation families each with 12 members (shown in Fig. 1) were generated. For each subject, a chromosome region containing 100 SNPs/SNVs was simulated. Table 1 shows that all the tests (three for dichotomous traits and four for continuous traits) are valid in the sense that their type-I error rates match the nominal significance levels.
Figs. 2 and 3 present the power for dichotomous traits and continuous traits, respectively. When dichotomous traits were studied (Fig. 2), Kernel was the most powerful method when m550 (signal proportion 50.1), whereas ADA had the best performance when m5150 (signal proportion 50.033). When m5100 (signal proportion 50.05), these two methods had comparable performance. Burden was the uniformly least powerful test among the methods we compared, regardless of the size of m (50, 100, or 150). To conclude, when the signal proportion was larger, Kernel was more powerful; when the signal proportion was smaller, ADA took the advantage of truncating noise variants and therefore was more powerful.
When continuous traits were studied (Fig. 3), Kernel was again the most powerful method when m550 (signal proportion 50.1). When the signal proportion was getting lower and lower (or, m was getting larger and larger), Kernel had a more substantial power loss. Burden was again the least powerful method, regardless of the size of m (50, 100, or 150). Compared with Kernel, FFBSKAT and ADA were less vulnerable to the inclusion of neutral variants. FFBSKAT became the most powerful method when m5100 (signal proportion 50.05) or when m5150 (signal proportion 50.033).

Application to Genetic Analysis Workshop 17 Simulated Data
We then applied the four tests to the Genetic Analysis Workshop 17 (GAW 17) simulated data [41]. The GAW 17 data set was designed to mimic a subset of data that might be generated in a full exome investigation for a complex disease. To reflect realistic human genomes, real sequence data were generated based on the 1000 Genomes Project [53]. The data set consisted of 697 subjects from eight large pedigrees, in which 202 founders had genotypes randomly selected from the 1000 Genomes Project. The MAFs ranged from 0.07% to 16.5%. These founders included 66 Tuscan, 50 Luhya, 28 Japanese, 19 Han Chinese, 18 Denver Chinese, 12 CEPH (European-descent residents of Utah), and 9 Yoruban samples. Pedigrees included four generations, and relatives were as distant as second cousins. The causal SNPs/SNVs were listed in Table 2. Phenotype simulations were performed multiple times to generate 200 replications. With this simulated data set, we could evaluate the power performance of the four statistical methods, given a more general pedigree structure and a more realistic sequence composition [41]. The b column was the change in mean quantitative trait due to a copy of minor allele.
To analyze this data set, we first obtained residuals ( Y {Ŷ À Á in Eq. (1)) by regressing the trait values on age and smoking status. Table 2 listed the results by FFBSKAT, ADA, Kernel, and Burden. The P-values of ADA were estimated based on 1,000 permutations. Based on the power to detect causal genes, these methods were roughly ranked as FFBSKAT. ADA. Kernel < Burden. FFBSKAT was the most powerful method. It was based on the linear mixed effects model, in which the kinship relatedness was captured by the random effects terms (b*N 0,s 2 b V À Á , as described in our Introduction section). In the GAW 17 data set, the eight pedigrees were all quite large (see S2 Fig.). The linear mixed effects model (y~X bzGªzbze) partitions the total phenotypic variance into several parts. When pedigrees are larger, this model is a better choice because s 2 b can be estimated more accurately. (We used SeqSIMLA2 [54] to simulate 1,380 subjects based on two pedigree structures: (A) 20 pedigrees each with 69 members (Fig. 1 of [54]), and (B) 115 pedigrees each with 12 members (Fig. 1). The variance of the polygenic effects, s 2 b , was specified at 45 in SeqSIMLA2 [54]. With 1,000 replications, the mean ofŝ 2 b estimated by the lmekin function [55] was 44.33 and  47.51 for cases (A) and (B), respectively.) Therefore, FFBSKAT was more powerful than other methods when analyzing the GAW 17 large pedigree data.

Discussion
Many statistical methods were proposed for rare variant association testing, but most of them were designed for population-based studies. Among the familybased rare variant association testing methods, some extend the transmission disequilibrium test [56,57] and focus on parent-child trio data [36,37]. Some methods are eligible for analyzing pedigree data (including but not limited to trios), and they can be categorized as the burden tests and the non-burden tests (e.g., famSKAT [31,38], FFBSKAT [39], Kernel [32]). The non-burden tests were shown to be more powerful than the burden tests in most situations [32,33]. Among the non-burden tests compared in this work, FFBSKAT [39] (and famSKAT [31,38]) can only deal with continuous traits, but it is more powerful than the other methods (ADA, Kernel, Burden) when the pedigree is larger (see the GAW 17 data analysis) or when the signal proportion is smaller (see the bottom two rows of Fig. 3).
In population-based studies, ADA is robust to the inclusion of neutral variants, and therefore it has been shown to outperform the burden tests and the nonburden tests (e.g., SKAT) [13,21]. In this work, we extend ADA to family-based studies and compare it with Kernel and Burden, the two commonly used methods for dichotomous traits. Simulation studies show that ADA is more powerful than other two competitors when the percentage of causal variants is smaller (see the bottom row of Fig. 2). On the contrary, Kernel is more powerful when the percentage of causal variants is larger (the top row of Fig. 2). Burden has the least power across the simulation scenarios we have investigated. The comparison between Kernel and Burden is consistent with that found by previous studies [32,33].
Kernel, Burden, and FFBSKAT can provide analytical P-values when the sample size is large. ADA searches for the optimal threshold among multiple P-value truncation thresholds. Therefore, permutation is required to assess the statistical significance, and so ADA needs more computational time than other methods. To be more computationally efficient, ADA can be combined with a sequential Monte Carlo algorithm [58]. For simulated data sets containing 50 families and 50 SNPs/ SNVs, ADA on average needs ,168.2 sec, Kernel or Burden takes ,37.6 sec, and FFBSKAT needs ,4.3 sec. This was measured on a Linux platform with an Intel Xeon E5-2690 2.9 GHz processor and 2 GB memory. Although the computation time of other competitors is much shorter than that of ADA, ADA is more robust to the inclusion of neutral variants when dichotomous traits are studied (see Fig. 2). But when continuous traits are analyzed, FFBSKAT/Kernel outperforms ADA when the signal proportion is smaller/larger (see Fig. 3).
Rare causal variants may play an important role in the etiology of complex diseases [59][60][61][62][63][64], but they are challenging to detect through single-locus tests [1,2,65,66]. Combining variants' signals in a chromosomal region and testing for association with a grouping statistic is a commonly used strategy. Compared with the burden test (Burden) and the non-burden test (Kernel), ADA is more robust to the inclusion of neutral variants when dichotomous traits are analyzed. However, there are some limitations for ADA. First, because this method is more computationally intensive, it is not realistic to apply it to genome-wide sequencing data. Second, pedigree structures and founders' sequence data are required for the permutation procedure implemented in ADA. Third, unrelated controls cannot be included in the ADA analyses. This work shows that, for family-based studies, the application of ADA is limited to dichotomous trait analyses with full pedigree information.
Supporting Information