From QTL to QTN: Candidate Gene Set Approach and a Case Study in Porcine IGF1-FoxO Pathway

Unraveling the genetic background of economic traits is a major goal in modern animal genetics and breeding. Both candidate gene analysis and QTL mapping have previously been used for identifying genes and chromosome regions related to studied traits. However, most of these studies may be limited in their ability to fully consider how multiple genetic factors may influence a particular phenotype of interest. If possible, taking advantage of the combined effect of multiple genetic factors is expected to be more powerful than analyzing single sites, as the joint action of multiple loci within a gene or across multiple genes acting in the same gene set will likely have a greater influence on phenotypic variation. Thus, we proposed a pipeline of gene set analysis that utilized information from multiple loci to improve statistical power. We assessed the performance of this approach by both simulated and a real IGF1-FoxO pathway data set. The results showed that our new method can identify the association between genetic variation and phenotypic variation with higher statistical power and unravel the mechanisms of complex traits in a point of gene set. Additionally, the proposed pipeline is flexible to be extended to model complex genetic structures that include the interactions between different gene sets and between gene sets and environments.


Introduction
The function of any biological process is determined by both its composition and its structure. The genetic architecture that links quantitative traits operates under the same conditions, wherein both gene identities as well as relationships amongst genes determine the phenotype. Over the past two decades, the candidate gene approach and linkage disequilibrium analysis have been used to discover QTGs (Quantitative trait genes) and QTNs (Quantitative Trait Nucleotide). The candidate gene approach is based on a prior knowledge including the physiological mechanisms described in other studies [1], for example, the estrogen receptor (ESR) gene [2]. Linkage disequilibrium analysis based on well-characterized populations has driven QTL (quantitative trait locus, a fragment of DNA covering one or several genes) mapping nearly to completion for almost all major livestock species. This approach exploits crosses of reference strains or individuals related through a known pedigree. Up to now, a huge amount of information has been accumulated and deposited in Animal QTLdb(http://www.animalgenome.org/QTLdb/ [3]), a valuable resource which is freely available for further mining.
In order to make use of the previous QTL mapping results, we proposed a complementary, alternative approach, named candidate gene set approach (CGSA), to identify QTG and QTN in a specific set of related genes such as regulatory networks, pathways, etc. The CGSA was designed to decrease false discoveries by considering a functional set and thus improve statistical power. The performance of the proposed approach was assessed by simulation based on 820 hydrid pigs and demonstrated by a real example of growth traits of 111 hydrid pigs.

Materials and Methods
1 Candidate Gene Set Approach 1.1 Defining candidate gene set. Gene set, as the name suggests, is a group of genes, which are functionally or structurally related to each other. There are many possible sources used for defining gene set, including gene ontology annotations [4], KEGG pathways [5], Molecular Signatures Database (MSigDB) [6], coexpression genes from microarray analysis, etc. They may even be chromosome regions covered by related QTL. The sources above are not an exhaustive list, but give a good idea on the types of analyses that can be done using the methods we described in this study. Parts of the gene sets can be selected as candidate gene sets based on some prior information for further analyses.
For animals (livestock) with a poor annotation of genes, we can define candidate gene set by comparative genomics, namely identifying orthologous genes in livestock by Blast searching, because researches on model animals have produced a huge amount of information. When such search indicates a livestock gene with unknown mapping data, the genes adjacent to the target gene in the query, can be similarly searched in the livestock genome in order to infer the location of livestock gene based on synteny. Recently, our group have developed a functional annotation platform called SNPpath for animals (livestock) SNP [7], and it can provide functional annotations of livestock SNP markers from several commercial platforms and dbSNP database. The web server may be particularly beneficial for the analysis of combining SNP association analysis with the CGSA.
1.2 Mapping genes to QTL. As mentioned above, by fine design of experiments, QTL mapping has obtained a huge amount of information, including QTL location (chromosome, location, and location span), flanking markers, peak markers, test statistics, QTL effects and traits. For example, up to June of 2012, there were 6,818 QTLs that represent 585 different traits for pig [3]. If a gene within the candidate gene set can be mapped to the confidence interval of QTL that have some physiological relevance to the trait, it will be reasonably considered as primary candidate QTGs [8], especially, overlapping with multiple QTL of interest. Some QTL in the pig QTLdb have very large confidence interval, so we suggest that research should set a limitation of confidence interval by considering overlapping with multiple QTL of interest. We have successfully written a Perl script to relatively locate genes within the interval of the QTL.
1.3 Predicting potential SNPs. Though a few thousand of SNPs in animal are currently available, which comes from commercial array or annotation database like dbSNP, the SNP density in animal is much lower than that in human. The low SNP density might limit the power of identifying genes involved in the traits of interest. The ongoing genome sequencing project and the releasing of the genome draft assembly of animal have provided the opportunity to compare and detect SNP candidates using alignment algorithms [9]. Kerstens (2009) successfully used this strategy to mine novel SNPs of pig using whole genome shotgun dataset generated by the Danish-Chinese Pig Genome Sequencing Initiative [10]. Therefore, we could perform similar pipeline to identify candidate SNPs.
1.4 SNP genotyping. The predicted candidate SNPs as well as SNPs from commercial array and annotation database need to be genotyped in a panel of target population for association studies. Owing to advances in the area of molecular genetics and improvements in genotyping technology, a large number of SNP genotyping techniques were used to type allele-specific products for SNPs of interest followed by genotype determination. In addition to high cost in genotyping and stringent requirements for accuracy, the throughput has been a critical factor in many farm animals studying with large sample populations. Therefore, in most cases, we need to make a reasonable compromise between throughput and cost. For instance, typing SNP candidates overlapped with QTLs or that in all members of gene set is determined by cost to some extent.
1.5 Association test. The CGSA method makes an attempt to answer the following questions: First, whether is there the association between genetic variation and phenotypic variation in the gene set level? Second, if the gene set is significantly associated with some traits, which SNPs or combinations play a key role?
To answer the two questions, we first used the following Mixed Liner Model (MLM) to perform our CGSA on each measured trait.
The matrix format of this model is that: where b is a vector of fixed effects, a is vector of random additive genetic effects of candidate gene set, d is vector of random dominance effect of candidate gene set, and h AA , h AD and h DD represent the interaction effects of additive-by-additive (AA), additive-by-dominance (AD) and dominance-by-dominance (DD) respectively. X, Z 1 , Z 2 , Z 3 , Z 4 and Z 5 are the corresponding design matrix. It is assumed that. MVN represents multivariate normal distribution, A and D represent the corresponding additive (dominance) effect relationship matrix in candidate gene set level, respectively. The matrices can be estimated by the SNPs in select candidate gene set using the software of PLINK [11]. The PLINK estimates the probability of sharing 0, 1, or 2 alleles IBD (Identiy by Descent) for two subjects. The elements of A matrix can be expressed as P(IBD~2)z0:5 Ã P(IBD~1) and the elements of D matrix can be calculated as P(IBD~2) [11]. The corresponding epistatic variation effect relationship matrix could obtain by the Hadamard products of the A and D matrices [12]. The likelihood ratio test could be used to test the significance of the additive effects, dominance effect and epistatic effect for the candidate gene set.
Second, if the candidate gene set is significantly associated with some traits, the single SNP general linear model or multiple higher order interaction models such as MB-MDR method can be further used to study the genetic mechanism. MB-MDR method proposed by Calle et al. [13] not only assesses joint significance of multiple higher order interaction models at once, but also facilitates distinguishing between epistatic effects and contributing main effects to the multilocus signal via the 'MB' part in MB-MDR [14].

Simulations
2.1 Simulation data. We compared the statistical power between our CGSA and the traditional candidate gene approach across a variety of heritabilities in a real experimental genotype data and simulated phenotype data produced by R program. The pig dataset was collected from 820 commercial female pigs. There were 51,385 SNPs scored on these individuals [15]. SNP markers were aligned to the corresponding Ensemble gene IDs (the Sus scrofa Build 9.2 downloaded from UCSC genome browser). The genes including 50 kb flanking regions (upstream and downstream of gene), were used for the assignment of SNPs to genes. The 57 SNPs mapped to 17 genes in the insulin and insulin growth factor-1/FoxO (IGF1-FoxO) signal pathway were used in the simulation procedure. All the makers with their positions are shown in Table  S1.
2.2 Simulation procedure. Two simulation experiments were conducted to assess the performance of our method. In the first simulation experiment, only additive genetic effects were simulated. Fifteen SNPs was randomly sampled from the IGF1/ FoxO signal pathway as causal QTNs for the simulated traits. The distribution of these QTN effects was assumed as a normal distribution with mean of 0 and variance of 1. Phenotypes were calculated as the sum of additive effect and the random residual effect. The total additive effect was the sum of additive effects across all QTNs and the residual variance was computed as V e~Va (1-h 2 )=h 2 , where V a is the empirical additive genetic variance and h 2 is the heritability. A residual error was assumed as a normal distribution with mean of 0 and variance of V e . Five levels of heritability were analyzed (h 2 = 0, 0.05, 0.1, 0.25 and 0.5) in this simulation experiment.
In the second simulation experiment, additive effects, dominance effects and epistatic effects were simulated simultaneous. Thirteen SNPs from the IGF1/FoxO signal transduction pathway were assumed as causal QTNs for the simulated traits. All the main effects and the effects of interaction (epistatic effects) of the makers with their positions are shown in Table S2. Three paired SNPs were simulated as interacting QTN effect. It is assumed that the effects of paired SNPs were seen only if an individual is heterozygous at both QTNs. The simulated additive genetic variance and total genetic variance for the true 820 commercial female pigs were 6.05 and 9.01 respectively. The residual variance was set as 10. The power was calculated as the proportion of QTNs detected or the significant pathway with Type I error rate 0.01. A total of 1000 replications were conducted for each method. The R program written by the authors based on GAPIT software was used to perform analysis for the two simulation experiments [16].

Analysis of Real Data
Ethics statements. Prior approval by the Institutional Animal Care and Use Committee of Shanghai Jiao Tong University (Contract 2009-0068) was given for all experimental procedures involving animals in the present study.
Experimental design and procedure. The 111 Duroc 6Landrace 6Yorkshire (DLY) hybrid pigs were phenotyped on daily feed intake and body weight. All pigs were fed the same cornsoybeanmill based ration during the feeding period. Pigs were automatically weighed on their way to feeding using OSBORNE FIRE system. Feed intake and body weight (BW) were measured several times a day, which were then arithmetically averaged to one value for further analysis. If some measurements were missing, the remaining measurements were used and treated as a daily measurement. For generating variables representing BW changes from born to 100 kg and further analysis of the BW data, individual measurements were first smoothed using S-shaped logistic curves. In this study, we mainly focus on the corresponding timing intervals from birth to 30 kg (BW 30), from 30 to 60 kg (BW 60), from 60 to 100 kg (BW 100), average daily gain (ADG) and total feed intake (DFI). The heritability of BW 30, BW 60, BW 100, ADG and DFI are 0.39, 0.35, 0.34, 0.37 and 0.15 separately in present population.
Defining candidate gene set. To make use of the results from the related experiments of model organisms, we defined the gene sets based on our previous work, which presents the transcriptomic profile during skeletal muscle regeneration using time-course expression data and motif scanning information [17]. Our study recovered some signaling pathways involved in the regulation of muscle differentiation, and the IGF1/FoxO signal transduction pathway was selected to define a gene set for the current study.
Predicting potential SNP positions and SNP genotyping. Two data sources were used to predict SNP positions, one from the commercial array annotations (Illumina PorcineSNP60 BeadChip), and the other from the NCBI Trace repository. The identification of novel SNPs was performed by a 2step process for the NCBI Trace repository. First, the shotgun sequences were assigned to a reference sequence (member form gene set) by clustering based on their sequence similarity. Secondly, the sequences within cluster were aligned to search for potential SNPs. A total of 111 genomic DNA samples from DLY hybrid pigs comprising 58 females and 53 castrates, were genotyped by SNaPshot method. The primer sequences for analysis of 23 identified SNPs are listed in Table S3. Association test. The mixed linear model is as following, y ijk~m zs i zp j za k zd k ze ijk , where y ijk is the phenotypic value of each trait, mis the overall mean for each trait, s i is the fixed gender effect (i = 1,2), p j is fixed sire effect (j = 1,2,3), a k and d k are the pathway additive effect and dominance effect of the kth individual,e ijk is the random residual effect. The additive and dominance genetic relationship matrix was estimated based on the TagSNPs in select candidate signaling pathway using the software of PLINK [11]. Then ASReml was used to estimate the pathway effects and test the significance level by the likelihood ratio test [18]. In this case study, we can take full advantage of the opportunities provided by MB-MDR model to study simultaneously the association of a group of genetic variants in predefined gene set, which can help us to holistically unravel the complex genetic structure of the studied trait and to gain insight into the biological processes and mechanisms.

Simulation Results
For the first simulation experiment, generally, along all the levels of heritability investigated, as heritability increases, the statistical power of both methods increased as expected (Figure 1). Our CGSA performed higher statistical power over the traditional candidate gene approach across all kinds of heritabilities. The differences among these methods varied along the level of heritabilities. Figure 1 showed that the differences among the methods with mid-level and low-level of heritability were greater than the one with high heritability. For the second experiment, the statistical power of our CGSA and candidate gene approach was shown in Table S4. Compared to the traditional candidate gene approach, the CGSA performed the significantly higher statistical power for both additive effect and dominance effects and similar statistical power for epistatic effect in the present simulation scenario.

Real Data Results
There were 28 murine genes in the IGF1/FoxO signal transduction pathway, and a simple scheme of IGF1/FoxO pathway components is shown in Figure 2 (modified from [19]). These 28 genes have been aligned to the pig chromosomes to identify orthologous genes using BLASTn program [20] from the blastall tool [21] with default parameters. We identify many pig orthologs of the murine genes and the compared results are shown in Table 1. However, the chromosomal locations for 4 genes (INS2, IRS2, PIK3CA and PIK3CB) are still unknown.
Each of the genes that were mapped on the pig genome has been co-localized with published QTL from pigQTLdb. The genes identified as being co-localized with a QTL are presented in Table S5. FOXO1 gene has been co-localized with 22 QTL, FOXO3 gene with 48 QTL, FOXO6 gene with 32 QTL, IGFBP6 gene with 27 QTL, and PDK3 gene with 4 QTL. Only the results for the FOXO3 gene are detailed here ( Table 2). We found that many traits were closely associated with growth and development traits, for example, average daily gain (QTL_id 5680, 329 and 5928) and feed intake (QTL_id 871). Each locus (QTL_id) corresponded to one publication.
At last, 23 single nucleotide polymorphisms (SNPs) in 9 genes (IGF1, IGFBP1, INSR, IRS1, PIK3R3, PDK3, PDK4, AKT3 and FOXO3) were identified in the IGF1/FoxO pathway, and 11 of which were predicted novel SNPs. Using this approach, 13 SNPs were determined to be polymorphic in our panel, three of which were novel SNPs (S14, S18 and S28 in Table S3). The allelic and genotypic frequencies, as well as the concrete distribution in this population were determined by SNaPshot method. Analysis of the same site using PCR-RFLP generated similar results to those of the SnaPshot analysis (data not shown). Our study showed that the SNaPshot method can reliably replace sequencing and RFLP testing, particularly in cases in which no restriction endonuclease recognizes the polymorphic site being analyzed. Moreover, the same results can be achieved in a shorter time by SNaPshot analysis for small or medium output.
Using both our MLM model and the likelihood ratio test, we tested the significance of the additive effect and dominance in IGF1-FoxO pathway level on each of the 5 measured traits. Our results showed that the IGF1-FoxO pathway significantly associ-ated with BW100 (0.007) and ADG (0.010). No significant dominance effect was found in the present study.
For further identifing which SNP (or combinations of SNPs) plays a key role(s) in the IGF1-FoxO pathway, we performed single-SNP association analysis and interaction models analysis. The single SNP association analysis was initiated using linear regression with adjustment for gender and pedigree. Of the 13 SNPs analyzed, 4 SNPs (AS3, AS6, AS9 and AS10) were showing the strong association with BW100 and ADG. Overall, most of SNP associations were moderate in single SNP analysis, suggesting that there appeared to be multiple weak associations in the studied pathway. These SNPs may therefore interact with each other to exert their effect.
In order to complement pathway-based analysis and provide additional insights into the genetic architecture of studied traits, a generalized multifactor dimensionality reduction method was used to determine the relationship between interacted SNPs (combined effect) and studied traits. Several interactions were shown to be significant: a two-SNP interaction model (AS3 and S14) and a three-SNP interaction (AS6, AS9 and S18) were related BW100 as well as a two-SNP interaction (AS10 and S28) and a three-SNP interaction (AS6, AS9 and S18) involved to ADG was found. Our study indicates that the susceptibility SNPs identified by single SNP analysis might independently and/or in a combined manner influence studied trait. Overall, we can identify many SNPs of moderate effect that influence studied traits though interaction, and most of the SNPs found have been reported in the literature as Figure 1. Statistical powers between CGSA and candidate gene approach under different heritability levels. The power was examined on a trait simulated from 15 causative mutations (QTNs). A total of 1000 replications were conducted for each method. The heritability of the trait varied from 0 to 0.5. The differences between our CGSA and candidate gene set approach with mid-level and low-level of heritability were greater than the one with high heritability. doi:10.1371/journal.pone.0053452.g001 functional elements correlated with the model animals' growth [22,23,24].

Discussion
Over the past decade, researchers have made great progresses in identifying economical relevance factors (genes, QTLs and SNPs) contributing to trait variation. SNPs have been the markers of choice in recent years, due to their high stability, density and the highly automated way in which SNPs are detected. Various statistical models and techniques have been developed and applied in order to make inferences about the general population using studied samples as well as relate the genotype information to observable/measurable phenotypes [25]. Our approach is different from traditional candidate gene strategies, which identify sets of SNPs/variants showing allele frequency and genotypic concordance among subjects. Parts of the traditional strategies based on the linkage disequilibrium between markers mainly consider haplotypes [26,27], but is restricted to areas of high LD (haplotype blocks). In addition, clustering and principal components were also used to group markers based on LD [28,29]. Unlike these, our CGSA approach can be considered as an improvement of traditional candidate gene approach. We group markers based on functional considerations, which we believe is in better concordance with the goal of identifying variation that is associated with phenotype.
In order to improved statistical power, several similar candidate pathway approaches [30,31] were published to deal with some of the same issues as our study. However, our study was conducted differently in a variety of ways. First, our method can make full use of the large scale study results performed on model animals (human, mouse) which have provided the potential to define candidate gene sets to identify the complex processes involved in the phenotype. Second, our approach takes full advantage of previously reported QTL, candidate gene sets that lie within the region of the QTL, and that have physiological relevance to the trait (considered as primary candidates for the QTL), particularly within multiple overlapping QTL of interest. Third, most of existing candidate pathway approaches [30] use multiple regression models to test the significance, but our method allows us to test whether a set of genes are associated with a trait of interest based on MLM framework which could eliminate false positives effectively both in candidate gene association study and GWAS [32,33]. Using our approach, we succeed in identifying combined effect members in our predefined set, which is associated with several economical traits in pigs. The increase in power when compared to single-marker approaches based on the joint reduction of false positives and false negatives. Reducing false negatives comes from the fact that moderately strong signals can be combined into a stronger signal, whereas, these members might be ignored by strong global significance in single SNP analysis. Reducing false positives is because of our method based on MLM framework which could eliminate false positives effectively [32,33]. In the present case study, 2 SNPs associated with the BW 30, 1 SNP with BW 60 in single-marker approach were excluded in case of false positives because our gene set approach didn't indicate the significant association in the pathway level. In the present case study, no significant dominance effect was found because sample size is relatively too small. So, the real results cannot demonstrate the advantage of the new approach. Our group is trying to apply our approach to other dataset with large sample size to verify the relevant conclusions furtherly.
Finally, the results from this approach offer a clearer picture on the functional relevance of the discoveries. The significant SNPs found using this approach warrant further experimental exploration. Especially, SNPs from FoxO3, which were not only significant in single SNP analysis, but also in combined effect analysis. This FoxO3 gene was overlapped with multiple QTLs (48 QTLs) of pigs' growth and development traits, and was reported to be related to muscle growth abundantly [34,35,36]. The validation experiments would be more or less complex depending on the nature of the trait (mono-or polygenic). As biotechnology and genetic transformation techniques advance, we will be able to demonstrate experimentally whether specific SNPs truly determine variation in a particular trait.
It has become clear that gene sets rather than individual gene are essential in understanding carcinoma [37,38]. One drawback of this approach, however, is the reliance of prior assumptions that genes identified in the QTL regions are well annotated, the explicitly recording from KEGG and so on. If not, we may overlook or miss some genes or SNPs involved in the phenotype for incompleteness of annotation and complex biological process.

Conclusions
We have proposed a novel approach for determining specific genes and their SNPs that are associated with studied traits in the gene set level. The simulation results demonstrated that our CGSA performed higher statistical power over the traditional candidate gene approach. We also used this approach to study the relationship between IGF1-FoxO pathway and growth and development traits of pig. The results suggested that genetic variations in the IGF1-FoxO pathway modulate the growth and development in pigs. The CGSA is flexible to be extended to model complex genetic structures and can be applied by other gene sets. Computer programs (R and perl source code) are available at http://klab.sjtu.edu.cn/CGSA. Supporting Information Table S1 The 57 SNPs with their positions in the insulin and insulin growth factor-1/FoxO (IGF1-FoxO).
(XLS) Table S2 The main effects and the effects of interaction (epistatic effects) of the makers from the IGF1/FoxO signal transduction pathway in the simulation experiment. (XLS)