Genome Wide Association Studies for Milk Production Traits in Chinese Holstein Population

Genome-wide association studies (GWAS) based on high throughput SNP genotyping technologies open a broad avenue for exploring genes associated with milk production traits in dairy cattle. Motivated by pinpointing novel quantitative trait nucleotide (QTN) across Bos Taurus genome, the present study is to perform GWAS to identify genes affecting milk production traits using current state-of-the-art SNP genotyping technology, i.e., the Illumina BovineSNP50 BeadChip. In the analyses, the five most commonly evaluated milk production traits are involved, including milk yield (MY), milk fat yield (FY), milk protein yield (PY), milk fat percentage (FP) and milk protein percentage (PP). Estimated breeding values (EBVs) of 2,093 daughters from 14 paternal half-sib families are considered as phenotypes within the framework of a daughter design. Association tests between each trait and the 54K SNPs are achieved via two different analysis approaches, a paternal transmission disequilibrium test (TDT)-based approach (L1-TDT) and a mixed model based regression analysis (MMRA). In total, 105 SNPs were detected to be significantly associated genome-wise with one or multiple milk production traits. Of the 105 SNPs, 38 were commonly detected by both methods, while four and 63 were solely detected by L1-TDT and MMRA, respectively. The majority (86 out of 105) of the significant SNPs is located within the reported QTL regions and some are within or close to the reported candidate genes. In particular, two SNPs, ARS-BFGL-NGS-4939 and BFGL-NGS-118998, are located close to the DGAT1 gene (160bp apart) and within the GHR gene, respectively. Our findings herein not only provide confirmatory evidences for previously findings, but also explore a suite of novel SNPs associated with milk production traits, and thus form a solid basis for eventually unraveling the causal mutations for milk production traits in dairy cattle.


Introduction
Over the last decades, advances in DNA-based marker technology make it possible to identify genome regions (namely quantitative trait loci, QTL) underlying complex traits such as milk yield in dairy cattle. Instead of traditional animal breeding programmes solely relying on phenotype and pedigree information, the incorporation of detected QTL into genetic evaluation provides a great potential to enhance selection accuracies, hence expediting the genetic improvement of animal productivity.
In dairy cattle, since the seminal work on QTL mapping by Georges el al [1], a large number of articles have been published concerning detection of QTLs for milk production traits. So far a total number of 1,137 QTL for milk production traits have been reported via genome scan based on marker-QTL linkage analyses (http://www.animalgenome.org/QTLdb/cattle.html, May 22, 2010). The limitations of QTL mapping using linkage analysis (LA) and/or linkage disequilibrium (LD) [2] based on panels of low to moderate density markers have been well documented previously [3,4]. In the past decades merely few strong candidate genes with potential effects on milk production traits, i.e., the DGAT1 gene [5] and the GHR gene [6], have been identified and/ or functionally confirmed from those findings derived from QTL linkage analyses and fine mapping studies.
With the advent of genome-wide panels of single nucleotide polymorphisms (SNPs), SNPs have been widely used for the detection and localization of QTL for complex traits in many species [7], and have proved powerful and useful in identification of casual mutations associated with economically important traits in livestock [8,9,10,11,12,13,14,15] as well as human diseases [16,17,18,19]. Most recently, along with maturing of genome sequencing and high throughput SNP genotyping technologies, genome-wide association studies (GWAS) are becoming practical for exploring genes associated with complex traits. Compared with traditional QTL mapping strategy, GWAS brings on major advantages both in power to detect causal variants with modest effects and in defining narrower genomic regions harboring causal variants [20]. GWAS has been widely accepted as a primary approach for gene finding and achieved huge success in identifying genes conferring modest disease risks in human. However, only few GWAS focusing on identifying genes for milk production traits have been performed [21,22]. Furthermore, the common limitation of these studies is that low-density SNP makers were employed in the analyses, leading to a decrease in power to capturing causal genes.
Motivated by searching for novel casual variants for milk production traits beyond previous findings via traditional linkage studies, the present study is to perform GWAS to detect potential casual genetic variants for milk production traits, using the Illumina BovineSNP50 BeadChip. The identified SNP loci may be considered as preliminary foundation for further replication studies and eventually unraveling the causal mutations for milk production traits in dairy cattle.

Materials and Methods
The blood samples were collected along with the regular quarantine inspection of the farms, so no ethical approval was required for this study.

Animal resource
A daughter design was employed in this study. In total 2,093 daughters as well as their 14 corresponding sires were collected to construct the study population. The numbers of daughters of the 14 sires range from 83 to 358 daughters with an average of 150. These daughters were from 15 Holstein cattle farms in Beijing, China, where regular and standard performance testing (dairy herd improvement, DHI) has been conducted since 1999. The official up to date estimated breeding values (EBVs) of five milk production traits, including milk yield (MY), fat yield (FY), protein yield (PY), fat percentage (FP), and protein percentage (PP) were used as phenotypes in this study. These EBVs were obtained based on a multiple trait random regression test-day model [23] using the software RUNGE provided by Canadian Dairy Network (CDN) (http://www.cdn.ca). The descriptive statistics of these EBVs for the five traits as well as the average reliabilities of EBVs of the 2,093 daughters are presented in Table 1. It is notable that the program RUNGE gave two sets of accuracies of EBVs for the five milk production traits. One is for milk yield (MY) and the other for the four milk content traits (FY, PY, FP, and PP). This is because that the amount of information used for calculating EBVs was different for MY and for the 4 milk content traits, while all of the 4 milk content traits provided the same amount of information for calculating EBVs.
Genotyping DNA was extracted from blood sample of the daughters and semen sample of the sires using the routine procedures. DNA was quantified and genotyped using the Illumina BovineSNP50 BeadChip containing 54001 SNPs, which is a multi-sample genotyping panel powered by Illumina's InfiniumH II Assay. Features of the Illumina BovineSNP50 BeadChip have been detailed previously [24]. All samples were genotyped using BEADSTUDIO (Illumina) and a custom cluster file developed from the 2180 samples. Genotype quality control To assess the technical reliability of the genotyping panel, a randomly selected DNA sample was genotyped twice and over 99% identity of called genotypes (two mismatches) was obtained. This demonstrates the technically robust feature of the 50K SNP BeadChip panel employed herein.
The quality control procedure can be largely split into two categories, including individual exclusion and SNP removal, as follows: Firstly, an individual would be excluded from the analyses if it had more than 10% missing genotypes or its SNP genotypes had a Mendelian error rate above 2%. For the second criterion, for each sire-daughter pair, we randomly choose 10,000 genotyped SNP loci for which both the sire and the daughter are homozygotes. A Mendelian error happens herein if the two homozygotes are different in the context that the maternal genotype is unavailable. Accordingly, if more than 200 out of 10,000 SNP have Mendelian errors, the daughter will be removed from the sample.
Secondly, a SNP would be removed if (1) its call rate was less than 90%, or (2) its minor allele frequency (MAF) was less than 3%, or (3) it was severely depart from Hardy Weinberg Equilibrium (HWE) with a P value lower than 10 26 , or (4) its minor genotype frequency was less than five individuals.
After the quality control procedures, 73 daughters with .10% missing genotypes and 205 daughters with Mendelian error rate above 2% were excluded, leading to 1,815 daughters remaining for the association analysis. On the other hand, we removed 1,218 SNPs with ,90% genotype call rate, 11,008 SNPs with a MAF ,0.03, 482 SNPs with extreme value of HWE statistics (P,10 206 ), and 1,073 SNPs with minor genotype frequency ,5 individuals. Eventually, 40,220 SNPs (74.5%) passed these quality control filters. The distribution of the remaining SNPs after filtering and the average distances between adjacent SNPs on each chromosome are given in Table 2. In addition, for the L1-TDT analyses, we excluded extra 1,057 SNPs for which all paternal genotypes are homozygotes, and 39,163 SNPs were finally utilized.

Statistical analyses
Two methods are adopted to perform GWAS in our studies as follows: TDT-based single locus regression analyses (L1-TDT). L1-TDT is a TDT-based association procedure [25], which is specifically suitable for the situation where only a single parent instead of both parents are genotyped for TDT analyses. As merely the genotypes of bulls and their daughters are available within the framework of a daughter design, we employed it to explore the existence of associations between phenotypes and SNP allele transmissions from bulls to their daughters within sire families. Under such circumstance, a phenotypic observation, i.e., EBV considered herein, can be modeled by a SNP effect within family due to transmission disequilibrium of the SNP alleles as well as the effect of the sire corresponding to each half-sib family. For each milk production trait, the equation of the model is given as follows: where y ij is the EBV of the j th daughter of sire i, m is the overall mean, s i is the fixed effect of sire i, TDS ij is an indicator variable with a value 21, 0 or 1 to indicate the transmission of a specific SNP allele from sire i to his j th daughter, which is determined according to [26], b is the regression coefficient (or the substitution effect of the SNP), and e ij is the residual error. For each SNP, b is estimated via a weighted least squares analysis with the weights equal to 1/REL ij , where REL ij is the reliability of the EBV of daughter j in family i. The association between the SNP and the trait is tested via the F-test.
Mixed model based single locus regression analyses (MMRA). Similar to the studies of [21] and [22], we performed association test for each SNP via regression analysis based on the following linear mixed model: where y is the vector of EBVs of all daughters, b is the regression coefficient of EBV on SNP genotypes, x is the vector of the SNP genotype indicators which takes values 0, 1 or 2 corresponding to the three genotypes 11, 12 and 22 (assuming 2 is the allele with a minor frequency), a is the vector of the residual polygenetic effects with a*N(0,As 2 a ) (where A is the additive genetic relationship matrix and s 2 a is the additive variance, and e is the vector of residual errors with e*N(0,Ws 2 e ) (where W is a diagonal matrix with the diagonal elements equal to 1/REL ij and s 2 e is the residual error variance). For each SNP, the estimate of b and the corresponding sampling variance Var(b b) can be obtained via mixed model equations (MME), and a Wald chi-squared statistiĉ b b 2 =Var(b b) with df = 1 is constructed to examine whether the SNP is associated with the trait.
We employed Fortran 95 to code the computing programs for L1-TDT and MMRA and they are available upon request.

Statistical Inference
For both analyses, the Bonferroni method was adopted to adjust for multiple testing from the number of SNP loci detected. We declared a significant SNP at the genome-wise significance level if a raw P value ,0.05/N, here N is the number of SNP loci tested in analyses.

Population stratification assessment
Confounding due to population stratification has been considered as a major plague to the validity of genetic association studies Figure 1. Genome-wide plots of 2log 10 (p-values) for association of SNP loci with five milk production traits in sequential order. Chromosomes 1-29 and X are shown separated by color. Fig. 1-a1, 1-a2, 1-a3, 1-a4 and 1-a5 refer to plots generated by L1-TDT for MY, FY, PY, FP and PP, respectively. Fig. 1-b1, 1-b2, 1-b3, 1-b4 and 1-b5 refer to plots generated by MMRA for MY, FY, PY, FP and PP, respectively. The corresponding horizontal lines indicate the genome-wise significance levels (2log 10 [27]. To view if the population stratification exists in our experimental population, we examined the distribution of the test statistics obtained from the numerous association tests performed and assessed their deviation from the expected distribution of no SNP being associated with the trait of interest utilizing a quantilequantile (Q-Q) plot, which is a routine and most frequently used tool for scrutinizing the population stratification in GWAS. Since merely MMRA method is not immune to potential population stratification, ''Q-Q'' plots for the test statistics of MMRA were conducted for the five traits.

Significant SNPs
The profiles of P values (in terms of 2log(p)) of all tested SNPs for the five investigated traits are shown in Fig. 1. The numbers of  genome-wise significant SNPs detected by L1-TDT or MMRA for the five traits are presented in Table 3. In total, the numbers of significant SNPs detected by either L1-TDT or MMRA for the five traits are 20, 9, 21, 65 and 28, respectively. Since some of these SNPs are associated with more than one trait, the total number of distinct identified SNPs is 105. Of these 105 SNPs, 38 were commonly detected by both methods, while four and 63 were solely detected by L1-TDT and MMRA, respectively. With exception of only four SNPs, all SNPs detected by L1-TDT were also detected by MMRA. The details of these significant SNPs for the five traits, including their positions in the genome, the nearest known genes and the raw P values, are given in Tables 4 through  8, respectively, and further described as follows.
Milk Yield (MY). As seen from Table 4, 14 out of 20 SNPs are located within a 3.63 Mb segment (between 0.07 and 3.7 Mb) on BTA 14. Ten of them fall into the regions that have been reported to harbor QTL for MY [5,21,28,29,30,31,32]. Furthermore, 6 of these SNPs are harbored within the regions of known genes, and the others are located 87 to 84,554 bp away from the nearest known genes.
Fat Yield (FY). As presented in Table 5, 6 out of 9 SNPs are clustered within a 0.55 Mb segment (between 0.05 and 0.6 Mb) on BTA 14. Eight out of them fall in the regions which have been reported to harbor QTL for FY previously [5,28,31,33,34,35]. Furthermore, two of these SNPs fall within the regions of known genes, and the others are located 160 to 285,289 bp away from the nearest known genes.
Protein Yield (PY). As shown in Table 6, among these 21 SNPs, 7 out of them are located within a 3.33 Mb segment (between 0.07 to 3.4 Mb) on BTA 14. Further, 14 out of these SNPs are within the QTL regions for PY reported in previous studies [5,21,28,33,36,37,38,39]; 5 of them are located within the regions of known genes, and the others are located 87 to 385,764 bp away from the nearest known genes.
Fat Percentage (FP). From Table 7, 60 are located within a 6.2 Mb segment (between 0.05 to 6.25 Mb) on BTA 14. 53 of them are located within the QTL regions for FP reported in previous studies [5,34,39,40,41,42,43,44]. Further, 27 of the 65 detected SNPs are located within the regions of known genes, and the others are 71 to 560,215 bp away from the nearest known genes.
Protein Percentage (PP). As given by Table 8, out of 28 identified SNPs, there are 4, 7, and 14 SNPs located within a 8.0 Mb segment (between 33.9 to 41.9 Mb) on BTA6, a 2.59 Mb segment (between 0.23 to 2.82 Mb) on BTA14, and a 7.9 Mb segment (between 34.0 to 41.9 Mb) on BTA20, respectively. Among these 28 SNPs, 17 are located within the QTL regions for PP identified in previous studies [5,28,29,34,42,45]. Further, 11 of these 28 SNPs are located within the regions of known genes, and the others are 160 to 401,634 bp away from the nearest known genes.

Population stratification assessment
The ''Q-Q'' plots for the test statistics of MMRA are shown in Fig. 2-1 to 2-5. From these plots, it is apparent that the distributions of the x 2 statistics generated from the association tests across the SNPs tested show no evidence of overall systematic bias. That is, the observed x 2 statistics of the significant SNPs are  above the expected x 2 statistics, which are largely at the adjusted genome-wide significance level. The profiles of the Q-Q plots clearly show that the significant SNPs identified by MMRA are unlikely threaten by potential population stratification.

Discussion
In this study, we performed a GWA study for five milk production traits using a daughter design in Chinese Holstein population. To our knowledge, this is one of the first GWA studies for milk production traits using the Illumina BovineSNP50 BeadChip. Two statistical methods, L1-TDT and MMRA, were implemented to analyze association between SNPs and phenotypes. These two methods belong to two distinct analytical approaches, i.e., family-based (L1-TDT) and population-based (MMRA) approaches, respectively, both of which have been widely employed in GWAS. Comparisons between the two methods have been well conducted by many investigators [27,46,47,48]. Consensus with respect to their performance is twofold. On the one hand, population-based analyses largely outperform family-based analyses in statistical power and efficiency. The power limitation of family-based analyses results from ''overmatching'' on genotype [49]. Much fewer significant SNPs detected by L1-TDT compared with MMRA in this study present consistent evidence for this aspect in practice. On the other hand, family-based analysis always guards against population admixture/stratification caused by recent migration and/or nonrandom mating, and do not give spurious significant results, although at the expense of some loss of power [50]. The ''Q-Q'' plots for the test statistics of MMRA (Fig. 2-1 to 2-5) demonstrate that no population admixture/stratification exists in our population. Therefore, it is safe to declare that the SNPs detected by MMRA as well as L1-TDT have convincing associations with the traits of interest.
BTA14 has been received wide attention by many investigators. Apart from a large number of QTL reported on BAT14 [34,39,51,52], the well-known DGAT1 gene [5] located at ,0.44Mb is generally accepted as a major gene affecting milk production traits. BENNEWITZ et al. [28] revisited the QTL on BTA14 and concluded that there should exist a further conditional QTL which should be in linkage with the DGAT1 gene, and possible epistatic effects arising from them may be an additional source of genetic variation for milk production traits. Indeed, KAUPE et al. [53] recently reported that the CYP11B1 gene located at ,1.33Mbp has significant effects on MY, PY, FP and PP, and the allele substitution effects of CYP11B1 and DGAT1 together explained more variation in milk production traits than DGAT1 alone. In our study, an apparent feature of our findings is that a large proportion of the significant SNPs (61 out of 105) are located on BTA14. Of the 61 SNPs, 59 are located within the reported QTL regions. In particular, all segments on BTA14 which harbor multiple SNPs for the five traits also harbor the DGAT1 gene, and the four segments for MY, PY, FP and PP also harbor the CYP11B1 gene. Within these segments, 13 SNPs are located very close (within 1Mb) to the DGAT1 gene with the closest one (ARS-BFGL-NGS-4939) only 160bp away from it and 14 SNPs very close to the CYP11B1 gene with the closest one (Hapmap25486-BTC-072553 ) only 8,693bp away from it.
In addition to the SNPs on BTA14, most (27 out of 44) of the significant SNPs on other chromosomes are also located within the reported QTL regions. Further, some SNPs are also within or close (within 1Mb) to the reported candidate genes (for a summary of cattle candidate genes for milk production traits, see [54]). In particular, a SNP (BFGL-NGS-118998) located at 34,036,832 bp on BTA20 was found to fall within the GHR gene, which is also generally accepted as a functional causal gene affecting milk yield and components [5,6]. The other SNPs include the SNPs BTA-121739-no-rs and Hapmap24324-BTC-062449 on BTA6, which are 20,591bp and 450,868bp away from the ABCG2 gene [55], respectively, and the SNP ARS-BFGL-NGS-26919 on BTA11, which is 41,562bp away from the LGB gene [56].
It is notable that for either L1-TDT or MMRA some detected SNPs are associated with phenotypic variation in multiple production traits, This could be explained by pleiotropic effects of these SNPs on multiple milk production traits, leading to genetic correlations among them and there were similar result in many prior studies [28,53].
In this study, we performed GWAS in the way of SNP by SNP individually via regressing the observations of a single trait on either the genotypes of a SNP (MMRA) or the allele transmission patterns of a SNP from bulls to corresponding half-sib offspring (L1-TDT). Previous studies have shown that single marker tests provide similar or greater power than haplotype-based approaches [57,58]. In contrast to haplotype-based methods, the main advantage of the single locus test is that it does not necessitate information of SNP positions and reconstruction haplotypes of multiple SNP loci. Thus, it is the preferable method for large scale genome-wise association analyses, e.g., GWAS. Also, we individually perform GWAS for each of five milk production traits. This is the most conventional strategy for current GWAS. However, the five milk production traits considered here are generally regarded as correlated and thus should share common environmental/ genetic factors. A multiple traits instead of single trait analysis may be a promising way to take correlations among these traits into consideration. Multivariate analyses have been widely adopted in linkage studies [59,60,61,62,63], and it has been generally accepted that multivariate analyses outperform univariate analyses in terms of increasing statistical power and precision of parameter estimation [64,65]. In the next step, an optimal multiple traits analytical strategy will be pursued to further enhance our GWA studies.
In our study, the EBVs of daughters were used as phenotypes for association analysis. Besides EBVs, yield deviation (YD) and de-regressed EBVs of individuals are also commonly used as phenotypic observations in GWAS as well as in LA and LA/LD analyses for milk production traits. Comparison among these three kinds of phenotypes with respect to their influence on QTL mapping [66] and marker assisted selection studies [67] demonstrate that none of them has absolute advantages over the others.
We also compared using EBVs and de-regressed EBVs as phenotypes for our GWAS and it turned out that the findings of them are basically overlap (data not shown). Therefore, only the findings from using EBVs are reported herein. In all, the present study revealed 105 genome-wise significant SNPs for milk production traits in Chinese dairy cattle population using two different association analysis approaches (L1-TDT and MMRA). Most of these SNPs (86 out of 105) are located within the previously reported QTL regions, and some within or close to the reported candidate genes. The general consistence of the significant SNPs detected herein with the reported QTL and candidate genes and the agreement of the results of the two analysis approaches present strong support for the outcomes of this study. Our findings herein lay a preliminary foundation for guiding follow-up replication studies, and eventually revealing the causal mutations underlying milk production traits in dairy cattle. Figure 2. Quantile-quantile (Q-Q) plots of genome-wide association results by MMRA for five milk production traits. Under the null hypothesis of no association at any SNP locus, the points would be expected to follow the slope lines. Deviations from the slope lines correspond to loci that deviate from the null hypotheses. doi:10.1371/journal.pone.0013661.g002