Genome-Wide Detection of Selective Signature in Chinese Holstein

Selective signatures in whole genome can help us understand the mechanisms of selection and target causal variants for breeding program. In present study, we performed Extended Haplotype Homozygosity (EHH) tests to identify significant core regions harboring such signals in Chinese Holstein, and then verified the biological significance of these identified regions based on commonly-used bioinformatics analyses. Results showed a total of 125 significant regions in entire genome containing some of important functional genes such as LEP, ABCG2, CSN1S1, CSN3 and TNF based on the Gene Ontology database. Some of these annotated genes involved in the core regions overlapped with those identified in our previous GWAS as well as those involved in a recently constructed candidate gene database for cattle, further indicating these genes under positive selection maybe underlie milk production traits and other important traits in Chinese Holstein. Furthermore, in the enrichment analyses for the second level GO terms and pathways, we observed some significant terms over represented in these identified regions as compared to the entire bovine genome. This indicates that some functional genes associated with milk production traits, as reflected by GO terms, could be clustered in core regions, which provided promising evidence for the exploitability of the core regions identified by EHH tests. Findings in our study could help detect functional candidate genes under positive selection for further genetic and breeding research in Chinese Holstein.


Introduction
When a novel beneficial mutation have been under the force of selection over a long period of time, it will show some obvious features such as unusually long-range linkage disequilibrium (LD) and a high population frequency, which therefore represents a detectable " signature of selection" [1,2]. Identifying signatures of recent positive selection could help us target causal variants for breeding and be potential to provide straightforward insights into the mechanisms of evolution. Furthermore, it can possibly highlight the genetic basis of phenotypic diversity for complex traits [3,4].
Many methods have been developed to detect such selection signals hitherto, such as the integrated Haplotype Homozygosity Score (iHS) and the extended haplotype homozygosity (EHH) test [5,6] based on haplotype lengths; F ST test [7] for population differentiation detection. These methods have been successfully implemented in many studies to exploit strong signatures of mutations under positive selection in humans and in domestic animals, e.g., genes resistance to malaria [6,8], in response to milk yield in dairy cattle [5,9] and related to pigmentation [10].
In this study, we applied the EHH test to identify selective signatures in Chinese Holstein. The method of the EHH test, firstly introduced by Sabeti et al. (2002), is based on the relationship of an allele's frequency and the extent of LD surrounding it. When under positive selection, it can make the allele frequency to increase at a very fast speed, such that the recombination does not totally break down the haplotype where the selected mutation occurs [6]. According to this, EHH is defined as the probability that two randomly chosen chromosomes carrying the core haplotype of interest are identical by descent for the entire interval from the core region to a certain point [6]. It has been reported that recombination rates are heterogeneous among chromosomal regions, and it may be a potential factor of false positives in the EHH detection [9]. REHH (relative EHH), proposed by Sabeti (2002), is supposed to overcome this limitation. The idea of REHH is to compare the EHH of the tested core haplotype with that of other core haplotypes for adjusting for local variation in recombination rates. With rectificatory recombination rates, we can readily use the REHH value to screen EHH at matched genetic distances, and haplotypes with excessive REHH values and high frequencies could indicate signal of positive selection [5,6]. The superiorities of the EHH test are that it can incorporate information on both the allele frequency and the association among SNPs, and it is designed to work with SNP markers rather than sequencing data [9].
Since different populations are undergoing different climates, geographic environment and selective forces [11], the frequencies of alleles are largely different among populations as well as the selective signals [10]. Aiming at exploring specific selective signatures in Chinese Holstein, we performed a genome-wide detection for selective signatures employing EHH tests in present study. Our findings herein as well as incorporating those findings in our previous genome-wide association studies (GWAS) for milk production traits could possibly verify some specifically biologic functions reflected by these identified core regions, and provide foundations for further pinpointing causal variants for economically important traits in dairy cattle.

Animal Resource and Genotyping
A total of 2019 daughters and 87 sires from 15 Holstein cattle farms in Beijing were collected as experimental population.
Samples were collected as part of the regular quarantine inspection of the farms, and used in this study with the permission of the animals' owners. The population structure has been detailed in our previous study [12]. DNA was extracted from blood (for daughters) and semen (for sires), and all individuals were genotyped using the Illumina BovineSNP50 BeadChip containing 54001 SNPs, with a mean distance of 48.75 kb between markers.
Before analyses, we used Beagle [13] to impute the missing genotypes and to construct haplotypes. Besides, quality control for SNP markers was performed to remove those with minor allele frequency (MAF) less than 0.03.

Detection of Selection Signatures
For the EHH test, we firstly identified core regions in the genome using the software Sweep v.1.1 [6], and then calculated EHH values for haplotypes in each core region. The core region is characterized by SNPs with strong LD and consists of some core haplotypes [9]. To compare REHH values across many core features, genetic distance of 0.5 cM was chosen as the matched distance to determine the REHH value for each core region. The distance of 0.5 cM was an appropriate setting for the longer extent LD in cattle, as compared with that in humans [9,14,15].

Bioinformatics Analyses
After carrying out EHH tests, we further performed multi-level bioinformatics analyses to explore potential biology significance of genes harbored in the identified core regions. The following two aspects were involved in the analyses: Genome annotation. In the analysis, each significant core region was extended 1 Mb towards both sides as the target region for annotation. In addition, genes involved in the core regions were compared with those in our previous GWAS [12] as well as those involved in a database of candidate genes for milk production performance and mastitis [16].
Enrichment analyses. Two different types of enrichment analyses relevant to the significant core regions were performed herein, i.e., gene ontology (GO) term enrichment analyses and pathway enrichment analyses. GeneMerge1.2 software tool [17] was employed to perform enriched GO term analyses and enriched pathway analyses.

A Descriptive Statistics for Markers and Core Haplotype
After the quality control for SNP genotypes and excluding the SNPs on chromosome X, 40130 SNPs were finally retained in the analyses. Table 1 gives an overall description for the distribution of SNPs across whole genome. The numbers of SNPs for every

Genome-wide EHH Tests
A total of 3,996 core regions were identified and corresponding 31,182 EHH tests have been performed. These identified core regions contained 14,363 SNPs, with a range of 3-12 SNPs per core. The total length of these cores was 529,108.6 kb, with a mean length of 129.7685.6 kb.
An intuitive scatter plot is presented in Fig. 1, showing the distribution of REHH values vs. haplotype frequencies. Different colors represented different ranges of p values. Moreover, a plot diagram is given in Fig. 2 to visualize the distribution of the REHH values across whole genome. Based on the concept of selective signature aforementioned, one character of such signals is the high frequency of a beneficial selected gene involved in a core haplotype. Hence we further discarded those core haplotypes with frequency ,20%. Accordingly, 16,035 EHH tests remained for all core haplotypes after filtering (See Table 2), with a maximum of 1,177 tests on chromosome 1. Among these EHH tests, 541 and 125 out of them achieved significant level corresponding to the significance levels of 0.05 and 0.01, respectively.

Genome Annotation
We annotated genes involved in all significant identified core regions based on the Gene Ontology database (http://www. geneontology.org/), and further found out those potential candidate genes relevant to milk production traits. Accordingly, a total of 9,829 genes had been identified, 6573 of them (66.87%) fall into the regions of QTL affecting milk production included in the cattle QTL database (http://www.animalgenome.org/cgibin/QTLdb/index).
In our previous GWAS [12], 75 genes were identified associated with one or multiple milk production traits in dairy cattle such as milk fat yield (FY), milk yield (MY), milk protein yield (PY), milk fat percentage (FP) and milk protein percentage (PP). Among these 75 genes, 45 out of them were contained in the core regions detected in present study, further indicating these genes under positive selection maybe underlie milk production traits in Chinese Holstein. Table 3 gives symbols of these genes as well as the closest core positions and corresponding p values of core haplotypes. Totally, 45 genes were involved in 19 core regions which contained 25 significant core haplotypes.
In addition, 203 genes within the core regions overlapped with those included in a database of candidate genes in cattle underlying milk production performance and mastitis [16]. A histogram showing the overall distribution of the candidate genes across the genome is given in Fig. 3. It can be seen that the largest number of overlapped candidate genes were located on chromosome 6. Among these 203 functional genes, 13 genes (LEP, ABCG2, CSN1S1, CSN3, IL8, TLR4, IL1B, LBP, DGAT1, TLR2, C5AR1, LTF, TNF), which were suggested as being under selection, have been frequently reported in different studies [16].

GO Term Enrichment Analyses
By searching for over-represented terms in core regions compared to those across whole genome, we can detect those functional genes clustered in the identified core regions contributing to milk production traits. We analyzed the significant enriched terms for genes involved in the core region. As a result, 36 second level GO terms (64.3%) were found to be significantly over-represented (Table 4). For instance, the term 'metabolic process'(GO: 0008152) showed significantly enriched, which indicated genes associated with milk production traits involved in 'metabolic process' were clustered in the core regions. Other terms such as 'immune system process'(GO: 0002376) and 'reproductive process'(GO: 0022414) were also showed clear biological association with mastitis and milk production traits.

Pathway Enrichment Analyses
We detected over-represented pathways relevant to genes within detected core regions. Accordingly, 9 out of 213 (4.23%) pathways were found to be significant in the core regions (Table 5), e.g., 471 genes were contributed in the significant 'metabolic pathways' (bta01100) which could plausibly be involved in the processes of milk production; The term 'chemokine signaling pathway' (bta04062) showed significant over-representation involved in the processes of inflammatory immune response in mammary gland containing 89 genes in all. Other over-represented pathways also reflected some important biological processes, such as: 'focal adhesion' (bta04510) participated in cell motility and cell    proliferation, 'spliceosome' (bta03040) assembled on the mRNA precursor with the function of folding the precursor into a conformation to allow transesterification to proceed.

Discussion
Due to serving as models for basic studies of some important pathways, cattle play a special role in biology (Bovine Genome Sequencing Initiative http://www.genome.gov/Pages/Research/ Sequencing/SeqProposals/BovineSEQ). In the past few years, many studies were performed in the field of cattle genetics. For instance, Kaupe et al. [18] genotyped 1748 DNA samples of 38 different Bos taurus and Bos indicus cattle breeds to examine the DGAT1 polymorphism in cattle breeds; another research used Bos taurus and Bos indicus cattle to detect the utilization of low-quality roughage [19]. So far searching for genomic evidence of positive selection in cattle recently has been widely considered as an attractive strategy for identifying functional polymorphisms and providing important insight into the mechanism of evolution [20]. In this study, we carried out EHH tests to detect populationspecific selection signatures by using the high density SNP chips in Chinese Holstein and then performed various systems biology This table describes second GO terms significantly enriched in core regions based on GeneMerge1.2 software. 36 terms are detected to be significant here. Pop-frec describes the frequency of genes in the population with this term, and CR-frec describes the frequency of genes in the core regions with this term. Ratio is calculated by the comparison of a term within the core regions to that in genome wide. P value here is a Bonferroni corrected P value. doi:10.1371/journal.pone.0060440.t004 analyses to confirm the biological significance for the identified core regions. Totally, there were 125 core regions been detected potentially harboring selective signals at the significance level of p = 0.01. As Qanbari et al. (2010) said, by comparing with shorter extent of homozygositiy of other haplotypes presented in these regions, it can demonstrate the longer extent of homozygosity of some haplotypes were not simply because of low recombination rates but due to the strong recent selection. In a recent study of detecting selective signatures in German Holstein cattle based on EHH tests, a total of 161 regions were identified to exhibit the signals of positive selection. Annotation for these regions also revealed a list of functional candidate genes like SPATA17, DGAT1, FABP3 and ABCE1 [9]. Another study carried out by Flori et al. (2009) in three major French dairy cattle breeds, showed 13 highly significant regions under the recent selection. Some of them contained genes that with strong effects on milk production traits, such as GHR, LAP3 and ABCG2. These two researches revealed that most of genes targeted by artificial selection in dairy cattle mainly tended to improve milk production [21]. So, for further probing into the relationships between the identified selection signatures and important traits under selection in this study, we compared these 125 core regions with findings in our previous GWAS studies as well as a candidate gene database for economic traits in dairy cattle. A suite of genes in the identified core regions have been suggested relevant to milk production traits and mastitis. For instance, two core regions overlapped with the region harboring polycystic kidney disease 2 (PKD2) gene, which functioned as a calcium permeable cation channel and had been identified significantly associated with milk protein percentage [12]; G protein-coupled receptor 20 (GPR20) gene located in an overlapped area of 3 extended core regions on chromosome 14, and it is related to milk fat percentage traits. On BTA 6, a region with a haplotype showed the lowest P value (P = 0.0022) hiding the HECT domain and RLD 3 (HERC3) and polycystic kidney disease 2 (PKD2) gene, HERC3 gene involved in the pathway of ubiquitination mediated proteolysis and it mainly functions as a signal for 26S proteasome dependent protein degradation. Another signal on chromosome 6, which harboring ATP-binding cassette sub-family G member 2 (ABCG2) has been proved as a target of selection [22,23]. These results indicated that these core regions presented biologic signals associated with milk production performance.
It was notable that the core region harboring DGAT1 gene merely showed a marginal significance (p = 0.057), although DGAT1 gene has been suggested as being under selection and been reported for many times. This could be due to the higher initial frequency of some beneficial alleles leading to less polymorphism in haplotypes [24]. Another possible reason could be the limitation of SNP density.
In our study, we conducted enrichment analyses for the second GO terms and pathways, which makes use of specifically meaningful annotations rather than uninformative ones and provides insight into regional genomic function [25]. This has been proved to successfully identify over-represented terms for general annotation analyses such as post-genomic analysis, DNA chips and gene networks analysis, and QTL global meta-analysis [26]. In the analyses, we firstly carried out enrichment GO analyses on 418 candidate genes in the database [16]. There were 57 (32.76%) over-represented terms have been found, which also provided a strong indication that genes may be clustered in a manner that reflects their functional association with particular traits. Subsequently, enrichment analyses on genes in the identified core regions were performed and 36 second level GO terms (64.3%) were found to be significantly over-represented in core regions including a total of 2740 functional genes, of which 34 significant terms were coincide with the results of the analyses for 418 candidate genes. These results were consistent with our previous hypothesis that some functional genes associated with milk production traits, as reflected by GO terms could be clustered in core regions compared against the entire genome regions.
We finally analyzed the enriched pathways with the purpose to confirm the biological associations between core regions and quantitative traits. The results of the enriched pathway analyses showed a total of 9 out of 213 (4.23%) pathways were found to be significant in core regions, of which 6 (66.67%) significant pathways were included in the results of candidate genes, confirming that genes involved in functional biological processes. As expected, milk production processes were enriched in core regions, which revealed proofs of multiple genes in a same pathway presenting selective signatures. These results were consistent with the results reported by Qanbari et al. (2010) that the identified candidate genes reflected a panel of pathways such as steroid metabolism and transportation. All these demonstrated the feasibility of EHH test as well as the biological significance of the identified cores based on this method. Findings in our study could help detect functional candidate genes under positive selection for further genetic and breeding studies in Chinese Holstein.
In addition, it was not ignorable that the length of these significant core regions identified by EHH tests was generally large and contained many genes, thus reducing the accuracy of this method; On the other hand, the method of EHH test may lack This table describes significant pathway terms over-represented in core regions based on GeneMerge1.2 software. Pop-frec describes the frequency of genes in the population with this pathway, and CR-frec describes the frequency of genes in the core regions with this pathway. Ratio is calculated by the comparison of a term within the core regions to that in genome wide. P value here is a Bonferroni corrected P value. doi:10.1371/journal.pone.0060440.t005 sensitive for identifying lower-frequency selected alleles [10], resulting in some low-frequency functional genes not included in our results. As for the enriched GO term analyses to detect the distribution of functional genes, we found that the proportion of significant terms were limited. This could be because many of the bovine gene models remain un-annotated. Hence it only roughly reflects our hypothesis that core regions displaying signatures are suggested targets of recent selection. New powerful methods are needed to enhance the power for localizing the source of selection. Due to each single test for selective signatures can provide partial information, we could possibly combine them in a composite likelihood statistic. This idea has already been proved effectively in Human population [10]. Besides, as the differences of allele frequency between populations may be more palpable in the regions harboring causal variants, we can combine with the tests for selective signatures between populations to help us narrow core regions for further accurately positioning the causal variants. In all, this study detected 125 core regions over the whole genome in Chinese Holstein using the EHH test associated with some bioinformatics analyses. Most of the genes included in these core regions were consistent with reported candidate genes for milk production and mastitis, of which some functional genes classified by GO terms or involved in important pathways of milk production traits were enriched in these identified regions. Our results provided evidence for the exploitability of the core regions identified by EHH tests, and eventually lay a substrate for further studies for targeting causal mutations underlying some important economic traits in Chinese Holstein.