Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Whole-Genome Resequencing of Holstein Bulls for Indel Discovery and Identification of Genes Associated with Milk Composition Traits in Dairy Cattle

  • Jianping Jiang,

    Affiliation Department of Animal Genetics and Breeding, College of Animal Science and Technology, Key Laboratory of Animal Genetics and Breeding of the Ministry of Agriculture, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing, China

  • Yahui Gao,

    Affiliation Department of Animal Genetics and Breeding, College of Animal Science and Technology, Key Laboratory of Animal Genetics and Breeding of the Ministry of Agriculture, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing, China

  • Yali Hou,

    Affiliation Laboratory of Disease Genomics and Individualized Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China

  • Wenhui Li,

    Affiliation Department of Animal Genetics and Breeding, College of Animal Science and Technology, Key Laboratory of Animal Genetics and Breeding of the Ministry of Agriculture, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing, China

  • Shengli Zhang,

    Affiliation Department of Animal Genetics and Breeding, College of Animal Science and Technology, Key Laboratory of Animal Genetics and Breeding of the Ministry of Agriculture, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing, China

  • Qin Zhang,

    Affiliation Department of Animal Genetics and Breeding, College of Animal Science and Technology, Key Laboratory of Animal Genetics and Breeding of the Ministry of Agriculture, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing, China

  • Dongxiao Sun

    Affiliation Department of Animal Genetics and Breeding, College of Animal Science and Technology, Key Laboratory of Animal Genetics and Breeding of the Ministry of Agriculture, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing, China

Whole-Genome Resequencing of Holstein Bulls for Indel Discovery and Identification of Genes Associated with Milk Composition Traits in Dairy Cattle

  • Jianping Jiang, 
  • Yahui Gao, 
  • Yali Hou, 
  • Wenhui Li, 
  • Shengli Zhang, 
  • Qin Zhang, 
  • Dongxiao Sun


The use of whole-genome resequencing to obtain more information on genetic variation could produce a range of benefits for the dairy cattle industry, especially with regard to increasing milk production and improving milk composition. In this study, we sequenced the genomes of eight Holstein bulls from four half- or full-sib families, with high and low estimated breeding values (EBVs) of milk protein percentage and fat percentage at an average effective depth of 10×, using Illumina sequencing. Over 0.9 million nonredundant short insertions and deletions (indels) [1–49 base pairs (bp)] were obtained. Among them, 3,625 indels that were polymorphic between the high and low groups of bulls were revealed and subjected to further analysis. The vast majority (76.67%) of these indels were novel. Follow-up validation assays confirmed that most (70%) of the randomly selected indels represented true variations. The indels that were polymorphic between the two groups were annotated based on the cattle genome sequence assembly (UMD3.1.69); as a result, nearly 1,137 of them were found to be located within 767 annotated genes, only 5 (0.138%) of which were located in exons. Then, by integrated analysis of the 767 genes with known quantitative trait loci (QTL); significant single-nucleotide polymorphisms (SNPs) previously identified by genome-wide association studies (GWASs) to be associated with bovine milk protein and fat traits; and the well-known pathways involved in protein, fat synthesis, and metabolism, we identified a total of 11 promising candidate genes potentially affecting milk composition traits. These were FCGR2B, CENPE, RETSAT, ACSBG2, NFKB2, TBC1D1, NLK, MAP3K1, SLC30A2, ANGPT1 and UGDH. Our findings provide a basis for further study and reveal key genes for milk composition traits in dairy cattle.


In recent years, the establishment of next-generation sequencing (NGS) technology has led to substantial progress in terms of the discovery of large numbers of single-nucleotide polymorphisms (SNPs), insertions and deletions (indels), and large structural variations (SVs). Short indels are now recognized as the second most common form of genomic variation [1], which contribute to phenotypic diversity in many species and human diseases [2], such as cystic fibrosis [3] and fragile X syndrome [4]. In cattle, indels have been shown to contribute to complex traits, such as the double-muscled phenotype [5]. Since Mills et al. constructed the initial map of human indel variations [6], and with the increased availability and advances of high-throughput sequencing technology, indels have been discovered in individual genomes across different species [1, 712]. At the time of writing, more than 8 million indels are registered in the dbSNP bovine database (, updated in July 11, 2016). Therefore, it is now feasible to identify and characterize the molecular basis of complex traits using indels.

Milk production in dairy cattle is an economically important variable, especially the traits of milk yield and milk composition. Numerous studies have focused on the detection of quantitative trait loci (QTL) for milk production traits [1319], since the first report on QTL mapping in dairy cattle by Georges et al. [20]. In recent decades, based on QTL mapping and genome-wide association studies (GWASs), numerous candidate genes and mutations associated with milk production traits have been identified, including two confirmed causal mutations: DGAT1 p.Lys232Ala and GHR p.Phe279Tyr, which have large effects on milk yield and composition [21, 22]. However, to date, only limited investigations to identify genes associated with milk production traits using whole-genome resequencing have been reported [2325].

The aims of this study were thus to detect indels across the whole genome in eight Holstein bulls with extremely high and low estimated breeding values (EBVs) of milk protein percentage and fat percentage using NGS, and to explore the short indels [1–49 base pairs (bp)] that were polymorphic between these two groups and to identify functional genes that are important for milk protein and fat traits.

Materials and Methods

Frozen semen samples were collected along with the regular quarantine inspection of the farms and in strict accordance with the protocol approved by the Institutional Animal Care and Use Committee (IACUC) at China Agricultural University.

Sample collection

Eight bulls from two half-sib families and two full-sib families (n = 2 each) were selected from the Beijing Dairy Cattle Center (, according to their EBVs of milk protein percentage (PP) and fat percentage (FP), which were calculated based on a multiple trait random regression test-day model using the software RUNGE by the Dairy Data Center of China ( The reliability of EBVs for milk protein percentage and fat percentage of each bull was more than 0.80. The two bulls in each group showed high and low EBVs for milk protein percentage and fat percentage. Detailed information on the eight bulls is presented in Table 1.

DNA library construction and sequencing

Genomic DNA was extracted from frozen semen samples using the standard phenol/chloroform extraction method. DNA degradation and contamination were monitored on 1% agarose gels and the concentration and purity were assessed on NanoDrop 2000 (Thermo Scientific Inc. Waltham, DE, USA); the high-quality DNAs were then used for library construction. Two paired-end libraries were constructed for each individual, the read length was 2×100 bp, and sequencing was performed using Illumina Hiseq2000 instruments (Illumina Inc., San Diego, CA, USA).

Read mapping and variant calling

The cattle reference genome assembly (UMD3.1.69) was downloaded from the Ensembl Genome Browser website ( We applied the NGS QC Toolkit [26] with default parameters to minimize mapping errors. To map reads to the reference genome, we used the Burrows–Wheeler Alignment tool (BWA ver. 0.6.2) [27], mainly with the default parameters. To call indels, SAMtools (ver. 0.1.19) [28] was applied. The criteria were as follows: overall base quality score≥20; read depth <100 for each individual; and alternative allele on either forward or reverse supporting reads >3.

Indel validation

To evaluate the reliability of the resequencing data, 20 randomly selected indels were validated using a PCR assay and direct sequencing. PCR primers were designed using PRIMER3 ( and Oligo 6.0 to amplify the 150–300 bp fragments containing the indels (S1 Table). The PCR reaction was performed in a volume of 25 μl containing 50–100 ng of genomic DNA, 10 μl of premix, 1 μl of each primer, and 11 μl of PCR buffer, using the following PCR program: 94°C for 10 min; 35 cycles of 94°C for 40 s, 55–60°C for 40 s, and 72°C for 40 s; and then 72°C for 7 min at the end of the final cycle. The PCR products were purified and then subjected to Sanger sequencing.

Detection of differential variants and functional annotation

Among all of the indels from the eight bulls retained after filtering, we selected those that were polymorphic between the two groups with high and low EBVs, which were called the ‘common differential variants’. The pipeline in this study was as follows: we first collected all the variants that were polymorphic between the two individuals within each family, and then retained the common differential variants with the same allelic distribution directions between the high and low individuals across the four families. Subsequently, these common differential indels were annotated by ANNOVAR [29] based on the cattle gene set downloaded from the Ensembl website ( to identify the indels within functional genes. Moreover, we compared the differential indels mentioned above with the variants in cattle SNP database (

Integrated analysis of resequencing, QTL, and GWAS data and known pathways

We specifically focused on those common differential indels between the high and low groups located in genic regions or flanking regions of genes to identify candidate genes and causal mutations related to milk composition traits. On the one hand, we estimated the positions of genes containing these indels using QTL mapper v.2.019 (, and then compared the genetic positions of the genes with the confidence intervals and peak positions of previously reported QTL (, updated in April 29, 2016). We also compared the physical chromosomal positions of the above genes with the SNPs determined by previous GWASs [13, 15] to be significantly associated with milk protein and fat traits of dairy cattle. Then, we performed Gene Ontology (GO) functional annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis of the genes containing the identified indels using KOBAS tool [30]. During the functional annotation of genes, a P value of <0.05 determined by Fisher’s exact test was set as the criteria for significance. In addition, we downloaded the genes involved in well-known pathways for protein, fat, and fatty acid metabolism from the KEGG pathway website (, consisting the mTOR, insulin, AMPK, PPAR, Jak-STAT, PI3K-Akt, MAPK, and TGF-β signaling pathways, and further determined whether these genes were included in the gene list which were identified by integrated analysis of QTL and GWAS data.


Data production and read mapping

After resequencing of the genomic DNA samples for eight Holstein bulls, a total of 2,303,781,448 short reads were obtained, and the average proportion of uniquely mapped reads was as high as 82.62%. The sequencing depth ranged from 7× to 10×, with an average of 8.1×, and the genome coverage was approximately 98% in each individual (Table 2).

Table 2. Summary of sequencing, mapping statistics, and indel count for individuals.

Indel detection

In total, 912,302 nonredundant short indels (1–49 bp) were obtained from the eight bulls after filtering using SAMtools. The number of indels detected in each bull varied from 323,490 to 411,070, with an average of 365,169 (Table 2). Among them, we focused on the unique indels that were polymorphic between the bulls with high and low EBVs of milk protein percentage and fat percentage, which were inferred to be relevant to the milk protein and fat trait. As a result, after filtering, we obtained 3,625 common differential indels that were polymorphic between the high and low groups of bulls. The distribution of allele frequencies of these common differential indels was calculated and presented in Fig 1. We further compared our results with the variants in the SNP database (NCBI dbSNP, updated in July 11, 2016) and found that the majority of the common differential indels (76.67%) were novel.

Fig 1. Distribution of allele frequencies for the common differential indels between high and low groups.

The first line represents insertion frequency for each common differential indel locus in low group, and the second line represents insertion frequency for each common differential indel locus in high group.

To estimate the accuracy and reliability of our resequencing data, PCR amplification and Sanger sequencing were applied to validate 20 randomly selected indels with a range of lengths from 1 to 21 bp with the same genomic DNA samples of the eight bulls as used for resequencing. We found that the sequences of the 14 indels were consistent with those from our whole-genome resequencing data, indicating that the accuracy of our results was 70% (S1 Table). The primer sequences are shown in the supplemental material (S1 Table).

Functional annotation and genomic distribution

Among the 3,625 common indels that were polymorphic between the high and low bull groups, the largest indel was 27 bp, but most (97.96%) of them were less than 10 bp. Single-bp indels (44.47%) were the most common, and the number of insertions (n = 1,842) was slightly more than the number of deletions (n = 1,783). The cumulative length of the indels reached 9 kb. The distribution of indel length is shown in Fig 2. The indels were proportionally distributed among autosomes corresponding to chromosome length; the number of indels was lower on X chromosome than on autosomes (Fig 3).

Fig 2. Distribution of indel length (bp).

The indel represents the common differential one between the high and low groups.

Fig 3. The number of indels in each chromosome.

The indel represents the common differential one between the high and low groups.

Functional annotation of the 3,625 common differential indels was performed against known genes from the Ensembl website using ANNOVAR software (Table 3). As a result of this, the majority (68.634%) of indels was found to be located in intergenic regions, and the remaining 1,137 indels were within 767 genes, including in introns (29.379%), exons (0.138%), 3’ untranslated region (3’ UTR) (0.276%), and other regions (ncRNAs, splicing site, etc.). Regarding the exonic indels that were identified, there were 2 frameshift indels with lengths not divisible by 3, and 3 nonframeshift indels, called 3n indels [31], with lengths divisible by 3.

Table 3. Annotation of 3,625 common differential indels for eight bulls.

Integrated analysis with known QTL and significant SNPs identified by GWASs

In this part of the study, we first compared the genetic positions of the 1,137 indels located in 767 genes with QTL known to be associated with milk protein and fat traits in the Cattle QTL database ( We found that 869 indels, corresponding to 605 unique genes, overlapped with the confidence intervals of 538 known QTL. Notably, 542 of them were located in 372 functional genes that were close to the peak position of the known QTL with a genetic distance of less than 5 cM. Furthermore, of the 1,137 indels, 926 located in 635 genes were close (< 5 Mb) to significant SNPs identified by previous GWASs. Hence, a total of 695 of the genes in these two groups were retained for further analyses.

Gene Ontology and pathway analyses

We performed GO enrichment and KEGG pathway analyses using the KOBAS tool based on the 695 genes mentioned above. In total, 306 GO terms and 29 KEGG pathways were significantly enriched (P<0.05; S2 Table), mainly including ‘inorganic cation transmembrane transport’, metabolic process related to protein and lipid metabolism such as ‘lipid biosynthetic process’, ‘phosphatidylglycerol biosynthetic process’, ‘glycosaminoglycan biosynthetic process’, ‘protein kinase A binding’, ‘protein kinase C activity’ and ‘protein phosphatase inhibitor activity’ etc., and terms associated with the cell cycle and cell adhesion such as ‘regulation of cell cycle G1/S phase transition’ and ‘protein complex involved in cell adhesion’. In addition, significant enriched KEGG pathways consisted of ‘thyroid hormone synthesis’, ‘aldosterone synthesis and secretion’, ‘GnRH signaling pathway’, MAPK signaling pathway, ‘gap junction’, and ‘phosphatidylinositol signaling system’ et al.

We also found that 34 out of the 695 genes with an indel polymorphic between the high and low groups of bulls were involved in several well-known pathways for protein and fat synthesis and metabolism. These included the mTOR and Jak-STAT signaling pathways, both of which play pivotal roles in the synthesis of milk proteins; the AMPK signaling pathway as an energy supplement system; the insulin and PPAR signaling pathway for fat synthesis; the PI3K-Akt signaling pathway together with the MAPK signaling pathway, which play roles in cellular functions, such as transcription, translation, and cell proliferation; and the TGF-β signaling pathway, which is essential for mammary epithelial cell proliferation.

Identification of candidate genes associated with milk protein and fat traits

Based on the known QTL and significant SNPs identified by previous GWASs, along with biological functions of genes, 11 genes that containing at least one common differential indel locus were considered as promising candidates for an association with milk protein and fat traits. Of them, four genes, including Acyl-CoA synthetase bubblegum family member 2 (ACSBG2), nuclear factor kappa B subunit 2 (NFKB2), TBC1 domain family member 1 (TBC1D1) and mitogen-activated protein kinase kinase kinase 1 (MAP3K1), were located near to both the peak locations of the reported QTL and significant SNPs for milk protein and fat traits, and were involved in the well-known protein and lipid metabolic pathways. The Fc fragment of IgG receptor IIb (FCGR2B) and centromere protein E (CENPE) genes were located in QTL regions and near to the significant SNPs reported to be associated with milk protein and fat traits. Two genes, solute carrier family 30 member 2 (SLC30A2) and UDP-glucose 6-dehydrogenase (UGDH) were very close to the peak position of QTL and significant SNPs for milk protein and fat traits. Two genes, nemo like kinase (NLK) and angiopoietin 1 (ANGPT1), were near to the significant SNPs for milk protein percentage, and participated in the MAPK and PI3K-Akt signaling pathways. In addition, retinol saturase (RETSAT) was also near to significant SNPs for milk protein and fat traits. The chromosomal positions of these 11 genes and detailed information about their nearest QTL and significant SNPs were shown in Tables A and B in S1 File. The indel genotypes in these 11 genes between high and low groups were shown in Table 4.


With the rapid development of sequencing technologies and bioinformatics tools, it is now feasible to sequence the complete genome of many species, such as human, mouse and Arabidopsis [3234]. Compared with traditional technologies, NGS has numerous major advantages, including enabling sequencing that is cheaper and can be performed on a larger scale. Moreover, unlike GWASs, it can identify numerous rare and common variants as genetic causes of complex traits, especially in human Mendelian disorders [35, 36]. In this paper, we presented the indels in eight Holstein bulls determined using Illumina sequencing, with the aim of detecting the distribution of variants across the cattle genome and identifying the candidate genes for milk protein and fat traits. To our knowledge, this is one of the first studies involving whole-genome resequencing for milk protein and fat traits in dairy cattle; based on the indels that were polymorphic between the high and low groups, we identified a few candidate genes potentially associated with milk protein and fat traits.

Indel discovery

Since the first report of the construction of a human indel map [6], numerous indels have been found in different species. In our research, we identified approximately 365,000 short indels for each bull, which was much more than for the Fleckvieh bull (n = 115,000) [37], but slightly fewer than for the Japanese native cattle Kuchinoshima-Ushi (n = 629,256) [38], Chikso (Korean brindle cattle) (n = 551,363) [24], Korean brown cattle (n = 697,048) [39], Jeju black cattle (n = 702,965) [39], Korean Holstein (n = 631,332) [39] and French dairy breeds (n = 873,372) [25], which was probably due to the sequencing fold coverage and different genetic backgrounds. In addition, in this study, the indels were validated by sequencing at a rate as high as 70%, which was the same as in a previous study (70%) [24]. In the current research, approximately 89% of insertions and 88% of deletions were less than 4 bp, and the number of indels in each chromosome declined with decreasing chromosome length. All of these findings were similar to those in previous studies [24, 37, 38]. The distributions in each chromosome and the lengths of the indels determined here were in line with recent findings in other species, such as chicken, Tibetan macaque, Quarter Horse, and sorghum [79, 11, 40].

Candidate genes for milk composition

In the present study, we performed integrated analysis of previously reported QTL, significant SNPs identified by previous GWASs, and the biological functions of genes, which enabled us to identify 11 candidate genes potentially associated with milk protein and fat traits. These genes contained at least one indel that was polymorphic between the high and low groups as described below.

Candidate genes containing exonic indel.

Indels in a gene coding region, especially frameshift indels, had a major influence on gene function [41, 42]. Among the 11 candidate genes, the FCGR2B gene with one frameshift indel in exon 7 encodes a transmembrane glycoprotein that is a member of the low-affinity immunoglobulin gamma Fc receptors and plays a key role in phagocytosis and the clearing of immune complexes. Bovine mammary gland, which is a product of the innate immune system, is active during lactation. FCGR2B might affect the milk protein and fat traits through playing an indispensable role in preventing the individuals from disease infection during lactation. CENPE with two exonic indels (exons 58 and 68) and seven intronic indels was a component of the kinetochore complex and part of the spindle assembly checkpoint those were responsible for maintain chromosome congression and mitotic checkpoint [43]. It was a candidate gene for triple-negative breast cancer, and played a crucial role in suppressing tumorigenesis [44] Depletion of CENPE resulted in promoting the Drosophila epithelial tissues overgrowth [45].

Candidate genes with indel in regulatory regions.

RETSAT also known as PPSIG with one indel in 3’ UTR and one indel in intron, was found to be a target gene of PPARα which played an essential role in lipid metabolism and energy balance [46], and responsible for adipogenesis promotion and lipid deposition [47]. ACSBG2 with one indel in upstream region and one indel in intron encodes a protein that belongs to the acyl-CoA synthetase family, which plays a pivotal role in lipid metabolism and lipid droplet formation. Claire et al. identified ACSBG2 as a candidate gene that was significantly related to abdominal fat deposition in chicken [48]. In addition, Sun et al. demonstrated that one SNP located in an intron of ACSBG2 was also significantly associated with yolk development in chicken [49]. NFKB2 with one indel in upstream region encodes a protein subunit of the transcription factor complex nuclear factor-kappa-B (NFkB), which regulates the transcription of genes related to cell differentiation and proliferation. It was indispensable for normal development of the bovine mammary gland [50]. In our previous study, we identified that a SNP (ARS-BFGL-NGS-107403) close to NFKB2 gene was significant associated with C14:1 and C14 indices of milk fatty acids in Chinese Holstein population, and the indel identified in current study was only 0.27 Mb away from the significant SNP mentioned above [51].

Candidate genes with intronic indel.

TBC1D1 with an intronic indel is a Rab GTPase activating protein, which was involved in AMPK signaling pathway and was important to maintain the glucose and energy homeostasis [52]. Dysfunction of TBC1D1 in mice induced that lipogeneic gene expressions were up-regulated through IGF1R–PKB–mTOR signaling pathway, including FASN, ACL, ACC1 and SCD [53]. Polymorphism of TBC1D1 was found linked to the severe obesity in human [54]. NLK with four intronic indels is a serine/threonine-protein kinase, belonging to the MAP kinase (MAPK) subfamily, and is involved in MAPK signaling pathway. Yuan et al. identified the NLK as a negative regulator of mTORC1 complex which was important to the milk protein synthesis [55]. Polymorphisms of NLK were strongly associated with obesity in human and fatty acid composition in pig [56, 57].

MAP3K1 also named as MEKK1 with one intronic indel, encodes a serine/threonine kinase and is involved in MAPK signaling pathway. Also, it was demonstrated that MAP3K1 was related to human breast cancer progression [58, 59], thus, MAP3K1 may play a vital role in keeping bovine mammary gland function well. SLC30A2 was also known as the zinc transporter ZnT2 with one intronic indel, and played an important role in transporting of Zn into mammary epithelial cells and secreting into the milk [60]. It was found that loss of ZnT2 can impair mammary gland development and reduce milk composition in mouse [61, 62], and polymorphisms of ZnT2 were associated with human mammary gland dysfunction [63, 64].

ANGPT1 also named as Ang1 with one indel in intron, was one of the ligands for TEK Receptor Tyrosine Kinase (TEK) and involved in the PI3K-Akt signaling pathway that is correlated to milk protein synthesis [55]. Mice deficient with ANGPT1 were shown poor vascular network [65]. Bovine mammary gland was with an advanced microvasculature which were requisite for mammary microcirculation of nutrients transport and supply to maintain milk synthesis and secretory [66]. UGDH with two intronic indels also located within a QTL region on chromosome 6 linked to milk protein and fat traits, and its biological function and genetic effects revealed that it was a key candidate gene associated with such traits [67].

Although we identified 11 candidate genes based on the indels in genic and gene flanking regions that were polymorphic between high and low groups, further association analysis should be performed to validate functionally important indels and to characterize more important molecular markers of milk protein and fat traits. Along with the development of the human ENCODE project (2011), similar analyses could be performed for the intergenic variants. This would make it possible to identify more important potential functional variants within the intergenic regions of the genome that play major roles in regulating genes affecting bovine milk composition traits. Therefore, the complete analysis based on a large number of indels in our study should constitute a valuable resource for improving the molecular breeding of dairy cattle.


In this study, we resequenced eight Holstein bulls and detected 912,302 indels through cattle genome. We also successfully identified 3,625 indels that were polymorphic between high and low groups throughout the genome. Based on the integrated analysis of the differential indels located in genic and flanking regions, previously reported QTL, and significant SNPs identified by GWASs, along with well-known pathways and biological functions of genes, we identified 11 promising candidate genes that may affect milk protein and fat traits of dairy cattle, including FCGR2B, CENPE, RETSAT, ACSBG2, NFKB2, TBC1D1, NLK, MAP3K1, SLC30A2, ANGPT1 and UGDH. And the indels of 11 genes identified in this study will provide a useful resource for future association studies to discovery the major alleles affecting milk composition traits in dairy cattle.

Supporting Information

S1 File. This file contains the detailed information of QTL, significant SNPs and well-known pathways of 11 candidate genes.

Table A. Detailed information on the reported QTL containing the 8 genes for the high and low groups regarding EBV for milk protein and fat traits. Table B. Detailed information of the nearest and most significant SNPs from previous GWASs and well-known pathways to the 11 genes between the high and low groups for milk protein and fat traits.


S2 Table. Functional enrichment of 695 genes containing the indels that were polymorphic between the high and low groups by KOBAS tool.



The authors would also like to appreciate the kind help from official Dairy Data Center of China in providing the official EBVs data.

Author Contributions

  1. Conceptualization: DS QZ.
  2. Data curation: JJ YG YH.
  3. Formal analysis: JJ YG YH.
  4. Funding acquisition: DS.
  5. Investigation: JJ YG.
  6. Methodology: DS YH.
  7. Project administration: DS SZ QZ.
  8. Resources: JJ DS.
  9. Software: YH.
  10. Supervision: DS.
  11. Validation: YG WL.
  12. Writing – original draft: JJ.
  13. Writing – review & editing: DS.


  1. 1. Montgomery SB, Goode DL, Kvikstad E, Albers CA, Zhang ZD, Mu XJ, et al. The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome research. 2013;23(5):749–61. pmid:23478400
  2. 2. Weischenfeldt J, Symmons O, Spitz F, Korbel JO. Phenotypic impact of genomic structural variation: insights from and for human disease. Nature reviews Genetics. 2013;14(2):125–38. pmid:23329113
  3. 3. Collins FS, Drumm ML, Cole JL, Lockwood WK, Vande Woude GF, Iannuzzi MC. Construction of a general human chromosome jumping library, with application to cystic fibrosis. Science. 1987;235(4792):1046–9. pmid:2950591
  4. 4. Warren ST, Zhang F, Licameli GR, Peters JF. The fragile X site in somatic cell hybrids: an approach for molecular cloning of fragile sites. Science. 1987;237(4813):420–3. pmid:3603029
  5. 5. Grobet L, Martin LR, Poncelet D, Pirottin D, Brouwers B, Riquet J, et al. A deletion in the bovine myostatin gene causes the double-muscled phenotype in cattle. Nature genetics. 1997;17(1):71–4. pmid:9288100
  6. 6. Mills RE, Luttig CT, Larkins CE, Beauchamp A, Tsui C, Pittard WS, et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome research. 2006;16(9):1182–90. pmid:16902084
  7. 7. Doan R, Cohen ND, Sawyer J, Ghaffari N, Johnson CD, Dindot SV. Whole-genome sequencing and genetic variant analysis of a Quarter Horse mare. BMC genomics. 2012;13(1):1.
  8. 8. Fan W-L, Ng CS, Chen C-F, Lu M-YJ, Chen Y-H, Liu C-J, et al. Genome-wide patterns of genetic variation in two domestic chickens. Genome biology and evolution. 2013;5(7):1376–92. pmid:23814129
  9. 9. Yan Y, Yi G, Sun C, Qu L, Yang N. Genome-wide characterization of insertion and deletion variation in chicken using next generation sequencing. PloS one. 2014;9(8):e104652. pmid:25133774
  10. 10. Lai J, Li R, Xu X, Jin W, Xu M, Zhao H, et al. Genome-wide patterns of genetic variation among elite maize inbred lines. Nature genetics. 2010;42(11):1027–30. pmid:20972441
  11. 11. Zheng L-Y, Guo X-S, He B, Sun L-J, Peng Y, Dong S-S, et al. Genome-wide patterns of genetic variation in sweet and grain sorghum (Sorghum bicolor). Genome biology. 2011;12(11):1.
  12. 12. Lam H-M, Xu X, Liu X, Chen W, Yang G, Wong F-L, et al. Resequencing of 31 wild and cultivated soybean genomes identifies patterns of genetic diversity and selection. Nature genetics. 2010;42(12):1053–9. pmid:21076406
  13. 13. Jiang L, Liu J, Sun D, Ma P, Ding X, Yu Y, et al. Genome wide association studies for milk production traits in Chinese Holstein population. PloS one. 2010;5(10):e13661. pmid:21048968
  14. 14. Mai M, Sahana G, Christiansen F, Guldbrandtsen B. A genome-wide association study for milk production traits in Danish Jersey cattle using a 50K single nucleotide polymorphism chip. Journal of animal science. 2010;88(11):3522–8. pmid:20656975
  15. 15. Cole JB, Wiggans GR, Ma L, Sonstegard TS, Lawlor TJ Jr., Crooker BA, et al. Genome-wide association analysis of thirty one production, health, reproduction and body conformation traits in contemporary U.S. Holstein cows. BMC genomics. 2011;12:408. pmid:21831322
  16. 16. Schopen G, Visker M, Koks P, Mullaart E, van Arendonk J, Bovenhuis H. Whole-genome association study for milk protein composition in dairy cattle. Journal of dairy science. 2011;94(6):3148–58. pmid:21605784
  17. 17. Meredith BK, Kearney FJ, Finlay EK, Bradley DG, Fahey AG, Berry DP, et al. Genome-wide associations for milk production and somatic cell score in Holstein-Friesian cattle in Ireland. BMC genetics. 2012;13(1):1.
  18. 18. Guo J, Jorjani H, Carlborg Ö. A genome-wide association study using international breeding-evaluation data identifies major loci affecting production traits and stature in the Brown Swiss cattle breed. BMC genetics. 2012;13(1):1.
  19. 19. Minozzi G, Nicolazzi EL, Stella A, Biffani S, Negrini R, Lazzari B, et al. Genome wide analysis of fertility and production traits in Italian Holstein cattle. PLoS One. 2013;8(11):e80219. pmid:24265800
  20. 20. Georges M, Nielsen D, Mackinnon M, Mishra A, Okimoto R, Pasquino AT, et al. Mapping quantitative trait loci controlling milk production in dairy cattle by exploiting progeny testing. Genetics. 1995;139(2):907–20. pmid:7713441
  21. 21. Grisart B, Farnir F, Karim L, Cambisano N, Kim J-J, Kvasz A, et al. Genetic and functional confirmation of the causality of the DGAT1 K232A quantitative trait nucleotide in affecting milk yield and composition. Proceedings of the National Academy of Sciences of the United States of America. 2004;101(8):2398–403. pmid:14983021
  22. 22. Blott S, Kim J-J, Moisio S, Schmidt-Küntzel A, Cornet A, Berzi P, et al. Molecular dissection of a quantitative trait locus: a phenylalanine-to-tyrosine substitution in the transmembrane domain of the bovine growth hormone receptor is associated with a major effect on milk yield and composition. Genetics. 2003;163(1):253–66. pmid:12586713
  23. 23. Stothard P, Choi J-W, Basu U, Sumner-Thomson JM, Meng Y, Liao X, et al. Whole genome resequencing of black Angus and Holstein cattle for SNP and CNV discovery. BMC genomics. 2011;12(1):1.
  24. 24. Choi J-W, Liao X, Park S, Jeon H-J, Chung W-H, Stothard P, et al. Massively parallel sequencing of Chikso (Korean brindle cattle) to discover genome-wide SNPs and InDels. Molecules and cells. 2013;36(3):203–11. pmid:23912596
  25. 25. Boussaha M, Esquerré D, Barbieri J, Djari A, Pinton A, Letaief R, et al. Genome-wide study of structural variants in bovine Holstein, Montbéliarde and Normande dairy breeds. PloS one. 2015;10(8):e0135931. pmid:26317361
  26. 26. Patel RK, Jain M. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data. PloS one. 2012;7(2):e30619. pmid:22312429
  27. 27. Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25(14):1754–60. pmid:19451168
  28. 28. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9. pmid:19505943
  29. 29. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic acids research. 2010;38(16):e164–e. pmid:20601685
  30. 30. Xie C, Mao X, Huang J, Ding Y, Wu J, Dong S, et al. KOBAS 2.0: a web server for annotation and identification of enriched pathways and diseases. Nucleic acids research. 2011;39(suppl 2):W316–W22.
  31. 31. Hu J, Ng PC. SIFT Indel: predictions for the functional effects of amino acid insertions/deletions in proteins. PLoS One. 2013;8(10):e77940. pmid:24194902
  32. 32. Frazer KA, Eskin E, Kang HM, Bogue MA, Hinds DA, Beilharz EJ, et al. A sequence-based variation map of 8.27 million SNPs in inbred mouse strains. Nature. 2007;448(7157):1050–3. pmid:17660834
  33. 33. Kim S, Plagnol V, Hu TT, Toomajian C, Clark RM, Ossowski S, et al. Recombination and linkage disequilibrium in Arabidopsis thaliana. Nature genetics. 2007;39(9):1151–5. pmid:17676040
  34. 34. Hinds DA, Stuve LL, Nilsen GB, Halperin E, Eskin E, Ballinger DG, et al. Whole-genome patterns of common DNA variation in three human populations. Science. 2005;307(5712):1072–9. pmid:15718463
  35. 35. Metzker ML. Sequencing technologies—the next generation. Nature reviews genetics. 2010;11(1):31–46. pmid:19997069
  36. 36. Koboldt DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER. The next-generation sequencing revolution and its impact on genomics. Cell. 2013;155(1):27–38. pmid:24074859
  37. 37. Eck SH, Benet-Pagès A, Flisikowski K, Meitinger T, Fries R, Strom TM. Whole genome sequencing of a single Bos taurus animal for single nucleotide polymorphism discovery. Genome biology. 2009;10(8):1.
  38. 38. Kawahara-Miki R, Tsuda K, Shiwa Y, Arai-Kichise Y, Matsumoto T, Kanesaki Y, et al. Whole-genome resequencing shows numerous genes with nonsynonymous SNPs in the Japanese native cattle Kuchinoshima-Ushi. BMC genomics. 2011;12(1):1.
  39. 39. Choi J-W, Liao X, Stothard P, Chung W-H, Jeon H-J, Miller SP, et al. Whole-genome analyses of Korean native and Holstein cattle breeds by massively parallel sequencing. PloS one. 2014;9(7):e101127. pmid:24992012
  40. 40. Fan Z, Zhao G, Li P, Osada N, Xing J, Yi Y, et al. Whole-genome sequencing of Tibetan macaque (Macaca thibetana) provides new insight into the macaque evolutionary history. Molecular biology and evolution. 2014:msu104.
  41. 41. Ng PC, Levy S, Huang J, Stockwell TB, Walenz BP, Li K, et al. Genetic variation in an individual human exome. PLoS Genet. 2008;4(8):e1000160. pmid:18704161
  42. 42. Pelak K, Shianna KV, Ge D, Maia JM, Zhu M, Smith JP, et al. The characterization of twenty sequenced human genomes. PLoS Genet. 2010;6(9):e1001111. pmid:20838461
  43. 43. Weaver BAA, Silk AD, Montagna C, Verdier-Pinard P, Cleveland DW. Aneuploidy acts both oncogenically and as a tumor suppressor. Cancer Cell. 2007;11(1):25–36. pmid:17189716
  44. 44. Kung PP, Martinez R, Zhu Z, Zager M, Blasina A, Rymer I, et al. Chemogenetic evaluation of the mitotic kinesin CENP-E reveals a critical role in triple-negative breast cancer. Mol Cancer Ther. 2014;13(8):2104–15. pmid:24928852
  45. 45. Clemente-Ruiz M, Muzzopappa M, Milan M. Tumor suppressor roles of CENP-E and Nsl1 in Drosophila epithelial tissues. Cell cycle. 2014;13(9):1450–5. pmid:24626182
  46. 46. Lee SS, Chan W-Y, Lo CK, Wan DC, Tsang DS, Cheung W-T. Requirement of PPARα in maintaining phospholipid and triacylglycerol homeostasis during energy deprivation. Journal of lipid research. 2004;45(11):2025–37. pmid:15342691
  47. 47. Moise AR, Lobo GP, Erokwu B, Wilson DL, Peck D, Alvarez S, et al. Increased adiposity in the retinol saturase-knockout mouse. FASEB J. 2010;24(4):1261–70. pmid:19940255
  48. 48. D’Andre HC, Paul W, Shen X, Jia X, Zhang R, Sun L, et al. Identification and characterization of genes that control fat deposition in chickens. Journal of animal science and biotechnology. 2013;4(1):1.
  49. 49. Sun C, Lu J, Yi G, Yuan J, Duan Z, Qu L, et al. Promising Loci and Genes for Yolk and Ovary Weight in Chickens Revealed by a Genome-Wide Association Study. PloS one. 2015;10(9):e0137145. pmid:26332579
  50. 50. Sun X-F, Zhang H. NFKB and NFKBI polymorphisms in relation to susceptibility of tumour and other diseases. 2007.
  51. 51. Li C, Sun D, Zhang S, Wang S, Wu X, Zhang Q, et al. Genome wide association study identifies 20 novel promising genes associated with milk fatty acid traits in Chinese Holstein. PloS one. 2014;9(5):e96186. pmid:24858810
  52. 52. Chen S, Murphy J, Toth R, Campbell DG, Morrice NA, Mackintosh C. Complementary regulation of TBC1D1 and AS160 by growth factors, insulin and AMPK activators. Biochem J. 2008;409(2):449–59. pmid:17995453
  53. 53. Chen L, Chen Q, Xie B, Quan C, Sheng Y, Zhu S, et al. Disruption of the AMPK–TBC1D1 nexus increases lipogenic gene expression and causes obesity in mice via promoting IGF1 secretion. Proceedings of the National Academy of Sciences. 2016;113(26):7219–24.
  54. 54. Stone S, Abkevich V, Russell DL, Riley R, Timms K, Tran T, et al. TBC1D1 is a candidate for a severe obesity gene and evidence for a gene/gene interaction in obesity predisposition. Human molecular genetics. 2006;15(18):2709–20. pmid:16893906
  55. 55. Bionaz M, Loor JJ. Gene networks driving bovine mammary protein synthesis during the lactation cycle. Bioinformatics and biology insights. 2011;5:83–98. pmid:21698073
  56. 56. Pei YF, Zhang L, Liu Y, Li J, Shen H, Liu YZ, et al. Meta-analysis of genome-wide association data identifies novel susceptibility loci for obesity. Hum Mol Genet. 2014;23(3):820–30. pmid:24064335
  57. 57. Supakankul P, Mekchay S. Association of NLK polymorphisms with intramuscular fat content and fatty acid composition traits in pigs. Meat science. 2016;118:61–5. pmid:27050409
  58. 58. Cuevas BD, Winter-Vann AM, Johnson NL, Johnson GL. MEKK1 controls matrix degradation and tumor cell dissemination during metastasis of polyoma middle-T driven mammary cancer. Oncogene. 2006;25(36):4998–5010. pmid:16568086
  59. 59. Glubb DM, Maranian MJ, Michailidou K, Pooley KA, Meyer KB, Kar S, et al. Fine-scale mapping of the 5q11. 2 breast cancer locus reveals at least three independent risk variants regulating MAP3K1. The American Journal of Human Genetics. 2015;96(1):5–20. pmid:25529635
  60. 60. Lopez V, Kelleher SL. Zinc transporter-2 (ZnT2) variants are localized to distinct subcellular compartments and functionally transport zinc. Biochem J. 2009;422(1):43–52. pmid:19496757
  61. 61. Lee S, Hennigar SR, Alam S, Nishida K, Kelleher SL. Essential Role for Zinc Transporter 2 (ZnT2)-mediated Zinc Transport in Mammary Gland Development and Function during Lactation. J Biol Chem. 2015;290(21):13064–78. pmid:25851903
  62. 62. Dempsey C, McCormick NH, Croxford TP, Seo YA, Grider A, Kelleher SL. Marginal maternal zinc deficiency in lactating mice reduces secretory capacity and alters milk composition. J Nutr. 2012;142(4):655–60. pmid:22357740
  63. 63. Seo YA, Kelleher SL. Functional analysis of two single nucleotide polymorphisms in SLC30A2 (ZnT2): implications for mammary gland function and breast disease in women. Physiol Genomics. 2010;42A(4):219–27. pmid:20858712
  64. 64. Alam S, Hennigar SR, Gallagher C, Soybel DI, Kelleher SL. Exome Sequencing of SLC30A2 Identifies Novel Loss- and Gain-of-Function Variants Associated with Breast Cell Dysfunction. J Mammary Gland Biol Neoplasia. 2015;20(3–4):159–72. pmid:26293594
  65. 65. Sato TN, Tozawa Y, Deutsch U, Wolburg-Buchholz K, Fujiwara Y, Gendron-Maguire M, et al. Distinct roles of the receptor tyrosine kinases Tie-1 and Tie-2 in blood vessel formation. Nature. 1995;376(6535):70–4. pmid:7596437
  66. 66. Prosser CG, Davis SR, Farr VC, Lacasse P. Regulation of blood flow in the mammary microvasculature. J Dairy Sci. 1996;79(7):1184–97. pmid:8872713
  67. 67. Xu Q, Mei G, Sun D, Zhang Q, Zhang Y, Yin C, et al. Detection of genetic association and functional polymorphisms of UGDH affecting milk production trait in Chinese Holstein cattle. BMC genomics. 2012;13(1):1.