A system-based analysis of the genetic determinism of udder conformation and health phenotypes across three French dairy cattle breeds

Using GWAS to identify candidate genes associated with cattle morphology traits at a functional level is challenging. The main difficulty of identifying candidate genes and gene interactions associated with such complex traits is the long-range linkage disequilibrium (LD) phenomenon reported widely in dairy cattle. Systems biology approaches, such as combining the Association Weight Matrix (AWM) with a Partial Correlation in an Information Theory (PCIT) algorithm, can assist in overcoming this LD. Used in a multi-breed and multi-phenotype context, the AWM-PCIT could aid in identifying udder traits candidate genes and gene networks with regulatory and functional significance. This study aims to use the AWM-PCIT algorithm as a post-GWAS analysis tool with the goal of identifying candidate genes underlying udder morphology. We used data from 78,440 dairy cows from three breeds and with own phenotypes for five udder morphology traits, five production traits, somatic cell score and clinical mastitis. Cows were genotyped with medium (50k) or low-density (7 to 10k) chips and imputed to 50k. We performed a within breed and trait GWAS. The GWAS showed 9,830 significant SNP across the genome (p < 0.05). Five thousand and ten SNP did not map a gene, and 4,820 SNP were within 10-kb of a gene. After accounting for 1SNP:1gene, 3,651 SNP were within 10-kb of a gene (set1), and 2,673 significant SNP were further than 10-kb of a gene (set2). The two SNP sets formed 6,324 SNP matrix, which was fitted in an AWM-PCIT considering udder depth/ development as the key trait resulting in 1,013 genes associated with udder morphology, mastitis and production phenotypes. The AWM-PCIT detected ten potential candidate genes for udder related traits: ESR1, FGF2, FGFR2, GLI2, IQGAP3, PGR, PRLR, RREB1, BTRC, and TGFBR2.


Introduction
Genetic architecture of complex phenotypes in cattle includes many loci affecting a given trait [1]. Most of these loci have small effects, but few segregating loci have moderate-to-large effects possibly due to epistatic effects, varying selection goals or recent selection for the favorable mutant allele. Moreover, markers collectively capture most but not all additive genetic variance for phenotypes. The incomplete variance capture may be due to causal mutations with low allele frequencies and therefore in incomplete linkage disequilibrium (LD) with markers [2]. To reduce this LD, we can either do a between breed analysis with a large sample of genotyped cows [3] or combine the results of a within breed GWAS in a multi-breed context. This is possible because the cost of genotyping is decreasing thus allowing many breeds, including those of medium population size, to be genotyped rapidly primarily for genomic selection purpose. These large populations of genotyped cows with own performances allows us to: (1) detect QTL for newly recorded traits or traits previously not studied; (2) carry out large confirmation studies for conventional traits. Previous studies have demonstrated that polymorphic sites that segregate within and across bovine populations can be studied using imputed low-to-dense genotypes [4,5]. Such genotypes have been used in model organisms and dairy cattle leading to the identification of candidate causal variants or closely neighboring variants that control complex phenotypes [6,7]. These studies have been useful for identifying QTL regions and probable genes associated with a phenotype. So far, however, there have been few validation studies of the vast number of putative variants across and between breeds and amongst multiple phenotypes. This study uses 50k SNP data to validate such variants using the Association Weight Matrix (AWM) [8] approach as a post GWAS analysis tool. The AWM is a systems biology approach for the genetic dissection of complex traits based on applying gene network theory to the results from GWAS. Hence, if the AWM SNP matrix is used in combination with a Partial Correlation (PC) in an Information Theory (IT) framework, and for correlated phenotypes, then it is possible to generate gene networks with regulatory and functional significance for udder related phenotypes.
Despite the limitations of the chip density, previous studies have shown the usefulness of the AWM to identify candidate genes in cattle, e.g. [9][10] and corroborated across different species, e.g. [11][12] in independent studies using the 50k marker density.
In this study, we report results based on GWAS analysis for mammary conformation, milk production and health phenotypes for 78,440 dairy cows. A multi-step validation by combining the results of single SNP, single phenotype, in a multiple-breed context using the AWM-P-CIT algorithm was performed. The aim was to identify the genes associated with mammary conformation and health phenotypes in Holstein, Montbeliarde, and Normande breeds, accounting for milk production as supportive traits. We further explored gene networks with the main gene ontology domains including biological processes, cellular component, and molecular functions.

Phenotypes
The cow sample was comprised of 46,732 Holstein, 20,141 Montbeliarde, and 11,965 Normande all with known parents. Phenotypes were yield deviations as produced by French national evaluation system [13]. A yield deviation is a performance adjusted for all non-genetic effects of the model [14]. In case of repeated records, a yield deviation is adjusted for the permanent environmental effect and averaged per animal. The 12 traits were: fore udder attachment (FUA), udder depth or development (UDD), udder cleft (UC), udder balance (UB), front teat placement (FTP), milk (MY), fat (FAT and FAT%), protein (PROT and PROT%) clinical mastitis (CM) and somatic cell score (SCS). SCS are derived from Somatic Cell Counts (SCC). SCC are obtained at each monthly test-day, and SCS are defined as SCS = 3+log 2 (SCC/ 100,000). CM events are declared by the farmer and recorded by the technician at each testday. The analysis trait is defined by 1 if the cow has at least one clinical mastitis event in the lactation, and 0 if no there is no event. SCS and CM are repeated over lactations. The model for obtaining yield deviations was different according to the trait and all models included the genetic additive effect. The model for SCS and CM (recorded in the first three lactations), included the effect of herd x year, age at calving x parity x year, month of calving x parity x year, preceding days dry (for later parities), and a permanent environmental effect. All models assumed heterogeneous variances depending on herd-year. Each cow had one record for each trait. UDD can be either udder depth (Holstein and Normande) or udder development (Montbeliarde), but treated as the same trait because they have similar definitions. Depending on the trait, the number of animals with phenotypes ranged from 7,671 to 11,965 in Normande, 13,879 to 20,141 in Montbeliarde, and 32,491 to 46,732 in Holstein (Table 1). We estimated variance components by fitting a multiple trait REML animal model as implemented in the DMU software [15]. This study is based on already existing data hence we did not require ethical approval.

Genotyping, quality control, and imputation
All cows were genotyped using Illumina BovineSNP50 BeadChip (50k) or Illumina BovineLD v.2 BeadChip. We used the UMD3.1 assembly of the bovine genome [16] for SNP chromosomal positions. We did not consider mitochondrial, X-chromosomal and Y-chromosomal SNP, as well as unmapped SNP for further analyses. We examined 43,800 SNP currently used in French genomic evaluation procedure [13]. Selection criteria was: call rate higher than 99%, minor allele frequency (MAF) higher than 2% in at least one of the three breeds, lack of Hardy-Weinberg Equilibrium (P < 10 −4 ), technical quality assessed by their very low rate of Mendelian mismatch between parents and progeny and known position of genome assembly. We imputed cows genotyped from BovineLD v.2 BeadChip to 50k to obtain a genotype without missing information. This step was performed within breed in the conventional pipeline for genomic selection [13]. We imputed using FImpute software [17], and reference population included all 50k genotyped male and female animals per breed. Imputation error rate, measured in routine evaluation situation (and not in this study), varied from 0.2 to 2.5% depending on whether parents were genotyped. Across breeds, best imputation accuracies were observed in Montbeliarde which has the highest proportion of sires and dams genotyped with 50k, while accuracies were lower in Holstein, a breed with a smaller portion of genotyped dams, a lower percentage of 50k genotyped parents, and even a non-zero proportion of nongenotyped (usually foreign) sires.

Statistical framework
GWAS was done within breed with the Mixed Linear Model Association (MLMA) method as implemented in GCTA [18]. Because phenotypes were yield deviations already adjusted for non-genetic effects, we did not consider additional fixed effects. For each phenotype and SNP i, the model used in each breed was the following where y is a vector of yield deviations, μ is a mean; u is a vector of random additive polygenic effects and is $ Nð0; G s 2 u Þ where G is genomic relationship matrix based on all cows with phenotypes per breed and all autosomes. Z is incidence matrix relating phenotypes y to u, w i is a vector of genotypes for SNP i, s i is the effect of SNP i, and e is a vector of random residual effects. We calculated the relationship between two individuals' j and k as x ij being the number of alleles for individual j and SNP i and p i is the observed allelic frequency, and, w was 43,800. We applied a genome-wide Bonferroni correction on all 43,800 tests to account for multiple testing.

Candidate variant discovery
We used the Association Weight Matrix (AWM) procedure to identify candidate genes per breed [8]. The AWM is a multiple trait approach that considers the genetic contribution of correlated traits allowing selection of pleiotropic SNP associated with numerous traits rather than a single trait. We classified trait information as either key or supportive trait, and the key trait in this study was udder depth or development (UDD) which is the most important type trait with the strongest relationship with mammary health and longevity. In addition, UDD is an aggregate trait, combining size, attachments, balance and strength of support. Populating the AWM starts with the selection of significant SNP from a GWAS [19]. The SNP additive effects are z-scored normalized by deviating the allele substitution effects from their mean and dividing by their standard deviation. We then created two matrices: (a) A z-scored additive values matrix (b) The GWAS p-values matrix. In both cases, rows represent SNP and columns represent traits. We then processed these matrices using the AWM algorithm, which includes five steps: (1) Primary SNP Selection: We select SNP associated with key trait using a P-value threshold (P < 0.05). (2): Exploring the dependency among traits: For the SNP selected in step (1), and, for the same threshold (P < 0.05), we register the average number of non-key traits to which the SNP are associated. In this study, that number was five traits. (3): Secondary SNP Selection: We select SNP from step (1) associated with at least five other traits including at least two udder traits. This step depends on correlation amongst traits and allows capturing most SNP associated with remaining traits. (4): Exploiting the genome map: We annotated the SNP captured in step (1), and step (3) using the UMD3.1 Genome assembly [16]. We classified the SNP that (i) mapped a gene, (ii) <10-kb to known genes, and, (iii) >10-kb to any coding region. For genes represented by more than one SNP, we select the SNP associated with the highest number of traits and has the lowest P-value average across all traits as representative for that gene. (5): Populating the AWM: Each {i,t} cell value in the AWM matrix corresponds to the z-score normalized additive effect of the i th SNP on the t th trait. This allows exploration of trait correlations column-wise, and gene/SNP interactions row-wise. We then calculate the SNP-based correlations and compare the SNP-based and genetic correlations, the latter being calculated as pedigree-based restricted maximum likelihood (REML), established in the same cows' populations. We then use the AWM SNP matrix as the input for the PCIT algorithm [20] and for any trio of SNP; we estimate the first order partial correlation coefficients to identify meaningful gene-gene interactions. We annotated and clustered gene ontology (GO) annotations for significant PCIT gene-gene interactions using Cytoscape [21]. Finally, we compare the gene clusters amongst the three breeds and plot the most significant cluster.

GWAS for all traits
Collectively for 78,440 cows, imputation resulted in a genotype density of 43,800 SNP. Of these, 38,827, 38,109, and 40,810 SNP had a MAF greater than 0.1 in Montbeliarde, Normande, and Holstein. Distribution of allele frequencies (MAF) of imputed genotypes was almost uniformly distributed across the MAF classes (Fig 1).
We performed GWAS for real and imputed SNP and yield deviations (YD) for 12 traits: five udder conformation traits, five milk production phenotypes, somatic cell score and clinical mastitis ( Table 2). associated SNP, and, Normande had 1,039 associated SNP (Table 2). There was an overlap of significant SNP across the breeds. Some 483 SNP were common between Holstein and Montbeliarde, 356 SNP were common between Holstein and Normande, 233 SNP were common between Montbeliarde and Normande, and 206 SNP were common among three breeds ( Fig  3). We observed overlap of significant SNP between udder and milk production traits. Irrespective of the breed, 15 SNP overlapped in udder related traits, and 205 SNP overlapped in milk production phenotypes. When considering 1SNP:1Gene, 3,651 significant SNP were close to genes (<10-Kb) across three breeds. Of these, 1,017 were highly associated with udder conformation traits, 2,502 with production traits and 132 with somatic cell score and clinical mastitis. The 2,673 additional SNP satisfying step 4 of the AWM algorithm (as described in M&M) was augmented with the 3,651 significant SNP from GWAS forming the AWM matrix with 6,324 SNP (S1 Table). Of these, 1,309 SNP were common across three breeds, and they mapped 1013 genes for 12 traits.

Genetic parameters and genomic AWM-based correlations
Heritability coefficients (diagonal), pedigree-based genetic correlations (upper diagonal), and SNP correlations (lower diagonal) are presented in Table 4. Heritability values are close to reported values for type traits [22], Fore udder attachment, (FUA) is 0.34, 0.26, 0.33 for Montbeliarde, Normande, and Holstein and higher for the other traits. For example, for Milk yield (MY), they reached 0.50, 0.61, and 0.54 in Montbeliarde, Normande and Holstein, respectively. These high values reflect the nature of the yield deviations (YD), which is a mean of records for repeated traits. YD is adjusted for permanent environment effect and thereby has a reduced non-genetic variability. As an example, assuming additive genetic variance of milk  Holstein breeds, respectively. There were moderate to zero genetic correlations between milk yield (MY) and FUA for Montbeliarde (0.42), Normande (0.28), and Holstein (0), and low SNP correlation (Montbeliarde (0.10), Normande (0.16) and Holstein (0.16)) between MY and udder depth or development (UDD). Holstein breed had a highly positive genetic (0.49) and SNP correlation (0.59) between front teat placement (FTP) and clinical mastitis (CM), a trend that was not evident in other two breeds. However, Montbeliarde breed showed a strong SNP correlation between FTP and CM (0.33) and between UDD and CM (0.18). Other trends evident from genetic correlations were between FUA and PROT for Montbeliarde (0.39) and Normande (0.26), FUA and CM for Montbeliarde (0.19) and a moderate genetic correlation between FTP and CM for Holstein (0.49). Genetic and SNP correlations were also comparable between FTP and PROT for all breeds, with genetic/SNP correlation being from medium in Montbeliarde (0.21 / 0.12) and Holstein (-0.34 / -0.1) to low in Normande (-0.12 / -0.1). However, the trend deviated between udder balance (UB) and fat percent (FAT %) with minimal genetic correlation in Normande (0.22) and no genetic correlation in Montbeliarde and Holstein but with moderate SNP correlations in Montbeliarde (-0.41), Normande (0.38) and Holstein (0.38). In general, genetic correlations are in the range of usual values, with high correlations between milk, fat and protein, a moderately negative correlation between production and type traits, moderately positive correlations between conformation traits, and low correlations otherwise. However, though there were deviations between genetic and SNP correlations in some of the traits, most traits correlations were in the same direction thus drawing plausibility of SNP correlated traits.
These SNP mapped FGF2, PRLR, and BTRC genes. PRLR gene (prolactin receptor) was previously associated with milk production traits in Finnish Ayrshire dairy cows [24]. Wang et al. [25] reported the association of FGF2 gene (Fibroblast growth factor 2) to fat yield and percentage and somatic cell score in US Holstein. Coleman-Krnacik et al. [26] reported the expression of FGF2 gene in the bovine mammary gland and uterine endometrium (UE). In the mammary gland, the FGF2 gene may play a role in development and reorganization of the mammary gland, while in UE, FGF2 gene is mainly expressed throughout estrous cycle and early pregnancy. BTRC gene (Beta-Transducing Repeat Containing E3 Ubiquitin Protein Ligase) is an F-box protein involved in Wnt/β-Catenin signaling pathway and indirectly activates nuclear factor kappa-B (NF-kB) [27]. Raven et al. reported these pathways to be highly relevant during mammary development and pregnancy, and as such, could have a major functional role in lactation [27]. The AWM-PCIT algorithm identified interacting candidate genes for udder conformation traits by first establishing SNP based correlation and fixing udder depth or development (UDD) as the key trait when constructing the AWM SNP matrix. These genes were represented by several biological processes involved in positive regulation of cellular biosynthetic processes and cell development, suggesting an endogenous characterization linked to udder morphology [28]. Fortes et al. [10] reported more similar SNP and genetic correlations for traits with moderate to high heritability and less similar correlations between traits with low heritability (Table 4). In our study, most traits had a heritability >10% thus aiding both GWAS detection power and AWM SNP detection.
The AWM was assessed for gene ontology (GO). The top GO terms were Calcium cation antiporter and Calcium-activated Potassium channel activity. As reported by Paulsen et al. [29], the former is a member of the cation diffusion facilitator (CDF) superfamily which are integral membrane proteins that increase tolerance to divalent metal ions, whereas, the latter is involved in ionic signaling in cells, a critical function for hormonal control of cell proliferation and differentiation [30]. Control of calcium signaling is likely to have profound effects on mammary physiology and pathophysiology. Their high significance level can be explained by the fact that mammary glands extract large quantities of calcium from the plasma during lactation [31], to ensure sufficient calcium concentration in milk. Dendrite development and putrescine biosynthetic processes were top biological GO terms. Dendritic cells (DC) are accessory cells of the mammalian immune system whose work is availing antigen material to T cells of the immune system [32][33]. This ability to stimulate native T-cells makes this result significant because it can directly be applied to improve udder health. DC has also been reported to play a pivotal role in the initiation of an adaptive immune response [34]. Putrescine, as a member of polyamine pathway, is regulated by the periovulatory endocrine milieu [35].
The central pathways that showed enrichment were "Neuropathic pain-signaling in dorsal horn neurons pathway," and, "CREB signaling pathway." Neuropathic pain is the pain after nerve injury whereas the dorsal horn (tip of the spinal cord) pathways have been shown to offer substantial overlap with spinal projections from adjacent mammary glands in model organisms [36]. Reflex milk ejection may result from the strong integration of sensory input from mammary glands afferents that terminate in the dorsal horn. CREB (cyclic-AMP response element-binding) protein family of transcription factors (TF-a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence [37]) play a crucial role in supporting the survival of sensory neurons [38]. The intervention of sensory neurons stimulates cells to secrete nerve growth factor (NGF) via the sympathetic nervous system (SNS) that maintains homeostasis [39]. NGF mediate its functions through ligation of tyrosine kinase (Trk) receptors [40]. Regulation of Trk signaling is by a variety of intracellular signaling cascades, which include MAPK pathway, which promotes cell continuation and growth [41]. Previous studies indicate that Trk could be multifunctional growth factors that exert various effects through their receptors on non-neuronal cells such as mammary ducts [30].
Fontanesi et al. [8] reported that long-term transcriptomic adaptations of tissues depend on the action of external stimuli that induce action of cellular functions on transcription factors (TF) [42]. In this study, we identified 39 interacting gene clusters and the most significant cluster had ten TF directly involved with mammary gland development and mammary gland duct morphogenesis : BTRC, ESR1, FGF2, FGFR2, GLI2, IQGAP3, PGR, PRLR, RREB1, and, TGFBR2. These TF are homonymous with genes encoding them. This top cluster is directly involved in mammary gland development, regulation of response to a stimulus, gland development, epithelium development, tissue development, animal organ development, system development and multicellular organism development. This was partly in agreement to works reported by Yang et al. [43] on QTL associated with follicle stimulating hormone production in Chinese Holstein cattle.

Conclusion
Our study suggests the usefulness of system-based approaches to identify candidate genes from interacting gene networks in a multi-breed context. We achieved this by exploiting associations between correlated traits. The reluctant inclusion of intergenic SNP leaves the possibility that the AWM approach was not capturing significant SNP such as trans-activators. Nonetheless, the AWM proved to be more efficient for integrating related complex traits and analyzing thousands of SNP and therefore appropriate for the analysis of these complex traits. When applied to our dataset, it predicted gene interactions that are consistent with the known biology of udder morphology and health captured known TF (e.g., ESR1, FGF2, GLI2, PGR and BTRC), and provided new candidate genes for udder morphology. This experiment may be replicated using whole genome sequence data and other independent datasets.