Genome-Wide Study of Structural Variants in Bovine Holstein, Montbéliarde and Normande Dairy Breeds

High-throughput sequencing technologies have offered in recent years new opportunities to study genome variations. These studies have mostly focused on single nucleotide polymorphisms, small insertions or deletions and on copy number variants. Other structural variants, such as large insertions or deletions, tandem duplications, translocations, and inversions are less well-studied, despite that some have an important impact on phenotypes. In the present study, we performed a large-scale survey of structural variants in cattle. We report the identification of 6,426 putative structural variants in cattle extracted from whole-genome sequence data of 62 bulls representing the three major French dairy breeds. These genomic variants affect DNA segments greater than 50 base pairs and correspond to deletions, inversions and tandem duplications. Out of these, we identified a total of 547 deletions and 410 tandem duplications which could potentially code for CNVs. Experimental validation was carried out on 331 structural variants using a novel high-throughput genotyping method. Out of these, 255 structural variants (77%) generated good quality genotypes and 191 (75%) of them were validated. Gene content analyses in structural variant regions revealed 941 large deletions removing completely one or several genes, including 10 single-copy genes. In addition, some of the structural variants are located within quantitative trait loci for dairy traits. This study is a pan-genome assessment of genomic variations in cattle and may provide a new glimpse into the bovine genome architecture. Our results may also help to study the effects of structural variants on gene expression and consequently their effect on certain phenotypes of interest.


Background
Over the past decade, many studies have attempted cataloging the nature and pattern of genomic alterations in population (e.g. [1]). The advent of novel high-throughput sequencing technologies [2][3][4][5][6] with the ability to partially or completely re-sequence genomes, in a relatively cost-effective manner, has offered new opportunities to study large scale genomic variations. In addition to single nucleotide polymorphisms (SNPs) and small insertions or deletions (indels), several other studies have identified larger and more complex structural variants (SVs). Originally, SVs were considered as genomic alterations affecting DNA segments greater than 1,000 base pairs (1 kbp) in size [7]. However, with new advances in high-throughput sequencing technologies, the operational spectrum of SVs has widened to include much smaller genomic alteration events (> 50 bp in size) [8]. SVs such as large insertions, large deletions, inversions, duplications, translocations and Copy Number Variants (CNVs), are less frequent than SNPs and indels within a given genome however some of them may have more significant functional effects [9] and may also play a role in genome structure remodeling [10][11][12][13][14][15][16]. For example, during its pilot phase, the 1000 Genomes Project Consortium has sequenced 185 human wholegenomes and has identified more than 22,025 deletions and 6,000 additional SVs [17]. Some of those SVs are associated with disease susceptibility, such as autism [18][19][20] or schizophrenia [21][22][23] in humans.
Many animal genomes have now been sequenced, including the genomes of several bulls and cows . For example, Eck et al. (2009) generated the first cattle genome sequence by a next-generation sequencing method [24]. By sequencing a Fleckvieh bull genome, they discovered more than 2 million novel cattle SNPs. More recently, Daetwyler et al. (2014) have sequenced the whole-genome of 234 bulls from four different breeds and have identified more than 28 million variants (SNPs and indels). These polymorphisms have then been used to identify putative causative mutations for genetic defects or economically important complex traits [44].
Here, we performed a large scale study to investigate both small indels (< = 50 bp) and large SVs (> 50 bp) in cattle by sequencing the whole-genome of 62 bulls from the three French major dairy breeds (Holstein, Montbéliarde and Normande breeds).
The collection of SVs reported in this study may prove useful to study their potential effect on the expression levels of certain genes of interest and consequently to study their link with the genetic variability of economically important traits in cattle.

Animal ethics
No animal experimentation was used in this study, therefore no ethical permission was required from any relevant authority. Sequencing was performed using genomic DNA obtained from sperm collected from semen straws kindly provided by approved commercial artificial insemination stations as part of their regular semen collection process. The authors did not participate in the acquisition of semen samples for the purpose of this research.

Genomic DNA extraction
Genomic DNAs were extracted from semen of 62 dairy bulls (27 Holstein,17 Montbéliarde and 18 Normande bulls) chosen based on their genetic contribution to the French cattle populations, using the Wizard Genomic DNA Purification Kit (Promega, Charbonnières-les-Bains, France) or using a standard phenol-chloroform method, respectively. A quality control inspection of each purified DNA sample was performed by agarose gel electrophoresis. DNA concentration was then measured with a Nanodrop ND-100 instrument (Thermo Scientific, Ilkirch, France).

Library construction and sequencing
Genomic libraries were prepared using the TruSeq DNA Sample Preparation Kit (Illumina) according to the manufacturer's instructions. Briefly, 4 μg genomic DNA were fragmented into 150-400 bp pieces using divalent cations at 94°C for 8 min. The resulting cleaved DNA fragments were purified using Agencourt AMPure XP beads (Beckman Coulter, Villepinte, France), then subjected to end-repair and phosphorylation and subsequent purification was performed using Agencourt AMPure XP beads (Beckman Coulter). These repaired DNA fragments were 3 0 -adenylated producing DNA fragments with a single 'A' base overhang at their 3 0 -ends for subsequent adapter-ligation. Illumina adapters were ligated to the ends of these 3 0adenylated DNA fragments followed by two purification steps using Agencourt AMPure XP beads (Beckman Coulter). Ten rounds of PCR amplification were performed to enrich the adapter-modified DNA library using primers complementary to the ends of the adapters. The PCR products were purified using Agencourt AMPure XP beads (Beckman Coulter) and sizeselected (200 ± 25 bp) on a 2% agarose Invitrogen E-Gel (Thermo Scientific). Libraries were then checked on an Agilent Technologies 2100 Bioanalyzer using the Agilent High Sensitivity DNA Kit and quantified by quantitative PCR with the QPCR NGS Library Quantification kit (Agilent Technologies, Massy, France). Libraries were used for 2×100 bp paired-end sequencing on an Illumina HiSeq2000 with a TruSeq SBS v3-HS Kit (Illumina).

Alignment to the reference
Sequence alignments were carried out using the Burrows-Wheeler Alignment tool (BWA v0.6.1-r104) [78] with default parameters for mapping reads to the UMD3.1 bovine reference genome [79]. Potential PCR duplicates, which can adversely affect the variant calls, were removed using the MarkDuplicates tool from Picard version 1.4.0 [80]. Only properly paired reads with a mapping quality of at least 30 (−q = 30) were kept. The resulting BAM files were then used for all subsequent analysis.

Identification of small insertions and deletions
Small indels were detected using the Genome Analysis Tool Kit 2.4-9 (GATK) version and GATK-UnifiedGenotyper as SNP caller [81]. Prior to variant discovery, reads were subjected to local realignment, coordinate sort, quality recalibration, and PCR duplicate removal. In the GATK analysis, we used a minimum confidence score threshold of Q30 with default parameters. We have also used multi-sample variant calling in order to distinguish between a homozygous reference genotype and a missing genotype in the analyzed samples.

Identification of SVs
Bioinformatics detection of potential genomic variation events was carried out on the 62 BAM files. We have performed multi-sample variant calling by Pindel software, v. 0.2.4y [82] using parameters as described in https://trac.nbic.nl/pindel. We first set the "Maximum event size index" to 9 in order to detect events whose sizes are up to 8,286,208 bp. We also set the-m parameter (min_perfect_match_around_BP) to 30 (i.e. at the point where the read is split into two, there should at least be 30 perfectly matching bases between the read and the reference sequences). We required a minimum mapping quality of the split read of 30 to support a breakpoint or junction. We finally used a custom python script to filter out Pindel-generated raw data: Only samples presenting at least three unique reads at the breakpoint of SVs were declared positive for the corresponding SV.

Annotation of SV regions
Analyses of the overlap between SVs and functional elements were performed based on the gene build 77 database for the UMD3.1 bovine gene dataset obtained from the Ensembl Genome Browser using the Biomart software (http://www.ensembl.org/index.html). Positions of SV breakpoints predicted by Pindel were compared to gene start and end positions in order to identify SVs that may encompass an entire gene, those that overlap with exons of a given gene, those that overlap gene starts or ends and those for which both SV breakpoints are located within two different genes.
The Ensembl Biomart software was also used to find gene paralogs located within or overlapping the annotated SV regions.
Gene Ontology (GO) enrichment was also performed using the MouseMine analysis tools, a powerful new system for accessing MGI (Mouse Genome Informatics) data, using the Inter-Mine framework and is available at the MGI international database resources (http://www. mousemine.org/mousemine/begin.do).
In order to investigate QTL regions within SV regions, we first downloaded all Bovine QTL regions from the public cattle QTL database release 24 (Aug 25, 2014), available at http://www. animalgenome.org. QTLs linked to milk traits (fat and protein content and yield) and somatic cell scores were subsequently extracted. A custom python script was then used to search for SVs located within or overlapping with QTLs regions.

SV validation by high-throughput genotyping
In order to investigate our approach efficiency to detect SVs, we developed a genotyping-based strategy using the already available Illumina BovineLD custom BeadChip [83]. With this strategy, many individuals can be genotyped for many SVs at limited cost. The main idea was to convert predicted SVs into "virtual SNPs" by testing the base change at the SV breakpoints. Therefore, several selection filters were applied in order to select a panel of SVs for validation: (1) in order to overcome genotyping problems due to sequence repeats, the SV flanking sequences were first analyzed with the RepeatMasker software [84] and all SVs with masked flanking sequences were removed; (2) For deletions, if the first nucleotide of the deleted region is different from the first nucleotide which is located immediately after the SV 3' breakpoint, then we selected the corresponding SVs for further analysis. This deletion was then converted into a "virtual SNP" for which the reference allele corresponds to the first nucleotide of the deleted region and the alternative allele corresponds to the first nucleotide immediately after the SV 3' breakpoint; (3) For inversion, if the first nucleotide of the SV region is different from the reverse-complement of the last nucleotide of the same SV region, then we selected the corresponding SV for further analysis. This inversion was then converted into a "virtual SNP" for which the reference allele corresponds to the first nucleotide of the inverted region and the alternative allele corresponds to the reverse-complement of the last nucleotide of the same inverted region. Steps 2 and 3 were repeated with the reverse-complement sequences.
After applying the above filters, 331 deletions and inversions were selected for validation. They were genotyped for a large number of animals. High-throughput genotyping reactions were performed at Labogena core facility, using the custom low-density Illumina BovineLD SNP chip (San Diego, CA). SNPs with an Illumina design score above 0.4 were retained for further analysis. Oligonucleotides were designed, synthesized, and used to genotype 382 animals from at least eight major dairy breeds (Table 1). Several other breeds such as beef breeds (Limousine: 15, Charolaise: 19, Blonde d'Aquitaine: 12, Parthenaise: 12 and Gasconne: 9) were also included in our genotyping panel. However, none of the bulls used for SV identification was included in this genotyping sample list.

Analysis of population structure
To indirectly validate the results of this SV detection study, we compared the population structure assessed from SVs to those previously obtained with SNPs [85]. We first performed Principal Components Analysis (PCA) using "dudi.pca" implemented in the R package ade4 [86] using all validated SV information. Second, we used the STRUCTURE software package [87] to assess the population structure. This program implements a model-based clustering method to infer population structure using genotype data of unlinked markers. We used the admixture model and correlated allele frequency version of STRUCTURE [88].

Results and Discussion
Whole-genome sequencing, read mapping Sixty-two of the most contributing bulls from the three major French dairy breeds (Holstein, Montbeliarde and Normande) were selected for whole-genome sequencing. A total of 31,140 million raw paired-end reads with a length of 100 bases were generated, resulting in a total of 3,114 gigabases. Each sample was sequenced on 1-4 lanes and approximately 140 to 1,120 million paired-ends reads were obtained for each library. On average, 93% (from 75% to 97%) of the paired-end reads were properly aligned on the UMD3.1 bovine reference genome (S1 Table). Similar read mapping rates were obtained in other bovine whole-genome sequencing studies. For example, Kawahara-Miki et al. (2011) found that 86% of the paired-end reads they generated while sequencing the genome of a Japanese Kuchinoshima-Ushi bull mapped uniquely onto the bovine genome [25]. The average genome-wide sequence coverage from the mapped reads ranged from 5× to 42× across the different genomes, with 52 samples sequenced at least at 10 fold average coverage.

Identification of genomic variations
Search for small variations with GATK-UnifiedGenotyer software resulted in the identification of 2,021,215 indels (S2 Table). On average we found 873,372 +/-47,845 indels per bull. With this approach based on GATK, the largest indel identified was 11 bp in length.
With Pindel algorithm, we generated two categories of variations. First, we produced a catalog containing 1,384,490 small SVs mainly small insertions and deletions (< = 50 bp) out of which 1,383,007 small SVs were less than 11 bp in size (S2 Table). These were subsequently used for concordance analysis with small indels data generated by GATK. Almost 98.9% (1,368,226 out 1,383,007) of small indels detected by Pindel were also identified with GATK (Fig 1). This relatively high percentage of concordance suggests that most small SVs detected by Pindel might be true variations. However, it is difficult to precisely estimate the sensitivity (false-positive rate) of our SV detection method as small indels found with GATK but not with Pindel might not be true indels.
Second, we produced another catalog containing 6,426 putative large SVs (>50 bp) corresponding to 3,138 large deletions, 1,061 tandem duplications and 2,227 inversions (S3 Table). On average we observed nearly 199,200 small SVs and 305 large SVs per individual.
Analysis of the length distribution of large SVs (Fig 2) revealed that most deletions (38.9%) are between 51 and 1,000 base pairs-long, whereas the length of most inversions (50.5%) is between 1 and 10 Kb while the vast majority of tandem duplications (80.9%) are larger than 10 Kb. These preliminary results seem to indicate a possible correlation between SV type and size. However, these observations should further be investigated.
Deletions and tandem duplications identified in this study covered a total length of up to 277 Mb corresponding to almost 10% of the whole bovine genome, whereas inversions covered a total length up to 152 Mb, ie almost 6% of the bovine genome. However, these percentages could be overestimated as SVs identified in our study are indeed putative variations and at this stage we do not know yet the false positive rate of our detection approach.

Distribution of SVs between animals and between breeds
Overall, 61% of SVs were found only in single bulls (Fig 4). One deletion was found to be present in all 62 animals. One deletion and one tandem duplication were observed in 60 and 61 animals, respectively. Analysis of raw results generated by Pindel revealed that these 2 SVs were present in all 62 animals of our study but, for two animals, they were supported by less than the minimum number of 3 reads which was required to support an SV. These samples were therefore excluded from the final list of animals presenting the SVs. The cow genome reference sequence is derived from a single Hereford animal called Dominette. Therefore the first deletion and probably the two other SVs might be Hereford-or Dominette-specific SVs. Alternatively, these SVs could also be due to local errors in the UMD3.1 reference genome assembly.
Comparison of large SVs revealed that 12% of these were shared between the three breeds ( Fig 5) and at least one third were shared between at least two breeds. As shown in Fig 5, we identified more large SVs (2,195) in Holstein bulls than in Montbéliarde and Normande bulls (1, 103 and 1,240, respectively). This result could be partly explained by the larger number of sequenced bulls in Holstein (27) than in Montbéliarde and Normande (18 and 17, respectively). Our results suggest that at least one third of the SV events occurred before the separation of the three breeds and therefore might also be present in other cattle breeds.

Identification of potential CNV regions
CNVs are defined as loss (deletions) or gain (duplications) of copies of DNA segments. In order to identify SV regions (SVRs) that might correspond to potential CNV regions (CNVRs), we searched for DNA segments for which we could observe at the same time either a deletion in one bull and a duplication in another bull (across animals) or a deletion and a duplication at the same region within the same bull (within animal). We considered a given DNA region as a potential CNVR when the deleted and the duplicated segments are located within the same region, and are at least 70 percent overlapping.
In our study, we found 452 unique deletions and 392 unique duplications which may code for potential CNVRs (S4 Table). In parallel, all deletion and tandem duplication regions identified in our study were also compared to publicly available CNVRs. Overall, 175 regions (128 deletion and 47 tandem duplication regions) overlapped with publicly available CNV datasets [29] [89]. Out of these, 33 deletions and 29 tandem duplications were also identified with our first approach (S5 Table). Overall, we identified 957 SVs that could potentially code for CNVs. Out of these, 547 SVs were deletions and 410 were tandem duplications (S4 and S5 Tables).

Annotation of SVRs
Gene content. Analyses of functional elements lying within SVRs revealed a total of 2,415 (38%) SVRs which contain either entire gene-coding regions or only parts of genes (S6 Table). Therefore these SVs could potentially have an effect on expression of some of these genes and consequently a potential effect on some phenotypes. Out of these, 48% (1,168) were deletions, 27% (650) were tandem duplications and 25% (597) were inversions. Overall, a total of 5,011 genes overlap with these SVRs. The vast majority of these genes has paralogs (S7 Table) and correspond to uncharacterized genes (587) and genes coding for the olfactory receptor (327), U6 splicesomal RNA (159) and for the 5S ribosomal RNA (86).
Interestingly, we found 182 large deletions removing an entire gene. Overall, 115 different genes are affected by these large deletions (S8 Table). Almost 91.3% (105/115 genes) of these genes belong to large multigene families. The remaining large deletions remove 10 single-copy genes, out of which we found 3 pseudogenes, 3 protein coding genes and 4 genes encoding for microRNAs.
Alignment to the UMD3.1 bovine genome sequence of the sequence of the genes encoding for the novel miRNA ENSBTAG00000044935 and for bta-mir-2887-2 revealed several   Table), suggesting that multiple paralogous copies of these two microRNAs are located throughout the bovine genome.
A single perfect alignment match was however observed for the other two miRNAs. The gene encoding for bta-mir-2310 has been discovered in the normal adult bovine kidney (MDBK) cell line after infection with bovine herpesvirus 1 and shows a low expression in nonand infected cells [90]. Further analysis using TargetScan database [91] identified four genes to be targets of bta-mir-2310. These encode for interleukin 5, protein inhibitor of activated STAT 1 (PIAS1), solute carrier family 25 member 31 (SLC25A31) and zinc finger protein 316 (ZFN316). They are involved in different functions such as immune response, gene signaling, metabolite transport, and gene expression regulation. It is therefore possible that bta-mir-2310 plays an important role by negatively modulating the gene expression of these genes. However, its inactivation might also have limited impact as targets for numerous other miRNAs were also found in the 3'-untranslated regions of these four target genes.
The gene deleted by INRA_BovSV6339 encode for the mediator complex subunit 10 protein-coding gene (MED10) which is is a coactivator for DNA-binding factors that activate transcription of RNA polymerase II-dependent genes [92].
The other two genes deleted by INRA_BovSV1327 and by INRA_BovSV4164 encode for two yet uncharacterized proteins. The first gene contains only one exon and the predicted protein is around 100 amino acids. Alignment of this protein sequences against protein databases revealed a perfect match (100% identity) with the 3'-end of Bos taurus partitioning defective 3 homolog B isoform X5 (PARD3B). The second gene, however, contains 13 exons and code for a 463 amino acid protein. Amino acid sequence alignments against protein databases revealed high similarities with Bos Taurus ankyrin repeat domain-containing protein 26-like isoform X1 (LOC513969).
Further analyses are needed to check whether these deletions have any functional impact in cattle.
Gene Ontology. Gene Ontology analyses were also performed for all 5,011 genes and GO terms were obtained for biological processes, cellular components and molecular functions (S10 Table). Several GO terms were found to be significantly over-represented. For example, the five most enriched GO categories corresponding to biological process are related to metabolic process, primary metabolic process, organic substance metabolic process, single-organism metabolic process and cellular metabolic process.
QTLs in SVRs. The positions of the 6,426 predicted large SV events were also compared to the positions on the UMD3.1 bovine genome assembly of known quantitative trait loci (QTLs) deposited in the public database AnimalQTLdb [93]. Overall 587 SVs (246 large deletions, 236 inversions and 105 tandem duplications) were found located within or overlapping QTLs linked to milk traits and somatic cell count and scores (S11 Table). The most frequent traits corresponded to somatic cell score (257 SVs) followed by milk fat percentage (161 SVs), milk protein yield (143) and milk protein percentage (107 SVs). QTL enrichment analysis (S11 Table) showed no significant enrichment of specific QTLs linked to milk trait or somatic cell counts when comparing the SVs overlapping the QTL regions against SVs overlapping all other known QTL regions available in the AnimalQTLdb database for cattle.

Validation of large SVs by genotyping
The efficiency of the selection approach and the relevance of the resulting SVs were assessed by genotyping a selected panel of SVs in 382 animals. None of the sequenced individuals was present in this genotyped panel.
Assays were developed for 331 putative SVs (S12 Table), out of which 255 (77%) were successfully genotyped (S13 Table) while genotyping failed for 76 (23%). These did not either cluster well according to genotype or failed to amplify most probably because of the sequence complexity or the presence of polymorphisms within flanking sequences or failed manufacture with Illumina. These were considered "failed assays". Out of the 255 successfully genotyped SVs, 237 were deletions and 18 were inversions.
For almost 25% (64 SVs) of the successfully genotyped SVs, only one SV allele was identified in all individuals (S13 and S14 Tables). Out of these, 61 SVs were homozygous for the reference allele and could therefore be incorrectly identified as true SVs by Pindel. Some of these SVs may also correspond to rare variants that were not present in the samples genotyped in this study. Indeed, almost 50% (30 out of 61) of these monomorphic SVs were found in a single bull and 84% (51 out of 61) were present in less than 5 animals.
The remaining 3 SVs were homozygous for the alternative allele and were therefore considered as true SVs.
Finally, 75% (191) of the successfully genotyped SVs were polymorphic and reliably scored, and thus were considered as true SVs (S13 and S14 Tables). Out of these, 184 SVs were Results of PCA analysis. PCA analysis results were shown for the 3 main dairy breeds ( Fig 6A) and for the 8 breeds (Fig 6B).
doi:10.1371/journal.pone.0135931.g006 deletions and 7 were inversions. The observed minor allele frequency (MAF) mean among true SVs was 0.20 ± 0.15 (SD), while the observed heterozygosity mean across loci was 0.28 ± 0.17, and the PIC (Polymorphic Information Content) mean was 0.23 ± 0.13 (S15 Table). Based on Genetic population structure prediction. Genetic population structure predicted by STRUCTURE software for the 3 main dairy breeds ( Fig 7A) and for the 8 breeds (Fig 7B). the observed heterozygosity and PIC rates in the validated SV panel and across the eight main breeds analyzed (S15 Table), we could conclude that this type of markers may be informative and is therefore of particular interest for linkage analysis.
Nine deletions overlapping with publicly available CNVs and 37 others identified as potential CNVs with our first approach were also validated in our genotyping study (S4 and S5 Tables).
Assessment of population structure using SV genotyping data Our validation study was carried out using animals from at least eight major dairy breeds ( Table 1), out of which there were 29 Holstein, 32 Montbeliarde and 30 Normande animals.
Using only genotyping data related to the three dairy breeds (S14 Table), PCA grouped individuals into three clusters according to their breeds of origin (Fig 6A).
For K = 3, which corresponded to the three main breeds, STRUCTURE successfully sorted individuals into three groups entirely corresponding to the three breeds (Fig 7A).
Similar results were also observed with the eight breeds used in our validation study ( Fig 6B  and Fig 7B).
Our results are of particular interest as they could be considered as a global statistical validation step in addition to the genotyping validation approach we developed. Indeed, the SV used in these analyses provided a good description of the breed structure, similar to the one previously provided by SNP data [85].

Conclusions
In the present study, we performed a pan-genome assessment of structural variations in cattle using whole genome sequence data. Analysis of WGS data of 62 bulls from the three main dairy breeds used in France (Holstein, Montbéliarde and Normande breeds) allowed the identification of 6,426 large SVs (> 50 bp). Out of these, 547 deletions and 410 tandem duplications were identified as potential CNVs.
To analyze the accuracy of our SV detection approach, a set of 331 SVs were selected for validation using a novel high-throughput genotyping strategy. Almost 75% of the successfully genotyped SVs could be validated and were polymorphic.
The collection of newly discovered SVs may prove useful to study their link with genetic variability of economically-important traits in cattle. It will be particularly interesting to analyze the impact of the large deletions inactivating completely single-copy genes.
Supporting Information S1