Design of a Bovine Low-Density SNP Array Optimized for Imputation

The Illumina BovineLD BeadChip was designed to support imputation to higher density genotypes in dairy and beef breeds by including single-nucleotide polymorphisms (SNPs) that had a high minor allele frequency as well as uniform spacing across the genome except at the ends of the chromosome where densities were increased. The chip also includes SNPs on the Y chromosome and mitochondrial DNA loci that are useful for determining subspecies classification and certain paternal and maternal breed lineages. The total number of SNPs was 6,909. Accuracy of imputation to Illumina BovineSNP50 genotypes using the BovineLD chip was over 97% for most dairy and beef populations. The BovineLD imputations were about 3 percentage points more accurate than those from the Illumina GoldenGate Bovine3K BeadChip across multiple populations. The improvement was greatest when neither parent was genotyped. The minor allele frequencies were similar across taurine beef and dairy breeds as was the proportion of SNPs that were polymorphic. The new BovineLD chip should facilitate low-cost genomic selection in taurine beef and dairy cattle.


Introduction
Genetic improvement of several key agricultural species is accelerating with the adoption of genomic selection [1,2,3]. With this method, animals or plants can be selected for breeding on the basis of their genetic merit predicted by markers spanning the entire genome. Particularly in dairy cattle, this method has been shown to be more efficient than conventional progeny testing of bulls (up to double the rate of genetic gain) as well as substantially less expensive [4]. Moreover, genomic selection opens new opportunities for sustainable management of populations by more efficiently selecting for traits that have low heritability, e.g. fitness traits, or traits that are difficult to measure. This method is also useful for managing the accumulation of inbreeding within breeds with a small effective population size. In dairy cattle, genomic selection has been deployed at a rapid pace, and most countries with major dairy breeding programs now rely heavily on this new technology [5].
A major challenge in implementing genomic selection in most species is the cost of genotyping. The expected value of the information gained by genotyping must exceed the cost of obtaining the genotypes. During the early stages of genomic selection in the dairy industry, the cost of high-density genotyping could be justified. The primary application was to evaluate bulls that were potential candidates for production of commercial semen. Using SNP information for those evaluations resulted in more accurate selection of bulls to acquire and extensively market. Once increased accuracies of genome-enhanced breeding values had been demonstrated, breeders and buyers quickly adopted this technology to improve accuracy of selection [6]. This example of a genomic-selection application has extreme value compared with other animal food production paradigms. In contrast, profit from genomic selection is likely to be much lower for beef bulls and dairy females [5,7]. An appealing approach in situations with much lower returns from genotyping is to use a more economical, reduced-density SNP chip with markers optimized for imputation.
Imputation is the process of predicting unknown genotypes for animals from observed genotypes and often uses information from a reference population with dense genotypes to predict missing genotypes for animals with lower density genotypes. It is also applied to merge genotypes of similar densities but different SNPs. Most imputation algorithms use information from relatives and population linkage disequilibrium. A number of software programs for imputation have been developed based originally on human genetics [8,9] and more recently on animal genetics [10,11,12,13]. The limited effective population sizes and population structures in livestock allow the possibility of imputation of high-density genotypes from quite low-density genotypes [11,14,15,16].
In 2010, a low-density bovine SNP chip, the Illumina GoldenGate Bovine3K Genotyping Beadchip (http://www. illumina.com /documents /products /datasheets/datasheet_bovine 3K.pdf), was developed and made commercially available. That product offered a significant advance toward low-cost genomic selection in cattle; however, imputation accuracy was highly dependent on the relationship of the individual genotyped with the Bovine3K chip to the reference population genotyped at a higher density [17]. In addition, some samples failed to provide genotypes of adequate quality for use in genomic predictions. The SNP call rate performance of the Bovine3K chip was slightly reduced compared with the BovineSNP50 chip [18] because GoldenGate chemistry relies on two hybridization events for proper SNP detection as opposed to a single event for Infinium chemistry.
In this study, the Illumina Infinium BovineLD Genotyping Beadchip (http://www.illumina.com/documents/products/ datasheets/datasheet_bovineLD.pdf) was developed to provide high imputation accuracy for higher density SNP genotypes in taurine dairy and beef populations. The main objective was to provide a tool that would enable genomic estimated breeding values to be calculated from accurately imputed genotype data from an Infinium-based SNP array with very low rates of failed samples. The main features of the new BovineLD chip are presented along with its imputation performance in a range of breeds and reference populations.

SNP selection
To provide highly accurate imputation to BovineSNP50 genotypes in global taurine breeds, SNPs were selected from validated assays from existing higher density chips and similar SNP detection technology, i.e. the Illumina BovineSNP50 and BovineHD (http://www.illumina.com/documents/products/ datasheets/datasheet_bovineHD.pdf) SNP arrays, with priority given to BovineSNP50 content. From the known and validated SNPs, selection priority was 1) high minor allele frequencies (MAFs) in targeted breeds, 2) uniform spacing at a minimum of 2 SNPs per Mbp, with increased SNP density within 500 kbp of chromosomal ends, 3) inclusion of SNPs for determination of sex, parentage, Y haplotypes, and subspecies and maternal lineages, 4) SNP quality and fidelity criteria for robust reproducibility (.98% call rate and ,0.01% Mendelian inconsistency), and 5) a target overlap of 2,000 SNPs with the Bovine3K chip to ensure backward compatibility. The anticipated SNP spacing (2 SNPs per Mbp) obviated the need to check for highly correlated SNPs.
The SNPs were selected to be highly informative with a high MAF over a large range of breeds from around the world (Table 1). The reference MAF estimates were from breeds in 10 countries from North America, Europe, and Oceania. Content selection was optimized using taurine allele frequencies. To achieve regular spacing, the UMD3 bovine genome assembly (http://www.cbcb. umd.edu/research/bos_taurus_assembly.shtml) was used to define 500-kbp segments over the 29 autosomes. A lack of flanking information at the end of each chromosome had resulted in lower imputation efficiency in preliminary tests. To correct that problem, the SNP density was doubled in the first and last segments of each chromosome. Reflecting the diverse membership of the Bovine LD Consortium, initial SNP selection was made by one member and updated by the others. The initial SNP selection was based on two independent criteria. First, SNPs with the highest mean MAF in each 500-kbp segment were selected over a broad range of European breeds including European Holstein, Montbéliarde, Normande, Jersey, Brown Swiss, Norwegian Red, Swedish Red and White, Finnish Ayrshire, Charolais, Limousine, Blonde d'Aquitaine, and Maine Anjou, with Holstein receiving   the highest mean of the two selection criteria were selected with doubling at the chromosome ends. Next, some of the selected SNPs were replaced by Bovine3K SNPs that were in nearby locations to ensure backward compatibility. In addition, SNPs used for breed determination and parentage testing that had not already been selected were included, and some SNPs were added to fill gaps generated by map inconsistencies.
For the X chromosome, Bovine3K SNPs with high MAFs were selected and supplemented with BovineSNP50 SNPs, with consideration given to spacing, MAF, and fidelity. Because large gaps remained after that initial selection, additional X-chromosome SNPs were chosen from the BovineHD assay.
For the Y chromosome and mitochondrial DNA (mtDNA), 9 Yspecific and 13 mtDNA SNP markers were identified from the BovineHD chip based on assay fidelity and performance across 27 breeds, MAF across those breeds, and ability of a SNP to discern subspecies and geographic locations of breed origins.

Imputation
Imputation efficiency was assessed in 10 populations (North American, French, and Australian Holsteins; North American and Australian Jerseys; North American Brown Swiss; Australian Angus; French Montbéliarde; French Normande; and French Blonde d'Aquitaine). Beagle software (http://faculty.washington. edu/browning/beagle/beagle.html) [9] was used for the Australian and French populations and findhap.f90 (http://aipl.arsusda. gov/software/findhap/) [13] for the North American populations. These imputation programs have similar performance in large dairy cattle data sets [19]. Using existing genotypes from the BovineSNP50 chip, imputation efficiency was determined by comparing imputed and obseved genotypes. Part of the population was retained as a ''reference,'' while target individuals for imputation had their genotypes reduced in silico to either BovineLD or Bovine3K genotypes. Results were assessed as the proportion of genotypes that were correct in the target population. For example, if the imputed genotype was a heterozygote and the BovineSNP50 genotype was a homozygote, that genotype was counted as incorrectly imputed. The count of correct genotypes included both observed and imputed genotypes to measure the overall success of a lower density genotype in approximating a BovineSNP50 genotype.

Content validation
The SNP assays for 6,914 loci were validated using data from 290 samples that represented 26 global dairy and beef breeds ( Table 2) and included Bovine Hapmap samples [20]. The 290 samples (234 males, 56 females) included 286 unrelated samples, 2 trios, and 2 replicates. All markers were assessed for clustering of the genotypes using Illumina GenomeStudio genotyping software (version 2010.3; http://www.illumina.com/documents/products/ datasheets/datasheet_genomestudio_software.pdf. A total of 6,909 clearly identifiable and scorable clusters were retained for robust utility of the panel. The cluster positions were defined with priority given first to data from dairy breeds and second to beef breeds. The purpose of the resulting cluster position file is to apply known robust cluster positions to future genotyping data for high throughput genotype calling. For phylogenetic analysis based on Y and mtDNA SNPs, individual sequences for each breed were clustered to construct consensus sequences using SNPs from 9 Ychromosome loci and 13 mtDNA loci with the DNASTAR SeqMan program (version 6.1; http://www.dnastar.com/t-subproducts-lasergene-seqmanpro.aspx). There were 236 chromosome X SNP on the final Bovine LD chip. Flanking sequences and base calls for the 6,909 SNP are given in Table S1.

SNP call rates and accuracy
The BovineLD chip, consisting of 6,909 final loci, was validated for 290 individuals from 26 major dairy and beef breeds ( Table 2). The mean call rate was 99.94% among dairy breeds, 99.90% among beef breeds, and 99.93% among all samples. For taurine breeds, discordant calls compared to BovineSNP50represented ,0.01% of all genotyping calls (Table 2). Mendelian consistency was examined using two Holstein trios, which showed a single error on BTB-01149046 out of 13,797 total possible comparisons. Reproducibility was 100% across two Holstein replicated samples. Based on the nearly perfect concordance between the BovineLD and the BovineSNP50 genotypes reported in Table 2 and the similar concordance between BovineSNP50 and BovineHD genotypes, Mendelian consistency and reproducibility were also examined for the overlapping 6,844 SNPs from BovineHD genotypes. Those data included 8 parent-progeny, 24 parentparent-progeny, and 10 replicate comparisons that represented 11 taurine, 2 indicine, and 1 hybrid breeds (Table 3). Mendelian consistency was 99.95%, and reproducibility was 99.99%.
The concordance rate for 2,088 SNPs in common between BovineLD and Bovine3K assays was 98.78% for 281 females genotyped with both chips. The most likely cause of the differential performance between the BovineLD and Bovine3K chips is the chemistry difference between the Infinium and GoldenGate assays.
Performance for MAF, mean spacing, and paternal and maternal lineages Data for calculating mean MAF (Table 1) were primarily BovineLD markers extracted from BovineSNP50 data. However, if BovineSNP50 data were not available, BovineLD markers from the validation data were used. That method allowed MAFs to be calculated more accurately. Mean MAF for the 6,909 SNPs was $0.29 for all taurine breeds (Table 1). For Brahman (a Bos primigenius indicus breed), mean MAF was lower (0.18). Overall, .89% of the SNPs were polymorphic in Brahman, which suggested that the BovineLD chip may be useful for imputation in this breed.
For the 6,909 SNPs selected for the BovineLD chip, median spacing was 0.348 Mbp, with only 82 (1.1%) of intervals greater than 1 Mbp (Fig. 1). These gaps originate either from the X chromosome, or from regions not covered by the BovineSNP50. The strategy of increasing SNP density at chromosome ends substantially improved imputation accuracy for those regions compared with the Bovine3K array (Fig. 2).
The sex-specific and lineage identification SNPs also appeared to perform well. The nine Y-chromosome SNPs had a 100% call rate across 230 males of different breeds and no genotype calls for the 55 females. We investigated the frequency of the haplotypes of the alleles from these 9 SNP both within and across breeds. Four unique haplotypes were observed, which differed dramatically in frequency across breeds, Table 4. One haplotype, CGCCGCAAC (haplotype 1) was observed only in cattle with indicine lineage (eg Brahmans, Beef Master, Santa Getrudis). The second haplotype (TCTCCTCAC) was associated with central European lineage, haplotype 3 (TCTCCTCAT) was 1 base different from haplotype 2 and probably appeared to be associated with breeds that came to the island of Jersey from France or Spain, and haplotype 4 (TCTTGTCGC) was associated with northern European lineage, including islands. Only a few breeds had more than one haplotype, e.g. Santa Gertrudis and Beefmaster, both of which are taurineindicine hybrids. Common haplotypes across breeds appeared to reflect a common origin. Phylogenetic analysis separated the 26 breeds into four distinctive clades, which agrees with a previous report on the dual origins of dairy cattle breeds in Europe [21]. For mtDNA SNPs (Table 5), seven unique mitochondrial haplotypes were found, however 259 of the animals sampled had the same mitochondrial haplotype. Haplotype 7 (AAGAG-CAAAAAAG) was at highest frequency in indicine cattle. Most taurine6indicine cattle were derived from taurine cows. Therefore, the lack of haplotype 7 for taurine breeds in most regions is not unexpected. While more research is required, these preliminary results suggest the BovineLD markers could be useful in determining lineage origin between taurine and indicine breeds or identifying potential admixture within a population of locally adapted animals. Accuracy of imputation Imputation accuracy was assessed in Australian, French, and North American cattle populations. In all cases, the accuracy of imputation to BovineSNP50 genotypes was $95% (Table 6). Most imputation results were .97%, particularly for dairy breeds. The results were lower for some breeds, likely because of the limited reference population size used. For example, the considerably larger size of the North American reference set of Holsteins compared with the Australian set could explain why the North American imputation accuracy was 1.1 percentage points higher than for Australia. The effect of a smaller reference set of genotypes on imputation accuracy was further demonstrated by imputation from BovineLD genotypes for Australian Angus, which had the smallest reference population in the data set. For French populations, imputation efficiency also varied, with the highest accuracy for Holsteins and the lowest for Blondes d'Aquitaine (Table 6); imputation accuracy for Normandes and Montbéliardes was slightly lower than for Holsteins. Again, much of the variation is likely explained by reference population size.
For Australian and North American Holsteins, accuracy of imputation to BovineSNP50 genotypes was better for BovineLD genotypes than for Bovine3K genotypes. For Australian Holsteins, imputation accuracies were up to almost 6 percentage points higher with the BovineLD chip than with the Bovine3K chip using the same data (Table 7). Mean imputation accuracy was 92.8% for Australian Holstein Bovine3K genotypes compared with 97.6% for BovineLD genotypes. For North American Holsteins, accuracies of imputation to BovineSNP50 genotypes from Bovine3K genotypes ranged from 93.0 to 96.7% (depending on number of parents genotyped) for 2,456 animals genotyped with both Bovine3K and BovineSNP50 chips [17]. Corresponding values for BovineLD genotypes (Table 8) are 96.6 to 99.3%.
The greatest improvement in imputation for BovineLD genotypes compared with Bovine3K genotypes was for individuals with no genotyped parents. For Australian Holsteins, difference in mean imputation accuracy with and without a sire in the reference population was 2.9 percentage points for Bovine3K genotypes but only 1.3 percentage points for BovineLD genotypes. The improvement was smaller for North American Holsteins: a   difference of 2.7 percentage points between both parents genotyped and no genotyped parents for Bovine LD genotypes (Table 6) compared with 3.7% for Bovine3K genotypes [17]. Compared with North American Holsteins, BovineLD imputation accuracy for animals without a parent in the reference population was slightly poorer for North American Jersey and Brown Swiss populations (Table 8). However, the more than doubling of markers and the different SNP selection criteria [22] compared with the Bovine3K chip allowed high imputation accuracies across a wider range of dairy breeds as well as some beef breeds.

Discussion
The Illumina BovineLD BeadChip includes 6,909 SNPs selected to provide optimized imputation to BovineSNP50 genotypes in dairy breeds. The SNPs have MAFs of .0.3 in most breeds, and nearly uniform spacing across the genome except at the ends of the chromosome where densities were increased. The chip also includes SNPs on the Y chromosome and mtDNA loci that are useful for gender checking, determining subspecies classification and identifying certain paternal and maternal breed lineages. Accuracy of imputation to BovineSNP50 genotypes using the BovineLD chip was .99% when both parents were genotyped in the North American BovineSNP50 reference population. That high accuracy suggests that the design criteria for the BovineLD chip would be useful to consider in other species for which an ''imputation chip'' could dramatically lower the cost of implementing genomic selection. BovineLD imputation was about 3 percentage points more accurate across multiple populations compared with Bovine3K imputation. The improvement was greatest when neither parent had been genotyped. The gain in imputation accuracy is attributed primarily to the increased overall density of the BovineLD chip compared with the Bovine3K chip and also to the even further increased density at the ends of chromosomes. The high MAFs also contribute to the improved imputation accuracy. The MAFs were similar across taurine beef and dairy breed as was the proportion of SNPs that were polymorphic. Although it would be expected that accuracies of imputation would be highest for those breeds which were included in the design of the chip, which was dominated by dairy breeds, the similar SNP characteristics (particularly the high MAF across many beef and dairy taurine breeds) suggest that the BovineLD chip will perform well in imputation of taurine beef cattle. Our results suggest that the imputation accuracy will also be quite dependent on the size of the population genotyped with a higher density SNP assay. Overall, the new BovineLD BeadChip should facilitate low cost genomic selection in Bos primigenius taurus beef and dairy cattle. Known genotypes without error (%) c

Supporting Information
Table S1 Genomic locations, flanking sequences and base calls for the 6,909 SNP on the bovineLD array. (CSV)