CRUMBLER: A tool for the prediction of ancestry in cattle

In many beef and some dairy production systems, crossbreeding is used to take advantage of breed complementarity and heterosis. Admixed animals are frequently identified by their coat color and body conformation phenotypes, however, without pedigree information it is not possible to identify the expected breed composition of an admixed animal and in the presence of selection, the actual composition may differ from expectation. As the roles of DNA and genotype data become more pervasive in animal agriculture, a systematic method for estimating the breed composition (the proportions of an animal’s genome originating from ancestral pure breeds) has utility for a variety of downstream analyses including the estimation of genomic breeding values for crossbred animals, the estimation of quantitative trait locus effects, and heterosis and heterosis retention in advanced generation composite animals. Currently, there is no automated or semi-automated ancestry estimation platform for cattle and the objective of this study was to evaluate the utility of extant public software for ancestry estimation and determine the effects of reference population size and composition and number of utilized single nucleotide polymorphism loci on ancestry estimation. We also sought to develop an analysis pipeline that would simplify this process for members of the livestock genomics research community. We developed and tested a tool, “CRUMBLER”, to estimate the global ancestry of cattle using ADMIXTURE and SNPweights based on a defined reference panel. CRUMBLER, was developed and evaluated in cattle, but is a species agnostic pipeline that facilitates the streamlined estimation of breed composition for individuals with potentially complex ancestries using publicly available global ancestry software and a specified reference population SNP dataset. We developed the reference panel from a large cattle genotype data set and breed association pedigree information using iterative analyses to identify purebred individuals that were representative of each breed. We also evaluated the numbers of markers necessary for breed composition estimation and simulated genotypes for advanced generation composite animals to evaluate the precision of the developed tool. The developed CRUMBLER pipeline extracts a specified subset of genotypes that is common to all current commercially available genotyping platforms, processes these into the file formats required for the analysis software, and predicts admixture proportions using the specified reference population allele frequencies.


Abstract 23
Background 24 In many beef and some dairy production systems, crossbreeding is used to take 25 advantage of breed complementarity and heterosis. Admixed animals are frequently 26 identified by their coat color and body conformation phenotypes, however, without 27 pedigree information it is not possible to identify the expected breed composition of an 28 admixed animal and in the presence of selection, the actual composition may differ from 29 expectation. 30 Results 31 We tested an approach to estimate the global ancestry of individuals using 32 ADMIXTURE and SNPweights. ADMIXTURE estimates ancestry using a model-based 33 approach applied to large single nucleotide polymorphism (SNP) genotype datasets. 34 Individuals are assumed to be unrelated and a supervised analysis can be performed 35 using reference animals sampled to represent ancestral populations. SNPweights 36 infers ancestry using weights estimated by principal component analysis for genome- 37 wide SNP panels that have been genotyped in the reference panel animals. We 38 constructed analysis pipelines to determine the ancestry of individuals with potentially 39 complex ancestries using both methods and a specified reference population SNP 40 dataset. The reference population was constructed using breed association pedigree 41 information and an iterative analysis to identify sets of purebred individuals proportions using the reference population dataset. In the U.S. and many other countries, the breed of an animal is associated with 85 its being registered with a breed association which requires that both parents of the 86 animal be identified and also registered with the association. For 50 years, parentage 87 has been validated by each breed association using blood or DNA typing. Many breed 88 associations have closed herdbooks which means, in theory, that the pedigrees of all 89 animals can be traced back to the animals that founded the breed's herdbook. Other 90 breed associations have open herdbooks, which means that crossbred animals can be 91 registered with the breed if they have been graded up by crossbreeding to have the 92 expectation of a certain percentage of their genome (e.g. 15/16ths) originating from the 93 respective breed based upon pedigree records and parentage validation. The term 94 "fullblood" is used to identify cattle for which every ancestor is registered in the 95 herdbook and can be traced back to the breed founders. The term "purebred" refers to 96 animals that have been graded up to purebred status. Pedigree errors that occurred 97 prior to, or that were not identified with the implementation of blood typing and DNA 98 testing, lead to admixed animals being incorrectly classified as fullblood and incorrectly 99 identified admixture proportions in purebred animals. The effects of recombination and 100 random assortment of chromosomes into gametes leads to considerable variation in the 101 extent of identity by descent between relatives separated by more than a single meiosis 102 [8] and can also lead to admixture proportions that differ substantially from expectation 103 based on pedigree in purebred animals. In commercial production of beef in the U.S., crossbreeding is extensively used 106 to capitalize on the effects of breed complementarity and heterosis resulting in herds of 107 females with very complex ancestries that frequently use fullblood or purebred bulls 108 sourced from registered breeders. Changes in the decision as to which breed of bull to 109 use can result in large changes in admixture proportions of replacement cows and 110 marketed steers between years and large differences can occur between herds for the 111 same reason. When commercially sourced animals are used to generate resource 112 populations to study the genomics of economically important traits such as feed 113 conversion efficiency [9, 10] or bovine respiratory disease [11], the presence of 114 extensive admixture in the phenotyped and genotyped animals may impact the GWAA 115 [9, 10] and leads to the training of genomic prediction models in populations for which 116 the breed composition is not understood. As a consequence, the utility of these models 117 in other industry populations, including the registered breeds in which the majority of 118 genetic improvement is generated is also not understood.

Marker set determination
To maximize the utility of the developed breed assignment tool, we identified the 160 intersection set of markers located on the bovine assays for which we had available 161 genotype data (Table 2). However, during the process of identifying the animals that 162 would define the breed reference panel, only 16 individuals had been genotyped using 163 the GGP-LDV4 (n=2) and GGP-LDV3 (n=14) assays and no animals had been 164 genotyped using the GGP-LDV1 assay. To retain as many SNP markers as possible 165 for subsequent analysis, we identified the intersection of markers present on the GGP-166 90KT, GGP-F250, GGP-HDV3, GGP-LDV3, GGP-LDV4, BovineHD, BovineSNP50 and 167 i50K assays. This intersection set included 6,799 SNP markers (BC7K). The 168 intersection of the markers representing 5 assays (GGP-90KT, GGP-F250, GGP-HDV3, 169 BovineHD, and BovineSNP50) was 13,291 markers (BC13K Genotypes can arise from any of the common bovine genotyping platforms (Table 2), 186 provided that a PLINK compatible MAP file is provided for each assay and data 187 produced using only a single genotyping assay is included in each PED file. The  Fig. S1 and Table   232 1.  (Table 1). Following the fastSTRUCTURE analysis using K=19 after 251 removal of Salers and using the BC7K marker set, Texas Longhorn was also removed 252 from the reference panel breed list due to the inability to distinguish Texas Longhorn as  (Table 3).  (Fig. 5). On the other hand, using an ancestry assignment of ≥ 85% clearly 362 captured greater diversity within each breed (Fig. 6) and maximized the self-assignment 363 of ancestry to the breed of registration (Table 5).
To examine whether the specific individuals represented in the reference panel 366 sample influenced the self-assignment of ancestry to the sampled individuals, a second 367 sample of ≤ 50 distinct individuals per breed was obtained from the individuals with 368 ≥ 85% assignment to their breed of registration and analyzed with SNPweights (Fig. 7). 369 This analysis indicates that the ability to predict ancestry was not influenced by the   (Fig. 4). Consequently, we conclude that these apparent introgressions can be Reference panel validation 390 To evaluate the ability of the selected reference breed panel (Table 1) Fig. 6). Using these SNP 405 weights, SNPweights software was used to estimate the ancestry proportions for the 27 406 registered Beefmaster individuals (Fig. 8) Shorthorn which are close to expectation. Ancestry assignments of registered fullblood 415 animals to their breed of registration were generally in the range 31-99%, with 93% of 416 animals being assigned to their breed of registration with genome proportions of >50%. 417 The remaining genome proportions again appear to identify common ancestry between 418 the breeds that predates breed formation (Fig. S16).  Moreover, the order in which the target individuals appear in the genotype input 478 file also seems to affect the Admixture estimates of ancestry proportions for the target 479 individuals. Fig. 13 shows the results of an Admixture analysis in which the target 480 individuals were identical to those shown in Fig. 12, but the order of the reference 481 individuals and the 2,005 Hereford crossbred individuals was reversed in the input files. 482 In Fig. 12, the reference individuals appear before the 2,005 Hereford crossbred ≥ 85% assignment to their breed of registration was 501 conducted to assess the effects of the reduction in markers used for ancestry 502 assignment. The ancestry proportions assigned based on the BC6K marker set (Fig.   503 15) do not differ significantly from those obtained using the BC7K marker set (Fig. 6). 504 This result indicates the utility of CRUMBLER and the reference panel breed set across 505 the spectrum of commercially available genotyping platforms.     Competing interests 626 The authors declare that they have no competing interests.

Fig. 4
SNPweights self-assignment of ancestry for candidate reference breed individuals following evaluation of open herdbook breeds using: (a) the BC7K, or (b) the BC13K marker panels. Reference breed panels were constructed by random sampling ≤50 individuals per breed and SNP weights were estimated using the BC7K and BC13K marker sets.

Fig. 5
Reference breed panel constructed by the random sampling of ≤50 individuals per breed from individuals with ≥90% ancestry was self-assigned to reference breed ancestry using the BC7K marker set.

Fig. 6
Reference breed panel constructed by the random sampling of ≤50 individuals per breed from individuals with ≥85% ancestry was self-assigned to reference breed ancestry using the BC7K marker set.

Fig. 7
Reference breed panel constructed by the independent random sampling of a second sample of ≤50 individuals per breed from individuals with ≥85% ancestry after eliminating individuals represented in the first sample was self-assigned to reference breed ancestry using the BC7K marker set. a.
b. Fig. 8 (a) SNPweights ancestry assignment results for 27 Beefmaster individuals based on a reference panel sampled to contain ≤50 individuals per breed with ancestry self-assignment percentages of ≥85% to their breed of registry. (b) Breed assignment for the Beefmaster individuals can be determined using this reference breed key. a.
b. Fig. 9 (a) SNP weights were calculated using three independent reference panels with ≤50 individuals per breed sampled from the individuals with ≥85% assignment to their breed of registry. Ancestry results for the 27 Beefmaster animals are shown as 27 sets of three columns demarcated by solid lines. Each column within a demarcated set, represents results for each of the three reference breed panels. (b) Breed assignment for the Beefmaster individuals can be determined using this reference breed key. a.
b. Fig. 10 (a) SNPweights ancestry results for 238 crossbred individuals with a-priori breed composition estimates of 50% Angus and 50% Simmental based on a reference panel with ≤50 individuals per breed sampled from individuals with ≥85% assignment to their breed of registry.
(b) Breed assignment for the crossbred individuals can be determined using this reference breed key.

Fig. 11
Self-assignment of ancestry for the animals in the reference breed set formed with ≤50 individuals per breed from the individuals that had ≥85% assignment to their breed of registration using Admixture.

Fig. 12
Admixture analysis conducted using the same data as shown in Figure 11 (first four rows), merged with an additional 2,005 high percentage crossbred Hereford target individuals (last row).

Fig. 13
Admixture analysis conducted using the same data as shown in Fig. 12 where the reference individuals appear before the 2,005 Hereford crossbred individuals in the input file.
Here, the 2,005 Hereford crossbred individuals appear before the reference individuals in the input file. The first row represents the 2005 Hereford crossbred samples. Rows 2 to 5 show the reference panel individuals.

Fig
. 14 Admixture analysis conducted using the same data as shown in Figs. 12 and 13, but with the order of the individuals in the input genotype file randomized. The first four rows represent the reference panel individuals, the fifth row shows the 2,005 Hereford crossbred animals.

Fig. 15
Reference breed panel constructed by the random sampling of ≤50 individuals per breed from individuals with ≥85% ancestry was self-assigned to reference breed ancestry using the BC6K marker set.