Genome-Wide Estimates of Coancestry, Inbreeding and Effective Population Size in the Spanish Holstein Population

Estimates of effective population size in the Holstein cattle breed have usually been low despite the large number of animals that constitute this breed. Effective population size is inversely related to the rates at which coancestry and inbreeding increase and these rates have been high as a consequence of intense and accurate selection. Traditionally, coancestry and inbreeding coefficients have been calculated from pedigree data. However, the development of genome-wide single nucleotide polymorphisms has increased the interest of calculating these coefficients from molecular data in order to improve their accuracy. In this study, genomic estimates of coancestry, inbreeding and effective population size were obtained in the Spanish Holstein population and then compared with pedigree-based estimates. A total of 11,135 animals genotyped with the Illumina BovineSNP50 BeadChip were available for the study. After applying filtering criteria, the final genomic dataset included 36,693 autosomal SNPs and 10,569 animals. Pedigree data from those genotyped animals included 31,203 animals. These individuals represented only the last five generations in order to homogenise the amount of pedigree information across animals. Genomic estimates of coancestry and inbreeding were obtained from identity by descent segments (coancestry) or runs of homozygosity (inbreeding). The results indicate that the percentage of variance of pedigree-based coancestry estimates explained by genomic coancestry estimates was higher than that for inbreeding. Estimates of effective population size obtained from genome-wide and pedigree information were consistent and ranged from about 66 to 79. These low values emphasize the need of controlling the rate of increase of coancestry and inbreeding in Holstein selection programmes.


Introduction
It is well known that the effective size of a population (N e ) is usually less than its census size as there are often deviations from the assumptions of the idealized population by having unequal sex ratios, variation in family size, unequal numbers in successive generations and overlapping generations [1]. A clear example of this is given by the Holstein cattle breed that despite being numerically very large (millions of animals spread across the world), shows a N e of the order of 100 [2,3] when calculated from the rate of inbreeding (ΔF) computed from pedigree data.
Holstein dairy cattle have dominated the milk production industry over decades. Intense and accurate artificial selection practised over many years has resulted in high rates of genetic gain for milk production traits. This has implied an extreme use of a limited number of elite sires of high genetic merit for production traits via artificial insemination. However, the high rates of gain have been accompanied by large increases in the rates at which inbreeding and coancestry (Δf) accumulate and, consequently, by large reductions in N e . This has led, in the last decades, to an increasing awareness of the need of measuring and controlling the loss of genetic variation and the increase in inbreeding to mitigate its negative effects [4]. These include inbreeding depression in production traits and reproductive ability [5,6,7,8] and an increase in the prevalence of undesirable genetic disorders such as the complex vertebral malformation (CVM) [9], deficiency of uridine monophosphate synthase (DUMPS) [10], brachyspina syndrome (BS) [11] and Bovine leucocyte adhesion deficiency (BLAD) [12].
With the advances in high-throughput genotyping techniques and the development of chips containing thousands of SNPs at a reasonable cost, the implementation of genome-wide evaluation [13] has become routine in many large scale commercial Holstein breeding programmes [14,15]. As high levels of accuracy can be achieved at an early age, generation intervals can be shortened leading to faster gain [16]. Although genomic selection leads to decreased rates of inbreeding per generation in comparison with BLUP selection [17,18,19], it is not inbreeding free. Thus, the need for measuring and controlling inbreeding remains [17].
Traditionally, coancestry and inbreeding coefficients have been estimated from pedigree records. However, the dense marker chips that have been developed for cattle with the main purpose of increasing selection responses, can be also used for obtaining genome-wide estimates of coancestry and inbreeding. Genomic estimates are expected to be more accurate than pedigree-based estimates because they i) reflect the actual percentage of the genome that is homozygous (inbreeding) or the actual percentage of the genome shared by two individuals (coancestry) whereas pedigree-based estimates are only expectations of such percentages; and ii) are able to capture relationships due to very distant common ancestors that pedigree-based estimates ignore [20].
In livestock species, genomic estimates of inbreeding have been obtained for different cattle [8,21,22,23,24], sheep [25] and pig [26,27] breeds, but estimates of genomic coancestry are rare [28]. However, both inbreeding and coancestry are of fundamental importance in animal breeding programmes. Although inbreeding depression depends on the levels of inbreeding and not on coancestry, in scenarios with non-random mating such as those in most cattle breeding programmes, the rate at which genetic variability is lost is better measured by Δf than by ΔF [29].
The objective of this study was to obtain genomic estimates of coancestry and inbreeding coefficients and N e in the Spanish Holstein population. Genomic estimates were then compared to those based on pedigree information.

Materials and Methods
Genomic and pedigree data Genomic information from 11,135 animals belonging to the Spanish Holstein population was analyzed in this study. These individuals were genotyped with the Illumina BovineSNP50 BeadChip (versions v1 or v2). Only SNPs common to both chip versions were selected for the analysis (52,340 SNPs). SNPs positions within the genome were obtained from the UMD 3.0 bovine genome assembly [30]. Unmapped SNPs (523) and those mapped on chromosomes X or Y (1,056) were excluded. In addition, 14,068 SNPs with missing genotypes for more than 5% of the individuals were discarded. After that, 566 animals with more than 5% missing genotypes for the remaining 36,693 SNPs were also removed. The final dataset included 36,693 autosomal SNPs and 10,569 animals (9,990 bulls and 579 cows). The distribution of genotyped animals by year of birth is shown in Fig 1. The pedigree data set, provided by the Spanish Holstein Association (Conafe), was constructed with all known ancestors of genotyped individuals and comprised 35,473 animals. The software Endog [31] was used to calculate the average numbers of generations traced (14.50), full traced generations (2.56) and equivalent complete generations (6.12). The generation interval (L) was calculated as the average age of parents when their offspring was born. The average generation intervals for sires of bulls, dams of bulls, sires of cows and dams of cows were 6.06, 3.88, 5.24 and 3.77, respectively. The average weighted L across paths and years was 4.2 years. We observed that L has decreased across time. The average L for the periods 1960 to 1979, 1980 to 1999 and 2000 to 2013 were 5.1, 4.2 and 3.3, respectively.

Genomic estimates of inbreeding and coancestry
Genomic estimates of inbreeding coefficients were obtained from runs of homozygosity (ROHs) which are long, uninterrupted stretches of homozygous genotypes [20]. Specifically, the inbreeding estimator, F ROH , is the proportion of the genome that is in ROH. For individual i, F ROH i was calculated as where n ROH i is the total number of ROHs in individual i, l ROH ik is the length of the k th ROH in individual i in base pairs and l g is the total length of the genome in base pairs. The criteria used for defining a ROH were as follows: (i) the minimum length that constituted a ROH was 4 Mb; (ii) the minimum number of SNPs was 30; (iii) the minimum density was 1 SNP per 100 kb; (iv) the maximum distance allowed between two consecutive homozygous SNPs in a run was 1 Mb; and (v) a maximum of two missing genotypes and one heterozygous genotype within a particular ROH were permitted. The choice of 4 Mb for the minimum length permitted for defining a ROH was based on the results of Ferenčaković et al. [26] who showed that F ROH based on shorter segments systematically overestimates autozygosity when using the Illumina BovineSNP50 BeadChip.
The coancestry coefficient based on IBD segments (f SEG ) between individuals i and j was calculated as where l SEG k ða i ; b j Þ is the length of the k th shared IBD segment measured over homologue a of individual i and homologue b of individual j and n SEGij is the total number of IBD segments between individuals i and j. The criteria for defining an IBD segment were equivalent to those used for defining a ROH. Both F ROH i and f SEG ij were obtained using the software developed by de Cara et al. [32].
In order to calculate coancestry coefficients based on IBD segments, haplotypes need to be inferred from genotype data. The phase inference was performed using the software BEAGLE [33] that uses localized haplotype clustering and fits the data through an Expectation-Maximization (EM) algorithm. Each chromosome was phased independently using the default parameters (10 iterations). After haplotype inference, f SEG ij was obtained as explained above.

Genomic estimates of homozygosity and similarity
We also obtained genomic estimates of molecular homozygosity and molecular similarity on a SNP-by-SNP basis. The genomic similarity between individuals was obtained by applying Malécot's definition [34] to the marker loci; that is, the molecular similarity between individuals i and j (S SNP ij ) is the probability that two alleles at a given locus taken at random from each individual are equal (IBS). Analogously, the genomic homozygosity of individual i (H SNP i ) is the probability that the two alleles carried by this individual at a given locus are IBS. In this study, S SNP ij was calculated as where n s is the number of SNPs and I s k i m j is the identity of the k th allele from individual i with the m th allele from individual j at SNP s, and takes a value of 1 if both alleles are identical and 0 otherwise. The genomic homozygosity of individual i was calculated as H SNP i ¼ 2S SNP ii À 1, and was equal to the proportion of homozygous SNPs.

Pedigree-based estimates
The pedigree-based coancestry (f PED ij ) and inbreeding (F PED i ) coefficients for the genotyped individuals were calculated by going back only five generations in the pedigree in order to homogenise the amount of pedigree information across individuals. This data set included 31,203 animals. Estimates of both coefficients were obtained using the software PEDIG [35] with the option that implements the algorithm of Meuwissen and Luo [36].

Rates of change in coancestry and inbreeding, and effective population size
Rates of change in genomic and pedigree-based inbreeding per year (ΔF ROH(y) and ΔF PED(y) , respectively) were computed by regressing the natural logarithm of (1-F) for each individual on the year of birth. The slopes of these regressions are approximately equal to -ΔF ROH(y) and -ΔF PED(y) . Rates of change in inbreeding per generation (ΔF ROH and ΔF PED ) were calculated by multiplying the rates per year by the generation interval. Finally, estimates of N e were obtained from the rate of change in inbreeding per generation as N eF_ROH = 1/2ΔF ROH and N eF_PED = 1/ 2ΔF PED . Rates of change in coancestry per year (Δf SEG(y) and Δf PED(y) ), and per generation (Δf SEG and Δf PED ), and effective population size (N ef_SEG = 1/2Δf SEG and N ef_PED = 1/2Δf PED ) were also computed following the same approach as for inbreeding. Rates of change in genomic homozygosity (ΔH SNP(y) and ΔH SNP ) and similarity (ΔS SNP(y) and ΔS SNP ), and the corresponding estimates of N e (N eH_SNP = 1/2ΔH SNP , and N eS_SNP = 1/2ΔS SNP ) were also obtained. Table 1 shows the mean, range, variance and coefficient of variation for the genomic homozygosity, similarity, inbreeding and coancestry coefficients. As expected, the molecular homozygosity and similarity coefficients were much higher than the inbreeding and coancestry coefficients. The reason is that the former (that are obtained on a SNP-by-SNP basis), do not discriminate alleles that are IBD or IBS. Estimates of F and f, based on ROHs and IBD segments, respectively, were close to the pedigree-based coefficients, indicating that they reflect well IBD. The highest coefficients of variation were observed for pedigree-based estimates. Table 2 shows the relationships between the genomic homozygosity and similarity, and different inbreeding and coancestry coefficients. The correlation between F ROH and F PED was higher than that between H SNP and F PED but still relatively low. Actually, 33% and 26% of the variance in F PED can be explained by F ROH and H SNP , respectively. The highest correlation was that between H SNP and F ROH . Seventy seven percent of the variance in F ROH can be explained by H SNP . Pearson's correlation coefficients involving S SNP , f SEG and f PED were substantially higher than those involving H SNP , F ROH , F PED . In fact, 61% and 53% of the variance in f PED is explained by f SEG and S SNP , respectively. Also, 90% of the variance in f SEG is explained by S SNP . Table 3 shows the rates of change in genomic homozygosity and similarity and in genomic and pedigree-based inbreeding and coancestry per year and per generation, and the N e estimated from those rates. Despite the different scales of the SNP-by-SNP based measures when compared with the inbreeding and coancestry coefficients (see Table 1), the rates of change were very similar for all these parameters, as expected. Consequently, estimates of N e obtained from ΔS SNP and from Δf SEG and Δf PED were very close. Rates of change in H SNP , F SEG and F PED were also very close. Pedigree-based inbreeding rates were slightly lower than the corresponding coancestry rates. Similarly, rates of genomic homozygosity were slightly lower than the corresponding rates of genomic similarity. Consequently, estimates of N e obtained from rates of change in molecular homozygosity and inbreeding were slightly higher than estimates obtained from rates of change in molecular similarity and coancestry.

Results
Given that L decreased over time, we also estimated N e for different periods (1960 to 1979, 1980 to 1999 and 2000 to 2013). Annual rates of homozygosity and inbreeding remained constant across time, but N e increased as a consequence of the reduction in L. These estimates of N e ranged from 61.   Table 3. Rates of change in inbreeding, molecular homozygosity, coancestry and molecular similarity per year (ΔF (y) , ΔH (y) , Δf (y) and ΔS (y) ), respectively, and per generation (ΔF, ΔH, Δf and ΔS) using different sources of information, and estimates of effective population sizes obtained from ΔF (N eF ), ΔH (N eH ), Δf (N ef ) and from ΔS (N eS ).

Discussion
In this study, estimates of coancestry, inbreeding and N e obtained from genome-wide information have been compared with those obtained from pedigree information in the Spanish population of Holstein cattle. Our results confirm the small N e of the Holstein breed and the need of controlling coancestry and inbreeding rates. The correlation between F ROH and F PED (0.57) obtained in the present study is in the range of values obtained in previous studies. Ferenčaković et al. [22] found values ranging from 0.50 to 0.72 when analysing four different cattle breeds with the 50 K chip. Keller et al. [20] showed that F ROH is preferable to F PED and other measures of genomic inbreeding because it correlates better with inbreeding depression. More specifically, for a N e similar to that estimated here for the Holstein population, they predicted that the correlation between homozygous mutation load and inbreeding coefficient was higher for F ROH (0.60) than for F PED (0.25).
Molecular homozygosity and similarity coefficients calculated on a SNP-by-SNP basis were much higher than pedigree-based inbreeding and coancestry coefficients because the latter refer to a base population where no homozygosity exists. Thus, with the SNP-by-SNP based method alleles that are IBD and IBS can not be distinguished. Several approaches have been proposed to express these genomic coefficients in the same scale as pedigree-based coefficients [37] but they require knowing the base population allele frequencies. However, given that these frequencies are usually unknown, these methods are generally unaccurate [38,39]. Another reason for correcting the raw molecular homozygosity and / or similarity is that in genomewide evaluation methods, genomic relationship matrices are usually combined with pedigreebased relationship matrices because SNP genotypes are not available for many animals included in the evaluation procedure [21,40,41,42]. Genomic matrices are usually corrected using the frequencies estimated in the present population through the formula of VanRaden [43,44]. An alternative would be to use genomic matrices based on IBD segments rather that are in the same scale as pedigree-based matrices. This could represent an advantage over current approaches used in the context of genome-wide selection for predicting breeding values [45].
Here, estimates of molecular homozygosity, molecular similarity, genomic inbreeding and coancestry were obtained using genotypes contained in the Illumina BovineSNP50 BeadChip. A higher density cattle chip (i.e., the BovineHD BeadChip) containing 777,962 SNPs is also commercially available and used in dairy cattle genetic evaluations. Although in principle it is expected that increasing marker density would lead to an increase in the accuracy of genomic predictions, there are indications that this is not the case at least when N e is low [46]. Low N e is associated with high linkage disequilibrium and, thus, increasing marker density above a certain level may not improve significantly the accuracy of genomic evaluations [47]. In fact, small differences have been detected in genomic prediction when comparing the high density panel with the 50 K chip [48]. In a similar way, two different studies have showed that the use of the high density chip does not lead to higher accuracies when estimating inbreeding [23,49]. For three different cattle breeds, Ferenčaković et al. [23] found similar correlations between F ROH and F PED when using the 50 K chip (estimates ranging from 0.62 to 0.77) and when using the high density cattle panel (estimates ranging from 0.61 to 0.75). Also, Purfield et al. [49] analysed nine different cattle breeds and indicated that the percentages of the variance in F PED explained by F ROH with the high density and 50 k chips were very similar (56% and 53%, respectively).
As expected, rates of change in molecular homozygosity, and genomic and pedigree-based inbreeding (and rates of change in molecular similarity, and genomic and pedigree-based coancestry) were very similar [26,38]. Rates of change were slightly lower for molecular homozygosity and inbreeding than for molecular similarity and coancestry, and consequently, N eH and N eF estimates were slightly higher than N eS and N ef estimates. This result could be a reflection of non-random mating (i. e. avoidance of matings between relatives) if the level of non-randomness was not constant across generations [50].
For cattle, there is evidence that the levels of N e were large (of the order of tens of thousands or more) following domestication [2,3], but current N e in some modern breeds are of the order of 100 [2,3]. Estimates of N e obtained in this study across different time periods are close to those previously published for other Holstein populations. Estimates of N e obtained from pedigree data have ranged from 39 in the US population [51] to 115 in the Canadian population [52]. Estimates obtained from genome-wide data have ranged from around 80 in the US population [53] to 150 in the Australian Holstein cattle [54].
The small N e found in Holstein cattle reflects the fact that breeding strategies followed in this breed have implied a very heavy use of few top sires and reinforces the need of controlling the rate at which coancestry and inbreeding increase in selection programmes [2]. Both selection and mating strategies have been proposed in the past for controlling the rates of coancestry and inbreeding. In particular, optimum contribution selection [55,56] provides a useful tool to manage the rate of accumulation of inbreeding. When applying this selection tool in cattle, higher genetic gains are expected at a given rate of inbreeding [4]. Both objectives can be attained using programmes where selection is based on classical BLUP-EBVs and those based on genomic EBVs [19]. The application of the method in Holstein cattle would however require a coordinated policy on the use of selected candidates and on breeding objectives [4]. This may require the cooperation of breeders and artificial insemination organizations to ensure that the correct animals and their contributions are identified [57,58]. Recently, Pryce et al. [28] and Sun et al. [59] have showed that mating strategies based on genomic information can reduce progeny inbreeding when compared with strategies based on pedigree information. Thus, dense genetic marker information is not only valuable for increasing accuracies of selection but also for controlling the rate at which coancestry and inbreeding increase in dairy cattle breeding programmes [19].