Application of single step genomic BLUP under different uncertain paternity scenarios using simulated data

The objective of this study was to investigate the application of BLUP and single step genomic BLUP (ssGBLUP) models in different scenarios of paternity uncertainty with different strategies of scaling the G matrix to match the A22 matrix, using simulated data for beef cattle. Genotypes, pedigree, and phenotypes for age at first calving (AFC) and weight at 550 days (W550) were simulated using heritabilities based on real data (0.12 for AFC and 0.34 for W550). Paternity uncertainty scenarios using 0, 25, 50, 75, and 100% of multiple sires (MS) were studied. The simulated genome had a total length of 2,333 cM, containing 735,293 biallelic markers and 7,000 QTLs randomly distributed over the 29 BTA. It was assumed that QTLs explained 100% of the genetic variance. For QTL, the amount of alleles per loci randomly ranged from two to four. The BLUP model that considers phenotypic and pedigree data, and the ssGBLUP model that combines phenotypic, pedigree and genomic information were used for genetic evaluations. Four ways of scaling the mean of the genomic matrix (G) to match to the mean of the pedigree relationship matrix among genotyped animals (A22) were tested. Accuracy, bias, and inflation were investigated for five groups of animals: ALL = all animals; BULL = only bulls; GEN = genotyped animals; FEM = females; and YOUNG = young males. With the BLUP model, the accuracies of genetic evaluations decreased for both traits as the proportion of unknown sires in the population increased. The EBV accuracy reduction was higher for GEN and YOUNG groups. By analyzing the scenarios for YOUNG (from 0 to 100% of MS), the decrease was 87.8 and 86% for AFC and W550, respectively. When applying the ssGBLUP model, the accuracies of genetic evaluation also decreased as the MS in the pedigree for both traits increased. However, the accuracy reduction was less than those observed for BLUP model. Using the same comparison (scenario 0 to 100% of MS), the accuracies reductions were 38 and 44.6% for AFC and W550, respectively. There were no differences between the strategies for scaling the G matrix for ALL, BULL, and FEM groups under the different scenarios with missing pedigree. These results pointed out that the uninformative part of the A22 matrix and genotyped animals with paternity uncertainty did not influence the scaling of G matrix. On the basis of the results, it is important to have a G matrix in the same scale of the A22 matrix, especially for the evaluation of young animals in situations with missing pedigree information. In these situations, the ssGBLUP model is an appropriate alternative to obtain a more reliable and less biased estimate of breeding values, especially for young animals with few or no phenotypic records. For accurate and unbiased genomic predictions with ssGBLUP, it is necessary to assure that the G matrix is compatible with the A22 matrix, even in situations with paternity uncertainty.


Introduction
The multiple sires natural mating is the most common mating system in extensive beef cowcalf production, where several sires are kept in the same paddock to breed with cows during the mating season. Despite the management advantages of this mating system, it does not allow the progeny paternity identification, increasing the occurrence of missing pedigree and, consequently, compromising the genetic evaluation reliability.
Several methods have been suggested to increase the genetic evaluation reliability of animals with paternity uncertainty. In this regard, the unknown parent groups (UPG) can be included in the mixed model equations to account for genetic differences among defined animal groups in genetic evaluations [1]. Henderson [2] and Kennedy [3] noticed that ignoring genetic groups or defining poor genetic groups could introduce bias in genetic evaluations. Westell et al. [4] developed rules to setting up genetic groups in mixed model equations. The hierarchical animal model (HIER) proposed by Cardoso and Tempelman [5] is another procedure to perform genetic evaluation of animals with paternity uncertainty. This method combines phenotypic records and a priori information to deduce the a posteriori probabilities of the candidate sire, inferring the animal genetic merit with paternity uncertainty and its respective sire. Furthermore, the DNA markers can be used to assign calves to their individual sires based on inheritance rules. However, in most of the extensive beef cow-calf production systems, it is not common to identify or record the sire or group of sires used in the mating season, since the sires are only used in one mating season (cleanup bulls).
Recently, Legarra et al. [6], Misztal et al. [7] and Aguilar [8] proposed the single-step genomic BLUP procedure (ssGBLUP). This procedure combines the pedigree-based relationship matrix with the genomic relationship matrix into a single matrix (H) to predict the genomic estimated breeding value (GEBV). Several studies have reported that the ssGBLUP is computationally efficient and accurate for genomic evaluation purposes [8][9][10][11]. The pedigree-based relationship matrix of genotyped animals (A 22 matrix) and the genomic relationship matrix (G matrix) should be compatible in order to decrease the occurrence of biased genetic parameter estimates [11,12,13,14]. Hence, the G matrix is adjusted to reduce the differences between the average diagonal and the average off-diagonal elements in G and A 22 matrices. Chen et al. [12] reported that the scale of G influenced the ranking of genotyped versus non-genotyped animals. Vitezica et al. [14] by exploiting the ssGBLUP, derived a formal proof and showed that a well-constructed G provides more accurate and less biased GEBV than a multistep approach. Up to date, there are no reports about the consequences of using the ssGBLUP in situations with the multiple sires natural mating system, where the A 22 matrix is less informative due to the presence of missing pedigree. In this situation, problems of compatibility and scaling between A 22 and G matrices are expected, affecting the reliability of genetic evaluations. In this regard, Forni et al. [13] reported overestimated variances and biased GEBV when the A 22 matrix was sparser than the G matrix.
The ssGBLUP procedure was developed to be implemented in situations without missing pedigree or informative A 22 matrix. Some problems have been reported when the ssGBLUP model included UPGs [15]. Misztal et al. [15] worked with missing pedigrees and studied several options to include UPG in the ssGBLUP model to eliminate or reduce biases in the UPG solutions and in the animal predictions. The authors explained that potential bias could happen in genomic EBV using ssGBLUP with UPG. Recently, Tsuruta et al. [1] examined how to define the UPG assigned in mixed-model equations to reduce bias and increase accuracy in genomic evaluations for young Holstein bulls using ssGBLUP model.
In Brazil, there are many herds belonging to breeding programs that frequently use the multiple sires natural mating system, and it is common observe roughly 60% of the progeny with unknown sires. Additionally, there is a growing interest in commercial beef cattle herds lacking the pedigree structure to run genetic evaluations with the ssGBLUP so as to identify and commercialize animals with genetic evaluation information. In this context, it is important to evaluate the technical feasibility of the ssGBLUP in situations with paternity uncertainty. There are many doubts and concerns about the most adequate G matrix scaling method under paternity uncertainty scenarios and, their impact upon genomic evaluation (accuracy and bias). Currently, models that consider paternity uncertainty using genomic information are not used in animal breeding programs. Therefore, the objective of this study was to investigate the application of BLUP and ssGBLUP models under different scenarios of paternity uncertainty with different strategies of scaling the G matrix to match the A 22 matrix, using simulated data for beef cattle.

Materials and methods
Phenotypes, pedigree, and genotypes were simulated using the software QMSim version 1.00 [16]. Two traits assuming low and moderate heritabilities were simulated: age at first calving (AFC; h 2 = 0.12) and weight at 550 days (W550; h 2 = 0.34). Heritabilities were based on real data estimates [17][18][19], and the phenotypic variance was assumed to be 1.0. Ten replicates were performed for each trait and results were averaged among replicates.

Simulated population
A historical population was created from generation zero to 2,020, with a constant size of 2,000 animals (from generation zero to 1,000) to generate different levels of linkage disequilibrium (LD). A gradual reduction in the number of animals (from 2,000 to 600) produced a "bottleneck effect" and consequently, genetic drift and LD from generation 1,001 to 2,020.
Two hundred out of the 600 animals from the latest generation of the historical population were selected (males and females equally distributed) for the expanded population, which had its effective size simulated based on the real population [20]. To simulate the expanded population, a mating system based on a random union of gametes, an absence of selection, an exponential growth of the number of females, and an average of five progeny per dam were considered.
After the expansion process, 240 males and 6,000 females from the last generation were randomly selected, including the founder animals from the selection population. This population was spanned over 10 generations and the selected males and females from each generation were randomly mated, generating a single progeny with equal probability of being a male or a female. The replacement rate of sires and dams was kept constant over the generations at a rate of 20% and 60%, respectively. The genotypes of 10,000 animals of the last three generations (8, 9, and 10) were randomly selected. The estimated LD between adjacent markers in the 8, 9 and 10 generations were 0.17, 0.18 and 0.18, respectively. These results were similar to those reported by Espigolan et al. [21] using a Nellore cattle population genotyped with the BovineHD BeadChip (Illumina).

Simulated genome
The simulated genome had a total length of 2,333 cM, 735,293 markers and 7,000 QTLs randomly distributed over the 29 Bos Taurus autosomes (BTA). The length of the bovine genome was based on Base_4.6.1 [22] and it was assumed that QTLs explained 100% of the genetic variance.
The number of markers and QTLs per chromosome ranged from 12,931 to 46,495 and from 121 to 438, respectively. All markers were bi-allelic, mimicking SNPs present in the bovine commercial panels. For QTLs, the amount of alleles per loci randomly ranged from two to four. Minor allele frequencies (MAF) were assumed equally for markers and QTLs alleles. QTLs allele effects were sampled from a gamma distribution with a shape parameter equal to 0.4 [23].
A mutation rate of 10 −5 for markers and QTLs in the historical populations was considered. A total of 335,000 markers (with MAF greater or equal to 0.02) and 1,000 QTLs were randomly selected from the last generation of the historical population to generate genotypic data for the selection population. The animal phenotypes were computed as the sum of the QTLs effects and an error term sampled from a normal distribution with zero mean and variance equal to 0.88 for AFC and 0.66 for W550.

BLUP and ssGBLUP models
In the BLUP model, a traditional genetic evaluation was performed using pedigree and phenotypic information. The model can be represented as follows: where y is the vector of phenotype, b is the vector of fixed effects, u is the vector of additive genetic effects, X e Z are incidence matrices and e is the vector of random residuals. Considering an infinitesimal model, varðuÞ ¼ Aσ 2 u , where A is the numerator relationship matrix obtained from pedigree information and σ 2 u is the variance of genetic effect. In the single-step genomic BLUP (ssGBLUP) proposed by Misztal et al. [7], the inverse of the numerator relationship matrix (A -1 ) was replaced by H -1 that combines pedigree and genomic information. The H -1 was constructed according Aguilar et al. [8] as showed below: where H -1 is the inverse of the realized relationship matrix that incorporates the inverse of the genomic relationship matrix (G -1 ) and the inverse of the numerator relationship matrix of genotyped animals A À 1 22 . The G matrix was created according to VanRaden [24]: where M is a matrix of marker alleles with m columns (m = total number of markers) and n rows (n = total number of genotyped individuals), and P is a matrix containing the frequency of the second allele (p j ), expressed as 2p j . M ij was 0 if the genotype of individual i for SNP j was homozygous for the first allele, 1 if heterozygous, or 2 if the genotype was homozygous for the second allele.

Scenarios
A total of 25 scenarios were tested for each trait (50 scenarios in total) with BLUP and ssGBLUP models, considering five proportions of paternity uncertainty and four strategies of scaling the G matrix to match to the A 22 matrix . These scenarios were proposed in order to evaluate the impact of disinformation (missing pedigree) in the A 22 matrix on the scale adjustment of the G matrix. In this sense, the A matrix was created assuming different proportions of multiple sires (0, 25, 50, 75, and 100%) in the genotyped animals. To evaluate the impact of scaling the G matrix on A 22 matrix under different situations with missing pedigree, four strategies of scaling the G matrix to match to the A 22 matrix were tested: S 1 -considering mean diagonal A 22 = mean diagonal G and mean-off diagonal A 22 = mean-off diagonal G matrix, this is the default option in the preGSf90 program to reduce the differences between the average diagonal and the average off-diagonal elements in the G and A 22 matrices [12]; S 2 -no scaling between the A 22 and G matrices, the elements of the G matrix were not adjusted for the elements of the A 22 matrix; S 3 -scaling only the animals which have known sire and dam, the elements of the G matrix were adjusted only considering the elements of the A 22 matrix of animals with both parents known; and S 4 -scaling only those animals which have one known parent, the elements of the G matrix were adjusted only considering the elements of the A 22 matrix of animals with at least one parent known. For each scenario, the accuracy of prediction, bias, and inflation were calculated for five groups of animals: ALL = all animals (66,240); BULL = only bulls with at least one progeny (1,563); GEN = genotyped animals (10,000); FEM = females (36,346); and YOUNG = young males without progeny (5,803). The accuracy of prediction was computed as the correlation between the true breeding value (TBV) and EBV or genomic EBV (GEBV). Bias was measured as the difference between predicted and simulated breeding values of the candidates [14]. Regression of TBV on EBV was used as a measure of the inflation of the prediction method, where a regression coefficient equal to one denotes no inflation. Results were the mean of 10 replicates of each scenario. For AFC, a total of 10,000; 91; 10,000; 5,008 and 4,901 genotyped animals for ALL, BULL, GEN, FEM, and YOUNG were used, respectively. For W550, a total of 10,000; 98; 10,000; 5,012 and 4,890 genotyped animals for ALL, BULL, GEN, FEM, and YOUNG were used, respectively. The variance component estimation and solutions were obtained by BLUPF90 family programs [25][26].

Results and discussion
The variance component estimates obtained for AFC and W550 with the traditional pedigree (REML) and genomic information (GREML) with different proportions of MS are presented in Table 1. It is important to highlight that the software used for the simulation did not estimate the variance components, besides, it uses heritability values provided by the user (0.12 for AFC and 0.34 for W550).
In the BLUP model, the additive genetic variance ranged from 0.13 to 0.20 and from 0.35 to 0.63 for AFC and W550, respectively. For both traits, the highest additive genetic variance was observed for the scenario with 100% of MS. The additive genetic variances were overestimated as the percentage of MS increased in the population, being more noticeable for the trait with a moderate heritability (W550). Considering 50% of MS, the additive genetic variance increased 18.75 and 27% for AFC and W550, respectively, when compared to the scenario with 0% of MS. Nevertheless, with 100% of MS, this increase was 35 and 44.4%, respectively. These results can be explained by the reduction in the number of inbred animals as the proportion of MS increased in the pedigree ( Table 2). The increase in the proportion of MS may have led to the increase in the additive genetic variance between families since the families were less related due to missing pedigree.
In addition, the poor data structure that did not match to the animal model to estimate the variance components might be another reason that led to the additive genetic overestimation. Nietlisbach et al. [27] stated that the inbreeding effect on the heritability estimate depends on the considered population, but increasing the level of inbreeding in the population reduced the heritability estimates. When applying the ssGBLUP model for the same scenarios (0 to 50% and 0 to 100% of MS) the additive genetic variance increased 37.5 and 41.2% for AFC, and 36.7 and 43.6% for W550, respectively. The inclusion of the genomic relationship matrix decreased the additive genetic variance estimates for both traits in all evaluated scenarios ( Table 2). Accuracies of genetic evaluation and bias for all studied groups using BLUP and ssGBLUP models are shown in Tables 3 and 4. With the BLUP model, the accuracies of genetic evaluations decreased for both traits as the proportion of unknown sires in the population increased. The EBV accuracy reduction was higher for GEN and YOUNG groups. Therefore, comparing the scenarios for YOUNG group (from 0 to 100% of MS) the decrease was 87.8 and 86% for AFC and W550, respectively. These results pointed out that in situations of missing pedigree, the selection of young animals with EBV estimated by BLUP can be really unreliable.
By their apply the ssGBLUP model, the accuracies of the genetic evaluation also decreased when increasing the proportion of MS in the pedigree, observed for both traits. However, the reduction in the accuracies was less evident than those observed for the BLUP model. Using the same comparison for YOUNG group (scenario 0 to 100% of MS), the reduction in the accuracies was 38 and 44.6% for AFC and W550, respectively. These observations may also support the argument that the ssGBLUP model can be more accurate for genetic evaluation of young animals in situations with missing pedigree records.
The accuracies for ALL, BULL, and FEM groups were similar for BLUP and ssGBLUP models for both traits. Additionally, the EBV and GEBV accuracies decreased as the proportion of MS increased. It is expected that genomic information would contribute less for a group of animals that have enough phenotypic or progeny information contributing to the accuracy [28]. This results support a previous study on real pig data by Forni et al. [13] that showed that the inclusion of the G matrix had little impact on the accuracy of sires, but higher impact on the accuracy of females with few phenotypic records. Considering the scenario with 0% of MS for both traits, the accuracies of genetic evaluation for ALL, BULL, and FEM groups remained almost constant with the inclusion of the G matrix in the population. However, the GEBV obtained with the ssGBLUP model for YOUNG group increased 22 and 30% for AFC and W550, respectively. Wiggans et al. [29] reported reliability gains above parent average in young bulls ranging from 2.7 to 47.6 percentage units for Holsteins, 9.6 to 29.2 percentage units for Jerseys, and 3.0 to 25.8 percentage units for Brown Swiss. In beef cattle, Garrick [30] stated that genomic prediction offers accuracies that exceed those of pedigree-based parent average of young selection candidates, and it can be equivalent to progeny tests based on up to 10 offspring. It is important to highlight that for YOUNG group, as the proportion of MS increased in the pedigree, the ssGBLUP compensated the accuracy reduction obtained with the BLUP model. In the scenario that considers all unknown sires (100% of MS), when applying the ssGBLUP model the accuracy increased from 0.05 to 0.31 for AFC and from 0.07 to 0.36 for W550, being 6 and 5 times higher for AFC and W550, respectively.
According to Olson et al. [31], the regression coefficient of TBV on EBV/GEBV is an alternative way to evaluate the genetic evaluation bias, which indicates an overestimation of the variance of genetic evaluation when it is less than 1 (inflation) and an underestimation when it is larger than 1 (deflation). In general, the regression coefficients for ALL, BULL, and FEM groups were close to 1 ( Table 4), indicating that EBV predictions were less biased. For GEN and YOUNG groups, the regression coefficients were inflated as the number of unknown sires in the population increased. Large differences in regression coefficients between the BLUP and ssGBLUP models were observed for YOUNG group in all the scenarios. Biases in GEBV have been reported and discussed in several studies [8,29,32] and it could be due to the difference in scale between pedigree-based and genomic relationships, especially for young genotyped animals.
As described by Vitezica et al. [14], the comparison of average TBV and EBV/GEBV was also used to assess the bias of genetic evaluation with different proportions of multiple sires for ALL and YOUNG groups ( Table 5). As expected for the scenario with 0% of MS, the average of EBV/GEBV was close to the average of TBV in ALL and YOUNG groups for both traits. As the percentage of missing pedigree increased, the BLUP and ssGBLUP models overestimated the TBV mean for both traits, mainly for young animals. However, the ssGBLUP model predicted less biased TBV mean than the BLUP model did in situations with missing pedigree.
The realized accuracy of prediction for AFC and W550 using different scaling for the G matrix to match to the A 22 matrix is presented in Figs 1 and 2. In general, the GEBV accuracy for both traits was invariant to the scaling method applied. The GEN and YOUNG groups showed a less accurate GEBV in situations with missing pedigree and with no scaling for the G matrix. However, no differences between the scaling methods were observed when all the genotyped animals had unknown sires. There were no differences between the strategies for Table 5 scaling the G matrix for ALL, BULL and FEM groups under the different scenarios with missing pedigree. These results pointed out that the uninformative part of the A 22 matrix, i.e. genotyped animals with paternity uncertainty, did not influence the scaling of the G matrix. These results highlight the need to apply a G matrix in the same scale of the A 22 matrix, especially for the evaluation of young animals in situations with missing pedigree information.

. Means and standard deviations (SDs) for true breeding values (TBV) and breeding values from different traits using BLUP and ssGBLUP with different proportions of multiple sires.
The theory for constructing the H matrix makes many assumptions that may not hold in practice [33]. Those assumptions include the same genetic parameters in the genotyped sample as in the whole population and the existence of complete data for all traits for which selection occurred to account for selection bias [33]. Chen et al. [12] reported that the scale of the G matrix influences the ranking of genotyped versus non-genotyped animals. The optimal G matrix should have the same average of diagonals and off-diagonals as A 22 matrix [33]. Vitezica et al. [14] derived a formal proof and showed that a well-constructed G matrix with ssGBLUP model gives a more accurate and less biased GEBV than did the multistep approach.
The ssGBLUP has been used for several large-scale analyses including dairy cattle [9,34,35], beef cattle [28], pigs [11,13], and chickens [10]. These studies showed that the ssGBLUP generally was equal or more reliable than the multistep procedure and that the GEBVs were less biased. There are large differences between beef cattle compared to dairy cattle, swine, and  chicken. The beef cattle production is often cited to be inferior to poultry and swine production [36], and most of the beef cattle production is in harsh environments and with low input and investment levels. In dairy cattle, a large proportion of calves in most populations are offspring of few artificial insemination sires. Thus, the lack of artificial insemination in beef contributes to poor genetic connectedness and sire identification, compromising the reliability of genetic evaluations compared to dairy cattle.
Several studies have been developed to apply the ssGBLUP model in situations with missing pedigree using unknown parent groups [1,15]. Tsuruta et al. [1] assigned UPG in mixedmodel equations using the ssGBLUP model, which reduced the bias and increased the accuracy of GEBV. Misztal et al. [15] explained that potential bias could occur in genomic EBV (GEBV) using ssGBLUP with UPG. The authors reported convergence problems with iterative methods and incompatibility between the G and A 22 matrices due to short or incomplete pedigrees, pedigree mistakes, incorrect assignment of genotypes, poor quality of genotypes, and the unaccounted presence of multiple/lines breeds. In our study, there were no convergence problems, even in a situation with a large proportion of missing pedigree, but it is important to emphasize that low correlations between the off-diagonal elements of the G and A 22 matrices were observed due to incomplete pedigrees, mainly once the percentage of MS was higher than 50%. It is expected higher occurrence of convergence problems with real data, in which the population structure and phenotypic records are unbalanced and the model complexity is higher. Lourenco et al. [37] showed that removing old phenotypes and pedigree helped to improve convergence without decreasing accuracy for selection candidates.
According to Berry et al. [38], the development of accurate genomic evaluations in beef populations is more difficult than in dairy populations. The reasons include the presence of multiple breeds, a poor extent of phenotyping, lack of artificial insemination, and because beef systems are generally a lower-profit business that fails in adopting new technologies. The results of this study showed that when a large proportion of the pedigree is missing, the BLUP model is not reliable. However, it is possible to increase the prediction accuracy for selection candidates using the ssGBLUP model.
In many countries, several beef cattle breeding programs need to increase the availability and market of young animals with reliable genetic information in commercial herds. However, the lack of genealogy or partial pedigree information limits the reliability of the genetic evaluation in commercial herds, and, consequently, the evaluation of candidate sires. In this context, young animals with unreliable genetic evaluation are frequently discarded. Despite this study was carried out with simulated data, the results obtained with the ssGBLUP model pointed out that is possible to obtain a more reliable genetic evaluation for young animals with missing pedigree. Moreover, the breeder can have a large availability of animals for selection, increasing the selection intensity. Considering that the MS is the most common mating system in extensive beef cattle production, our results provide valuable information to support the most adequate strategy to scale the G matrix under paternity uncertainty scenarios, so as to increase the accuracy and decrease the bias in genetic evaluations using the ssGBLUP model. The results from this study would support breeders to reduce the risk of selecting young animals using the ssGBLUP model with missing pedigree information.

Conclusions
Despite the ssGBLUP procedure was not developed to deal with paternity uncertainty situations, the ssGBLUP model is an appropriate alternative to obtaining more reliable and less biased breeding values in situations of missing pedigree, especially for young animals with few or no phenotypic records.
It is important to scale the G matrix to be compatible with the numerator relationship matrix for genotyped animals, even in situations where the latter is less informative due to the presence of missing pedigree. For accurate and unbiased genomic predictions with the ssGBLUP model, it is necessary to assure that the G matrix is compatible with the A 22 matrix even in situations with paternity uncertainty.