Genome-wide identification of major genes and genomic prediction using high-density and text-mined gene-based SNP panels in Hanwoo (Korean cattle)

It was hypothesized that single-nucleotide polymorphisms (SNPs) extracted from text-mined genes could be more tightly related to causal variant for each trait and that differentially weighting of this SNP panel in the GBLUP model could improve the performance of genomic prediction in cattle. Fitting two GRMs constructed by text-mined SNPs and SNPs except text-mined SNPs from 777k SNPs set (exp_777K) as different random effects showed better accuracy than fitting one GRM (Im_777K) for six traits (e.g. backfat thickness: + 0.002, eye muscle area: + 0.014, Warner–Bratzler Shear Force of semimembranosus and longissimus dorsi: + 0.024 and + 0.068, intramuscular fat content of semimembranosus and longissimus dorsi: + 0.008 and + 0.018). These results can suggest that attempts to incorporate text mining into genomic predictions seem valuable, and further study using text mining can be expected to present the significant results.


Introduction
Genomic prediction, which is the first step in genomic selection, is a method for calculating genomic estimated breeding values (GEBVs) using large numbers of genetic markers, such as single-nucleotide polymorphism (SNP), covering the whole genome [1]. The genomic prediction methods that are currently applied to livestock populations use the extent of linkage disequilibrium between markers and quantitative trait loci (QTL) because high-density SNPs increase the chances of co-segregation of markers with causal mutations [2]. Genetic variation in quantitative traits could be influenced by large numbers of loci affecting any given trait with small to moderate effects. In some cases, however, there are loci with moderate to large effects due to relatively recently selected mutations [3][4][5]. It is difficult to capture recently selected causal mutations in genomic prediction because the linkage disequilibrium between these dams. All animals of the two populations (n = 12,635, n = 1,039) were slaughtered at averages of 918 and 920 days, respectively. The carcass traits (n = 12,635) consisted of three traits. The carcass weight (CWT/kg), backfat thickness (BF/mm), and eye muscle area (EMA/cm 2 ) were measured after a 24-hour chill at the junction of the 12th and 13th ribs. Meat quality traits (n = 1,039) were measured by evaluating two traits in two muscles. The WBSF values of the longissimus dorsi muscle (D_SF) and semimembranosus muscle (S_SF) were measured according to the method described by Wheeler et al. (2000) [16]. Briefly, beef steak 2.5 cm 2 thick was kept in polyethylene bags for 48 hours postmortem. All of the bags were heated in a water bath at 80˚C for 30 minutes, until the internal temperature of the steaks reached 70˚C. The samples were stored at room temperature for 30 minutes prior to measurement. An Instron Universal WBSF testing machine (Instron Corporation, Canton, MA) with a crosshead speed of 200 mm/min and a 50-kg load cell was used to measure the WBSF. Each sample was divided into six representative cores with a diameter of 1.27 cm and parallel to the muscle fibers. The final phenotype of the WBSF was the mean of the maximum force required to shear each core sample. The intramuscular fat contents of the longissimus dorsi muscle (D_IMF) and semimembranosus muscle (S_IMF) were measured using the microwave solvent extraction method described by AOAC International [17].
Genotyping and quality control. The genomic DNA of each animal group was extracted from longissimus thoracis muscle samples using a DNeasy Blood and Tissue Kit (Qiagen, Valencia, CA). DNA concentration and purity were determined using a NanoDrop 1000 (Thermo Fisher Scientific, Wilmington, DE). A total of 13,674 samples were genotyped using the Illumina Bovine SNP50 BeadChip and the 1,295 samples were genotyped additionally by the Illumina Bovine HD BeadChip to use as the reference population in imputation step. All animals' 50K genotypes were imputed to a high density level (777K) using Minimac3 [18]. r 2 < 0.6 SNPs were excluded in the imputations step and SNPs on the sex chromosomes were excluded from the analysis. SNP quality control for each group was performed using PLINK1.9 software [19] based on the following criteria: minor allele frequency < 0.001 for carcass traits group and < 0.01 for meat quality traits group; gene call rate < 0.1. In the carcass trait group, 23,415 SNPs were excluded by the above step, leaving 670,080 SNPs. In the meat quality trait group, 56,477 SNPs were excluded by this step, and 637,017 SNPs were used for the analysis. The imputed 777K SNPs of each group were annotated using the SnpEff program [20].
Text mining and gene ontology term analysis. Published papers related to CWT, WBSF, IMF, BF, and EMA were searched before text mining. The workflow of the text mining is shown in S1 Fig. First, all the texts in the abstracts of papers containing queries related to traits in their abstracts or titles were collected. This step was performed using functions in the RISmed package of the R statistical programming language [21]. Words consisting of only capital letters or numbers were extracted to filter out words that were accidentally the same as gene symbols (e.g., impact, pigs). Finally, only words matching the bovine gene symbols in the BioMart databases were selected for analysis. The gene symbols were obtained from the Bioconductor package BiomaRT, and btaurus_gene_ensembl was used as the dataset [22]. SNPs contained in text-mined genes (TMG) were then extracted from the imputed 777K SNPs. Furthermore, SNPs from the intergenic region of TMG were also extracted because the intergenic region often contains functionally important elements, such as promoters and enhancers. The above two types of SNPs were used as text-mined SNPs. In this study, three marker sets-the imputed 777K SNPs (Im_777K), the SNPs excluding the text-mined SNPs from imputed 777K SNPs (exp_777K), and the text-mined SNPs-were used in genomic prediction. The Bioconductor R package 'clusterProfiler' was used for Gene Ontology (GO) analysis to identify the biological process of TMG [23]. The −log 10 adjusted P-value (P.adj) by the Bonferroni method was used to examine the significance in GO analysis. To visualize the differences between QTL regions obtained from Animal QTL DB [24] and text-mined regions, karyotypes were plotted using the Circos program [25].

Statistical analyses
Genome-wide association study (GWAS) using text-mined gene-based SNP panels. The phenotypic data on carcass and meat quality traits were pre-adjusted for fixed effects including growing sites, birth year, season, and slaughter age using a linear model implemented in R software 3.3.1 (R Foundation for Statistical Computing, Vienna, Austria). The adjusted phenotypes and text-mined SNP panel were subsequently used for GWAS under a linear mixed model. The linear mixed model can be written as: where y c is a vector of the corrected phenotype for N individuals; μ is the overall mean of the term and 1 N is a vector of N ones; D is a vector of genotype of the candidate SNPs recorded as 0, 1, or 2; β is the additive effect of the candidate SNPs; g is a vector of random polygenic effects from the genetic relationship matrix (GRM) constructed by the Im_777K; and e is a vector of residuals. This model was computed by GCTA 1.26 [26]. The GRM for the polygenic effect (g) was constructed using all SNPs except those on the chromosome where the candidate SNP was located. The P-values were adjusted using the Bonferroni method to correct multiple hypotheses. The values calculated by dividing 0.05 by the number of text-mined SNPs were used as the thresholds for obtaining significant SNPs associated with the trait.
Genomic models for estimation and prediction. The three genomic models were used to estimate genetic and residual variances as well as to predict genomic estimated breeding values (GEBV) in models 1 to 3. The two types of GRM constructed by lm_777K and exp_777K were used for models 1 and 2, respectively. The equations can be written as: where y is the vector of the observed phenotype for N individuals. X is an incidence matrix for the fixed effects and b is the vector of fixed effects, which included growing site, birth month, birth year, slaughter month, slaughter year, and slaughter age as covariates for all traits. In addition, the carcass traits included slaughter place and sex, while the meat quality trait included farm information (the owner's name of steers). In the two equations, g all is the N vector of the additive effects from the GRM with lm_777K for additive genetic effects, and g −t is the N vector of the additive effects from the GRM with exp_777K. The genetic and residual effects were assumed to be normally distributed, with mean as zero. The variances estimated by the above two models are given by: where G all and G −t are GRMs with lm_777K and exp_777K, respectively; and I is an N � N identity matrix.
In model 3, two GRMs constructed by exp_777K and text-mined SNPs were jointly used to differentially weight the random effects. The model used can be written as: where y is the vector of phenotypic observations, and g t is the N vector of the additive effects from GRM with the text-mined SNPs. The genetic and residual effects were assumed to be normally distributed, with mean as zero. The variances estimated by model 3 are given by: where G t is the GRM with the text-mined SNPs. Variance component estimation and GBLUP. The variance components, s 2 all ; s 2 À t , and s 2 t , and heritability were estimated using an average information restricted maximum likelihood (AIREML) model by implementing the AIREMLF90 program in the BLUPF90 family [27]. The proportion of genomic variance explained by each model can be written as: GEBVs were predicted using GBLUP methods and a 10-fold cross-validation scheme was used to evaluate the accuracy of the GEBVs. Samples were divided into 10 groups of equal size. Nine of these groups were used as the reference set and the other group was used as the validation set in each cross-validation. The GEBVs for the model 1 and model 2 were calculated using the following mixed model. The matrix for the model used can be written as: whereû is the vector of the GEBVs distributed as g~(0,Gs 2 g ); G is genomic relationship matrix for individuals; Z is a design matrix designed one column for each GEBV and one row for each phenotype (if an individual would have no phenotype, Z would have a column with zero's only for this individual). λ is shrinkage value calculated by (σ 2 e /σ 2 g ). The GEBV for the model 3 is calculated using two random effect linear mixed model followed by Whereû À t andû t are vectors of GEBVs calculated by exp_777K and text-mined SNPs; G −t and G t are GRMs with exp_777k and text-mined SNPs. The final GEBV of model 3 is the sum of the two GEBVs (û À t þû t ). The GRM (G) is defined as where M contains genotypes adjusted by allele frequency and p j is the allele frequency for PLOS ONE marker j [28]. All of these estimates were performed using BLUPF90 [27]. The accuracy of predicted breeding values was calculated as the Pearson's correlation between the GEBVs and adjusted phenotypes (y c ) of the validation set, and the equation can be represented by:

Text mining and gene ontology term analysis
The queries used to search the papers and a statistical summary of the text mining are shown in Table 1 In the results of Gene Ontology (GO) term analysis (Table 3), CWT, BF, EMA-related TMG showed significance relatedness with growth regulator and growth factor ("response to hormone", "regulation of signaling receptor activity", and "response to endogenous stimulus", "response to peptide"). WBSF-related TMG were identified to be associated with organic acid ("carboxylic acid metabolic process", "oxoacid metabolic process", "monocarboxylic acid biosynthetic process", "organic acid metabolic process", "monocarboxylic acid metabolic process"). For IMF, the biological process terms with lipid synthesis and lipid metabolism were statistically significant ("regulation of lipid metabolic process", "lipid metabolic process", "fatty acid metabolic process", "regulation of lipid biosynthetic process"). The karyotypes of the QTL regions registered in animal QTLDB, text-mined regions, and the intersection of the two regions are shown in Fig 1. The highest percentage of intersecting regions within the textmined regions corresponded to regions of CWT-related TMG (36.3%), and the lowest corresponded to IMF regions (5.5%).

Genome-wide association study (GWAS) with text-mined SNPs
The Manhattan plots for each trait are shown in Fig 2. The Bonferroni correction method was used for the significance test (0.05/number of SNPs) in the genome-wide association study,

Genomic prediction
The accuracy of GEBV are shown separately for the carcass traits (CWT, BF, EMA) and meat quality traits (WBSF, IMF) in Table 5. Fitting two different GRMs constructed with two different SNP panels (exp_777K + tm_SNPs) as random effects in the GBLUP model showed better accuracy than fitting one GRM with exp_777K in all traits. In CWT, the prediction accuracy

PLOS ONE
with Im_777K was 0.453, which was 0.002 higher than in the model with exp_777K + tm_SNPs. Conversely, for BF, using exp_777K + tm_SNPs resulted in an accuracy of 0.421, which was 0.002 higher than that using Im_777K. EMA also exhibited its highest prediction accuracy (0.437) when using two GRMs with exp_777K + tm_SNPs. The accuracy of genomic prediction using two GRMs for WBSF in the two muscle types, semimembranosus and longissimus dorsi, were calculated as 0.129 and 0.189, respectively, and those for IMF were 0.168 and 0.225, respectively, which were better than those using Im_777K. In order to validate the effect of text-mined SNPs in the multi-GRM model, GBLUP using evenly-mined SNPs (em_SNPs) and except SNPs was additionally conducted (Table 6). For all four meat quality traits, the GBLUP using tm_SNPs showed higher accuracy than em_SNPs. It seems that CWT and EMA may have more polygenic characteristics than other traits, because em_SNPs showed higher accuracy than tm_SNPs in these two traits.

Biological relatedness of text-mined gene with carcass and meat quality traits
Carcass traits. The top three mined genes for carcass traits were IGF1, MSTN, MC4R, SST, CAPN1, and PPARGC1A. Many previous studies have investigated the biological effect of

PLOS ONE
these genes on the quantitative traits. Insulin-like growth factor (IGF) plays a key role in cell differentiation, growth, and metabolism regulation [29]. The myostatin (MSTN) gene, also known as GDF8, encodes a member of the transforming growth factor β superfamily, which is associated with the proper regulation of skeletal muscle mass and carcass yield in cattle [30]. The melanocortin 4 receptor (MC4R) gene plays an important role in energy balance and is associated with beef economic traits [31]. Peroxisome proliferator activated receptor gamma coactivator 1 alpha (PPARGC1A) have been standing out as a candidate gene for beef fat synthesis [32]. Although somatostatin (SST) inhibits growth hormone, there has been little research on the association between the SST gene and carcass traits. This gene seemed to have been mined because the abbreviation "SST" was used with other meanings, such as "sole soft tissue", in the literature.
Meat quality traits. The CAST and CAPN1 were included in the two most frequently mined genes related to the WBSF. Calpain 1 (CAPN1) encodes the large subunit of calciumactivated neutral proteases (calpain), and the calpastatin (CAST) gene inhibits μ-and m-calpain activity. These two proteins, as key myofibrillar proteins, mediate proteolysis during postmortem storage of the carcass and cuts of meat at refrigerated temperatures and play important roles in meat tenderness [46]. The association between these CAST/CAPN1 and WBSF has been studied extensively [47][48][49][50]. In IMF, SCD, LPL, and FABP4 were the three most frequently mined genes. The stearoyl-CoA desaturase (SCD) gene encodes an enzyme involved in fatty acid biosynthesis, primarily the synthesis of oleic acid [51]. The lipoprotein

PLOS ONE
lipase (LPL) gene encodes lipoprotein lipase, which provides triglyceride-derived fatty acids to adipose tissue [52]. Fatty-acid-binding protein 4 (FABP4) plays a number of important roles, including fatty acid uptake, transport, and metabolism in the muscle [53]. In addition to these genes, CAPN3, KCNJ11, DNAJA1 are also known to be associated with beef tenderness [54][55][56] and FABP3, LEPR, FASN, DGAT1 were reported to associated with IMF in previous studies [57][58][59]. In the results of GO term analysis for WBSF, biological processes related to the carboxylic acid biosynthetic and metabolic processes were significant. Carboxylic acid is an organic acid that was shown in previous studies to affect beef tenderness [60,61]. In addition, IMF related TMG showed a significant association with the regulation of lipid metabolic and biosynthetic processes. According to these biological processes, GO term results can support that WBSF, IMF-related TMG have been associated with WBSF and IMF.

Genomic prediction
When excluding text-mined SNPs from the Im_777K marker panels, the prediction accuracy for CWT, BF, WBSF, and IMF were decreased. In a previous simulation study, a panel that excluded QTL from the 50K SNP panel showed lower accuracy than a panel that included the QTL [2]. These results indicated that text-mined SNPs may be more strongly functionally associated with QTL for CWT, BF, WBSF, and IMF and include markers in a linkage disequilibrium relationship with QTL for these traits. Fitting two GRMs constructed using exp_777K and textmined SNPs in the GBLUP model as different random effects resulted in higher accuracy than fitting one GRM constructed using Im_777K for BF, EMA, WBSF, and IMF. These results were consistent with previous studies indicating that differentially weighted subsets of markers based on genomic features increased the predictive ability [8]. The increase in accuracy was greater in the traits related to the longissimus dorsi muscle than in those related to the semimembranosus Table 5. Carcass traits average correlation between the GEBV and corrected phenotypic values (y c ) and standard error for 10-validation set. Meat quality traits average correlation between the GEBV and corrected phenotypic values (y c ) and standard error for 10-validation set.

PLOS ONE
muscle. One of the most important factors that can affect the accuracy of genomic prediction is linkage disequilibrium between common SNPs and QTL [7]. As selection for a specific trait proceeds, linkage disequilibrium between causal polymorphisms for that trait and other marker loci appears to be stronger [6]. As traits related to the semimembranosus muscle were not considered in evaluating the degree of the Hanwoo breed, the selection of these traits would not have been carried out actively. Therefore, linkage disequilibrium between QTL and other markers would be weakened, and this seemed to have been responsible for these results.
In this study, the SNPs that seemed to be related to the traits were selected by text mining, and the prediction accuracy was slightly increased when these SNPs were weighted differentially to other SNP panels. In the GBLUP method, the weights of GRMs are controlled by the lambda value (σ 2 e /σ 2 u ). As σ 2 u estimated by text-mined SNPs showed lower variance than estimated by exp_777K, higher lambda values were multiplied to GRM made by text-mined SNPs and this seemed to increase the prediction accuracy by giving more weight to text-mined SNPs in the model. Nevertheless, in comparisons between multi-GRM models, the accuracy of CWT and EMA decreased when tm_SNPs was used. These results may indicate that textmined GBLUP doesn't seem to be effective in the case of traits that are more genetically affected by polygenic effect than causal variant effect. There may be limits to the conclusion that text mining can improve prediction accuracy, since text mined SNPs didn't result in a significant improvement in prediction accuracy. However, there was a slight accuracy increase for meat quality traits and GO term analysis may suggests that text mining can play a role in finding functional genes for complex traits. Therefore, attempts to incorporate text mining into genomic predictions seem valuable and further study (i.e., other SNP effects weighting methods) using text mining can be expected to present the significant results [62,63]. In addition, text mining may be used for various population or breeds, since marker selection by text mining didn't use the phenotypic or genetic information of a specific population.

Conclusions
This study was performed to use text mining, to extract biological information from previous papers and increase the performance of genomic prediction. The results showed that text mining could be used to find genes related to specific traits because associations between each carcass and meat quality trait and TMG were identified in the results of text mining and GO term analysis. However, a word that was accidentally the same as a gene symbol but used with another meaning (i.e., SST) was also mined as a text-mined gene. Therefore, it will be necessary to develop further methods of text mining that can resolve this problem. In the genomic prediction results, text-mined SNPs seemed to be in tighter linkage disequilibrium with QTL for BF, EMA, WBSF, and IMF. There may be limits to the conclusion that text mining can improve prediction accuracy, since text mined SNPs didn't result in a significant improvement in prediction accuracy. However, attempts to incorporate text mining into genomic predictions still seem valuable, and further study using text mining can be expected to present the significant results, because a slight accuracy increase for meat quality traits may suggests that text mining can play a role in finding functional genes for complex traits. In addition, text mining may be used for various population or breeds, since marker selection by text mining didn't use the phenotypic or genetic information of a specific population.
Supporting information S1 Fig. The workflow of