Genome-Wide Association Analysis and Genomic Prediction of Mycobacterium avium Subspecies paratuberculosis Infection in US Jersey Cattle

Paratuberculosis (Johne’s disease), an enteric disorder in ruminants caused by Mycobacterium avium subspecies paratuberculosis (MAP), causes economic losses in excess of $200 million annually to the US dairy industry. To identify genomic regions underlying susceptibility to MAP infection in Jersey cattle, a case-control genome-wide association study (GWAS) was performed. Blood and fecal samples were collected from ∼5,000 mature cows in 30 commercial Jersey herds from across the US. Discovery data consisted of 450 cases and 439 controls genotyped with the Illumina BovineSNP50 BeadChip. Cases were animals with positive ELISA and fecal culture (FC) results. Controls were animals negative to both ELISA and FC tests that matched cases on birth date and herd. Validation data consisted of 180 animals including 90 cases (positive to FC) and 90 controls (negative to ELISA and FC), selected from discovery herds and genotyped by Illumina BovineLD BeadChip (∼7K SNPs). Two analytical approaches were used: single-marker GWAS using the GRAMMAR-GC method and Bayesian variable selection (Bayes C) using GenSel software. GRAMMAR-GC identified one SNP on BTA7 at 68 megabases (Mb) surpassing a significance threshold of 5×10−5. ARS-BFGL-NGS-11887 on BTA23 (27.7 Mb) accounted for the highest percentage of genetic variance (3.3%) in the Bayes C analysis. SNPs identified in common by GRAMMAR-GC and Bayes C in both discovery and combined data were mapped to BTA23 (27, 29 and 44 Mb), 3 (100, 101, 106 and 107 Mb) and 17 (57 Mb). Correspondence between results of GRAMMAR-GC and Bayes C was high (70–80% of most significant SNPs in common). These SNPs could potentially be associated with causal variants underlying susceptibility to MAP infection in Jersey cattle. Predictive performance of the model developed by Bayes C for prediction of infection status of animals in validation set was low (55% probability of correct ranking of paired case and control samples).


Introduction
Paratuberculosis or Johne's disease (JD) is a chronic bacterial infection of the gastrointestinal tract caused by Mycobacterium avium subspecies paratuberculosis (MAP). MAP is contagious; infected animals expose their cohorts to the pathogen by shedding bacterium into their colostrum, milk or feces [1]. In cattle, young calves are at the highest risk for acquiring MAP infection [2]. The major route of MAP transmission is fecal-oral [3]. MAP is a slowgrowing intracellular bacterium; infected animals remain asymptomatic for 2 to 10 years before showing clinical signs of the infection. Clinical signs of JD in MAP-infected dairy cattle usually appear after 2 nd or 3 rd lactation and include poor nutrient uptake, severe diarrhea, progressive weight loss, low milk production and eventually death [4]. There is currently no cure for this disease. The NAHMS Dairy 2007 study estimated the apparent herd-level prevalence of MAP-infected herds in the top 17 US dairy states to be at least 68% based on recovery of viable MAP in environmental fecal samples [5]. In a recent study, the true herd-level prevalence of MAP infection in these herds was estimated to be 91.1% [6]. JD is a common disease in countries with a significant dairy industry [7] and causes a negative impact on the global economy [8,9]. JD, like most other complex diseases is multi-factorial i.e. under the influence of both genetic and environmental factors. Studies have shown that susceptibility to JD is heritable with the estimates ranging from 0.03 to 0.28 in cattle [10,11,12,13,14,15,16,17].
Crohn's disease (CD) is an inflammatory bowel disease (IBD) in humans with manifestations similar to those of JD in cattle. MAP has been found in some patients with CD [18], however a causal link between the two has not been demonstrated. In the past few years, genome-wide association studies (GWAS) have been applied widely to decipher the genetic basis of complex traits and diseases in human. Using this approach for IBD has resulted in identification of 163 loci conferring risk of CD and ulcerative colitis (another common form of IBD) [19].
In recent years, availability of the BovineSNP50 platform for genotyping ,54,000 SNPs across the Bovine genome [20] has facilitated GWAS in cattle. Six GWAS seeking genomic regions underlying susceptibility or tolerance to infection with MAP in Holstein have been performed to date [21,22,23,24,25,26]. Various loci on multiple chromosomes have been reported for association with susceptibly to MAP infection in Holsteins. Jersey is the second most common dairy breed after Holstein in the United States. This is the first GWAS for susceptibility to paratuberculosis infection in the Jersey breed. Our objectives were to identify genomic regions that underlie susceptibility to infection with MAP as well as development of a multi-marker model to be used in genomic selection against susceptibility to MAP infection in Jersey cattle.

Ethics statement
The University of Wisconsin-Madison College of Agricultural and Life Sciences Animal Care and Use Committee approved the procedures used with animals in this experiment.

Resource population
Blood and fecal samples were obtained from ,5,000 mature cows (minimum age of 20 months) from 30 commercial Jersey herds throughout the US in a retrospective cross-sectional design. Nomination of herds for this study was based on the prevalence of JD evidenced by the herd's owner or veterinarian. Sampling was performed selectively for three herds and completely (whole herd) for the remaining 27 herds. Samples were shipped in insulated containers with cold packs by overnight courier to the Johne's Testing Center of the School of Veterinary Medicine at the University of Wisconsin-Madison and processed upon receipt (separation of plasma, collection of buffy coats for DNA extraction). Serum was held at 4C and fecal samples at -20uC until testing. All blood samples were tested within 7 days by a serum ELISA (JTC-ELISA) [27] with 30% sensitivity and . 99% specificity relative to fecal culture [28].
ELISA optical density (OD) values for each serum sample were converted to sample to positive ratios (S/P) using [OD Sample -OD Negative control ]/[OD Positive control -OD Negative control ] [29]. ELISA S/P ratios were categorized as negative (0 to 0.09), suspect (0.10 to 0.24), low positive (0.25 to 0.39), positive (0.40 to 0.99), and strong positive ($1.00) as suggested [29] (Table 1). Animals categorized as low-positive, positive or strong-positive were all considered to be ELISA-positive. Within-herd apparent prevalence (number of test-positive animals divided by total animals in the herd) varied from 0.03 to 0.30 based on ELISA results. Within herds, for each ELISA-positive cow, two ELISA-negative cows (matched on birth date) were selected. All ELISA-positive and selected ELISA-negative cows were also tested for evidence of MAP infection by fecal culture (FC) [30] (Table 1). The sensitivity and specificity of fecal culture have been estimated to be approximately 74% and 100% for detection of infectious cows, respectively [31].

Discovery data
In total, 1,000 cows including 500 cases and 500 controls were selected for discovery purpose (objective 1). Cases consisted of animals with positive results to both ELISA and FC (ELISA+/ FC+). Controls were cows testing negative to both ELISA and fecal culture (ELISA-/FC-) and matched with cases on herd and birth date. In a previous study [32] using a temporal clustering approach, we showed that MAP-infected animals were significantly clustered by birth date within dairy herds. This finding strengthens the hypothesis of non-uniform exposure to MAP in dairy herds. The choice of matching cases and controls on their birth dates was to ensure similar exposure to the pathogen thus reducing the environmental noise. Considering lack of a gold standard for diagnosis of MAP infection, we used the results of ELISA and FC tests in combination for defining MAP infection status. The rationale was to increase certainty with which these phenotypes represent the true infection status of animals.

Validation data
To validate the results of discovery GWAS, an additional 200 animals were selected from the original herds. All eligible ELISA+/FC+ animals and their matching ELISA-/FC-were already used in the discovery stage. The remaining test-positive animals were either ELISA+ or FC+. FC+ cows that were not used in the discovery stage were used as cases for validation (n = 100). This choice was made to reduce classification bias, as fecal culture is twice as sensitive as ELISA. Controls were ELISA-/FC-also matched with cases on herd and birth date (n = 100).

Genotyping and quality control
Genomic DNA of discovery and validation animals was extracted from buffy coats using a variation of the typical proteolytic digestion and organic extraction method [33]. The purity (A260/A280) of DNA samples was assessed by spectrophotometry. DNA samples were also quantified using PicoGreenH dsDNA (Invitrogen) and adjusted to 50 ng/ml prior to genotyping.
For the validation data, 91 cases and 91 controls passed the DNA quality assessment. The BovineLD BeadChip (Illumina Inc, San Diego, CA) including 6,909 SNPs was used for genotyping [34]. A QC analysis was performed with the criteria of animal call rate (. 90%) and SNP call rate (. 95%). A total of 180 animals (90 cases, 90 control) and 6,796 SNPs passed QC. All QC procedures were done in R GenABEL package [35]. Imputation BEAGLE (v 3.3.2) was used to impute missing genotypes in the discovery and validation data [36]. The average missing rate per SNP in the discovery data after QC was , 1%. Allelic R 2 was estimated by BEAGLE based on genotype probabilities as an indicator of imputation accuracy. The average allelic R 2 of 0.99 for imputed genotypes in the discovery data suggested high accuracy of imputations. Out of 6,796 SNPs in the validation set, 5,424 were common across LD and 50K chips. The genotypes of common SNPs were used to infer haplotypes and impute 27,163 SNPs unique to 50K panel for the validation data set. The average allelic R 2 of 0.96 indicated high accuracy of imputed genotypes.

Population stratification
Differences in allele frequencies between subpopulations of admixed populations can lead to false associations in GWAS [37]. To find genetic outliers, the genomic kinship was computed between all pairs of animals using the ibs function in the R GenABEL package [35]: where f i, j is the genomic kinship (identical-by-state) between animal i and j, k ranges from 1 to N (number of autosomal SNPs), x i,k or x j, k is the genotype of i th or j th animal for k th SNP (coded as 0, K and 1) and p k is the allele frequency at the k th SNP. The kinship matrix was transformed to a distance matrix (0.5 -kinship) and principal components (PCs) of variation of the genomic distance matrix were calculated using the cmdscale function. The first two PCs (PC1 and PC2) were used to obtain the classical multidimensional scaling (MDS) plots. All population stratification procedures were performed within the R GenABEL package.

Statistical analyses
Single-marker GWAS. Genome-wide association analysis was carried out based on regression of phenotypes (susceptibility to MAP infection) on the genotypes of animals for one SNP at a time. For single-marker GWAS, we used a three-step approach referred to as genomic GRAMMAR-GC (Genome-wide Rapid Association using Mixed Model and Regression-Genomic Control) [38,39]. This approach has been used in multiple GWAS in cattle [23,40,41]. The advantage of this approach especially in livestock is that it accounts for cryptic population structure caused by the presence of closely related animals [38]. In the absence of pedigree information, GRAMMAR-GC infers relationships through genomic marker data. Following the approach of Aluchenko et al. [38], in the first step phenotypes were corrected by accounting for familial dependence among individuals using: where y Ã i is the so-called ''environmental residual'' and y i is the binary phenotype of i th animal, m is the overall mean,Ĝ G i is the estimated polygenic contribution. In the second step, these familial correlation-free residuals were used as dependent quantitative traits for association analysis of each SNP using a linear regression model: where y Ã i is as defined before, g i is the genotype of the i th individual at the marker under study, a j is the effect of j th SNP and e i is the random residual for the i th individual. In the third step, genomic control (GC) is applied to correct the test statistic using a deflation factor (f) calculated by: f f~Median (T 2 1 zT 2 2 ,:::,T 2 j )=0:465 ð4Þ where T 2 j is the observed chi-squared (x 2 ) statistic for the j th SNP and 0.465 is the expected median of x 2 (1) distribution with a noncentral variance. T 2 for each SNP is calculated by whereâ a j is the effect of j th SNP. T 2 =f f is compared with x 2 (1) to determine whether the locus is significantly associated with the quantitative trait. The deflation factor is estimated in the same way as inflation factor (l) in conventional GC method [42] with the difference that f,1 in contrast to l that is constrained to be . 1. This difference is due to the regression of residuals instead of original trait on n loci in step 2. Polygenic [43] and qtscore [38,39] functions of the R GenABEL package were used for association analysis. Two P-value thresholds of 5610 27 and 5610 25 were considered for genome-wide ''strong'' and ''moderate'' associations [44]. The Manhattan plot of minus log 10 (P-value) against chromosomes was drawn using an in-house script in R [45]. The quantile-quantile (Q-Q) plot of observed P-values against expected P-values was generated to evaluate the overall genome-wide significance.
To validate the associations suggested by discovery GWAS, validation data (31,065 SNPs and 180 animals) were analyzed by single-marker GWAS procedures described above. For validation, strong and moderate associations suggested by discovery GWAS were required to meet two criteria in the validation analysis: P,0.01 and same direction for estimated effects.
Bayesian GWAS. In contrast to traditional single-marker regression based GWAS that fits one marker at a time, Bayesian methods simultaneously fit many markers and take into account the linkage disequilibrium (LD) relationships between markers. Bayesian methods were originally adopted for genomic prediction of breeding values [46], however, in recent years they have been applied for GWAS as well e.g. [47,48]. We used the Bayes C threshold model implemented in GenSel [49] for both mapping quantitative trait loci (QTL) of MAP infection as an alternative approach to single-marker GWAS and developing a multi-marker model to predict new phenotypes (risk assessment). Bayesian methodology combines prior information of marker effects with information from data to draw inferences from posterior distributions using Markov Chain Monte-Carlo (MCMC) sampling [50]. In Bayes C, a common variance is assumed for all SNPs and its advantage over other Bayesian approaches is that it is less sensitive to priors of genetic and residual variances [50]. The threshold model in Bayes C with a probit link function for categorical binary traits described [51] as: where y is the liability vector for case/control observations, m is the overall mean, n is number of SNPs, z represents a column vector of genotype covariates at SNP j (AA = -10, AB = 0 and BB = 10), a j is the allele substitution effect of SNP j , d j is a an indicator variable for presence (1) or absence (0) of j th SNP in the model and e is the vector of random residual effect assumed normally distributed ,N (0, I) I being an identity matrix. The proportion of SNPs with no effect (parameter p) was assumed to be 0.999. Therefore, 33 SNPs (0.1% of 32,587 SNPs) were assumed to contribute to genetic variance in any MCMC iteration. A high value of p was chosen to allow only regions with strongest associations to be identified. In Bayes C when a SNP is present in the model (i.e. d = 1), a j is assumed to be normally distributed ,N (0, s 2 a ) conditional on s 2 a , whereas when the SNP is not present (d = 0) a j is zero. Residual variance was set to 1. Assuming an average heritability of 0.1 for susceptibility to MAP-infection in dairy cattle [10,11,12,13,14,15], 0.11 was used as a prior for genetic variance (s 2 a ). s 2 a was assumed to have a scaled inverse chisquared distribution with 4 degrees of freedom and scale parameter S 2 a . A total of 41,000 MCMC cycles with 1,000 burnin cycles were implemented.
Because of LD between markers in the vicinity of a QTL, the effect of a QTL may be distributed over nearby markers. Consecutive 1 Mb non-overlapping windows (genome windows) along the bovine genome were used to calculate the cumulative effects of markers within windows [52]. SNPs were allocated to consecutive genome windows using physical map order derived from Bovine genome assembly UMD 3.1. The window effect in GenSel is expressed as the percentage of total genetic variance contributed by each window. Percentages of explained genetic variance by windows were plotted against chromosomes using R [45]. Model frequency of a marker defined as the proportion of fitted models which included that marker was used to infer associations. SNPs with the highest model frequency in top windows are potentially associated with the phenotype under study. In Bayesian GWAS, limiting the proportion of false positives (PFP) among all positive values is an approach that can be used to correct for multiple testing [53]. Therefore, if SNPs with model frequency $ 0.90 were deemed significant, PFP would be # 0.10.
Genomic prediction. A multi-SNP model was developed by Bayes C analysis in the discovery data (training set). Using the predict function in GenSel the genomic estimated breeding values (GEBV) were obtained for 180 animals in the validation data (testing set). The efficacy of predicted GEBVs in correctly ranking cases and controls was evaluated by Receiver operator characteristic (ROC) analysis implemented in ROCR package of R [54]. The ROC curve plots the true-positive rate (sensitivity) against the false-positive rate (1-specificity) which graphically depicts the accuracy with which a risk classification score (GEBV, in this study) predicts the binary outcome (infection status) across a full range of thresholds. The area under ROC curve (AUC) is a statistic that quantifies the classification power of the SNP model, where values of 1.0 and 0.5 reflect perfect classification and random assignment [55].
As an alternative, a 10-fold cross-validation was also performed using the discovery data. 450 cases and 439 controls were randomly divided into 10 approximately equal subgroups. Nine subgroups were assigned to a training set while the remaining subgroup was considered as a testing set. For each replication, we used the training set to construct a SNP model which subsequently was tested in the testing set. This procedure was repeated 9 additional times with a unique testing set each time. Ten different models were constructed in GenSel using Bayes C (same input parameters used in initial Bayesian GWAS). The GEBVs of animals in the testing sets were calculated using GenSel predict and the efficacies of SNP models were evaluated by comparing AUCs.
Combined analyses. The discovery (N = 889) and validation (N = 180) data were merged to enhance the power of analyses. The combined data comprised of 1,069 animals and 32,375 SNPs. A total of 212 SNPs were excluded in QC: 175 SNPs due to MAF , 0.01, 33 SNPs for call rate ,0.95 and 4 SNPs were out of HWE (P,10 26 ). Population stratification, single-marker GWAS, Bayesian GWAS and 10-fold cross-validation were performed in the manner described before.

Single-marker GWAS (discovery data)
Appearance of a single cluster in the MDS plot suggested the absence of population substructure in the discovery data ( Figure  S1-A). The deflation factor (f) was estimated to be 0.96 (SE = 9610 25 ). The GC-corrected P-values for the majority of SNPs corresponded well to the expected P-values under H 0 of no association, with a few departures indicating association with the trait under study ( Figure S2-A). No SNP passed the threshold of strong association (Figure 1-A). The most significant SNP was identified on BTA7 position 68 Mb (P = 4.9610 25 ) surpassing the threshold for moderate association ( Table 2). The second most significant SNP (P = 5.9610 25 ) was located on BTA3 (107 Mb) and failed to pass the moderate threshold ( Figure 1-A, Table 2). The 20 most significant SNPs (P,5610 24 ) were located on 8 chromosomes including BTA7, 3, 23, 17, 6, 1, 5 and 13 (in order of significance) ( Table 2). On BTA3 a total of six SNPs covering 101 to 107 Mb were identified (Table 2) Table 2). BTA1 contained three SNPs located at positions 125.6, 135.3 and 141.9 Mb (P,5610 24 ) representing three genomic regions based on a relatively low average pair wise r 2 (0.25) ( Table 2).

Single-marker GWAS (validation data)
The MDS plot for validation data (after imputation) did not show any outliers and confirmed genetic homogeneity in the population ( Figure S1-B). The estimated deflation factor from GRAMMAR was 0.98 (SE = 2610 24 ). The Q-Q plot of corrected P-values is shown in Figure S2-B. None of the 20 most significant SNPs identified in the discovery analysis were significant (P,0.01) in validation analysis (Figure 1-B). The effect directions of these SNPs were compared with validation results, and 65% of the effects were in the same direction.
A total of 2,657 genome windows existed along the Bovine genome with an average of 12 SNPs per window. The genome windows were sorted based on the proportion of genetic variance captured by each. In the top 20 windows, the proportion of genetic variance explained by a window ranged between 0.56% to 3.3% (Table 2). Assuming an infinitesimal model, the expected proportion of genetic variance explained by each window was ,0.04% ([1/2,657] * 100). In total, 584 windows (22%) explained more than 0.04% of the genetic variance. Window 2199 on   Only the twenty most significant SNPs are shown. 2 SNPs identified commonly by both GenABEL and GenSel analyses; order is based on P-values from GenABEL. 3 SNPs identified only by GenABEL. 4 SNPs identified only by GenSel. 5 Bos Taurus chromosomes. 6 Position of SNP based on Bovine genome build UMD 3.1 (in base pair). 7 Major allele. 8 Minor allele. 9 Estimated effect of allele B (fitted allele) and the standard error of the estimated effect in the parenthesis. 10 P-value corrected by genomic control approach (GC). 11 Frequency of B allele in cases. 12 Frequency of B allele in controls. 13 Rank based on percentage of genetic variance among the twenty most significant windows by GenSel analysis. 14 Number of 1-Mb non-overlapping genome window and number of SNPs within each window in the parenthesis. 15 Percentage of total genetic variance explained by 1-Mb windows. 16 Proportion of models in which the corresponding window accounted for . 0.04% of genetic variance (expected variance if each window had the same effect: 1/total number of windows = 2,657). 17 Proportion of MCMC iterations that included the corresponding SNP. doi:10.1371/journal.pone.0088380.t002 Table 3. Posterior means of variance components in Bayesian genome-wide association study of susceptibility to MAP infection in US Jersey cattle.  Table 2). The probability that this window explained more than the average genetic variance was 0.26. In window 2199, ARS-BFGL-NGS-11887 at position 27.7 Mb among other SNPs, showed the highest model frequency, i. e., was included in the model in 21% of MCMC iterations ( Table 2). The top 20 windows were located on BTA23 (5 windows), BTA3 (5 windows), BTA6 (2 windows) and BTA17, 5, 7, 16, 19, X, 1 and BTA10 (one window each). To compare these results with the results of single-marker GWAS, for each window the SNP with the highest model frequency (the most influential SNP) was chosen to represent the window. For BTA3 and BTA23 the results of Bayesian GWAS corresponded with the results of single-marker GWAS ( Table 2). The loci that were among the 20 most significant in the Bayesian analysis (GenSel) but not the GenABEL analysis included SNPs on BTA16 (48 Mb), 6 (1 Mb), 19 (61 Mb), 6 (136 Mb), 10 (1 Mb) and 23 (35 Mb)( Table 2). The most significant SNP based on GenABEL analysis, BTA-109542-no-rs on BTA7, ranked 11 th in the GenSel analysis. Considering the 20 most significant SNPs in each, the results of the two analyses (GenABEL and GenSel) were in high agreement (70% of loci in common). In total, 10 SNPs had model frequency . 0.10. Assuming these SNPs to be positive results, we would expect at least one of these SNPs to be truly associated with MAP infection (PFP ,0.90).

Genomic prediction (validation data)
The marker effect estimates from the Bayesian analysis were used to predict the genomic merit of 180 animals in the validation set. The predicted genomic merit was used to rank paired case and control samples which was compared with observed phenotype by ROC analysis. The predictive ability of the model was low. The AUC of the SNP model was 0.55 ( Figure 3). The AUC from 10fold-cross validation using discovery data was similar, ranging between 0.47 to 0.67 (average 0.56) (Figure 4-A).

Combined analyses
Single-marker GWAS. No population substructure existed in the combined data set ( Figure S1-C). The deflation factor from GRAMMAR-GC was 0.94 (SE = 1.6610 24 ). The Q-Q plot of corrected P-values is shown in Figure S2-C. No SNP passed the strong or moderate thresholds of association (Figure 1-C, Table 4). The two most significant SNPs were located on BTA23 at 27.7 Mb (third most significant in discovery analysis by GenA-BEL) and BTA17 position 26.3 Mb (not among the 20 most significant results in discovery data) (P,10 24 ) (Figure 1-C, Table  4). The 20 most significant SNPs by combined analysis (P, 6610 24 ), representing 8 genomic regions, were mapped to 8 chromosomes including BTA23 (4 SNPs), 11 (2), 3 (6), 17 (2), 5, 25, 6 (2) and 13 (2) (ordered by P-value) ( Table 4). Comparing singlemarker GWAS results of the combined and discovery data, the most significant SNP on BTA7 (P = 4.9610 25 ) in discovery analysis declined in significance in the combined analysis ( Table  2 & 4). Likewise, the SNPs located on BTA1 and at 7 Mb of BTA23, also declined in significance ( Bayesian GWAS. A total of 2,656 genome windows containing 32,375 SNPs were used to estimate the genetic variance explained by SNPs in combined data. The mean posterior estimate of genetic variance was 0.147 (95% HPD: 0.042-0.287). The mean posterior estimate of h 2 was 0.126 with 95% HPD interval from 0.040 to 0.223 (Table 3).
Comparing the results of Bayesian analysis between combined and discovery data, 11 SNPs [BTA23 (3 SNPs), BTA3 (4) and BTA17, 6, 5 and 16] were common between the 20 most significant SNPs in the two analyses ( Table 2 & (Table 4). Model frequencies of SNPs in most cases were equal or smaller in combined analysis compared to discovery. Likewise, the percentage of explained genetic variance by each window was generally smaller in combined analysis ( Table 2 & 4). Comparing the results of Bayesian GWAS with single-marker GWAS for combined data, 16 out of 20 most significant SNPs were common between both analyses (80% concordance) ( Table  4).

Genomic prediction (cross-validation)
In a ten-fold cross-validation, the predictive abilities of models developed by training with 90% of combined data were evaluated in the remaining 10%. AUC ranged from 0.46 to 0.65 for the  Table 4. Results of single-marker (GenABEL) and Bayesian (GenSel) genome-wide association analysis for susceptibility to MAP infection in Jersey cattle (combined data, N = 1,069) 1. Only the twenty most significant SNPs are shown. 2 SNPs identified commonly by both GenABEL and GenSel analyses; order is based on P-values from GenABEL. 3 SNPs identified only by GenABEL. 4 SNPs identified only by GenSel. 5 Bos Taurus chromosomes. 6 Position of SNP based on Bovine genome build UMD 3.1 (in base pair). 7 Major allele. 8 Minor allele. 9 Estimated effect of allele B (fitted allele) and the standard error of the estimated effect in the parenthesis. 10 P-value corrected by genomic control approach (GC). Figure 4-B). Average AUC across 10 models was 0.55.

Discussion
This is the first genome-wide association study for susceptibility to paratuberculosis in Jersey cattle. Previously, GWAS [21,22,23,24,26] and a meta-analysis [41] were conducted to identify loci responsible for susceptibility to this condition in Holsteins. These studies have found evidence for association on multiple and varying chromosomal locations.
The proportion of variance explained by all SNPs across the genome (0.15 and 0.12) was in the range of pedigree-based heritability estimates (0.08 to 0.27, unpublished data). This suggests that markers captured most of the additive genetic variation through LD with QTL. However, the previous pedigreebased heritability estimates are from studies of Holstein cattle. No report is available for the heritability of susceptibility to paratuberculosis in Jersey cattle. Studies have shown that susceptibility to MAP infection may differ between breeds [56,57]. Sorge et al., (2011) reported that Jersey cows are 1.4 times more likely to test positive to milk ELISA than Holstein cows [56]; however, this conclusion is questionable as breeds were confounded with herds in their data. To capture loci with the largest genetic contribution a high value of the p parameter (0.999) was used in the current analysis; with a smaller p, the proportion of genetic variance explained by all SNPs might be increased.
The significance level in single-marker and Bayesian GWAS was generally low; no SNPs surpassed the threshold of strong association in separate or combined analyses. Likewise, the highest model frequency of SNPs in most significant windows for both discovery and combined analyses was 0.23. The low model frequencies resulted in higher PFP. For p = 0.999, a randomly chosen locus would a priori be expected to have non-zero effect in 0.1% of MCMC samples. A model frequency of 0.23 indicates that particular SNP had a non-zero effect in 23% of MCMC samples. The relatively low significance of identified SNPs can be explained by the complex genetic architecture of susceptibility to paratuberculosis and that there are many genes with small effects influencing this disease. The power of GWAS to identify SNPs with small effect size is limited, however, this limitation may be overcome in studies of larger scale.
The results of single-maker GWAS on discovery data showed evidence for association on BTA23 and BTA3 as multiple SNPs on these chromosomes in relatively close proximity were among the 20 most significant SNPs. There was a relatively high correspondence between the results of single-marker and Bayesian GWAS; 70% and 80% of the 20 most significant SNPs by GenABEL were also among 20 most significant SNPs by GenSel for discovery and combined analyses, respectively. The first three windows from Bayesian analysis of discovery data explained ,10% of genetic variance while with combined data ,7% of genetic variance was explained by the top three windows. From both GenABEL and GenSel analyses, combining data enhanced the significance of only a few SNPs.
There is some correspondence between the results of this study and previous GWAS in Holsteins. The closest correspondence is for SNP ARS-BFGL-NGS-19381 (BTA23, 32.6 Mb). The nearest SNPs to this position that were identified in Holsteins were located at 32.1 and 32.2 Mb on BTA23 [22,41]. The 32 Mb region of BTA23 may be a case of a genomic region commonly associated with MAP infection across two breeds. Genes within 1 Mb of this location are four members of solute carrier family 17 (SLC17 A1, A2, A3 and A4) located between 31.7-31.8 Mb. The 1 Mb 11 Frequency of B allele in cases. 12 Frequency of B allele in controls. 13 Rank based on percentage of genetic variance among the twenty most significant windows by GenSel analysis. 14 Number of 1-Mb non-overlapping genome window and number of SNPs within each window in the parenthesis. 15 Percentage of total genetic variance explained by 1-Mb windows. 16 Proportion of models in which the corresponding window accounted for . 0.04% of genetic variance (expected variance if each window had the same effect: 1/total number of windows = 2,657). 17 Proportion of MCMC iterations that included the corresponding SNP. doi:10.1371/journal.pone.0088380.t004 distance was chosen based on the extent of linkage disequilibrium in cattle. Kim and Kirkpatrick (2009) showed that for pairs of markers with relatively high LD (r 2 = 0.4-0.6) the median physical distance was ,1.1 Mb. However, for Jersey with smaller population size and higher inbreeding the extent of LD may be even higher. Association of SNPs in SLC11A1 (another member of SLC family) with MAP infection has been reported [58]. SLC11A1 and SLC17A1 both encode membrane transport proteins and mutations in these genes have been associated with inflammatory diseases such as Crohn's and Gout diseases [59,60]. SLC17A1can be considered a potential candidate gene for predisposition to MAP infection in cattle. For validation of GWAS results the ideal situation is using samples from a population independent from the discovery population with the same phenotype that was used in the discovery stage. In this study, neither of these requirements was possible. Our validation data failed to replicate the results of discovery GWAS. One reason for this might be use of a different case definition; FC+ and ELISA+/FC+ may represent distinct phenotypes i.e. loci responsible for a cow's ability for MAP shedding may be different from loci underlying humoral response to the pathogen. Another limitation is the small number of samples used in the validation data set.
The predictive ability of the models developed by the Bayesian approach was low. Given that an AUC of 0.50 represents random guessing, an AUC of 0.55 or 0.56 is a weak classifier. The efficacy of the multi-marker model developed by Kirkpatrick et al. (2010) for prediction of susceptibility to MAP infection in Holsteins was 0.73 by AUC in cross-validation analyses [22]. It has been shown that the accuracy of genetic tests for prediction of disease susceptibility is limited by heritability and disease prevalence [55,61]. For low heritability traits even if GEBVs are 100% accurate, prediction of unobserved phenotypes from genomic data will never be accurate [62]. It would be of interest to study the effect of heritability and prevalence on the maximum accuracy that can be obtained by multi-marker models for prediction of the risk of MAP infection.
Combining the results of GRAMMAR-GC and Bayes C for discovery and combined data, nine SNPs distributed on four chromosomes were commonly identified by all four analyses.  [19,63,64,65]. ARS-BFGL-NGS-11887 is located in the intronic region of TCF19 (transcription factor 19) which is highly conserved among multiple species. TCF19 is located on chromosome 6p21.3 close to MHC region in human and its potential role in the etiology of Type 1diabetes (an autoimmune disease) has been suggested [66]. The most relevant candidate gene within 1 Mb of SNP ARS-BFGL-BAC-35219 on BTA23 is ubiquitin D (UBD), alternatively known as FAT10; FAT10 modifies an inflammatory mediator that inhibits its activity during cellular response to Leprosy [67]. HIVEP1 is proximate to BTA-56690-no-rs at 44.4 Mb on BTA23 and encodes for a protein that participates in the transcriptional regulation of inflammatory target genes by binding specific DNA sequences in their promoter and enhancer regions [68]. The most relevant candidate gene close to ARS-BFGL-NGS-109837 and Hap-map51790-BTA-103080 is CCDC17 (coiled-coil domain containing 17) located between 101.05-101.06 Mb on BTA3. CCDC88B, from the same family, was suggested as the most promising candidate gene at location 11q13.1 in humans for susceptibility to Sarcoidosis (a complex inflammatory disease) [69]. Zinc finger protein 684 (ZNF 684) is in 200 kb of BTB-00148619 located at 106.4 Mb on BTA3; the variants in ZNF 365 have been associated with CD [70]. UBE2K (ubiquitin-conjugating enzyme E2K) located 20kb upstream of BTA-30686-no-rs (60.6 Mb) on BTA6 could be a potential candidate gene for susceptibility to paratuberculosis infection. UBE2L3 on human chromosome 22 has been identified as a new potential risk gene for CD which is also involved in other immune-mediated diseases [71]. ARS-BFGL-NGS-100555 is in close proximity (170 kb) to FAM109A (family with sequence similarity 109, member A) on BTA17. FAM5C from the same family has been reported to be associated with gastric cancer in humans [72]. All these SNPs had the same effect directions in both discovery and validation data supporting their validation. BTA-75232-no-rs (10.1 Mb) on BTA5 was identified by all four analyses but had different effect direction in validation data. The genes described above seem to be promising candidates for response to paratuberculosis infection.

Conclusions
We performed a case-control genome-wide association study for infection with Mycobacterium avium subsp. paratuberculosis in Jersey cattle. Two statistical approaches were used: single-marker regression (GRAMMAR-GC) and Bayesian methodology (Bayes C) for multi-marker regression. Nine SNPs representing four chromosomes (BTA3, 6, 17 and 23) were identified by both GRAMMAR-GC and Bayes C analyses in discovery and combined (discovery and validation) data. Multi-marker prediction models were developed and tested by both cross-validation and application to the validation data set; predictive ability of the models to correctly rank cases and controls was low (55-56%) based on the area under ROC curve. The application of these models to predict the phenotypic outcome of animals in regard to JD is limited, however, they can be used for prediction of genetic merit. Figure S1 Multi-dimensional scaling plots. A) Discovery data (N = 889) B) Validation data (N = 180) and C) Combined data (N = 1,069). Each animal is represented by one point. PC1 and PC2 are the first two principal components obtained from genomic kinship matrix. Distance between points represents the genetic distance between animals. (PDF) Figure S2 Quantile-quantile plots of P-values from genome-wide association analysis for susceptibility to MAP infection. A) Discovery data B) Validation data C) Combined data. Y-axis represents the observed P-values and Xaxis the expected P-values under null hypothesis (diagonal) of no association. (PDF)