Conceived and designed the experiments: HN JLW GM. Performed the experiments: ML GM RZ MLS. Analyzed the data: GM AS. Contributed reagents/materials/analysis tools: ML RW JT FS. Wrote the paper: GM HN JLW.
The authors have declared that no competing interests exist.
Bovine paratuberculosis (ParaTB) also known as Johne's disease, is a contagious fatal disease resulting from infection by
The two populations used for the association analyses were a cohort of
The analyses identified several loci (P<5 e-05) associated with ParaTB, defined by positive ELISA and presence of bacteria in tissue compared to ELISA and tissue negative animals, on chromosomes 1, 12 and 15 and one unassigned SNP. These results confirmed associations on chromosome 12 and the unassigned SNP with ParaTB which had been found in the Italian population alone. Furthermore, several additional genomic regions were found associated with ParaTB when ELISA and tissue positive animals were compared with tissue negative samples. These loci were on chromosomes 1, 6, 7, 13, 16, 21,23 and 25 (P<5 e-05). The results clearly indicate the importance of the phenotype definition when seeking to identify markers associated with different disease responses.
Control of major infectious disease in livestock remains difficult, despite the detailed characterisation of the infectious agents associated with common diseases and the deciphering of their genome sequences
Paratuberculosis (ParaTB) or Johne's disease, caused by
In recent years, new knowledge of cellular and molecular immune response has been obtained largely from studies using experimental models and defined genetic lines of laboratory animals. However, animal models of infectious diseases suffer from the limitation that the experimental conditions remove the sources of variation that impact the disease. In the case of livestock, there are many complex host-environment interactions which affect the presentation of diseases and their level of incidence. The use of field samples is therefore an indispensable complement to studies in animal models. However such data are often difficult to interpret, as the sources of variation are not well understood. The advances in molecular and bioinformatic tools for the analysis of genomes, together with epidemiological theory previously used only in human studies, is now being applied to studies of livestock species. Through genome-wide association studies (GWAS) using high densities of SNP markers, we are finally in a position to perform a joint analysis across datasets to increase the power and discrimination of individual studies.
Until a few years ago, it was not possible to design genome-wide scans in livestock based on field samples with unknown pedigree information as markers were not sufficiently dense to detect linkage disequilibrium between the markers and causative variations. Following the publication of the bovine genome sequence, high marker density SNP assays became available. Statistical methodology has been developed to make use of these high density marker panels to detect associations between markers and traits in populations where linkage disequilibrium extends over more limited distances. The problem now is that cattle populations generally have a high level of relatedness, especially dairy cattle where the effective population size and number of sires is very small. The hidden presence of closely related animals in the sample set results in a complex population structure and an
This article reports a joint analysis of two independent GWAS datasets to identify genes and markers associated with ParaTB susceptibility in dairy cattle. The objective was to improve the power to detect associations over what was possible in each individual study, and to investigate the consistency or heterogeneity of these associations across diverse phenotype definitions and populations. The study design was based on the joint analysis of two cohorts of animals. All animals were of the same breed and tested for the presence of antibodies for MAP or lesions in the intestinal tract. Furthermore, population stratification was assessed by MDS (multi dimensional scaling), no specific intervention was done to the animals and all samples were field samples.
No electronic search strategy was performed to seek adequate databases to be included in the meta-analysis. The study was conducted in collaboration between two research institutions and the two databases were shared under personal agreement and communication. The databases were chosen based on the breed of the animals (Holstein-Frisian), the genotyping platform used (Bovine 50K SNP CHIP) and availability/possibility to share information. Raw genotypic datasets and phenotypic files were personally obtained from authors of the manuscript. The principal summary measure of the results of the work is a p-value associate to the single SNP tested in the combined dataset. The principal risk of bias in the study can be linked with the limited number of studies analysed, 2 cohorts, and with the different phenotypic definition that is used to define the diseases in the two datasets. No specific analysis of risk of bias was conducted on the two datasets.
In addition to comparing the results of GWA analysis of studies using different disease phenotype definitions, two different control definitions were also used. Group (A) controls consisted of animals that were ELISA or tissue culture negative, where a case was defined as an animal that was either positive for the
Group A comprised 1190 animals, 590 cases (483 ELISA positive, 107 tissue positive) and 600 controls (483 ELISA negative and 117 tissue negative), prior to genotype quality checks. Group B comprised 707 animals, 590 cases (483 ELISA positive, 107 tissue positive) and 117 controls (117 tissue negative), prior to genotype quality checks.
Samples were collected from routine ParaTB screening of Holstein cattle between September 2007 and December 2008 in the province of Lodi in Italy, in an area with a high prevalence of ParaTB. Animals were defined as ParaTB positive based on the detection of serum antibodies produced in response to
Two hundred forty-five Holstein cows from herds in New York, Pennsylvania, and Vermont were followed to culling between January 1999 and November 2007 and assessed for the presence of
Both sets of samples were genotyped using the Illumina (San Diego, CA) BovineSNP50 BeadChip assay, although with slightly different versions. The Italian 966 samples, plus 9 duplicated samples were genotyped using an assay with 54,001 SNPs with an average spacing of 51.5 kb and a median spacing of 37.3 kb, detailed information of markers tested can be found at the following webpage (
Custom Python scripts were developed to process, store, and merge the two datasets. A unique Illumina file was created that contained 1190 samples and the common 54,001 SNPs (before quality checks) from the two datasets. All alleles were converted from BOT to TOP coding using the SNP definitions for the Illumina BovineSNP50 BeadChip as a reference and by performing a direct comparison between the genotype calls from the two datasets: this was done to ensure that the Illumina A/B genotype calls referenced identical alleles in each study. To avoid errors and potential false positives during the joint analysis, 270 SNPs were deleted because the allele assignment to the TOP strand was not reliable.
Genotype quality assurance was performed within the R statistical environment using the GenABEL package implementing the “check.marker” function on the combined raw data for the two cohorts
Genome-wide association analysis was performed with the GenABEL package
The combined dataset from both studies comprised 1190 samples and 54,001 markers. Following quality control checks, 1177 of the 54,001 markers were excluded because of low (less than 95%) call rate and 4823 markers were excluded because of low (less than 0.001) MAF. Fourteen samples were removed because of low call rate (less than 95%) and 4 were eliminated because of high autosomal heterozygosity (FDR<1%). The mean heterozygosity of the sample was 0.326, while the removed samples had heterozygosity >0.39, indicating possible sample contamination. A further 19 samples were removed due to high IBS. Mean IBS computed using genomic data with the identity by state (IBS) “ibs” (option weight = “freq”) function of GenABEL was 0.7393, based on 2000 autosomal markers, while the samples removed showed IBS higher values than 0.95. Consequently the first dataset after quality edits comprised 1153 samples and 48,001 genome-wide SNPs.
To evaluate the presence of population substructure, genome-wide SNPs that were not in linkage disequilibrium (r2<0.2; 13,000 SNPs) were used for the MDS plot. The population structure of the USA and Italian populations was found to be very similar as seen from the extensive overlap in the MDS plots. There were no overall differences in the genetic background of cases and controls (
A second quality control check was performed following the sample reduction to 1036 individuals which removed 3 SNPs because of low (less than 95%) call rate and 2715 markers because of low MAF (less than 0.01). One further sample was removed because of low call rate (less than 95%). Consequently, the final dataset comprised 1035 samples and 45,282 genome-wide SNPs.
The joint analysis of the combined data, from the two independent genome-wide studies of ParaTB, identified SNPs associated with
Manhattan plot displaying the −log10(
SNP | BTA |
BTA Position (bp) |
UMD Position (bp) |
N |
effB Q.2 |
P-value |
|
12 | 69,663,832 | 69.979.057 | 998 | 0.15 | 2.04 e-05 |
|
12* | - | 68.866.138 | 998 | 0.17 | 2.66 e-05 |
|
12 | 69,599,639 | 69.620.872 | 1017 | 0.15 | 2.88 e-05 |
|
15 | 66,161,046 | 65.868.375 | 1017 | −0.19 | 3.07 e-05 |
|
1 | 113,617,698 | 109.807.935 | 1017 | −0.21 | 3.34 e-05 |
|
1 | 113,855,358 | 109.950.635 | 1017 | −0.20 | 3.94 e-05 |
BTA:
BTA Position: the Btau4.0 location of the SNPs on the cattle chromosome in base pairs.
UMD Position: the UMD3.0 location of the SNPs on the cattle chromosome in base pairs.
N: number of animals represented in the comparison.
effB Q.2.: effect of the minor allele.
P-values: p-values after GRAMMAR-GC test for association.
The initial common dataset comprised 707 samples and 54,001 markers. Following the first quality control check, 1209 of the 54,001 markers were excluded because of <95% call rate and 5115 markers were excluded because of <0.001 MAF. Eight samples were removed because of <95% call rate and 2 were eliminated because of high autosomal heterozygosity (FDR<1%). The mean heterozygosity of the samples was 0.327, while the samples removed had heterozygosity values >0.41, indicating possible sample contamination. A further 7 samples were removed due to high IBS. Mean IBS computed using genomic data with the identity by state (IBS) “ibs” (option weight = “freq”) function of GenABEL was 0.7363, based on 2,000 autosomal markers, while the samples removed showed IBS values greater than 0.95. Following data cleaning, 692 samples and 47,677 genome-wide SNPs remained for the analysis.
To evaluate population substructure among the 692 animals, genome-wide SNPs that were not in linkage disequilibrium (r2<0.2; 13,000 SNPs) were used to produce MDS plots. The MDS plot indicated that the two populations were genetically very similar, and that there was no evidence of clustering based on the ParaTB status. After removing outliers, 619 animals remained for the analysis (
The joint analysis of the combined dataset from the two independent genome-wide studies of ParaTB infection identified several SNPs associated with
Manhattan plot displaying the −log10(
SNP | BTA |
BTA Position (bp) |
UMD Position (bp) |
N |
EffB Q.2 |
P-value |
|
22 | 56,087,082 | 53.837.358 | 574 | −0.07 | 1.27 e-15 |
|
6 | 22,013,011 | 19.487.224 | 575 | −0.07 | 3.62 e-09 |
|
1 | 3,083,368 | 3.591.678 | 600 | 0.03 | 4.82 e-06 |
|
1 | 3,083,498 | - | 600 | 0.03 | 4.82 e-06 |
|
7 | 40,664,184 | 40.766.325 | 549 | −0.01 | 1.57 e-05 |
|
26* | - | 2.091.652 | 589 | −0.05 | 1.74 e-05 |
|
25 | 29,929,537 | 27.952.781 | 584 | 0.04 | 2.53 e-05 |
|
13 | 65,977,384 | 65.155.432 | 600 | −0.06 | 2.62 e-05 |
|
16 | 72,179,197 | 73.582.939 | 600 | −0.08 | 2.75 e-05 |
|
21 | 33,135,132 | 32.869.451 | 600 | −0.13 | 3.15 e-05 |
|
23 | 34,108,529 | 32.251.429 | 596 | 0.03 | 4.64 e-05 |
BTA:
BTA Position: the Btau4.0 location of the SNPs on the cattle chromosome in base pairs.
UMD Position: the UMD3.0 location of the SNPs on the cattle chromosome in base pairs.
N: number of animals represented in the comparison.
effB Q.2.: effect of the minor allele.
P-values: p-values after GRAMMAR-GC test for association.
Results from independent and moderately sized GWA studies rarely stand on their own and should be considered as part of a process that accumulates evidence of association
For livestock it is often difficult, or impossible, to increase sample size within a study either because of the difficulty in finding populations with appropriate phenotypes due to economic constrains, or because of ownership concerns that make it difficult to obtain samples from commercial populations, especially in dairy cattle. Furthermore, in order to combine studies, the overlap between marker sets needs to be maximised to optimise the accuracy of the genotyping and avoid the need for genotype imputation. Joint analysis of two or more GWA datasets of raw data is one approach to increase the evidence for the effects of genetic loci, despite the challenges associated with combining different phenotype definitions and different fixed or random effects. Even with these problems, there are many benefits to be gained as the creation of a larger sample size and the increased statistical power results in a reduced chance of false positives.
In the work presented here, we combined data from two independent GWA studies of ParaTB. The first study was performed in an Italian Holstein population. Samples were collected from routine ParaTB screening of Holstein cattle by ELISA testing. This analysis identified several regions on chromosomes 8, 9, 11, 12 and 27 (P<5 e-05) associated with disease status defined by the presence of anti-
Discrepancies between loci identified in the two previous studies
The two adopted strategies of analysis (Group A and Group B) identified 6 and 11 SNPs, respectively, that were associated with disease phenotypes. The analyses were optimised as the majority of markers used in both studies were identical.
The analysis strategy based on the Group A definition where controls consisted of animals that were ELISA or tissue culture negative and a case was defined as an animal that was either positive for the
Examining the region identified by the joint analysis confirmed several candidate genes located within 1 Mb of all the three markers on chromosome 12, as previously reported by Minozzi et al. 2010. Two of the significant SNPs are located within coding regions of two genes that encode a single protein, the ATP-binding cassette, sub-family C (
The loci on chromosomes 1 and 15 did not reach significance in either the Italian or American studies alone. However in the American population, allele frequencies at
The region within 1 Mb of
Similarly, neither of the SNPs on chromosome 1 were significant in the independent studies although allele frequencies were 0.03 and 0.04 in cases and 0.06 and 0.08 in controls for
The SNPs on chromosomes 1 and 15 are newly discovered associations from the joint analysis that are common to both phenotypes, i.e., antibody response and tissue burden of bacteria. These markers may be of interest for breeding schemes as they could be used to identify animals susceptible to a disease that manifests later in the productive life of the animal causing economic damage and an increased risk of contaminating other animals in the herd.
The analysis strategy based on the Group B definition compared controls (animals that were
A very strong new association (P = 1.27 e-15) was found on chromosome 22 at position 56,087,082. Interestingly, the allele frequencies at this SNP were similar between cases and controls in both cohorts. The SNP is not flanked by other SNPs associated with the disease (
None of the SNPs on chromosomes 6, 13, 16, and 21 have genes located within 1 Mb that could be considered as strong candidates for a role in ParaTB. Several of the SNPs that reached significance in the independent studies were not significant in the combined analysis, however, this does not exclude them as being associated with disease, as they may be associated with one or other of the phenotypes but not both.
In the GWA study of the Italian population with the antibody response phenotype, 6 genomic regions on BTA 8, 9, 11, 12 and 27, which were not detected in the joint analysis, were significantly associated with disease. The SNP on chromosome 9
The American study identified several SNPs linked to tissue infection on chromosomes 3, 5, 16, and 21 that were not identified in the joint analysis. The minor allele frequencies at
In summary, combining data sets into a joint analysis of genome-wide association can improve the power for detecting and validating associations and provides the possibility of identifying new loci which were below the significance threshold in the independent studies. We combined datasets from two independent ParaTB studies. Using a genomic kinship matrix based on the Bovine SNP50 BeadChip data, we defined population relationships among samples, and show that the genetic composition of the two populations was sufficiently similar to undertake the joint analysis. In performing this analysis, using two distinct phenotypic descriptions of disease, we were able to confirm loci identified in the independent studies, and to identify new loci through increased power, presumably where the biology underlying the two phenotypes coincided. However, other loci found in the original studies were lost, again presumably where different mechanisms underlie the two phenotypes. The limitation of the study could be the phenotype definition used in the two studies, that has limited the possibility of confirming all previous results, but has at the same time enabled to find markers that are common to both diagnostic measures. Further work may be carried out by adding data from other independent GWA studies to this analysis. The concept applied here, of using datasets from two studies using the same and different trait measures may help to decipher the genetic architecture of complex infectious polygenic disease traits which require very large sample sizes to have sufficient power to detect risk loci with sufficient statistical support.
In conclusion, the results of the joint analysis of two single GWA studies confirmed previous findings and identified new genomic regions and candidate genes involved with specific and general immune response to ParaTB and have increased the overall understanding of the genetics of paratuberculosis and can be of great advantage in increasing the knowledge for genome based selection in livestock.
(BMP)
(TIF)
(TIFF)