Genome-Wide Analysis Identifies Germ-Line Risk Factors Associated with Canine Mammary Tumours

Canine mammary tumours (CMT) are the most common neoplasia in unspayed female dogs. CMTs are suitable naturally occurring models for human breast cancer and share many characteristics, indicating that the genetic causes could also be shared. We have performed a genome-wide association study (GWAS) in English Springer Spaniel dogs and identified a genome-wide significant locus on chromosome 11 (praw = 5.6x10-7, pperm = 0.019). The most associated haplotype spans a 446 kb region overlapping the CDK5RAP2 gene. The CDK5RAP2 protein has a function in cell cycle regulation and could potentially have an impact on response to chemotherapy treatment. Two additional loci, both on chromosome 27, were nominally associated (praw = 1.97x10-5 and praw = 8.30x10-6). The three loci explain 28.1±10.0% of the phenotypic variation seen in the cohort, whereas the top ten associated regions account for 38.2±10.8% of the risk. Furthermore, the ten GWAS loci and regions with reduced genetic variability are significantly enriched for snoRNAs and tumour-associated antigen genes, suggesting a role for these genes in CMT development. We have identified several candidate genes associated with canine mammary tumours, including CDK5RAP2. Our findings enable further comparative studies to investigate the genes and pathways in human breast cancer patients.


Introduction
Breast cancer is a devastating disease causing a majority of cancer-related deaths in women worldwide [1]. Sub-categorisation of patients based on receptor status (oestrogen (ER), progesterone (PR) and HER2) has enabled improved targeted treatments. However, treatment could be further improved, especially for the triple negative patients, which account for 12-24% of the patients, and for which no efficient therapy exists at present [2]. There is therefore an urgent need to identify predisposing genes and prognostic tools to improve early detection and enhanced treatment options in breast cancer. One approach is to attempt to identify genes influencing susceptibility to breast cancer, which also has the potential to reveal novel targets for drug development and assist in the implementation of strategies towards personalised medicine. Breast cancer susceptibility is generally believed to be conferred by a large number of loci, each contributing with a small effect to breast cancer risk [3]. So far only a small fraction of human breast cancer cases can be explained by a single gene mutation and the prevalence of clearly hereditary breast cancer is about 5-10% of all breast cancers, leaving a large majority of cases with a more complex aetiology [4,5]. Several genes predisposing to breast cancer have been identified, including BRCA1 and BRCA2, which explain about 20% of the familial breast cancer cases [6]. A large number of association studies have been performed in search of breast cancer susceptibility genes, including pooled strategies and meta-analyses, and many genes conferring a moderately increased risk have been identified [7][8][9][10]. However, a large proportion of the inherited risk factors remain unknown.
The dog is a unique model for human disease, sharing many both complex and monogenic diseases, a similar gene set and largely the same environment as humans. In addition, the canine population structure makes trait mapping much easier than in humans. Several recent studies have proven the effectiveness of gene discovery in dog breeds for both monogenic [11][12][13][14] and complex traits including cancer [15][16][17][18][19]. Canine mammary tumours (CMT) are the most common neoplasia in intact female dogs and constitute about half of all tumours [20,21]. As in women, dogs develop mammary tumours with increasing age, rarely before 5 years of age and with a median age of occurrence of 10-11 years. However, the English Springer Spaniel (ESS) has been shown to have a median age of onset at 7 years of age in the Swedish dog population and 32% of the female dogs are affected at ten years of age in this high-risk breed [20]. This early onset mimics that of familial breast cancer in humans and indicates that inherited risk factors influence CMT development. CMTs also show a high degree of similarity to human breast tumours regarding epidemiological, clinical, morphological and prognostic features [22][23][24]. CMT is considered a heterogeneous disease with a complex background, resembling that of human breast cancer, but very little is known about the inherited genetic risk factors influencing CMT. We have previously shown that the BRCA1, BRCA2 and ESR1 genes are associated with CMT in Swedish ESS [25,26]. The associations imply similarities in predisposing genetic risk factors between human and canine mammary tumours, but explain only a minor proportion of the elevated risk for CMT in the ESS breed.
In this study, we have conducted a genome-wide association study (GWAS) to identify genetic risk factors associated with mammary tumour predisposition in the ESS breed in addition to BRCA1/2 and ESR1 [25,26]. We have identified three candidate regions containing plausible cancer susceptibility genes and pathways. The top associated region is located on canine chromosome 11 and reveals a complex genetic architecture with an abundance of risk haplotypes indicating involvement of the centrosomal cell cycle regulator CDK5 regulatory subunit-associated protein 2 (CDK5RAP2) in tumour development.

GWAS identifies three regions significantly associated with CMT
A cohort of Swedish ESS dogs was genotyped for genome-wide association analysis for CMT. A total of 332 individuals (188 cases, 144 controls) and 130,238 SNPs remained in the analysis after quality control filtering. The English Springer Spaniels display an inbreeding coefficient of 0.03±0.05. The ESS cohort showed substantial stratification, mainly due to an outlier group visible in the MDS plot (S1 Fig). The outlier group could potentially be due to genetic mix-in from other breeds. A standard case-control chi-square test resulted in a genomic inflation λ = 2.34, clearly indicating a stratified dataset. The inflation was controlled by removal of an outlier group of 33 individuals, and by mixed model analysis with PCA covariates to correct for residual stratification and cryptic relatedness in the remaining 180 cases and 119 controls (λ = 1.00, Fig 1A). Several loci showed association with CMT, indicating multiple risk factors in the ESS breed ( Fig 1B, Table 1). Genome-wide significant association was detected on canine chromosome 11 (SNP BICF2G630310626, chr11:73,290,522, p raw = 5.6x10 -7 , p perm = 0.019). Allele frequencies of this top SNP was also studied investigated in the Swedish outlier group (n = 33), a UK cohort (n = 40) and a Norwegian cohort (n = 15) to investigate a potential overlap in association signal in these minor cohorts. The association with CMT in the original Swedish ESS was however only replicated in the Swedish outlier group, which could indicate an enrichment of this risk variant in the Swedish population (S2 Fig). In addition to the top SNP, seven SNPs located in three genomic regions have p-values deviating from the expected in the QQ-plot (nominal significance threshold at -log p>4.0, Fig 1A). Three SNPs are located on canine chromosome 11, supporting the genome-wide significant locus, and four SNPs are positioned in two loci on chromosome 27. The nominally associated SNPs are listed in Table 1. No SNPs were excluded due to HWE inconsistencies.
Linkage disequilibrium clumping was used to define the associated regions for further analysis, using both association and LD values to restrict the regions, Table 2. Several of the identified regions overlap with variants associated with different forms of cancer in humans [27]. However, none of the regions contain known genes or GWAS sites for human breast cancer [27].
91% of the cases carry at least one risk allele at the three top loci (S3A Fig). The chromosome 11 peak confers a substantial risk (OR = 2.76, 95% CI 1.72-5.57) and accounts for 11.0 ±7.2% of the phenotypic variance, whereas the three associated regions together explain 28.1 ±10.0% of the phenotypic variance (S3B Fig). Interestingly, the proportion increases to 34.8 ±11.0% for the top 5 regions and 38.2±10.8% for the top 10 regions, despite the lack of genome-wide significant association.

Candidate region re-sequencing
The associated and potentially-associated regions were re-sequenced in 7 ESS dogs selected for optimal variance. The re-sequencing resulted in a coverage of 159x±33x and 90±3% of the target covered by 20x or more. On average 24,500 SNPs and 13,100 short indels were detected in each dog across the 12 Mb sequenced. Nine non-synonymous SNPs were discovered within the top region on chromosome 11, three of these were previously known canine SNPs. No SNPs with predicted deleterious effects were identified within the two nominally associated regions on chromosome 27. The identified SNPs were evaluated for their potential biological function and whether they complied with risk and protective haplotypes in the sequenced dogs. Fifty-four candidate SNPs were selected for genotyping in the ESS cohort. Several larger structural variants were also detected in the top candidate regions, with the majority overlapping repetitive elements or flanking gaps in the genome assembly, indicating alignment difficulties. This was especially evident in the region on chromosome 27:0.7Mb, where two deletions, four duplications and one inversion ranging from 200 bp to 43 kb were detected.
Genome-wide significant candidate region on chromosome 11 The chromosome 11 candidate locus (Fig 2A and 2C) shows a dispersed minor allele frequency pattern, with no signs of reduced variability due to selective pressure in the region (Fig 2B).
Haplotype and LD analysis was performed for the top candidate region (chr11:76.1-76.8Mb, Fig 2D) using a merged dataset including genotypes from the canine SNP chip combined with candidate SNPs identified by sequencing. This dataset was imputed to allow haplotype, LD and association comparisons between markers. The 700 kb region on chromosome 11 displayed a complex genetic architecture with a multitude of haplotypes, which persisted when taking potential genotype or imputation errors into account. When investigating only SNPs with signs of association (p<0.001, 15 SNPs), 51 different haplotypes were identified of which 17 were private, indicating an unusually high genotypic diversity in the candidate region. Based on data from Auton et al [28], there are four recombination hotspots within the region and the recombination hotspot density is significantly higher in this region compared to the rest of chromosome 11 (p = 0.017), which could be an explanation for the increased diversity in the region. The phylogenetic relationship between the haplotypes was investigated and the haplotypes can be clustered into three groups with 29, 5 and 17 haplotypes in each group ( Fig 3A). The haplotype frequencies are 0.59, 0.16 and 0.25 for haplotype group 1, 2 and 3, respectively. Haplotypes belonging to group 3 confer a higher risk for CMT than group 1 (p = 5.9x10 -5 , OR = 2.3, Fig 3B). No significant difference could be established between haplotype group 2 and either group 1 or 3.
The top locus on chromosome 11 could be further defined in the combined GWAS and fine-mapping dataset by markers in high LD (r 2 >0.6) with the top SNP, Fig 2D, restricting the candidate region to approximately 446 kb ). This region spans the CDK5RAP2 (CDK5 Regulatory Subunit Associated Protein 2) and parts of the MEGF9 (Multiple Epidermal Growth Factor-Like Domains Protein 9) gene, both with previous connections to cancer [29,30]. There could potentially also be a microRNA and a lincRNA gene in the region since the human MIR147A and LINC01613 overlap gaps in the dog genome assembly (CanFam 2.0 and CanFam 3.1). Three putative non-synonymous SNPs in CDK5RAP2 and one SNP in the 3'UTR of MEGF9 were included in the analysis, but the chr11:73,290,522 top SNP identified in the GWAS remained the most significantly associated after comparisons with potential candidates from re-sequencing, but with a slightly higher p-value after imputation (p = 1.66x10 -6 ), Fig 2D). This SNP is located in a small gene desert downstream of CDK5RAP2. The region is evolutionary conserved, indicating potential functional importance. The base is evolutionary conserved in 95% of the vertebrates and all mammals evaluated (UCSC genome browser, 100 vertebrates). The SNP is predicted to significantly alter the transcription factor binding abilities for the photoreceptor cell-specific nuclear receptor (PNR/NR2E3, p = 5.3x10 -5 from TOMTOM [31]), which is specific for the protective allele. PNR/NR2E3 is an orphan nuclear hormone receptor previously reported to have a regulatory role in breast cancer [32,33].
Interestingly, two other SNPs in the candidate region show high association and LD with each other (r 2 = 0.94) but relatively low LD (r 2 <0.4) with the top SNP, indicating a possibility of two independent genetic risk factors in the area. One of these SNPs produces a non-synonymous change in CDK5RAP2 (at chr11:73,692,993 bp, grey in Fig 2D). The SNP at chr11:73,692,993 creates a proline to alanine transition, but the amino acid is not well conserved evolutionary and the change is predicted to be benign (score 0.156) when analysed with PolyPhen [34]. It displays strong association to CMT (p = 1.3x10 -5 ), but is in moderate LD with the top SNP at chr11:73,290,522 bp (r 2 = 0.38), and could thus be an alternative genetic risk factor in the region. Analysing the dataset with the chr11:73,290,522 top SNP genotypes as covariates does however remove the association signal in the entire chromosome 11 region (p>0.02), indicating that the associated SNPs are not independent.

Associated candidate regions on chromosome 27
The entire proximal 13.5 Mb of chromosome 27 shows elevated levels of association, with nominally associated peaks at 0.7 and 7.7 Mb, Fig 4A. An associated region of this size could indicate selection at this site, but the allele frequencies vary and do not suggest decreased genetic variation in the area, Fig 4B. The two loci appear independent with low LD (r 2 = 0.03) between the top SNPs, Fig 4A. Fine-mapping with additional SNPs did not result in new association signals or restrict the size of either region (Fig 4C and 4D). The 0.7 Mb region contains several large gaps and is poorly annotated in both the CanFam 2.0 and CanFam 3.1 genome assemblies, but the corresponding human region includes 17 genes. Several larger structural variations (SVs) were detected when re-sequencing this region in seven dogs, either reflecting an unstable genomic region or, alternatively, could be indicative of errors in design and sequence alignment due to the incomplete dog genome assembly in this area. After resequencing, genotyping and imputation, the SNP BICF2P1040993 (chr27:735,281 bp) showed the strongest levels of association in the region (p = 6.8x10 -6 ). It is located 418 bp upstream of the lacritin gene (LACRT, annotated from human hg19), which encodes a glycoprotein involved in tear secretion [35]. LACRT expression has also been detected in breast tissue (normal breast tissue, breast cancer tissue and breast cancer cell lines) [36].
After fine-mapping of the 7.7 Mb candidate region, the strongest association is seen for two SNPs located 23 kb apart (BICF2P815910, chr27:7,683,337 and BICF2P365456, chr27:7,706,463). Based on gene annotation in human, this region includes the 5' part of the amino acid transporter SLC38A4, which is known to be imprinted [37]. The SNP with the lowest p-value in this area, BICF2P365456, chr27:7,706,463, is located intronic of the SLC38A4 gene based on human genome gene annotations (Fig 4D).

Pathway analysis reveals enrichment of snoRNA and tumour antigen genes
The ten most associated GWAS regions contain excellent candidate genes previously connected to cancer, several of which are novel in breast cancer. We used a PubMed text-based pathway analysis tool (GRAIL) to evaluate gene relationships linking the top ten GWAS loci [38]. Highly significant connections were found for six of the ten regions (p GRAIL 4.6x10 -6 , S1 Table), which all contain small nucleolar RNA (snoRNA) genes. The snoRNAs are involved in post-transcriptional modification of mainly ribosomal RNA and small nuclear RNA. Emerging evidence connect several snoRNAs to cancer [39].
In addition to the associated regions, 117 regions with reduced genetic variability (RGVs) were identified in the ESS cohort (MAF<0.01 over >250 kb), S2 Table. The RGVs cover 2.1% of the genome, and 47 of the RGVs (representing 19.5% of the X chromosome and 1.0% of the genome) are located on the X chromosome. When allowing for more variation (MAF<0.05), RGV regions cover 2.2% of the autosomes and 25.4% of the X chromosome. This is consistent with a lower recombination rate leading to reduced genetic variation on the X chromosome. There is also a bias towards the X chromosome due to a lower marker density (average distance 22.1 kb compared to 13.0 kb for the autosomes). The syntenic human regions were extracted for the 117 regions, of which 99 contain genes and could be evaluated for pathway enrichments. Using GRAIL, 29 RGV regions were significantly connected (p GRAIL <0.05), mainly through genes with connections to cancer (14 of the 29 regions). Of these, ten can be classified as tumour antigen genes, including PAGE2B (prostate-associated P antigen family, member 2B, p GRAIL = 3.3x10 -9 ), SAGE1 (sarcoma antigen 1, p GRAIL = 5.8x10 -6 ), XAGE5 (X Antigen Family, Member 5, p GRAIL = 8.2x10 -4 ), DUSP21 (Dual Specificity Phosphatase 21, cancer/testis antigen, p GRAIL = 4.5x10 -3 ), CT55 (CXorf48, cancer/testis antigen 55, p GRAIL = 3.8x10 -2 ) and five melanoma-associated antigens (MAGEA11, MAGED2, MAGED4, MAGED9 and MUM1L1) (S2 Table). The tumour antigen gene products can act as antigens in tumour tissue due to somatic mutations or aberrant expression, which can lead to an immune response. Any altered protein could act as a tumour associated antigen, but according to the T cell-defined tumour antigen peptide database [40] there is an overrepresentation of antigen genes on the X chromosome (22.9% of the unique gene entries 2013 are on X, average distance 4.7 Mb compared to 26.9 Mb in the remaining genome), which could potentially cause a bias towards enrichment in the RGV regions.
Interestingly, when combining the GWAS top ten associated regions together with the RGVs, 41 of the 109 regions (excluding regions without genes) are connected (p GRAIL <0.05, S3 Table), including both snoRNAs and tumour associated antigens in both datasets.

Discussion
In this study we have performed a genome-wide association analysis for canine mammary tumours in the English Springer Spaniel breed. We identified a genome-wide significant peak on chromosome 11 and the candidate region includes a regulator of cyclin-dependent kinase 5 (CDK5RAP2). A different cyclin-dependent kinase, CDKN2A (p16), has recently been identified in two independent dog cancer GWAS studies, suggesting an important role for these proteins in cancer development [15,16]. The region displays a high genetic diversity, which is evident from the lack of a selective sweep signature, a varied high minor allele frequency and a complex haplotype structure with a high number of haplotypes across the region. The reason for the diversity remains unclear, but there are twice as many recombination hotspots defined within the region than expected based on chromosome 11 average (p = 0.017) [28]. The locus is close to the distal end of chromosome 11 and recombination is known to occur at higher rates close to telomeres, possibly explaining the high haplotype diversity in our candidate region [28]. After resequencing and addition of new candidate SNPs to the study, the associated region could be further refined, encompassing the CDK5RAP2 gene and parts of MEGF9. However, the SNP at chr11:73,290,522bp remains the most associated SNP, and is located within a small gene desert upstream of the CDK5RAP2 gene. The risk allele disrupts a transcription factor binding motif for PNR/NR2E3, which is an orphan nuclear hormone receptor. It is a regulator of the estrogen receptor 1 (ESR1) in ER positive breast cancer cells and interleukin 13Ra2 in ER negative breast cancer cells, regulating tumour growth, cell migration and metastasis [32,33]. High expression levels of PNR/NR2E3 have also been associated with longer recurrence-free survival in breast cancer patients and enhanced response to tamoxifen treatment [32]. The function of the PNR/NR2E3 site in this region is unclear, but the high level of conservation and the presence of a transcription factor binding motif indicate that the associated SNP could have a regulatory function. Alternatively this SNP is in high LD with a causative variant that still remains to be detected. Either of these scenarios suggests that the causative variant is of regulatory nature since the three non-synonymous SNPs in CDK5RAP2 are not the most or independently associated CMT variants. Non-coding predisposing variants would be in line with the majority of the canine cancer associated germ-line variants discovered so far, which are involved in gene regulation and not directly altering the coding sequence [15,16,41]. Dysregulation of cancer genes is also believed to play a major role in predisposition to human cancer, with the majority of GWAS loci being intronic or intragenic [42], and the variants discovered in this study are therefore likely to be of comparative value. However, pinpointing the exact regulatory variants remains a challenge. If the genetic risk factor in the chromosome 11 region is of regulatory nature, it may act in cis or trans. One method for evaluating a potential trans as well as cis acting regulation would be to correlate transcriptome data with haplotype status for this region. Several strong eQTLs are known for CDK5RAP2 in humans [43], but the corresponding bases do not seem to be variable in this ESS cohort. It is however possible that different sites may confer a similar mechanism in the dog. The associated SNPs discovered in this study could have eQTL functions and control regulation of CDK5RAP2. This would be in concordance with previous studies demonstrating the importance of CDK5RAP2 expression for maintaining the normal cell cycle checkpoint [30]. The CDK5RAP2 protein has a fundamental role in the centrosome when attaching the mitotic spindle pole to the centrosome during mitosis and is also required for the mitotic spindle checkpoint in response to DNA damage [44]. Mutations in the CDK5RAP2 gene can cause primary microcephaly due to abnormalities in cell division during neurogenesis [45]. Furthermore, there are indications that CDK5RAP2 levels can influence cancer treatment response since CDK5RAP2 knockdown breast cancer cell lines display increased resistance towards paclitaxel and doxorubicin in a study by Zhang et al [30]. The germ-line risk factor identified in this study could potentially be utilised to predict treatment outcome when using these chemotherapy drugs, but further studies are needed to establish CDK5RAP2's putative involvement in ESS CMT and the possible influence it may have on treatment regimes. The candidate region also contains parts of the MEGF9 gene and although no eQTL are reported in the region identified in this study, the GTEx portal shows nearby variants regulating MEGF9 expression in breast tissue [46].
We have identified three significant CMT associated regions in this study, which account for 28% of the disease risk in the breed. In contrast, the top ten associated loci are predicted to explain 38% of the risk, which indicates that several inherited risk factors play important roles in the development of the disease. In addition to the GWAS loci, regions with a very low variability and therefore undetected by GWAS could also add to the genetic predisposition of CMT. Insurance records show that the Swedish ESS population has a very high CMT risk with 32% of the bitches being affected by ten years of age [20], and it is likely that most individuals in the breed are homozygous for genomic regions containing risk genes increasing the overall CMT risk in the ESS. These regions could have arisen due to random segregation or due to selection of a desired trait. A gene influencing cancer development could therefore be directly involved in the selection or be hitchhiking with the gene selected for. We identified fixed regions covering up to 2.1% of the autosomes and 25.4% of the X chromosome. Although it is expected that genes on the X chromosome might be associated to CMT, the bias towards long fixed regions on the X chromosome could potentially also create spurious findings. The number of regions with reduced variability is smaller than in for instance Rottweilers, Irish wolfhound and greyhounds [15], which indicates a relatively high diversity in the ESS compared to these breeds. This is also reflected in the relatively low inbreeding coefficient (0.03±0.05), which is similar to what has previously been reported in the US ESS population [47] and is probably due to a large population size compared to many breeds.
When analysing possible pathway connections between the genomic regions identified in the GWAS, five snoRNA gene clusters are present in the top regions, with highly significant pvalues (p<5x10 -6 for all linked snoRNA regions). Altered expression levels of snoRNAs have been associated with several cancer forms, including breast cancer and canine mammary tumours [48][49][50], and elevated snoRNA biogenesis has been found essential for breast cancer tumourigenicity by affecting the tumour suppressor p53's function [51]. It has been reported that snoRNAs can act as putative oncogenes [48][49][50]52,53] and regulate other cancer genes. There are also indications that snoRNAs could influence treatment response in breast cancer since altered snoRNA expression levels have been described in Tamoxifen resistant breast cancer cell lines [54].
The regions with reduced genetic variability are strongly enriched for cancer related genes. In particular, tumour antigen genes, such PAGE2B (prostate-associated P antigen family, member 2B), sarcoma antigen 1 (SAGE1) and several melanoma-associated antigens are overrepresented in the RGV regions. The tumour antigen gene products can act as antigens due to somatic mutations, overexpression or by expression in a cell type where they are normally not expressed which can lead to an activation of the host immune defence. The tumour antigens can also be utilised as biomarkers or in cancer treatment, which is reflected in the large number of ongoing clinical trials focusing on tumour antigens (>1600 in www.clinicaltrials.gov as of Jan 2015). Even though most proteins have the ability to become tumour antigens when mutated, there is an overrepresentation of antigen genes on the X chromosome, which could potentially cause a bias in the pathway analysis reflecting the RGV detection. However, there are also tumour antigen genes in two of the top ten GWAS loci (MAGEB10 and DDX43). There is a substantial overlap of similar genes and pathways in the RGV and GWAS datasets, where tumour antigen genes and snoRNAs link the two datasets and are present in both RGV and GWAS loci. This overlap indicates that the associated and fixed regions affect similar pathways, which are likely influential in CMT development. These results support the theory that both associated and RGV regions contribute to the high rate of CMT in the ESS breed. The fixed regions would increase the base level risk in the breed overall, with genes in the GWAS loci elevating the susceptibility further for individuals carrying the risk genotypes.
This study demonstrates the power of utilising a closed population such as a dog breed for finding genetic susceptibility loci. We have identified a genome-wide significant locus with a promising candidate gene using only about 300 dogs. We note however that several other loci are significantly connected to the identified pathways and also add to the CMT risk explained in the breed. We hypothesise therefore that adding more dogs and markers to the study would bring additional loci to genome-wide significance. It remains however unclear whether our results are applicable to other dog populations. We have investigated other European ESS populations (Norwegian and British), and could see no correlation of the top SNP identified in this study with CMT, but the sample sizes are too small to make any definite conclusions. There is however a trend towards replication in the Swedish outlier group, which would suggest that the risk factor is enriched in Sweden. The accumulation of the risk allele in Sweden is possibly due to an excess of carriers in the population founding the breed in Sweden. It would be of great interest to explore ESS populations from different continents as well as different breeds to investigate if the same genes and pathways are implicated. Furthermore, comparative studies to human breast cancer would be highly prioritised since further studies are required to understand the potential involvement of CDK5RAP2 and MEGF9 in breast cancer. Other animal models with mammary tumours could also be considered, such as rats. Similar to dog breeds, rat strains differ in their susceptibility to mammary tumours [55,56]. The associated region on chromosome 11 reported here overlap with rat mammary tumour susceptibility loci [57,58]. Five QTLs in rat, however large, overlap with the associated region [59]; three linked to exposure to tobacco metabolites [60,61] and two linked to estrogen levels [58,62]. Several of the detected GWAS loci overlap with variants associated with different forms of cancer in humans, such as colorectal and prostate cancer, but none of the regions have been reported as significantly associated with breast cancer [27]. We have previously shown that the known human breast cancer genes BRCA1, BRCA2 and ESR1 are associated with CMT in ESS dogs overlapping with this cohort [25,26], although not as strongly as the risk factors identified in this GWAS study. The similarities in epidemiology, clinical features and genetic predisposition suggest that CMT could be used as a model for breast cancer on many levels. The CDK5RAP2, MEGF9, snoRNA and tumour antigen pathways identified in this study could thus potentially also play roles in human breast cancer. If so, these could possibly influence treatment response to chemotherapeutic agents and therefore be used to guide the choice of treatment. Further validations, initially in the dog, are certainly necessary to assess whether this hypothesis is true.

Conclusions
We have performed the first genome-wide study to identify the underlying cause of CMT, which is a spontaneously occurring tumour with many similarities to human breast cancer. We have identified significant CMT association in a region overlapping the CDK5RAP2 gene. This study further demonstrated the value of CMT as a comparative model for breast cancer for future genetic and clinical studies.

Ethics statement
All blood and buccal swab samples were collected from English springer spaniel pet dogs with owner's consent according to the ethically approved protocols of the participating institutions.

Samples
All blood and buccal swab samples were collected from English springer spaniel pet dogs with owner's consent according to the ethically approved protocols of the participating institutions. A total of 216 CMT cases and 175 controls were collected. Of these, 336 ESS samples were collected in Sweden (190 cases and 146 controls), 40 in the United Kingdom (18 cases and 22 controls) and 15 in Norway (8 cases and 7 controls).
Swedish ESS blood samples were collected by veterinarians in different veterinary animal hospitals and veterinary clinics throughout Sweden between the years 2005 and 2010 and information was collected regarding possible risk factors for the development of mammary tumours for most dogs (signalment, age of onset, sex, spaying, lactation, use of contraceptives, diet, pregnancy, disease status, and family cancer history) as well as pathology reports and/or other clinical diagnostic information. The average age of diagnosis was 10.8 years, ranging from 5 to 17 years of age. The age of diagnosis is based on the age at time of surgery, which often occurs several years after initial detection of lumps in the mammary glands. 23% of the cases were spayed. Control dogs were over 8 years old and with a confirmed absence of CMT based on palpation of the mammary gland performed by a veterinarian. They were also unaffected by any other form of cancer. 25% of the controls were spayed, with an average age of 6.0 years at time of spaying. When samples were available from siblings only one dog was included to reduce the degree of relatedness in the study cohort. Genomic DNA was extracted from whole blood or buccal swabs using the QIAamp DNA Blood Midi Kit (Qiagen, Hilden, Germany), QIAamp DNA Mini Kit (Qiagen), or salt extraction [63]. 196 of the samples were subsequently whole-genome amplified (GenomePlex Whole Genome Amplification (WGA) Kit, Sigma) due to low DNA amounts. Associated risk allele status for BRCA1, BRCA2 and ESR1 was available for 278, 281 and 178 of the Swedish ESS dogs, respectively [25,26]. The proportion of dogs carrying at least one risk allele was 98.2% (BRCA1), 85.8% (BRCA2) and 94.4% (ESR1).

Genome build and annotation
All designs and data analyses were made using the CanFam 2.0 genome build, and the results lifted over to CanFam 3.1. All positions are in CanFam 3.1 unless otherwise specified. Gene annotations were extracted from ENSEMBL [64] and by lift-over from the human genome hg18 and hg19 using UCSC genome lift-over tool [65].

Genome-wide association mapping
The Illumina 170K canine HD SNP array was used for the genotyping of approximately 174,000 SNPs with a mean genomic interval of 13 kb [66]. The Swedish cohort of 332 samples was used for GWA analysis. Data quality control was performed using the software package PLINK [67], removing SNPs and individuals with a call rate below 90% and SNPs with a minor allele frequency below 1%. A total of 96 SNPs were removed due to platform genotyping inconsistencies. Population stratification was estimated and visualised in multidimensional scaling plots (MDS) using PLINK (S1 Fig) to detect outliers and subgroups in the dataset after removing SNPs in high linkage disequilibrium (LD) (r 2 >0.95). The GCTA software was used to estimate the inbreeding coefficient [68].
Regions associated with CMT were detected by case-control genome-wide association analysis. The EMMAX software [69] was used to calculate association p-values corrected for stratification and cryptic relatedness using mixed model statistics. The two primary eigenvectors calculated by the GCTA software [68] were used as covariates in the analysis to adjust for stratification. The LD-pruned SNP set was used for the estimations of MDS, eigenvectors in GCTA and relationship matrix in EMMAX, whereas the full QC filtered SNP set was used for association testing. Quantile-quantile (QQ)-plots were created in R [70] to assess possible genomic inflation and to establish suggestive significance levels. Permutation testing was performed in GenABEL [71] using mixed model statistics, two eigenvector covariates calculated by GCTA and 10,000 permutations to establish empirical genome-wide corrected p values. Genome-wide significance is considered for p perm 0.05. Minor allele frequencies were calculated for each cohort (cases and controls) using PLINK [67]. Odds ratios (ORs) and 95% confidence intervals were also calculated from allele frequencies in PLINK [67]. The allele frequencies in domestic animal populations do not always comply with Hardy-Weinberg equilibrium (HWE) due to the non-random mating, but the top SNPs were tested for HWE to exclude SNPs in extreme HW disequilibrium (p0.0001), in order to detect possible genotyping errors.
A restricted maximum likelihood (REML) analysis implemented in the GCTA software [68] was used to estimate how much of the phenotypic variance the associated SNPs account for. The two primary eigenvectors were used as covariates, and the prevalence set to 0.36 [20]. The top ten GWAS regions were defined using LD-based clumping in PLINK [67], where ± 5 Mb of the top SNP positions were searched for associated SNPs (p<0.1) in LD (r 2 >0.2) with the top SNP in each region. The regions analysed are listed in Table 1.

Regions with low genetic variation
The ESS cohort exhibit regions with reduced genetic variability (RGVs), which are fixed or close to fixed for certain alleles in the breed. These regions could contain risk variants contributing to the high incidence of CMT and are undetected in association studies. Regions with MAF<0.01 for >250 kb in the entire cohort without outliers (180 cases and 119 controls) were selected for further analysis. A less strict cut-off, MAF<0.05 for >250 kb, was also used to enable comparisons to other studies.

Pathway analysis
The top ten GWAS and all RGV regions were evaluated separately and together for pathway enrichment using the GRAIL software [38], which use published scientific abstracts to evaluate connectivity between genomic regions. 50 kb were added to the flanks of the GWAS and RGV regions and coordinates were translated to human genome 18 using UCSC liftover [65]. Gene size correction and PubMed Text (Aug2012) were applied in GRAIL.

Candidate region re-sequencing
We performed mutational screening of the 14 most associated regions in order to identify disease-causing variants. A homozygous region on chromosome 30 was also included in the sequencing. The regions were targeted by either hybrid selection (NimbleGen Sequence Capture arrays, Roche NimbleGen) followed by re-sequencing using next generation sequencing (Illumina Genome Analyzer II, Illumina), or by PCR of exons and conserved elements and Sanger sequencing. A total of 12 Mb was targeted and sequence capture was performed using an in-house modified protocol [72]. DNA samples from seven ESS dogs (3 cases, 4 controls), selected to carry haplotypes that captured as much genetic variation as possible, were sequenced. Targeted next generation sequencing was used to evaluate the top regions chr11:73.1-73.8Mb and chr27:0.48-0.76Mb whereas Sanger sequencing of selected regions was applied to the chr27:7.6-7.72Mb region. Several software packages were used for sequence data analysis to identify SNPs, indels and copy number variants in the sequenced regions. The BWA package [73] was used for read alignment to the dog reference genome [74], the GATK pipeline for local realignment and quality score recalibration [75], Picard for removal of read clones and to extract statistics (http://picard.sourceforge. net), Samtools for SNP and small indel variant calling and filtering [76], SnpEff to annotate variants [77], the DELLY software suite for detection of structural variants [78], IGV for visualisation of sequences [79] and SeqScoring [80] for evaluation of conservation using data from the 29 mammals project [81]. CodonCode Aligner (CodonCode) was used to evaluate Sanger sequences.

Candidate SNP genotyping
Additional genotyping was performed for 61 SNPs using iPLEX Gold Mass ARRAY (Sequenom). Fifty-one of the SNPs were selected from the sequencing data as candidate variants for CMT and eleven were top SNPs from the GWAS included for genotype confirmation. Pyrosequencing (Qiagen, Hilden, Germany) was also used for genotyping of two additional candidate SNPs in the CDK5RAP2 gene [82] and one SNP was genotyped using PCR amplification followed by restriction enzyme cleavage and gel electrophoresis. The SNPs included in the candidate SNP genotyping are listed in S4 Table.

Imputation
The Illumina 170K canine HD SNP array dataset was merged with the iPLEX Gold Mass ARRAY and Pyrosequencing SNP data using PLINK [67]. Imputation of missing genotypes was performed with the BEAGLE software [83] before evaluating LD and haplotype structure in the candidate regions. Imputed SNP calls with <90% probability were filtered out, and quality control filtering and association analysis was performed on the imputed dataset using PLINK and EMMAX as described previously in the genome-wide association mapping section. An additional GWAS analysis was also performed with the top SNP (Chr11:73,290,522) genotypes as covariates to investigate this SNP's impact on the remaining loci.

Haplotype analysis
Pair-wise r 2 -based LD between markers was used to evaluate the size of candidate regions and whether the associated loci were independent. The r 2 calculations were performed using the Haploview [84] and PLINK software packages [67] on the expanded and imputed dataset. The candidate locus on chromosome 11 was restricted by SNPs with a pairwise r 2 0.6 with the top SNPs. Haplotype analysis was performed using PHASE v2.1.1 [85] to identify haplotypes in the candidate regions. The SeaView software package was used to construct maximum-parsimony phylogenetic trees with bootstrap resampling (1000 permutations) [86]. The designation of clusters was based on branch length. Chi-square statistics were used to evaluate differences between haplotype groups. The associated SNPs within the detected 446 kb region were evaluated with the transcription factor binding motif tool TOMTOM [31] (JASPAR and UniProbe motif databases, p<0.001). The PolyPhen-2 software was used to evaluate the effect of non-synonymous SNPs [34]. Recombination hotspot data was obtained from Auton et al [28]. Student's t-test was applied to evaluate recombination hotspot density.  Table. Pathway analysis result for the top ten GWAS candidate regions. Significant p-values (p GRAIL <0.05) are indicated in bold. (DOCX) S2 Table. Regions with reduced genetic variability, including GRAIL pathway analysis results. Significant p-values (p GRAIL <0.05) and tumour associated antigen genes are indicated in bold. (DOCX) S3 Table. GRAIL pathway analysis results from associated regions and regions with reduced genetic variability.