A Genome-Wide Association Study Identifies Genomic Regions for Virulence in the Non-Model Organism Heterobasidion annosum s.s

The dense single nucleotide polymorphisms (SNP) panels needed for genome wide association (GWA) studies have hitherto been expensive to establish and use on non-model organisms. To overcome this, we used a next generation sequencing approach to both establish SNPs and to determine genotypes. We conducted a GWA study on a fungal species, analysing the virulence of Heterobasidion annosum s.s., a necrotrophic pathogen, on its hosts Picea abies and Pinus sylvestris. From a set of 33,018 single nucleotide polymorphisms (SNP) in 23 haploid isolates, twelve SNP markers distributed on seven contigs were associated with virulence (P<0.0001). Four of the contigs harbour known virulence genes from other fungal pathogens and the remaining three harbour novel candidate genes. Two contigs link closely to virulence regions recognized previously by QTL mapping in the congeneric hybrid H. irregulare × H. occidentale. Our study demonstrates the efficiency of GWA studies for dissecting important complex traits of small populations of non-model haploid organisms with small genomes.


Introduction
Genome-wide association (GWA) studies have been used to elucidate the genetic basis for disease traits [1]. The feasibility of GWA studies in plants was demonstrated by Atwell and colleagues [2] who tested 107 phenotypes in Arabidopsis thaliana inbred lines, and has also been used to investigate the genetic basis of resistance in different crops or various traits in wild plants and trees [3]. So far, very few GWA studies have been applied to fungi, even though the tools required are available and fungi have the advantages of small genomes, a homokaryotic (haploid) stage, a sexual life cycle and the ability to multiply genotypes for reproducibility. In Saccharomyces cerevisiae GWA mapping of mtDNA copy number identified one significant SNP [4] and population based GWA analysis of clinical vs. nonclinical yeast identified several genetic loci associated with the clinical background [5].
However, several different genetic traits have been studied in fungi using quantitative trait loci (QTL) analysis. In Pleurotus ostreatus, Santoyo et al. [6] located QTLs linked to lignindegrading enzymatic activities on six different chromosomes. A QTL associated with pathogenicity on wheat has been identified in the Fusarium head blight fungus Gibberella zeae [7]. In Nectria haematococca MPI, several QTLs for virulence on pumpkin have been found [8]. The genetic components controlling virulence in the necrotrophic forest pathogen Heterobasidion annosum (Fr.) Bref. sensu lato (s.l.) have been studied by QTL mapping of fungal growth within sapwood and induced lesion length in phloem in Norway spruce (Picea abies) and Scots pine (Pinus sylvestris) [9]. Four QTLs, for both lesion length and growth in sapwood on spruce and pine were found located within the same linkage group, while two other pine-specific QTLs associated with growth or lesion length were located to two other groups [9]. In addition to nuclear genetic factors, the virulence has been shown to be influenced by the mitochondrial genome using hybrids of H. irregulare 6 H. occidentale [10]. The advantage of using GWA studies over QTL mapping is that no mapping population needs to be generated. Furthermore, because many more recombination events have occurred in natural populations compared to in a single crossing, the linkage blocks are expected to be smaller than in QTL mapping, which implies a higher map resolution [3].
Population genomics and genome-wide mapping have revealed recent hybridization and genetic introgression between the pathogenic fungi Coccidioides posadasii and C. immitis [11] while a survey of the domesticated and wild yeasts Saccharomyces cerevisiae and S. paradoxus showed that phenotypic variation correlated broadly with global genome-wide phylogenetic relationships [12]. By SNP identification and whole-transcriptome sequencing, Ellison et al. [13] discovered two recently diverged populations of Neurospora crassa in which they identified genomic islands of high divergence that may be the result of local adaptation to a temperature difference between the two populations.
The necrotrophic pathogen H. annosum s.l. causes severe damage in coniferous forests throughout the Northern Hemisphere [14]. H. annosum s.l. is a complex consisting of five species with different but partially overlapping host preferences [15,16]. H. annosum sensu stricto (s.s.) mainly infects Pinus spp. but is also able to attack Picea spp. and other soft and hard wood tree species throughout Europe [17]. The economic losses due to devaluation of saw timber because of stem rot, reduced tree growth and tree mortality reach 1 billion Euros yearly for European forest owners. In addition, H. annosum s.l. challenges the coniferous forests potential to act as a carbon sink by releasing CO 2 from decayed wood.
In this study, we identify novel and well-known virulence genes in the conifer pathogen Heterobasidion annosum s.s. through GWA studies based on large scale single nucleotide polymorphism (SNP) identification and typing through next generation sequencing. Our results demonstrate the efficiency of this technique in a haploid non-model organism.

Generation of a Genomic Data Set
We sequenced the genomes of 23 haploid H. annosum s.s. isolates with an Illumina Genome Analyzer. We obtained paired end reads from a 400-base pair (bp) insert library from three of the isolates (90211/2, Sä _16-4 and W_15), while 36-bp single end reads were generated from the remaining 20. The number of reads ranged between 3.2 and 8.9 million for the single end read isolates, and between 12.3 and 15.3 million for the paired end read isolates (Table 1). A combined reference genome was de novo assembled from the paired end sequenced isolates using the Velvet assembler [18,19]. The assembly resulted in a reference genome of 56,195 contigs with an N50 of 41,157 bp and a median coverage of 38.76. The number of contigs of at least 1000 bp was 2330, which comprise in total 30,569,260 bp (internal gaps excluded). These contigs were used as reference sequences when sequence reads of each isolate were assembled including the three used in the de novo assembly. The assembly of the individual isolates was achieved through mapping towards the combined reference genome using MOSAIK (http://bioinformatics.bc.edu/ marthlab/Mosaik). For each isolate, 48% to 75% of the sequence reads were aligned to the reference and the coverage ranged between 80% and 94% with gap regions included ( Table 1). The mean read coverage was between 2.66 and 12.66 of the individual isolates. There were 4,398,305 positions where all 23 individuals had sequence coverage of at least 26. In these positions, a total of 64,055 SNPs were found and 33,018 of these were not singletons. We tested our SNP panel for population structure with the program Structure [20,21,22], using one SNP per contig. No signs of population structure were detected.

Fungal Virulence on Pine and Spruce Differs Significantly between Isolates
Virulence of H. annosum s.s. on Scots pine and Norway spruce was measured both as lesion length formed under the bark and fungal growth in the sapwood of the plants. The average infection success was 90% for each plant species and ranged between 30% and 100% for individual isolates ( Table 2). There was a significant difference in virulence between isolates on pine and on spruce, measured both as lesion length and fungal growth in sapwood (P,0.05, ANOVA). In spruce, fungal growth (SFG) was normally distributed. However, the measurements for growth in pine (PFG) and lesion length in both spruce (SLL) and pine (PLL) had to be transformed to generate a normal distribution. A correlation between lesion length and fungal growth for spruce (R 2 = 0.6684) was found, but not for pine. A weak but still significant correlation (R 2 = 0.2363, P = 0.0187) was found between PFG and SFG.

Twelve SNPs are Significantly Associated with Fungal Virulence
The general linear model revealed an association between 12 SNP markers and virulence with an adjusted P-value lower than 0.0001 (10 000 permutations) ( Table 3). Out of these twelve, six SNP markers were found on the same contig (contig 41480); the remaining six SNP markers were distributed on six different contigs (Fig. 1). Four SNP markers were associated with SFG and the remaining eight with PFG. No markers were associated with SLL or PLL.
The length of the contigs varied between 2.5 kilobase pairs (kb) and 76.5 kb (Table 4) and the SNP density varied in the contigs from 0.4 to 1.7 SNPs per kb (minor allele frequency $2/23 individuals; coverage in all individuals). The linkage disequilibrium (LD) heat maps show clear blocks of LD in the two contigs 4128 and 41480 associated with the markers for virulence traits (Fig. 2). In the remaining five contigs the SNP marker associated with virulence was not associated with any LD block ( Fig. 2 and Fig.  S1). The six SNP markers located in contig 41480 were found in two distinct LD blocks with the same statistical support (P,0.0001). The less virulent genotypes were present at low frequencies in the population (Fig. 3).

Surveying the Genomic Regions Associated with Fungal Virulence
The seven H. annosum s. s. contigs containing SNPs associated with virulence were highly homologous to the H. irregulare genome (http://genome.jgi-psf.org/Hetan2/Hetan2.home.html) [23] and the gene order within each contig was well conserved. The contigs were not linked among each other and they were distributed over different scaffolds. Contig 4128 had sequence homology to positions 49591-113806 on scaffold 2 in the genome of H. irregulare. The LD block harbouring the SNP associated with SFG was located between position 40886 and 66145 in the contig (corresponding to position 89756-113692, scaffold 2 in H. irregulare). This region contained nine genes that encode: a serine protease, a transcriptional co-repressor, a quinone oxidoreductase (ToxD), an inner centromere protein, a urea transporter, an enzyme similar to a DNA-dependent ATPase, a sorbitol dehydrogenase and two flavin-containing monooxygenases ( Fig. 2A). Marker 2541, associated with SFG, was located in between the two genes; the serine protease and the transcriptional co-repressor ( Fig. 2A). Linkage was found between this marker and SNPs in the genes that encode a cytochrome P450, a urea transporter, a DNA-dependent ATPase and a sorbitol dehydrogenase.
Contig 41480 had homology to positions 499810-584466 on scaffold 13 in the genome of H. irregulare and they were syntenic, although there was a small break in homology between positions 532643 and 533812, where H. irregulare has a transposon inserted. The LD block between 2.2 and 7.0 kb in the contig spanned a putative calcineurin, similar to S. cerevisiae CNA-1, and the end of an acetylglutamate kinase/synthase gene (Fig. 2B). Two markers at positions 11461 and 12127 were found in a smaller LD block that was located in a gene of unknown function. An absolute LD (r 2 = 1) between the significant markers and the rest of the positions in the LD blocks was observed. This linkage is localized to the genes that encode calcineurin, acetylglutamate kinase and the longer unknown gene. Two of the markers associated with virulence in the calcineurin gene were synonymous whereas the third was located in an intron (Table 3). In addition, the small contig 15627 (2.5 kb) also had homology to scaffold 13 (51802-56242) in the genome of H. irregulare. This region contains the last 200 bp of a polyprenylsynthetase gene and three unknown genes, out of which two encode secreted proteins.
Contig 16590 had homology to positions 1029321-1073048 on scaffold 10 of the H. irregulare genome (Fig. 2C). The significant SNP marker 7164 at position 17829 in the contig was located in a non-synonymous position in an exopolyphosphatase gene (Table 3). It had LD (r 2 = 1) to one unknown gene at position 14325 and to three intergenic positions.
Contig 9600 had homology to positions 533823-565834 on scaffold 1 in the genome of H. irregulare. Only a few, quite small LD blocks were found, the largest being 1.7 kb (Fig. 2D). The significant marker at position 30323 was located to a synonymous position in the transcription factor SWI5 with an absolute LD (r 2 = 1) restricted to the same gene at positions 30227, 30247, 30352 and 30602 (Table 3).
Contig 45322 and 50191 had homology to scaffold 3 and 5 of the genome of H. irregulare, respectively. The SNP in contig 45322 was localized to a gene of unknown function and the one in contig 50191 was found in an intergenic region (Fig. S1).

Discussion
Applying GWA analysis to an organism in a haploid stage that can be clonally reproduced in high numbers and phenotyped repeatedly increases the accuracy of the phenotypic measurements and the power of the association analysis with several orders of magnitude as compared to diploids. We demonstrated that as few as 23 haploid individuals could successfully be used to identify convincing associations in small fungal genomes, whereas several hundreds of individuals are needed in diploid organisms with large genomes such as humans and plants [24,25].
Lind et al. [9] used a mapping population generated from a cross between H. irregulare and H. occidentale to show that several H. annosum s.l. QTLs were associated with fungal virulence on spruce and pine. An improved linkage map [26] located two QTLs on scaffolds 1 and 2 close to or overlapping the H. annosum s.s. contigs 9600 and 4128, suggesting that these regions harbour general virulence factors important in several Heterobasidion species, e.g. H. irregulare, H. occidentale and H. annosum s.s. Such general virulence regions have previously been suggested by Lind et al. [9] who found one linkage group with QTLs for four virulence traits. The overlapping results between these different studies and mapping methods strengthen the assumption that the regions target important virulence genes. These combined data suggest that virulence in H. annosum s. l. is controlled by regions of virulence factors that occur at different positions in the genome, of which some confer general virulence conserved between the different Heterobasidion species and other species-specific virulence that have evolved separately. Some of these species-specific genes found in H. annosum s.s. may encode virulence factors that are specialized for Scots pine infection. Interestingly, several Norway spruce specific virulence factors were also found with a significant association for SFG, confirming the ability of H. annosum s.s. to infect spruce as an alternative host. In addition, this fact may indicate that H. annosum s.s. uses different mechanisms to infect spruce than H. occidentale. The success and power of an association study is dependent on the number of SNP markers and on the LD decay. We analysed LD in six contigs using an extended SNP data set and the results indicate a variable LD block size present in our population (300 bp to 31 kb). The very limited LD for the associated SNPs in four of the contigs indicates a high-resolution mapping of these regions. The fact that only part of the genome could be included in the analysis makes it difficult to predict the number of SNPs that    are needed for complete coverage in H. annosum s.s. Maize has been shown to have an LD decay between 1 and 10 kb that increased with the increase of minor allelic frequencies and with smaller sample sizes [27]. For the diploid maize, small sample size was defined as 25 individuals and a sample size of more than 50 individuals generated no significant differences in the mean r 2 . The long LD regions found in H. annosum s.s. might therefore partly be an effect of the relatively low number of individuals used.

Surveying the Genomic Regions Associated with Fungal Virulence
Gene models associated with fungal growth in spruce. Markers associated with fungal growth in spruce (SFG) were found to be located in gene models with varying functions (contig 4128, Fig. 2A) as well as to unknown/novel virulence genes (contig 45322, 50191, 15627, Fig. S1). Contig 4128 is located close to a previously known QTL for fungal growth in pine sapwood, which is why the genes found in this region are candidates as conserved, general, virulence factors. The gene encoding a quinone oxidoreductase (contig 4128) is similar to a gene that encodes zinc-binding oxidoreductase ToxD, a hostselective toxin produced by Pyrenophora tritici-repentis Pt-1C-BFP [28], with 27% sequence identity and 36% similarity. Ptr ToxD is a zinc enzyme, specific for NADPH that catalyses the one-electron reduction of certain quinones [29]. Pyrenophora tritici-repentis, causal agent of tan spot of wheat, is known to produce several additional host-selective toxins, including Ptr ToxA, Ptr ToxB and Ptr ToxC [30]. Serine carboxypeptidases, found in contig 4128, belong to the serine-type catalytic family and use the amino acid as a nucleophile to form an acyl intermediate. They can also act as transferases and are common in viruses, bacteria and eukaryotes [31]. In Schizosaccharomyces pombe it was shown that deletion of a serine carboxypeptidase gene, sxa2, causes hypersensitivity to the P-factor mating pheromone and a reduction in mating efficiency in M cells [32]. Moreover, in Cryptococcus neoformans a deletion mutant of KIN1, a serine/threonine protein kinase, was shown to have attenuated virulence [33]. The position of a significant marker between two genes that encode a serine protease and a transcriptional co-repressor indicate that the transcriptional regulation of either of the genes could be affected.
Flavin-containing monooxygenases (FMOs), found in contig 4128, are xenobiotic-metabolizing enzymes that catalyse the oxygenation of nucleophilic nitrogen, sulphur, phosphorous and selenium atoms using NADPH as a cofactor and FAD as a prosthetic group [34]. They have been implicated in the metabolism of several pharmaceuticals, pesticides and other toxicants. Arabidopsis mutants overexpressing FMO1 had an increased basal resistance against Pseudomonas syringae pv. tomato and Hyaloperonospora parasitica, which suggests that the FMO was involved in detoxifying the virulence factors produced by the pathogens [35]. Another FMO was previously found within a QTL for lesion lengths in pine bark close to contig 9600 suggesting a general and conserved role in pathogenicity [9] (Lind M, Dalman K, Olson Å , Brandström-Durling M, Stenlid J in prep.).
Gene models associated with fungal growth in pine. Three contigs were found to harbour markers associated with fungal growth in pine (PFG) (contig 41480, 16590, 9600). The general picture of LD for the significant markers in contig 41480 is that it is limited to three genes, calcineurin, acetylglutamate kinase and a longer unknown gene. Therefore these genes, along with exopolyphosphatase (contig 16590) and SWI5 (contig 9600), are the strongest candidates for the PFG virulence trait. The putative gene for calcineurin found in this study represents a strong candidate for virulence signalling in H. annosum s.s. Calcineurin is a phosphatase regulated by Ca 2+ and calmodulin, and is heavily involved in the calcium-dependent signal transduction pathways of  many processes in eukaryotes, such as T cell activation, muscle hypertrophy, memory development, glucan synthesis, ion homeostasis and cell cycle control [36]. Furthermore, the protein confers a conserved function for virulence in several fungi; e.g. Candida albicans [37], Ustilago maydis [38] and Sclerotinia sclerotiorum [39]. The plant pathogen Botrytis cinerea calcineurin-responsive transcription factor CRZ1 was found to be required for penetration of plant surfaces [40]. In the rice blast fungi Magnaporthe oryzae, deletion mutants for the genes encoding a calcineurin-responsive transcription factor had a reduced virulence due to a defect in host penetration [41,42]. A gene encoding N-acetylglutamate was found in contig 41480. N-acetylglutamate is involved in the biosynthesis of arginine in prokaryotes, lower eukaryotes and plants [43]. An insertional mutation or deletion of genes encoding N-acetylglutamate lead to reduced virulence in Gibberella zeae (anamorph, Fusarium graminearum) [43] and in Colletotrichum higginsianum [44].
Exopolyphosphatases (PPXs), found in contig 16590, hydrolyse and release the terminal phosphate from linear polyphosphate containing three or more phosphoanhydride bonds [45]. Together with polyphosphate kinases (PPKs), responsible for polyphosphate synthesis, PPXs maintain the dynamic balance of the polyphosphate level. Polyphosphate is essential for growth of cells, responses to stresses and virulence in several bacteria [45]. Mutants lacking PPX in the human pathogen Neisseria meningitidis had an increased resistance to complement-mediated killing [46].
In contig 9600 the significant marker was located to a SWI5 transcription factor. The homolog in C. albicans, CaACE2, was shown to affect virulence and deletion mutants of CaACE2 were avirulent in a mouse model [47]. Strong QTLs for lesions in pine and spruce bark, and growth in spruce sapwood, are overlapping this contig [9](Lind M, Dalman K, Olson Å , Brandström-Durling M, Stenlid J in prep.). The corresponding peaks of these QTLs were at approximately 400 kb (pine) and 600 kb (spruce) respectively on scaffold 1 in the genome of H. irregulare. Candidate genes from this region are likely to play an important part in general Heterobasidion pathogenicity.

Conclusions
In this study we show that GWA studies are useful for dissecting important complex traits of non-model organisms, in particular those with small genomes and a haploid life style. We characterized seven genomic regions associated with fungal growth in the sapwood of spruce and pine and present eight candidate virulence genes that encode quinone oxidoreductase (ToxD), serine carboxypeptidase, two flavin-containing monooxygenases, calcineurin, acetylglutamate kinase, exopolyphosphatase and the SWI5 transcription factor. Of these, all except calcineurin, acetylglutamate kinase and exopolyphosphatase were found very close to or directly overlapping with previously known virulence QTLs.

Plant and Fungal Material
Two-year-old Norway spruce (Picea abies) and Scots pine (Pinus sylvestris) plants, originating from Latvia and Gotthardsberg in Sweden, respectively, were washed and planted in 2 L pots with fertilized peat. The plants were grown for one month in the greenhouse at 20uC before inoculation. The 23 haploid H. annosum s.s. isolates used in this study originated from field studies from different geographic locations in Europe (Table 5). They were all single spore isolates isolated from basidiospores or conidiospores. All isolates were grown on Hagem medium [48] at 21uC in darkness for one week whereafter autoclaved pine wood blocks (56565 mm) were placed on the mycelia. The cultures were incubated for a further four weeks to allow thorough colonization of the wood blocks.

Virulence Assay
The spruce and pine plants were inoculated with an H. annosum s.s. isolate by cutting a small window (5610 mm) in the cambium of the plant, halfway between two nodes, inserting a colonised wood block in the wound and then wrapping it with ParafilmH. The experiment was executed in blocks: two spruce and two pine plants were inoculated with each of the 23 fungal isolates each day for five consecutive days, so that in total ten spruce and ten pine plants were inoculated with each isolate. As a control, ten spruce and ten pine plants were inoculated with sterile wood blocks. After four weeks the plants were harvested and the lesion lengths upstem and downstem from the wound was measured. The stems were then cut into 5-mm pieces and placed on wet filter paper in Petri dishes for 7 days. Growth within sapwood upstem and downstem from the wound was scored by the presence of conidiophores of H. annosum s.s. Inoculation was considered to have been unsuccessful in plants that showed no visible lesion, no fungal growth in sapwood and no conidia in the wound: these stems were discarded.

Statistical Tests for the Virulence Assay
A one-way ANOVA, performed using MinitabH 15.1.0.0, was used to test the difference in virulence between isolates. The measured values for virulence were normally distributed for SFG, but not for PFG or for SLL and PLL. To obtain a normal distribution for the three latter traits, 2 were added to every measure whereupon they were log transformed. The resulting mean values for all four traits were used in the regression analysis and in the association mapping.

SNP Extraction
Heterobasidion annosum s.s. mycelia were grown and DNA extracted according to Lind et al. [49]. All library preparations and sequencing were performed at the SNP&SEQ Technology Platform of Uppsala University Hospital on the Illumina Genome Analyzer according to Illumina's standard protocol. For three of the isolates (90211/2, Sä _16-4 and W_15), libraries with an insert length of 400 bp were sequenced from both ends (paired end).
Reads from the three paired-end-sequenced isolates were de novo assembled using Velvet version 0.7.60 (hash length 21 bp) [18,19]. Contigs larger than 1000 bp were used as reference sequences. Reads from each isolate (including those isolates that were used for the de novo assembly) were aligned to the reference contigs using MOSAIK version 1.0.1384 (http://bioinformatics.bc.edu/ marthlab/Mosaik). Two mismatches per read were allowed using a hash size of 10; only the uniquely aligned reads were accepted. The SAMtools program (pileup -c) [50,51] and an in-house Python script were used to find SNPs. Each SNP had to have a SNP quality of at least 10 according to Li et al. [51]. Only sites where all individuals had coverage of two reads or more were used. The data was submitted to the Sequence Read Archive, SRA (SRA050098).

Population Structure Analysis
Structure within the sample population was explored using the program STRUCTURE [20,21,22]. To minimise the impact of linkage between SNPs in the structure analysis, we selected a single SNP representing each contig. These were analysed with different assumed population subdivisions in the range from one to four subpopulations in a model assuming the SNPs were not linked and the organism haploid.

Association Mapping and Linkage Disequilibrium Analysis
The association between SNP markers and virulence estimates was analysed using TASSEL version 2.1 [52]. A general linear model (GLM) approach assuming a completely random mating population was used. The imported phenotypic data consisted of transformed and non-transformed mean values of the four traits. The association analysis was performed in two rounds; first using the SNP dataset from the whole genome and then using an extended dataset from each selected genomic region found to be significantly associated with virulence in the first round. The SNP dataset for the first round consisted of SNPs with a minor allelic frequency of 9% and a minimum coverage of 100% and in the second round a minor allelic frequency of 9% and a minimum coverage of 65%. A permutation test with 10 000 replicates, implemented in TASSEL, was used to correct the p-values for each site. Linkage disequilibrium (LD) heat maps, based on r 2 calculations implemented in TASSEL, were constructed for the selected genomic areas using the second SNP dataset.

Alignment and Annotation
The SAMtools pileup -c output [50] and an in-house Python script were used to retrieve the consensus sequences from the selected genomic regions of all isolates. Corresponding genomic regions were identified using BLAST searches of the complete sequenced genome of H. irregulare, isolate TC32-1 (http://genome. jgi-psf.org/Hetan2/Hetan2.home.html). The gene annotations found in TC32-1 were transferred to the reference sequence using PROT_MAP, FGENESH-2 (SoftBerry, Mount Kisco, NY) and Artemis [53]. The alignments were further analysed for synonymous and non-synonymous substitutions using MEGA version 4 [54].