A population genomic characterization of copy number variation in the opportunistic fungal pathogen Aspergillus fumigatus

Aspergillus fumigatus is a potentially deadly opportunistic fungal pathogen. Molecular studies have shaped our understanding of the genes, proteins, and molecules that contribute to A. fumigatus pathogenicity, but few studies have characterized genome-wide patterns of genetic variation at the population level. Of A. fumigatus genomic studies to-date, most focus mainly on single nucleotide polymorphisms and large structural variants, while overlooking the contribution of copy number variation (CNV). CNV is a class of small structural variation defined as loci that vary in their number of copies between individuals due to duplication, gain, or deletion. CNV can influence phenotype, including fungal virulence. In the present study, we characterized the population genomic patterns of CNV in a diverse collection of 71 A. fumigatus isolates using publicly available sequencing data. We used genome-wide single nucleotide polymorphisms to infer the population structure of these isolates and identified three populations consisting of at least 8 isolates. We then computationally predicted genome-wide CNV profiles for each isolate and conducted analyses at the species-, population-, and individual levels. Our results suggest that CNV contributes to genetic variation in A. fumigatus, with ~10% of the genome being CN variable. Our analysis indicates that CNV is non-randomly distributed across the A. fumigatus genome, and is overrepresented in subtelomeric regions. Analysis of gene ontology categories in genes that overlapped CN variants revealed an enrichment of genes related to transposable element and secondary metabolism functions. We further identified 72 loci containing 33 genes that showed divergent copy number profiles between the three A. fumigatus populations. Many of these genes encode proteins that interact with the cell surface or are involved in pathogenicity. Our results suggest that CNV is an important source of genetic variation that could account for some of the phenotypic differences between A. fumigatus populations and isolates.


Introduction
Aspergillus fumigatus is a ubiquitous, saprophytic mold found in soil, compost, and other organic matter, and plays an important ecological role as a decomposer [1,2]. This species is PLOS  also an opportunistic human pathogen and is responsible for the greatest number of deaths and the second highest number of infections of any fungal species [3]. It is estimated that A. fumigatus infection in immunocompromised individuals results in 100,000 deaths annually [4]. A. fumigatus harbors several strategies conducive to the pathogenic lifestyle. The small size of the conidia and layer of hydrophobic proteins covering the conidia permits evasion of mucociliar clearance, and mask the antigenic carbohydrate β(1,3)-glucan, which alveolar macrophages use for recognition, respectively [1]. A. fumigatus also grows optimally at 37˚C, which, coincidentally, is the internal temperature of the human body [5], and produces an arsenal of molecules used to degrade host tissue, import nutrients, and counteract host defenses [1,2,6]. High-throughput sequencing combined with comparative and population genomics is a powerful tool for identifying genes or genetic variants associated with phenotypes, including components of A. fumigatus pathogenicity. The utility and power of this approach was first exhibited in a study by Camps et al. [7], in which the resequencing and comparison of serial isolates collected from a patient before and after prolonged azole therapy revealed a novel mutation in the hapE gene that conferred azole resistance. Whole-genome sequencing of patient derived serial isolates resulted in the identification of several nonsynonymous mutations, a 38.5 Kb deletion containing 11 genes, and the presence of an isolate with the azole resistant cyp51A mutation [8]. In one of the most extensive A. fumigatus population genomic studies to date, Abdolrasouli et al. [9] resequenced the genomes of 24 A. fumigatus isolates to characterize genetic variants associated with azole resistance. This study confirmed that the TR34/L98H mutation in the cyp51A gene is the sole mechanism responsible for azole resistance in the analyzed isolates and also provided evidence for recombination, including in those isolates with the TR34/L98H mutation.
The majority of the aforementioned genomic studies analyzed single nucleotide polymorphisms (SNPs) or large scale structural variants while playing little or no attention to copy number variation (CNV). CNV is a type of segregating variation that is defined as fragments of DNA that are present at variable copy number (CN) in comparison with a reference genome [10]. CNV mutation rates are often higher than those of SNPs [11,12], and are the result of several mutational processes including non-allelic homologous recombination, non-homologous end-joining, retrotransposition, and fork stalling and template switching [13][14][15][16]. CNV can affect phenotype by directly altering gene function through gene interruption, or gene fusion, or by modifying gene expression through gene dosage, regulatory element dosage, and position effect [17]. In fungi, gene CNV has been associated with phenotypic variation and adaptation [18]. For example, population genomic analyses revealed widespread CNV in fermentation-related genes in Saccharomyces cerevisiae wine strains [19], higher α-amylase gene expression in isolates of Aspergillus oryzae and Aspergillus flavus with greater α-amylase CN [20], and differences in pathogenicity-related gene CN between closely related, but phenotypically distinct, populations of Cryptococcus gattii [21]. A recent population genomic analysis of A. fumigatus secondary metabolite encoding gene clusters revealed widespread gene CNV which likely contributes to phenotypic variation [22].
Despite the importance of CNV as a source of genetic and phenotypic variation, no study to date has characterized the genome-wide population patterns of CNV in A. fumigatus. In the present study we analyzed publicly available whole-genome Illumina sequence data from 71 A. fumigatus isolates [9,[23][24][25]. We first identified three genetic populations consisting of at least 8 individuals using a panel of high-resolution SNPs. We then performed multiple analyses to characterize the abundance, localization, variation, and functional associations of CNVs at the species-, population-, and individual-levels.

Data-mining and sequence processing
Whole genome paired-end Illumina sequence data for 71 A. fumigatus isolates [9,[23][24][25] was downloaded from the NCBI Sequence Read Archive [26] using the SRA toolkit (S1 Table). We implemented a similar data processing pipeline described previously [21]. Briefly, identical paired-end sequence reads were collapsed using tally, with the parameters "-with-quality" and "-pair-by-offset" [27]. Next, trim_galore (http://www.bioinformatics.babraham.ac.uk/ projects/trim_galore/) was used to remove residual adapter sequences, and to trim reads at bases where quality scores were below Q30. Trimmed reads shorter than 50 bp were removed. These read sets were then mapped to the A. fumigatus Af293 reference genome [28] using the "sensitive" pre-set parameters in bowtie2 [29]. SAM alignment files were converted into sorted BAM format using the view and sort functions in samtools [30]. The samtools depth function was then used to estimate average coverage for each of the 71 samples. To avoid the bias introduced by varying sequencing depths across samples, seqtk (https://github.com/lh3/seqtk) was used to randomly subsample reads such that each sample had a genome-wide average coverage of 10X. These deduplicated, quality and adapter trimmed, 10X coverage read sets were mapped against the A. fumigatus Af293 reference genome [28] and used in subsequent SNP and CNV analysis.

Identifying A. fumigatus populations
We performed three analyses to infer the population structure of the 71 A. fumigatus isolates. SNP sites for each sample were conservatively predicted using VarScan v2.3.9 with the parameters "-min-var-freq 1" and "-min-coverage 8" [31]. Consensus genotypes from polymorphic sites were extracted for each sample. Sites harboring more than 5% missing or ambiguous data were removed. This process resulted in 35,120 variant sites. We also subsampled polymorphic sites to minimize physical linkage between markers, resulting in 859 sites separated by an average distance of~33 Kb.
We first performed population structure analysis with the subsampled set of 859 variant sites using the Structure v2.3.4 while implementing the "admixture" ancestry model, and the "allele frequencies are correlated among populations" frequency model [32]. We ran 15 replicates with a burn-in length of 100,000, and a Markov Chain Monte Carlo (MCMC) of 200,000 generations for K = 1-15, where K indicates the number of genetic clusters or populations. ΔK, a measurement of the rate of change in the average log probability between successive K values, was calculated using Structure Harvester in order to predict the optimal number of populations [33,34]. We additionally used admixture with the subsampled set of 859 variant sties, and the full set of 35,120 variant sites to compare individual population assignments [35]. admixture was run for K = 1-15. Cross validation error was calculated for each K value with the lowest values corresponding to good predictors of K. Lastly, we constructed a Neighbor-Net phylogenetic network using the set of 859 variant sites with SplitsTree V4.14.4 with 1,000 bootstrap replicates [36].

Copy number variation analysis
We used the read depth based approach implemented in control-FREEC to estimate integer CN for each non-overlapping 500 bp window throughout the genome [37]. The following parameters were used: window = 500, telocentromeric = 0, minExpectedGC = 0.33, and max-ExpectedGC = 0.63. Heatmaps of CNV patterns were illustrated using the R package Complex-Heatmap [38].
We calculated two measurements of CNV diversity. First, we calculated the Polymorphic Index Content (PIC) across all samples, and within each population [39,40]. PIC is a useful measurement for the identification of diverse CN variable loci [19], and ranges from 0 (no CNV is present) to 1 (all alleles are unique). PIC was calculated as follows: where i 2 is the squared frequency of a to z CN values at a particular 500 bp window. PIC values in the upper 99th percentile of all samples, corresponding to values greater than 0.82, were considered significant within each population.
We calculated V ST to identify divergent CNV profiles between populations [12]. V ST is conceptually similar to F ST and varies from 0 (no difference in CN allele frequencies between populations) to 1 (completely differentiation of CN allele frequencies between populations). V ST was calculated as follows: where V total is total variance, V popx is the CN variance for each respective population, N VGIIx is the sample size for each respective population, and N total is the total sample size. We considered V ST values in the upper 99th percentile, corresponding to values greater than 0.68, as significantly differentiated between populations.

Genomic localization of copy number variable genes
To test whether CN variants were disproportionately represented in subtolemeric regions, we compared the number of observed CN variants that partially or entirely overlapped the subtelomeric regions to the number expected if CN variants were randomly distributed across the genomes. The proportion of observed vs. expected was assessed using a chi-square goodness of fit test. This analysis was conducted independently for each chromosome, and for the entire genome. Subtelomeric regions were defined as the 400 kb region preceding the telomere end. We ran nine independent tests (1 test per chromosome plus 1 test for the entire genome), and thus implemented a Bonferroni multiple test-corrected p-value cutoff of 0.0056 (p-value cutoff = 0.05 / 9 tests = 0.0056).

Gene ontology enrichment
We performed Gene Ontology (GO) enrichment across all samples, and within each population for genes in which CN variants partially overlapped, and for genes in which gene entirely overlapped. All GO enrichment analysis was performed in Fungifun2 [41] using the A. fumigatus Af293 annotation from AspGD [42].

Population structure analysis
Our aim was to analyze patterns of CNV at the species, population, and individual levels.
Thus, we first determined the evolutionary relationships and population structure of the 71 A. fumigatus isolates. Using a collection of 859 SNPs distributed across the genome, we used Structure to predict population structure (Fig 1A and 1B) [32, 43]. We calculated ΔK for each K value [33] and this analysis suggested that K = 2, K = 3 and K = 12 represent the best predictors of population number (Fig 1A). To better understand why K = 2 and K = 3 gave the strongest signal, we constructed a Neighbor-Net phylogenetic network of the 71 A. fumigatus isolates (S1 Fig). This analysis revealed that two closely related isolates (LMB-3Aa, and F15927) were highly divergent to all other isolates ( Fig 1B) [44]. When K = 3 these isolates also formed a single population. When K = 12, we find strong agreement in population assignment between structure, admixture, and the phylogenetic network (Fig 1 and S1 Fig). Population structure analysis was further investigated using admixture [35] with the subsampled set of 859 variant sites ( Fig 1C and 1D), and the full set of 35,120 variant sites (Fig 1E and 1F). The best predictor of population number was 7 and 5 when the subsampled set of 859 variant sites, and the full set of 35,120 variant sites were used, respectively (Fig 1C and 1E). Together, these analyses (structure with 859 variant sites, admixture with 859 variant sites, admixture with 35,120 variant sites, and SplitsTree with 859 variant sites) resulted in a concordant set of individuals falling into three populations with a sample size ! 8 (Fig 1 and S1 Fig). Columns correspond to different individuals (X-axis), while membership coefficient is depicted on the Y-axis. Admixture based estimate of the optimal predicted population number (K) by the cross validation error method using 859 SNPs (C) and 35,120 SNPs (E). Admixture based population assignment for the 71 A. fumigatus isolates for K = 7 with 859 SNPs (D) and K = 5 with 35,120 SNPs (F). Columns correspond to different individuals (X-axis), while membership coefficient is depicted on the Y-axis. Isolates colored with red, blue, and yellow correspond to populations 1, 2, and 3. https://doi.org/10.1371/journal.pone.0201611.g001 Population 1 consists mainly of clinical isolates from Japan [23], but also two clinical isolates from the United Kingdom, and one environmental isolate from the Netherlands [9]. Population 2 consists primarily of clinical isolates with azole resistance from the United Kingdom and the Netherlands [9]. Population 3 consists of clinical isolates with azole resistance from India [9]. The remaining samples are composed of both clinical and environmental isolates that have both azole susceptibility and azole resistance from Asia, Europe, North America, South America and the International Space Station [9,23,25,44]. Our subsequent analyses of CNVs at the species-level are conducted with all 71 isolates, while the population-level analyses consisted of the 43 isolates from populations 1, 2, and 3 (Fig 1 and S1 Table).

Characterizing copy number variation at the species-level
We generated CNV profiles for each non-overlapping 500 bp window throughout whole genome of the 71 A. fumigatus isolates using control-FREEC [37] (S1 File). To assess our computational CNV prediction pipeline, we first examined the CN of the ribosomal DNA (rDNA) encoding locus and compared these results to previous studies that used quantitative PCR and digital droplet PCR to quantify rDNA CN [45,46]. The average rDNA CN of the 71 A. fumigatus isolates ranged from 5 to 65, with an average of 30.72 and median of 30 (Fig 2). We are encouraged by our in silico results as they are within the range of experimentally determined A. fumigatus rDNA CN estimates [45,46].
On average, 7.88% of the genome (~2.31 Mb) is CN variable, with 5.94% (1.74 Mb) and 1.94% (0.57 Mb) deriving from absences and duplications, respectively (Fig 3). We choose to refer to CN of 0 as an "absence" rather than a deletion because the absence of a locus in an analyzed genome could be the result of a gain in the reference genome and not solely a deletion in the analyzed genomes. The average number of absences and gains per isolate is 89 and 76, respectively. Our collection of analyzed isolates included AF293 (Fig 1), which also served as our reference genome. We observed very few CNVs in the AF293 isolate further reinforcing the accuracy of our CNV prediction (Fig 3). We examined the number of CNVs that overlapped annotated genes across all isolates, and discovered substantial variation. In at least one of the 71 isolates, 433 and 922 genes overlapped partially and completely with absences, respectively. Because we observed a wide-range in gene gains across isolates that were likely the result of rare large segmental duplications (Figs 3 and 4), we separated the 71 isolates into two groups according to the number of gene gains. Isolates with the number of gene gain events greater than the 3 rd quartile + 1.5 Ã (interquartile range) were considered as group 2 isolates, while all other isolates were considered as group 1. Group 1 included 62 isolates that harbored fewer than 27 duplicated genes, while group 2 included the remaining 9 isolates that possessed greater 27 gene duplications (Fig 4). In the duplicated regions, we found 126 and 1,310 entire gene gains in group 1 and group 2, respectively, with 75 genes shared between the groups. When analyzing patterns of gene CNV that were present in at least 2 isolates, we found that more gene absences are shared between isolates (60.2% of absent genes) than gene duplications (26.89% of duplicated genes) (Fisher's exact test; p-value = 2.2e-16).
Among 71 isolates, the size of duplication events ranged from 500 bp to 537 kb with an average of 7.5 Kb and a median of 1.5 Kb. As noted, 9 of the A. fumigatus isolates (group 2) contained relatively large segmental duplications (Figs 3 and 4). The cumulative size of duplications was significantly greater in group 2 isolates (760.6 Kb) compared to group 1 isolates (344.5 Kb) (Students T-test; p-value = 0.036). Among the 9 group 2 isolates, we observed 43 segmental duplications events larger than 40 kb, covering between 11 and 474 genes. For example, isolate Afum IFM 62115 [23] contained two independent duplications larger than 160 kb that entirely overlapped the fumisoquin and fumagillin encoding secondary metabolic gene clusters (Fig 5) [47].
We also investigated the diversity of CNV loci using the Polymorphic Index Content (PIC) measurement. PIC has been used to estimate diversity of microsatellites, restriction fragment length polymorphisms, and CNVs [19,49]. We calculated PIC for each non-overlapping 500 bp region of the genome. The average PIC value for regions of the genome with CNV in at least 1 isolate is 0.04, suggesting most regions of the genome lack CNV diversity. We identified 613 windows in the top 1% of PIC values (! 0.82) (Fig 6), representing 25 distinct loci. Four and one of these high diversity CNV regions partially and completely overlapped genes, respectively. Three of the four genes mapped to the rDNA locus, while the remaining gene (Afu8g00342) plays a predicted role in the Pseurotin A secondary metabolite encoding gene cluster [50].

Divergent copy number profiles between A. fumigatus populations
Population processes, including natural selection, can shape CN patterns between populations [19,51]. To identify loci differentiated by CN between populations, we calculated V ST at each 500 bp window between populations 1, 2, and 3. V ST is conceptually derived from F ST [52,53], and ranges from 0 to 1, with a value of 0 representing complete allelic sharing between populations, and a value of 1 representing fixed allelic differences between populations. The average V ST value for regions of the genome with CNV in at least 1 isolate is 0.022, suggesting the majority of the genome is not differentiated by population. We considered the upper 99% of V ST values as significant, representing a cutoff of V ST = 0.68. In total, we identified 545 divergent CN variable windows, comprising 72 distinct non-overlapping loci. These high V ST regions contained 33 genes, 19 and 14 of which were completely and partially overlapped by CN variants, respectively (Fig 7 and Table 3). Several proteins encoded by these high V ST genes localize to or interact with the cell membrane including transmembrane transporters (Afu3g02520, Afu4g00830, Afu5g12720, and Afu6g14640), and kinases (Afu3g02460, Afu3g02500, and Afu8g06140) ( Table 3). Other genes present in the high V ST regions were associated with putative pathogenicity functions, such as oxidation reduction (Afu3g00100, and Afu3g00110) and hydrolase activity (Afu4g01070) [1]. In addition, we identified a high V ST locus containing 7 genes that are part of a highly variable secondary metabolism gene cluster [24]. This locus contains at least 6 distinct and unrelated secondary metabolism gene cluster "alleles" [54].

Low levels of polymorphic copy number variation within A. fumigatus populations
We characterized patterns of CNV within populations by independently calculating PIC for each 500 bp window in populations 1, 2, and 3 (Fig 8). The average PIC values were 0.050, 0.036, and 0.016 for populations 1, 2, and 3, respectively. Population 3 harbors the lowest levels of intrapopulation CNV, which is in agreement with a previous report of low genetic variation [9]. We considered PIC values in the upper 99 th percentile of all populations as displaying significant levels of CNV within individual populations (PIC > 0.82). We identified 4 (278.5 kb), 5 (283.5 kb), and 0 significantly divergent CN variable loci within populations 1, 2, and 3. Three and four genes overlap significant PIC regions in populations 1, and 2 respectively, including three rDNA encoding genes. An additional gene (Afu00342) was identified in population 2, and neighbors the Pseurotin A encoding cluster [50,55].

Discussion
In this study, we identified and characterized CN variants on a genetically and geographically diverse collection of 71 A. fumigatus isolates. Our results reveal that, on average, 7.88% of the genome is CNV, among which absence and duplications account for 75.38% and 24.64%, respectively. Interestingly, the distribution of CN absences displayed a strong subtelomeric bias (Fig 3). This is consistent with previous research suggesting that the A. fumigatus subtelomeric regions are hypervariable [24,48]. The subtelomeric bias of CNVs is also in line with results observed in S. cerevisiae [56,57]. Enrichment analysis of genes overlapping absence revealed an interconnected set of GO terms associated with transposable elements (Table 1). Though the A. fumigatus genome is relatively compact,~3% of the genome is composed of Copia, Gypsy, I (LINE), and Mariner family transposable elements [58]. Transposable element CN varies at the population level in Aspergillus species [24, [59][60][61]. In rare cases, transposable element activity can lead to adaptive gene duplication, as in the Tc1/mariner induced duplication of alpha-amylase in A. oryzae https://doi.org/10.1371/journal.pone.0201611.g006 [62]. The presence of transposable elements may promote CNV through retrotranspotion or nonallelic homologous recombination [63] and could account for some of the gene CNV between isolates.
The A. fumigatus genome contains 36 putative secondary metabolic gene clusters that encode a diverse set of compounds functioning in defense and communication [1,47]. For example, the secondary metabolites gliotoxin, restrictocin, and fumagillin can induce host cell apoptosis, inhibit neutrophil-mediated hyphal damage, and cause epithelial cell damage and slowed ciliary beating, respectively [1]. We identified partial or entire gene absence in 16 secondary metabolic gene clusters in at least one isolate. These gene content polymorphisms could potentially affect the structure, expression, or transport of secondary metabolites [22]. For example, Afu3g02570 encodes a nonribosomal peptide synthetases that acts as the backbone enzyme in a 21 gene secondary metabolic cluster [28]. This gene is absent in 59% of the isolates analyzed in this study. Additionally, we observed gene absence of the secondary  metabolic cluster backbone enzymes Afu1g01010 and Afu2g17690 [55]. Conversely, one isolate (Afum IFM 62115) contained duplications of two entire secondary metabolic gene clusters, including a 22 gene cluster on chromosome 6 overlapping the fumisoquin cluster (Afu6g03340 -Afu6g03610), and the fumagillin cluster on chromosome 8 (Afu8g00370-Afu8g00570) ( Fig  5) [47]. Taken together, these results are consistent with previous reports of extensive genetic variation in A. fumigatus secondary metabolic clusters [22], and suggests that gene CNV in these regions could contribute to individual variation in secondary metabolite production, and pathogenicity. Population differences in gene CN can be the result of natural selection favoring polymorphisms that are advantageous to a particular environment [19][20][21][64][65][66][67]. Consistent with other species, the vast majority of CN variable loci were not stratified by population [21, 51,68]. However, we identified 33 genes with highly differentiated CN profiles between populations 1, 2, and 3 (Fig 6 and Table 3). Many of these genes encode proteins that localized to or interact with the cell surface such as transporters, kinases, and hydrolases. For instance, fhk3 (Afu8g06140) encodes a histidine kinase and is predominantly present at a CN of 1 in population 1, while entirely absent in populations 2 and 3 ( Fig 6B). A knockout mutant of fhk3 was not phenotypically distinct from the wild type strain, although only a limited number of environments and phenotypes were evaluated [69]. Afu4g01070 is another gene overlapping a high V ST region of the genome and encodes an acid phosphatase (Fig 7). Phosphate acquisition and storage is essential for the biosynthesis of nucleic acids, sugars, proteins, and lipids. In fungal pathogens, phosphate acquisition can also mediate resistance to alkaline pH, cation, oxidative, and nitrosative stress [70]. Populations 1 and 2 have a CN of 1, while seven of the eight isolates in population 3 contain an absence that accounts for nearly half of the gene length. Previous molecular characterization of PHO80 suggest that Afu4g01070 is likely involved in the acquisition of inorganic phosphate [71]. Consistent with studies in Cyptococcus gattii and S. cerevisiae [19,21], our results suggest that CNV genes often localize to the cell surface and could be the result of sensing and responding to population-specific environmental factors.
Alvoelar macrophages kill swollen conidia with reactive oxygen species (ROS) [1]. To survive this immune response, A. fumigatus has evolved several strategies to counteract ROS, including the ability to produce of a variety of oxidation-reduction enzymes. We identified two adjacent genes with predicted functions in oxidation-reduction (Afu3g00100 and Afu3g00110) that displayed divergent CN patterns (Fig 7 and Table 3). Both genes were predominantly present at a CN of 1 in population 1 and entirely absent in populations 2 and 3. Lastly, we observed a CN divergent region that overlaps Afu6g04490. This regulatory gene is involved in sporulation and asexual development (Table 3). This gene was present at a CN of 1 in population 1, and ranged between 1 and 6 in populations 2 and 3. The Aspergillus nidulans ortholog (osaA) functions as a transcription factor that regulates sexual and asexual development [72]. Multiple copies of osaA in the A. nidulans genome leads to proliferation of vegetative cells while deletion of osaA results in heightened sexual fruiting and reduces asexual development [72]. In Fusarium oxysporum the otholog Sge1 is also confirmed as a transcription factor and is essential for pathogenicity in tomato [73].

S1 Fig. Phylogenetic network generated in Splitstree4 of the 71 A. fumigatus isolates [35].
The scale bar represents the proportion of nucleotide sites at which two sequences that were being compared were different. Isolates colored with red, blue, and yellow correspond to structure populations 1, 2, and 3. (PDF) S1