Comparative genome analysis indicates high evolutionary potential of pathogenicity genes in Colletotrichum tanaceti

Colletotrichum tanaceti is an emerging foliar fungal pathogen of commercially grown pyrethrum (Tanacetum cinerariifolium). Despite being reported consistently from field surveys in Australia, the molecular basis of pathogenicity of C. tanaceti on pyrethrum is unknown. Herein, the genome of C. tanaceti (isolate BRIP57314) was assembled de novo and annotated using transcriptomic evidence. The inferred putative pathogenicity gene suite of C. tanaceti comprised a large array of genes encoding secreted effectors, proteases, CAZymes and secondary metabolites. Comparative analysis of its putative pathogenicity gene profiles with those of closely related species suggested that C. tanaceti likely has additional hosts to pyrethrum. The genome of C. tanaceti had a high repeat content and repetitive elements were located significantly closer to genes inferred to influence pathogenicity than other genes. These repeats are likely to have accelerated mutational and transposition rates in the genome, resulting in a rapid evolution of certain CAZyme families in this species. The C. tanaceti genome showed strong signals of Repeat Induced Point (RIP) mutation which likely caused its bipartite nature consisting of distinct gene-sparse, repeat and A-T rich regions. Pathogenicity genes within these RIP affected regions were likely to have a higher evolutionary rate than the rest of the genome. This “two-speed” genome phenomenon in certain Colletotrichum spp. was hypothesized to have caused the clustering of species based on the pathogenicity genes, to deviate from taxonomic relationships. The large repertoire of pathogenicity factors that potentially evolve rapidly due to the plasticity of the genome, indicated that C. tanaceti has a high evolutionary potential. Therefore, C. tanaceti poses a high-risk to the pyrethrum industry. Knowledge of the evolution and diversity of the putative pathogenicity genes will facilitate future research in disease management of C. tanaceti and other Colletotrichum spp.

Introduction genomic architecture and their relationship to pathogenicity genes, in order to assess the plasticity and thereby the evolutionary potential.
Comparative genomics has enabled inference of patterns of speciation, pathogenesis and host determination within Colletotrichum lineages [42]. These studies have indicated that the gain and loss of putative pathogenicity gene families in Colletotrichum genomes are important determinants of host specificity and pathogenic adaptation of these species [7,[9][10][11]43]. Colletotrichum tanaceti has only been reported from pyrethrum in Australia but may have crossed over from another host plant species. However, cross-host pathogenicity has not yet been assessed and the potential host range of the pathogen is currently unknown. Comparison of putative pathogenicity gene repertoires of Colletotrichum species from different species complexes and species closely related to the genus Colletotrichum may provide insights into evolution of pathogenicity gene and the host range of C. tanaceti. Therefore, combined genomics and comparative genomics analyses can provide sound means of assessing the current and future risks posed by C. tanaceti.
In order to achieve the major goal of evaluating the potential risk to the pyrethrum industry form C. tanaceti, the aims of this study were to: 1) infer the pathogenicity gene suite of C. tanaceti; 2) infer the host range of C. tanaceti; and 3) assess the evolutionary potential of pathogenicity genes of C. tanaceti.

RNA sequencing
Inoculation of pyrethrum leaves. Pyrethrum leaves were inoculated using the leaf-sandwich method [58,59] by placing a fungal 'mat' between two pyrethrum leaves in a petridish. Each petri dish was sealed with parafilm and incubated at 24˚C with a 12 h-photoperiod. Induced mycelia were harvested at 6, 24 and 48 h after inoculation, and total RNA was extracted using the RNeasy Plant Mini kit (Qiagen, Australia) following the manufacturer's instructions. Total RNA was extracted from the saprobic stage (1-week-old cultures growing on potato dextrose agar). Contaminating genomic DNA was removed from RNA samples by Ambion DNase I (Thermo Fisher Scientific, USA) treatment; the integrity and quantity of total RNA was confirmed by 1% agarose gel electrophoresis and the Experion automated electrophoresis system (Biorad Laboratories, Australia).
RNA libaries were prepared using both E7530L and E&335L NEBNext Ultra RNA Library Prep Kits (New England Biolabs, USA) to generate fragment sizes of 351-371 bp. The transcriptome was paired-end sequenced (

Gene prediction
Genes were first predicted using the MAKER3 v3.0.0-beta [61], in which both the transcriptomic data from C. tanaceti and the proteomic and ab initio gene predictions from C. graminicola; [43] and C. higginsianum; [43,62] were combined into a consensus prediction. In brief, transcriptomic RNAseq reads of C. tanaceti were assembled into transcripts in both de novo and genome-guided modes of the program Trinity v2.2.0 [63]. In genome guided assembly, reads were mapped onto the genome using the program TopHat2 v2.1.0 [64]. Genome guided and de novo transcriptomic assemblies were combined, redundancy (99% similarity) was removed using the program cd-hit-est [65,66] and resulting transcripts were filtered for fulllength open reading frames (ORFs) using the program Transdecoder [63]. Resulting fulllength transcripts were further reduced to 80% similarity using the program cd-hit-est and checked for splicing sites. These high quality transcripts were then used as a training set for ab initio gene prediction programs AUGUSTUS v3.1 [67] and SNAP v6.7 [68] and GENEMARK v4.2.9 [69]. Evidence data from assembled transcriptomes (with 99% redundancy using cd-hitest) and the proteomes were provided to Maker3. The predicted genes (length of conceptually translated protein � 30 amino acids) were further clustered using the k-means clustering algorithm [70] with following metrics: 1) Maker3 annotation edit distance (AED); 2) number of exons in the mRNA; 3) length of translated protein sequence; 4) fraction of exons that overlap transcript alignment; 5) fraction of exons that overlap transcript and protein alignment; 6) fraction of splice sites confirmed by a SNAP prediction from Maker3; 7) percentage for repeat overlap with gene-, exon-and CDS-sequence; 8) size of the inferred orthologous group the gene belongs to using OrthoMclv2.0.9 [71]; and 9) presence of functional annotation (see Functional annotation of the C. tanaceti genome section below). Resulting clusters with transposons and ab initio gene predictions with no transcriptome or proteome support were removed.

Functional annotation of the C. tanaceti genome
Putative coding regions were subjected to protein homology searches against the NCBI (nr) and Swiss-Prot database using BLAST v 2.7.1 (E-value of � 1e-8) [72]. Conserved protein domains and gene ontology (GO) terms were assigned to predicted proteins using InterProScan 5 [73]. Additionally, Kyoto Encyclopedia of Genes and Genomes (KEGG) Orthology (KO) terms were assigned to predicted proteins using the Blastkoala search engine [74]. Assigned KO terms were used to generate C. tanaceti pathway maps using KEGG mapper [75]. Putative genes of C. tanaceti with functional annotations were subjected to species-specific gene enrichment analysis on the DAVID functional annotation database tool [76,77] and using C. graminicola as the reference species.

Comparison to related taxa
The genome and proteome of C. tanaceti was compared to genomes of related taxa using genome alignment, synteny and orthology analyses as following.
Genome alignment and synteny analysis. Colletotrichum tanaceti genome contigs were aligned to 13 other publicly available genomes (Table 1) of Colletotrichum species using Nucmer in Mummer v 4.0 [78,79]. Contig-alignments were then filtered for a minimum 30% nucleotide identity and 200 bp in aligned length. The global coverage of each of the genomes from contigs of C. tanaceti was computed as a measure of pairwise sequence comparison between the two genomes.
The program 'Synteny Mapping and Analysis Program', SyMAP v 4.2 [80,81] was used to map C. tanaceti contigs of the highest sequence length (>150 kb) to the chromosomes of C. higginsianum IMI349063 reference genome [43] to identify the syntenic regions. PROmer was invoked within SyMAP. Orthology search and phylogenomics analysis. The proteomes of C. tanaceti and the publicly available 17 other species were subjected to ortholog searching using OrthoMCL v2.0.9 [71] and MCL [91] with an inflation value of 1.5. The orthoMCL output was used to determine the percent orthology among the species and to determine the core gene set for Colletotrichum. The ortho-groups with pathogenicity genes (inferred as below) of C. tanaceti were extracted and used to determine the percent conservation of those gene categories within the genus. Furthermore the single copy orthologs were extracted from the orthoMCL output and aligned using MAFFT v.7 [92]. These alignments were then trimmed using trimAl v. 1.3 [93] to remove all positions in the alignment with gaps in 20% or more of the sequences, unless this leaves less than 60% of the sequence remaining. The trimmed reads were concatenated using FASconCAT-G [94]. The concatenated alignment was partitioned and amino acid substitution models were predicted for each partition using ProtTest 3 [95] in FASconCAT-G. The partitioned, concatenated alignment was subjected to maximum likelihood phylogenetic analysis using RAxML v8.2.10 [96] to find the best tree from 20 maximum likelihood searches and using 100 bootstrap replicates. Evolutionary distance in number of substitutions per site was computed using the ape package [97] in the R statistical language framework v 3.5.1.
[98] from the maximum likelihood tree.
Estimation of divergence dates. The phylogram developed from above was utilized to estimate the divergence dates of the species considered as following. The final RAxML phylogenetic tree was used to generate an ultrametric tree in r8s v1. 81

Prediction of secretome and database searches for identifying other virulence factors
Predicted proteins of C. tanaceti were used in downstream prediction of the secretome [103]. A union of three software tools: SignalPv4.1 [104], Phobius [105] and WoLFPSORT [106] was used to predict the candidate proteins to be used downstream of the pipeline. Proteins with either signal peptides predicted using SignalP or Phobious or proteins predicted as 'extracellular' in WoLFPSORT were retained as candidates for secreted proteins. Proteins with transmembrane domains were identified using TMHMM v.2.0 [107] and were excluded as secreted proteins. Proteins with signals targeting the endoplasmic reticulum and GPI anchors were identified and excluded using Ps-SCAN [108] and Pred-GPI [109] respectively. NLStradamus [110] was used to identify proteins with nuclear localization signals. Curated secretome was subjected to homology search against the CDD database to identify the conserved domains (Evalue � 1e-10). The candidate secreted effector proteins were identified by passing the secretome through the program EffectorP v1.0 [21]. Predicted effector candidates were manually inspected and candidates with known plant cell wall degrading catalytic domains, such as cutinases (PF01083.21), short-chain dehydrogenases (PF00106.24), glycosyl hydrolases (PF00457), peptidases (PF04117.11) and lipases (PF13472.5) were excluded. Candidates with no detectable conserved domains and no homology (E-value � 1e-3) to any other proteins in NCBI-nonredundant protein sequence database were defined as species-specific. Putative secreted peptidases and inhibitors were predicted by stand-alone blastp (E-value � 1e-10) homology searches of the MEROPS-MPEP database (consisting only the sequences of peptidase and inhibitor units) of MEROPS release 12.0 [111]. Furthermore, potential virulence factors of C. tanaceti were identified by blastp searches (E-value � 1e-10) against PHI-base v 4.4 [28]. The online analysis tools, Antibiotics and Secondary Metabolite Analysis Shell (antiSMASHV. 4) [30] with default parameters and SMURF [112] were used to predict potential secondary metabolite backbone genes and clusters using the default parameters. Cytochrome P450s and transporters were described based on blastp (E-value � 1e-10) homology searches against the Fungal Cytochrome P450 database [113] and the Transported Classification Database [114]. The functional annotations for C. tanaceti were compared across 17 other closely related taxa (Table1). The family specific Hidden Markov Model profiles of dbCan database v6 [115] were employed using the program HMMScan in HMMER v31.b2 [116] in order to identify the carbohydrate active enzymes (CAZymes) and the CAZyme families in the proteome of C. tanaceti. Fungi-optimal cut-off E-value of 1e-17 and a coverage cut-off of 0.45 [115] were used in the analysis which was repeated for seventeen related species ( Table 1). The identified CAZymes were run though InterProScan 5 [73] (E-value � 1e-10) to check for false positives. The member counts of each CAZyme family for each taxon were corrected accordingly.

Evolution of CAZyme gene families
CAFE v4.0 [117,118] was used to estimate the number of CAZyme gene family expansions, contractions and the number of rapidly evolving gene families upon divergence of different lineages. Error-models [118] were estimated to account for the genome assembly errors and were incorporated into computations. A universal lambda value (maximum likelihood value of the birth-death parameter) was assumed and gene families with significant size variance were identified using a probability value cut-off of 0.01. The branches responsible for significant evolution, were further identified using the Viterbi algorithm [117] with a probability value cutoff of 0.05. Sizes of plant pathogenicity-related gene families from CAZomes of each of the species; the 'CAZyme pathogenicity profiles' were retrieved and compared using the online tool ClustVis [119]. The 'CAZyme pathogenicity profile' of a particular species included the gene families that have activities in binding to or degradation of plant cell wall components such as cellulose, hemicelluloses, lignin, pectin, cutin and chitin.

Testing for the bipartite nature of the C. tanaceti genome
The GC-bias of the genome was detected using OcculterCut version 1.1 with default settings [37]. The genome wide dinucleotide frequency and the two RIP indices, TpA/ApT and (CpA + TpG)/(ApC + GpT) were computed and the RIP affected genomic regions were identified using the RIPCAL V.2 [120]. Significant enrichment of A-T rich regions of the C. tanaceti genome with interspersed repeats and RIP were tested using permutation tests implemented in the package regioneR in the R statistical language framework v3.5.1 [98] with the evaluation function for number of overlaps [121]. Ten thousand random iterations were conducted, from which a Z-statistic estimate, and its associated probability, were computed. RIP affected region was compared against a distribution of distances of random samples from the whole genome. Ten thousand random iterations were conducted, from which a Z-statistic estimate, and its associated probability, were computed for each gene category.

Colletotrichum tanaceti genome and gene content
The genome of isolate BRIP57314 was assembled into 5,242 contigs with an N50 value of 103,135 bp and assembly size of 57.91Mb. The average GC content was 49.3% ( Table 2). The genome size and GC content of C. tanaceti was within the range previously reported for other Colletotrichum spp. (S1 Fig). Draft genome assembly and the raw unassembled sequences are available under the accession no PJEX00000000 in Genbank. The genome contained 12,172 coding genes with an average gene length of 2,575bp. Mean exon count per gene was 3, and 54.1% of the genome sequence contained protein-encoding genes. In the BUSCO analysis, out of the 3,725 benchmarking genes in the Sordariomycetes group, the genome was reported to contain 3,656 complete BUSCOs (98.2%), of which two were duplicated and the rest were single copy genes (98.1%). A total of 30 (0.8%) BUSCOs were fragmented and 39 were missing (1.0%). The repeat content of C. tanaceti was 24.6% of the total genome of which 85.2% was interspersed repeats (Table 3).
Of the 12,172 predicted proteins, 11,352 had an annotation edit distance (AED) value of less than 1.0, and 2962 genes had an AED value of zero. The number of genes without putative annotation from the public database searches was only 958. A total of 8,945 proteins (73.5% of proteome) had InterProScan annotations of which 6,911 contained 9,647 Pfam domain annotations and 5,452 had GO term ontology annotation. The most abundant (n = 129) Pfam domain was the cytochrome P450 family (PF00067) followed by the protein kinase domain (n = 127; PF00069). Gene enrichment analysis suggested enrichment of many GO terms including those associated with translation and chromosome telomeric region (S1 Table).
Putative proteins of C. tanaceti were subjected to KEGG pathway analysis which returned assignment of 5,883 proteins to known pathways (S2 Table). The highest number of KO identifiers was among the metabolic pathway assignments (n = 693) of which the majority (n = 363) were for amino acid metabolism followed by carbohydrate metabolism (n = 290) (S3 Table). Among the environmental information processing pathways, 81 C. tanaceti genes were assigned into 47 KO identifiers belonging to MAPK pathway (S4 Table). Furthermore, 24 C. tanaceti proteins were annotated with 10 aflatoxin biosynthesis pathway KO assignments (S5 Table) and 56 proteins were assigned KOs for ABC transporters (S6 Table).

Genome alignment and synteny
The global alignment coverage of 13 other Colletotrichum genomes from C. tanaceti contigs was proportionate to the evolutionary proximity to C. tanaceti ( Fig 1A). The highest coverage was in C. higginsianum (63.8%) and the least was in C. orbiculare 4.26%. Among the C. tanaceti contigs aligned to the chromosomes of C. higginsianum, the best alignment coverage was to chromosome NC_030961.1 (chromosome 9) (S7 Table). Colletotrichum tanaceti contigs (n = 155 of size�10 kb) were mapped in SyMAP synteny analysis to form 142 synteny blocks which covered 44.0% of the C. higginsianum and 80.0% of the C. tanaceti sequences that were used (S2 Fig). Genes were present in 92.0% of the syntenic regions in C. tanaceti and in 77.0% of C. higginsianum. No inverted synteny blocks were reported. Despite the highest coverage in C. higginsianum chromosome 9, the largest synteny block was identified between the complete C. tanaceti contig 4 (945.01 kb of length) and C. higginsianum chromosome NC_030954 (Chromosome 1). A total of 38 effector candidates of C. tanaceti were within these syntenic regions between C. tanaceti and C. higginsianum. No synteny blocks were detected to the two mini chromosomes (NC_030963.1 and NC_030964.1) of C. higginsianum.

Orthology search
Of 221,456 total genes from 18 genomes, the number of core genes reported for all ascomycetes in the orthology analysis was 3,944. A total of 10,695 putative proteins from C. tanaceti were assigned to 10,074 groups containing orthologs and/or recent paralogs and/or co-orthologs across all species tested. A total of 6,002 genes were conserved in all tested members of the genus Colletotrichum. Colletotrichum tanaceti had 9,679 orthologs with C. higginsianum which was the highest ortholog count among Colletotrichum spp. followed by 8,855 orthologs with C. nymphaea ( Fig 1B). Twenty of these groups, with 48 genes among them were exclusive to C. tanaceti and were defined as recent paralogs (in-paralogs) of C. tanaceti with no homology to the 16 other species tested.

Divergence time in Colletotrichum lineages
A total of 2,214 single copy ortholog (SCO) genes identified among the C. tanaceti and 17 closely related genomes (Table 1) were used to generate a maximum likelihood (ML) evolutionary tree in which all branches achieved bootstrap support of 100%. Colletotrichum tanaceti formed a clade with C. higginsianum, a member of the destructivum complex and the two destructivum complex members formed a sister clade with the graminicola complex members and C. incanum. A smoothing factor value of 1 was reported as the optimal value for divergence time predictions in r8s. Colletotrichum tanaceti and C. higginsianum were reported to have diverged~9.97 million years ago (mya). The most recent common ancestor (MRCA) of gloeosporioides, acutatum, and graminicola clades were reported to be 6.12, 10.98 and 15.78 mya, respectively (Fig 2).

Identification of pathogenicity related genes in C. tanaceti
Secretome of C. tanaceti. Of the 12,172 putative proteins, 1,024 (8.41%) were predicted to be secreted. A total of 2,702 Conserved Domain Database (CDD) domains were found in the secretome. Of these, 287 were specific features with NCBI curated models, 124 were generic features with only the superfamily annotations [122]. Only 433 queries had no known domain hits. The secretome was rich in alpha beta hydrolase superfamily (cl21494) containing enzymes, glycosyl hydrolases and proteolytic enzymes and cytochrome P450 monoxygenases (P450) (S8 Table). A total of 100 secreted proteins had nuclear-localization signals (S8 Table).
A total of 233 putative effector candidates were predicted by EffectorP. Following manual inspection and filtering out candidates with known plant cell wall degrading catalytic domains, a total of 170 candidates were selected as putative effectors of C. tanaceti (S9 Table). The putative secreted candidate effector repertoire of C. tanaceti contained homologs of known effectors, such as the Ecp6 of Cladosporium fulvum [123]  Colletotrichum draft genomes to contigs representing the C. tanaceti draft genome, plotted against evolutionary distance with reference to C. tanaceti (x axis), (b) Number of orthologs shared by 13 Colletotrichum draft genomes and C. tanaceti (y axis) plotted against the evolutionary distance with reference to C. tanaceti (x axis); evolutionary distance given in number of substitutions per site, computed using the ape package [98] in R from a maximum likelihood tree. https://doi.org/10.1371/journal.pone.0212248.g001 Bas3 [128]. Furthermore, among the putative effector candidates, there were proteins with conserved domains of known virulence factors. Most effector candidates were small (average length of 157 amino acids) and rich in cysteine (average cysteine composition was 3.3%) which are the hallmarks of effectors. A total of 78 conserved motifs of fungal effectors [129] were present in 62 effector candidates which had at least one motif each. Twenty-two effector candidates that did not cluster in ortholog search among the 14 Colletotrichum and three related species, and also did not show detectable homology to the NCBI-nr and swissprot databases were defined as C. tanaceti-specific. Only 25% of the putative effectors of C. tanaceti were conserved among all 14 Colletotrichum spp.
A total of 98 putative secreted peptidases were predicted with the majority (n = 64) being serine peptidases largely comprising the S08 and S09 subfamilies. The second most abundant class was the metallo peptidases (n = 19) (S10 Table). All six putative aspartic peptidases belonged to subfamily A01. A total of 20 putative secreted peptidase inhibitors were reported in C. tanaceti comprising two carboxypeptide-y inhibitors, five family-19 inhibitors and 13 family-14 inhibitors (S11 Table). Forty nine percent of the putative proteases of C. tanaceti were among the "core" set of proteases of Colletotrichum spp. tested.
Secondary metabolite-related genes and clusters. Forty-one putative secondary metabolite backbone genes were predicted in C. tanaceti using SMURF and the majority were polyketide synthases (PKS, n = 13) with four PKS-like proteins. Furthermore, nine non-ribosomal peptide synthases (NRPS), eight NRPS-like proteins, two hybrid PKS-NRPS enzymes and five dimethylallytryptophan synthases (DMATS) were also predicted as backbone genes (S12 Table). A total of 52% of these putative backbone genes were within the core set of genes in Colletotrichum. A total of 33 putative secondary metabolite gene-clusters were predicted surrounding the backbone genes. However, the program antiSMASH predicted a total of 50 clusters. Among the clusters, there were twelve typeІPKS, two typeІІІPKS, thirteen terpenes, eleven NRPS, four indoles, three typeІPKSs-NRPS, one typeІPKS-indole and four other proteins. Cluster 10 of typeІPKS showed 100% similarity to the genes in LL-Z1272 beta biosynthetic gene cluster (BGC0001390_cl). Furthermore, a homolog to the melanin biosynthetic gene SCD1 was also reported in C. tanaceti (CTA1_6632). When predictions from the two tools were compared, putative SMB clusters on 31 contigs of C. tanaceti were predicted by both tools and 19 of the backbone genes from SMURF were also predicted in antiSMASH (S12 Table). A total of 37 putative SM clusters were within the syntenic blocks of C. higginsianum. The conserved SM domains identified in each cluster were reported (S13 Table). Predictions from antiSMASH were compared across taxa and majority of the clusters were typeІPKS like followed by NRPS in all ascomycetes compared (Fig 3A). The highest number of clusters were reported from C. fructicola (n = 84) followed by C. higginsianum (n = 74) and C. gloeosporioides (n = 73). The composition of the SMB gene cluster composition of C. tanaceti was most similar to C. orchidophilum, the acutatum complex members and C. orbiculare (S3 Fig).
Cytochrome P450 monoxygenases (P450s) and transporters. In the C. tanaceti genome, 1,457 putative genes had homologs in the fungal cytochrome P450 database (S14 Table) and 911 out of that had >30% identity. There were 1,824 homologs (S15 Table) in the transport classification database for C. tanaceti with 1,276 genes with >30% identity. The majority (n = 430) of the homologs were genes of the major facilitator superfamily (MFS, 2.A.1) followed by 129 genes of the ABC transporter family (3.A.1) and 123 of N.P.C 1.I.1. Within Colletotrichum genus, members of the gloeosporioides complex had the highest number of homologs for both P450s and transporters (Fig 3B).
Homologs in PHI-base. A total of 3,497 homologs were recorded in C. tanaceti from the pathogen-host interaction database (PHI), of which 1,592 represented homologs of genes that result in reduced virulence in loss of function mutants (S16 Table). The second most common (n = 1,514) were the unaffected pathogenicity category, 382 homologs were for loss of pathogenicity and 42 were in the effector category. Notably, 141 homologs were reported to genes of which the loss of function mutants were lethal to the pathogen and 103 homologs were reported to genes in which virulence increased after loss of function mutation (Fig 3C). The two gloeosporioides complex members had the highest number of homologs in the database among the Colletotrichum spp., followed by the acutatum complex species, C. simmondsii, C. fioriniae and C. nymphaea. Despite C. higginsianum having a large number of homologs, C. tanaceti had a below average number for all the categories among the Colletotrichum spp., with a profile similar to C. orchidophilum, C. chlorophyti and C. graminicola (S4 Fig). CAZymes. A total of 608 C. tanaceti proteins were assigned to 121 CAZyme families of which 43% was putative glycosyl hydrolases followed by 18% of putative redox enzymes (auxiliary activities) and 14% putative carbohydrate esterases (S17 Table). Putative carbohydrate binding molecules and polysaccharide lyases both formed 7% each of the C. tanaceti CAZome whereas 11% was glycosyltransferases. A total of 179 CAzymes were secreted (S17 Table). Members of the gloeosporioides and acutatum complexes had the largest CAZomes among Colletotrichum spp. The CAZyme repertoires of the graminicola complex members were relatively small (Fig 3D).

Evolution of CAZyme families upon divergence of Colletotrichum lineages
A total of 152 CAZyme families, predicted at the node of MRCA for S. macrospora and B. cinerea, were used in gene family evolution analyses in CAFÉ. A uniform birth-death parameter (λ) of 0.0023 was computed. Thirty gene families were reported to be significantly evolving (family-wide p value � 0.05), of which 21 were rapidly evolving (family-wide p� 0.01 and Viterbi p � 0.01 in any lineage) (S18 Table).
At the divergence of Colletotrichum spp., 39 expansions and 12 contractions were predicted with respect to its MRCA with Verticillium species (S19 Table). Expansions included the lignin hydrolase family AA2, pectin degrading polysaccharide lyase families (PL1, 3, 4, 9 and GH78), lignocellulose degrading families (AA3, AA9, GH131, GH5, GH6, GH7), hemicelluloses degrading families (CE1, CE4, CE5, CE12, GH3, GH16, GH30, GH43, GH51, GH67, and GH10), Lys M domain containing family CBM50 and cutinase family CE5. The cellulose degrading family GH131 was the only rapidly evolving CAZyme family (family-wide p � 0.01 and Viterbi p � 0.01) which expanded upon the divergence of Colletotrichum spp. Within the genus, the highest number of expansions (n = 38) was reported at the divergence of the gloeosporioides-complex clade with only 4 contractions. Notably, the CBM18 and GH10 families were contracted and many families with plant cell wall degrading enzyme activity were expanded. The rapidly and significantly expanded families, (family-wide p � 0.01 and Viterbi p� 0.01) upon the divergence of the gloeosporioides-complex clade include GH43, GH106, CBM50 and AA7. At the divergence of the acutatum-complex clade, there were 22 expansions, of which expansions in GH78, GH43 families were rapid and significant and there was only one contraction. The divergence event of the graminicola-complex clade involved contractions in many CAZyme families with pectin degradation activity showing significant, rapid contractions (family-wide p � 0.01 and Viterbi p � 0.01)in families AA7, CBM50, CE8, GH28, GH78, PL1, and PL3. Divergence of the destructivum complex-clade was associated with 11 expansions and 21 contractions, of which expansion in AA7, GH74 and CE10 was significant and rapid.
Among the other species considered, Fusarium oxysporum had the highest number of genes (n = 344) that were gained, with 75 expanded CAZyme with respect to its MRCA (S20 Table). Colletotrichum incanum had the second highest number of gene family expansions (n = 35) and genes gained (n = 69) followed by C. higginsianum (31 and 68 respectively). Forty CAZyme families contracted and only nine expanded in C. tanaceti with respect to the MRCA with C. higginsianum. The AA2 family with lignin peroxidase activity and the hemicellulose degrading GH12, GH74 families were among the expanded families, but many families with pathogenicity and plant cell wall degrading activity had contracted in C. tanaceti. However, the highest number of significant, rapidly evolving gene families was reported from C. tanaceti (n = 9) followed by F. oxysporum and C. higginsianum, both which had seven rapidly evolving gene families each. In C. tanaceti, rapidly evolving CAZyme families included AA9, GH131 with lignocellulose degrading activity, chitin binding molecule families CBM18 and CBM50, GH18 with chitinase activity, GH3 and GH74 with hemicelluloses degrading activity, GH78 with pectinase activity and GT1 with glucuronosyltransferase activity. However, CBM18 and GH74 were the only families that expanded among those above with the rest contracting in C. tanaceti with respective to their MRCA. Gloeosporioides complex species had the largest 'CAZyme pathogenicity profiles' among all Colletotrichum species considered. The CAZyme pathogenicity profile of C. tanaceti was most similar to those of Colletotrichum species known to have an intermediate host range, infecting many hosts within a single plant family or few hosts across several plant families (Fig 4). When compared the overall pathogenicity gene profiles of all Colletotrichum spp., which included the numbers of the SMB clusters, transporters, P450s, CAZymes and the homologs to the PHI database, the profile of C. tanaceti was most similar to C. orchidophilum and C. chlorophyti (Fig 5).

RIP affected regions of the C. tanaceti genome
The RIP indices computed for the genome of C. tanaceti using the dinucleotide frequencies (S21 Table) indicated strong RIP signals in the genome. The TpA/ApT index of C. tanaceti (1.2) was higher than the cutoff 0.89 and the (CpA + TpG)/(ApC + GpT) index (0.96) was lower than the cutoff 1.03 indicating a strong RIP signal. Homologs to two genes involved in RIP of Neurospora crassa RID [130] and Dim-2 [131] were identified from the genome of C. tanaceti (CTA1_356s and CTA1_4791s respectively).

Bipartite nature of C. tanaceti genome
Distinct A-T rich regions and G-C equilibrated regions were identified in the genome of C. tanaceti (Fig 6). A total of 24.3% of the genome which had an average length of 3.77 kb was rich in A-T and had a maximum G-C of 29%. The Z-score and the P value of the permutation tests for the random association of transposable elements with the A-T rich regions were 799.354 and �0.001 respectively. The Z-score and the P value of the permutation tests for the enrichment of RIP with the A-T rich regions were 165.001 and �0.001 respectively. For the A-T rich regions of the genome, the TpA/ApT index was 1.86 and (CpA + TpG)/(ApC + GpT) index was 0.32 indicating a strong RIP signal. A total of 85 genes were reported in these regions which had a gene density of 6.04 genes per Mb but the majority (68.25%) of these genes was hypothetical. Two secondary metabolite biosynthetic genes, 3 CAZymes, 2 cytochrome P450s, 2 lipases, 4 transporters, one transcription factor and one DNA polymerase were putative pathogenicity genes among the genes in the A-T rich regions (S22 Table). The

Relationship of putative pathogenicity genes with repeat elements and RIP
The permutation tests confirmed that genes in all the tested pathogenicity-related gene categories are located significantly closer to tandem repeats than expected in a random sample ( Table 4). The negative Z-scores confirmed the mean distance between those genes and the nearest repetitive element was less than mean of a random sample of the genome. Furthermore, all gene categories except the CAZymes were located significantly closer to the interspersed repeats. However, the expanded and the contracted subgroups of the total CAZome were significantly associated with interspersed repeats (Table 4). All pathogenicity gene categories except contracted CAZymes and effectors were also located closer to the RIP affected regions of the genome than expected. The numbers of CAZymes, secondary metabolite biosynthetic gene clusters (SMB), homologs in the transporter classification database (transporters), homologs in the fungal cytochrome P450 database (P450) and the number of homologs in the PHI database, excluding the homologs to entries in the "unaffected pathogenicity" database were used in the analysis. Hierarchical clustering was performed using Euclidean distance and Ward linkage methods. The number of genes in each pathogenicity gene category is normalized using unit variance scaling. Overrepresented and underrepresented pathogenicity gene categories are represented in red to orange and blue respectively as fold standard deviations above and below the mean. https://doi.org/10.1371/journal.pone.0212248.g005 High evolutionary potential of pathogenicity genes in Colletotrichum tanaceti

Genome and the repeat content of Colletotrichum tanaceti
This study reports the first draft genome sequence and annotations of the emerging plant pathogen, C. tanaceti. The high N50 value and BUSCO completeness indicates the high quality of the assembly and AED scores of less than one for the majority of predicted genes (93.3%) suggested that these genes had at least partial congruence with the transcriptomic evidence [132]. These good quality gene predictions and annotations will provide a solid foundation for downstream genetic, population genomic and evolutionary studies.  The genome of C. tanaceti had a larger repeat content (25%) than the typical 3-10% in fungi [133]. Simple sequence repeats comprised 3.03% of the genome of C. tanaceti which itself was unusually high for fungi (generally 0.08-0.67%) [134]. However, the majority of repeats were interspersed transposable elements (TE) (21%). TE content of C. tanaceti was higher than in six previously studied Colletotrichum species, including C. higginsianum which is in the same species complex, but lower than in C. orbiculare (44.8%). The majority of TE were retrotransposons, similar to other Colletotrichum spp. [41]. Proliferation of repetitive elements especially transposons, is known to be a major mechanism driving expansion of eukaryote genomes [135,136]. Furthermore, TE activity favors chromosomal rearrangements, gene deletions, gene duplications and greater sequence diversity and is a mechanism of genome plasticity [35].
Colletotrichum tanaceti's genome consists of distinct A-T rich, gene sparse regions. These regions are also enriched in putative TE and RIP. Strong genome-wide RIP signals were observed on C. tanaceti. RIP is a method of controlling TE proliferation in fungi and facilitates genome plasticity, via high mutational rates and inactivating genes [35]. Homologs of two genes involved in RIP were present in C. tanaceti similar to C. higginsianum [15]. Therefore, TE proliferation in C. tanaceti may have caused accumulation of RIP as a control mechanism. These RIP mutations may have caused the bipartite nature of the C. tanaceti genome giving rise to A-T rich blocks [38-41] similar to the observations in C. orbiculare and C. graminicola [6,41]. Similar to C. orbiculare and C. graminicola [6,41], C. tanaceti has a known sexual stage [17] which could have activated the RIP in the genome [137]. Although genome-wide RIP was not prominent, the TEs in the genomes of C. fructicola [42], C. higginsianum [15], C. truncatum [41] and the C. cereale [138] has shown signals of RIP.

Pathogenicity genes of C. tanaceti
A large array of putative genes related to pathogenicity was inferred from the sequenced genome of C. tanaceti. Apart from many plant cell wall-degrading enzymes, effectors, P450s and the proteolytic enzymes, there were proteins with CFEM domain (pfam05730) [139] with roles in conidial production and stress tolerance [140] among the secreted proteins. The average cysteine composition, length and proportion of specificity of the candidate secreted effectors of C. tanaceti were similar to those hemibiotrophic pathogens [141]. However, a minority of effector candidates was neither small (<300bp) nor rich in cysteine (>3%), similar to previous reports of atypical effectors [142]. Effector candidates with a nuclear localization signal might translocate to the host nucleus and reprogram the transcription of genes related to host immune responses. Homologs to known effectors, and effectors with conserved domains of virulence factors may have similar functions in C. tanaceti, for example, in penetration peg formation (cyclophilin) [143], phytotoxity induction (cerato-platanin) [144] and adherence of the fungal structures to other organisms (hydrophobin) [145].
Most secreted proteases of C. tanaceti were serine proteases predicted to evade plant immune responses by degrading plant chitinases [22]. Subtilisins (S08) were the most abundant of these in C. tanaceti, similar to reports in other fungi [22]. Subtilisins, with their alkaline optima, and the proteases in other subfamilies with acidic optima, such as A01, C13, G01, M20 and S10 [146], might enable C. tanaceti to degrade plant proteins across a wide pH range. Also, the protease inhibitors of C. tanaceti might have effector-like roles via inhibition of plant defense proteases [147].
The SMB gene clusters and the candidate proteins of MAPKs pathways identified in the genome of C. tanaceti are also believed to play an important role in pathogenesis. The majority of the secondary metabolite clusters of C. tanaceti were typeІ PKs-like which are usually associated with synthesizing fungal toxins [148]. Melanin, another important secondary metabolite aids penetration via increasing turgor pressure [149]. Even though the gene cluster associated with melanin biosynthesis was not identified, the homolog of the melanin biosynthetic gene SCD1 encoding Scytalone dehydratase [150] in C. tanaceti is worth investigating further since SCD1 has been successfully used as a target for fungicides to control other pathogens [151]. Apart from their function in SM biosynthesis, the candidate P450s of C. tanaceti could be involved in housekeeping roles and therefore, could be good targets for fungicide development, as in the case of azoles targeting CYP51 [152]. Furthermore, the candidate proteins of MAPKs pathway in C. tanaceti could play a crucial role in appressorium formation [25,153], penetration [154], conidiation [155] and pathogenesis-related morphogenesis [156], as reported for C. higginsianum and C. lagenaria.
Of the CAZyme families identified to be expanded in C. tanaceti, the chitin binding family CBM18 could play a role in protecting the C. tanaceti cell wall from exogenous chitinases, as is the case in Trichoderma reesei [157]. The expansion of the hemicellulose-degrading GH74 family could promote rapid degradation of host tissues by C. tanaceti during the necrotrophic phase. The expansion of the lignin-degrading AA2 family in C. tanaceti has the potential to assist infection of xylem vessels and thereby aid translocation of propagules to different parts of the plant and establishing secondary infections.
The conserved nature of certain pathogenicity genes, such as the secondary metabolite clusters within the destructivum complex, was evident with their presence within the syntenic blocks with C. higginsianum. However, only a minority of the effectors, proteases and SM backbone genes of C. tanaceti were among the core gene set for Colletotrichum spp. tested, therefore emphasizing their role in adaptation to new hosts. The species-specific effectors, singletons from the orthology analysis and the genes exclusive to C. tanaceti might have been horizontally transferred or be related to the host affiliation and niche specialization of C. tanaceti. Taken together, this inferred pathogenicity gene suite of C. tanaceti could be targeted in future resistance breeding and other disease management strategies for C. tanaceti.

Host range of Colletotrichum tanaceti
The proposed pathogenicity gene repertoire of C. tanaceti was most similar to that of pathogens with intermediate host ranges. The number of pathogenicity genes inferred from C. tanaceti was either similar to or less than the average for all Colletotrichum spp. investigated, but the overall composition was similar to Colletotrichum spp. which either were able to infect many species within a plant family or few species across families.
The putative pathogenicity profile of C. tanaceti was very distinct from that of the other destructivum complex member, C. higginsianum, despite the two species sharing the highest number of orthologs and having the shortest evolutionary distance. Contractions in many pathogenicity gene families in C. tanaceti compared to C. higgginsianum indicated more restricted pathogenicity in C. tanaceti. The most similar CAZyme pathogenicity profile to that of C. tanaceti was from C. chlorophyti which has been reported to infect herbaceous hosts such as tomato (plant family Solanaceae) and soybean (plant family Fabaceae) [82]. The similarity to C. chlorophyti was consistent for other gene categories such as the P450s, transporters and the overall pathogenicity profile. A homolog to the demethylase (PDA), which provides tolerance to the phytoalexin pisatin synthesised by Pisum sativum [158], was predicted in C. tanaceti (CTA1_6324s) which could be an indicator of the ability of C. tanaceti to infect Fabaceae. The composition of the SMB cluster was however, more similar to C. orchidophilum, another pathogen reported to infect the herbaceous, monocot plant family of Orchidaceae [159]. The similarity of the putative pathogenicity profile of C. tanaceti to two pathogens infecting multiple herbaceous plant species was notable as the only known host of C. tanaceti is also herbaceous. Both C. chlorophyti and C. orchidophilum have been reported from multiple host species. Therefore, the putative pathogenicity gene suite of C. tanaceti suggests that C. tanaceti has the genetic ability to infect more hosts than currently recognized. If C. tanaceti can infect other hosts, such crops could also provide an external gene pool of inoculum for infection of pyrethrum crops increasing the evolutionary potential of the pathogen populations. Based on results of comparative analysis of pathogenicity profiles, a further hypothesis is that these alternative hosts are likely to be herbaceous plants. Future studies investigating the cross-host infectivity and pathogenicity of C. tanaceti are recommended.

Evolution of pathogenicity genes
Pathogenicity genes of C. tanaceti appear to be capable of evolving relatively rapidly. Tandem repeats such as simple sequence repeats have high mutation rates [160] and could promote frameshift mutations in adjacent genes by slipped misalignment during replication. Therefore, the significant overlap between the tandem repeats and the pathogenicity genes suggested high potential to mutate and create different pathotypes. Transposons promote insertional mutations that can either cause disruption or modification of gene expression or generate new proteins and also are major drivers of gene duplication [161]. Transposons were in close proximity to putative pathogenicity in C. tanaceti, such as the SMB clusters, expanded and contracted CAZymes, peptidases and effectors. The significant association of TE with pathogenicity genes were previously reported in C. truncatum [41] and C. higginsianum [15]. The RIP affected regions of the genome were also in close proximity to certain putative pathogenicity genes of C. tanaceti. RIP mutations can leak into nearby flanking regions causing mutations in those genes, further diversifying the pathogenicity gene repertoire of C. tanaceti [35]. However, unlike in Leptosphaeria maculans [162] RIP was not associated with the putative effectors of C. tanaceti. Therefore, TEs could be facilitating effector diversification in this species [41]. Although gene sparse, the A-T rich regions of the C. tanaceti genome contained several (n = 18) putative pathogenicity and virulence factors and many hypothetical proteins which could be functioning as effectors facilitating adaptive evolution. Small secretory proteins were identified in the A-T rich regions of C. orbiculare [6]. The genes in these A-T and repeat rich, gene sparse regions can evolve faster than the rest of the genome according to the "two-speed genomes" hypothesis [36].
Colletotrichum tanaceti had the highest number of rapidly evolving CAZyme families among the 17 species studied which also was indicative of the high evolutionary potential in these pathogenicity genes. Interspersed repeats were not in close proximity to the total CAZome. They were however, located significantly closer to the expanded or contracted families indicating that interspersed repeats were a major contributor to CAZyme family expansions/contractions in C. tanaceti by causing gene duplication (in expansions) or gene disruptions (in contractions) [135,163]. Diversification of pathogenicity and virulence genes through repeats and RIP mutation in C. tanaceti result a high evolutionary potential for pathogenicity genes of this pathogen. The high evolutionary potential of pathogenicity genes may cause rapid evolution of resistance to host immune responses in existing hosts or even adaptation to new host species in C. tanaceti.

Genus Colletotrichum
Phylogenetic relationship throughout the genus was consistent with previous observations, with gloeosporioides complex members and C. orbiculare forming a clade separately from the destructivum, graminicola and acutatum clades [9][10][11]. One notable difference was in the divergence time estimates for the divergence of Colletotrichum species complexes which were more ancient than reported by Liang et al [11], despite using the same calibration times. This could have been due to this study using cross-validation across 50 smoothing factors in CAFÉ as opposed to using 12 different constraints and smoothing factor combinations differences, as well as the use of the use of the different data sets.
Comparative genomic analyses emphasized the rapid evolutionary rate and the high diversity within the genus. The short time for speciation within the acutatum complex, and the fourteen Colletotrichum species in general, was suggestive of the high evolutionary rate within the genus with respective to the typical evolutionary rate of the fungal kingdom (0.0085 species units per Myr) [164]. The sequence similarity between C. tanaceti and other species of Colletotrichum varied widely and dropped drastically with evolutionary distance, suggesting high diversity within the genus. However, the drop in orthology was less dramatic, emphasizing the contribution of non-coding regions in generating diversity within the genus. The extent of synteny between C. tanaceti and C. higginsianum was high and very similar to the percentage synteny previously reported for the two graminicola complex species, C. sublineola and C. graminicola [165]. This suggested that even though there was high diversity within the genus, the species in the same species complex tend to share more synteny and orthology than the species between species complexes.
Evolutionary analysis of CAZyme families of different Colletotrichum lineages revealed an association between CAZyme families and host range. The GH131 with cellulose degrading activity was the only rapidly evolving gene family at the MRCA of Colletotrichum spp. suggesting a possible association of this family with speciation and host determination within the genus. Families GH43, with hemicellulose degrading activity and AA7, with gluco-oligosaccharide activity significantly expanded upon divergence of both the gloeosporioides and acutatum-complex clades, which could have broadened the host ranges of members of these two complexes, consistent with previous reports [7,[9][10][11].The significant expansions in pectin degrading enzyme families GH106 in gloeosporioides and GH78 in the acutatum clades could also have enabled degradation of pectin rich cell walls of young fruits [166] of these fruit-rotting species.
The most significant contractions were reported in pectin degrading families upon the divergence of the graminicola complex clade. This could have been the reason for species in this complex exclusively infecting monocot plant species considering that the pectin content of monocot cell walls is generally less than in dicots [167]. Even though this was a similar result to previous studies [6,7,[9][10][11]43], C. orchidophilum which is known to infect plants from monocot family Orchidaceae [168], deviated from this pattern. Gene family AA7 was rapidly evolving in many Colletotrichum species and could have been involved in biotransformation or detoxification of the lignocellulosic compounds [169].
In general, the overall CAZyme pathogenicity profiles of Colletotrichum spp. followed host range of those species rather than the taxonomy in consistence with the previous studies, [6, 7, 9-11, 43, 170]. The gloeoporioides and acutatum complex members which have broad host ranges, but are evolutionary distant, were clustered together. This could be a byproduct of the "two-speed" genome scenario in certain Colletotrichum spp. such as C. orbiculare, C. fructicola and C. graminicola [6,11,41] and as suggested by this study, also in C. tanaceti. In this scenario, the pathogenicity genes are located in repeat-rich regions, allowing them to evolve at a higher rate than the rest of the genome. This was also evident by the significant association of TE with pathogenicity genes in C. tanaceti, C. truncatum, C. higginsianum and C. graminicola [10,15,43]. Furthermore, in C. fructicola, two gene clusters that were horizontally transferred were within the rapidly evolving lineage specific regions [11]. This scenario would cause the species with similar pathogenicity gene profiles to cluster together, despite their evolutionary distance.

Conclusion
In conclusion, a draft genome of C. tanaceti was used to characterize the molecular basis of pathogenicity of the species and to improve the knowledge of the evolution of the fungal genus Colletotrichum. Colletotrichum tanaceti is likely to have alternative hosts to pyrethrum. The genome of Colletotrichum tanaceti contains a large component of repetitive elements that may result in genome expansion and rapid generation of novel genotypes. The tendency of the pathogenicity genes to evolve rapidly was evident in genomic signals of the RIP and association of repeats and RIPs with the putative pathogenicity genes. Therefore, due to the large array of pathogenicity genes with a high evolutionary potential, C. tanaceti is likely to become a highrisk pathogen. Complexity of the Colletotrichum genus was evident with its high diversity and evolutionary rate. The significant expansions and contractions of gene families upon divergence of different lineages within the genus could be important determinants in species and species complex diversification in Colletotrichum. The reason for pathogenicity genes to have different clustering than the phylogeny in Colletotrichum could be the occurrence of "twospeed" genomes in certain species. These findings will facilitate future research in genomics and disease management of Colletotrichum.
Supporting information S1  homolog profiles (number homologs to entries in "reduced virulence", "unaffected pathogenicity", loss of pathogenicity", "effector", "lethal" and "increased virulence" categories in the PHIbase) in Colletotrichum and related species; hierarchical clustering performed with Euclidean distance and Ward linkage. The number of genes in each PHI category is normalized using unit variance scaling. Overrepresented and underrepresented gene categories are represented in red to orange and blue respectively as fold standard deviations above and below the mean.