Molecular Evolution of the Deuterolysin (M35) Family Genes in Coccidioides

Coccidioides is a primary fungal pathogen of humans, causing life-threatening respiratory disease known as coccidioidomycosis (Valley fever) in immunocompromised individuals. Recently, Sharpton et al (2009) found that the deuterolysin (M35) family genes were significantly expanded in both the Coccidioides genus and in U. reesii, and that Coccidioides has acquired three more M35 family genes than U. reesii. In the present work, phylogenetic analyses based on a total of 28 M35 family genes using different alignments and tree-building methods consistently revealed five clades with high nodal supports. Interestingly, likelihood ratio tests suggested significant differences in selective pressure on the ancestral lineage of three additional duplicated M35 family genes from Coccidioides species compared to the other lineages in the phylogeny, which may be associated with novel functional adaptations of M35 family genes in the Coccidioides species, e.g., recent pathogenesis acquisition. Our study adds to the expanding view of M35 family gene evolution and functions as well as establishes a theoretical foundation for future experimental investigations.


Introduction
Coccidioides, a type of dimorphic fungi in the order Onygenales, is composed of two closely related species, Coccidioides posadasii and Coccidioides immitis [1,2]. They are primary fungal pathogens of humans, which can cause life-threatening respiratory disease known as coccidioidomycosis (Valley fever) in the immunocompromised individuals [3][4][5][6]. They infect about 150,000 people annually in the United States [4], and have been listed as one of the ''U.S Health and Human Services Select Agents'' of bioterrorism [7]. Such a designation has fueled research efforts to develop an effective human vaccine and new treatments against coccidioidomycosis [8,9].
Gene duplication has long been thought as a main event for evolutionary innovations and functional adaptations [17][18][19][20][21]. Up to now, several studies have revealed evidence for positive selection following gene duplication, leading to the emergence of novel functions (known as neofunctionalization) [18,[22][23][24]. Thus, the characterization of the molecular evolution of the M35 family genes in Coccidioides and U. reesii can be of great importance for understanding what evolutionary force has possibly shaped the diversification of these Mep genes in Coccidioides and U. reesii, and for inferring their biological significance.
To find putative homologs of M35 family sequences from these nine genome sequences, we used the program HMMSEARCH from the HMMER package (http://hmmer.wustl.edu/) for homologous protein search, with the HMM profile Peptidase_ M35 (PF02102; http://pfam.sanger.ac.uk/family/PF02102# tabview = tab6) used as query. Hits were considered significant when they matched the Pfam HMM profile with E values,10 25 . In total, 28 M35 family genes were identified. In addition, we also used InterProScan (http://www.ebi.ac.uk/Tools/InterProScan/) [25] to analyze the domain compositions of all the 28 M 35 family genes.

Sequence alignment
To avoid the influence of a specific alignment program on the results, we applied three methods to yield alignments for analyses. In the first, MUSCLE v3.5 was first used to generate protein alignment with default settings [26]. Based on this protein alignment, PAL2NAL v13 was used to build a codon-based alignment with default settings [27]. A 417-bp codon-based alignment obtained from PAL2NAL v13 was shown in Figure 1 and the initial protein alignment generated by MUSCLE 3.5 that includes the trimmed sites was supplied (Fig. S1). In the second, sequences were aligned using the MUSCLE v3.5 software with default settings [26]. The ambiguous areas of alignment were Figure 1. Protein alignment of 28 M35 genes. Areas shaded in black are conserved regions (100% similarity). Areas shaded in grey have a high degree of homology (more than 75% similarity), while those unshaded areas are highly variable regions between the proteases. Two positively selected residues identified in branch i are indicated. doi:10.1371/journal.pone.0031536.g001 located and removed by using the program Gblocks 0.91b [28,29] with default parameters. The gap selection criterion ''with half'' was used here. A 918-bp alignment was shown in Fig. S2. In the third approach, sequences were aligned using PRANK with default settings [30,31]. The ambiguous areas of alignment were located and removed by using the program Gblocks 0.91b [28,29] with default parameters. The gap selection criterion ''with half'' was again used here. A 531-bp alignment was shown in Fig. S3.

Phylogenetic analysis
Phylogenetic analyses of the alignments were performed using MEGA 5 [32] for Neighbor-joining (NJ) analysis, using PHYML 3.0 [33] for Maximum likelihood (ML) analysis, and using MrBayes 3.1.2 [34] for Bayesian inference. In the NJ analysis, Kimura-2 parameter model and pairwise deletion option for gaps were used. In the ML analysis, the model HKY+I+G of sequence evolution was optimized using Akaike information criterion [35,36] as implemented in Modeltest version 3.7 [37]. The reliability of these tree topologies was evaluated using bootstrap support [38] with 1000 replicates for NJ and 100 for ML analysis.
The parameters estimated by Modeltest were also used in the priors of Bayesian inference with MrBayes version 3.1.2 [34]. Four Metropolis-coupled Markov chain Monte Carlo (MCMC) analyses were run for 2610 5 generations, sampling trees every 100 generations. The dataset was run for three times independently to avoid being trapped in local optimal. We determined the burnin period by checking for likelihood stability. At the end of the run, the average standard deviation of split frequencies was less than 0.01 in all the cases, indicating a good convergence level (MrBayes 3.1.2 manual). A 50% majority rule consensus of post burn-in trees was constructed to summarize posterior probabilities (PPs) for each branch.
In addition, PhyloBayes 3.3b was run using the siteheterogeneous CAT model [39] with two independent Monte Carlo Markov Chain (MCMC) chains [40]. To check for convergence, the program bpcomp [41] was used to compare the bipartitions between the two runs. With a burn-in of 1000 and taking every two trees, the largest discrepancy (maxdiff) between the bipartitions was less than 0.1, indicating a good convergence level.

Gene duplication and loss analyses
The reconciliation between species tree and gene tree along with the inferences of the gene duplication and loss scenarios were determined by Notung 2.6 [42,43]. We infer the species tree by performing NJ analysis using MEGA 5 [32] from a combined alignment of six genes as those used in James et al [44], including 18S rRNA, 28S rRNA, ITS RNA, translation elongation factor 1a (TEF1a), RNA polymerase II largest subunit (RPB1) and RNA polymerase II second largest subunit (RPB2) (Figure 2). For the gene tree used here, we collapsed those inconsistent nodes produced by different tree-building methods, which were also poorly supported into polytomy [45] (Figure 3 and Figure 4).

Selective pressures analyses
The ratio v (d N /d S ) is the ratio of the number of nonsynonymous substitutions per non-synonymous site (d N ) to the number of synonymous substitutions per synonymous site (d S ), which provides an indication of the change in selective pressures [46]. A d N /d S ratio = 1, ,1, and .1 are indicative of neutral evolution, purifying selection, and positive selection on the protein involved, respectively [47,48].
To check whether there are substitution saturations in our data set, we plot the number of transitions and transversions vs. divergence using DAMBE, with an asymptotic relationship indicating the presence of saturation [49].
The codon substitution models implemented in the CODEML program in the PAML 4.4b package [50] were used to analyze changes of selective pressure. All models correct for transition/ transversion rate and codon usage biases (F364). Different starting v values were also used to avoid the local optima on the likelihood surface [51].
Two branch-specific models, i.e., ''one-ratio'' (M0) and ''freeratios'', were compared. M0 model assumes the same v ratio for all branches while ''free-ratios'' model assumes an independent v ratio for each branch [52]. We constructed likelihood ratio tests (LRT) to compare the two models. Significant differences between models were evaluated by calculating twice the log-likelihood difference following a x 2 distribution, with the number of degrees of freedom equal to the difference in the numbers of free parameters between models. Site-specific models, which allow for variable selection patterns among amino acid sites, M1a, M2a, M7, and M8, were used to test for the presence of sites under positive selection and identify them. Significant differences between the 2 models were evaluated by calculating twice the log-likelihood difference following x 2 distribution, with the number of degrees of freedom equal to the difference in the numbers of free parameters between the 2 models. M2a and M8 models allow for positively selected sites. When these 2 positive-selection models fitted the data significantly better than the corresponding null models (M1a and M8a), the presence of sites with v.1 would be suggested. The conservative Empirical Bayes approach [53] was then used to calculate the posterior probabilities of a specific codon site and identify those most likely to be under positive selection.
Considering that positive selection may act in very short episodes during the evolution of a protein [54] and affect only a few sites along a few lineages in the phylogeny, the likelihood models accommodating v ratios to vary both among lineages of interest and amino acid sites, that are an improved version of the ''branch-site'' model, were also considered here [55]. We used branch-site Model A as a stringency test (test 2) and identified amino acid sites under positive selection by an empirical Bayes approach along the lineages of interest [55,56]. The log-likelihoods for the null and alternative models were used to calculate a likelihood ratio test statistic, which was then compared against the x2 distribution (with a critical value of 3.84 at a 5% significance level) [50]. In addition, the Bonferroni correction [57,58] was also applied for multiple testing in the analysis according to the number of tests of significance performed. Using the 2.5 Å deuterolysin metalloproteinase from Aspergillus oryza (PDB ID: 1EB6A) [59], we conducted homology modeling with the SWISS-PROT program (http://swissmodel.expasy.org/) [60][61][62] and analyzed the structure with Pymol Ver. 0.99 [63] to investigate the possible functional shifts of the positively selected sites identified here.

Characterizations of M35 family genes
As shown in Table 1, a total of 28 M35 family genes were identified from 9 fungal genome sequences. All these genes are consisted of only one Mep domain identified by InterProScan (Fig.  S4). We used the Kimura 2-parameter model with MEGA 5 [32] to calculate the average genetic distance of M35 family genes, and found that the average genetic distance within this family was 0.640. In C. immitis, 7 genes were identified. In C. posadasii, 8 genes were identified, including 7 previously identified genes [14], and one gene (CPAT_05384) newly determined here (referred as Mep4-1 like gene). In addition, 4 genes were predicted from U. reesii. For the other six fungal species, one M35 family gene each was predicted from H. capsulatum, P. brasiliensis and F. graminearum, while two genes each were identified from B. dermatitidis, N. crassa and P. odorum (Table 1 and Figure 2).

Phylogenetic analysis
Phylogenetic analyses based on the 28 M35 family genes using different alignments and tree-building methods consistently recovered five clades (designated as Group 1 to 5; Figure 3, Fig.  S5 and Fig. S6). Figure 3 shows the results based on the alignment strategy 1. Group 1 (all BS and PP$99%) contained Mep3-like, Mep5-like and Mep6-like genes from C. posadasii, C. immitis, and U. reesii. It seemed that these genes were duplicated before the divergence of the three species. Group 2 (BS = 78-92%; PP = 55-100%) is a Coccidioides-specific lineage, containing three additional duplicated genes (Mep2-like, Mep7-like and Mep8-like) from each of the two Coccidioides species. They are most likely to have duplicated before the divergence of two Coccidioides species. Group 1 and Group 2 are more closely related to each other than to other groups. Group 3 (all BS and PP$99%) is consisted of two Mep2-like genes from B. dermatitidis and P. brasiliensis. Group 4 (BS = 41-60%; PP = 53-100%) contained Mep4-like genes from both Coccidioides and U. reesii, and Mep4-1 like gene from C. posadasii as well as two Mep-like genes from H. capsulatum and B. dermatitidis. Group 5 (BS = 61-63%; PP = 56-65%) contained Mep-like genes from F. graminearum, N. crassa and P. odorum.

Gene duplication and loss analysis
The inferences of gene duplication and loss events are shown in Figure 5. We found that there were at least 9 gene duplication and 5 gene loss events during the evolutionary history of these M35 family genes and most of the gene duplication and loss events happened in the Onygenales species. An initial duplication event emerged in the ancestral lineage leading to Onygenales.
Subsequently, three gene duplications occurred before the divergence of Coccidioides and U. reesii, and 4 gene duplications occurred before the divergence of C. posadasii and C. immitis ( Figure 5). In comparison, one gene might have lost in C. immitis, U. reesii, P. brasiliensis and H. capsulatum each ( Figure 5).

Selective pressure analyses
The DAMBE analyses suggested that there was no evidence for substitution saturation in our data set (Fig. S7). Because the likelihood analysis might be sensitive to tree topology used, we collapsed the nodes that showed inconsistent branching patterns from different tree-building methods and with poor statistical support into polytomy [45] (Figure 4). Table 2 shows the results of positive selection analyses inferred from alignment strategy 1. In the branch-specific model analyses [64], the v ratios calculated in the one-ratio model (M0) is 0.11784, suggesting that most M35 family genes in these species have evolved under strong functional constraints. Interestingly, the free-ratio model shows a significantly better fit to the data than the M0 model (2DL = 109.5459, df = 49, p,0.001) ( Table 2), indicating that these Mep genes have been the subjects of different selective pressures. A similar result was obtained based on alignment strategies 2 and 3 (see Table S1). In the site-specific model analyses, both the positive-selection models (M2a and M8) did not provide a significantly better fit to the data than did the neutral models (M1a and M8a) (P = 1.000, respectively), A similar result was obtained based on alignment strategies 2 and 3 (see Table S1).
To investigate the possible selective forces behind the Mep gene duplication in Coccidioides species and U. reesii, we conducted LRTs based on the branch-site models for those branches resulted from gene duplications (16 branches in total, a-p as indicated in Figure 4). The analyses suggest that there are five branches (branches a, d, i, j and l) showing signs of positive selection (Figure 4). After Bonferroni correction for multiple testing, we found that LRT tests were still significant in branch i, leading to the common ancestor of 3 additional duplicated genes in Coccidioides (2DL = 22.4986, P = 0.0001) (Table 2 and Figure 4). As summarized in Table 2, the Bayesian approach  in PAML predicted two sites located in mature peptide, G118Y/F/H and T119C, as positively selected for branch i with high BEB posterior probabilities larger than 0.95 (Figure  1and Table 2). When we performed the branch-site model tests based on other two alignment strategies, we found that branch i consistently displayed significant signs of positive selection after the Bonferroni correction. In addition, G118Y/F/H and T119C were consistently predicted as positively selected sites along this branch. In summary, our results showed that positive selection is most likely to have at least acted on the lineages leading to the common ancestor of 3 additional duplicated genes in Coccidioides.

Homology modeling
Ninteen M35 family proteases from Coccidioides and U. reesii were modeled and evaluated to have good structural qualities (Table S2). We then mapped two sites, G118Y/F/H and T119C (corresponding to sites 134 and 135 of the model, respectively), that were consistently showed positively selected based on three different alignment strategies onto the structure models ( Figure 6). Interestingly, all the three substituted amino acid residues, i.e., Tyrosine, Phenylalanine and Histidine, occurred in site 134 have larger side chains than Glycine in the other sequences. No obvious structure difference between T135 and C135 was identified from the structure models ( Figure 6).

Discussion
In the present paper, we analyzed the deuterolysin (M35) family genes in 9 Ascomycota species, adding the growing diversity of the Mep gene evolution. We found that the M35 family genes are significantly expanded in Coccidioides and U. reesii compared with the other fungal species examined. Phylogenetic reconstruction and gene duplication/loss analyses of these sequences placed the first gene duplication event in the common ancestor of the Onygenales species (Figure 3 and Figure 5). Subsequently, significantly more Mep genes were produced by additional gene duplication events in Coccidioides (7 genes in C. immitis and 8 genes in C. posadasii) and U. reesii (4 genes) compared with those in other Onygenales species examined (1 or 2 genes) ( Figure 5). The observation of obvious expansion of M35 family genes in the present study in Coccidioides and U. reesii based on more Onygenales species supports earlier prediction of Sharpton et al [16].
Noticeably, we further showed that positive selection promoted this unusual gene expansion in Coccidioides, at least along the ancestral lineage producing the three additional duplicated genes specifically in Coccidioides species (Table 2 and Figure 4). As far as we know, this study is the first demonstration that positive selection has acted on the duplicated Mep genes during evolution. In addition, our study provided valuable information on the potentially important adaptive amino acid replacements. Among them, residues G118Y/F/H and T119C were consistently recovered as positively selected sites from the analysis of all three different alignment strategies with high posterior probabilities (.95%). For M35 family genes, the active zinc ligands are composed of 2 histidines in the HExxH motif and the Aspartic acid in motif GTXDXXYG [15]. Intriguingly, we found that these 2 putative positive selected sites were located within the motif GTXDXXYG. For site 118, the original amino acid G changed to Y (Mep2), F (Mep7) or H (Mep8) in the three additional duplicated Meps in Coccidioides species. Moreover, when we mapped them on the 3-dimensional crystal structure of the molecule, the substitutions that occurred at G118Y/F/H (corresponding to sites 134 in Figure 6) were predicted to have larger side chains. It is generally thought that side-chain of residues can influence the flexibility of the ligand-binding site of a protein [65,66]. Therefore, these substitutions may exert influences on the binding of the zinc ion. For site 119 (corresponding to sites 135 in Figure 6), it has been considered important for M35 family gene because the hydroxyl group of T135 can interact with the second zinc binding histidine (H126), playing an important role in sustaining the coordination of the catalytic zinc ligands [15]. In our study, at site 135, T changed to C in all three additional duplicated proteases in Coccidioides species, indicating that this substitution may have influence on the coordination of the zinc ligands and protein flexibilities. (Figure 1 and Figure 6). Therefore, the two putative positively selected sites identified here may have profound effect on protein flexibility and zinc binding, which may lead to activity changes of the three additional duplicated proteases in Coccidioides.
Though the functional changes brought by these 2 positively selected residues cannot be predicted at present, they likely have functional consequences, raising the possibility that new physiologic functions of Mep genes have been developed in Coccidioides. Therefore, we proposed that the evidence of positive selection observed in the Coccidioides species might be associated with novel functional adaptations of M35 family genes. Presently, the experimental data concerning the expression of these paralogous M35 family genes in the Coccidioides is unavailable, so it is difficult to tell what selective pressure has promoted Mep gene expansion in this genus. One of the most likely speculations was that the presence of three additional Mep genes in Coccidioides was an adaptation to its recent pathogenesis acquisition because Coccidioides diverged from the nonpathogenic fungus U reesii and acquire its pathogenic phenotype relatively recently [16]. Of course, we can't exclude the possibility that these Mep genes in Coccidioides may serve other yet unknown physiological functions.
In conclusion, while novel information was generated through our analyses, more information on the structure, function, and evolution of M35 family genes is required to understand why significant expansion of Mep genes occured in Coccidioides and U reesii, and not in the other species. Our study established a foundation for the experimental investigations. It will be interesting to test the expression pattern of these Mep genes and the functional effects of amino acid substitutions for the identified positively selected sites. Figure S1 Whole protein alignment of the M35 genes with MUSCLE 3.5. Sequences were aligned using MUSCLE v3.5 with default settings [26]. 139 amino acids (corresponding to 417-bp nucleotide positions) obtained from PAL2NAL v13 [27] were encompassed by frame with black edge. (DOC) Figure S2 Protein alignment of M35 genes aligned with strategy 2. Sequences were aligned using MUSCLE v3.5 software with default settings [26]. The ambiguous areas of alignment were located and removed by using the program Gblocks 0.91b [28,29] with default parameters. The gap selection criterion ''with half'' was used here. A 918-bp alignment was obtained. (DOC) Figure S3 Protein alignment of M35 genes aligned with strategy 3. Sequences were aligned using PRANK with default settings [30,31]. The ambiguous areas of alignment were located and removed by using the program Gblocks 0.91b [28,29] with default parameters. The gap selection criterion ''with half'' was used here. A 531-bp alignment was obtained.