MIDDAS-M: Motif-Independent De Novo Detection of Secondary Metabolite Gene Clusters through the Integration of Genome Sequencing and Transcriptome Data

Many bioactive natural products are produced as “secondary metabolites” by plants, bacteria, and fungi. During the middle of the 20th century, several secondary metabolites from fungi revolutionized the pharmaceutical industry, for example, penicillin, lovastatin, and cyclosporine. They are generally biosynthesized by enzymes encoded by clusters of coordinately regulated genes, and several motif-based methods have been developed to detect secondary metabolite biosynthetic (SMB) gene clusters using the sequence information of typical SMB core genes such as polyketide synthases (PKS) and non-ribosomal peptide synthetases (NRPS). However, no detection method exists for SMB gene clusters that are functional and do not include core SMB genes at present. To advance the exploration of SMB gene clusters, especially those without known core genes, we developed MIDDAS-M, a motif-independent de novo detection algorithm for SMB gene clusters. We integrated virtual gene cluster generation in an annotated genome sequence with highly sensitive scoring of the cooperative transcriptional regulation of cluster member genes. MIDDAS-M accurately predicted 38 SMB gene clusters that have been experimentally confirmed and/or predicted by other motif-based methods in 3 fungal strains. MIDDAS-M further identified a new SMB gene cluster for ustiloxin B, which was experimentally validated. Sequence analysis of the cluster genes indicated a novel mechanism for peptide biosynthesis independent of NRPS. Because it is fully computational and independent of empirical knowledge about SMB core genes, MIDDAS-M allows a large-scale, comprehensive analysis of SMB gene clusters, including those with novel biosynthetic mechanisms that do not contain any functionally characterized genes.


Introduction
Chemists have been deciphering the chemical structures of natural products for a century and a half. Many of these natural products are produced as ''secondary metabolites'' by plants, bacteria, and fungi. During the middle of the 20 th century, several secondary metabolites from fungi revolutionized the pharmaceutical industry. These include the antibiotic, penicillin; the cholesterol-level lowering compound, lovastatin; and the immune suppressor, cyclosporin. Other fungal secondary metabolites have achieved notoriety, such as aflatoxin [1]. In the late 20 th century, with the advent of gene cloning, it became apparent that fungal secondary metabolites are biosynthesized by clusters of coordinately regulated genes. Such gene clustering is rare in eukaryotes.
In spite of limited number of secondary metabolites identified from a single species, sequencing the genomes of filamentous fungi has revealed far more than the expected numbers of secondary metabolite biosynthetic (SMB) genes. The numbers of SMB genes encoding polyketide synthases (PKSs) and non-ribosomal peptide synthetases (NRPSs) range from 17-35 and 14-24, respectively, in the individual genomes of eight Aspergillus species [2]. To identify potential secondary metabolites (SMs) in filamentous fungi, various bioinformatics tools, including SMURF [3], antiSMASH [4,5], CLUSEAN [6], and the method described by Andersen et al. [7], have been developed and successfully applied. The basic concept underlying these tools is the existence of SMB gene clusters, which typically contain approximately 20 genes, including the so-called core genes of PKS, NRPS, or dimethylallyl tryptophan synthases (DMATs). These methods are completely dependent on the known sequence motifs of the core genes; therefore, they can only be used to detect SMB gene clusters that include these core genes. In addition, they cannot distinguish functional clusters from silent or cryptic clusters in fungi [8] because they do not incorporate transcriptomics data.
Many secondary metabolites with important medicinal activities have scaffold structures that are mostly synthesized by the core genes of PKS or NRPS, but there are also others independent of those core genes such as oxylipins, a derivative of fatty acids [9]. We recently discovered the SMB gene cluster for kojic acid (KA), which is the representative secondary metabolite of Aspergillus oryzae [10,11]. The KA cluster could not be detected by conventional methods due to the lack of the core genes. KA was discovered in 1907 and has been used industrially [12], but its biosynthetic gene cluster was found only recently. This fact indicates the extreme difficulty in identifying SMB gene clusters without any core genes.
Comparative genomics has shed light on the characteristics of SMB genes that localize to so-called non-syntenic blocks (NSBs) [13][14][15]. NSBs harbor genes that have roles in the transport and metabolism of various compounds [13] and are highly divergent between species [16][17][18]. Two-thirds of the genes in NSBs are not homologous with any genes with known functions [13]. Considering our limited knowledge regarding SMB genes and their high level of diversity, it can be speculated that the significant accumulation of unknown genes on NSBs is due to the presence of a large number of SMB genes on NSBs. In support of this hypothesis, the KA gene cluster is located in an NSB [11].
To enhance the exploration of SMB gene clusters in fungal genomes, especially those without core genes, we have developed MIDDAS-M, a motif-independent de novo detection algorithm for secondary metabolite gene clusters. We used virtual gene cluster generation on an annotated genome sequence integrated with highly sensitive and accurate scoring for the cooperative transcriptional regulation of cluster member genes. MIDDAS-M accurately predicted 38 SMB gene clusters in 3 fungal strains that have been experimentally confirmed and/or predicted by other motif-dependent methods. In addition, we discovered a novel SMB cluster with a potentially new mechanism of cyclic peptide biosynthesis using MIDDAS-M. The cluster was experimentally validated to perform ustiloxin B biosynthesis. Because it is fully computational and independent of empirical knowledge about SMB core genes, MIDDAS-M permits a large-scale, comprehensive analysis of SMB gene clusters, including those with novel biosynthetic mechanisms that do not contain any functionally characterized genes.

MIDDAS-M algorithm
The algorithm depends on the concurrent expression of SMB cluster member genes. First, all possible gene clusters (virtual clusters, VCs) are identified in a previously gene-annotated genome by moving a frame with a given cluster size (ncl) from 3 to 30 genes (Fig. 1A). The cluster induction ratio (M score) for a VC is calculated by summing the induction ratios of all genes in the VC. For a given gene, the induction ratio is determined by dividing the expression level of the gene in an SM-producing condition by the expression level in a non-SM-producing condition. The M i,ncl score for each VC, which begins at gene i with cluster size ncl, was determined according to the following equation: where m k is the induction ratio of gene k, and m and s m are the mean and the standard deviation of all m values, respectively. As shown in Equation 1, each m value should be normalized by Zscore transformation before the summation. M scores are evaluated for each ncl from 3 to an appropriate upper limit (30 in this study). Using this procedure, the M scores of ''non-real'' clusters in which genes are not co-regulated should have low absolute values because positive values are cancelled out by negative values, and vice versa. In contrast, M scores of ''real'' SMB clusters show significantly high absolute values because the genes in the cluster are regulated concurrently (Fig. 1B). SMB cluster candidates exhibit relatively high M scores, but the background noise from pseudo-positive VCs remains high (Fig. 2B). To help distinguish between VCs that are SMB clusters and those that are not, M scores deviating from the normal distribution are magnified by statistical treatment. The magnified score, v i,ncl , was evaluated for each M i,ncl at each ncl using the following equation: where M ncl and s M,ncl are the mean and the standard deviation, respectively, of all M scores at ncl, d is a positive odd integer as an order of the moment (set as 3 in this study), and P i,ncl is the occurrence probability of M i,ncl in the distribution of all M scores at ncl. The moment expresses the magnitude of deviation from standard distribution, being emphasized as the order d increases.
An SMB cluster candidate with M i,ncl largely deviated from the mean value shows a large absolute value of v i,ncl , because of the large Z-score (the content in the parenthesis of Equation 2) and the logarithmic P i,ncl (,,1) converging to minus infinity. The v score shows a positive or negative value when the gene cluster is induced or repressed, respectively. For each starting gene, the ncl showing the largest v value (v max ) is chosen as the cluster size. This step contributes to the high sensitivity of MIDDAS-M by surveying clusters of different sizes. Finally, the clusters showing the largest v max among overlapping VCs (sub-clusters of a candidate cluster) are defined as the ''unique'' cluster (detailed explanation with an example is described in the ''MIDDAS-M computation'' section of the Supplementary Method in Appendix S1). MIDDAS-M also automatically generates the candidate clusters from all possible pairwise comparisons of transcriptomes from several or more culture conditions. This allows comprehensive de novo predictions using large-scale transcriptome datasets based on a variety of culture conditions. See Supplementary Method, the ''MIDDAS-M computation'' section in Appendix S1 for further details. MIDDAS-M is available for use at the following server (http:// 133.242.13.217/MIDDAS-M).

Accurate detection of experimentally validated SMB gene clusters
MIDDAS-M was applied to the filamentous fungus A. oryzae for the detection of the KA gene cluster. This metabolite is an inhibitor of pigment formation in animal tissues and is therefore used as a skin-whitening compound in cosmetics [19,20]. The KA cluster was recently found to be composed of only three genes, none of which encodes a PKS, NRPS, or other core SMB enzyme. Instead, the three genes encode an oxidoreductase, a Zn(II) 2 -Cys 6 (C6)-type transcription factor, and a major facilitator superfamily transporter [10,11]. KA production is typically observed after 3 to 4 days of inoculation of A. oryzae in liquid growth media, and can be stopped by adding a small amount of sodium nitrate to the medium [21,22]. with nitrate. Among the 12,084 genes of A. oryzae [13], 5,046 genes with expression in all three datasets were used for the analysis. The M scores for the 7/4-day dataset are normally distributed when the cluster size ncl = 1, but the symmetry was lost, and the top of the distribution slid to the left, when ncl = 3 and 5, accompanied by the emergence of large M scores outside of the normal distribution ( Fig. 2A). MIDDAS-M emphasizes this deviation of the SMB cluster candidates through Equation 2, enabling their sensitive detection. In the 7/4-day dataset, a distinct single peak emerged in the v max score from the gene induction ratio (m value) as designated by a red arrow in Fig. 2B. The gene cluster corresponding to this peak was composed of three genes, AO090113000136, AO090113000137, and AO090113000138, which were exact matches to the three KA biosynthetic genes [10,11]. The highly sensitive and specific detection of the KA gene cluster, which has a small cluster size of 3 and does not include any core genes, indicates that MIDDAS-M has strong potential as a motif-independent predictor of SMB gene clusters. In the 4/2-day and without/with nitrate datasets, only small v max signals were observed, indicating that the increase of KA productivity in the two datasets was not due to the transcriptional induction of the genes responsible for KA biosynthesis.
MIDDAS-M was also tested for Fusarium verticillioides using a time series of four transcriptomes obtained from mycelia grown in the liquid medium used to induce fumonisin production [23]. This fungus is a plant pathogen that produces mycotoxins and is phylogenetically distantly related to Aspergillus. A comprehensive comparison of the 4 transcriptomes followed by the MIDDAS-M prediction yielded several distinct peaks of v max , of which 5 corresponded to the known SMB gene clusters for fumonisin [24], perithecium pigment [23], fusaric acid [23], bikaverin [25], and fusarin [23] (Fig. 3). Although the size of the predicted SMB gene cluster for fusaric acid was three-fold larger than the experimentally validated clusters, the others were almost correct in size (Table 1). This result clearly illustrates the high sensitivity of MIDDAS-M in detecting functional SMB clusters.
The cluster harboring fusaric acid biosynthetic genes (peak c in Fig. 3B) was predicted to have 17 genes (FVEG_125192F-VEG_12535) by MIDDAS-M, whereas the cluster size reported by Brown et al was 5 (FVEG_125192FVEG_12523) [23] (Table 1, Fig. S2 in Appendix S1). The gene expression profile in this region suggests existence of another cluster adjacent to the fusaric acid gene cluster with a few additional genes in between (Fig. S2 in Appendix S1). One of the remarkable features of MIDDAS-M is the potential to predict a gene cluster even though it includes a small number of genes that are not co-regulated with other cluster member genes. This enables sensitive detection of gene clusters from the dataset containing inaccurate data points due to their low expression levels and/or biological fluctuation under the same condition. It is thought that this characteristic led to the prediction of the above cluster much longer than the actual size by combining the two clusters into one. In addition to detecting the five clusters noted above, this analysis revealed two other VCs with high v max scores (y1 and y2 in Fig. 3B). They were not predicted by SMURF, and were composed of 3 and 4 genes, respectively, the latter of which included an NRPS-like enzyme (Fig. 3B, Table S2 in Appendix S1). To assign peaks to their corresponding compounds, detailed analysis of the linkage between the gene cluster expression and compound productivity is necessary.

Large-scale detection of SMB gene clusters by MIDDAS-M
To demonstrate the fully computational and motif-independent features of MIDDAS-M for the comprehensive analysis of SMB gene clusters, we employed a systematic pairwise comparison of A. flavus 28 transcriptome datasets from a variety of cultivation conditions (GSE15435 [26], Fig. 4A). MIDDAS-M detected 240 candidate clusters with the threshold of 0.05 for the statistical  (Table 1). Using the datasets above, twenty-seven of the 55 clusters predicted by SMURF were detected by MIDDAS-M (Table 1). Secondary metabolites tend to be produced under only limited culture conditions; in other words, SMB genes are silent under most conditions. In addition, many SMB-like gene clusters may have possibly lost their functions. For example, A. oryzae has the gene cluster homologous to that for aflatoxin in A. flavus, but never produces the compound due to mutations both inside and outside the cluster [27]. SMURF, which uses only genome sequence information, predicts clusters regardless of their silence or nonfunctionality. In contrast, MIDDAS-M excludes non-functional SMB gene clusters in defined culture conditions. Similarly, MIDDAS-M predicted 35 of the 76 candidate clusters predicted by antiSMASH (the column D in the ''antiSMASH.AF'' sheet in Appendix S2). Certain peaks were detected under only limited combinations of conditions, illustrating the utility of MIDDAS-M for the comprehensive analysis of culture conditions that induce rarely expressed SMB genes (Fig. 4B). For example, the peak circled in Fig. 4B detected only in a limited conditions, composed of AFLA_035680 through AFLA_035720, was not detected either by SMURF or by antiSMASH.
The detected peaks were highly localized to NSBs (702 detected cluster genes out of 969 total; see Table S3 in Appendix S1). This result is in good agreement with the fact that the genes related to secondary metabolite biosynthesis, transport, and catabolism (Qgenes), identified in the EuKaryotic Orthologous Groups (KOG) [28,29] on NSBs [13]. In addition, the detected gene clusters were enriched for Q-genes compared with the whole genome, regardless of their inclusion of core genes (SMURF+/2) (Fig. 5A). Genes annotated as cytochrome P450 enzymes, which constitute a large enzyme family often involved in SMB gene clusters [30], represent 1.1% of the 13,471 genes in the A. flavus genome, and are contained in 9.1% of the 240 unique clusters detected by MIDDAS-M. The P450 gene content in the detected gene clusters increased drastically to .60%, by applying threshold v max $15,800 (Fig. 5B), although the number of clusters decreased exponentially along with increasing the threshold of v max score (24 clusters when v max $10,000, Fig. S3 in Appendix S1). SMB clusters are often regulated by C6-type transcription factors [31], and major facilitator superfamily (MFS) transporters are often present in SMB clusters [32]. These two genes also appear more frequently in the clusters as the threshold increased. Among 240 candidate SMB gene clusters detected by MIDDAS-M with the threshold of 0.05 false positive rate, 89% (213) were not detected by SMURF (Table S3 in Appendix S1), and this tendency continued when v max .10,000 (71% or 17 in 24). These results strongly suggest that MIDDAS-M detected clusters of SMBs even when the clusters did not include the core genes. Detection of the KA cluster is the typical example. The ustiloxin B biosynthetic gene cluster, which was first detected by MIDDAS-M and experimentally-validated in this study, is another good example. These two clusters are both lacking known core genes, thus have never been predicted by the existing software tools based on sequence information of core genes, such as SMURF and antiSMASH (see detail in the next section). Use of high threshold of v max and gene functional information will increase accuracy of predicting SMB gene clusters, though it may fail to detect novel SMB clusters.

Identification of a novel ustiloxin B gene cluster by MIDDAS-M
The comprehensive analysis of A. flavus transcriptomes by MIDDAS-M revealed a pair of culture conditions (cracked maize at 28uC versus 37uC) that showed 3 distinct peaks: the first peak corresponded to the aflatoxin biosynthetic gene cluster; the second peak to a putative cluster (designated cluster a) consisting of 18 genes (AFLA_0949402AFLA_095110; gene ID interval = 10 in most cases); and the third peak to a putative cluster (cluster b) consisting of 5 genes (AFLA_0392002AFLA_039240) (Fig. 6A). To identify the compounds produced by clusters a and b, we constructed three types of A. flavus deletion mutants for each cluster using pyrG as a selectable marker. For cluster a, mutant DAF_a had 13 genes (DAFLA_0949402AFLA_095060) deleted, mutant DAF_a_4960 had one gene (DAFLA_094960) deleted, and mutant DAF_a_5040 had one gene (DAFLA_095040) deleted. For cluster b, mutant DAF_b had five genes (DAFLA_0392002A-FLA_039240) deleted, mutant DAF_b_9210 had one gene (DAFLA_039210) deleted, and mutant DAF_b_9230 had one gene (DAFLA_039230) deleted ( Fig. S1 and Table S1 in Appendix  The detection threshold is .95th quantile (false positive rate 0.05).
The most induced combinations of culture conditions are listed in Appendix S2. aClusters with numbers are those predicted by SMURF. The list of the predicted gene clusters can be downloaded from http://jcvi.org/smurf/precomputed.php. bGene IDs are for annotated genome sequences in GenBank (A. oryzae, F. verticillioides, and A. flavus) as described in Appendix S1. cTwo numbers are described when the predicted clusters are divided into two regions and represent the corresponding clusters. dCluster size experimentally validated or predicted by SMURF (refer to Source in detail). doi:10.1371/journal.pone.0084028.t001 S1). The deletion mutant lacking the entire aflatoxin cluster and the pyrG revertant were also constructed as positive controls. After solid cultivation of the 7 deletion mutants and the control strain (pyrG revertant) on cracked maize at 28uC for 7 days, water-soluble metabolites were analyzed by high-performance liquid chromatography-mass spectrometry (HPLC-MS). By comparing metabolite profiles between mutants, we found a negative ion spectrum at m/z 644.2 with a retention time (RT) of 8.9 min that was absent only in water extracts from the three deletion mutants corresponding to cluster a (Figs. 6B, 6C) [33][34][35]. The HPLC-purified compound from the water extract of the control strain (pyrG revertant) was compared with a ustiloxin B standard using UPLC-HRMS. The two compounds showed identical mass spectra with an RT of 1.61 min (Fig. 6D) as well as identical peaks in the extracted ion chromatogram at m/z 644.231 [M-H] 2 and in the UV spectra at 290, 254, and 220 nm (Fig. 6E). These results provide the first evidence that the genes AFLA_094960 and AFLA_095040 are responsible for ustiloxin B biosynthesis, indicating that cluster a, composed of AFLA_094940 through AFLA_095110, is a ustiloxin B biosynthetic cluster. Based on its chemical structure, ustiloxin B is likely characterized as a non-ribosomal peptide. One of the genes responsible for producing ustiloxin B, AFLA_095040, was putatively annotated as an NRPS-like enzyme in the NCBI database (gene ID: 7917917). However, the AFLA_095040 gene contains only the catalytic domain of a pyridoxal 59-phosphate-dependent enzyme from aminotransferase family-5, which must be involved in reactions other than non-ribosomal peptide bond biosynthesis (Fig. S4 in Appendix S1 and the ''ust'' sheet in Appendix S2). Moreover, none of the NRPS-specific catalytic domains (A, C, PCP, or TE) were found in any genes in or near the cluster (AFLA_0949302A-FLA_095170), as determined by a BLAST [36] search against the UniProtKB database [37,38]. Accordingly, the cluster was not detected by SMURF (http://jcvi.org/smurf/precomputed.php), antiSMASH (the ''antiSMASH.AF'' sheet in Appendix S2), or other currently available conventional SMB gene cluster prediction methods, which use catalytic domain sequence motif information. This result clearly indicates that MIDDAS-M has potential use as a motif-independent predictor of functional SMB gene clusters.

Discussion
In this work, we described the first sequence motif-independent algorithm for the discovery of functional fungal SMB gene clusters based on a combination of whole genome sequence data and transcriptome information. To achieve this novel and fully computational approach, we combined an algorithm to generate comprehensive virtual gene clusters on a genome of interest with the statistical processing of signal enhancement based on deviation from a standard distribution for transcriptional induction or repression of a cluster. First, we confirmed that our algorithm, MIDDAS-M, accurately detected experimentally validated SMB gene clusters, including the fumonisin, aflatoxin/sterigmatocystin, and KA clusters, from DNA microarray datasets obtained under culture conditions associated with the production and non-production of these compounds. In contrast to the former 3 clusters, the KA gene cluster does not include any genes considered as core SMB genes, such as PKSs, NRPSs, DMATs, or terpene cyclases (TCs). The KA gene cluster predicted by MIDDAS-M was the sole candidate with a correct cluster size. Nine gene disruption experiments were required to identify this cluster without MIDDAS-M prediction in our previous work using the same transcriptomes [11].
The fully computational and motif-independent feature of MIDDAS-M allowed for the comprehensive analysis of SMB gene clusters based on expression differences in a given pair of multiple transcriptomes. Because little is known about SMB gene clusters other than those containing PKS, NRPS, TC, and DMATS, the validation of the MIDDAS-M results is extremely difficult. Nonetheless, based on the MIDDAS-M prediction, we identified the first SMB gene cluster for ustiloxin B, the non-ribosomal peptide-like compound that inhibits microtubule assembly [35], in A. flavus. Although ustiloxin B was identified more than 20 years ago, the ustiloxin B biosynthetic gene cluster had remained unknown until the present study. The lack of the NRPS catalytic domains A, C, PCP, and TE in all genes both in the cluster and within 10 adjacent genes outside the cluster strongly suggests a novel mechanism for cyclic peptide biosynthesis. Our further deletion experiments and sequence analysis revealed that at least 3 genes with unknown functions (AFLA_094970, AFLA_094980, and AFLA_094990) may be involved in the peptide bond synthesis and cyclization of the compound, supporting the idea above (data not shown). However, there still remains a possibility that additional gene encoding an NRPS for the ustiloxin biosynthesis may be located distantly from the cluster.
MIDDAS-M enables the highly sensitive identification of SMB gene clusters, but the predicted cluster sizes may be smaller than the actual cluster sizes in some cases. For example, the aflatoxin gene cluster of A. flavus is composed of 29 genes from AFLA_139150 through AFLA_139440 [39,40], but MIDDAS-M detected 23 genes, AFLA_139150 through AFLA_139410 (excluding AFLA_139330 -AFLA_139360). This discrepancy is most likely due to the Z-score transformation at each ncl used to normalize M scores before enhancement. When information from a candidate gene cluster(s) is included at a certain ncl, the standard deviation used for the denominator in Z-score transformation increases. As a result, the M score(s) of the strongly positive gene cluster tend to be smaller at the correct size. This factor does not affect the detection sensitivity of cluster positions but does affect the cluster boundary detection. One potential solution for this problem is to use another algorithm, such as co-expression analysis, for the precise prediction of cluster boundaries after the sensitive detection of cluster candidates by MIDDAS-M.
There are more than 100,000 fungal species in nature [41] that are potential producers of bioactive compounds [31]. Because fungal SMB genes are highly divergent [16,42,43], even fungal species closely related to those that have already been sequenced are worth sequencing to discover new SMB genes. We have confirmed that MIDDAS-M performs equally well when using transcriptomes from RNA-seq data in a comparative performance with DNA microarray for SMB gene cluster detection. MIDDAS-M enables the comprehensive exploration of functional SMB genes in fungal genomes by effectively utilizing the vast amount of available genome and transcriptome information, which will accelerate the discovery of biosynthesis or other functional categories of genes in the future.