Coregulation of Terpenoid Pathway Genes and Prediction of Isoprene Production in Bacillus subtilis Using Transcriptomics

The isoprenoid pathway converts pyruvate to isoprene and related isoprenoid compounds in plants and some bacteria. Currently, this pathway is of great interest because of the critical role that isoprenoids play in basic cellular processes, as well as the industrial value of metabolites such as isoprene. Although the regulation of several pathway genes has been described, there is a paucity of information regarding system level regulation and control of the pathway. To address these limitations, we examined Bacillus subtilis grown under multiple conditions and determined the relationship between altered isoprene production and gene expression patterns. We found that with respect to the amount of isoprene produced, terpenoid genes fall into two distinct subsets with opposing correlations. The group whose expression levels positively correlated with isoprene production included dxs, which is responsible for the commitment step in the pathway, ispD, and two genes that participate in the mevalonate pathway, yhfS and pksG. The subset of terpenoid genes that inversely correlated with isoprene production included ispH, ispF, hepS, uppS, ispE, and dxr. A genome-wide partial least squares regression model was created to identify other genes or pathways that contribute to isoprene production. These analyses showed that a subset of 213 regulated genes was sufficient to create a predictive model of isoprene production under different conditions and showed correlations at the transcriptional level. We conclude that gene expression levels alone are sufficiently informative about the metabolic state of a cell that produces increased isoprene and can be used to build a model that accurately predicts production of this secondary metabolite across many simulated environmental conditions.


Introduction
Isoprene (2-methyl-1,3-butadiene) is an important commodity due to its use as an aviation fuel and as a platform chemical for synthetic chemistry. Renewable methods for producing isoprene are being investigated to meet product demand and reduce the environmental impact of current production methods, which involve petroleum cracking [1,2]. To this end, methods for large scale production of isoprene from a microbial host are being explored as cleaner sources of raw material [3]. End products of the isoprenoid pathway in bacteria are involved in a multitude of essential functions, such as protein degradation and hormonebased signaling. In addition, some are structural components of membranes, such as sterols, carotenoids, ubiquinone, and dolichols [3,4]. Given the importance of these compounds in cell physiology and its utility as an industrial product, there is great interest in understanding the regulation of the enzymes that control the metabolism of these compounds.
Most animals, plants, and bacteria produce isoprene [5][6][7]. It is produced from one of two pathways: the mevalonate pathway, which is the predominant pathway in yeast and mammals, and the 1-deoxy-D-xylulose-5-phosphate (DXP) pathway, which is functional in most plants and microorganisms [6]. The role of isoprene in bacteria has not been firmly established [8][9][10], but the toxicity of isoprene precursors and the requirement of isoprenoid compounds for normal growth strongly suggest that regulation of this pathway is tightly controlled by the host cell, and must be considered in efforts to alter isoprene production levels in model or non-model organisms [8,11]. Collectively, we refer to the 11 genes of the DXP pathway and three genes of the mevalonate pathway as the terpenoid genes. One potential way to control the expression of the enzymes in this pathway is through operon usage or shared transcription factors as part of one or more regulons [12][13][14][15][16][17][18]. These regulatory controls are common in bacteria, but have not yet been described for the members of the DXP pathway [8].
The enzymatic cascade responsible for isoprene and isoprenoid production includes 11 gene products [19,20]. The dxs gene encodes the first enzyme in the pathway that condenses pyruvate and glyceradehyde-3-phosphate to generate 1-deoxy-D-xyulose-5phosphate. The next six steps in the pathway, encoded by dxr, ispD, ispE (also referred to as ipk), ispF, ispG, and ispH are required to produce both of the five carbon prenyl diphosphates, dimethylallyl diphosphate (DMAPP), and isopentenyl diphosphate (IPP). The Idi enzyme (isopentenyl pyrophosphate isomerase), encoded by the fni gene, is responsible for the isomerization reaction between DMAPP and IPP. Evidence from previous studies suggests that DMAPP is converted to isoprene by isoprene synthase (IspS), an enzyme identified in plants but currently unidentified in bacteria [7,8,11,21]. Farnesyl diphosphate synthase (IspA) condenses DMAPP and IPP to produce geranyl diphosphate (GPP), farnesyl diphosphate (FPP), and larger order isoprenoid compounds. Both heptaprenyl diphosphate synthase (HepS) and undecaprenyl pyrophosphate synthetase (UppS) are downstream of IspA in the pathway and also synthesize larger order isoprenoids.
The genes yhfS, mmgA, and pksG are expressed in B. subtilis but are annotated to be part of a dead-end mevalonate pathway [22,23]. However, it has been suggested that polyketide biosynthesis is linked to isoprenoid biosynthesis in B. subtilis [24]. Furthermore, the activities of pathways that regulate the metabolic pool available for the terpenoid pathway, such as pyruvate, are likely to influence isoprene production. Attempts to produce significant amounts of isoprene in engineered bacteria would benefit from a mechanistic understanding of the how the terpenoid genes are regulated in the native host, thereby facilitating synthetic biology approaches to control isoprene production levels in engineered microbes [18,25,26].
We hypothesized that the RNA expression levels of genes that modulate isoprene production are strictly controlled and regulated under environmental conditions that increase isoprene production. Accumulation of DMAPP and IPP is cytotoxic in E. coli, and to a lesser extent B. subtilis [11,27], which is consistent with the idea that the genes underlying isoprene production are regulated. Thus, the expression levels of genes in the terpenoid pathway should correlate with the level of isoprene produced. To test this hypothesis, RNA was isolated from B. subtilis after being subjected to 12 different perturbations and the pattern of gene expression was determined using RNA-seq. We found that different subsets of the terpenoid genes were coregulated. We also produced a multivariate model using partial least squares regression (PLSR) analyses with a subset of 213 genes. The model was tested against 19 different conditions from an independent experiment and accurately predicted isoprene production levels from these perturbations. Therefore, the gene expression-based model can be used to identify the conditions and pathways that cooperate to maximize isoprene production.

Bacterial Strains and Growth Conditions
Bacillus subtilis strain DSM10 (ATCC 6051) was obtained from the German Collection of Microorganisms and Cell Cultures and grown in Luria-Bertani (LB, Fisher Scientific) broth at 30uC with continuous shaking. The E. coli strain DH5a (Invitrogen), used in B. subtilis cloning efforts, was grown in LB media at 37uC at 180 rpm. When required, the culture media was supplemented with antibiotics (ampicillin, 100 mg/mL and chloramphenicol, 10 mg/mL, Fisher Scientific).
Cloning and Sequencing of dxs, dxr, fni, and ispA Genes The dxs, dxr, fni, and ispA genes were PCR-amplified from B. subtilis DSM10 genomic DNA and cloned using the B. subtilis -E. coli shuttle vector pHT01 (MoBiTec) [28] according to the method used by Xue et al. [3]. The recombinant plasmids are referred to as OX-dxs, OX-dxsdxr, OX-dxsfni, OX-fni, and OX-ispA. The empty pHT01 vector was also transformed into B. subtilis using the same method, and is referred to in the text as empty vector. A list of primers used in the study is provided in Table S1.

Perturbations
A starter culture of DSM10 (in 10 mL LB media) was inoculated with a single colony and grown overnight (16 to 20 hr) at 30uC with shaking at 180 rpm. The optical density of cultures at 600 nm (OD 600 ) was determined using a JENWAY 6405 UV/VIS Spectrophotometer. The starter culture was diluted to an OD 600 of 0.05 using fresh LB media to a total volume of 100 mL. When cultures reached an OD 600 of 0.5, 10 mL aliquots of the culture was transferred to 20 mL GC-MS sampling vials. At this point, the indicated treatment was applied (two vials per treatment, biological replicates were independent from this point forward). Genetic perturbations by overexpression of genes in the mutant strains were induced at mid-log phase by the addition of 1 mM isopropyl b-D-1-thiogalactopyranoside (IPTG, Fisher Scientific). Environmental perturbations were conducted as follows: wild type control, no additions; acetic acid (glacial acetic acid, Fisher Scientific), 0.2% or 2%; ethanol (200 proof molecular biology grade, Fisher Scientific), 0.1% or 1%; lactic acid (DL-Lactic Acid, Fisher Scientific), 0.2% or 2%; indole (Sigma-Aldrich, stock solutions prepared in molecular biology grade ethanol) 0.02 mg/mL or 0.2 mg/mL; hydrogen peroxide (Fisher Scientific), 0.005% or 0.02%; sodium chloride (Fisher Scientific), 0.3 M; glucose (Fisher Scientific), 0.2%; xylose (Fisher Scientific) 0.2%; mannose (Fisher Scientific) 0.2%; and dimethyl sulfoxide (DMSO; Fisher Scientific), 70 mM. The vials were tightly capped and incubated for three hours at 30uC with shaking at 150 rpm. After three hours, the culture density was determined (from one sample vial) and the concentration of isoprene in the headspace was measured (from the remaining sample vial) using the GC-MS method described below.

Detection
of isoprene production using gas chromatography-mass spectrometry. We measured isoprene as described previously [3]. Briefly, the concentration of isoprene in the headspace of liquid cultures was detected using a gas chromatography-mass spectrometry (GC-MS) method. We used an Agilent 7890A GC and a 5975 mass selective detector (MSD) coupled with a CTC PAL autosampling system that was equipped with a solid-phase microextraction (SPME) syringe (Supelco) to absorb volatile compounds in the headspace. The fiber material was 50/30 divinylbenzene-carburen on polydimethylsiloxane on a stable flex fiber. A DB-5 ms column (length, 30 m; inner diameter, 0.25 mm; film thickness, 0.25 mm) with helium as the carrier gas at a flow rate of 0.9 mL/min was used. The oven temperature was initiated at 30uC for 3.5 min, and was increased at a rate of 20uC per min until 120uC. The headspace of LB media was used as a negative control. Bacterial isoprene production was identified by comparing peak retention times and the mass spectra profiles with an isoprene standard (Sigma-Aldrich). Calibration standards were prepared by adding various concentrations to 10 mL of LB media. Isoprene eluted at 1.9 min, and the isoprene concentrations produced in the bacterial cultures were calculated by conversion of the GC-MS peak area to nanograms of isoprene using the calibration curve.

RNA Extraction
Bacterial cultures were centrifuged at 5,0006 g for 10 minutes at room temperature, and the supernatant was discarded. RNA was extracted from the resulting cell pellet using the RiboPure TM -Bacteria Kit (Life Technologies, AM1925). Briefly, 350 mL of icecold RNA WIZ was added to the cell pellet and resuspended by vigorous vortexing. We note that the level of RNA species that are rapidly degraded may be altered during this short period between culturing and RNA stabilization and extraction (approximately 10 minutes). Cells were disrupted by beating in the presence of 250 mL of Zirconia beads for 10 minutes at room temperature. The resulting lysates were collecting after centrifuging for 5 min at 4uC. Lysates were frozen at 280uC for at least 12 hours prior to final RNA extraction. Upon removal from the freezer, 0.2 volumes of ice-cold chloroform was immediately added and pipetted vigorously until the lysate thawed. All subsequent protocol steps were conducted according to the manufacturer's instructions, including use of the DNase treatment supplied with the kit. Extracted RNA was quantified using the NanoDrop ND-1000 (Fisher Scientific). RNA Integrity was determined using 1.0 mL of RNA sample with the Agilent RNA 6000 Nano Kit and chip (Agilent, part number 5067-1548) according to the manufacturer's instructions with the Agilent 2100 BioAnalyzer. Samples were analyzed according to the total prokaryote RNA method software supplied with the Agilent 2100 BioAnalyzer instrument. Samples with a RIN score .7 were used for RNA-Sequencing.

RNA-sequencing
The Applied Biosystems SOLiD TM Total RNA-Seq kit (catalog number 4445374) was used to generate the cDNA template library. The SOLiD TM EZ Bead system was used to perform emulsion clonal bead amplification to generate bead templates for SOLiD TM platform sequencing. Samples were sequenced on either the SOLiD TM 5500xl platform or the SOLiD TM 4 platform. The samples sequenced on the SOLiD TM 5500xl platform consisted of the two biological replicates (cultures supplemented independently). The training data set consisted of the following twelve samples: wild type control, 2% acetic acid, 1% ethanol, 2% lactic acid, 0.2 mg/mL indole, 0.005% and 0.02% H 2 0 2 , OX-dxs, OX-fni, 70 mM DMSO, empty vector control, and OX-ispA. One biological replicate for the OX-fni perturbation did not pass quality control metrics and was eliminated from further analysis, resulting in a single replicate for this condition. The samples sequenced on the SOLiD TM 4 platform consisted of the samples used in the testing data set with one replicate of each perturbation sequenced. The testing data set consisted of the following 19 samples: wild type control, 0.2% acetic acid, 2% acetic acid, 0.1% ethanol, 1% ethanol, 0.2% lactic acid, 2% lactic acid, 0.02 mg/ mL indole, 0.2 mg/mL indole, 0.005% and 0.02% H 2 0 2 , 0.3 M NaCl, 0.2% glucose, 0.2% mannose, 0.2% xylose, OX-dxs, OXdxsdxr, OX-dxsfni, and OX-fni. The 50-base short read sequences produced by the SOLiD TM 4 sequencer were mapped in color space using SOLiD TM BioScope TM software version 1.3. The 50base short read sequences produced by the SOLiD TM 5500xl sequencer were mapped in color space using LifeTechnologies LifeScope TM software version 2.5.1. The same mapping algorithm was used for both sequencing platforms with the default settings. BioScope TM and LifeScope TM were instructed to map the short reads onto a reference genome and set of called genes from B. subtilis subsp. str. 168, a neighboring B. subtilis strain that has previously been sequenced (RefSeq NC_000964). BioScope TM and LifeScope TM were given a filtering file to use that included the 16S, 5S, and 23S rRNA sequences for B. subtilis (in addition to the adapter and barcode sequences to filter before mapping). Therefore, ribosomal filtering was done computationally, with BioScope TM and LifeScope TM discarding any reads matching any subsequence of the three rRNA sequences in the filter file. Each BioScope TM and LifeScope TM run produced a BAM file containing the sequence of every mapped read and its mapped location. The resulting BAM files were imported into AvadisNGS (version 1.3, Strand Sciences). BAM files are available in the Sequence Read Archive at http://www.ncbi.nlm.nih.gov/sra.

Data Analyses
The expression values for each gene were quantitated using DE-Seq [29] normalization in Avadis NGS with the threshold set to one without mean centering and log 2 transformed. The mean and standard deviation of each gene's expression were calculated from the normalized expression using Matlab (2012a, Mathworks). The coefficient of variation (CV) for expression was ,22% for at least 75% of the genes for the training dataset (2 biological replicates) and ,33% for 75% of the genes when considering all three biological replicates (training and test dataset). We did not observe a perturbation-specific bias in CV in these experiments. Data used for clustering and modeling are included in Table S2. Clustering analyses were performed using the Matlab functions 'clustergram' and 'kmeans' as indicated with default parameters. Spearman's rank correlation coefficients were calculated using the Matlab function 'corr' with the option '''type', 'Spearman'''. The reduced list of regulated genes (199 genes) was generated using analysis of variance (ANOVA) in AvadisNGS to select genes differentially expressed between the control and any other perturbation with a p,0.001 with Benjamini-Hochberg correction. Gene Ontology (GO) enrichment analyses were performed using DAVID Bioinformatics Resources [30].
Partial least squares regression analyses were performed using the Matlab toolbox 'PLStoolbox' (Eigenvector Research, Inc.) using the SIMPLS algorithm. Training data were mean-centered and auto-scaled to create a model with calibration fit of R 2 = 0.99, a cross-validation fit (venetian blinds method with 6 splits) of R 2 = 0.49, and the model predicted the response variable (isoprene production) in the test set with a fit R 2 = 0.64. Models with higher prediction power were created when test data included the H 2 0 2 perturbations; however, these perturbations created models that were highly dependent on these samples and were therefore removed to create a model more representative of the other perturbations.

Identification of Perturbations that Alter Isoprene Production
The rate of isoprene production in B. subtilis is dependent on growth conditions and nutrient availability [8,21]. To identify changes in gene expression that correspond to altered isoprene production, we screened for conditions that either increase or decrease isoprene production in B. subtilis strain DSM10. After media was supplemented with chemical stresses or nutrients, isoprene production in the culture head space was measured as previously described [3]. Because isoprene production levels correlate to both total cell number and phase of growth [3,8], supplements were added mid-log phase (corresponding to an OD 600 of 0.5) and isoprene levels were normalized to culture density ( Figure 1A).
We identified eight conditions that altered the rate of isoprene production in wild type B. subtilis. Of these, the addition of acetic acid and lactic acid had the most dramatic effects on isoprene production and reduced it to nearly undetectable levels ( Figure 1B). Ethanol, which induces chemical stress, or indole, which functions as a signaling molecule in bacteria and is required for amino acid biosynthesis [31,32], reduced isoprene production, albeit to a lesser extent than either of the acids. Addition of DMSO, which negatively affects NAD + synthetase enzymatic activity and alters the available NAD/NADH pool [33,34], moderately but reproducibly lowered isoprene production. Hydrogen peroxide (H 2 O 2 ), which induces oxidative stress, was the strongest inducer of isoprene production. In plants, it has been proposed that H 2 O 2induced isoprene production evolved as a mechanism to react with and mitigate the damaging effects of reactive oxygen species [35][36][37][38]. In total, we identified seven media supplements that altered the production of isoprene in B. subtilis: 2% acetic acid, 2% lactic acid, 0.2 mg/mL indole, 70 mM DMSO, 1% ethanol, and 0.005% and 0.02% H 2 0 2 .
Genetic perturbations were also tested to determine if overexpression of selected enzymes in the DXP pathway would influence isoprene production. Isoprene production proceeds via the condensation of pyruvate and glyceraldehyde-3-phosphate by the enzyme Dxs into a metabolic cascade ultimately producing either isoprene or larger terpenoid compounds ( Figure 2A) [8,20,22,39]. Several enzymes in this pathway could serve as rate limiting steps in isoprene production. Specifically, metabolic flux through the DXP pathway appears to be heavily dependent on the activity of Dxs, Idi (encoded by fni), and IspA enzymes [9,10,[40][41][42]. The regulation of these enzymes in bacteria, however, has yet to be fully described. These genes are not part of a single operon, therefore suggesting a more complicated regulatory strategy for these genes as well as for the entire terpenoid pathway ( Figure 2B) [20,22,23]. To evaluate the influence of these enzymes on isoprene production, we overexpressed the genes encoding Dxs, Idi, and IspA. We found that dxs overexpression slightly increased isoprene levels, as described previously ( Figure 1B) [43][44][45]. Although the Idi enzyme catalyzes the conversion of IPP to DMAPP (the substrate for IspS conversion of DMAPP to isoprene), our experiments showed that fni overexpression did not affect isoprene production levels, suggesting that conversion of IPP to DMAPP is not a rate-limiting step in isoprene production in LB growth media. Overexpression of ispA, which catalyzes the conversion of DMAPP and IPP to larger terpenoids (e.g., GPP and FPP), significantly reduces the level of isoprene production ( Figure 1B), suggesting that ispA reduces the level of DMAPP, resulting in reduced substrate availability for IspS to convert to isoprene. In total, we identified 12 unique chemical and genetic perturbations that modulate isoprene production.

Terpenoid Gene Clustering to Identify Patterns of Coregulation
To determine whether the expression of genes that modulate isoprene production are coordinately regulated, we isolated RNA from B. subtilis cultures that had been perturbed, and assessed the cellular transcription profile of two biological replicates for each perturbation using RNA-seq. We were particularly interested in the expression profile of genes involved directly in the DXP pathway, as well as mmgA, yhfS, and pksG, which are reported to belong to a ''dead-end'' mevalonate pathway in B. subtilis [22,23]. The pksG gene is also known to play a functional role in polyketide biosynthesis, which has recently been linked to isoprenoid biosynthesis [24]. We collectively refer to these 14 genes as the terpenoid genes.
Hierarchical clustering was used to identify the terpenoid genes that are coexpressed under the 12 perturbations. The 14 genes are split into three main clusters. Cluster 3 contains the genes dxs, yhfS, pksG, ispD, and mmgA ( Figure 3). On average, these genes demonstrate similarly low levels of expression in the first six perturbations (as ordered on the heat map, discussed below) and moderate to high expression in the remaining six conditions. Cluster 2 contains the genes ispH, uppS, hepS, ispE, dxr, and ispF. These genes present the opposite pattern of expression compared to cluster 1 and are highly expressed in the first six conditions compared to cluster 3. Cluster 1 contains three genes (fni, ispG, and ispA) that have dissimilar expression patterns compared to the other genes.

Relationship between Isoprene Production and Terpenoid Gene Clustering
The pattern of gene expression was further examined to determine if the expression profiles of the terpenoid genes in each cluster correlate with isoprene production and whether the expression level of the cluster 2 or 3 genes might be predictive of the amount of isoprene produced. The perturbations (columns of Figure 3) are clustered based on their similarity in terpenoid gene expression. The leftmost cluster, cluster A, contains the perturbations lactic acid, acetic acid, and 0.005% and 0.02% H 2 0 2. Given that these conditions are the lowest and highest producers of isoprene (acetic acid and H 2 0 2, respectively, Figure 1B), it is surprising that these four perturbations are similar to each other. The clustering of these conditions is dominated by the high level of expression of the genes in cluster 2 (ispH, uppS, hepS, ispE, dxr, and ispF) and low expression of the genes in cluster 3 (dxs, yhfS, pksG, and ispD). This suggests that the gene expression levels in these clusters are not proportional to the level of isoprene because high and low isoprene-producing conditions show similar expression levels.
Although hierarchical clustering allowed visualization of data that successively merged similar groups of genes across conditions, this method of analysis only requires a measure of similarity between groups. We wanted to further examine the relationship between the expression of terpenoid genes and isoprene production. Therefore, we divided the terpenoid genes using k-means clustering and overlaid the isoprene production levels against the expression levels. Using this approach to represent the genes allowed us to visualize the gene expression changes with the level of isoprene produced under those perturbations. Genes in cluster 4 ( Figure 4) (yhfS, mmgA, pksG, dxs, and ispD) show a similar expression pattern, with the exception of mmgA, which has very low expression levels under most conditions. The overlay of isoprene production ( Figure 4, dashed blue line) shows that the expression of the genes in this cluster follows the same trend in expression as in the first five conditions (control, acetic acid, ethanol, lactic acid, and indole). Under the H 2 0 2 conditions, in which cells produced the highest levels of isoprene, the expression levels of the genes in this cluster are low and thus do not follow the same trend as the first set of five conditions, and appear to be inversely correlated. The trend of concordant gene and isoprene expression is observed under the remaining conditions (OX-dxs, OX-fni, DMSO, empty vector, and OX-ispA). Analysis of the genes in cluster 5 (dxr, ispE, ispF, ispH, uppS, and hepS) revealed an inverse correlation between the gene expression level and the amount of isoprene in the first five conditions. However, under the remaining conditions, there is a positive correlation between gene expression and isoprene production. The genes in cluster 6 (ispA, fni, ispG) had little similarity of expression and we identified no clear relationship to isoprene production. Thus, we have identified two main clusters of terpenoid genes that covary with isoprene production. However, the nature of the correlation between these genes and isoprene production is dependent on the type of perturbation, suggesting a complicated relationship between gene expression and isoprene production.

Correlation between Individual Terpenoid Genes and Isoprene Production
We next tested whether individual terpenoid genes show a higher correlation with isoprene production than was apparent in the previous clustering analyses. Spearman's correlation coefficient for each gene was calculated against different subsets of perturbations; specifically, all perturbations, the first five perturbations, and all perturbations excluding H 2 0 2 . In these analyses, we found that dxs correlated positively with a Spearman's correlation coefficient of 0.63 ( Figure 5A), whereas ispG had an inverse correlation in all 12 perturbations with a Spearman's correlation coefficient of 20.57 ( Figure 5A). We did note that although dxs expression levels correlated with isoprene production, the range of expression values was modest, with the largest change between conditions (excluding the overexpression condition) only a 2.1 fold change (data not shown, see Discussion). The other terpenoid genes showed lower correlation to isoprene levels under these conditions. When analyzing the first five perturbations (control, acetic acid, ethanol, lactic acid, and indole), the correlation between isoprene and gene expression increased for several genes. The expression levels of dxs, ispD, pksG, yhfS, and fni have coefficient correlations of approximately 0.6 ( Figure 5B). These genes, with the exception of fni, all cluster together in the hierarchical and k-means clustering (Figures 3, 4). Several genes show a strong inverse relationship (Spearman's correlation coefficient less than 20.6) with isoprene production under the first five perturbations, specifically ispH, ispE, uppS, and hepS ( Figure 5B). Thus, analyses of the terpenoid gene expression profiles indicate that under the acetic acid, lactic acid, ethanol, and indole conditions, there are subsets of genes that strongly correlate with isoprene production (one group positively and the other negatively) ( Figure 5C).

Genome Wide Expression Analyses and the Relationship to Isoprene Production
Although the expression clusters of terpenoid genes were predictive of isoprene production under certain perturbations, we were interested in predicting isoprene levels under all the perturbations tested, thus providing a broader investigation into the mechanisms of isoprene production. Therefore, we sought to determine whether the expression levels of additional genes outside the terpenoid pathway would contribute to a model to predict isoprene production. Identification of these genes could help provide information regarding the processes that are important for isoprene production, such as pathways that regulate substrate or cofactor levels that are necessary for isoprene production. These expanded analyses included the original 14 terpenoid genes (Figure 2A) as well as an additional 199 genes selected by their differential regulation in at least one perturbation compared to the wild type control (refer to Materials and Methods). These genes were clustered by their expression levels across the 12 perturbations. Clustering the perturbations (columns) with this expanded set of genes results in clustering the H 2 0 2 conditions with the lactic acid and acetic acid conditions (cluster C, Figure 6), the same clustering pattern observed when analyzing the terpenoid genes alone. This suggests that at the transcriptional level, the cellular response is dramatically different under these perturbations compared to any of the other conditions (cluster D, Figure 6). Because these are conditions of extremely high (H 2 0 2 ) or low (acetic acid and lactic acid) isoprene production, it is surprising that the gene expression state, which represents the state of the cell (e.g., stressed, actively growing, sporulation, etc.), is dramatically different than other conditions that correspond to moderate isoprene production levels.
We investigated whether the identity of the genes that drive the clustering are informative of the differential cell state. The top cluster of 93 genes (Cluster 7, Figure 6) are expressed at much lower levels in the perturbations with H 2 0 2 and the acidic conditions (Cluster C, Figure 6) compared to other perturbations. This group is enriched with genes involved in antibiotic metabolic processes and genes involved in polyketide biosynthesis, including pksG (Table S3) (GO analysis; Benjamini Hochberg p = 1.7610 27 compared to the entire transcriptome). The H 2 0 2 conditions contain a cluster (cluster 8) of 62 genes that were highly expressed. This group has a high proportion of genes (58%) that encode proteins predicted to be involved in prophage or phage-like elements (Table S4). This finding supports previous studies that determined these elements were involved in B. subtilis responses to H 2 O 2 exposure [46], including the prophage element PBSX, indicative of initiation of sporulation and autolytic activities [47,48]. The bottom cluster of 57 genes (Cluster 9, Figure 6) is highly expressed in H 2 0 2 conditions, with higher expression in the lactic acid and acetic acid conditions. This group is enriched with genes involved in translation, including several 30S and 50S subunits, elongation, and translation factors (Table S5)

Predicting Isoprene Production Levels from Gene Expression Profiles
Based on the clustering data, we hypothesized that rather than a simple linear relationship between groups of genes to predict isoprene production, a more accurate prediction could be made by examining the relative importance of different gene expression levels on isoprene production. Therefore, we created a model to predict isoprene production using partial least squares regression (PLSR) analyses [49]. This model was created using the 213 genes described above to evaluate the significance of each gene for determining the level of isoprene produced. Our initial PLSR model was created with 12 perturbations constituting the training set. This model was tested against RNA profiles from an independent experiment of 19 perturbations, some of which are not present in the training set (Table 1). It resulted in a high predictive capacity for isoprene production (prediction fit of R 2 = 0.69). A model created after randomizing the isoprene levels yielded a poor model (R 2 = 0.25). Because the extremely high level of isoprene produced in the H 2 0 2 perturbations is likely to dominate the predictions of the model, a reduced training set was used that did not include the H 2 0 2 perturbations. The final model was created with this reduced training set of 10 perturbations, using three latent variables (LV), which together captured 99.4% of the variance in the isoprene levels. When the model was applied to the test set, the model accurately predicted the isoprene production with prediction fit of R 2 = 0.64 (Figure 7). The level of isoprene produced in the H 2 0 2 perturbations, although not in the training data, was accurately predicted to be the highest isoprene producer; however, the model predicted a lower level of isoprene than was observed in the experiment. These results demonstrate that the set of 213 genes includes the necessary information to Figure 4. The terpenoid genes cluster into their main groups with respect to coexpression and isoprene production; covariance is dependent upon the type of perturbation. Genes were clustered using k-means. The normalized isoprene production level is represented by the dashed blue line and the mean expression for each cluster is represented by the solid black line in each cluster. The legend in each of the three panels indicates the name of the genes in each cluster. The mRNA levels are normalized (z-score) for each gene. The x-axis is representative of the perturbations as follows (1 through 12): wild type control, acetic acid, ethanol, lactic acid, indole, 0.005% H 2 O 2 , 0.02% H 2 O 2 , OX-dxs, OX-fni, DMSO, empty vector, and OX-ispA. doi:10.1371/journal.pone.0066104.g004 predict the relative amount of isoprene produce in novel cellular perturbations.
To gain a clear understanding of how our model predicts isoprene production levels, we examined how the gene expression variables of the model influenced the prediction. The PLSR model is created by assigning weights to each variable; in this case, individual genes. The locations of each gene are indicated by its position (technically, its projection) in the biplot (Figure 8, Table  S6), and the weights of the responses are indicated for each of the perturbations analyzed in the test and training dataset by their position in the biplot. This plot reveals that the predictive model assigns high positive values for the H 2 0 2 and OX-dxs perturbations, which produce the highest isoprene levels. The genes that have the highest weights in LV1 (x-axis of Figure 8) and LV2 (yaxis of Figure 8) are those that have the greatest influence in the model to predict high isoprene production. Of the terpenoid genes, only dxs is located in this area of the biplot (Quadrant I), supporting its role as a key regulator of isoprene production. The genes dxr, ispE, hepS, ispH, and ispF are located along the negative region of LV1 and along the axis of LV2 (Quadrant II). This is consistent with a negative influence on isoprene production because this is proximal to the region of the biplot with perturbations for low isoprene production. The dense cluster of genes near dxr is enriched for genes involved in translation, suggesting that increased expression of these genes is predictive of low isoprene production ( Figure 8, Table S6). The cluster in Quadrant IV (grouped near the terpenoid genes pksG, ispD, and yhfS) is enriched for genes involved in antibiotic metabolism as well as genes involved in amino acid metabolism ( Figure 8, Table S6). This model indicates that the level of isoprene produced is best predicted by considering the expression of not only terpenoid genes, but also using genes not normally associated with this pathway.

Discussion
The production of isoprene and isoprenoids from bacteria has great potential for industrial and medical applications. However, the lack of a complete understanding of the regulation and constraints inherent to the terpenoid pathway discourages practical applications. To address these limitations, we sought to identify the relationship between gene expression and isoprene production in B. subtilis and found that the terpenoid genes fall into two main groups: one group has a strong positive correlation with isoprene production, and the other has a strong negative correlation with isoprene production. The expression levels of these genes, along with a subset of genes representative of other pathways, were predictive of isoprene levels in novel perturbations. These data suggest that further investigation into the influence of other metabolic pathways will support efforts to modulate isoprene production in model organisms.
The first step in the DXP pathway, condensation of pyruvate and glyceraldehyde-3-phosphate, is mediated by Dxs and is the rate-limiting step for isoprene (and isoprenoid production) in plants and bacteria [9,45]. Here, we show that dxs overexpression drives isoprene production and that its expression correlates with isoprene production in nearly all perturbations. Accordingly, the expression level of dxs strongly influences the PLSR model that accurately predicts the amount of isoprene produced, in contrast with other terpenoid genes that either moderately or negatively correlate to isoprene according to the model (Figures 7, 8). Dxs is known to be negatively regulated by RelA during starvation, known as the stringent response, a response that modulates ,142 genes [50]. Consistently, our analyses reveal that relA expression negatively correlated with dxs expression in low isoprene conditions (data not shown). The dxs gene is predicted to be part of a four gene operon including recN, ahrC, and yqxC [51]. However, the role of these other gene products in the terpenoid pathway is unknown. Surprisingly, the magnitude of dxs regulation across conditions is modest (,2 fold), suggesting a tight level of regulation that is consistent with its key regulatory role and strengthens the notion that dxs overexpression is required to ensure maximal isoprene production.
Three other terpenoid genes, ispD, pksG, and yhfS, also positively correlate with isoprene production. However, this correlation is strongest in a subset of conditions (wild type control, acetic acid, lactic acid, ethanol, and indole). The ispD gene is part of a sigBdependent general stress response [52], but little has been reported regarding its expression. pksG and yhfS are part of a mevalonate pathway that is annotated as a dead-end pathway in B. subtilis [22,23]. The coexpression of these genes with dxs and ispD is unexpected and suggests that these genes may have a role in regulating isoprene production, consistent with a reported link between polyketide biosynthesis and the isoprenoid synthesis [53]. In our model, pksG and yhfS have weights similar to genes involved in antibiotic metabolism, which includes several other pksX genes [51,52]. Further work is required to determine the influence of the pksG and yhfS genes, and whether the mevalonate pathway in B. subtilis is indeed a dead-end pathway. Additionally, our analyses of coexpressed gene clusters in 104 environmental conditions (data not shown, analysis of data from [54]) demonstrated that the 14 terpenoid genes do not share any common clusters with one another (with the exception of two large clusters of 546 and 1383 genes that contained multiple terpenoid genes). Further experimental analyses and mining of publically available datasets and upstream coding regions should help extend our understanding of coregulation of the terpenoid genes. Additionally, further perturbing the nutritional and environmental constituents will likely yield important information regarding isoprene regulation.
The expression level of a second subset of genes (dxr, ispE, ispF, ispH, uppS, and hepS) was inversely correlated with isoprene in the first five conditions (control, acetic acid, lactic acid, ethanol, and indole). UppS and HepS are past the branch point of the DXP pathway and their activity is necessary to produce larger terpenoids downstream of GPP and FPP. In the absence of these enzymes, substantial amounts of these four molecules might reach toxic levels [11]. Thus, it is likely that an increase in the expression of these six genes is indicative of a shift from isoprene production to GPP/FPP production. The ispA gene had limited correlation with isoprene production, but overexpression of ispA decreased isoprene production, indicating that diversion of the prenyl diphosphates away from isoprene synthesis to produce GPP and FPP directly affects IspS activity. The molecular cues to explain this possible phenomenon await further investigation. The inverse correlation of dxr, ispE, ispF, and ispH is more difficult to explain because these enzymes are immediately downstream of Dxs and one would intuitively predict that their expression patterns would be similar to dxs. Of note, the expression of ccpA, a transcription factor that regulates many genes involved in carbon metabolism [55], is also inversely correlated with isoprene production (data not shown). Because CcpA can function as a positive or negative regulator of transcription depending on the levels of coregulators, its role in terpenoid gene expression and isoprene production is likely to be complex [55].
Another terpenoid gene, ispG, also had limited correlation with isoprene production across the conditions. Recently, Zhou et al. [56] reported that the efflux of methylerythritol cyclodiphosphate, the substrate for IspG, is a rate limiting step in isoprenoid production in microbes. This rate-limiting step applies another regulatory constraint in the pathway at the level of ispG, perhaps providing an explanation for the more complicated relationship between its expression and isoprene production.
The mechanisms of regulating secondary metabolites such as isoprene are complex and are often intertwined with aspects of cell growth, central carbon metabolism, stress, and redox state. The first reaction of the DXP pathway utilizes pyruvate and glyceraldehyde-3-phosphate as substrates, so any limitation in these substrates due to changes in glucose or other nutrients would be expected to alter isoprene production. We expected that analyses of transcriptional changes would be a sufficient indicator of the metabolic states of cells because substantial evidence suggests that changes in metabolic activity are preceded by transcriptional changes [57,58]. A recent report shows that in high glucose environments that cause overflow metabolism to produce excess acetate, transcriptional regulation of TCA and respiration genes is mediated by the transcription factor ArcA [59]. In Populus trichocarpa, which produces high levels of isoprene following an ultradian cycle, gene expression analyses found that ispS and dxs are strongly regulated at the transcriptional level and contain lightsensitive promoter elements, thus supporting a transcriptional level of regulation [60]. Although the members of the terpenoid pathway strongly correlate with isoprene production (Figure 5), gene expression level alone was insufficient to predict isoprene levels using a PLSR model (R 2 = 0.33). However, the expression levels of these genes did indicate a high similarity between expression patterns between the perturbations with acetic acid, lactic acid, and H 2 0 2 ( Figure 5, cluster B, and Figure 6 cluster A). This is surprising considering that acidic conditions produce the lowest amount of isoprene and H 2 0 2 perturbations produce the highest. However, this pattern of clustering was nearly identical when we clustered the perturbations using the entire transcriptome (4176 genes, not shown) and a smaller subset of regulated genes (213 genes, Figure 6). The model was created using 213 genes and trained using ten perturbations with cross validation. The green line is the fit for the training data set; the red line is the fit for the testing data set. The R 2 value for prediction of the test set is 0.64. Closed circles are representative of training set values that are labeled A through J; the red triangles are the test set conditions, the test set conditions that are identical to the training set conditions are labeled A' through J'. Unique conditions in the training set are labeled numerically 1 through 12. Perturbation abbreviations: A and A', 2% acetic acid; B and B', 2% lactic acid; C and C', 1% ethanol; D and D', 0.2 mg/mL indole; E, DMSO; F, OX-ispA; G and G', OX-fni, H, empty vector; I and I', wild type control; J and J', OX-dxs; 1, 0.2% acetic acid; 2, 0.2% lactic acid; 3, 0.02 mg/mL indole; 4, 0.1% ethanol; 5, 0.2% xylose; 6, 0.2% mannose; 7, 0.2% glucose; 8, 0.3 M NaCl; 9, OX-dxsdxr; 10, OX-dxsfni; 11 The need for B. subtilis to balance its needs for carbon through central metabolism as well as secondary metabolites requires complex regulation of enzyme expression and activity. To utilize bacteria such as B. subtilis to produce secondary metabolites requires understanding how this balance is maintained and how it can be shifted with the desired effect. The high amount of isoprene production in B. subtilis compared to other bacteria, such as E. coli [7], may suggest a strategy of comparing the metabolic networks between these organisms in order to identify key regulatory steps in isoprene production. Our study shows that measurements of gene expression, with correlation at the transcriptional level, can provide insight into the production of isoprene across many simulated environmental conditions.