The authors have declared that no competing interests exist.
Conceived and designed the experiments: YW HM JW. Performed the experiments: HM JW. Analyzed the data: HM JW. Wrote the paper: HM JW ZX FX GZ YW.
Accurate and controllable regulatory elements such as promoters and ribosome binding sites (RBSs) are indispensable tools to quantitatively regulate gene expression for rational pathway engineering. Therefore,
The coming era of synthetic biology aims at design and construction of complex biological networks to achieve our special goals (e.g., high-level production of clinically valuable natural products), which requires fine-tuning gene expression in the cellular networks to achieve an expected metabolic behaviour
Aforementioned quantitative prediction models commonly use linear regression analysis or its derivative methods (e.g., linear correlation of data after logarithm processing) to simplify the complex process for model construction. Thus, it is hard to well reflect the complex non-linear relationship between the sequences and their strengths, which results in a low prediction accuracy and poor generality. In addition, these models are supposed to have the potential, but have not been further developed into
All strains and plasmids involved in this study are listed in
Strains & plasmids | Relevant characteristics | Source |
DH10B | F- |
Invitrogen |
BL21(DE3) | F |
EMD4 Biosciences |
pTrcHis2B | Ampicillin resistance marker, Trc promoter | Invitrogen |
pJF07 | Plasmid pTrcHis2B carrying a |
This study |
pET28a- |
pET28a derived plasmid carrying a |
This study |
pET21c- |
pET21c derived plasmid carrying a |
This study |
pET28a- |
pET28a derived plasmid carrying |
This study |
s14/s05/s21- |
Three pTrcHis2B derived plasmids carrying synthetic promoters s14 (0.56), s05 (1.00) and s21 (2.50) followed by a |
This study |
s14/s05/s21- |
Three pTrcHis2B derived plasmids carrying synthetic promoters s14 (0.56), s05 (1.00) and s21 (2.50) followed by a |
This study |
s14/s05/s21- |
Three pTrcHis2B derived plasmids carrying synthetic promoters s14 (0.56), s05 (1.00) and s21 (2.50) followed by a |
This study |
pTrcHis2B- |
pTrcHis2B derived plasmid carrying a |
This study |
Reporter plasmid pJF07 was created by inserting a
Random mutagenesis of the wild-type Trc promoter & RBS sequence was performed by error-prone PCR using primers TrcF (
For primary screening, transformants were picked out into the 48-deep-well plate and screened through gene fluorescent protein assay (excitation/emission wavelength = 485 nm/535 nm). The conditions for 48-deep-well plate cultivation were as follows: 0.5 ml LB medium with 0.1 mM IPTG in each 5 ml-well at 37°C and 250 rpm for 8 h during exponential phase. The OD600 nm and green fluorescent signal of 100 µl culture was quantified in a 96-well plate reader (Multiskan FC Microplate Photometer, Thermo Scientific). For the convenience of comparison, we used the relative strength
One hundred clones with distributed strength were selected and cultivated overnight in LB broth and preserved in 20% glycerol at −80°C for seed culture. Fine quantification of the selected elements was performed in tube (15 mm×150 mm) and assayed by flow cytometry (FACSCalibur flow cytometer, Bection Dickinson). Seventy-five microliters of seed culture was innoculated into 1.5 ml LB with 0.1 mM IPTG and incubated at 37°C and 250 rpm for 3 h at exponential phase. The culture was cooled with ice bath and assayed using clone containing pTrcHis2B as blank control. Each clone was sampled with 20,000 events and the geometric mean (Gmean) of fluorescent signal was calculated using statistics. The relative strength value
Matlab 2012a (Mathworks Inc.,
Peptide expression was performed in
Recombinant strains
Trc promoter is commonly used for protein expression in
The region of Trc promoter & RBS in pTrcHis2B is selected for random mutagenesis by error prone PCR, and mutants with various strength are obtained by detecting the fluorescent intensity of GFP after screening by 48-deep-well plates and flow cytometry assay.
The initial ANN model was built as a backpropagation model (BP-ANN model) by using Matlab functions provided by Neural Network Toolbox. The model contains three layers, including an input layer, an output layer and a hidden layer. Neuron numbers of the input layer and the output layer were 896 and 1 (determined by the data conversion rule), respectively. For the hidden layer, the number was variable for optimization. The initial weights for all neuron connections were randomly assigned by Matlab functions.
We evaluated the predicting performance by using the sum squared error (
The activation functions of the hidden layer and the output layer were set to be a non-linear sigmoid function ‘logsig’, which was defined as
For training of BP-ANNs, a set of example pairs was given as
The original sequence data were translated to digital data and served as the input matrix according to the following rules: A = {1, 0, 0, 0}, G = {0, 1, 0, 0}, C = {0, 0, 1, 0}, and T = {0, 0, 0, 1}. For instance, a given sequence ‘ATTGCC’ can be translated to a ‘0-1’ digital series of {1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0}.
It must be noted that, since the output range of logsig function lies in (0,1), while the target strength can be greater than 1, so it was necessary to normalize the target strength data through dividing by the maximum strength value and then multiplying by this value after simulation.
The goal of
All 100 sequences in the library were randomly split into two data sets (the training set and the test set) to train and test the ANN prediction models. Considering the effect of the size of training set on the prediction performance, training set was sampled from 40 to 90 sequences (51 situations in total) and each corresponding test set contained the rest sequences. The neuron number of the hidden layer was optimized in a range from 5 to 30, and each trained to generate 1,000 models. Consequently, we obtained 51×26×1,000 = 1,326,000 models. Owing to the random initialization of weights, the trained models have different prediction performance which can be evaluated by their
Training data set scale ranges from 40 to 90 sequences. (A) Maximum and minimum
(A) The predicted relative strengths of promoter & RBS fit with the measured values using the data of training set. (B) The predicted values fit with the measured values using the data of test set. (C) The comparison results between prediction values and target values (experiment values). (D) The best fitting results of log Trc promoter & RBS activities with their PWM scores.
Owing to the high correlation and accurate prediction performance, the model NET90_19_576 can be effectively developed into a computational platform for quantitative design of novel regulatory parts. Our quantitative design strategy was achieved by consequential
(A) Sequence strength influenced by mutation of each single site. Red indicates positive mutation while blue indicates the negative. Deeper color means more significant change of strength. Each box represents one base in the sequence. Figure in the boxes is the location number of this base, while the subscript indicates that this base is mutated to another one (e.g., A→C means A mutated to C, and T→G means T mutated to G, etc.). (B) Conservative analysis of high activity sequences (strength >1). Bases in the boxes are conservative points. ‘+/−’ indicates this point is predicted to be a positive/negative ‘key-point’. Same as below. (C) Conservative analysis of extremely low activity sequences (strength <0.1). The analysis was performed using online WebLogo Tool (
To further verify the effectiveness of our design, sixteen novel Trc promoters & RBS sequences (s01–s08 designed from pre-generated library by approach i, s11–s15 generated from random mutagenesis by approach i, and s21–s23 designed by approach ii) were synthesized
(A) Sequence with desired strength can be designed by the following strategies: i) 8 out of 10,000 sequences (s01–s08) are randomly selected from an
The aforementioned work proves that predicting strength of one randomized part and designing a new part are feasible. To further validate the methodology, we need to change the reporter GFP with other metabolic enzymes to test if the designed parts are functionally reliable. Herein we attempted to apply these quantitatively designed regulatory elements in different genetic contexts in strain
(A) Sketch maps of plasmids for designed elements applications. Plasmids s21-
The second case is to fine-tune the expression of 1-deoxy-D-xylulose-5-phosphate synthase gene (
Constructing computational models that can precisely predict the strength of a regulatory element and further quantitatively build regulatory elements with desired strength have been a real challenge in gene expression area over decades. Many non-linear or unknown relationships between the sequences of regulatory elements and their strengths are still waiting to be uncovered
In contrast to the existing prediction models
Previous studies have confirmed that certain promoters can be identified or predicted based on ANN method
During the library construction process, we found that large fraction of clones was negative mutants and the probability of picking a positive mutant was less than 0.5%. In contrast, five designed elements with desired strength >1.0 were experimentally verified. These results demonstrate that the present methodology makes great sense for obtaining large amount of elements with different strength without laborious experimental screenings, especially for those stronger elements. But we cannot design a high strong element with a relative strength larger than the maximum value of training data set (3.559), which is limited by the strength range of data samples for model training.
With the rapid development of synthetic biology, quantitative characterization and standardization of regulatory elements will be in general valuable in predicting parts in ever increasing genome sequence data
(DOCX)
(DOCX)
(PDF)
(DOCX)
(RAR)