A Predictive Model of the Oxygen and Heme Regulatory Network in Yeast

doi:10.1371/journal.pcbi.1000224

A Predictive Model of the Oxygen and Heme Regulatory Network in Yeast

Figure 2

A schematic flow chart showing the algorithmic steps for learning the oxygen regulatory program with MEDUSA.

(A) The mRNA expression data is discretized into three states, up (over-expressed), down (under-expressed), and baseline (not significantly differentially expressed), and genes are partitioned into potential regulators (transcription factors and signal transducers) and targets. The regulators are also included in the list of target genes so that their transcriptional regulation can be modeled. (B) The MEDUSA learning algorithm is presented with the promoter sequences of target genes, the discretized expression profiles of the regulators across multiple conditions, and the differentially expressed (up and down) target gene examples from these experiments. Baseline examples are not used to train MEDUSA. In the first stage of training, MEDUSA considers rules based on promoter sequence data and regulator expression states. MEDUSA uses a boosting strategy to avoid overfitting over many rounds of the algorithm. At each iteration i, a motif/regulator rule is chosen based on the current weights on the training examples; this rule predict that targets whose promoters contain the motif will go up (or down) in experiments where the regulator is over- (or under-) expressed. Before the next iteration, the examples are reweighted to emphasize the ones that are difficult to predict. (C) To learn the sequence motif, the algorithm agglomerates predictive k-mer sequences to produce candidate PSSMs, and it optimizes both the choice of PSSM and the probabilistic threshold used to determine where the hits of the motif occur. (D) At the end of each round of training, motif /regulator rules are placed into an alternating decision tree, building a global regulatory program. This regulatory program can be used to predict target gene up/down regulation for gene-experiment examples that were not seen in training. In order to produce a more stable decision tree, we perform a second pass of the tree-learning algorithm using a stabilized variant of boosting that gives more consistent models over different subsets of the training data. At this stage, both the motifs learned previously by MEDUSA and TF occupancies from ChIP-chip experiments are used as sequence features for the final regulatory program.

doi: https://doi.org/10.1371/journal.pcbi.1000224.g002