Figure 1.
State-transition diagram for a PPHMM implementing a reversible and affine background indel model for a phylogeny branch of length t.
Ovals denote emitting states; arrows denote transitions. Start and stop are special non-emitting states. This model can be implemented in 33 lines of SEAL code. Parameter s = 1-(1-b∞)(1-b(t)) gives the probability of leaving the background model, for gain-of-function probability b(t), b∞ = limt→∞b(t). Parameters α and β influence indel rates.
Figure 2.
PPHMMs for loss (top) and gain (bottom) of function in a binding site.
Ovals are emitting states. The top half of an emitting state denotes the functional class in the parent, while the bottom half denotes the functional class in the child. Dash denotes a gap. bg denotes the background functional class. Wi denotes the functional class corresponding to the ith column in a positional weight matrix (PWM). Transition probabilities are derived from the background indel model. Emission probabilities are derived via a substitution mixture model.
Figure 3.
An example CRM evolution model that can be implemented in our framework.
Parallelograms denote groups of states in the PPHMM; small parallelograms denote states implementing a binding site profile (positional weight matrix). b(t): gain probability. q(t): loss probability. p(t): retention probability. b∞: limit of b(t) as t→∞. t: branch length. s: 1-(1-b∞)(1-b(t)). ε: 0.00001. Plus and minus denote strand. See Materials and Methods for additional details.
Figure 4.
Site-level prediction accuracy as a function of number of species in EVOS simulation runs (the simulator and predictor modeled the same number of species).
Table 1.
Alignment accuracy for PSPE simulation runs, averaged across runs (CRMs).
Table 2.
Binding-site prediction accuracy for PSPE simulation runs.
Table 3.
Site-prediction accuracy on 17 Drosophila developmental enhancers.
Figure 5.
Histogram of number of extant Drosophilids predicted to share a given site, for known sites (top pane) and novel predicted sites (bottom pane), over a 10-way phylogeny.
Figure 6.
ROC-like curve for MAFIA (blue) applied to ten species, rMonkey (red) applied to six species, and the “gold standard” (gold).
Sensitivity is plotted on the y-axis, false-positive rate along the x-axis. Each point corresponds to a different stringency threshold in the processed ChIP-seq data.
Figure 7.
An example D. melanogaster developmental enhancer.
At top are F-Seq scores from ChIP-seq data for six transcription factors (kr = kruppel, kni = knirps, hb = hunchback, gt = giant, cad = caudal, bcd = bicoid); curves were scaled to maximize visual impact for the figure. Predictions and known sites are shown below, with colors denoting factor identity as per the F-Seq curves (factor tailless was not assayed in the ChIP-seq experiments and is shown in white). Plus and minus tracks correspond respectively to sense and antisense strands of the dm3 assembly for chromosome 3R. The FlyReg track depicts known binding sites according to the “gold standard” (see text). The EMMA track was produced using the –e option for that program.
Figure 8.
Example MAFIA alignments from the CRM shown in Figure 7.
Nucleotides predicted to participate in binding are shown in bold. Weight matrices for factors are shown as sequence logos above (sense strand) or below (antisense strand) the alignment.