Figure 1.
Schematic representation of the modeling framework and combinatorial predictive power of chromatin features.
(A) Schematic illustration of the modeling framework. DNA replication timing (ToR, blue line) and chromatin feature signals ( indicated by gradient filled rectangles) were quantified for each promoter in a 1 kb window centered on its TSS (
). The resulting input data matrix is shown (bottom left), where feature levels are encoded by different colors ranging from dark green to red. TSSs were then randomized according to a permutation P and split in training and test sets. The training set is used to train a Lasso model using 10-fold cross validation. At each model fit (CV1 to CV10), a TSS can either be assigned to the training set (black square) or to the test set (white square). The model was then used to infer the replication timing of promoters in the test set and the model accuracy is evaluated with respect to their experimentally measured replication timing. (B) Cross-validated mean squared error (CV-MSE) as a function of the regularization parameter (log10(λ)) for different Lasso models trained with ten fold cross-validation. The average CV-MSE is reported as solid line, with minimum and maximum CV-MSE drawn as dashed lines. A vertical line reaching a CV-MSE curve indicates the value of λ that was used to generate predictions from the corresponding model. The different sets of features used for model training are indicated in the legend. (C–E) Predicted versus experimentally measured replication timing of the test set represented as smoothed color density scatter plot. Model predictions were generated using second-order interactions between CBPs (CBPs2, C), HMs (HMs2, D) and HMs2+CBPs (E). Prediction accuracies are Pearson correlation coefficients. Orange lines indicate the model fit, whereas dashed gray lines indicate the bisector
.
Figure 2.
Feature importance analysis and simplified models.
(A) Values of the model coefficients along the λ-path, i.e. the sequence of values of the regularization parameter λ used to fit the model. The λ-path is truncated at the value of λ used for model predictions. Line thickness is proportional to the total number of models in which a non-zero coefficient is assigned to the corresponding feature. The vertical dashed line denotes the value of λ yielding the selected simplified model solely based on the four indicated terms. (B) Scatter plot of model features according to their z-scores and bootstrap-Lasso selection probabilities (p). Features with are colored in red (positive coefficient values) or blue (negative coefficient values) and their coefficient distributions are shown on the right as violin plots. Features are ranked by decreasing selection probabilities. (C) Boxplot of prediction accuracies (PCC on test sets) of 100 Lasso models where the indicated feature was excluded from the model fit. Rrp6 was used as control, as stability analysis indicated no significant role for this feature in predicting replication timing. p-values were obtained using a two-sided Wilcoxon rank sum test. (D) Frequency of appearance of chromatin features in four-features simplified models as a function of their model accuracy with respect to the full model. Only simplified models reaching at least 60% of the full model accuracy are shown.
Figure 3.
Predicting the replication timing profile of the Drosophila S2 cells genome.
(A) Predicted versus experimentally measured replication timing of the Drosophila S2 cells genome represented as smoothed color density scatter plot. Model predictions were generated using the Lasso model based on CBPs and second-order interactions between HMs (HMs2+CBPs) and trained at promoters. Prediction accuracy is Pearson correlation coefficient. The orange line indicates the model fit, whereas the dashed gray line indicates the bisector . (B,C) Measured (top track) and predicted (middle and bottom track, see Methods) replication timing profiles along 6 Mb and 12 Mb of chromosomes 3R (B) and 3L (C), respectively. A color gradient representation of feature signals is shown at the bottom for chromatin features within the bootstrap-Lasso simplified model (K8ac = H4K8ac; K36me1 = H3K36me1 and K79me1 = H3K79me1). The yellow rectangle in B highlights the genomic position of the Bithorax Complex.
Figure 4.
Histone modification levels predict replication timing across different cell types.
(A) Predicted versus experimentally measured differences in replication timing between S2 and Bg3 cells unique promoters (S2-Bg3) represented as smoothed color density scatter plot. Model predictions were generated using differences in HMs levels and their pairwise interactions for a subset of HMs that were profiled in both S2 and Bg3 cell lines (CHMs). The orange line indicates the model fit, whereas the dashed gray line indicates the bisector . (B) Scatter plot of model features according to their z-scores and bootstrap-Lasso selection probabilities (p). Features with
are colored in red (positive coefficient values) or blue (negative coefficient values) and their coefficient distributions are shown on the right as violin plots. Features are ranked by decreasing selection probabilities. (C, top) Replication timing of S2 cells promoters versus Bg3. Differentially replicating promoters are color-coded according to the quadrant (delimited by dashed blue lines) they belong to (red: early replicating in S2 and late replicating in Bg3; green: early in both S2 and Bg3; blue: late in S2 and early in Bg3, aqua: late in both S2 and Bg3). A total of
promoters exhibit a log fold change greater than or equal to 1 (
). (C, bottom) Experimentally determined replication timing in Bg3 versus predictions generated by a model based on pairwise interactions between CHMs in S2 cells. Prediction accuracy is Pearson correlation coefficient. The dashed gray line indicates the bisector
.