Fig 1.
(A) In vivo tested sequences from VISTA [4] considered in this study. For both limb-enhancers (left) as well as sequences not active in the developing limbs (right) the overlap with H3K27ac peaks and/or DNase I hypersensitive sites is shown as pie charts. (B) Schematic of the different classes of chromatin and sequence features considered in this study. (C) Summary of the machine learning strategy. After calculation of the relevant chromatin and sequence features for all observations, the data was partitioned into ten equally sized bins, retaining the original ratio of positive to negative observations. Training was performed using 10-fold cross-validation (CV), separately for each model (LASSO, RF, SVM) and categories of features (chromatin, sequence). The performances of these models as well as their combinations were evaluated on the ten independent, non-overlapping test sets. Models were then trained using the entire set of observations, and genome-wide predictions were made available through a track hub (see Methods) for the UCSC genome browser [25] and through a user-friendly web interface at http://leg.lbl.gov/.
Fig 2.
Limb-specific chromatin features accurately predict limb-enhancers.
(A) Box plots showing the AUROC estimated on the ten leave-one-out test sets, considering an increasingly larger set of chromatin features (left to right, outliers not shown). (B) Same as (A) but showing the AUPRC. (C) UCSC genome browser snapshots indicating two representative loci. Validated limb enhancers (bright red elements) showed different features than nearby regions that tested negative in vivo (blue). In particular, they displayed a higher DNase I enrichment (compare DNase I to Headless embryo). De novo limb-enhancers predicted based on combined models are also shown (dark red).
Fig 3.
Biological interpretation of the chromatin models based on feature importance.
For each macro-category (on the far left), each dataset considered is indicated (stage, structure and reference publication are specified), followed by three distinct plots showing (left to right): (1) a box plot overlaid to a violin plot showing the distribution of the coefficients assigned to each particular feature by LASSO; (2) the selection probability as estimated by Bootstrap LASSO (darker shades of green indicating higher probability); (3) the feature importance estimated as mean decrease in accuracy by RF (separately for the positive and the negative classes, indicated by red and light blue bars, respectively).
Fig 4.
Modeling limb enhancers sequence composition and its integration with chromatin information.
(A) AUROC estimated on the ten leave-one-out test sets for all the indicated models (Sequence, Chromatin or their combination, S+C). (B) Same as (A) but showing AUPRC. (C) Bar plot showing the average coefficient from the LASSO models for those sequence features selected in 9/10 of splits during LASSO training and reported in the top 25% in terms of mean decrease in accuracy, as estimated during RF training. (D) Boxplot showing the distribution of the total number of publications in limb development for the genes shown in (C) (Top, orange) as well as for those TFs whose corresponding features were selected in at least one CV-fold (but less than nine, Low, light yellow) or in none (No, grey) (outliers not shown).
Table 1.
LEG performances compared to two state-of-the-art approaches.
Fig 5.
LEG predicts bona fide limb-enhancers genome-wide.
(A) Overall enrichment scores (see Methods) for the indicated functional terms based on the proximity of the newly predicted elements to the genes annotated within each category (see also S6 Fig). (B) UCSC genome browser [25] snapshots showing the landscape at two previously in vivo validated limb-enhancers that were not part of the training set but were identified in the top 10,000 predictions. The region on the left is the ZRS (ZPA Regulatory Sequence), a known regulatory element for Shh [43]; the one on the right is an intronic enhancer of Tfap2a [44]. (C) UCSC genome browser snapshot of the Hand2 gene locus. The probability of being a limb-enhancer (Ridge model) along with the top 5,000 predictions from both the Ridge Regression (RR) and the Sum Of Ranks (SOR) combined models are shown. The four elements tested for activity in the developing limbs are highlighted in boxes (green for those showing activity in the limbs at E11.5, red if negative). LacZ reporter staining (blue) indicates enhancer activities in the fore- and hindlimb mesenchyme at E11.5. One representative whole mount picture is reported for each tested element. Pictures of a representative forelimb and hindlimb are provided for the validated enhancers. Reproducibility is indicated in brackets below each whole mount picture, along with the corresponding VISTA identifier. The ranks for both combined scores (RR, SOR) are also reported. Scale bar, 100 μm.