Fig 1.
Utilizing AkitaV2 to extract CTCF-directed sequence preferences and grammar.
Conceptual summary of the analyses performed in this study. Left: two main approaches: genomic disruption and virtual insertion. Genomic disruption involves permuting the nucleotides of a CTCF site within its genomic context, while virtual insertion entails inserting a CTCF site into a feature-less background sequence. Sequences and sites are shown as cartoon sequences with illustrative predicted maps below. Right: six types of virtual insertion experiments reveal distinct aspects of CTCF-site grammar. Three experiments tested single-site rules: (1) the impact of flanking sequences, (2) the compatibility between core motifs and their flanking sequences, and (3) nucleotide level mutagenesis. Three experiments tested multi-motif grammar: (4) varying numbers of CTCF sites within a cluster, (5) varying spacing between sites, and (6) varying site orientation. Cartoon sequences represent the parameters tested in these experiments.
Fig 2.
Disruption scores highlight impactful epigenomic features.
A) Scatterplots showing disruption scores vs. genomic features at n = 9,991 autosomal CTCF sites, profiled with single-molecule footprinting (SMF) [31] which categorizes sites as: bound, nucleosome occupied, or accessible. We used the complete set of SMF-profiled CTCF sites and disrupted genomic sequence via permutation in silico. The first column displays disruption score vs. (i) frequency of being bound or (ii) accessible. The other subplots show disruption scores vs. following genomic features: (iii) cohesin (Rad21) ChIP-seq signal [60], (iv) CTCF ChIP-seq signal, (v) conservation score (phyloP), and (vi) PWM score, with dots colored by their SMF bound frequency. ChIP-seq signal is quantified as the sum in a ±100bp window around each CTCF site. B) Matrix of pairwise correlations between disruption scores and genomic features of n = 9,991 autosomal CTCF sites. C) Partial correlation coefficients between disruption scores and subsets of genomic features from panel B, adjusting for mutual influences among these features. Partial correlations computed controlling for CTCF and cohesin ChIP-seq either from [60] (left) or [52,61] (right) are similar qualitatively and quantitatively.
Fig 3.
A virtual insertion strategy reveals the impact of flanking sequences.
A) Virtual insertion strategy assesses individual CTCF site impacts. We generated background sequences by shuffling genomic sequences such that they produce mostly-featureless predicted maps. A CTCF site (green box) along with its flanking sequences (pink box) is then inserted into these background sequences (in gray). Using the sequence with an insertion as input, we generated predicted maps and quantified the impact as an insertion score. B) Scatterplot of insertion versus disruption scores, for n = 7,560 CTCF sites (PearsonR > 0.91). Sites were obtained by intersecting sites from JASPAR with mESC boundaries and filtering for lack of overlap with repetitive elements within +/- 20bp or other CTCF sites within +/- 60bp. Scores were averaged across all six mouse outputs (i.e. cell types) and all eight models. Insertion scores were additionally averaged over ten background sequences. Histograms show log density along each scatterplot axis, as the majority of sites exhibit both low insertion and disruption scores. Given this, for further analysis we selected the 1250 sites with the highest disruption scores and chose an additional 250 sites randomly from the remaining pool. C) Flanking sequence length versus insertion score for the analysis set of n = 1,500 CTCF sites. Flanking sequence was varied from 0bp (19bp core motif only) up to 35bp, depicted as cartoons above the plot. Genomic flanking sequences were symmetrically extracted around each CTCF site. For visualization, sites were divided into five groups based on their insertion score with 30bp flanks. Smoothed lines show the mean for each group, and shaded bands show the 25th to 75th percentiles. To illustrate the variability among sites, we show 10 sites chosen randomly from the strongest group as navy lines. D) Predicted contact maps illustrate the impact of increasing flanking sequence lengths for a strong CTCF site. Sequence of inserted CTCF site and flanks obtained from chr15:101,984,508–101,984,527 in the mouse genome. E) Heatmap of nucleotide composition around 150 strong CTCF sites (±15bp). Rows ordered by insertion score. F) Sequence logos for the sequences with top 150 and bottom 150 insertion scores highlight core motifs and flanking preferences. A gray underline indicates the CTCF core motif in the top 150 logo, with black arrows pointing to positions 6,7, and 12. The CTCF consensus logo from JASPAR (MA0139.1) is aligned below the logos for visual comparison, and the weighted Jensen-Shannon difference between the top 150 CTCF core sequences and the CTCF consensus is displayed for visual comparison of sequence preferences.
Fig 4.
CTCF core and flanking sequences are broadly compatible.
A) Illustration of the test for compatibility between core and flanking sequences by assessing all possible combinations of flanks and cores classified into three strength groups. Each row represents a distinct 19bp core motif sequence and each column represents a distinct pair of 30bp flanking sequences adjacent to the core motif. B) Distributions of insertion scores for pairs of core and flanking sequences around 100 strong, 100 medium, and 100 weak CTCF sites. Each histogram shows 10,000 (1002) combinations. Sites were classified as strong, medium, and weak based on their combination of core and flanking sequence seen in the mouse genome. Distributions for original genomic core-flank pairs shown in red (with count scaled by 100), synthetic combinations shown in gray. C) Scatterplot of insertion scores (panel D) versus approximate values obtained through SVD (panel E). Their high correspondence indicates that predicted strengths are largely multiplicative and core and flanking sequences are largely compatible. D) Heatmap of insertion score for 300 CTCF core and 300 flanking sequence pairs. Each row corresponds to a different core sequence, while each column represents a different flanking sequence. Rows and columns are ordered by the insertion score of the core-flank combination that occurs in the genome (i.e. by values along the diagonal). E) Heatmap of approximate insertion strength Mi,j obtained via SVD for 300 CTCF core and 300 flanking sequence pairs. Mi,j represents the predicted insertion strength for the combination of the core i and flank j. Using SVD, M = U D VT, we found Mi,j ≈ D0 U0i V0j, where U0 and V0 capture core and flank strengths, respectively. Rows and columns are ordered as panel D.
Fig 5.
Pairwise nucleotide dependencies are largely additive in core CTCF motifs and their flanking sequences.
A) Pairwise mutagenesis. Mutation score for pairs of mutations in the 19bp core motifs +/- 15bp flanks for 100 strong CTCF sites. Mutation score is calculated as the difference between insertion scores for the mutant versus the unperturbed sequence. The heatmap shows the average mutation score for each pair of positions. B) Single-nucleotide saturation mutagenesis of the 19bp core motifs +/- 15bp flanks for the same sequences in A. The heatmap presents the average over all CTCF sites for each possible substitution. C) Predicted additive impact of pairwise mutagenesis. The predicted additive pairwise impact is the sum of the average single-nucleotide impacts (panel B). Note the shared color scale across panels A-C. D) Scatterplot of predicted additive and observed pairwise mutagenesis effects from panels A,B. For pairs of weak mutations, impacts are largely additive (up to mutation scores of -40). Higher impact mutations (i.e. more negative mutation scores) appear to saturate and diverge from this linear trend.
Fig 6.
CTCF grammar depends on number and spacing.
A) Insertion score versus number of inserted CTCF sites. Averages over five groups of n = 1,500 CTCF sites plotted as in Fig 3C. Shaded areas indicate 25-75th percentile for each group. Variability among sites is highlighted using 10 randomly chosen sites from the strongest group (dark navy lines). All sites inserted in a rightward orientation with 30bp flanks and 180bp spacing between cores. Note that with this spacing, 10 inserted sites constitute about 2kb or one bin. B) Insertion score as a function of spacing for four possible orientations for 300 CTCF sites (the strongest 20% from A), also with 30bp flanks. Average across sites shown for each orientation, with variability indicated by 25–75 percentile bands. C) Predicted log-transformed observed/expected contact frequency maps by CTCF pair orientation and spacing for insertion of a representative CTCF site (sequence from chr2:93,199,043–93,199,062). D) As in B), zooming into the 0-5kb region.
Fig 7.
CTCF sites do not mediate feature-specific genome folding.
A) Illustration of the test for CTCF feature specialization using two distinct layouts: (i) a ’boundary’ with two divergent sites, 180bp apart, versus (ii) a ’dot’ with two convergent sites, 400kb apart. CTCF insertions are shown as green rectangles (core motifs) with pink flanks (30bp), arrows indicate the orientation of the CTCF site. B) Scatterplot of boundary vs. dot strength (n = 1,500 CTCF sites). Boundary strength is the overall intensity of the map; dot strength is the local average signal within versus around the dot. Six CTCF sites spanning a range of strengths are highlighted with colored dots. C) Predicted maps for boundary and dot scenarios for the six highlighted CTCF sites in panel B. D) Predicted maps for symmetric (<<<>>>) and asymmetric (<<<<<<) insertions for a cassette of six CTCF sites into the middle of a background sequence. E) Insulation scores calculated using sliding diamond windows of three sizes (32.7kb, 131kb, 262.1kb), shown for the central 163,84kb of the map. Note that insulation minima display an offset for the asymmetric case, and the same coloring used for window sizes in E and F. F) Insulation minima offset for indicated CTCF cassette insertions. Insulation offset is the position of the insulation score minima relative to the center of the sequence (window sizes of 32.7kb, 131kb, 262.1kb). Each point represents the average across 100 strong CTCF site insertions. Note the insulation offset increases with the asymmetry of the inserted CTCF site configuration and is more pronounced for larger window sizes. G) Histogram of bins around disruption-sensitive TAD boundaries stratified by their orientation. To obtain bin orientation, we: disrupted sequences in non-overlapping 2048 bp bins around each TAD boundary, took the bin with the highest disruption score, and assigned a left (pink) or right (blue) orientation if all CTCF sites within a bin are aligned in the same orientation. Bins without assigned orientations shown in grey. The black line shows the smoothed total number of disruption-sensitive bins (across all orientations).