Identifying Cis-Regulatory Sequences by Word Profile Similarity

doi:10.1371/journal.pone.0006901

Figure 1.

WPH-finder: Finding putative co-regulated CRMs (WPHs).

To identify putative co-regulated sequences given a known CRM, we first split the CRM into overlapping windows to allow us to leverage closely linked word or motif combinations. Each of these windows is represented by its word counts, or its word profile, which is then used to identify similar word profiles in the genome. A set of WPHs for a given CRM window consists of genome sequence windows whose word profiles are similar to the word profile of the CRM window, as measured by our similarity score Z.

More »

Expand

Figure 2.

Significance of overlap between eve WPHs and test sets.

Each sequence window across the eve locus is associated with a set of WPHs. We observe significant overlap between WPHs corresponding to annotated CRMs and our test sets (stripe CRMs and four sets of chIP-chip peaks). Stripe CRMs are shaded in gray, and chIP-chip bounds regions are boxed in a dotted line. For p<1e-5, the p-value is reported as 6.1e-6 (−log(p) = 12). The dashed line represents p = 0.05.

More »

Expand

Figure 3.

Summary of predictive power of stripe WPHs and their frequent words.

For WPHs associated with stripe CRMs or with chIP-chip bound regions found near the primary pair-rule genes, most demonstrate significant overlap (p≤0.05) with stripe CRMs (dark blue, light blue). Words overrepresented in these WPHs also correspond well with predicted TFBSs (p≤0.05, red, pink).

More »

Expand

Figure 4.

Significance of overlap between NRSF WPHs and test sets.

WPHs are collected for windows spanning NRSF-bound sequences. At all p-value cutoffs considered, these NRSF WPHs significantly overlap with the NRSF dataset considerably more than they overlap with randomly generated test sets.

More »

Expand

Figure 5.

Identifying orthologous CRMs in distant fly species.

We scanned eve CRMs from D. melanogaster against the eve loci of several distant fly species, Sepsis cynipsea (A), Themira putris (B), Scaptodrosophila lebanonensis (C). The upper blue track indicates experimentally verified CRMs, the lower green track shows the best match to the indicated D. melanogaster CRM. The best match is not shown if the score did meet the score threshold (Z≥6).

More »

Expand

Figure 6.

Finding similar and dissimilar sequence neighbors (HSNs and LSNs).

Given a block size B and a threshold pairwise score of similarity, we scanned the genome for sequence windows with either high-scoring or low-scoring neighboring sequences within B kb.

More »

Expand

Figure 7.

Significance of overlap between HSNs/LSNs and REDfly CRMs.

Each set of HSNs (or LSNs) represents sequences with a high-scoring (or low-scoring) neighbor within a given block size for a given Z-score threshold. The significance of overlap between an HSN (or LSN) set and known CRMs is represented by a color scale (−log(p)), such that blue shades represent significant enrichment of CRMs (p<0.05). While HSNs are enriched for stripe CRMs (A), they are not enriched for the broader set of CRMs in REDfly (p>0.05 for all block sizes and Z-scores, data not shown), suggesting that CRM clustering is not a common feature of CRM organization. LSNs are enriched for both stripe CRMs (B) and REDfly CRMs (C) at modest score cutoffs. For p<1e-5, the p-value is reported as 6.1e-6 (−log(p) = 12).

More »

Expand

Figure 8.

Average pairwise Z-score as a function of distance.

Pairwise sequence similarity decreases as the distance between the two sequences in the genome increases. On average, non-coding sequences (black) are more similar to neighboring sequences than REDfly CRM sequences (blue). Stripe CRMs (red), known to cluster together, are similar to close neighboring sequences.

More »

Expand

Figure 9.

Significance of overlap between low-scoring neighbors of REDfly sequences and REDfly CRMs.

Low-scoring (Z≤−1.5) neighbors of REDfly CRMs overlap REDfly CRMs more than expected by chance compared to its coverage of “valid” non-coding sequences surrounding REDfly CRMs. For p<5e-7, the p-value is reported as 3e-7 (−log(p) = 15). The dashed line represents p = 0.05.

More »

Expand