Edited final manuscript: DIR ZD SG JHC. Manuscript Figures: DIR JHC. Designed and built website: DIR. Conceived and designed the experiments: DIR SG JHC. Performed the experiments: DIR ZD. Analyzed the data: DIR ZD SG JHC. Contributed reagents/materials/analysis tools: DIR ZD SG JHC. Wrote the paper: DIR JHC SG.
Current address: The Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America
The authors have declared that no competing interests exist.
Many conserved noncoding sequences function as transcriptional enhancers that regulate gene expression. Here, we report that protein-coding DNA also frequently contains enhancers functioning at the transcriptional level. We tested the enhancer activity of 31 protein-coding exons, which we chose based on strong sequence conservation between zebrafish and human, and occurrence in developmental genes, using a Tol2 transposable GFP reporter assay in zebrafish. For each exon we measured GFP expression in hundreds of embryos in 10 anatomies via a novel system that implements the voice-recognition capabilities of a cellular phone. We find that 24/31 (77%) exons drive GFP expression compared to a minimal promoter control, and 14/24 are anatomy-specific (expression in four anatomies or less). GFP expression driven by these coding enhancers frequently overlaps the anatomies where the host gene is expressed (60%), suggesting self-regulation. Highly conserved coding sequences and highly conserved noncoding sequences do not significantly differ in enhancer activity (coding: 24/31 vs. noncoding: 105/147) or tissue-specificity (coding: 14/24 vs. noncoding: 50/105). Furthermore, coding and noncoding enhancers display similar levels of the enhancer-related histone modification H3K4me1 (coding: 9/24 vs noncoding: 34/81). Meanwhile, coding enhancers are over three times as likely to contain an H3K4me1 mark as other exons of the host gene. Our work suggests that developmental transcriptional enhancers do not discriminate between coding and noncoding DNA and reveals widespread dual functions in protein-coding DNA.
The functions in a genome are often conceptually divided into protein functions for coding DNA and regulatory functions for noncoding DNA. This division is based on the intuition that constraints associated with encoding a protein would prevent the evolution of noncoding functions in a coding region. However, the validity of this division has not been well-studied. One important class of regulatory functional elements in noncoding DNA is enhancers. These are DNA sequences classically found distal to gene promoters and associated with tissue- or temporally-specific transcriptional regulation of gene expression, especially for developmental genes
Prior computational and evolutionary studies at the motif level have shown that coding DNA can hold noncoding information. This ability to contain other functional information arises from the redundancy of synonymous codons. For example, Itzkovitz and Alon compared the human genetic code to alternative permuted codes, finding that the genetic code is nearly ideal for containing short functional motifs within protein-coding DNA
Protein-coding sequences can be more critically tested by considering developmental enhancer activity. Developmental enhancers are typically much longer than individual TF-binding motifs and are often associated with strong sequence constraint. Highly conserved noncoding sequences have shown frequent enhancer activity in developmental expression assays
Relatively little is known about enhancers in coding sequence. Coding exon-controlled enhancer activity has been reported in a few cell line experiments, e.g. from the APOE, ADAMTS5 and BCL-2 genes
To address this question, we investigated the enhancer functions of 31 coding sequences from a variety of developmental genes orthologous between human and zebrafish. We chose
Conserved Coding Elements (CCEs) were identified using minimal criteria of >60% DNA sequence conservation between zebrafish and human, 100–1000 bp length, and occurrence within a set of developmental genes orthologous between zebrafish and human. These criteria were chosen to be similar to those used for identifying Conserved Noncoding Elements (CNEs) in a previous study of CNE enhancer activity
We sub-cloned each CCE into a Tol2 vector, upstream of an E1B minimal promoter driving EGFP (see
Embryos were scored for EGFP transient expression at 22–30 hours in 10 anatomies. Transient expression of reporter genes in zebrafish has been successfully used to identify noncoding enhancers by several groups
(
We identified CCE enhancers by comparing the fraction of embryos with activity driven by the CCE to the fraction of embryos with activity driven by the control, on an anatomy-specific basis (see
PLASMID NAME | Control | CCE-rab11fip4a | X | CCE-abca1a | X | CCE-odz3 | X | CCE-rfx2 | X |
|
161 | 53 | 43 | 39 | 48 | ||||
|
|||||||||
Forebrain: | 0.3354 | 0.4340 | 0.3488 | 0.8205 | 2.4 | 0.3750 | |||
Midbrain/Hindbrain: | 0.2546 | 0.1509 | 0.3953 | 0.0256 | 0.1250 | ||||
Eye: | 0.1800 | 0.2264 | 0.4419 | 2.5 | 0.0769 | 0.1667 | |||
Ear/AboveHeart: | 0.0680 | 0.0189 | 0.0698 | 0.2821 | 4.1 | 0.0208 | |||
Heart: | 0.3160 | 0.2264 | 0.0000 | 0.1538 | 0.1042 | ||||
Notochord: | 0.1610 | 0.6038 | 3.8 | 0.3953 | 2.5 | 0.0769 | 0.1042 | ||
Yolk/YolkExtension: | 0.1550 | 0.2075 | 0.3953 | 2.6 | 0.1282 | 0.2708 | |||
MidTrunk/AboveYolk: | 0.0800 | 0.0943 | 0.0930 | 0.1538 | 0.0000 | ||||
Muscle: | 0.1800 | 0.1509 | 0.3488 | 0.5128 | 2.8 | 0.6875 | 3.8 | ||
TailRegion: | 0.1240 | 0.1887 | 0.1163 | 0.2051 | 0.1667 | ||||
|
|||||||||
Forebrain: | 0.1289 | 0.5000 | 5.48E–08 | 0.3693 | |||||
Midbrain/Hindbrain: | 0.9147 | 0.0519 | 0.9983 | 0.9546 | |||||
Eye: | 0.2941 | 0.0003 | 0.9089 | 0.5000 | |||||
Ear/AboveHeart: | 0.8445 | 0.5000 | 0.0002 | 0.8127 | |||||
Heart: | 0.8595 | 1.0000 | 0.9660 | 0.9969 | |||||
Notochord: | 5.14E-10 | 0.0009 | 0.8627 | 0.7732 | |||||
Yolk/YolkExtension: | 0.2512 | 0.0006 | 0.5694 | 0.0539 | |||||
MidTrunk/AboveYolk: | 0.4904 | 0.5000 | 0.1373 | 0.9547 | |||||
Muscle: | 0.6094 | 0.0146 | 1.84E-05 | 2.44E-11 | |||||
TailRegion: | 0.1725 | 0.5000 | 0.1470 | 0.3028 | |||||
|
|||||||||
Forebrain: | 0.1964 | 0.5680 | 0.0053 | 0.5038 | |||||
Midbrain/Hindbrain: | 0.9497 | 0.1371 | 0.9961 | 0.9711 | |||||
Eye: | 0.2282 | 0.0460 | 0.9572 | 0.7784 | |||||
Ear/AboveHeart: | 0.9744 | 0.7036 | 0.0287 | 0.9525 | |||||
Heart: | 0.9047 | 0.9981 | 0.9758 | 0.9950 | |||||
Notochord: | 0.0058 | 0.0057 | 0.9847 | 0.8387 | |||||
Yolk/YolkExtension: | 0.2801 | 0.0276 | 0.7832 | 0.1060 | |||||
MidTrunk/AboveYolk: | 0.5277 | 0.5800 | 0.2014 | 0.9966 | |||||
Muscle: | 0.7513 | 0.0524 | 0.0205 | 0.0051 | |||||
TailRegion: | 0.4191 | 0.5419 | 0.1681 | 0.2205 |
Raw Expression Proportion (top) displays the fraction of surviving embryos with expression in each anatomy. X column shows the ratio of expression fraction for statistically significant anatomies. We consider anatomies as significant if they have p≤0.05 for both the Wilcoxon and Proportions test (middle and bottom). The full dataset can be found in Supplemental
We observed that 24/31 CCEs (77%) drove clear GFP expression above the control. Although there was a small amount of mosaicism, 20/24 CCEs drove expression in at least one anatomy at a level significantly greater than the control (
(A) Examples of Specific CCE Activity. CCEs from the genes gria3b, rab11fip4a, prim1, and abca1a each drove robust expression in a finely localized anatomical region. Overall, 14 CCEs produced this type of specific expression (defined as expression in 4 or fewer anatomical regions). (B) This behavior contrasts with CCEs that drove robust but non-specific expression, such as CCE-ephb3a. 6 of the active CCEs drove nonspecific expression.
To further confirm the validity of our assays, we made transgenic lines for one CCE: CCE-lmo1, which we chose because of its strong expression in the transient assay (see
(A) Stable transgenic F1 embryos from two independently generated lines displaying strong forebrain and hindbrain expression. Supplementary
In a previous study we reported that 76/101 of CNEs, chosen by criteria similar to those used for the CCEs, exhibited enhancer activity as measured using
As shown in
(A) Comparison of the fraction of enhancers active in conserved coding elements (CCEs) and conserved non-coding elements (CNEs). CCEs and CNEs exhibit similar enhancer activity levels, with no significant difference in activity. (B) Comparison of the fraction of enhancers exhibiting tissue specificity in CCEs and CNEs. While CNEs are marginally less tissue-specific, the difference is not statistically significant.
We have previously shown that for CNEs with greater than 60% human-zebrafish conservation, increased conservation cannot distinguish active and inactive CNEs
To investigate the target genes of CCE enhancers, we compared CCE activity to the anatomical expression of the gene in which the CCE resides (hereafter termed “host” gene) using ZFIN anatomy tags (see
In some of the non-overlapping cases, there is evidence for host gene expression in the anatomies where the CCE is active. For example, CCE-gria3b and CCE-islet1 both display strong heart expression (see CCE-gria3b:
A separate aspect of enhancer activity is that enhancers can be active in multiple tissues. Such concurrent activity may be functionally important, but mosaicism can often obscure recognition. An advantage of our method is that it yields activity annotations for all anatomies on an embryo-by-embryo basis, allowing us to quantitatively distinguish concurrent multi-anatomy activity from mosaicism (see
Histone3 Lysine4 monomethylation (H3K4me1) has previously been associated with ∼30–40% of enhancers in mammalian non-coding regions
H3K4me1 is not a perfect predictor of enhancer activity, since only a subset of active CCEs show the mark. We note that in the more comprehensively characterized human ENCODE datasets, recent algorithms to predict enhancer activity from histone modifications (including H3K4me1) also have false negative rates of 20–40%
p300 is a bromo-domain histone acetyl-transferase protein that has been associated with enhancers found in noncoding regions. To further determine whether enhancers are likely to be common in coding regions, we reanalyzed the mouse p300 ChIP-seq data of Visel et al. using CCDS coding exons, a stringently annotated set of conserved exons between mouse and human
In addition, we analyzed the clustered ChIP-seq human transcription factor dataset from ENCODE
An alternate hypothesis for the function of highly conserved coding sequences has been proposed to be “poison cassettes”
We have shown that conserved coding sequences often act as enhancers, with activity, tissue-specificity and protein-binding characteristics similar to highly conserved noncoding sequences selected by analogous criteria. While we tested only 31 sequences, 168 sequences met our screening criteria of human-teleost conservation and overlap with genes active during forebrain development. At our observed success rate (77%), this would imply ∼129 coding enhancers in the zebrafish genome. In all likelihood this is an underestimate, as there may be many coding enhancers that do not meet our selection criteria. In any case, our work demonstrates that even sophisticated regulatory functions such as enhancers may occur commonly in protein coding sequence.
These experiments clearly verify the coding enhancer hypothesis
The observation of prevalent coding enhancers is counterintuitive given that protein-coding constraints would be expected to conflict with other functions in the same location. However the degree of conflict depends on the amount of evolutionary constraint associated with both the protein and other function. Consider first the constraint associated with protein function. Previous studies have shown that 70% of amino acids in a protein can be altered while maintaining structure and function
The level of constraint associated with enhancer activity remains controversial, as enhancers vary widely in their sequence conservation. Conservation-blind enhancer identification approaches in noncoding regions have suggested that enhancers are typically under strong sequence constraint. McGaughey et al. tiled intergenic regions around the phox2b locus in zebrafish and found that ∼40% of sequences with enhancer activity had ≥75% zebrafish-human conservation in a block ≥100 bp
How should enhancers in coding regions be predicted? Given that enhancer-protein conflict appears to be weak, this question is essentially the same as for predicting enhancers in noncoding regions. In other words, occurrence in developmental loci and relatively high conservation (e.g. our criteria of >60% fish-human ID over 100 bp) are important features, the application of which should yield true positive rates of ∼3/4. These criteria tend to overlap – for the genome-wide set of exons with 1∶1 orthology, there are 8,693 exons (avg. conservation 77%) and many are from developmental genes. There are 6274 exons from developmentally expressed zebrafish genes, and 61% are at least 60% similar to human. Even nonconserved sequences in developmental loci may have substantial rates of enhancers. For example, McGaughey et al found enhancer activity in 4/13 blocks of noncoding sequence near the zebrafish phox2b developmental gene lacking conservation to fugu, tetraodon, human or mouse
Extensions of sequence-based prediction approaches, e.g. through superior neutral background models
Our work suggests that enhancers in coding regions target their own gene. This finding is consistent with the genomic regulatory block concept of Kikuta et al. that enhancers and their targets should remain syntenic through evolution
Finally, this work sheds light on the many protein-binding, histone modification, and RNA-binding events in coding DNA which have typically been regarded as ‘experimental noise.’ Given that coding sequence can contain enhancer functions, it is likely that many of these events are functional as well. A number of recent disease studies have shown the functional importance of synonymous SNPs
Additional CCE images and the program to calculate significant anatomies are available at the public website and database:
To determine a set of CCEs, we identified exons with mutual best BLAST hits among Ensembl RefSeq exons from zebrafish (dr6) and human (GrCH37) with E-value<1e−10
Additionally, 5 ultraconserved regions (UCRs) were chosen from Bejerano et al.
Primers were designed using the Primer3 executable
The plasmid (pT2KXIGQ) is a modification of the Tol2 plasmid pT2KXIG
Pooled zebrafish embryos from AB and SH strains were collected within 10 minutes of fertilization. 150–300 embryos were injected per CCE with typically ∼130 surviving. The amount of injected plasmid DNA was consistent across CCEs and was very close to that in prior zebrafish enhancer studies
150 zebrafish embryos were injected with plasmid LMO1. At 24 hpf, 50 fish were chosen with strong and specific expression. ∼20 adults survived to adulthood and 6 were chosen to cross with wildtype zebrafish. 4 of the 6 crosses resulted in GFP expression, with around 30–40% of F1 offspring displaying GFP expression similar to transient LMO1 expression.
Embryos were visually scored for EGFP expression between 22–0 hours post-fertilization (judged by direct visualization of the 3-D living embryos from multiple viewing angles: dorsal, ventral, lateral, oblique, etc.) Representative white-light and fluorescence images were acquired at 5–20X. All CCEs and the control plasmid were tested in multiple independent runs. Subsets of embryos were anesthetized and plated (in sets of 15–20) onto inverted 96-well cell culture dish lids. Embryos were scored using the iPhone voice recognition application DragonDictation and a controlled-language anatomy for 10 anatomical sections as shown in
Resulting text files were manually reviewed and processed by a PERL script and R Statistical Package
Of note, our criteria to classify a CCE as having significant expression in an anatomy are stringent. To determine enhancer activity, previous publications have used thresholds of ∼4%
To determine pairs of anatomies with concurrent activity, we assumed a null hypothesis of equal probability for the four cases: 00, 01, 10, and 11, where 0 and 1 indicate absence or presence of activity and the two digits correspond to the two anatomical regions. A co-regulation z-score was calculated as z = (N11–0.25* Ntotal)/(Ntotal * 0.25 * 0.75)1/2 and a
We downloaded the complete set of known anatomical annotations for every gene in the zebrafish genome from Zfin (10,746 unique genes). These annotations are based on literature-curated
The set of possible ZFIN anatomies was created by text-matching anatomical descriptions for CCE significant anatomies to ZFIN anatomical IDs. We also allowed for matches to IDs one sub-level down in the ZFIN anatomical hierarchy to account for variations in the resolution of anatomical annotations. For example, “forebrain” = ZFA:0000109. The immediate sub-level down from forebrain contains the following terms: “diencephalon” = ZFA:0001343, “eminentia thalami” = ZFA:0007010, “forebrain ventricle” = ZFA:0000101, “telencephalon” = ZFA:0001259 and “telencephalon diencephalon boundary” = ZFA:0000079. These 6 IDs were also used to query the ZFIN gene expression database for matches to forebrain enhancer activity. This matching flexibility is important when the mRNA expression covers a diffuse area. For instance, CCE-ddx18 displays overlapping expression with ddx18 mRNA
To compare CCE expression to random genes, the host/upstream/downstream and genes for the CCEs were removed from the Zfin wildtype expression file, as were miRNA genes. List::Util ‘Shuffle’ Perl module was used to randomly pick 20 genes and assign to CCEs as “host genes.” The number of anatomies shared between the CCE and the random host gene was then counted. This process was repeated 100 times. The mean, standard deviation and proportions analysis was done using the R Statistical Package.
To treat CNEs and CCEs equally, coordinates from experimentally tested CNEs in Li et al
H3K4me1 binding sites were obtained from the recently published data of Aday et al.
Human (hg19) and zebrafish exons (Zv8) for tested CCEs and all exons from 100–1000 bp were extracted using CDS Fasta data from the UCSC Genome Browser. Sequences were searched for aligned 4-fold synonymous codons, and a minimum of five such codons were required for further analysis. Four-fold sites were extracted and the p-distance was calculated by counting the number of conserved sites divided by the total number of sites. Random exons were extracted using PERL to randomly shuffle the set of all exons. Random exons were required to have alignable coding sequence between human and zebrafish. The R Statistical package was used for the unpaired Wilcoxon rank sum analysis.
p300 peaks were obtained from Visel et al. 2009
TFBS clusters on 8 human cell lines were obtained from the UCSC Genome Browser Encode Project
Plasmid Design and Injection. Flanking Tol2 sequences integrate the control or experimental cassette into the zebrafish genome after injection with plasmid and transposase mRNA at the 1-cell stage.
(TIF)
Processing Voice-Operated Anatomical Expression Analysis. A schematic representation of how the proportions and Wilcoxon rank-sum test compare CCE-slc1a2 expression in the forebrain and yolk to the background expression of the control plasmid lacking an insert. Only anatomies with
(TIF)
A group of stable transgenic embryos (F1) derived from embryos injected with CCE-lmo1. Injected embryos were selected for forebrain and hindbrain expression and then crossed with wildtype zebrafish to yield the F1 generation.
(TIF)
Anatomy Comparison Using Zfin. For the host/upstream/downstream genes, the Zfin gene expression database was queried using anatomical terms corresponding to our CCE anatomies. The number of unique shared anatomies was counted for each CCE-gene comparison. CCEs with at least 1 shared anatomy with the gene were assigned a score of “1” while CCEs without were assigned “0.” The number of CCEs with a match was counted. Since there were 20 CCEs to be tested, in the randomized control the same procedure was used but with 100 random sets of 20 genes.
(TIF)
Comparison of CCE-lmo1 expression to Zfin stages. CCE-lmo1 maintains strong similarity to the mRNA
(TIF)
CCE-islet expression in the heart.
(TIF)
Concurrent Anatomical Activity Schematic. Each anatomy pair is compared to a null expectation of equal likelihood of expression in each of four cases: 00, 01, 10, 11. The first position represents the first anatomy, the second position represents the second anatomy. A 0 represents no expression and 1 represents expression.
(TIF)
CCE-ddx18 displays expression in the forebrain and midbrain, consistent with annotations in the ZFIN database. However, the diffuse expression patterns around the tectum and eye (particularly at ∼22–24 hpf) make it difficult to visually determine whether there is agreement on a finer scale.
(TIF)
Images from the 20 significant CCEs and their corresponding anatomies. Images are labeled with (CCE-GeneName, ExonNumber), and the significant anatomy for each CCE is labeled. To view more images for each CCE, please visit:
(TIFF)
Images from the 20 significant CCEs and their corresponding anatomies. Images are labeled with (CCE-GeneName, ExonNumber), and the significant anatomy for each CCE is labeled. To view more images for each CCE, please visit:
(TIFF)
Images from the 20 significant CCEs and their corresponding anatomies. Images are labeled with (CCE-GeneName, ExonNumber), and the significant anatomy for each CCE is labeled. Note that CCE-ddx5 has voice-expression data but lacks an image of yolk expression. To view more images for each CCE, please visit:
(TIFF)
Expression and Count Statistics for CCEs Evaluated for Whole Embryo (non-anatomy based)
(TIF)
CCE concurrent activity. 10 CCEs display concurrent activity in at least two anatomies with
(TIF)
Data file of zebrafish-human exon conservation. ExonConservation_AllExonsInGene tab: The average of all exons in the gene is highlighted in yellow, the CCE tested in marked in red text. Exon-Cons,Total,Rank tab: a table of the average conservation, the CCE conservation, the exon rank of the CCE, and a count of exons in the gene.
(XLS)
Data file of wilcox.test and prop.test scores. The raw expression proportion (top) displays the fraction of surviving embryos with expression in each listed anatomy. X (green) shows the ratio of expression fraction for statistically significant anatomies, which are highlighted in yellow. We consider anatomies as significant if they have p≤0.05 (shown in red) for both the Wilcoxon and Proportions test (middle and bottom).
(XLS)
Data file of CCE activity patterns. CCE_HostGeneComparison tab: lists the CCE, the anatomies of experimental GFP expression, the ZFIN ID of the CCE-containing gene, the anatomies of gene expression in ZFIN database. Expression information for the closest upstream and downstream gene is also listed. RandomlyAssignedGenes tab: Counts of matching expression for 100 sets of 20 randomly assigned genes. CCE_GENE_UPSTREAM_DOWNSTREAM tab: ZFINID and common gene name for the CCE-containing gene, and the closest upstream and downstream genes. ExtendedAnatomy tab: Relational anatomy tags from the ZFIN database assigned to the anatomy tags used to visually score GFP in zebrafish embryos.
(XLS)
Expression, location and conservation of 31 Conserved Coding Elements. 20 CCEs display significant expression: 14 CCEs display significant specific expression (≤4 anatomies) and 6 display significant non-specific expression. In addition 4 CCEs display weak expression, and 7 CCEs fail to display expression. Sequences with Ultra-Conserved Regions are marked as (UCR).
(XLSX)
The authors wish to thank Ed Hurlock for discussions and for providing comments on the manuscript.