Transcripts expressed in eukaryotes are classified as poly A+ transcripts or poly A- transcripts based on the presence or absence of the 3′ poly A tail. Most transcripts identified so far are poly A+ transcripts, whereas the poly A- transcripts remain largely unknown.
We developed the TRD (Total RNA Detection) system for transcript identification. The system detects the transcripts through the following steps: 1) depleting the abundant ribosomal and small-size transcripts; 2) synthesizing cDNA without regard to the status of the 3′ poly A tail; 3) applying the 454 sequencing technology for massive 3′ EST collection from the cDNA; and 4) determining the genome origins of the detected transcripts by mapping the sequences to the human genome reference sequences. Using this system, we characterized the cytoplasmic transcripts from HeLa cells. Of the 13,467 distinct 3′ ESTs analyzed, 24% are poly A-, 36% are poly A+, and 40% are bimorphic with poly A+ features but without the 3′ poly A tail. Most of the poly A- 3′ ESTs do not match known transcript sequences; they have a similar distribution pattern in the genome as the poly A+ and bimorphic 3′ ESTs, and their mapped intergenic regions are evolutionarily conserved. Experiments confirmed the authenticity of the detected poly A- transcripts.
Our study provides the first large-scale sequence evidence for the presence of poly A- transcripts in eukaryotes. The abundance of the poly A- transcripts highlights the need for comprehensive identification of these transcripts for decoding the transcriptome, annotating the genome and studying biological relevance of the poly A- transcripts.
Citation: Wu Q, Kim YC, Lu J, Xuan Z, Chen J, Zheng Y, et al. (2008) Poly A- Transcripts Expressed in HeLa Cells. PLoS ONE 3(7): e2803. https://doi.org/10.1371/journal.pone.0002803
Editor: Jürg Bähler, Wellcome Trust Sanger Institute, United Kingdom
Received: March 21, 2008; Accepted: July 4, 2008; Published: July 30, 2008
Copyright: © 2008 Wu et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: The study was funded by NIH grant R01HG002600, the Daniel F. and Ada L. Rice Foundation, Mazza Foundation, and the Chicago Biomedical Consortium supported by The Searle Funds at The Chicago Community Trust.
Competing interests: The authors have declared that no competing interests exist.
The genome is expressed through transcription that generates different classes of RNA molecules, specifically, ribosomal RNAs, messenger RNAs, and small RNAs, which constitute the transcriptome content. The transcriptional process is regulated at multiple levels with differential promoter usage, alternative splicing, intron retention, and alternative polyadenylation etc. Furthermore, the abundance of individual transcripts can vary up to million-fold levels. Thus, the transcriptome is far more complicated than the original coding sequences in the genome, and decoding the transcriptome will be more challenging than decoding the genome.
Early studies identified the presence or absence of the 3′ poly A tail on transcripts, resulting in their classification as either poly A+ transcripts or poly A- transcripts –. This classification has been firmly confirmed by a recent genome tiling array study . The poly A+ transcripts include mRNA, microRNA and snoRNA generated by RNA polymerase II ; the poly A- transcripts currently known include ribosomal RNAs generated by RNA polymerase I , histone RNAs generated by RNA polymerase II , and tRNAs and other small RNAs generated by RNA polymerase III .
An ultimate goal of transcriptome study is to identify all transcripts at the sequence level. This has been very successful for the poly A+ transcripts, largely attributed to the presence of 3′ poly A tail that facilitates their isolation and cDNA synthesis by using oligo dT. Up to now, millions of poly A+ transcripts have been sequenced from various species. Regardless of the evidence indicating the wide prevalence of poly A- transcripts, however, only a few poly A- transcripts have been identified so far at the sequencing level. Without the poly A- transcript information, the transcriptome complexity and genome organization cannot be fully understood.
The lack of poly A- transcript information is largely associated with the technical factors. Unlike the poly A+ transcripts that have the universal 3′ poly A tail, there is no known consensus sequence in poly A- transcripts for isolation and cDNA synthesis. To overcome this obstacle, we developed a technical system termed Total RNA Detection (TRD). The system consists of three key elements: 1) enriching the poly A- transcripts by depleting the abundant ribosomal and tRNA transcripts; 2) synthesizing cDNA without regard to the status of the 3′ poly A tail; and 3) using the 454 sequencer for massive 3′ EST collection . The 3′ EST provides poly A signal and poly A tail information to distinguish between poly A+ transcripts and poly A- transcripts, can be compared directly with known transcripts, and can map to the genome with 3′ boundary location for the mapped locus. Using the TRD system, we analyzed the cytoplasmic transcripts of HeLa cells, a model cell line widely used for transcriptome study. Our sequence data indicate that the poly A- transcripts indeed exist.
Results and Discussion
Transcript enrichment and sequence collection
A challenge for studying the poly A- transcripts is to distinguish the true poly A- transcripts from the degraded transcripts originated from in vivo physiological RNA metabolism or in vitro RNase activities (which also lack the 3′ poly A tail). Another challenge is the scarcity of the poly A- transcripts in the total transcriptome content. Three key steps in the Total RNA Detection system were used to address these issues (Figure 1).
A universal RNA adaptor was firstly added to the 3′ ends of all RNA templates. The abundant 18S and 28S ribosome RNAs were then subtracted by using biotinylated ribosomal-specific probes. Small-size RNAs containing the degraded RNA intermediates were removed by size-filtration. The enriched transcripts were converted into double-strand cDNA by using the 3′ end RNA adaptor-based primer. The cDNAs were further digested by NlaIII. The 3′ cDNAs were isolated by using the streptoavidin beads. An adaptor was added to the 5′ ends of the 3′ cDNAs. The 3′ cDNAs were then amplified by PCR using the 5′ adaptor-based sense primer and the 3′ end RNA adaptor-based antisense primer. The amplified 3′ cDNAs were sequenced from the 3′ end by the 454 system. See further details in Materials and Methods.
- Adding a universal adapter to the 3′ end of all transcripts before processing the RNA sample. This adaptor protects the 3′ end of transcripts at the very beginning, provides a universal priming site for later cDNA synthesis, and serves as an identifier to distinguish between the sequences from the 3′ end of true transcripts and the sequences from experimental artifacts.
- Removing ribosomal RNAs. Subtraction was applied to remove the abundant 28S and 18S ribosomal RNAs by using two specific probes for both 18S ribosomal RNA and 28S ribosomal RNA.
- Removing small-size RNAs. Sizing exclusion was used to remove the small-size RNAs that include the abundant tRNAs, snoRNAs and the degraded transcript intermediates. Although this process will remove certain authentic small RNAs such as miRNA, these small RNAs have been extensively characterized by many other studies and they are not the focus of the study. Removal of small-size RNAs will provide a better background to identify the true poly A- transcripts.
The enriched transcripts were converted into cDNA by using the biotinylated primer based on the universal adaptor sequence at the 3′ ends of transcripts. The cDNA was then digested by a 4-base cutting restriction enzyme NlaIII to generate the 3′ cDNAs at about 140 bps  that fit with the 454 sequencing range. The 3′ cDNA was recovered by using streptoavidin beads. An adaptor was ligated to the 5′ end of the recovered 3′ cDNA. To generate 3′ cDNA templates for 454 sequencing, the 3′ cDNA was amplified by PCR using the sense primer based on the 5′ adaptor sequences and the antisense primer based on the 3′ end university adaptor seqeunces. 454 sequencing was performed from the 3′ ends of the cDNA. A total of 273,949 raw sequences was collected by a single 454 sequencing run. After removing sequences from remaining ribosomal RNAs, tRNAs, mitochondrial RNAs, sequences without a 3′ end adaptor, and sequences shorter than 11 bp, 148,520 sequences were qualified, representing 52,571 distinct 3′ ESTs. Of the 52,571 3′ ESTs, 13,782 were mapped to the human genome reference sequences (HG18) and were used for the following analyses (Table 1, Table S1).
Classification, novelty, and abundance of 3′ ESTs
Based on the factors of poly A signal, poly A tail and matching to known transcript sequences, standards were set to classify the 3′ ESTs into three subgroups:
Poly A- 3′ EST.
A sequence with no or only one A at the 3′ end (A accounts for 1/4 probability at this position out of A, G, C, and T) AND without a poly A signal AND matched to none of the known RefSeq/mRNA/EST/SAGE tags (nearly all known sequences are from poly A+ transcripts). The only known poly A- transcripts, which are histone transcripts, were determined by direct matching to known histone mRNA sequences.
Poly A+ 3′ EST.
A sequence matched to either the RefSeq/mRNA/EST/SAGE tags and with one or more A at the 3′ end with/without a poly A signal, OR a sequence matched to none of the known RefSeq/mRNA/EST/SAGE tags (novel) but with one A and a poly A signal or more As with/without a poly A signal.
Bimorphic 3′ EST.
A sequence matched to either the RefSeq/mRNA/EST/SAGE tags and with no A at the 3′ end with/without a poly A signal, OR a sequence matched to none of the known RefSeq/mRNA/EST/SAGE tags (novel) and with no A at the 3′ end but with a poly A signal. The bimorphic 3′ EST represents the isoform of the poly A+ transcript. Its lack of the 3′ poly A tail reflects the dynamics of poly A tail metabolism in the poly A+ transcript.
Poly A signals were defined in the order of prevalence: AATAAA, ATTAAA, TATAAA, AGTAAA, AAGAAA, AATATA, AATACA, CATAAA, GATAAA, AATGAA, TTTAAA, ACTAAA, AATAGA .
Based on the standards, 24% of 3′ ESTs (2% are histone 3′ ESTs) were classified as poly A- 3′ EST, 36% as poly A+ 3′ EST, and 40% as bimorphic 3′ EST (Table 2, Table S1A, S1B, S1C). Most of the poly A+ and bimorphic 3′ ESTs match to known human transcript sequences that are basically from poly A+ transcripts. The unmatched novel poly A+ 3′ ESTs tend to have short poly A tails (Table S2A, S2B). Although the poly A- 3′ EST accounts for 24% of total sequences, they only contribute 12% of the total sequence copies (Table 2), confirming that the poly A- transcript is highly heterogeneous but expressed at lower levels , . Similarly, the bimorphic 3′ EST accounts for 41% of the total sequences but only contributes 15% of the total sequence copies. The lack of high-copy bimorphic 3′ EST implies that the highly expressed poly A+ transcripts may not use the change of poly A length as a regulatory mechanism. The poly A+ 3′ EST accounts for 36% of total sequences but contributes 73% of the total copies, indicating that the poly A+ transcript contributes far more abundance to the transcriptome than the poly A- and bimorphic transcripts.
Histone transcripts are the only known polymerase II-generated poly A- transcripts with sequence information , , . A total of 315 distinct 3′ ESTs from 46 histone genes was detected in this study. Comparison of the 3′ ends of the detected histone 3′ ESTs to the full-length histone transcript sequences shows that 80% of the 3′ ESTs matched proximal to the 3′ end of their corresponding full-length sequences (Table 3, Figure 2, Figure S1, Table S3). The pattern of the 3′ end distribution of these histone sequences closely resembles that observed in a recent histone study . The high-degree of intact 3′ ends of the detected histone transcripts provides an internal control for the authenticity of the poly A- transcripts detected in this study.
Fifteen 3′ ESTs that map to the full-length histone 1H2AB cDNA sequences (NM_003513) are clustered proximal to the 3′ end of the full-length sequence. See Table 3, Table S3 and Figure S1 for the distribution of other histone 3′ ESTs.
Map 3′ ESTs to the human genome reference sequences
We analyzed the genome origins for the 8,178 3′ ESTs that mapped to a single location in the genome. Nearly two-thirds of the 3′ ESTs mapped to the intragenic region and a third to the intergenic region (Table 4, Table S1A, S1B, S1C). The poly A- 3′ EST maps more to the intergenic regions than poly A+ and bimorphic 3′ ESTs. Of the 3′ ESTs mapped to the intragenic region, most mapped to the introns, and the poly A- 3′ EST orients more in the antisense direction. Those mapped to the intron regions might represent the transcripts that are the intron-retained isoforms of the annotated genes, or the transcripts that are from the genes overlapping with the annotated genes, or the alternatively spliced transcripts in which their last exon sequences differ due to alternative splicing, or the transcripts that are largely originated from the intron regions with regulatory function such as the microRNA . Indeed, 491 intron-mapped 3′ ESTs mapped to 52 of the 219 known intron-originated microRNA precursors, 38 of which were matched by a single type of 3′ EST and 14 of which were matched by more than one type of 3′ EST (Table S4A, S4B). We compared the mapped loci by the poly A-, poly A+, and bimorphic 3′ ESTs. Using the average gene density of 75 kb in the human genome as a cut-off (40,007 “genes” in the human genome/3 Gb human genome size = 75 kb. http://www.ncbi.nlm.nih.gov/projects/Gene/gentrez_stats.cgiSNGLTAX9606), the results show that half of the mapped loci between two or three subtypes of 3′ ESTs do not overlap each other (Table 5, Table S5A, S5B1-3, S5C1-3), suggesting that many different subtypes of transcripts are transcribed from different genomic loci.
The intergenic loci mapped by the 3′ ESTs represent novel transcribed regions in the genome. To investigate the potential functional relevance, those sequences were compared with the genome sequences of 16 species to define their evolutionary conservation. The results show that of the 2,310 mapped intergenic sequences, 1,344 (59%) are conserved across different species, and the conservation rate for poly A- 3′ EST-mapped sequences is similar to the rates of the poly A+ and bimorphic 3′ EST-mapped sequences (Table 6, Table S6A, 6B, 6C).
Confirmation of the 3′ ESTs
Four approaches were used to confirm the detected 3′ ESTs.
- RT-PCR was used to verify each subtype of novel 3′ ESTs. Two types of cDNAs were used as the templates. One was generated by random priming that does not rely on the poly A tail for cDNA synthesis, and the other was generated by oligo dT priming that relies on the poly A tail for cDNA synthesis. For poly A- transcripts, only the random-priming cDNA but not the oligo dT priming-cDNA should generate positive amplification; for bimorphic transcripts, the random-priming cDNA should and the oligo dT priming-cDNA could generate positive amplification; for poly A+ transcripts, both random-priming cDNA and oligo dT-priming cDNA should generate positive amplification. The results show that for the 28 positively detected poly A- 3′ ESTs, 23 were only detected in random-priming cDNA, confirming that their original transcripts do not have poly A tails; for the 12 bimorphic 3′ ESTs, 12 were detected only in random-priming cDNA, confirming that their original transcripts lack poly A tails; for the 12 poly A+ 3′ ESTs, all were detected in both random-priming cDNA and oligo dT-priming cDNA, confirming that their original transcripts have poly A tails (Figure 3A, Table S7).
- RT-PCR was used to detect the 3′ EST-matched microRNA precursors that are transcribed from the intronic regions. For the 28 reactions, 17 were confirmed by sequencing the amplified products (Figure 3B, Table S4C).
- northern blot was performed to verify the poly A- transcripts using poly A+ transcripts-depleted RNA samples from five human cell lines. Two poly A- 3′ ESTs that were verified to be originated from poly A- transcripts were used as the probes (Table S7). One probe detected signals in all five cell lines, and the other probe detected signals in four but not in HeLa cells, likely due to its low abundance in HeLa cells that was under the threshold of northern blot detection (Figure 3C).
- The poly A- 3′ ESTs were compared with the poly A- “transfrag” detected in HepG2 cells by the genome-tiling array study . For the 579 poly A- 3′ ESTs mapped to the 10 chromosomes covered by the array study, 210 overlapped with the poly A- “transfrags”, of which 37 (17%) to cytosolic, 53 (25%) to nuclear, and 120 (57%) map to both cytosolic and nuclear “transfrags” (Table 7, Table S8A, 8B, 8C). The high rate of overlapping in “both cytosolic and nuclear” part indicates that the poly A- transcripts are prevalent in both cytosolic and nuclear compartments.
(A). 3′ end verification for each subtype of 3′ EST. 3′ ESTs from poly A-, poly A+, and bimorphic subtypes were selected for the confirmation. Known poly A+ transcripts were used as positive control. Random priming- generated cDNA and oligo dT-generated cDNA were used as the templates. R: cDNA generated by random priming; T: cDNA generated by oligo dT priming. See Table S7 for primer information. (B). Verification of 3′ ESTs mapped to intronic microRNA precursors. RT-PCR was used to verify the 3′ ESTs that map to intronic microRNA precursors. Amplified products were cloned and sequenced. See Table S4C for primer information. (C). northern blot verification of poly A- 3′ EST. Two poly A- 3′ ESTs were used as the probes (Table S7) and RNAs from five human cell lines were used for the detection.
A total of 52,571 distinct sequences were identified from the raw sequences, of which only 13,782 mapped to the human genome reference sequences. For these unmapped sequences, RT-PCR was used to test if they were derived from true transcripts. Of the 48 tested 3′ ESTs of poly A+, poly A-, and bimorphic subgroups, 35 were detected in HeLa RNA (Figure S2A, Table S9), and 39 were detected in human fetal brain, kidney and liver RNA (Figure S2B). Although certain sequences could be produced by experimental artifacts, including non-specific PCR amplification, 454 sequence error and “homopolymer” sequences inherited with the pyrosequencing used by the 454 system , the verification results suggest that many unmapped 3′ ESTs were originated from authentic transcripts. A possible source could be related to the differences between the HeLa genome and the human genomes that contributed the human genome reference sequences. HeLa cells were derived from cervical cancer cells and have adapted to in vitro cultural conditions for over 50 years, resulting in a genome substantially different from the normal human genomes, as reflected by its aneuploidic 70 to 164 chromosomes (http://www.atcc.org/common/catalog/numSearch/numResults.cfmatccNumCCL-2). Indeed, our analysis of genome structure in a cancer cell line Kasumi-1 shows that cancer genome sequences are substantially different from the normal human genome sequences . The transcripts expressed from the unique contents in the HeLa-specific genomic DNA would not expect to map to the human genome reference sequences.
Despite the long-term indirect evidence for the presence of poly A- transcripts in eukaryotic cells, only the histone poly A- transcripts have been systematically identified at the sequencing level. Unlike the poly A+ transcripts that can be easily isolated by binding to the 3′ poly A tail using the oligo dT, isolating poly A- transcripts with no known consensus sequences is technically difficult. The Total RNA Detection system developed in this study provides a solution to overcome this obstacle. By combining with the new next-generation sequencing platforms, this system should be useful for comprehensive poly A- transcript identification. Many fundamental questions remain to be answered, including what type(s) of RNA polymerase generates the poly A- transcripts, how the poly A- transcripts are processed, whether the poly A- transcripts code for protein or they are non-coding transcripts, and more importantly, what their functions are. Answers to these questions should have significant impacts on decoding the transcriptome, annotating the transcribed elements in the genomes, and studying the biological role of poly A- transcripts.
Materials and Methods
HeLa cells (ATCC CCL-2) were cultured in MEM medium containing 10% fetal calf serum. Cells at exponential growth were harvested with trypsin treatment. Cytoplasmic RNA was isolated from the cells by using the RNeasy midi kit (Qiagen) following the manufacturer's protocol. To ligate the RNA adaptor (5′ P-UUAAUGGUAUCAACGCAGAGUGG (ddC) -3′) to the 3′ end of all RNA templates, RNA and adaptor were mixed at an approximate 1∶10 molar ratio (50 µg RNA and 1 µl of 1,000 µM of adaptor) in a total volume of 21 µl. The mixture was heated at 75°C for 5 minutes and cooled on ice, and the following items were added to the mixture: 2.5 µl of 10× T4 RNA ligation buffer, 1 µl of DMSO, and 1 µl (20 units) of T4 RNA ligase (New England Biolabs). The ligation mixture was incubated at 37°C for 1 hour. One µl of 0.5 M EDTA was added to the mixture to stop the reaction.
Subtraction of ribosomal RNAs and removal of short RNAs
Subtraction was performed by using the RiboMinus Transcriptome Isolation Kit (Invitrogen) following the manufacturer's protocol. To increase subtraction efficiency, two additional probes were used, including a probe for the 18S ribosomal RNA: 5′ biotin-AGTCAAGTTCGACCGTCTTCTCAGC (location at 1884–1909, M10098), and a probe for the 28S ribosomal RNA: 5′ biotin-ACTAACCTGTCTCACGACGGTCT (location at 4493–4515, M11167). RNA and each set of probes were mixed at a 1∶10 molar ratio in a final hybridization solution (10 mM Tris-Cl pH 7.5, 1 mM EDTA, 1 M NaCl). After denaturing at 75°C for 10 minutes, the mixture was maintained at 37°C for 10 min. MagPrep® Streptavidin Beads (Novagen) were added to the mixture to remove the hybrids and the free probes. The subtracted RNA was precipitated by adding 1/10 volume of 3 M sodium acetate, and 2.5× volume of ethanol, and was maintained at −20°C for 30 minutes. RNA was collected by centrifugation and dissolved in water. To further remove short RNAs, the subtracted RNA was passed over a mini-column of the RNeasy mini kit (Qiagen). The eluted RNA was precipitated and used for cDNA synthesis.
cDNA synthesis and 3′ ESTs collection
The enriched RNA was converted into double-strand cDNA by using a cDNA synthesis kit (Invitrogen) following the manufacturer's protocol, except for using the biotin-labeled primer based on the 3′ end adaptor sequences for the priming (5′ biotin-ATCTAGAGCGGCCGCAATGGCCACTCTGCGTTGATAC). Upon digestion of double-strand cDNA by NlaIII (New England Biolabs), the 3′ cDNA was isolated by using the MagPrep® streptavidin beads. An adaptor (sense primer: 5′-TTTGGATTTGCTGGTGCAGTACAACTAGGCTTAATAGGGACATG-3′, antisense primer: 5′-TCCCTATTAAGCCTAGTTGTACTGCACCAGCAAATCC-3′) was ligated to the 5′ end of the recovered 3′ cDNA. To integrate the 454 sequencing primer A and primer B, a 23-cycle PCR was performed by using a sense primer containing the 454 primer A and the 5′ adaptor- (5′-TTTGGATTTGCTGGTGCAGTACAACTAGGCTTAATAGGGACATG-3′, the underlined part is the 454 primer A and the rest is the 5′ adaptor), and an antisense primer containing the 454 primer B and the 3′ adaptor (5′-TCCCTATTAAGCCTAGTTGTACTGCACCAGCAAATCC-3′, underlined is the 454 primer B and the rest is the RNA adaptor ligated to the 3′ end of all RNA templates). PCR products were purified with a PCR purification kit (Qiagen), and used for 454 sequencing collection by reading from the 3′ end towards the 5′ of the 3′ cDNA templates using the 454 primer B as the sequencing primer.
The following steps were used to generate non-redundant sequences: 1) only the sequences with the 3′ adaptor sequences were kept; 2) 3′ adaptor sequences were removed; 3) sequences shorter than 11 bps were eliminated; 4) the same sequences were combined; 5) sequences were separated into three groups: no A residue at the 3′ end, one A residue at the 3′ end, and more than one A residue at the 3′ end. Within each group, homologous sequences were combined at a cut-off of at least 90% identity and 90% coverage, the longest sequence of which was selected as the representative sequence; 6) the resulting sequences in each group were further manually checked if necessary to ensure the sequence quality. The sequences were deposited in the NCBI dbEST (dbEST ID 43676141-43728711).
Comparison of 3′ ESTs with known transcript sequences
The known transcript sequences were downloaded in the following databases: RefSeq (ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/), mRNA (http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/), EST (ftp://ftp.ncbi.nih.gov/repository/dbEST/). Two types of SAGE tag reference databases were used, including the SAGEmap full database that contains annotated SAGE tags based on the known human transcript sequences (http://www.ncbi.nlm.nih.gov/projects/SAGE/) and the GEO SAGE database that contains experimentally collected SAGE tags (http://www.ncbi.nlm.nih.gov/geo/). The poly A tail in the sequences was excluded for the comparison. The PatternHunter 2.0 program was used for sequence comparison with the parameter setting at e<0.1 (www.bioinformaticssolution.com/ph/, 24–26). For SAGE tag comparison, a 17-bp SAGE tag was extracted after CATG in each CATG-containing 3′ EST and matched to the reference SAGE tags.
Mapping 3′ ESTs to genome sequences
The 3′ ESTs longer than 19 bp were used to map to the human genome reference sequences through BLAT (HG18, http://hgdownload.cse.ucsc.edu/downloads.html), with a minimal 90% coverage and 90% identity. The poly A tail in the poly A+ 3′ EST was excluded for the mapping. The intergenic region was defined as the region outside the annotated genes, and the intragenic region was defined as the region covered by the annotated genes . The microRNA precursor sequences were downloaded from miRBase (http://microrna.sanger.ac.uk/sequences/, 27).
To study the evolutionary conservation of the intergenic sequences mapped by 3′ ESTs, the mapped sequences were aligned to the genomes of 17 vertebrate species (Vertebrate Multiz 17-way genome alignments, http://www.genome.ucsc.edu). For each aligned sequence, the divergence (substitution rate) between human and any of the 16 vertebrate species was calculated by the Kimura two-parameter method . Although the neutral substitution rates between human and the majority of the 16 genomes are still lacking, based on the phylogenetic tree of the 17 species (http://www.genome.ucsc.edu/images/phylo/), rodents (mouse and rat) have the least divergence time with human among the 14 species, except for the chimpanzee and macaque. Since the mouse and rat genome sequence analyses show that the neutral substitution rate between rodent and human is around 0.5 substitution/site –, a conservative 0.5 was used as the neutral substitution rate between human and any of the other 14 species except chimpanzee and macaque, as they are too similar to human. For each sequence comparison, the number of substitutions observed after correction for multiple hits (O) and the number of substitutions expected at the neutral evolution (λ) were calculated. The probability that a sequence is under conservation constraints is determined by .
A group of novel poly A-, poly A+, and bimorphic 3′ ESTs mapped to the human genome sequences was selected for PCR verification (Table S7). The sense primer was designed upstream of the mapped genomic location, and the antisense primer was designed based on the 3′ end sequence of each 3′ EST. HeLa RNA was used for cDNA synthesis by using MMLV reverse transcriptase (Invitrogen), and oligo dT17 primer or random hexamer primer. Six known poly A+ transcript sequences were used as positive control. A 30-cycle PCR was performed for each reaction by using the sense and antisense primers and either type of cDNA templates at 94°C for 30 s, 55°C for 30 s, and 72°C for 30 s. PCR products were visualized on agarose gels.
To verify the microRNA precursor-derived 3′ ESTs, a set of 3′ ESTs matching to the intronic microRNA precursor sequences was selected for the test (Table S4C). Sense and antisense primers were designed based on the 3′ ESTs using the Primer3 program, and random-priming generated cDNAs from HeLa RNA were used as the templates for PCR. PCR products were cloned into pGEMT vector (Promega) and sequenced with BigDye reagents (Applied BioSystems).
Two novel poly A- 3′ ESTs were used for northern blot confirmation. cDNA probes for each poly A- 3′ EST were generated by PCR amplification using random-priming generated HeLa cDNA templates (Table S7). Total RNAs from HepG2, kidney, HL-60, K-562 and HeLa were used for the test. The poly A+ transcript was depleted from each RNA sample by using oligo dT beads three times (Dynal). The poly A+ depleted RNA (20 ug) was fractionated through formaldehyde-denatured agarose gel, transferred to positively charges nylon membranes, and UV cross-linked. Probes were labeled with biotin using a random primer labeling method. Blots were pre-hybridized for 1 hour in hybridization buffer at 45°C and then hybridized with probes overnight at 45°C. Blots were washed twice with low stringency buffer at room temperature and twice with high stringency buffer at 45°C. The signals were then detected by chemiluminescent detection reagents.
RT/PCR was used to verify a group of the unmapped 3′ ESTs. Sense primers and antisense primers were designed based on each 3′ EST (Table S9). Random-priming generated cDNAs from human RNA of HeLa, fetal brain, kidney, and liver were used as the templates for PCR. PCR products were checked on agarose gels.
Comparison of poly A- 3′ ESTs with poly A- “transfrag” of Affymetrix genome-tiling array
The poly A- “transfrag” data detected in HepG2 cells by Affymetrix 10 chromosome genome-tiling array were downloaded from (http://transcriptome.affymetrix.com/publication/transcriptome_10chromosomes/, 15). The poly A- 3′ ESTs mapped to the same 10 chromosomes were used to compare the genomic locations of the “transfrags”. The location shared by a 3′ EST and a “transfrag” is defined as overlapping.
The 3′ end distribution of histone 3′ ESTs.
(0.11 MB XLS)
Intron-originated microRNA precursors mapped by 3′ ESTs
(0.06 MB XLS)
Evolution conservation of 3′ EST mapped intergenic regions
(3.65 MB XLS)
RT-PCR confirmation of novel 3′ ESTs.
(0.05 MB XLS)
RT-PCR confirmation for the 3′ ESTs not mapped to the human genome sequences.
(0.03 MB XLS)
Conceived and designed the experiments: SMW. Performed the experiments: QW JC YZ TZ. Analyzed the data: YCK JL ZX MQZ CIW SMW. Wrote the paper: SMW.
- 1. Milcarek C, Price R, Penman S (1974) The metabolism of a polyA minus mRNA fraction in HeLa cells. Cell 3: 1–10.
- 2. Nakazato H, Edmonds M, Kopp DW (1974) Differential metabolism of large and small polyA sequences in the heterogeneous nuclear RNA of HeLa cells. Proc Natl Acad Sci U S A. 71: 200–204.
- 3. Grady LJ, North AB, Campbell WP (1978) Complexity of polyA+ and polyA- polysomal RNA in mouse liver and cultured mouse fibroblasts. Nucleic Acids Res. 5: 697–712.
- 4. Van Ness J, Maxwell IH, Hahn WE (1979) Complex population of nonpolyadenylated messenger RNA in mouse brain. Cell 18: 1341–1349.
- 5. Zimmerman JL, Fouts DL, Manning JE (1980) Evidence for a complex class of nonadenylated mRNA in Drosophila. Genetics 95: 673–691.
- 6. Katinakis PK, Slater A, Burdon RH (1980) Non-polyadenylated mRNAs from eukaryotes. FEBS Lett. 116: 1–7.
- 7. Galau GA, Legocki AB, Greenway SC, Dure LS 3rd (1981) Cotton messenger RNA sequences exist in both polyadenylated and nonpolyadenylated forms. J Biol Chem. 256: 2551–2560.
- 8. Salditt-Georgieff M, Harpold MM, Wilson MC, Darnell JE Jr (1981) Large heterogeneous nuclear ribonucleic acid has three times as many 5′ caps as polyadenylic acid segments, and most caps do not enter polyribosomes. Mol Cell Biol. 1: 179–187.
- 9. Moffett RB, Doyle D (1981) Polyadenylic acid-containing and -deficient messenger RNA of mouse liver. Biochim Biophys Acta. 652: 177–192.
- 10. Zimmerman JL, Fouts DL, Levy LS, Manning JE (1982) Nonadenylylated mRNA is present as polyadenylylated RNA in nuclei of Drosophila. Proc Natl Acad Sci U S A. 79: 3148–3152.
- 11. Duncan R, Humphreys T (1984) The polyA + RNA sequence complexity is also represented in polyA - RNA in sea-urchin embryos. Differentiation 28: 24–29.
- 12. Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, et al. (2005) Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science. 308: 1149–1154.
- 13. Kronberg RD (1999) Eukaryotic transcriptional control. Trends Cell Biol. 9: M46–49.
- 14. Grummt I (1999) Regulation of mammalian ribosomal gene transcription by RNA polymerase I. Prog Nucleic Acid Res Mol Biol. 62: 109–154.
- 15. Detke S, Stein JL, Stein GS (1978) Synthesis of histone messenger RNAs by RNA polymerase II in nuclei from S phase HeLa S3 cells. Nucleic Acids Res. 5: 1515–1528.
- 16. Willis IM (1993) RNA polymerase III. Genes, factors and transcriptional specificity. Eur J Biochem. 212: 1–11.
- 17. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376–380.
- 18. Lee S, Clark T, Chen J, Zhou G, Scott LR, et al. (2002) Correct identification of genes from serial analysis of gene expression tag sequences. Genomics. 79: 598–602.
- 19. Beaudoing E, Freier S, Wyatt JR, Claverie JM, Gautheret D (2000) Patterns of variant polyadenylation signal usage in human genes. Genome Res. 10: 1001–1010.
- 20. Dominski Z, Marzluff WF (1999) Formation of the 3′ end of histone mRNA. Gene 239: 1–14.
- 21. Mullen TE, Marzluff WF (2008) Degradation of histone mRNA requires oligouridylation followed by decapping and simultaneous degradation of the mRNA both 5′ to 3′ and 3′ to 5′. Genes Dev. 22: 50–65.
- 22. Rodriguez A, Griffiths-Jones S, Ashurst JL, Bradley A (2004) Identification of mammalian microRNA host genes and transcription units. Genome Res. 14: 1902–1910.
- 23. Chen J, Kim YC, Jung YC, Xuan Z, Dworkin G, et al. (2008) Scanning the human genome at kilobase resolution. Genome Res. 18: 751–62.
- 24. Li M, Ma B, Kisman D, Tromp J (2002) PatternHunter II: Highly sensitive and fast homology search. Bioinformatics 18: 440–445.
- 25. Mouse Genome Sequencing Consortium (2002) Initial sequencing and comparative analysis of the mouse genome. Nature. 420: 520–562.
- 26. Rat Genome Sequencing Project Consortium (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428: 493–521.
- 27. Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ (2006) miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. 34: D140–D144.
- 28. Kimura MA (1980) Simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide-sequences. J Mol Evolution 16: 111–120.