Poly A- Transcripts Expressed in HeLa Cells

Background Transcripts expressed in eukaryotes are classified as poly A+ transcripts or poly A- transcripts based on the presence or absence of the 3′ poly A tail. Most transcripts identified so far are poly A+ transcripts, whereas the poly A- transcripts remain largely unknown. Methodology/Principal Findings We developed the TRD (Total RNA Detection) system for transcript identification. The system detects the transcripts through the following steps: 1) depleting the abundant ribosomal and small-size transcripts; 2) synthesizing cDNA without regard to the status of the 3′ poly A tail; 3) applying the 454 sequencing technology for massive 3′ EST collection from the cDNA; and 4) determining the genome origins of the detected transcripts by mapping the sequences to the human genome reference sequences. Using this system, we characterized the cytoplasmic transcripts from HeLa cells. Of the 13,467 distinct 3′ ESTs analyzed, 24% are poly A-, 36% are poly A+, and 40% are bimorphic with poly A+ features but without the 3′ poly A tail. Most of the poly A- 3′ ESTs do not match known transcript sequences; they have a similar distribution pattern in the genome as the poly A+ and bimorphic 3′ ESTs, and their mapped intergenic regions are evolutionarily conserved. Experiments confirmed the authenticity of the detected poly A- transcripts. Conclusion/Significance Our study provides the first large-scale sequence evidence for the presence of poly A- transcripts in eukaryotes. The abundance of the poly A- transcripts highlights the need for comprehensive identification of these transcripts for decoding the transcriptome, annotating the genome and studying biological relevance of the poly A- transcripts.


Introduction
The genome is expressed through transcription that generates different classes of RNA molecules, specifically, ribosomal RNAs, messenger RNAs, and small RNAs, which constitute the transcriptome content. The transcriptional process is regulated at multiple levels with differential promoter usage, alternative splicing, intron retention, and alternative polyadenylation etc. Furthermore, the abundance of individual transcripts can vary up to million-fold levels. Thus, the transcriptome is far more complicated than the original coding sequences in the genome, and decoding the transcriptome will be more challenging than decoding the genome.
An ultimate goal of transcriptome study is to identify all transcripts at the sequence level. This has been very successful for the poly A+ transcripts, largely attributed to the presence of 39 poly A tail that facilitates their isolation and cDNA synthesis by using oligo dT. Up to now, millions of poly A+ transcripts have been sequenced from various species. Regardless of the evidence indicating the wide prevalence of poly A-transcripts, however, only a few poly Atranscripts have been identified so far at the sequencing level. Without the poly A-transcript information, the transcriptome complexity and genome organization cannot be fully understood.
The lack of poly A-transcript information is largely associated with the technical factors. Unlike the poly A+ transcripts that have the universal 39 poly A tail, there is no known consensus sequence in poly A-transcripts for isolation and cDNA synthesis. To overcome this obstacle, we developed a technical system termed Total RNA Detection (TRD). The system consists of three key elements: 1) enriching the poly A-transcripts by depleting the abundant ribosomal and tRNA transcripts; 2) synthesizing cDNA without regard to the status of the 39 poly A tail; and 3) using the 454 sequencer for massive 39 EST collection [17]. The 39 EST provides poly A signal and poly A tail information to distinguish between poly A+ transcripts and poly A-transcripts, can be compared directly with known transcripts, and can map to the genome with 39 boundary location for the mapped locus. Using the TRD system, we analyzed the cytoplasmic transcripts of HeLa cells, a model cell line widely used for transcriptome study. Our sequence data indicate that the poly A-transcripts indeed exist.

Results and Discussion
Transcript enrichment and sequence collection A challenge for studying the poly A-transcripts is to distinguish the true poly A-transcripts from the degraded transcripts originated from in vivo physiological RNA metabolism or in vitro RNase activities (which also lack the 39 poly A tail). Another challenge is the scarcity of the poly A-transcripts in the total transcriptome content. Three key steps in the Total RNA Detection system were used to address these issues ( Figure 1). miRNA, these small RNAs have been extensively characterized by many other studies and they are not the focus of the study. Removal of small-size RNAs will provide a better background to identify the true poly A-transcripts.
The enriched transcripts were converted into cDNA by using the biotinylated primer based on the universal adaptor sequence at the 39 ends of transcripts. The cDNA was then digested by a 4-base cutting restriction enzyme NlaIII to generate the 39 cDNAs at about 140 bps [18] that fit with the 454 sequencing range. The 39 cDNA was recovered by using streptoavidin beads. An adaptor was ligated to the 59 end of the recovered 39 cDNA. To generate 39 cDNA templates for 454 sequencing, the 39 cDNA was amplified by PCR using the sense primer based on the 59 adaptor sequences and the antisense primer based on the 39 end university adaptor seqeunces. 454 sequencing was performed from the 39 ends of the cDNA. A total of 273,949 raw sequences was collected by a single 454 sequencing run. After removing sequences from remaining ribosomal RNAs, tRNAs, mitochondrial RNAs, sequences without a 39 end adaptor, and sequences shorter than 11 bp, 148,520 sequences were qualified, representing 52,571 distinct 39 ESTs. Of the 52,571 39 ESTs, 13,782 were mapped to the human genome reference sequences (HG18) and were used for the following analyses ( Table 1, Table S1).

Classification, novelty, and abundance of 39 ESTs
Based on the factors of poly A signal, poly A tail and matching to known transcript sequences, standards were set to classify the 39 ESTs into three subgroups: Poly A-39 EST. A sequence with no or only one A at the 39 end (A accounts for 1/4 probability at this position out of A, G, C, and T) AND without a poly A signal AND matched to none of the known RefSeq/mRNA/EST/SAGE tags (nearly all known sequences are from poly A+ transcripts). The only known poly A-transcripts, which are histone transcripts, were determined by direct matching to known histone mRNA sequences.
Poly A+ 39 EST. A sequence matched to either the RefSeq/ mRNA/EST/SAGE tags and with one or more A at the 39 end with/without a poly A signal, OR a sequence matched to none of the known RefSeq/mRNA/EST/SAGE tags (novel) but with one A and a poly A signal or more As with/without a poly A signal.
Bimorphic 39 EST. A sequence matched to either the RefSeq/mRNA/EST/SAGE tags and with no A at the 39 end with/without a poly A signal, OR a sequence matched to none of the known RefSeq/mRNA/EST/SAGE tags (novel) and with no A at the 39 end but with a poly A signal. The bimorphic 39 EST represents the isoform of the poly A+ transcript. Its lack of the 39 poly A tail reflects the dynamics of poly A tail metabolism in the poly A+ transcript.
Based on the standards, 24% of 39 ESTs (2% are histone 39 ESTs) were classified as poly A-39 EST, 36% as poly A+ 39 EST, and 40% as bimorphic 39 EST (Table 2, Table S1A, S1B, S1C). Most of the poly A+ and bimorphic 39 ESTs match to known human transcript sequences that are basically from poly A+ transcripts. The unmatched novel poly A+ 39 ESTs tend to have short poly A tails (Table S2A, S2B). Although the poly A-39 EST accounts for 24% of total sequences, they only contribute 12% of the total sequence copies ( Table 2), confirming that the poly A-transcript is highly heterogeneous but expressed at lower levels [1,12]. Similarly, the bimorphic 39 EST accounts for 41% of the total sequences but only contributes 15% of the total sequence copies. The lack of high-copy bimorphic 39 EST implies that the highly expressed poly A+ transcripts may not use the change of poly A length as a regulatory mechanism. The poly A+ 39 EST accounts for 36% of total sequences but contributes 73% of the total copies, indicating that the poly A+ transcript contributes far more abundance to the transcriptome than the poly A-and bimorphic transcripts.
Histone transcripts are the only known polymerase II-generated poly A-transcripts with sequence information [12,15,20]. A total of 315 distinct 39 ESTs from 46 histone genes was detected in this study. Comparison of the 39 ends of the detected histone 39 ESTs to the full-length histone transcript sequences shows that 80% of the 39 ESTs matched proximal to the 39 end of their corresponding full-length sequences (Table 3, Figure 2, Figure  S1, Table S3). The pattern of the 39 end distribution of these histone sequences closely resembles that observed in a recent histone study [21]. The high-degree of intact 39 ends of the detected histone transcripts provides an internal control for the authenticity of the poly A-transcripts detected in this study.

Map 39 ESTs to the human genome reference sequences
We analyzed the genome origins for the 8,178 39 ESTs that mapped to a single location in the genome. Nearly two-thirds of the 39 ESTs mapped to the intragenic region and a third to the intergenic region (Table 4, Table S1A, S1B, S1C). The poly A-39 EST maps more to the intergenic regions than poly A+ and bimorphic 39 ESTs. Of the 39 ESTs mapped to the intragenic region, most mapped to the introns, and the poly A-39 EST orients more in the antisense direction. Those mapped to the intron regions might represent the transcripts that are the intron-retained isoforms of the annotated genes, or the transcripts that are from the genes overlapping with the annotated genes, or the alternatively spliced transcripts in which their last exon sequences differ due to alternative splicing, or the transcripts that are largely originated from the intron regions with regulatory function such as the microRNA [22]. Indeed, 491 intron-mapped 39 ESTs mapped to 52 of the 219 known intronoriginated microRNA precursors, 38 of which were matched by a single type of 39 EST and 14 of which were matched by more than one type of 39 EST (Table S4A, S4B). We compared the mapped loci by the poly A-, poly A+, and bimorphic 39 ESTs. Using the average gene density of 75 kb in the human genome as a cut-off (40,007 ''genes'' in the human genome/3 Gb human genome size = 75 kb. http://www.ncbi.nlm.nih.gov/projects/Gene/gentrez_stats.cgi? SNGLTAX = 9606), the results show that half of the mapped loci between two or three subtypes of 39 ESTs do not overlap each other (Table 5, Table S5A, S5B1-3, S5C1-3), suggesting that many different subtypes of transcripts are transcribed from different genomic loci.
The intergenic loci mapped by the 39 ESTs represent novel transcribed regions in the genome. To investigate the potential functional relevance, those sequences were compared with the genome sequences of 16 species to define their evolutionary conservation. The results show that of the 2,310 mapped intergenic sequences, 1,344 (59%) are conserved across different species, and the conservation rate for poly A-39 EST-mapped sequences is similar to the rates of the poly A+ and bimorphic 39 EST-mapped sequences ( Table 6, Table S6A, 6B, 6C).

Confirmation of the 39 ESTs
Four approaches were used to confirm the detected 39 ESTs.
1. RT-PCR was used to verify each subtype of novel 39 ESTs.
Two types of cDNAs were used as the templates. One was generated by random priming that does not rely on the poly A tail for cDNA synthesis, and the other was generated by oligo dT priming that relies on the poly A tail for cDNA synthesis.
For poly A-transcripts, only the random-priming cDNA but not the oligo dT priming-cDNA should generate positive amplification; for bimorphic transcripts, the random-priming cDNA should and the oligo dT priming-cDNA could generate positive amplification; for poly A+ transcripts, both randompriming cDNA and oligo dT-priming cDNA should generate positive amplification. The results show that for the 28 positively detected poly A-39 ESTs, 23 were only detected in random-priming cDNA, confirming that their original tran-  Figure 3A, Table S7). 2. RT-PCR was used to detect the 39 EST-matched microRNA precursors that are transcribed from the intronic regions. For the 28 reactions, 17 were confirmed by sequencing the amplified products ( Figure 3B, Table S4C). 3. northern blot was performed to verify the poly A-transcripts using poly A+ transcripts-depleted RNA samples from five human cell lines. Two poly A-39 ESTs that were verified to be originated from poly A-transcripts were used as the probes (Table S7). One probe detected signals in all five cell lines, and the other probe detected signals in four but not in HeLa cells, likely due to its low abundance in HeLa cells that was under the threshold of northern blot detection ( Figure 3C). 4. The poly A-39 ESTs were compared with the poly A-''transfrag'' detected in HepG2 cells by the genome-tiling array study [12]. For the 579 poly A-39 ESTs mapped to the 10 chromosomes covered by the array study, 210 overlapped with the poly A-''transfrags'', of which 37 (17%) to cytosolic, 53 (25%) to nuclear, and 120 (57%) map to both cytosolic and nuclear ''transfrags'' (Table 7, Table S8A, 8B, 8C). The high rate of overlapping in ''both cytosolic and nuclear'' part indicates that the poly A-transcripts are prevalent in both cytosolic and nuclear compartments.
A total of 52,571 distinct sequences were identified from the raw sequences, of which only 13,782 mapped to the human genome reference sequences. For these unmapped sequences, RT-PCR was used to test if they were derived from true transcripts. Of the  Table 3, Table S3 and Figure S1 for the distribution of other histone 39 ESTs. doi:10.1371/journal.pone.0002803.g002  48 tested 39 ESTs of poly A+, poly A-, and bimorphic subgroups, 35 were detected in HeLa RNA ( Figure S2A, Table S9), and 39 were detected in human fetal brain, kidney and liver RNA ( Figure  S2B). Although certain sequences could be produced by experimental artifacts, including non-specific PCR amplification, 454 sequence error and ''homopolymer'' sequences inherited with the pyrosequencing used by the 454 system [17], the verification results suggest that many unmapped 39 ESTs were originated from authentic transcripts. A possible source could be related to the differences between the HeLa genome and the human genomes that contributed the human genome reference sequences. HeLa cells were derived from cervical cancer cells and have adapted to in vitro cultural conditions for over 50 years, resulting in a genome substantially different from the normal human genomes, as reflected by its aneuploidic 70 to 164 chromosomes (http://www.atcc.org/ common/catalog/numSearch/numResults.cfm?atccNum = CCL-2). Indeed, our analysis of genome structure in a cancer cell line Kasumi-1 shows that cancer genome sequences are substantially different from the normal human genome sequences [23]. The transcripts expressed from the unique contents in the HeLaspecific genomic DNA would not expect to map to the human genome reference sequences. Despite the long-term indirect evidence for the presence of poly A-transcripts in eukaryotic cells, only the histone poly Atranscripts have been systematically identified at the sequencing level. Unlike the poly A+ transcripts that can be easily isolated by binding to the 39 poly A tail using the oligo dT, isolating poly Atranscripts with no known consensus sequences is technically difficult. The Total RNA Detection system developed in this study provides a solution to overcome this obstacle. By combining with the new next-generation sequencing platforms, this system should be useful for comprehensive poly A-transcript identification. Many fundamental questions remain to be answered, including what type(s) of RNA polymerase generates the poly A-transcripts, how the poly A-transcripts are processed, whether the poly Atranscripts code for protein or they are non-coding transcripts, and more importantly, what their functions are. Answers to these questions should have significant impacts on decoding the transcriptome, annotating the transcribed elements in the genomes, and studying the biological role of poly A-transcripts.

RNA preparation
HeLa cells (ATCC CCL-2) were cultured in MEM medium containing 10% fetal calf serum. Cells at exponential growth were harvested with trypsin treatment. Cytoplasmic RNA was isolated from the cells by using the RNeasy midi kit (Qiagen) following the manufacturer's protocol. To ligate the RNA adaptor (59 P-UUAAUGGUAUCAACGCAGAGUGG (ddC) -39) to the 39 end of all RNA templates, RNA and adaptor were mixed at an approximate 1:10 molar ratio (50 mg RNA and 1 ml of 1,000 mM of adaptor) in a total volume of 21 ml. The mixture was heated at 75uC for 5 minutes and cooled on ice, and the following items were added to the mixture: 2.5 ml of 106 T4 RNA ligation buffer, 1 ml of DMSO, and 1 ml (20 units) of T4 RNA ligase (New England Biolabs). The ligation mixture was incubated at 37uC for 1 hour. One ml of 0.5 M EDTA was added to the mixture to stop the reaction.

Subtraction of ribosomal RNAs and removal of short RNAs
Subtraction was performed by using the RiboMinus Transcriptome Isolation Kit (Invitrogen) following the manufacturer's protocol. To increase subtraction efficiency, two additional probes were used, including a probe for the 18S ribosomal RNA: 59 biotin-AGTCAAGTTCGACCGTCTTCTCAGC (location at 1884-1909, M10098), and a probe for the 28S ribosomal RNA: 59 biotin-ACTAACCTGTCTCACGACGGTCT (location at 4493-4515, M11167). RNA and each set of probes were mixed at a 1:10 molar ratio in a final hybridization solution (10 mM Tris-Cl pH 7.5, 1 mM EDTA, 1 M NaCl). After denaturing at 75uC for 10 minutes, the mixture was maintained at 37uC for 10 min. MagPrepH Streptavidin Beads (Novagen) were added to the mixture to remove the hybrids and the free probes. The subtracted RNA was precipitated by adding 1/10 volume of 3 M sodium ESTs from poly A-, poly A+, and bimorphic subtypes were selected for the confirmation. Known poly A+ transcripts were used as positive control. Random priming-generated cDNA and oligo dT-generated cDNA were used as the templates. R: cDNA generated by random priming; T: cDNA generated by oligo dT priming. See Table  S7 for primer information. (B). Verification of 39 ESTs mapped to intronic microRNA precursors. RT-PCR was used to verify the 39 ESTs that map to intronic microRNA precursors. Amplified products were cloned and sequenced. See Table S4C for primer information. (C). northern blot verification of poly A-39 EST. Two poly A-39 ESTs were used as the probes (Table S7) and RNAs from five human cell lines were used for the detection. doi:10.1371/journal.pone.0002803.g003 acetate, and 2.56volume of ethanol, and was maintained at 220uC for 30 minutes. RNA was collected by centrifugation and dissolved in water. To further remove short RNAs, the subtracted RNA was passed over a mini-column of the RNeasy mini kit (Qiagen). The eluted RNA was precipitated and used for cDNA synthesis.

cDNA synthesis and 39 ESTs collection
The enriched RNA was converted into double-strand cDNA by using a cDNA synthesis kit (Invitrogen) following the manufacturer's protocol, except for using the biotin-labeled primer based on the 39 end adaptor sequences for the priming (59 biotin-ATCTAGAG-CGGCCGCAATGGCCACTCTGCGTTGATAC). Upon digestion of double-strand cDNA by NlaIII (New England Biolabs), the 39 cDNA was isolated by using the MagPrepH streptavidin beads. An adaptor (sense primer: 59-TTTGGATTTGCTGGTGCAGTACA-ACTAGGCTTAATAGGGACATG-39, antisense primer: 59-T-CCCTATTAAGCCTAGTTGTACTGCACCAGCAAATCC-39) was ligated to the 59 end of the recovered 39 cDNA. To integrate the 454 sequencing primer A and primer B, a 23-cycle PCR was performed by using a sense primer containing the 454 primer A and the 59 adaptor-(59-TTTGGATTTGCTGGTGCAGTACAACT-AGGCTTAATAGGGACATG-39, the underlined part is the 454 primer A and the rest is the 59 adaptor), and an antisense primer containing the 454 primer B and the 39 adaptor (59-TCCCTATT-AAGCCTAGTTGTACTGCACCAGCAAATCC-39, underlined is the 454 primer B and the rest is the RNA adaptor ligated to the 39 end of all RNA templates). PCR products were purified with a PCR purification kit (Qiagen), and used for 454 sequencing collection by reading from the 39 end towards the 59 of the 39 cDNA templates using the 454 primer B as the sequencing primer.

Sequence process
The following steps were used to generate non-redundant sequences: 1) only the sequences with the 39 adaptor sequences were kept; 2) 39 adaptor sequences were removed; 3) sequences shorter than 11 bps were eliminated; 4) the same sequences were combined; 5) sequences were separated into three groups: no A residue at the 39 end, one A residue at the 39 end, and more than one A residue at the 39 end. Within each group, homologous sequences were combined at a cut-off of at least 90% identity and 90% coverage, the longest sequence of which was selected as the representative sequence; 6) the resulting sequences in each group were further manually checked if necessary to ensure the sequence quality. The sequences were deposited in the NCBI dbEST (dbEST ID 43676141-43728711).

Mapping 39 ESTs to genome sequences
The 39 ESTs longer than 19 bp were used to map to the human genome reference sequences through BLAT (HG18, http:// hgdownload.cse.ucsc.edu/downloads.html), with a minimal 90% coverage and 90% identity. The poly A tail in the poly A+ 39 EST was excluded for the mapping. The intergenic region was defined as the region outside the annotated genes, and the intragenic region was defined as the region covered by the annotated genes [12]. The microRNA precursor sequences were downloaded from miRBase (http://microrna.sanger.ac.uk/sequences/, 27).
To study the evolutionary conservation of the intergenic sequences mapped by 39 ESTs, the mapped sequences were aligned to the genomes of 17 vertebrate species (Vertebrate Multiz 17-way genome alignments, http://www.genome.ucsc.edu). For each aligned sequence, the divergence (substitution rate) between human and any of the 16 vertebrate species was calculated by the Kimura two-parameter method [28]. Although the neutral substitution rates between human and the majority of the 16 genomes are still lacking, based on the phylogenetic tree of the 17 species (http://www.genome.ucsc.edu/images/phylo/), rodents (mouse and rat) have the least divergence time with human among the 14 species, except for the chimpanzee and macaque. Since the mouse and rat genome sequence analyses show that the neutral substitution rate between rodent and human is around 0.5 substitution/site [24][25], a conservative 0.5 was used as the neutral substitution rate between human and any of the other 14 species except chimpanzee and macaque, as they are too similar to human. For each sequence comparison, the number of substitutions observed after correction for multiple hits (O) and the number of substitutions expected at the neutral evolution (l) were calculated. The probability that a sequence is under conservation

Experimental verification
A group of novel poly A-, poly A+, and bimorphic 39 ESTs mapped to the human genome sequences was selected for PCR verification (Table S7). The sense primer was designed upstream of the mapped genomic location, and the antisense primer was designed based on the 39 end sequence of each 39 EST. HeLa RNA was used for cDNA synthesis by using MMLV reverse transcriptase (Invitrogen), and oligo dT 17 primer or random hexamer primer. Six known poly A+ transcript sequences were used as positive control. A 30-cycle PCR was performed for each reaction by using the sense and antisense primers and either type of cDNA templates at 94uC for 30 s, 55uC for 30 s, and 72uC for 30 s. PCR products were visualized on agarose gels.
To verify the microRNA precursor-derived 39 ESTs, a set of 39 ESTs matching to the intronic microRNA precursor sequences was selected for the test (Table S4C). Sense and antisense primers were designed based on the 39 ESTs using the Primer3 program, and random-priming generated cDNAs from HeLa RNA were used as the templates for PCR. PCR products were cloned into pGEMT vector (Promega) and sequenced with BigDye reagents (Applied BioSystems). Two novel poly A-39 ESTs were used for northern blot confirmation. cDNA probes for each poly A-39 EST were generated by PCR amplification using random-priming generated HeLa cDNA templates (Table S7). Total RNAs from HepG2, kidney, HL-60, K-562 and HeLa were used for the test. The poly A+ transcript was depleted from each RNA sample by using oligo dT beads three times (Dynal). The poly A+ depleted RNA (20 ug) was fractionated through formaldehyde-denatured agarose gel, transferred to positively charges nylon membranes, and UV crosslinked. Probes were labeled with biotin using a random primer labeling method. Blots were pre-hybridized for 1 hour in hybridization buffer at 45uC and then hybridized with probes overnight at 45uC. Blots were washed twice with low stringency buffer at room temperature and twice with high stringency buffer at 45uC. The signals were then detected by chemiluminescent detection reagents.
RT/PCR was used to verify a group of the unmapped 39 ESTs. Sense primers and antisense primers were designed based on each 39 EST (Table S9). Random-priming generated cDNAs from human RNA of HeLa, fetal brain, kidney, and liver were used as the templates for PCR. PCR products were checked on agarose gels.

Comparison of poly A-39 ESTs with poly A-''transfrag'' of Affymetrix genome-tiling array
The poly A-''transfrag'' data detected in HepG2 cells by Affymetrix 10 chromosome genome-tiling array were downloaded from (http://transcriptome.affymetrix.com/publication/transcrip-tome_10chromosomes/, 15). The poly A-39 ESTs mapped to the same 10 chromosomes were used to compare the genomic locations of the ''transfrags''. The location shared by a 39 EST and a ''transfrag'' is defined as overlapping.