Comparative Analysis of the Base Compositions of the Pre-mRNA 3′ Cleaved-Off Region and the mRNA 3′ Untranslated Region Relative to the Genomic Base Composition in Animals and Plants

The precursor messenger RNA (pre-mRNA) three-prime cleaved-off region (3′COR) and the mRNA three-prime untranslated region (3′UTR) play critical roles in regulating gene expression. The differences in base composition between these regions and the corresponding genomes are still largely uncharacterized in animals and plants. In this study, the base compositions of non-redundant 3′CORs and 3′UTRs were compared with the corresponding whole genomes of eleven animals, four dicotyledonous plants, and three monocotyledonous (cereal) plants. Among the four bases (A, C, G, and U for adenine, cytosine, guanine, and uracil, respectively), U (which corresponds to T, for thymine, in DNA) was the most frequent, A the second most frequent, G the third most frequent, and C the least frequent in most of the species in both the 3′COR and 3′UTR regions. In comparison with the whole genomes, in both regions the U content was usually the most overrepresented (particularly in the monocotyledonous plants), and the C content was the most underrepresented. The order obtained for the species groups, when ranked from high to low according to the U contents in the 3′COR and 3′UTR was as follows: dicotyledonous plants, monocotyledonous plants, non-mammal animals, and mammals. In contrast, the genomic T content was highest in dicotyledonous plants, lowest in monocotyledonous plants, and intermediate in animals. These results suggest the following: 1) there is a mechanism operating in both animals and plants which is biased toward U and against C in the 3′COR and 3′UTR; 2) the 3′UTR and 3′COR, as functional units, minimized the difference between dicotyledonous and monocotyledonous plants, while the dicotyledonous and monocotyledonous genomes evolved into two extreme groups in terms of base composition.


Introduction
After transcription, the three-prime (39)-most segment of the newly made precursor RNA (pre-RNA) is usually cleaved off [1,2]. This 39 cleaved-off region is referred to herein as ''39COR'' for the sake of simplicity. The new 39 end is polyadenylated. There is a 39 untranslated region (39UTR) between the coding sequence and the polyadenylation [poly(A)] tail starting position, also often known as the polyadenylation site or poly(A) site. The 39UTR can include the 39COR in the broad sense. In practice, however, the 39UTR is the untranslated 39 region in the mature messenger RNA (mRNA), because there is no information about the 39COR in most mRNA sequences. The exact length of the 39COR at the whole-genome level is unclear; however, some studies have used an approximate length of 200 nucleotide fragments to represent the 39COR in some yeast genes [3]. Although the function that the pre-mRNA 39COR performs after transcription termination is unclear, that region is believed to have an important influence on pre-mRNA length and folding as well as on pre-mRNA cleavage. In contrast, the 39UTR is known to be very gene-specific, to play a critical role in regulating mRNA export, stability, and functionality, and to be critical for the development of living organisms [4][5][6].
Gene-density distribution in fish genomes [7] and human genomes [8] increases with increasing isochore G+C content (GC, C+G or G+C richness). G+C-rich genes are usually more fully expressed than the G+C-poor ones [9]. In vertebrates, introns are poorer in G+C and richer in A+T in comparison with exons [10]. It is widely known that introns are usually less conserved than exons. Within the same genes, however, the G+C content of exons correlates with that of introns [11]. In a previous study, base compositions were analyzed in the 39UTRs of 271 dicotyledonous and 82 monocotyledonous plant genes [12]; however, these 39UTRs were only approximate because the poly(A) sites had not been determined. In mammals, there are numerous studies on the motifs around the poly(A) site [13][14][15] as well as some studies on the base composition in the 39UTR [16] or in the regions both upstream and downstream of poly(A) sites [17]. Little is known about the base-composition differences between the poly(A) site region and the whole genome in different subkingdoms of plants and animals.
In this study, the author analyzed the nucleotide contents of the 201-base region including the 100 bases of the 39COR and the 100 bases of the 39UTR immediately adjoining (downstream and upstream, respectively) each poly(A) site in eleven animal species and seven plant species, and compared the results with the nucleotide contents of the corresponding whole genomes (using only complete or nearly complete genomes). The order of these regions are as follows: coding region-39UTR-Poly(A) tail attachment position-Poly(A) tail starting position [usually called poly(A) site]-39COR. The mapping used mRNA sequences in the nucleotide database of National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov/) for all the studied species and Illumina RNA-seq reads for four species. The species chosen for this study were the ones with the largest number of unique poly(A) sites mapped on their corresponding genomes. Although the decision to use the 201-base region in this analysis was arbitrary, given that the exact length of the 39COR is unknown, our intention was to concentrate on the poly(A) site regions and to disregard base-composition effects related to translation termination (upstream part of the 39UTR) and gene region termination (downstream part of the 39COR) that can be unrelated to poly(A) site selection.

Animal and Plant Genomic Base Compositions
Both the animal and plant genomes (see Table 1 for the list of species) showed A+T richness in the approximate order A = T.. C = G ( Figure 1A). The A and T contents were highest in dicotyledonous plants, lowest in monocotyledonous (cereal) plants, and intermediate in animals ( Figure 1A). The C and G contents showed the opposite pattern for the A+T contents, with the highest contents in monocotyledonous plants and the lowest in dicotyledonous plants (Figure 1). Among the eleven animal genomes analyzed in this study, the Apis mellifera (honey bees, invertebrates) and Caenorhabditis elegans (nematodes, invertebrates) genomes had the highest A and T contents, and the Drosophila melanogaster (fruit fly, invertebrates) genome had the lowest.
Comparison between pre-mRNA 39-Cleaved-off Regions (39COR) and Genomes All of the animal and plant species had a 39COR basecomposition pattern of U..A..G.C, except for the two insect invertebrates (honey bee and fruit fly), which had the pattern A. U..C = G, according to the poly(A) sites determined by the mapping of NCBI mRNA sequences to their reference genomes ( Figure 1A). In C. elegans 39COR, A and U contents were found to be very similar ( Figure 1A). On average, the U content in the 39COR was highest in dicotyledonous plants, lowest in mammal animals, and intermediate in monocotyledonous plants (Table 2; Figure 1A). The vertebrate non-mammal animals had higher U contents and lower C contents in the 39COR than did the mammals (Table 2; Figure 1). The C content was higher in mammals and monocotyledonous plants, and lower in dicotyledonous plants and non-mammal animals ( Figure 1A). The A content was generally much higher than the G and C contents in all species ( Figure 1A), but lower (except in the two invertebrates) than the U content ( Figure 1A). Monocotyledonous plants had the lowest A contents overall for all species groups. There was no general difference between animals and dicotyledonous plants in A content ( Figure 1A). The G contents were similar to, but generally higher than, the C contents in all species, except for the two invertebrates and Arabidopsis thaliana, in which the C and G contents were approximately the same ( Figure 1A).
In comparison with the genome, the 39COR showed consistently higher U contents in all species, slightly lower A contents, and generally lower C contents ( Figure 1A). The G contents in the 39COR were appreciably higher than the genomic G contents in mammals and substantially lower than the genomic G contents in the monocotyledonous plants ( Figure 1A). ANOVA and Duncan multiple-range test confirmed that monocot plants had the lowest genomic U content among all subkingdoms, but monocot 39COR had a relatively high average U content ( Table 2).
The 39COR/genome ratio for U content (i.e., the ratio obtained by dividing the U content in the 39COR by the wholegenome T content) was significantly highest among the 39COR/ genome ratios in monocotyledonous plants, intermediate in dicotyledonous plants, and lowest in animals ( Figure 1B; Table 3). In most species, the 39COR/genome ratio for C content was the smallest ratio for the four bases ( Figure 1B). The 39COR/ genome ratios for A and G contents showed considerable variation. However, in monocotyledonous plants, only U was consistently overrepresented to a considerable extent; the other three bases (A, C, and G) were underrepresented in the 39COR of monocotyledonous plants ( Figure 1B).
At each position along the 100 bases of the 39COR, the 39COR/genome ratio for U content was consistently higher than 1.0 in both monocotyledonous and dicotyledonous plants, except for the first base after the poly(A) site in dicotyledonous plants ( Figure 2). This means that U was more frequent at every position in the 100-base region in the 39COR than in the whole genome ( Figure 2). However, the 39COR/genome ratio was consistently higher in monocotyledonous plants than in dicotyledonous plants ( Figure 2). Although the first two bases had the smallest ratios, the 39COR/genome ratio was higher in regions closer to the poly(A) site than in the regions farther downstream of the poly(A) site ( Figure 2).

Comparison between mRNA 39-Untranslated Regions (39UTR) and Genomes
Like the 39CORs, the mRNA 39UTRs showed a basecomposition pattern of U..A..G.C in all species except the two insect invertebrates (honey bees and fruit flies) ( Figure 3A). The U content in the 39UTR was highest in dicotyledonous plants, lowest in animals, and intermediate in monocotyledonous plants ( Figure 3A; Table 2). In addition, like the 39CORs, the 39UTRs had higher U and lower C contents in the non-mammal animals than in the mammals ( Figure 3A). In the 39UTRs, the C content was highest in mammals, lowest in dicotyledonous plants, and intermediate in monocotyledonous plants and non-mammal animals ( Figure 3A). The A content was generally lower than the U content but much higher than the G and C contents, except in the two insects, which had the highest A contents ( Figure 1A). The A content was lowest in monocotyledonous plants, highest in animals, and intermediate in dicotyledonous plants ( Figure 3A). The G contents were similar to, but usually higher than, the C contents ( Figure 1A).
In comparison with the genome, the 39UTR had the highest U content in all species (except for the fruit fly and the honey bee, two invertebrates), a slightly higher A content in animals, and an appreciably lower A content in plants ( Figure 3A). The G content in the 39UTR was lower than the genomic G content in animals and monocotyledonous plants but approximately similar in dicotyledonous plants, which had the lowest genomic C and G contents among the subkingdoms covered in this study ( Figure 3A). The U content in 39UTR was consistently higher on average than in 39COR in every plant and animal subkingdom, even though the difference was not significant in mammals ( Table 2 and Table S1).
The 39UTR/genome ratio for U content was highest in monocotyledonous plants, intermediate in dicotyledonous plants, and lowest in animals ( Figure 3B; Table 3). The 39UTR/genome ratio for A content was greater than 1.0 in animals, but clearly lower than 1.0 in plants ( Figure 3B). In monocotyledonous plants in the 39UTR, only U was strongly overrepresented; the other three bases (A, C, and G) were underrepresented ( Figure 3B). In the analysis of each position along the 39UTR, the area within approximately 30 bases of the poly(A) site was found to be highly variable in base composition, but U was always the most dominant of the four bases in the region farther upstream (data not shown). Strong overrepresentation of U content was obvious at most The mapped extra copies were eliminated if the sequences in the 100-base three-prime untranslated region were identical. Note that for the 39COR, U was richest in all species except Apis mellifera and Drosophila melanogaster, two insect invertebrates, in which A was richest. In Caenorhabditis elegans (a nematode), the A, C, G and U contents in the 39COR are 34.3%, 15.8%, 16.0%, and 33.9%, respectively. Monocotyledonous plants had lower U contents but higher 39COR/genome ratios in the 39COR than did dicotyledonous plants. doi:10.1371/journal.pone.0099928.g001 positions in the 39UTR. The base U was more overrepresented in 39UTR than in 39COR in terms of the region/genome ratio in every subkingdom (Table 3 and Table S2).

Comparison of Dicotyledonous and Monocotyledonous Plants about Poly(A) Sites Mapped with NCBI mRNA
The dicotyledonous plants show the highest U contents in the 39COR and 39UTR, possibly because they have the highest genomic U contents and moderate overrepresentation of U ( Figures 1A and 3A). In monocotyledonous plants, however, the high U contents in these two regions are attributable mainly to the strongest overrepresentation of U (Table 3), given that they had the lowest genomic U contents (Table 2; Figure 3B). Even though overrepresentation of U was strongest in monocotyledonous plants, the actual U contents in the 39COR and 39UTR were lower than in dicotyledonous plants ( Table 2). The basecomposition differences between dicotyledonous and monocotyledonous plants in the 39COR were 4.0%, 23.3%, 23.7%, and 3.0% for A, C, G, and U, respectively (in terms of contents in dicots minus the contents in monocots), which were smaller than the corresponding differences in the whole genome (5.3%, 25.2%, 25.2%, and 5.2%, for A, C, G, and T, respectively) ( Figure 1A). The base-composition differences between dicotyledonous and monocotyledonous plants in the 39UTR were 4.0%, 23.3%, 2 3.7%, and 3.0% for A, C, G, and U, respectively, which were also smaller than the differences in the whole genome ( Figure 3A).

Base Composition of the mRNA 39UTR Region 50 Bases Away from Poly(A) Sites
It is known that the region within 25 bases from the poly(A) site usually has specific A-or U-rich motifs [18]. These motifs may affect the calculated A-or U-contents of the 100 base 39UTR. To verify whether the 39UTR is still U-rich (in most species) or A-rich (in the two insect species) without this motif-rich regions, the author also analyzed the base composition of the 50-base UTR region that was 50 bases upstream away from the poly(A) site. Similar base composition orders between A, C, G, and U were confirmed between this 50-base region and the 100-base region of the 39UTR in all the species: U was richest for all the species except for the two insects (honey bee and fruit fly) ( Table 4). The U/A ratio in this 50-base region was found to be significantly higher in plants than in animals (Table 4).

Illumina Reads-mapped 39UTR and 39COR
Illumina deep sequencing data of mRNA (RNA-Seq) were analyzed for poly(A) sites in nematode (C. elegans), fruit fly, honey bee, and potato (Table 5 and Figure 4). Compared with the NCBI mRNA-based analysis, the base composition of the 39COR region from Illumina RNA-seq reads showed the following ( Figure 4): 1) much higher overrepresentation of A in all the species; 2) lower U contents; 3) less different between C, G, and U. In the 39COR, the 6 bases counted from the poly(A) site were more predominantly A in mapping with Illumina TruSeq RNA-Seq reads than mapping with NCBI mRNA sequences ( Figure 5). The extremely high average A content of the 6 base positions in the poly(A) site regions strongly suggests that the Illumina Tru-Seq reads-based mapping was sensitive to internal priming. In the 39UTR region, the Upredominance estimated by Illumina RNA-Seq is generally lower than that estimated by NCBI-mRNA based mapping ( Figure 4).

Discussion
Mapping using Illumina RNA-Seq reads was found to be very likely more sensitive to internal priming than mapping using nondeep sequencing-generated mRNA sequences as shown in     Note that for the 39UTR, compared with other nucleotides, U was the richest in all species except Apis mellifera and Drosophila melanogaster, two insect invertebrates, in which A was the richest ( Figure 3A). Monocotyledonous plants had lower U contents but higher 39UTR/genome ratios in the 39UTR than did dicotyledonous plants. The U content in the 39UTR was significantly different between dicotyledonous plants and monocotyledonous plants and between non-mammal animals and mammals according to one-way ANOVA and Duncan's multiple range test at the P,0.05 level ( Table 2). The C content in the 39UTR was significantly different between non-mammal animals (15.82%) and mammals (   Most of the previous studies on this topic were based on the accumulated databases of DNA clone sequences [12,16,19], because at the time not many complete or nearly complete genome sequences of animals and plants were available. Our approach largely avoids the problem of over-contribution from redundant genomic DNA clone sequences in the NCBI database. The quality of the transcript sequences we used is generally more reliable than that of single-pass reads, because most mRNA sequences submitted to the NCBI mRNA database are supposed to have been verified by sequencing from both directions, particularly if a poly(A) tail is included in the submitted sequences. The NCBI's mRNA databases are typically smaller than the databases of ESTs and other single-run reads; however, the higher quality of the mRNA can largely compensate for the use of a smaller database, provided only species with sufficient numbers of mapped poly(A) sites for statistical tests are compared and provided the comparisons made are mainly between groups of species. Furthermore, we applied a zero tolerance approach to mismatches in our transcript-genome alignment, which is much stricter than the mismatch tolerance of 10% applied in a previous study on the topic [16]. This zero-tolerance approach to mismatches can help to prevent or minimize ambiguity in mapping. As well, the mapping done in the present study was based on the 100-nucleotide upstream sequence, which is much more stringent than the 60-nucleotide sequence mapping approach used in the previous study [16]. Compared with our previous characterization of the poly(A) site starting position and the poly(A) site attachment position [20], in this study we removed any redundant poly(A) sites after mapping to minimize the inflation effects of unexpressed alleles. These stricter mapping criteria reduced the number of poly(A) sites and the number of species to be compared but greatly increased the reliability of the mapping and minimized the dilution effects from fault sites.
Likely because complete genome sequences were unavailable, previous base-composition studies focused mainly on the 39UTR and did not analyze the 39COR in multiple animals and plants [12,16]. Although the genomes that recently became available have greatly contributed to the characterization of both the upstream and downstream regions around poly(A) sites in mammals [17], little information is available to use in comparing the base composition of the 39UTR and 39COR with that of the whole genome. In the current study, we examined the base composition of the 39COR and 39UTR relative to the genomic base composition in mammals as well as non-mammal animals, dicotyledonous plants, and monocotyledonous plants.
To reach conclusions on general differences between animals and plants and between dicotyledonous and monocotyledonous plants, it is critically important to include a sufficiently large number of species in the analysis. Although the study on EST and DNA clones in a previous study [16] generated valuable information about poly(A) sites, no conclusions could be reached about differences between animals and plants and between dicotyledonous and monocotyledonous plants, because only one dicotyledonous species (Arabidopsis thaliana), one monocotyledonous species (rice), and three animal species (fruit fly, mouse, and human) were studied. In our study, we used the RNA-genome Information about the Sequence Read Achives (SRA) transcriptomic files can be found in Table 5. NCBI mRNA and RNA-Seq reads were significantly different in this 6-base region according the Excel ''ChiTest'' in each of these four species (P,0.0001). Note that this six-base region showed higher adenosine content in mapping with RNA-Seq reads than mapping with NCBI mRNA. doi:10.1371/journal.pone.0099928.g005 alignment approach and analyzed both the 39UTR and 39COR in eleven animal species and seven plant species. Although we examined more species in each subkingdom than did previous studies, the numbers are still not large enough to permit comparisons within sub-groups such as insects and noninsects or woody and herbaceous plants. For example, honey bee (Apis mellifera) and fruit fly (D. melanogaster) were the only two insects in the invertebrates studied, and Populus trichocarpa was the only tree species. This is because we used only complete or nearly complete genomes. In future, research can be undertaken to verify whether other tree and insect species have base compositions in the 39UTR and 39COR similar to those of the two species we studied, once more insects and trees have been completely sequenced. The larger variation among species in the ''nonmammal animals'' group than in the ''mammals'' group (Figures 1  and 3) is likely attributable to the fact that the ''non-mammal animals'' group is very diverse and included both invertebrate and vertebrate animals.
Interestingly, we found that insect invertebrates (honey bees and fruit flies) preferred A over U in 39UTR and 39COR (Figures 1  and 3), which is the opposite of what we found for all non-insect animals. Among the 17 plant and animal species analyzed, these two insects invertebrates clearly stood out from the other 15 species in terms of the A/U ratio. The species-specificity in the A/ U ratio may suggest genetic influence on the base composition in the 39UTR and 39COR. Since the two insect invertebrates gave similar results (opposite to those for vertebrates and nematodes), the results are unlikely to represent an artifact associated with the relatively smaller number of poly(A) sites mapped on the honey bee genome. Although further research may lead to more optimal settings and thus improve RNA-seq sequence read analysis, the NCBI-mRNA-based approach still has its rightful place because of its higher sequence quality and its usually broader coverage of tissues and treatments relative to the currently available sequence reads. In the present study, the high base-composition similarity of the poly(A) site region among species within subkingdoms and the general difference between subkingdoms suggest that the results are unlikely due to coincidence or bias and that the data must reflect the true biology of these subkingoms/species.
One of the important features of this study is the comparison between the 39COR and 39UTR and the whole genome. Most of the previous studies on this topic described the nucleotide contents of the 39UTR without examining differences or similarities between that region and the corresponding genomes. It is unclear whether the base-composition difference in the 39UTR between animals and plants is a simple reflection of differences in their whole-genome base compositions. Although the G content in the 39UTR was found to be very similar in Arabidopsis thaliana and fruit flies [16], we found that the G content in the 39UTR and the genomic G content were similar in A. thaliana but the G content in the 39UTR was much lower than the genomic G content in fruit flies. This suggests that there is a bias against G in the 39UTR in fruit lies ( Figure 3A). This kind of difference can only be identified through a comparative study of the 39UTR and the genome.
The most consistent feature of the base compositions of both the 39COR and 39UTR in both animals and plants was found to be lower C contents in these regions than in the whole genomes ( Figures 1A and 3A). The C contents were the lowest among all the four types of nucleotides in the 39COR and 39UTR regions but differed significantly between subkingdoms. These results appear to be somewhat similar to the AT richness found in the introns and intergenic sequences of two animals in a previous study [21]. In the present study, however, the GC poorness in the 39COR and 39UTR is caused mainly by C poorness, because the G content varies depending on the species; in fact, G was usually much higher than C in the 39UTR in both monocotyledonous and dicotyledonous plants (Figures 1 and 3).
Interestingly, each animal and plant subkingdom showed distinct characteristics in terms of base composition (Figures 1A  and 3A). Different groups of living organisms have their own sets of unique genes. For example, immune system genes play highranked and conserved roles in mammals but are not conserved in nematodes [22]. As well, each subkingdom may differ in its own transposons and its regulation of DNA mutation/repair systems. Further research is required to investigate in what way unique genes, transposons, and DNA mutation/repair systems contribute to base-composition differences. It is known that cellular selection favouring translation differs between G+C-rich and G+C-poor classes of genes [23]. Given that base-composition patterns are known to differ between animals and plants in the 39COR and 39UTR (as shown in this study), it is logical to expect that different selection mechanisms apply to animals and plants. Further research is needed to find out what these selections are in living organisms. Since U (usually the most dominant base) was consistently overrepresented and C (usually the least dominant) was consistently underrepresented in the 39COR and 39UTR, regardless of their respective content variation in the whole genome, U and C must play important roles in both poly(A) region selection and interaction with the poly(A) complex.
It is likely that G+C richness affects gene length in vertebrate G+C-rich isochores [24]. We found differences in C or G contents between animal and plant genomes. It is unclear whether these C or G differences influence the lengths of the 39COR and 39UTR. Built on the present study of the 201 nucleotides in this untranslated-cleaved-off region (100 bases in the 39COR), future research could analyze the base composition of more-distant downstream regions relative to the whole genome. This will make it possible to determine the approximate difference in 39COR length between animals and plants.
Plant chromosomes are more often characterized by the predominance of genes on the same-direction than are animal chromosomes [25]. It would be interesting to see whether gene direction has any relationship with the base composition in the 39COR and 39UTR. Base composition and genome or chromosome size are correlated in various organisms [26], and genome and chromosome sizes are known to strongly impact gene direction on chromosomes during the increase of life complexity [25]. The region of 39COR and 39UTR are either a part of, or close to, the gene end. This gene region creates a certain level of repeats between genes, in terms of conserved base composition patterns. Further research is required to investigate whether the base composition in this 39COR-39UTR gene end region affects DNA recombination and consequently impact the gene direction rearrangement on chromosomes.
Plants have extreme patterns of genomic base composition in comparison with animals. Dicotyledonous plants were found to have extremely high genomic A and T contents and extremely low C and G contents ( Figure 1A). In contrast, monocotyledonous plants were found to have the lowest genomic A and T contents and the highest genomic C and G contents among the four subkingdoms (non-mammal animals, mammals, dicotyledonous plants, and monocotyledonous plants) ( Figure 1A). Interestingly, among all the plant and animal species analyzed, monocotyledonous plants had the lowest T contents in genome (27.63%, Table 2; Figure 1A) but the highest increase in the U contents in terms of mRNA 39UTR/genome ratio (1.30; Table 3; Figure 1B) and mRNA 39COR/genome ratio (1.25; Table 3; Figure 3B). Whereas dicotyledonous plants had a higher genomic T contents on average than monocot plants ( Figure 3A) but significant smaller increases in the U contents in the same mRNA regions (39UTR/ genome ratio = 1.20 and 39COR/genome ratio = 1.14; Table 3; Figures 1B and 3B). These adjustments made the U content difference between monocot and dicot plants smaller in the 39UTR and 39COR region than in the genome. Our hypothesis is that the important function of the 39UTR and 39COR makes it less likely that these regions will undergo mutation during evolution (even though they are very rich in A+T) than most other regions of the genome. Further research is needed to determine whether this means that the T content is too low in monocotyledonous genomes and must be enriched to a certain degree to permit the 39COR and 39UTR to function properly in monocotyledonous cells.
Among the species within subkingdoms, the base compositions in the 39COR ( Figure 1A) were more similar than the 39COR/ genome ratios ( Figure 1B). This pattern of similarity in 39CORs among species and of weaker similarity in their 39COR/genome ratios was particularly obvious among mammals and among monocotyledonous plants (Figure 1). The base compositions in the 39UTR ( Figure 3A) were also more similar than the 39UTR/ genome ratios ( Figure 3B). These results appear to suggest that the 39COR and the 39UTR are evolutionarily more stable than the genomes in terms of base composition changes. Further research is required to gain a better understanding of the similarities and differences among species between the 39COR and 39UTR regions and the genomes. This evolutionary trend drove the analyzed sequences in the same direction: the most frequent nucleotide was U, followed by A, G, and C, in most of the animals and plants studied. Although the U content was slightly lower than the A content in the 39COR and 39UTR in the two insect invertebrates (honey bee and fruit fly), U was consistently much more frequent than G and C in these regions in all species. The results also suggest that the nucleotide content in these 39COR and 39UTR sequences evolved nearly independently of the rest of the genome. The knowledge acquired about the base compositions of the whole genome and the 39COR and 39UTR in eleven animals and seven plants may stimulate further research aimed at interpreting the results.
The analysis used highly reliable data, characterized the basecomposition of the 39COR and 39UTR in comparison with that of the whole genome, and identified clear differences between dicotyledonous and monocotyledonous plants and between nonmammal animals and mammals.

Genomes and mRNA Sequences
Nucleotide sequences of complete genomes were downloaded in FASTA format from the NCBI website at http://www.ncbi.nlm. nih.gov/sites/genome. Most of the animal and plant species for which both complete genomes (serve as the reference genomes) and large mRNA databases were available in NCBI were screened as described previously [20], and the species that had sufficient poly(A)-tailed mRNA in NCBI were used for detailed analysis.

Mapping mRNA on Genomes
The screening of poly(A) tailed mRNA and the mapping of poly(A) tailed mRNA to corresponding genomes were carried out essentially as described in the previous study of the dinucleotide covering the pre-mRNA 39end cleavage site [20]. Only the transcripts that each had a poly(A) tail of at least 12 continuous A's at the 39 end were used in this analysis. The 100 bases immediately upstream of the poly(A) site were used to screen the mRNA datasets to eliminate any redundant copies. Each sequence in the final dataset of poly(A)-tailed mRNA was unique. These sequences were mapped to the reference genomes of their corresponding species with zero tolerance for mismatches.
Several species for which sequenced genomes were available were not included in the final comparative study, because a) the number of mapped unique poly(A) sites was too small for comparison, b) the mRNA dataset of the species (i.e., Macaca mulatta and Pan troglodytes) had a large number of computation predicted mRNA or c) many of their mapped poly(A) sites had 12 or more A's and were potentially more susceptible to internal priming (i.e., Macaca mulatta, Pan troglodytes, and Sus scrofa) than were most other species.
Most species used in comparison in this study had likely very low percentage of internal priming, partly due to the use of high quality sequences from the mRNA database and partly due to the very strict settings used in this mapping. For example, only 0.3% percentage of mapped mRNAs in plants had 12 A's [20]. Its potential modification of the content (percentage) of each specific nucleotide (A, C, G, and U) in the whole mRNA pool would be less than 0.1% on average. Whereas, the actually U content difference detected between 39UTR and whole genome in monocot plants was about 8% ( Table 2). The internal priming issue had no power to change the conclusions in this mRNA study. The internal priming issue has been described and discussed in detail previously [20].
The post-mapping treatment that was applied differed from that in our previous study [20] as follows: To minimize overrepresentation from duplicated gene copies, the extra copies were eliminated if the 100-base sequences (i.e., 39UTR) upstream of the poly(A) sites were identical. Thus, every mRNA sequence and every poly(A) site analyzed were unique in terms of this 100nucleotide 39UTR. All of the poly(A) tail screening, mRNAgenome alignment, and base-composition counting were assisted by Perl scripts.
The computation included the following steps: a) searched for poly(A) tailed mRNA using the requirements described in the Methods section; b) eliminated the duplicated mRNA sequences using the 100 bases upstream of the poly(A) site; aligned the poly(A)-tailed mRNA sequences with the reference genome of the same species with zero tolerance for mismatches; c) kept only one site from identical multiple sites based on the 100-base upstream nucleotide sequence; d) eliminated the species that had high proportions of predicted mRNA in the mapped sequences; e) eliminated the species that had multiple-A stretches immediately after poly(A) sites on the genome; f) counted the base composition for each position of the 201 bases for each mapped mRNA; g) counted the average base composition of each region [upstream and downstream of the poly(A) site] for the pooled mappedsequences; and h) compared them with the whole genome base composition.

Nucleotide Contents
The A, C, G, or T nucleotide content in each genome is the percentage of A, C, G, or T in the total nucleotide number of the genome accumulated from all the chromosomes. In the case of species for which the complete chromosome sequences or pseudomolecules were unavailable, we used large scaffolds. The pre-mRNA 39COR nucleotide contents were from the 100-base genomic sequence downstream of, but not including, the poly(A) tail starting position [i.e., the genomic or pre-mRNA nucleotide corresponding to the first A of the poly(A) tail]. The 39UTR base compositions were from the 100 mRNA bases upstream of, but not including, the poly(A) tail starting position.
Plants of a doubled monoploid potato line DM1-3 516R44 (S. tuberosum Group Phureja) [27] were growing in the greenhouse. Total RNA was prepared from roots using RNeasy Plant Mini Kit (Qiagen Cat. 74903). A TruSeq mRNA cDNA library was constructed using this RNA and then sequenced by pair-end 100 cycles using Illumina HiSeq 2000 at the Genome Quebec-McGill University Innovation Centre. Reads were processed by removing the adapters, poor quality regions and too short ones using Trimmomatic [28] with setting of MINLEN:50 and TRAIL-ING:30. Detail of transcriptomic analysis will be published elsewhere. Sequence reads with 12 A's at the 39 end were used in alignment to the potato (Group Phureja) reference genome (PGSC_DM_v4.03 downloaded from http://solanaceae. plantbiology.msu.edu/pgsc_download.shtml).
Illumina RNA-Seq files of nematiode, fruit fly, and honey bee were downloaded from NCBI Sequence Read Archives (SRA) (http://www.ncbi.nlm.nih.gov/sra/). SRA file IDs are listed in Table 5. Sequence reads with 12 A's at the 39 end were considered polyadenylated.
The Illumina RNA-seq read mapping was based on the 80-base region immediately upstream of the poly(A) site. The author eliminated the duplicated RNA-seq reads for the 80-base sequences upstream of the poly(A) site before starting mapping. The mapping was with zero tolerance of mismatch. The redundant copies of the mapped poly(A) sites in terms of the sequence of the same 80 bases were also eliminated after mapping.

Statistical Analysis
The content of each type of base (A, C, G, and U), the 39UTR/ genome ratios, and the 39COR/genome ratios were compared among non-mammal animals, mammals, dicotyledonous plants, and monocotyledonous plants at the P,0.05 level by one-way ANOVA and Duncan's multiple range test at the P,0.05 level using SAS Enterprise Guide, version 4.3. The U/A ratio mean comparison between plants and animals was according to Student's t-test (in Excel 2010) with a two tailed distribution and two-sample equal variance model. ChiTest (in Excel 2010) was used to test the A content differences for the 6-bases in the poly(A) site region between mapped NCBI mRNA and mapped RNA-Seq reads. Table S1 ANOVA-Duncan's multiple range tests of base U contents of different subkingdoms (classifying animals into invertebrates and vertebrates). (DOCX) Table S2 ANOVA and Duncan's multiple range tests of the region/genome ratios of U contents (classifying animals into invertebrates and vertebrates).