Predicting Statistical Properties of Open Reading Frames in Bacterial Genomes

doi:10.1371/journal.pone.0045103

Figure 1.

ORF lengths distribution of Escherichia coli O157:H7 Sakai.

ORF lengths are given in base pairs (bp). All ORFs in the six possible reading frames are shown. The prediction of a simple model based on independent and identically chosen nucleotides (IID nt) is not able to reproduce the ORF distribution.

More »

Expand

Figure 2.

ORF lengths distribution and survival probability.

Left panel: Shown is the relative frequency of the EHEC ORF lengths (orange triangles) and of Rcodon (blue open dots). The prediction of the mixture model is shown in red. Right panel: Survival probability (probability to observe at least one ORF with given length in any of the six reading frames) of the mixture model.

More »

Expand

Figure 3.

QQ-Plot.

Comparison of 75% quantile of ORF lengths predicted by the mixture model to the ORF lengths observed in the natural genomes (orange triangles) and Rcodon (blue open dots), respectively. Some individual data points are labeled with an abbreviated species name and the corresponding GC-content according to Table S1.

More »

Expand

Figure 4.

ORF number prediction.

Comparison of ORF numbers predicted by the mixture model to the ORF numbers found in natural genomes (orange triangles) and Rcodon (blue open dots), respectively. Some individual data points are labeled with an abbreviated species name and the corresponding GC-content according to Table S1.

More »

Expand

Figure 5.

Ratio of annotated ORFs to non-annotated ORFs.

The ratio predicted by the mixture model is compared to the ratio observed in bacterial genomes (orange triangles) and Rcodon (blue open dots), respectively. The observable slight difference between natural genomes and Rcodon is due the fact that the expected number of short coding ORFs in Rcodon deviates from the natural genomes (compare to Figure 7). Some individual data points are labeled with an abbreviated species name and the corresponding GC-content according to Table S1.

More »

Expand

Figure 6.

Influence of GC-content and sequence length.

Left panel: Comparison of the average ORF lengths over the GC-content as predicted by the mixture model (green dots) compared to bacterial genomes (orange triangles) and Rcodon (blue open dots), respectively. Right panel: Comparison of the predicted number of ORFs to the observed number for different bacteria over sequence length. The number of ORFs expected depends on the sequence length and GC-content. The upper bounds for the number of ORFs expected are shown for the GC-contents 32.5% and 70%.

More »

Expand

Figure 7.

aORF lengths distributions.

The absolute frequency of aORF lengths in codons from the EHEC genome (NC_002695) is compared to its Rcodon and the prediction of the mixture model. The visible difference between the natural genome and the theoretical expectations either by Rcodon or the mixture model is due to the fact that short ORFs are generally less likely to be annotated as functional proteins. However, this is changing (e.g., [40]) and short ORFs are picked up for annotations more frequently.

More »

Expand

Table 1.

Number of aORFs predicted and observed.

More »

Expand

Figure 8.

naORF lengths distributions.

Left panel: The relative frequencies of naORF lengths derived from EHEC (orange triangle) are compared to Rcodon (open blue dots) and the mixture model (red line). Right panel: The survival probabilities of naORF lengths for the different alternative frames are derived from the mixture model. The survival probability shows the likelihood to observe at least one naORF with given length . Indeed, longer naORFs are expected in reading frames −1 and, to some extent, frame −2 (see text).

More »

Expand

Figure 9.

Length distributions of different groups of genes for each alternative frame.

Shown are the absolute frequencies of naORF lengths for the genome of E. coli MC4100 (NC_000913) as predicted by the mixture model. Each colored line represents a different group used to obtain a codon usage as input to the model. Subset 1 of very high expressed genes is shown in green, Subset 2, contains, in addition to Subset 1, further highly expressed genes and is shown in blue (data from [41]). The group which includes all genes is shown in red. Finally, the natural frequencies obtained from the bacterial genome are shown in black triangles. In most alternative frames, the expression values of the annotated frame is of negligible influence, but not so for frame −1. As Silke [38] has already stated, most, but not all, long overlapping ORFs in −1 frame might be explained by a codon usage bias for highly expressed genes. However, this finding is not true for any other alternative frame nor for genes not highly expressed.

More »

Expand

Table 2.

Undetermined nucleotides and their substitutions [47].

More »

Expand

Figure 10.

Ergodic Markov chain.

Markov Chain connects all codons. For each reading frame the stationary of this ergodic Markov chain is calculated to obtain the individual start and stop codon probabilities.

More »

Expand

Figure 11.

Two state Markov chain.

Stationary distribution of this Markov model reveals the probability for being within an ORF.

More »

Expand