PreTIS: A Tool to Predict Non-canonical 5’ UTR Translational Initiation Sites in Human and Mouse

doi:10.1371/journal.pcbi.1005170

Fig 1.

Example mRNA sequence showing the categorization of true positive and true negative start sites.

Suppose that a ribosome profiling experiment detected the following start sites for a given mRNA sequence: CUG at position -78 and CUG at position -120 (blue colored codons). These start sites were then assumed to be true positive start sites. In consequence, all near-cognate start sites not listed in the ribosome profiling dataset and upstream of the most downstream reported true start site were assumed to be true negatives (dark red colored codons). The light red colored codons are start sites not considered as false starts in the analyses since they are located downstream of the most downstream reported true start site. Note that the grey colored downstream part depicts the annotated CDS sequence whereas the italic (purple) upstream part marks the -99 upstream window needed to calculate some of the features (see below). All marked start sites (true positive and true negative) exhibit a surrounding window of ±99 nucleotides as well as a downstream in–frame stop codon. In total, this mRNA sequence would provide 2 true start sites and 9 false start sites out of 23 putative starts.

More »

Expand

Fig 2.

Flowchart of the regression approach.

Data balancing was repeated ten times to investigate model robustness. Significant features were identified by the Wilcoxon-rank sum test.

More »

Expand

Table 1.

Datasets used in this study.

More »

Expand

Table 2.

Evaluation of the regression approach.

More »

Expand

Fig 3.

Codon distribution of the test samples in the best performing human model.

AUG, CUG and GUG were the most prevalent true positive start sites (t = 0.54).

More »

Expand

Table 3.

Mean value and standard deviation of the 44 features that were used in the best human model.

More »

Expand

Fig 4.

Frequency distribution of PWM_positive scores for the test samples of the best performing run 2.

The PWM was established using the true start sites in the training data of run 2. The difference between TPs and TNs was found to be highly significant (p = 5.5 × 10⁻¹⁷³, Wilcoxon–rank sum test).

More »

Expand

Table 4.

Performance of the best human HEK293 model applied to the mouse ES and human HEK293–AUG datasets.

More »

Expand

Fig 5.

Alternative start codons of human gene GIMAP5.

Predicted start sites were subdivided into four confidence groups and highlighted by different colors and dashed lines: very high (hot/best candidates with c ≥ 0.9), high (0.8 ≤ c < 0.9), moderate (0.7 ≤ c < 0.8) and low (t = 0.54 ≤ c < 0.7) initiation confidence c. For this gene, we found one hot candidate with a very high confidence value of 0.92 of being a true start site (AUG at position -203).

More »

Expand

Fig 6.

SNP analysis of gene GIMAP5.

Mutation matrix showing the impact of the flanking sequence context of four putative start sites of gene GIMAP5 on the predicted initiation confidence. In each case, only one nucleotide is mutated with respect to the reference sequence (top line). Grey means that the start was predicted as true translational start (predicted initiation confidence is greater than 0.54) whereas white means that the start was classified as false start. Mutations at the start sites itself were not considered. The numbers reflect the predicted initiation confidence values. A: CUG at position -36. B: CUG at position -44. C: AUA at position -237. D: CUG at position -160.

More »

Expand

Fig 7.

In silico mutation analysis considering all 3,566 genes of the HEK293 dataset.

The flanking sequences of all possible start sites in the HEK293 dataset were mutated. Shown is the difference in the predicted initiation confidence (IC_difference = IC_mutation − IC_wildtype). Positions -3 and -12 are prevalent and seem to have the largest influence on the prediction. Positions at the start site were not mutated.

More »

Expand