Reliability of plastid and mitochondrial localisation prediction declines rapidly with the evolutionary distance to the training set increasing

doi:10.1371/journal.pcbi.1012575

Fig 1.

Targeting prediction algorithms are frequently cited across disciplines and rely on a limited training set.

(a) Taxonomic distribution of plastid and mitochondrial training datasets used for the three commonly used predictions tools TargetP, Localizer and WoLF PSORT (WPS). (b) Distribution of citations across different disciplines for the three commonly used predictions tools TargetP, Localizer and WoLF PSORT (WPS) and for a time period ranging from 2018 until 2022. Numbers according to the Web of Science.

More »

Expand

Fig 2.

Performance of algorithms outside the training species.

Comparison of predicted versus experimentally localised plastid (a) and mitochondrial (b) proteome numbers. Each Venn diagram of the top panel shows an overlap between predicted (left circles, colour-coded based on the algorithms used) and experimentally verified organelle proteomes (right circles, grey). The underscored numbers in the bottom corners show the total number of predicted (bottom left) and experimentally confirmed proteins (bottom right). The numbers of proteins that overlap (true positives) are provided in the top right corner in bold, while the numbers of non-overlapping false positives and negatives are shown next to each circle. See also the key for the Venn diagrams on the bottom left. Sensitivity, specificity and precision of individual algorithms and their combinations for plastid (c) and mitochondria (d).

More »

Expand

Fig 3.

Strong negative correlation between the precision of algorithms and the evolutionary distance from the training data.

Precision of TargetP across eukaryotes for plastid (a) and mitochondria (b); A. thaliana is shown in darker shade. Taxonomic classification of test species is shown on the X axis, skewed towards eudicots due to genome sequence availability but similar to the training data (Figs 1A and S5). (c) Precision of TargetP as a function of evolutionary distance between the training species and 171 test genomes (plastid in green and mitochondria in orange).

More »

Expand

Fig 4.

Cross-organelle errors in proteome prediction due to physio-chemical properties of the NTS.

Cross organelle prediction errors could be either because an in-vivo plastid protein is in-silico mitochondria localised (a) or vice versa (b). The overlaps between cross-organelle in-vivo and in-silico proteomes identifies these predictions errors. Analysis of the first 20 amino acids of pNTSs incorrectly predicted to be mitochondrial (c) and vice versa (d). Average charge and phosphorylatable amino acids for NTS from all verified organelle proteins of each species are indicated by vertical green (pNTS) and orange (mNTS) lines. Error bars indicate standard error of mean (N = 4–331, S2 Fig). (e) Overlap between predicted (left) and experimentally localised (right, in grey) dual targeted proteins. (f) Predicted (in-silico) intracellular localisation of experimentally verified (in-vivo) dual targeted proteins (left column) and experimentally verified (in-vivo) intracellular localisation of proteins that are predicted (in-silico) to be dual targeted (right column).

More »

Expand

Fig 5.

Success rate of predicting unique versus conserved organelle proteins.

Success rate (sensitivity) of predicting experimentally verified conserved and unique proteins for (a) plastids and (c) mitochondria. All proteins from each species were sorted into conserved or unique based on sequence-based protein clustering (see methods, S3 Fig). Of the total plastid and mitochondrial false negatives (from Fig 2A and 2B), the number of proteins that were unique to a given species are shown for plastids (b) and mitochondria (d).

More »

Expand

Fig 6.

A framework for improving localisation prediction algorithms.

Strategies to improve prediction reliability involve changes in the curation of the training data as well as the training procedure. The training data should ideally be collected from a range of diverse species and for each be based on different experimental techniques that support a training protein’s localisation (e.g. reporters, mass spectrometry, coexpression, interactions). Proteins with non-canonical internal motifs, or those dually targeted need to be taken into account (as they help to better distinguish between pNTS and mNTS features) and validated data could be sorted according to whether it is part of a core- or pan-proteome. Classifiers on which the algorithms are trained could include parameters such as the evolutionary distance of a species, non-coding regions, or a protein’s abundance as a currently neglected factor. One can expect that the combination of multi-dimensional parameters from evolutionary biology, cell biology and molecular biology on evolutionary diverse species will significantly improve the next generation of machine leaning algorithms that serve localisation (and function) predictions.

More »

Expand