Phylogenetic Profiling: How Much Input Data Is Enough?

doi:10.1371/journal.pone.0114701

Figure 1.

Predictive accuracy of phylogenetic profiling, measured in Area Under the Precision-Recall Curve (AUPRC), when we change the amount of data available for training the model for functional annotation.

For each year from 2005 to 2013 denoted on the x-axis, the corresponding dataset includes those genomes that were available both in the OMA database, as well as the NCBI taxonomy database in the respective year; the phylogenetic profiles are annotated using the UniProt-GOA file available in January of the respective year. Each violin plot summarizes the distribution of GO terms according to the AUPRC value: the area of the plot corresponds to the probability density of GO terms at different values of AUPRC; the black dot denotes the mean value of AUPRC. A) We consider 1093 GO terms in total—those that had sufficient annotation information in the most recent database releases. If the model does not have enough data to infer annotations for one of the 1093 GO terms, as will be the case for, e.g., 846 of these GO terms using the data from 2005, its AUPRC score is zero. B) We consider only the GO terms that had sufficient annotation information throughout the analysed releases. C) We consider all the GO terms from the prokaryotic GO set.

More »

Expand

Figure 2.

Predictive accuracy of phylogenetic profiling, measured in AUPRC, when we reduce the number of annotations used for phylogenetic profiling.

For each of the experiments denoted on the x-axis, we only used a fraction of the available annotations in the most recent dataset. Dashed and full lines connect the dots of the mean AUPRC scores for two sets of experiments: random sub-selection of genomes (full lines) and sub-selection to keep maximum diversity among the selected genomes (dashed lines). Colour denotes the number of genomes used in the phylogenetic profiles.

More »

Expand

Figure 3.

Predictive accuracy of phylogenetic profiling when we change the number of included genomes.

Dashed and full lines connect the dots representing the mean AUPRC scores for two sets of experiments: random sub-selection of genomes (full lines) and sub-selection to keep maximum diversity among the selected genomes (dashed lines). Each dot represents the mean AUPRC for the GO terms we use in annotating. The rightmost point denotes the mean AUPRC score when we include all the available bacteria in the OMA 2012 release (1078 bacteria). Separate dots denote the mean AUPRC for subsets of genomes denoted with the label.

More »

Expand

Figure 4.

Predictive accuracy of phylogenetic profiling when we control for the influence of the Open World Assumption.

Two sets of experiments are denoted with colours: experiments when we include only the well-annotated proteins (purple) and experiments where we randomly remove 60% of the available annotations (red). Dashed and full lines connect the dots of the mean AUPRC scores for two sets of experiments: random sub-selection of genomes (full lines) and sub-selection to keep maximum diversity among the selected genomes (dashed lines). Each dot represents the mean AUPRC for the GO terms we use in annotating. The final point denotes the mean AUPRC score when we include all the available bacteria in the used OMA database release (1078 bacteria).

More »

Expand

Figure 5.

Predictive accuracy of phylogenetic profiling was not affected when we used many strains of the same organism.

We used 31 strains of Escherichia coli and we added to this set: A) 31 random organisms, B) 62 random organisms, C) 93 random organisms, and D) 124 random organisms. Each plot in a panel corresponds either to the combination of the 31 E. coli strains and the randomly selected organisms (left) or just the randomly selected organisms (right). Each boxplot summarizes AUPRC scores for GO terms in the dataset indicated on the x-axis. Lower, mid, and upper horizontal lines denote the first quartile, median and the third quartile, respectively; vertical lines reach 1.5 interquartile range from the respective quartile or the extreme value, whichever is closer. Each plot summarizes the results for ten independent random organism selections.

More »

Expand