Amino acid composition predicts prion activity

doi:10.1371/journal.pcbi.1005465

Table 1.

Accuracy of several variants of pWALTZ-like scoring.

The method described as “original PSSM” uses the pWALTZ PSSM; “scrambled PSSM” averages over all permutations of the positions of the PSSM; “scrambled hexamers” first scrambles the hexamers from which the PSSM is constructed; “scrambled prion domain” tests the ability of the original PSSM to detect scrambled versions of the prion domain. Accuracy is measured using the area under the receiver operating characteristic curve (AUC-ROC) and the area under the precision recall curve (AUC-PR). All methods use a filtering step that considers only regions with negative FoldIndex scores and Q/N content of at least 25%. Figure A in S1 File provides the ROC and PR curves corresponding to these results, and Table A in S1 File provides additional results when no pre-filtering is performed.

More »

Expand

Table 2.

Classifier performance on the Alberti dataset.

Performance is measured with leave-one-protein-out cross-validation using the area under the ROC curve (AUC-ROC) and the area under the precision recall curve (AUC-PR); the curves are provided in Figure B in S1 File.

More »

Expand

Fig 1.

An illustration of the concept of multiple-instance learning.

In MIL, training examples come in bags; a positive bag contains a set of examples, with the constraint that at least one of them must be positive. In our setup a positive bag corresponds to an annotated prion domain, and this constraint captures the inaccuracy that is inherent in experimentally delineating a prion-forming domain: the actual minimal domain that supports prion formation is rarely fully characterized, and is typically embedded within the annotated domain. The examples in a negative bag are all negative (all the sequence windows outside a prion-forming domain are negative examples).

More »

Expand

Fig 2.

Receiver operating characteristic curves (a) and precision-recall curves (b) for proteome-wide prediction in yeast.

FPR and TPR are the false and true positive rates. The numbers in parentheses represent the area under the curve. Notice that the x-axis of the ROC plot is trimmed at FPR of 5%. MW and PLAAC-LLR indicate the results for the Michelitsh-Weissman score and the HMM based algorithm presented in [25].

More »

Expand

Table 3.

Classifier performance proteome wide.

Performance is measured with leave-one-protein-out cross-validation using the area under the ROC curve (AUC-ROC) and the area under the precision recall curve (AUC-PR).

More »

Expand

Fig 3.

Comparison of amino acid weights for different methods.

(a) the weights for pRANK over the Alberti dataset; (b) the pRANK weights in proteome wide evaluation; (c) shows the log-odds ratio, obtained by Angarica et al., of the frequencies of occurrence of different amino acids in prion domains in the yeast prions relative to their corresponding background frequencies in the protein universe. Figure (d) shows the log-odds ratios obtained experimentally by the random mutagenesis experiment by Toombs et al.

More »

Expand

Table 4.

Results of mutation analysis.

Scores in bold correspond to correct predictions by a method at the given threshold value. The highest score of a non-prion in the Alberti dataset is chosen as the threshold for PAPA and pRANK whereas for PrionW, the value suggested in their paper has been used. No prion like domain was found by PrionW in YLR177^W and YLR177^mut.

More »

Expand

Fig 4.

Prion prediction as a classification problem.

Sequence windows within the protein are denoted by double arrows. Sequence windows within the prion-forming domain (highlighted as a red box), are in red; sequence windows that do not overlap the annotated domain are shown in grey; those are used as negative examples.

More »

Expand