Assessing predictors for new post translational modification sites: A case study on hydroxylation

doi:10.1371/journal.pcbi.1007967

Table 1.

Datasets.

Negative clusters (“Negative”) contain only clusters with non-hydroxylated sites. Other datasets have clusters with both positive and negative examples, but negatives are completely removed during evaluation (*). Negative sites (^) considered during evaluation are always resampled for each replica, based on the size of the positive dataset.

More »

Expand

Fig 1.

Performance on literature examples.

The evaluation is performed only considering hydroxylated sites detected by single protein experiments (Literature dataset). Error bars are the standard deviation calculated over 1,000 replica sets. The consensus baseline method is the majority vote across all predictors. Suffix numbers in the method names indicate increasing quality threshold as defined by developers.

More »

Expand

Table 2.

Methods overview.

Self-reported performance is taken from the corresponding method publications preferring values reported from independent validation sets, i.e. not used in the training. The “Type” column indicates the type of hydroxylated residue predicted, proline (P), lysine (K) and tyrosine (Y). “Window” indicates the number or neighbour residues considered for a prediction. Self-reported performance includes specificity (Sp), sensitivity (Sn), accuracy (acc), Matthew’s Correlation Coefficient (MCC) and the area under the ROC curve (AUC).

More »

Expand

Fig 2.

Performance on the MS-HeLa datasets.

The evaluation is performed only considering hydroxylated sites detected by a mass-spectrometry experiment (MS-HeLa dataset). Consensus and errors are calculated as in the previous figure. Suffix numbers in the method names indicate increasing quality threshold as defined by developers.

More »

Expand

Fig 3.

Performance on the MS-Kim datasets.

The evaluation is performed only considering hydroxylated sites detected by a mass-spectrometry experiment (MS-Kim dataset). Consensus and errors are calculated as in the previous figure. Suffix numbers in the method names indicate increasing quality threshold as defined by developers.

More »

Expand

Fig 4.

Performance on MS-collagen examples.

The evaluation is performed only considering hydroxylated sites detected by mass-spectrometry experiments and belonging to collagen proteins (MS-collagen dataset). Consensus and errors are calculated as in the previous figure. Suffix numbers in the method names indicate increasing quality threshold as defined by developers.

More »

Expand

Fig 5.

Features distribution for MS and Literature sites.

Content refers to the fraction of residues in the site sequence associated with a given feature. Density refers to the fraction of proteins in the dataset with a given number of sites.

More »

Expand

Fig 6.

Dataset generation.

Negative (blue dots) and positive sites (red dots) are clustered based on sequence similarity. Positive clusters (gray background) contain at least one hydroxylation site and negative examples falling inside positive clusters are removed. 1,000 replica sets are created by random sampling 70% of the positive sites and the same number from the negatives.

More »

Expand