PhANNs, a fast and accurate tool and web server to classify phage structural proteins

doi:10.1371/journal.pcbi.1007845

Table 1.

Summary of previous ML-based methods for classifying viral structural proteins.

More »

Expand

Fig 1.

Non homologous database split—To ensure that no homologous sequences are shared between the test, validation, and training sets the sequences from each class (Major capsid proteins in this figure) were de-replicated at 40%.

In the de-replicated set, no two proteins have more than 40% identity and each sequence is a representative of a larger cluster of related proteins. The de-replicated set is then randomly partitioned into eleven equal size subsets, (1d_Mcp-10d_Mpc plus Test_Mpc). Those subsets are expanded by replacing each sequence with all the sequences in the cluster it represents (subsets 1D_Mpc-10D_Mpc plus TEST_Mpc). Analogous subsets are generated for the remaining ten classes and corresponding subsets are combined to generate the subsets used for 10-fold cross-validation and testing (1D-10D and TEST).

More »

Expand

Table 2.

Database numbers—Raw sequences were downloaded using a custom script available at https://github.com/Adrian-Cantu/PhANNs.

All datasets can be downloaded from the web server. *Numbers before and after removing sequences at least 60% identical to a protein in the classes database.

More »

Expand

Table 3.

Feature types included in each of the 12 models.

di—2-mer/dipeptide composition; tri—3-mer/tripeptide composition; tetra—4-mer/tetrapeptide composition; sc—side-chain grouping; p—plus all the extra features [isoelectric point, instability index (whether a protein is likely to be degraded rapidly), ORF length, aromaticity (relative frequency of aromatic amino acids), molar extinction coefficient (how much light a protein absorbs) using two methods (assuming reduced cysteines or disulfide bonds), hydrophobicity, GRAVY index (average hydropathy), and molecular weight, as computed using Biopython. - *Per class score figures are available as supplementary material.

More »

Expand

Fig 2.

Model-specific F₁ score—F₁ scores (harmonic mean of precision and recall) for each polypeptide model/class combination.

All models follow similar trends as to which classes are more or less difficult to classify correctly. Error bars represent the 95% confidence intervals.

More »

Expand

Fig 3.

Class-specific F₁ score—F₁ scores (harmonic mean of precision and recall) for each polypeptide model/class combination.

Some classes, such as minor capsid, tail fiber, or minor tail, are harder to classify correctly irrespective of the model used. Error bars represent the 95% confidence intervals.

More »

Expand

Fig 4.

Model-specific validation weighted average scores—Precision, recall, and F₁ scores for all models.

Precision is higher in all models as the “others” class is the largest and easiest to classify correctly. Error bars represent the 95% confidence intervals.

More »

Expand

Fig 5.

Per class relationship between PhANNs score and confidence—The confidence corresponding to a particular class PhANNs score represents the fraction of true positives (correctly classified) sequences in the test set that were classified as that class, with a given PhANNs score or higher.

As it is uncommon for the highest class PhANNs score to be less than 2, the left side of the graph includes all test proteins that were classified as that class, and the confidence corresponds to the per class precision (see Table 4).

More »

Expand

Fig 6.

Confusion matrix using the “tetra_sc_tri_p” model—Each row shows the proportional classification of test sequences from a particular class.

A perfect classifier would have 1 on the diagonal and 0 elsewhere. In general, a protein that is misclassified is predicted as “others”.

More »

Expand

Table 4.

Results of per class classification for the test set.

Support indicates the number of test sequences in each specific class. accuracy (fraction of observation correctly classified) is equivalent to the weighted average recall (weighted by the support of each class). The macro average is unweighted (all classes contribute the same).

More »

Expand

Fig 7.

Effect of disregarding low scoring test proteins—Progression of the weighted average precision, recall and F₁-score of the test set after excluding low scoring proteins.

The portion of included proteins is the fraction that can be classified if you only trust that score or higher. Very few test proteins have PhANNs score of 10 and not all classes are represented.

More »

Expand

Table 5.

Results of per class classification for proteins in the test set with a PhANNs score of 8 or higher.

Support indicates the number of test sequences in each specific class. accuracy (fraction of observation correctly classified) is equivalent to the weighted average recall (weighted by the support of each class). The macro average is unweighted (all classes contribute the same).

More »

Expand

Fig 8.

Comparison of “tetra_sc_tri_p” model trained with and without the Minor capsid class—As minor capsid is the worst performing class in our test set, we trained an analogous ANN ensemble with it removed.

Panels A and B show the ROC curves for the models with and without minor capsid respectively. Panels C and D show the relationship between PhANNs score and Confidence for the models with and without minor capsid respectively. Panels E and F show the confusion matrix for the models with and without minor capsid respectively.

More »

Expand

Table 6.

The effect on the models’s scores from excluding the minor capsid class (mc)—Most scores are affected only slightly and are as likely to improve as to worsen.

More »

Expand

Table 7.

Comparison of PhANNs with VIRALpro. Results from using VIRALpro test set in PhANNs and PhANNs test set in VIRALpro.

More »

Expand