PhANNs, a fast and accurate tool and web server to classify phage structural proteins

doi:10.1371/journal.pcbi.1007845

PhANNs, a fast and accurate tool and web server to classify phage structural proteins

Fig 1

Non homologous database split—To ensure that no homologous sequences are shared between the test, validation, and training sets the sequences from each class (Major capsid proteins in this figure) were de-replicated at 40%.

In the de-replicated set, no two proteins have more than 40% identity and each sequence is a representative of a larger cluster of related proteins. The de-replicated set is then randomly partitioned into eleven equal size subsets, (1d_Mcp-10d_Mpc plus Test_Mpc). Those subsets are expanded by replacing each sequence with all the sequences in the cluster it represents (subsets 1D_Mpc-10D_Mpc plus TEST_Mpc). Analogous subsets are generated for the remaining ten classes and corresponding subsets are combined to generate the subsets used for 10-fold cross-validation and testing (1D-10D and TEST).

doi: https://doi.org/10.1371/journal.pcbi.1007845.g001