rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments

doi:10.1371/journal.pone.0220182

Fig 1.

Network architecture for two rawMSA networks.

On the left, the SS-RSA network predicts the secondary structure and relative solvent accessibility of each amino acid; on the right the CMAP network predicts the full contact map of the protein. The first layers are in common between the SS-RSA and CMAP architectures, although with slightly different settings, and provide the basis for the rawMSA approach.

More »

Expand

Fig 2.

2D PCA of the space of the embedded vectors representing the single residues.

In this example, we show the embedding outputs of a simpler network where the original space has a dimensionality of four. The residues that are closest (lower cosine between the 4D vectors) to (a) lysine and (b) tryptophan are colored (the closer the residue, the darker the hue).

More »

Expand

Fig 3.

Per target secondary structure (a) and four-class solvent accessibility (b) accuracy for predictions using one hot encoding, a number of rawMSA networks, and a classical PSSM network trained and tested on the same dataset.

One hot100 skips the rawMSA embedding step and encodes the alignments using one hot encoding, limited to the top100 alignments for memory reasons. Four different rawMSA networks are tested at variable MSA depths, using top 100, 200, 500 or 1000 alignments from the MSAs as input to the SS-RSA network. The average accuracies are shown as red squares.

More »

Expand

Table 1.

Results for the SS-RSA networks trained to predict secondary structure and solvent accessibility.

More »

Expand

Table 2.

Comparison of rawMSA against the top 5 contact prediction methods in CASP12.

More »

Expand

Fig 4.

Impact of the number of sequences in the MSA on performance for rawMSA and METAPSICOV_baseline in the CASP13 FM targets.

More »

Expand