Membrane protein contact and structure prediction using co-evolution in conjunction with machine learning

doi:10.1371/journal.pone.0177866

Table 1.

The 25 membrane proteins used as the benchmark set.

More »

Expand

Fig 1.

Contact prediction flowchart and diagram of resulting descriptor categories used for machine learning.

A.) Flowchart overview for producing contacts—divided into the DI only method (solid black outline, left) and the all-inclusive machine learning sequences (dotted black outline, right). Both result in sets of the top L-fractions of predicted contacts, which were then used in combination with BCL::Fold to predict the structures of the 25 membrane proteins in the benchmark set. B.) Descriptor vectors include three categories: global, sequence information, and correlation descriptors. Global descriptors include sequence position for sites i and j, the separation from i to j, and the number of amino acids in the sequence. Sequence information descriptors include windows of biochemical properties surrounding sites i and j such as volume, hydrophobicity, sterical parameter, polarizability, isoelectric point, and BLAST profiles. The probability of each SSE state helix/strand/coil (by JUFO9D) as well as membrane/transition/solution state are also included. Finally, correlation information includes the symmetric matrix around sites i and j, all unique pairwise combinations from i ± half_window_size with j ± half_window_size, the mean, max, and normalized mean of this window, and the overall mean sequence correlation.

More »

Expand

Fig 2.

Visualization of the predicted transmembrane position descriptor.

We used SPOCTOPUS to predict topology and then assigned each amino acid within all proteins a 0 for inner membrane (blue), 1 for outer membrane (red), and a value between 0 and 1 based on the distance along the predicted transmembrane helix normalized by the size of the containing helix. Above are 1HZXA, 2RH1A, 1OCCC, and 3QE7A (top left to right, bottom left to right). The vast majority (23 of the 25 benchmark proteins) are similar to the top two examples—well defined and aligned gradients across the structure from inner to outer membrane portions (blue to red). The only two proteins with significant errors are on the bottom (1OCCC and 3QE7A). One can see in 1OCCC that the foremost helices do not align to the expected gradient due to their inaccurate prediction as a single unbroken helix. A similar error exists in 3QE7A where a small portion is incorrect due to a missing helix break prediction.

More »

Expand

Fig 3.

Best DT and ANN contact prediction ROC curve and logarithmic precision vs. fraction positive predicted (FPP) compared to naïve direct information with minimum separations of 1 and 12.

ROC curves of the merged predictions averaged from five different training iterations for each protein. Also included are results from contact prediction based solely on naïve direct information (using the optimal filtered MSA) for comparison. The training, monitoring, and independents set included data from 15, 5, and 5 proteins respectively. The independent predictions are the ones presented above. AUC is approximately 0.700, 0.938, and 0.928 at a minimum separation of 1 for the filtered direct information, DT, and ANN methods respectively. For a minimum separation of 12 the AUC is approximately 0.611, 0.862, and 0.855 for the filtered DI, DT, and ANN methods respectively. Both methods significantly outperform naïve DI with a slight edge for DTs. The bottom panel contains a graph showing precision as the fraction predicted positive increases. The black line depicts ideal performance. Each curve includes the aggregated predicted contacts from five training iterations using DTs or ANNs. Models were trained using all contacts with a minimum separation of 1 and were tested on pairs with a minimum separation of 12. Each iteration uses 15, 5 and 5 proteins for the training, monitoring, and independent sets respectively. The integral of the precision from 0.01% to 0.55% is approximately 0.656, 1.921, and 0.865 at a minimum separation of 1 for direct information, DT, and ANN based methods respectively. At a minimum separation of 12 the integral is approximately 0.537, 0.469, and 0.549 for direct information, DT, and ANN based methods respectively. Greater precision initially and continuing out as FPP increases is better.

More »

Expand

Fig 4.

Accuracy comparison across best DT, ANN, naïve direct information, and processed direct information contact prediction for a minimum separation of 12.

The graph above depicts the average accuracies of each method across the entire benchmark set for each of the top L fractions examined. Accuracy is significantly higher for DI contact predictions from filtered MSA in comparison to unfiltered and is further improved by processing (filtering based on predicted transmembrane topology). The processed method is second best for the top L/10 predictions (44.42%) only slightly lower than the best DTs (44.99%). DTs are also best for the top L/5 (39.40%). For L/2, 1L, 2L and 3L ANNs optimized using an analysis of weights between nodes produces the best results (30.14%, 23.97%, 18.41%, and 15.28% respectively). Thus, for all L-fractions, one or more commonly both machine learning methods have a higher average accuracy.

More »

Expand

Fig 5.

Comparison of protein model distribution across methods for 2RH1A, 1OCCA, and 1HZXA.

The RMSD100 distributions of 1000 predicted models across naïve DI, processed DI, the best DTs, and the best ANNs above are book-ended by the distributions of the positive and negative controls. Contact restraints consistently shift the distributions towards lower RMSD100 models. There is little difference between methods for 1OCCA and 1HZXA. However, both machine learning methods shift more substantially towards lower RMSD100 values in the case of 2RH1A. One should also note that the distributions for all experimental methods approach that of the positive control for 1OCCA.

More »

Expand

Table 2.

Full benchmark folding results for controls and predicted restraints from the best methods—DI filtered, negative control, and positive control (Top 10 average RMSD100 Å).

More »