Structural coordinates: A novel approach to predict protein backbone conformation

Motivation Local protein structure is usually described via classifying each peptide to a unique class from a set of pre-defined structures. These classifications may differ in the number of structural classes, the length of peptides, or class attribution criteria. Most methods that predict the local structure of a protein from its sequence first rely on some classification and only then proceed to the 3D conformation assessment. However, most classification methods rely on homologous proteins’ existence, unavoidably lose information by attributing a peptide to a single class or suffer from a suboptimal choice of the representative classes. Results To alleviate the above challenges, we propose a method that constructs a peptide’s structural representation from the sequence, reflecting its similarity to several basic representative structures. For 5-mer peptides and 16 representative structures, we achieved the Q16 classification accuracy of 67.9%, which is higher than what is currently reported in the literature. Our prediction method does not utilize information about protein homologues but relies only on the amino acids’ physicochemical properties and the resolved structures’ statistics. We also show that the 3D coordinates of a peptide can be uniquely recovered from its structural coordinates, and show the required conditions under various geometric constraints.


Is RMSD based on Calpha, or more atoms?
Thank you for noticing that. We agree, this needs to be clarified. The RMSD is based on N, Calpha, C atoms of the backbone. We have included this clarification in in the text and in the notation description of formula (2) (lines 246, 308, 358 of the Revised Manuscript with Track Changes).

RMSD and RMSDA have units. They are often forgotten.
Thank you for pointing this out, the units are added.
6. I am not sure that 5 numbers after the digits are essential... especially when you know the precision of protein structure resolution.
Thank you for pointing this out, there is indeed no need for the 5 digits, we have shortened it to one digit in Table 2, and two digits in Table 1.

As results are provided before the Methods, reader had to interpret what the authors wanted to do. What this reviewer had understood: Authors want to use PB (based on dihedral angles) as structural seeds, but used RMSD to do the real assignment. Ok, to be honnest, this reviewer is not entirely sure it is that. So, it must be precise in Results section (and also in Methods section, as it is far to be clear).
Thank you, this is a very valid point, we need to make it clear. Exactly as the reviewer wrote, we used the PBs derived by de Brevern et al as seeds (or cluster centres), but instead of using RMSDA as a similarity measure between a fragment and a seed, we used RMSD. To be able to do so, we reconstructed the 3D coordinates of each PB using the dihedral angles with which they were defined, as well as the standard values for the bond angles, bond lengths and setting the omega value to 180 degrees. To bring more clarity, we emphasised that in the very beginning of the Results subsection (lines 84-90 of the Revised Manuscript with Track Changes). We have also renamed the corresponding subsection of the Results into "Relation of RMSD-and RMSDA-based assignments" (line 75). We agree, thank you. There is now a subsection in the Results dedicated to comparing the two assignments (line 75 of the Revised Manuscript with Track Changes), as well as a new Figure (Fig.1). Indeed, we observe more changes in assignments than expected. We believe that the discrepancy between the two distances increases for irregular structures:

From
Such as clusters PB g or PB h. The reason for that may be that the contribution of each dihedral angle to the RMSD is not equivalent (deviation in central angles causes more RMSD-distant fragments), while in the RMSDA there is no distinction between those.
Later the reviewer raised a very valid point about overlapping fragments (Structural Words could be an example), which may compensate for the above. We believe, that is indeed what happens in the case of regular structures, but less so for the irregular ones. We discussed that in details in a newly subsection of the Results, the "Benchmarking using Structural Words", line 290 of the Revised Manuscript with Track Changes.

Is it close to the PB similarity distance used in PBALONG or iPBA ?
If we understood correctly, the reviewer is referring to GDT_PB (Global Distance Test Total analog for PB alignments) from Gelly et al. (https://doi.org/10.1093/nar/gkr333), which is used to compare alignments derived by different tools. But I think, it is something different from what we do here: We only use a reverse of the distance, i.e. 1/RMSD, as the similarity measure between fragments. Maybe we misunderstood the question.  Fig.2 is just an example that RMSD and RMSDA may have a different meaning. We chose the reference (red) fragment to be an alpha-helix because it is familiar to everyone and is easy to visualise. We then isolated all the fragments with RMSDA=14.1 degrees from the reference fragment, and choose three examples (A-C) to show that the values of RMSD may still vary. Example (D) is designed to show a situation where RMSDA=0 degrees, but the RMSD is rather high. It has mainly an illustrative purpose.

If the distance criteria to say it is improve is the RMSD and the assignment is redone using RMSD. It is logical that it is better? But how is it in terms of RMSDA? How does it impact mean and sd values?
Thank you for the question. It would have been interesting to explore this aspect. However, we mainly plan to use the model presented in the article for 3D coordinate reconstruction. For this task, it is better if it produces more defined (compact) classes of structures in terms of RMSD, therefore we did not explore the qualities of our (RMSD-based) assignment in terms of RMSDA further.
However, we have added a benchmarking case to cover a possibly related question. Using the example of Structural Words (mentioned in the reviewer's comments later), and there RMSD-based assignment shows an improvement as compared to the RMSDA assignment: The Structural Words generated with RMSD are almost always more compact and the number of words is significantly larger (Table 4 and lines 307-315 of the Revised Manuscript with Track Changes).
12. RMSDA is very quick in terms of CPU times, RMSD is more expensive, isn't it?
Thank you for the question. We ourselves were surprised to discover that the computing time for both RMSD and RMSDA is comparable. The reason for this is that in case of RMSDA, in order to find a dihedral angle, one needs to solve for two planes and then compute an angle between the corresponding normal vectors. For a fragment of 5 residues, there are 8 dihedral angles, which means we repeat the above procedure 8 times. On the other hand, for the RMSD, the alignment procedure requires computation of eigenvectors and eigenvalues for a 3x3 matrix, followed by the change of coordinates, but it is done once for the complete fragment. This is a commonly used algorithm, therefore there are very efficient implementations of it. Hence, the use of RMSD did not hinder the computational time.
13. Similarly, for PBALIGN, IPBA and myPBA, the PBs were only used as an intermediate before RMSD optimisation. What could be the interest to do RMSDA, then RMDS locally instead of RMSD globally as in the cited researches. The text must be rephrase to focuss on the precise questions of the authors.
Thank you for pointing this out. We agree that the previous text might have been misleading. We have added two paragraphs explaining why we chose RMSD over RMSDA ( §s 184-197 of the Revised Manuscript with Track Changes), and that we needed a particular quality of the chosen distance (lines 219-222 of the Revised Manuscript with Track Changes): Namely, that small distance value guarantees well-aligned structures.
But I thought you might be interested in the underlying idea, and wrote a longer answer. Our main idea was to find a way to retain structural information about a protein fragment after transition to the structural alphabet. Therefore, we used distanced to the PBs as new "structural" coordinates. This approach is not yet fully developed, but as a proof of concept we show that indeed a protein conformation (in terms of dihedral angles) can be reconstructed from its 16 structural coordinates. In this context, one could potentially substitute RMSD with RMSDA, and still reconstruct the conformation (this is a guess). But further on we predict RMSDs to the Protein Blocks based on the sequence (not only assign the fragment to a certain class). For this, we use physicochemical properties of individual amino acids, and it becomes crucial that small values of the distance correspond only to closely aligned structures. Fig. 2 shows that this is not always the case for RMSDA, and without this quality the training of our prediction model cannot really work. We have not included this complete explanation to the text of the article (it is probably too wage to be in a paper), but it is a very interesting question. We would take your advice and add it, if it is reasonable. Thank you for this suggestion. This is a great benchmarking example that we have overlooked before. In the revised version, we have assessed the noted paper to explore the effect of overlapping fragments and have added a subsection in the manuscript (line 290 of the Revised Manuscript with Track Changes). Thank you for pointing this out, we agree that original text was not detailed enough. The difference between a standard MD and our approach is the loss function we aim to optimise. Usually, some sort of energy function is optimised to recover the protein conformation. We, however, do following: For a proposed protein conformation, one can calculate the RMSDs to the 16 Protein Blocks, which is a function of dihedral angles. The Loss function we aim to optimise is the mean squared error between the two "coordinates" of a protein conformation in the space of our structural alphabet. Table 2 shows that we can reconstruct the true protein conformation from the "structural coordinates" with high precision. We have updated the subsection "Reconstruction of the backbone conformation" to make the difference between the MD and our approach clearer (lines 198-228, but in particular lines 211-223 of the Revised Manuscript with Track Changes). Vriend et al. [11],but ref [11] is not the right one. If it had been done with BLASTCLUST, it keeps proteins with identity higher than the chosen threshold. Please proceed to a new evaluation with PISCES datasets that are of really better quality.

The protein dataset is not excellent. If the reviewer understands, it came from
Thank you for spotting the mistake in the reference, we corrected that. We also agree that the PISCES dataset suggested by the reviewer is much better. We used it to assess the performance of our method. Since there are no transmembrane proteins (PISCES only contains X-Ray structures) in the new data set, and they differ from the non-membrane proteins in their physicochemical properties significantly, our multi class prediction increased to 72%.
For benchmarking against PB-kPRED, we still used the old data set (PDB30) since it is used by the authors.

I apologize; i have not in the result section understood the evaluation of reconstruction. It is too quick and not detailed.
Thank you for pointing this out. We agree with the comment and changed the text of the subsection to make it clearer (lines 198-228).

Similarly for the Q16 section, the prediction approaches presented in Methods section are applied. Provided prediction rates for LOCUSTRA and PB-kPred are the ones from the papers or new evaluations on the specific dataset? not clear and very important.
Thank you, we agree that this is a very important point. We have changed the text of the subsection and clearly stated that the dataset for LOCUSTRA is different from PDB30 (lines 270-273, Revised Manuscript with Track Changes).
20. Finally, the major question. Is it possible to say that it is better or worse? On one side, there is an assignment based on RMSDA, it leads to the assignment of PBs a,b...,o,p; on the other side, there is an assignment based on RMSD with the same number of letters, but a different assignment. How is it possible to compare with Q16(PB-RMSDA) and Q16 (PB-like-RMSD)?
Thank you for raising this point. We agree that the meaning of the RMSDA assignment is different from that of the RMSD, and it would be incorrect to compare the assignments per se. We have added the following line to the discussion: "Nevertheless, the comparison between alignment methods and their performance is not straightforward, and requires to use a range of different criteria. Not only it cannot be captured by the Q16 classification accuracy alone, it is also important to point out that two assignment performed with different distance measures are not exactly the same thing. They may describe very different classes of structures even if the same set of reference PBs or even the same dataset is used." (lines 449-453) However, the Q16 problem is defined in each case: We know the true assignment of the structure by the chosen metric, i.e. which PB is the closest. Then we aim to predict the closest PB (in the same metric) from the sequence of a protein fragment. Assignments themselves are not comparable, but the frequency with which we make a correct prediction is. In the end, either assignment is an approximation of the original structure, and the desire is to make it correct (and meaningful) as often as possible. One could argue that it is possible to construct another set of 16 classes such that they are uninformative but the prediction accuracy it higher. We believe, it is not the case here, but, as the reviewer rightly noted, it is important to point the difference of the two assignments. (Unger, Fetrow,...). they can provide another view of the question.

Some important references of Levitt (w/Kolodony) and others are missing
Thank you! This was a helpful suggestion, and made us aware of certain complications in comparing of structural alignment methods, which we overlooked before (lines 449-453). We have added the Kolodony reference (27).

Is there an open tool for assignment of these PB-RMSD?
Thank you for asking. We agree it needs to be made available. We have added the tool to the homepage of the server: https://pbpred.eimb.ru/S/index.html ("Program PB-RMSD", second line).
In addition to the above comments, we corrected the spelling mistakes we have notice during the editing process.
We look forward to hearing from you in due time regarding our submission. We will ge glad respond to any further questions and comments you may have.
Sincerely, Vladislava Milchevskaya on behalf of all the authors