An efficient algorithm calculating common solvent accessible volume

The solvent accessible surface area and the solvent accessible volume are measurements commonly used in implicit solvent models to include the effect of forces exerted by solvents on the protein surfaces (or the atoms on protein surfaces). The two measurements have limitations in describing interactions between proteins (or proteins’ atoms) mediated/bridged by solvents. This is because describing the interactions between proteins should be able to capture the chain of protein-solvent-protein interactions while the solvent accessible surface area or the solvent accessible volume can capture only protein-solvent interactions. If we represent the solvent as a continuous medium, we can consider an atom of a protein can effectively interact with the solvent within a certain distance from its surface (or its own solvent-interacting sphere). In this case, the protein-solvent-protein interactions can be measured by the amount of solvent interacting with two proteins’ atoms at the same time (or the volume shared by the two atoms’ solvent-interacting spheres excluding the volumes occupied by proteins’ atoms). We call the shared volume as the common solvent accessible volume (CSAV); there has been no method developed to determine the CSAV. In this work, we propose a new sweep-line-based method that efficiently calculates the common solvent accessible volume. The performance and accuracy of the proposed sweep-line-based method are compared with those of the naïve voxel-based method. The proposed method takes log-linear time to the number of atoms involved in a CSAV calculation and linear time to the resolution. Our results, tested with 52 protein structures of various sizes, show that the proposed sweep-line-based method is superior to the voxel-based method in both computational efficiency and accuracy.

Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions?
The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.
Reviewer #1: Partly Reviewer #2: Partly 2. Has the statistical analysis been performed appropriately and rigorously?
Reviewer #1: No Reviewer #2: Yes 3. Have the authors made all data underlying the findings in their manuscript fully available?
The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data-e.g. participant privacy or use of data from a third partythose must be specified.

Reviewer #1: Yes
Reviewer #2: No 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Review Comments to the Author
Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The algorithm is interesting. However, the motivation that is given is to apply it to proteins, but I dont see any attempt to compare the results of CSAV to exiting methods. Are the values different numerically for a set 21 proteins that was used to benchmark the speed etc.?
Response: Yes. Each CSAV value and its calculation time are different. The CSAV value is determined by (or depends on) its geometry: two solvent spheres, involved-atoms, and their locations and sizes. The computation time depends on the geometry and additionally the computing environments, such as CPU speed and tasks running on the computer. In our results, the primary purpose of showing the running time is i) to show the agreement between our running time analysis and the actual running time of the proposed method and ii) to compare the computational efficiency of the sweep-line-based method with that of the voxel-based method.
We re-ran the entire dataset by setting the priority of each program high. As a result, the coefficients of running time decreased. We additionally show three digits of the coefficients in our updated plots. Now, the coefficients are updated from 0.07 to 0.063, 0.064 and 0.066 in Fig 7, Fig 8(a) and Fig 8(b), respectively. We did not re-run our program to collect data points for Fig 9 since the computation of the voxel-based method takes too much time.
Faster is not better for protein science. Statement is made that CSAV is more accurate. How can one argue that the algorithm is more accurate without providing the criteria for it?
Response: As we responded to the editor's recommendation, there are no practical and existing methods that can be used to determine CSAV and thus can be compared with our proposed sweepline-based method. Therefore, to state that the proposed method is accurate, we compared the accuracy of the proposed sweep-line-based method with that of the naïve voxel-based method.
To provide a more solid argument, we added a new experiment (Fig. 10) that compares the true errors determined by the two methods. The true errors are evaluated with the true CSAV values that are calculated using a closed-form solution of the simple random systems; if the geometry of a system is simple enough, its CSAV can be algebraically calculated. Our new results show that the sweep-linebased method is clearly (around 100 times) more accurate than the voxel-based method. We added the description regarding Fig 10 on pages 17-18. Please refer to the table below in this reply to reviewer 2 for the additional details.
So if any statement about importance for biology is stripped from the manuscript, it is a computer science paper calculating the volumes of collection of spheres of arbitrary shapes. From that perspective it is probably OK but i still think it must be critically compared to other methods. If, however, any mention of biology remains in the manuscript, real comparison with the existing methods must be done. After that is done, a compelling evidence (other than speed of calculations and scaling with the size) must be made why CSAV is better in providing physical insights into protein structure analysis.
Response: As we responded to the editor's recommendation, there are no practical and existing methods that can be used to determine the CSAV. We explained this on pages 3 and 19-20.
Regarding the compelling evidence, we added a discussion explaining how the proposed method can provide a physical insight into protein study. The proposed sweep-line-based method can be used to predict the amount of the spring interaction between two atoms of proteins that is influenced by solvents near the two atoms (or within their solvent-accessible spheres). This prediction can be made by replacing the equations (1)-(3) with a different function that integrates the values of points in the CSAV. We explained the detailed discussion in the second paragraph (including equations (4) and (5)) of the Conclusion and Discussion section on page 19.
Reviewer #2: In this work, Kim and Na describes a new method to calculate the solvent accessibility of a protein-protein interface. Their method differentiates itself from the others in the way that it takes both protein parties into account during the calculation.
Response: We apologize that we did not clarify how the experiments were performed. Actually, the proposed method can determine the CSAV between atoms in different proteins. However, our work focuses on evaluating the accuracy and the efficiency of the proposed method, and thus we performed the evaluation by determining the CSAVs of atom pairs in one protein. We clarified this on page 13.
Although this idea is novel and worth exploring, the way the authors presented their research requires a serious reconsideration. The main critical points are listed below.
-The authors indicated that: "The source code of the proposed sweep-line-based method will be available from GitHub repository when this paper is published: https://github.com/htna/CSAV." I do not understand, why the Github repo is not presented together with this version of the paper. The code and some example cases should be presented with the paper at the submission stage.
Response: Thank you for understanding the value of our study.
Regarding uploading our source code into Github, we planned to share the source code after our paper is published since we were concerned that (maybe) others can publish the same work to ours after regenerating the results from our source code (as if it is their original work). We now provide our source code as Supporting Information, and so you can look at the source code. We plan to upload our source code into Github after the paper is accepted; the dataset (protein structures) and the list of atom pairs and their corresponding CSAV value calculated using the proposed sweep-line-based method are on Github already.
-The authors tested their methods on 21 crystal structures. Why are these particular structures selected? What are the functions of these structures, size of their monomers, shapes, secondary structure content, stoichiometry, etc. What is the biological relevance of using these cases? Are there already any water molecules around their interfaces?
Response: The 21 structures in our previous dataset were selected only by considering their size variances rather than their functional importance. Additionally, we tried to include alpha, beta, and alpha+beta proteins in the dataset. This is because the CSAV value between two atoms is determined and calculated only from their geometry (the sizes and locations of the solvent-accessible spheres and the atoms to be excluded from the CSAV). As responded to the editor's recommendation, we replaced our previous dataset containing the 21 structures with a new dataset containing 52 proteins obtained from the list of proteins used by G. D. Georgiev et al. (published to Algorithms Mol. Biol. in 2020) This is described in the Dataset preparation section on page 12.