A fast lasso-based method for inferring higher-order interactions

Kieran Elmes; Astra Heywood; Zhiyi Huang; Alex Gavryushkin

doi:10.1371/journal.pcbi.1010730

Peer Review History

Original SubmissionJanuary 10, 2022
8 Jun 2022 Decision Letter - Ilya Ioshikhes, Editor, Megan L. Matthews, Editor Dear Mr Elmes, Thank you very much for submitting your manuscript "A fast lasso-based method for inferring higher-order interactions" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. The reviewers found the proposed method to be sound and an improvement upon existing methods, but raised some concerns around the limitations and accuracy and precision of the results that need to be adequately addressed or justified. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Megan L. Matthews, Ph.D. Guest Editor PLOS Computational Biology Ilya Ioshikhes Deputy Editor PLOS Computational Biology ********************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors introduce a novel method based on lasso regression to identify features that have an effect on fitness. One novelty is to make the identification of order three effects feasible. They achieve this by improving a previous model used for order two interactions. Speed is improved through parallelization and memory reduced by data compression. Introduction of three-way effects is cleverly combined with the removal of uninformative pairwise effects a priori. The regression problem is solved by a square-root lasso. The model seems mathematically sound and while the authors do not introduce new methodology, the implementation and combination of several approaches presents a novel software package useful for many relevant research topics, like for example drug resistance. However, I have some remarks that should be addressed. I was able to install the package and run: > output <- interaction_lasso(X, Y, n = dim(X)[1], p = dim(X)[2], lambda_min = -1, frac_overlap_allowed = -1, halt_error_diff=1.01, max_interaction_distance=-1, use_adaptive_calibration=FALSE, max_nz_beta=-1, max_lambdas=200, verbose=FALSE, log_filename="regression.log", depth=2, log_level="none", estimate_unbiased=FALSE, use_intercept=TRUE) Error in interaction_lasso(X, Y, n = dim(X)[1], p = dim(X)[2], lambda_min = -1, : unused arguments (frac_overlap_allowed = -1, use_adaptive_calibration = FALSE) > output <- interaction_lasso(X, Y, n = dim(X)[1], p = dim(X)[2], lambda_min = -1, halt_error_diff=1.01, max_interaction_distance=-1, max_nz_beta=-1, max_lambdas=200, verbose=FALSE, log_filename="regression.log", depth=2, log_level="none", estimate_unbiased=FALSE, use_intercept=TRUE) total entries: 49944 This should obviously be fixed in the readme of the package. github.com/bioDS/lasso_data_processing leads to a 404. p.10: I looked at reference [15] to understand how the simulated data sets were constructed. However, the authors should at least address the sizes of the data sets directly in their manuscript’s main part. This is important information and should be readily available to the reader. If I understand correctly, the sizes are nxp = 1000x100 and 10000x1000. That seems relatively small compared to the real data sets with dimensions 6703x19533 and 259x174334. The authors should explain why the relation of n and p in the real data sets is inverse to the simulated ones. In the discussion the authors mention /approx 67000 siRNAs and the n=6703 from before could be a mistake. The authors address the limitation, where they talk about data set size and three-way effects, i.e., when their method breaks down. Why is that not directly shown in the simulations? The authors should add more simulations to roughly show when this three-way effect identification becomes infeasible. What is in general very concerning is the overall low accuracy of all methods, even for small data sets. The authors need to explain why such low precision and relatively high (only for small data sets) recall is favored in their performance. Are false positives considered the lesser evil? The authors should consider the area under the curve for the ranked effects, i.e., precision and recall for the first (i.e., strongest) effect that is found, then for the first and second, then for the first, second and third effect, and so on. This might even be a fairer evaluation of usefulness and lets us know if false positives and/or false negatives rank very low or high (which would be concerning). Additionally, the authors never show recall for the 3-way effects. The question is, how relevant is this with a precision of 6%? How good is random in this case? I.e., what is the prevalence of the effects in the data sets? If I read [15] correctly, the data sets include only {5, 20, 50, 100} + {0, 20, 50, 100} main and pairwise effects. What about three-way effects in the simulated data? That is of course never talked about in [15] but also not in the main manuscript. Hence, the referral to the simulated data in [15] makes no sense for three-way effects. Just to be clear, the authors never use the actual effect size for performance evaluation. The main concern is just to identify true positives, correct? Maybe they should make that a bit clearer in the text. As I understand, the authors look for cell growth and survival genes in their list of identified effects from the infectx data. While the top found genes seem to certainly make sense and are a nice result in itself, I suggest if possible an additional gene set enrichment analysis to make this part more systematic and quantifiable. This option does not seem to be that straightforward for the second data set, does it? "... so the possibility of a true combined effect of Ciprofloxacin resistance cannot be ruled out... ", but seems highly unlikely given the simulation results. Also, even if the variants were characterized as not affecting Ciprofloxacin resistance, even then a true combined effect could still not be ruled out. Hence, this last part of the sentence seems like a NULL statement and should be removed, unless I am completely missing the meaning of that sentence. E.g., what are "majority variants"? Shouldn't that be "majority of variants"? I noticed several long sentences like this one and suggest in general to keep sentence length as concise as possible.. p.13: "Additional genes identified using our method...". Did you use any other methods to identify genes in this section? If so, this is not clear at all. The authors again make the argument in the discussion that tools need to be tailored towards larger data sets and higher order interactions. That is why they developed Pint. However, if accuracy is so low for relatively small data sets (where speed is not the issue), how can they make any prediction on even larger data sets? Accuracy will drop even more. Like mentioned above maybe the top hits still look very good and an AUC as accuracy measure might visualize that in the simulations. The authors mix notation, which can be confusing. E.g., /phi is used once to denote the probability density function of the normal distribution and another time as "r, the last time column x was included in the working set" (p.5). This type of ambiguity should be avoided. It is also confusing reading about column X_x and column x, which is also referred to as the index ("... interaction with x ...", this time x is the column? p.5). The authors should make sure that their notation is consistent and unambiguous throughout the whole manuscript. Also defining /phi_x = r and then using r/cdot/phi_x in an equation seems strange. It would be less confusing just defining /phi_x as "the residuals, the last time column x was inc..." without misusing r. It seems that the residuals are used on page 4 eq.4 for the first time. Even though it is a common notation in the field, they still need to be formally introduced. p.9: Off-target effects are prediction using RIsearch2 => Off-target effects are predicted using RIsearch2 In the follow up sentence: the gene should be called the "off-target". The respective off-target effect would be the inhibition of the gene's expression. p.10: "... further increasing the number of threads has to noticeable impact on performance..." => "... further increasing the number of threads has no noticeable impact on performance..." p.11: The "Method" in figure 4 is supposed to be Pint, right? Please correct that/make that clear. p.16: The reference of [38] seems to be incomplete. P.19: “We can considerably reduce the size of the active set by compression the columns.” => “We can considerably reduce the size of the active set by compressing the columns.” Reviewer #2: In this paper Elmes et al proposed a fast LASSO-based method, Pint, for inferring higher-order interactions. This method was claimed to outperform known methods in simulated data, and identifies a number of biologically plausible gene effects in both the antibiotic and siRNA models. This research was motivated in large genetic/genomic studies where a large number of genetic/genomic features are present and their interactions need to be assessed. In such studies because of the number of features (include interactions), the computing efficiency or scalability becomes a main issue. One major innovation of Pint compared to the existing approaches such as glinternet and WHinter is it’s better computing efficiency achieved in implementation including parallelization. This efficiency is very much needed in such studies. The package was shown to discover interesting three-way interaction terms in the siRNA data. This paper is well written, the methods are sound and results are convincing. It presents significant development in implementing square-root LASSO method. It will be of interest of general audience in this field. I recommend to publish in your journal. I have a few comments and would like authors to address: 1. In Figure 4c, it’s noted that the precision for the 3way interaction terms was extremely low. In Section 3.2 the authors noted that this was mostly due to overlapping three-way effects? Could authors clarify what they meant by “overlapping three-way effects”? columns are identical? If so, how many False positives due to such overlapping three-way effects? When n=1000 in the binary setting, the interaction term is either 0 and 1. How many such three way interaction terms were detected significant? Such low precision may limits the usefulness of the method in such applications, which may deserve some discussion 2. Can Pint apply to non-binary data? I can see many applications may go beyond the SNP type of variables, (ie. continuous instead of binary). But the binary setting of variables may greatly limit the usage of this method. For example, in gene-gene interaction case, all expression may be continuous. The overlapping-effect presented in Figure 4C maybe not an issue. If Pint works for such data, it will expand the scope of applications. Can authors discuss it? Minor point: On page 7, section 2.1 should X\\in \\{0,1\\}^p be X\\in \\{0,1\\}^{nxp}? ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No: github.com/bioDS/lasso_data_processing leads to a 404. Reviewer #2: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols https://doi.org/10.1371/journal.pcbi.1010730.r001
Revision 1
23 Aug 2022 Author Response Attachments Attachment Submitted filename: reply.pdf https://doi.org/10.1371/journal.pcbi.1010730.r002
24 Sep 2022 Decision Letter - Ilya Ioshikhes, Editor, Megan L. Matthews, Editor Dear Mr Elmes, Thank you very much for submitting your manuscript "A fast lasso-based method for inferring higher-order interactions" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Particularly, please address the additional comments from Reviewer 1 about the AUC and PR plots and the enrichment analysis. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Megan L. Matthews, Ph.D. Guest Editor PLOS Computational Biology Ilya Ioshikhes Section Editor PLOS Computational Biology ********************* A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: I appreciate the authors’ effort to address all of my previous points. However, some issues remain, which to me seem fixable with relatively little work and would significantly increase the quality of the manuscript. Maybe the authors misunderstood my previous concerns regarding the precision/recall plots and my suggestion to replace (not add) them with AUC plots. Precision and recall in the single digits do not show that a method is working reasonably well (on the contrary). However, precision and recall are also much dependent on the cutoff for the output values (in this case !=0 and ==0, correct?). Considering a small effect only marginally larger than zero as an effect and a large effect far larger than zero as the exact same effect seems unreasonable. The AUC does not depend on such a cutoff. Hence, I strongly suggest that the authors replace precision/recall plots with AUC plots unless there is a valid reason not to do so (which I can not see). The single shown AUC plot seems to show that the method works very well (is there a corresponding precision/recall plot to it?, because I have a hard time believing that this AUC plot would result in low recall/precision numbers or is this simulation run an outlier? There is no error shown.). Especially, since the authors let their method stop after some (arbitrary? At least in the application to real data. How is that number determined?) number of found non-zero effects, the AUC shows that the top effects should be true positives. I do not follow the explanation of why an enrichment analysis is not possible. You have your identified genes and there are databases (e.g., http://www.gsea-msigdb.org/gsea/msigdb/index.jsp) with known biological processes and other common functional gene groups available. Based on this you could for example compute a simple confusion matrix (overlap of your genes and the group, exclusive to the group/your genes, and rest of the gene population) and use Fisher’s exact test (or equivalent) to try to quantify the overlap. There is no need to rank the results. Other: I assume the authors actually simulated each data set several times to get the variance of the accuracy plots. Is that and how often this is done stated anywhere? “... we maintain a vector Vk of the values for each column of the input matrix Xk.” => “... we maintain a vector Vk of the values for each column Xk of the input matrix X.” “... We prepared one simulated and two experimental data sets …” Should be nine (without repetition for random noise) simulated data sets, right? “To begin with, we take simulated a simulated matrix …” => “To begin with, we simulated a matrix …” “datasets” => “data sets” (or be consistent with datasets). Response: “In summary, yes for the applications we had in mind the main objective was to identify true positives. “ But this objective is not fulfilled, if you identify a lot of false positives due to low precision. In a real case you cannot distinguish true and false positives. I fail to see the motivation here. Reviewer #2: The authors have addressed my comments ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. https://doi.org/10.1371/journal.pcbi.1010730.r003
Revision 2
26 Oct 2022 Author Response Attachments Attachment Submitted filename: response_ploscb_v2.pdf https://doi.org/10.1371/journal.pcbi.1010730.r004
11 Nov 2022 Decision Letter - Ilya Ioshikhes, Editor, Megan L. Matthews, Editor Dear Mr Elmes, We are pleased to inform you that your manuscript 'A fast lasso-based method for inferring higher-order interactions' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Megan L. Matthews, Ph.D. Guest Editor PLOS Computational Biology Ilya Ioshikhes Section Editor PLOS Computational Biology ********************************************************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: I greatly appreciate the additional amount of work the authors put in and think all my points have been completely satisfied. ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No https://doi.org/10.1371/journal.pcbi.1010730.r005
Formally Accepted
9 Dec 2022 Acceptance Letter - Ilya Ioshikhes, Editor, Megan L. Matthews, Editor PCOMPBIOL-D-22-00040R2 A fast lasso-based method for inferring higher-order interactions Dear Dr Gavryushkin, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Zsofi Zombor PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1010730.r006

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .