ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference

Jacob L. Steenwyk; Thomas J. Buida III; Yuanning Li; Xing-Xing Shen; Antonis Rokas

doi:10.1371/journal.pbio.3001007

Peer Review History

Original SubmissionJune 24, 2020
1 Jul 2020 Decision Letter - Roland G Roberts, Editor Dear Dr Rokas, Thank you for submitting your manuscript entitled "ClipKIT: a multiple sequence alignment-trimming algorithm for accurate phylogenomic inference" for consideration as a Methods and Resources by PLOS Biology. Your manuscript has now been evaluated by the PLOS Biology editorial staff, as well as by an academic editor with relevant expertise, and I'm writing to let you know that we would like to send your submission out for external peer review. However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire. Please re-submit your manuscript within two working days, i.e. by Jul 03 2020 11:59PM. Login to Editorial Manager here: https://www.editorialmanager.com/pbiology During resubmission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF when you re-submit. Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed all checks it will be sent out for review. Given the disruptions resulting from the ongoing COVID-19 pandemic, please expect delays in the editorial process. We apologise in advance for any inconvenience caused and will do our best to minimize impact as far as possible. Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission. Kind regards, Roli Roberts Roland G Roberts, PhD, Senior Editor PLOS Biology https://doi.org/10.1371/journal.pbio.3001007.r001
Revision 1
21 Aug 2020 Decision Letter - Roland G Roberts, Editor Dear Dr Rokas, Thank you very much for submitting your manuscript "ClipKIT: a multiple sequence alignment-trimming algorithm for accurate phylogenomic inference" for consideration as a Methods and Resources at PLOS Biology. Your manuscript has been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by four independent reviewers. You’ll see that three of the reviewers (#1, #2, #4) find your method useful, but rev #1 thinks that you need to improve the description of what it actually does (in contrast to other approaches); we agree, and think that this is doubly important for our broader readership. Several of the reviewers have further technical concerns, some of which may need additional analysis. In light of the reviews (below), we will not be able to accept the current version of the manuscript, but we would welcome re-submission of a much-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent for further evaluation by the reviewers. We expect to receive your revised manuscript within 2 months. Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension. At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may end consideration of the manuscript at PLOS Biology. IMPORTANT - SUBMITTING YOUR REVISION Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript: 1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript. NOTE: In your point by point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point. You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response. 2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type. Re-submission Checklist* When you are ready to resubmit your revised manuscript, please refer to this re-submission checklist: https://plos.io/Biology_Checklist To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record. Please make sure to read the following important policies and guidelines while preparing your revision: Published Peer Review Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details: https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/ PLOS Data Policy Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5 Blot and Gel Data Policy We require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements Protocols deposition To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Roli Roberts Roland G Roberts, PhD, Senior Editor, rroberts@plos.org, PLOS Biology ***************************************************** REVIEWERS' COMMENTS: Reviewer #1: General comments: The study by Steenwyk addresses an important and unappreciated issue in genomics, that sequence alignments have problems, which, if not fixed, can negatively impact downstream studies, such as generating a phylogenetic tree. They came up with a solution for phylogeny that identifies and saves phylogenetically informative sites of a multiple sequence alignment, as opposed to deleting sites that are not uninformativ. They performed a thorough comparison of multiple methods, and find that their ClipKIT methods out performs the other methods. Overall, I think the study has made important advances. But, I have some concerns that I feel need to be addressed. The authors present one view on issues with alignments, but there are really two: 1) Determining which sites are phylogenetically informative, as ClipKIT is designed to do; 2) Removing errors in the alignment, which some other tools are designed to do, but it is not clear that ClipKIT is designed to remove errors? Such errors include over-alignment (non-homologous sequences that should not be aligned but were aligned) and under-alignment (homologous sequences that should be aligned but were not aligned). These errors too affect phylogenetic inference and many other analyses. The authors do not seem to address how ClipKIT handles such errors? In a study of alignment of 48 bird genomes, Jarvis et al 2014 (Science) developed alignment filtering tools (described in the supplement) to remove such errors before they inferred a phylogeny. Does ClipKIT work like these? A big issue is that the authors are not clear enough on what ClipKIT does. I looked in the methods and in their online website about it. Do the authors have equations that they programmed? What criterion is used to determine a site is phylogenetically informative? What does constant sites mean (identical sequence in all species)? What does ClipKIT do with the non-phylogenetically informative sites Also what type of alignment software was used to make the alignments? Different alignment algorithms give more or less errors in the alignment (e.g. Jarvis et al 2014). After the authors demonstrate that ClipKIT outperforms maintaining phylogenetic sites, they infer phylogenetic trees comparing the trimmed alignments from multiple methods, and find almost identical, well-supported phylogenies with all methods. So, then what is the need of ClipKIT? This makes me wonder, if their simulated alignments reflected real world situations? Overall, the paper is not prepared well enough. So, it is hard to tell issues due to missing information, or a flaw in design of the analyses. Specific comments: Page and line numbers need to be included, otherwise writing a review is more difficult. If ClipKIT is suppose to identify and retain phylogenetically informative sights, the term itself implies removal of sites. In the introduction, the authors need to mention the primary goal of many trimming/filtering tools is to remove alignment errors. They should describe what ClipKIT does with such errors. The authors are inconsistent in stating the number of trimming approaches used; 13, 14, 6. In table 1, I count 6 trimming approaches, 12 subapproaches, plus a no trimming control. They need to be consistent. nRF and ABS need to be spelled out on first use in the paper, in the results section. The authors state that Gblocks and BMGE trimming was not evaluated on simulated data sets, because they could remove entire alignments. However, values from these methods are in the figures, including in the simulated data sets of Figure 1B. Where is the evidence for the following sentence? "Finally, counter to previous evidence suggestive of a trade-off between trimming and phylogenetic accuracy [6], we found that ClipKIT aggressively trimmed MSAs in the empirical datasets without compromising phylogenetic tree accuracy and support." The following sentence in the discussion seems contradictory to the finding in the results that ClipKIT trimming did not increase or decrease the accuracy of the phylogeny. "A previous analysis suggested that MSA trimming methods often decreased the accuracy of phylogenetic inference [6], highlighting the need for alternative alignment trimming approaches." Reviewer #2: Trimming MSA in phylogenomics # Summary This study describes a new approach to trimming multiple sequence alignments (MSAs) for phylogenomics. While most available methods focus on identifying and trimming putatively phylogenetically-uninformative sites, the new method (ClipKIT) aims to identify and retain phylogenetically-informative sites. The authors demonstrate the accuracy of ClipKIT with multiple empirical and simulated datasets using split-based distance metrics for phylogenetic trees. Overall, the method performs extremely well (as judged by the metrics used) and sometimes is on par with no-trimming, yet it saves computation time. # Assessment In general, there are no major issues with this study. It is a clever idea to focus on site retention instead of removal. The rationale and implementation of the algorithm are clear and well-justified (though I wish the pseudo-code was provided in the Suppl. Info). Speed and accuracy are a major improvement over alternative approaches indicating that ClipKIT will likely be appealing to most empirical phylogenomicists. My biggest quarrel with the study is the use of split-based distance metrics like the RF distance to assess accuracy (and thus making the PCAs not easily interpretable). It is well known that RF is a highly biased and sometimes problematic metric, even if normalized. However, I am not sure what would be reasonable to ask the authors to do here given the large numbers of MSAs they examined. On one hand, it would be useful at least to provide raw, as opposed to normalized RF. On the other hand, it would be better to use alternative metrics of tree comparison. Not sure if the information theoretic-based metrics developed by the senior author (tree certainty (TC) and AllTC) would be a good addition to the ms. Alternatively, using information theoretic-based RF (see https://academic.oup.com/bioinformatics/article/doi/10.1093/bioinformatics/btaa614/5866976) could be more informative. I know this may be a lot of work, and hence I am not sure what would be the best way to proceed. Results may not change, but I worry about describing the accuracy of a new method using a metric that is highly biased like RF (and normalized RF). ABS is not significantly better. A second issue is that MSA site trimming may not only affect topology but also branch lengths, an equally important parameter in empirical phylogenomics and molecular evolution. Comparing branch lengths and defining null models can also be tricky, but I wonder if the authors have any insight here. Addressing or at least discussing these two issues may help improve an already excellent, well-written manuscript. Reviewer #3: In this work, Steenwyk and collaborators developed an alignment-trimming algorithm aiming to identify phylogenetically-informative sites. The authors performed tested the effectiveness under different condition (simulated and not) and showed that consistently outperformed other trimming methods across diverse datasets. I have some significant concern about this work: 1) The software removes parsimonious uninformative sites; however, the same sites analysed under ML and Bayesian framework can be informative. 2) Removing the parsimonious uninformative sites affect the branch length and potentially the topology. 3) It is possible to remove parsimonious uninformative sites also gaps in MEGA. However, ClipKIT allows the user to perform this using a command line, but this does not justify a publication on Plos Biology. Reviewer #4: The submission addresses the issue of alignment trimming for phylogenetic inference. This is a routine step in phylogenetic inference, and some of the trimming software have thousands of citations. Recently, the benefit of alignment trimming has been called into question. The software introduced here, ClipKIT, purports to address this issue. The work is generally well done, clearly written, and well illustrated. I only have a few comments. Major: - The use of "Desirability-based integration of accuracy and support metrics" makes it easier to rank the methods and summarise the results, but makes it harder to interpret the differences. Please include a precise mathematical definition of the compound measure. Minor: - Abstract: "Phylogenies inferred from ClipKIT-trimmed alignments are accurate, robust, and time-saving". This statement is too absolute, particularly the accuracy claim (e.g. if the input alignment is fundamentally flawed, trimming alone cannot possibly turn it into something "accurate"). It could however be said that contrary to other methods, the trees don't worsen after ClipKit trimming, and the method saves time. - Define ABS (average bootstrap support?) and nRF as normalised Robinson-Foulds. - In Supplementary figure 4, I find the z-score transform confusing. I would be interested to see how the ABS and nRF values compare for the different methods. The z-score makes things less interpretable. For one thing, over what population were the mean and variance computed? (across all methods and datasets? across all datasets separately for each method?) https://doi.org/10.1371/journal.pbio.3001007.r002
Revision 2
8 Sep 2020 Author Response Attachments Attachment Submitted filename: Steenwyk_etal_response_to_reviewers.docx https://doi.org/10.1371/journal.pbio.3001007.r003
26 Oct 2020 Decision Letter - Roland G Roberts, Editor Dear Dr Rokas, Thank you for submitting your revised Methods and Resources paper entitled "ClipKIT: a multiple sequence alignment-trimming software for accurate phylogenomic inference" for publication in PLOS Biology. I have now obtained advice from two of the original reviewers and have discussed their comments with the Academic Editor. Based on the reviews, we will probably accept this manuscript for publication, assuming that you will modify the manuscript to address the remaining points raised by the reviewers. Please also make sure to address the data and other policy-related requests noted at the end of this email. IMPORTANT: Please attend to the following: a) Please address the remaining concerns raised by reviewer #1. b) Please address my Data Policy requests (see further down). We expect to receive your revised manuscript within two weeks. Your revisions should address the specific points made by each reviewer. In addition to the remaining revisions and before we will be able to formally accept your manuscript and consider it "in press", we also need to ensure that your article conforms to our guidelines. A member of our team will be in touch shortly with a set of requests. As we can't proceed until these requirements are met, your swift response will help prevent delays to publication. To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following: - a cover letter that should detail your responses to any editorial requests, if applicable - a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable) - a track-changes file indicating any changes that you have made to the manuscript. Copyediting Upon acceptance of your article, your final files will be copyedited and typeset into the final PDF. While you will have an opportunity to review these files as proofs, PLOS will only permit corrections to spelling or significant scientific errors. Therefore, please take this final revision time to assess and make any remaining major changes to your manuscript. NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines: https://journals.plos.org/plosbiology/s/supporting-information Published Peer Review History Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details: https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/ Early Version Please note that an uncorrected proof of your manuscript will be published online ahead of the final version, unless you opted out when submitting your manuscript. If, for any reason, you do not want an earlier version of your manuscript published online, uncheck the box. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us as soon as possible if you or your institution is planning to press release the article. Protocols deposition To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods Please do not hesitate to contact me should you have any questions. Sincerely, Roli Roberts Roland G Roberts, PhD, Senior Editor, rroberts@plos.org, PLOS Biology ------------------------------------------------------------------------ DATA POLICY: You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797 Many thanks for depositing your alignments and phylogenies in Figshare and making your code available on Github. However, we also ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms: 1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore). 2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication. Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figs 1, 2, S1-S13. NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values). Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend. Please ensure that your Data Statement in the submission system accurately describes where your data can be found. ------------------------------------------------------------------------ REVIEWERS' COMMENTS: Reviewer #1: [identifies himself as Erich Jarvis) The authors were very responsive to the reviews, and this led to a significant improvement in the manuscript. I just have a few conceptual concerns in how things are presented, that can easily be fixed with changes in the text. The authors were responsive to my comments about the difference between ClipKIT identifying and removing natural phylogeneticaly un-informative sites versus removing sites due to alignment errors. They however did not clearly state this in the main text, nor state the difference on phylogenetic inference. This needs to be clearly stated for the theory and the proportion of sites with suspected uninformative site and alignment errors, when possible. The authors seem to present a contradictory message on alignment filtering strategies that are meant to improve phylogenetic inference; when some don't actually make any change at all or sometimes make the phylogeny worse. But ClipKIT trimming is suppose to make the phylogenetic inference "better" or make no change in an already accurate phylogeny. The contradictions to this view is that in response to reviewer 3, not removing the uninformative sites does not change anything; and removing the uninformative sites does not change the branch lengths. But isn't the point of ClipKIT is that by removing the uninformative sites the phylogenic inference is should improve? The branch lengths should become more accurate? Isn't removing the sequence alignment errors suppose to improve phylogenetic inference? I believe the authors show improvements, but I also think they are unnecessarily trying to play two sides of the same coin - no change or improvement in phylogeny -. The definition of no change also means no improvement. Lines 157-159. The reason for highly divergent sites could be more than them not being natural mutations that are not phylogenetically formative, but because some of them are due to alignment errors. Actually, I think the later reason is more likely for many highly divergent sites in the alignment. This should be mentioned. Reviewer #2: The authors have addresses all my concerns and further provided more nuance to the manuscript in other sections. This is an excellent contribution to the field. I personally look forward to start using ClipKIT! https://doi.org/10.1371/journal.pbio.3001007.r004
Revision 3
4 Nov 2020 Author Response Attachments Attachment Submitted filename: Steenwyk_etal_second_resubmission_response_to_reviewers.docx https://doi.org/10.1371/journal.pbio.3001007.r005
10 Nov 2020 Decision Letter - Roland G Roberts, Editor Dear Dr Rokas, On behalf of my colleagues and the Academic Editor, Andreas Hejnol, I am pleased to inform you that we will be delighted to publish your Methods and Resources in PLOS Biology. PRODUCTION PROCESS Before publication you will see the copyedited word document (within 5 business days) and a PDF proof shortly after that. The copyeditor will be in touch shortly before sending you the copyedited Word document. We will make some revisions at copyediting stage to conform to our general style, and for clarification. When you receive this version you should check and revise it very carefully, including figures, tables, references, and supporting information, because corrections at the next stage (proofs) will be strictly limited to (1) errors in author names or affiliations, (2) errors of scientific fact that would cause misunderstandings to readers, and (3) printer's (introduced) errors. Please return the copyedited file within 2 business days in order to ensure timely delivery of the PDF proof. If you are likely to be away when either this document or the proof is sent, please ensure we have contact information of a second person, as we will need you to respond quickly at each point. Given the disruptions resulting from the ongoing COVID-19 pandemic, there may be delays in the production process. We apologise in advance for any inconvenience caused and will do our best to minimize impact as far as possible. EARLY VERSION The version of your manuscript submitted at the copyedit stage will be posted online ahead of the final proof version, unless you have already opted out of the process. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. PRESS We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have not yet opted out of the early version process, we ask that you notify us immediately of any press plans so that we may do so on your behalf. We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/. Thank you again for submitting your manuscript to PLOS Biology and for your support of Open Access publishing. Please do not hesitate to contact me if I can provide any assistance during the production process. Kind regards, Vita Usova Publication Assistant, PLOS Biology on behalf of Roland Roberts, Senior Editor PLOS Biology https://doi.org/10.1371/journal.pbio.3001007.r006

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .