Fast and accurate imputation of genotypes from noisy low-coverage sequencing data in bi-parental populations

Cécile Triay; Alice Boizet; Christopher Fragoso; Anestis Gkanogiannis; Jean-François Rami; Mathias Lorieux

doi:10.1371/journal.pone.0314759

Peer Review History

Original SubmissionApril 16, 2024
23 Jul 2024 Decision Letter - Andrea Tangherloni, Editor PONE-D-24-15414Fast and accurate imputation of genotypes from noisy low-coverage sequencing data in bi-parental populationsPLOS ONE Dear Dr. Lorieux, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. The reviewers' comments pointed out that a revision is required to improve the current version of the manuscript. One of the most critical concerns regards the filtering step. It also needs to be clarified if the input of the three tested tools is the same. Please refer to the reviewers’ reports and the Reviewer’s Responses to Questions section for detailed comments, which could help you improve your manuscript. Please carefully address (and reply to) all the comments raised by all reviewers (this is mandatory). Please submit your revised manuscript by Sep 06 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Andrea Tangherloni Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex. 3. Thank you for stating the following financial disclosure: "The French ANR project "LANDSREC" 385 (ANR-21-CE20-0012-03), , the French Government France Génomique program through its International RIce 485 Genome INitiative “IRIGIN” project, and the CGIAR Research Program “RICE”" Please state what role the funders took in the study. If the funders had no role, please state: ""The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."" If this statement is not correct you must amend it as needed. Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf. 4. Thank you for stating the following in the Acknowledgments Section of your manuscript: "We thank Karine Labadie (CEA, Institut de Génomique, Genoscope, Evry, France) for sharing the WGS data for the sequencing of rice populations, Christine Tranchant-Dubreuil (IRD, Montpellier, France) for her help with retrieving the Rice_WGS data and François Sabot (IRD, Montpellier, France) for coordinating the IRIGIN project. We are grateful to the Institut Français de Bioinformatique (IFB) for providing computing resources. We also thank the Yale Center for Research Computing for guidance and use of the research computing infrastructure. The following programs supported parts of this initiative: the French ANR project "LANDSREC" (ANR-21-CE20-0012-03), the French Government France Génomique program through its International RIce Genome INitiative “IRIGIN” project, and the CGIAR Research Program “RICE”." We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: "The French ANR project "LANDSREC" 385 (ANR-21-CE20-0012-03), , the French Government France Génomique program through its International RIce 485 Genome INitiative “IRIGIN” project, and the CGIAR Research Program “RICE”" Please include your amended statements within your cover letter; we will change the online submission form on your behalf. 5. When completing the data availability statement of the submission form, you indicated that you will make your data available on acceptance. We strongly recommend all authors decide on a data sharing plan before acceptance, as the process can be lengthy and hold up publication timelines. Please note that, though access restrictions are acceptable now, your entire data will need to be made freely accessible if your manuscript is accepted for publication. This policy applies to all data except where public deposition would breach compliance with the protocol approved by your research ethics board. If you are unable to adhere to our open data policy, please kindly revise your statement to explain your reasoning and we will seek the editor's input on an exemption. Please be assured that, once you have provided your new statement, the assessment of your exemption will not hold up the peer review process. 6. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. Additional Editor Comments: The reviewers' comments pointed out that a revision is required to improve the current version of the manuscript. One of the most critical concerns regards the filtering step. It also needs to be clarified if the input of the three tested tools is the same. Please refer to the reviewers’ reports and the Reviewer’s Responses to Questions section for detailed comments, which could help you improve your manuscript. Please carefully address (and reply to) all the comments raised by all reviewers (this is mandatory). [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Partly Reviewer #3: Yes ******** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: I Don't Know Reviewer #2: Yes Reviewer #3: I Don't Know ****** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: Yes Reviewer #3: Yes ****** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ****** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors propose a method to genotype bi-parental populations with low-coverage NGS technology and compare their approach against FSFhap and LB-Impute. An MLE algorithm as compared to HMM makes sense with the assumptions of the bi-parental populations with inbred and exclusively homozygous parental lines and the low running time and memory consumption looks good. However, I have some comments/questions as given below: Major Comments ------------- - There is a heavy filtering of SNPs before being used by the tool. Filtering multiallelic SNPs and small indels is common in genomic tools (but newer versions of tools are updating their pipelines to incorporate them now) but further filtering of SNPs based on HWE parameters, read counts, etc might be a bit excessive. Do you have any explanation about this? How many SNPs are removed through these filtering criteria (in terms of percentage of all SNPs)? - Do you compare NOISYmputer to FSFhap and LB-Impute with the input of these filtered variants to all three tools or are the filtering criteria for the three tools different? If variants are filtered differently for the three methods, is it possible to run the three tools on one VCF and compare the results? Is it possible that the high accuracy, better map size estimation, and breakpoint precision of the tool are the outcome of the heavily filtered set of SNPs? - Line 259: How do you define too many or too few? Is it based on the average read depth or a constant value of X? Also it is possible that the SNPs with too many reads might be duplicated regions but this information seems to be lost when they are removed. Any comments on this? - What is the reason that windows in this work are based on the number of variant positions around a central SNP and not base pair distance-based? More clearly, why is an m-window around SNPj defined as SNPj-m to SNPj+m and not all the variants whose base pair distance is less than m from SNPj. Most references have gaps in their sequences and the existing window-system in the manuscript will include variants which can be large distances away from the central SNP if such a gap exists between the two variants. Why was such a window-system used instead of a base pair distance-based? - Why was the comparison done with FSFhap and also not with FILLIN (as the main publication of FSFhap does)? FILLIN has imputation capabilities that can also be compared with the imputation done by NOISYmputer. I could not find a direct link to the simulated and real dataset used in the experiments. Each dataset used in the experiments should be stated clearly with a proper link. For simulations, an open source repository (e.g., Zenodo) might be used. Minor Comments ------------ - Figure 2 shows that the performance of the algorithm is poor for data with high error rates. So even if the correct error rates are estimated, it does not help in generating accurate results and the tool depends on the underlying error rates. This is not a very big issue since the tool primarily deals with Illumina reads, which have very low error rates (as stated in lines 354-357). But future ideas related to extending these tools to error-prone long reads will be unsuccessful based on these results. Is it possible to generalize the model further to allow high error rates? - I would expect to see some stats about the performance of the algorithm in the abstract. - Line 48 states that “Reducing sequencing costs through minimized per-sample coverage has an important experimental downside: LC-NGS mechanically introduces a series of issues, the main ones being:” However, I can't see any connection of issue 3 listed in line 60 with the data being low coverage. It's a problem with short-read sequencing. - Line 117: Where does the equation for the proportion of the genome with no recombination come from? Reference? - Line 123: “computation time, which increases linearly according to the diplotype size” - Is there any time complexity analysis given as part of the work to show this? - Line 153-155: Are nAi and nBi numbers or is it a multiplication between n and Ai and Bi. If they are numbers, consider superscipting the A and B. It is not clear how they are being obtained. The binomial expansion has them as variables and their values can range from 0 to ni. But in Line 162 in the equation, they are no longer variables on the LHS of the equation while they exist on the RHS of the equation. Which value is chosen for nAi and nBi and how are they chosen? - Line 210-225: If BB is the segment to the left of position j and AB is to the right, why is the BB window defined for j to j+m and vice versa for the AB window? Since BB is to the left, the window is for j-m to j. Further definition of PSI as a product of 1-probabilities for the two regions. There is no explanation on why this approach has been taken. - Line 244-245: The formulation of homozygous to heterozygous transitions seems to be easily generalizable for double recombination events of transitioning from homozygous to homozygous. Why not allow such events as default and allow users to choose whether to consider the double recombination transitions? - Line 320: Does PoPSimul also simulate gaps in the reference or is it a uniform distribution of SNPs? - Line 338-342: It can also be considered to report the results using other metrics like precision-recall or Matthews Correlation Co-efficient etc. Maybe even with a confusion matrix. - Line 316-317: Specs of the cluster are given in the link but which of these machines did you use for the experiments. This information can also be added as a single line regarding the specs of the machine used. Typos ----- - Line 102: “leaving unimputed the regions" - Line 132: “once could set” - Line 157-159: End bracket missing in the expression in the middle. - Line 218: Should this be SNPj+m and SNPj-m respectively, where “m” is subscripted? Reviewer #2: The authors propose a specific scheme for handling imputation of individuals where the original genetic stock come from two completely inbred lines. They propose to compute the total number of reads supporting the two generalized alleles (A and B) and make the primary imputation decisions based on these counts. They claim that this technique will be more tolerant to sequencing and mapping errors compared to other prevalent approaches. The authors claim that this approach is more efficient than a Hidden Markov Model, but in my mind, what's proposed is similar to a HMM where the emitted symbols are the read counts in each window. Combining windowing with transition probabilities for long range haplotype consistency wouldn't be impossible. I would also be somewhat weary against the argument that it makes sense to have a single global error rate per origin line, since mapping and sequencing errors are highly variant-specific. On the other hand, I can of course agree that two parameters would maybe be better than a single global epsilon. The use of the reconstructed chromosome length as a metric for qaulity is ingenious, but it takes some time to wrap your head around it. Overall, I find the contribution to be interesting, but there are several assumptions made on the type of dataset and I think the manuscript would benefit from more clearly spelling these out and also contrast to related scenarios where they would not be applicable. Reviewer #3: The paper is written well, I understood the problem of imputation and authors' approach to the maximum likelihood estimation. The results provided are understandable. However, related work or literature review could be improved. ****** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Carl Nettelblad Reviewer #3: No ******** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. https://doi.org/10.1371/journal.pone.0314759.r001
Revision 1
10 Sep 2024 Author Response PLOS ONE – PONE-D-24-15414 Manuscript: “Fast and accurate imputation of genotypes from noisy low-coverage sequencing data in bi-parental populations” Response to Reviewers Editor’s comments 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf > We carefully checked the styles requirements, and hope everything is fine. 2. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex. > We updated the manuscript using the PLOS LaTeX template. 3. Thank you for stating the following financial disclosure: "The French ANR project "LANDSREC" 385 (ANR-21-CE20-0012-03), the French Government France Génomique program through its International RIce 485 Genome INitiative “IRIGIN” project, and the CGIAR Research Program “RICE”" Please state what role the funders took in the study. If the funders had no role, please state: ""The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."" If this statement is not correct you must amend it as needed. Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf. > We included the amended role of funders in the cover letter. 4. Thank you for stating the following in the Acknowledgments Section of your manuscript: "We thank Karine Labadie (CEA, Institut de Génomique, Genoscope, Evry, France) for sharing the WGS data for the sequencing of rice populations, Christine Tranchant-Dubreuil (IRD, Montpellier, France) for her help with retrieving the Rice_WGS data and François Sabot (IRD, Montpellier, France) for coordinating the IRIGIN project. We are grateful to the Institut Français de Bioinformatique (IFB) for providing computing resources. We also thank the Yale Center for Research Computing for guidance and use of the research computing infrastructure. The following programs supported parts of this initiative: the French ANR project "LANDSREC" (ANR-21-CE20-0012-03), the French Government France Génomique program through its International RIce Genome INitiative “IRIGIN” project, and the CGIAR Research Program “RICE”." We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: "The French ANR project "LANDSREC" 385 (ANR-21-CE20-0012-03), , the French Government France Génomique program through its International RIce 485 Genome INitiative “IRIGIN” project, and the CGIAR Research Program “RICE”" Please include your amended statements within your cover letter; we will change the online submission form on your behalf. > We included the amended statement in the cover letter. Thank you for the help with this. 5. When completing the data availability statement of the submission form, you indicated that you will make your data available on acceptance. We strongly recommend all authors decide on a data sharing plan before acceptance, as the process can be lengthy and hold up publication timelines. Please note that, though access restrictions are acceptable now, your entire data will need to be made freely accessible if your manuscript is accepted for publication. This policy applies to all data except where public deposition would breach compliance with the protocol approved by your research ethics board. If you are unable to adhere to our open data policy, please kindly revise your statement to explain your reasoning and we will seek the editor's input on an exemption. Please be assured that, once you have provided your new statement, the assessment of your exemption will not hold up the peer review process. > All data were made available to download as required. 6. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. > This was done. The reviewers' comments pointed out that a revision is required to improve the current version of the manuscript. One of the most critical concerns regards the filtering step. It also needs to be clarified if the input of the three tested tools is the same. Please refer to the reviewers’ reports and the Reviewer’s Responses to Questions section for detailed comments, which could help you improve your manuscript. Please carefully address (and reply to) all the comments raised by all reviewers (this is mandatory). > We carefully addressed all the points raised by the Reviewers. Reviewer #1 Note: all line numbers refer to the file “Revised Manuscript with Track Changes” The authors propose a method to genotype bi-parental populations with low-coverage NGS technology and compare their approach against FSFhap and LB-Impute. An MLE algorithm as compared to HMM makes sense with the assumptions of the bi-parental populations with inbred and exclusively homozygous parental lines and the low running time and memory consumption looks good. However, I have some comments/questions as given below: Major Comments ------------- - There is a heavy filtering of SNPs before being used by the tool. Filtering multiallelic SNPs and small indels is common in genomic tools (but newer versions of tools are updating their pipelines to incorporate them now) but further filtering of SNPs based on HWE parameters, read counts, etc might be a bit excessive. Do you have any explanation about this? > Thank you for discussing the filtering steps. We do not filter on HWE – since we treat bi-parental populations only – but advanced filtering - before and after imputation - together with MLE-based imputation, is indeed a major feature of NOISYmputer. We filter on genotypic frequencies because some SNPs, generally those which come from erroneous mapping, produce segregations that depart heavily from what is expected in a bi-parental population, even in case of segregation distortion. Keeping those SNPs would add confusion, reducing the difference between the likelihoods of heterozygous and homozygous states. The filter on read counts is particularly useful to eliminate SNPs that lay in repeated regions. Keeping them would lead to similar confusion. Furthermore, this is a flexible filter since it is applied per individual, and only the sites whose coverage is at least 10x higher than the average depth (calculated across all sites for the individual) are removed. The thresholds of these various filters therefore depend on the dataset used and are recalculated to optimally filter the input VCFs. Thus, these “pre-imputation” filters should not be considered as a step preceding NOISYmputer, but as an integral part of the software, contributing to the quality of the imputation results. Regarding multiallelic sites, they do not correspond to the case of crosses between pure lines, where only two alleles can segregate. This is why they are filtered out. - How many SNPs are removed through these filtering criteria (in terms of percentage of all SNPs)? > We produced a table that we added as Supplemental S5_Table to better inform about the consequences of these filters on the number of remaining sites based on the real datasets analyses and modified main text (lines 423-426) to refer to it. However, we would like to emphasize that the number of filtered sites heavily depends on the filters that may have been applied during the production of the initial VCF or the sequencing method. Thus, the user might feel that a large number of sites have been lost during NOISYmputer pre-imputation filtering steps if the initial VCF was minimally filtered beforehand and contained noise. We strongly recommend applying as few arbitrary filters as possible when calling SNPs and producing the initial VCF. For example, in the Snakemake pipeline we provide, we only filter for biallelic sites, sites with at most 80% missing data across the entire population, and sites for which parents are polymorphic and homozygous. These filters are actually applied by NOISYmputer anyway, so applying them beforehand simply allows us to optimize the VCF storage by reducing their final size. No filters on depth, deviation from HWE, or other criteria are applied before running NOISYmputer. - Do you compare NOISYmputer to FSFhap and LB-Impute with the input of these filtered variants to all three tools or are the filtering criteria for the three tools different? If variants are filtered differently for the three methods, is it possible to run the three tools on one VCF and compare the results? Is it possible that the high accuracy, better map size estimation, and breakpoint precision of the tool are the outcome of the heavily filtered set of SNPs? > It is correct that NOISYmputer does run filters that FSFhap or LB-Impute do not run, since NOISYMputer’s algorithm is a combination of filtering and imputation. We thus ran again FSFhap and LB-Impute on the real rice and maize datasets pre-filtered by NOISYmputer, and, although the two programs produced better results than without the filtering, they still produced noisy imputed data and therefore extremely long genetic maps. We added Supplementary S3 Table to reflect these new analyses, and modified the text (lines 378-381). We did not re-test FSFhap on simulated data, because the program PopSimul does not produce SNPs with excess counts or incoherent genotype frequencies. - Line 259: How do you define too many or too few? Is it based on the average read depth or a constant value of X? Also it is possible that the SNPs with too many reads might be duplicated regions but this information seems to be lost when they are removed. Any comments on this? > Thank you for the question. It does indeed require some additional explanations. This parameter is not fixed, it varies depending on the dataset used. By default, when reading the input VCF file, if the depth of a site for a sample is 10 times greater than the average depth for that sample (across all its sites), the genotype at that site is set to a missing data. Also by default, no filter is applied on the minimum number of reads required to consider a site in a sample. We added an explanation (lines 205-209) in the main text to clarify this section. This filter is relatively “soft”. For example it corrects only 0.02% of all genotypes as missing data in the Rice F2 dataset (9626/44845342) and 0.002% in the Maize F2 dataset (22/971901). Regarding the second part of the question, we intentionally filter sites with excessive coverage relative to the sample's average coverage, as it might indicate the presence of (highly) duplicated regions. SNPs from duplicated regions are not informative because we cannot determine which alleles belong to which copy, making it impossible to compare haplotypes to estimate breakpoint positions. We have also included this information in the standard output of NOISYmputer and added an output file that now reports the number of sites per sample (datapoints) set as missing data. This feature is currently available on the development branch of our GitLab repository and will be included in the next release of NOISYmputer. - What is the reason that windows in this work are based on the number of variant positions around a central SNP and not base pair distance-based? More clearly, why is an m-window around SNPj defined as SNPj-m to SNPj+m and not all the variants whose base pair distance is less than m from SNPj. Most references have gaps in their sequences and the existing window-system in the manuscript will include variants which can be large distances away from the central SNP if such a gap exists between the two variants. Why was such a window-system used instead of a base pair distance-based? > This is a very interesting question. We chose to use windows based on the number of variant sites - just like FSFhap or LB-Impute do - because the sites are far from being evenly dispersed on the genome. Thus, in base pair-based windows, the number of sites can vary greatly. Since the goal of the windows is to calculate likelihoods, which values depend on the number of sites in the window, using base pair-based windows would produce likelihood values that are difficult to interpret, especially when the number of sites is low. Also, the observed likelihoods depend on the real genotypes, which are modeled by recombination that is only faintly correlated with physical distance. Since the final goal is to calculate accurate genetic maps, in centimorgans, using physical distances do not seem to be appropriate. Furthermore, we verified that the site-based method still produces good results when encountering breakpoints in low marker density regions (as shown in the simulated data in main text or in the table below). We looked at the top 25% of largest loose support intervals (which implied that sites are spaced apart) in the pseudo 3X samples of the Rice F2 dataset and observed that the distance to the true position of the breakpoint (in number of SNPs) is the same as in the complete dataset. Finally, even if the physical position in base-pair is higher on average, the median stays very reasonable and the drive of the mean is actually due to a single breakpoint among the 23 analyzed. Lastly, we report several measures in the breakpoints output file for the user to filter the breakpoint results if needed. For example, we report the size of the loose support interval along with a proxy of a confidence interval, the number of SNPs analyzed and the number of SNPs that have data in the analyzed sample. See the Breakpoints.csv file in NOISYmputer output. - Why was the comparison done with FSFhap and also not with FILLIN (as the main publication of FSFhap does)? FILLIN has imputation capabilities that can also be compared with the imputation done by NOISYmputer. > FILLIN, like Beagle or Impute2, was developed to impute genotypes in diversity panels (which the authors call inbred lines libraries). FSFhap, like NOISYmputer, was developed specially to address the problem of bi-parental populations. The authors themselves report that “FSFHap and FILLIN performed very similarly, but FSFHap imputed more heterozygous sites with increased accuracy” in the case of bi-parental populations. We thus considered that FSFhap is well more adapted for comparison with NOISYmputer. > I could not find a direct link to the simulated and real dataset used in the experiments. Each dataset used in the experiments should be stated clearly with a proper link. For simulations, an open source repository (e.g., Zenodo) might be used. > We have made both real and simulated datasets available. The simulated datasets can be reached at https://zenodo.org/records/13381283, while the real dataset is available at the IRD Dataverse: https://doi.org/10.23708/8FXUNC. We added this in the main text line ~283 and line ~361. Minor Comments ------------ - Figure 2 shows that the performance of the algorithm is poor for data with high error rates. So even if the correct error rates are estimated, it does not help in generating accurate results and the tool depends on the underlying error rates. This is not a very big issue since the tool primarily deals with Illumina reads, which have very low error rates (as stated in lines 354-357). But future ideas related to extending these tools to error-prone long reads will be unsuccessful based on these results. Is it possible to generalize the model further to allow high error rates? > Inde Attachments Attachment Submitted filename: Response to Reviewers.pdf https://doi.org/10.1371/journal.pone.0314759.r002
21 Oct 2024 Decision Letter - Andrea Tangherloni, Editor PONE-D-24-15414R1Fast and accurate imputation of genotypes from noisy low-coverage sequencing data in bi-parental populationsPLOS ONE Dear Dr. Lorieux, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. The authors addressed the main Reviewers’ comments; however, a couple of concerns remain that must be considered. For instance, I agree with Reviewer 2 that the authors should improve the repository documentation. Please consider the two minor comments from Reviewer 1. Please submit your revised manuscript by Dec 05 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Andrea Tangherloni Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Additional Editor Comments: The authors addressed the main Reviewers’ comments; however, a couple of concerns remain that must be considered. For instance, I agree with Reviewer 2 that the authors should improve the repository documentation. Please consider the two minor comments from Reviewer 1. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ******** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ****** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ****** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ****** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ****** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: - I think this sentence in the abstract has issues, it should be rephrased: "NOISYmputer is particularly convincing when comparing the genetic map obtained on real datasets as it estimates accurately the size of the map where the two other software can mistake from hundred to hundred thousands centimorgans." - Link to data in Zenodo should be added under "Availability" - The code repository will benefit from better documentation about the list of output files that are generated. Multiple output and intermediate files are generated but no explanation for what each file represents have been provided. Reviewer #2: I still think there would be room for improvement in the presentation, and I can agree with the request from reviewer 3 for additional references. I think even tools such as R/QTL and the original Lander-Botstein model are relevant in the sense that inferring trait status and outright imputation are mirror problems, especially in the biparental case. ****** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Carl Nettelblad ******** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. https://doi.org/10.1371/journal.pone.0314759.r003
Revision 2
23 Oct 2024 Author Response Please see the "Response to Reviewers revision 2.pdf" file Attachments Attachment Submitted filename: Response to Reviewers revision 2.pdf https://doi.org/10.1371/journal.pone.0314759.r004
18 Nov 2024 Decision Letter - Andrea Tangherloni, Editor Fast and accurate imputation of genotypes from noisy low-coverage sequencing data in bi-parental populations PONE-D-24-15414R2 Dear Dr. Lorieux, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Andrea Tangherloni Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed ******** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes ****** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ****** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes ****** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ****** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: (No Response) ****** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No ******** https://doi.org/10.1371/journal.pone.0314759.r005
Formally Accepted
27 Nov 2024 Acceptance Letter - Andrea Tangherloni, Editor PONE-D-24-15414R2 PLOS ONE Dear Dr. Lorieux, I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team. At this stage, our production department will prepare your paper for publication. This includes ensuring the following: * All references, tables, and figures are properly cited * All relevant supporting information is included in the manuscript submission, * There are no issues that prevent the paper from being properly typeset If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps. Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. If we can help with anything else, please email us at customercare@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Andrea Tangherloni Academic Editor PLOS ONE https://doi.org/10.1371/journal.pone.0314759.r006

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .