Towards mouse genetic-specific RNA-sequencing read mapping

Nastassia Gobet; Maxime Jan; Paul Franken; Ioannis Xenarios

doi:10.1371/journal.pcbi.1010552

Peer Review History

Original SubmissionJanuary 19, 2022
24 Mar 2022 Decision Letter - Ilya Ioshikhes, Editor, Sonika Tyagi, Editor Dear Prof. Xenarios, Thank you very much for submitting your manuscript "Towards mouse genetic-specific RNA-sequencing read mapping" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Sonika Tyagi Guest Editor PLOS Computational Biology Ilya Ioshikhes Deputy Editor PLOS Computational Biology ********************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: In this manuscript Gobet et al examine the impact the choice of reference genome and existing genetic variation has on downstream analyses. It is an interesting study with potential implications for downstream analyses in model organisms. The code is available and the methods are well described. While interesting, I found the manuscript requires clearer descriptions of the experiment, more examples of types of errors, and further discussion about what the results mean for both inbred and outbred organisms. Major comments: 1) Experimental design: While I eventually was able to understood the design the manuscript would benefit from clearer text and a flowchart describing the analysis. Figure 1 B attempts this but the legend does not sufficiently detail the image. This should be improved and presented at the onset. Further the ‘hybrid’ strategy (BXD-specific references) is poorly described and I am unclear what, if any analyses, were preformed using this strategy. 2) The authors conclude the modified B6 reference is superior relative to the two reference approach however this is potentially just reflecting the lower quality of the D2 reference as you later mention. On L454 you state “However, we showed that the gap in quality and completeness between the two parental assemblies is masking the genetic specificity”. While this may be true, you do not demonstrate this in the paper. In order to make any such conclusions, detailed information on the quality of the two references is required. 3) In Fig S2 you provide a few examples of read mismapping largely due to differences in the assembly and annotation. While useful, it would be better to try to broadly classify the types of issues encountered via a diagram or table. For example, are gene families or repeat elements regions prone to mismapping? Perhaps looking at the cases in Figure 4 that actually change the eQTL slope would be insightful. 4) Many factors are known to have a significant impact on read mappability including the choice of aligner and read length. More discussion is needed on such factors and what (if any) impact they would be expected to have on the results of this study. 5) What is the impact on mappability of SNVs compared to small indels? While SNVs are more numerous indels are more likely to affect mapping. What portion of mismapped reads are attributable to each variant class? 6) You recommend modifying mm10 with strain specific variants to achieve the best results. This is not a trivial undertaking however, does your code offer this functionality? 7) The study describes work on inbred mice however what would your recommendations be for human as an outbred example? Or is your approach only be suitable for inbred organisms? Minor comments: 1) L129: “Assembly refers to genomic sequences assembled from DNA reads to form chromosomes whereas reference simply refers to the sequences the reads are compared to during read mapping” -> This needs to be clearer. 2) L325 “we modified the B6 reference assembly using SNPs and indels specific to the D2 strain from dbSNP. We mapped parental and F1 samples on these two assemblies with exact matches.” -> what 2 assemblies, mm10 and modified? 3) L332: “D2 samples gained between 0.0% and 3.4%” -> You don’t gain 0% and also why where there no gains in this instance? 4) In Figure 5 you aim to maximise eQTLs. Can you explain why more eQTLs are indicative of better performance? 5) L434 “However, even though it has been shown to importantly impact genetic specificity (Yuan et al., 2015)” -> What exactly is meant by “genetic specificity”? Is it reads uniquely mapped? Regardless, it need to be clearly defined 6) L449: “Indeed, each BXD sample has both B6 and D2 alleles, so the samples compared for each gene are not split the same way, depending on alleles at this locus.” -> What does “depending on alleles at this locus” mean? I found a few typos / poorly worded sentences throughout. A few are highlighted below although I would suggest a thorough proof read. Example typos and sentence structures: L140: “sequences are not containing alternative haplotypes” -> do not contain L166: “only one BXD lines” -> line L384: “However, we assumed that globally B6 alleles are equally expressed than D2 alleles. Moreover, since our samples are inbred lines, the heterozygous sites are greatly rarer” -> Two errors here L427: “However, this does not guaranty” -> guarantee L428: “due to e.g. redundancy in the genome” -> not sure what this means Reviewer #2: This manuscript describes the importance of the reference genome chosen when aligning short read data for subsequent analysis in RNAseq experiments. The insight from the recombinant inbred strains is particularly interesting. The authors present information of how this may skew findings, especially the results of the differential mapping analysis (Fig 2D and 3B). However, some basic analyses need to be performed to increase the credibility of the results. Especially surrounding the mappability analyses. Also, the manuscript has significant problems in execution, namely the methods. Some of the points raised in the discussion could actually be quickly tested by the authors with existing data. Methods: The methods are not publication-quality and are confusing. These need to be re-written for improved understandability and credibility. Results: At line 287, the authors note surprising deviations from their expectations of mappability of reads from B6, D2 and BXD lines. One possible interpretation that is not addressed is the possible mixups with individual mice and/or tubes when the original tissue samples were collected. Also, we have seen cross-contamination of samples introduced at sequencing facilities, even including the identification of reads from different mammalian species. The transcriptomic data is from pooled samples from many individual mice and an accidental inclusion of one or mice from the wrong strain into a pooled sample would go some way to explain the results the authors obtained. These sample mix-ups a very common, even in the largest and most automated genome centres. It would be possible to search for reads that contain genotypes that are specific to either B6 or D2 and count the relative presence of these in each sample pool. The presence or absence of contaminating reads (potentially introduced even from the sequencing step) would be instructive to the following results. On line 290 the authors also describe mappability differences between the cortex and liver samples in the same lines. That mappability should vary between tissues from the same organisms points to a technical difference in the preparation of the samples. The reasons for the differences may be trivial, but the authors need to address this further. For example, are differences in sequencing depth or increased sequencing error important for the mappability differences that are a main focus of the manuscript? The mappability of reads seems to have been assessed in two ways, i) exact matches, and ii) up to ten mismatches. Both might be problematic: Generally, the rate of short-read sequencing error is ~1 nucleotide per 100, in the 100bp reads they have obtained, most of them will have a sequencing error. So, requiring exact matches will be too strict for many otherwise-mappable reads. However, allowing 10 mismatches will not resolve reads that could be mapped to multiple members of gene families. Given the importance of this to the following analysis, some optimisation of the allowed mismatches in alignment is essential. As would be inclusion of comparison with a different aligner. The “Mapping parameters evaluation” section is interesting, but does not address the need to optimise the number of allowed mismatches. Discussion: On line 438 the authors note the need for more investigation of tissue effects and point to existing data from GTex. Given the relevance of this to the current manuscript, potentially the authors might judiciously choose a small number of example datasets from GTex that include cortex/liver comparisons – and investigate these for tissue effects themselves and include this in the manuscript. On line 479 of the discussion, the point is made about the genetic drift of inbred lines (10 to 30 germline mutations per generation) and the relevance this has to mappability. Which is true, yet de novo variation accumulated since the production of a reference sequence seems a red herring. The variation mentioned is unlikely to be problematic for read-mapping so soon after the reference genomes have been sequenced. Even after a decade and assuming a very high 10 generations per year, this is only 10x10x30 = 3000 de novo variants distributed among 3 gigabases of genome. This may very likely cause phenotypic divergence. Yet, for alignment of RNAseq data, this will be only around 30 coding de novo variants, hence only a little more than one 1 in 1000 genes are likely to harbour a variant that has occurred since the production of the reference sequence for the same strain. ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None Reviewer #2: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols https://doi.org/10.1371/journal.pcbi.1010552.r001
Revision 1
27 Jun 2022 Author Response Attachments Attachment Submitted filename: BXDmapping_answersreviewerscomments.pdf https://doi.org/10.1371/journal.pcbi.1010552.r002
18 Jul 2022 Decision Letter - Ilya Ioshikhes, Editor, Sonika Tyagi, Editor Dear Prof. Xenarios, Thank you very much for submitting your manuscript "Towards mouse genetic-specific RNA-sequencing read mapping" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Sonika Tyagi Guest Editor PLOS Computational Biology Ilya Ioshikhes Deputy Editor PLOS Computational Biology ********************* A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: My concerns have been adequately addressed Reviewer #2: Uploaded as an attachment ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Attachments Attachment Submitted filename: Revised_manuscript_comments_220713.pdf https://doi.org/10.1371/journal.pcbi.1010552.r003
Revision 2
18 Aug 2022 Author Response Attachments Attachment Submitted filename: Revised_manuscript_comments_answers2.pdf https://doi.org/10.1371/journal.pcbi.1010552.r004
7 Sep 2022 Decision Letter - Ilya Ioshikhes, Editor, Sonika Tyagi, Editor Dear Prof. Xenarios, We are pleased to inform you that your manuscript 'Towards mouse genetic-specific RNA-sequencing read mapping' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Sonika Tyagi Guest Editor PLOS Computational Biology Ilya Ioshikhes Section Editor PLOS Computational Biology *********************************************************** https://doi.org/10.1371/journal.pcbi.1010552.r005
Formally Accepted
22 Sep 2022 Acceptance Letter - Ilya Ioshikhes, Editor, Sonika Tyagi, Editor PCOMPBIOL-D-22-00085R2 Towards mouse genetic-specific RNA-sequencing read mapping Dear Dr Xenarios, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Zsanett Szabo PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1010552.r006

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .