Peer Review History

Original SubmissionNovember 1, 2019
Decision Letter - Christos A. Ouzounis, Editor, William Stafford Noble, Editor

Dear Dr Bracht,

Thank you very much for submitting your manuscript 'Regional sequence expansion or collapse in heterozygous genome assemblies' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the manuscript as it currently stands. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts.

In addition, when you are ready to resubmit, please be prepared to provide the following:

(1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors.

(2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text.

(3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution.

Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are:

- Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition).

- Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video.

- Funding information in the 'Financial Disclosure' box in the online system.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here

We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us.

Sincerely,

Christos A. Ouzounis

Associate Editor

PLOS Computational Biology

William Noble

Deputy Editor

PLOS Computational Biology

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: OVERVIEW

The development of efficient and inexpensive next-generation genome sequencing has enabled an explosion of new genome sequences for 'non-model' organisms. Such organisms are either not studied much in laboratories as a matter of past custom, or practically not feasible for such study, but their genomes are of biological importance nevertheless. Of their nature, such organisms are often outbred with substantial genomic diversity. Even for small sample populations used to extract the genomic DNA used for sequencing and assembly; levels of genetic heterogeneity can reach the 'hyperdiverse' level (e.g., 7% variation in non-selected nucleotides of the nematode Caenorhabditis brenneri) and almost always contain substantial amounts of unresolved allelism.

Asalone et al. have recently characterized the genome for a nonmodel but biologically interesting subterranean nematode, Halicephalobus mephisto. In so doing, they have identified a potential artifact of genomic analysis that to my knowledge has not been previously described: depending on fine details of the genome assembly programs and parameters used, different regions of the genome encoding gene families with biologically interesting functions can assemble in two different ways. They can either assemble so that two or more alleles are compressed in silico into a single sequence, or instead be assembled so that two or more alleles are artifactually linked in a tandem sequence array. Given heterozygosity throughout a genome, such variable compression or tandem expansion can have a visible effect on what genes are predicted for a genome, with expansions or compressions of a gene family having downstream effects on its biological function being scored as over- or under-represented among the protein-coding genes of that genome.

The authors compare assemblies from Platanus versus SOAPdenovo2 versus their best reference assembly (generated by Platanus with PacBio long-read superscaffolding). They observe a general tendency for genome assembly regions with lower polymorphism (assayed by raw Illumina reads mapped to a given assembly) to correlate with greater length. They observe striking differences between assemblies in which heterozygous regions are represented, despite those assemblies having considerably more similar total lengths. The authors conclude that, in assessing quality of a genome assembly, it is not sufficient to look at the size-weighted median of its scaffold or contig sizes (i.e., its N50 score); it is also advisable to assess its degree of sequence coverage and heterozygosity, with caution being exercised for regions of the assembly showing abnormally high or abnormally low heterozygosity (and, in parallel, abnormally low or abnormally high coverage).

One general result: given heterozygous sequence data, Platanus seriously outperforms SOAPdenovo2. The numbers in Supplemental Table 1 make that quite clear. Although the authors do not provide results for other mainstream short-read assemblers comparable to SOAPdenovo2 (e.g., ABySS 2), their results make it advisable that researchers assembling short reads from a heterozygous organism use a heterozygosity-aware program such as Platanus.

Although the general points are the paper is well-taken, I have some specific questions and caveats about it, along with some suggested revisions.

SPECIFIC QUESTIONS AND CAVEATS

An inconspicuous-looking point of the Methods may be driving nontrivial amounts of the differences between how different genome assemblies are scored for completeness: the authors have imposed a minimum scaffold/contig size of 1,000 nt for all of their competing assemblies. This is likely to be harmless for those assemblies with high N50 values, but may be leading to substantial losses of sequence information for those four assemblies with low N50 values (Platanus step-size 1, N50 = 2.8 kb; SOAPdenovo2 k-mer 23, N50 = 2.7 kb; SOAPdenovo2 k-mer 47, N50 = 1.9 kb; and SOAPdenovo2 k-mer 63, N50 = 1.9 kb) -- particularly when one considers that the N50 values given in Fig. 1 and Supp. Tab. 1 were computed for these assemblies *after* scaffolds/contigs of <1 kb had been discarded. If the authors had performed their analyses on assemblies that had had a less stringent minimum size filter (such as 200 nt), how much would the downstream results change? This question clearly has to affect BUSCO scores (Figure 1), but could conceivably also affect evidence-based annotation of genes (Figure 3) and homology of genes to other genes (Figure 4), since assemblies with low N50 values are likely to have fragmented or partial gene predictions.

At crucial points in their Methods -- specifically, when they compute heterozygosity levels for an entire genome assembly, or for particular genes within that assembly -- the authors invoke nameless "custom python scripts". Given the central importance of this computation to their work, this is entirely unacceptable. Each Python script used in the work must be given a name in the Methods and must be explicitly available through either github or some equivalently useful public software repository. Note: I am aware that the authors have written "All python scripts are available from the github database (repository: "Name TBD").", but that is not enough!

The authors cite results based on 11 alternative (non-reference) genome assemblies for H. mephisto. It would be preferable if these genome assemblies were themselves publicly available in some data repository. One data repository that works quite well for permanent archiving of such data is the Open Science Foundation (https://osf.io). Other options are Figshare (https://figshare.com) and Zenodo (https://zenodo.org).

The authors have devised their own tools for making either genome-wide or regional estimates of nucleotide heterozygosity. This is ingenious and potentially valuable to other researchers. However, there already exists a published open-source programs for estimating overall heterozygosity of a given organism, directly from that organism's raw Illumina sequence read set: GenomeScope (https://github.com/schatzlab/genomescope.git and https://academic.oup.com/bioinformatics/article/33/14/2202/3089939). I think it would be highly desirable for the authors to compute heterozygosities for H. mephisto from their raw Illumina sequence reads using either GenomeScope or an equivalent k-mer analysis tool, and then for them to compare the heterozygosity score generated with one of these tools versus their own results.

SUGGESTED REVISIONS

The authors had no page numbers in their manuscript. Next time, please have them! Page numbers in manuscripts help readers (even though the readers in this case will be a small number of editors and reviewers.) In this case, for clarity while reviewing, I am providing page numbers using my own count (with the title and abstract being on page 1).

Page 4 --

"(Borgonie et al., 2011)": although cited in the text, this was not included in the References on pp. 18-22. I assume that the authors meant Borgonie et al. (2011), Nature 474, 79-82, PubMed 21637257. Please add this reference to the References; more importantly, please proofread the entire manuscript to ensure that there are no other missing references cited anywhere.

Page 5 --

Legend for Figure 1: "N50, heterozygosity, and the BUSCO results." I would prefer something like "N50 (in nt), heterozygosity (as defined in Methods), and the BUSCO results." As it stands, the reader is left to guess what the measurement unit for N50 is, and to wonder where the heterozygosity comes from. It will be good for readers to understand that the authors are using their own methods of computing heterozygosity rather than using previously published methods.

Page 6 --

"We found that N50 is highly correlated with evidence-supported genes predicted..." What are the mean and median sizes of protein-coding sequences for these genes, and how do they vary with respect to assembly N50? It is a long-known problem in genome analysis that assemblies with low N50 values result in gene predictions that are fragmentary or partial; fragmentary or partial gene predictions, in turn, may lower the rate at which genes are scored as evidence-supported. (The same caveat also applies to Figures 3 and 4, which are cited at this point in the text.)

"However, we found that..." To avoid awkwardly starting two sentences in a row with "However", I suggest that this instance of "However" be replaced with something like "Nevertheless".

Page 7 --

"we extracted the second-largest group of proteins": why was the *second*-largest group chosen? Why not the first, or the third? The answer could go here or in Methods.

Page 10 --

The authors write: "We would predict that if an assembler maximally 'spreads out' the variation within a dataset into distinct contigs, coverage and length assembled would go up, while heterozygosity would go down as the reads are able to find their perfect match."

Unless I have misunderstood the argument of this paper badly, this is not quite correct, and they should have instead written: "We would predict that if an assembler maximally 'spreads out' the variation within a dataset into distinct contigs, length assembled would go up, while coverage and heterozygosity would go down as the reads are able to find their perfect match."

Page 11 --

"smaller than 1kb" should be "smaller than 1 kb" (i.e., do not fuse a number and its measurement unit).

Page 12 --

"These assembly variations are not easily detected particularly when assembling a genome for the first time" should read "These assembly variations are not easily detected, particularly when assembling a genome for the first time".

Pages 12 and 13 --

"sequences lower than 1000bp were removed prior to subsequent analysis", and "Sequences smaller than 1000bp were removed from these assemblies prior to downstream analysis". First, replace '1000bp' with '1000 bp'. Second, this filtering step can have strong and differential effects on genome assembly analysis. Consider the assembly N50s listed in both Figure 1 and Supplemental Table 1. For the reference genome (N50 = 313 kb), the effect of discarding scaffolds or contigs of under 1 kb will be slight -- almost all of the assembly will be over that threshold anyway. However, for four of the most fragmented genome assemblies (Platanus step-size 1, N50 = 2.8 kb; SOAPdenovo2 k-mer 23, N50 = 2.7 kb; SOAPdenovo2 k-mer 47, N50 = 1.9 kb; and SOAPdenovo2 k-mer 63, N50 = 1.9 kb), filtering out sequences of <1 kb is likely to be substantially depleting genomic contents -- particularly since these low N50s were presumably computed *after* sequences of <1 kb had been filtered out.

Given that the authors observe profound drops in their %BUSCO scores for these very same four assemblies (Figure 1 and Supp. Table 1), it is difficult not to suspect that they might have observed significantly better %BUSCO scores if they had adopted a somewhat smaller minimum scaffold/contig size (say, 200 nt instead of 1,000 nt). That, in turn, raises the question of how many *other* results in this paper would be significantly changed if the minimum size had been so lowered.

Page 14 --

"H. Mephisto" should read "H. mephisto".

Page 15 --

"SamTools" should probably be written "Samtools" (following how it is written on the author's main software page -- see http://www.htslib.org).

"BCFTools" should be written "BCFtools" (again, following http://www.htslib.org).

"variants were called using the mpileup and call function" should read 'functions', not 'function'; also, from exactly which software suite were these functions taken? The way the sentence is written, it is not clear whether they are from SAMtools or BCFtools.

"10kb" should be "10 kb".

Page 16 --

"(Note that the dynamics..." starts with a parenthesis ['('], but does not close with one [')'].

Figure 1 --

Please revise the header "N50" to "N50 (nt)", so that the reader knows what size the N50s are in.

Please *add* a column for total genome assembly sizes (i.e., total genome assembly lengths). I know that these data are in Supplemental Table 1, but I think they would be significantly useful in Figure 1, which is what most readers will see. The genome assemblies should be rounded to 0.1 Mb, and the header should be something like "Genome size (Mb)".

Figures 3 and 4 --

For genes predicted in the various H. mephisto assemblies, these two figures show quite different rates of evidence-association (as scored by MAKER; Figure 3) and homology to other genes (as scored by OrthoMCL; Figure 4). The authors note that different assemblies can have similar numbers of predicted genes, but quite different values for evidence-association or homology. However, they do not show whether these genes vary in the mean or median length of their protein-coding sequences; yet it is quite likely that the four genome assemblies with lowest N50 values (under 3.0 kb) will have significant numbers of truncated or partial gene predictions, which may well affect both assays. I would like to see the authors address this point in some reasonable way.

Figure 4 --

This figure shows different assemblies as "Platanus", "Platanus and PacBio", or "SOAPdenovo2". However, I would prefer to have individual labels next to each glyph, specifying exactly which assembly is associated with each data point in the figure (for instance, *which* Platanus assembly gave rise to the unpromising data point with only ~7,250 predicted proteins and ~0.93 proportion grouped?).

Also, the x-axis lists "proteins". However, not all gene prediction methods give exactly one predicted protein isoform per gene; my guess is that there is such a relationship, in this instance, but my guess could be wrong. The authors should make it clear in the legend for this figure that there is (or, is not) a one-to-one relationship between proteins in this figure and genes predicted in the various assemblies.

Supplemental Table 2 --

Here, it would be good to add a column for the value "Observed/Expected" (i.e., the ratio of the existing "Observed Hits" and "Expected Hits" columns.) Adding such a column would allow readers to sort the Excel spreadsheet by this ratio, and thus get a clear view of which particular PANTHER functions are either most overrepresented or most underrepresented by the various genome assemblies. (They can already use the 'sort' function in Excel to reorder the PANTHER functions by ascending "Raw P-value" scores, and thus get a clear view of which over- or under-represented functions are most statistically significant.)

Reviewer #2: In this manuscript, Asalone et al. examine the effects of assembler choice and parameter values on genome assembly of diploid genomes with high levels of heterozygosity. Specifically, they examine assemblies generated for a nematode species, Halicephalobus mephisto, using two different assemblers (Platanus and SOAPdenovo2) with various parameter settings. Assemblies are compared with a reference assembly generated using additional PacBio data and the Platanus assembler. Assemblies are evaluated with BUSCO, alignments with the reference genome, numbers of predicted protein-coding genes, and enrichment/depletion analysis of protein function groups with respect to the reference genome protein set. The overall conclusion of this work is that assemblies can vary significantly in erroneously expanded or contracted regions even if other measures of assembly quality are consistently good.

The topic of assembly accuracy in the presence of high heterozygosity is an important one and thus this is a welcome contribution. Whereas the the overall conclusions of the paper are supported by the experiments, I found the experiments and methods to be confusing and perhaps overly complicated.

Specific comments:

1. Nowhere in the manuscript is a description of the underlying data that was assembled. After some digging through the references, I'm assuming it was the Illumina data described in Weinstein et al. 2019, but this needs to be clear and explicit in this paper. There is also mention of "RNA from H. mephisto", by which I'm assuming the authors mean RNA-seq data, but there is no description of these data anywhere.

2. It seems troubling that one of the assemblers evaluated was the same one used to generate the "reference" assembly. And as I understand it, PacBio reads were only used for scaffolding this reference assembly, and not for constructing the original contigs, and thus erroneous expansions or contractions made by Platanus on the Illumina data are not necessarily corrected by the PacBio data in this reference assembly. This issue needs clarification and discussion in this manuscript. In particular, an assembly that appears to have an enrichment or depletion of a certain protein functional group relative to the reference is not necessarily less accurate, because the reference may (perhaps equally likely) be in error with respect to this group.

3. The evaluation of expansion/contraction via enrichment/depletion of functional groups seems more indirect and complicated than necessary. Why not simply align the genomes (gene sets) pairwise to the reference and quantify how many genes/regions are expanded/contracted with respect to the reference? One would expect only expansion/contraction of highly-similar sequences, not of broad functional categories of proteins.

4. There is no logic given for why the an assembly with a high (or low?) proportion of grouped proteins by OrthoMCL would be better/worse than another assembly.

5. Please provide a definition for an "evidence-supported gene"

6. I have never heard of an isolog or iso-ortholog. Perhaps simply one-to-one ortholog can be used instead.

7. Please describe early on how heterozygosity is defined/measured in these genome assemblies.

8. Fig 2 - the dot plots not very informative. They would be greatly improved if assembly contigs were ordered and oriented according to the reference.

9. Fig 5 - there are so few points here - just show the points instead of a box plot.

10. The Borgonie et al. 2011 reference seems to be missing.

11. Benchmarking *University* Single Copy Orthologs => universal

12. The GitHub link to the software/scripts used is not provided.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Revision 1

Attachments
Attachment
Submitted filename: Response to Reviews of PLOS Computational Biology.pdf
Decision Letter - Christos A. Ouzounis, Editor, William Stafford Noble, Editor

Dear Dr. Bracht,

Thank you very much for submitting your manuscript "Regional sequence expansion or collapse in heterozygous genome assemblies" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. 

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Christos A. Ouzounis

Associate Editor

PLOS Computational Biology

William Noble

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I am fully satisfied with the response of the authors to my previous comments, and have no further questions or suggested revisions.

Reviewer #2: With this revision, the authors have satisfactorily addressed the majority of my previous comments. However, I continue to be of the opinion that a number of the analyses in this work are rather indirect and difficult to interpret.

1. In particular, the PANTHER analysis of enrichment/depletion of protein functional categories and the OrthoMCL grouping analysis are hard to interpret with regard to the quality of the assemblies. Consider one protein-coding gene in the reference genome and its assembly with one of the alternative assemblers or assembler parameters. There are many ways in which this gene might be assembled, but consider two simple erroneous cases: (1) the gene has two copies in the assembly and (2) the gene is fragmented into two non-overlapping pieces. In both cases, assuming protein-coding components can be detected in all contigs, there is an effective doubling of the gene, but only the former is truly an "expansion" of the gene in the assembly. It does not seem that either the PANTHER or OrthoMCL analyses can distinguish between these possibilities and thus the interpretation of their results is difficult. The OrthoMCL analysis is particularly hard to understand because an assembly that erroneously produced two copies of every gene would result in 100% grouping (because the two copies of each gene would fall into the same group), whereas an assembly that fragments each gene into many non-overlapping pieces that cannot be confidently aligned, would have a much lower % grouping. This seems to be a roundabout way of assessing fragmentation but says little about expansion/contraction, which is the focus of the manuscript.

2. The LAST analysis (alignment of each assembly to the reference) and associated Figure 2 is a much more direct and easier to interpret method of understanding expansion/contraction in an assembly compared to a reference. I recommend that the authors expand on this analysis. Briefly, LAST can used to identify the *single best place* in the reference to align each component of an assembly. I believe the authors are already using LAST for this purpose. Then, for each position in the reference genome, one can count how many positions in the assembly are aligned to it. The distribution of these counts is highly informative: the positions with zero alignments are "missing" (perhaps due to contraction) and positions with more than one alignment are duplicated/expanded in the assembly. This should be simple to implement and more directly assesses expansion/contraction/missing-ness than much of the rest of the analyses.

3. Related to point 2 above, Figure 2 is quite important and could be improved. With a few tweaks, it can visually display expansions ("steeper" diagonals) and contractions/missing-ness ("less steep" diagonals). Suggested improvements are:

a. Clarify whether this is for the 200bp or 1000bp filtered assemblies. I would suggest using the 200bp assemblies so that one can still see if an assembly is relatively "complete" even if highly fragmented.

b. Keep the x-axis constant across all plots. It currently seems to be changing slightly between plots, which is misleading. All contigs in the reference should be plotted such that contigs that are missing in the assembly can be seen.

c. Include all contigs in the assembly on the y-axis, regardless of whether they have an alignment to the reference. That way one can visually see (1) how large the assembly is and (2) the fraction of the assembly that doesn't align anywhere in the reference.

d. Make sure the scales are the same on both x and y axes. I believe this may already be the case, which is great. This is important for interpreting the "steepness" of the diagonals.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods

Revision 2

Attachments
Attachment
Submitted filename: Response to Reviews v3 (1).pdf
Decision Letter - Christos A. Ouzounis, Editor, William Stafford Noble, Editor

Dear Dr. Bracht,

We are pleased to inform you that your manuscript 'Regional sequence expansion or collapse in heterozygous genome assemblies' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Christos A. Ouzounis

Associate Editor

PLOS Computational Biology

William Noble

Deputy Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I have reviewed the latest version, and am satisfied with it.

While reading it, I observed two minor possible corrections of the text.

1. In the Abstract, "downstream analyses, yet is common" might better read "downstream analyses, yet are common" (since 'are' is plural, it agrees with the preceding plural noun "High levels".

2. On page 5, "from the raw Illuminia reads" should read "from the raw Illumina reads" (i.e., "Illuminia" is a typo).

Reviewer #2: The authors have sufficiently addressed my previous comments. My only suggestion is to replace (or swap) Figure 2 with Supplementary Table 1, as it is hard to interpret the Oxford Grids when the set of reference contigs displayed changes from plot to plot (I disagree that the x-axis is not changing). If Fig 2 is retained as is, I would suggest adding some text to caption to guide the reader in its interpretation (e.g., steepness of diagonals).

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Formally Accepted
Acceptance Letter - Christos A. Ouzounis, Editor, William Stafford Noble, Editor

PCOMPBIOL-D-19-01915R2

Regional sequence expansion or collapse in heterozygous genome assemblies

Dear Dr Bracht,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Sarah Hammond

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .