Distinguishing excess mutations and increased cell death based on variant allele frequencies

Gergely Tibély; Dominik Schrempf; Imre Derényi; Gergely J. Szöllősi

doi:10.1371/journal.pcbi.1010048

Peer Review History

Original SubmissionJuly 26, 2021
8 Dec 2021 Decision Letter - Teresa M. Przytycka, Editor, Douglas A Lauffenburger, Editor Dear Tibély, Thank you very much for submitting your manuscript "Distinguishing excess mutations and increased cell death based on variant allele frequencies" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Teresa M. Przytycka Associate Editor PLOS Computational Biology Douglas Lauffenburger Deputy Editor PLOS Computational Biology ********************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: This paper seeks to disentangle the processes of new mutations due to a higher mutation rate or due to higher cell turnover rate. This work assumes a fixed mutation rate and a fixed cell turnover rate during the progression of a tumor, and seeks to estimate these two rates given variant and total read counts of mutations in bulk DNA sequencing data. The authors validate their method on simulated data, finding that while their method is sensitive to per-base sequencing errors, in regimes of low sequencing error the method provides accurate estimates of ground-truth mutation and cell turnover rates. Finally, the paper considers an HCC sample sequenced using a technology with low error, and discusses the estimated rates in light of related findings. While I find the work interesting, I have several comments and questions. 1. How do cell lineage trees relate to trees inferred using tree deconvolution methods? Many clone tree inference methods have been proposed for reconstructing clone trees from bulk DNA sequencing data of tumors. See for example [1,2]. How do these trees differ from the cell lineage trees that you sample in your method? Is there a way to make use of these clone trees in your method? 2. Copy-number aberrations VAFs are affected by copy-number aberrations (CNAs). I suggest using cancer cell fractions instead. Also, I would appreciate more details on how VAFs were corrected for CNAs for the HCC data. 3. Justify/discuss assumptions There are two key assumptions: (1) fixed mutation rate and (2) fixed turnover rate. While the latter assumption is discussed in the Discussion, I would have appreciated simulation experiments that assess how your method would perform if this assumption was violated? Would you still be able to accurately infer the mutation rate? Moreover, I would like to see a discussion on the assumption of a fixed mutation rate. It has been shown that the mutation rate can change during the evolution of a tumor, due to for instance mutations in DNA mismatch repair mechanisms. Or more generally, due to exposures to mutational signatures. For an example of the former, please see Ref [3]. 4. More discussion on subclonal mutations clusters. I found this point extremely interesting. I would appreciate more simulation experiments to further investigate this point. Also, please discuss this in light of the findings in the MOBSTER paper [4]. 5. Extension to support multiple samples It would be good if the method could support multiple bulk samples from the same tumor. 6. Software and documentation - Please include example input files. - Add comments to your code/functions. - It looks like your code takes trees as input rather than generating them. Please describe how trees can be generated. - To facilitate reproducibility of the experimental results, please include simulated data and preferably real data as well. Minor comments: - turnover or turn over? - page 3: it'd be good to justify the assumption of independence among sites. - Good to provide the name of the method in the paper. - What is the correspondence between the number of mutations and number of leaves in the trees that you sample? More generally, how do you choose the number of leaves? References [1] El-Kebir M, Oesper L, Acheson-Field H, Raphael BJ. Reconstruction of clonal trees and tumor composition from multi-sample sequencing data. Bioinformatics. 2015;31(12):i62-i70. doi:10.1093/bioinformatics/btv261 [2] Deshwar AG, Vembu S, Yung CK, Jang GH, Stein L, Morris Q. PhyloWGS: Reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 2015;16(1):35. doi:10.1186/s13059-015-0602-8 [3] Christensen S, Leiserson MDM, El-Kebir M. PhySigs: Phylogenetic Inference of Mutational Signature Dynamics. In: Biocomputing 2020. WORLD SCIENTIFIC; 2019:226-237. doi:10.1142/9789811215636_0021 [4] Caravagna G, Heide T, Williams MJ, et al. Subclonal reconstruction of tumors by using machine learning and population genetics. Nat Genet. 2020;52(9):898-907. doi:10.1038/s41588-020-0675-5 Reviewer #2: In this manuscript Tibély et al. propose a likelihood-based approach to estimate the mutation rate and cell renewal rate history of a tumor from DNA sequencing data from a single bulk sample of the tumor and an adjacent normal sample. The methodology used is sounding, even though the applicability and interest for a wide biological community is questionable and possibly limited due to the strong theoretical character and the various (unavoidable) strong assumptions implied. The method is applied to an experimental dataset, but there is a lack of tools for cross-validation of the inferred parameters with real estimates (even if the method is appropriate for synthetic datasets generated under the same methodological assumptions, which is not striking per se). The mathematical approach is clean but the text suffers from a too technical style that sometimes is more proper of a Technical Note than a Main manuscript. Altogether, in my view the authors should pay most attention on the narrative to better introduce the background and succinctly explain the model’s novelty on such a non-new question to be able to convince on its far-reaching biological implications. MAIN POINTS - The Introduction feels short and too simplistic given the level of concretion of the subsequent approach and it should be more pedagogical to acknowledge the multiple facets and limitations involved in the inference of evolutionary parameters. First, Williams et al. work is referred, but since those authors addressed a similar problem in their original publication and even refined their subsequent analysis based on multiple-sampling data, it is unclear what it was inappropriate in their methodology, or, in other words, what is genuine in this work’s approach that makes it that more successful at discerning mutation rate and growth-death ratio even when considering just from bulk sequencing data. Is it just the likelihood approach vs. machine learning methods? Is it the fact of simulating and fitting the whole VAF distribution rather than its global scaling properties derived from Luria-Delbrück model? Secondly, the method shows important limitations (and this is common to Williams et al. work): it assumes a net exponential tumor growth, which is not necessarily realistic, and purely neutral subclonal dynamics. Other authors have shown the prevalence of positive Darwinian selection across tumor types (Martincorena et al, 2017, Cell), and even if this is not incompatible with an excess of neutral passenger mutations, it would affect the subclonal dynamics and the tree shape, consequently affecting inferences from the VAF distribution, as has been largely debated (e.g. Tarabichi et al. 2018 Nat. Genet, as a criticism to Williams et al. 2016). These limitations cannot be ignored and have to be highlighted or authors may want to consider how much inferences would depart from the given estimated values when these assumptions get relaxed. One could draw the impression that estimates can result orders of magnitude deviated if basic evolutionary assumptions are inappropriate. For the same reason, I would avoid overclaims in the Discussion. - The model definition is not sufficiently clear or precise. Authors use a lineage tree representation but should explain what methodology they use to generate the simulated trees, i.e. T(delta). Is it by Markov-chain Monte-Carlo? Continuous or discrete time? What is the initialization condition and end time? What are the random events? From the Methods it seems clear that tree topologies and mutagenesis are simulated independent from one another but a higher level of concretion in the main text would be helpful. In particular, there are two major confusing points: - (i) Whether mutagenesis events are circumscribed to the cell division instant (i.e. internal nodes) or are simulated too all along branches. The text sometimes gives the impression of the former assumption, but at the same time the division rate is taken as unit of time. This is relevant insofar as in one case both parameters are intermingled and mu changes as a result of changes in delta, while in the other they represent independent measures (delta would impact the mutation burden just indirectly, by modulating the total no. of tumor branches, but would not affect its density) and thus parameters might be more easily resolvable? If mu is truly independent it should not depend on b. - (ii) Related with the previous, how empirical VAF distributions (that are indicative of tumor subclonal composition) can be directly contrasted with a tree whose branches are said to represent individual cells rather than subclones of multiple cells. I would assume a representation in terms of a subclonal tree where branches represent populations of clonal cells would be more readily contrastable with experimental VAF data. It is unclear how the fitting procedure is done if branches are restricted to individual cells and not clades. As a suggestion, I think it would help to improve the explanation on Fig. 1. This shows site frequency spectra but these are difficult to relate with trees on the left as there is no sketch drawing read sites along the branches. And the paragraph describing this connection between site frequency spectrum and tree topology is confusing. On the same line, Fig. 1 would benefit much if it shows trees generated under four different scenarios (low mu-low delta, high mu-low delta, low mu-high delta, and high mu-high delta) or at least under scenarios where just one or the other parameter is changed but not both at the same time like is done here. That would better illustrate how the tree shape and mutation burden change with the two parameters of interest. Altogether, the model section is vague and contrasts with that of the parameter estimation procedure which is much more detailed (Indeed, the parameter estimation section is already initiated by the second last paragraph of pg. 3 even if the title comes later). MINOR POINTS - I in part differ from the diagnosis that the main limitation for evolutionary inference is that bulk sequencing does not resolve the genotypes of individual cells. I assume it is not as much a problem of spatial resolution as it is of temporal resolution, as it is limited to a snapshot. I think exploring models accounting for sampling bulk data at various serial time points is a suitable direction for future work. This is poorly explored. On the contrary, I am not persuaded of what the benefit of sampling individual cell genotypes is vs. sampling subclones in regard to tree reconstruction and evolutionary inferences. This is perhaps a misinterpretation of what individual cell trees in this work offer vs. subclone-based approaches in Williams et al, or they are essentially similarly informative. - In the Abstract the sentence referring to the “orders of magnitude” difference between mutation burden in tumors and healthy tissues should be toned down. It is certainly the case in liver but not necessarily that dramatic in other tissues (authors may want to refer to recent literature by P. Campbell group and colleagues). In any case it deviates the attention from the main purpose of the manuscript that is not on this comparison. This difference in mutation rate is perhaps too much overstated too in pg. 10 given that methods to estimate mutation rate in healthy tissues are so relaxed and heterogeneous. Yet, authors should consider that most reliable method to infer mutation rate in healthy tissues is by fitting data from individuals of different age, and for liver in particular, they can refer to the estimate in Brunner et al (2019) Nature. Here again it is patent that the particular election to normalize mutation rate by cell division is inconvenient as it heavily relies on that estimate, while Brunner et al. present a mutation rate as an independent parameter with dimension site-1 real time-1. - Fig. 3 is very nice, even if it is limited to some particular true mu:(1-delta) pairs. I would suggest an accompanying figure (e.g. a heatmap), but not necessarily a comprehensive one, showing how uncertainty in parameter estimates changes across a grid search on different true values of mu and (1-delta) extending to other regions of the grid. Do changes in mu affect uncertainty for the same value of delta? It is certainly interesting how uncertainty increases when the death rate approaches the growth rate. The authors point to the fluctuation on the shapes of the trees used for sample generation. I think this deserves a brief reflection. If I am right, I understand that the difficulty in capturing the distribution of bifurcation times with limited leave sized trees in cases where d is similar to b arises from the broad distribution of possible outcomes when fates are balanced in stochastic branching birth-death processes (Bailey, 1964, The elements of stochastic processes with application to the natural sciences). Similar problems arise in healthy tissues (Snippert et al, 2010, Cell; Piedrafita et al, 2020, Nat Commun). Indeed, when commenting on possible reasons for tumors to show elevated values of delta, the underlying tissue architecture and stem cell renewal mode might play important roles too determining the tumor death-to-birth ratios. These factors can be reflected too in the discussion in pg. 11. For instance, the organization of intestinal epithelial stem cells into crypts imposes constraints on clonal expansion between crypts while there is intrinsically elevated competition forces between stem cells belonging to the same crypt (Snippert et al, 2010, Cell). - The considerations when discussing the cell division rate of the HCC tumor are pertinent but the authors neglect other major factors, such as the significant fraction of the HCC volume occupied by cirrhotic tissue and stroma, or the heterogeneous dynamics between tumor cells depending on their genotype (Darwinian selection) and/or the microenvironment that can greatly affect those estimates. They should tone down the argument that “division rate is realistic, suggesting that the approximation is adequate” – they should do so too when arguing that the fact that the value falls unrealistic when considering random sampling from a large specimen suggests that the sampling ratio may be close to 1. - I would suggest to change the title for: “Distinguishing excess mutations and increased cell renewal based on tumoral variant allele frequencies”, which is more understandable. - Regarding the mathematical formulation, the two conditions for m reflected in Eq. 6: is it the intersection between the two sets that is considered? In Eq. 7, better to include a parenthesis for readability, to show that second summation term is nested to the first. Both in Eq. 6 and 7 subindexes i might be pertinent for r_i, m_i, m’_i and m_th,i. - As a suggestion, variable names m_th, m’, M_obs mut can be explicitly mentioned in the text right before their first use in the respective equations. - Eq. 7 could benefit from a little explanation in the accompanying text saying that it is a way of weighting the influence of coverage on the discovery of any actual given branch length from just significant mutant read counts, which should thus be < or = to Sum_k l_k. In this sense, I wonder how much this consideration affects the estimates in practice? Can some values for cases where mu is estimated ignoring the significance criterion be represented in Fig. 3? Similarly, in the same paragraph, authors may want to explain that Msig is similar to the total number of accumulated true mutations. - The definition and extent of a “synthetic dataset” is unprecise. I infer that when “10 synthetic datasets were generated for each true value” it means that 10 query trees were generated for the same given point in the delta grid search, and then “10^4 trees with 10^4 leaves were used for fitting” it means per each dataset? - Similarly, mu is said to be estimated directly from the data, but since mu_est requires L_sig computation and this depends on T(delta), one infers that it cannot be uncoupled from the delta inference; in other words it is not estimated beforehand but in conjunction with the likelihood for delta, for being a related quantity. Is this correct? This reflection would clarify the procedure. - What is the criterion when selecting a particular number of simulated trees (10^4)? Is this based on empirical observation, e.g. an optimality criterion like the value for which the error on the estimate gets below a certain threshold or similar? Regarding the choice of 10^4 leaves, it seems to fulfill the number of cells in the empirical HCC dataset, but I wonder if fitting bulk data from a larger tumor sample would require more tree leaves or not necessarily insofar as leaves are representative of the overall subclonal heterogeneity? i.e. Can we assume one branch = one subclone in the formalism used? - Confidence intervals on MLE values should be provided for the inferred parameters, at least for the empirical dataset. From pg. 14 it is unclear whether a confidence interval is implemented or not but authors could resort to a Likelihood Ratio test or similar in the light of supplementary figures. - I am just curious to know why not defining delta as a birth-death ratio instead of death-birth ratio. In any case, the election to represent one minus death-to-birth ratio is already sufficiently twisted so that figures might benefit from indicating the extremes corresponding to 0 death and max. death. - When introducing the likelihood model, it would be useful to state what exactly “observed data D” is. Indeed, is “site frequency spectrum” equivalent to “distribution of variant read counts”? I find less confusing the latter term. - Authors may acknowledge in pg. 9 if empirical data corresponds with WGS or WES, and whether it involved amplification that could impair the interpretation of the VAF distribution. TYPOS AND REWORDING - multiple sites where the term “linage” is used and it should read “lineage” - (pg. 2) “mutations from individual cells are intermixed” -> rephrase - (pg. 2) “with the cell linage tree” -> “in terms of a cell lineage tree” - (pg. 3) “descendance of the extant cells” -> “descendance of any given cell” - (pg. 9) Perhaps one could simply point to (ii) the level of randomness on tree structure and (iii) the randomness on mutation occurrence as main sources of noise apart from the limited size of trees (i)? - (pg. 12) “are higher than expected for healthy tissues” -> not true in the case of delta. Rephrase. - (pg. 14) “from a prescribed distribution” -> an obscure term… A Poisson? - (pg. 14) “we generate a sample tree” -> “we generate a sample of a given set of simulated trees” ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No: See review, simulated data missing. Reviewer #2: No: The authors have made (well annotated) code available to generate synthetic data and compute estimates. They could just perhaps include one or two examples in the repository that summarize results in some of the main Fig. ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes:** Gabriel Piedrafita Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols https://doi.org/10.1371/journal.pcbi.1010048.r001
Revision 1
6 Feb 2022 Author Response Attachments Attachment Submitted filename: PLOS CB response.pdf https://doi.org/10.1371/journal.pcbi.1010048.r002
22 Mar 2022 Decision Letter - Teresa M. Przytycka, Editor, Douglas A Lauffenburger, Editor Dear Dr Szöllősi, We are pleased to inform you that your manuscript 'Distinguishing excess mutations and increased cell death based on variant allele frequencies' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Teresa M. Przytycka Associate Editor PLOS Computational Biology Douglas Lauffenburger Deputy Editor PLOS Computational Biology ********************************************************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: My previous comments have been satisfactorily addressed. Reviewer #2: The authors have addressed all points raised and have significantly improved the narrative. The details of the model are much clearer and the potential approach limitations/future research directions are well acknowledged in the new version of the Discussion, reason why I consider the current version suitable for publication. Just two very minor points for consideration (for which no round of revision would be needed by my side): - In pg. 4 it is explained how cell lineage trees are simulated with continuous-time branch lengths. I guessed but would be nice to state: Is cell turnover simulated as a Poisson process? i.e. cell division rate drawn from an underlying exponential distriibution? - "Poisson distribution with parameter corresponding to the product" : if "expectancy" parameter is implied, perhaps better specifying so. Typos: - pg. 11: "sequnceing" - pg. 14: "Finnally" and "dyanmics" ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: None ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes:** Gabriel Piedrafita https://doi.org/10.1371/journal.pcbi.1010048.r003
Formally Accepted
8 Apr 2022 Acceptance Letter - Teresa M. Przytycka, Editor, Douglas A Lauffenburger, Editor PCOMPBIOL-D-21-01380R1 Distinguishing excess mutations and increased cell death based on variant allele frequencies Dear Dr Szöllősi, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Katalin Szabo PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1010048.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .