Peer Review History

Original SubmissionJanuary 13, 2024
Decision Letter - Bill Sugden, Editor, Patrick Hearing, Editor

Dear Dr Chiang,

Thank you very much for submitting your manuscript "Meta-analysis of Epstein-Barr virus genomes in Southern Chinese identifies genetic variants and viral lineage associated with Nasopharyngeal Carcinoma" for consideration at PLOS Pathogens. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

Both reviewers provide compelling concerns with the analysis. If the authors revise their study and meet these concerns, then their work and noted by one of the reviewers "the overall goal and much of the approach is exactly right and has been missing from other papers. "

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Bill Sugden

Academic Editor

PLOS Pathogens

Patrick Hearing

Section Editor

PLOS Pathogens

Michael Malim

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0002-7699-2064

***********************

Both reviewers provide compelling concerns with the analysis. If the authors revise their study and meet these concerns, then their work and noted by one of the reviewers "the overall goal and much of the approach is exactly right and has been missing from other papers. "

Reviewer's Responses to Questions

Part I - Summary

Please use this section to discuss strengths/weaknesses of study, novelty/significance, general execution and scholarship.

Reviewer #1: Wong et al add a new set of case-control samples to two existing EBV genome datasets (one from their lab), and then extend existing SNP mapping to identify other risk variants. Importantly, this study is (to my knowledge) the first to extend these analyses to haplotype mapping, which is essential for moving towards the use of viral genetics to facilitate targeted screening. Doing so, they identify what appears to be a high risk haplotype.

While the goal and approach is generally right, I have two substantial criticisms that should be addressed before these results can be considered robust and informative. This essentially revolves around the way the analyses have been executed (or in some cases perhaps explained), in particular the lack of robust approaches to control for the unequal sizes of control and case groups from the two geographical areas being studied. Since I am not well versed in genetic methods, I cannot tell how difficult these would be to address.

Reviewer #2: Nasopharyngeal carcinoma (NPC) is prevalent in Southeast Asia. It is thought that environmental factors, human genetic variation, and genetic diversities of EBV contribute to EBV-associated NPC. Wong et al identified 38 single nucleotide polymorphisms (SNPs) in 61 NPC patients carrying high-risk EBV lineage which encodes a G6944A variant at the upstream of EBER2. However, if these SNPs indeed correlate with NPC incidences, they should be aligned to the sequence of M81 strain which was derived from an NPC patient rather than the B95.8 strain (Figure 5). All SNPs in the cluster 2, 3 and 7 should be listed, and create a Venn diagram of intersections for all SNPs. To conclude these G6944A and 38 SNPs can be implicated in the development of NPC, EBV strains from areas other than Southeast Asia should also be analyzed to ensure the specificity of these variations and SNPs.

**********

Part II – Major Issues: Key Experiments Required for Acceptance

Please use this section to detail the key new experiments or modifications of existing experiments that should be absolutely required to validate study conclusions.

Generally, there should be no more than 3 such required experiments or major modifications for a "Major Revision" recommendation. If more than 3 experiments are necessary to validate the study conclusions, then you are encouraged to recommend "Reject".

Reviewer #1: My major issue is the way the Hong-Kong and Guangzhou datasets were combined. The HK data set had a case: control ratio of 2:1 and the GZ data was a ratio of 1:2.5. Therefore any statistical analysis of SNP enrichment will also identify SNPs that statistically differ between HK and GZ. The authors present PCA that they claim shows the two populations are equivalent, but this is both not a suitable test for this (PCA includes only a subset of variation) and unconvincing (HK is significantly enriched in the bottom left quadrant of the PCA fig 2b, even if this simply reflects the more ethnically diverse history of HK). Therefore, the data would be much more robust if the authors used a two factor statistical test (maybe ANOVA) combining sampling site with case-control. It would also be helpful to present sample site-split case control data sets (and present proportions rather than case numbers) as supplementary to help validate data such as in figures 4b and 5b.

My second issue is around how the haplotype analysis was conducted (or explained). The study picked - apparently arbitrarily - 8 SNPs for their initial analysis. If this was all the mis-sense SNPs, then this makes the assumptions that any important SNPs will be coding (despite the fact that the top hit - EBER2 promoter) is clearly non-coding. The paper then produces a maximum risk haplotype of 38 SNPs using an agnostic method. In my view this has been done the wrong way around, and the agnostic method has not been leveraged sufficiently. The haplotype selection process is a bit opaque, and there is a lack of clarity of what percentage of cases have vs do not have this haplotype, and how to leverage the haplotype (and more importantly deviations from the haplotype) to inform predictive diagnosis. If this could be incorporated, it would maximise the value of the paper.

What would be really valuable to the field would be to know which are the biggest contributors to NPC within this high risk haplotype. I have seen cumulative contribution analyses when attempting to reduce gene signatures associated with a disease to the minimum size for maximum predictive value. My colleague tells me this is a machine learning strategy called “feature selection” (specifically forward selection - adding them one by one and checking the performance metrics [significance or odds ratio] each time), and/or using logistic regression or decision trees. I think this approach to selecting the most predictive SNPs would be an agnostic approach to both identifying the most predictive SNPs, and potentially identifying SNPs that would be most important for future biomarker studies. Such a refined set of SNPs could then be more rationally used in the analyses described in figure 4/tables 1&2. This kind of insight would make the paper both more robust (in terms of avoiding investigator assumptions) and more impactful in focusing future research.

Third, this sort of study is highly dependent on the multiple sequence alignment (MSA) used by the investigators. This MSA should be directly available, either from a data repository or as supplementary data file, to help others to validate the study.

Reviewer #2: (No Response)

**********

Part III – Minor Issues: Editorial and Data Presentation Modifications

Please use this section for editorial suggestions as well as relatively minor modifications of existing data that would enhance clarity.

Reviewer #1: I also have a number of minor comments below that I compiled as read the paper, that should be considered or addressed, for clarity, accuracy, precision, or methodological robustness. This includes some references to the major issues mentioned above.

Abstract:

22 - clarify multiple SNPs (why variants?) forming a high risk haplotype? Or independent linkages?

26 Panel of SNPs: say something about interdependency/recombination/haplotype. Multiple mentions of panel (summary) seems redundant?

61 Line mentions preponderance for Type I in NPC without context of background type1:type2 prevalence in the high risk population. Comment on relative frequencies in South China.

64 - endemic regions: terminology confusing in the context of endemic BL. Regions of high NPC prevalence?

66. Ref 22 is much more complicated than the interpretation here: the paper made a chimeric M81/B95-8 EBV that likely resulted in mismatches between polymorphisms in N- and C-terminus, it may be these mismatches that reduced EBNA-1 function, rather than this polymorphism alone. It therefore may not reflect the real biology in circulating EBV strains.

70/71 - ref 28 shows SNPs common to China/indonesian EBVs (NPC-high regions) having altered miRNA expression profiles: it does not test NPC-association within those regions, so the claim should be toned down as a geographical link, rather than a disease link.

78-80 - when describing the EBER2 polymorphisms, it may be useful to the reader to include position relative to EBER1 (one of the two polymorphisms will be between EBER2 and EBER1) - eg “less than 20 nt upstream of EBER2”. Also, T6999G (from https://doi.org/10.1099/jgv.0.001728) position is within EBER2 on my NC7605 map. Please clarify (and clarify whether this is the functional EBER2 SNP characterised by the Delecluse (Li et al 2019) and Tibbetts (Wang et al 2022) labs).

94 - explain what “279 and 227” this was confusing: suggest attach the numbers to the descriptors. case vs control? [controls are assymmetrically matched?]

Statistics: How does the study mathematically handle the unequal sizes of cases and controls from each study? HK1 is 142+61/62+31; GZ is 61/156. How do you know your study is not showing differences between HK and GZ, because of this skew?

.

208 - Statistics conducted in R is insufficient information (unless the individual statistical tests are defined throughout the paper).

233-239 - Principal components are generated de novo for every data set, so the the implication that geography falls in PC2 or 3 (line 235-6) is not valid. Indeed, just looking at the PCA, you might (by the same logic) conclude that the NPC and non-NPC samples also are not different from each other. And looking at the data, there are no GZ samples (except one NPC) in the bottom part of the PCA. So I feel that the geography in the samples is not fully resolved. A more robust statistical test is required.

244 - ‘two cohorts’ - is this NPC/ctrl or HK/GZ?

250 - top association was replicated - was this replication strengthened, weakened or similar, given the additional samples (we would expect p to become even smaller with larger n).

254 - the non-replication of 163364 warrants additional scrutiny to explain the difference: is your re-calling the sequences changing the frequency of this SNP? Is the prevalence of this SNP different between HK and GZ samples? Or was this just skewed by the low control number in GZ data set the first time around?

267 - variants within R2 0.6-0.8 - this seems to be a different analysis from that in figure 3: Narratively the analysis that gives the degree of association should like to the p value measure in figure 3, or the figure indicating R2 values should be more prominent.

Overall - not sure I follow the logic of the narrative - surely if you are looking for driver risk alleles other than EBER, you want to investigate SNPs with high P value but low linkage, to maybe identify alternate haplotypes?

277-295 - a haplotype-based analysis is absolutely the right think to do, but the selection of the haplotype seems entirely subjective, and the basis for choosing these SNPs is unclear. Is it reasonable to select only coding mutations, as other (regulatory) changes could be more important but much harder to identify.

Table 1: REF and ALT may not be the best annotation - REF and NPC or REF/CHN (for China) may be better, since you are defining NPC-associated variants. In a table of all variants, then the “ALT” makes a bit more sense, although it is possible that a REF allele has higher association with NPC than ALT (where the OR is <1, presumably).

Table 2 is the crucial data set that shows a really strong enrichment - again I would prefer a supplemental version looking at the haplotype frequencies in the geographically separated data sets (GZ vs HK). However it is very hard to ‘read’ the haplotypes: could you have all the haplotypes labelled as bold (or coloured) for risk-allele and non-bold as no-risk allele. Then it is much easier to see both the number and structure of risk alleles.

I think it might also be useful to have a separate analysis (including rare alleles) that shows OR and/or p-value for different numbers of risk alleles from your list. Or if the haplotypes are more structured (showing recombination between two high and low risk haplotypes) then the total length of the risk haplotype.

In the 16 non-risk haplotype NPC cases, are there any risk-linked SNPs in common? Could these be assessed for being part of the haplotype, or an independent risk locus?

Thought: are there any CTL epitopes (esp in Chinese HLA) linked to the SNPs? Or even DRiPs associated with silent polymorphisms? [may not be possible to answer this]

Figure 4/317-324 - I am not clear how the clusters have been defined: They do not seem to make sense in the context of the phylogenetic tree (with cluster 3 being scattered among several parts of the tree) and the NPC risk-alleles. Perhaps it would have been more useful to cluster only using SNP positions that were defined as associated with NPC at some threshold, and place that in the wider phylogeny of the virus? Ether way, the authors need to explain the rationale behind their particular choice of clustering algorithm, as the results are hard to understand.

Figure 4b is not a great visualisation of the data - surely much better to use paired bars, with each bar indicating the % of cases or controls. This would offer a much easier to interpret visual representation of cluster enrichment in one or the other group.

Line 337 - the key observation (AUC value for ROC) is not easy for

Revision 1

Attachments
Attachment
Submitted filename: Point_to_point_response_PLOS_Pathog_202404253.docx
Decision Letter - Bill Sugden, Editor, Patrick Hearing, Editor

Dear Dr Chiang,

Thank you very much for submitting your manuscript "Meta-analysis of Epstein-Barr virus genomes in Southern Chinese identifies genetic variants and high risk viral lineage associated with nasopharyngeal carcinoma" for consideration at PLOS Pathogens. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Your responses to the first reviewer are clear. Those to the second reviewer are a bit confusing and need clarification. In particular: "The authors seem to recognize that their analysis requires them first to identify which strain of EBV they consider (type I or type II) and then identify sequence variants in their isolates from each strain that differ form their type I or type II reference genome. However, their description of their analysis is unclear. For example, they write: " To determine the EBV type of our samples, an EBV Type II BAM file was generated by an alignment of the Type II EBV reference genome AG876 (GenBank accession number: NC_009334.1) to the Type I EBV reference genome. The BAM file for each sample was piled up together with the Type II EBV reference BAM file, and variants were called by BCFtools (version 1.13). Nucleotides that were identical in both Type I and Type II EBV genomes were ignored." This description does not make sense and needs to be corrected so that a reader can understand the steps in the authors' analysis to ensure that it does make sense." Can you please revise this section of your manuscript to address this need?

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Bill Sugden

Academic Editor

PLOS Pathogens

Patrick Hearing

Section Editor

PLOS Pathogens

Michael Malim

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0002-7699-2064

***********************

Your responses to the first reviewer are clear. Those to the second reviewer are a bit confusing and need clarification. In particular: "The authors seem to recognize that their analysis requires them first to identify which strain of EBV they consider (type I or type II) and then identify sequence variants in their isolates from each strain that differ form their type I or type II reference genome. However, their description of their analysis is unclear. For example, they write: " To determine the EBV type of our samples, an EBV Type II BAM file was generated by an alignment of the Type II EBV reference genome AG876 (GenBank accession number: NC_009334.1) to the Type I EBV reference genome. The BAM file for each sample was piled up together with the Type II EBV reference BAM file, and variants were called by BCFtools (version 1.13). Nucleotides that were identical in both Type I and Type II EBV genomes were ignored." This description does not make sense and needs to be corrected so that a reader can understand the steps in the authors' analysis to ensure that it does make sense." Can you please revise this section of your manuscript to address this need?

Reviewer Comments (if any, and for reference):

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Revision 2

Attachments
Attachment
Submitted filename: Point_to_point_response_PLOS_Pathog_20240507.docx
Decision Letter - Bill Sugden, Editor, Patrick Hearing, Editor

Dear Dr Chiang,

We are pleased to inform you that your manuscript 'Meta-analysis of Epstein-Barr virus genomes in Southern Chinese identifies genetic variants and high risk viral lineage associated with nasopharyngeal carcinoma' has been provisionally accepted for publication in PLOS Pathogens.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Pathogens.

Best regards,

Bill Sugden

Academic Editor

PLOS Pathogens

Patrick Hearing

Section Editor

PLOS Pathogens

Michael Malim

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0002-7699-2064

***********************************************************

Reviewer Comments (if any, and for reference):

Formally Accepted
Acceptance Letter - Bill Sugden, Editor, Patrick Hearing, Editor

Dear Dr Chiang,

We are delighted to inform you that your manuscript, "Meta-analysis of Epstein-Barr virus genomes in Southern Chinese identifies genetic variants and high risk viral lineage associated with nasopharyngeal carcinoma," has been formally accepted for publication in PLOS Pathogens.

We have now passed your article onto the PLOS Production Department who will complete the rest of the pre-publication process. All authors will receive a confirmation email upon publication.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any scientific or type-setting errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Note: Proofs for Front Matter articles (Pearls, Reviews, Opinions, etc...) are generated on a different schedule and may not be made available as quickly.

Soon after your final files are uploaded, the early version of your manuscript, if you opted to have an early version of your article, will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Pathogens.

Best regards,

Michael Malim

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0002-7699-2064

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .