MiDAS—Meaningful Immunogenetic Data at Scale

Maciej Migdal; Dan Fu Ruan; William F. Forrest; Amir Horowitz; Christian Hammer

doi:10.1371/journal.pcbi.1009131

Peer Review History

Original SubmissionJanuary 15, 2021
7 Mar 2021 Decision Letter - Mihaela Pertea, Editor Dear Dr. Hammer, Thank you very much for submitting your manuscript "MiDAS - Meaningful Immunogenetic Data at Scale" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Mihaela Pertea Software Editor PLOS Computational Biology Mihaela Pertea Software Editor PLOS Computational Biology ********************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Summary: In this work, the authors describe MiDAS a free package in R where HLA alleles and amino acid sequences can be tested for association with a given phenotype. The program is specialized for HLA amino acid fine mapping and evolutionary divergence. In addition, KIR effect on phenotype can be tested by KIR genes in association with present HLA alleles. The program is capable of refining data so that it all data is in the same format for cross-study analysis. MiDAS statistical tests include Hardy-Weinberg equilibrium, linear and logistic regression tests, Cox proportional hazard models and likelihood ratio tests. Overall the paper is well written and the MiDAS package is of general use to the complex trait genetics community, where HLA and KIR association testing is often complex and highly relevant. Comments: 1) I recommend that a better description of the motivation behind developing MiDAS aimed at the non-expert be included in the introduction or early in the Design and implementation section. For example, the authors state that “statistical considerations are more complex” for HLA and KIR analysis but don’t provide further description. Including this would improve readability and understanding for a general audience. 2) Some clarifications are warranted (a few example below): On pg 2 the authors state “MiDAS accepts HLA genetic data” but don’t mention the type of data. Does it accept sequence data, textual allele names, SNP data? Is it for a single individual or a population (the same is true for KIR data)? I see now that this is included in methods but I feel it would be helpful to specify this earlier. Author’s should describe what Grantham’s distance measures and state why it is useful to know for association studies 3) The authors mention the use case of allele-specific expression for HLA? Is this data included in MiDAS? Literature suggests that imputing at least HLA class I expression is possible and biologically informative [for example PMID: 23559252 and 29302013 ]. This would be a nice addition. Reviewer #2: Migdal et al. introduce an R package to carry out multiple analyses for HLA and KIR loci. Standard tools for analyzing genetic data often cannot deal with conventions used for documenting HLA and KIR variation, or often miss important variation which could help explain disease and normal phenotypes. Even when carrying out genome-wide analyses, HLA and KIR dedicated analyses might be needed to appropriately take into account the distinct characteristics of these loci, and researchers find themselves often in the need to develop custom code and methods to parse and analyze HLA and KIR data. Therefore, a tool to perform such tasks in a consistent way is very welcome and a relevant contribution. However, I have some comments which I believe can help improve the manuscript. Overall, I think that the manuscript is too brief, resembling more of a technical note instead of an original research article. There is no Author Summary section, which I believe should be included for publishing in Plos Comp Bio. Although the main result here is the computational tool itself, I miss some biological results. When we propose a new tool, it is good practice to show the accuracy in recovering ground truth results from simulated data or in replicating previously reported results. Please consider trying to replicate results from familiar papers (e.g. Arora et al. and Carrington et al. papers that you cite) if their datasets are available, or at least cite published examples of particular analyses to justify why they are important (you’ve done that in the last section of your tutorial). In that regard, the tutorial document is much more complete than the manuscript. The tutorial shows the potential of the package, it provides more use case examples, and it motivates users to try the package out, more than the manuscript does. I’d like to see a version with this manuscript with more material and results, which would make clear what the potential of the package is, motivating more users to try it out. Overall I think this work is a promising contribution; I will share it with students, and maybe be a user myself. Additional comments: In the introduction, the authors try to say that (1) conventions used for documenting genomic variation are not optimal for HLA and KIR, (2) standard methods to analyze genetic data often miss important variation at HLA and KIR, and (3) genome-wide statistical association methods often miss hits at HLA and KIR, which calls for dedicated methods and analyses. Those are all relevant points which deserve discussion, however the text is not very clear and omits important examples and citations. Please try to improve this discussion, because it is indeed relevant. For example, this GWAS for COVID-19 (10.1056/NEJMoa2020283) is an example of dedicated analyses for HLA complementing a GWAS, and illustrates a potential use case for your package. Minor points: (1) Introduction “Many KIR are receptors for HLA Class I ligands, but these interactions are highly specific”. I don’t see a contrast between the 2 statements. Consider something like “Many KIR are receptors for HLA Class I ligands via highly specific interactions that depend on the individuals’ HLA and KIR genotypes”. (2) “the availability of immunogenetic variation data is only the first necessary step in uncovering and understanding its role in immune-related traits” This phrase does not read well since the subject is “availability of immunogenetics variation data”, and the availability of data doesn’t have any roles on traits. Consider “the HLA and KIR roles in immune-related traits”. (3) Design and implementation “Statistical association analyses of immunogenetic variants often focus on the presence vs. absence of single HLA alleles.” As written, this may be misleading because “presence vs absence” usually refers to loci which show CNV (e.g., DRB3/4 and KIR). I think the authors actually mean that statistical associations for HLA are often carried out at the HLA allele level, which is an important unit of information both for its biological meaning and knowledge accumulated by traditional studies. (4) The manuscript is too brief when explaining some points. For example, the authors describe the consideration of allelic variation at KIR as a shortcoming of their tool, but do not explain the reason why this is not possible to implement. Further, “Data transformation can also be customized using user-supplied additional data dictionaries”, but no examples are given. (5) It may be confusing that the package is named “MiDAS”, it is installed as “MiDAS”, but the package is actually “midasHLA” (library(“midasHLA”)). Consider changing this. (6) In the tutorial, please indicate that some tidyverse packages need to be loaded, otherwise the code fails. (7) One disappointing aspect of computational tools for HLA is that they usually get stuck with a single and outdated version of internal datasets, such as IMGT data or allelefrequencies.net. Please consider a simple interface for users to update datasets, so your tool can live longer. Reviewer #3: The authors have presented a paper describing in detail the MIDAS software. The paper address the challenges of analysing HLA data with providing an analysis of the dataset and disease association measures. The following queries arose when reading the manuscript. The authors describe data formats and options to refine the data to certain levels of resolutions. Experience of real life datasets is that HLA datasets are often not clean and well defined. Many data sets have missing values, some loci may be untyped and there are often mixed resolutions, strings of allele typings and NMDP MAC allele codes. How does the software cope with this data. How are homozygous types represented. Very large datasets are often typed with multiple techniques and analysed against multiple versions of the reference database, leading to inconsistencies, how is this addressed. Is there a minimum and maximum size of dataset? Two challenges in analysis of this type are the use of software to provide results where the data set is too small for the results to be valid, or the dataset is too large to load. The HLA-DPB1 and MICA and MICB alleles use a slight variation of HLA nomenclature, how does this cope? The authors briefly mention comparisons to other applications but more recently published tools like Easy-HLA and the Gene[rate] tools from HLA-net are not mentioned, with which there may be some overlap. There is no discussion of how the software was tested and validated, which would be informative. ****** Have all data underlying the figures and results presented in the manuscript been provided?** Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ******** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Paul McLaren Reviewer #2: Yes:** Vitor R. C. Aguiar Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods https://doi.org/10.1371/journal.pcbi.1009131.r001
Revision 1
5 May 2021 Author Response Attachments Attachment Submitted filename: Migdal_Reviewer_response.docx https://doi.org/10.1371/journal.pcbi.1009131.r002
30 May 2021 Decision Letter - Mihaela Pertea, Editor Dear Dr. Hammer, We are pleased to inform you that your manuscript 'MiDAS - Meaningful Immunogenetic Data at Scale' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Mihaela Pertea Software Editor PLOS Computational Biology Mihaela Pertea Software Editor PLOS Computational Biology ********************************************************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: I am satisfied with the author's responses to my review. I support publication of the revised version. Reviewer #2: I am satisfied with the authors' responses and updates. I still believe that a lot of material in the tutorial could be moved to the manuscript, as this would provide a more substantial motivation for specific analyses. However, I respect the authors' decision to keep the manuscript brief, while providing a separate tutorial which will complement the paper. If the editor considers the current format appropriate, I believe this work is a welcome contribution and it deserves publication. Reviewer #3: The authors have responded positively to the reviewers comments and the Github and tutorial provided work well alongside the manuscript. The authors have responded to a query regarding the data input formats, it may be of use to include some of their response to reviewers in the main manuscript. HLA data is notoriously complicated and messy (different resolutions, strings and codes) and whilst it may be the responsibility of the user to clean data before using MIDAS, there may be benefit in explicitly stating this, form experience many users expect tools for HLA analysis to also clean the data, as well as perform the expected analysis. The authors have confirmed that MIDAS is not performing any novel statistical methods, and as such validation is limited, the response to reviewers that "When writing these functions, we tested them in parallel to this step-by-step approach, thereby making sure results are comparable." neatly sums this query up and may be useful to include in the manuscript but is not essential. ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Paul J McLaren, Ph.D. Reviewer #2: Yes:** Vitor R.C. Aguiar Reviewer #3: No https://doi.org/10.1371/journal.pcbi.1009131.r003
Formally Accepted
1 Jul 2021 Acceptance Letter - Mihaela Pertea, Editor PCOMPBIOL-D-21-00070R1 MiDAS - Meaningful Immunogenetic Data at Scale Dear Dr Hammer, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Olena Szabo PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1009131.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .