Semi-parametric empirical bayes method for multiplet detection in snATAC-seq with probabilistic multi-omic integration

Yuntian Wu; Haoran Hu; Wei Chen; Johann E. Gudjonsson; Lam C. Tsoi; Xiaoquan Wen

doi:10.1371/journal.pcbi.1013653

Peer Review History

Original SubmissionOctober 23, 2025
29 Jan 2026 Decision Letter - Shaun Mahony, Editor, Zhixiang Lin, Editor -->PCOMPBIOL-D-25-02188 Semi-parametric Empirical Bayes Method for Multiplet Detection in snATAC-seq with Probabilistic Multi-omic Integration PLOS Computational Biology Dear Dr. Wen, Thank you for submitting your manuscript to PLOS Computational Biology. After careful consideration, we feel that it has merit but does not fully meet PLOS Computational Biology's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Mar 31 2026 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at ploscompbiol@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pcompbiol/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: * A letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. This file does not need to include responses to formatting updates and technical items listed in the 'Journal Requirements' section below. * A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. * An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter We look forward to receiving your revised manuscript. Kind regards, Zhixiang Lin Academic Editor PLOS Computational Biology Shaun Mahony Section Editor PLOS Computational Biology Additional Editor Comments: Please refer to the comments raised by the reviewers. Journal Requirements: 1) We ask that a manuscript source file is provided at Revision. Please upload your manuscript file as a .doc, .docx, .rtf or .tex. If you are providing a .tex file, please upload it under the item type u2018LaTeX Source Fileu2019 and leave your .pdf version as the item type u2018Manuscriptu2019. 2) Please provide an Author Summary. This should appear in your manuscript between the Abstract (if applicable) and the Introduction, and should be 150-200 words long. The aim should be to make your findings accessible to a wide audience that includes both scientists and non-scientists. Sample summaries can be found on our website under Submission Guidelines: https://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-parts-of-a-submission 3) Please upload all main figures as separate Figure files in .tif or .eps format. For more information about how to convert and format your figure files please see our guidelines: https://journals.plos.org/ploscompbiol/s/figures 4) We have noticed that you have uploaded Supporting Information files, but you have not included a list of legends. Please add a full list of legends for your Supporting Information files after the references list. 5) Some material included in your submission may be copyrighted. According to PLOSu2019s copyright policy, authors who use figures or other material (e.g., graphics, clipart, maps) from another author or copyright holder must demonstrate or obtain permission to publish this material under the Creative Commons Attribution 4.0 International (CC BY 4.0) License used by PLOS journals. Please closely review the details of PLOSu2019s copyright requirements here: PLOS Licenses and Copyright. If you need to request permissions from a copyright holder, you may use PLOS's Copyright Content Permission form. Please respond directly to this email and provide any known details concerning your material's license terms and permissions required for reuse, even if you have not yet obtained copyright permissions or are unsure of your material's copyright compatibility. Once you have responded and addressed all other outstanding technical requirements, you may resubmit your manuscript within Editorial Manager. Potential Copyright Issues: - Figure 1. Please confirm whether you drew the images / clip-art within the figure panels by hand. If you did not draw the images, please provide (a) a link to the source of the images or icons and their license / terms of use; or (b) written permission from the copyright holder to publish the images or icons under our CC BY 4.0 license. Alternatively, you may replace the images with open source alternatives. See these open source resources you may use to replace images / clip-art: - https://commons.wikimedia.org - https://openclipart.org/. 6) Please amend your detailed Financial Disclosure statement. This is published with the article. It must therefore be completed in full sentences and contain the exact wording you wish to be published. State what role the funders took in the study. If the funders had no role in your study, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.". If you did not receive any funding for this study, please simply state: u201cThe authors received no specific funding for this work.u201d Reviewers' comments: Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: 1. It would be beneficial to explore and clearly articulate the limitations of the methods used in the study. 2. Consider including additional methodologies to enable a more comprehensive comparison and evaluation. 3. A careful review of the manuscript for minor grammatical errors is recommended to ensure clarity and professionalism. 4. Improving the quality and resolution of the figures would significantly enhance the visual appeal and readability of the paper. 5. The manuscript would benefit from a more cohesive narrative structure. The abstract, introduction, and related work sections should clearly highlight the current challenges and gaps in the literature that the study aims to address. Reviewer #2: Review on Semi-parametric Empirical Bayes Method for Multiplet Detection in snATAC-seq with Probabilistic Multi-omic Integration The manuscript presents SEBULA, a statistical framework designed to identify multiplets (droplets containing two or more cells) in single-nucleus ATAC-seq data. The authors identify a critical limitation in existing methods like AMULET, which rely on rigid parametric assumptions (such as Poisson distributions) that fail to capture the overdispersion and multimodality often found in real chromatin accessibility data. To address this, SEBULA utilises a semi-parametric empirical Bayes framework to estimate the singlet distribution directly from the data, specifically leveraging the "high-coverage locus count" (HCLC) metric. Furthermore, the authors introduce a probabilistic integration strategy that combines SEBULA’s output with evidence from other modalities, such as scRNA-seq, via a Bayesian update rule. Benchmarking is performed on simulated data and seven annotated DOGMA-seq datasets, demonstrating that SEBULA generally outperforms AMULET and ArchR, particularly in datasets with higher doublet proportions. This manuscript addresses a significant quality control challenge in single-cell genomics. The move toward a semi-parametric approach is well-motivated by the specific statistical properties of snATAC-seq data (sparsity and overdispersion). The "divide-and-conquer" modularity of the method, which allows for the probabilistic integration of multiomic data, is a valuable contribution to current multimodal workflows. Overall, the manuscript addresses a relevant challenge and introduces a method with practical utility. The semi-parametric approach is conceptually sound and empirically strong, and the focus on calibration and false discovery rate control is welcome. The paper is generally well written. Several aspects, however, would benefit from clarification, more systematic evaluation, or fuller justification before publication. Major comments The manuscript emphasizes the importance of calibrated probabilities over simple ranking. While the estimation of multiplet prevalence (pi_0) is presented, the paper would greatly benefit from a more detailed analysis of the calibration itself. Reliability diagrams (calibration plots) comparing predicted probabilities to observed multiplet frequencies in the ground-truth datasets or quantitative scores such as expected calibration error could further strengthen the claim that SEBULA yields "well-calibrated posterior probabilities". The method relies on the "zero assumption," specifically that the central mode of the data is dominated by singlets and that singlets are substantially more abundant than multiplets. While this holds for standard experiments, "super-loaded" runs can have extremely high doublet rates. The authors could discuss or simulate the performance of SEBULA in scenarios with very high doublet rates to define the breaking point of this assumption. For example, demonstrations of failure modes and guidance for identifying them would strengthen reader confidence. The current validation focuses heavily on benchmarking metrics (F1 scores, ROC curves). To increase the relevance of the manuscript, it would be beneficial to demonstrate the impact of using SEBULA on a downstream biological application or something similar. Additionally, the authors could explicitly discuss the limitation that the method requires fragment files and cannot run on processed count matrices alone, which may limit its applicability for some users. The multimodal integration uses a naïve Bayes assumption. Although the authors note the limitations, the performance comparisons against COMPOSITE may not be fully fair unless the independence assumption is examined. An analysis when naïve Bayes fusion may inflate or deflate confidence due to dependence would be valuable. Figure 1 is intended to provide an overview of the framework, but it currently lacks context regarding the problem setup. The figure should ideally illustrate the data acquisition or the biological origin of multiplets to set the stage. A "zoom-in" approach detailing the different stages (preprocessing vs. detection vs. integration) would make the workflow easier to parse visually. Can the authors comment on further literature and their relevancy such as OmniDoublet (and the references in there such as Scrublet and DoubletFinder)? Minor comments The introduction establishes the problem well, but the focus on the integration of multiple modalities appears somewhat late. Given that SEBULA-MI (the integrated version) yields the best performance, this motivation should be highlighted earlier in the introduction to better frame the paper’s contributions. Detecting homogeneous doublets is rather difficult in general, what would be the impact on a downstream analysis of missing them? The Box–Cox transformation is a key component of the pipeline, yet several choices are presented as default without a clear rationale. The transformation may blur biologically meaningful differences across cells with extreme HCLC. The abbreviation "SEBULA" is introduced without a clear explanation of what the acronym stands for in the main text (it appears to be derived from the title, but this should be explicit). In Figure 2, it is unclear which panels correspond to labels A and B. Moreover, the text claims the singlet distribution is "multimodal". However, in the density plot (Figure 2A), the singlet distribution appears as a single sharp spike. If multimodality exists, the x-axis scale or visualization needs to be adjusted to make this visible to the reader. Runtime comparisons against the other methods would highlight SEBULA’s practical usability at scale. A few minor spelling errors, e.g., “applyed,” (l.172) “effectively” (l.168), should be corrected. Reviewer #3: This manuscript introduces SEBULA, a novel computational tool for detecting multiplets (doublets) in single-nucleus ATAC-seq (snATAC-seq) data. The authors identify a gap in current count-based methods (e.g., AMULET), which rely on rigid parametric assumptions (such as Poisson distributions) that often fail to capture the true data distribution. SEBULA addresses this by employing a semi-parametric empirical Bayes framework that estimates the "singlet" distribution directly from the data using statistical techniques including Box-Cox transformation, central matching, splines, and the method of moments. Furthermore, SEBULA offers a flexible Bayesian framework to integrate its predictions with other modalities (e.g., scRNA-seq), enabling multi-omic doublet detection. Overall, the manuscript is methodologically sound and clearly written. My detailed comments follow. Major: 1. The method uses a truncation threshold, x_min. Although an optimization procedure (minimizing cross-entropy) is provided and the Methods section claims robustness, including a sensitivity analysis (e.g., a plot of F1 scores as x_min varies) would better demonstrate SEBULA's robustness. 2. HCLC (High-coverage loci) is defined according to AMULET’s preprocessing. As HCLC is an critically important concept to the SEBULA method, it would be helpful to briefly clarify whether this preprocessing step is computationally expensive or if SEBULA includes an optimized preprocessor, as this step is often the bottleneck in read-based methods. 3. The integration of ATAC and RNA signals relies on a "naïve Bayes" assumption of conditional independence between the modalities. While the authors acknowledge this is an approximation, the manuscript would benefit from a brief analysis or plot (perhaps in Supplementary Materials) showing the correlation between HCLC scores and RNA-based doublet scores within the singlet population. If the correlation is low, this would empirically justify the independence assumption. Minor 1. Although the authors claim the software is "computationally efficient," the Results section lacks specific data to support this. Including a brief comparison of runtime and memory usage between SEBULA, AMULET, and ArchR on a representative dataset would be beneficial. 2. Figure 1 shows that SEBULA can integrate ATAC and RNA data. However, it is unclear whether SEBULA supports integration of more than two modalities, as newer sequencing technologies like DOGMA-seq and TEA-seq profile ATAC, RNA, and surface protein. If so, the Methods section and Figure 1 should be updated accordingly. 3. Rank-preserving Box-Cox transformation: this transformation always preserves ranks as long as \(\lambda\) is fixed. It is unclear why the term “rank preserving” needs to be emphasized. 4. I noticed several typos: Page 12, line 172: "applyed" should be "applied". Page 12, line 174: "profiles" should be "profiling". Page 12, line 189: "effectivey" should be "effective". Page 16, line 245: "provieds" should be "provides". ******** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None Reviewer #2: Yes Reviewer #3: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] Figure resubmission: While revising your submission, we strongly recommend that you use PLOS’s NAAS tool (https://ngplosjournals.pagemajik.ai/artanalysis) to test your figure files. NAAS can convert your figure files to the TIFF file type and meet basic requirements (such as print size, resolution), or provide you with a report on issues that do not meet our requirements and that NAAS cannot fix.-->--> After uploading your figures to PLOS’s NAAS tool - https://ngplosjournals.pagemajik.ai/artanalysis, NAAS will process the files provided and display the results in the "Uploaded Files" section of the page as the processing is complete. If the uploaded figures meet our requirements (or NAAS is able to fix the files to meet our requirements), the figure will be marked as "fixed" above. If NAAS is unable to fix the files, a red "failed" label will appear above. When NAAS has confirmed that the figure files meet our requirements, please download the file via the download option, and include these NAAS processed figure files when submitting your revised manuscript.-->--> -->--> Reproducibility:** To enhance the reproducibility of your results, we recommend that authors of applicable studies deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols--> https://doi.org/10.1371/journal.pcbi.1013653.r001
Revision 1
17 Mar 2026 Author Response Attachments Attachment Submitted filename: response.pdf https://doi.org/10.1371/journal.pcbi.1013653.r002
9 Apr 2026 Decision Letter - Shaun Mahony, Editor, Zhixiang Lin, Editor Dear Wen, We are pleased to inform you that your manuscript 'Semi-parametric Empirical Bayes Method for Multiplet Detection in snATAC-seq with Probabilistic Multi-omic Integration' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Zhixiang Lin Academic Editor PLOS Computational Biology Shaun Mahony Section Editor PLOS Computational Biology ********************************************************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors addressed all my concerns and the paper can be accepted now Reviewer #2: I appreciate the effort made by the authors. All my major concerns are adressed. Some minor comments: The discussion introduces some new results and refers then to the supplement. These insight should ideally already be incorporated into the results part. Some minor typos remain - line 207: "Eqn.(3)" instead of "Eqn. (3)" - line 231: “Nebula“ instead of “Sebula“ Reviewer #3: Thank you for the revision. The updated manuscript has satisfactorily addressed all of my comments. ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None Reviewer #2: Yes Reviewer #3: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No https://doi.org/10.1371/journal.pcbi.1013653.r003
Formally Accepted
Acceptance Letter - Shaun Mahony, Editor, Zhixiang Lin, Editor PCOMPBIOL-D-25-02188R1 Semi-parametric Empirical Bayes Method for Multiplet Detection in snATAC-seq with Probabilistic Multi-omic Integration Dear Dr Wen, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. For Research, Software, and Methods articles, you will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Anita Estes PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1013653.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .