Tracking SARS-CoV-2 genomic variants in wastewater sequencing data with LolliPop

David Dreifuss; Ivan Topolsky; Pelin Icer Baykal; Niko Beerenwinkel

doi:10.1371/journal.pcbi.1014003

Abstract

During the COVID-19 pandemic, wastewater-based epidemiology has progressively taken a central role as a pathogen surveillance tool. Tracking viral loads and variant outbreaks in sewage offers advantages over clinical surveillance methods by providing estimates not biased by testing practices and enabling early detection. However, wastewater-based epidemiology poses new computational research questions that need to be solved in order for this approach to be implemented broadly and successfully. Here, we address the variant deconvolution problem, where we aim to estimate the relative abundances of genomic variants from next-generation sequencing data of a mixed wastewater sample. We introduce LolliPop, a computational method to solve the variant deconvolution problem. LolliPop is tailored to wastewater time series sequencing data and applies temporal regularization in the form of a fused ridge penalty. We show that this regularization is equivalent to kernel smoothing and that it makes abundance estimates robust to very high levels of missing data, which is common for wastewater sequencing. We use the bootstrap to produce confidence intervals, and develop analytical standard errors that can produce similar confidence intervals at a fraction of the computational cost. We demonstrate the application of our method to data from the Swiss wastewater surveillance efforts as well as on simulated data.

Author summary

Wastewater-based epidemiology has become a valuable tool for tracking viruses like SARS-CoV-2 across entire communities. Sequencing wastewater can reveal which viral variants are circulating, offering early insights into variant dynamics while avoiding the biases of clinical testing. A central challenge is to infer the relative abundances of these variants from observed mutation data. This task is complicated by the fact that variant profiles can be highly similar, and the data is often noisy with many missing read count values from genomic positions with no coverage, especially when the incidence of the pathogen is low. We developed LolliPop, a statistical method that leverages the time series structure of wastewater data to robustly deconvolve variant abundances and compute fast confidence intervals. Using both simulated data and real data from the Swiss national variant monitoring, we show that LolliPop is accurate and robust to high levels of missing data.

Citation: Dreifuss D, Topolsky I, Icer Baykal P, Beerenwinkel N (2026) Tracking SARS-CoV-2 genomic variants in wastewater sequencing data with LolliPop. PLoS Comput Biol 22(2): e1014003. https://doi.org/10.1371/journal.pcbi.1014003

Editor: Claudio José Struchiner, Fundação Getúlio Vargas: Fundacao Getulio Vargas, BRAZIL

Received: April 28, 2025; Accepted: February 9, 2026; Published: February 19, 2026

Copyright: © 2026 Dreifuss et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All data and code used to produce the results presented in this article is available at https://doi.org/10.5281/zenodo.15277338.

Funding: DD is funded by the Swiss National Science Foundation (grant no. CRSII5_205933 to NB). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

During the COVID-19 pandemic, genomic surveillance has been applied at an unprecedented scale to support various national efforts in containing outbreaks [1]. In this context, wastewater monitoring has seen its broadest and most successful application, with PCR-based surveillance of total viral loads sometimes complemented with next-generation sequencing (NGS) [2]. As genomic analysis is extended from clinical samples to samples from wastewater treatment plants (WWTPs), new statistical and computational research questions arise: the samples are mixtures of heterogeneous, possibly highly diverse RNA molecules, for which the traditional approach of reconstructing consensus sequences is not suitable [1]. Beyond SARS-CoV-2, wastewater-based epidemiology (WBE) is becoming a central tool for pathogen surveillance in general [2]. WBE has been shown to inform on the infection dynamics of pathogens for which clinical testing is usually low, such as influenza and respiratory syncytial virus [3,4]. It is therefore pressing that the computational challenges of analyzing mixed, heterogeneous wastewater sequencing data be addressed.

Most of the existing viral genomic data analysis pipelines and tools were designed for clinical samples and rely on classifying the majority variant of each sample from the consensus sequence of the read alignment [5–7]. One of the main challenges in analyzing wastewater-derived NGS data is the loss of information about the full viral haplotype, i.e., which mutations occur together on the same RNA molecule. This loss can result from fragmentation of the genetic material and from sequencing protocols, which often rely on PCR amplification of the target genome in multiple amplicons. In addition, the sequencing data exhibits very high levels of noise, because from a large pool of raw wastewater, extreme downsampling steps (grab or composite sampling, filtering, random reverse transcription, etc.) are followed by extreme amplification steps (PCR). This in turn can result in low genomic coverage and high levels of missing data. Some tools have been developed to increase sensitivity in the detection of variants, for example, by searching for co-occurring mutations on the same read, which has been shown to improve early detection of newly introduced variants [8].

Beyond early detection of new introductions, quantitative estimation of the relative abundances of different variants from wastewater is a very important aspect of viral genomic surveillance. The relative growth dynamics of a new variant can inform on its fitness advantage relative to the dominating strain [9], and hence on its predicted impact on the infection dynamics. The fitness advantage is an epidemiologically important parameter that can be estimated accurately from wastewater samples using either specific PCR-based assays [10,11] or NGS data [8,12,13], while using far fewer samples as compared to using clinical data, provided that accurate quantification of the variant relative abundances can be made. For practical planning and policy making, it is therefore crucial to have accurate and time-efficient methods for estimation of the relative abundances through time of variants from wastewater NGS data, including reliable measures of uncertainty.

Prior research has shown that the relative abundance of an emerging variant can be quantified by averaging over a set of mutations that is unique for this variant [8]. However, as the number of variants grows, shared mutations cannot generally be discarded as finding sets of unique, characteristic mutations for each variant quickly becomes inefficient or impossible. To address this limitation, some methods have been developed that take into account the correlation structure of mutations between different variants [14–17]. Some of these methods – such as Freyja [16] – also rely on deconvolution of the mutation frequencies in a sample into the relative abundance of variants by inverting a linear model. However, with these methods, quantification of uncertainty is done on the basis of the computationally intensive bootstrap, which can be prohibitive when faced with large amounts of data or limited computational resources (as is often the case in real-world surveillance settings). It is also not guaranteed in general that these methods fare well with high levels of missing data, which are common in environmental sampling due to low concentrations and high PCR inhibition [18].

Here, we introduce a new method for estimating variant relative abundances from wastewater sequencing data, named LolliPop. Tailored to time-series data, our method estimates the time course of relative abundances of all variants in the mixed sample. We employ temporal regularization to make this approach robust to high levels of missing data, which are frequent in wastewater sequencing. As an alternative to confidence bands based on the bootstrap, we derive analytical methods to compute asymptotic confidence bands, which provides a 30-fold speedup. We evaluate our method on simulated datasets with high missing-value rates, as well as on wastewater NGS data with low genomic coverage, where we find high correlation between the inferred variant abundances and estimates from matched clinical data [8]. LolliPop is currently used for the Swiss wastewater monitoring program (https://wise.ethz.ch/?page=viruses/variants).

Methods

We model expected mutation frequencies in wastewater samples as a linear combination of variant profiles. To address challenges such as high noise, missing data, and similarity between variant profiles, we apply temporal regularization in the form of a fused ridge penalty. This leads to a regularized loss minimization framework for deconvolving relative variant abundances over time. We provide both bootstrap-based and analytical confidence intervals, incorporating adjustments for overdispersion and using logit reparametrization to ensure valid bounds and variance reduction. We evaluate the method on real wastewater sequencing data from the Swiss national surveillance program and on simulated datasets designed to assess robustness across varying levels of missing data, variant profiles similarity and model misspecification.

Variant deconvolution

We consider an ordered collection of samples, taken at (not necessarily evenly spaced) timepoints . Each studied variant carries a subset of the mutations relative to a fixed reference strain. Let be the design matrix of variant definitions, i.e., if variant bears mutation , and otherwise. Let be the observed mutation frequency vector at time , where the entries are the observed proportions of reads from the wastewater sequencing experiment supporting a certain mutation . We are interested in the relative variant abundances of all variants at each time point . We make the assumption of a linear probability model, where at time the probability of a read carrying a certain mutation equals the sum of the relative abundances of the variants bearing that mutation. We thus have that the expected proportion of reads with a given mutation is a linear combination of the variant relative abundances:

(1)

Definition (Variant Deconvolution Problem): For given variant definitions and a time series of mutation frequencies , the variant deconvolution problem is to find the relative variant abundances in the population of the catchment area of the WWTP, such that for all time points t.

As finding the exact relative variant abundances in the population is not possible due to the randomness of the data-generating process and model misspecification, we relax the problem to finding a best approximation. Solving the variant deconvolution problem is then performed by choosing a loss function and optimizing:

In the following, we use the soft loss (SL1)

which interpolates between the least squares (LS) loss and the least absolute deviation loss and is a common choice in robust statistics. Here, is a hyperparameter controlling the tradeoff between robustness and efficiency.

Temporal regularization

Wastewater sequencing data typically displays high noise levels, possibly leading to dropouts (i.e., missing values, genomic regions with no coverage). Excluding the missing values of corresponds to removing rows of the variant design matrix . In the worst (and common) case, this leads to the variant design matrix to be rank deficient. In such a case, the variants relative abundances are not identifiable without further assumptions. A common regularizer in regression settings where the design matrix is singular (or has a prohibitively high condition number) is the so-called ridge penalty [19]. Here, we do not apply a ridge penalty to the magnitude of the relative abundance vectors, but further assuming temporal continuity of the variant abundances we introduce a fused ridge penalty on the difference between relative abundance vectors of different timesteps. As variants defined by more mutations contribute to more terms in the loss computation, they risk being less penalized relative to variants with fewer mutations. To avoid this, we distribute the penalization through on the entries of . The quadratic penalty is therefore formulated by where controls the penalization for different time differences between and , such that the complete penalized loss to minimize is

In the following, we introduce a temporal decoupling of the optimization problem, allowing at each time step to be estimated independently, for improving efficiency and numerical stability. Assuming a LS loss (i.e., SL1 with ) we have

For minimization, we compute the gradient w.r.t and set it to zero such that

Where we assumed, without loss of generality, that . We define to obtain

We define the symmetric matrix

If is full rank, which is easy to verify for example if (meaning that the penalization should encode a smoothness constraint), we obtain the solution

With the common convention of nonnegative penalties , we have that is also doubly stochastic. We remark that the solution is equivalent to solving the square loss solution with additional simultaneous kernel smoothing using the kernel function . Thus, Instead of specifying the penalties, we rather specify the more interpretable kernel function , a non-negative, non-increasing function of . Extending equation (1) we have:

where , and In LolliPop, we use the Gaussian kernel with bandwidth hyperparameter , which is a common choice in nonparametric statistics.

Solving the deconvolution problem

With observed mutation frequencies and known variant definitions as input, we solve the deconvolution problem for a given kernel and loss function by finding as

To numerically solve this optimization problem, we use routines from the Python scientific computing library Scipy [20]. In general we use the Trust Region Reflective method [21], but for we can switch to using the LS loss function along with Scipy’s faster non-negative Least Squares solver [22].

Confidence intervals

To use WBE for robust decision making, it is essential to provide estimates of the uncertainty in the prediction of relative variant abundances. In the variant deconvolution step, we make assumptions only on the conditional first moment of the mutation frequencies. The corresponding least-squares estimator is a linear probability model estimator solving the estimating equations . Although it is unbiased and consistent, its default standard errors are not suitable for confidence interval construction [23]. Further assumptions are needed for computing confidence intervals, and we pursue two different strategies: one based on the bootstrap and one based on an analytical approximation of the standard errors.

It is typical to assume that observations are independent and identically distributed, implying a binomial conditional distribution of the number of positive observations (i.e., here mutated reads). However, as sequencing and amplification processes introduce strong dependencies between reads – many reads are simply copies of each other –, this assumption can lead to misspecification of the conditional variance of the fraction of mutated reads. Therefore, we here do not assume the reads to be independent in the bootstrap procedure, nor do we derive analytical standard errors under a strict binomial model.

Bootstrapping

A popular strategy to compute confidence bands is to use the non-parametric bootstrap by resampling observations with replacement [24,25]. Here, we do not resample the individual reads due to strong dependence between them, but instead adopt a cluster bootstrap approach [26] by resampling mutation indices from with replacement – an approach analogous to the resampling of alignment sites in phylogenetics [27]. We do so times to construct bootstrap resamples of the whole time series. Each bootstrap sample is then processed by deconvolution and smoothing, resulting in time series of dimensionality . For each relative variant abundance , confidence intervals are constructed at each timepoint from the empirical quantiles of the bootstrap samples. This approach has the merit of producing confidence intervals restricted to the [0,1] range without further reparametrization.

Asymptotic confidence intervals

We assume that at a given time , the proportion of reads supporting mutation follows a distribution with parameter , i.e.,

Using the linear probability model (Eq. 1), we have . We additionally assume conditional independence of the mutation proportions such that . We thus obtain the log-likelihood:

, where

Differentiating twice the log-likelihood summands, we find the Fisher information matrix (see Text 1 in S1 Text)

We extract the asymptotic standard errors:

which are then used to construct Wald confidence intervals [28]. Here, a pseudofraction is added to the entries of to avoid division by zero when computing the asymptotic standard errors.

Logit reparametrization

To ensure that the confidence bands stay confined to the [0,1] interval, we compute the asymptotic standard errors and Wald confidence intervals on the logit scale , before projecting them back to the linear scale. We compute the inverse of the Fisher information matrix of by using the Delta method,

where is the Jacobian of the transformation, such that

Here again, a pseudofraction is added to the entries of to avoid division by zero.

Overdispersion

In the likelihood model described here, taking as the read depth can lead to a model not capturing the dispersion of the data correctly, due to reads not being independent. We thus follow an approach analogous to quasilikelihood, making more flexible assumptions on the conditional variance of the data [29]. We fix for each genomic position, and we compute the ratio of observed versus expected average square deviations of the observed data from the fitted values. This ratio is then taken as an overdispersion (or underdispersion) factor. At a given time t, the Wald confidence intervals are adjusted by adjusting the asymptotic standard errors for :

We build on the moment-based estimator of the dispersion factor for generalized linear models [29]

Implementation and availability

The methods we present here are implemented in the Python package LolliPop, which takes as input a tabular file of observed mutation frequencies and variant definitions, performs simultaneous kernel smoothing and deconvolution using numerical optimization, and produces confidence intervals. LolliPop is available on Github (https://github.com/cbg-ethz/lollipop) and as a Bioconda package. LolliPop is available within V-pipe 3.0 [30]. All data and code used to produce the results presented in this article is available at https://doi.org/10.5281/zenodo.15277338.

Processing of wastewater sequencing data

We used the wastewater sequencing data from the Swiss surveillance project reported in [8]. The dataset contains 1295 NGS datasets from longitudinal samples collected at six major WWTPs in Switzerland, sampled daily between January 2021 and September 2021. In brief, 24-hour raw influent composite samples were submitted to filtering, total nucleic acid extraction and reverse transcription. SARS-CoV-2 was amplified using the ARTIC v3 protocol [31], which amplifies almost the whole viral genome using 98 amplicons of roughly 400 bp each, before being submitted to NGS. The data were processed using V-pipe 3.0 [30]. We defined the variants of concern (VOCs) B.1.1.7 (Alpha), B.1.351 (Beta), P.1 (Gamma), B.1.617.1 (Kappa), and B.1.617.2 (Delta) by querying the mutations present in ≥80% of the clinical sequences defining these variants on Cov-Spectrum [32] and supported by at least 100 clinical sequences. We then called these mutations in the wastewater samples from pileups of the read alignments. We defined the lineage “other” as the complement of all the mutations in this set of VOCs (i.e., a profile with no mutations). We deconvolved using different hyperparameter values (see below). We computed Wald confidence intervals adjusted for overdispersion with logit reparametrization, as well as bootstrap-based confidence intervals (1000 bootstrap samples).

Comparison to clinical data

Using the LAPIS API of Cov-Spectrum [32], we retrieved counts of sequenced SARS-CoV-2 PCR-positive clinical samples for Switzerland, stratified by submitting lab, canton, and inferred variant. We restricted the data to samples from the large clinical testing company Viollier, where the PCR-positive samples are randomly subsampled before being sent for sequencing. We compare each WWTP to the clinical data from the canton it is located in. For the Berne WWTP of Laupen, we compare to an aggregate of the clinical sequences from both the cantons of Bern and Fribourg, as the catchment area is split between those two cantons [8]. For comparing the clinical and wastewater time series, we computed their lagged cross-correlations for time lags ranging from 30 days of lead time for the clinical signal to 30 days of lead time for the wastewater signal. In each case, we computed the overall cross-correlation as well as the per location , both weighted by the square root of the clinical sample size.

Simulations

As a matrix inverse problem, the variant deconvolution problem can be sensitive to collinearity in the variant definition matrix, which is determined by the genetic similarity between variants and exacerbated by noise and missing values. To assess the robustness of LolliPop to the degree of similarity between variants and the fraction of missing value in the data, we simulated wastewater sequencing data for a range of different scenarios. For timesteps and variants , we generated deterministic time series of variant relative abundances from a multinomial logistic growth model

where the mixing of variants is controlled by parameter vectors and , which control the fitness and introduction times of the different variants. The expected frequencies of mutations at time are computed as

where is the variant definition matrix and denotes the per-base error probability in the sequencing process. The read depth at position and time is then sampled as

where and are the expected value and overdispersion parameter, respectively, of the read depth in covered regions. To simulate levels of dropouts higher than expected from this model we further randomly set entries to zero with probability . The observed mutation frequency at position and time is then sampled as

with expected value and overdispersion parameter .

Simulation of Delta taking over Alpha

To assess the efficacy of our method to track the important situation where a new variant displaces the currently circulating one over time, we generated a 60-timestep time series of Delta taking over Alpha at the logistic growth rate of 0.1/day. We used the variant definitions generated from Cov-Spectrum [32] using the toolkit from COJAC [8]. We set , , and . To assess the robustness of our method to varying levels of noise, we varied the dropout probability between 0, 0.1, 0.2, …, 0.9 and 0.99. These simulated datasets were deconvolved using the SL1 () loss, as well as with the LS () loss to assess the robustness added by the SL1 loss. We varied the bandwidth of the smoothing kernel between 0, 30 and 60. We compared the deconvolved value to the known ground truth and computed their squared correlation coefficient R².

Simulation of a mixture of Omicron subvariants

To further test our method in a more complicated situation where multiple related variants co-circulate, we generated another 60-timestep time series of a mixture of highly similar Omicron subvariants BA.2, BA.5, BA.2.75, BQ.1.1 and XBB. We set the other parameters of the simulation to the same values. We deconvolved using the same parameters and compared the results to the ground truth similarly.

Robustness to misspecification

To investigate the effect of adding variants not present in the samples (model overspecification), we generated data using the same two scenarios as above but included additional related variants in the deconvolution. For the Delta–Alpha time series, we added Beta (B.1.351) and Kappa (B.1.617.1), a close relative of Delta. For the Omicron mixture, we added BA.4 to the deconvolution.

Conversely, to assess the effect of omitting a truly circulating variant (model underspecification), we removed XBB from the deconvolution in the Omicron subvariant scenario. All simulations used identical parameter settings and noise levels as described above.

Simulation of a mixture of artificial closely related variants

Finally, we generated an artificially adversarial 60-timestep time series of a mixture of 5 highly related variants. Their definitions were generated by taking all size 4 subsets of a set of 5 mutations. This ensures no mutation is exclusive to any single variant, but that each of them is found in 4 of the 5 variants. Thus, each pair of variants shares all but one mutation. We set the rest of the parameters to the same values as before, deconvolved using the same procedure, and assessed the results similarly.

Hyperparameters

We assessed the sensitivity of our deconvolution method to hyperparameter choice both on simulated and on real wastewater data. For the simulated data, we assessed the root mean square error of the deconvolution of a simulated mixture of 5 Omicron subvariants across a grid of values for the smoothing bandwidth parameter and the parameter (controlling the breakpoint between and loss) in , for different levels of missing data.

For the real wastewater sequencing data, we analyzed the two biggest WWTPs (Zurich and Vaud) using a similar grid of hyperparameters. For each deconvolved dataset, we linearly regressed the relative abundances of the different variants in clinical sequencing on the relative abundances in wastewater inferred by the deconvolution, using the statistical software R [33], and we reported the R². The regressions were performed with data points weighted by the square root of the clinical sample sizes.

Results

We developed LolliPop, a statistical and computational method for deconvolving variant abundances from wastewater sequencing data (Fig 1A). It estimates relative abundances variants from observed mutation frequencies using a variant profile matrix, a flexible loss function, and temporal regularization. Below, we compare its performance on Swiss wastewater data against matched clinical estimates and evaluate its robustness on simulated datasets with high levels of missing data.

Download:

Fig 1. Overview of LolliPop for variant deconvolution.

A Mutation frequencies in wastewater samples of different timepoints are obtained from NGS read counts. The mutation frequencies are deconvolved using the variant definitions, producing estimates of relative variant abundances along with confidence intervals. Repeating the operation for each timepoint tracks the relative abundances of the variants through time. Some values for mutation frequencies are missing (gray), but temporal regularization allows for deconvolution with high levels of missing data. Created in BioRender. Dreifuss, D. (2025). B Estimates of variant relative abundances obtained from deconvolution of wastewater data in different WWTPs. Different colors correspond to the different genomic variants studied. The deconvolution was performed with a Gaussian smoothing kernel with and a scale parameter . C Relative abundances of variants in clinical samples of the cantons surrounding the studied WWTPs D Cross-correlation between the wastewater deconvolved values of relative abundances and the clinical data estimates, overall (black) and for the different locations (colors). The correlation values are computed weighted by . Dots mark measured cross-correlations at different lag times, lines mark quadratic fits. The overall cross-correlation peaks at for a wastewater signal lead time of 3.3 days. D Fraction of amplicons covered in the wastewater sequencing data, for each month and for each location (colors). The coverage drops drastically for periods with low incidence (see Fig A in S1 Text). The central line within each box represents the median coverage per month, while the box itself spans the monthly interquartile range (IQR). The whiskers extend to the most extreme data points still within 1.5 times the IQR, and data points beyond this range are shown as individual dots.

https://doi.org/10.1371/journal.pcbi.1014003.g001

Comparison to clinical data

We compared time series of relative variant abundances inferred from wastewater sequencing data using LolliPop to those estimated using clinical data. To challenge the robustness of LolliPop, we used wastewater data including samples from a low incidence period of the pandemic, where viral concentrations were extremely low (Fig A in S1 Text). The cumulative aligned reads depth per sample ranged between 0 and 17,369,931, with a median of 239,812, a mean of 1,315,240 and a standard deviation of 2,583,714. Only 9 samples had full coverage, with genome coverage dropping extremely low during periods of low incidence (Figs 1E, Fig A in S1 Text). Despite this very high level of missing data, it was still possible to deconvolve the observed mutations into variant relative abundances accurately (Fig 1). We found that the wastewater-based infection dynamics closely follow those derived from clinical sequencing (Fig 1B,1C).

Due to the delay distribution between infection and clinical testing not necessarily matching the delay distribution of shedding and sewage travel time, we expect that the signal in wastewater can be shifted in time compared to clinical samples. The overall highest cross-correlation value (weighted by root clinical sample size) between wastewater-derived estimates and clinical estimates was estimated, with wastewater having a lead time of 3.3 days (Fig 1D). In the region of Zürich – which had both the largest clinical sample ( and the least impacted wastewater sequencing coverage – the wastewater signal showed a lead time of 6.1 days with . We performed the deconvolution with kernel bandwidth and a loss scale parameter .