Skip to main content
Advertisement
  • Loading metrics

Library size-stabilized metacells construction enhances co-expression network analysis in single-cell data

  • Tianjiao Zhang ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    haibinzhu@jnu.edu.cn (HZ); tjzhang@wyu.edu.cn (TZ)

    Affiliation School of Pharmacy and Food Engineering, Wuyi University, Jiangmen, China

  • Haibin Zhu

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Writing – original draft, Writing – review & editing

    haibinzhu@jnu.edu.cn (HZ); tjzhang@wyu.edu.cn (TZ)

    Affiliation Department of Statistics and Data Science, School of Economics, Jinan University, Guangzhou, China

Abstract

Single-cell RNA sequencing (scRNA-seq) deciphers cell type-specific co-expression networks to resolve biological functions but remains constrained by data sparsity and compositional biases. Conventional metacells construction strategies mitigate sparsity by aggregating transcriptionally similar cells but often neglect systematic biases introduced by compositional data. This problem leads to spurious co-expression correlations and obscuring biologically meaningful interactions. Through mathematical modeling and simulations, we demonstrate that uncontrolled library size variance in traditional metacells inflates false-positive correlations and distorts co-expression networks. Here, we present LSMetacell (Library Size-stabilized Metacells), a computational framework that explicitly stabilizes library sizes across metacells to reduce compositional noise while preserving cellular heterogeneity. LSMetacell addresses this by stabilizing library sizes during metacells aggregation, thereby enhancing the accuracy of downstream analyses such as Weighted Gene Co-expression Network Analysis (WGCNA). Applied to a postmortem Alzheimer’s disease brain scRNA-seq dataset, LSMetacell revealed robust, cell type-specific co-expression modules enriched for disease-relevant pathways, outperforming the conventional metacells approach. Our work establishes a principled strategy for resolving compositional biases in scRNA-seq data, advancing the reliability of co-expression network inference in studying complex biological systems. This framework provides a generalizable solution for improving transcriptional analyses in single-cell studies.

Author summary

Gene co-expression analysis is a widely used method to infer functional relationships between genes by measuring correlations in their normalized gene expression level. However, in this paper, through mathematical modeling and simulations, we demonstrate that these correlations are systematically skewed—particularly due to biases caused by variability in sequencing depth (library size). This issue distorts co-expression analysis results, inflating false correlations and masking true biological interactions. Traditional methods fail to address library size biases in single-cell studies where data sparsity compounds these challenges. We introduce LSMetacell, a computational framework that simultaneously tackles single-cell data sparsity and corrects for library size-induced correlation biases. By constructing metacells with stabilized sequencing depths, our method reduces technical noise while preserving biological heterogeneity. Applied to Alzheimer’s disease brain data, LSMetacell uncovered microglia-specific co-expression networks linking immune dysregulation to neurodegeneration. Our work provides a dual solution: enhancing single-cell resolution through cell aggregation and mitigating systemic biases that plague co-expression studies. LSMetacell integrates technical approaches with biological analysis, enabling researchers to extract precise and reproducible findings from compositional data.

Introduction

Gene co-expression network analysis serves as a powerful and efficient framework for deciphering transcriptional regulatory mechanisms across diverse biological contexts, offering an intuitive approach to unravel complex transcriptional interactions. [1,2]. While bulk RNA-seq enabled early discoveries, its fundamental limitation—confounding by cell-type heterogeneity—persists, masking cell-specific interactions. Single-cell RNA sequencing (scRNA-seq) holds promise to address this limitation by resolving transcriptional profiles at cellular resolution. However, the challenge of data sparsity, arising from the constraints of current single-cell sequencing workflows and technologies, continues to pose significant hurdles for downstream analysis. [3,4]. A conventional workflow accounts for these considerations by collapsing highly similar cells into “metacells” to reduce sparsity while retaining cellular heterogeneity [5,6]. However, the compositional nature of normalized scRNA-seq data (both before and after metacells aggregation) introduces critical yet underappreciated biases that undermine co-expression network inference [7].

In compositional data, where measurements represent relative abundances (e.g., gene counts normalized to library size, i.e., the total number of gene counts sequenced from a sample), spurious correlations arise due to the “closed-sum” constraint: an increase in one gene’s expression inherently distorts the apparent proportions of others [5,8]. While this issue plagues bulk RNA-seq analyses, it is exacerbated in scRNA-seq by two factors: (1) extreme library size variability across cells (e.g., often spanning 10-fold differences in unique molecular identifier [UMI] counts, where UMIs are short nucleotide tags used to accurately quantify transcript molecules and distinguish them from PCR duplicates), amplifying normalization artifacts; and (2) data sparsity, which compounds compositional noise by inflating stochastic zeros [911]. Conventional metacells construction strategies, designed primarily to mitigate sparsity, often neglect library size variance, generating metacells where normalized expression values (e.g., CPM/TPM, Log-CPM/TPM) remain compositionally biased, propagating technical noise into downstream network analyses [7,12,13].

Here, we rigorously demonstrate through mathematical modeling and in silico simulations that uncontrolled library size variance in metacells introduces systematic distortions in co-expression network inference. Specifically, it leads to false-positive correlations and obscures biologically meaningful interactions. Furthermore, to address this challenge, we introduce LSMetacell (Library Size-stabilized Metacells), a novel framework that explicitly stabilizes library sizes across metacells while preserving transcriptional heterogeneity. By explicitly accounting for library size variability, LSMetacell reduces compositionally induced biases, enabling more accurate and reliable co-expression network analyses, particularly for methods like WGCNA (Weighted Gene Co-expression Network Analysis) that rely on precise correlation estimates. Our approach not only enhances the robustness of co-expression network inference but also provides a generalizable solution for improving the accuracy of downstream transcriptional analyses in scRNA-seq studies. Finally, we evaluated the utility of LSMetacell by applying it to postmortem brain samples from Alzheimer’s disease patients and controls scRNA-seq data set and identified biological meaningful cell type-specific co-expression network.

Results

The magnitudes of correlation are overestimated by sequencing depth variations

To verify the impact of confounding effects of sequencing depth variations on the correlation of independent genes, we generated null datasets where genes are not co-expressed (null data) by permuting the single-nucleus RNA-seq (snRNA-seq) data from reference [14], while introducing different library size variation across cells (as detailed in the Methods section). Fig 1a shows that in null data with no library size variations, there were minimal biases. However, as the variations increased, both the mean and variance of correlation estimation increased. Especially, when the variance approached that of the original data (Variance Size Factor = 1), the biases were almost comparable to those observed with a tenfold increase in the original data. Furthermore, we prove mathematically that the growing variations of library size induce higher type I error, tending to over-reject independence between gene co-expression (S1 File). Our theoretical derivation aligns with the results from simulation.

thumbnail
Fig 1. Relationship between correlation bias and library-size variation.

a) Library-size variance-dependent correlation inflation in null data. Scatter plot demonstrates systematic overestimation of co-expression (Pearson correlation coefficients between gene pairs) (y-axis) with increasing mean expression levels (x-axis) under varying library size variance (Variance size factors). Variance size factors scale the variance of library size of the simulated data. The higher the variance size factor, the greater the library size variation in the simulated data. b) Library-size variance-dependent correlation absolute bias. Scatter plot showing the scale factor (variance of library size/mean of library size) and average absolute bias of the estimated correlation. The higher the variance size factor, the greater the library size variation in the simulated data.

https://doi.org/10.1371/journal.pcbi.1013697.g001

We further create a simulated snRNA-seq dataset, incorporating predefined potential correlations that are known in advance (as detailed in the Methods section). We observed that as the library size variability increased, the deviations in the estimated correlations also grew larger (Figs 1b and S1). Specifically, when the variance in library sizes was high, the estimated correlations exhibited significant biases, deviating substantially from the true predefined correlations. This finding underscores the critical impact of library size variability on the accuracy of correlation estimates in snRNA-seq data, highlighting the need for robust normalization and variance stabilization techniques to mitigate such biases.

The library stabilized metacell method achieved robust co-expression modules

To tackle the confounding effects of library size variability in single-cell co-expression analysis, we developed the Library size-Stabilized Metacell (LSMetacell) algorithm. This method is based on the hypothesis that cells of the same type share similar gene expression patterns. The LSMetacell method simultaneously accomplishes two objectives: (1) aggregating transcriptionally similar cells into metacells (2) stabilizing library size variation across metacells. This dual approach, which aims to preserve biological signals and minimize technical variance, addresses the key challenge in single-cell co-expression analysis, where library size heterogeneity often distorts downstream network inference.

The LSMetacell algorithm processes three key inputs: a cell-cell similarity network computed from transcriptional profiles (Pearson correlation among cells by default), a gene expression count matrix, and the target number of metacells. It then iteratively builds metacells by: (1) initiating with seed cells and incorporating their most similar neighbors; (2) dynamically evaluating the cumulative library size relative to the global mean metacells library size ; and (3) excluding cell additions that would substantially alter the library size distribution. This dual refinement strategy guarantees transcriptional homogeneity while maintaining uniform library sizes across all metacells (See Methods for implementation details). When evaluated on null datasets (under the condition variance = 1 in Fig 1a), LSMetacell introduced minimal technical bias compared to other benchmark methods, demonstrating its enhanced accuracy in mitigating artifactual correlations (S2 Fig).

To rigorously evaluate the enhancement of LSMetacell on WGCNA for single-cell data, we carried out a comprehensive comparison with the conventional high-dimensional WGCNA (hdWGCNA) method, which leverages k-nearest neighbor aggregation without library size stabilization [7]. In addition to this primary comparison, we also incorporated four other cutting-edge algorithms capable of generating metacells, even though they were not specifically designed for gene co-expression analysis. The first is metacells2 [15,16], which adopts a divide-and-conquer strategy to partition large single-cell datasets and identifies metacells by constructing k-nearest neighbor graphs within subsets. Another is metaQ [17], which leverages a deep learning framework to quantize single cells into a limited codebook. We also included SEACells, a metacell identification method that uses a kernel-based approach to learn a low-dimensional embedding of single-cell data and then employs sparse combinatorial optimization to group cells into archetypal aggregates [18]. Additionally, SuperCell is a method that constructs metacells by merging the most transcriptionally similar cells with a k-nearest neighbor network [19]. Furthermore, we incorporated the primary version of LSMetacell into the comparison, where cells are aggregated as LSMetacell without strict library size control. Using human dorsolateral prefrontal cortex scRNA-seq data spanning five major cell types (excitatory neurons [EN], inhibitory neurons [IN], astrocytes [AC], oligodendrocytes [OC], and microglia [MC]), we evaluated two key aspects of network quality: biological relevance and technical robustness. Notably, LSMetacell achieved the lowest coefficient of variation, indicating superior library size stability of the generated meta-cells, which is a property essential for reducing noise in downstream analyses (S2 Table). For biological validation, we leveraged the principle that functionally related genes (as evidenced by protein-protein interactions) should show stronger co-expression. Analysis of the top 10,000 correlated gene pairs revealed superior Protein-Protein Interaction (PPI) enrichment with LSMetacell across most cell types (Table 1).

thumbnail
Table 1. PPI enrichment among the top 10,000 correlated gene pairs across cell types. Values denote the number of gene pairs exhibiting significant PPI enrichment for each method within each cell type. Paired Wilcoxon test comparing LSMetacell with all benchmark methods across cell types yielded P = 0.031 for every comparison.

https://doi.org/10.1371/journal.pcbi.1013697.t001

Technical robustness was evaluated through module preservation analysis in an independent human prefrontal cortex scRNA-seq data (Fig 2). LSMetacell demonstrated higher and stable module preservation, indicating greater reproducibility of the identified co-expression patterns. The improved PPI enrichment and module preservation underscore its utility for identifying conserved regulatory programs. These findings demonstrate that stabilizing library size variation in metacells enhances co-expression analysis. It is worth noting that the method of aggregation used in the primary version of LSMetacell has the potential to generate metacells with reduced library size variation. On this basis, LSMetacell further decouples technical artifacts from biological variation through library size variation stabilization, thereby enhancing co-expression analysis.

thumbnail
Fig 2. Preservation of co-expression modules constructed by different methods across major cell types.

Co-expression modules were constructed separately for excitatory neurons (EN), inhibitory neurons (IN), oligodendrocytes (OC), astrocytes (AC), and microglia (MC). The preservation of these modules was subsequently evaluated in a fully independent single-cell sequencing dataset using the Z-Summary statistic. A Z-Summary > 10 indicates strong evidence of preservation, while a Z-Summary > 2 suggests moderate preservation.

https://doi.org/10.1371/journal.pcbi.1013697.g002

The microglia immune response in AD

Using LSMetacell method followed by weighted gene co-expression network analysis (WGCNA) on a comprehensive single-cell transcriptomic dataset, we identified 73 modules across five major central nervous system cell types (EN: 18; IN: 16; AC: 14; MC: 12; OC: 13). Strikingly, only four microglial modules (MC-Brown, MC-Green, MC-Purple, MC-Yellow) exhibited significant enrichment for immune-related pathways (GO analysis, FDR < 0.01), whereas neuronal modules predominantly mapped to synaptic function, metabolic processes, etc. This cell-type-specific functional segregation validates our algorithm’s precision in recovering biologically coherent signals. This highlights the unique position of microglia as the core immune effector of central nervous system in AD pathology (Fig 3a).

thumbnail
Fig 3. Microglia gene co-expression modules.

Module names are prefixed with “MC” (abbreviation for microglia) followed by the color label. a) UMAP representation of the co-expression network visualizes individual genes (kMEs > 0.2) with eigengene of each module indicated by larger circles colored by the module color. Positive module eigengene correlations > 0.5 are indicated by grey lines connecting the eigengenes. b) Modules-phenotype association analysis. Correlation between the module eigengene and phenotypes. Numbers indicate correlation coefficients and p-values. Braaksc: Braak Stage is the semiquantitative measure (from no effect on neocortex to severity effect) of severity of neurofibrillary tangle (NFT) pathology. Ceradsc: CERAD score is the semiquantitative measure of neuritic plague (from severity to no effect). Cogdx: Clinical consensus diagnosis of cognitive status at time of death (from no cognitive impairment to dementia). Dcfdx_lv: Clinical diagnosis of cognitive status at last visit (from no cognitive impairment to dementia). c) The biological process gene ontology enrichment analysis of genes. GeneRatio is the ratio between genes of interest in the gene set and total genes of interest. Dot color represents the adjusted p-value (Benjamini–Hochberg method) of enrichment analysis.

https://doi.org/10.1371/journal.pcbi.1013697.g003

Module MC-Brown exhibited strong positive correlations with AD severity markers (Fig 3b). This gene module orchestrates immune cell signaling and regulation through GTPase activity (e.g., RAC1, ARHGAPs, DOCK proteins), driving processes like leukocyte activation, phagocytosis, and chemotaxis [20,21]. Furthermore, lipid metabolism perturbations (dysregulated glycerophospholipid and phosphatidylinositol biosynthesis) and vesicle trafficking (autophagy, endocytosis) amplify microglial membrane dysfunction, impairing Aβ clearance and amplifying inflammatory cascades [2224] (Fig 3c).

The MC-Green module is more significantly enriched in pathways associated with adaptive immunity compared to the MC-Brown module (Fig 3c). It is worth noting the GO results of T cells, especially CD8-positive T cells. Previous studies have shown the double-edged role of T cells in the mouse model of AD [25]. Su et al. reported the protective role of CD8 ⁺ T cells in AD development [26], while Chen et al. also reported that microglia-mediated infiltration of T cells exacerbates neurodegeneration in the tauopathy-associated model [27]. Our results show that this module is negatively correlated with AD indicators. This may indicate the protective role of this module in AD. The observed decline in MC-Green activity may reflect exhaustion of adaptive-immune surveillance as the disease progresses, thereby accelerating the emergence of AD-related signatures [28]. In contrast, this module profiled with other benchmark approaches showed a notably closer association with innate-immune rather than adaptive-immune processes, along with a weaker correlation with AD indicators (S3S8 Figs).

MC-Yellow shows significant pathway in chaperone-mediated protein folding, ATP synthesis coupled electron transport, and immune related terms (Fig 3c). MC-Yellow hub genes contain ribosome-related and some known AD-related genes, including APOE [29], HSP90B1 [30], TREM2 [31], etc. The MC-Red module occupies a central position in the network analysis, with its functional enrichment linked to RNA metabolism (Fig 3c). Notably, the hub genes within this module are primarily associated with mitochondrial pathways (MT-ND1, MT-ND2, MT-ND3, MT-ND4, MT-ND4L, MT-CO1, MT-CO2, MT-ATP6, MTCO3), a finding with particular pathological relevance given the established role of microglial mitochondrial dysfunction in driving Alzheimer’s pathogenesis through mechanisms involving neuroinflammation and metabolic failure. [32]. However, when applying certain benchmark methods, many genes now assigned to the MC-Red module were merged into the MC-Brown or MC-Yellow modules, highlighting the enhanced resolution offered by the present workflow (S3S8 Figs).

Discussion

Gene co-expression network analysis is indispensable for uncovering transcriptional regulatory mechanisms in diverse biological contexts. Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool to overcome the limitations of bulk RNA-seq in dealing with cell-type heterogeneity. However, the sparsity of single-cell data and the compositional nature of normalized data pose significant challenges for accurate co-expression network inference.

Our study rigorously demonstrates the detrimental impact of uncontrolled library size variation in metacells on co-expression network analysis. Through mathematical modeling and in silico perturbations, we found that this variance introduces false-positive correlations and obscures biologically meaningful interactions. The biases in correlation estimates increase with higher variability in sequencing depths, as evidenced by both the analysis of null datasets and simulated snRNA-seq data. These findings highlight the urgent need for methods that can mitigate the confounding effects of library size variability in single-cell co-expression analysis.

A standard approach to address differences in library size is by scaling the counts for each cell using a specific factor, e.g., its total UMI count. While effective in controlling for technical variability, this method introduces a critical artifact: applying different scaling factors across cells artificially inflates gene-gene correlations, as the same multiplicative factor is applied to all genes within a cell.

To address this issue, we introduced the Library Size-Stabilized Metacell (LSMetacell) algorithm. By aggregating transcriptionally similar cells into metacells while simultaneously stabilizing their library sizes, LSMetacell effectively reduces technical noise and preserves biological signals. Our proposed pepline controls for library size effects through a two-stage process. In the first stage, we form metacells by pooling UMI counts from clusters of similar cells. This initial aggregation step naturally equilibrates library sizes across metacells to a large extent, but more importantly, it maintains gene-specific variance. For instance, the counts for each gene in a metacell represent the sum from its constituent cells, thereby preserving co-variation patterns. In the second stage, we perform a library size normalization on the count profiles of these metacells. The key innovation lies in applying normalization after aggregation. Since the normalization factors are computed across metacells, which already exhibit a more uniform library size distribution, the variance among these factors is significantly reduced. As a result, scaling each metacell to a common total count does not materially distort the gene-gene correlation matrix. This two-step procedure thus effectively controls for any residual differences in metacell size while minimizing the introduction of spurious correlations, leading to a more reliable representation of the data for downstream analysis.

The significant improvement in protein-protein interaction (PPI) enrichment and module preservation in an independent scRNA-seq dataset of human prefrontal cortex demonstrates the biological accuracy and technical robustness of LSMetacell. This method provides a reliable framework for single-cell co-expression network analysis, enabling the decoupling of technical artifacts from biological variation.

Applying the LSMetacell method to a comprehensive single-cell transcriptomic dataset in Alzheimer’s disease (AD), we identified distinct co-expression modules in major central nervous system cell types, with a particular focus on microglia. The functional segregation of these modules, such as immune-related pathways in microglial modules and synaptic/metabolic functions in neuronal modules, validates the precision of our algorithm in recovering biologically coherent signals. These findings confirm the unique role of microglia as the core immune effector in AD pathology. Given the critical role of immune mechanisms in AD, we focused on the co-expression network modules of microglia.

Among the microglial modules, MC-Brown is strongly correlated with AD and is enriched in immune activation pathways. The regulation of small GTPase-mediated signal transduction, innate immunity, and lipid metabolism perturbations in this module provide potential therapeutic targets. For example, targeting small GTPases could potentially disrupt microglial chemotaxis towards amyloid-β (Aβ) plaques. Similarly, restoring normal lipid metabolism in microglia may enhance Aβ clearance and alleviate the inflammatory cascades. The MC-Green module, which is more significantly enriched in adaptive immunity pathways and negatively correlated with AD indicators, suggests a potential protective role. However, its changing correlation patterns in control and AD groups imply possible exhaustion of the adaptive immune system in AD. Future studies could explore ways to boost the function of this module to harness its protective effects against AD progression. MC-Yellow and MC-Red also present valuable insights into AD pathogenesis. The involvement of chaperone-mediated protein folding, ATP synthesis coupled electron transport, and RNA metabolism in these modules provides new perspectives on the complex molecular mechanisms underlying AD. For instance, enhancing chaperone-mediated protein folding may help prevent the accumulation of misfolded proteins and inhibit pro-inflammatory process which are characteristic features of AD.

In the broader context, our work not only provides a novel methodological framework for single-cell co-expression network analysis but also offers in-depth insights into the molecular mechanisms of AD. The LSMetacell algorithm has the potential to be applied in other biological systems and diseases, facilitating the discovery of conserved regulatory programs and potential therapeutic targets. Future research could focus on validating the proposed targets in in vivo models and exploring the translational potential of these findings in clinical settings.

However, it is important to acknowledge that, like all count-based sequencing data, our meta-cell expression matrices remain inherently compositional. Although meta-cell aggregation significantly reduces technical variability and mitigates spurious correlations caused by variable library sizes across single cells, it does not eliminate the compositional nature of the data, as the data remain subject to the unit-sum constraint. This means that the elevated expression of one gene necessarily leads to the relative decrease of others, potentially introducing negative biases in correlation estimates. Another notable limitation of LSMetacell is its dependence on pre-annotated and transcriptionally homogeneous cell populations. While our method effectively controls for technical variation in library size during metacell construction, it is restrictive that cell type labels are uncertain, incomplete, or derived from continuous differentiation trajectories. This design reflects a deliberate trade-off in Algorithm 1: by focusing on library size stabilization within annotated groups, we prioritize suppressing technical variation for co-expression analysis at the potential expense of broader metacell versatility. Interestingly, our benchmarking results revealed that metacell2, a method that also achieves relatively stable library sizes, performed well in module preservation and other metrics, as might be expected. Furthermore, the co-expression modules derived by metacell2 exhibited greater similarity to those from LSMetacell than those from other methods. Surprisingly, SuperCell, despite showing relatively higher library size variation, also achieved competitive module preservation results, and its co-expression patterns were likewise more similar to those of LSMetacell compared to others. These observations suggest that, in addition to library size control, other factors such as low compactness and high separation of metacells may also play important roles in obtaining biologically meaningful co-expression networks. Future work could explore complementary approaches to extend the utility of library size stabilization to more complex and heterogeneous single-cell datasets, while incorporating features that enhance metacell compactness and transcriptional coherence. Furthermore, while we have significantly reduced the number of parameters compared to previous metacells algorithms (only the number of metacells needs to be provided), it is still crucial to select a cell-cell similarity algorithm. Fortunately, we achieved good results by simply using Pearson correlation in our study, but this aspect warrants further investigation. Finally, the time complexity of LSMetacell scales quadratically () with the number of cells (), making computational efficiency an important consideration for large datasets (S3 Table). Constructing metacells separately for each individual and cell type, rather than pooling cells across sources, enhances biological specificity while simultaneously reducing computational expense.

Conclusions

Through mathematical modeling and in silico simulations, we rigorously demonstrate that excessive library size variance in single-cell data systematically inflates gene-gene correlations, fundamentally compromising co-expression network fidelity. We introduce LSMetacell, a rational metacell framework that strategically balances transcriptional similarity with library size stabilization. By leveraging scRNA-seq’s cellular granularity to aggregate phenotypically coherent cells while controlling technical variability, LSMetacell mitigates false-positive interactions. Applied to Alzheimer’s disease data, our method identified robust cell type-specific modules enriched for neurodegeneration pathways, including microglia-driven immune dysregulation. This work establishes that systematic technical stabilization during metacell design is essential for reliable single-cell network inference, advancing mechanistic discovery in complex biological systems.

Methods

Data collection and preprocessing

Single-cell RNA sequencing (scRNA-seq) datasets were obtained from public repositories: dorsolateral prefrontal cortex (DLPFC) data from Synapse (syn18485175; [14]); prefrontal cortex data from Synapse (syn21261143; [33]). Gene count matrices for single-cell data were generated using the original authors’ preprocessing pipelines [14,33]. Then, we performed log-normalization with a scale factor of 10,000.

Simulated dataset

To generate null data sets from an authentic scRNA-seq dataset with co-expression levels at or close to zero among all gene pairs while pre-serving gene expression levels, we adopt the following approach that combines permutation with Poisson sampling. First, we normalize the expression level of each gene in cell , denoted as , where represents the raw expression count and is the scaling factor (e.g., sequencing depth) for cell . Subsequently, for each gene , we randomly shuffle its normalized expression values across all cells. This shuffling step decorrelates the gene expressions, effectively eliminating any potential co-expression patterns among gene pairs. Finally, to generate the UMI (Unique Molecular Identifier) counts for the permuted data, we sample from a Poisson distribution with parameters derived from the normalized, permuted expression levels. Specifically, for gene in cell , the UMI count is sampled from , where is the desired library size in cell . The sequencing depth is determined from a truncated normal distribution, with almost surely, using variance of the original scaling factors multiplied by different variance size factor. This process ensures that the null data sets maintain the overall gene expression characteristics of the original data while lacking any meaningful co-expression patterns.

To generate synthetic single-cell RNA-seq data with controlled co-expression structures, we developed a simulation framework integrating a Gamma-Poisson marginal model with a Gaussian copula. First, the Gamma distribution parameters, and for the gene , were estimated from authentic scRNA-seq data to ensure realistic overdispersion. Next, given a predefined co-expression matrix , calibrated from the real data, we sampled latent variables from multivariate normal distribution, . These latent variables were transformed by

where is the cumulative distribution function (CDF) of a standard normal distribution and is the CDF of . This approach simulates gene expression level , maintaining their predefined correlation structure for benchmark. Finally, the simulated UMI count, , is sampled from , where scales expression to match desired library sizes. By generating from a truncated normal distribution (larger than 0) with different variance we simulated datasets with low, moderate, and high library size variability.

LSMetacell method

The LSMetacell algorithm is specifically designed to operate within pre-defined, transcriptionally homogeneous cell populations. Before applying the method, users must first annotate cell types using established clustering approaches to ensure biologically meaningful aggregation. At its core, the algorithm constructs metacells through an iterative process that balances transcriptional similarity (default: Pearson correlation between cells) with library size stabilization. The algorithm begins by calculating the target average library size across all cells. Starting with the smallest-library cell as a seed, it iteratively aggregates cells based on transcriptional similarity. At each step, a candidate cell is probabilistically selected from unassigned cells based on its aggregate affinity to the current metacell. Crucially, the algorithm enforces library size stabilization: a candidate is retained only if adding it brings the metacell’s cumulative library size closer to . This process continues until the metacell’s library size approaches or no suitable candidates remain. The workflow proceeds as Algorithm 1.

Algorithm 1 LSMetacell

Input: , set of single cells with raw gene expression profiles; , target number of metacells; , cell-cell similarity matrix for cells in .

Output: , a list of metacells with stabilized library sizes.

Initialization:

Target average library size: ;

Unassigned cells: ;

Metacell list:

Main Loop:

while do

Step1: Seed Selection

  Select the cell with the smallest library size:

   ;

  Initialize current metacell:

   ;

   ;

   ;

Step2: Iterative Aggregation

While and do

  Compute affinity scores between and all :

   , ;

  Normalized weights:

   , ;

   with probability ;

  Tentatively update:

   

  if then

 Accept candidate:

   ;

   ;

   ;

  else

 Reject .

  end if

end while

end while

To extend the LSMetacell algorithm for datasets with heterogeneous groups (e.g., individuals, biological replicates), we developed a group-aware metacells construction framework (Algorithm 2).

Algorithm 2 LSMetacell by Group

Input: , set of single cells in group ; , cell-cell similarity matrix for cells in group ; , the target number of metacells for group .

Output: , a list of metacells with stabilized library sizes.

Main Loop:

Step1: Proportional Metacells Allocation

   For each group , calculate its target metacell count:

    ;

Step2: Group-Specific Metacell Construction

   For each group :

    Apply Algorithm 1 LSMetacell.

WGCNA and identification of significant modules

Gene co-expression network construction.

Following metacells construction, weighted co-expression network were constructed by the common pipeline. First, the gene-gene similarity matrix is computed by taking the pairwise signed correlation of genes.

To comply with scale-free topology criterion and the recommendations of WGCNA use, we chose appropriate soft-thresholding powers to convert the gene expression matrices to adjacency matrices. Then topology overlap matrices (TOM) were calculated by adjacency matrices [34]. We then use hierarchical clustering and dynamic tree cut method to identify gene clusters [35].

Protein-Protein Interaction (PPI) enrichment.

The protein-protein interaction annotation were download from STRING database (12.0) [36].

Module preservation.

We used Z-Summary, a network preservation statistic aggregating multiple preservation statistics (3 density-based statistics and 3 connectivity-based statistics), to quantify the conservation of the co-expression network in another dataset [37].

Module eigengenes.

Given that each module comprises genes with correlated expression patterns, it is logical to summarize each module using a single representative expression profile, known as the module eigengene. The module eigengene is defined as the first principal component of the standardized expression matrix of the genes within the module, which captures the majority of the variance within the module. Module eigengenes lead to a natural measure of similarity (membership) of all individual genes to all modules [38]. A continuous measure of module membership of gene in module is defined as

where is the expression measurement of gene and is the eigengene of module [39].

Gene enrichment analysis.

We implemented gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis by ClusterProfiler (4.15) [40].

Comparison of LSMetacell with other metacell algorithms.

To compare LSMetacell with other metacell construction algorithms, including hdWGCNA, metacell2, metaQ, SEACells and SuperCell. we generated an identical number of metacells across the same cell types using each of these methods, all under their respective default parameter settings. For primary LSMetacell (LSMetacell without strict library size control), we omitted lines (25–29) in LSMetacell (Algorithm 1) and ceased the metacell aggregation process once the required cell number (number of cells/number of designed metacells) was achieved. To ensure a fair and biologically relevant comparison of co-expression networks across different metacell methods, we applied a unified gene filtering criterion prior to WGCNA analysis: only genes expressed in at least 10% of cells within each cell type were retained. This filtering step was consistently applied to metacells generated by all methods, including LSMetacell, hdWGCNA, metacell2, metaQ, SEACells, and SuperCell. Although the absolute number of retained genes varied slightly across methods due to differences in metacell construction, the overlap among gene sets was substantial. To enable a direct and methodologically consistent comparison of functional modules, all subsequent WGCNA analyses were performed using the intersection of these filtered gene sets across methods.

Supporting information

S1 File. Proof for the main results.

Contains the theoretical proof of our main results, where we establish theoretically that compositional gene expression data suffers from inflated Type I errors in correlation analyses.

https://doi.org/10.1371/journal.pcbi.1013697.s001

(PDF)

S1 Table. Variability in library size distribution of metacells generated by different methods across major cell types.

Values represent the coefficient of variation (standard deviation divided by the mean), indicating the degree of variability in library sizes within each metacell type.

https://doi.org/10.1371/journal.pcbi.1013697.s002

(PDF)

S2 Table. Results of one-sided Wilcoxon rank-sum tests (left > right) comparing Zsummary scores between pairs of metacell construction methods presented in Fig 2.

Each cell displays the p-value from the statistical test evaluating whether the method in the row exhibits significantly higher Zsummary scores than the method in the column. Empty cells indicate comparisons that were either not applicable or not performed.

https://doi.org/10.1371/journal.pcbi.1013697.s003

(PDF)

S3 Table. The table shows the LSMetacell algorithm’s runtime on a standard laptop (64 GB RAM, Windows 11 OS) using datasets of varying sizes.

The results demonstrate a near-quadratic scaling of runtime with cell count, which aligns with its theoretical O() complexity.

https://doi.org/10.1371/journal.pcbi.1013697.s004

(PDF)

S1 Fig. Distribution of correlation absolute bias under a specific scaling condition.

This histogram illustrates the distribution of correlation absolute bias values across all gene pairs for different scale factor. Each scale factor is corresponding to one representative scaling condition shown in Fig 1b. The x-axis represents the correlation absolute bias, while the y-axis shows the frequency of gene pairs at each bias interval. Lower and more concentrated distributions indicate better control of technical artifacts induced by library size variation.

https://doi.org/10.1371/journal.pcbi.1013697.s005

(PDF)

S2 Fig. Benchmarking metacell algorithms under a synthetic null-correlation dataset.

We generated a synthetic single-cell dataset in which genes possess no intrinsic pairwise correlation (see Fig 1 Methods) and applied seven metacell-building algorithms: hdWGCNA, Metacell2, SEACells, SuperCell, MetaQ, Primary, and LSMetacell. After normalization within each metacell set, all pairwise Pearson correlations between genes were recalculated; their frequency distributions are displayed above. To quantify differences in noise suppression, two-sided Wilcoxon signed-rank tests were conducted between every distribution and the LSMetacell-derived distribution (our reference). P-values are reported directly on the plots.

https://doi.org/10.1371/journal.pcbi.1013697.s006

(PDF)

S3 Fig. Microglia gene co-expression modules identified by hdWGCNA employing the same analysis and presentation panels as in Fig 3 (see this figure for a detailed caption).

To ensure comparability across methods, the modules identified by each algorithm were aligned with those from LSMetacell using Fisher’s exact test to assess significant overlaps. Thus, modules labeled as MCgreen, MCred, MCyellow, MCblue, MCturquoise, MCblack, MCbrown, MCpink, MCpurple, MCyellow, and MCmagenta in each method correspond significantly to their counterparts in LSMetacell. Any additional modules identified by a method that are not listed above do not exhibit consistent correspondence across the other algorithms.

https://doi.org/10.1371/journal.pcbi.1013697.s007

(PDF)

S4 Fig. Microglia gene co-expression modules identified by Metacell2 employing the same analysis and presentation panels as in Fig 3 (see this figure for a detailed caption).

To ensure comparability across methods, the modules identified by each algorithm were aligned with those from LSMetacell using Fisher’s exact test to assess significant overlaps. Thus, modules labeled as MCgreen, MCred, MCyellow, MCblue, MCturquoise, MCblack, MCbrown, MCpink, MCpurple, MCyellow, and MCmagenta in each method correspond significantly to their counterparts in LSMetacell. Any additional modules identified by a method that are not listed above do not exhibit consistent correspondence across the other algorithms.

https://doi.org/10.1371/journal.pcbi.1013697.s008

(PDF)

S5 Fig. Microglia gene co-expression modules identified by MetaQ employing the same analysis and presentation panels as in Fig 3 (see this figure for a detailed caption).

To ensure comparability across methods, the modules identified by each algorithm were aligned with those from LSMetacell using Fisher’s exact test to assess significant overlaps. Thus, modules labeled as MCgreen, MCred, MCyellow, MCblue, MCturquoise, MCblack, MCbrown, MCpink, MCpurple, MCyellow, and MCmagenta in each method correspond significantly to their counterparts in LSMetacell. Any additional modules identified by a method that are not listed above do not exhibit consistent correspondence across the other algorithms.

https://doi.org/10.1371/journal.pcbi.1013697.s009

(PDF)

S6 Fig. Microglia gene co-expression modules identified by SEACells employing the same analysis and presentation panels as in Fig 3 (see this figure for a detailed caption).

To ensure comparability across methods, the modules identified by each algorithm were aligned with those from LSMetacell using Fisher’s exact test to assess significant overlaps. Thus, modules labeled as MCgreen, MCred, MCyellow, MCblue, MCturquoise, MCblack, MCbrown, MCpink, MCpurple, MCyellow, and MCmagenta in each method correspond significantly to their counterparts in LSMetacell. Any additional modules identified by a method that are not listed above do not exhibit consistent correspondence across the other algorithms.

https://doi.org/10.1371/journal.pcbi.1013697.s010

(PDF)

S7 Fig. Microglia gene co-expression modules identified by SuperCell employing the same analysis and presentation panels as in Fig 3 (see this figure for a detailed caption).

To ensure comparability across methods, the modules identified by each algorithm were aligned with those from LSMetacell using Fisher’s exact test to assess significant overlaps. Thus, modules labeled as MCgreen, MCred, MCyellow, MCblue, MCturquoise, MCblack, MCbrown, MCpink, MCpurple, MCyellow, and MCmagenta in each method correspond significantly to their counterparts in LSMetacell. Any additional modules identified by a method that are not listed above do not exhibit consistent correspondence across the other algorithms.

https://doi.org/10.1371/journal.pcbi.1013697.s011

(PDF)

S8 Fig. Microglia gene co-expression modules identified by Primary employing the same analysis and presentation panels as in Fig 3 (see this figure for a detailed caption).

To ensure comparability across methods, the modules identified by each algorithm were aligned with those from LSMetacell using Fisher’s exact test to assess significant overlaps. Thus, modules labeled as MCgreen, MCred, MCyellow, MCblue, MCturquoise, MCblack, MCbrown, MCpink, MCpurple, MCyellow, and MCmagenta in each method correspond significantly to their counterparts in LSMetacell. Any additional modules identified by a method that are not listed above do not exhibit consistent correspondence across the other algorithms.

https://doi.org/10.1371/journal.pcbi.1013697.s012

(PDF)

Acknowledgments

We express our gratitude to researchers who have shared their data online. This project benefits of the computing resources from the High-Performance Computing Center of Wuyi University.

References

  1. 1. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinform. 2008;9:559. pmid:19114008
  2. 2. Zhang T, Wong G. Gene expression data analysis using Hellinger correlation in weighted gene co-expression networks (WGCNA). Comput Struct Biotechnol J. 2022;20:3851–63. pmid:35891798
  3. 3. Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods. 2009;6(5):377–82. pmid:19349980
  4. 4. Aldridge S, Teichmann SA. Single cell transcriptomics comes of age. Nat Commun. 2020;11(1):4307. pmid:32855414
  5. 5. Bilous M, Hérault L, Gabriel AA, Teleman M, Gfeller D. Building and analyzing metacells in single-cell genomics data. Mol Syst Biol. 2024;20(7):744–66. pmid:38811801
  6. 6. Feregrino C, Tschopp P. Assessing evolutionary and developmental transcriptome dynamics in homologous cell types. Dev Dyn. 2022;251(9):1472–89. pmid:34114716
  7. 7. Morabito S, Reese F, Rahimzadeh N, Miyoshi E, Swarup V. hdWGCNA identifies co-expression networks in high-dimensional transcriptomics data. Cell Rep Methods. 2023;3(6):100498. pmid:37426759
  8. 8. Aitchison J. The statistical analysis of compositional data. J R Stat Soc Ser B Stat Methodol. 1982;44(2):139–60.
  9. 9. Skinnider MA, Squair JW, Foster LJ. Evaluating measures of association for single-cell transcriptomics. Nat Methods. 2019;16(5):381–6. pmid:30962620
  10. 10. Su C, Xu Z, Shan X, Cai B, Zhao H, Zhang J. Cell-type-specific co-expression inference from single cell RNA-sequencing data. bioRxiv. 2022:2022.12.13.520181. pmid:36561173
  11. 11. Cuevas-Diaz Duran R, Wei H, Wu J. Data normalization for addressing the challenges in the analysis of single-cell transcriptomic datasets. BMC Genomics. 2024;25(1):444. pmid:38711017
  12. 12. Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36(5):411–20. pmid:29608179
  13. 13. Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19(1):15. pmid:29409532
  14. 14. Mathys H, Davila-Velderrain J, Peng Z, Gao F, Mohammadi S, Young JZ, et al. Single-cell transcriptomic analysis of Alzheimer’s disease. Nature. 2019;570(7761):332–7. pmid:31042697
  15. 15. Ben-Kiki O, Bercovich A, Lifshitz A, Tanay A. Metacell-2: a divide-and-conquer metacell algorithm for scalable scRNA-seq analysis. Genome Biol. 2022;23(1):100. pmid:35440087
  16. 16. Baran Y, Bercovich A, Sebe-Pedros A, Lubling Y, Giladi A, Chomsky E, et al. MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions. Genome Biol. 2019;20(1):206. pmid:31604482
  17. 17. Li Y, Li H, Lin Y, Zhang D, Peng D, Liu X, et al. MetaQ: fast, scalable and accurate metacell inference via single-cell quantization. Nat Commun. 2025;16(1):1205. pmid:39885131
  18. 18. Persad S, Choo Z-N, Dien C, Sohail N, Masilionis I, Chaligné R, et al. SEACells infers transcriptional and epigenomic cellular states from single-cell genomics data. Nat Biotechnol. 2023;41(12):1746–57. pmid:36973557
  19. 19. Bilous M, Tran L, Cianciaruso C, Gabriel A, Michel H, Carmona SJ, et al. Metacells untangle large and complex single-cell transcriptome networks. BMC Bioinform. 2022;23(1):336. pmid:35963997
  20. 20. Fan Y, Xie L, Chung CY. Signaling pathways controlling microglia chemotaxis. Mol Cells. 2017;40(3):163–8. pmid:28301917
  21. 21. Aguilar BJ, Zhu Y, Lu Q. Rho GTPases as therapeutic targets in Alzheimer’s disease. Alzheimers Res Ther. 2017;9(1):97. pmid:29246246
  22. 22. Xu S, Zhu Z, Delafield DG, Rigby MJ, Lu G, Braun M, et al. Spatially and temporally probing distinctive glycerophospholipid alterations in Alzheimer’s disease mouse brain via high-resolution ion mobility-enabled sn-position resolved lipidomics. Nat Commun. 2024;15(1):6252. pmid:39048572
  23. 23. Desale SE, Chinnathambi S. Phosphoinositides signaling modulates microglial actin remodeling and phagocytosis in Alzheimer’s disease. Cell Commun Signal. 2021;19(1):28. pmid:33627135
  24. 24. Li R-Y, Qin Q, Yang H-C, Wang Y-Y, Mi Y-X, Yin Y-S, et al. TREM2 in the pathogenesis of AD: a lipid metabolism regulator and potential metabolic therapeutic target. Mol Neurodegener. 2022;17(1):40. pmid:35658903
  25. 25. Hu D, Weiner HL. Unraveling the dual nature of brain CD8+ T cells in Alzheimer’s disease. Mol Neurodegener. 2024;19(1):16. pmid:38355649
  26. 26. Su W, Saravia J, Risch I, Rankin S, Guy C, Chapman NM, et al. CXCR6 orchestrates brain CD8+ T cell residency and limits mouse Alzheimer’s disease pathology. Nat Immunol. 2023;24(10):1735–47. pmid:37679549
  27. 27. Chen X, Firulyova M, Manis M, Herz J, Smirnov I, Aladyeva E, et al. Microglia-mediated T cell infiltration drives neurodegeneration in tauopathy. Nature. 2023;615(7953):668–77. pmid:36890231
  28. 28. Grayson JM, Short SM, Lee CJ, Park N, Marsac C, Sette A, et al. T cell exhaustion is associated with cognitive status and amyloid accumulation in Alzheimer’s disease. Sci Rep. 2023;13(1):15779. pmid:37737298
  29. 29. Serrano-Pozo A, Das S, Hyman BT. APOE and Alzheimer’s disease: advances in genetics, pathophysiology, and therapeutic approaches. Lancet Neurol. 2021;20(1):68–80. pmid:33340485
  30. 30. Huang C, Liu Y, Wang S, Xia J, Hu D, Xu R. From genes to metabolites: HSP90B1’s role in Alzheimer’s disease and potential for therapeutic intervention. Neuromolecular Med. 2025;27(1):6. pmid:39760808
  31. 31. Carmona S, Zahs K, Wu E, Dakin K, Bras J, Guerreiro R. The role of TREM2 in Alzheimer’s disease and other neurodegenerative disorders. Lancet Neurol. 2018;17(8):721–30. pmid:30033062
  32. 32. Li Y, Xia X, Wang Y, Zheng JC. Mitochondrial dysfunction in microglia: a novel perspective for pathogenesis of Alzheimer’s disease. J Neuroinflammation. 2022;19(1):248. pmid:36203194
  33. 33. Lau S-F, Cao H, Fu AKY, Ip NY. Single-nucleus transcriptome analysis reveals dysregulation of angiogenic endothelial cells and neuroprotective glia in Alzheimer’s disease. Proc Natl Acad Sci U S A. 2020;117(41):25800–9. pmid:32989152
  34. 34. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabási AL. Hierarchical organization of modularity in metabolic networks. Science. 2002;297(5586):1551–5. pmid:12202830
  35. 35. Langfelder P, Zhang B, Horvath S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics. 2008;24(5):719–20. pmid:18024473
  36. 36. von Mering C, Huynen M, Jaeggi D, Schmidt S, Bork P, Snel B. STRING: a database of predicted functional associations between proteins. Nucleic Acids Res. 2003;31(1):258–61. pmid:12519996
  37. 37. Langfelder P, Luo R, Oldham MC, Horvath S. Is my network module preserved and reproducible? PLoS Comput Biol. 2011;7(1):e1001057. pmid:21283776
  38. 38. Dong J, Horvath S. Understanding network concepts in modules. BMC Syst Biol. 2007;1:24. pmid:17547772
  39. 39. Wang N, Langfelder P, Stricos M, Ramanathan L, Richman JB, Vaca R, et al. Mapping brain gene coexpression in daytime transcriptomes unveils diurnal molecular networks and deciphers perturbation gene signatures. Neuron. 2022;110(20):3318-3338.e9. pmid:36265442
  40. 40. Yu G, Wang L-G, Han Y, He Q-Y. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012;16(5):284–7. pmid:22455463