Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Network-constrained Random Lasso for biologically interpretable gene network inference across unequal sample sizes

  • Heewon Park ,

    Roles Conceptualization, Formal analysis, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing

    heewonn.park@gmail.com

    Affiliations School of Mathematics, Statistics and Data Science, Sungshin Women’s University, Seoul, Republic of Korea, M&D Data Science Center, Tokyo Medical and Dental University, Bunkyo-ku, Tokyo, Japan, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo, Japan

  • Satoru Miyano

    Roles Supervision

    Affiliations M&D Data Science Center, Tokyo Medical and Dental University, Bunkyo-ku, Tokyo, Japan, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo, Japan

Abstract

Gene regulatory network inference is a key approach for elucidating molecular mechanisms underlying complex diseases, but accurately inferring them from high-dimensional data, especially when sample sizes are imbalanced, remains a significant challenge. Although the L1-type regularization methods have been used for gene network inference, the existing methods often fail under conditions involving high dimensionality, noise, and unequal sample sizes across phenotypes. To overcome these limitations, this study developed netRL, a novel computational framework that integrates the Random Lasso with prior network biological knowledge. The proposed method leveraged a bootstrap-based strategy to stabilize the selection of key regulatory genes and incorporates network-informed penalization using centrality measures (i.e., hubness and betweenness centrality). This study also introduced a statistical strategy using a hypergeometric test to assess the significance of the inferred edges, thereby enhancing the reliability of the network. Through extensive simulation studies, this study demonstrated that netRL outperforms conventional methods in both network estimation and gene selection. Applying netRL to whole-blood RNA-seq profiles from the Japan COVID-19 Task Force, this study successfully identified distinct phenotype-specific molecular interplays between asymptomatic and critical cases despite pronounced sample imbalance. The findings reveal that asymptomatic networks were dense and enriched for ribosomal proteins, whereas critical networks were sparse, centralized, and characterized by hub genes such as NFKBIA, B2M, CXCL8, and FOS. Pathway enrichment further revealed phenotype-specific biological processes, highlighting molecular signatures of disease progression. The results of this study suggest that enhancing the activity of asymptomatic condition-specific markers (e.g., ribosomal proteins) may provide important insights into the molecular mechanisms underlying COVID-19 severity. Collectively, these results demonstrate that netRL enables biologically interpretable and statistically robust network inference, offering new insights into the molecular basis of COVID-19 severity and broader applications in systems biology.

Introduction

Gene regulatory networks (GRNs) provide a powerful framework to represent molecular interactions, where nodes correspond to genes and edges denote regulatory influences. Accurately reconstructing GRNs from high-dimensional transcriptomic data enables researchers to elucidate disease mechanisms, identify biomarkers, and uncover therapeutic targets. With the increasing availability of large-scale RNA sequencing data, the development of statistically rigorous and biologically interpretable computational methods for network inference has become essential.

However, a significant challenge in gene network inference, particularly in clinical and biological studies, is the issue of imbalanced sample sizes across different phenotypes or conditions. For instance, in rare diseases or specific disease subtypes, the number of available samples for one group may be substantially smaller than for a control group. This disparity can lead to several statistical problems, including estimation bias and a significant reduction in statistical power. As a result, the molecular interactions identified in such studies are more likely to be artifacts of the sampling imbalance rather than those stemming from true biological differences. Although various L1-type regularization methods (e.g., lasso, adaptive lasso, and elastic net) have been developed and widely applied in gene network inference, most existing computational methods are highly susceptible to the aforementioned issues. Furthermore, the lasso method tends to select only a portion of correlated genes, potentially overlooking biologically relevant groups. Although the elastic net can circumvent some of these limitations, it may produce biased network estimates when correlated genes have edges that differ in magnitude of weight or direction to their target genes, which frequently occurs in gene networks [2]. While some methods have been developed to address these limitations, they often overlook crucial biological context, leading to networks that are statistically sound but lack biological interpretability. For example, although the standard Random Lasso framework [2] offers advantages in feature selection, it lacks integration of biological knowledge, which can be critical for accurately inferring biologically consequential networks.

To address these challenges, this study developed network-constrained Random Lasso (netRL), a novel framework that integrates random lasso with network biological knowledge. The developed netRL leverages a bootstrap-based strategy to mitigate sample imbalance while enhancing the stability of regulator genes selection. The proposed strategy measure the gene importance based on the not only statistical but also network biological knowledge (i.e., hubness and between centrality). In addition, netRL incorporates prior molecular interaction knowledge into the penalty structure, thereby steering genes selection toward biologically plausible solutions. To further refine inference, this study developed statistical strategies that assesses the significance of estimated edges within the bootstrap framework using the hypergeometric test. By combining these complementary components—bootstrap-based stabilization, network-informed penalization, and statistical edge validation—netRL offers a robust and biologically interpretable solution for gene network inference. The framework enables accurate detection of phenotype-specific molecular interplays under challenging conditions such as high dimensionality, noise, and unequal sample sizes, ultimately facilitating the discovery of molecular mechanisms underlying complex diseases.

Through extensive simulation studies, this study demonstrated that netRL outperforms existing methods in terms of network estimation and crucial edge selection. This study applied the proposed netRL framework to RNA-seq profiles obtained from Japan COVID-19 Task Force [13]. The current study aimed to uncover COVID-19 severity specific molecular interplays, especially between asymptomatic and critical samples. The results of this dataset successfully identified distinct gene regulatory networks for asymptomatic and critical COVID-19 groups, despite the significant sample size imbalance between these groups. In asymptomatic COVID-19 samples, the inferred networks were relatively dense, with strong connectivity among ribosomal proteins and translation-related genes. Importantly, ribosomal proteins were identified as phenotype-specific regulators in asymptomatic cases. Genes such as NFKBIA, B2M, CXCL8, and FOS emerged as common markers of critical and asymptomatic cases. Pathway enrichment analysis of the phenotype-specific networks provided additional biological insights. The network of COVID-19 critical samples was significantly enriched for “Positive regulation”-related GO terms, i.e., (“Positive regulation of macromolecule biosynthetic process,” “Positive regulation of cellular biosynthetic process” and “Positive regulation of biosynthetic process”). By contrast, the asymptomatic network was enriched for “Endoplasmic”-related (i.e., “Endoplasmic reticulum,” “Endoplasmic reticulum lumen”) and “Receptor complex” GO terms. By integrating prior biological knowledge into network inference, netRL was able to uncover robust and interpretable molecular signatures of COVID-19 progression that may have been overlooked by conventional differential expression approaches. These findings reinforce the value of the developed netRL in dissecting the molecular complexity of infectious disease.

The remainder of this paper is organized as follows. In the Methods section, this study introduced the formulation of netRL and describe its algorithmic implementation. Simulation studies are presented to compare the performance of netRL with existing approaches under various data scenarios. This study then applied the developed framework to COVID-19 transcriptomic data to demonstrate its practical utility in differential network analysis. Finally, this study discusses the biological implications of the findings and conclude with future perspectives.

Methods

All data analyses and method development were performed during the period from May 2025 to December 2025. Different sample size scenarios can lead to various statistical challenges. In gene network inference and differential gene network analysis, unequal sample sizes may lead to bias and substantially reduce statistical power, particularly when applying methods such as Gaussian graphical models. Accordingly, the phenotype-specific molecular interactions observed under the condition of unequal sample sizes are more likely artifacts resulting from sampling imbalance rather than authentic biological differences. To address this issue, the current study propose a new computational method, network-constrained Random Lasso (netRL) that combines bootstrap resampling with random forest methodology in line with the random lasso framework [2]. By employing the random lasso framework, the proposed strategy can circumvent the challenges posed by unequal sample sizes in differential gene network analysis. Moreover, the developed framework integrates network biological knowledge into gene network inference, thereby enhancing the biological reliability of gene network analysis.

Existing method for gene network estimation

The gene network can be formalized as a weighted graph G = (V,E,W), where V is the set of vertices corresponding to the p genes, represents the set of edges describing gene-gene interactions, and W = (wij) specifies the weights associated with each interaction . The normalized Laplacian matrix L for the graph G is given as [6,7],

(1)

where is the degree of each gene.

Suppose that is the data matrix describing the expression levels of p genes. In this study, we considered a directed gene regulatory network that is often represented by the following linear regression model:

where is the expression level of the gene in the ith cell and is the random error term accounting residual variation. Without loss of generality, this study assumed that the response has been centered and each predictor has been standardized,

In order to estimate the gene regulatory network described by the linear regression model, the following L1-type regularization methods have often been used [1]:

(2)

where is a hyperparameter for controlling the degree of shrinkage of and is the L1-type penalty, such as

  • Lasso
  • Adaptive lasso
    where or [1],
  • Elastic net
    where denotes the mixing parameter that balances the L2-norm penalty (i.e., ridge [8] and the L1-norm penalty (i.e., lasso).

By imposing L1-type penalties to the least squares loss, L1-type regularization methods simultaneously achieve edge selection and weight estimation.

However, the existing methods suffer from the following critical limitations in gene network inference.

  • These methods suffer from unequal sample size issues, especially in uncovering phenotype specific characteristics, because, when p > n, the lasso and adaptive lasso can select at most n regulator genes [2]. When the sample size of a phenotype is relatively large, the corresponding model incorporates more substantial features, resulting in networks with an increased number of genes and their connections relative to those derived from limited samples. Hence, the identified phenotype-specific molecular interplays under the unequal sample sizes are more likely to reflect sampling imbalance rather than true biological differences.
  • Furthermore, when constructing gene regulatory networks, many genes that are tightly connected within a biological pathway or functional module tend to exhibit strong correlation patterns. However, lasso tends to select only a few representatives while discarding the rest, thereby missing biologically meaningful groups.
  • Although the elastic net alleviates some limitations of the lasso and adaptive lasso, it may produce biased network estimation results when handling highly correlated genes with regression coefficients of different magnitudes or opposing signs, owing to its grouping effect. Such situations frequently arise in gene network analysis, as genes within the same biological pathway are often strongly correlated but may contribute to the regulatory process with varying effect sizes or even opposite directions.

In summary, the existing approaches are not proper for gene network analysis, and gene network inference and differential gene network analysis between phenotypes should be conducted using methods that maintain robustness under consistent sample size conditions.

To overcome the critical limitations, this study considered the random lasso framework [2], which is one of L1-type regularization methods based on the bootstrapping samples and random forest technique. The random lasso exhibits superior capability in feature selection and model estimation relative to conventional L1-regularization approaches (such as lasso, adaptive lasso, and elastic net). Although the random lasso addresses several limitations of the existing approaches, it was originally developed from a purely statistical and computational perspective, without incorporating biological knowledge. This often poses challenges in interpreting the results of gene network inference.

Network-constrained Random Lasso

This study developed a new computational strategy, called network-constrained Random Lasso (netRL), which explicitly incorporated network biological knowledge into the network inference procedure. In addition, this study proposed a dedicated edge selection scheme, implemented subsequent to the bootstrap resampling and random forest steps.

The proposed netRL infers gene regulatory networks based on three stages: first, assessing the importance of regulator genes, second, gene network inference (edge selection and edge weight estimation), and third, assessing the significance of edges. Fig 1 shows the overview of the developed network constrained Random Lasso (netRL).

thumbnail
Fig 1. Overview of the network constrained Random Lasso (netRL).

https://doi.org/10.1371/journal.pone.0344198.g001

  1. Generating gene importance measure
    1. (a) Suppose that there are two phenotypes (e.g., COVID-19 severe and non-severe) of samples, and wed estimate the gene network of the () target gene for the severe (or non-severe) samples. Draw bootstrap samples with size from the original dataset, where nS and nN are the sample sizes of the severe and non-severe groups, respectively.
    2. (b) For the bootstrap sample , randomly select regulator genes, and apply the lasso to estimate the edge weight for for the target gene.
    3. (c) Compute the gene importance measure by incorporating the network biological knowledge. This study considered the widely used normalized centrality (i.e., hubenss: Hj) and between centrality (Bj) as indicators of gene’s importance [3,4].(3)
      where ajk is the (j,k) element of the adjacency matrix, |V| is number of nodes (genes) in the network, btu is the total number of shortest paths from the tth to uth genes and btu(j) is the number of these paths that pass through the jth gene (where j is not an end point). The normalized centrality measures, i.e., hubness (Hj) and betweenness centrality (Bj), both take values in the interval [0,1], because hubness is defined as the sum of the jth column of thhe adjacency matrix normalized by its maximum possible value , and the betweenness centrality is scaled by the maximal attainable betweenness in a directed graph [5]. Hub genes are highly interconnected nodes that can propagate small local perturbations into global changes across the gene regulatory network, underscoring their functional importance. Accordingly, hubness serves as an indicator of a gene’s essentiality. In parallel, “betweenness centrality” evaluates a gene’s importance by measuring its control over information flow between other genes in the network. The proposed strategy incorporated the centralities to evaluate the importance of a gene and propose the following gene importance measure(4)
      The gene importance measure can be considered as the weighted version of absolute value of average bootstrap coefficient (i.e., ) that indicates crucialness of genes in the statistical viewpoint. The weighting factor constructed from standardized centrality metrics takes values in [0,1], ensuring that it modulates the bootstrap coefficient without altering its scale. That is, the proposed strategy measures the importance of genes based on not only statistical but also network biology perspectives, and thus, the proposed method can perform biologically reliable gene network inference.
  2. Gene network estimation
    1. (a) Compute the normalized Laplacian matrix L based on the following weighted adjacency matrix(5)
      where . The weighted adjacency matrix can be also defined by an external prior network, e.g., gene–gene correlation matrix, where each entry wkj corresponds to the correlation coefficient between the expression levels of kth and jth genes.
    2. (b) Draw another set of bootstrap samples with size by sampling from the original dataset.
    3. (c) For the bootstrap sample , randomly select candidate regulator genes with selection probability of the jth regulator gene proportional to its importance Cj.
    4. (d) Construct the normalized Laplacian matrix for the randomly selected genes from the computed L in stage 2.a, i.e., .
    5. (e) For the interpretable and biologically reliable gene network estimation, this study considered the following network constrained L1-type regularization method [6],(6)(7)
      As genes within the same network neighborhood tend to share functional roles, this study applied wkj in the second penalty term to encourage consistency among the estimated coefficients. This penalty induces local smoothing across the network, thereby promoting the simultaneous selection of biologically related genes. Moreover, scaling gene coefficients by the square root of node degree enables the method to impose relatively weaker penalties on highly connected hub genes. Consequently, hub genes together with their neighboring nodes were more likely to receive larger coefficients, thereby increasing their chance of being selected during network reconstruction [6]. The proposed strategy applied the network constrained L1-type regularization method to the bootstrap samples for the selected genes and estimate , .
    6. (f) Compute edge weights of p regulator genes for the target gene(8)
  3. Access significance of edges
    The procedures of stages 1 and 2 have the potential to generate false-positive regulator gene selections, owing to the fact that the final edge weight estimation in (8) is derived from the averaged coefficients over iterations. Hence, regulator genes exhibit nonzero coefficients only in a model among the models, resulting in nonzero edge weights. Thus, the assessment of significance should be performed following the bootstrap-based edge estimation procedure.
    1. (a) Permutation test
      To enhance the effectiveness of edge selection, this study proposed a novel strategy that evaluates the significance of edges using the permutation test. As a first step, this study quantifies the selection frequency of the jth regulator gene is selected among the bootstrap models as follows,
      where is the indicator function. To compute the selection frequency of the jth regulator gene under the permutation framework, the proposed strategy permutes the expression levels of target genes and re-estimates the gene networks following “Step 2: Gene network estimation,” yielding for for permutation replicates. Then, the permutation selection frequency is computed as follows,
      Next, the proposed strategy computed the permutation p-value as follows:(9)
      The proposed netRL selected regulator genes with p-values less than or equal to a significance level .
    2. (b) Using percentile bootstrap interval
      This study also compared the netRL based on significance assessment of edges based on percentile bootstrap interval [11]. The confidence interval was determined as follows,(10)
      where and are and quantiles of the generated bootstrap replications, respectively. Under a significance threshold of , genes whose % confidence intervals for encompass zero were discarded from the list of candidate regulators of the target gene.

The proposed netRL framework introduces several methodological advances over existing approaches, which can be summarized as follows.

  • Integrating random lasso with Laplacian-regularized regression to handle sample imbalance: The incorporation of a random lasso framework into Laplacian-regularized regression is a key design choice of netRL. This integration enables us to mitigate severe sample imbalance across phenotypes while improving the stability of regulator gene selection. Although numerous statistical methodologies have been developed and used to gene network estimation (e.g., sparse Gaussian Graphical Models (GGMs) [9], Laplacian-regularized regressions [6], etc.), existing approaches typically assume comparable sample sizes across conditions. Under unequal-sample scenarios, these methods tend to identify spurious molecular interactions driven by sampling imbalance rather than true biological differences. By repeatedly fitting Laplacian-regularized models on balanced bootstrap subsamples, netRL explicitly addresses this issue, yielding more robust phenotype-specific network inference.
  • Centrality-weighted subsampling to quantify gene importance within the random lasso framework: Traditional stability selection [10] and random lasso [2] approaches rely on uniform subsampling of predictors and do not incorporate biological or network-level information. Moreover, although prior Laplacian-regularized regression and sparse GGM methods exploit network structure as a constraint during the estimation stage, they do not use network centrality information to guide the predictor subsampling process. In contrast, netRL directly integrates biological network knowledge into the subsampling mechanism by quantifying gene-level importance through a combination of bootstrap-based coefficient stability and graph-derived structural centrality measures, such as hubness and betweenness. This approach yields a biologically meaningful sampling distribution that preferentially selects genes occupying topologically influential positions within the underlying network.
  • Two-stage inference and selection with hypergeometric calibration: Rather than relying solely on coefficient magnitudes or regularization paths, the proposed netRL employs a two-stage inference scheme. In the second stage, a hypergeometric test is used to assess whether edges repeatedly selected across bootstrap replicates occur more frequently than expected by chance under random subsampling. This calibration step addresses a known limitation of random lasso–based feature selection, namely the lack of a formal statistical criterion for edge significance, and provides principled control over false-positive discoveries.

Monte Carlo simulation

Monte Carlo simulation experiments were performed with simulated datasets to investigate the performance of the proposed netRL. In line with earlier studies on network-constrained regularization method [6], this study adopted benchmark settings to establish simulation scenarios.

This study supposed u transcription factors (TFs) and each regulates v genes, and the expression levels of TFs follows standard normal distribution, i.e., . The expression levels of the TF and the regulated gene by the TF are jointly distributed as a bivariate normal with a correlation of 0.7, i.e., the regulated gene by the jth TF follows . This study defined the expression data for the genes as follows,

(11)

Here, this study assumed that the genes as regulator genes and they regulate the target gene as follows,

(12)

where .

The current study explored multiple possible configurations of the gene regulatory structure for the target gene as described below.

  • Scenario 1:
  • Scenario 2:
  • Scenario 3:
  • Scenario 4:

This study considered , number of TFs u = 10, 20, and simulated 50 datasets consisting of n = 60 observations from the 4 scenarios, where the training and test datasets consisted of 80% (48) and 20% (12) observations, respectively. This study also considered two situations in which the number of crucial regulator genes corresponding is smaller (i.e., u = 10: Situation 1) and larger (i.e., u = 20: Situation 2) than the number of observations of the training dataset (i.e., n = 48).

To evaluate the effectiveness of netRL, this study compared its performance against several established L1-type regularization methods, namely lasso, adaptive lasso, elastic net, and standard random lasso, where netRL with boostrap interval is based on 95% confidence interval. The hyperparameters (i.e., and ) of the L1-type regularization methods were selected by using the following Bayesian Information Criterion (BIC) [12],

(13)

where is expression levels of the gene considered as a target gene, is the degree of freedom of the estimated model of the gene network of the gene. For the tuning parameters and , their optimal values are determined via a grid search that selects the combination minimizing the BIC. This study estimated gene networks based on B = 200 bootstrap replications. Fig 2 shows the mean square error (MSE) in estimating expression levels of target genes based on the test dataset. As illustrated in Fig 2, the proposed netRL demonstrated superior performance in gene network estimation, with netRL combined with the hypergeometric test achieving the most effective results. However, incorporating the bootstrap confidence interval into netRL does not yield satisfactory results. In scenario 2, the elastic net demonstrated notably weak results. This study also evaluated the edge selection results, i.e., regulator gene selection in the model of the target gene. Tables 1 and 2 present the true positive rate (TPR), true negative rate (TNR), and overall accuracy for regulatory gene selection, with the most effective results highlighted in bold, for situations 1 and 2, respectively. Consistent with the MSE results, the integration of netRL with the hypergeometric test and median resulted in effective crucial gene selection. The developed strategy combining the hypergeometric test and median yielded similar outcomes in Situation 1; however, the hypergeometric test–based method demonstrated clear advantages in Situation 2. Although netRL with bootstrap confidence intervals achieved favorable performance in terms of true positive rate, it failed to effectively filter out noise regulator genes with zero coefficients (), especially in Situation 2. The results indicate that netRL has strong potential as an effective approach for biologically meaningful gene network inference. This study also considered a more conservative strategy based on an inner bootstrap (nested resampling) scheme, in which the Laplacian matrix is constructed within each inner bootstrap replication using randomly selected samples and variables, and subsequently applied in the outer regression step. Although this strategy was also evaluated, its results are not described in the main text because the method did not perform well in practice. For completeness, the corresponding results are provided in the Supplementary Materials.

thumbnail
Table 1. Simulation results of Situation 1: Regulator selection accuracy.

https://doi.org/10.1371/journal.pone.0344198.t001

thumbnail
Table 2. Simulation results of Situation 2: Regulator selection accuracy.

https://doi.org/10.1371/journal.pone.0344198.t002

thumbnail
Fig 2. Mean square error for estimating expression levels of the target gene in test dataset (netRL.PM: netRL with permutation test and netRL.

CI: netRL with bootstrap confidence interval, where B.netRL and C.netRL indicate netRL based on constructed Laplacian matrix by regression coefficient and correlation coefficient, respectively. RL: ordinary random lasso, LA: lasso, adLA: adaptive lasso, ELA: elastic net).

https://doi.org/10.1371/journal.pone.0344198.g002

This study next evaluated the computational efficiency of the proposed netRL in comparison with the existing methods. Table 3 shows the running times for the gene networks, i.e., estimation of the model in (11), where running time of the bootstrap-based strategies were evaluated based on 200 bootstrap replications. As shown in Table 3, the bootstrap-based strategies (e.g., random lasso and the proposed netRL) exhibit higher computational complexity compared with ordinary L1-type regularization methods (i.e., lasso, adaptive lasso, and elastic net). However, the increased computational cost remains acceptable given the improved stability and performance of the proposed approach.

thumbnail
Table 3. Running time (in seconds) of the gene network estimation by using various methods. The running time of the bootstrap-based strategies were evaluated based on 200 bootstrap replications.

https://doi.org/10.1371/journal.pone.0344198.t003

Differential gene network analysis of COVID-19 severe stages

This study applied the proposed strategy to differential gene network analysis of severe stages of COVID-19. Since the proposed netRL with a Laplacian matrix constructed from regression coefficients outperformed the correlation-matrix–based approach, we employed the regression-coefficient–based netRL in this analysis. The analysis was based on whole-blood RNA-seq profiles of 1,102 genotyped individuals obtained from the Japan COVID-19 Task Force [13]. Disease severity was categorized into four stages: critical (Level 4, patients in intensive care unit or requiring intubation and ventilation, severe (Level 3, others requiring oxygen support), mild (Level 2, other symptomatic patients), and asymptomatic (Level 1, without COVID-19–related symptoms) [13]. The RNA-seq data consist of 71 asymptomatic, 241 mild, 404 severe, and 303 critical samples.

The current study aimed to identify differentially regulated molecular interplays between 303 critical and 71 asymptomatic cases. Given the imbalance in sample size between critical and asymptomatic groups, this imbalance must be carefully considered in the analysis. Therefore, this study applied the proposed netRL with the hypergeometric test to investigate differential gene networks between severe and asymptomatic COVID-19 cases. The current study focused on 1,404 genes associated with viral infectious diseases that are annotated in the “Infectious disease: viral” pathway of the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (https://www.genome.jp/kegg/pathway.htm). Accordingly, all genes within this pathway, not only transcription factors(TFs), were included as candidate regulators, allowing the model to account for a broader spectrum of regulatory influences relevant to viral infection related processes, including signaling and modulatory roles carried by non-TF genes. Table 4 shows the “Infectious disease: viral”-pathways and the number of genes involved in each pathway.

thumbnail
Table 4. Infectious disease: Viral-pathways in the KEGG database.

https://doi.org/10.1371/journal.pone.0344198.t004

The current study matched the 1404 genes and RNA-seq data from the Japan COVID-19 task force, and two gene networks of samples derived from critical and asymptomatic groups were estimated for the matched 694 genes. The total number of edges was 40,303 and 84,922 in the estimated networks of critical and asymptomatic samples, respectively.

Fig 3 shows the infectious disease gene regulatory networks of asymptomatic and critical COVID-19 samples. To effectively visualize the complex and large-scale gene networks, we display only the top 0.1% of edges ranked by absolute weight.

thumbnail
Fig 3. Gene regulatory networks of asymptomatic and critical COVID-19 samples.

Arrows () indicate regulatory effects from □ to △, edge thickness reflects the corresponding weight, and edge color denotes the direction of regulation (green: positive and red: negative).

https://doi.org/10.1371/journal.pone.0344198.g003

As shown in Fig 3, samples from asymptomatic samples show higher levels of molecular interactions (i.e., a large number of genes and edges) than those derived from critical individuals. The genes NFKBIA, B2M, CXCL8, and FOS emerged as hub genes in both networks. Furthermore, the interactions between ribosomal proteins (i.e., RPS and RPL genes) were identified as asymptomatic sample-specific characteristics.

  • B2M
  • Conca et al. [14] suggested that elevated 2-m levels may serve as an early biomarker of disease severity and could provide predictive value for the clinical trajectory and outcome of COVID-19. As noted by Zou et al. [15], abnormal expression of B2M may be an important factor underlying tracheal dysfunction induced by coronavirus infection. According to Song et al. [16], B2M was proposed as a candidate biomarker for viral myocarditis, which may provide insights on the mechanisms underlying COVID-19–related myocarditis.
  • CXCL
  • Pius et al. [17] proposed that CXCL8 antibody seropositivity may serve as a novel prognostic indicator for severe progression in patients with COVID-19. In severe COVID-19, CXCL8 is persistently elevated and has been proposed as a potential prognostic biomarker. Cross-viral comparisons involving SARS-CoV, MERS-CoV, and SARS-CoV-2 highlight its role as a central mediator of pulmonary pathogenesis and emphasize its relevance to COVID-19 progression [18]. Elevated levels of CXCL8 were observed in patients with severe COVID-19, suggesting a possible role in the underlying mechanisms of disease severity [19].
  • FOS
  • In the early recovery phase of COVID-19, CD4 + T cells exhibited pronounced upregulation of inflammatory genes, including FOS, JUN, KLF6, and S100A8, indicating their potential involvement in modulating the immune response [20]. Li et al. [21] proposed that puerarin-mediated targeting of FOS could serve as a potential therapeutic approach for clinical management of SARS-CoV-2 infection. The study by Qin et al. [22] highlighted critical molecular targets of puerarin in the context of COVID-19 treatment, encompassing novel anti–COVID-19 targets, including FOS, PTGS, PRKCB, PRKCA, and NOS3.
  • NFKBIA
  • Amini et al. [23] demonstrated that infection with SARS-CoV-2 leads to elevated expression of essential NF-B signaling components, including NFKBIA, NFKB1, RELA, and NFKB2. In severe COVID-19, the NFKBIA rs696 GG genotype was associated with ICU admission. Given the central involvement of NF-B pathway dysregulation in disease severity, this variant could represent a marker for wider COVID-19 clinical outcomes [24]. Severe COVID-19 was associated with increased neutrophil counts and higher expression of NFKBIA and TNFAIP3, whereas TNFAIP3, PPP1R15A, NFKBIA, and IFIT2 showed bimodal expression across various immune cell populations relative to that in or uninfected individuals or those with mild symptoms [25].

Fig 4 shows the expression levels of the identified COVID-19 asymptomatic specific (i.e., ribosome proteins) and common (i.e., NFKBIA, B2M, CXCL8, and FOS) makers across the severity stages (Lv1: asymptomatic, Lv2: mild, Lv3: severe, Lv4: critical).

thumbnail
Fig 4. Expression levels of COVID-19 asymptomatic-specific and common markers, where Lv1: asymptomatic, Lv2: mild, Lv3: severe, Lv4: critical of COVID-19 samples.

https://doi.org/10.1371/journal.pone.0344198.g004

The identified markers characterize differential expression signatures corresponding to COVID-19 stages. Expression of COVID-19 common markers increased with disease progression from asymptomatic to critical stages, whereas asymptomatic-specific markers, such as ribosomal proteins, were highly expressed in asymptomatic samples and declined with disease severity. The findings demonstrate that severe COVID-19 is marked by elevated levels of the identified markers (i.e., NFKBIA, B2M, CXCL8, and FOS), in contrast to non-severe cases, which are distinguished by enhanced expression of asymptomatic-specific markers. Such molecular features may offer important clues to the mechanistic basis of disease severity.

To elucidate biological pathways and functional annotations underlying the gene networks in COVID-19 asymptomatic and critical groups, this study performed pathway analysis based on the bioinformatics tool, Database for Annotation, Visualization, and Integrated Discovery (DAVID) [26]. The genes constituting the gene networks of asymptomatic and critical samples in Fig 3 were subjected to Gene Ontology (GO) term pathway analysis, using the 1,404 viral infection disease–related genes as the background set. Fig 5 shows the top three most significant GO terms (i.e., those with the smallest FDR-adjusted q-values), where statistical significance evaluated using Benjamini–Hochberg false discovery rate (FDR)–adjusted q-values that correct for multiple testing across all GO terms.

thumbnail
Fig 5. Significant Gene Ontology terms of gene networks of COVID-19 asymptomatic and critical samples.

https://doi.org/10.1371/journal.pone.0344198.g005

Pathway analysis revealed that “Positive regulation”-related GO terms, i.e., (“Positive regulation of macromolecule biosynthetic process,” “Positive regulation of cellular biosynthetic process” and “Positive regulation of biosynthetic process”) were specific to COVID-19 critical samples, in contrast to “Endoplasmic”-related (i.e., “Endoplasmic reticulum,” “Endoplasmic reticulum lumen”) and “Receptor complex” GO terms were characterized the asymptomatic samples. The current study suggest that targeting the suppression of critical sample–specific GO terms (i.e., “Positive regulation”-related GO terms) may provide important insights into the underlying mechanisms of severe COVID-19.

Discussion

This study developed network-constrained Random Lasso (netRL), a novel computational framework for gene regulatory network inference that explicitly incorporates network biological knowledge into statistical modeling. The motivation stemmed from a critical limitation in differential gene network analysis under the unequal sample sizes across phenotypes, which often leads to biased estimation and spurious phenotype-specific molecular interactions. By combining bootstrap resampling and random selection of regulators, the netRL resolved the limitation of the existing methods and achieved effective gene network inference. Furthermore, the developed strategy incorporated network biological knowledge into network estimation, and thus netRL enhances the biological interpretability and reliability of the inferred networks.

Monte Carlo simulation experiments clearly demonstrated the superiority of netRL compared with existing methods. Notably, netRL combined with the hypergeometric test yielded the most effective results across diverse simulation scenarios. While the bootstrap confidence interval–based version of netRL achieved high sensitivity, it failed to efficiently filter out false positives, particularly in settings with large numbers of potential regulators. These findings highlight that the choice of edge selection criterion plays a decisive role in determining the practical utility of network inference approaches. This study applied the netRL to uncover COVID-19 severe-specific molecular interplays. The proposed strategy successfully identified distinct gene regulatory networks for asymptomatic and critical COVID-19 samples, despite the significant sample size imbalance. The hub genes identified in the both asymptomatic critical samples (e.g., NFKBIA, B2M, CXCL8, and FOS) are consistent with those identified in existing literature on COVID-19 severity, as their expression levels are found to increase with disease progression. In contrast, asymptomatic-specific markers, such as ribosomal proteins, were highly expressed in non-severe cases, and their expression decreased with increasing severity. Moreover, pathway analysis uncovered distinct Gene Ontology terms between the asymptomatic and critical networks, further suggesting that network-based molecular signatures can provide mechanistic insights into disease progression.

In the gene network analysis of the severe stage of COVID-19, this study considered all genes within the Infectious disease: viral pathway, not only TFs, as potential regulators. While this broader inclusion allows the model to capture a wider range of regulatory influences that may contribute to viral-infection–related processes, narrowing the analysis to transcription factors could enhance the biological interpretability of the inferred network. Therefore, incorporating TF-focused regulatory modeling represents an important direction for future work in this study.

While the proposed method effectively addresses unequal sample sizes, the reliance on bootstrap resampling may increase computational burden for very large-scale genomic datasets. Future research should thus focus on developing efficient algorithms for large-scale data and on integrating multi-omics network resources to further refine inference accuracy.

Although the proposed permutation-based procedure assesses the significance of individual edges, a global multiple-testing correction across the entire network was not applied. Extending the framework to incorporate formal edge-level error control, for example through FDR control methods, is left as an important direction for future work. Furthermore, the increased computational complexity resulting of the proposed netRL from the bootstrap-based strategy may limit its applicability in extremely large-scale settings. Addressing this issue through algorithmic optimization or scalable approximations will be an important direction for future research.

Conclusion

This study proposed network-constrained Random Lasso (netRL), a unified framework for gene network inference that explicitly addresses sample size imbalance while enhancing biological interpretability. By integrating balanced bootstrap resampling with network-informed regularization and centrality-guided subsampling, netRL provides an effective framework for high-dimensional gene network inference. Simulation studies and real-world RNA-seq analysis demonstrated that netRL effectively performed gene network inference even under severe sample imbalance. The results indicate that incorporating network structure and gene-level importance into the random lasso framework yields more robust and biologically meaningful networks than existing approaches. Overall, netRL offers a flexible and extensible platform for differential gene network analysis and can be readily applied to a wide range of complex disease studies where unequal sample sizes and high dimensionality are inherent challenges.

Supporting information

S1 File. Supplementary file of netRL.xlsx.

This file provides the supplementary results of Monte Carlo Simulation.

https://doi.org/10.1371/journal.pone.0344198.s001

(XLSX)

Acknowledgments

This research used computational resources of the Super Computer System, Human Genome Center, Institute of Medical Science, University of Tokyo.

References

  1. 1. Rauschenberger A, Glaab E, van de Wiel MA. Predictive and interpretable models via the stacked elastic net. Bioinformatics. 2021;37(14):2012–6. pmid:32437519
  2. 2. Wang S, Nan B, Rosset S, Zhu J. RANDOM LASSO. Ann Appl Stat. 2011;5(1):468–85. pmid:22997542
  3. 3. Freeman LC. A Set of Measures of Centrality Based on Betweenness. Sociometry. 1977;40(1):35.
  4. 4. Xiang N, Wang Q, You M. Estimation and update of betweenness centrality with progressive algorithm and shortest paths approximation. Sci Rep. 2023;13(1):17110. pmid:37816806
  5. 5. White DR, Borgatti SP. Betweenness centrality measures for disconnected graphs. Sociolog Method. 1994;24:305–34.
  6. 6. Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24(9):1175–82. pmid:18310618
  7. 7. Sun H, Lin W, Feng R, Li H. Network-regularized high-dimensional cox regression for analysis of genomic data. Stat Sin. 2014;24(3):1433–59. pmid:26316678
  8. 8. Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67.
  9. 9. Chun H, Zhang X, Zhao H. Gene regulation network inference with joint sparse Gaussian graphical models. J Comput Graph Stat. 2015;24(4):954–74. pmid:26858518
  10. 10. Meinshausen N, Bühlmann P. Stability selection. J R Stat Soc Ser B. 2010;72(4):417–73.
  11. 11. Abram SV, Helwig NE, Moodie CA, DeYoung CG, MacDonald AW 3rd, Waller NG. Bootstrap Enhanced Penalized Regression for Variable Selection with Neuroimaging Data. Front Neurosci. 2016;10:344. pmid:27516732
  12. 12. Wang S, Zhang H, Chai H, Liang Y. A novel Log penalty in a path seeking scheme for biomarker selection. Technol Health Care. 2019;27(S1):85–93. pmid:31045529
  13. 13. Wang QS, Edahiro R, Namkoong H, Hasegawa T, Shirai Y, Sonehara K, et al. The whole blood transcriptional regulation landscape in 465 COVID-19 infected samples from Japan COVID-19 Task Force. Nat Commun. 2022;13(1):4830. pmid:35995775
  14. 14. Conca W, Alabdely M, Albaiz F, Foster MW, Alamri M, Alkaff M, et al. Serum β2-microglobulin levels in Coronavirus disease 2019 (Covid-19): Another prognosticator of disease severity?. PLoS One. 2021;16(3):e0247758. pmid:33647017
  15. 15. Zou M, Su X, Wang L, Yi X, Qiu Y, Yin X, et al. The Molecular Mechanism of Multiple Organ Dysfunction and Targeted Intervention of COVID-19 Based on Time-Order Transcriptomic Analysis. Front Immunol. 2021;12:729776. pmid:34504502
  16. 16. Song Y, Wang X, Tong D, Huang X, Jin X, Zhang C, et al. Identification of Potential Biomarkers and Immune Cell Signatures in COVID-19 Myocarditis Through Bioinformatic Analysis. Cardiol Res Pract. 2025;2025:2349610. pmid:40230577
  17. 17. Pius-Sadowska E, Niedźwiedź A, Kulig P, Baumert B, Sobuś A, Rogińska D, et al. CXCL8, CCL2, and CMV Seropositivity as New Prognostic Factors for a Severe COVID-19 Course. Int J Mol Sci. 2022;23(19):11338. pmid:36232655
  18. 18. Khalil BA, Elemam NM, Maghazachi AA. Chemokines and chemokine receptors during COVID-19 infection. Comput Struct Biotechnol J. 2021;19:976–88. pmid:33558827
  19. 19. Park JH, Lee HK. Re-analysis of Single Cell Transcriptome Reveals That the NR3C1-CXCL8-Neutrophil Axis Determines the Severity of COVID-19. Front Immunol. 2020;11:2145. pmid:32983174
  20. 20. Wen W, Su W, Tang H, Le W, Zhang X, Zheng Y, et al. Immune cell profiling of COVID-19 patients in the recovery stage by single-cell sequencing. Cell Discov. 2020;6:31. pmid:32377375
  21. 21. Li H, Huang F, Liao H, Li Z, Feng K, Huang T, et al. Identification of COVID-19-Specific Immune Markers Using a Machine Learning Method. Front Mol Biosci. 2022;9:952626. pmid:35928229
  22. 22. Qin X, Huang C, Wu K, Li Y, Liang X, Su M, et al. Anti-coronavirus disease 2019 (COVID-19) targets and mechanisms of puerarin. J Cell Mol Med. 2021;25(2):677–85. pmid:33241658
  23. 23. Amini-Farsani Z, Yadollahi-Farsani M, Arab S, Forouzanfar F, Yadollahi M, Asgharzade S. Prediction and analysis of microRNAs involved in COVID-19 inflammatory processes associated with the NF-kB and JAK/STAT signaling pathways. Int Immunopharmacol. 2021;100:108071. pmid:34482267
  24. 24. Camblor DG, Miranda D, Albaiceta GM, Amado-Rodríguez L, Cuesta-Llavona E, Vázquez-Coto D, et al. Genetic variants in the NF-κB signaling pathway (NFKB1, NFKBIA, NFKBIZ) and risk of critical outcome among COVID-19 patients. Hum Immunol. 2022;83(8–9):613–7. pmid:35777990
  25. 25. Li Y, Duche A, Sayer MR, Roosan D, Khalafalla FG, Ostrom RS, et al. SARS-CoV-2 early infection signature identified potential key infection mechanisms and drug targets. BMC Genomics. 2021;22(1):125. pmid:33602138
  26. 26. Dennis G, Sherman BT, Hosack DA, Yang J, Gao W. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003;4(5):3.