Integrating multi-omics and machine learning to decipher the molecular pathways of bisphenol a-associated lactylation-related genes driving bladder cancer

Hao Wang; Hongquan Liu; Fengze Sun; Jitao Wu

doi:10.1371/journal.pone.0347134

Abstract

In this study, we systematically investigated bladder cancer–related gene signatures using a toxicogenomics-informed framework, with particular attention to genes associated with lactylation-related pathways. Multi-omics data from the Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) were integrated, and Weighted Gene Co-expression Network Analysis (WGCNA), a toxicology database, and lactylation-related gene sets were combined for intersection screening. Machine learning algorithms, including LASSO, SVM, and random forest, were then applied to identify key genes. Four prioritized BPA–lactylation-associated candidate genes—ENO1, WBP11, GTF2F1, and SPR—were ultimately identified and showed consistent associations with metabolic, immune, and transcription-related features. Multi-level validation, including immune infiltration analysis, single-cell transcriptome localization, proteomic validation, and molecular docking and kinetic simulation, supported the structural plausibility of BPA–protein interactions at the molecular level. This study proposes a toxicogenomics-informed, hypothesis-generating framework that prioritizes candidate genes and pathways potentially linking BPA-related signatures with lactylation-associated processes in bladder cancer.

Citation: Wang H, Liu H, Sun F, Wu J (2026) Integrating multi-omics and machine learning to decipher the molecular pathways of bisphenol a-associated lactylation-related genes driving bladder cancer. PLoS One 21(5): e0347134. https://doi.org/10.1371/journal.pone.0347134

Editor: Rajesh Kumar Pathak, Chung-Ang University, KOREA, REPUBLIC OF

Received: December 18, 2025; Accepted: March 28, 2026; Published: May 5, 2026

Copyright: © 2026 Wang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: “Yes - all data are fully available without restriction; All public datasets used in this study are available via official access: GEO Datasets GSE13507: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE13507; Search term: bladder cancer GSE13507 GSE130001: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE130001; Search term: bladder cancer single-cell GSE130001 GSE145281: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145281; Search term: bladder cancer immunotherapy single-cell GSE145281 External validation cohorts: GSE154261, GSE69795, GSE31684, GSE19423, GSE39281 TCGA Dataset TCGA-BLCA: https://portal.gdc.cancer.gov/projects/TCGA-BLCA; Search term: TCGA bladder cancer BLCA”.

Funding: This work was supported by the National Natural Science Foundation of China (No. 82370690), Natural Science Foundation of Shandong Province (No. ZR2023MH241), basic research project of Yantai Science and Technology Innovation development plan (No.2023JCYJ069) and Shandong Health Science Innovation Team Building Project. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Bladder cancer is one of the most common malignant tumors of the urinary system worldwide and imposes a substantial burden on patients and healthcare systems. Its pathogenesis is influenced by environmental, genetic, and epigenetic factors [1]. In recent years, environmental toxicants have been increasingly recognized as important exogenous contributors to bladder carcinogenesis.[2–4]. As a common endocrine disruptor, BPA is widely used in the production of synthetic polymers and thermal paper and can accumulate in the human body [5]. BPA has been shown to exert pro-tumorigenic effects in various cancers through the activation of estrogen receptors, promotion of oxidative stress, and metabolic disruption [6,7]. Although the effects of BPA have been well studied in breast cancer, its mechanism of action in bladder cancer needs to be further explored [8]. However, there is still limited evidence regarding whether and how BPA may be involved in bladder cancer pathogenesis at a systems or mechanistic level.

Epigenetic modifications play an important role in the occurrence and development of bladder cancer. Recent studies have shown that multiple epigenetic mechanisms, including DNA methylation, histone modifications, and non-coding RNAs, are involved in bladder cancer progression. For example, changes in DNA methylation can lead to aberrant gene expression, thereby affecting tumor growth and metastasis [9]. In addition, epigenetic alterations have been linked to chemoresistance in bladder cancer and may offer potential targets for personalized therapy through the regulation of key genes [10]. Lactylation, a newly discovered post-translational modification, also plays an important role in the tumor microenvironment. Lactate is a major product of glycolysis, while lactylation is a protein modification induced by lactate [11]. Studies have shown that lactylation can be involved in tumor development through downstream transcriptional regulation and is a potential target for tumor therapy [12]. In bladder cancer, lactate modification may promote tumor progression and metastasis by affecting metabolic reprogramming and epigenetic regulation of tumor cells [13]. Although previous studies have suggested the functional potential of lactylation in bladder tumors, its specific role in bladder cancer remains unclear, and systematic investigations into its potential links with environmental toxicants are still lacking.

Therefore, the aim of this study was to systematically identify and prioritize potential BPA-associated, lactylation-related candidate genes involved in bladder cancer using integrative bioinformatics and multi-level validation. Through a systematic multi-omics integration strategy, we constructed an analysis framework based on multi-platform datasets, including GEO and TCGA. WGCNA, a toxicology database, and a lactylation-related gene set were combined for intersection screening, and machine learning algorithms, including Random Forest and LASSO, were used to identify core targets. These candidates were further evaluated through immune infiltration analysis, single-cell sequencing, proteomic validation, and molecular docking. Four key BPA-associated lactylation-related genes—ENO1, WBP11, GTF2F1, and SPR—were identified, suggesting that BPA-related signatures may be linked to the lactylation network and to metabolic, immune, and transcriptional features of bladder cancer. This study not only provides a new epigenetic perspective to elucidate the mechanism of BPA-associated bladder cancer, but also provides a theoretical basis and potential targets for the search for biomarkers and targeted interventions for environmental exposure-associated tumors.

Methods

Data sources and preprocessing

Transcriptomic and single-cell sequencing data from several publicly available databases were used in this study. The bladder cancer bulk transcriptome expression data were obtained from the GSE13507 [14,15] dataset in the GEO database, which contains a total of 191 samples, including primary bladder cancer tissues, recurrent tumors, paracancerous tissues, and normal bladder mucosal tissues. To unify the analysis, we combined 58 of the paracancerous tissues with 10 normal bladder mucosa samples into the normal group, and the rest were categorized into the disease group. The data platform was Illumina Human-6 v2.0 expression BeadChip, and the raw expression matrix was log2-transformed and normalized. And to eliminate the non-biological differences caused by the technical sources, the data were corrected for batch effects by applying the ComBat method in the R language 4.3.2#39;s “sva” package. The data were corrected for batch effects by the ComBat method in the “sva” package of R4.3.2, and the distribution of the samples before and after correction was visualized by principal component analysis (PCA) plots, which showed that the data quality met the requirements of the analysis. Toxicity target genes related to BPA were obtained from the Comparative Toxicogenomics Database [16] (CTD, https://ctdbase.org) with the search keyword “bisphenol A”, and the screening species was “Homo sapiens”; targets with experimental or literature support were retained for subsequent intersection analysis. The collection of lactylation-related genes, on the other hand, was obtained from previously published literature, covering identified enzymes, substrate proteins, and regulatory factors related to lysine lactylation modification, to establish the set of lactylation genes for analysis. These annotations reflect curated toxicogenomics knowledge and do not represent patient-level BPA exposure measurements. In addition, to further analyze the expression characteristics of key genes in different cell types in the tumor microenvironment, two single-cell RNA sequencing datasets were introduced: GSE130001 [17,18] and GSE145281 [19]. The GSE130001 dataset was derived from bladder cancer tissues, enriched with tumor and stromal cells after CD45-negative screening, and is suitable for the analysis of structural cell taxa; GSE145281 contains peripheral blood mononuclear cells (PBMC) from bladder cancer patients undergoing immunotherapy and can be used to assess the expression profile of key genes in immune cells. Both datasets were downloaded from the GEO database and combined with standardized annotation information provided by the TISCH2 platform for subsequent single-cell level expression analysis [20]. In this study, “lactylation-related genes” refer to genes previously reported to participate in lactylation-associated metabolic or regulatory pathways, rather than proteins directly confirmed to carry lysine lactylation modifications in the analyzed samples.

WGCNA analysis and intersecting gene screening

Weighted Gene Co-expression Network Analysis (WGCNA) is a scale-free network construction method based on gene expression profiling data, which can identify gene modules with co-expression trends among samples through clustering analysis and correlation analysis with clinical phenotypes to screen potential key regulatory factors [21]. To identify functional modules closely related to bladder cancer and screen potential key genes, this study constructed a weighted gene co-expression network based on the “WGCNA” package (v1.72-1) in R language. Input data is the gene expression matrix in GSE13507. The samples were first analyzed by hierarchical clustering to identify and exclude abnormal samples to ensure the stability of the network construction. Subsequently, the scale-free topology criterion was used to filter the appropriate soft threshold (β value), and the “pickSoftThreshold” function was used to calculate the R² and the average connectivity index under different powers, and the soft thresholds with R² > 0.85 and obvious curve inflection points were selected. The “pickSoftThreshold” function is used to calculate the R² and the average connectivity index under different powers, and select the soft threshold that satisfies the R² > 0.85 and the curve inflection point is obvious to construct the weighted adjacency matrix, which is further converted into the Topological Overlap Matrix (TOM). Based on the dissimilarity distance of the TOM, the module identification was performed using the dynamic tree cut (dynamic tree cut) method, setting the minimum module gene number to 50, merging the module eigenvectors (eigengenes) of the highly similar modules, and the module merging threshold was set to 0.25. After the construction was completed, the Pearson correlation coefficients between the modules and the phenotypes (bladder cancer) were calculated, and the heat map of module-trait correlation was generated. All module genes were extracted from the significantly correlated modules as candidate gene sets, which were further intersected and analyzed with the set of BPA-related target genes retrieved from the Comparative Toxicogenomics Database, as well as the set of lactated genes obtained from literature collation. The intersection operation was realized in R4.3.2 using the “VennDiagram” package, and the obtained genes were used for subsequent functional annotation and core gene identification analysis [22]. To evaluate whether the observed overlap among the BPA target genes, lactylation-related genes, and WGCNA module genes exceeded random expectation, statistical calibration analyses were performed. Specifically, random gene sets matched for size to each of the three gene lists were repeatedly sampled from the background gene universe, and the distribution of intersection sizes was estimated under null conditions. This approach was used to assess the potential influence of gene-set size effects and module selection variability on the observed overlap.

Immune infiltration analysis

To comprehensively assess the infiltration characteristics of immune cells in bladder cancer tissues, this study used three classical immune infiltration algorithms based on the GSE13507 expression matrix in the GEO database: the CIBERSORT, the single-sample Gene Set Enrichment Analysis (ssGSEA), and the MCPcounter, to quantify the immune cell composition in the tumor microenvironment [23–25]. All analyses were performed in R 4.3.2. CIBERSORT analysis was performed by deconvolution of the sample expression data using the “CIBERSORTx” algorithm in conjunction with the LM22 immune cell signature matrix to calculate the relative proportions of the 22 immune cell types in each sample. The number of permutations was set to 1000, and the results were filtered by P-value < 0.05. ssGSEA analysis was implemented using the “GSVA” package (v1.48.3), which scores each sample based on the set of genes related to immune cell function, reflecting the enrichment degree of immune cells among different samples [26]. MCPcounter analysis was performed using the MCPcounter module in the “immunedeconv” package to quantitatively evaluate more than 10 types of immune and mesenchymal cells, including T cells, NK cells, monocytes, endothelial cells, and so on [27]. For the visualization of the results, heatmaps were created using the “ComplexHeatmap” package (v2.18.0), which demonstrated the overall distribution trend of immune cell abundance between the normal and tumor groups calculated by different methods [28]. To further explore the synergistic trend of different immune cell subpopulations in bladder cancer tissues, we calculated the correlation matrices between immune cells based on three algorithms, namely, CIBERSORT, ssGSEA, and MCPcounter, and the correlation analyses were performed by using “corrplot” and “ggcorrplot”. Correlation analysis was performed using the “corrplot” and “ggcorrplot” packages to draw intragroup correlation bubble plots, and the graphs were uniformly scaled with a color scale and labeled with statistical significance (P < 0.05) [29].

Expression characterization and functional network analysis of intersecting genes

To further explore the expression characteristics of BPA-associated lactylation genes in bladder cancer tissues and their potential biological functions, this study first extracted the expression information of the intersecting genes in samples of the tumor group and normal group based on the GSE13507 expression matrix. After the expression matrix was log2-transformed and Z-score normalized, the expression heatmap was plotted using the “pheatmap” package (v1.0.12) in R4.3.2 to visualize the expression differences between the intersected genes in different groups. Subsequently, to analyze the functional interactions among the intersecting genes, we constructed a protein-protein interaction network with the help of the STRING database (Search Tool for the Retrieval of Interacting Genes/Proteins, https://string-db.org) [30]. STRING is a high-quality protein-protein interaction database that integrates experimental data, literature mining, computational prediction, and other sources, and supports multiple species, including human. In the analysis, the species was set as “Homo sapiens”, the minimum interaction confidence score threshold was 0.4 (medium confidence), and the results were restricted to direct interactions between intersecting genes. The constructed PPI network was used to identify potential synergistic modules and signaling relationships. We further performed Gene Set Enrichment Analysis (GSEA) using the clusterProfiler package (v4.6.2) in R to identify signaling pathways and biological functions associated with the intersected gene set. The GSEA was conducted based on a pre-ranked list of all genes derived from the differential expression analysis between tumor and normal tissues, using the signal-to-noise ratio as the ranking metric. Gene sets from the Molecular Signatures Database (MSigDB) Hallmark and KEGG collections were tested. Significance was assessed through 1,000 permutations, and enrichment results with a false discovery rate (FDR) q-value < 0.25 were considered statistically significant. The outcomes were visualized using a dot plot to display enrichment strength (-log10(P value)), gene counts, and the significance of the top-enriched pathways. Finally, to further characterize potential interaction networks involving BPA-related targets and bladder cancer–associated genes, we constructed a “BPA-target-tumor” network using Cytoscape 3.7.2 [31].

Machine learning algorithms for screening key genes

To accurately identify key genes with potential diagnostic value for bladder cancer from the intersecting genes, this study integrates three classical machine learning methods: LASSO regression (Least Absolute Shrinkage and Selection Operator) [32], Support Vector Machine (SVM) [33], and Random Forest (RF) [34]. All the analyses were done based on the R language (v4.3.2) platform, and the input data were the expressions of the intersected genes in the GSE13507 dataset in the tumor group and the normal group.1) LASSO regression is a linear regression method used to deal with high-dimensional data, which has the dual functions of variable filtering and dimensionality reduction. The analysis was implemented by the “glmnet” package (v4.1-8), using binomial logistic regression mode (family = “binomial”), setting the standardization parameter standardize = TRUE and passing a 10-fold cross-validation (nfolds). 10-fold cross-validation (nfolds = 10) to determine the optimal penalty parameter λ. Finally, genes with non-zero regression coefficients are extracted as candidate features at the corresponding points of lambda. Min. 2) SVM is a feature recursive elimination algorithm combined with a support vector machine classifier, which is capable of identifying the optimal subset of features iteratively. In this study, we use the “e1071” package (v1.7-13) to construct a linear kernel function support vector machine (kernel = “linear”, cost = 1), combined with the “caret “package (v1.0-94) to perform feature recursion with 5-fold cross-validation (cv = 5), calculate the cross-validation error and accuracy under different number of features, and finally select the feature with the smallest error and the highest accuracy as the optimal gene set. 3) Random forest is an integrated learning method based on decision trees, which is based on the construction of a large number of tree models and integrating their prediction results to improve model stability. The analysis uses the “randomForest” package (v4.7-1.1), and sets the number of decision trees ntree = 500 and the number of feature variables mtry = sqrt(p), where p is the total number of features. The top 30 genes with the highest contribution values were extracted based on the Mean Decrease Gini (MDG) ranking of each feature in node partitioning. Finally, the feature gene sets obtained by the three algorithms were intersected and integrated by the “VennDiagram” package, and the key genes identified by all three methods were screened out as the final candidate targets.

Diagnostic model construction and validation

To further evaluate the diagnostic potential of the four screened key genes, multiple machine learning classification models were constructed using the training dataset (TCGA+GTEx) and validated in an independent external cohort (GSE13507). The modeling and visualization were performed in R (v4.3.2), with major packages including “caret”, “caretEnsemble”, “glmnet”, “randomForest”, “pROC” and “ggplot2”. To minimize potential batch effects between TCGA and GTEx datasets, expression data were processed and normalized within each cohort prior to model training. Feature selection was performed exclusively within the training dataset to avoid information leakage, and the selected features were then applied to the external validation cohort.

We adopted a 5-fold cross-validation strategy to reduce the risk of overfitting. Various algorithms were applied, including generalized linear model boosting (glmBoost), elastic net regression (Enet) with different α parameters, ridge regression, lasso regression, support vector machine (SVM), stepwise regression (Stepglm), and their combinations. For each model, the area under the receiver operating characteristic curve (AUC) was calculated in both training and validation datasets to evaluate diagnostic performance. ROC curves were plotted to visualize model discrimination ability. Model performance should be interpreted as exploratory and predictive, rather than as definitive diagnostic evidence.

Validation of expression levels, evaluation of diagnostic efficacy, and prognostic analysis of key genes

To verify the expression pattern of the screened key genes in bladder cancer with diagnostic and prognostic value, expression analysis, ROC analysis, and survival analysis were performed in two independent cohorts, GSE13507 and TCGA-BLCA, respectively. All statistical analyses were done in the R language 4.3.2 environment, and the main R packages used included “limma” (v3.52.2), “ggplot2” (v3.4.4), “pROC “(v1.18.4), “survival” (v3.3-1), and “survminer” (v0.4.9). In the expression level analysis, the “limma” package was used to compare the differences between the tumor group and the normal group samples in GSE13507 and TCGA, and the four key genes were presented by “ggplot2” violin plots (SPR, The expression differences of four key genes (SPR, WBP11, ENO1, GTF2F1) in different tissues were presented in violin plots by “ggplot2”, and the P-values were calculated by two-sided t-test. To evaluate the diagnostic efficacy of the key genes, ROC curves were constructed based on the GSE13507 and TCGA expression matrices, and the AUC (Area Under the ROC Curve) was calculated and visualized using the “pROC” package, and the higher the AUC value was, the more accurate it was in distinguishing between tumor and normal samples. The higher the AUC value, the better the accuracy in distinguishing between tumor and normal samples. For prognostic analysis, a total of 411 patients were retained for subsequent analysis after excluding normal tissue samples and samples with missing survival information using the TCGA-BLCA project RNA-seq (STAR process, TPM format) and clinical data, combined with the follow-up data published in Cell (2018) by Liu et al. Expression data were first log2(value+1) transformed, and then the association of the four genes with overall survival (OS, Overall Survival) was assessed based on survival data. The Cox proportional risk model was fitted using the “survival” package, and the optimal cut-off value was calculated using the surv_cutpoint function in the “survminer” package for sample grouping. Survival differences were estimated by the Kaplan-Meier method, and the log-rank test was used to compare the survival differences between the two groups, with the significance level set at P < 0.05. The forest plots and the survival differences were finally calculated by the “ggplot2” and “survminer” packages. Finally, the forest plot and survival curve were plotted by “ggplot2” and “survminer”.

Localization analysis of key genes

To further analyze the cell-type-specific expression of the four key genes in the bladder cancer microenvironment, we introduced two complementary single-cell RNA sequencing datasets from the GEO database, namely GSE130001 and GSE145281.

GSE130001 was derived from bladder cancer tissues of 4 patients, in which CD45 ⁻ cell sorting was performed to enrich structural cell populations. After sequencing, more than 10,000 single cells were obtained, covering epithelial cells, fibroblasts, endothelial cells, and myofibroblasts. This dataset was thus suitable for evaluating the expression characteristics of the target genes in non-immune stromal and structural cells.

GSE145281 included peripheral blood mononuclear cell (PBMC) samples from 7 bladder cancer patients undergoing immunotherapy. This dataset profiled over 20,000 immune cells, including CD4⁺ and CD8 ⁺ T cells, B cells, NK cells, and monocyte/macrophages, which enabled the assessment of gene expression in immune cell compartments.

The processed expression matrices and cell-type annotation files were obtained from the TISCH2 platform, and data analysis was performed using Seurat (v4.3.0) and MAESTRO in R (v4.3.2). After quality control, normalization, and clustering, t-SNE dimensionality reduction was performed, and FeaturePlot and VlnPlot functions were used to visualize expression trends across cell types. The AverageExpression function was further applied to calculate the mean expression level of each key gene across major cellular subpopulations.

Validation of protein expression levels

To further validate the expression of key genes at the protein level, this study systematically assessed the protein expression patterns of ENO1, GTF2F1, SPR, and WBP11 in bladder cancer tissues with the help of the Human Protein Atlas (HPA, https://www.proteinatlas.org/) public database. The HPA database provides spatial expression information of proteins in human tissues through standardized immunohistochemical staining experiments with highly reproducible and authoritative histological images taken by automated microscopy and reviewed manually. We used gene names to retrieve target proteins in the HPA database separately, corresponding to localization and expression intensity analysis in cancerous specimens of bladder tissue. All image sources were bladder cancer tissue sections from patients with uroepithelial carcinoma, and the most representative images of expression were selected for each gene for presentation. Staining results were graded and assessed by antibody staining intensity (strong, weak, not detected), staining location (cytoplasmic, membranous, nuclear), and percentage of positive cells (<25%, 25–75%, > 75%). In addition, to ensure the objectivity of the interpretation, we combined the staining intensity and distribution range to make a comprehensive judgment of the expression level. To further enhance the clarity and presentation of the images, all images were locally enlarged based on the original resolution (the following figure shows the locally enlarged image), to observe the cytoarchitectural features and localization differences of the stained regions, and finally the expression distribution characteristics of each gene were demonstrated in the figure by pairing a high-magnification view with a low-magnification panoramic view. This part of the analysis helps to bridge the potential differences between the transcriptome level and the protein level, providing reliable experimental evidence for the functional study of key genes and the value of biomarkers.

Molecular Docking

In this study, the binding ability between target proteins and small-molecule compounds was evaluated by a systematic molecular docking approach. First, the 2D structural data of the target ligands were downloaded from the PubChem database (https://pubchem.ncbi.nlm.nih.gov) [35] and converted into 3D structural models with the help of ChemOffice software, and finally saved as mol2 format files. Next, protein crystal structures corresponding to the target genes with high resolution were screened as receptors from the RCSB PDB database (https://www.rcsb.org/) [36], and the crystal structures were pre-processed, including the removal of water molecules, cofactors, irrelevant ions, and other heteroatoms, with the help of PyMOL software [37], and saved as PDB format. Subsequently, docking simulations were performed using AutoDock Vina 1.5.6 software [38]. In Autodock Tools, the protein and small molecule structures were hydrogenated and assigned Gasteiger charges, the original water molecule structure was removed, and rotational bonding degrees of freedom were set for the ligands, respectively. For the docking parameters, the binding site region was defined, a grid box was constructed, and the center coordinates and grid size were set to ensure that the potentially active pockets of the protein were covered. After docking, the conformation with the lowest binding affinity was selected as the optimal model based on the Vina score (binding affinity). To further elucidate the specific binding mode between the small molecule and the target, three-dimensional structures were displayed with the help of PyMOL and Discovery Studio Visualizer 2019 software, and two-dimensional force diagrams were drawn to visualize the important mechanical relationships, such as hydrogen bonding, hydrophobic interactions, and aromatic ring stacking. In general, a binding energy lower than −5.0 kcal/mol indicates that the ligand and the receptor have good binding activity; if it is lower than −7.0 kcal/mol, it indicates that the two have strong affinity and stable binding conformation. The lower the binding energy, the more stable and biologically active the ligand-protein binding is.

Molecular dynamics simulation

To further evaluate the stability and interaction mechanism of Bisphenol A (BPA) in complex with key target proteins, 100 ns molecular dynamics simulations were performed in this study using the GROMACS 2022 software package [39]. The protein topological parameters used are derived from the CHARMM36 force field to ensure accurate modeling of interatomic interactions in complex biomolecular systems, while the topology of the small-molecule ligand, BPA, is constructed based on the General AMBER Force Field 2 (GAFF2), which takes into account its flexible structural properties and electron distribution. In the initial preparation stage, the protein-ligand complex was placed in an adequately sized three-dimensional cubic simulation box, and the minimum distance between the box boundary and the outermost atoms of the protein was set to 1.2 nm to avoid interference between periodic images. The system was solvated using the TIP3P water model to construct an aqueous system that more closely resembles the physiological environment. Subsequently, an appropriate amount of Na ⁺ /Cl- ions was added to neutralize the overall charge of the system to form an electrically neutral system. The energy minimization step was performed using the Steepest Descent algorithm to ensure that the system was in a reasonable energy conformation before entering the kinetic simulation. Next, the system was pre-equilibrated in an isothermal isovolumic system (NVT) and an isothermal isobaric system (NPT) for 100 ps each, and the temperature and pressure of the system were maintained at 310 K and 1 bar using a V-rescale temperature coupler and a Parrinello-Rahman pressure coupler, respectively. Electrostatic interaction calculations were performed using the Particle Mesh Ewald (PME) method, and the truncation radius of van der Waals force and Coulomb force is uniformly set to 1.0 nm to balance the computational efficiency and accuracy. The neighborhood search algorithm adopts the Verlet list updating scheme. Finally, 100 ns molecular dynamics simulations of the production period were performed based on the system after equilibrium was completed. The trajectory data were used to further calculate the root mean square displacement (RMSD), radius moment (Rg), solvent accessible surface area (SASA), number of hydrogen bonds (H-bonds), and residue fluctuation (RMSF), and to analyze the distribution of the stable conformations of the complexes by Free Energy Landscape, to comprehensively reveal the kinetic behaviors and structural features of the binding of BPA to the target protein. The kinetic behavior and structural characteristics of BPA binding to target proteins were comprehensively revealed.

Immunofluorescence staining

Five paired bladder cancer and adjacent normal tissue samples were collected from patients who underwent surgical resection at the Department of Urology, Yantai Yuhuangding Hospital. Paraffin-embedded sections were deparaffinized, rehydrated, and subjected to antigen retrieval in citrate buffer (pH 6.0). After blocking with 5% bovine serum albumin (BSA) for 30 min at room temperature, the sections were incubated overnight at 4 °C with a primary antibody against ENO1 (1:200, Proteintech, Wuhan, China). The slides were then washed and incubated with an Alexa Fluor 488–conjugated secondary antibody (1:500, Proteintech, Wuhan, China) for 1 h at room temperature. Nuclei were counterstained with DAPI (Beyotime, Shanghai, China) for 5 min. Images were captured using a fluorescence microscope (Leica Microsystems, Germany), and fluorescence intensity was quantified by ImageJ software.

Expression validation

Three paired bladder cancer tissues and adjacent normal tissues from the same cohort were used for quantitative real-time PCR (qPCR) validation. Total RNA was extracted from tissue samples using a standard RNA extraction protocol, and complementary DNA (cDNA) was synthesized by reverse transcription according to the manufacturer’s instructions. qPCR was performed using gene-specific primers for WBP11, SPR, and GTF2F1, with GAPDH as the internal control. Relative mRNA expression levels were calculated using the 2^-ΔΔCt method. The primer sequences were as follows: WBP11 forward, CCTTCTCAGATACAAGCACCTCC; reverse, AGGTGGTCTCAGGAATGGAGGA. SPR forward, TGCAGGAAAGGCTGCTCGTGAT; reverse, TGCTGCATGTCTGTGTCCAGAG. GTF2F1 forward, GAGGTGGACTACATGTCAGACG; reverse, ACTCTCCTCACTACTGTCGCTC.

Three paired bladder cancer and adjacent normal tissues from the same cohort were used for Western blot validation. Total protein was extracted using RIPA buffer containing protease inhibitors. Equal amounts of protein (30 µg per lane) were separated on 10% SDS–PAGE and transferred to PVDF membranes. After blocking with 5% non-fat milk for 1 h, membranes were incubated overnight at 4 °C with anti-ENO1 (1:1000, Proteintech) and anti-GAPDH (1:5000, Proteintech) antibodies. After washing, membranes were incubated with HRP-conjugated secondary antibodies (1:5000, Proteintech) for 1 h at room temperature. Protein bands were visualized by enhanced chemiluminescence (ECL, Thermo Fisher), and relative densitometric analysis was performed using ImageJ.

Results

WGCNA analysis and intersecting gene screening

The flowchart of this study is shown in Fig 1. To identify co-expression modules associated with bladder cancer and prioritize BPA- and lactylation-related candidate genes, we first performed WGCNA using the GSE13507 dataset. Based on the normalized expression matrix of the GSE13507 dataset, 7,735 highly variable genes ranked within the top 30% of variability were selected for WGCNA. Sample clustering results showed that no obvious abnormal samples were found (Fig 2A), and all of them were included in the subsequent network construction. To construct a scale-free network, we evaluated the topological fit under different soft-thresholding powers and selected a power of 14, at which the network most closely approximated a scale-free distribution (R² = 0.88; Fig 2B). The similarity between genes was calculated based on the Topological Overlap Matrix (TOM) and the initial modules were identified by the Dynamic Tree Cut algorithm. The clustering results identified a total of 17 co-expression modules (Fig 2C), excluding unclassified genes (gray modules). Pearson correlation analysis between each module and clinical phenotype (bladder cancer vs. normal) showed that several modules were significantly associated with disease status. Among them, the blue, yellow, and green modules showed the strongest correlations (|cor| > 0.3, P < 0.05; Fig 2D), suggesting their potential relevance to bladder cancer progression. A total of 3651 module genes were extracted from the above significantly related modules as a candidate set for subsequent functional analysis.

Download:

Fig 1. Flowchart of this study.

Bisphenol A is presented as a possible cause of bladder cancer by acting on lactylation-related genes.

https://doi.org/10.1371/journal.pone.0347134.g001

Download:

Fig 2. WGCNA co-expression module construction and intersection gene screening.

(A) Sample clustering tree diagram for detecting abnormal samples; (B) Scale-free topological fit and average connectivity analysis under different soft thresholds, selecting power = 14; (C) TOM-based module identification clustering tree, with the colors representing different modules; (D) Heatmap of the correlation between modules and clinical phenotypes (bladder cancer vs. normal); (E) Heatmap of correlations between BPA targets, lactylation genes and WGCNA Venn diagram of module genes, intersected to obtain 74 candidate key genes.

https://doi.org/10.1371/journal.pone.0347134.g002

To further refine the candidate set, BPA-related target genes were retrieved from the CTD database, yielding 26,516 genes. In addition, 328 lactylation-related genes were collected from the published literature, including lactylated substrate proteins, regulatory enzymes, and functionally related factors. Intersection analysis of these three gene sets identified 74 overlapping genes with co-expression, BPA-targeting, and lactylation-related features (Fig 2E–F), providing the basis for subsequent candidate gene prioritization and downstream analyses. The observed 74-gene intersection was significantly larger than expected under random sampling conditions.

Analysis of immune infiltration

Quantitative analysis of major immune cell types using the MCPcounter algorithm revealed (Fig 3A) that estimated cytotoxic lymphocyte infiltration levels were higher in the primary tumor group compared with normal and peritumoral tissues, while the recurrence group showed a modest reduction relative to primary tumors but remained elevated compared with normal tissue. CD8 ⁺ T cells exhibited a similar relative trend across groups.

Download:

Fig 3. Immune infiltration profiles across subgroups using multiple algorithms(A) MCPcounter analysis: Grouped bar plots depicting absolute infiltration scores of major immune cell types in normal bladder mucosa (NC), paracancerous tissues (Surrounding), primary bladder tumor, and recurrent bladder tumor groups.

Error bars indicate standard deviation. (B) CIBERSORT analysis: Peak – shaped histograms showing relative proportions of 22 immune cell subsets across the four groups, highlighting immune subpopulation heterogeneity. (C) ssGSEA analysis: Grouped bar plots presenting enrichment scores of immune cell – associated gene sets, reflecting immune cell activity changes among groups.

https://doi.org/10.1371/journal.pone.0347134.g003

CIBERSORT-based immune deconvolution analysis (Fig 3B) indicated that estimated M2 macrophage fractions were higher in recurrent tumors than in primary tumors and peritumoral tissues, while regulatory T cells showed increased relative abundance in tumor samples compared with normal tissue. These patterns were consistent with an immunosuppressive tumor microenvironment.

ssGSEA analysis of immune-related gene sets (Fig 3C) showed differential enrichment of NK cell–related signatures across tissue groups, with higher enrichment in peritumoral samples and lower enrichment in primary tumor tissues. These pathway-level scores represent relative immune activity patterns rather than direct measurements of immune cell function.

Consistency analysis across the three algorithms showed concordant enrichment trends for cytotoxic lymphocytes, CD8 ⁺ T cells, and M2 macrophages across different methods. This cross-method concordance supports the robustness of the inferred immune infiltration patterns.

Correlation analyses within each algorithm (Fig 4A–C) further revealed coordinated patterns among immune cell subsets, including positive correlations among adaptive immune cell populations and inverse relationships between macrophage subtypes. In addition, the four prioritized candidate genes (SPR, WBP11, ENO1, and GTF2F1) showed consistent associations with multiple immune cell signatures across all three algorithms. These associations indicate potential links between candidate genes and immune infiltration patterns, but do not establish direct regulatory or causal relationships.

Download:

Fig 4. Correlation analysis between immune cell infiltrations.

Four key genes showed strong correlation with multiple immune cells in three different algorithms for immune infiltration analysis. Line thickness indicates the strength of correlation, line color indicates the degree of significance, color block color within each circle in the heatmap indicates positive or negative correlation coefficient, and color block size indicates the size of absolute value of correlation coefficient.

https://doi.org/10.1371/journal.pone.0347134.g004

Overall, the multi-algorithm immune infiltration analysis revealed reproducible and consistent immune-related trends in bladder cancer tissues. These findings should be interpreted as computationally inferred immune landscape patterns that generate hypotheses regarding tumor–immune interactions within a BPA-informed toxicogenomics framework, rather than as precise measurements or mechanistic evidence.

Expression characterization and functional network analysis of intersecting genes

To characterize the expression patterns of the intersecting genes in bladder cancer, their expression profiles were extracted from the GSE13507 dataset and visualized using a heatmap to compare normal and tumor samples. Most intersecting genes showed a trend toward higher expression in tumor tissues, and several genes, including EAF1, UPF1, KHDRBS1, and SP3L, were markedly upregulated in the tumor group (Fig 5A), indicating preferential upregulation of these genes in bladder cancer tissues. To explore potential interactions among the intersecting genes, a protein–protein interaction (PPI) network was constructed using the STRING database. After removal of unconnected nodes, the network contained 74 nodes and 216 edges, with an average node degree of 5.84 and an average clustering coefficient of 0.487. PPI enrichment analysis was highly significant (P < 1.0e-16; Fig 5B), indicating that the observed interaction density was greater than expected by chance and suggesting coordinated involvement of these genes in bladder cancer–related pathways. To investigate the biological processes and signaling pathways associated with these genes, we further performed GSEA. GSEA revealed significant enrichment of several pathways, including “Spliceosome” (NES = 2.12, FDR < 0.001), “RNA degradation” (NES = 1.98, FDR < 0.001), and “mRNA surveillance pathway” (NES = 1.85, FDR < 0.005), suggesting that these intersecting genes were mainly enriched in post-transcriptional regulatory pathways (Fig 5C). In this study, we constructed a BPA-target-tumor network topology, which contained 76 nodes with a network density of 0.052, indicating sparse connections between nodes(Fig 6). The network heterogeneity is 2.959, suggesting that there are a few key nodes with high connectivity in this network, which may play a central regulatory role in the network. The network is structurally complete without isolated nodes, self-connections, and multiple edges. The analysis time is 0.029 seconds.

Download:

Fig 5. Expression characterization and functional network analysis of intersecting genes.

(A) Heatmap displaying the expression patterns of the 74 intersecting genes across samples in the GSE13507 dataset. Each row represents a gene, and each column represents a sample. The color scale from blue to red indicates low to high Z-score normalized expression levels, Bar chart: The abscissa represents the corresponding difference fold size, and the higher the bar, the greater the difference fold; (B) Protein-protein interactions (PPI) network graph constructed by the STRING database, with nodes indicating the gene products and the connecting lines indicating the functional associations; and (C) Enrichment score (ES) curves for pathways including “Metabolism of RNA”, “Processing of Capped Intron Containing Pre - mRNA”, “mRNA Splicing”, and “Aurora B Pathway” during GSEA. The lower panel shows the distribution of genes in these pathways within the phenotype – ranked gene list.

https://doi.org/10.1371/journal.pone.0347134.g005

Download:

Fig 6. The “BPA-target-tumor” network.

The network consists of 76 nodes with a density of 0.052 and a heterogeneity index of 2.959. No isolated nodes, self-loops, or multi-edge pairs were observed. The network is fully connected (1 connected component), with a diameter and radius of 2, indicating a compact structure. A high network centralization value (0.96) and characteristic path length (1.948) suggest the presence of key hub nodes. The clustering coefficient is 0.0, and the average number of neighbors per node is 3.895. A total of 5700 shortest paths were calculated, covering 100% of all node pairs.

https://doi.org/10.1371/journal.pone.0347134.g006

Machine learning screening of key genes

To identify core feature genes with potential diagnostic value for bladder cancer, three machine learning methods—LASSO, SVM-RFE, and random forest—were applied to the intersecting genes for feature selection. Ten-fold cross-validation of the LASSO model identified the minimum mean squared error at lambda.min = 0.01 (Fig 7A). At this value, 12 feature genes with non-zero regression coefficients were retained (Fig 7B). The SVM-RFE analysis had the smallest cross-validation error (0.133) and the highest accuracy (0.867) at a feature number of 10, corresponding to the selection of 10 optimal feature genes (Fig 7C-D). The random forest model calculated variable importance based on the Gini index, and extracted the top 30 genes with the average reduction (Fig 7E), which included several key regulators such as TOP2B, WBP1, RPL14, UPF1, AHNAK, and so on. Finally, integration of the three feature-selection results using a Venn diagram identified four overlapping genes—ENO1, WBP11, GTF2F1, and SPR—as the final candidate genes (Fig 7F). These genes were prioritized for further analysis as BPA- and lactylation-associated candidates in bladder cancer.

Download:

Fig 7. Machine learning algorithm to screen key genes.

(A, B) LASSO regression cross-validation to determine the optimal lambda value and screen for non-zero coefficient genes; (C, D) SVM-RFE algorithm evaluating cross-validation error vs. accuracy with different number of features to determine the optimal set of genes; (E) Mean-reduction Gini coefficient-based assessment of variable significance in the Random Forest model; (F) Venn diagrams of the intersection of three algorithms for screening results. identifying four common candidate key genes.

https://doi.org/10.1371/journal.pone.0347134.g007

Construction and validation of diagnostic models

We systematically compared the diagnostic performance of 28 candidate models in the training set (TCGA+GTEx) and validation set (GSE13507). As shown in Fig 8A, most models achieved excellent classification ability in the training cohort, with AUC values close to 1.0. When validated in the independent GSE13507 dataset, the performance decreased but remained robust, with several models retaining satisfactory discrimination. Among them, the glmBoost+Enet[α = 0.3] model demonstrated the best balance, with an AUC of 0.996 in TCGA+GTEx and 0.739 in GSE13507, while the external validation AUC of individual models generally ranged from 0.70 to 0.87. These findings indicate that although the models tend to overfit in the training set, they still possess cross-cohort generalization ability.

Download:

Fig 8. Diagnostic performance evaluation of multiple machine learning models in the training and validation sets.

(A) AUC values of 28 machine learning models constructed in the TCGA+GTEx training set (red) and validated in the GSE13507 dataset (blue). Models are sorted by validation AUC, showing that several algorithms maintain robust performance across datasets.(B) ROC curves of the glmBoost+Enet[α = 0.3] model in the TCGA+GTEx training set (blue, AUC = 0.996) and the GSE13507 validation set (red, AUC = 0.739).

https://doi.org/10.1371/journal.pone.0347134.g008

The representative ROC curves for the glmBoost+Enet[α = 0.3] model are shown in Fig 8B, with an AUC of 0.996 in TCGA+GTEx and 0.739 in GSE13507, confirming its diagnostic reliability. This phenomenon is likely attributable to the high dimensionality of transcriptomic features relative to the sample size and the strong correlation structure among genes, which can lead to overly optimistic performance in the training cohort. To minimize information leakage, all feature selection procedures and model optimization steps were performed exclusively within the training set, while the validation cohort was kept completely independent and was not involved in any stage of feature selection or parameter tuning.

Expression, diagnostic efficacy and prognostic analysis of key genes

In the GSE13507 cohort, the violin plot results showed that the expression levels of the four key genes, SPR, WBP11, ENO1 and GTF2F1, were significantly higher in bladder cancer tissues than in normal tissues (all P less than 0.001, Fig 9A-D). However, in the TCGA-BLCA dataset, WBP11 and ENO1 retained the upregulation trend observed in GSE13507 (Fig 9F,G), whereas SPR and GTF2F1 showed the opposite pattern, with higher expression in normal tissues (Fig 9E,H). These findings suggest that the expression patterns of some genes varied across datasets, potentially owing to differences in sample sources, platform technologies, or grouping criteria. ROC analysis showed that in GSE13507, ENO1 had the highest AUC (0.868), and the AUCs of WBP11, SPR and GTF2F1 were 0.820, 0.788 and 0.755, respectively, which all showed good diagnostic efficacy (Fig 9I). In the TCGA validation set, ENO1 still showed strong discriminatory ability (AUC = 0.803), while the AUCs of SPR, WBP11, and GTF2F1 were 0.699, 0.572, and 0.537, respectively (Fig 9J), suggesting that the discriminatory ability of this gene combination across datasets was somewhat stable, with ENO1 showing the most prominent performance in particular. Further survival analysis showed that only high ENO1 expression was significantly associated with poorer overall survival in the TCGA-BLCA cohort (HR = 1.367, 95% CI: 1.019–1.834, P = 0.037), whereas the other three genes did not reach statistical significance in Cox regression (Fig 9K). Kaplan–Meier analysis further showed that the ENO1 high-expression group had significantly poorer overall survival than the low-expression group (log-rank P = 0.010), whereas no significant survival differences were observed for SPR, WBP11, or GTF2F1 (Fig 9L–O). To further support the relevance of WBP11, SPR, and GTF2F1 in bladder cancer, additional survival analyses were conducted across multiple independent cohorts (GSE154261, GSE69795, GSE31684, GSE19423, GSE39281, IMvigor210, and GSE13507), showing that these genes also displayed prognostic associations in supplementary datasets (S1 Fig). These results suggest that ENO1 may play a critical role in the diagnosis and prognosis of bladder cancer.

Download:

Fig 9. Expression, diagnostic efficacy and prognostic analysis of four key genes in different cohorts.

(A-E) Violin plots showing differential expression of key genes between tumor and normal samples in TCGA and GEO cohorts. Statistical significance was assessed using Welch’s t-test, and P-values were adjusted for multiple testing using the Benjamini–Hochberg (BH) correction. Adjusted P-values are indicated in the figure (adjusted P < 0.05 considered significant); (F,G) ROC curves and AUC values of the four genes in the two datasets; (I) one-way Cox regression analysis of the four genes in the TCGA cohort; (H,J-L) Kaplan-Meier survival curves showed that only the ENO1 high-expression group had a significantly worse overall survival (P = 0.010).

https://doi.org/10.1371/journal.pone.0347134.g009