A water solubility prediction algorithm based on the StackBoost model

Bin Pan; Xiaoyu Hou; Mingxin Zhang; Jingxian Yu; Conghui Zhang; Yunhui Zhang; Xiaolong Su; Shuangcai Li

doi:10.1371/journal.pone.0330598

Abstract

Aqueous solubility, an essential physical property of compounds, has significant applications across various fields. However, verifying the solubility of compounds through experimental methods often requires substantial human and material resources. To address this issue, this study introduces the StackBoost model for predicting the solubility of organic compounds and systematically compares it with five well-known ensemble learning algorithms: Adaptive Boosting (AdaBoost), Gradient Boosted Regression Trees (GBRT), Light Gradient Boosting Machine (LGBM), Extreme Gradient Boosting (XGBoost), and Random Forest (RF). The prediction results indicate that the StackBoost model excels in predicting aqueous solubility, achieving a coefficient of determination () of 0.90, a root mean square error (RMSE) of 0.29, and a mean absolute error (MAE) of 0.22, significantly outperforming the other comparative models. Furthermore, this study further conducted high-throughput screening on large-scale datasets and successfully identified compounds with high potential for water solubility. Additionally, the model’s generalization ability is verified through transfer learning. Although the performance of the StackBoost model decreases when applied to different datasets, it still shows considerable transferability, making it a more generalizable prediction model for aqueous solubility.

Citation: Pan B, Hou X, Zhang M, Yu J, Zhang C, Zhang Y, et al. (2025) A water solubility prediction algorithm based on the StackBoost model. PLoS One 20(8): e0330598. https://doi.org/10.1371/journal.pone.0330598

Editor: Mahmud Iwan Solihin, UCSI University Kuala Lumpur Campus: UCSI University, MALAYSIA

Received: April 21, 2025; Accepted: August 5, 2025; Published: August 29, 2025

Copyright: © 2025 Pan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data used in this study are publicly available on the AqSolDB platform. https://doi.org/10.1038/s41597-019-0151-1.

Funding: This work was supported in pat by the National Natural Science Foundation of China under Grant 61602228, and in part by Liaoning Province Department of Education Project under Grant L2020018.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Water solubility is a crucial property of compounds, holding significant value in various fields such as geochemistry, climate prediction, biochemistry, drug design, agrochemical design, and protein-ligand binding [1]. For instance, in the case of pesticide compounds, poor water solubility can lead to precipitation from screening buffers, resulting in erroneous outcomes, false leads, and high risks of increased costs and formulation challenges during practical development. In recent years, researchers have begun to utilize machine learning to address the prediction of chemical solubility properties, aiming to develop compounds with rapid and efficient solubility [2–5].

With the rapid development of machine learning algorithms and their software implementations, ensemble learning has gained widespread attention due to its robust performance on multi-task cheminformatics datasets.Yang et al. employed the Support Vector Regression (SVR) method, combined with computed molecular descriptors, to establish prediction models for several physicochemical properties of alkylbenzene compounds. By identifying the optimal hyperplane in a high-dimensional feature space, they effectively captured the complex relationship between molecular structure and water solubility [6]. Fioressi et al. predicted the water solubility of 1,211 approved heterogeneous pesticide compounds using a widely applicable multiple linear regression (MLR) model, incorporating molecular descriptors such as topological, fingerprint, and flexibility descriptors. By exploring approximately 18,000 structural variables and utilizing the Replacement Method to select the most representative descriptors, they achieved optimal predictive performance [7]. Kherouf et al. developed prediction models for the water solubility of 68 phenolic compounds using both MLR and Artificial Neural Networks (ANN) methods, combined with computed three-dimensional structural molecular descriptors. After selecting descriptors through a Genetic Algorithm (GA), the ANN model demonstrated superior prediction accuracy compared to MLR [8]. Currently, ensemble learning methods are increasingly becoming an effective approach to enhance prediction accuracy.

Traditional machine learning models, such as LGBM and XGBoost, typically perform well on training data, but they are prone to overfitting when faced with complex or high-noise data. This means that the model overly relies on noise and local features in the training data, leading to a decrease in its predictive ability on new data. To address this issue, ensemble learning methods have been widely applied. Ensemble learning combines the predictions of multiple base learners, effectively reducing the bias and variance of individual models, thereby improving the model’s generalization ability and stability. Dietterich et al. used ensemble learning methods to combine the predictions of multiple base learners, establishing an effective multi-classifier system, which enhanced the stability and prediction accuracy of the model [9]. To further improve the model’s performance, Snoek et al. employed Bayesian optimization, combined with hyperparameter tuning for machine learning algorithms, to address the overfitting problem and significantly enhance the model’s predictive performance [10]. Rasmussen and Williams, on the other hand, used Gaussian process models combined with Bayesian inference methods, successfully solving the generalization problem of traditional models and improving the model’s robustness [11,12].

In recent years, ensemble learning has made significant progress in the field of cheminformatics, especially in molecular property prediction tasks. The AdaBoost algorithm proposed by Li and Zhang improves model performance in imbalanced data classification tasks by adjusting the sample weighting strategy, effectively enhancing its ability to represent complex nonlinear relationships [13]. Additionally, Zhang et al.’s improvement to the AdaBoost algorithm, by adjusting the weighting strategy of weak learners, significantly increased its accuracy in fraud detection tasks [14]. The Extreme Gradient Boosting (XGBoost) algorithm proposed by Chen and Guestrin introduced second-order gradient information, column sampling, and regularization strategies, significantly alleviating overfitting issues [15]. In recent years, researchers have made numerous optimizations to XGBoost to improve its performance on specific tasks. Huang et al. (2022) applied XGBoost to the prediction of the COVID-19 pandemic in the United States, and the results showed that it outperformed the traditional ARIMA model in prediction accuracy [16]. The LightGBM algorithm, proposed by Ke and Yang, improves the training speed and memory efficiency of large-scale high-dimensional data through gradient-based one-sided sampling (GOSS) and exclusive feature bundling (EFB) [17]. Recent research has further improved the performance of LightGBM. Wang et al. proposed an improved LightGBM hybrid ensemble model, incorporating the whale optimization algorithm (WOA) for hyperparameter optimization and using Jacobian regularization to reduce noise [18].

Although the aforementioned ensemble learning algorithms perform well in molecular property prediction, existing research still predominantly uses single algorithms or simple Bagging/Voting ensemble methods, lacking in-depth exploration of the internal complementarity of gradient boosting models. Furthermore, current research faces the following issues: first, the diversity and size of datasets are insufficient, which prevents them from fully representing a variety of experimental types, thus affecting the model’s generalization ability; second, traditional machine learning regression methods tend to overfit when faced with complex datasets, leading to poor predictive performance; third, traditional machine learning methods struggle with high-dimensional feature spaces, where redundant or irrelevant features can affect the model’s predictive accuracy. Therefore, this paper proposes a new ensemble learning framework, StackBoost, which combines the strengths of LGBM and XGBoost, aiming to improve the model’s prediction performance and stability.

Therefore, this study constructs a new ensemble model, StackBoost, based on the solubility dataset of A-I compounds by stacking the two base models, LGBM and XGBoost. The outputs of these models are used as new features input into the GBRT meta-learner. StackBoost not only fully leverages the strengths of each base learner but also effectively mitigates overfitting through the stacking strategy. To further optimize the model’s performance and reduce the risk of overfitting, Optuna is employed for hyperparameter tuning. Additionally, to validate the stability of the StackBoost model, Bayesian models are used to verify its prediction results, ensuring the model’s reliability and robustness across different datasets. This research evaluates the model’s performance using three widely used regression evaluation metrics: , RMSE, and MAE, and conducts a systematic comparison with five mainstream ensemble learning algorithms, including AdaBoost, GBRT, LGBM, XGBoost, and RF.

Dataset and preprocessing

Dataset description

The experimental data of water solubility in this study were obtained from the AqSo1DB database, which is an open-access and well-structured compound database [13]. This study collected nine water solubility datasets, encompassing 9,982 compounds with water solubility data. Using topological and physicochemical two-dimensional descriptors from the datasets, all Chemical Abstracts Service(CAS) and Sybyl Line Notation(SLN) identifiers were converted into Simplified Molecular Input Line Entry System(SMILES) and validated. The nine datasets were standardized and merged into a single dataset [14–16].

The data are shown in Table 1. Dataset A is derived from the open-source chemical property database eChemPortal; after filtering by temperature range, 3,656 valid compounds were retained out of 6,110 compounds. Dataset B comes from the EPISuite Data website; through temperature filtering and SMILES validation, 3,542 valid instances were retained from 4,651 compounds. Dataset C originates from the study by Raevsky et al., containing solubility data for 1,325 solid compounds along with their SMILES representations. Dataset D also comes from the EPISuite Data website, retaining 163 liquid and crystalline compounds. Dataset E is taken from the AQUASOL and PHYSPROP databases, including 249 solubility values and SMILES data. Dataset F is sourced from the study by Wang et al.; after converting SLN to SMILES, 798 valid instances were obtained from 1,210 compounds. Dataset G includes solubility data measured at 25^°C for 1,144 compounds, ultimately retaining 139 valid compounds. Dataset H is from Wang et al.’s cleaning of datasets used by Jain and Yalkowsky, containing 322 liquid compounds and 256 solid compounds, with 64 valid compounds retained in the end. Dataset I contains 94 compounds, with 46 valid compounds retained [17–26]. Detailed information is shown in Table 1:

Download:

Table 1. List of datasets in AqSolDB.

https://doi.org/10.1371/journal.pone.0330598.t001

In this study, the AqSolDB dataset, which includes multiple physicochemical features, was utilized for solubility prediction research [27–32]. The ID serves as the unique identifier for compounds, distinguishing data from different sources; Name refers to the name of the compound; InChI is the IUPAC International Chemical Identifier for the compound; InChIKey is the hashed form of InChI, simplifying the unique identification of chemical substances; SMILES is a linear notation used to represent the molecular structure of chemicals; the Solubility column represents the experimental water solubility values; SD (Standard Deviation) quantifies the dispersion of multiple measurement results; Occurrences indicates the frequency of the compound in the dataset; Group is a label representing the quality of the data. The physicochemical properties include Mol Wt (Molecular Weight), Mol LogP (Molecular LogP), and Mol MR (Molecular Refractivity). Molecular structure descriptors include Heavy Atom Count, Num H Acceptors (Number of Hydrogen Bond Acceptors), Num H Donors (Number of Hydrogen Bond Donors), and Num Heteroatoms (Number of Heteroatoms). Num Rotatable Bonds represent the number of rotatable bonds in the compound. Electronic and ring structure features include Num Valence Electrons, Num Aromatic Rings, and Num Saturated Rings. Topological and physicochemical features include TPSA (Topological Polar Surface Area), Labute ASA (Labute’s Approximate Surface Area), Balaban J, and Bertz CT. These multiple features contribute to the accurate prediction of compound solubility in machine-learning models [33–36]. Among them, Solubility is the target feature of the model, and the other 25 features serve as input features. As shown in Table 2:

Download:

Table 2. List of available information in each column of AqSolDB.

https://doi.org/10.1371/journal.pone.0330598.t002

Data preprocessing

To ensure consistency across different units, the solubility values of various compounds were converted to mol/L, and the InChI notation was used for validation. For non-numerical features, the One-Hot Encoding method was employed to transform them, avoiding computational bias caused by differences in feature scales. The data were normalized to highlight weaker features and make the data more closely resemble a standard normal distribution, thereby improving the accuracy and convergence speed of the algorithm. Finally, the Singular Value Decomposition (SVD) dimensionality reduction technique was applied to reduce computational complexity and enhance the model’s accuracy and stability.

To more intuitively display the distribution of the target variable, this paper visualizes Solubility using a scatter plot, as shown in Fig 1. From the figure, it can be seen that the range of solubility values in the dataset spans from -4 to 2, with most data points concentrated in the lower solubility intervals. In particular, data points exhibit a high density in the range from -4 to -2, indicating that solubility values within this range occur more frequently in the dataset. As solubility increases, the data points gradually become sparse, suggesting that compounds with higher solubility are relatively fewer. Additionally, the lower solubility values (deep purple regions) in the figure correspond to lower occurrences and TPSA values. In the higher solubility intervals, the data points show a more dispersed trend, indicating a sparser distribution of compounds in this range. Through the color gradient, it can be further observed that compounds with lower solubility (deep purple regions) are more common in the dataset, while as solubility increases, the number of compounds significantly decreases and their distribution becomes more scattered. The nonlinear distribution characteristic of solubility indicates that compounds with lower solubility dominate the dataset, whereas those with higher solubility are relatively scarce.

Download:

Fig 1. The scatter plot of the target variable solubility dataset.

https://doi.org/10.1371/journal.pone.0330598.g001

Application of the model

Machine learning regression model

Fig 2 illustrates the overall workflow of the StackBoost model. The StackBoost model utilizes base models such as XGBoost and LGBM to make predictions on the data. These predictions are used as inputs to train a GBRT meta-model for the final prediction. During the model training process, StackBoost employs Bayesian optimization techniques to fine-tune the hyperparameters of the GBRT meta-model, while incorporating cross-validation strategies to ensure the model’s stability and robustness across different datasets. By stacking multiple models, the StackBoost model effectively enhances its prediction accuracy and generalization ability when handling complex data. Additionally, the StackBoost model uses Optuna for hyperparameter optimization. The preprocessed data is then used for training and prediction, with the dataset randomly split into training and testing sets at a ratio of 80% to 20%.

Download:

Fig 2. Schematic workflow diagram of the StackBoost model framework.

https://doi.org/10.1371/journal.pone.0330598.g002

To improve the accuracy and generalization ability of water solubility prediction for materials, a multi-level, multi-dimensional StackBoost model was constructed. The overall workflow of this study is illustrated in Fig 3, which includes three parts: data analysis and processing, model prediction and analysis, and model application. For comparison, the StackBoost model was systematically compared with five mainstream ensemble learning algorithms: AdaBoost, GBRT, LGBM, XGBoost, and RF. A total of 8,352 data samples were used as the training set. The six trained ensemble learning models were employed to predict and systematically compare the water solubility of materials. Subsequently, the trained StackBoost model was applied to conduct high-throughput screening on the dataset.

Download:

Fig 3. Workflow diagram of the prediction algorithm framework.

https://doi.org/10.1371/journal.pone.0330598.g003

Evaluation metrics

This paper primarily utilizes , RMSE, and MAE as metrics to evaluate and assess the model results. These three indicators quantitatively measure the discrepancies between the model’s predictions and the actual values. The calculation formula for is as follows:

(1)

serves as a critical quantitative measure for assessing the fit of regression models, with values ranging from 0 to 1. A value closer to 1 indicates a stronger fit of the model, whereas a value closer to 0 indicates a poorer fit. SSres, the sum of squared residuals, represents the squared differences between predicted values and actual values; SStot, the total sum of squares, denotes the squared differences between actual values and their mean. The calculation formula for RMSE is as follows:

(2)

In this context, n represents the number of samples, y_i signifies the actual value of the i-th sample, and signifies the predicted value of the i-th sample. RMSE measures the square root of the differences between predicted and actual values; lower RMSE values indicate better model accuracy. Another metric, the coefficient of determination R², serves as a key quantitative standard for evaluating the fit of regression models. Values range from 0 to 1, with closer to 1 indicating a stronger fit and closer to 0 indicating a poorer fit.

(3)

In this context, n represents the number of samples, while y_i and denote the actual and predicted values for the i-th sample, respectively. MAE measures the average absolute deviation between predicted and actual values, indicating the mean degree to which the prediction model deviates from actual observations. A lower MAE indicates better predictive accuracy of the model.

Results and discussion

To verify the impact of increasing the sample size on training and prediction results, first, 70% of the 9,936 data samples from eight datasets (A-H) were randomly selected as the training set, with the remaining 30% as the test set. Using 6,955 data samples, the StackBoost model was trained, achieving an R² value of 0.857 on the test set. Subsequently, 80% of the 9,936 samples were randomly selected as the training set, and 20% as the test set. Using 7,949 data samples to train the StackBoost model resulted in an R² value of 0.90 on the test set. With the increase in training samples, the R² value improved by 0.043. Finally, 90% of the 9,936 samples were randomly selected as the training set, and 10% as the test set. Using 8,942 samples to train the StackBoost model, the test set R² value remained at 0.90. As the number of training samples increased, the improvement in model performance tended to saturate. Therefore, this method constructs a model with high predictive accuracy using a relatively small dataset.

Table 3 presents the performance metrics of AdaBoost, GBRT, LGBM, XGBoost, RF, and StackBoost models in predicting solubility.

Download:

Table 3. Assessment values of machine learning models.

https://doi.org/10.1371/journal.pone.0330598.t003

The AdaBoost model has an of 0.67, RMSE of 0.57, and MAE of 0.44, indicating larger prediction errors and weaker performance on complex data. The GBRT model shows improved results with an of 0.84, RMSE of 0.40, and MAE of 0.28, performing better compared to AdaBoost.The LGBM model achieves an of 0.79, RMSE of 0.46, and MAE of 0.33, indicating it performs slightly less effectively than GBRT but still delivers strong results on large datasets. The XGBoost model has an of 0.83, RMSE of 0.40, and MAE of 0.29, demonstrating strong predictive capabilities on complex data. The RF model records an of 0.80, RMSE of 0.44, and MAE of 0.30, indicating stable prediction accuracy but slightly lacking in performance. The Lasso model, with an of 0.70, RMSE of 0.52, and MAE of 0.50, performs the worst among regression models due to its inability to capture complex nonlinear relationships in data. The StackBoost model outperforms others, achieving the highest metrics with an of 0.90, RMSE of 0.29, and MAE of 0.22, indicating superior predictive precision. The results suggest that the StackBoost model overcomes the limitations of machine learning models on complex data by stacking different algorithms. The training outcomes for all six models are illustrated in Fig 4.

Download:

Fig 4. Comparison of training and predicted values of compound water solubility for different machine learning models: (A) AdaBoost, (B) GBDT, (C) LGBM, (D) RF, (E) XGBoost, (F) RF, and (G) StackBoost.

https://doi.org/10.1371/journal.pone.0330598.g004

High-throughput screening and analysis of optimal samples

This section utilizes the feature-optimized StackBoost model to perform theoretical calculations and sample selection on a dataset of 9936 compounds, calculating the feasibility scores for each sample, as shown in Table 4. The top 20 samples were selected and visualized using the RDKit tool, with Fig 5 displaying the visualization of these 20 samples. The results indicate that samples with unique functional groups and stereo structures significantly influence solubility. Functional groups such as fluorine, chlorine, bromine, and hydroxyl enhance solubility by intensifying interactions with water molecules within the molecule. Samples with stereo structures improve solubility through their spatial interactions with water molecules. This highlights that the functional groups and molecular weight within the molecular structure are key factors affecting solubility.

Download:

Fig 5. Samples with the highest solubility after high-throughput screening.

https://doi.org/10.1371/journal.pone.0330598.g005

Download:

Table 4. Synthetic feasibility scores of 20 molecular fragments.

https://doi.org/10.1371/journal.pone.0330598.t004

Ring structures significantly impact the stability, rigidity, and reactivity of molecules. For molecule A-1080, two saturated rings and three aliphatic rings enhance its stability and resistance to reactivity. Molecule A-1756, with two aromatic rings and three saturated rings, increases its electron stability and reactivity. The flexibility of a molecule is primarily related to the number of its rotatable bonds. Molecule A-606 has five rotatable bonds; its higher structural flexibility makes it adaptable to varied reaction conditions. Molecule A-560 has only one rotatable bond; its rigid structure provides higher stability in chemical reactions. The number of valence electrons is also a crucial factor in determining a molecule’s chemical reactivity. A-606, with fewer valence electrons, exhibits higher chemical activity, while A-560 shows lower reactivity due to its greater number of valence electrons. Additionally, molecular weight is a significant factor influencing the chemical properties of molecules. Variations in molecular weight affect the solubility and reactivity of samples in different solvents, with larger molecular weights typically offering greater stability and resistance to volatility.

Feature analysis

Calculate the Pearson correlation coefficients among 20 input features to analyze their interrelationships. As shown in Fig 6, the correlation coefficient between Mwt and MolLogP is 0.92, indicating that larger molecules typically exhibit higher solubility. There is a positive correlation of 0.87 between NumAromaticRings and NumSaturatedRings, suggesting that ring structures influence compound stability. The correlation coefficient between TPSA and LabuteASA is 0.92, indicating that surface area affects molecular solubility.

Download:

Fig 6. Pearson correlation analysis between pairs of feature parameters.

https://doi.org/10.1371/journal.pone.0330598.g006

As illustrated in Fig 7, according to SHAP analysis, MolLogP is the most influential feature with the highest SHAP value, indicating it is the most significant factor affecting compound volume. The SHAP value of MolLogP suggests a positive correlation with solubility, related to the balance between molecular hydrophilicity and hydrophobicity. The SHAP values for ID, MolMR, BertzCT, TPSA, and MolWt range from 0 to 0.5, indicating that while they correlate positively with solubility, their impact is relatively weak.SHAP values for features such as NumRotatableBonds, NumValenceElectrons, HeavyAtomCount, NumAromaticRings, and Occurrences are close to zero, suggesting they have minimal impact on solubility. Overall, SHAP analysis reveals that features like MolLogP, ID, MolMR, BertzCT, TPSA, and MolWt significantly influence model predictions, whereas NumRotatableBonds, NumValenceElectrons, NumHDonors, NumHAcceptors, HeavyAtomCount, NumAromaticRings, and Occurrences have little effect on model output. Fig 8 presents a 3D scatter plot of features MolLogP, MolMR, BertzCT, and TPSA with material solubility.

Download:

Fig 7. Feature importance derived from SHAP analysis.

https://doi.org/10.1371/journal.pone.0330598.g007

Download:

Fig 8. 3D Scatter plot of features related to water-soluble materials.

https://doi.org/10.1371/journal.pone.0330598.g008

Analysis of model transferability

To evaluate the transferability of the StackBoost model, this study trained the model using 9,936 compound samples from datasets A-H and conducted prediction and evaluation on 46 compound samples from dataset I to test the model’s transferability and generalization ability. When the model was transferred to dataset I, the test results showed an of 0.84, an RMSE of 0.40, and an MAE of 0.29, indicating a decrease in accuracy after transfer but still demonstrating the model’s transferability. Fig 9 presents a comparison between the true and predicted values. The figure shows that the distribution of dataset I is broader and more variable than that of datasets A-H. By standardizing the features of datasets A-H and I to the same normal distribution to eliminate differences in data distribution, the StackBoost model was retrained and evaluated, resulting in improved performance. This indicates that differences in data distribution are a significant factor affecting the model’s transfer performance, as such differences impact prediction accuracy. After standardizing the data to the same normal distribution, the model demonstrated better transferability and improved robustness in cross-domain transfer.

Download:

Fig 9. Comparison of actual vs predicted values after transfer.

https://doi.org/10.1371/journal.pone.0330598.g009

Conclusion

The study used a dataset of water solubility data for 9,982 compounds from the AqSolDB database. Data preprocessing was conducted through standardization, conversion to SMILES representation, and One-Hot encoding, followed by hyperparameter tuning using Optuna. In this research, five ensemble models—AdaBoost, GBRT, LGBM, XGBoost, and RF—were employed to train and predict the dataset. Additionally, a StackBoost model was built, which not only accurately analyzed the relationships between features but also improved the model’s predictive accuracy. The StackBoost model achieved an of 0.90, an RMSE of 0.29, and an MAE of 0.22, demonstrating strong predictive ability.

To further validate the practicality of the StackBoost model, high-throughput screening of 9,936 samples was conducted, yielding feasibility scores for each sample and identifying compounds with high water solubility. The screening results indicated that the StackBoost model could effectively identify compounds with high water solubility, confirming its effectiveness and reliability in recognizing the water solubility of compounds. The innovation of the StackBoost model lies in its ability to overcome the limitations of single models in handling complex relationships by stacking multiple machine learning models. Furthermore, hyperparameter tuning using Optuna enhanced the model’s stability and predictive accuracy. The integration of Bayesian optimization techniques for hyperparameter adjustments effectively reduced the risk of overfitting and strengthened the model’s robustness. Although the model’s value decreased from 0.90 to 0.8362 on Dataset I, it still demonstrated its transferability. The application of the StackBoost model for high-throughput screening of 9,936 samples yielded feasibility scores and identified high water solubility compounds, indicating that the StackBoost model successfully overcame the limitations of single models in capturing complex relationships, enhancing predictive accuracy and stability.

Despite demonstrating good performance, there remains room for further optimization of the StackBoost model. Future research could explore more effective ways to handle high-dimensional feature data to improve predictive accuracy. Additionally, taking into account the diversity of datasets and variations in experimental design, future work should include a broader range of datasets and various types of compounds to validate the model’s generalization capabilities. Furthermore, this study has several limitations: first, the size of the dataset used may not be sufficient to comprehensively cover all possible compound types; thus, future efforts should consider more diversified sample datasets. Second, the computational cost of the StackBoost model is relatively high, especially when dealing with large amounts of compound data, and future optimizations for computational efficiency may be necessary. Based on these research findings, future work should focus on the following aspects: optimizing the computational efficiency of the StackBoost model to make it suitable for larger-scale compound screening; further enhancing the model’s stability and accuracy when handling high-dimensional data; and attempting to integrate other datasets and domain-specific knowledge to improve the model’s generalization ability and cross-domain applications.

References

1. Perryman AL, Inoyama D, Patel JS, Ekins S, Freundlich JS. Pruned machine learning models to predict aqueous solubility. ACS Omega. 2020;5(27):16562–7. pmid:32685821
- View Article
- PubMed/NCBI
- Google Scholar
2. Taherdangkoo R, Liu Q, Xing Y, Yang H, Cao V, Sauter M, et al. Predicting methane solubility in water and seawater by machine learning algorithms: application to methane transport modeling. J Contam Hydrol. 2021;242:103844. pmid:34111717
- View Article
- PubMed/NCBI
- Google Scholar
3. Altalbawy FMA, Al-Saray MJ, Vaghela K, Nazarova N, K N RP, Kumari B, et al. Machine-learning based prediction of hydrogen/methane mixture solubility in brine. Sci Rep. 2024;14(1):30227. pmid:39632964
- View Article
- PubMed/NCBI
- Google Scholar
4. Han X, Wang X, Zhou K. Develop machine learning-based regression predictive models for engineering protein solubility. Bioinformatics. 2019;35(22):4640–6. pmid:31038685
- View Article
- PubMed/NCBI
- Google Scholar
5. Sun B, Li X, Chen P, Lian Z, Wang C. Research progress on the interface performance of porous material modified carbon fiber and its reinforced composites. Journal of Liaoning Petrochemical University. 2025;45(01):33–40.
- View Article
- Google Scholar
6. Li S, Han Q, Cai T, Yang Z. Research progress on molybdenum disulfide-doped hydrogen evolution electrocatalysts. Journal of Liaoning Petrochemical University. 2025;45(01):1–9.
- View Article
- Google Scholar
7. Fioressi SE, Bacelo DE, Rojas C, Aranda JF, Duchowicz PR. Conformation-independent quantitative structure-property relationships study on water solubility of pesticides. Ecotoxicol Environ Saf. 2019;171:47–53. pmid:30594756
- View Article
- PubMed/NCBI
- Google Scholar
8. Kherouf S, Bouarra N, Bouakkadia A, Messadi D. Modeling of linear and nonlinear quantitative structure property relationships of the aqueous solubility of phenol derivatives. J Serb Chem Soc. 2019;84(6):575–90.
- View Article
- Google Scholar
9. Ghazwani M, Yasmin Begum M, Naglah AM, Alkahtani HM, Almehizia AA. Development of advanced model for understanding the behavior of drug solubility in green solvents: Machine learning modeling for small-molecule API solubility prediction. Journal of Molecular Liquids. 2023;386:122446.
- View Article
- Google Scholar
10. He Y, Yang A, Zou C, Fan T, Lan Q, He Y, et al. An interpretable surrogate model for H2S solubility forecasting in ionic liquids based on machine learning. Separation and Purification Technology. 2025;357:130061.
- View Article
- Google Scholar
11. Smith J, Doe R. A survey of machine learning techniques for classification tasks. Journal of Artificial Intelligence Research. 2021;45(3):100–15.
- View Article
- Google Scholar
12. Snoek J, Larochelle H, Adams RP. Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems. 2012;25.
- View Article
- Google Scholar
13. Li Q, Li M, Shi X, Wu B, Xiao Y. A key elements influence discovery scheme based on ternary association graph and representation learning. Knowledge-Based Systems. 2021;229:107359.
- View Article
- Google Scholar
14. Niu T, Wang J, Lu H, Yang W, Du P. Developing a deep learning framework with two-stage feature selection for multivariate financial time series forecasting. Expert Systems with Applications. 2020;148:113237.
- View Article
- Google Scholar
15. Chen T, Guestrin C. Xgboost: A scalable and efficient gradient boosting system. Journal of Machine Learning Research. 2019;20(1):2857–73.
- View Article
- Google Scholar
16. Haq FU, Bhui P, Chakravarthi K. Real time congestion management using Plug in Electric Vehicles (PEV’s): a game theoretic approach. IEEE Access. 2022;10:42029–43.
- View Article
- Google Scholar
17. Ke G, Yang Q. LightGBM: A fast, distributed, high-performance gradient boosting machine. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2020. p. 3146–54. https://doi.org/10.1145/3293692.3293693
18. Wang L, Chen X, Wei Y. Hybrid ensemble model for large-scale high-dimensional data using LightGBM and whale optimization algorithm. Computational Intelligence and Neuroscience. 2021.
- View Article
- Google Scholar
19. Ghazwani M, Yasmin Begum M, Naglah AM, Alkahtani HM, Almehizia AA. Development of advanced model for understanding the behavior of drug solubility in green solvents: machine learning modeling for small-molecule API solubility prediction. Journal of Molecular Liquids. 2023;386:122446.
- View Article
- Google Scholar
20. He Y, Yang A, Zou C, Fan T, Lan Q, He Y, et al. An interpretable surrogate model for H2S solubility forecasting in ionic liquids based on machine learning. Separation and Purification Technology. 2025;357:130061.
- View Article
- Google Scholar
21. Nagulapati VM, Raza Ur Rehman HM, Haider J, Abdul Qyyum M, Choi GS, Lim H. Hybrid machine learning-based model for solubilities prediction of various gases in deep eutectic solvent for rigorous process design of hydrogen purification. Separation and Purification Technology. 2022;298:121651.
- View Article
- Google Scholar
22. Ye Z, Ouyang D. Prediction of small-molecule compound solubility in organic solvents by machine learning algorithms. J Cheminform. 2021;13(1):98. pmid:34895323
- View Article
- PubMed/NCBI
- Google Scholar
23. Yang Y, Ju B, Lü G, Huang Y. Machine learning methods for predicting CO2 solubility in hydrocarbons. Petroleum Science. 2024;21(5):3340–9.
- View Article
- Google Scholar
24. Besharati Z, Hashemi SH. Prediction of CO2 solubility in aqueous and organic solvent systems through machine learning techniques. Model Earth Syst Environ. 2024;11(1).
- View Article
- Google Scholar
25. Bhattacherjee R, Botchway K, Pashin JC, Chakraborty G, Bikkina P. Developing statistical and machine learning models for predicting CO2 solubility in live crude oils. Fuel. 2024;368:131577.
- View Article
- Google Scholar
26. Qin H, Wang K, Ma X, Li F, Liu Y, Ji X. Predicting the solubility of CO2 and N2 in ionic liquids based on COSMO-RS and machine learning. Front Chem. 2024;12:1480468. pmid:39544717
- View Article
- PubMed/NCBI
- Google Scholar
27. Liu B, Yu Y, Liu Z, Cui Z, Tian W. Prediction of CO2 solubility in aqueous amine solutions using machine learning method. Separation and Purification Technology. 2025;354:129306.
- View Article
- Google Scholar
28. Yang A, Sun S, Su Y, Kong ZY, Ren J, Shen W. Insight to the prediction of CO2 solubility in ionic liquids based on the interpretable machine learning model. Chemical Engineering Science. 2024;297:120266.
- View Article
- Google Scholar
29. Ma Y, Gao Z, Shi P, Chen M, Wu S, Yang C, et al. Machine learning-based solubility prediction and methodology evaluation of active pharmaceutical ingredients in industrial crystallization. Front Chem Sci Eng. 2021;16(4):523–35.
- View Article
- Google Scholar
30. Wang C, Cheng Y, Ma Y, Ji Y, Huang D, Qian H. Prediction of enhanced drug solubility related to clathrate compositions and operating conditions: Machine learning study. Int J Pharm. 2023;646:123458. pmid:37776964
- View Article
- PubMed/NCBI
- Google Scholar
31. Francoeur PG, Koes DR. SolTranNet-a machine learning tool for fast aqueous solubility prediction. J Chem Inf Model. 2021;61(6):2530–6. pmid:34038123
- View Article
- PubMed/NCBI
- Google Scholar
32. Mu Y, Dai T, Fan J, Cheng Y. Prediction of acetylene solubility by a mechanism-data hybrid-driven machine learning model constructed based on COSMO-RS theory. Journal of Molecular Liquids. 2024;414:126194.
- View Article
- Google Scholar
33. Alanazi M, Alanazi J, Alharby TN, Huwaimel B. Correlation of rivaroxaban solubility in mixed solvents for optimization of solubility using machine learning analysis and validation. Sci Rep. 2025;15(1):4725. pmid:39922955
- View Article
- PubMed/NCBI
- Google Scholar
34. Gao Q, Zhu P, Zhao H, Farajtabar A, Jouyban A, Acree WE Jr. Solubility, solvent effect, and solvation performance of MBQ-167 in aqueous cosolvent solutions. J Chem Eng Data. 2021;66(12):4725–39.
- View Article
- Google Scholar
35. Silveira M, Mayer DA, Rebelatto EA, Araújo PHH, Vladimir Oliveira J. Solubility and thermodynamic parameters of nicotinic acid in different solvents. The Journal of Chemical Thermodynamics. 2023;184:107084.
- View Article
- Google Scholar
36. Chirkova J, Andersone I, Grinins J, Andersons B. Sorption properties of hydrothermally modified wood and data evaluation based on the concept of Hansen solubility parameter (HSP). Holzforschung. 2013;67(5):595–600.
- View Article
- Google Scholar

[ref1] 1. Perryman AL, Inoyama D, Patel JS, Ekins S, Freundlich JS. Pruned machine learning models to predict aqueous solubility. ACS Omega. 2020;5(27):16562–7. pmid:32685821
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Taherdangkoo R, Liu Q, Xing Y, Yang H, Cao V, Sauter M, et al. Predicting methane solubility in water and seawater by machine learning algorithms: application to methane transport modeling. J Contam Hydrol. 2021;242:103844. pmid:34111717
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Altalbawy FMA, Al-Saray MJ, Vaghela K, Nazarova N, K N RP, Kumari B, et al. Machine-learning based prediction of hydrogen/methane mixture solubility in brine. Sci Rep. 2024;14(1):30227. pmid:39632964
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Han X, Wang X, Zhou K. Develop machine learning-based regression predictive models for engineering protein solubility. Bioinformatics. 2019;35(22):4640–6. pmid:31038685
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Sun B, Li X, Chen P, Lian Z, Wang C. Research progress on the interface performance of porous material modified carbon fiber and its reinforced composites. Journal of Liaoning Petrochemical University. 2025;45(01):33–40.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref6] 6. Li S, Han Q, Cai T, Yang Z. Research progress on molybdenum disulfide-doped hydrogen evolution electrocatalysts. Journal of Liaoning Petrochemical University. 2025;45(01):1–9.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref7] 7. Fioressi SE, Bacelo DE, Rojas C, Aranda JF, Duchowicz PR. Conformation-independent quantitative structure-property relationships study on water solubility of pesticides. Ecotoxicol Environ Saf. 2019;171:47–53. pmid:30594756
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref8] 8. Kherouf S, Bouarra N, Bouakkadia A, Messadi D. Modeling of linear and nonlinear quantitative structure property relationships of the aqueous solubility of phenol derivatives. J Serb Chem Soc. 2019;84(6):575–90.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref9] 9. Ghazwani M, Yasmin Begum M, Naglah AM, Alkahtani HM, Almehizia AA. Development of advanced model for understanding the behavior of drug solubility in green solvents: Machine learning modeling for small-molecule API solubility prediction. Journal of Molecular Liquids. 2023;386:122446.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref10] 10. He Y, Yang A, Zou C, Fan T, Lan Q, He Y, et al. An interpretable surrogate model for H2S solubility forecasting in ionic liquids based on machine learning. Separation and Purification Technology. 2025;357:130061.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref11] 11. Smith J, Doe R. A survey of machine learning techniques for classification tasks. Journal of Artificial Intelligence Research. 2021;45(3):100–15.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref12] 12. Snoek J, Larochelle H, Adams RP. Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems. 2012;25.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref13] 13. Li Q, Li M, Shi X, Wu B, Xiao Y. A key elements influence discovery scheme based on ternary association graph and representation learning. Knowledge-Based Systems. 2021;229:107359.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref14] 14. Niu T, Wang J, Lu H, Yang W, Du P. Developing a deep learning framework with two-stage feature selection for multivariate financial time series forecasting. Expert Systems with Applications. 2020;148:113237.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref15] 15. Chen T, Guestrin C. Xgboost: A scalable and efficient gradient boosting system. Journal of Machine Learning Research. 2019;20(1):2857–73.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref16] 16. Haq FU, Bhui P, Chakravarthi K. Real time congestion management using Plug in Electric Vehicles (PEV’s): a game theoretic approach. IEEE Access. 2022;10:42029–43.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref17] 17. Ke G, Yang Q. LightGBM: A fast, distributed, high-performance gradient boosting machine. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2020. p. 3146–54. https://doi.org/10.1145/3293692.3293693

[ref18] 18. Wang L, Chen X, Wei Y. Hybrid ensemble model for large-scale high-dimensional data using LightGBM and whale optimization algorithm. Computational Intelligence and Neuroscience. 2021.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref19] 19. Ghazwani M, Yasmin Begum M, Naglah AM, Alkahtani HM, Almehizia AA. Development of advanced model for understanding the behavior of drug solubility in green solvents: machine learning modeling for small-molecule API solubility prediction. Journal of Molecular Liquids. 2023;386:122446.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref20] 20. He Y, Yang A, Zou C, Fan T, Lan Q, He Y, et al. An interpretable surrogate model for H2S solubility forecasting in ionic liquids based on machine learning. Separation and Purification Technology. 2025;357:130061.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref21] 21. Nagulapati VM, Raza Ur Rehman HM, Haider J, Abdul Qyyum M, Choi GS, Lim H. Hybrid machine learning-based model for solubilities prediction of various gases in deep eutectic solvent for rigorous process design of hydrogen purification. Separation and Purification Technology. 2022;298:121651.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref22] 22. Ye Z, Ouyang D. Prediction of small-molecule compound solubility in organic solvents by machine learning algorithms. J Cheminform. 2021;13(1):98. pmid:34895323
View Article
PubMed/NCBI
Google Scholar

[68] View Article

[69] PubMed/NCBI

[70] Google Scholar

[ref23] 23. Yang Y, Ju B, Lü G, Huang Y. Machine learning methods for predicting CO2 solubility in hydrocarbons. Petroleum Science. 2024;21(5):3340–9.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref24] 24. Besharati Z, Hashemi SH. Prediction of CO2 solubility in aqueous and organic solvent systems through machine learning techniques. Model Earth Syst Environ. 2024;11(1).
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref25] 25. Bhattacherjee R, Botchway K, Pashin JC, Chakraborty G, Bikkina P. Developing statistical and machine learning models for predicting CO2 solubility in live crude oils. Fuel. 2024;368:131577.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref26] 26. Qin H, Wang K, Ma X, Li F, Liu Y, Ji X. Predicting the solubility of CO2 and N2 in ionic liquids based on COSMO-RS and machine learning. Front Chem. 2024;12:1480468. pmid:39544717
View Article
PubMed/NCBI
Google Scholar

[81] View Article

[82] PubMed/NCBI

[83] Google Scholar

[ref27] 27. Liu B, Yu Y, Liu Z, Cui Z, Tian W. Prediction of CO2 solubility in aqueous amine solutions using machine learning method. Separation and Purification Technology. 2025;354:129306.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref28] 28. Yang A, Sun S, Su Y, Kong ZY, Ren J, Shen W. Insight to the prediction of CO2 solubility in ionic liquids based on the interpretable machine learning model. Chemical Engineering Science. 2024;297:120266.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref29] 29. Ma Y, Gao Z, Shi P, Chen M, Wu S, Yang C, et al. Machine learning-based solubility prediction and methodology evaluation of active pharmaceutical ingredients in industrial crystallization. Front Chem Sci Eng. 2021;16(4):523–35.
View Article
Google Scholar

[91] View Article

[92] Google Scholar

[ref30] 30. Wang C, Cheng Y, Ma Y, Ji Y, Huang D, Qian H. Prediction of enhanced drug solubility related to clathrate compositions and operating conditions: Machine learning study. Int J Pharm. 2023;646:123458. pmid:37776964
View Article
PubMed/NCBI
Google Scholar

[94] View Article

[95] PubMed/NCBI

[96] Google Scholar

[ref31] 31. Francoeur PG, Koes DR. SolTranNet-a machine learning tool for fast aqueous solubility prediction. J Chem Inf Model. 2021;61(6):2530–6. pmid:34038123
View Article
PubMed/NCBI
Google Scholar

[98] View Article

[99] PubMed/NCBI

[100] Google Scholar

[ref32] 32. Mu Y, Dai T, Fan J, Cheng Y. Prediction of acetylene solubility by a mechanism-data hybrid-driven machine learning model constructed based on COSMO-RS theory. Journal of Molecular Liquids. 2024;414:126194.
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref33] 33. Alanazi M, Alanazi J, Alharby TN, Huwaimel B. Correlation of rivaroxaban solubility in mixed solvents for optimization of solubility using machine learning analysis and validation. Sci Rep. 2025;15(1):4725. pmid:39922955
View Article
PubMed/NCBI
Google Scholar

[105] View Article

[106] PubMed/NCBI

[107] Google Scholar

[ref34] 34. Gao Q, Zhu P, Zhao H, Farajtabar A, Jouyban A, Acree WE Jr. Solubility, solvent effect, and solvation performance of MBQ-167 in aqueous cosolvent solutions. J Chem Eng Data. 2021;66(12):4725–39.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref35] 35. Silveira M, Mayer DA, Rebelatto EA, Araújo PHH, Vladimir Oliveira J. Solubility and thermodynamic parameters of nicotinic acid in different solvents. The Journal of Chemical Thermodynamics. 2023;184:107084.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref36] 36. Chirkova J, Andersone I, Grinins J, Andersons B. Sorption properties of hydrothermally modified wood and data evaluation based on the concept of Hansen solubility parameter (HSP). Holzforschung. 2013;67(5):595–600.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

Figures

Abstract

Introduction

Dataset and preprocessing

Dataset description

Data preprocessing

Application of the model

Machine learning regression model

Evaluation metrics

Results and discussion

High-throughput screening and analysis of optimal samples

Feature analysis

Analysis of model transferability

Conclusion

References