Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

An optimized machine learning framework for predicting and interpreting corporate ESG greenwashing behavior

  • Fanlong Zeng,

    Roles Methodology, Software, Visualization, Writing – original draft, Writing – review & editing, Conceptualization

    Affiliation School of Foreign Studies, Yiwu Industrial and Commercial College, Jinhua, Zhejiang, China

  • Jintao Wang,

    Roles Investigation, Project administration, Validation, Formal analysis, Funding acquisition, Data curation

    Affiliations School of Finance, Shanxi Technology and Business University, Taiyuan, Shanxi, China, Graduate School, Lyceum of the Philippines University-Batangas, Batangas, Philippines

  • Chaoyan Zeng

    Roles Conceptualization, Data curation, Investigation, Methodology, Project administration, Validation

    22229417@lpubatangas.edu.ph

    Affiliation Graduate School, Lyceum of the Philippines University-Batangas, Batangas, Philippines

Abstract

The accurate prediction and interpretation of corporate Environmental, Social, and Governance (ESG) greenwashing behavior is crucial for enhancing information transparency and improving regulatory effectiveness. This paper addresses the limitations in hyperparameter optimization and interpretability of existing prediction models by introducing an optimized machine learning framework. The framework integrates an Improved Hunter-Prey Optimization (IHPO) algorithm, an eXtreme Gradient Boosting (XGBoost) model, and SHapley Additive exPlanations (SHAP) theory to predict and interpret corporate ESG greenwashing behavior. Initially, a comprehensive ESG greenwashing prediction dataset was developed through an extensive literature review and expert interviews. The IHPO algorithm was then employed to optimize the hyperparameters of the XGBoost model, forming an IHPO-XGBoost ensemble learning model for predicting corporate ESG greenwashing behavior. Finally, SHAP was used to interpret the model’s prediction outcomes. The results demonstrate that the IHPO-XGBoost model achieves outstanding performance in predicting corporate ESG greenwashing, with , RMSE, MAE, and adjusted R² values of 0.9790, 0.1376, 0.1000, and 0.9785, respectively. Compared to traditional HPO-XGBoost models and XGBoost models combined with other optimization algorithms, the IHPO-XGBoost model exhibits superior overall performance. The interpretability analysis using SHAP theory highlights the key features influencing the prediction outcomes, revealing the specific contributions of feature interactions and the impacts of individual sample features. The findings provide valuable insights for regulators and investors to more effectively identify and assess potential corporate ESG greenwashing behavior, thereby enhancing regulatory efficiency and investment decision-making.

1. Introduction

In recent years, ESG issues have increasingly become a focal point of global attention, with investors, regulatory bodies, and consumers placing greater emphasis on the sustainable development performance of enterprises. In March 2024, the People’s Bank of China issued the “Guiding Opinions on Further Strengthening Financial Support for Green and Low-Carbon Development,” which emphasized the need to strengthen constraint mechanisms based on information disclosure. This directive aimed to guide listed companies to disclose sustainable development information, thereby highlighting the importance of ESG information disclosure in China. However, some companies with poor actual ESG performance manipulate selective disclosure of ESG data to create a false image of sustainability, seeking market recognition and investor trust. Yu et al. [1] refer to this as corporate “greenwashing” behavior, which can lead to a series of greenwashing risks. Therefore, researching corporate ESG greenwashing behaviors is of significant importance for reducing related risks, enhancing information transparency, and improving regulatory effectiveness.

Regarding corporate ESG greenwashing behaviors, studies by Tashman et al. [2] and Hora et al. [3] have pointed out that there is often a discrepancy between ESG statements and actual performance in practice, indicating the prevalence of corporate ESG greenwashing. Liu et al. [4] identified external governance environments and internal governance mechanisms as the two main drivers of corporate ESG greenwashing behaviors. For example, the implementation of green credit policies can initially lead to a significant increase in greenwashing behaviors [5]; additionally, financially constrained companies or those with high debt levels may resort to greenwashing due to future investment and financing needs [6,7]. The drivers of corporate ESG greenwashing are varied, as are the risks. Greenwashing behaviors can mislead investors about the actual ESG performance of enterprises [8], resulting in erroneous investment decisions and potentially leading to incorrect stock market pricing [9]. Furthermore, it can undermine the market recognition and capital support deserved by genuinely sustainable enterprises, reducing market fairness and transparency [10]. To mitigate various greenwashing risks, a reliable greenwashing identification and evaluation mechanism is needed. Consequently, various methods for measuring greenwashing have emerged [1,1115]. However, these studies primarily focus on ex-post identification of corporate greenwashing behaviors, with few addressing the prediction of corporate ESG greenwashing behaviors. Accurate prediction of corporate greenwashing behaviors, as opposed to ex-post identification, can better assist regulatory agencies in monitoring corporate ESG disclosure behaviors, enhancing market transparency, and protecting investor interests. Therefore, developing effective prediction models for corporate ESG greenwashing behaviors holds significant theoretical and practical importance.

Machine learning, with its powerful data processing and prediction capabilities, has become a popular tool for addressing prediction problems in various fields such as management decision-making [16,17], transportation [18], healthcare [19], agriculture [20], climate change [21,22], and construction engineering [23,24]. Classical machine learning algorithms include Linear Regression (LR), Support Vector Machines (SVM), Random Forest (RF), Artificial Neural Networks (ANN), and XGBoost. XGBoost, an ensemble learning algorithm based on Gradient Boosting Decision Trees (GBDT), was proposed by Chen et al. [25]. Due to its efficient computational performance and excellent predictive capabilities, XGBoost has gained widespread recognition in academia and industry. Niazkar et al. [26] systematically summarized the applications of XGBoost in the field of water resources engineering from December 2018 to May 2023; Meddage et al. [2729] have extensively applied the XGBoost model to various predictive problems in the field of construction engineering; Shi et al. [30] investigated early prediction of acute kidney injury based on the XGBoost model; and Wang et al. [31] verified the superiority of the XGBoost model compared to K-Nearest Neighbors (KNN), SVM, RF, and Back Propagation Neural Network (BPNN) models in predicting e-commerce user purchasing behavior.

However, machine learning models generally suffer from poor interpretability, and XGBoost is no exception. To address this issue, several studies have explored the application of Explainable Artificial Intelligence (XAI) techniques. For example, Ukwaththa et al. [32] reviewed machine learning and XAI methods in additive manufacturing, highlighting the importance of interpretability in complex models. Similarly, Perera et al. [33] applied XAI techniques to streamflow modeling in ungauged basins, using local interpretable model-agnostic explanations to interpret a modified generative adversarial network model. These studies emphasize the growing use of XAI methods to improve the transparency and interpretability of machine learning models. The SHAP theory, proposed by Lundberg et al. [34], is one of the XAI techniques.SHAP, as a unified framework for enhancing the interpretability of machine learning models, enables both global and local model interpretability analyses, helping users understand the contribution of each feature to the prediction results [35]. Consequently, scholars from different research fields have begun combining XGBoost with SHAP. For example, Wu [36] used the XGBoost-SHAP framework to predict and explain the economic resilience of Chinese provinces, identifying R&D expenditure and patent authorizations as key factors influencing economic resilience in China. Yi et al. [37] employed this framework to predict the difficulty of mathematics exam questions in the field of education, finding that parameter-level and reasoning-level features were crucial factors influencing the difficulty of subjective exam questions. Notably, the XGBoost-SHAP framework is more commonly used in transportation prediction problems [3840]. These applications demonstrate the maturity and broad applicability of the XGBoost-SHAP research framework.

Nevertheless, in practical applications, the predictive performance of the XGBoost model heavily depends on the choice of hyperparameters such as n_estimators, learning_rate, and max_depth. The interaction between these hyperparameters is highly complex, as their values not only affect the model’s learning ability but also its generalization performance. The optimization of this hyperparameter space is crucial for achieving the best predictive results. Previous studies have shown that the performance of tree-based models like XGBoost is sensitive to the settings of key hyperparameters, and optimizing these parameters can lead to significant improvements in accuracy and generalization ability [25,4145]. Common hyperparameter optimization methods include Grid Search (GS) and Bayesian Optimization (BO). Grid Search exhaustively searches all possible parameter combinations to find the optimal solution, which, while simple and straightforward, is computationally expensive and inefficient [41]. Bayesian Optimization guides the search process by constructing a surrogate model, which is more efficient than Grid Search but may still encounter local optima issues in high-dimensional spaces [42]. Recently, some studies have attempted to use various intelligent optimization algorithms (such as Genetic Algorithms (GA) [43], Particle Swarm Optimization (PSO) [44], and Whale Optimization Algorithm (WOA) [45]) for XGBoost hyperparameter tuning, achieving certain successes. The Hunter-Prey Optimization (HPO) algorithm, proposed by Naruei et al. [46], is an emerging intelligent optimization algorithm with strong global search capabilities and convergence accuracy. However, the original HPO algorithm still has certain limitations when dealing with complex optimization problems [47]. To improve optimization efficiency and performance, this study proposes an IHPO algorithm, making it more suitable for XGBoost hyperparameter optimization. By incorporating Tent Chaos Mapping and Lévy distribution, the algorithm enhances the quality of the initial population and global search capabilities. Additionally, an adaptive inertia weight strategy is adopted to improve the balance during the optimization process and accelerate convergence.

In summary, existing research lacks focus on the prediction of corporate ESG greenwashing behaviors, and there are still deficiencies in hyperparameter optimization and interpretability of the XGBoost model. To address these issues, this study undertakes the following tasks to pave the way for further research on corporate ESG greenwashing prediction. First, ESG-related data of Chinese A-share listed companies from 2017 to 2022 were collected, and an ESG greenwashing prediction indicator system was constructed through indicator screening, resulting in a corporate greenwashing prediction dataset. Then, the improved IHPO algorithm was applied to optimize the hyperparameters of the XGBoost model to enhance its predictive performance. Next, the optimized XGBoost model and the corporate greenwashing prediction dataset were used to predict corporate ESG greenwashing behaviors. Finally, the effectiveness of the proposed IHPO-XGBoost model was validated through model comparison analysis, and the prediction results were interpreted using SHAP theory. The specific research steps are illustrated in Fig 1. The potential contributions of this study are:

  1. (1). Exploring the new issue of corporate ESG greenwashing prediction.
  2. (2). Proposing an IHPO-XGBoost-based corporate ESG greenwashing prediction model that significantly enhances the predictive performance of the XGBoost model through hyperparameter optimization.
  3. (3). Enhancing model interpretability and transparency by interpreting corporate greenwashing prediction results using SHAP theory, deepening stakeholders’ understanding of greenwashing behaviors, and providing tools for regulators and investors to identify and evaluate corporate ESG greenwashing behaviors.
  4. (4). Offering a machine learning framework with certain generalizability and application value that can serve as a reference for studies in other fields.

2. Dataset

2.1. Measurement of corporate greenwashing index

To study the prediction of corporate ESG greenwashing behaviors, it is first necessary to determine how to quantify the degree of greenwashing by enterprises. The literature [1,1315] provides a relatively consistent method for quantifying corporate greenwashing, which has been maturely applied in related research. Therefore, this paper refers to these studies and uses Equation (1) to calculate the greenwashing index (GWI).

(1)

Where represents the ESG disclosure score of company i at time t, and are the mean and standard deviation of the ESG disclosure scores of companies in the same industry, respectively. represents the actual ESG performance score of company i at time t, and are the mean and standard deviation of the actual ESG performance scores of companies in the same industry, respectively. If the value calculated according to Equation (1) is greater than 0, it indicates that the company is engaging in greenwashing, and the larger the value, the greater the degree of greenwashing.

In practical operation, this paper uses the Bloomberg ESG rating as the ESG disclosure score and the Wind ESG rating as the actual ESG performance score. The rationale for this choice is that the Bloomberg ESG rating encompasses a broad range of information disclosures on environmental, social, and governance aspects, serving as an important means for companies to demonstrate their ESG commitments externally. On the other hand, the Wind ESG rating is based on actual operational data of the company, reflecting the true performance of the company in terms of environmental protection, social responsibility, and governance structure.

2.2. Feature variables

This study employs multiple steps to determine the feature variable set for predicting corporate ESG greenwashing behaviors. The specific steps are as follows:

Step 1: Through extensive literature review [47,4853] and interviews with several experts in the ESG research field, an initial set of feature variables was summarized based on the principles of scientific validity and data availability [54].

Step 2: Data collection based on the initial set of feature variables was performed. Following the approach in literature [54], missing values in the initial dataset were handled using interpolation methods, and normalization methods were applied to unify data dimensions.

Step 3: Referring to literature [55], correlation coefficient analysis was used to remove variables with high correlation and weak representativeness, ensuring that the retained indicators possess independence and representativeness.

Step 4: Variables were further refined by eliminating those deemed clearly unimportant according to the F-Score [56] of all features.

Through these steps, a final set of 19 feature variables was obtained, as shown in Table 1. Table 1 categorizes the 19 features into five categories. Company characteristics and governance structure reflect the basic situation and governance level of enterprises. Factors such as company size, shareholder structure, and executive compensation level can influence corporate ESG behaviors and information disclosure, thereby affecting the likelihood of ESG greenwashing. Financial status and performance are core aspects of enterprise operations. A company’s financial health and profitability directly impact its investments and performance in ESG areas. Financial constraints and distress may prompt enterprises to engage in ESG greenwashing to enhance their image and attract investment. Operational efficiency and management levels determine resource utilization efficiency and innovation capacity. Indicators such as management expense ratio, digital transformation, and corporate innovation reflect a company’s operational and management efficiency and capability, thereby influencing its ESG behaviors. Media attention and environmental regulations reflect the stringency of the external regulatory environment, which can pressure companies to engage in greenwashing in their ESG disclosures. Environmental investment, common prosperity, and business credit are areas of focus for the Chinese government and stakeholders. These indicators reflect the efforts and performance of enterprises in social responsibility and sustainable development, which can also exert social responsibility pressure on enterprises, potentially leading to ESG greenwashing behaviors.

thumbnail
Table 1. Feature variables for corporate ESG greenwashing prediction.

https://doi.org/10.1371/journal.pone.0316287.t001

2.3. Dataset description

After data collection, calculation of the greenwashing index, and selection of feature variables, the final corporate greenwashing dataset for this study was obtained. To enhance understanding of the data, the dataset is further described as follows:

  1. (1). The dataset comprises samples of Chinese A-share listed companies from 2017 to 2022, which have both Bloomberg ESG ratings and Wind ESG ratings. Companies with unique capital structures in the financial and insurance industries, as well as companies with abnormal financial conditions (ST and * ST companies), are excluded from the sample.
  2. (2). The feature variables in the dataset are listed in Table 1, and the target variable is the GWI. In constructing the dataset, the GWI lags one year behind the feature variables to enable the prediction of the greenwashing index for year t + 1 using data from year t. Therefore, the data years for the greenwashing index are from 2018 to 2022, while the data years for the feature variables are from 2017 to 2021.
  3. (3). The dataset consists of 4584 observations. The data for the feature variables primarily come from Wind and CSMAR databases, while the target variable is calculated using Equation (1). The ESGdis used in the equation are derived from the Bloomberg ESG Disclosure Scores, and the ESGper come from the Wind database.
  4. (4). The visual description of the dataset is shown in Fig 2. Fig 2a clearly illustrates the correlations between various variables. Fig 2b displays the distribution of GWI. The results indicate that the GWI roughly follows a normal distribution, with most companies’ GWI values centered around 0, suggesting that most companies do not exhibit significant greenwashing behaviors. However, some companies have GWI values significantly greater than 0, indicating potential significant greenwashing behaviors.
thumbnail
Fig 2. Correlation heatmap of feature variables and distribution of the target variable.

https://doi.org/10.1371/journal.pone.0316287.g002

3. IHPO-XGBoost model for predicting corporate greenwashing

3.1. XGBoost

The basic idea behind XGBoost is to iteratively add new weak learners to fit the residuals of the previous training. The final prediction for each sample is obtained by summing the prediction scores from all weak learners [35]. The main principle of building a corporate greenwashing prediction model based on XGBoost is as follows:

Assume the corporate greenwashing prediction dataset is ,where each sample enterprise xi has m features and corresponds to a target value yi, The greenwashing prediction value for the i-th sample, , can be expressed as follows:

(2)

where is a regression tree, F is the space of all possible regression trees, K is the total number of regression trees, and is the score calculated by the k-th tree for the i-th sample enterprise xi.

The objective function of the corporate greenwashing prediction model is defined as:

(3)

where is the loss function used to measure the fitting degree between the predicted greenwashing values and the actual values, and is a regularization term used to penalize complex models to avoid overfitting.

By integrating and reorganizing the Taylor expansion of the objective function and converting it into a polynomial related to the prediction residuals, the optimal weight of the leaf nodes and the optimal solution of the objective function value are obtained as follows:

(4)(5)

where , , and are the first and second derivatives of the loss function, is the sample group of leaf node j, is the tree structure function, and is the tree structure score. A smaller indicates a lower overall loss. T is the number of leaf nodes, γ is the penalty coefficient, λ represents the regularization term.

To solve for the objective value, a greedy algorithm is used to split the subtrees. The greedy algorithm achieves global optimization by controlling local optimizations. It attempts to add one split point to the existing leaf nodes, enumerates feasible split points, and selects the split with the highest gain [44]. The gain formula is expressed as:

(6)

where and are the sample sets of the left and right subtrees after the split, and are the information scores of the left and right subtrees, and is the information score of the current node before splitting.

3.2. IHPO-XGBoost prediction model

XGBoost has numerous hyperparameters, and inaccurate hyperparameter settings can affect the model’s predictive efficiency and effectiveness. However, the hyperparameter optimization process is essentially a black-box function optimization problem. If too many parameters are optimized, the model can become redundant, leading to increased computational complexity and affecting overall system performance [57]. Therefore, this study selects three key parameters for optimization that significantly influence the predictive performance of the XGBoost algorithm: n_estimators, learning_rate, and max_depth [58]. n_estimators specifies the number of weak estimators in the ensemble algorithm. A higher value increases the model’s learning capability but also makes the model more prone to overfitting; max_depth controls the maximum depth of the trees in the model. A higher value increases the model’s complexity and the likelihood of overfitting; learning_rate controls the iteration rate and can prevent overfitting by adjusting the step size during the learning process [59].Yu et al. [35] pointed out that using intelligent optimization algorithms for hyperparameter adjustment not only can obtain the optimal parameter combination but also reduce time and enhance efficiency. Therefore, this study introduces the HPO algorithm, a novel intelligent optimization algorithm, to address the hyperparameter optimization problem for XGBoost.

3.2.1. Original HPO algorithm.

The basic principle of the HPO algorithm proposed by Naruei et al. [46] is as follows:

First, initialize the prey population positions. The initial position xi of the i-thprey in the population is a random number within the range of the lower and upper limits [l,u], as shown in Equation (7), where dis the dimension of the variable. After initializing the population positions, the fitness is determined based on the objective function f(x).

(7)

The hunter search mechanism is described by Equation (8), where xi,j(t) represents the hunter’s current position, xi,j(t + 1) represents the hunter’s next position, and Pposis the prey’s position. The balance parameter Cis calculated as shown in Equation (9), where Iter is the current iteration number of the algorithm, and MaxIter is the maximum number of iterations of the algorithm. The adaptive parameter Zis calculated as shown in Equation (10), where and are random vectors within [0,1], R2 is a random number within [0,1], P is the index value of , and IDX is the index value of that satisfies the condition of ().

(8)(9)(10)

The hunter selects the prey farthest from the population’s average position as the target. The Euclidean distance Deuc(i), for each member is calculated as shown in Equation (11). Considering that after capturing the prey, the hunter continues to move to the new prey position, a decay mechanism is introduced as shown in Equation (13), where N is the number of search agents. As the iteration progresses, the prey position is continuously updated as shown in Equation (12).

(11)(12)(13)

The prey position update formula is shown in Equation (14), where xi,j(t) represents the current position of the prey, xi,j(t + 1) represents the next position of the prey, R4 is a random number within [-1,1], and Tpos is the global optimal position. In the process of finding the global optimal solution, the HPO algorithm selects hunters and prey based on the parameters R5 and the adjustment parameter β, where R5 is a random number within [0,1]. When R5 < β, the search agent is a hunter and updates its position using Equation (8); otherwise, the search agent is a prey and updates its position using Equation (14).

(14)

3.2.2. Improved HPO algorithm.

(1) Tent chaos mapping and Lévy distribution: A high-quality initial population helps improve the optimization performance of the algorithm. However, the HPO algorithm uses random initialization, which makes it difficult to ensure the quality of the initial population. Chaos mapping, with characteristics such as randomness, ergodicity, and regularity, can ensure population diversity [60]. Therefore, this study uses sequences generated by Tent Chaos Mapping to initialize the population to enhance early population diversity and improve convergence speed. The expression for Tent Chaos Mapping is as follows:

(15)

where i is the corresponding particle number, ; represents the chaotic parameter, which is proportional to chaos. In this study, .

Additionally, random variables following a Lévy distribution [61] are introduced to address the issues of small periodic points and unstable periodic points, ensuring the three properties of the Tent Chaos Mapping sequence. Therefore, the initial values based on Tent Chaos Mapping and the Lévy distribution can be calculated as follows:

(16)

Substituting Equation (16) into Equation (7) yields the initial population, which can be expressed as:

(17)

(2) Adaptive inertia weights: Inertia weight is an important control parameter in intelligent algorithms, including both linear and nonlinear adjustment techniques, to improve the convergence of the algorithm [62]. Nonlinear relationships are widely present in practical optimization problems, making nonlinear strategies more broadly applicable. This allows the algorithm to have good global and local balancing capabilities during the prey iteration phase, accelerating convergence to the optimal solution. In this study, a nonlinear decreasing inertia weight based on a concave function is applied to the HPO algorithm, which can be expressed as follows:

(18)

Where and are the maximum and minimum inertia weights, respectively, and c is the adjustment parameter.

Incorporating Equation (18) into Equation (14) yields Equation (19) as the new prey position update strategy:

(19)

3.2.3. Construction of the IHPO-XGBoost model.

Utilizing the optimization capabilities of the IHPO algorithm and the learning capabilities of the XGBoost algorithm, this study proposes an IHPO-XGBoost prediction model optimized by the IHPO algorithm. The flowchart of the model is shown in Fig 3.

This model aims to improve the predictive performance of the XGBoost algorithm by finding the optimal set of parameters during the training process, minimizing the error between the predicted results and actual values. According to Fig 3, the implementation process of the IHPO-XGBoost is as follows:

Step 1: Set the initial parameters of the XGBoost algorithm, including the parameter ranges and initial values, such as n_estimators, learning_rate, and max_depth.

Step 2: Based on Equation (17), use the Tent Chaos Mapping method combined with the Lévy Distribution to initialize the population, ensuring the diversity and high quality of the initial population to enhance the exploration capability of the algorithm.

Step 3: Set the key parameters of the HPO algorithm, including the population size (nPop) and the MaxIter.

Step 4: Evaluate the fitness and Tpos. Use the defined fitness function to evaluate the fitness of the initial population. This function is based on the performance of the XGBoost model, using cross-validation to ensure the robustness of the fitness evaluation.

Step 5: Update C using Equation (9); evaluate Z using Equation (10).

Step 6: Select the appropriate position update strategy based on whether R5<β.

Step 7: Reevaluate the fitness of the population and Tpos after the position updates, and determine if the algorithm meets the convergence criteria. If the criteria are met, the optimization process terminates; otherwise, continue to the next iteration.

Step 8: Use the optimal hyperparameters (n_estimators, learning_rate, and max_depth) obtained by the IHPO algorithm to train and test the corporate greenwashing prediction model using XGBoost.

4. Performance evaluation and interpretability of model

4.1. Model performance evaluation

Following the approach in [55,63,64], this study evaluates the model’s performance using the coefficient of determination (R2), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Adjusted R2. represents the percentage of the variance in the dependent variable that is predictable from the independent variables. It characterizes the degree of regression fit to all research samples and is commonly used as the primary indicator of prediction accuracy in regression problems [65]. RMSE and MAE are crucial indicators for measuring model error, while Adjusted R2penalizes the number of input features to adjust for potential overfitting in [66]. The specific calculation formulas for these four indicators are as follows:

(19)(20)(21)(22)

Where represents the actual value, represents the predicted value, represents the mean value, n represents the number of samples, and p represents the number of features. The higher the R2 and Adjusted R2, and the lower the RMSE and MAE, the better the model performance and the smaller the gap between the predicted and actual values.

4.2. Model interpretability

SHAP is an algorithm for interpreting machine learning models, primarily based on cooperative game theory to compute Shapley values. These values assess the contribution of each feature to the model’s prediction, thereby explaining the output of the machine learning model [35]. The principle of SHAP is as follows.

Assume the i -th enterprise sample is , and the SHAP value of its j -th feature is calculated as follows:

(23)

where M is the set of all features in the enterprise greenwashing behavior dataset with a dimension of m; S is a subset drawn from M with a size of ; is the model’s prediction for the enterprise sample using only the feature set, and when S is an empty set, the value of is considered the baseline, i.e., the average prediction of the model across all enterprise samples; is the model’s average prediction for the sample across all samples when the feature value is added to the feature set S; represents the subset of all features excluding the j -th feature.

Based on SHAP values, the SHAP explanation for the enterprise greenwashing behavior prediction model can be expressed with an additive model as follows:

(24)

where is the baseline output of the entire enterprise greenwashing behavior prediction model, i.e., the average prediction of all training samples; is the SHAP value, representing the contribution of the j -th feature of the i -thsample to the model output ; is the simplified feature indicator, which takes the value of 0 or 1, where indicates that the feature is present in the sample being explained ; indicates that the feature is not present in .

For the SHAP value in the above equation, when , it means that the j -th feature of the i -th enterprise sample increases the model’s prediction value, having a positive effect on the model output; when , it indicates that the feature decreases the prediction value, having a negative effect on the model output. Hence, the role of the SHAP algorithm is to collectively use the contribution values of each feature to drive the model’s prediction result from the baseline value to the final model prediction value.

5. Results and discussion

5.1. Model prediction results

In this study, the constructed corporate greenwashing dataset is split into 80% training set and 20% test set. A five-fold cross-validation method is used for model training, which helps ensure that the model generalizes well to unseen data. To further prevent overfitting, the XGBoost model incorporates regularization parameters, specifically alpha (L1 regularization) and lambda (L2 regularization). The default values for these regularization parameters are used in this study, which helps prevent excessive fitting while ensuring the model’s generalization ability. The operating environment of the model is shown in Table 2. The hyperparameters of the XGBoost algorithm, namely n_estimators, learning_rate, and max_depth, are optimized using the IHPO algorithm. Based on a review of existing research on XGBoost hyperparameter optimization [4145], it is found that the optimized ranges for these three parameters typically fall within [100, 1000], [0.01, 1], and [1,10], respectively. Therefore, these ranges were set as the initial bounds for optimization. Furthermore, in the IHPO algorithm, through multiple rounds of code tuning, we determined that a population size of 20 and 100 iterations produced the best results. Mean Squared Error (MSE) is used as the fitness function. The final optimized results are n_estimators of 450, learning_rate of 0.1, and max_depth of 6.

thumbnail
Table 2. Operating environment of the corporate greenwashing prediction project.

https://doi.org/10.1371/journal.pone.0316287.t002

After completing model training, the test set is used to validate the model’s predictive performance, and the prediction results on the test set are visualized (as shown in Fig 4). From Fig 4a, it can be seen that the predicted values of the sample points are very close to the actual values, and the absolute errors are relatively stable, mainly concentrated in the low error range of 0 to 0.2, indicating that the model’s prediction performance is good. Fig 4b shows the scatter plot of the actual values versus the predicted values of the greenwashing index. The vast majority of sample points are distributed within the error band of the 95% confidence interval drawn based on the ideal fit line y = x, and the actual fitting equation for all scatter points is , showing a strong correlation between the predicted and actual values. Additionally, Fig 4b presents the performance evaluation metrics of the IHPO-XGBoost model: , RMSE, MAE, and Adjusted R² are 0.9790, 0.1376, 0.1000, and 0.9785, respectively. The results presented in Fig 4 demonstrate the excellent predictive performance of the IHPO-XGBoost model.

thumbnail
Fig 4. Performance of IHPO-XGBoost model on the test set.

https://doi.org/10.1371/journal.pone.0316287.g004

5.2. Comparative analysis of model performance

To verify the effectiveness of the improvements made to the HPO algorithm, the performance of IHPO-XGBoost is compared with HPO-XGBoost on the test set, as shown in Fig 5. The results indicate that using the improved HPO algorithm for hyperparameter optimization of XGBoost leads to certain improvements in and Adjusted R² metrics while significantly reducing RMSE and MAE. This demonstrates the effectiveness of the proposed optimization strategy.To further validate the superiority of the IHPO-XGBoost model, this study compares the proposed model with WOA-XGBoost [67], SSA-XGBoost [68], GA-XGBoost [69], and BO-XGBoost [70] models on the test set. As shown in Fig 6, the IHPO-XGBoost model exhibits the highest and 与Adjusted R² values, along with the lowest RMSE and MAE, indicating that the IHPO-XGBoost model has the best overall performance.

thumbnail
Fig 5. Comparison of model performance before and after HPO improvement.

https://doi.org/10.1371/journal.pone.0316287.g005

thumbnail
Fig 6. Model performance comparison based on , RMSE, MAE and Adjusted R².

https://doi.org/10.1371/journal.pone.0316287.g006

5.3. Interpretability analysis based on SHAP theory

5.3.1. Global impact analysis of sample features.

Fig 7 presents the visualization of the sample feature global explanation of the IHPO-XGBoost model’s prediction results based on SHAP theory.

thumbnail
Fig 7. Global model explanation results by SHAP.

(a) Feature importance analysis. (b) SHAP value distribution. Note: Firm Size (FS), Shareholding Ratio of the Second Largest Shareholder/ Largest Shareholder (SRSL/LS), Total Asset Turnover Ratio (TATR), Shareholding Ratio of the Largest Shareholder (SRLS), Financing Constraints (FC), Firm Innovation (FI), Z-Score (ZS), Management Expense Ratio (MER), Return on Equity (ROE), Total Compensation of Top Three Executives (TCTTE), Institutional Investors Shareholding Ratio (IISR), Business Credit (BC), Digital Transformation (DT), Common Prosperity Index (CPI), Total Asset Growth Rate (TAGR), Annual Stock Return (ASR), Environmental Regulation (ER), Media Attention (MA), Environmental Protection Investment (EPI).

https://doi.org/10.1371/journal.pone.0316287.g007

Firstly, Fig 7a visualizes the importance of features by showing the mean absolute SHAP values for each feature; the larger the value, the more important the feature. As shown in Fig 7a, overall, features such as FS, SRSL/LS, and TATR have a significant impact on the prediction results, whereas features like EPI are relatively less important. Among them, FS has the greatest impact on the prediction results, with an average absolute SHAP value of 0.1772, indicating that it plays a crucial role in predicting corporate greenwashing. It is followed by SRSL/LS, with an average absolute SHAP value of 0.1030, which also has a significant impact on the prediction results. Other features with significant impacts include TATR, SRLS, and FC, with average absolute SHAP values of 0.0706, 0.0654, and 0.0628, respectively. In contrast, EPI has an average absolute SHAP value of only 0.0074, indicating its minimal impact in the model. This implies that financial status is more important than environmental protection investment in predicting corporate greenwashing.

Fig 7b is the SHAP summary plot for features, showing the distribution of SHAP values for each feature and their corresponding impact trends. In the Fig, the X-axis represents specific SHAP values, and the Y-axis represents input features sorted by importance. The dots in the plot represent samples from the database, with dot colors indicating specific feature values, ranging from blue (low values) to red (high values). The horizontal position of the dots indicates whether the feature values lead to an increase or decrease in the prediction value. For example, the red dots at the top of the Fig indicate that higher FS values lead to a significant increase in the prediction value. Therefore, the SHAP summary plot not only provides an understanding of which features are important but also shows how each feature affects the prediction results. Generally, as the values of features such as FS, SRSL/LS, and TATR increase, the prediction value for corporate greenwashing also increases. In contrast, the increase in features like EPI has a smaller impact on the prediction results.

5.3.2. Interaction impact analysis of features.

In this section, based on the model importance shown in Fig 7 and actual business significance, nine key feature pairs are selected to analyze their interaction effects, revealing how these features jointly affect the prediction results of corporate ESG greenwashing behavior. The interaction impact of two input features on the model’s prediction results can be visualized using the SHAP dependence plot shown in Fig 8. In Fig 8, the horizontal axis represents the feature values of a certain input feature, the left vertical axis represents the SHAP values corresponding to that input feature, and the right vertical axis represents the feature values of another input feature marked by color, ranging from blue to red indicating low to high feature values.

thumbnail
Fig 8. Interaction impact analysis of features on model prediction results.

Firm Size (FS), Shareholding Ratio of the Second Largest Shareholder/ Largest Shareholder (SRSL/LS), Total Asset Turnover Ratio (TATR), Shareholding Ratio of the Largest Shareholder (SRLS), Financing Constraints (FC), Firm Innovation (FI), Z-Score (ZS), Management Expense Ratio (MER), Return on Equity (ROE), Total Compensation of Top Three Executives (TCTTE), Institutional Investors Shareholding Ratio (IISR), Business Credit (BC), Digital Transformation (DT), Common Prosperity Index (CPI), Total Asset Growth Rate (TAGR), Annual Stock Return (ASR), Environmental Regulation (ER), Media Attention (MA), Environmental Protection Investment (EPI).

https://doi.org/10.1371/journal.pone.0316287.g008

Based on Fig 8, the analysis and discussion are as follows:

  1. (1). FS, TATR, SRLS, and ER all show a trend where increasing feature values correspond to increasing SHAP values. This indicates that higher values of these features have a more positive impact on the prediction results for corporate ESG greenwashing. Specifically, larger companies (higher FS) have more resources and capabilities, making them more likely to engage in ESG greenwashing. Similarly, companies with high asset utilization efficiency (higher TATR) and high governance concentration (higher SRLS) are more inclined to showcase their management and decision-making advantages through ESG greenwashing. Strict environmental regulations (higher ER) prompt companies to actively engage in ESG greenwashing to meet regulatory requirements and reduce regulatory pressure.
  2. (2). ZS and CPI show a trend where increasing feature values correspond to decreasing SHAP values, indicating that lower values of these features have a more positive impact on the prediction results for corporate ESG greenwashing. Companies with poorer financial health (lower ZS) may use ESG greenwashing to cover up internal issues and enhance external trust. Similarly, companies with lower CPI, indicating poorer social responsibility performance, may use ESG greenwashing to improve their image.
  3. (3). ROE and DT exhibit fluctuating trends. When ROE increases from 0 to 0.8, SHAP values decrease, indicating that in the early stages of profitability, companies focus more on genuine performance improvements. When ROE increases from 0.8 to 1, SHAP values rise, suggesting that in the high profitability stage, companies may increase ESG greenwashing to enhance shareholder satisfaction. For DT, SHAP values increase as the feature value rises from 0 to 0.4, indicating a positive impact of initial digital transformation on ESG greenwashing. SHAP values then decrease from 0.4 to 0.6, possibly due to challenges in technology integration and organizational change. When DT exceeds 0.8, SHAP values rise again, indicating that deep digital transformation enables companies to better utilize digital means for ESG greenwashing.
  4. (4). Additionally, Fig 8 reveals clear relationships between certain features. For example, higher FS may correspond to higher SRLS, which is further related to higher IISR; high ROE is usually accompanied by high TAGR. Conversely, lower TATR often correlates with higher MER, while lower ZS is associated with higher FC, and low CPI may correspond to high BC. These relationships reveal complex interactions between features, helping us to understand corporate ESG greenwashing behavior in different contexts more deeply. For instance, higher FS typically implies higher SRLS, which may reflect a more concentrated governance structure and higher decision-making efficiency, thus leading to a greater tendency for ESG greenwashing. Similarly, the association between high ROE and high TAGR suggests that highly profitable companies may be more motivated to engage in ESG greenwashing to enhance their financial performance when expanding assets. The relationship between lower TATR and higher MER may indicate that despite lower asset turnover, companies with higher MER may still showcase their management capabilities through ESG greenwashing. Finally, the association between lower ZS and higher FC implies that companies with poorer financial health are more likely to face financing constraints and therefore may engage in ESG greenwashing to improve external perceptions of their financial status.

5.3.3. Single sample feature impact analysis.

Two samples with predicted values around 0.5 are selected from the test set to analyze the impact of input features on the prediction results using SHAP theory, as shown in Fig 9. In the Fig, the horizontal axis represents the size of the predicted value, and the vertical axis represents the value of the input feature. Red indicates that the SHAP value of the input feature is positive, while blue indicates that the SHAP value is negative. The larger the area of the red and blue regions, the greater the SHAP value, and thus the greater the impact on the prediction result for the sample.

thumbnail
Fig 9. SHAP explanation of predictions for single samples.

Firm Size (FS), Shareholding Ratio of the Second Largest Shareholder/ Largest Shareholder (SRSL/LS), Total Asset Turnover Ratio (TATR), Shareholding Ratio of the Largest Shareholder (SRLS), Financing Constraints (FC), Firm Innovation (FI), Management Expense Ratio (MER), Return on Equity (ROE), Institutional Investors Shareholding Ratio (IISR), Business Credit (BC), Media Attention (MA).

https://doi.org/10.1371/journal.pone.0316287.g009

In Fig 9a, the predicted value for sample 62 is 0.491. FS has the greatest impact on the prediction result for sample 62, with a SHAP value of + 0.32, indicating that a larger FS positively drives the prediction value. Additionally, BC and SRLS also have positive impacts on the prediction result, with SHAP values of + 0.11 and + 0.09, respectively, showing the positive contributions of market trust and shareholder structure to the company’s ESG performance. On the other hand, FC has a negative impact on the prediction value, with a SHAP value of -0.09, indicating that financial constraints inhibit the prediction value increase for this sample. Overall, the prediction result for sample 62 is primarily driven by the positive influences of FS and BC, but the negative impact of FC cannot be ignored.

In Fig 9b, the predicted value for sample 560 is 0.5. FS also has the greatest impact on the prediction result for this sample, with a SHAP value of + 0.32. TATR and SRLS have significant positive impacts on the prediction result, with SHAP values of + 0.09 and + 0.09, respectively, indicating good performance in asset utilization efficiency and shareholder structure. However, SRSL/LS and FC have negative impacts on the prediction value, with SHAP values of -0.06 and -0.05, respectively, showing that these factors suppress the prediction value for this sample. It can be seen that the prediction result for sample 560 is primarily driven by the positive influences of FS and TATR, but the negative impacts of SRSL/LS and FC also play a role in inhibiting the prediction result.

The SHAP value analysis of the two samples with similar predicted values in Fig 9 reveals that different features have varying impacts on different samples. Therefore, explanations of input features should not be overly absolute to avoid misinterpretations.

6. Conclusion

This study investigates the issue of predicting corporate greenwashing behavior using a dataset of A-share listed companies from the Shanghai and Shenzhen stock exchanges between 2017 and 2022. A comprehensive dataset for predicting corporate greenwashing was first constructed. Then, an IHPO-XGBoost ensemble learning model was proposed to predict corporate ESG greenwashing behavior, with its effectiveness validated through comparative analysis. Finally, SHAP theory was employed to explain the model’s prediction results. The main findings of this study are as follows:

(1) Superior performance of the IHPO-XGBoost model: The model achieved R², RMSE, MAE, and adjusted R² scores of 0.9790, 0.1376, 0.1000, and 0.9785, respectively, demonstrating high accuracy and stability in predicting corporate ESG greenwashing behavior. The IHPO-XGBoost model outperforms the traditional HPO-XGBoost model across multiple metrics, confirming the effectiveness of the IHPO algorithm in hyperparameter optimization. Furthermore, compared to other optimization algorithms (such as WOA, SSA, GA, and BO) in combination with XGBoost, IHPO-XGBoost exhibits the best overall performance.

(2) Insights from SHAP theory: The application of SHAP theory enabled a more comprehensive understanding of the model’s predictions. The global explanation of sample features revealed that factors such as Firm Size, Shareholder Structure, and Total Asset Turnover Ratio play crucial roles in predicting corporate ESG greenwashing behavior. The feature interaction analysis demonstrated that the interplay between specific features significantly influences model predictions, helping to uncover the complex decision-making processes behind corporate ESG practices. The single-sample feature impact analysis further clarified the contribution of each feature to the prediction results, offering valuable insights into the context-specific dynamics of corporate ESG performance.

(3) Practical implications: The results have important implications for regulatory agencies and investors seeking to identify and assess corporate ESG greenwashing behavior. Understanding the key features and their interactions can facilitate more effective monitoring of corporate ESG disclosures, promoting transparency and reducing the risk of greenwashing. Corporate managers can also leverage these findings to optimize their ESG strategies, enhancing their sustainability efforts.

In conclusion, the IHPO-XGBoost model, combined with SHAP theory, provides a robust framework for predicting and managing corporate ESG greenwashing behavior. Future research could extend this analysis by incorporating longer time spans to explore the evolution of corporate ESG practices and by including additional data sources, such as social media and news reports, to improve the model’s predictive power and comprehensiveness.

Supporting information

S1 Dataset. ESG greenwashing prediction dataset.

https://doi.org/10.1371/journal.pone.0316287.s001

(XLSX)

References

  1. 1. Yu EPY, Van Luu B, Chen CH. Greenwashing in environmental, social and governance disclosures. Research in International Business and Finance. 2020;52:101192.
  2. 2. Tashman P, Marano V, Kostova T. Walking the walk or talking the talk? Corporate social responsibility decoupling in emerging market multinationals. J Int Bus Stud. 2019;50(2):153–71.
  3. 3. Hora M, Subramanian R. Relationship between positive environmental disclosures and environmental performance: An empirical investigation of the greenwashing sin of the hidden trade‐off. J Ind Ecol. 2019;23(4):855–68.
  4. 4. Liu Y, Li W, Wang L, Meng Q. Why greenwashing occurs and what happens afterwards? A systematic literature review and future research agenda. Environ Sci Pollut Res Int. 2023;30(56):118102–16. pmid:37932612
  5. 5. He L, Gan S, Zhong T. The impact of green credit policy on firms’ green strategy choices: green innovation or green-washing? Environ Sci Pollut Res Int. 2022;29(48):73307–25. pmid:35622278
  6. 6. Zhang D. Are firms motivated to greenwash by financial constraints? Evidence from global firms’ data. Journal of International Financial Management & Accounting. 2022;33(3):459–79.
  7. 7. Xia F, Chen J, Yang X, Li X, Zhang B. Financial constraints and corporate greenwashing strategies in China. Corp Social Responsib Environ Manage. 2023;30(4):1770–81.
  8. 8. Liu Y, Li W, Meng Q. Influence of distracted mutual fund investors on corporate ESG decoupling: evidence from China. Sustainability Accounting, Management and Policy Journal. 2023;14(1):184–215.
  9. 9. Lin X, Zhu H, Meng Y. ESG greenwashing and equity mispricing: Evidence from China. Finance Research Letters. 2023;58:104606.
  10. 10. Kaner G. Greenwashing: How difficult it is to be transparent to the consumer—H&M case study. Green Marketing in Emerging Markets: Strategic and Operational Perspectives. 2021;203–26.
  11. 11. De Vries G, Terwel BW, Ellemers N, Daamen DD. Sustainability or profitability? How communicated motives for environmental policy affect public perceptions of corporate greenwashing. Corp Social Responsib Environ Manage. 2015;22(3):142–54.
  12. 12. Mateo-Márquez AJ, González-González JM, Zamora-Ramírez C. An international empirical study of greenwashing and voluntary carbon disclosure. J Clean Prod. 2022;363:132567.
  13. 13. Zhang D. Green financial system regulation shock and greenwashing behaviors: Evidence from Chinese firms. Energy Econ. 2022;111:106064.
  14. 14. Liao F, Sun Y, Xu S. Financial report comment letters and greenwashing in environmental, social and governance disclosures: Evidence from China. Energy Econ. 2023;127:107122.
  15. 15. Hu X, Hua R, Liu Q, Wang C. The green fog: Environmental rating disagreement and corporate greenwashing. Pacific-Basin Finance J. 2023;78:101952.
  16. 16. Zeng F, Ni J. Research on bank credit decision-making in the case of historical data loss: Model construction based on prudent trust field and machine learning. Financial Regulation Research. 2020;(3):85–98.
  17. 17. Zeng F, Ni J, Wang Y. Enterprise performance evaluation based on Entropy-VIKOR and AGA-BP model——take China’s listed logistics enterprises as an example. Journal of University of Shanghai for Science and Technology. 2022;44(1):94–102.
  18. 18. Al-refai G, Elmoaqet H, Ryalat M. In-vehicle data for predicting road conditions and driving style using machine learning. Applied Sciences. 2022;12(18):8928.
  19. 19. An Q, Rahman S, Zhou J, Kang JJ. A comprehensive review on machine learning in healthcare industry: classification, restrictions, opportunities and challenges. Sensors (Basel, Switzerland). 2023;23(9):4178. pmid:37177382
  20. 20. Chen J, Liu Q, Gao L. Deep convolutional neural networks for tea tree pest recognition and diagnosis. Symmetry. 2021;13(11):2140.
  21. 21. Niazkar M, Zakwan M, Goodarzi MR, Hazi MA. Assessment of Climate Change Impact on Water Resources Using Machine Learning Algorithms. J Water Clim Change. 2024;15(6):iii–vi.
  22. 22. Chadee A, Narra M, Mehta D, Andrew J, Azamathulla H. Impact of climate change on water resource engineering in trinidad and Tobago. LARHYSS Journal P-ISSN 1112-3680/E-ISSN 2521-9782. 2023;(55.
  23. 23. Mendhe V, Kulkarni K, Nithya M, Olutoge F, Vedartham S, Chadee AA. Prediction Of Strength Properties Of Ultra High-Performance Concrete By Using Artificial Intelligence And Machine Learning Techniques. Educational Administration: Theory and Practice. 2024;30(5):6479–83.
  24. 24. Bhavana B, Kavitha MS, Bobade SS, Satyanarayana A, Chadee AA, Ravitheja A. Prediction of tensile strength of basalt fiber reinforced concrete by using artificial intelligence and machine learning techniques. African Journal of Biological Sciences. 6(Si4):3914–20.
  25. 25. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016. p. 785–94.
  26. 26. Niazkar M, Menapace A, Brentan B, Piraei R, Jimenez D, Dhawan P, et al. Applications of XGBoost in water resources engineering: A systematic literature review (Dec 2018–May 2023). Environmental Modelling & Software. 2024;174:105971.
  27. 27. Meddage DPP, Mohotti D, Wijesooriya K. Predicting transient wind loads on tall buildings in three-dimensional spatial coordinates using machine learning. Journal of Building Engineering. 2024;85:108725.
  28. 28. Meddage DPP, Fonseka I, Mohotti D, Wijesooriya K, Lee CK. An explainable machine learning approach to predict the compressive strength of graphene oxide-based concrete. Constr Build Mater. 2024;449:138346.
  29. 29. Ekanayake IU, Meddage DPP, Rathnayake U. A novel approach to explain the black-box nature of machine learning in compressive strength predictions of concrete using Shapley additive explanations (SHAP). Case Stud Constr Mater. 2022;16:e01059.
  30. 30. Shi H, Shen Y, Li L. Early prediction of acute kidney injury in patients with gastrointestinal bleeding admitted to the intensive care unit based on extreme gradient boosting. Frontiers in Medicine. 2023;10:1221602. pmid:37720504
  31. 31. Wang W, Xiong W, Wang J, Tao L, Li S, Yi Y, et al. A user purchase behavior prediction method based on xgboost. Electronics. 2023;12(9):2047.
  32. 32. Ukwaththa J, Herath S, Meddage DPP. (2024). A review of machine learning (ML) and explainable artificial intelligence (XAI) methods in additive manufacturing (3D printing). Mater Today Commun. 1102;94.
  33. 33. Perera UAKK, Coralage DTS, Ekanayake IU, Alawatugoda J, Meddage DPP. A new frontier in streamflow modeling in ungauged basins with sparse data: A modified generative adversarial network with explainable AI. Results in Engineering. 2024;21:101920.
  34. 34. Lundberg SM, Lee S. I. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems. 2017. p. 30.
  35. 35. Yu H, Wang X, Ren B, Zheng M, Wu G, Zhu K. IAO-XGBoost ensemble learning model for seepage behavior analysis of earth-rock dam and interpretation of prediction results. J Hydraul Eng. 2023;54(10):1195–209.
  36. 36. Wu Z. Evaluation of Provincial Economic Resilience in China Based on the TOPSIS‐XGBoost‐SHAP Model. Journal of Mathematics. 2023;2023(1):1–12.
  37. 37. Yi X, Sun J, Wu X. ). Novel Feature-Based Difficulty Prediction Method for Mathematics Items Using XGBoost-Based SHAP Model. Mathematics. 2024;12(10):1455.
  38. 38. Mao C, Xu W, Huang Y, Zhang X, Zheng N, Zhang X. Investigation of Passengers’ Perceived Transfer Distance in Urban Rail Transit Stations Using XGBoost and SHAP. Sustainability. 2023;15(10):7744.
  39. 39. Wang Z, Jiao P, Wang J, Luo W, Lu H. Contributing factors on the level of delay caused by crashes: A hybrid method of latent class analysis and XGBoost based SHAP algorithm. Journal of Transportation Safety & Security. 2024;16(2):97–129.
  40. 40. Huang J, Peng Y, Hu L. A multilayer stacking method base on RFE-SHAP feature selection strategy for recognition of driver’s mental load and emotional state. Expert Syst Appl. 2024;238:121729.
  41. 41. Ide Y, Ozaki S, Yamashiro M, Kodama M. Development and improvement of a method for determining the worst-case typhoon path for storm surge deviation through Bayesian optimization. Eng Appl Artif Intell. 2024;132:107950.
  42. 42. Wang X, Jin Y, Schmitt S, Olhofer M. Recent advances in Bayesian optimization. ACM Comput Surv. 2023;55(13s):1–36.
  43. 43. Peng K, Peng Y, Li W. Research on customer churn prediction and model interpretability analysis. PLoS One. 2023;18(12):e0289724. pmid:38064499
  44. 44. An J, Guo W, Lv T, Zhao Z, He C, Zhao H. Joint prediction of the state of charge and the state of health of lithium-ion batteries based on the PSO-XGBoost algorithm. Energies. 2023;16(10):4243.
  45. 45. Ramshankar N, Joe Prathap PM. Reviewer reliability and XGboost whale optimized sentiment analysis for online product recommendation. Journal of Intelligent & Fuzzy Systems. 2023;44(1):1547–62.
  46. 46. Naruei I, Keynia F, Sabbagh Molahosseini A. Hunter-prey optimization: Algorithm and applications. Soft Comput. 2022;26(3):1279–314.
  47. 47. Xiang C, Gu J, Luo J, Qu H, Sun C, Jia W, et al. Structural damage identification based on convolutional neural networks and improved hunter–prey optimization algorithm. Buildings. 2022;12(9):1324.
  48. 48. Zhang J, Li Z. Heavy is the Head that Wears the Crown: Responsibility and Fulfillment of Supply-chain Network Center Enterprises—from the Perspective of ESG. Foreign Economics & Management. 2024;07:86–101.
  49. 49. Yao Q, Hu H, Feng Y. A Literature Review of Corporate Greenwashing and Prospects. Ecological Economy. 2022;03:86–92+108.
  50. 50. Wang J, Liu X, Yu X, Zhou Z. Bank-Enterprise ESG Consistency and Post-loan Corporate Strategic ESG Behavior. Journal of Finance and Economics. 2024;04:109–23.
  51. 51. Horobet A, Smedoiu-Popoviciu A, Oprescu R, Belascu L, Pentescu A. Seeing through the haze: greenwashing and the cost of capital in technology firms. Environment, Development and Sustainability. 2024;1–32.
  52. 52. Cho CH, Laine M, Roberts RW, Rodrigue M. Organized hypocrisy, organizational façades, and sustainability reporting. Account Organ Soc. 2015;40:78–94.
  53. 53. Liao J, Zhan Y, Zhao X. Two tigers cannot live on the same mountain: The impact of the second largest shareholder on controlling shareholder’s tunneling behavior. PLoS One. 2023;18(6):e0287642. pmid:37379292
  54. 54. Zeng F, Sun H. Spatial Network Analysis of Coupling Coordination between Digital Financial Inclusion and Common Prosperity in the Yangtze River Delta Urban Agglomeration. Mathematics. 2024;12(9):1285.
  55. 55. Zhang J, Yuan J, Mahmoudi A, Ji W, Fang Q. A data-driven framework for conceptual cost estimation of infrastructure projects using XGBoost and Bayesian optimization. Journal of Asian Architecture and Building Engineering. 2023;1–24.
  56. 56. Chen YW, Lin CJ. Combining SVMs with various feature selection strategies. Feature Extraction: Foundations and Applications. 2006;315–24.
  57. 57. Wang Z. Transformer fault diagnosis research based on feature parameter preference and sparrow optimization (Master’s thesis, Liaoning Technical University). 2022.
  58. 58. Feng DC, Wang WJ, Mangalathu S, Taciroglu E. Interpretable XGBoost-SHAP machine-learning model for shear strength prediction of squat RC walls. J Struct Eng. 2021;147(11):04021173.
  59. 59. Chen X, Jia J, Bai Y, Guo T, Du X. ). Prediction model of axial bearing capacity of concrete-filled steel tube columns based on XGBoost-SHAP. Journal of Zhejiang University (Engineering Science). 2023;56(6):1061–70.
  60. 60. Demidova LA, Gorchakov A. V. A study of chaotic maps producing symmetric distributions in the fish school search optimization algorithm with exponential step decay. Symmetry. 2020;12(5):784.
  61. 61. Barua S, Merabet A. Lévy Arithmetic Algorithm: An enhanced metaheuristic algorithm and its application to engineering optimization. Expert Syst Appl. 2024;241:122335.
  62. 62. Wang X, Li J, Shao L, Liu H, Ren L, Zhu L. Short-term wind power prediction by an extreme learning machine based on an improved hunter–prey optimization algorithm. sustainability. 2023;15(2):991.
  63. 63. Fuladipanah M, Shahhosseini A, Rathnayake N, Azamathulla HM, Rathnayake U, Meddage DPP, et al. In-depth simulation of rainfall–runoff relationships using machine learning methods. Water Practice &. Technology. 2024;19(6):2442–59.
  64. 64. Kisi O, Azamathulla HM, Cevat F, Kulls C, Kuhdaragh M, Fuladipanah M. Enhancing river flow predictions: Comparative analysis of machine learning approaches in modeling stage-discharge relationship. Results in Engineering. 2024;22:102017.
  65. 65. Shehadeh A, Alshboul O, Al Mamlook RE, Hamedat O. Machine learning models for predicting the residual value of heavy construction equipment: An evaluation of modified decision tree, LightGBM, and XGBoost regression. Autom Constr. 2021;129:103827.
  66. 66. Elmousalami HH. Artificial intelligence and parametric construction cost estimate modeling: State-of-the-art review. Journal of Construction Engineering and Management. 2020;146(1):03119008.
  67. 67. Duan Y, Zhang J, Wang X, Feng M, Ma L. Forecasting carbon price using signal processing technology and extreme gradient boosting optimized by the whale optimization algorithm. Energy Science &. Engineering. 2024;12(3):810–34.
  68. 68. Li X, Wang Z, Yang C, Bozkurt A. An advanced framework for net electricity consumption prediction: Incorporating novel machine learning models and optimization algorithms. Energy. 2024;296:131259.
  69. 69. Sun H, Luo Q, Xia Z, Li Y, Yu Y. Bottomhole Pressure Prediction of Carbonate Reservoirs Using XGBoost. Processes. 2024;12(1):125.
  70. 70. Yu G, Jin Y, Hu M, Li Z, Cai R, Zeng R, et al. Improved Machine Learning Model for Urban Tunnel Settlement Prediction Using Sparse Data. Sustainability. 2024;16(11):4693.