Skip to main content
Advertisement
  • Loading metrics

A hybrid machine learning framework for land use carbon accounting: A case study of Tanzania

  • Talemwa Byomutonzi Johansen ,

    Roles Conceptualization, Formal analysis, Investigation, Validation, Visualization, Writing – original draft, Writing – review & editing

    johansent@nm-aist.ac.tz

    Affiliation Department of Applied Mathematics and Computational Science, Nelson Mandela African Institution of Science and Technology (NM-AIST), Arusha, Tanzania

  • Mwema Felix Mwema,

    Roles Conceptualization, Supervision, Writing – review & editing

    Affiliation Department of Materials Science and Engineering, Nelson Mandela African Institution of Science and Technology (NM-AIST), Arusha, Tanzania

  • Silas Steven Mirau,

    Roles Conceptualization, Supervision, Writing – review & editing

    Affiliation Department of Applied Mathematics and Computational Science, Nelson Mandela African Institution of Science and Technology (NM-AIST), Arusha, Tanzania

  • Verdiana Grace Masanja

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Supervision, Writing – review & editing

    Affiliations Department of Applied Mathematics and Computational Science, Nelson Mandela African Institution of Science and Technology (NM-AIST), Arusha, Tanzania, Department of Science Education, Kampala International University in Tanzania (KUIT), Dar es salaam, Tanzania

Abstract

This study presents a structured integration of land-use carbon accounting with regression, time-series, and machine-learning models to examine historical patterns and prospective trajectories of land-use related CO2 emissions in data-constrained settings. Rather than proposing a new accounting methodology, the framework demonstrates how established modeling approaches can be combined to support comparative analysis and scenario exploration. The study demonstrate the framework’s application through a case study of Tanzania, integrating multi-source land-use and socio-economic data within a hybrid ensemble of multiple linear regression, ARIMA time-series modeling, and machine learning approaches (Random Forest and XGBoost). The analysis indicates that land-use change explains a substantial share of modeled emissions variability within the accounting-consistent framework, with cropland-to-forest conversion associated with comparatively larger modeled emission reductions under the explored scenarios. These results reflect model-based associations rather than causal dominance and are conditional on the accounting structure and scenario assumptions. In contrast, Socio-economic drivers, particularly urbanization and economic growth, were associated with variations in modeled emissions, although the magnitude and direction of these effects depend on model specification. Scenario analysis suggests that a 20% conversion of cropland to forest is associated with a reduction in modeled emissions from 24,339–18,041 metric tons, whereas combined urbanization and GDP growth increase projected emissions. Machine-learning models, particularly XGBoost, exhibited lower prediction errors under the adopted validation design; however, because the data are temporally indexed, these results should be interpreted as measures of internal predictive consistency rather than strict out-of-sample forecasting accuracy. Overall, the framework provides evidence-informed and conditional insights that may support prioritization of land-based mitigation options in developing-country contexts under the Paris Agreement, while remaining contingent on model specification, data structure, and validation design.

1 Introduction

Climate change remains one of the most urgent environmental and socio-economic challenges of the 21st century, primarily driven by the unprecedented rise in anthropogenic greenhouse gas (GHG) emissions, with carbon dioxide (CO2) being the most dominant contributor [13]. The Intergovernmental Panel on Climate Change (IPCC) reports that land-based systems play a dual role in climate change: they are both a significant source of emissions and a critical sink capable of mitigating the effects of atmospheric CO2 accumulation [4]. Globally, land use change, including deforestation, agricultural expansion, grassland degradation, and urbanization, accounts for approximately 10–15% of total anthropogenic CO2 emissions [57]. When forests and natural ecosystems are converted to croplands, pastures, or settlements, carbon stored in vegetation and soils is released into the atmosphere, disrupting the carbon balance of terrestrial ecosystems. Conversely, land management strategies such as afforestation, reforestation, and agroforestry can enhance carbon sequestration capacity, offsetting a portion of emissions [8,9].

Despite global agreements like the Paris Accord urging countries to integrate land use management into climate mitigation strategies, the capacity to quantify, predict, and manage land-based CO2 emissions varies significantly across nations [10]. While developed economies benefit from comprehensive monitoring networks and advanced modeling tools, many developing countries, particularly in sub-Saharan Africa, face substantial challenges [11]. Data gaps are a primary concern, as reliable, long-term datasets on land cover, land use change, and carbon stocks are often scarce or inconsistent [12]. Furthermore, aggregation issues arise from the need to scale up local and regional data to national and international levels, which can lead to inaccuracies and a loss of critical detail. Finally, the scenario simplification inherent in many models often overlooks the complex, site-specific socio-economic drivers that influence land use decisions, such as informal land tenure systems, subsistence farming practices, and diverse livelihood strategies [13,14]. This limitation hampers the formulation of effective, evidence-based land management and climate policies.

To address these challenges, this study develops a structured integration of land-use carbon accounting with complementary statistical, time-series, and machine-learning models. Rather than proposing a new accounting methodology or algorithmic innovation, the framework demonstrates how established approaches can be combined within a transparent analytical pipeline to support comparative assessment, scenario exploration, and sensitivity analysis under data-constrained conditions. The framework is applied through a national scale case study of Tanzania, which is experiencing rapid land-use change and exemplifies the data and capacity constraints common across sub-Saharan Africa [1517].

In the African context, land use change is strongly intertwined with population growth, economic development, and food security concerns [18]. Tanzania, as one of the fastest growing economies in East Africa, is experiencing rapid population expansion, accelerating urbanization, and the continuous conversion of forests and grasslands into croplands to meet agricultural demands. The National Forest Resources Monitoring and Assessment (NAFORMA) has reported a net loss of forest cover, while the expansion of settlements and infrastructural projects continues to alter the country’s carbon dynamics [19]. These pressures highlight an urgent need to understand how socio-economic drivers interact with land transitions to influence CO2 emissions, not only for historical assessment but also for conditional forecasting under alternative development and policy scenarios.

Previous research on land use and CO2 emissions has been dominated by two broad approaches: (i) empirical estimation methods that quantify emissions based on observed land cover changes and default emission factors [20,21], and (ii) process-based or global carbon cycle models that simulate biogeochemical processes at large spatial scales [22,23]. While both approaches have produced valuable insights, they often face limitations in regional applications due to scale mismatches, limited integration of socio-economic drivers, and generalized parameterization that may not capture local land dynamics. In Tanzania, existing studies have largely focused on sector-specific emissions, such as those from forestry or agriculture, without jointly modeling urbanization, GDP growth, and rural–urban migration within a unified analytical framework. The application of advanced statistical and machine-learning methods including multiple linear regression (MLR), autoregressive integrated moving average (ARIMA) models, and ensemble tree-based approaches such as Random Forest and XGBoost has also remained limited, despite their demonstrated capacity to model non-linear relationships in environmental data [24,25].

Within this context, the contribution of this study lies in the coordinated application and benchmarking of multiple modeling approaches MLR, ARIMA, Random Forest, and XGBoost using a consistent land-use accounting structure and shared data inputs. The analysis distinguishes between accounting-based emissions estimation, statistical association, and predictive modeling, without asserting causal identification or decision-ready optimization. Machine-learning components are employed to explore non-linear patterns and comparative predictive behavior rather than to claim algorithmic superiority or definitive forecasting accuracy. The overarching objective of this research is therefore threefold. First, it applies the integrated framework to examine historical patterns and prospective trajectories of land-use–related CO2 emissions in Tanzania. Second, it benchmarks commonly used statistical, time-series, and machine-learning models under a consistent data structure and validation design. Third, it employs scenario-based analysis and sensitivity metrics to explore the relative responsiveness of modeled emissions to alternative land-use transitions and socio-economic drivers. The resulting insights are intended to be evidence-informed and conditional, supporting strategic prioritization and comparative understanding rather than causal inference or final policy prescription.

2 Materials and methods

This study analyzes the dynamics of CO2 emissions in Tanzania by integrating land use and socio-economic data within a multi-model analytical framework. The focus is on major land use categories cropland, forest land, grassland, other land, and settlements alongside key socio-economic drivers including urbanization rate, rural population, and GDP. To comprehensively capture the relationships between these factors and emissions,This study employed a suite of modeling approaches: Multiple Linear Regression (MLR) to identify linear relationships, Autoregressive Integrated Moving Average (ARIMA) models to capture temporal dependencies, and ensemble machine learning algorithms (Random Forest and XGBoost) to model complex, non-linear interactions. Finally, policy scenario analyses are conducted to evaluate the impacts of alternative land management and urbanization strategies on projected CO2 emissions, thereby providing evidence to inform national climate commitments and sustainable development goals.

2.1 Study area and scope

This study focuses on quantifying and modeling the relationship between land-use change and CO2 emissions within the spatial and temporal limits of the available national-level data. The analysis covers annual records from 1996 to 2021 and includes major land-use categories such as cropland, forest, urban land, and other land types, alongside socio-economic indicators including urbanization rate, rural population, and gross domestic product (GDP).

Although the temporal coverage comprises 26 annual observations, the analytical dataset is structured in a stacked (long-format) form, in which each observation corresponds to a specific land-use category or land-use transition within a given year. As a result, the effective sample size used in regression and machine-learning analyses reflects the combination of multiple land-use categories across time rather than a single national aggregate per year. This structure allows the framework to exploit cross-category variation while maintaining consistency with national-level carbon accounting. The scope of the analysis is intentionally restricted to the national scale. This enables aggregation of land-use and socio-economic metrics that are commonly available in data-scarce contexts, while still permitting identification of temporal patterns, relative sensitivities, and comparative dynamics among land-use pathways. The framework does not attempt spatially explicit modeling or sub-national attribution, and all results should be interpreted within the constraints imposed by national aggregation and data availability.

2.2 Emissions data sources and calculation methodology

The core CO2 emissions data for this study were sourced from the Greenhouse Center of CO2 Emissions at the Sokoine University of Agriculture (SUA). This center is a recognized national entity responsible for compiling Tanzania’s land-use change and forestry emissions in accordance with the Intergovernmental Panel on Climate Change (IPCC) guidelines [2628].

The emissions estimates are calculated using a Tier 2 methodology, as defined by the IPCC. This represents a transition from global default emission factors (Tier 1) to country-specific parameters derived from national forest inventories and empirical studies [29,30]. The Tier 2 approach incorporates domestic data on carbon stocks, forest growth rates, and land-use transitions, including those provided by the National Forest Resources Monitoring and Assessment (NAFORMA) program [31]. Emissions are estimated using the IPCC gain loss method, which quantifies changes in carbon stocks across major carbon pools, including above-ground biomass, below-ground biomass, dead wood, litter, and soil organic carbon. Operationally, the emissions accounting follows a structured workflow that links land-use activity data to emission estimates in a model-ready format. First, annual land-use areas are classified by category (e.g., forest, cropland, settlements) and by land-use transitions (e.g., forest-to-cropland, cropland-to-forest). Second, activity data describing the extent of each land-use category or transition are combined with Tier 2 emission factors specific to the corresponding carbon pools. Third, changes in carbon stocks are calculated for each land-use transition using the gain–loss approach, producing annual net CO2 emission values expressed in metric tons.

Emissions are therefore computed at the level of individual land-use categories and transitions before aggregation. This allows emissions to be consistently attributed to specific land-use dynamics while preserving compatibility with national totals. The resulting dataset is structured in a long-format form, where each observation corresponds to a specific land-use category or transition in a given year, rather than a single aggregated national value. These emissions entries serve as direct inputs to the statistical, time-series, and machine-learning models applied in subsequent analyses. This methodology produces net emissions values that represent the balance between carbon releases and removals, such that:

  1. Carbon sequestration (negative values) corresponds to net CO2 uptake through vegetation growth and soil carbon accumulation, for example in reforestation or afforestation following cropland abandonment.
  2. Carbon emissions (positive values) correspond to CO2 releases associated with deforestation, biomass burning, and soil disturbance during land conversion.

Accordingly, negative CO2 emission values reported in this study are not artifacts or calculation errors but reflect periods in which Tanzania’s land sector functioned as a net carbon sink. These outcomes are methodologically consistent with the IPCC Tier 2 framework and arise directly from accounting-based estimates of carbon stock gains exceeding losses under specific land-use transitions.

2.3 Data sources and preprocessing

The dataset comprises annual observations from 1996 to 2021. It includes net CO2 emissions (in metric tons), land use proportions (forest, cropland, grassland, settlements, other land), and socio-economic indicators (gross domestic product, rural and urban population). Data were compiled from national sources (NAFORMA) and international repositories (World Bank, FAOSTAT), with core emissions data provided by the Greenhouse Center of CO2 Emissions at the Sokoine University of Agriculture. For analytical purposes, the original national-level time series was restructured into a stacked panel format. Each observation (row) in the final dataset represents a specific land-use category or land-use transition in a given year. Thus, the unit of analysis is defined as land-use category (or transition) × year, rather than a single aggregated national observation per year. This restructuring yields multiple observations per year and explains the larger regression degrees of freedom reported in subsequent analyses.

Land-use categories and transitions are mutually exclusive and collectively exhaustive at the national scale, such that their aggregated areas are consistent with national land totals for each year. Emissions are attributed to individual land-use categories or transitions based on accounting-consistent allocation prior to aggregation. Missing values were imputed using multiple imputation with chained equations to preserve the temporal structure of the dataset. Continuous predictors were examined for skewness and potential variance instability. Categorical land-use categories and transitions were encoded using dummy variables, with one category omitted as the reference group to avoid multicollinearity. These dummy variables allow regression and machine-learning models to estimate differential emission responses associated with distinct land-use pathways while maintaining national accounting consistency.

All analyses were conducted using Python (version 3.12) and R (version 4.5.1), employing the packages forecast, randomForest, and xgboost. Scripts and preprocessing pipelines were documented to ensure reproducibility. Exploratory diagnostics evaluated logarithmic transformations of CO2 emissions and selected socio-economic predictors to assess variance stabilization and alternative functional forms. However, the final regression specification retains CO2 emissions in level form (metric tons) to preserve direct policy interpretation of estimated coefficients. Consequently, coefficients on log-transformed predictors are interpreted as semi-elasticities, representing the change in CO2 emissions (metric tons) associated with a 1% change in the predictor, while coefficients on variables expressed in levels represent marginal effects measured directly in metric tons. The detailed regression specification and interpretation of coefficients are presented in Section 2.4 (Table 1).

thumbnail
Table 1. Structure of the processed analytical dataset used in the modelling framework.

https://doi.org/10.1371/journal.pclm.0000952.t001

2.4 Multiple linear regression (MLR) model

The Multiple Linear Regression (MLR) model was used as a baseline to quantify the relationship between CO2 emissions and explanatory variables [32]. Although logarithmic transformations of emissions and socio-economic predictors were explored during diagnostic analysis, the final regression model was estimated using CO2 emissions expressed in level form (metric tons) and socio-economic predictors primarily in their original units. The empirical specification is therefore expressed as:

(1)

where yt denotes CO2 emissions in year t, Xi,t represents socio-economic predictors and land-use transition dummy variables, are regression coefficients, and is the error term. Under this specification, the coefficients represent marginal effects measured as the change in CO2 emissions (metric tons) associated with a one-unit change in the corresponding predictor.

Land-use variables enter the regression as mutually exclusive categorical indicators (dummy variables) representing land-use categories or land-use transitions, rather than as continuous proportional shares. One land-use category is omitted as the reference group to avoid perfect multicollinearity arising from the compositional constraint that land-use shares sum to one at the national level. In this study, Cropland Remaining Cropland is used as the reference category because it represents the dominant and most stable land-use state over the study period. All reported land-use coefficients therefore measure relative differences in emissions with respect to this baseline category. The model was estimated using the ordinary least squares (OLS) estimator:

(2)

Because the analytical dataset is structured in stacked form with multiple land-use transitions observed within each year, socio-economic covariates are shared across observations belonging to the same year. This implies that observations within a given year are not fully independent, as they contain repeated information on key explanatory variables, leading to within-year dependence and a reduction in the effective sample size relative to the nominal number of observations.

This structure may induce within-year correlation in regression residuals, potentially violating the classical independence assumption of ordinary least squares. If unaddressed, such dependence can result in underestimated standard errors and overstated statistical significance. To address this issue, heteroskedasticity-robust standard errors clustered at the year level were computed. Clustered standard errors account for potential correlation among observations within the same year while preserving independence across years, providing more reliable inference for regression coefficients. However, while clustering improves the robustness of inference, it does not fully eliminate the implications of the stacked data structure, and the estimated coefficients may still reflect inflated precision due to repeated covariate information. Accordingly, statistical significance should be interpreted with caution. This may lead to downward-biased standard errors and overstated statistical significance, and results should therefore be interpreted conservatively. This issue is well recognized in panel and grouped-data settings, where repeated observations sharing common covariates may artificially increase effective sample size and lead to overstated precision [3335]

This consideration is well documented in the statistical literature on grouped and panel-like data structures, where repeated observations sharing common covariates may increase apparent precision without contributing independent information [36,37] Under the level-based specification used in this study, regression coefficients represent marginal effects. Each coefficient indicates the expected change in CO2 emissions (in metric tons) associated with a one-unit increase in the corresponding predictor, holding other variables constant.

Because land-use category indicators are derived from the same accounting framework used to construct national CO2 emissions, their inclusion creates a structural linkage between predictors and the dependent variable. As a result, the high explanatory power of the MLR model should be interpreted as reflecting accounting consistency and association rather than independent causal explanation. MLR was chosen because of its interpretability and ability to provide proportional effect estimates that are useful for scenario-based policy simulations [38]. However, the model assumes linearity in parameters, independence of errors, and homoscedasticity, which may not always hold in complex environmental systems [39]. Residual diagnostics were conducted to evaluate the validity of the regression assumptions. The residual-versus-fitted plot indicated that residuals were randomly scattered around the zero line without a systematic pattern, suggesting approximate homoscedasticity. In addition, the normal Q-Q plot showed that residuals were approximately normally distributed (see S1 Fig and S2 Fig).

2.5 Autoregressive integrated moving average (ARIMA) and ARIMA with exogenous variables (ARIMAX)

Time-series analysis was implemented to capture temporal dependencies in emissions data [40,41]. In contrast to the regression and machine-learning models, which are estimated on a stacked dataset of land-use transitions and socio-economic predictors, the ARIMA and ARIMAX models are fitted to a single aggregated national time series of annual CO2 emissions. The time-series analysis therefore comprises 26 annual observations (1996–2021), with each observation representing total national land-sector emissions for a given year. The ARIMA model is formulated as:

(3)

where B is the backshift operator (), and are polynomials representing the autoregressive (AR) and moving average (MA) components of orders p and q, respectively, and d denotes the degree of differencing. The ARIMAX model extends ARIMA by incorporating one or more exogenous (independent) variables Xt that can help explain variations in the dependent variable yt. The general form is:

(4)

where is a polynomial capturing the dynamic effects of the exogenous variables, and Xt may represent climate indicators, socio-economic variables, or other relevant external drivers (like population and GDP) of emissions. Model identification was performed using autocorrelation (ACF) and partial autocorrelation (PACF) plots, with parameters selected via the Akaike Information Criterion (AIC). ARIMAX is chosen for its proven capability to combine historical time-series patterns with influential external drivers, improving forecasting accuracy in environmental modeling [42].

2.6 Random forest (RF)

Random Forest is a non-parametric ensemble learning algorithm that constructs multiple decision trees and aggregates their predictions [43,44]. Given training data , RF builds B bootstrap samples and fits a regression tree to each, selecting a random subset of m predictors at each split to reduce correlation among trees. The prediction for a new observation x is:

(5)

where Tb denotes the b-th regression tree. The Random Forest model is estimated on the same stacked national dataset used for the regression analysis, where each observation corresponds to a specific land-use transition year combination augmented with contemporaneous socio-economic indicators. Land-use transition variables are encoded as mutually exclusive categorical indicators for each year, ensuring that each unit of observation represents a distinct transition state within the national land-use accounting framework.

The model was evaluated using a temporal 80/20 train–test split, where the earliest 80% of observations were used for training and the most recent 20% for testing. Random Forest is robust to overfitting, accommodates nonlinear relationships, and provides permutation-based measures of variable importance, quantified here by the percentage increase in mean squared error (%IncMSE) when a predictor is permuted [45]. This approach was used for feature importance assessment, indicating that land use type was the dominant predictor with a %IncMSE of 62.3%. Because the underlying dataset is temporally indexed, standard random k-fold cross-validation may allow information from later years to enter the training set when predicting earlier observations. In the present implementation, cross-validation was primarily used for hyperparameter tuning rather than formal forecast evaluation. Consequently, predictive performance should be interpreted as conditional on this validation design rather than as strict out-of-sample forecasting skill. Hyperparameter tuning for the Random Forest model was conducted using grid search within the training dataset. The parameters considered included the number of trees (ntree), the number of variables randomly selected at each split (mtry), and the minimum node size. The optimal parameters were selected based on minimizing the root mean squared error (RMSE) on the training data. To ensure reproducibility, a fixed random seed was used during model training.

2.7 Extreme gradient boosting (XGBoost)

XGBoost is a gradient boosting framework that sequentially builds trees, each correcting errors from the previous ones [46]. For iteration t, the model predicts:

(6)

where is the learning rate and ft is the regression tree added at step t. The objective function minimized is:

(7)

where l is the loss function (mean squared error), and is a regularization term penalizing model complexity. XGBoost was selected as a suitable method for capturing nonlinear patterns in the data, with reasonable predictive performance and scalability. [47]. Like the Random Forest model, the XGBoost model was evaluated using the same temporal 80/20 train–test split. While XGBoost showed reasonable predictive performance relative to the other models under this validation scheme, these results should be interpreted with caution because the dataset is temporally indexed. Random cross-validation does not fully prevent information leakage from later to earlier periods and may therefore inflate performance metrics. Accordingly, the XGBoost results are used here primarily for comparative and exploratory assessment of model behavior rather than as evidence of strict out of sample forecasting superiority. Hyperparameter tuning for the XGBoost model was performed using grid search within the training dataset. The parameters evaluated included the number of boosting rounds, learning rate (eta), maximum tree depth, subsample ratio, and column sampling rate. The optimal parameter combination was selected based on minimizing the RMSE on the training data. A fixed random seed was used to ensure reproducibility of the results.

2.8 Policy scenario simulation

Policy interventions were modeled by altering predictor values in the final MLR and machine learning models to simulate hypothetical scenarios [48,49]. If the final predictive model is:

(8)

then policy scenarios are represented by modified predictor values , yielding predicted emissions:

(9)

The study evaluated five policy-relevant scenarios representing different land-use and socio-economic pathways: (1) a 10% increase in urbanization rate with 5% GDP growth; (2) a 20% conversion of cropland to forest; (3) implementation of Tanzania’s Forestry Strategy through a 15% increase in protected forest area with reduced deforestation; (4) an ambitious REDD+ program combining 25% reduction in deforestation with sustainable agroforestry; and (5) SDG-aligned urban growth featuring compact city planning with 10% green cover in urban areas. These modifications were implemented in both the linear and machine learning frameworks to quantify potential emissions reductions or increases.

The selection of these scenarios is guided by their relevance to widely discussed land-use policy pathways and observed development trends. The cropland-to-forest and forestry strategy scenarios reflect reforestation and conservation efforts commonly promoted in climate mitigation and biodiversity frameworks. The REDD+ scenario represents internationally supported mechanisms aimed at reducing emissions from deforestation and forest degradation, while the urbanization and SDG-aligned scenarios capture ongoing urban growth dynamics and sustainable development planning priorities. [50,51] These scenarios are constructed using simplified and stylized adjustments to key predictors and therefore do not represent fully specified or empirically validated policy interventions. Instead, they provide a transparent and interpretable set of assumptions that allow examination of how the modeling framework responds to plausible directional changes in land-use and socio-economic drivers.

Scenario analysis presented in this study is based on stylized and hypothetical perturbations of input variables rather than empirically validated policy pathways. As such, the scenarios are not intended to generate precise real-world policy forecasts. Instead, they serve as illustrative experiments designed to examine how the integrated modeling framework responds to controlled changes in land-use transitions and socio-economic drivers. The resulting outputs should therefore be interpreted as conditional and model-dependent, providing insight into system behavior and relative sensitivities rather than predictive evidence for policy outcomes.

2.9 Model evaluation

Models were evaluated using the root mean squared error (RMSE) and mean absolute error (MAE) [52], as defined in Equation 11. Lower values indicate higher predictive accuracy. For the Multiple Linear Regression (MLR) model, the adjusted R2 was also reported to quantify the proportion of variance explained, penalized for the number of predictors.

(10)

To quantify uncertainty in model performance, bootstrap resampling was applied to the test predictions. For each model, 1,000 bootstrap samples were drawn from the test set, and RMSE and MAE were recalculated for each resample. The resulting distributions were used to compute 95% confidence intervals for the reported performance metrics. For time series models, out-of-sample forecasting performance was assessed using rolling-origin cross-validation [53]. Because the emissions dataset contains 26 annual observations (1996–2021), rolling-origin cross-validation was implemented using an expanding training window. The initial training sample covered 1996–2010 (15 observations), after which the forecast origin was moved forward one year at a time. At each origin, the model was re-estimated and a one-year-ahead forecast (forecast horizon h = 1) was generated for the subsequent year. This process produced 10 rolling evaluation folds, corresponding to forecast origins from 2010 to 2019, with predictions evaluated against observed values from 2011 to 2020. The training window expanded sequentially with each iteration, ensuring that only information available prior to the forecast origin was used for model estimation. Forecast accuracy across folds was summarized using RMSE and MAE. For machine learning models, the dataset was divided into training (80%) and testing (20%) subsets, with the most recent observations reserved for testing to preserve the temporal ordering of the time-series data. Despite using a chronological train–test split, validation is limited by the small sample and short time span, so results reflect internal consistency rather than strong out-of-sample prediction. Such limitations of single temporal splits are well documented in recent machine learning literature, which highlights risks of temporal dependence and optimistic bias in performance estimation [54,55]. All preprocessing steps, including data transformation and any feature preparation, were performed using only the training subset and then applied to the testing data to prevent temporal information leakage.

Accordingly, the machine-learning results reported in this section should not be interpreted as evidence of superior forecasting ability relative to ARIMA, but rather as indicative of how different algorithms capture non-linear structure within the same accounting-based data representation. The multi-model strategy ensures that the study benefits from the interpretability of linear models, the temporal forecasting strengths of ARIMA/ARIMAX, and the predictive power of ensemble machine learning methods. This integrated approach is supported by previous environmental modeling studies [56,57], ensuring both scientific rigor and policy relevance.

3 Results and discussion

3.1 Exploratory data analysis (EDA) and results

Before modeling CO2 emissions, an exploratory data analysis was conducted to understand the patterns, trends, and relationships in the dataset. The EDA focused on summary statistics, temporal trends, distributions, correlations, and the impact of land use type on emissions. Table 2 presents the summary statistics for key numeric variables: CO2 emissions, urban and rural population, total population, and GDP. The average annual CO2 emissions were -25,173 tons, reflecting periods of net carbon uptake in certain land use transitions.

thumbnail
Table 2. Summary statistics for key variables (1996–2021).

https://doi.org/10.1371/journal.pclm.0000952.t002

Negative CO2 values indicate net carbon sequestration when terrestrial uptake exceeds emissions from land-use change. Tanzania’s land sector was a net carbon sink over the study period, driven largely by cropland-to-forest transitions. Urban population increased steadily over the years, while rural population growth was slower. Total population and GDP showed consistent growth, highlighting the potential impact of economic expansion and urbanization on emissions.

Fig 1 shows a general upward trend in CO2 emissions, with occasional net absorption years. Urban population growth outpaced rural population growth, reflecting increasing urbanization. GDP also increased steadily, which may contribute to rising emissions. Fig 2 and Fig 3 show the distribution of CO2 emissions and other numeric predictors. The histograms indicate right-skewed distributions for CO2 emissions and population variables, suggesting occasional extreme emission or population years. The boxplot shows that forest land and cropland conversions are associated with higher emissions, emphasizing their importance in carbon dynamics. Fig 4 highlights multicollinearity between urban and rural population. Emissions are strongly associated with urbanization trends, which justifies using urbanization rate as a predictor in regression models. Fig 5 shows that transitions from cropland to forest or remaining forest land contribute significantly to carbon emissions. Land conversion to settlements produces comparatively lower emissions. These insights guide the selection of predictors in the subsequent regression models. The EDA reveals temporal trends in emissions and population, skewed distributions, correlations among predictors, and land use types with major emission impacts. These observations provide a foundation for the multiple linear regression analysis presented in Section 3.2.

thumbnail
Fig 1. The parallel rise of economic output (GDP), urbanization, and CO2 emissions highlights the strong link between development and environmental impact.

https://doi.org/10.1371/journal.pclm.0000952.g001

thumbnail
Fig 2. Histograms depict skewness and spread of Emissions, Urban Population, Rural Population, and GDP.

https://doi.org/10.1371/journal.pclm.0000952.g002

thumbnail
Fig 3. Shows that transitions involving forest and cropland, such as conversions or retention, exhibit the largest swings in CO2 emissions.

This high variability confirms their outsized impact and dominant role in the national carbon budget.

https://doi.org/10.1371/journal.pclm.0000952.g003

thumbnail
Fig 4. CO2 emissions are more strongly linked to the degree of urbanization than to total population size, indicating the environmental impact of economic activity concentrated in cities.

https://doi.org/10.1371/journal.pclm.0000952.g004

thumbnail
Fig 5. Average CO2 emissions by Land Use Type.

Cropland-to-Forest and Forest Land exhibit the highest average emissions, while settlements and other land types show lower emissions.

https://doi.org/10.1371/journal.pclm.0000952.g005

3.2 Multiple linear regression results

Although logarithmic transformations of CO2 emissions and continuous socio-economic predictors were explored during preprocessing for diagnostic and robustness purposes, the Multiple Linear Regression results reported in this section and summarized in Table 3 are estimated using variables expressed in their original level units; consequently, all reported coefficients represent absolute differences in modeled CO2 emissions (metric tons) conditional on the accounting-consistent data structure. The multiple linear regression analysis was conducted to examine the influence of land use change, urbanization rate, and GDP on carbon dioxide (CO2) emissions. Model 2, which replaced absolute urban and rural population counts with the urbanization rate (urban population / total population), was selected based on its interpretability, policy relevance, and absence of multicollinearity (all variance inflation factor (VIF) values < 1.6). Multicollinearity diagnostics are reported in S2 Table. The regression is estimated on a stacked land-use transition by year dataset, where socio-economic variables vary annually but are shared across transitions within each year. Because multiple land-use transitions share the same annual socio-economic values, residuals may exhibit within year dependence; Standard errors are reported using heteroskedasticity-robust estimates clustered at the year level to account for within-year dependence arising from the stacked dataset structure. Although logarithmic transformations are applied in other components of the analysis, the Multiple Linear Regression results reported here are estimated in levels for both CO2 emissions and continuous socio-economic variables. Accordingly, coefficients represent absolute differences in modeled emissions conditional on the accounting-consistent data structure.

thumbnail
Table 3. Multiple Linear Regression Results for CO2 Emissions (with Land Use Type, Urbanization Rate, and GDP).

https://doi.org/10.1371/journal.pclm.0000952.t003

In Table 3, shaded rows highlight the most impactful land use transitions (e.g., largest absolute estimates) with light gray. The overall regression model was statistically significant, F (35, 848) = 801.9, p < 0.001, with an adjusted R2 of 0.969. The high explanatory power partly reflects the inclusion of land-use transition indicators that are structurally linked to the accounting-based construction of emissions. As a result, a substantial share of variation is mechanically explained by the model’s accounting structure rather than independent causal relationships.

In particular, because CO2 emissions are derived from the same land-use accounting framework used to define several predictors, the reported R2 partly reflects accounting consistency rather than purely independent explanatory power. Given the relatively large number of transition indicators, the results should be interpreted cautiously and primarily used for comparative assessment and scenario exploration rather than as evidence of strong predictive generalization. Accordingly, the model should be interpreted as capturing conditional associations within an accounting-consistent framework, rather than providing evidence of strong independent or causal relationships. The findings may also be sensitive to model specification and the accounting-based data structure.

The coefficients for most land use change categories were statistically significant (p < 0.001), indicating substantial differences in emissions depending on the type of land conversion or retention. For example, transitions from cropland to forest land were associated with a significant reduction in emissions, whereas conversion to settlements showed comparatively smaller reductions.

The urbanization rate was negatively associated with CO2 emissions (, SE = 11,960, p < 0.001), indicating that, conditional on land-use transitions and GDP, higher levels of urbanization are associated with lower emissions in the regression sample. This association should be interpreted as statistical rather than causal and may reflect structural correlations between urbanization, land-use composition, and emissions accounting. To provide an explicit measure of parameter uncertainty beyond p-values, 95% confidence intervals were computed for all regression coefficients using . For the urbanization rate, this corresponds to a 95% confidence interval of approximately [-125,200, -78,400] metric tons, reinforcing the direction and statistical robustness of the estimated association under the level-based specification.

These findings indicate that land-use transitions account for a large share of the variation in emissions within the accounting-consistent regression framework, while urbanization shows a statistically significant association under the model specification considered. These results support evidence-informed prioritization of land-use related mitigation strategies, but they should be interpreted as conditional on model structure, data aggregation, and accounting assumptions rather than as definitive causal guidance.

Uncertainty in coefficient estimates is reflected in the reported standard errors and associated test statistics, and statistical significance should be interpreted conditional on the model specification, data construction, and underlying accounting assumptions.

3.3 Time series modeling of carbon dioxide emissions

To assess the potential of time series methods in forecasting carbon dioxide (CO2) emissions, The study implemented both an Autoregressive Integrated Moving Average (ARIMA) model and an ARIMA with exogenous variables (ARIMAX) using historical emissions data. The objective was to determine the most accurate and parsimonious model for projecting future emissions trends. The Augmented Dickey-Fuller (ADF) test was conducted to evaluate the stationarity of the emissions series. Results indicated non-stationarity in levels (ADF statistic = -2.1165, p-value = 0.528), prompting first differencing of the series. Based on the Akaike Information Criterion corrected for small samples (AICc) and residual diagnostics, an ARIMA(1,1,0) with drift was selected for the univariate model. For the ARIMAX specification, Urbanization Rate was included as an external regressor, resulting in an ARIMA(2,0,0) errors model.

Table 4 summarizes the comparative performance of the two models. The ARIMA(1,1,0) with drift achieved a lower AICc (407) compared to the ARIMAX (463) and demonstrated substantially lower root mean squared error (RMSE) and mean absolute percentage error (MAPE) on the test set (RMSE = 2937.42, MAPE = 25.77%). Residual diagnostics via the Ljung-Box test indicated no significant autocorrelation in the ARIMA residuals (p = 0.153), whereas the ARIMAX model exhibited significant residual autocorrelation (p = 0.002), suggesting potential misspecification.

thumbnail
Table 4. Comparison of ARIMA and ARIMAX Model Performance for CO2 Emissions Forecasting.

https://doi.org/10.1371/journal.pclm.0000952.t004

The ARIMA(1,1,0) model forecasts indicated a gradual but consistent increase in CO2 emissions over the forecast horizon, with relatively narrow prediction intervals, reflecting its stable residual behavior. In contrast, the ARIMAX model forecasts exhibited wider confidence bounds and higher forecast errors, likely due to the instability introduced by the exogenous regressor under the current data structure. Fig 6 presents the forecast results for both models. The visual inspection confirms the superior fit and predictive stability of the ARIMA(1,1,0) model, justifying its selection for forward-looking emissions projections.

thumbnail
Fig 6. Comparison of forecast performance between ARIMA(1,1,0) with drift and ARIMAX(2,0,0) errors models.

The ARIMA model demonstrates closer alignment with observed test set values and narrower prediction intervals.

https://doi.org/10.1371/journal.pclm.0000952.g006

These results highlight that, within the available dataset, a simple univariate ARIMA model outperforms the more complex ARIMAX approach for CO2 emissions forecasting. The inclusion of Urbanization Rate as an exogenous predictor did not improve accuracy and introduced significant residual autocorrelation. This suggests that additional preprocessing or the inclusion of alternative exogenous factors (e.g., industrial activity, energy consumption) may be required before ARIMAX or similar multivariate models can yield improvements in predictive performance.

3.4 Machine learning prediction of carbon emissions

To further explore the comparative predictive behavior of available land-use and socio-economic drivers, two machine-learning models Random Forest (RF) and XGBoost were implemented as benchmark methods within the integrated framework. The models were trained on 80% of the observations and evaluated on the remaining 20% of the data. Because the dataset is temporally indexed, this split reflects internal model evaluation rather than strict out-of-sample forecasting. However, the validation framework remains relatively limited due to the short time span and modest sample size of the dataset, which constrains the robustness of out-of-sample performance assessment. While this approach preserves temporal ordering, the use of a single train–test split does not fully eliminate the risk of temporal leakage, particularly in small samples where adjacent observations may share persistent patterns. Consequently, the reported predictive performance may be sensitive to the specific split and may not fully reflect the models’ ability to generalize to unseen data. In temporally structured datasets, more robust validation strategies such as rolling-origin evaluation or time-series cross-validation are often recommended to better account for temporal dependence and reduce optimistic bias in performance estimates [58,59] Table 5 summarizes the resulting prediction errors under this validation design. XGBoost exhibited lower RMSE and MAE than Random Forest, indicating stronger internal fit under the adopted data partitioning within a limited validation setting rather than definitive forecasting superiority. Bootstrap analysis indicated that the reported RMSE and MAE values fall within stable 95% confidence intervals, confirming the robustness of the relative performance differences between models. Therefore, the results should be interpreted cautiously and viewed as indicative of potential predictive relationships rather than definitive evidence of generalizable forecasting performance.

thumbnail
Table 5. Comparison of Machine Learning Model Performance for CO2 Emissions Prediction.

https://doi.org/10.1371/journal.pclm.0000952.t005

For Random Forest, land use type was the dominant predictor (%IncMSE = 62.3%), while Urban Rate, Rural Population, and GDP contributed minimally. This pattern suggests that land-use transitions account for a substantial share of variation in modeled emissions within the accounting-consistent framework. Fig 7 shows the ranked importance of predictors. Overall, the results indicate that machine-learning models, particularly XGBoost, provide useful complementary benchmarks for capturing non-linear patterns in the available data, alongside the linear and time-series approaches.

thumbnail
Fig 7. Feature importance from Random Forest model predicting CO2 emissions.

Land use type is the most influential predictor.

https://doi.org/10.1371/journal.pclm.0000952.g007

Because the modeling approaches are estimated on different data structures, their performance metrics should be interpreted with caution. The multiple linear regression and machine-learning models are trained on the stacked dataset of land-use transitions and socio-economic predictors, where each observation corresponds to a specific land-use category or transition within a given year. In contrast, the ARIMA model is estimated on an aggregated national time series consisting of a single observation per year. This structural difference implies that the models are effectively learning from different representations of the system, and therefore their error metrics (RMSE and MAE) reflect distinct predictive tasks rather than a common forecasting objective. As a result, direct numerical comparisons of predictive accuracy across model classes may be misleading and should not be interpreted as evidence of the superiority of one modeling approach over another. Consequently, the reported RMSE and MAE values provide approximate benchmarks of predictive behavior rather than strictly comparable forecasting performance across models. Each modeling approach therefore serves a complementary analytical role: regression models provide interpretable associations between drivers and emissions, time-series models capture temporal dynamics in national emissions, and machine-learning algorithms explore nonlinear relationships within the accounting-based dataset.

The multiple linear regression (MLR) model achieved a high R-squared value (0.9707) with moderate RMSE and MAE, indicating that linear relationships between land use type, urbanization, and GDP explain most of the variation in emissions. However, MLR assumes linearity and may not fully capture complex interactions among drivers. The ARIMA(1,1,0) time series model produced similar RMSE and MAE as MLR, reflecting that historical temporal trends in emissions alone can provide reasonable forecasts. Nonetheless, ARIMA does not explicitly incorporate socio-economic drivers or land use changes, limiting interpretability for policy interventions.

Among the machine-learning approaches, XGBoost exhibited lower error metrics than Random Forest and the classical models with the lowest RMSE (2,141) and MAE (1,078) under the adopted evaluation setup. However, such cross-model comparisons should be interpreted cautiously given the underlying differences in data structure and modeling objectives. However, these differences should be interpreted as relative performance under internal validation rather than as evidence of generalizable forecasting dominance. This demonstrates its ability to capture non-linear relationships and complex interactions between land use transitions and socio-economic factors. Random Forest, while robust, showed higher prediction errors, likely due to the smaller effect of other predictors compared to land use type.

Overall, the comparison indicates that while classical models such as MLR and ARIMA provide interpretable baseline estimates, machine-learning approaches, particularly XGBoost, exhibited lower prediction errors under the adopted evaluation design. However, these comparisons should not be interpreted as direct evidence of model superiority across approaches, as the underlying data structures and modeling objectives differ. Given that the data are temporally indexed, the reported results are sensitive to data structure and train–test partitioning and may reflect optimistic performance due to potential temporal leakage. Accordingly, the machine-learning models are best viewed as complementary benchmarking tools that enhance exploratory analysis rather than as substitutes for interpretable statistical models or as decision-ready forecasting systems.

3.5 Policy interventions and scenario-based predictions

To provide actionable insights for policymakers, the study used the final multiple linear regression model. As shown in Table 6, Scenario 1 combines a 10% increase in the urbanization rate with a 5% increase in GDP. Although the regression model estimates a negative coefficient for urbanization, the combined scenario results in higher predicted CO2 emissions because the positive contribution of GDP growth outweighs the emission-reducing association linked to urbanization within the model, while converting cropland to forest (Scenario 2) substantially reduces emissions.

thumbnail
Table 6. Predicted CO2 Emissions Under Different Land Use and Socio-Economic Scenarios.

https://doi.org/10.1371/journal.pclm.0000952.t006

Policy-aligned scenarios reveal strong mitigation potential: Tanzania’s Forestry Strategy (Scenario 3) reduces emissions, ambitious REDD+ (Scenario 4) offers the largest cuts, and SDG-aligned urban growth (Scenario 5) decouples development from emissions. These results offer quantitative guidance for aligning land use, urban planning, and climate commitments. All scenarios are generated by imposing exogenous changes to selected predictors within the final log-linear regression model, with predicted emissions subsequently back-transformed to the original scale (metric tons). Scenario outcomes therefore represent conditional, model-based projections rather than causal policy impacts or long-term forecasts. Scenario construction respects national land accounting constraints. For each year, imposed changes to a given land-use transition are applied subject to the condition that the sum of land-use areas across all categories remains constant. Increases in one land-use category are therefore offset by proportional reductions in one or more alternative categories, ensuring that land-use transitions remain mutually exclusive and collectively exhaustive at the national scale. All scenario inputs are checked to satisfy these constraints prior to model evaluation. Uncertainty in scenario outcomes is assessed by propagating regression coefficient uncertainty through the scenario simulations. For each scenario, predicted emissions are computed using the estimated coefficient vector and corresponding confidence intervals, yielding a range of plausible emission outcomes around the point estimate. These ranges reflect statistical uncertainty in model parameters rather than structural or policy uncertainty and are reported alongside point predictions to avoid over-interpretation of deterministic scenario outcomes.

Fig 8 illustrates the predicted CO2 emissions under three different land use and socio-economic scenarios. The base scenario represents the current conditions, serving as a reference point. Scenario 1 models a 10% increase in the urbanization rate combined with a 5% increase in GDP. The scenario results in higher emissions relative to the base case, primarily driven by the GDP growth effect within the model, which dominates the negative coefficient associated with urbanization in the regression results. Scenario 2 examines the effect of converting 20% of cropland to forest land, which leads to a substantial reduction in predicted emissions, demonstrating the potential of land management interventions for carbon mitigation. Scenario 3 reflects implementation of the Tanzania Forestry Strategy, showing moderate emission reductions through forest protection and reduced deforestation. Scenario 4 represents ambitious REDD+ implementation, demonstrating the greatest mitigation potential through comprehensive forest conservation and sustainable agroforestry practices. Scenario 5 shows how SDG-aligned urban growth, incorporating compact city design and green infrastructure, can nearly decouple urban development from emission increases. Overall, the figure provides policymakers with a clear visual comparison of how strategic land use and socio-economic changes can influence future CO2 emissions.

thumbnail
Fig 8. Predicted CO2 emissions under different land use and socio-economic policy scenarios.

https://doi.org/10.1371/journal.pclm.0000952.g008

3.6 Discussion

The results provide evidence-informed insights that may support land-use policy discussions in Tanzania, while remaining conditional on the modeling assumptions and data limitations. The study makes two main contributions. First, it offers analytical insights that can inform land-use policy discussions in Tanzania. Second, it presents an integrated analytical framework with potential for broader application in data-limited contexts. By combining multiple linear regression (MLR), ARIMA, and XGBoost models, the framework highlights the relative importance of different land-use transitions within the modeled scenarios. In particular, scenarios involving the conversion of cropland to forest are associated with notable reductions in projected CO2 emissions. This pattern is consistent with global evidence on the climate mitigation potential of reforestation [60,61], while the present study quantifies these relationships using a modeling approach designed for national-level analysis under limited data conditions. For instance, [62] estimated that natural climate solutions, including forest restoration, could contribute up to 37% of the cost-effective CO2 mitigation required by 2030. This finding supports our scenario-based results, which show that cropland-to-forest conversion represents one of the more impactful pathways for reducing emissions in Tanzania within the modeled scenarios. These findings highlight the role of land-use planning in climate mitigation. In Tanzania, reforestation, conservation, and agroforestry could support NDC targets while enhancing carbon sequestration and livelihoods. The observed influence of urbanization and GDP growth on emissions trajectories is consistent with previous empirical evidence. Studies such as [63,64] have reported that rapid urban expansion in developing economies typically drives higher CO2 output due to increased energy consumption and transportation demand. However, our findings indicate that these effects can be mitigated through sustainable urban planning, echoing the conclusions of [65], who emphasized the value of compact city designs and integrated green infrastructure. Thus, the combination of socio-economic and environmental interventions emerges as a more effective and sustainable pathway compared to isolated sectoral measures.

From a methodological perspective, this study adds empirical evidence to the growing body of research indicating that hybrid modeling approaches can complement single-method frameworks in environmental forecasting. The relatively higher predictive accuracy observed for XGBoost in this dataset compared to traditional statistical models is consistent with previous studies [66,67], which demonstrate the capability of gradient boosting algorithms to capture complex nonlinear and interaction effects in environmental datasets. While ARIMA effectively captures temporal dynamics, it is generally less suited to representing nonlinear relationships, which machine-learning approaches may help to capture when sufficient data are available. The primary methodological contribution of this work lies in the practical integration of econometric, time-series, and machine-learning approaches within a single analytical framework applied to national land-use emissions data. Within the limits of the available dataset, the lower prediction errors obtained by XGBoost suggest that machine-learning models can provide a useful complementary perspective for capturing the complex, non-linear interactions that characterize land systems, although their results should be interpreted cautiously due to their limited transparency compared with statistical models.

Despite its strengths, the study acknowledges several limitations. The policy intervention scenarios, while informative, are based on simplified assumptions about changes in land use and socio-economic conditions. Real-world implementation may be constrained by governance capacity, policy enforcement, and socio-political factors that were not explicitly modeled. Similar challenges have been observed by [68], who emphasized the importance of strong institutional frameworks and sustained investment in translating land-use scenarios into actionable policy outcomes. Another limitation concerns the lack of a fine-grained spatial dimension. Although national-level aggregation facilitates robust time-series modeling, it obscures local emission hotspots and limits regional analysis of land management strategies [69]. Future research should integrate Geographic Information Systems (GIS) and satellite-based datasets to validate model outcomes at sub-national scales and generate high-resolution emission maps. This spatial enhancement would provide policymakers with more localized and actionable insights for targeted land-use planning and conservation efforts. Future research could address this issue by applying clustered standard errors or panel-data estimation approaches to account for within-year correlation in stacked transition datasets.

While the modeling framework provides structured insights into potential mitigation pathways, the results should be interpreted in light of several methodological constraints. Model performance is influenced by the limited temporal sample, data aggregation at the national scale, preprocessing transformations, and cross-validation design. In addition, because socio-economic covariates repeat across multiple land-use transition observations within the same year, the stacked dataset structure may induce within-year residual correlation, potentially violating the strict independence assumption of ordinary least squares regression. In particular, machine learning results may be sensitive to data partitioning and potential information leakage in small datasets. Therefore, the identification of cropland-to-forest conversion as a high-impact mitigation pathway reflects patterns within the current analytical framework rather than a definitive ranking across all possible land transitions. Future research incorporating expanded datasets, spatial disaggregation, and alternative validation schemes would improve robustness and strengthen causal interpretation. Applying clustered or panel-based estimation approaches could further improve the robustness of statistical inference in future analyses.

4 Conclusion and recommendations

4.1 Conclusion

This study developed and applied an integrated hybrid modeling framework to examine the relationship between land-use change and CO2 emissions, using Tanzania as a representative case study in a data-scarce national context. By combining accounting-consistent land-use transitions with statistical, time-series, and machine-learning models, the framework enables comparative exploration of emission responses under alternative land-use and socio-economic scenarios. Within the scope of the modeled scenarios, cropland-to-forest conversion consistently emerged as a land-use pathway associated with lower modeled net emissions relative to other transitions, while socio-economic drivers such as urbanization and GDP growth were associated with upward pressure on emissions.

Model evaluation indicates that machine-learning approaches, particularly XGBoost, exhibited lower prediction errors than linear and time-series models under the adopted evaluation design. However, these results are conditional on the available data, model structure, and validation approach, and should therefore be interpreted as internal benchmarking outcomes rather than as evidence of generalizable forecasting superiority. Overall, the primary contribution of this study lies in methodological integration rather than in the introduction of new predictive theory or policy prescriptions. The proposed framework is intended to support exploratory analysis, comparison of alternative land-use pathways, and structured decision support under uncertainty, rather than to identify optimal mitigation strategies or provide decision-ready forecasts. Future work incorporating longer time series, spatially explicit data, and expanded validation strategies would further strengthen the robustness and policy relevance of the approach.

4.2 Recommendations

The evidence generated by this study suggests several strategic directions for policymakers, urban planners, and researchers aiming to mitigate CO2 emissions through sustainable land use and socio-economic interventions. If the relationships identified in this study are supported by further empirical validation, targeted land use policies could be considered to promote reforestation and the conversion of low-productivity cropland into forest or grassland, thereby potentially enhancing the carbon sequestration capacity of available landscapes. Similarly, the management of urban growth may benefit from well-structured planning frameworks that aim to limit horizontal expansion, encourage vertical development, and incorporate green infrastructure solutions that could help offset a portion of emissions associated with urban activities.

In addition, predictive modeling tools such as the MLR and machine learning models developed in this research could be integrated into environmental policy workflows to support scenario-based and data-informed decision-making processes. Such integration may provide planners with useful insights for anticipating and potentially mitigating future emission trends, although further validation in different contexts is recommended. Strengthening data collection systems also appears important, particularly through the development of high-resolution, spatially explicit datasets on land use patterns, emissions sources, and socio-economic indicators, which could improve the robustness and precision of future modeling efforts. Finally, this study highlights the potential value of a more interdisciplinary approach to emissions research, encouraging collaborations that combine socio-economic, environmental, and technological perspectives to support the development of more comprehensive simulation frameworks. Taken together, these considerations may offer a useful pathway for exploring land use strategies and supporting emission reduction efforts within a broader sustainable development context, while acknowledging the conditional nature of the findings.

4.3 Contributions

This work makes several key contributions to the field of environmental modeling and climate policy. First, it integrates multi-source datasets, including national emissions inventories and socio-economic statistics, into a unified analytical framework. Second, by combining traditional econometric models with modern machine learning techniques, it leverages the strengths of each approach to improve predictive accuracy and robustness. Third, it delivers actionable insights for policymakers, highlighting specific land use strategies and urban planning interventions that could significantly reduce CO2 emissions. Finally, it contributes to the global discourse on climate change mitigation by providing a replicable methodological template for developing countries facing similar environmental and socio-economic pressures.

Supporting Information

S1 Data. Dataset used for the analysis conducted in this study.

The file contains the data used to generate the study results and statistical analyses.

https://doi.org/10.1371/journal.pclm.0000952.s001

(XLSX)

S1 Fig. Regression diagnostics for the multiple linear regression model, including residual plots.

https://doi.org/10.1371/journal.pclm.0000952.s002

(TIF)

S2 Fig. Normal Q-Q diagnostics for the multiple linear regression model residuals.

https://doi.org/10.1371/journal.pclm.0000952.s003

(TIF)

S1 Table. Variable dictionary and transformation summary for all predictors and response variables.

https://doi.org/10.1371/journal.pclm.0000952.s004

(PDF)

S2 Table. Variance Inflation Factor (VIF) diagnostics for regression predictors.

https://doi.org/10.1371/journal.pclm.0000952.s005

(PDF)

S3 Table. Estimated Land Use Carbon Sensitivity Index (LUCSI) for selected land transitions.

https://doi.org/10.1371/journal.pclm.0000952.s006

(PDF)

S1 Text. Detailed description, formulation, and interpretation of the Land Use Carbon Sensitivity Index (LUCSI).

https://doi.org/10.1371/journal.pclm.0000952.s007

(PDF)

References

  1. 1. Kabir M, Habiba U, Iqbal MZ, Shafiq M, Farooqi ZR, Shah A, et al. Impacts of anthropogenic activities & climate change resulting from increasing concentration of Carbon dioxide on environment in 21st Century; A Critical Review. IOP Conf Ser: Earth Environ Sci. 2023;1194(1):012010.
  2. 2. Filonchyk M, Peterson MP, Zhang L, Hurynovich V, He Y. Greenhouse gases emissions and global climate change: Examining the influence of CO2, CH4, and N2O. Sci Total Environ. 2024;935:173359.
  3. 3. Filonchyk M, Peterson MP, Yan H, Gusev A, Zhang L, He Y, et al. Greenhouse gas emissions and reduction strategies for the world’s largest greenhouse gas emitters. Sci Total Environ. 2024;944:173895. pmid:38862038
  4. 4. Yamanoshita M. IPCC special report on climate change and land. JSTOR; 2022.
  5. 5. Li L, Awada T, Zhang Y, Paustian K. Global Land Use Change and Its Impact on Greenhouse Gas Emissions. Glob Chang Biol. 2024;30(12):e17604. pmid:39614423
  6. 6. Liu M, Chen Y, Chen K, Chen Y. Progress and Hotspots of Research on Land-Use Carbon Emissions: A Global Perspective. Sustainability. 2023;15(9):7245.
  7. 7. Mehmood J, Shahbaz M, Wang J, Malik MN. Unveiling the dynamics of agriculture greenhouse gas emissions: The role of energy consumptions and natural resources. Appl Energy. 2025;379:124946.
  8. 8. Verma K, Sharma P, Bhardwaj D, Kumar R, Kumar NM, Singh AK. Land and environmental management through agriculture, forestry and other land use (AFOLU) system. Land and environmental management through forestry. 2023. p. 247–71.
  9. 9. Ménard I, Thiffault E, Kurz WA, Boucher J-F. Carbon sequestration and emission mitigation potential of afforestation and reforestation of unproductive territories. New Forests. 2022;54(6):1013–35.
  10. 10. Chlela S. Integrated modeling of the global energy system for pathways with carbon dioxide removal. 2024.
  11. 11. Kuteyi D, Winkler H. Logistics Challenges in Sub-Saharan Africa and Opportunities for Digitalization. Sustainability. 2022;14(4):2399.
  12. 12. Nedd R, Light K, Owens M, James N, Johnson E, Anandhi A. A Synthesis of Land Use/Land Cover Studies: Definitions, Classification Systems, Meta-Studies, Challenges and Knowledge Gaps on a Global Landscape. Land. 2021;10(9):994.
  13. 13. Mutale B, Qiang F. Modeling future land use and land cover under different scenarios using patch-generating land use simulation model. A case study of Ndola district. Front Environ Sci. 2024;12.
  14. 14. Kariuki RW, Capitani C, Munishi LK, Shoemaker A, Courtney Mustaphi CJ, William N, et al. Serengeti’s futures: Exploring land use and land cover change scenarios to craft pathways for meeting conservation and development goals. Front Conserv Sci. 2022;3.
  15. 15. Sumari NS, Cobbinah PB, Ujoh F, Xu G. On the absurdity of rapid urbanization: Spatio-temporal analysis of land-use changes in Morogoro, Tanzania. Cities. 2020;107:102876.
  16. 16. Olorunfemi IE, Olufayo AA, Fasinmirin JT, Komolafe AA. Dynamics of land use land cover and its impact on carbon stocks in Sub-Saharan Africa: an overview. Environ Dev Sustain. 2021;24(1):40–76.
  17. 17. Govender T, Dube T, Shoko C. Remote sensing of land use-land cover change and climate variability on hydrological processes in Sub-Saharan Africa: key scientific strides and challenges. Geocarto Int. 2022;37(25):10925–49.
  18. 18. Molotoks A, Smith P, Dawson TP. Impacts of land use, population, and climate change on global food security. Food Energy Secur. 2020;10(1).
  19. 19. Pandey S, Kumari N, Dash SK, Al Nawajish S. Challenges and monitoring methods of forest management through geospatial application: A review. Advances in remote sensing for forest monitoring. 2022. p. 289–328.
  20. 20. Gasser T, Crepin L, Quilcaille Y, Houghton RA, Ciais P, Obersteiner M. Historical CO 2 emissions from land use and land cover change and their uncertainty. Biogeosciences. 2020;17(15):4075–101.
  21. 21. Obermeier WA, Nabel JEMS, Loughran T, Hartung K, Bastos A, Havermann F, et al. Modelled land use and land cover change emissions – a spatio-temporal comparison of different approaches. Earth Syst Dynam. 2021;12(2):635–70.
  22. 22. Sun Y, Goll DS, Huang Y, Ciais P, Wang Y-P, Bastrikov V, et al. Machine learning for accelerating process-based computation of land biogeochemical cycles. Glob Chang Biol. 2023;29(11):3221–34. pmid:36762511
  23. 23. Chen Z, Huntzinger DN, Liu J, Piao S, Wang X, Sitch S, et al. Five years of variability in the global carbon cycle: comparing an estimate from the Orbiting Carbon Observatory-2 and process-based models. Environ Res Lett. 2021;16(5):054041.
  24. 24. Schmid L. Statistical analyses of tree-based ensembles. 2024.
  25. 25. Mishra P, Al Khatib AMG, Yadav S, Ray S, Lama A, Kumari B, et al. Modeling and forecasting rainfall patterns in India: a time series analysis with XGBoost algorithm. Environ Earth Sci. 2024;83(6):163.
  26. 26. Mwakalukwa EE, Meilby H, Treue T. Carbon storage in a dry Miombo woodland area in Tanzania. South For. 2024;86(2):115–24.
  27. 27. Ahmed A, Mihayo I, Ally R, Makindara J. Implications of climate data application. 2023.
  28. 28. Kohnert D. The impact of the industrialized nation’s CO2 emissions on climate change in Sub-Saharan Africa: Case studies from South Africa, Nigeria and the DR Congo. 2024.
  29. 29. Lao W-L, Li X-L, Gong Y-C, Duan X-F. Carbon Dioxide Emission Evaluations in the Chinese Furniture Manufacturing Industry Using the IPCC Tier-2 Methodology. For Prod J. 2023;73(1):6–12.
  30. 30. Ramírez-Melgarejo M, Reyes-Figueroa AD, Gassó-Domingo S, Güereca LP. Analysis of empirical methods for the quantification of N2O emissions in wastewater treatment plants: Comparison of emission results obtained from the IPCC Tier 1 methodology and the methodologies that integrate operational data. Sci Total Environ. 2020;747:141288. pmid:32777511
  31. 31. Rajala T, Heikkinen J, Gogo S, Ahimbisibwe J, Bakanga G, Chamuya N. NAFORMA: National Forest Resources Monitoring and Assessment of Tanzania Mainland: Sampling Design Options for 2nd Biophysical Inventory (NAFORMA II). Food & Agriculture Org; 2022.
  32. 32. Shams SR, Jahani A, Kalantary S, Moeinaddini M, Khorasani N. The evaluation on artificial neural networks (ANN) and multiple linear regressions (MLR) models for predicting SO2 concentration. Urban Clim. 2021;37:100837.
  33. 33. Bai J, Choi SH, Liao Y. Standard errors for panel data models with unknown clusters. J Econom. 2024;240(2):105004.
  34. 34. Zhang Y, Lai MHC. Evaluating two small-sample corrections for fixed-effects standard errors and inferences in multilevel models with heteroscedastic, unbalanced, clustered data. Behav Res Methods. 2024;56(6):5930–46. pmid:38321272
  35. 35. Pfaffermayr M. Bias-corrected cluster-robust standard errors for fixed effects PPML estimators of gravity panel models with autocorrelated disturbances. Empir Econ. 2026;70(2).
  36. 36. Wooldridge JM. Econometric Analysis of Cross Section and Panel Data. 2nd ed. MIT Press; 2010.
  37. 37. Colin Cameron A, Miller DL. A Practitioner’s Guide to Cluster-Robust Inference. J Hum Resour. 2015;50(2):317–72.
  38. 38. Dinca Z, Oprean-Stan C, Balsalobre-Lorente D. UK carbon price dynamics: long-memory effects and AI-based forecasting. Fractal Fract. 2025;9(6):350.
  39. 39. Abdullah S, Ismail M, Fong SY. Multiple linear regression (MLR) models for long term PM10 concentration forecasting during different monsoon seasons. J Sustain Sci Manage. 2017;12(1):60–9.
  40. 40. Tanania V, Shukla S, Singh S. Time series data analysis and prediction of CO2 emissions. In: 2020 10th International Conference on Cloud Computing, Data Science & Engineering (Confluence). IEEE; 2020. p. 665–9.
  41. 41. M. HP, Rehman MZ, Dar AA, Wangmo A. T. Forecasting CO2 Emissions in India: A Time Series Analysis Using ARIMA. Processes. 2024;12(12):2699.
  42. 42. Majka M. ARIMAX Time Series Forecasting with External Variables. ResearchGate. 2024. https://www.researchgate.net/publication/384196976_ARIMAX_Time_Series_Forecasting_with_External_Variables
  43. 43. Syam N, Kaul R. Random Forest, Bagging, and Boosting of Decision Trees. Machine Learning and Artificial Intelligence in Marketing and Sales. Emerald Publishing Limited; 2021. p. 139–82.
  44. 44. Guhan T, Revathy N. EMLARDE tree: ensemble machine learning based random de-correlated extra decision tree for the forest cover type prediction. SIViP. 2024;18(12):8525–36.
  45. 45. Zhou Z, Qiu C, Zhang Y. A comparative analysis of linear regression, neural networks and random forest regression for predicting air ozone employing soft sensor models. Sci Rep. 2023;13(1):22420. pmid:38104205
  46. 46. Nielsen D. Tree boosting with xgboost-why does xgboost win every machine learning competition? NTNU; 2016.
  47. 47. Dong Q, Su Y, Xu G, She L, Chang Y. A Fast Operation Method for Predicting Stress in Nonlinear Boom Structures Based on RS–XGBoost–RF Model. Electronics. 2024;13(14):2742.
  48. 48. Mkhitaryan S, Giabbanelli PJ, Wozniak MK, de Vries NK, Oenema A, Crutzen R. How to use machine learning and fuzzy cognitive maps to test hypothetical scenarios in health behavior change interventions: a case study on fruit intake. BMC Public Health. 2023;23(1):2478. pmid:38082297
  49. 49. Gawusu S, Jamatutu SA, Ahmed A. Predictive Modeling of Energy Poverty with Machine Learning Ensembles: Strategic Insights from Socioeconomic Determinants for Effective Policy Implementation. Int J Energy Res. 2024;2024(1):9411326.
  50. 50. Riahi K, van Vuuren DP, Kriegler E, Edmonds J, O’Neill BC, Fujimori S, et al. The shared socioeconomic pathways and their energy, land use, and emissions implications. Nat Clim Change. 2022;12(3):231–40.
  51. 51. van Vuuren DP, Kok MTJ, Girod B, Lucas PL, de Vries B. Scenarios in Global Environmental Assessments: Key characteristics and lessons for future use. Glob Environ Change. 2012;22(4):884–95.
  52. 52. Hodson TO. Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not. Geosci Model Dev. 2022;15(14):5481–7.
  53. 53. Sarris D, Spiliotis E, Assimakopoulos V. Exploiting resampling techniques for model selection in forecasting: an empirical evaluation using out-of-sample tests. Oper Res Int J. 2017;20(2):701–21.
  54. 54. Roberts DR, Bahn V, Ciuti S, Boyce MS, Elith J, Guillera‐Arroita G, et al. Cross‐validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography. 2017;40(8):913–29.
  55. 55. Cerqueira V, Torgo L, Mozetič I. Evaluating time series forecasting models: an empirical study on performance estimation methods. Mach Learn. 2020;109(11):1997–2028.
  56. 56. Wang PP, Huang GH, Li YP, Liu YY, Li YF. An ecological input-output CGE model for unveiling CO2 emission metabolism under China’s dual carbon goals. Appl Energy. 2024;365:123277.
  57. 57. Liu H, Li P, Peng C, Liu C, Zhou X, Deng Z, et al. Application of climate change scenarios in the simulation of forest ecosystems: an overview. Environ Rev. 2023;31(3):565–88.
  58. 58. Hyndman RJ, Athanasopoulos G. Forecasting: principles and practice. OTexts; 2018.
  59. 59. Bergmeir C, Benítez JM. On the use of cross-validation for time series predictor evaluation. Inform Sci. 2012;191:192–213.
  60. 60. Psistaki K, Tsantopoulos G, Paschalidou AK. An Overview of the Role of Forests in Climate Change Mitigation. Sustainability. 2024;16(14):6089.
  61. 61. Raihan A. A review of forest’s contribution to mitigating climate change. In: Proceedings of the International Conference on Forests and Climate Change. 2024.
  62. 62. Griscom BW, Adams J, Ellis PW, Houghton RA, Lomax G, Miteva DA, et al. Natural climate solutions. Proc Natl Acad Sci U S A. 2017;114(44):11645–50. pmid:29078344
  63. 63. Sun C, Zhang Y, Ma W, Wu R, Wang S. The Impacts of Urban Form on Carbon Emissions: A Comprehensive Review. Land. 2022;11(9):1430.
  64. 64. Wang Z, Ahmed Z, Zhang B, Wang B. The nexus between urbanization, road infrastructure, and transport energy demand: empirical evidence from Pakistan. Environ Sci Pollut Res Int. 2019;26(34):34884–95. pmid:31655983
  65. 65. Artmann M, Kohler M, Meinel G, Gan J, Ioja I-C. How smart growth and green infrastructure can mutually support each other — A conceptual framework for compact and green cities. Ecol Indic. 2019;96:10–22.
  66. 66. Chen Z-Y, Zhang T-H, Zhang R, Zhu Z-M, Yang J, Chen P-Y, et al. Extreme gradient boosting model to estimate PM2.5 concentrations with missing-filled satellite data in China. Atmos Environ. 2019;202:180–9.
  67. 67. Gelete G. Hybrid Extreme Gradient Boosting and Nonlinear Ensemble Models for Suspended Sediment Load Prediction in an Agricultural Catchment. Water Resour Manage. 2023;37(14):5759–87.
  68. 68. Hauck J, Schleyer C, Priess JA, Veerkamp CJ, Dunford R, Alkemade R, et al. Combining policy analyses, exploratory scenarios, and integrated modelling to assess land use policy options. Environ Sci Policy. 2019;94:202–10.
  69. 69. District–county-level assessment of greenhouse gas emissions in China. Environmental Impact Assessment Review. 2025;110:107956.