Demographic forecast modelling using SSA-XGBoost for smart population management based on multi-sources data

Jin Wang; Shihan Ma; Qing Lv; Qiang Li

doi:10.1371/journal.pone.0320298

Abstract

Population prediction could provide effective data support for social and economic planning and decision-making, especially for the sub-national population forecasting accurately. In addition to realizing efficient smart population management, this research focuses primarily on the combination model for forecasting demographic data based on machine learning. As to the higher error of population forecasts due to high population density and mobility, a dynamic monitoring method based on mobile communication big data such as mobile phone signals is proposed, combined with more structurally stable traditional statistical data, it forms a multi-source dataset that possesses both accuracy and real-time characteristics. In the study, the Extreme Gradient Boosting tree (XGBoost) model is used to identify the base model to create a reliable predictive model for population dynamic monitoring. The sparrow search algorithm (SSA) is investigated to obtain more reasonable parameters of XGBoost to improve forecast accuracy. The combination model is verified based on the data of the 6th and 7th national population census and mobile phone signal data in Hebei Province, obtained the predicted data for mortality and migration, categorized by age and gender, for the following year. Subsequently, the research compared the performance of different metaheuristic algorithms and various gradient-boosting machine-learning models on the dataset. The SSA-XGBoost model demonstrates a better prediction performance in the demographic data forecast with better R² 0.9984 and a lower mean absolute error of 0.0002 and a mean squared error of 6.9184. The results of the comparative experiments and cross-validation show that the proposed predictive model can effectively forecast the demographic data for sub-national regions to realize smart population management.

Citation: Wang J, Ma S, Lv Q, Li Q (2025) Demographic forecast modelling using SSA-XGBoost for smart population management based on multi-sources data. PLoS One 20(6): e0320298. https://doi.org/10.1371/journal.pone.0320298

Editor: Salim Heddam, University 20 Aout 1955 skikda, Algeria, ALGERIA

Received: August 22, 2024; Accepted: February 15, 2025; Published: June 25, 2025

Copyright: © 2025 Wang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data required to replicate the study are available in the Supporting Information files. Due to privacy and commercial restrictions, raw mobile signaling data cannot be shared publicly, but parts of aggregated data are provided. Requests for additional data can be directed to the corresponding author.

Funding: This research was funded by Soft Science Research Project of Innovation Ability Improvement Plan in Hebei Province (Grant number: 23556103D.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Rapid urbanization has led to a sharp increase in urban population density, complexity, and mobility. To address the challenges of rapid urbanization on high-quality urban governance and industrial development, generating accurate and reliable projections of population size, structure, and mobility is essential for decision-making, strategic planning, and meeting the requirements of diverse services and infrastructure, such as formulating public policy [1], allocating healthcare and educational resources [2,3], building smart city [4,5], and energy planning and management [6–8]. The demographic situations can be found and collected from sub-national areas, where provinces, cities, and Statistical Area Level 2 (SA2) [9,10] are examples of sub-national areas. Nevertheless, prediction methods based on these sub-national data as mentioned earlier are susceptible to generating highly inaccurate numbers. This is primarily attributed to the following factors: firstly, exponentially growing intra-regional population movement, migration, and social integration result in high complexity and error of population dynamics and forecast [11]; secondly, the development of populations shows complex non-linear and multi-dimensional features. Traditional forecasting methods, like time series analysis and demographic models [12–14], are often constrained by linear assumptions and limited dimensionality when predicting mortality, fertility, and migration using single-source data. In light of the rapid development of big data and digital technologies, Wilson, Grossman, and their colleagues contend that ensemble forecasting and machine learning algorithms present substantial research opportunities in the area of population prediction [15]. Meanwhile, the integration of mobile communication big data, such as mobile phone signaling, geographic location, satellite remote sensing, and social media data, into population prediction analysis offers robust data support for diversified and real-time forecasting. It also maximizes the strengths of machine learning and ensemble prediction models in handling complex, multi-dimensional, and large-scale data. Therefore, dynamic monitoring of population data by multi-resources is an important way to improve the accuracy of population forecasts.

This paper is expected to propose optimization solutions that improve the real-time accuracy of urban population prediction from two perspectives: data from multiple sources and ensemble models based on machine learning. Current studies on population prediction, both in domestic and international contexts, mainly depend on census data and traditional sample survey data [16]. Conducting censuses and surveys is costly in terms of human and material resources, involves multiple intermediary steps, and is susceptible to human error. Moreover, the extended time intervals between censuses limit the ability to perform fine-grained temporal forecasting. Variations in administrative regions also result in inconsistent survey methods, complicating the acquisition of statistical data for specific areas of interest. While acknowledging the limitations of traditional statistical methods, this research does not fully transition to the use of LBS data(including mobile phone signaling data, heatmaps, remote sensing data, etc.) as is common among many researchers [17,18]. The paper establishes a baseline using traditional statistical methods, such as population census and sampling surveys, and integrates real-time correction data from mobile communication platforms like Baidu Maps and mobile phone signaling, this multi-source data framework is used for dynamic population prediction. Research on population prediction models involves both the refinement of conventional methods, including linear regression [19], ARIMA [20], and Logistic models [21], and the exploration of new approaches using multiple intelligent algorithms and machine learning models for ensemble forecasting. The No Free Lunch Theorem [22] suggests that ensemble forecasting with multiple algorithms can yield better overall results. However, incorporating an excessive number of algorithms can greatly increase model complexity, resulting in higher overfitting, reduced real-time performance, and inefficient use of computational resources. This research develops an ensemble prediction model that includes the Salp Swarm Algorithm (SSA) for its superior parameter optimization and the Extreme Gradient Boosting (XGBoost) model as a robust baseline based on distributed gradient boosting.

The purpose of this research is to develop an ensemble prediction model combining meta-learning and machine-learning techniques, leveraging multi-source population data to forecast and analyze the population structure, birth and death rates, and migration patterns within a region. Specifically, the paper has the following two key research aims:

To build and curate a multi-source population dataset comprising traditional statistical sources such as population censuses and sampling surveys, along with mobile communication data from mobile phone signalling and Baidu Maps.

To establish a streamlined ensemble prediction model capable of automatically optimizing the hyperparameters of machine learning models, and to validate the model’s effectiveness in regional population forecasting through comparative analysis.

In this study, the researchers gathered population census data from 2010 and 2020, along with multiple rounds of sampling survey data between 2010 and 2020, and mobile phone signaling data from 2019 to 2020 for Hebei Province. By partitioning the dataset into training and validation sets, they obtained predicted population data for Hebei Province in 2020 classified by age and gender which is presented in section Data and methodology. This section also explains the base models, the combination methods, and the overall experimental design used in this study. The Result and analysis section presents population status and structure. The Discussion Section compares the performance of SSA-XGBoost model with other models.

Literature review

Using communication data for population dynamic monitoring is a feasible method, which has become a hot spot in recent years [23–28]. For example, Calabrese et al have used mobile communication data to conduct real-time monitoring of the population of Rome [29]; Naaman et al. conducted a study on the daily behaviours of the urban population based on Twitter check-in data [30]. Based on mobile positioning data rarely, Martin Sveda et al [31] provide an appropriate method to transform data from the mobile network into target spatial units, ensuring the precision and accuracy of the results for population estimates. In addition, Yongping Zhang et al [32] utilize mobile phone data as a data source to investigate the working and residential segregation of migrants in Longgang City, China. Fabio Ricciato, et al [33] proposed an approach to the estimation of present population density from mobile network operator data collected by Mobile Network Operators (MNO). The operation difficulty of population statistics using mobile communication big data costs less manpower and material resources, so it can achieve high-frequency monitoring. Moreover, mobile data contains multiple dimensional attributes, including temporal and spatial information, user characteristics, and flow rules. However, multi-resource data exists the differences in multiple semantic natures, multiple-scale features, and storage formats, and there are differences in data models and storage structures.

Computational intelligence and machine learning methods have been very promising in the field of prediction. Recurrent neural networks [34] based on improved architectures such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) exhibit strong time series prediction capabilities in scenarios like climate science [35], traffic flow [36] and emergency evacuation [37]. While these improved structures, such as LSTM and GRU, address some of the shortcomings of RNNs in terms of gradient issues and long-term dependency relationships to a certain extent, the fact remains that RNNs require multi-step backpropagation through time (BPTT), leading to longer training processes and higher consumption of computational resources. Given these challenges, machine learning algorithms based on gradient boosting decision trees (GBDT), like Adaboost [38], XGBoost [39], LightGBM [40], and CatBoost [41], have gradually come into research focus. Combining the fine-grained control over large-scale data and the generalization ability for prediction targets, the XGBoost model has become one of the preferred choices for many researchers. The XGBoost model is considered to be very flexible, and it can adapt to a variety of different problems. However, this also means that the hyperparameters have to be tuned for each of these specific tasks. Selecting the optimal values for hyperparameters through optimization can be regarded as a non-deterministic polynomial (NP) problem [42]. Metaheuristic optimizations, which use random operators, trial-and-error processes, and random scanning of the problem-solving space to generate efficient solutions to optimization problems [43], play a crucial role while dealing with NP-hard challenges that are commonly faced in parameter optimization, regression analysis, cluster analysis, etc.

There is limited research on stacking the XGBoost model with other metaheuristic algorithms for population prediction analysis and comparing the performance of the stacked model with other standalone machine learning models. Practitioners not only need to decide which individual models to integrate with XGBoost but also need to determine the method of integration to allocate weights among the algorithms. Relatively novel optimization algorithms like crayfish optimization algorithm [44], reptile search algorithm [45], red fox optimizer [46], sparrow search algorithm (SSA) [47], particle swarm optimization (PSO) [48], and other swarm algorithms have been utilized for obtaining more reasonable parameter settings to improve the predictive performance of the model. Mohamed Salb et al. [49] have designed an innovative solution that combines convolutional neural networks (CNNs) for feature extraction and the XGBoost model for intrusion detection, by customizing the reptile search algorithm for hyperparameter optimization, the methodology provides a resilient defence against emerging threats in IoT security. Tamara Zivkovic et al. [50] suggested a modified variant of the reptile search optimization algorithm named HARSA to carry out the calibrating of the XGBoost hyperparameters, by comparing with other metaheuristic algorithms, the proposed scheme has been shown to have superior classification accuracy. Mihailo Todorovic et al. [51] compared the performance of six metaheuristic algorithms used for tuning the XGBoost algorithm in their study. The results showed that models with hyperparameter optimization outperformed the benchmark models in financial data prediction. Nguyen Thi Thuy Linh et al. [52] evaluated the performance of the hybrid genetic algorithm (GA) optimization method and XGB mode land K-nearest neighbour. validated on the test dataset, using the genetic algorithm as an optimizer for determining the best parameters in the XGB model increases efficiency in this study. Luka Jovanovic et al. [53] tested eight metaheuristics algorithms for XGBoost optimization to achieve a superior level of performance in estimating the relative importance of each pollutant level and meteorological parameter for the prediction of benzene concentrations. Among these algorithms, the SSA is a population-based optimization algorithm that was proposed based on foraging and anti-predatory behaviours of sparrow populations and built upon existing population intelligence algorithms, such as GWO, GA, PSO, etc. It presents certain advantages in terms of stability, convergence accuracy, and velocity.

As discussed in the related literature, the combination of metaheuristic algorithms and machine learning models has been proven to improve model accuracy across various fields. XGBoost algorithm has high precision, strong flexibility, and can prevent data over-fitting, but this algorithm has high time and space complexity. The metaheuristic algorithm SSA, on the other hand, can further enhance the predictive performance of the XGBoost model through hyperparameter optimization. Therefore, a global optimization method based on SSA is proposed in this paper to identify the improved XGBoost model to realize population dynamic monitoring.

Data and methodology

The methodology employed in this investigation is illustrated in Fig 1. The initial phase of the study started with data collection. After a data pre-processing stage, the acquired data is ready for modelling. Data pre-processing is a 4-stage process involving the following steps:

Download:

Fig 1. Methodology adopted for this study.

https://doi.org/10.1371/journal.pone.0320298.g001

Data Integration.
Data cleaning and organization.
Finding missing data and cleaning outliers.
Generating data sets for training, testing, and validation.

The data pre-processing is then followed by correlation analysis to find out the correlation between input and output variables. The machine learning models will be proposed for demographic data. The model performance evaluation is then carried out using various metrics.

Data collection

The following data are used in this study:

(1). National census data: These data were obtained from the 6th and 7th censuses, which could reflect the basic population situation, the change in family structure, the improvement of education level, and the population distribution differences among regions. The statistics from 2010 to 2020 cover the entire year, while the statistics for 2021 cover the first to the twelve months. The national census data can be obtained from the National Bureau of Statistics.
(2). Mobile information data: The main data source is mobile phone communication network signalling data, including mobility control (MC) port data and mobility management entity (MME) port data. MC port data is the location data of 2G/3G mobile phones, whose signal is generated for an average of about 20 minutes during the day. MME port data is 4G mobile phone location data, with every 5 minutes to generate data. Use the voice call CDRS to exclude numbers that have not received a voice call in the past six months. The 13-digit numbers starting with 106 and 144 are excluded. The mobile information data could be obtained by purchasing desensitization data from correlative communication companies.

Before data analysis and processing, the following two points about the population indicator statistical methods and the migration population are cleared as follow:

(1). Population indicator statistical methods
1. a. Stable population: The presence of more than 10 hours in the region on the same day is considered a stable day, and the number of stable days in a year is more than 1/2, which is considered a stable population in the month.
2. b. Population dimension (gender, age, residence registration location): Stable population associated ID number, gender identification bit to distinguish male and female. Stabilize the population associated with the ID number, and calculate the age according to the birth date. Precipitation analysis was conducted on children aged 0–17, who regularly visited children’s hospitals, primary and secondary schools, playgrounds, and other locations.
(2). Migration population: The intra-provincial stability of the current year minus the number of the previous year is taken as the inter-provincial migration object. The intra-provincial stability of the previous year minus the one of the current years is taken as the inter-provincial migration object. Take the roaming place of users in Hebei as the stable place, and take the mobile number home place as the stable place for users in other provinces.

Data pre-processing

Data pre-processing mainly involves dealing with null or missing values in the data which need to be removed before modeling and the outliers also need to be removed before using it in a model. The data was grouped to identify the missing values. The outliers were identified as per Inter Quantile Range (IQR) [54]

(1)

where is first quartile corresponds to 25%, is the third quartile corresponds to 75%. The range considered was .

The outlier points are seen in the data either due to a faulty detection or maybe an exceptional event. To ensure the comparability of the prediction, feature scaling was implemented consistently. The training set is denoted by and represents the n-dimensional explanatory space and is the dependent variable. The normalization is denoted as:

(2)

Machine learning algorithms

In recent years, the field of data science has brought machine learning and artificial intelligence to the forefront, and numerous machine learning algorithms have either emerged or have gained popularity. In this paper, XGBoost is adapted to make predictions iteratively on the training dataset and avoid too many splits, reduce the complexity of the model, and prevent the model from overfitting.

XGBoost model

XGBoost is an algorithm based on gradient-boosting decision tree (GBDT). Compared with GBDT, XGBoost uses Taylor expansion to optimize loss function, and regularization term to avoid model overfitting. The loss function is expanded to the second order, and a regularization term is added to control model complexity. The objective function is composed of two parts: the loss function and the regularization term. The loss function is defined as:

(3)

where is the number of training samples, is the loss function for an individual sample. is the predicted value for the training sample. is the true value for the training sample. could be defined as follows:

(4)

where is weight vector of leaf node, is the mapping between leaf nodes.

Then, the complexity of a tree is expressed as follows:

(5)

where and is the penalty coefficient, is the number of leaf nodes. And is punishing the score of leaf nodes.

SSA algorithm

Sparrow Search Algorithm (SSA) is a novel swarm intelligence optimization algorithm inspired by the foraging and anti-predation behaviour of sparrows. In the process of sparrow foraging, it is divided into discoverer (seeker) and joiner (follower). The discoverer is responsible for finding food and providing foraging areas and directions for the whole sparrow population, while the joiner uses the discovery to obtain food. In addition, sparrow populations make anti-predation when they know the danger. In SSA, discoverers with better fitness values will preferentially obtain food during the search process. During each iteration, the location update of the discoverer is described as follows:

(6)

where denotes the current number of iterations, and is a constant that denotes the utmost number of iterations. represents the position information of the sparrow in dimension . is a random number. and represent the safety value and warning value respectively. Q is a random number that follows a normal distribution. represents a matrix in which each element in the matrix is 1.

The joiner obtains food from the seekers. The continuously updated location of the joiner is as follows:

(7)

where denote the producer’s optimal position; is the sparrow population’s worst position. denotes a matrix assigning 1 or −1 randomly at each element, and its dimension is 1 × d. The scroungers are starving with low fitness when . When the sparrow population detects danger, sparrows at the edges quickly move to a safer area. The middle sparrow of the flock will approach other sparrows at random. The sparrows update their positions according to the following formula:

(8)

where is the best position of a whole sparrow population; is defined as the control parameter of step size. and are the fitness values of present, current global best, and worst, respectively. When , it means the sparrow is located at the edge of the whole group. When , the middle sparrows of the flock spotted the danger and had to move closer to other sparrows. determines the sparrow’s movement direction. ε denotes a small constant.

Algorithm 1 The framework of SSA.

Input:

: the maximum iterations; : the number of producers; : the number of sparrows who perceive the danger; : the alarm value; : the number of sparrows

Initialize a population of sparrows and define its relevant parameters.

Output:

1: while

2: Rank the fitness values and find the current best individual and the current worst individual.

3:

4: for

5: Using equation (6) update the sparrow’s location;

6: end for

7: for

8: Using equation (7)update the sparrow’s location;

9: end for

10: for

11: Using equation (8) update the sparrow’s location;

12: end for

13: Get the current new location;

14: If the new location is better than before, update it;

15:

16: end while

17:return

SSA-based parameter optimization

In the proposed method, each sparrow represents a set of XGBoost parameters, and the positions represent the parameter values. The mean square error of cross-validation is the objective function. SSA is employed to identify parameter values that minimize the objective function. The goal is to find a location that minimizes the objective function , i.e.,

(8)

where denotes the position of each sparrow is denoted. With each iteration, each sparrow position updates by the subsequent formula:

(9)

where is the current best sparrow position, is a learning rate parameter, is the step size, and is a random perturbation term. According to (10), the fitness of each sparrow is calculated and the position of the current best sparrow is updated. If the fitness of a specific sparrow exceeds the current best one, then the sparrow’s position is updated by the best value. This process is repeated until the optimal solution remains unchanged within a specified number of iterations, or until a predetermined number of iterations is reached.

Ultimately, the optimal sparrow position is the optimal solution required by the model and the parameters are shown in Table 1.

Download:

Table 1. The main parameters in the model.

https://doi.org/10.1371/journal.pone.0320298.t001

In the process of optimizing a population prediction system model based on SSA-XGBoost, each parameter is treated as a “sparrow” and the optimal parameter value is by simulating the sparrow’s foraging and anti-preying behaviour. The flowchart of the SSA-XGBoost model is shown in Fig 2.

Download:

Fig 2. Flowchart of SSA-XGBoost model.

https://doi.org/10.1371/journal.pone.0320298.g002

Evaluation indices

Evaluation is a critical stage in the implementation of any research project. Each model or procedure that is implemented must undergo an assessment using one or more metrics. The various model evaluation metrics used in the study are as follows:

(10)

(11)

(12)

where, is the number of observations, and are the standard deviation of and respectively, and observed values respectively. is predicted value, is the actual value.

Results and analysis

Study area

In this paper, the dynamic management of population data in the Hebei Province of China is taken as an example due to its profound impact on the enhancement of regional comprehensive carrying capacity, economic development imbalance, and sustainable development. Compared with the data of the sixth population census in 2010, the number of separated households increased by 11,478,362 people, an increase of 138.34% in Hebei Province. The floating population increased by 8,657,908 people, an increase of 129.71%, with the inter-provincial floating population accounting for 20.58% and the provincial floating population accounting for 79.42%.

Population status

The population’s natural growth rate is very close to zero, and population growth has substantially decreased.

The birth rate has fluctuated resulting in an accelerated decline in the population size of Hebei Province over the past decade. The birth rate experienced a brief surge in 2013, 2014, 2016, and 2017, following the implementation of the “two-child only” and “two-child universal” policies. Subsequently, it experienced a gradual decline. In 2020, it is anticipated to decrease to 8.2 per thousand. The mortality rate is 7.22 per thousand, with the rate fluctuating at a low level. In general, it has reached the stage of low birth rate, low mortality rate, and low natural growth rate, as illustrated in Fig 3.

Download:

Fig 3. The natural change of population in Hebei Province from 2009 to 2020.

(Data source: National Bureau of Statistics). Note: The population of 2000 and 2001 is the projected figure of the current population census, the population of other years is the projected data of the annual population sampling survey, and the population data of each region since 2005 is the standard of the permanent population.

https://doi.org/10.1371/journal.pone.0320298.g003

Changes in the number of offspring, working-age population, and elderly.

In the past decade, the proportion of the working-age population decreased, while the number of infants and elders increased in Hebei province. The data from the seventh population census indicates that the number of children aged 0–14 years in Hebei Province reached 15.09 million in 2020, a 2.99 million increase from 2010. Additionally, the proportion of children aged 0–14 years rose from 16.83% in 2010 to 20.22%. The working-age population decreased from 53.84 million to 49.13 million, a decrease of 4.71 million. The proportion of the population aged 15–64 decreased from 74.93% to 65.86%. The number of geriatric individuals aged 65 and older increased from 5.92 million in 2010 to 10.39 million in 2020, a 4.47 million increase. Concurrently, the percentage of individuals aged 65 and older rose from 8.24% to 13.92%. The details are shown in Fig 4.

Download:

Fig 4. Changes in the number and proportion of children, working-age, and elderly in Hebei Province.

(Data source: The sixth and seventh population censuses of Hebei Province).

https://doi.org/10.1371/journal.pone.0320298.g004

Population structure analysis

According to the statistical data of mobile signaling in 2020 and 2021 (see Fig 5), the stable population of Hebei Province and prefecture-level cities in 2020 amounted to 75,996,000, and the stable population in 2021 amounted to 74,697,000, a year-on-year decrease of 1.71%. The top three cities with the largest stable population in 2020 were Baoding, Shijiazhuang, and Handan. In 2021, the top three cities with the largest stable population are Shijiazhuang, Baoding, and Tangshan, while the cities with the least stable population are Hengshui, Chengde and Qinhuangdao. It can be seen that the population size of cities in Hebei Province does not change much, and there is a positive correlation between population size and urban location, economic level and traffic conditions.

Download:

Fig 5. Stable population by region in Hebei Province in 2020.

https://doi.org/10.1371/journal.pone.0320298.g005

Fig 6 is the demographic structure of the population in Hebei Province in 2020, respectively. the population pyramid of Hebei Province is ageing, with a gradual decrease in the lower echelons. However, the working-age population continues to dominate the province, with a concentration of youthful and middle-aged individuals between the ages of 30 and 59.

Download:

Fig 6. Population age pyramid of Hebei Province in 2020.

https://doi.org/10.1371/journal.pone.0320298.g006

As to the data, the population migration data comes from the 2020 Hebei Unicom mobile signalling data and Baidu VIP big data, and the GNP data of each region comes from the economic census data of the National Bureau of Statistics.

Population prediction comparison

In this section, the SSA algorithm is used to optimize the parameters of the XGBoost model to obtain reasonable parameter values. Twenty-five runs were conducted for the metaheuristic method, using a size of 10 solutions and a maximum of thirty rounds in each run (iterations = 10), which can be seen in Table 2. MSE has been utilized as an objective function that is required to be minimized throughout the conducted experiments. Figs 7 and 8 show the visualizations of the experimental outcomes in the form of the following graphs for both the fitness function: convergence graph and box plot. The optimal parameters and the fitness value are shown in Table 3.

Download:

Table 2. Parameter Values of the models.

https://doi.org/10.1371/journal.pone.0320298.t002

Download:

Table 3. Experiment Settings.

https://doi.org/10.1371/journal.pone.0320298.t003

Download:

Fig 7. SSA fitness curve.

https://doi.org/10.1371/journal.pone.0320298.g007

Download:

Fig 8. Box plots of the fitness function across independent runs.

https://doi.org/10.1371/journal.pone.0320298.g008

To reflect the improvement of the prediction accuracy of the proposed method, the traditional XGBoost model and SSA-XGBoost model are adopted to predict mortality and mobility in Hebei Province. The results are shown in Figs 9–12, respectively.

Download:

Fig 9. Results of training and prediction of death population for traditional XGBoost model.

https://doi.org/10.1371/journal.pone.0320298.g009

Download:

Fig 10. Results of training and prediction of death population for SSA-XGBoost model.

https://doi.org/10.1371/journal.pone.0320298.g010

Download:

Fig 11. Results of training and prediction of migration data for traditional XGBoost model.

https://doi.org/10.1371/journal.pone.0320298.g011

Download:

Fig 12. Results of training and prediction of migration data for SSA-XGBoost model.

https://doi.org/10.1371/journal.pone.0320298.g012

Three evaluation indices were used to quantitatively evaluate the prediction effects of the two models, and the results are shown in Table 4, which shows that the adequacy of both models was determined to be satisfactory in terms of the values for the training and test data, as they were above 0.99.

Download:

Table 4. Statistics of the predictive performance indicators of the two models.

https://doi.org/10.1371/journal.pone.0320298.t004

Nevertheless, value of 1 and 0.9999 on the training set indicates that XGBoost fully fits the training data, but it performs poorly on the test set. The results illustrate that SSA-XGBoost is preferable to traditional XGBoost in a variety of data indexes and efficiently prevents overfitting by incorporating regularization, and the prediction accuracy was significantly improved by parameter optimization.

SSA-XGBoost performs best when the data set is death according to Table 4. To further verify the accuracy and stability of the model, SSA-XGBoost was cross-validated based on the death set, and the results are shown in Fig 13. The trend of Best Scores and Mean Scores shows a clear downward trend over the iterations, indicating that the model is continuously optimizing and finding better parameter combinations. Although the median Scores are not steady because of the outliers or missing values, the standard deviation continues to decrease over iteration. The rolling average line provides a smoother representation of this trend, further confirming the increase in stability.

Download:

Fig 13. Performance metrics of cross-validation over iterations.

https://doi.org/10.1371/journal.pone.0320298.g013

Discussion

Comparisons with several baseline models

To better explore the prediction ability and universality of the SSA-XGBoost model, a variety of network models are used to forecast female deaths in 2020 and compare the predicted results. Considering that XGBoost is an optimized distribution gradient lift tree that belongs to machine learning, the models selected for comparison in this section are SSA-Adaboost and SSA-Catboost.

Three evaluation indices were used to quantitatively evaluate the prediction effects of the two models, and the specific values are shown in Table 5. All the evaluation indices indicated that the SSA-XGBoost model achieved the most accurate regression effect, and the prediction accuracy was significantly improved by parameter optimization.

Download:

Table 5. Statistics of the predictive performance indicators of three different models.

https://doi.org/10.1371/journal.pone.0320298.t005

As for deep learning^, RNNs like LSTM or GRU, are well known to handle time series well. The train(2010–2019) and test(2020) data set prediction results for the three models are shown in Figs 14 and 15.

Download:

Fig 14. Comparison of prediction and true values for the training datasets.

https://doi.org/10.1371/journal.pone.0320298.g014

Download:

Fig 15. Comparison of prediction and true values for the test datasets.

https://doi.org/10.1371/journal.pone.0320298.g015

The prediction curves of the LSTM and GRU models in Fig 15 are relatively smooth. LSTM performs well in capturing both short-term and long-term dependencies in the data. Its prediction curve closely follows the actual data, particularly at peaks and troughs. Similar to the LSTM, the GRU model is good at handling sequential data. Its predictions align well with the actual data, although there may be slight deviations during some significant changes in the data. The green line representing XGBoost is closely aligned with the red dots representing the actual data, indicating that it has accurately captured the trends and patterns in the data. As the four corresponding predictive performance indices shown in Table 6, the SSA-XGBoost model has the strongest prediction ability among the three models and the prediction error is relatively low.

Download:

Table 6. Statistics of the predictive performance indicators of the three models.

https://doi.org/10.1371/journal.pone.0320298.t006

Comparison of different competitor algorithms

This subsection outlines the simulation results over the death data set with the XGBoost model optimized by SAA and other three recent competitor algorithms, including Crayfish, Reptile and Redfox. The pseudocode for each metaheuristic algorithm utilized is outlined in Algorithms 2–4.

Algorithm 2 The framework of Crayfish.

1. Initialize the population of crayfish randomly.

2. Evaluate the fitness of each crayfish.

3. While stopping condition is not met:

a. For each crayfish, determine its state (e.g., foraging, resting, or defending).

b. Update the crayfish’s position based on attraction and repulsion forces:

i. Attraction force: towards better solutions or prey.

ii. Repulsion force: away from predators or danger zones.

c. Foraging and defence behaviors:

i. Simulate the crayfish searching for food while avoiding predators.

ii. Update position accordingly.

d. Evaluate the fitness of each crayfish’s new position.

e. Update the global best solution if a better one is found.

4. Return the global best solution found.

Algorithm 3 The framework of Reptile.

1. Initialize a population of reptiles randomly within the search space.

2. Evaluate the fitness of each reptile.

3. While stopping condition is not met:

a. For each reptile, simulate movement using:

i. Exploration: Move to a random direction (search for new solutions).

ii. Exploitation: Move towards known better solutions (use past knowledge).

b. Account for territorial behavior:

If the reptile encounters others, simulate conflict or cooperation.

c. Evaluate the fitness of each reptile’s new position.

d. Update the global best position if necessary.

4. Return the best solution found.

Algorithm 4 The framework of Red Fox.

1. Initialize a population of red foxes randomly within the search space.

2. Evaluate the fitness of each red fox.

3. While stopping condition is not met:

a. Each red fox selects a strategy based on its current position:

i. Search for food: move toward higher fitness (search for better solutions).

ii. Escape predators: move to avoid worse solutions or stagnation.

iii. Territorial behavior: defend a region or seek mates (exploit known good regions).

b. Update the fox’s position based on its strategy.

c. Evaluate the fitness of the updated position.

d. Update the global best solution if a better one is found.

4. Return the best solution found.

Then, Table 7 exhibits the indices of XGBoost based on different optimal algorithms. The SSA-XGBoost model has the strongest prediction ability among four optimal algorithms and the indices are relatively low.

Download:

Table 7. Statistics of the predictive performance indicators of four different competitor algorithms.

https://doi.org/10.1371/journal.pone.0320298.t007

Figs 16 and 17 show the visualizations of the experimental outcome in the form of the following graphs.

Download:

Fig 16. Visualized XGBoost simulations for all four metaheuristics regarding the convergence.

https://doi.org/10.1371/journal.pone.0320298.g016

Download:

Fig 17. Visualized XGBoost simulations for all four metaheuristics regarding the box plot.

https://doi.org/10.1371/journal.pone.0320298.g017

Conclusions

In this paper, a combined prediction model named SSA-XGBoost is proposed with the use of a sparrow search algorithm to optimize the parameters of the XGBoost model. Based on the 7th national population census of Hebei Province and the mobile communication data, a prediction experiment was conducted for a comparative analysis. Compared with the traditional XGBoost model, different metaheuristic algorithms (Crayfish, Reptile, and Redfox) and other models including deep learning and machine-learning(LSTM, GRU, CatBoost, and AdaBoost) through a variety of comparison graphs and error evaluation indicators, the following conclusions can be drawn:

(1). Compared with the traditional XGBoost model, the SSA algorithm was used to obtain more reasonable parameters to fit the actual development curve of the population, which greatly improves the ability of the model to predict time series. Concerning the population prediction, the SSA-XGBoost model is far better than the other models in terms of both sequence fit and performance evaluation indicators. This shows that the SSA and XGBoost combination model has a better prediction performance than their single models.
(2). The SSA-XGBoost model proposed in this study performs better in population prediction. Compared to other machine-learning models(CatBoost and AdaBoost), the XGBoost model optimized by SSA represents better prediction performance with better indices. As for other metaheuristic algorithms(Crayfish, Reptile, and Redfox), the proposed SSA-XGBoost model in this study exhibits the best performance in terms of six indices (Best, Worst, Mean, Median, Std, Var). Moreover, Combined with the comparison results provided by the convergence diagram and box plots. In practice, the SSA-XGBoost model can be applied to monitor and forecast the population.
(3). In view of the fact that the dataset used in this study is characterized by tabular data, and that the improved versions of RNNs such as LSTM and GRU that we have selected do not perform as well as XGBoost in our experimental results, it confirms that XGBoost is inherently superior in dealing with tabular data relative to deep models such as RNNs. The SSA-XGBoost model yields a higher prediction accuracy and better evaluation indices. For the three evaluation indices of MSE, MAE, and R², the SSA-XGBoost model can achieve more improvements, which effectively indicates the powerful prediction performance and high robustness of the SSA-XGBoost model and provides a new way of thinking about time series prediction research.

The proposed method in this paper can effectively forecast the population, which could expand in many specific cases such as optimizing public services by predicting demand, resource allocation in urban planning by forecasting population growth, or traffic management by anticipating congestion patterns.

Due to computational resource constraints, the study models are researched mainly based on the national census data and mobile information data, which is limited in geographical and temporal fineness. The model prediction is only yearly and the data management is limited to Hebei Province. Future research can incorporate sound data for a more comprehensive study.

Supporting information

S1 Data. Dataset.

https://doi.org/10.1371/journal.pone.0320298.s001

(ZIP)

References

1. Schlembach C, Schmidt SL, Schreyer D, Wunderlich L. Forecasting the Olympic medal distribution - a socioeconomic machine learning model. Technol Forecast Soc Chang. 2022;175:121314.
- View Article
- Google Scholar
2. Badmos OS, Rienow A, Callo-Concha D, Greve K, Juergens C. Simulating slum growth in Lagos: An integration of rule based and empirical based model. Computers, Environment and Urban Systems. 2019;77:101369.
- View Article
- Google Scholar
3. Risanger S, Singh B, Morton D, Meyers LA. Selecting pharmacies for COVID-19 testing to ensure access. Health Care Manag Sci. 2021;24(2):330–8. pmid:33423180
- View Article
- PubMed/NCBI
- Google Scholar
4. Hasegawa Y, Sekimoto Y, Seto T, Fukushima Y, Maeda M. My city forecast: Urban planning communication tool for citizen with national open data. Computers, Environment and Urban Systems. 2019;77:101255.
- View Article
- Google Scholar
5. Bautista S, Espinoza A, Narvaez P, Camargo M, Morel L. A system dynamics approach for sustainability assessment of biodiesel production in Colombia. Baseline simulation. Clean Prod. 2019;213:1e20.
- View Article
- Google Scholar
6. Shafizadeh-Moghadam H. Improving spatial accuracy of urban growth simulation models using ensemble forecasting approaches. Computers, Environment and Urban Systems. 2019;76:91–100.
- View Article
- Google Scholar
7. Hu Y, Ji Z, Kong X, Jin S, Yu L. Carbon footprint and economic efficiency of urban agriculture in Beijing: a comparative case study of conventional and home-delivery agriculture. Clean Prod. 2019;234:615–25.
- View Article
- Google Scholar
8. Eshragh A, Ganim B, Perkins T, Bandara K. The importance of environmental factors in forecasting australian power demand. Environ Model Assess. 2022;27(1):1–11.
- View Article
- Google Scholar
9. Wilson T, Grossman I, Alexander M, Rees P, Temple J. Methods for Small Area Population Forecasts: State-of-the-Art and Research Needs. Popul Res Policy Rev. 2022;41(3):865–98. pmid:34421158
- View Article
- PubMed/NCBI
- Google Scholar
10. Grossman I, Bandara K, Wilson T, Kirley M. Can machine learning improve small area population forecasts? A forecast combination approach. Comput Environ Urban Syst. 2022;95:101806.
- View Article
- Google Scholar
11. Smith SK, Morrison PA. Small-Area and Business Demography. In: Poston DL, Micklin M, editors. Boston (MA): Springer; 2005. p. 761–85.
12. Rayer S. Population forecast errors: a primer for planners. Plan Educ Res. 2008;27(4):417–30.
- View Article
- Google Scholar
13. Tayman J. Assessing uncertainty in small area forecasts: state of the practice and implementation strategy. Popul Res Policy Rev. 2011;30:781–800.
- View Article
- Google Scholar
14. Diamond I, Tesfaghiorghis H, Joshi H. The uses and users of population projections in Australia. J Aust Popul Assoc. 1990;7(2):151–70. pmid:12343018
- View Article
- PubMed/NCBI
- Google Scholar
15. Irina G, Kasun B, Tom W, Michael K. Can machine learning improve small area population forecasts? A forecast combination approach. Computers, Environment and Urban Systems. 2022;95(2022):101806.
- View Article
- Google Scholar
16. Mu XY, Zhang XH, Anthony GY, Wang JJ. Evaluating the representativeness of mobile big data: A comparative analysis between China’s mobile big data and census data at the county level. Applied Geography. 2024;166:103260.
- View Article
- Google Scholar
17. Pierre D, Catherine L, Samuel M, Andrew JT. Dynamic population mapping using mobile phone data. Applied Physical Sciences. 2014, 111 (45) 15888–93.
- View Article
- Google Scholar
18. Shen JF, Gu HY. Unravelling intercity mobility patterns in China using multi-year big data: A city classification based on monthly fluctuations and year-round trends. Computers, Environment and Urban Systems. 2023;102:101954.
- View Article
- Google Scholar
19. Tuljapurkar S. Stochastic population forecasts and their uses. Int J Forecast. 1992;8(3):385–91. pmid:12157865
- View Article
- PubMed/NCBI
- Google Scholar
20. Tayman J, Smith SK, Lin J. Precision, bias, and uncertainty for state population forecasts: an exploratory analysis of time series models. Population Research and Policy Review. 2007;26(3):347–69.
- View Article
- Google Scholar
21. Ullah MS, Kabir KMA, Khan MAH. A non-singular fractional-order logistic growth model with multi-scaling effects to analyze and forecast population growth in Bangladesh. Scientific Reports. 2024;13(1).
- View Article
- Google Scholar
22. Wolpert DH. On the connection between in-sample testing and generalization error. Complex Systems. 1992;6:47–94.
- View Article
- Google Scholar
23. Calabrese F, Colonna M, Lovisolo P, et al. Real-time urban monitoring using cell phones: A case study in Rome. IEEE Transactions on Intelligent Transportation Systems. 2011;12(1).
- View Article
- Google Scholar
24. Ansah J, Liu L, Kang W, et al. Leveraging burst in twitter network communities for event detection. World Wide Web. 2020.
- View Article
- Google Scholar
25. Martin S, Pavol H, Michala S, et al. When spatial interpolation matters: Seeking an appropriate data transformation from the mobile network for population estimates. Computers, Environment and Urban Systems. 2024;110:102106.
- View Article
- Google Scholar
26. Zhang YP, Song Y, Zhang WW, et al. Working and residential segregation of migrants in Longgang City, China: A mobile phone data-based analysis. Cities. 2024;144(104625).
- View Article
- Google Scholar
27. Fabio R, Giampaolo L, Albrecht W, et al. Towards a methodological framework for estimating present population density from mobile network operator data. Pervasive and Mobile Computing. 2020;68:101263.
- View Article
- Google Scholar
28. Song GW, Cai L, Liu L, et al. Effects of ambient population with different income levels on the spatio-temporal pattern of theft: A study based on mobile phone big data. Cities. 2023;137:104331.
- View Article
- Google Scholar
29. Chen T, Guestrin C. XG-boost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 785–94.
30. Breiman L. Random forests. Machine Learning. 2001;45(1):5–32.
- View Article
- Google Scholar
31. You Y, Gitman I, Ginsburg B. Large Batch Training of Convolutional Networks. 2017.
32. De Clercq D, Fatima T, Jahanzeb S. Pandemic crisis and employee skills: How emotion regulation and improvisation limit the damaging effects of perceived pandemic threats on job performance. Journal of Management & Organization. 2022;:1–20.
- View Article
- Google Scholar
33. Bhagat K, et al. Prediction and characterization of substrate specificity and thermal stability for thermostable aliphatic amidases: An in-silico approach. Journal of Advanced Scientific Research. 2021;12(1):115–27.
- View Article
- Google Scholar
34. Yu S-Z. Explicit Duration Recurrent Networks. IEEE Trans Neural Netw Learn Syst. 2022;33(7):3120–30. pmid:33497341
- View Article
- PubMed/NCBI
- Google Scholar
35. Kapoor A, Negi A, Marshall L, Chandra R. Cyclone trajectory and intensity prediction with uncertainty quantification using variational recurrent neural networks. Environmental Modelling & Software. 2023;162:101654.
- View Article
- Google Scholar
36. Wu Q, Jiang Z, Hong KW, Liu HZ, Yang LT, Ding JH. Tensor-Based Recurrent Neural Network and Multi-Modal Prediction With Its Applications in Traffic Network Management. IEEE Transactions on Network and Service Management. 2021;18(1):780–92.
- View Article
- Google Scholar
37. Cortez B, Carrera B, Kim YJ, Jung JY. An architecture for emergency event prediction using LSTM recurrent neural networks. Expert Systems with Applications. 2018;97:315–24.
- View Article
- Google Scholar
38. Liu H, Tian HQ, Li YF, Zhang L. Comparison of four Adaboost algorithm based artificial neural networks in wind speed predictions. Energy Conversion and Management. 2015;92:67–81.
- View Article
- Google Scholar
39. Moore A, Bell M. XGBoost, A Novel Explainable AI Technique, in the Prediction of Myocardial Infarction: A UK Biobank Cohort Study. Clin Med Insights Cardiol. 2022;16:11795468221133611. pmid:36386405
- View Article
- PubMed/NCBI
- Google Scholar
40. Smyl S. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. International journal of forecasting. 2020;36(1):75–85.
- View Article
- Google Scholar
41. Ben JS, Gharib C, Mefteh WS, Ben AW. CatBoost model and artificial intelligence techniques for corporate failure prediction. Technological Forecasting and Social Change. 2021;166:120658.
- View Article
- Google Scholar
42. Chen JX, Yuan WL, Chen SF, Hu ZZ, Li P. Evo-MAML: Meta-Learning with Evolving Gradient. Electronics. 2023;12(18).
- View Article
- Google Scholar
43. Ozkan R, Samli R. Flood algorithm: a novel metaheuristic algorithm for optimization problems. PeerJ Comput Sci. 2024;10:e2278. pmid:39650360
- View Article
- PubMed/NCBI
- Google Scholar
44. Rajeswari V, Priya KS. Ontological modeling with recursive recurrent neural network and crayfish optimization for reliable breast cancer prediction. Biomedical Signal Processing and Control. 2025;99:106810.
- View Article
- Google Scholar
45. Khan MK, Zafar MH, Rashid S, Mansoor M, Moosavi SKR, Sanfilippo F. Improved Reptile Search Optimization Algorithm: Application on Regression and Classification Problems. Applied Science-Basel. 2023;13(2):945.
- View Article
- Google Scholar
46. Vaiyapuri T, Alaskar H, Aljohani E, Shridevi S, Hussain A, Liyakathunisa. Red Fox Optimizer with Data-Science-Enabled Microarray Gene Expression Classification Model. Applied Science-Basel. 2022;12(9):4172.
- View Article
- Google Scholar
47. Mirjalili S, Gandomi AH, Mirjalili SZ, Saremi S, Faris H, Mirjalili SM. Salp Swarm Algorithm: A bio-inspired optimizer for engineering design problems. Adv Eng Softw. 2017;114:163–91.
- View Article
- Google Scholar
48. Kayarvizhy N, Kanmani S, Uthariaraj R. Improving fault prediction using ANN-PSO in object oriented systems. Int J Comput Appl. 2013;73:0975–8887.
- View Article
- Google Scholar
49. Mohamed S, Luka J, Nebojsa B, Nebojsa B, Laith A. Enhancing Internet of Things Network Security Using Hybrid CNN and XGBoost Model Tuned via Modified Reptile Search Algorithm. Applied Science. 2023;13:12687.
- View Article
- Google Scholar
50. Tamara Z, Bosko N, Vladimir S, Dragan P, Nebojsa B. Software defects prediction by metaheuristics tuned extreme gradient boosting and analysis based on Shapley Additive Explanations. Applied Soft Computing. 2023;146:110659.
- View Article
- Google Scholar
51. Mihailo T, Nemanja S, Miodrag Z, Nebojsa B. Improving audit opinion prediction accuracy using metaheuristics tuned XGBoost algorithm with interpretable results through SHAP value analysis. Applied Soft Computing. 2023;149:110955.
- View Article
- Google Scholar
52. Nguyen TTL, Manish P, Saeid J, Gouri SB, Akbar N, Shoaib A, et al. Flood susceptibility modeling based on new hybrid intelligence model: Optimization of XGBoost model using GA metaheuristic algorithm. Science Direct. 2022;69:3301–18.
- View Article
- Google Scholar
53. Luka J, Gordana J, Nebojsa B, Miodrag Z, Mirjana P, Filip A, et al. The explainable potential of coupling metaheuristics optimized XGBoost and SHAP in revealing VOCs’ environmental fate. Atmosphere. 2023;14(109).
- View Article
- Google Scholar
54. Jones RH. Discourse Analysis: A Resource Book for Students. 2nd edition. Routledge; 2019.

[ref1] 1. Schlembach C, Schmidt SL, Schreyer D, Wunderlich L. Forecasting the Olympic medal distribution - a socioeconomic machine learning model. Technol Forecast Soc Chang. 2022;175:121314.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Badmos OS, Rienow A, Callo-Concha D, Greve K, Juergens C. Simulating slum growth in Lagos: An integration of rule based and empirical based model. Computers, Environment and Urban Systems. 2019;77:101369.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Risanger S, Singh B, Morton D, Meyers LA. Selecting pharmacies for COVID-19 testing to ensure access. Health Care Manag Sci. 2021;24(2):330–8. pmid:33423180
View Article
PubMed/NCBI
Google Scholar

[8] View Article

[9] PubMed/NCBI

[10] Google Scholar

[ref4] 4. Hasegawa Y, Sekimoto Y, Seto T, Fukushima Y, Maeda M. My city forecast: Urban planning communication tool for citizen with national open data. Computers, Environment and Urban Systems. 2019;77:101255.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref5] 5. Bautista S, Espinoza A, Narvaez P, Camargo M, Morel L. A system dynamics approach for sustainability assessment of biodiesel production in Colombia. Baseline simulation. Clean Prod. 2019;213:1e20.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref6] 6. Shafizadeh-Moghadam H. Improving spatial accuracy of urban growth simulation models using ensemble forecasting approaches. Computers, Environment and Urban Systems. 2019;76:91–100.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref7] 7. Hu Y, Ji Z, Kong X, Jin S, Yu L. Carbon footprint and economic efficiency of urban agriculture in Beijing: a comparative case study of conventional and home-delivery agriculture. Clean Prod. 2019;234:615–25.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref8] 8. Eshragh A, Ganim B, Perkins T, Bandara K. The importance of environmental factors in forecasting australian power demand. Environ Model Assess. 2022;27(1):1–11.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref9] 9. Wilson T, Grossman I, Alexander M, Rees P, Temple J. Methods for Small Area Population Forecasts: State-of-the-Art and Research Needs. Popul Res Policy Rev. 2022;41(3):865–98. pmid:34421158
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref10] 10. Grossman I, Bandara K, Wilson T, Kirley M. Can machine learning improve small area population forecasts? A forecast combination approach. Comput Environ Urban Syst. 2022;95:101806.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref11] 11. Smith SK, Morrison PA. Small-Area and Business Demography. In: Poston DL, Micklin M, editors. Boston (MA): Springer; 2005. p. 761–85.

[ref12] 12. Rayer S. Population forecast errors: a primer for planners. Plan Educ Res. 2008;27(4):417–30.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref13] 13. Tayman J. Assessing uncertainty in small area forecasts: state of the practice and implementation strategy. Popul Res Policy Rev. 2011;30:781–800.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref14] 14. Diamond I, Tesfaghiorghis H, Joshi H. The uses and users of population projections in Australia. J Aust Popul Assoc. 1990;7(2):151–70. pmid:12343018
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref15] 15. Irina G, Kasun B, Tom W, Michael K. Can machine learning improve small area population forecasts? A forecast combination approach. Computers, Environment and Urban Systems. 2022;95(2022):101806.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref16] 16. Mu XY, Zhang XH, Anthony GY, Wang JJ. Evaluating the representativeness of mobile big data: A comparative analysis between China’s mobile big data and census data at the county level. Applied Geography. 2024;166:103260.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref17] 17. Pierre D, Catherine L, Samuel M, Andrew JT. Dynamic population mapping using mobile phone data. Applied Physical Sciences. 2014, 111 (45) 15888–93.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref18] 18. Shen JF, Gu HY. Unravelling intercity mobility patterns in China using multi-year big data: A city classification based on monthly fluctuations and year-round trends. Computers, Environment and Urban Systems. 2023;102:101954.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref19] 19. Tuljapurkar S. Stochastic population forecasts and their uses. Int J Forecast. 1992;8(3):385–91. pmid:12157865
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref20] 20. Tayman J, Smith SK, Lin J. Precision, bias, and uncertainty for state population forecasts: an exploratory analysis of time series models. Population Research and Policy Review. 2007;26(3):347–69.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref21] 21. Ullah MS, Kabir KMA, Khan MAH. A non-singular fractional-order logistic growth model with multi-scaling effects to analyze and forecast population growth in Bangladesh. Scientific Reports. 2024;13(1).
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref22] 22. Wolpert DH. On the connection between in-sample testing and generalization error. Complex Systems. 1992;6:47–94.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref23] 23. Calabrese F, Colonna M, Lovisolo P, et al. Real-time urban monitoring using cell phones: A case study in Rome. IEEE Transactions on Intelligent Transportation Systems. 2011;12(1).
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref24] 24. Ansah J, Liu L, Kang W, et al. Leveraging burst in twitter network communities for event detection. World Wide Web. 2020.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref25] 25. Martin S, Pavol H, Michala S, et al. When spatial interpolation matters: Seeking an appropriate data transformation from the mobile network for population estimates. Computers, Environment and Urban Systems. 2024;110:102106.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref26] 26. Zhang YP, Song Y, Zhang WW, et al. Working and residential segregation of migrants in Longgang City, China: A mobile phone data-based analysis. Cities. 2024;144(104625).
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref27] 27. Fabio R, Giampaolo L, Albrecht W, et al. Towards a methodological framework for estimating present population density from mobile network operator data. Pervasive and Mobile Computing. 2020;68:101263.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref28] 28. Song GW, Cai L, Liu L, et al. Effects of ambient population with different income levels on the spatio-temporal pattern of theft: A study based on mobile phone big data. Cities. 2023;137:104331.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref29] 29. Chen T, Guestrin C. XG-boost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 785–94.

[ref30] 30. Breiman L. Random forests. Machine Learning. 2001;45(1):5–32.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref31] 31. You Y, Gitman I, Ginsburg B. Large Batch Training of Convolutional Networks. 2017.

[ref32] 32. De Clercq D, Fatima T, Jahanzeb S. Pandemic crisis and employee skills: How emotion regulation and improvisation limit the damaging effects of perceived pandemic threats on job performance. Journal of Management & Organization. 2022;:1–20.
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref33] 33. Bhagat K, et al. Prediction and characterization of substrate specificity and thermal stability for thermostable aliphatic amidases: An in-silico approach. Journal of Advanced Scientific Research. 2021;12(1):115–27.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref34] 34. Yu S-Z. Explicit Duration Recurrent Networks. IEEE Trans Neural Netw Learn Syst. 2022;33(7):3120–30. pmid:33497341
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref35] 35. Kapoor A, Negi A, Marshall L, Chandra R. Cyclone trajectory and intensity prediction with uncertainty quantification using variational recurrent neural networks. Environmental Modelling & Software. 2023;162:101654.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref36] 36. Wu Q, Jiang Z, Hong KW, Liu HZ, Yang LT, Ding JH. Tensor-Based Recurrent Neural Network and Multi-Modal Prediction With Its Applications in Traffic Network Management. IEEE Transactions on Network and Service Management. 2021;18(1):780–92.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref37] 37. Cortez B, Carrera B, Kim YJ, Jung JY. An architecture for emergency event prediction using LSTM recurrent neural networks. Expert Systems with Applications. 2018;97:315–24.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref38] 38. Liu H, Tian HQ, Li YF, Zhang L. Comparison of four Adaboost algorithm based artificial neural networks in wind speed predictions. Energy Conversion and Management. 2015;92:67–81.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref39] 39. Moore A, Bell M. XGBoost, A Novel Explainable AI Technique, in the Prediction of Myocardial Infarction: A UK Biobank Cohort Study. Clin Med Insights Cardiol. 2022;16:11795468221133611. pmid:36386405
View Article
PubMed/NCBI
Google Scholar

[115] View Article

[116] PubMed/NCBI

[117] Google Scholar

[ref40] 40. Smyl S. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. International journal of forecasting. 2020;36(1):75–85.
View Article
Google Scholar

[119] View Article

[120] Google Scholar

[ref41] 41. Ben JS, Gharib C, Mefteh WS, Ben AW. CatBoost model and artificial intelligence techniques for corporate failure prediction. Technological Forecasting and Social Change. 2021;166:120658.
View Article
Google Scholar

[122] View Article

[123] Google Scholar

[ref42] 42. Chen JX, Yuan WL, Chen SF, Hu ZZ, Li P. Evo-MAML: Meta-Learning with Evolving Gradient. Electronics. 2023;12(18).
View Article
Google Scholar

[125] View Article

[126] Google Scholar

[ref43] 43. Ozkan R, Samli R. Flood algorithm: a novel metaheuristic algorithm for optimization problems. PeerJ Comput Sci. 2024;10:e2278. pmid:39650360
View Article
PubMed/NCBI
Google Scholar

[128] View Article

[129] PubMed/NCBI

[130] Google Scholar

[ref44] 44. Rajeswari V, Priya KS. Ontological modeling with recursive recurrent neural network and crayfish optimization for reliable breast cancer prediction. Biomedical Signal Processing and Control. 2025;99:106810.
View Article
Google Scholar

[132] View Article

[133] Google Scholar

[ref45] 45. Khan MK, Zafar MH, Rashid S, Mansoor M, Moosavi SKR, Sanfilippo F. Improved Reptile Search Optimization Algorithm: Application on Regression and Classification Problems. Applied Science-Basel. 2023;13(2):945.
View Article
Google Scholar

[135] View Article

[136] Google Scholar

[ref46] 46. Vaiyapuri T, Alaskar H, Aljohani E, Shridevi S, Hussain A, Liyakathunisa. Red Fox Optimizer with Data-Science-Enabled Microarray Gene Expression Classification Model. Applied Science-Basel. 2022;12(9):4172.
View Article
Google Scholar

[138] View Article

[139] Google Scholar

[ref47] 47. Mirjalili S, Gandomi AH, Mirjalili SZ, Saremi S, Faris H, Mirjalili SM. Salp Swarm Algorithm: A bio-inspired optimizer for engineering design problems. Adv Eng Softw. 2017;114:163–91.
View Article
Google Scholar

[141] View Article

[142] Google Scholar

[ref48] 48. Kayarvizhy N, Kanmani S, Uthariaraj R. Improving fault prediction using ANN-PSO in object oriented systems. Int J Comput Appl. 2013;73:0975–8887.
View Article
Google Scholar

[144] View Article

[145] Google Scholar

[ref49] 49. Mohamed S, Luka J, Nebojsa B, Nebojsa B, Laith A. Enhancing Internet of Things Network Security Using Hybrid CNN and XGBoost Model Tuned via Modified Reptile Search Algorithm. Applied Science. 2023;13:12687.
View Article
Google Scholar

[147] View Article

[148] Google Scholar

[ref50] 50. Tamara Z, Bosko N, Vladimir S, Dragan P, Nebojsa B. Software defects prediction by metaheuristics tuned extreme gradient boosting and analysis based on Shapley Additive Explanations. Applied Soft Computing. 2023;146:110659.
View Article
Google Scholar

[150] View Article

[151] Google Scholar

[ref51] 51. Mihailo T, Nemanja S, Miodrag Z, Nebojsa B. Improving audit opinion prediction accuracy using metaheuristics tuned XGBoost algorithm with interpretable results through SHAP value analysis. Applied Soft Computing. 2023;149:110955.
View Article
Google Scholar

[153] View Article

[154] Google Scholar

[ref52] 52. Nguyen TTL, Manish P, Saeid J, Gouri SB, Akbar N, Shoaib A, et al. Flood susceptibility modeling based on new hybrid intelligence model: Optimization of XGBoost model using GA metaheuristic algorithm. Science Direct. 2022;69:3301–18.
View Article
Google Scholar

[156] View Article

[157] Google Scholar

[ref53] 53. Luka J, Gordana J, Nebojsa B, Miodrag Z, Mirjana P, Filip A, et al. The explainable potential of coupling metaheuristics optimized XGBoost and SHAP in revealing VOCs’ environmental fate. Atmosphere. 2023;14(109).
View Article
Google Scholar

[159] View Article

[160] Google Scholar

[ref54] 54. Jones RH. Discourse Analysis: A Resource Book for Students. 2nd edition. Routledge; 2019.

Figures

Abstract

Introduction

Literature review

Data and methodology

Data collection

Data pre-processing

Machine learning algorithms

XGBoost model

SSA algorithm

SSA-based parameter optimization

Evaluation indices

Results and analysis

Study area

Population status

The population’s natural growth rate is very close to zero, and population growth has substantially decreased.

Changes in the number of offspring, working-age population, and elderly.

Population structure analysis

Population prediction comparison

Discussion

Comparisons with several baseline models

Comparison of different competitor algorithms

Conclusions

Supporting information

S1 Data. Dataset.

References