Figures
Abstract
Population prediction could provide effective data support for social and economic planning and decision-making, especially for the sub-national population forecasting accurately. In addition to realizing efficient smart population management, this research focuses primarily on the combination model for forecasting demographic data based on machine learning. As to the higher error of population forecasts due to high population density and mobility, a dynamic monitoring method based on mobile communication big data such as mobile phone signals is proposed, combined with more structurally stable traditional statistical data, it forms a multi-source dataset that possesses both accuracy and real-time characteristics. In the study, the Extreme Gradient Boosting tree (XGBoost) model is used to identify the base model to create a reliable predictive model for population dynamic monitoring. The sparrow search algorithm (SSA) is investigated to obtain more reasonable parameters of XGBoost to improve forecast accuracy. The combination model is verified based on the data of the 6th and 7th national population census and mobile phone signal data in Hebei Province, obtained the predicted data for mortality and migration, categorized by age and gender, for the following year. Subsequently, the research compared the performance of different metaheuristic algorithms and various gradient-boosting machine-learning models on the dataset. The SSA-XGBoost model demonstrates a better prediction performance in the demographic data forecast with better R2 0.9984 and a lower mean absolute error of 0.0002 and a mean squared error of 6.9184. The results of the comparative experiments and cross-validation show that the proposed predictive model can effectively forecast the demographic data for sub-national regions to realize smart population management.
Citation: Wang J, Ma S, Lv Q, Li Q (2025) Demographic forecast modelling using SSA-XGBoost for smart population management based on multi-sources data. PLoS One 20(6): e0320298. https://doi.org/10.1371/journal.pone.0320298
Editor: Salim Heddam, University 20 Aout 1955 skikda, Algeria, ALGERIA
Received: August 22, 2024; Accepted: February 15, 2025; Published: June 25, 2025
Copyright: © 2025 Wang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data required to replicate the study are available in the Supporting Information files. Due to privacy and commercial restrictions, raw mobile signaling data cannot be shared publicly, but parts of aggregated data are provided. Requests for additional data can be directed to the corresponding author.
Funding: This research was funded by Soft Science Research Project of Innovation Ability Improvement Plan in Hebei Province (Grant number: 23556103D.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Rapid urbanization has led to a sharp increase in urban population density, complexity, and mobility. To address the challenges of rapid urbanization on high-quality urban governance and industrial development, generating accurate and reliable projections of population size, structure, and mobility is essential for decision-making, strategic planning, and meeting the requirements of diverse services and infrastructure, such as formulating public policy [1], allocating healthcare and educational resources [2,3], building smart city [4,5], and energy planning and management [6–8]. The demographic situations can be found and collected from sub-national areas, where provinces, cities, and Statistical Area Level 2 (SA2) [9,10] are examples of sub-national areas. Nevertheless, prediction methods based on these sub-national data as mentioned earlier are susceptible to generating highly inaccurate numbers. This is primarily attributed to the following factors: firstly, exponentially growing intra-regional population movement, migration, and social integration result in high complexity and error of population dynamics and forecast [11]; secondly, the development of populations shows complex non-linear and multi-dimensional features. Traditional forecasting methods, like time series analysis and demographic models [12–14], are often constrained by linear assumptions and limited dimensionality when predicting mortality, fertility, and migration using single-source data. In light of the rapid development of big data and digital technologies, Wilson, Grossman, and their colleagues contend that ensemble forecasting and machine learning algorithms present substantial research opportunities in the area of population prediction [15]. Meanwhile, the integration of mobile communication big data, such as mobile phone signaling, geographic location, satellite remote sensing, and social media data, into population prediction analysis offers robust data support for diversified and real-time forecasting. It also maximizes the strengths of machine learning and ensemble prediction models in handling complex, multi-dimensional, and large-scale data. Therefore, dynamic monitoring of population data by multi-resources is an important way to improve the accuracy of population forecasts.
This paper is expected to propose optimization solutions that improve the real-time accuracy of urban population prediction from two perspectives: data from multiple sources and ensemble models based on machine learning. Current studies on population prediction, both in domestic and international contexts, mainly depend on census data and traditional sample survey data [16]. Conducting censuses and surveys is costly in terms of human and material resources, involves multiple intermediary steps, and is susceptible to human error. Moreover, the extended time intervals between censuses limit the ability to perform fine-grained temporal forecasting. Variations in administrative regions also result in inconsistent survey methods, complicating the acquisition of statistical data for specific areas of interest. While acknowledging the limitations of traditional statistical methods, this research does not fully transition to the use of LBS data(including mobile phone signaling data, heatmaps, remote sensing data, etc.) as is common among many researchers [17,18]. The paper establishes a baseline using traditional statistical methods, such as population census and sampling surveys, and integrates real-time correction data from mobile communication platforms like Baidu Maps and mobile phone signaling, this multi-source data framework is used for dynamic population prediction. Research on population prediction models involves both the refinement of conventional methods, including linear regression [19], ARIMA [20], and Logistic models [21], and the exploration of new approaches using multiple intelligent algorithms and machine learning models for ensemble forecasting. The No Free Lunch Theorem [22] suggests that ensemble forecasting with multiple algorithms can yield better overall results. However, incorporating an excessive number of algorithms can greatly increase model complexity, resulting in higher overfitting, reduced real-time performance, and inefficient use of computational resources. This research develops an ensemble prediction model that includes the Salp Swarm Algorithm (SSA) for its superior parameter optimization and the Extreme Gradient Boosting (XGBoost) model as a robust baseline based on distributed gradient boosting.
The purpose of this research is to develop an ensemble prediction model combining meta-learning and machine-learning techniques, leveraging multi-source population data to forecast and analyze the population structure, birth and death rates, and migration patterns within a region. Specifically, the paper has the following two key research aims:
To build and curate a multi-source population dataset comprising traditional statistical sources such as population censuses and sampling surveys, along with mobile communication data from mobile phone signalling and Baidu Maps.
To establish a streamlined ensemble prediction model capable of automatically optimizing the hyperparameters of machine learning models, and to validate the model’s effectiveness in regional population forecasting through comparative analysis.
In this study, the researchers gathered population census data from 2010 and 2020, along with multiple rounds of sampling survey data between 2010 and 2020, and mobile phone signaling data from 2019 to 2020 for Hebei Province. By partitioning the dataset into training and validation sets, they obtained predicted population data for Hebei Province in 2020 classified by age and gender which is presented in section Data and methodology. This section also explains the base models, the combination methods, and the overall experimental design used in this study. The Result and analysis section presents population status and structure. The Discussion Section compares the performance of SSA-XGBoost model with other models.
Literature review
Using communication data for population dynamic monitoring is a feasible method, which has become a hot spot in recent years [23–28]. For example, Calabrese et al have used mobile communication data to conduct real-time monitoring of the population of Rome [29]; Naaman et al. conducted a study on the daily behaviours of the urban population based on Twitter check-in data [30]. Based on mobile positioning data rarely, Martin Sveda et al [31] provide an appropriate method to transform data from the mobile network into target spatial units, ensuring the precision and accuracy of the results for population estimates. In addition, Yongping Zhang et al [32] utilize mobile phone data as a data source to investigate the working and residential segregation of migrants in Longgang City, China. Fabio Ricciato, et al [33] proposed an approach to the estimation of present population density from mobile network operator data collected by Mobile Network Operators (MNO). The operation difficulty of population statistics using mobile communication big data costs less manpower and material resources, so it can achieve high-frequency monitoring. Moreover, mobile data contains multiple dimensional attributes, including temporal and spatial information, user characteristics, and flow rules. However, multi-resource data exists the differences in multiple semantic natures, multiple-scale features, and storage formats, and there are differences in data models and storage structures.
Computational intelligence and machine learning methods have been very promising in the field of prediction. Recurrent neural networks [34] based on improved architectures such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) exhibit strong time series prediction capabilities in scenarios like climate science [35], traffic flow [36] and emergency evacuation [37]. While these improved structures, such as LSTM and GRU, address some of the shortcomings of RNNs in terms of gradient issues and long-term dependency relationships to a certain extent, the fact remains that RNNs require multi-step backpropagation through time (BPTT), leading to longer training processes and higher consumption of computational resources. Given these challenges, machine learning algorithms based on gradient boosting decision trees (GBDT), like Adaboost [38], XGBoost [39], LightGBM [40], and CatBoost [41], have gradually come into research focus. Combining the fine-grained control over large-scale data and the generalization ability for prediction targets, the XGBoost model has become one of the preferred choices for many researchers. The XGBoost model is considered to be very flexible, and it can adapt to a variety of different problems. However, this also means that the hyperparameters have to be tuned for each of these specific tasks. Selecting the optimal values for hyperparameters through optimization can be regarded as a non-deterministic polynomial (NP) problem [42]. Metaheuristic optimizations, which use random operators, trial-and-error processes, and random scanning of the problem-solving space to generate efficient solutions to optimization problems [43], play a crucial role while dealing with NP-hard challenges that are commonly faced in parameter optimization, regression analysis, cluster analysis, etc.
There is limited research on stacking the XGBoost model with other metaheuristic algorithms for population prediction analysis and comparing the performance of the stacked model with other standalone machine learning models. Practitioners not only need to decide which individual models to integrate with XGBoost but also need to determine the method of integration to allocate weights among the algorithms. Relatively novel optimization algorithms like crayfish optimization algorithm [44], reptile search algorithm [45], red fox optimizer [46], sparrow search algorithm (SSA) [47], particle swarm optimization (PSO) [48], and other swarm algorithms have been utilized for obtaining more reasonable parameter settings to improve the predictive performance of the model. Mohamed Salb et al. [49] have designed an innovative solution that combines convolutional neural networks (CNNs) for feature extraction and the XGBoost model for intrusion detection, by customizing the reptile search algorithm for hyperparameter optimization, the methodology provides a resilient defence against emerging threats in IoT security. Tamara Zivkovic et al. [50] suggested a modified variant of the reptile search optimization algorithm named HARSA to carry out the calibrating of the XGBoost hyperparameters, by comparing with other metaheuristic algorithms, the proposed scheme has been shown to have superior classification accuracy. Mihailo Todorovic et al. [51] compared the performance of six metaheuristic algorithms used for tuning the XGBoost algorithm in their study. The results showed that models with hyperparameter optimization outperformed the benchmark models in financial data prediction. Nguyen Thi Thuy Linh et al. [52] evaluated the performance of the hybrid genetic algorithm (GA) optimization method and XGB mode land K-nearest neighbour. validated on the test dataset, using the genetic algorithm as an optimizer for determining the best parameters in the XGB model increases efficiency in this study. Luka Jovanovic et al. [53] tested eight metaheuristics algorithms for XGBoost optimization to achieve a superior level of performance in estimating the relative importance of each pollutant level and meteorological parameter for the prediction of benzene concentrations. Among these algorithms, the SSA is a population-based optimization algorithm that was proposed based on foraging and anti-predatory behaviours of sparrow populations and built upon existing population intelligence algorithms, such as GWO, GA, PSO, etc. It presents certain advantages in terms of stability, convergence accuracy, and velocity.
As discussed in the related literature, the combination of metaheuristic algorithms and machine learning models has been proven to improve model accuracy across various fields. XGBoost algorithm has high precision, strong flexibility, and can prevent data over-fitting, but this algorithm has high time and space complexity. The metaheuristic algorithm SSA, on the other hand, can further enhance the predictive performance of the XGBoost model through hyperparameter optimization. Therefore, a global optimization method based on SSA is proposed in this paper to identify the improved XGBoost model to realize population dynamic monitoring.
Data and methodology
The methodology employed in this investigation is illustrated in Fig 1. The initial phase of the study started with data collection. After a data pre-processing stage, the acquired data is ready for modelling. Data pre-processing is a 4-stage process involving the following steps:
- Data Integration.
- Data cleaning and organization.
- Finding missing data and cleaning outliers.
- Generating data sets for training, testing, and validation.
The data pre-processing is then followed by correlation analysis to find out the correlation between input and output variables. The machine learning models will be proposed for demographic data. The model performance evaluation is then carried out using various metrics.
Data collection
The following data are used in this study:
- (1). National census data: These data were obtained from the 6th and 7th censuses, which could reflect the basic population situation, the change in family structure, the improvement of education level, and the population distribution differences among regions. The statistics from 2010 to 2020 cover the entire year, while the statistics for 2021 cover the first to the twelve months. The national census data can be obtained from the National Bureau of Statistics.
- (2). Mobile information data: The main data source is mobile phone communication network signalling data, including mobility control (MC) port data and mobility management entity (MME) port data. MC port data is the location data of 2G/3G mobile phones, whose signal is generated for an average of about 20 minutes during the day. MME port data is 4G mobile phone location data, with every 5 minutes to generate data. Use the voice call CDRS to exclude numbers that have not received a voice call in the past six months. The 13-digit numbers starting with 106 and 144 are excluded. The mobile information data could be obtained by purchasing desensitization data from correlative communication companies.
Before data analysis and processing, the following two points about the population indicator statistical methods and the migration population are cleared as follow:
- (1). Population indicator statistical methods
- a. Stable population: The presence of more than 10 hours in the region on the same day is considered a stable day, and the number of stable days in a year is more than 1/2, which is considered a stable population in the month.
- b. Population dimension (gender, age, residence registration location): Stable population associated ID number, gender identification bit to distinguish male and female. Stabilize the population associated with the ID number, and calculate the age according to the birth date. Precipitation analysis was conducted on children aged 0–17, who regularly visited children’s hospitals, primary and secondary schools, playgrounds, and other locations.
- (2). Migration population: The intra-provincial stability of the current year minus the number of the previous year is taken as the inter-provincial migration object. The intra-provincial stability of the previous year minus the one of the current years is taken as the inter-provincial migration object. Take the roaming place of users in Hebei as the stable place, and take the mobile number home place as the stable place for users in other provinces.
Data pre-processing
Data pre-processing mainly involves dealing with null or missing values in the data which need to be removed before modeling and the outliers also need to be removed before using it in a model. The data was grouped to identify the missing values. The outliers were identified as per Inter Quantile Range (IQR) [54]
where is first quartile corresponds to 25%,
is the third quartile corresponds to 75%. The range considered was
.
The outlier points are seen in the data either due to a faulty detection or maybe an exceptional event. To ensure the comparability of the prediction, feature scaling was implemented consistently. The training set is denoted by and represents the n-dimensional explanatory space and
is the dependent variable. The normalization is denoted as:
Machine learning algorithms
In recent years, the field of data science has brought machine learning and artificial intelligence to the forefront, and numerous machine learning algorithms have either emerged or have gained popularity. In this paper, XGBoost is adapted to make predictions iteratively on the training dataset and avoid too many splits, reduce the complexity of the model, and prevent the model from overfitting.
XGBoost model
XGBoost is an algorithm based on gradient-boosting decision tree (GBDT). Compared with GBDT, XGBoost uses Taylor expansion to optimize loss function, and regularization term to avoid model overfitting. The loss function is expanded to the second order, and a regularization term is added to control model complexity. The objective function is composed of two parts: the loss function and the regularization term. The loss function is defined as:
where is the number of training samples,
is the loss function for an individual sample.
is the predicted value for the
training sample.
is the true value for the
training sample.
could be defined as follows:
where is weight vector of leaf node,
is the mapping between leaf nodes.
Then, the complexity of a tree is expressed as follows:
where and is the penalty coefficient,
is the number of leaf nodes. And
is punishing the score of leaf nodes.
SSA algorithm
Sparrow Search Algorithm (SSA) is a novel swarm intelligence optimization algorithm inspired by the foraging and anti-predation behaviour of sparrows. In the process of sparrow foraging, it is divided into discoverer (seeker) and joiner (follower). The discoverer is responsible for finding food and providing foraging areas and directions for the whole sparrow population, while the joiner uses the discovery to obtain food. In addition, sparrow populations make anti-predation when they know the danger. In SSA, discoverers with better fitness values will preferentially obtain food during the search process. During each iteration, the location update of the discoverer is described as follows:
where denotes the current number of iterations, and
is a constant that denotes the utmost number of iterations.
represents the position information of the
sparrow in dimension
.
is a random number.
and
represent the safety value and warning value respectively. Q is a random number that follows a normal distribution.
represents a
matrix in which each element in the matrix is 1.
The joiner obtains food from the seekers. The continuously updated location of the joiner is as follows:
where denote the producer’s optimal position;
is the sparrow population’s worst position.
denotes a matrix assigning 1 or −1 randomly at each element, and its dimension is 1 × d. The
scroungers are starving with low fitness when
. When the sparrow population detects danger, sparrows at the edges quickly move to a safer area. The middle sparrow of the flock will approach other sparrows at random. The sparrows update their positions according to the following formula:
where is the best position of a whole sparrow population;
is defined as the control parameter of step size.
and
are the fitness values of present, current global best, and worst, respectively. When
, it means the sparrow is located at the edge of the whole group. When
, the middle sparrows of the flock spotted the danger and had to move closer to other sparrows.
determines the sparrow’s movement direction. ε denotes a small constant.
Algorithm 1 The framework of SSA.
Input:
: the maximum iterations;
: the number of producers;
: the number of sparrows who perceive the danger;
: the alarm value;
: the number of sparrows
Initialize a population of sparrows and define its relevant parameters.
Output:
1: while
2: Rank the fitness values and find the current best individual and the current worst individual.
3:
4: for
5: Using equation (6) update the sparrow’s location;
6: end for
7: for
8: Using equation (7)update the sparrow’s location;
9: end for
10: for
11: Using equation (8) update the sparrow’s location;
12: end for
13: Get the current new location;
14: If the new location is better than before, update it;
15:
16: end while
17:return
SSA-based parameter optimization
In the proposed method, each sparrow represents a set of XGBoost parameters, and the positions represent the parameter values. The mean square error of cross-validation is the objective function. SSA is employed to identify parameter values that minimize the objective function. The goal is to find a location that minimizes the objective function
, i.e.,
where denotes the position of each sparrow is denoted. With each iteration, each sparrow position updates by the subsequent formula:
where is the current best sparrow position,
is a learning rate parameter,
is the step size, and
is a random perturbation term. According to (10), the fitness of each sparrow is calculated and the position of the current best sparrow is updated. If the fitness of a specific sparrow exceeds the current best one, then the sparrow’s position is updated by the best value. This process is repeated until the optimal solution remains unchanged within a specified number of iterations, or until a predetermined number of iterations is reached.
Ultimately, the optimal sparrow position is the optimal solution required by the model and the parameters are shown in Table 1.
In the process of optimizing a population prediction system model based on SSA-XGBoost, each parameter is treated as a “sparrow” and the optimal parameter value is by simulating the sparrow’s foraging and anti-preying behaviour. The flowchart of the SSA-XGBoost model is shown in Fig 2.
Evaluation indices
Evaluation is a critical stage in the implementation of any research project. Each model or procedure that is implemented must undergo an assessment using one or more metrics. The various model evaluation metrics used in the study are as follows:
where, is the number of observations,
and
are the standard deviation of
and
respectively,
and
observed values respectively.
is predicted value,
is the actual value.
Results and analysis
Study area
In this paper, the dynamic management of population data in the Hebei Province of China is taken as an example due to its profound impact on the enhancement of regional comprehensive carrying capacity, economic development imbalance, and sustainable development. Compared with the data of the sixth population census in 2010, the number of separated households increased by 11,478,362 people, an increase of 138.34% in Hebei Province. The floating population increased by 8,657,908 people, an increase of 129.71%, with the inter-provincial floating population accounting for 20.58% and the provincial floating population accounting for 79.42%.
Population status
The population’s natural growth rate is very close to zero, and population growth has substantially decreased.
The birth rate has fluctuated resulting in an accelerated decline in the population size of Hebei Province over the past decade. The birth rate experienced a brief surge in 2013, 2014, 2016, and 2017, following the implementation of the “two-child only” and “two-child universal” policies. Subsequently, it experienced a gradual decline. In 2020, it is anticipated to decrease to 8.2 per thousand. The mortality rate is 7.22 per thousand, with the rate fluctuating at a low level. In general, it has reached the stage of low birth rate, low mortality rate, and low natural growth rate, as illustrated in Fig 3.
(Data source: National Bureau of Statistics). Note: The population of 2000 and 2001 is the projected figure of the current population census, the population of other years is the projected data of the annual population sampling survey, and the population data of each region since 2005 is the standard of the permanent population.
Changes in the number of offspring, working-age population, and elderly.
In the past decade, the proportion of the working-age population decreased, while the number of infants and elders increased in Hebei province. The data from the seventh population census indicates that the number of children aged 0–14 years in Hebei Province reached 15.09 million in 2020, a 2.99 million increase from 2010. Additionally, the proportion of children aged 0–14 years rose from 16.83% in 2010 to 20.22%. The working-age population decreased from 53.84 million to 49.13 million, a decrease of 4.71 million. The proportion of the population aged 15–64 decreased from 74.93% to 65.86%. The number of geriatric individuals aged 65 and older increased from 5.92 million in 2010 to 10.39 million in 2020, a 4.47 million increase. Concurrently, the percentage of individuals aged 65 and older rose from 8.24% to 13.92%. The details are shown in Fig 4.
(Data source: The sixth and seventh population censuses of Hebei Province).
Population structure analysis
According to the statistical data of mobile signaling in 2020 and 2021 (see Fig 5), the stable population of Hebei Province and prefecture-level cities in 2020 amounted to 75,996,000, and the stable population in 2021 amounted to 74,697,000, a year-on-year decrease of 1.71%. The top three cities with the largest stable population in 2020 were Baoding, Shijiazhuang, and Handan. In 2021, the top three cities with the largest stable population are Shijiazhuang, Baoding, and Tangshan, while the cities with the least stable population are Hengshui, Chengde and Qinhuangdao. It can be seen that the population size of cities in Hebei Province does not change much, and there is a positive correlation between population size and urban location, economic level and traffic conditions.
Fig 6 is the demographic structure of the population in Hebei Province in 2020, respectively. the population pyramid of Hebei Province is ageing, with a gradual decrease in the lower echelons. However, the working-age population continues to dominate the province, with a concentration of youthful and middle-aged individuals between the ages of 30 and 59.
As to the data, the population migration data comes from the 2020 Hebei Unicom mobile signalling data and Baidu VIP big data, and the GNP data of each region comes from the economic census data of the National Bureau of Statistics.
Population prediction comparison
In this section, the SSA algorithm is used to optimize the parameters of the XGBoost model to obtain reasonable parameter values. Twenty-five runs were conducted for the metaheuristic method, using a size of 10 solutions and a maximum of thirty rounds in each run (iterations = 10), which can be seen in Table 2. MSE has been utilized as an objective function that is required to be minimized throughout the conducted experiments. Figs 7 and 8 show the visualizations of the experimental outcomes in the form of the following graphs for both the fitness function: convergence graph and box plot. The optimal parameters and the fitness value are shown in Table 3.
To reflect the improvement of the prediction accuracy of the proposed method, the traditional XGBoost model and SSA-XGBoost model are adopted to predict mortality and mobility in Hebei Province. The results are shown in Figs 9–12, respectively.
Three evaluation indices were used to quantitatively evaluate the prediction effects of the two models, and the results are shown in Table 4, which shows that the adequacy of both models was determined to be satisfactory in terms of the values for the training and test data, as they were above 0.99.
Nevertheless, value of 1 and 0.9999 on the training set indicates that XGBoost fully fits the training data, but it performs poorly on the test set. The results illustrate that SSA-XGBoost is preferable to traditional XGBoost in a variety of data indexes and efficiently prevents overfitting by incorporating regularization, and the prediction accuracy was significantly improved by parameter optimization.
SSA-XGBoost performs best when the data set is death according to Table 4. To further verify the accuracy and stability of the model, SSA-XGBoost was cross-validated based on the death set, and the results are shown in Fig 13. The trend of Best Scores and Mean Scores shows a clear downward trend over the iterations, indicating that the model is continuously optimizing and finding better parameter combinations. Although the median Scores are not steady because of the outliers or missing values, the standard deviation continues to decrease over iteration. The rolling average line provides a smoother representation of this trend, further confirming the increase in stability.
Discussion
Comparisons with several baseline models
To better explore the prediction ability and universality of the SSA-XGBoost model, a variety of network models are used to forecast female deaths in 2020 and compare the predicted results. Considering that XGBoost is an optimized distribution gradient lift tree that belongs to machine learning, the models selected for comparison in this section are SSA-Adaboost and SSA-Catboost.
Three evaluation indices were used to quantitatively evaluate the prediction effects of the two models, and the specific values are shown in Table 5. All the evaluation indices indicated that the SSA-XGBoost model achieved the most accurate regression effect, and the prediction accuracy was significantly improved by parameter optimization.
As for deep learning, RNNs like LSTM or GRU, are well known to handle time series well. The train(2010–2019) and test(2020) data set prediction results for the three models are shown in Figs 14 and 15.
The prediction curves of the LSTM and GRU models in Fig 15 are relatively smooth. LSTM performs well in capturing both short-term and long-term dependencies in the data. Its prediction curve closely follows the actual data, particularly at peaks and troughs. Similar to the LSTM, the GRU model is good at handling sequential data. Its predictions align well with the actual data, although there may be slight deviations during some significant changes in the data. The green line representing XGBoost is closely aligned with the red dots representing the actual data, indicating that it has accurately captured the trends and patterns in the data. As the four corresponding predictive performance indices shown in Table 6, the SSA-XGBoost model has the strongest prediction ability among the three models and the prediction error is relatively low.
Comparison of different competitor algorithms
This subsection outlines the simulation results over the death data set with the XGBoost model optimized by SAA and other three recent competitor algorithms, including Crayfish, Reptile and Redfox. The pseudocode for each metaheuristic algorithm utilized is outlined in Algorithms 2–4.
Algorithm 2 The framework of Crayfish.
1. Initialize the population of crayfish randomly.
2. Evaluate the fitness of each crayfish.
3. While stopping condition is not met:
a. For each crayfish, determine its state (e.g., foraging, resting, or defending).
b. Update the crayfish’s position based on attraction and repulsion forces:
i. Attraction force: towards better solutions or prey.
ii. Repulsion force: away from predators or danger zones.
c. Foraging and defence behaviors:
i. Simulate the crayfish searching for food while avoiding predators.
ii. Update position accordingly.
d. Evaluate the fitness of each crayfish’s new position.
e. Update the global best solution if a better one is found.
4. Return the global best solution found.
Algorithm 3 The framework of Reptile.
1. Initialize a population of reptiles randomly within the search space.
2. Evaluate the fitness of each reptile.
3. While stopping condition is not met:
a. For each reptile, simulate movement using:
i. Exploration: Move to a random direction (search for new solutions).
ii. Exploitation: Move towards known better solutions (use past knowledge).
b. Account for territorial behavior:
If the reptile encounters others, simulate conflict or cooperation.
c. Evaluate the fitness of each reptile’s new position.
d. Update the global best position if necessary.
4. Return the best solution found.
Algorithm 4 The framework of Red Fox.
1. Initialize a population of red foxes randomly within the search space.
2. Evaluate the fitness of each red fox.
3. While stopping condition is not met:
a. Each red fox selects a strategy based on its current position:
i. Search for food: move toward higher fitness (search for better solutions).
ii. Escape predators: move to avoid worse solutions or stagnation.
iii. Territorial behavior: defend a region or seek mates (exploit known good regions).
b. Update the fox’s position based on its strategy.
c. Evaluate the fitness of the updated position.
d. Update the global best solution if a better one is found.
4. Return the best solution found.
Then, Table 7 exhibits the indices of XGBoost based on different optimal algorithms. The SSA-XGBoost model has the strongest prediction ability among four optimal algorithms and the indices are relatively low.
Figs 16 and 17 show the visualizations of the experimental outcome in the form of the following graphs.
Conclusions
In this paper, a combined prediction model named SSA-XGBoost is proposed with the use of a sparrow search algorithm to optimize the parameters of the XGBoost model. Based on the 7th national population census of Hebei Province and the mobile communication data, a prediction experiment was conducted for a comparative analysis. Compared with the traditional XGBoost model, different metaheuristic algorithms (Crayfish, Reptile, and Redfox) and other models including deep learning and machine-learning(LSTM, GRU, CatBoost, and AdaBoost) through a variety of comparison graphs and error evaluation indicators, the following conclusions can be drawn:
- (1). Compared with the traditional XGBoost model, the SSA algorithm was used to obtain more reasonable parameters to fit the actual development curve of the population, which greatly improves the ability of the model to predict time series. Concerning the population prediction, the SSA-XGBoost model is far better than the other models in terms of both sequence fit and performance evaluation indicators. This shows that the SSA and XGBoost combination model has a better prediction performance than their single models.
- (2). The SSA-XGBoost model proposed in this study performs better in population prediction. Compared to other machine-learning models(CatBoost and AdaBoost), the XGBoost model optimized by SSA represents better prediction performance with better indices. As for other metaheuristic algorithms(Crayfish, Reptile, and Redfox), the proposed SSA-XGBoost model in this study exhibits the best performance in terms of six indices (Best, Worst, Mean, Median, Std, Var). Moreover, Combined with the comparison results provided by the convergence diagram and box plots. In practice, the SSA-XGBoost model can be applied to monitor and forecast the population.
- (3). In view of the fact that the dataset used in this study is characterized by tabular data, and that the improved versions of RNNs such as LSTM and GRU that we have selected do not perform as well as XGBoost in our experimental results, it confirms that XGBoost is inherently superior in dealing with tabular data relative to deep models such as RNNs. The SSA-XGBoost model yields a higher prediction accuracy and better evaluation indices. For the three evaluation indices of MSE, MAE, and R2, the SSA-XGBoost model can achieve more improvements, which effectively indicates the powerful prediction performance and high robustness of the SSA-XGBoost model and provides a new way of thinking about time series prediction research.
The proposed method in this paper can effectively forecast the population, which could expand in many specific cases such as optimizing public services by predicting demand, resource allocation in urban planning by forecasting population growth, or traffic management by anticipating congestion patterns.
Due to computational resource constraints, the study models are researched mainly based on the national census data and mobile information data, which is limited in geographical and temporal fineness. The model prediction is only yearly and the data management is limited to Hebei Province. Future research can incorporate sound data for a more comprehensive study.
References
- 1. Schlembach C, Schmidt SL, Schreyer D, Wunderlich L. Forecasting the Olympic medal distribution - a socioeconomic machine learning model. Technol Forecast Soc Chang. 2022;175:121314.
- 2. Badmos OS, Rienow A, Callo-Concha D, Greve K, Juergens C. Simulating slum growth in Lagos: An integration of rule based and empirical based model. Computers, Environment and Urban Systems. 2019;77:101369.
- 3. Risanger S, Singh B, Morton D, Meyers LA. Selecting pharmacies for COVID-19 testing to ensure access. Health Care Manag Sci. 2021;24(2):330–8. pmid:33423180
- 4. Hasegawa Y, Sekimoto Y, Seto T, Fukushima Y, Maeda M. My city forecast: Urban planning communication tool for citizen with national open data. Computers, Environment and Urban Systems. 2019;77:101255.
- 5. Bautista S, Espinoza A, Narvaez P, Camargo M, Morel L. A system dynamics approach for sustainability assessment of biodiesel production in Colombia. Baseline simulation. Clean Prod. 2019;213:1e20.
- 6. Shafizadeh-Moghadam H. Improving spatial accuracy of urban growth simulation models using ensemble forecasting approaches. Computers, Environment and Urban Systems. 2019;76:91–100.
- 7. Hu Y, Ji Z, Kong X, Jin S, Yu L. Carbon footprint and economic efficiency of urban agriculture in Beijing: a comparative case study of conventional and home-delivery agriculture. Clean Prod. 2019;234:615–25.
- 8. Eshragh A, Ganim B, Perkins T, Bandara K. The importance of environmental factors in forecasting australian power demand. Environ Model Assess. 2022;27(1):1–11.
- 9. Wilson T, Grossman I, Alexander M, Rees P, Temple J. Methods for Small Area Population Forecasts: State-of-the-Art and Research Needs. Popul Res Policy Rev. 2022;41(3):865–98. pmid:34421158
- 10. Grossman I, Bandara K, Wilson T, Kirley M. Can machine learning improve small area population forecasts? A forecast combination approach. Comput Environ Urban Syst. 2022;95:101806.
- 11.
Smith SK, Morrison PA. Small-Area and Business Demography. In: Poston DL, Micklin M, editors. Boston (MA): Springer; 2005. p. 761–85.
- 12. Rayer S. Population forecast errors: a primer for planners. Plan Educ Res. 2008;27(4):417–30.
- 13. Tayman J. Assessing uncertainty in small area forecasts: state of the practice and implementation strategy. Popul Res Policy Rev. 2011;30:781–800.
- 14. Diamond I, Tesfaghiorghis H, Joshi H. The uses and users of population projections in Australia. J Aust Popul Assoc. 1990;7(2):151–70. pmid:12343018
- 15. Irina G, Kasun B, Tom W, Michael K. Can machine learning improve small area population forecasts? A forecast combination approach. Computers, Environment and Urban Systems. 2022;95(2022):101806.
- 16. Mu XY, Zhang XH, Anthony GY, Wang JJ. Evaluating the representativeness of mobile big data: A comparative analysis between China’s mobile big data and census data at the county level. Applied Geography. 2024;166:103260.
- 17. Pierre D, Catherine L, Samuel M, Andrew JT. Dynamic population mapping using mobile phone data. Applied Physical Sciences. 2014, 111 (45) 15888–93.
- 18. Shen JF, Gu HY. Unravelling intercity mobility patterns in China using multi-year big data: A city classification based on monthly fluctuations and year-round trends. Computers, Environment and Urban Systems. 2023;102:101954.
- 19. Tuljapurkar S. Stochastic population forecasts and their uses. Int J Forecast. 1992;8(3):385–91. pmid:12157865
- 20. Tayman J, Smith SK, Lin J. Precision, bias, and uncertainty for state population forecasts: an exploratory analysis of time series models. Population Research and Policy Review. 2007;26(3):347–69.
- 21. Ullah MS, Kabir KMA, Khan MAH. A non-singular fractional-order logistic growth model with multi-scaling effects to analyze and forecast population growth in Bangladesh. Scientific Reports. 2024;13(1).
- 22. Wolpert DH. On the connection between in-sample testing and generalization error. Complex Systems. 1992;6:47–94.
- 23. Calabrese F, Colonna M, Lovisolo P, et al. Real-time urban monitoring using cell phones: A case study in Rome. IEEE Transactions on Intelligent Transportation Systems. 2011;12(1).
- 24. Ansah J, Liu L, Kang W, et al. Leveraging burst in twitter network communities for event detection. World Wide Web. 2020.
- 25. Martin S, Pavol H, Michala S, et al. When spatial interpolation matters: Seeking an appropriate data transformation from the mobile network for population estimates. Computers, Environment and Urban Systems. 2024;110:102106.
- 26. Zhang YP, Song Y, Zhang WW, et al. Working and residential segregation of migrants in Longgang City, China: A mobile phone data-based analysis. Cities. 2024;144(104625).
- 27. Fabio R, Giampaolo L, Albrecht W, et al. Towards a methodological framework for estimating present population density from mobile network operator data. Pervasive and Mobile Computing. 2020;68:101263.
- 28. Song GW, Cai L, Liu L, et al. Effects of ambient population with different income levels on the spatio-temporal pattern of theft: A study based on mobile phone big data. Cities. 2023;137:104331.
- 29.
Chen T, Guestrin C. XG-boost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 785–94.
- 30. Breiman L. Random forests. Machine Learning. 2001;45(1):5–32.
- 31.
You Y, Gitman I, Ginsburg B. Large Batch Training of Convolutional Networks. 2017.
- 32. De Clercq D, Fatima T, Jahanzeb S. Pandemic crisis and employee skills: How emotion regulation and improvisation limit the damaging effects of perceived pandemic threats on job performance. Journal of Management & Organization. 2022;:1–20.
- 33. Bhagat K, et al. Prediction and characterization of substrate specificity and thermal stability for thermostable aliphatic amidases: An in-silico approach. Journal of Advanced Scientific Research. 2021;12(1):115–27.
- 34. Yu S-Z. Explicit Duration Recurrent Networks. IEEE Trans Neural Netw Learn Syst. 2022;33(7):3120–30. pmid:33497341
- 35. Kapoor A, Negi A, Marshall L, Chandra R. Cyclone trajectory and intensity prediction with uncertainty quantification using variational recurrent neural networks. Environmental Modelling & Software. 2023;162:101654.
- 36. Wu Q, Jiang Z, Hong KW, Liu HZ, Yang LT, Ding JH. Tensor-Based Recurrent Neural Network and Multi-Modal Prediction With Its Applications in Traffic Network Management. IEEE Transactions on Network and Service Management. 2021;18(1):780–92.
- 37. Cortez B, Carrera B, Kim YJ, Jung JY. An architecture for emergency event prediction using LSTM recurrent neural networks. Expert Systems with Applications. 2018;97:315–24.
- 38. Liu H, Tian HQ, Li YF, Zhang L. Comparison of four Adaboost algorithm based artificial neural networks in wind speed predictions. Energy Conversion and Management. 2015;92:67–81.
- 39. Moore A, Bell M. XGBoost, A Novel Explainable AI Technique, in the Prediction of Myocardial Infarction: A UK Biobank Cohort Study. Clin Med Insights Cardiol. 2022;16:11795468221133611. pmid:36386405
- 40. Smyl S. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. International journal of forecasting. 2020;36(1):75–85.
- 41. Ben JS, Gharib C, Mefteh WS, Ben AW. CatBoost model and artificial intelligence techniques for corporate failure prediction. Technological Forecasting and Social Change. 2021;166:120658.
- 42. Chen JX, Yuan WL, Chen SF, Hu ZZ, Li P. Evo-MAML: Meta-Learning with Evolving Gradient. Electronics. 2023;12(18).
- 43. Ozkan R, Samli R. Flood algorithm: a novel metaheuristic algorithm for optimization problems. PeerJ Comput Sci. 2024;10:e2278. pmid:39650360
- 44. Rajeswari V, Priya KS. Ontological modeling with recursive recurrent neural network and crayfish optimization for reliable breast cancer prediction. Biomedical Signal Processing and Control. 2025;99:106810.
- 45. Khan MK, Zafar MH, Rashid S, Mansoor M, Moosavi SKR, Sanfilippo F. Improved Reptile Search Optimization Algorithm: Application on Regression and Classification Problems. Applied Science-Basel. 2023;13(2):945.
- 46. Vaiyapuri T, Alaskar H, Aljohani E, Shridevi S, Hussain A, Liyakathunisa. Red Fox Optimizer with Data-Science-Enabled Microarray Gene Expression Classification Model. Applied Science-Basel. 2022;12(9):4172.
- 47. Mirjalili S, Gandomi AH, Mirjalili SZ, Saremi S, Faris H, Mirjalili SM. Salp Swarm Algorithm: A bio-inspired optimizer for engineering design problems. Adv Eng Softw. 2017;114:163–91.
- 48. Kayarvizhy N, Kanmani S, Uthariaraj R. Improving fault prediction using ANN-PSO in object oriented systems. Int J Comput Appl. 2013;73:0975–8887.
- 49. Mohamed S, Luka J, Nebojsa B, Nebojsa B, Laith A. Enhancing Internet of Things Network Security Using Hybrid CNN and XGBoost Model Tuned via Modified Reptile Search Algorithm. Applied Science. 2023;13:12687.
- 50. Tamara Z, Bosko N, Vladimir S, Dragan P, Nebojsa B. Software defects prediction by metaheuristics tuned extreme gradient boosting and analysis based on Shapley Additive Explanations. Applied Soft Computing. 2023;146:110659.
- 51. Mihailo T, Nemanja S, Miodrag Z, Nebojsa B. Improving audit opinion prediction accuracy using metaheuristics tuned XGBoost algorithm with interpretable results through SHAP value analysis. Applied Soft Computing. 2023;149:110955.
- 52. Nguyen TTL, Manish P, Saeid J, Gouri SB, Akbar N, Shoaib A, et al. Flood susceptibility modeling based on new hybrid intelligence model: Optimization of XGBoost model using GA metaheuristic algorithm. Science Direct. 2022;69:3301–18.
- 53. Luka J, Gordana J, Nebojsa B, Miodrag Z, Mirjana P, Filip A, et al. The explainable potential of coupling metaheuristics optimized XGBoost and SHAP in revealing VOCs’ environmental fate. Atmosphere. 2023;14(109).
- 54.
Jones RH. Discourse Analysis: A Resource Book for Students. 2nd edition. Routledge; 2019.