Figures
Abstract
Urbanization and industrialization have led to a significant increase in air pollution, posing a severe environmental and public health threat. Accurate forecasting of air quality is crucial for policymakers to implement effective interventions. This study presents a novel AIoT platform specifically designed for PM2.5 monitoring in Southwestern Morocco. The platform utilizes low-cost sensors to collect air quality data, transmitted via WiFi/3G for analysis and prediction on a central server. We focused on identifying optimal features for PM2.5 prediction using Minimum Redundancy Maximum Relevance (mRMR) and LightGBM Recursive Feature Elimination (LightGBM-RFE) techniques. Furthermore, Bayesian optimization was employed to fine-tune hyperparameters of popular machine learning models for the most accurate PM2.5 concentration forecasts. Model performance was evaluated using Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and the coefficient of determination (R2). Our results demonstrate that the LightGBM model achieved superior performance in PM2.5 prediction, with a significant reduction in RMSE compared to other evaluated models. This study highlights the potential of AIoT platforms coupled with advanced feature selection and hyperparameter optimization for effective air quality monitoring and forecasting.
Citation: Bekkar A, Hssina B, ABEKIRI N, Douzi S, Douzi K (2024) Real-time AIoT platform for monitoring and prediction of air quality in Southwestern Morocco. PLoS ONE 19(8): e0307214. https://doi.org/10.1371/journal.pone.0307214
Editor: Worradorn Phairuang, Chiang Mai University, THAILAND
Received: March 14, 2024; Accepted: July 1, 2024; Published: August 22, 2024
Copyright: © 2024 Bekkar et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The core code and dataset supporting the findings of this study are publicly accessible on GitHub at https://github.com/abdbekkar/paper.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
The rapid pace of industrialization and urbanization has significantly reshaped the global landscape. According to the United Nations, over 50% of the world’s population now resides in urban areas, a figure expected to rise in the coming years [1]. While urbanization has driven economic growth and improved living standards, it has also introduced significant challenges, notably in transportation, healthcare, and air pollution [2]. To address these issues, the concept of smart cities has emerged, leveraging advanced information and communication technologies to promote sustainability and enhance quality of life.
One critical challenge in urban areas is air pollution, driven by the extensive use of fossil fuel-based vehicles and industrial equipment. Harmful gases and particulate matter, including carbon oxides (COx—CO and CO2), nitrogen oxides (NOx—NO and NO2), sulfur oxides (SOx—SO2, SO3, and SO4), and particulate matter (PM10 and PM2.5), pose serious health risks due to their small size and ability to penetrate respiratory systems. The World Health Organization (WHO) attributes approximately seven million deaths annually to air pollution, with about 90% of the global population exposed to polluted air [3]. Health impacts include respiratory issues, premature mortality, and increased hospital admissions for cardiovascular and pulmonary diseases [4, 5]. Additionally, prolonged exposure to air pollutants can damage vegetation, affecting agricultural productivity and natural ecosystems [6].
Particulate matter (PM), particularly particles smaller than 2.5 microns (PM2.5), has been the focus of recent studies due to its severe health implications. These fine particles can penetrate deep into lung tissue, causing respiratory diseases, asthma, cardiovascular problems, and even mortality [7–9]. Moreover, evidence suggests that PM pollution may facilitate the spread of viruses like SARS-CoV-2 [10]. Accurate assessment and prediction of PM2.5 levels are therefore crucial for effective air pollution management.
Traditional urban air quality monitoring relies on fixed monitoring stations, which are expensive to install and maintain, with each station costing at least 10,000 USD excluding installation and maintenance expenses [11]. Despite their high cost, the distribution of these stations is often inadequate, even in developed countries [12]. For instance, Morocco, with an area of 710,850 km2 and a population of over 38 million, has only 29 regulatory air quality monitoring stations, indicating insufficient coverage [13]. In metropolitan areas, atmospheric pollutant dispersion can vary significantly over short distances (< 1 km) [12]. This variability, driven by unevenly distributed emission sources and complex urban dispersion processes, makes conventional stationary monitoring stations site-specific and often insufficient for capturing real-time air quality variations.
To address these challenges, there has been a shift towards using small, affordable sensing devices, or IoT units. Deploying numerous low-cost air sensors that provide frequent data updates is now feasible [14]. These sensors are significantly cheaper than traditional fixed monitoring stations, with costs ranging from 100 to 2500 USD [15]. The integration of IoT technology with machine learning algorithms presents a promising approach to enhance air quality management. Machine learning techniques, such as Support Vector Regression (SVR), Gradient Boosting, and LightGBM, have demonstrated effectiveness in forecasting PM2.5 concentrations using various environmental and meteorological data [16]. Mampitiya et al. showcased the high effectiveness of LightGBM in forecasting PM10 levels in urban areas of Sri Lanka, achieving near-perfect accuracy metrics [17]. Additionally, integrating machine learning with remote sensing data, as explored by Rostami et al., enhances the precision of environmental assessments [18]. Bekkar et al. highlighted the effectiveness of deep learning approaches in predicting air pollution in smart cities [19]. The synergy between advanced algorithms and IoT data collection systems offers a powerful tool for real-time monitoring and prediction, enabling proactive measures to mitigate pollution impacts.
Despite significant advances in air quality monitoring and prediction, substantial gaps remain. Most studies focus on highly industrialized regions, neglecting urban areas in developing countries like Morocco. Existing models often fail to integrate low-cost IoT sensors for real-time data collection and prediction, limiting their applicability in resource-constrained settings.
This study aims to develop an efficient, AI-integrated air pollution monitoring system tailored for smart urban environments, capable of providing timely alerts and accurate predictions of pollution levels to mitigate health impacts. The novelty of this work lies in integrating low-cost IoT sensors with advanced machine learning techniques, enabling real-time air quality monitoring and prediction in resource-limited settings. The use of feature selection methods like mRMR and LightGBM-RFE, combined with Bayesian optimization for hyperparameter tuning, ensures high accuracy in PM2.5 predictions. This approach provides a scalable and cost-effective solution for air quality monitoring in developing countries.
The remainder of this paper is structured as follows: Section 2 reviews the related work on air quality monitoring and prediction. Section 3 expands on the IoT platform devised for data collection. Section 4 describes our approach to sensor implementation. Section 5 presents preliminary results and corresponding analysis. Section 6 discusses the utilization of machine learning algorithms for air quality forecasting. Finally, Section 7 offers conclusions and potential avenues for future advancements.
2 Related work
The increasing urgency for accurate and comprehensive air quality monitoring has intensified due to rapid industrialization and urbanization. Traditional air quality monitoring methods rely on fixed monitoring stations equipped with high-precision instruments such as gas analyzers and particulate matter sensors. These stations, often managed by government agencies, provide reliable measurements of pollutants like NO2, SO2, CO, O3, and PM2.5 [20]. However, the high costs associated with their setup, calibration, and maintenance restrict their deployment, resulting in limited spatial coverage and insufficient data to capture localized pollution events or micro-scale air quality variations [21]. To address these limitations, researchers have explored alternative methods that offer greater flexibility and cost-efficiency.
The advent of the Internet of Things (IoT) has introduced a paradigm shift in air quality monitoring by enabling the deployment of low-cost sensors that can be distributed extensively. IoT-based systems utilize sensors to measure various pollutants and transmit data in real-time to centralized databases for analysis. These systems offer several advantages, including reduced costs, enhanced spatial coverage, and the ability to provide high-resolution temporal data [22]. IoT sensors can be integrated into stationary nodes, mobile units, and wearable devices, allowing for flexible and comprehensive monitoring solutions. For example, Kumar et al. demonstrated the effectiveness of deploying a dense network of low-cost IoT sensors across urban areas, providing detailed spatial and temporal pollution data [23]. The deployment of IoT sensors in smart cities has shown promising results in improving air quality monitoring and enabling proactive pollution mitigation measures [24, 25]. Additionally, IoT-based systems facilitate the development of advanced data analytics and machine learning models to predict pollution levels, identify pollution sources, and inform policy-making, while also raising public awareness by providing real-time air quality information through mobile apps and web platforms [26].
Machine learning techniques have been increasingly applied to predict air quality, leveraging historical data to forecast future pollution levels. Supervised learning models such as Support Vector Regression (SVR), Random Forests (RF), and Gradient Boosting Machines (GBM) have shown considerable promise. SVR has been widely used for its robustness in handling non-linear relationships between input variables and air quality indices. For instance, Chen et al. demonstrated the effectiveness of SVR in predicting PM2.5 concentrations with high accuracy, outperforming traditional statistical methods [27]. Random Forests, known for their ability to handle large datasets and complex interactions, have also been employed successfully. Jiang et al. utilized RF to predict air quality levels and identified significant predictors among meteorological variables and pollutant concentrations [28]. Gradient Boosting, another powerful ensemble technique, combines multiple weak predictive models to form a strong predictor. Studies like those by Li et al. have shown that GBM can effectively model the temporal and spatial variations in air pollution, achieving superior performance compared to individual models [29].
Ensemble methods, which integrate multiple learning algorithms to improve predictive performance, have gained traction in air quality prediction. Techniques such as Light Gradient Boosting Machine (LightGBM) are particularly noted for their efficiency and accuracy. LightGBM, a variant of GBM, optimizes the training process by focusing on gradient-based one-side sampling and exclusive feature bundling, making it suitable for large-scale data [30]. Comparative studies have highlighted the advantages of ensemble methods over single models. For example, Zhang et al. found that LightGBM consistently outperformed other methods in terms of prediction accuracy and computational efficiency [31]. These findings underscore the potential of ensemble approaches in enhancing the reliability and precision of air quality forecasts.
In Morocco, several studies have focused on understanding the seasonal variations and meteorological influences on air quality. Bounakhla et al. provided an overview of PM10, PM2.5, and black carbon (BC) and their relationships with meteorological variables in Kenitra, Morocco. Their research highlighted significant seasonal variations in pollutant concentrations and the influence of temperature, humidity, and wind speed on these pollutants [32]. This aligns with the findings of Sbai et al., who investigated the response of atmospheric pollutants to emission reduction and meteorological factors during the COVID-19 lockdown in northern Morocco. Their study emphasized the impact of meteorological conditions on secondary air pollutants like PM2.5 [33].
The integration of low-cost sensors and IoT technologies has been explored by Fahim et al. in developing a smart weather monitoring station for air quality assessment. Their system uses a fuzzy inference model and MQTT protocol to provide accurate and real-time air quality data, crucial for effective environmental management [34]. Additionally, a systematic review by Bouchriti et al. on the health impacts of outdoor air pollution in Morocco highlights the significant health issues caused by pollutants like PM10 and PM2.5, underscoring the need for continuous monitoring and predictive modeling to mitigate these effects [35].
Local implementations of air quality monitoring systems in Moroccan cities have demonstrated the potential for low-cost, high-efficiency solutions. For example, deploying an IoT-based air quality monitoring system in an urban area of Morocco showcased the effectiveness of real-time data collection and analysis in identifying pollution hotspots [36]. Similarly, using a geographic information system (GIS) to provide real-time air quality information to citizens has proven effective in raising awareness and promoting environmental health [37].
Despite significant advancements, gaps remain in the research on air quality monitoring and prediction in Morocco. There is a need for extensive research on the health and economic impacts of air pollution, improved air quality modeling, and a broader pollutant focus beyond just regulated ones. Developing comprehensive datasets, such as the MOREAIR dataset, which includes temporal, geographical, and air-quality measurements, can provide a richer context for understanding air pollution’s impact [38].
Our study builds upon this body of work by integrating low-cost IoT sensors, machine learning models, and real-time data analytics specifically tailored for air quality monitoring in Morocco. By leveraging these technologies, our research aims to achieve accurate air quality predictions and provide valuable insights into the health impacts of air pollution. The novel integration of AIoT platforms with advanced feature selection and hyperparameter optimization techniques, such as mRMR and LightGBM-RFE, ensures high accuracy in PM2.5 predictions. This approach addresses the challenges identified in previous studies and offers a scalable and cost-effective solution for air quality monitoring in developing countries.
3 IoT platform / IoT monitoring system
This study presents an innovative and cost-effective IoT-based monitoring system designed to accurately measure fine particulate matter, specifically PM2.5 and PM10, in micrograms per cubic meter (μg/m3). This system enables wireless data transfer from multiple geographic locations to a centralized cloud server, facilitating comprehensive storage, analysis, and real-time visualization. The IoT monitoring system consists of two core components: Remote Sensor Nodes (RSN) and a Cloud Server (CS), as depicted in Fig 1. The RSNs collect environmental data, which is then transmitted to the CS for aggregation, processing, and visualization. This architecture supports efficient, scalable monitoring and provides valuable insights into environmental conditions, aiding data-driven decision-making for urban planning and public health management.
3.1 Remote Sensor Nodes
The Remote Sensor Nodes (RSN) are crucial for the IoT monitoring system, enabling wireless environmental data collection and transmission. Each RSN integrates various sensors with an ESP8266 microcontroller, known for its cost-effective and reliable wireless communication capabilities [39] (see Fig 2).
3.1.1 DHT22 sensor.
The DHT22 measures temperature and humidity with high accuracy (±0.5°C for temperature and ±2% rH for humidity) and provides a digital output [40].
3.1.2 PMS5003 sensor.
The PMS5003 detects PM2.5 and PM10 using a fan mechanism, offering near-accurate results and affordability (around 15 USD) [41].
3.1.3 MQ135 gas sensor.
The MQ135 detects harmful gases like ammonia, sulfur dioxide, and benzene. Proper calibration ensures accurate measurements despite resistance variability [42].
3.1.4 Data transmission.
Data from the sensors is digitized and transmitted wirelessly to the Cloud Server via Wi-Fi or GPRS, using the lightweight and reliable MQTT protocol.
The integration of these components within the RSN provides a robust, scalable solution for real-time air quality monitoring, enhancing urban planning and public health management.
3.2 Server coordinator
This section outlines the robust communication framework designed to collect and present real-time environmental data from various Remote Sensor Nodes (RSNs) to a centralized Cloud Server (CS). The framework employs Mosquitto MQTT [43], Node-RED, InfluxDB, and Grafana, ensuring seamless data transmission, processing, and visualization.
3.2.1 MQTT protocol.
The Message Queuing Telemetry Transport (MQTT) protocol [44] is ideal for lightweight messaging in IoT applications due to its simplicity, efficiency, and reliability. Client devices (RSNs) connect to a centralized broker on a cloud server, publishing or subscribing to messages using designated topics.
3.2.2 Node-RED.
Node-RED [45], an open-source platform based on Node.js, connects devices, APIs, and services. It processes MQTT-transmitted sensor data, extracting and reassembling measurements for storage or transmission to InfluxDB.
3.2.3 InfluxDB.
InfluxDB [46], an open-source time-series database optimized for IoT applications, handles large volumes of time-based data efficiently. It integrates seamlessly with Python for robust data retrieval and storage.
3.2.4 Grafana.
Grafana [47] is an open-source application for data analysis and visualization, supporting various data sources including InfluxDB. It offers extensive graphical representations and can send notifications based on predefined criteria for real-time monitoring.
This integrated framework ensures reliable and efficient data management, supporting the IoT monitoring system’s goal of providing accurate, real-time insights into air quality and environmental conditions, thereby facilitating informed urban planning and public health management.
4 Experiment
4.1 Deployment strategy
Ait Melloul, a municipality in southwestern Morocco, spans 40 km2 and has a population of over 171,847. Located about 15 km from Agadir, it experiences a semi-arid to arid climate influenced by the Atlas Mountains, the Atlantic coast, and the desert. According to [48], aridity increases from west to east. The average annual precipitation is around 260 mm, with temperatures ranging from a high of 27°C in August to a low of 11°C in January. Humidity levels range between 32% and 85%, and prevailing winds from the west-northwest blow at speeds of 0.1 to 3.3 m/s. These conditions can cause temperature inversions, trapping pollutants in the lower atmosphere, making the region vulnerable to climate change impacts.
Two locations within Ait Melloul were selected for their proximity to vehicular traffic and industrial facilities, as shown in Fig 3. The first site, S1, is in the Industrial Zone, near the intersection of RN1 and the expressway to Agadir Al Massira International Airport. This area experiences heavy vehicular traffic, including large trucks. The second site, S2, is situated between two heavily trafficked lanes of RN1.
Map data (C) OpenStreetMap contributors, under the Open Database License (ODbL). This map was created using the folium library in Python.
No specific permits were required for this study. Air quality data was collected using a low-cost, non-invasive monitoring system. Local authorities confirmed that no specific permits were necessary for this type of data collection and its use in research.
4.2 Exploratory data analysis
Data collection occurred from October 2022 to February 2023, resulting in 3504 hourly data points, providing high-resolution temporal data. Weather data were obtained via the Visual Crossing Weather API [49], with the GMAD weather station located at coordinates (30.33, -9.4). This station was chosen due to its proximity within a 10 km radius of the pollutant monitoring station. The data were received in CSV format, comprising 5 pollutant variables and 13 meteorological variables.
The dataset collected by our platform is crucial as it is specifically gathered from Southwestern Morocco, ensuring relevance to local air pollution challenges. It includes high-resolution, real-time measurements taken hourly, capturing temporal variations in pollutant levels. Our AIoT platform is tailored to collect comprehensive environmental data, demonstrating the feasibility and effectiveness of low-cost IoT sensors in resource-limited settings. This innovative integration of sensors and machine learning offers a practical solution for real-time air quality monitoring and prediction, providing a valuable baseline for future research and public health benefits.
The statistical characteristics of the data collected from the monitoring points in Ait Melloul are presented in Table 1. The analysis of PM2.5 can be conducted in temporal and spatial dimensions, where prior concentrations of PM2.5 may impact subsequent measurements. The PM2.5 concentration is influenced by various factors, including the interaction between pollutants and meteorological variables, such as wind speed and direction.
Through an examination of the interplay among PM2.5, atmospheric pollutants, and meteorological conditions, a greater understanding of the diverse influences on PM2.5 can be attained. The PM2.5 concentration at target stations can be affected by changes in wind direction and airflow from surrounding stations when viewed through a spatial lens. Therefore, it is imperative to consider the spatial latitude correlation when forecasting, mitigating, and managing PM2.5 pollution.
Fig 4 illustrates the temporal fluctuations of pollution and meteorological parameters recorded at station 1 during the initial days of October. The variation patterns of PM10 and PM2.5 concentrations are closely aligned, suggesting a significant influence of PM10 on PM2.5 levels. Similar trends are observed in the fluctuations of carbon monoxide (CO) and carbon dioxide (CO2).
The data were observed at station 1 during the initial days of October.
Notably, local maxima in humidity, dew point, wind speed, and wind direction coincide with a gradual decrease in PM2.5 concentrations. Elevated relative humidity enhances the adsorption capacity of PM2.5 particles, leading to the condensation of moisture-rich fine particles and the formation of larger particles that settle out of the air, thus reducing PM2.5 concentrations. Additionally, increased wind speeds aid in the dispersion of particles, potentially contributing to lower PM2.5 levels. Initial analysis indicates a negative correlation between PM2.5 concentrations and factors such as humidity, dew point, wind speed, and wind direction [50, 51].
These findings underscore the complex interactions between pollutant levels and meteorological conditions, highlighting the critical role of these factors in air quality management in industrial zones.
A wind rose diagram (Fig 5) provides a comprehensive representation of the predominant wind patterns observed in Ait Melloul. This diagram categorizes wind direction data into discrete sectors, crucial for determining the transport capacity and overall wind direction. According to data from the airport station, prevailing winds of moderate velocity dominate the eastern region of Ait Melloul. These winds, penetrating the industrial zone with velocities ranging from 35 to 40 km/h, likely facilitate the transportation of particles to the designated monitoring locations.
Map data (C) OpenStreetMap contributors, under the Open Database License (ODbL). This map was created using the folium library in Python.
Time plots are indispensable for the preliminary exploration of time-series data, facilitating the identification of trends, seasonality, anomalies, and disruptions. Such insights are crucial for selecting the most appropriate forecasting methods. Figs 6 and 7 present various graphical representations of the data utilized in this study.
The figure depicts the hourly concentrations of PM2.5 at two monitoring sites: S_RH (blue) and S_ZI (red).
Figs 6 and 7 illustrate the hourly and average hourly concentrations of PM2.5 recorded from October 2022 to February 2023 at two monitoring sites: S_RH (blue) and S_ZI (red). The data reveal that the distribution of PM2.5 does not follow a linear trend, underscoring the complexity of urban air pollution dynamics. The relatively stable variance over time suggests that the series is stationary.
Significant outliers are evident, including a pronounced peak in PM2.5 levels at station S_ZI on December 30th, indicative of transient pollution events. The presence of missing data, as shown in the figure, highlights the challenges associated with low-cost sensors. These gaps were addressed through interpolation during the prediction phase but are retained in the figure to emphasize potential data quality concerns.
Temporal variations in pollutant concentrations are clearly observed, with station S_ZI, located in the industrial zone, generally exhibiting higher PM2.5 levels compared to station S_RH, situated in a residential area. This spatial disparity underscores the influence of local emission sources, such as industrial activities and vehicular traffic, on air quality.
The average hourly PM2.5 concentrations exhibit discernible peaks and troughs throughout the day, with notable discrepancies between the two stations. The highest concentrations of PM2.5 are typically observed between 8:00 and 9:00 a.m., aligning with morning traffic and industrial activities. Station S_ZI consistently records higher PM2.5 levels compared to S_RH, indicating a greater influence of local emission sources. This analysis underscores the importance of understanding temporal variations in pollutant levels for effective air quality management and forecasting.
Fig 8 illustrates the correlations between PM2.5 levels and various environmental parameters at two distinct monitoring sites. The correlation matrices provide a comprehensive view of the relationships between PM2.5 and other measured variables, highlighting both positive and negative associations.
A strong positive correlation is observed between PM2.5 and PM10 levels at both sites, indicating that these two particulate matter sizes often increase and decrease together. This is due to their common sources, such as vehicular emissions, industrial activities, construction projects, and natural sources like dust and pollen. Despite their similar properties, PM2.5 particles are smaller and can penetrate deeper into the respiratory system, posing different health risks compared to PM10. To enhance the accuracy of our machine learning models, PM10 was excluded to avoid redundancy and improve performance.
Positive correlations are also evident between PM2.5 and other variables such as humidity, dew point, relative humidity (RH), carbon monoxide (CO), and carbon dioxide (CO2). These correlations, although weaker than that with PM10, suggest that increases in these parameters are associated with higher PM2.5 levels.
Conversely, several environmental factors exhibit negative correlations with PM2.5, including wind gusts, visibility, wind speed, temperature, “feels like” temperature, and sea level pressure. These negative correlations imply that increases in these factors generally correspond to decreases in PM2.5 concentrations. For example, higher wind speeds and gusts can disperse particulate matter, reducing its concentration in the air.
The strength of these correlations varies between the two monitoring stations. Generally, the industrial zone (S1) shows stronger correlations compared to the residential area (S2), indicating that local environmental conditions and sources of pollution significantly influence these relationships. Understanding these correlations is crucial for effective air quality management and developing predictive models for PM2.5 concentrations.
5 Predictive modeling
5.1 Data preprocessing
This study examines and contrasts the effectiveness and efficiency of six machine learning models for predicting PM2.5 concentrations. Initially, data preprocessing was undertaken, which involved the removal of outliers and the imputation of missing values. After executing an exploratory data analysis, we set up three experimental scenarios for PM2.5 prediction: 1) utilizing the full dataset, 2) selecting important features with the mRMR method, and 3) replicating the second scenario using LightGBM-RFE. For each scenario, we built machine learning models using the training data and assessed their predictive accuracy based on various metrics. Each experiment incorporated a range of machine learning algorithms including linear models, decision trees (DT), gradient boosting regression (GBR), support vector regression (SVR), and ensemble methods.
The detailed steps of the proposed workflow are presented in Algorithm 1.
Algorithm 1: Proposed Workflow for PM2.5 Prediction
Input: Meteorological Data, Historical Data from Low-cost IoT Sensors
Output: Predicted PM2.5 Concentrations
1 begin
// Data Preprocessing
2 Perform Data Cleaning: Remove outliers and impute missing values;
3 Conduct Exploratory Data Analysis (EDA);
// Feature Selection
4 Scenario 1: Use the full dataset;
5 Scenario 2: Select important features using mRMR method;
6 Scenario 3: Select important features using LightGBM-RFE method;
// Model Building and Evaluation
7 Split data into Training (70%) and Testing (30%) sets;
8 Train models using DT, GBR, SVR, XGBoost, and LightGBM;
9 Optimize models using Bayesian optimization with 5-fold cross-validation;
10 Evaluate models using MAE, RMSE, and R2 on the test set;
5.1.1 Data outlier.
The dataset contains measurements with outliers, likely caused by sensor malfunctions. Cleaning the data to remove outliers and fill in missing values is crucial for improving data quality and model accuracy [52]. We used the Interquartile Range (IQR) technique to detect outliers, defined as values below QL − 1.5 × IQR or above QU + 1.5 × IQR, where QU and QL represent the upper and lower quartiles, respectively. Fig 9 shows the boxplot of hourly mean air pollution levels and meteorological factors during the study period in S1 and S2.
Outliers were treated as missing data and filled using an imputation method.
5.1.2 Data imputation.
Missing data is a prevalent issue in real-world datasets, including those related to air pollutant measurements. Factors such as sensor malfunctions, incorrect data recording, power outages, and data acquisition errors contribute significantly to this problem [53, 54]. These missing values can adversely affect study outcomes and the effective operation of public services related to air quality. Various methods are available to address missing values [55]. In air pollution datasets, missing values often occur in long consecutive periods due to sensor malfunctions or in short gaps resulting from routine maintenance or temporary power outages [56].
In this study, the dataset contains 1% to 5% missing values across all variables. We used the K-nearest neighbors (KNN) technique [57] to address these gaps. KNN leverages existing data to find and assign the most similar values based on the k nearest neighbors. Fig 10 shows the comparison between the original and imputed datasets.
5.1.3 Feature selection.
The dataset includes many features, some of which are irrelevant and non-essential, negatively impacting regression accuracy and increasing processing time. Thus, selecting the most relevant features is crucial.
We evaluated two methods for feature selection: LightGBM-RFE and mRMR. LightGBM-RFE excels in capturing complex, non-linear relationships through its tree-based approach, effectively handling feature interactions. mRMR focuses on selecting essential features that maximize relevance to the target variable while minimizing redundancy, using mutual information measures.
Our objective was to determine which method yields the best performance for our dataset. The following sections detail the methodologies of LightGBM-RFE and mRMR, and the rationale behind their selection for this study.
5.1.4 LightGBM-RFE.
LightGBM, introduced by Ke et al. in 2017 [30], is a gradient boosting framework that builds decision trees for regression and classification tasks. It is known for its efficiency in training time and memory usage while maintaining high prediction accuracy. LightGBM employs innovative techniques such as gradient one-sided sampling (GOSS) and exclusive feature bundling (EFB). Additionally, it utilizes the histogram algorithm and a leaf growth strategy with a depth limit to optimize memory consumption and prevent overfitting.
Recursive Feature Elimination (RFE) is a wrapper algorithm for feature selection, introduced by Guyon et al. [58], and it has shown significant success in various fields, including gene selection and air pollution studies. RFE iteratively reduces the feature set size by eliminating the least important features based on their importance ranking, which is recalculated at each iteration using a specific underlying algorithm.
In this study, we integrate RFE with LightGBM (LightGBM-RFE) to enhance the model’s predictive performance by selecting the most relevant features. The procedure for LightGBM-RFE is as follows:
- Initialize the model: Set the desired number of features to select and specify the LightGBM model configuration. Let n be the total number of features, and k be the desired number of features to select.
- Train the initial model: Train the LightGBM model using the full set of n features in the dataset. Let X = [x1, x2, …, xn] be the feature matrix and y be the target vector.
- Compute feature importance: Calculate the importance of each feature based on the trained LightGBM model. The feature importance I(xi) can be defined as the total gain or split improvement brought by xi across all trees in the model.
where T is the total number of trees,
is the set of splits in tree t, and ΔIt(j) is the improvement in the loss function due to the split on feature xj.
- Eliminate least important features: Identify and eliminate the least important features based on their importance ranking. Let
be the set of remaining features. Remove the feature xmin with the lowest importance:
.
- Retrain the model: Retrain the LightGBM model using the reduced set of features
.
- Repeat the process: Repeat the process of feature elimination and model retraining until the desired number of features k is selected.
- Train the final model: Train the final LightGBM model using the selected subset of k features.
The integration of RFE with LightGBM provides a powerful method for feature selection, leveraging the efficiency and accuracy of LightGBM to identify the most relevant features while iteratively reducing the feature set to enhance model performance and reduce overfitting.
5.1.5 mRMR (Minimum Redundancy Maximum Relevance).
Minimum Redundancy Maximum Relevance (mRMR) [59] is a feature selection method designed to select a subset of features that are both highly relevant to the target variable and minimally redundant among themselves. Relevance is defined as the degree to which a feature is related to the target variable, while redundancy is the degree to which features are correlated with each other.
The mRMR algorithm operates in two main stages: calculating relevance and redundancy. Relevance is measured using mutual information, which quantifies the statistical dependence between two variables. Redundancy is measured using the Pearson correlation coefficient between pairs of features. The goal of mRMR is to maximize relevance and minimize redundancy, ensuring the selected features are the most representative of the target variable without being overly correlated with each other.
The steps of the mRMR algorithm are as follows:
- Initialization: Begin with an empty set of selected features
. Calculate the relevance score for each feature xi with respect to the target variable y using mutual information I(xi; y):
where p(xi, y) is the joint probability distribution of xi and y, and p(xi) and p(y) are the marginal probability distributions.
- Maximum Relevance: Select the feature xi with the highest relevance score and add it to the set
:
- Minimum Redundancy: Calculate the redundancy between each remaining feature xj and the features already in
using average mutual information:
Select the feature xj that maximizes the difference between relevance and redundancy:
- Iteration: Repeat steps 2 and 3 until the desired number of features is selected or another stopping criterion is met.
- Final Subset: The final subset
consists of the features selected during the iterative process.
By balancing relevance and redundancy, the mRMR algorithm ensures that the selected features provide maximum information about the target variable while minimizing overlap, leading to a more efficient and interpretable feature set.
5.2 Pipeline and performance criteria
Time series prediction is crucial for estimating future information using past and present data. This study employs a “next-hour prediction” methodology based on supervised learning principles. The initial dataset is divided into historical datasets (t = −24, t = −23, …, t = 0) and a dataset for the following hour (t = 1). These datasets are then combined to create a comprehensive time-series dataset for forecasting future values (t = 1, t = 2, …, t = n).
In constructing the machine learning models, the dataset was first organized and then split into two subsets: 70% for training and 30% for evaluation. For LightGBM, data normalization was unnecessary, but preprocessing, including normalization, was essential for other machine learning approaches like Support Vector Regression. The normalization formula used is as follows:
(1)
Where Xnorm, Xi, Xi,min, and Xi,max represent the normalized value, the actual value, the minimum value, and the maximum value, respectively.
The optimal hyperparameters for the SVR, XGBoost, and LightGBM models were determined using Bayesian Optimization. After running the algorithm for 100 iterations to obtain the hyperparameter values, a final model was trained and tested.
5.2.1 Evaluation metrics.
The performance of the proposed model is evaluated using several metrics to provide a comprehensive assessment. The primary metrics used are RMSE (Root Mean Square Error), MAE (Mean Absolute Error), and R2 (coefficient of determination).
5.2.1.1 Root Mean Square Error (RMSE). RMSE measures the square root of the average squared differences between observed and predicted values, making it sensitive to outliers.
5.2.1.2 Mean Absolute Error (MAE). MAE measures the average absolute differences between observed and predicted values.
5.2.1.3 Coefficient of Determination (R2). R2 indicates the proportion of variance in the dependent variable predictable from the independent variables.
5.2.1.4 Additional metrics. To ensure robust evaluation, additional metrics are considered:
- Mean Squared Error (MSE): Measures the average of the squared differences.
- Explained Variance Score: Measures the proportion of variance explained by the model.
- Median Absolute Error (MedAE): Measures the median of the absolute errors.
Using multiple evaluation metrics allows for a comprehensive assessment of model performance.
6 Results and discussion
The effectiveness of a machine learning model heavily depends on the selection of hyperparameters. In this study, Bayesian optimization combined with k-fold cross-validation was employed to optimize these hyperparameters. Specifically, the dataset was partitioned into k = 5 subsets for cross-validation, enhancing the model’s ability to generalize by training on different subsets and testing on the remaining ones.
Our experiment aimed to assess the impact of historical PM2.5 data on forecasting future concentrations. Three scenarios were evaluated: (1) forecasting without feature selection, (2) with feature selection using the mRMR algorithm, and (3) with feature selection using the LightGBM-RFE algorithm. Various regression models, including Linear Regression, Random Forest, XGBoost, Gradient Boosting, and SVR, were examined. Performance metrics such as R2, RMSE, MAE, MSE, MedAE, and Explained Variance Score were used for evaluation.
To illustrate the different sets of features used for PM2.5 prediction in the ZI dataset, we have summarized the subsets selected by the All, mRMR, and LightGBM-RFE feature selection methods in Table 2.
6.1 Primary metrics analysis
Table 3 shows the performance of the models using primary metrics (R2, RMSE, and MAE) across three forecasting horizons (1H, 2H, and 3H) and under different feature selection methods (ALL, mRMR, LightGBM-RFE).
For the 1-hour forecast, the LightGBM model exhibited superior performance with an R2 of 0.747, RMSE of 4.559, and MAE of 3.236 when using the LightGBM-RFE method. This strong alignment with observed data underscores its reliability for short-term predictions. The Random Forest and XGBoost models also performed well, particularly with the LightGBM-RFE method, achieving R2 values above 0.72.
As the forecasting horizon extended to 2 hours, a slight decline in performance was observed, which is expected due to the increased complexity of predictions. Nevertheless, the LightGBM model remained the top performer with an R2 of 0.678, RMSE of 5.237, and MAE of 3.780. Random Forest and XGBoost maintained competitive performance levels, reflecting their robustness.
For the 3-hour forecast, the LightGBM model continued to lead with an R2 of 0.644, RMSE of 5.522, and MAE of 3.995. Although accuracy decreased with the longer forecast horizon, LightGBM’s predictive power remained evident compared to other models.
Linear Regression consistently showed the lowest performance across all horizons and feature selection methods, with R2 values below 0.61 for the 3-hour forecast, indicating its limitations for complex air quality prediction tasks.
These results highlight the significant impact of feature selection on model performance, with the LightGBM-RFE method consistently enhancing predictive accuracy across all models and forecasting horizons.
6.2 Additional metrics analysis
Table 4 provides additional performance metrics (MSE, MedAE, and Explained Variance) for the same models and scenarios. These metrics offer deeper insights into error distribution and variance explanation.
For the 1-hour forecast, the LightGBM model with LightGBM-RFE achieved the lowest MSE (22.19), MedAE (2.29), and the highest Explained Variance (0.74), reaffirming its strong performance.
In the 2-hour forecast, LightGBM maintained its leading position with an MSE of 27.35, MedAE of 2.83, and Explained Variance of 0.68, demonstrating robustness as the prediction horizon extends.
For the 3-hour forecast, LightGBM continued to perform well with an MSE of 31.28, MedAE of 3.16, and Explained Variance of 0.63, indicating its consistent ability to handle longer-term forecasts.
The additional metrics support the primary metrics findings, showing that LightGBM, particularly with the LightGBM-RFE feature selection method, consistently delivers high accuracy and reliability in PM2.5 forecasting.
Figs 11 and 12 further illustrate the effectiveness of the LightGBM model. The first figure shows the observed and predicted PM2.5 concentrations, demonstrating the model’s capability to closely follow actual trends. The second figure presents the fit curve of the PM2.5 real values and the predicted values using the LightGBM model, indicating a good degree of alignment and minimal deviations.
6.3 Explainable AI with SHAP
The SHAP (SHapley Additive exPlanations) method was employed to convert the typically opaque machine learning model into an interpretable model that illustrates the influence of each feature on the prediction of PM2.5 levels. This innovative approach, based on the Shapley values concept from cooperative game theory as proposed by Lloyd Shapley [60], has been adapted to machine learning by Lundberg and Lee [61] to offer a unified framework for model interpretation. The scores are calculated based on the contributions of individual features, offering insights into how each feature affects the model’s output. The utilization of SHAP brings several notable benefits, including enhanced transparency, deeper insights into the model’s decision-making process, and more informed and accurate interpretations of predictive outcomes.
The key benefits of employing the SHAP method in our study are:
- Investigating the relationship between particular characteristics and forecasting, enhancing our understanding of how environmental and anthropogenic factors influence PM2.5 levels.
- Analyzing factors influencing predictions to obtain more comprehensive and nuanced understandings, facilitating the identification of significant predictors and their interactions within the model.
- Unraveling the intricacies of the machine learning opaque system, making the decision-making process of complex models transparent and comprehensible for both researchers and practitioners.
The proposed approach involved the integration of LightGBM with SHAP. As depicted in Fig 13, the color red represents the maximum value for each selected feature on its unit scale, while the color blue indicates the minimum value.
Upon analysis of Fig 13, it is observed that the curves representing the genuine values and the predictions produced by the LightGBM model demonstrate a comparable pattern and display a good degree of alignment. This finding suggests that the model put forward in this study effectively encompasses the temporal and spatial fluctuations of PM2.5, facilitating reasonably precise forecasts of PM2.5 concentrations.
Feature significance plots were constructed for the LightGBM model using the SHAP values. The most significant features, arranged in descending order based on their respective impacts, are presented in Fig 14. The five features that exhibited the greatest influence were PM2.5 at time t − 1, PM2.5 at time t − 2, CO2 at time t − 1, and PM2.5 at time t − 3.
To enhance the understanding of the effects of feature importance on the model’s output, we have included a SHAP summary plot in Fig 13. The plot demonstrates that features with higher SHAP values are more likely to have a significant impact on the predictions generated by the LightGBM model. The SHAP values, denoted by red dots, signify an augmentation in prediction, whereas the blue dots imply a reduction in prediction.
The variable PM2.5(t − 1) denotes the level of PM2.5 concentration observed one hour prior to the forecasted date. It is evident that a majority of the blue sample points are located within the left half of the territory, while a majority of the purple sample points are situated within the right half of the region. This implies that the feature results in a reduction of the predicted concentration when the PM2.5 concentration one hour prior is lower, and an increase in the predicted concentration when the PM2.5 concentration one hour prior is higher. The scenario involving feature PM2.5(t − 2) bears resemblance to that of feature PM2.5(t − 1), with a notable distinction in the magnitude of importance, where the peak of significance for feature PM2.5(t − 1) is higher.
The distinction between the feature wind gust and the features PM2.5(t − 1) and PM2.5(t − 2) is evident. The observation reveals that a majority of the blue sample points inside this feature row are situated in the right zone, whilst the purple sample points are predominantly concentrated in the left part of the zone. This implies that when making predictions about the concentration of PM2.5, the inclusion of the feature results in an increase in the projected concentration of PM2.5, whereas the exclusion of the feature leads to a drop in the predicted concentration of PM2.5.
6.4 Comparison, analysis, and implications of results
The Light Gradient Boosting Regressor consistently exhibits superior performance compared to other models across all time forecasting scenarios. It attains top ratings in R2, RMSE, MAE, MSE, MedAE, and Explained Variance Score. For instance, the 1-hour forecast showed a commendable R2 of 0.747, signifying strong alignment with the provided data and minimal variance from actual values. The RMSE test result of 4.559 and MAE test value of 3.236 indicate precise forecasts with minimum variance from the actual values.
The Gradient Boosting Regressor and Support Vector Regression (SVR) models also demonstrated robust performance across all measures. These models consistently demonstrated reliable performance across various time intervals, indicating their potential for accurate air quality predictions. In contrast, Linear Regression regularly demonstrated inferior performance compared to the other models, with lower scores across all evaluation metrics, indicating a comparatively poorer alignment and increased prediction errors.
The selection of the optimal model ultimately depends on the specific demands and priorities of the analysis. Nevertheless, the results clearly indicate that the Light Gradient Boosting Regressor is a highly favorable option due to its consistently robust performance across all time intervals. This underscores the importance of model choice and feature selection in enhancing predictive accuracy and reliability.
The implications of these results highlight the importance of advanced machine learning techniques, particularly the LightGBM model with the LightGBM-RFE feature selection method, in accurately predicting PM2.5 concentrations. This capability is crucial for developing effective air quality management strategies in urban environments, enabling authorities to make data-driven decisions to mitigate pollution and protect public health.
7 Conclusion
This study introduces an innovative architectural framework for smart city applications, centered on the integration of Artificial Intelligence of Things (AIoT) to enhance sustainability and improve quality of life. The proposed architecture consists of three key levels:
- Data Collection: Sensors collect environmental data, which is transmitted to a Cloud Server (CS) via a wireless network and the MQTT (Message Queuing Telemetry Transport) protocol.
- Data Processing: Data is processed using a Node-Red infrastructure and stored in InfluxDB.
- Data Analytics: A centralized Data Analytics server employs machine learning algorithms for control and prediction.
We propose the use of a LightGBM model to predict PM2.5 levels, with data preprocessing steps including the elimination of redundant attributes, removal of outliers, and imputation of missing values. Feature selection techniques such as minimum Redundancy Maximum Relevance (mRMR) and LightGBM Recursive Feature Elimination (LightGBM-RFE) were employed to identify the most significant features, streamlining the model for optimal performance. Bayesian Optimization and 5-fold cross-validation were used to enhance the efficiency of various machine learning models, including Random Forest (RF), Gradient Boosting Regression (GBR), XGBoost, Support Vector Regression (SVR), and LightGBM, resulting in high precision.
To provide insights into the model’s predictions, we employed the SHAP (Shapley Additive Explanations) method, which helps interpret the influence of individual features on PM2.5 predictions. Evaluation metrics such as R2, RMSE, and MAE were used to comprehensively assess the predictive performance of the models.
The novelty of this work lies in the integration of low-cost IoT sensors with advanced machine learning techniques to create a scalable and cost-effective air quality monitoring and prediction system. By combining feature selection methods like mRMR and LightGBM-RFE with Bayesian optimization, our approach achieves high accuracy in PM2.5 predictions. This study is among the first to implement such a comprehensive system in a resource-limited setting like Morocco, offering a valuable framework for other developing regions facing similar challenges.
Future research will focus on integrating deep learning methodologies with machine learning algorithms to analyze a larger, more diverse air quality dataset. This expanded dataset will incorporate innovative features, further enhancing our understanding and predictive capabilities for air quality management.
References
- 1.
United Nations. (2018). World Urbanization Prospects 2018 [Report]. Retrieved June 22, 2023, from https://www.un.org/en/desa/2018-revision-world-urbanization-prospects.
- 2.
World Economic Forum. (2015). Global Risks 2015 [Report]. Retrieved June 22, 2023, from https://www.weforum.org/reports/global-risks-2015.
- 3.
World Health Organization. (2018). 9 out of 10 people worldwide breathe polluted air, but more countries are taking action [Report]. Retrieved June 22, 2023, from https://www.who.int/news/item/02-05-2018-9-out-of-10-people-worldwide-breathe-polluted-air-but-more-countries-are-taking-action.
- 4. Khafaie M. A., Yajnik C. S., Salvi S. S., & Ojha A. (2016). Critical review of air pollution health effects with special concern on respiratory health. Journal of air pollution and health, 1(2), 123–136.
- 5. Lelieveld J., Evans J., Fnais M., et al. “The contribution of outdoor air pollution sources to premature mortality on a global scale”. Nature, vol. 525, pp. 367–371, 2015. pmid:26381985
- 6.
Dhir B., “Air Pollutants and Photosynthetic Efficiency of Plants”. In: Kulshrestha U., Saxena P. (eds) Plant Responses to Air Pollution. Springer, Singapore, 2016. https://doi.org/10.1007/978-981-10-1201-3_7.
- 7. Bowe B., Xie Y., Yan Y., and Al-Aly Z., “Burden of Cause-Specific Mortality Associated With PM2.5 Air Pollution in the United States”. JAMA network open, vol. 2, no. 11, e1915834, 2019. pmid:31747037
- 8. Xing Y. F., Xu Y. H., Shi M. H., and Lian Y. X., “The impact of PM2.5 on the human respiratory system”. Journal of thoracic disease, vol. 8, no. 1, E69–E74, 2016. pmid:26904255
- 9. Liang R., Zhang B., Zhao X., Ruan Y., Lian H., and Fan Z., “Effect of exposure to PM2.5 on blood pressure: a systematic review and meta-analysis”. Journal of hypertension, vol. 32, no. 11, pp. 2130–2141, 2014. pmid:25250520
- 10. Nor N.S.M., Yip C.W., Ibrahim N., et al., “Particulate matter (PM2.5) as a potential SARS-CoV-2 carrier”. Scientific Reports, vol. 11, 2508, 2021. pmid:33510270
- 11.
The World Air Quality Project 2008–2023. “Air Quality Product Listing”, aqicn. Retrieved June 22, 2023: https://aqicn.org/products/monitoring-stations/.
- 12. Apte J. S., Messier K. P., Gani S., Brauer M., Kirchstetter T. W., Lunden M. M., et al., “High-Resolution Air Pollution Mapping with Google Street View Cars: Exploiting Big Data”. Environmental science & technology, vol. 51, no. 12, pp. 6999–7008, 2017. https://doi.org/10.1021/acs.est.7b00891. pmid:28578585
- 13.
Ministry of Environment and Sustainable Development. (n.d.). Air Quality Monitoring. Environnement.gov.ma. Retrieved June 22, 2023 https://www.environnement.gov.ma/en/air/118-theme/air/209-air-quality-monitoring
- 14. Popoola O. A. M., Carruthers D. J., Lad C. S., Bright V., Mead M. I., Stettler M., et al., “Use of networks of low cost air quality sensors to quantify air quality in urban settings”, Atmospheric Environment, 2018. https://api.semanticscholar.org/CorpusID:105724438.
- 15.
R. Williams, Vasu Kilaru, E. Snyder, A. Kaufman, T. Dye, A. Rutter, et al., “Air Sensor Guidebook”. U.S. Environmental Protection Agency, Washington, DC, EPA/600/R-14/159 (NTIS PB2015-100610), 2014.
- 16. Kisi O., Mohammad Azamathulla H., Cevat F., Kulls C., Kuhdaragh M., and Fuladipanah M., “Enhancing river flow predictions: Comparative analysis of machine learning approaches in modeling stage-discharge relationship,” Results in Engineering, vol. 22, pp. 102017, 2024.
- 17. Mampitiya L., Rathnayake N., Leon L. P., Mandala V., Azamathulla H. M., Shelton S., et al., “Machine Learning Techniques to Predict the Air Quality Using Meteorological Data in Two Urban Areas in Sri Lanka,” Environments, vol. 10, no. 141, pp. 1–15, 2023.
- 18. Rostami A., Raeini-Sarjaz M., Chabokpour J., Azamathulla H.M., and Kumar S., “Determination of rainfed wheat agriculture potential through assimilation of remote sensing data with SWAT model case study: ZarrinehRoud Basin, Iran,” Water Supply, vol. 22, no. 5, pp. 5331–5354, May 2022.
- 19. Bekkar A., Hssina B., Douzi S., and Douzi K., “Air-pollution prediction in smart city, deep learning approach”. Journal of big data, vol. 8, no. 1, 161, 2021. pmid:34956819
- 20. Snyder E. G., Watkins T. H., Solomon P. A., et al., “The changing paradigm of air pollution monitoring”, Environmental Science & Technology, vol. 47, no. 20, pp. 11369–11377, 2013. pmid:23980922
- 21. Castell N., Dauge F. R., Schneider M., et al., “Can commercial low-cost sensor platforms contribute to air quality monitoring and exposure estimates?” Environment International, vol. 99, pp. 293–302, 2017. pmid:28038970
- 22.
AQICN, “The world air quality project”, 2021. [Online]. Available: https://aqicn.org/.
- 23. Kumar P., Morawska L., et al., “Rise of low-cost sensing for managing air pollution in cities”, Environment International, vol. 75, pp. 199–205, 2015. pmid:25483836
- 24. Mead M. I., Popoola O., et al., “The use of electrochemical sensors for monitoring urban air quality in low-cost, high-density networks”, Atmospheric Environment, vol. 70, pp. 186–203, 2013.
- 25. Gupta P., Mandariya A. K., et al., “Smart cities and air quality monitoring: Case studies from India”, Smart Cities and Urban Development Journal, vol. 10, no. 1, pp. 45–56, 2020.
- 26. Rai A. C., et al., “End-user perspective of low-cost sensors for outdoor air pollution monitoring”, Science of The Total Environment, vol. 607-608, pp. 691–705, 2017. pmid:28709103
- 27. Chen L., Guo W., Wang H., “A support vector regression model for predicting PM2.5 levels in Beijing, China”, Journal of Environmental Management, vol. 181, pp. 94–102, 2016.
- 28. Jiang X., Zhang J., Li M., et al., “Predicting air quality in Chinese cities using Random Forest”, Science of the Total Environment, vol. 579, pp. 148–157, 2017.
- 29. Li X., Liu Y., Maiheu M., et al., “Application of a gradient boosting decision tree for PM2.5 prediction”, Atmospheric Pollution Research, vol. 8, no. 5, pp. 967–973, 2017.
- 30.
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, et al, “LightGBM: A Highly Efficient Gradient Boosting Decision Tree”, in Neural Information Processing Systems, 2017. https://api.semanticscholar.org/CorpusID:3815895.
- 31. Zhang Y., Liu B., Wang X., et al., “Comparative study of ensemble learning approaches in predicting PM2.5 concentration”, Environmental Science and Pollution Research, vol. 25, no. 30, pp. 29779–29789, 2018.
- 32. Bounakhla Y., Benchrif A., Costabile F., Tahri M., El Gourch B., El Hassan K., et al. (2023). Overview of PM10, PM2.5 and BC and Their Dependent Relationships with Meteorological Variables in an Urban Area in Northwestern Morocco. Atmosphere, 14(1), 162.
- 33. Sbai S., Bentayeb F., Yin H. (2021). Atmospheric Pollutants Response to the Emission Reduction and Meteorology During the COVID-19 Lockdown in the North of Africa (Morocco). Stochastic Environmental Research and Risk Assessment, 35(12), 2183–2199. https://doi.org/10.1007/s00477-022-02224-z.
- 34. Fahim M., El Mhouti A., Boudaa T., Jakimi A. (2023). Modeling and Implementation of a Low-Cost IoT-Smart Weather Monitoring Station and Air Quality Assessment Based on Fuzzy Inference Model and MQTT Protocol. Environmental Science and Pollution Research, 30(12), 18563–18577. pmid:36776786
- 35. Bouchriti Y., Ait Haddou M., Kabbachi B. (2022). Ambient Air Quality and Health Impact of Exposure to Outdoor Air Pollution in the Moroccan Population: A Systematic Review. Pollution, 8(4), 1055–1069. https://doi.org/10.22059/poll.2022.348613.1626.
- 36. Gryech I., Ben-Aboud Y., Guermah B., Sbihi N., Ghogho M., Kobbane A. (2020). MoreAir: A Low-Cost Urban Air Pollution Monitoring System. Sensors, 20(4), 998. pmid:32069821
- 37. Guermah B., Sbihi N., Kobbane A., Ghogho M. (2022). A GIS-Based Real-Time Air Quality Monitoring System for Urban Areas: A Case Study of Morocco. International Journal of Environmental Research, 16(2), 252–267. https://doi.org/10.1007/s40808-022-01452-0.
- 38. Gryech I., Ghogho M., Mahraoui C., Kobbane A. (2022). An Exploration of Features Impacting Respiratory Diseases in Urban Areas. International Journal of Environmental Research and Public Health, 19(5), 3095. pmid:35270785
- 39. Abekiri N., Rachdy A., Ajaamoum M., Nassiri B., Elmahni L., and Oubail Y., “Platform for hands-on remote labs based on the ESP32 and NOD-red”, Scientific African, vol. 19, e01502, 2023. pmid:36531209
- 40.
“DHT22 Digital Temperature and Humidity Sensor Datasheet”, SparkFun Electronics, [Online]. Available: https://www.sparkfun.com/datasheets/Sensors/Temperature/DHT22.pdf. [Accessed: 10-Mar-2024].
- 41.
Plantower, “PMS5003 Particulate Matter Sensor Datasheet”, [Online]. Available: https://cdn-shop.adafruit.com/product-files/3686/plantower-pms5003-manual_v2-3.pdf. [Accessed: 10-Mar-2024].
- 42.
“SNS-MQ135 Gas Sensor Datasheet”, Olimex, [Online]. Available: https://www.olimex.com/Products/Components/Sensors/Gas/SNS-MQ135/resources/SNS-MQ135.pdf. [Accessed: 10-Mar-2024].
- 43. Light , “Mosquitto: server and client implementation of the MQTT protocol”, Journal of Open Source Software, vol. 2, no. 13, 265, 2017.
- 44. Shinde Shubhangi A., Nimkar Pooja A., Singh Shubhangi P., Salpe Vrushali D., and Jadhav Yogesh R., “MQTT-Message Queuing Telemetry Transport protocol”, International Journal of Research, vol. 3, no. 3, pp. 240–244, 2016.
- 45.
Node-RED, “Node-RED [Website]”, Retrieved June 22, 2023, from https://nodered.org/
- 46.
J. Shahid, InfluxDB Documentation, 2022.
- 47.
T. Ödegaard, “Grafana”, Grafana Labs, 2014. Grafana [Website]. Retrieved June 22, 2023, from https://grafana.com/
- 48.
IZ Ait Melloul, “Case Study Summary”, [PDF file], Retrieved June 22, 2023, from https://www.climate-expert.org/fileadmin/user_upload/Case_Study_Summary_IZ_Ait_Melloul_EN.pdf
- 49.
Visual Crossing. (n.d.). Visual Crossing [Website]. Retrieved June 22, 2023, from https://www.visualcrossing.com/
- 50. Wang J., Ogawa S. (2015). Effects of Meteorological Conditions on PM2.5 Concentrations in Nagasaki, Japan. International Journal of Environmental Research and Public Health, 12(8), 9089–9101. pmid:26247953
- 51. Zou Z., Cheng C., Shen S. (2021). The Complex Nonlinear Causal Coupling Patterns between PM2.5 and Meteorological Factors in Tibetan Plateau: A Case Study in Xining. Research Square, June 2021. https://doi.org/10.21203/rs.3.rs-634756/v1.
- 52.
Aihua Li, Mengyan Feng, Yanruyu Li, and Zhidong Liu, “Application of Outlier Mining in Insider Identification Based on Boxplot Method”, Procedia Computer Science, vol. 91, pp. 245-251, 2016, note: Promoting Business Analytics and Quantitative Management of Technology: 4th International Conference on Information Technology and Quantitative Management (ITQM 2016). https://doi.org/10.1016/j.procs.2016.07.069.
- 53.
M. Peña, P. Ortega and M. Orellana, “A novel imputation method for missing values in air pollutant time series data”, in 2019 IEEE Latin American Conference on Computational Intelligence (LA-CCI), Guayaquil, Ecuador, 2019, pp. 1-6.
- 54.
I Nyoman Kusuma Wardana, Julian William Gardner, and Suhaib A. Fahmy, “Estimation of missing air pollutant data using a spatiotemporal convolutional autoencoder”, Neural Computing and Applications, vol. 34, pp. 16129-16154, 2022. https://api.semanticscholar.org/CorpusID:248657032.
- 55. Lin W.C. and Tsai C.F., “Missing value imputation: a review and analysis of the literature (2006–2017)”, Artificial Intelligence Review, vol. 53, pp. 1487–1509, 2020.
- 56. Moshenberg S., Lerner U., and Fishbain B., “Spectral methods for imputation of missing air quality data”, Environmental Systems Research, vol. 4, 26, 2015.
- 57.
A. Bekkar, B. Hssina, S. Douzi, and K. Douzi, “Air Quality Forecasting using decision trees algorithms”, in 2022 2nd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), 2022, pp. 1-4. https://api.semanticscholar.org/CorpusID:247682185.
- 58. Guyon I., Weston J., Barnhill S., et al., “Gene Selection for Cancer Classification using Support Vector Machines”, Machine Learning, vol. 46, pp. 389–422, 2002.
- 59. Brown Gavin, Pocock Adam, Zhao Ming-Jie, and Mikel Luján, “Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection”, Journal of Machine Learning Research, vol. 13, no. 2, pp. 27–66, 2012. http://jmlr.org/papers/v13/brown12a.html.
- 60.
Shapley Lloyd S., “A Value for n-person Games”, in Contributions to the Theory of Games (AM-28), Volume II, pp. 307–317, Princeton University Press, 1953.
- 61.
Scott M. Lundberg and Su-In Lee, “A Unified Approach to Interpreting Model Predictions”, in Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017).