Bidirectional convolutional LSTM for the prediction of nitrogen dioxide in the city of Madrid

Nitrogen dioxide is one of the pollutants with the most significant health effects. Advanced information on its concentration in the air can help to monitor and control further consequences more effectively, while also making it easier to apply preventive and mitigating measures. Machine learning technologies with available methods and capabilities, combined with the geospatial dimension, can perform predictive analyses with higher accuracy and, as a result, can serve as a supportive tool for productive management. One of the most advanced machine learning algorithms, Bidirectional convolutional LSTM, is being used in ongoing work to predict the concentration of nitrogen dioxide. The model has been validated to perform more accurate spatiotemporal analysis based on the integration of temporal and geospatial factors. The analysis was carried out according to two scenarios developed on the basis of selected features using data from the city of Madrid for the periods January-June 2019 and January-June 2020. Evaluation of the model’s performance was conducted using the Root Mean Square Error and the Mean Absolute Error which emphasises the superiority of the proposed model over the reference models. In addition, the significance of a feature selection technique providing improved accuracy was underlined. In terms of execution time, due to the complexity of the Bidirectional convolutional LSTM architecture, convergence and generalisation of the data took longer, resulting in the superiority of the reference models.


Introduction
The increase in the level of urbanisation, in addition to positive consequences, also causes some problems associated with environmental changes, one of which is the deterioration of air quality [1,2]. According to the observations of the World Health Organisation (WHO), seven million deaths due to short-term and long-term exposure to air pollutants are recorded every year [3]. Regarding Spain, studies show that over 93,000 people have died in Spain due to air pollution in recent decades [4]. The WHO has identified the most dangerous pollutants and established guidelines with specific thresholds for each of them, including particulate matter (PM), ozone (O 3 ), nitrogen dioxide (NO 2 ) and sulphur dioxide (SO 2 ) [5,6]. The prediction of one of these pollutants, NO 2 , is the main focus of the current work. The main source of NO 2 been deployed to predict air quality. The choice of model can be adjusted depending on the stated problem to be solved, for example, the predicted pollutant or the study region's peculiarities. Below are a few examples of research related to the subject area extracted from the following works [14,32]. For example, Xu et al. [33] employed the Extreme Gradient Boosting (XGBoost) integrated with the Shapley additive explanation technique for ultrafine particle concentrations forecast. Another work implemented XGboost was developed by Ma et al. [34] to predict PM 2.5 in Shanghai. Leong et al. [35] applied Support Vector Machine to predict air pollution index. Lasisi et al. [36] proposed Fuzzy Rough Set and Artificial Immune System algorithms to predict air quality. Among many studies, many of them have confirmed the effectiveness of the Table 1. Implemented algorithms and evaluation metrics extracted from the publications focused on the prediction of NO 2 ( � ).

Work ML Algorithm
Evaluation Metric [15] BRT, SVM, XGBoost, RF, GAM, Cubist RMSE, ME, NRMSE, NME, POD, POF, R 2 [ [17] to predict air pollutants in the next day using meteorological and air pollutant's concentration data of Macau. Zhai and Cheng performed a one-day forecast implementing LSTM on air quality, meteorological and social media data [19]. Another work by Yang et al. proposed hybrid Convolutional Neural Network (CNN)-LSTM and CNN-Gated Recurrent Unit (GRU) models to predict PM 10 and PM 2.5 for the next seven days in Seoul using air pollution and meteorological data [37]. Heydari et al. [38] developed hybrid model based on combination of LSTM and multi-verse optimization algorithm to predict the air pollution obtained from Combined Cycle Power Plants (Kerman, Iran). In addition to forecasting along the time axis, it is also important to consider the spatial dimension, and identify the air quality value in places where there are no stations. Several authors have focused on the spatial factor in their studies. Danesh Yazdi et al. [39] proposed ensemble machine learning based on a Random Forest (RF), a Gradient Boosting Machine (GBM), and a k-nearest Neighbor (KNN) to predict PM 2.5 using air quality, satellite aerosol optical depth, land use, and meteorological data. Li et al. [23] suggested Kruskal-K-means clustering method to predict NO 2 and NO x . Just et al. [40] applied XGBoost to predict PM 2.5 using satellite-derived aerosol optical depth integrated with recursive feature selection technique. Zou et al. [41] applied spatiotemporal attention based LSTM on the Beijing dataset. Ma et al. implemented a Bidirectional LSTM (BLSTM) network with Inverse Distance Weighting to predict PM 2.5 concentration at Guangdong, China [42]. Ma et al. [43] presented Transfer Learning-based Stacked Bidirectional Long Short Term Memory network to predict air quality in Anhui, China. Le et al. [44] implemented Convolutional LSTM (ConvLSTM) to interpolate and predict PM 2.5 in the city of Seoul. Also, Alléon et al. [45], and Liu and Shuo [46] applied ConvLSTM for forecasting air quality. Phruksahiran implemented the geographically weighted predictor method to predict air quality index in Bangkok and Thailand [47].

Study area and data description
The study area considered in this work is the city of Madrid (Fig 1). It has an area of about 604.31 km 2 , and it is the second largest city in the European Union in terms of population (3,305,408 [48]). According to the study by Sasha Khomenko et al. [49] related to premature mortality due to air pollution in European cities, in which the pollutants PM 2.5 and NO 2 were considered, Madrid was found to be at the top of the ranking of European cities with the highest NO 2 mortality burden. Taking into consideration the importance of NO 2 for Madrid, it was selected as an air pollutant for predictive analysis.
In a study carried out by Cuevas et al. [51] the authors observed the temporal evolution of NO 2 in five Spanish cities, including Madrid, over the period 1996-2012. Applying the shift trend model to NO 2 data, they found that NO 2 levels in the Madrid area had dropped by about 53%. A comparison of average annual values obtained from air quality monitoring stations showed that the decline in Madrid is 37%. This decline is associated with the implementation of environmental policies and technologies, as well as with the consequences of the global economic crisis. The study shows that in the pre-recession period the annual decline was 1.1%, and 7.8% during the economic recession. Therefore, it can be seen that economic and industrial factors significantly affect NO 2 emissions. According to the work by Izquierdo et al. [52], the implementation of the Madrid City air-quality plan would lead to an annual mean decrease in NO 2 by 4.0 μg/m 3 in 2020.
While the implementation of control policies and strategies has a positive impact on reducing air pollution, the problem nevertheless still remains the focus of attention. New technologies can help make better and more efficient decisions. Following the aforementioned belief, this work focuses on NO 2 prediction in the city of Madrid using machine learning technologies.
According to the following study [14], the publications related to the prediction of air quality using machine learning technologies used more than 26 datasets to supplement air quality data (meteorological, spatial, traffic, social media, etc.). The datasets used in this work are NO 2 data (μg/m 3 ), meteorological data and traffic data from January to June 2019 and from January to June 2020, and the location of the monitoring stations. The data were obtained from Open Data portal of the Madrid City Council [53]. There are 24 air quality control stations, 26 meteorological control stations and more than 4000 traffic measurement points (shapefiles of measurement point locations are also provided for each month). The meteorological data include ultraviolet radiation (Mw/m 2 ), wind speed (m/s), wind direction, temperature (˚C), relative humidity (%), barometric pressure (mb), solar irradiance (W/m 2 ) and precipitation (l/m 2 ),

PLOS ONE
while the traffic data include intensity, occupancy time, load and average traffic speed. The datasets have an hourly rate. Since the attributes of the traffic data can be specific to a certain area, the following are the selected traffic attributes with their definition for the city of Madrid: Intensity-Intensity of the measurement point in a period of 15 minutes (vehicles/hour); Occupancy time-Measurement point occupancy time in a period of 15 minutes (%); Load-Vehicle loading in a 15-minute period. This is a parameter that takes into account intensity, occupation and capacity of the road and establishes the degree of road use from 0 to 100; and Average traffic speed-Average speed of the vehicles in a period of 15 minutes (km/h). Only for M30 intercity measuring points.
From the above definitions it can be seen that the traffic data is recorded every 15 minutes. However, since NO 2 and meteorological data are at hourly rates, the traffic data were filtered and only hourly records were selected (for example, with entries at 13:00, 13:15, 13:30, 13:45 and 14:00, we simply selected the entries at 13:00 and 14:00 and the same logic was applied for the entire period). Table 2 shows summary statistics of each type of data (since the location of traffic measurement points changes monthly, summary statistics were calculated based on the part that was Table 2. Summary statistics of the periods January-June 2019 and January-June 2020 for each data type.

Descriptors
January-June 2019 January-June 2020 used in the analysis). The datasets and the code implemented are available at the following links [53][54][55].
Considering the spatial factor in air quality prediction, the Pearson correlation coefficients between stations were calculated (Fig 2). It can be noticed that the stations are spatially correlated. Fig 3 shows autocorrelation (or the correlogram, the correlation between values of the same series at different time steps) and partial autocorrelation plots of NO 2 concentration; the daily interval is chosen as a lag length and the plots show the results of 80 lags. The difference between autocorrelation and partial autocorrelation is that in the first case, it calculates the correlation between two lags, taking into account the influence of previous observations (direct and indirect affects), and in the case of partial autocorrelation, it is just a real correlation between two lags without intervening observations (only direct effects). These functions help to determine the best lags, which can be selected for effective forecasting. It can be seen that in the autocorrelation plot more than 25 lags have a significant positive correlation, although if we look at the partial autocorrelation plot, there is a statistically significant correlation for lag 1 and 2 periods. In this work, 6-hour lag was chosen, which are in the range of significant correlated lags.
Another interesting observation can be seen in

PLOS ONE
are mean values corresponding to each boxplot). The concentration distribution can be explained by the traffic factor, which plays a decisive role in raising the level of NO 2 . This recent belief was also confirmed by the following study [56], which showed that in Madrid up to 90% of NO 2 comes from local traffic.

Method
The algorithm that was used in this work is Bidirectional convolutional LSTM (BiConvLSTM). It is an advanced version of ConvLSTM in which hidden and cell states are kept for forward and backward sequences. Fig 5 shows the architecture of (a) ConvLSTM and (b) BiConvLSTM cells. ConvLSTM was first used by Shi et al. [57], who showed that it is possible to preserve spatial information in an LSTM implementation by converting internal matrix multiplication into convolution operations. This spatiotemporal factor, combined with a bidirectional factor, allows for an increased ability to capture more information in the temporal dimension. The hidden states from forward and backward sequences are combined and then go through a convolution layer. There are several ways to execute the combination process (sum, calculate the average, multiply or concatenate), which as a parameter has to be defined during the tuning process (the parameter optimisation is presented in the next section).
Firstly, the ConvLSTM can be formulated with the following equations [57,58]: where i t is the input gate, f t is the forget gate, and o t is the output gate (these gates control the flow of information through the cell), W is the weight matrix in the forward ConvLSTM cell, X t is the current input data, h t−1 is previous hidden output, C t is the cell state, " � " represents the convolution operation and "�" represents the Hadamard product. It can be seen that ConvLSTM takes into account only information from past sequences, however combining information from both forward and backward sequences may give better results. Below is the mathematical expression of BiConvLSTM [58].
where H f is hidden state from forward ConvLSTM unit, H b is hidden state from backward ConvLSTM unit, and Y t is the final output.

Experimental settings
This section includes a detailed description of the workflow. The main goal of the current work is to predict NO 2 in the next 6 hours over a given area, which was carried out based on the data on the previous 6 hours. The overall workflow of the analysis is presented in Fig 6. It can be seen that the workflow consists of the following steps: Data Generation, Feature Engineering, Model Development and Evaluation. In terms of tools, ArcGIS Pro software [59] and Google Colab cloud service [60] (with GPU enabled for Pro version) were used to accomplish the proposed tasks.

Data generation.
As already mentioned, the raw data was obtained from Open Data portal of the Madrid City Council [53]. Since the monitoring stations and measurement points are different for each dataset, the first task is to combine them spatially and temporally. Therefore, the grid was created in a given area, which was defined as a selected part of Madrid with a width and height of 1,000 metres within the following extent:  [61], specifically the CreateFishnet function [62]. There are total of 340 cells (20 by 17) which cover 340 km 2 or 56.27% of the total area of the city of Madrid. The logic behind selecting this area was to have a minimum extent to include all air quality control stations with the aim of obtaining higher accuracy. The value of each cell includes the values of NO 2 , meteorological and traffic attributes obtained from assigned stations at a certain time. The value of the cells that do not include any stations was assigned as zero and in the case of more than one station, an average value was assigned. The above procedure was repeated for every hour of the selected period. The following functions were used to execute aforementioned process, including arcpy.management.AddField [63], arcpy. analysis.SpatialJoin [64], arcpy.da. SearchCursor [65], arcpy.da.UpdateCursor [66]. The output was exported as Comma Separated Values (CSV) files, which were used as an input in further stages of the analysis. Overall, 4344 and 4368 CSV files were generated corresponding to every hour during January-June 2019 and January-June 2020, respectively. A formal description of the data generation process is given by Algorithm 1. The presence of many features sometimes prevents a model from generalising data efficiently, due to the curse of dimensionality. Hence, feature selection must be implemented to select the best combination of datasets, which in turn will prompt the model to efficiently generalise the data. First of all, the following variables were excluded for future predictive analysis: average traffic speed, traffic load, UV, precipitation. Average traffic speed was excluded because it is available only for M30 road which is 15.8% of the case study ( Table 2). Traffic load, according to the definition is the combination of intensity, occupancy time and capacity of the road. Therefore, this variables also was excluded, taking into account the fact that it is correlated with other variables. Regarding UV, it was observed that June of 2019 and the whole period of 2020 do not have records about UV. Regarding precipitation, it was found out that around 99% of data were 0, so this feature was also eliminated. Afterwards, the mutual information (MI) technique was implemented [69] on the remaining features. It calculates the mutuality between additional datasets and the target dataset (NO 2 ). The formula to calculate mutual information is presented below (Eq (3)).

MIðx; yÞ
where P(x i , y) is the joint probability distribution of two variables, P(x i ) and P(y) are marginal distributions, H(x) is the entropy for x, and H(x|y) is the conditional entropy. Fig 7 shows the feature importance scores of 7 additional datasets based on mutual information. For further analysis in the second scenario, features with a score above 0.005 were selected, including wind speed, barometric pressure, intensity and occupancy time. It should be mentioned that wind direction also was selected considering the interconnection with wind speed. The reason for not including wind direction in the mutual information calculation process is that the wind direction is circular data and needs to be converted for later use (details below).

Transformation.
In this step wind direction was converted in categorical data with the following categories: north, east, south, west, southwest, northeast, southeast, northwest, and later by implementing One Hot Encoder [70] it was included in the analysis. Another transformation was the conversion of the input data into the supervised learning dataset. Independent and dependent datasets were generated based on the defined time granularity (to predict NO 2 in the next 6 hours on the basis of data for the previous 6 hours).

Scaling.
Scaling is a very useful technique for handling differences that exist between ranges of the features. The current work applied Min-Max (0-1) normalisation in order to normalise the input data (Eq (4)).
4.1.2.6 Data splitting. After preprocessing the data with the above-mentioned techniques, the next step is to split the dataset into training, validation and testing sets. The data was splitted with the following order: January-June 2019-training set; January-March 2020-validation set; April-June 2020-testing set. The dimension of each sets is illustrated in Table 3.

Model development.
This step presents the process of model construction. The parameter optimisation of the proposed model was performed by applying GridSearchCV with Blocking Time Series Split. Blocking Time Series Split was chosen instead of cross-validation because it considers the time series aspect and prevents leakage from one set to another. In order to reduce the computation time for parameter optimisation, GridSearchCV was applied on data for one month. Table 4 shows optimised parameters with the options that were tried, and the one that was finally selected is indicated in bold.
Therefore, the architecture of the model was built based on the chosen parameters by stacking 3 bidirectional ConvLSTM layers with a kernel size of 3x3 (it should be noted that a model with a smaller kernel allows capturing slower motion), filters equal to 16 and with an Adam optimiser. It can be seen that concatenation was selected as the merge mode, which means that the forward and backward ConvLSTM units were concatenated before passing information to the next unit. Each BiConvLSTM layer was followed by Dropout and Batch Normalization layers, and the model was finalised using a 1x1 convolution layer.
Regarding the baseline models, LSTM-FC had the following structure: 2 LSTM layers with 2048 units followed by Dropout layer and the model was finalised adding a Dense layer; ConvLSTM had 5x5 kernel size with filters equal to 32, followed by Batch Normalisation and Dropout layers and it was finalised with 1x1 convolution layer.

Evaluation.
After parameter optimisation the finalised model was evaluated in the testing set in order to answer the questions defined in the Introduction. From Table 1, it can be seen that RMSE and MAE are the most used evaluation metrics, therefore, these metrics were chosen as evaluation metrics. RMSE measures the geometric difference between estimated and actual values and it is very sensitive to large errors (Eq (5)), and MAE measures the average magnitude of the errors (Eq (6)). where n is the number of instances, and E i and A i are the estimated and actual values. The lower the value is, the better the prediction will be. Algorithm 2 provides pseudo code of NO 2 prediction procedure.

Algorithm 2 NO 2 prediction
Input: CSV files for each hour including NO 2 , Meteorological and Traffic data function CALCULATE

Results and discussion
As mentioned in the Introduction, the analysis was carried out according to two scenarios. Below are the results for each of them.

First scenario.
In this scenario the experiments were performed using 9 features (NO 2 , wind speed, wind direction, temperature, relative humidity, barometric pressure, solar irradiance, intensity, and occupancy time) without the remaining 4 features (UV, precipitation, load and average traffic speed), which, as mentioned above, were excluded immediately after the data exploration phase, given the obvious reasons for the exclusion. Table 5 presents the results obtained and the runtime of the models for the next 6-hour lag. Looking at the results of the RMSE and MAE, it can be seen that BiConvLSTM outperforms ConvLSTM and LSTM-FC with values of 19.14 and 13.06, respectively. In particular, in terms of RMSE, In this scenario, the analysis was carried out using the datasets selected after calculating the feature importance scores based on mutual information. Table 6 shows RMSE and MAE values and runtime of the models performed on the selected features. It can be seen that, as in the first scenario, in this case also BiConvLSTM surpassed other models. Especially, in terms of RMSE, BiConvLSTM improves results compared to ConvLSTM by 16.28%, and to LSTM-FC by 19.32% in terms of MAE, BiConvLSTM improves results compared to ConvLSTM by 18.32%, and to LSTM-FC by 28.21%. Regarding runtime, in this case also BiConvLSTM converges comparably slower than ConvLSTM and LSTM-FC.
The difference between the two scenarios, which can be observed, is a significant decrease of the values in terms of runtime and the error, which is associated with the peculiarities of the implementation of the feature selection methodology. It is essential to understand why, among all the features, only some of them (wind speed, wind direction, barometric pressure, intensity, and occupancy time) were chosen, what is the relationship between NO 2 and features with a higher mutual information index, the inclusion of which as a result improved the performance of the model. In terms of wind speed and direction, the correlation is because an increase in wind speed suggests a lower concentration due to increased dilution through advection and increased mechanical turbulence. In terms of traffic data, the transport sector has been confirmed to be one of the largest sources of nitrogen oxides (nitrogen oxide and NO 2 ), for example, about 46% of total emissions in 2013 in the European Union were attributed to nitrogen oxides [71].
It is worth to mention that the units of RMSE and MAE are defined in the same unit as the target variable; therefore, in the current work, it matches with the unit of NO 2 (μg/m 3 ). Hence, by looking at the results, it can be seen that MAE is 9.72 μg/m 3 , which can be considered sufficient comparing with mean values of NO 2 (36.69 and 26.03 for the period 2019 and 2020, respectively). It is essential to consider the impact of the Coronavirus Disease 2019 (COVID-19) during 2020 to combat some measures, such as traffic restrictions and self-isolation, and as a result, these events have affected the air pollution concentration. In the case of Madrid, due to COVID-19 restrictions, the concentration of NO 2 dropped to 62% [72]. These sudden changes can also affect the model's performance, and it would be ideal for future work to compare the results with a different period to identify these effects.
Overall, it can be seen that BiConvLSTM outperforms other reference models in both scenarios; however, regarding the execution time, it takes comparable more time. The superiority of the proposed model over LSTM can be explained by the fact that BiConvLSTM captures spatial information, while LSTM focuses exclusively on temporal information. Compared to ConvLSTM, the advantage of existing forward and backward sequences of BiConvLSTM helps to collect more information and, as a result, outperforms ConvLSTM. On the other hand, these sequences lengthen the execution time.

Conclusions and future work
Taking into account the impact of NO 2 on health and the environment, the management and control of its value become an essential issue for governments and decision-makers (according to WHO guidelines, NO 2 has the following threshold values: 40 μg/m 3 and 200 μg/m 3 , respectively, for the annual average and for the 1-hour average [6]). Considering that the concentration of NO 2 correlates both temporally and spatially, this work implements BiConvLSTM, which can perform effectively in temporal and spatial dimensions. The data used for analysis are NO 2 , meteorological and traffic data from January to June 2019 and from January to June 2020 in the city of Madrid. Two scenarios were developed based on the subsets of features used in the analyses. The proposed model was compared to ConvLSTM and LSTM-FC, and the results showed that BiConvLSTM outperformed the reference models in both scenarios. In particular, feature selection improved the final results by 33.9% in terms of RMSE and by 25.27% in terms of MAE. Regarding runtime, BiConvLSTM is slower due to the model architecture, and it takes longer to converge the data. Moreover, the output showed that the feature selection step is important because it significantly reduces the error. It is worth noting that by looking at the results of the MAE and comparing them with the average concentration values, the proposed model can be considered a reliable and robust model. As regards the limitations, it is worth mentioning that the predictive analysis was performed using Google Colab, and the cloud service itself has restrictions in terms of the amount of data and the complexity of the model [73]. However, with access to a more powerful machine learning analysis platform, the scale of optimisation of the parameters of the proposed model could be expanded, more data could be generated and included in the training set, and perhaps the performance of the model could be improved. In terms of the proposed model's limitations, the requirement of the input data, which is related to the model's architecture, can be specified. As it can be seen, the input data must be in grid format. However, grid formatting can be challenging, since in the case of lack of data, modification of the original data will be required, which may have an impact on the model performance. Therefore, another machine learning model, such as a graph neural network, could be developed in the future, with the results of alternative approaches compared. Other aspects that could be considered as future work may be the integration of other datasets, such as street networks and buildings, application the proposed procedure to a different pollutant (for example, for PM 2.5 as it has serious health effects), as well as to other cities in order to compare performance based on spatial characteristics. Also, as already mentioned, it would be ideal for performing analysis for a different period and observing the impact of COVID-19 on the model's execution.