Using satellite data on remote transportation of air pollutants for PM2.5 prediction in northern Taiwan

Accurate PM2.5 prediction is part of the fight against air pollution that helps governments to manage environmental policy. Satellite Remote sensing aerosol optical depth (AOD) processed by The Multi-Angle Implementation of Atmospheric Correlation (MAIAC) algorithm allows us to observe the transportation of remote pollutants between regions. The paper proposes a composite neural network model, the Remote Transported Pollutants (RTP) model, for such long-range pollutant transportation that predicts more accurate local PM2.5 concentrations given such satellite data. The proposed RTP model integrates several deep learning components and learns from the heterogeneous features of various domains. We also detected remote transportation pollution events (RTPEs) at two reference sites from the AOD data. Extensive experiments using real-world data show that the proposed RTP model outperforms the base model that does not account for RTPEs by 17%-30%, 23%-26% and 18%-22% and state-of-the-art models that account for RTPEs by 12%-22%, 12%-14%, and 10%-11% at +4h to +24h, +28h to +48 hours, and +52h to +72h hours respectively.


Introduction
Rapid urban development and industrialization in recent years have increased air pollution, especially PM 2.5 with an aerodynamic diameter of less than 2.5 micrometers (μm) that cannot be filtered through nasal passages, leading to health problems such as respiratory and cardiac diseases [1][2][3].Many nations have built urban stations to monitor the presence of PM 2.5 in the environment.The resulting datasets can be used to better understand and predict PM 2.5 [4].Furthermore, the prediction of PM 2.5 levels is a difficult problem, as the dispersion of pollutants is heavily dependent on the meteorological characteristics and terrain, in addition to the activities of the inhabitants [5,6].The prediction is further complicated by factors such as pollutant migration outside the observed area.Such long-range transport of air pollutants in this work is called Remote Transportation Pollution Events (RTPEs) and relies on wind and other meteorological effects [7].The aerosol optical depth (AOD) from the Multi-Angle Implementation of Atmospheric Correlation (MAIAC) algorithm is a natural solution to understand air quality in large areas, especially as they have a strong correlation with PM 2.5 [8,9].In this study, we considered pollutants transported from northeast Asia through the East China Sea to Taiwan [10,11].
Different researchers have used satellite-based AOD measurements to estimate and predict PM 2.5 due to their high correlation with PM 2.5 and their large spatial coverage area [12][13][14].The AOD of the MAIAC algorithm at 1 km resolution produces better performance than other algorithms with 10 km resolution [2].Most models use AOD for end-to-end training, where the input to their model is AOD and other related meteorological data, and the output is PM 2.5 of the same area.In this work, we used MAIAC AOD data from a large remote area as input, while the output is local PM 2.5 from a part of northern Taiwan.In this work, AOD is not used to directly predict local PM 2.5 , while the proposed model first predicts RTPEs that later with other data are used to predict local PM 2. 5 .
In the literature, predictive models are designed using the classical dispersion approach that focuses on identifying the root cause of PM 2.5 from emissions, chemicals, climatology, or a combination of these factors [15,16].For example, the Community Multiscale Air Quality Model (CMAQ) [17] is designed to study air pollution on a global scale.The challenge of this approach includes the failure to detect the complex relationship of features that affect PM 2.5 [18].Furthermore, it does not perform well in capturing the spatial and temporal distribution of PM 2.5 and suffers from high computation costs, especially for a model with complex highorder equations [19].
There are studies that simulate and quantify RTPEs to Taiwan [20][21][22][23][24].Most of them use trajectory statistics (TS) and chemical transport modeling (CTM) approaches to discover the source of RTPEs.TS uses the frequency of backward trajectories in an area to determine whether pollution is due to remote pollutants.CTM involves a brute-force-based method, which involves two simulations, one without pollutants from the local area and a normal simulation.The difference between these simulations determines the RTPE amounts from the remote area.In this work, we predict RPTEs with the help of several deep neural networks.
The recent development of deep learning in the prediction of air pollution shows the ability to outperform the classic dispersion approach and the statistical approach.Deep learning is based on large historical datasets to capture complex interactions among various features of different datasets.For PM 2.5 prediction, various deep learning methods involve different machine learning techniques to capture different knowledge from large datasets.Convolutional neural networks (CNN) [25][26][27][28] are used to capture spatial knowledge, long short-term memory (LSTM) [27,28] models are adopted for temporal knowledge, and fully connected neural network (FC) models extract complex interaction between those datasets.Convolutional Long Short-Term Memory (ConvLSTM) is another technique that combines CNN and LSTM to predict PM 2.5 [26,29,30].However, deep learning with an end-to-end training approach that directly applies the CNN and LSTM components could not perform well for complicated and heterogeneous data [27].Specifically, the AOD image inputs cause high computation costs and training time that can only obtain a sub-optimal solution.Furthermore, PM 2.5 prediction becomes more challenging in the long term (for example, for 48 to 72 hours) as there are influences from both known and unknown factors [19].In this work, we adopted ConvLSTM, CNN and FC to improve the prediction accuracy of PM 2. 5 The philosophy behind deep learning is to produce good results if there is a sufficient training dataset [31].Due to incomplete data and missing data from different observation stations, the ensemble machine learning approach [18,32,33] is used to improve the prediction of PM 2. 5 .In an ensemble model, a linear combination of the outputs of different individual deep learning models is used for PM 2.5 prediction, delivering better results than individual new page will appear with steps labeled by number at the top, e.g., "1.PRODUCT."4th step: In the "PRODUCT" step, it will appear as "All Sensors" by default.At the bottom, select "MODIS Collection 6level 1, Atmosphere, Land (Archive Set 6)." 5th step: Then click on "All[188]" to show the available number of products.Choose the "MCD19A2" product.6th step: Click the arrow to move to "2.TIME," then click "Add Date." (This research applies the data of 2014-2016.)7th step: Click the arrow to move to "3.LOCATION" to select the area of interest.Then click on "Tiles," and enter the tile numbers of the study area as reported in the paper: h28v05, h28v06, h29v05, h29v06.8th step: Click the arrow to move to "4.FILES," choose "Select All." 9th step: Click the arrow again to move to "5.REVIEW and ORDER," then click "Submit Order."Then the link for downloading for that order will be sent to the email of the account by NASA.� For NCEP dataset: Select the range the same with AOD data (as shown in prediction results.These are the popular ensemble machine learning models, AdaBoost (AB) [32], bagging regression (BG), random forest (RF) [34,35], extreme gradient boosting (XGB) [36][37][38], and a generalized additive model (GAM) [33,39].In this work, we used the composite neural network that outperforms those ensemble models.The composite neural network framework [27,40] is proposed to resolve complicated applications, such as PM 2.5 prediction, that connects a collection of pre-trained neural network models to form a large neural network.It is proven that a composite neural network yields greater learning capabilities without the burden of high model training expenses.
Recently, a composite neural network framework that combines different pre-trained deep learning models [27,40] has been proposed to resolve complicated applications, such as PM 2.5 prediction.A composite neural network is a collection of pre-trained neural network models that forms a large neural network to yield greater learning capabilities without the burden of high model training expenses.Each pre-trained model utilizes the knowledge from different datasets, and outputs from them are connected into an acyclic tree construction.Later, the outputs are ensembled after constraining the weight of each component into a specific value by using a defined function instead of being ensembled by a weighted average like an ensemble machine learning model This paper answered two main questions.The first one was about the identification of the occurrence of RTPEs in a local area.The second question was about the incorporation of knowledge about RTPEs to improve the local prediction of PM 2.5 .The questions created three challenges in terms of deep learning design and practice.The first challenge is how to prioritize the factors that influence the capture of remote pollutants, as air quality is affected by multiple factors [9,28,41], each with its own spatial and temporal distribution.The next challenge is how to identify factors in the design of a neural network model to capture the complex interactions between them for better PM 2.5 prediction.The third one is how to fuse and train the proposed neural network model on large heterogeneous datasets for improved efficiency and prediction results.
We addressed the first challenge by considering the AOD data and weather data of remote areas are typically provided in coarse-grained grids.Generally, RTPEs are caused by monsoon and frontal surfaces which are synoptic; therefore, we considered wind speed, direction, and related features.To tackle the second challenge the resulting model [27] was selected as the pre-trained base model, and then including another large deep learning model called spatiotemporal remote information neural network (STRI), it is extended as the proposed remote transported pollutant composite neural network (RTP model).The STRI model incorporates long-range pollutants for PM 2.5 , grasps spatio-temporal features from remote areas and learns the spatial correlation between remote AOD and local PM 2.5 .For the third challenge, we broke the new STRI component into two parts: one for feature extraction and another for prediction.This reduces the number of training parameters and thus the computational cost of the training process with virtually equivalent prediction performance.The four contributions of this work are: 1.The proposed composite neural network RTP model efficiently captures RTPEs and significantly improves PM 2.5 prediction in comparison with the base model and state-of-the-art models.
2. Addressed challenges using RTPEs as features for local PM 2.5 prediction.These challenges are addressed in a combined fashion to learn from selected features and models.
3. Developing a classification algorithm to classify RTPEs of two reference sites at different PM 2.5 levels and increase rates.
4. Applying a composite neural network [42] to develop neural network models incrementally to demonstrate the design rationale and contributions of each component for PM 2.5 prediction.

Study area
The area under study comprises of a remote area where we captured RTPEs and a local area (Taipei area) where we performed the prediction of PM

Dataset
We used the 2014 to 2016 data to evaluate the proposed neural network models.The 2014 and 2015 data were used for training the model and the 2016 data were used for testing.The data for the extended local satellite dataset (ESD) model evaluation were prepared at daily granularity, whereas for the RTP model the data were at hourly granularity.
Observed air quality concentration.We obtained hourly air quality data from the EPA website (data.epa.gov.tw)consisting of PM 10 , PM 2.5 , Carbon monoxide (CO), Nitrogen Oxides (NOx), Ozone (O 3 ), and Sulfur dioxide (SO 2 ).The Taipei area was divided into grids with total of 1140 (30 × 38) grids cells, thus we used four nearest neighbors (4-NN) method to fill grid cells with empty values.To evaluate ESD, we convert those datasets to daily interval.Meteorological data.The hourly meteorological data were obtained from the Center Weather Bureau (CWB) website (opendata.cwb.gov.tw).Each reading includes wind speed and direction, rainfall, pressure, temperature, and humidity.The data covered 77 grid cells in the Taipei area.Therefore, we used 4-NN to fill those cells without monitor stations.Again, we averaged those datasets into daily reading for ESD evaluation.
Remote meteorological data.We used the National Center for Environmental Prediction (NCEP) final (FNL) global analysis data (rda.ucar.edu)covering all over the world.These data are provided over 28 × 28 km 2 grids every six hours and we converted into hourly interval using linear interpolation.The data includes meteorological features: temperature, pressure, vertical velocity (VVEL), absolute vorticity (ABSV), lifted index, wind speed, and wind direction.The wind speed (ws)(denoted as ws) and direction(θ)(denoted as θ) are represented as u and v components, i.e., ws × cos(θ) and ws × sin(θ).The u component is the horizontal speed toward the east (known as Zonal Velocity) and v component is the horizontal wind speed toward the north (known as Meridional Velocity).The ws and θ, temperature, VVEL and ABSV were considered at pressure levels from 10mb to 1000mb.These data were used for evaluation of the RTP model only.
Satellite MAIAC AOD dataset.This is satellite AOD data at a 1 × 1 km 2 resolution created using the MAIAC algorithm which is updated twice a day and downloaded from the National Aeronautics and Space Administration (NASA) website (ladsweb.modaps.eosdis.nasa.gov).The remote area is covered by four satellite tiles (Fig 1 ) with the AOD data.The AOD is used to evaluate both RTP and ESD models but the data pre-processing is different for each model.
For ESD evaluation, we used tile 1 and 2 (Fig 1 ) to fill the AOD data in Taipei area.However, for the missing grids we use the mean of their neighboring grids (3 × 3) to fill their AOD data.For RTP evaluation, we calculated the daily means of AOD value for each grid in all tiles.We assumed the AOD value is the same for the whole day; thus we repeated the same value 24 times to match hourly reading.Furthermore, we also downscaled all tiles to produce a finer spatial resolution.The downscale approach was used on satellite images for precipitation [26] using mean pooling.In this work we use maximum pooling to maintain the distribution of values in each tile.At the end each tile is reduced to spatial dimension of 300 × 300 km 2 .The downscaled tiles match the available memory of graphics processing unit (GPU) and reduce computational cost.

Method
The description of the methods of the study is based on the composite neural network.The idea is to access or design several pre-trained deep learning models for different tasks and then treat them as the components of the final composite neural networks, ESD and RTP.

Proposed models
This The inputs of the STRI_fe model are the current four downscaled satellite tiles and remote weather.Each tile is represented by four-dimensional (4d for short) tensor, [t, c, w, h] corresponding to time, channel, width, and height.Considering the available memory and computational resources, the model uses average pooling with 3 dimensions (3d) [c, w, h] on each tile along the time axis to reduce their dimension and output T q .The CNN layers receive T q for capturing spatial correlation, and aggregates information between grid cells.The output from the pooling layer on the 4d tensor is denoted by P q : where L represented the pooling layer, c is the convolutional feature from the convolutional layer, b is the additional bias, v is a vector with the same size as c, and % is an activation function.
where T q was the downsampled AOD data, � represents the convolutional operation, and K is the convolutional kernel.To speed up the training process, we applied batch normalization [43] between ConvLSTM layers in the STRI_fe model.The output of ConvLSTM for each tile (H T1 , H T2 , H T3 , H T4 ) is concatenated and then flattened as a 1 dimension (1d) tensor |e|.
On the right hand side of the STRI_fe component, again the ConvLSTM structure with batch normalization is applied to the current remote weather dataset to extract spatio-temporal features, which represent historical weather patterns of wind and other features associated with time and location.Furthermore, the 4d tensor's output from ConvLSTM, denoted by H W is flattened as 1d tensor |x| which later is merged with |e| to form another 1d tensor [g] denoted as R p .R p is the extracted spatio-temporal features of remote pollutants with their corresponding weather features which later is transferred to the STRI_p model after being converted to a 2d tensor.
In the second training phase, the STRI_p model is further refined with the fixed STRI_fe model to reduce the training time, model complexity, and model parameters for improved prediction results.STRI_p receives sequence of R p ,local sequences of PM 2.5 and meteorology data which also include future weather forecasts.Future weather forecast data is included to reflect weather fluctuations, because the current weather is not satisfactory for long-term prediction, i.e., beyond 24 hours [28].Furthermore, all input features are merged together and form a 2d tensor which is denoted as HR.Finally, fully connected (FC) layers are applied to HR to learn the complex interaction between features extracted from the remote area and local features and make predictions.More details of STRI model configuration can be found in S1 Table.
Base model.The Base model [27] was designed for PM 2.5 prediction for 18 EPA stations using local influential factors within the Taipei area represented as 30 × 38 = 1140 cells.This model uses 21 features from EPA and 26 features from CWB.Among the 1140 cells, there are 18 EPA stations and 77 CWB stations.This Base model itself is a composite neural network combining six heterogeneous models as its components: one LSTM, two FC (fully connected layer) and three ConvLSTM, where each component has its input data and its expected task.For example, ConvLSTMs extract spatio-temporal knowledge of EPA, CWB and weather forecasting datasets respectively, and two FC are expected to automatically distillate the information from EPA and CWB data.The trained weights of this base are always fixed in our consecutive steps.LSD model.This model only considers local AOD data in the Taipei area, which made it a simpler composite neural network than STRI.We fill the area with AOD PM 2.5 , weather forecast, and meteorological data, all of which are aligned as daily readings and we used them as input to the LSD model.
The LSD model (Fig 4(a)) starts with a series of CNNs on the AOD data to capture the spatial correlation from neighbors along the temporal axis.Then a pooling layer is applied after CNN to reduce the spatial dimensions and aggregate features between the grids and output K o .The model uses the same series of CNN and pooling layers on the current meteorological, air quality, and weather forecast data, and outputs K l .Later, the model concatenated K o and K l using an Add layer and LSTM applied to extract temporal related features.
Finally, the FC layer used to learn the interaction and correlation between all features in a nonlinear way [28] and then produces the PM 2.5 prediction using the final FC layer.pollutant measurements as opposed to ground measurements.The difference between the RTP and ESD models is that the ESD has the LSD component with local AOD knowledge to improve daily PM 2.5 prediction, while the RTP utilizes STRI_p, which learns the remote AOD knowledge.
Given the PM 2.5 prediction outputs from the Base model(P) and LSD model (Y) for 18 stations for next 1 day to 3 day, then the ESD predicts PM 2.5 for the same stations at the same daily interval.

Evaluation
This section reports the experimental environment, the settings and the evaluation for the training of deep learning models.

Experimental environment and setting
The models were trained on an NVIDIA GPU and implemented on Keras with TensorFlow backend environment.All models were evaluated using root mean square error (RMSE), correlation coefficient (R) and mean bias error (MBE).The RMSE and R evaluate the model predicted values if they represent the true values.Furthermore, MBE estimates the average bias in the prediction.The mathematical equations of those metrics are defined below: RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 n ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P n i¼1 ðy i À � yÞ where y and ŷ are true values and predicted value at timestamp respectively.Also � ŷ ¼ 1 n P n i¼1 ŷi and n is a total number of incident in a sample.In this work we consider the mean of all monitor stations RMSE value.

Classification of remote pollutants
In this section we answered the first question by classifying PM 2.5 levels at the Wanli and Tamsui stations in the Taipei region that are affected by RTPEs.The reasons that these two stations are considered the RTPEs indicators include (1) their locations are by the seashore as shown in Fig 1 , and (2) their background PM 2.5 values are stable and relatively low and the occurrence of RTPEs will cause the rise.Therefore, we started by producing PM 2.5 predictions using the STRI_fe and SRTI_p models without considering the local PM 2.5 influence factor.In our prediction experiment, we considered only November to May because they are months when RTPEs have the greatest impacts on northern Taiwan [24].
The proposed classification algorithm classifies the PM 2.5 concentrations for the next 24, 48, and 72 hours (+24h, +48h, +72h) that are affected by RTPEs.Such RTPEs are understood to flow across the eastern China Sea to Taiwan; however, due to variations in wind direction, not all pollutants reach Taiwan.Thus, we seek to ascertain the amount of pollutants reaching Taiwan or the increased concentration caused by the remote pollutants, which corresponds to two conditions of RTPEs.Condition (1): the arrival of such RTPEs raises PM 2.5 concentrations beyond a certain threshold at these stations.Condition (2): PM 2.5 concentration increases over two consecutive hours.In particular, Condition (2) assumes if RTPEs arrive in the current hour (t), then the difference between the current PM 2.5 and that of the previous hour (t-1) must be positive.The RTPEs are said to occur if the peak value simultaneously satisfies both conditions.
Classifying remote transportation pollution events.For the two conditions, we created three thresholds for each.That for Condition (1), based on the finding of Chuang et al. [24] showing that the RTPEs in northern Taiwan account for PM 2.5 concentrations ranging from 31 to 39, we selected 30, 33, and 36 in our experiments.For Condition (2), i.e., the differential threshold (Diff_tshd), the true and predicted PM 2.5 concentrations were converted to firstorder difference vectors, after which the differential thresholds 0.5, 1.0, and 1.5 were set.Fig 5 is an example with the ground-truth (GT) and predicted results of the next one hour (+1h) and next four hours(+4h) for the Wanli station.The Epa_tshd and Diff_tshd used in the example are 30 and 0.5, respectively.The green dashed line indicates Epa_tshd, and the colored dots represent the peaks from different predicted hours that exceed the two thresholds.The 26 red dots represent the total number of RTPEs predicted in 1 hour, compared to the 69 ground-truth events.The accuracy of RTPEs detection is thus 37.7%.
Note that RTPEs are defined by conditions that depend on the given Epa_tshd and Diff_tshd thresholds.By definition, a True Positive (TP) RTPEs is when the ground truth and the model prediction are both larger than the Epa_tshd and Diff_tshd.A True Negative (TN) event occurs if neither ground truth nor prediction is larger than the given thresholds.False Positive (FP) and False Negative (FN) events are defined similarly.We used these to calculate the accuracy (A), precision (Pr), recall (R), and F1 score.Specifically, the F1 score is defined as: The formulas of the remaining metrics can be found in the deep learning textbook [44]; further classification details are provided in S1 Appendix.

Performance of ESD model
In Table 2   By applying heterogeneous AOD data, ESD model improves its prediction in RMSE by 12.68%, 11.45%, and 6.65% for +1day, +2day, and +3day, respectively.It also shows a better R value than the base model with a prediction bias (underestimation) value of 0.304.The result demonstrates the ESD's topological changes with the addition of new local AOD knowledge decrease the prediction error between the true and predicted values.
Table 2 also shows that RTP_2tile and RTP_4tile outperformed all those three models for for all target days in RMSE and R. For MBE bias both RTP's and Base model show positive (overestimation) and negative prediction values at different target day however RTP has few outliers than all the models as shown in scatter plots of observed vs predicted PM 2.5 (Fig 6).Overall, the results show that the RTP model captures RTPEs from remote AOD data and it helps improve the prediction performance for all days.Regarding the Base model, the RTP_4tile provides the greatest improvement prediction performance in RMSE by 25.77%, 28.96% and 21.17% for +1day to +3day.The RTP_4tile also outperforms RTP_2tile on all three days that the result demonstrates the enlarged remote area will help improve the local prediction of PM 2.5 .This matches with our idea of enlarging the remote area from 2tile to 4tiles to capture more RTPEs.

Prediction of RTPEs
To answer the first question that we raised in the introduction, i.e., to predict RTPEs, we predicted the local PM 2.5 for the two stations first using only the local PM 2.5 and weather as input to the STRI_p model with the extracted spatio-temporal features from remote areas.We predicted the RTPEs by applying the thresholds Diff_tshd and Epa_tshd to the PM 2.5 predictions.To observe the general performance, we used combinations of various thresholds that Diff_tshd = 30, 33, 36 and Epa_tshd = 0.5, 1.0, 1.5.For example, we combined Epa_tshd = 0.5 with Diff_tshd = 30, 33, 36 and also combined Epa_tshd = 1.0 with Diff_tshd = 30, 33, 36 as it shown in both Tables 3 and 4. Tables 3-5 shows the classification results in terms of accuracy (A), precision (Pr),recall (R), and F1 score (F1).The first column indicates the data used, for instance, "P" represents the local PM 2.5 values, "EP" represents the remote spatio-temporal features from four tiles, and "W" represents the local weather features.We first noted the increases in accuracy and other metrics for many forecasts when EP and W are added to the model, which demonstrates the contribution of RTPEs to increasing PM 2.5 concentrations at the stations.A similar trend is observed for the Epa_tshd, although the increases in accuracy are lower.For example, for the next 24hour (+24h in short) predictions for the Wanli station with a Diff_tshd of 0.5, the accuracy is 0.72, 0.62, and 0.46 for Epa_tshd thresholds of 30, 33, and 36, respectively.This shows that as the threshold of PM 2.5 increases, the prediction of PM 2.5 tends to be conservative and cannot follow the PM 2.5 increase resulting in low accuracy.In terms of precision and recall, the highest recall of 0.44 is observed at Wanli whereas for Tamsui, the highest is 0.41, both at thresholds of 0.5 (Diff_tshd) and 30 (EPA_tshd) for +24h.For precision, the highest score is 0.24 and 0.30 at thresholds of 0.5 (36) for the Wanli and Tamsui stations, respectively.Overall, these results demonstrate the effects of the STRI model on the RTPEs prediction of these two stations in northern Taiwan.

Performance of RTP model
In this section we answer the second question about improving local PM 2.5 predictions using knowledge about RTPEs.We discussed the results of different training approaches for the RTP components, knowledge captured from RTPEs, and the results of RTP models in comparison to other models.
The effect of training strategies on prediction performance was evaluated by comparing the results from a full STRI model with those using the STRI_fe and STRI_p components, as described.STRI_p yields better prediction results than the full STRI model in both RMSE and Secondly, the effects of the extracted remote pollutants and local features on the STRI_p component were evaluated.Experiments were conducted using one feature and incrementally added features while observing the results in terms of RMSE and R. Fig 7c and 7d shows the results of various features, including spatio-temporal features from two and four tiles (t12 and t1234) as well as the local PM 2.5 (P) and weather (W) features from 18 stations.Therefore, the model input sequence is P, t12(tiles h28v06 and h29v06), the remaining two tiles(t34)(tiles h28v05 and h29v05) and then W.
A significant gap between the performance in R and RMSE when using P and when using data on remote pollutants from tiles t12 (P+t12) and t1234 (P+t1234) was observed (Fig 7(c) and 7(d)).Also the impact of expanding the range from tiles t12 to t1234 was observed.This impact is not present between +4h and +24h, possibly as events from t34 require additional time to make an impact.This fits with our goal of expanding the range to four tiles to improve prediction by capturing more RTPEs.The gap was attributed to long-term rather than short- term (+4h to +24h) weather fluctuations.Generally, the results show that the STRI_fe component captures knowledge from RTPEs by learning spatio-temporal behavior from AOD data with their corresponding weather features from remote areas.Thirdly, we evaluated the RTP model performance for four seasons in a year using prediction results of +4hr up to +48hr.We divided the dataset into four periods each having three months and evaluate the prediction results using RMSE and R. Season one (S1) starts from January to March, season two (S2) covers April to June, Season three (S3) starts July to September and season four (S4) covers the remaining months (October to December).Fig 8a and  8b shows RTP produces better performance in RMSE and R for S1 than all seasons in every prediction hour.Furthermore, S1 is the period where winter is at its peak and is the same period where northeasterly winter monsoon wind transport pollutants from central and northeastern Asia to Taiwan [45].Therefore the better performance of RTP in S1 is probably contributed by the existence of a high level of pollutants in that period in the training datasets.In addition, S2 and S3 represent the spring and summer seasons which normally have a low levels of PM 2.5 this reflects the performance of the RTP model in that period.S4 is the period when winter starts and expecting the level of PM 2.5 to increase however the RTP performance does not imitate that scenario.Overall, the RTP performance matches with season variation with the level of PM 2.5 in the northern part of Taiwan.Fourth, we evaluated the performance of the RTP model in comparison with the Base model [27] and other state-of-the-art ensemble models with the same settings: linear regression (LR) [46], AB [32], BG, RF [34,35], XGB [36][37][38], and a GAM [33,39].Note XGB yields better performance than gradient boosting machine [47] because of using more regularized model formalization to control over-fitting [48].We also show RTP performance when we use RTPEs from 2tile (RTP_2tile) and 4tile (RTP_4tile) to show the impact of remote pollutants on the on the local prediction of PM 2.5 .
Fig 8(c) shows the relative prediction improvements in RMSE of both RTP models and the other state-of-the-art models w.r.t.Base model from +4h to +72h; the greater the improvement, the better the model does in comparison to the Base model.The figure shows that RTP_4tile yields the greatest improvements: from 17%-30%, 23%-26%, and 18%-22% for +4h to +24h, +28h to +48h, and +52h to +72h, respectively.Similarly, the RTP_2tile provides greater improvement: from 13%-24%, 17%-23%, and 13%-17% for +4h to +24h, +28h to +48h, and +52h to +72h.XGB and GAM, in turn, improve on the Base model by 6%-8%, 10%-12%, and 8%-11% for +4h to +24h, +28h to +48h, and +52h to +72h, respectively, with scores that are similar to those for the LR model.AB is outperformed by the Base model for most hours; RF is also, but to a lesser extent.The RTP_2tile and RTP_4tile outperform other models due to their composite neural network design [27], which involves high flexibility with learning capability to model nonlinear association between input features.The performance of RTP_4tile over RTP_2tile continues to demonstrate the importance of the enlarged remote area to capture more RTPEs from the remote area.

Conclusion
To characterize the occurrence of remote transportation pollution events (RTPEs), we define it as a combination of thresholds and increments in one hour of PM 2.5 concentration, and  then design an algorithm to classify PM 2.5 concentrations into RTPEs.The proposed RTPE and algorithm are evaluated for the area in northern Taiwan and the corresponding satellite data, and we believe that the proposed method can be applied elsewhere.In particular, the evaluation shows that a well-designed deep learning model extracts the knowledge from satellite data that it improves the accuracy of PM 2.5 prediction.
It is worth noting that RTPEs can be captured using the proposed composite RTP model, and then RTPEs can be aqpplied to improve the prediction of PM 2.5 .The proposed RTP comprises two main components: a pre-trained Base model and the STRI model that capture the knowledge of local PM 2.5 concentrations and RTPEs, respectively.In addition, STRI learns spatio-temporal characteristics of AOD data and weather features through its component STRI_fe, and then predicts local PM 2.5 through the other component STRI_p that their performances are validated from empirical study.Local PM 2.5 predictions using the RTP model outperform the base model and other state-of-the-art ensemble models by 12%-30%, 12%-18%, and 10%-14% for +4h to +24h, +28h to +48h, and +52h to +72h, respectively.The ESD model, although it only considers local AOD data, still improves PM 2.5 prediction, as evidenced by lower RMSE scores than the base model by 12.68% for +1day and 11.45% and 6.65% for +2day and +3day.
The outstanding performance of the STRI model on the prediction of PM 2.5 will help the government policy maker to take measures, including controlling traffic in the area that is expected to have a high level of PM 2.5 .They may also use that information for warning systems and plan mitigation actions to reduce the risk to public health.In addition, the same information can be used by individuals to organize their activities, such as whether to exercise outside.
Future work will focus on expanding the remote area, using data that are updated at a higher frequency compared to AOD data, and considering other possible features and models.
Fig 1, from 20N to 40N latitude and 106.60E to 155.30E longitude), during 2014-2016.� For CWB and EPA datasets: Select the range covering the north Taiwan area, especially for Tamsui and Wanli stations (as shown in Fig 1, from 25.17N to 25.23N latitude and 121.55E to 121.60E longitude), during 2014-2016.The deep learning programs of this research for the above datasets can be found in the GitHub: https://github.com/MCC-SINICA/Using-Satellite-Data-on-Remote-Transportation.

Fig 1 .
Fig 1. Study area.Left side: Four tiles (adapted from NASA) label 1(h28v06), 2(h29v06), 3(h28v05), 4(h29v05) with Taiwan in the middle between tiles 1 and 2. Right side: The map of Taiwan after zooming with two stations Wanli (red circle) and Tamsui (green circle).https://doi.org/10.1371/journal.pone.0282471.g001 subsection reports the tasks and the architectures of the following neural network models:STRI model, Base model, Local Satellite Data Model (LSD), RTP model and ESD model.STRI model.As depicted in Fig 2, STRI predicts the PM 2.5 concentrations of the 18 EPA stations in Taipei using meteorological and AOD features from remote areas with local meteorological features and PM 2.5 values.Due to the size of both STRI model and its input dataset being larger than the GPU memory limitation and to reduce the computational costs, STRI was divided into the STRI_fe and STRI_p submodels(i.e.components).There are two phases of training, the first trains the STRI_fe model and the second fine-tunes the STRI_p model with fed features from the STRI_fe model.

Fig 3
shows the Base model for the next 72-hour prediction.For the next 24-hour and 48-hour predictions, the Base model have the same architecture but different details, such as activation functions and weights (W).
RTP model.Consisting of a pre-trained Base model and an STRI_p component, RTP is a composite neural network which handles knowledge from RTPEs.The RTP structure has two components (Fig 4(b)) where each component is trained separately, after which they are used as pre-trained components in the RTP model using a series of ReLU functions (ReLU()) for improved overall local PM 2.5 prediction for the 18 EPA stations.The RTP model predicts PM 2.5 concentrations using composite techniques on input from the two components.The PM 2.5 prediction results from the Base model(O) and STRI_p(X) for 18 stations for the next 4,8 up to 72hour were used as input for RTP.Then RTP predicts PM 2.5 at the same hour interval for the same stations.The objective here is to improve local PM 2.5 prediction by accounting for RTPEs.ESD model.We changed the topology of the Base model to the ESD model and apply it for the daily prediction of PM 2.5 .The ESD model with a series of ReLU functions (Fig 4(c)) is composed of the Base model and the LSD model.AOD data in LSD is composed of columnar

Fig 5 .
Fig 5. Classify remote pollutant.Prediction of +1h and +4h with ground truth at Wanli station.The dots and stars at the bottom show all peaks that meet both EPA and Differential conditions.https://doi.org/10.1371/journal.pone.0282471.g005

Fig 6 .
Fig 6.Observed PM 2.5 vs predicted PM 2.5 scatter graphs.Plots of association between observed PM 2.5 vs predicted PM 2.5 for Base model and proposal models for next 1 day prediction.https://doi.org/10.1371/journal.pone.0282471.g006

Table 1 . Sample size (N) summary for hourly (+hr) prediction.
we compared the daily PM 2.5 prediction results of the ESD model with that of its components and the RTP_ktile model (with k = 2,4 representing the number of tiles), where Δ % is the relative improvement in RMSE over the Base model.The Base model outperforms the LSD model for +1day in both R and RMSE, but for +2day and +3day, the LSD component outperforms the Base model due to the application of AOD data.The prediction underestimation (negative) values of the Base model are low compared with LSD however the scatter plots (Fig 6) of observed PM 2.5 and predicted PM 2.5 values show few outliers than the LSD model.

Table 3 . Classification results with Diff_tshd = 0.5.
Since training a full STRI model on a single GPU can be challenging, STRI_fe for feature extraction and STRI_p for prediction were used.As STRI_p consists of a small number of layers, it converges quickly during training, leaving more room for model fine-tuning.Thus, the improved performance of STRI_p validates our idea of breaking the full STRI model into two components.