Prediction of road dust concentration in open-pit coal mines based on multivariate mixed model

The problem of dust pollution in the open-pit coal mine significantly impacts the health of staff, the regular operation of mining work, and the surrounding environment. At the same time, the open-pit road is the largest dust source. Therefore, it analyzes the influencing factors of road dust concentration in the open-pit coal mine. It is of practical significance to establish a prediction model for scientific and effective prediction of road dust concentration in the open pit coal mine. The prediction model helps reduce dust hazards. This paper uses the hourly air quality and meteorological data of an open-pit coal mine in Tongliao City, Inner Mongolia Autonomous Region, from January 1, 2020, to December 31, 2021. Create a CNN-BiLSTM-Attention multivariate hybrid model consisting of a Convolutional Neural Network (CNN), a bidirectional long short-term memory neural network (BiLSTM), and an attention mechanism, Prediction of PM2.5 concentration in the next 24h. Establish prediction models of parallel and serial structures, and carry out many experiments according to the change period of the data to determine the optimal configuration and the input and output size. Then, a comparison of the proposed model and Lasso regression, SVR, XGBoost, LSTM, BiLSTM, CNN-LSTM, and CNN-BiLSTM models for short-term prediction (24h) and long-term prediction (48h, 72h, 96h, and 120h). The results show that the CNN-BiLSTM-Attention multivariate mixed model proposed in this paper has the best prediction performance. The mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination (R2) of the short-term forecast (24h) are 6.957, 8.985, and 0.914, respectively. Evaluation indicators of long-term forecasts (48h, 72h, 96h, and 120h) are also superior to contrast models. Finally, we used field-measured data to verify, and the obtained evaluation indexes MAE, RMSE, and R2 are 3.127, 3.989, and 0.951, respectively. The model-fitting effect was good.


Introduction
wind speed, air pressure, and relative humidity as input to establish a multiple linear regression model of PM2.5 and PM10 to predict indoor PM2.5 and PM10 concentration. Because the atmospheric process is nonlinear, the linear regression method is difficult to accurately describe the nonlinear relationship, resulting in the prediction effect could be wrong [10]. X Xi et al. built a combination model of grey system prediction model GM (1,1) and ARMA (p, q) to predict the concentration of PM2.5 in Xi'an. The research results show that the relative error of the combination model is the smallest [11]. Y Zhang et al. screened out the main influence parameters on the prediction of PM2.5 concentration through the MIV algorithm. They built the ARIMA model to predict the concentration of PM2.5 in Guilin. The results showed that the MIV screening parameters could improve the prediction accuracy of the model [12]. X Zhang et al. used an exponential smoothing algorithm to predict PM2.5 concentrations and used a Markov model to correct them. The prediction results were more accurate than uncorrected prediction models [13]. Y Ju et al. compared and analyzed the predictive ability of the multiple linear and nonlinear regression models on PM2.5 concentration in Tianmen City. The results showed that the numerous nonlinear regression model had an excellent predictive effect [14].
(3) The prediction method based on machine learning uses the learning algorithm to analyze a large number of related historical data and learn the nonlinear transformation law between data to achieve the purpose of prediction. It is a prediction method commonly used by scholars in recent years. The widely used machine learning models are random forest, support vector machine, artificial neural network, etc. Compared with traditional prediction methods, the prediction method based on machine learning simplifies the prediction process and has advantages in prediction accuracy. Y Wang established a random forest-Markov model to predict the dust concentration in Haerwusu open-pit mine. She constructs a hybrid model based on the random forest and Markov models. The results show that the hybrid model has a better prediction effect than the single random forest model [15]. R Wang et al. combined a support vector machine with a genetic algorithm, trained and tested the model using common data sets, and compared it with other machine learning methods. The results show that GA-XVR has a higher prediction accuracy [16]. J Yin et al. established a multilevel residual modified least squares support vector regression model based on wavelet decomposition to predict PM2.5 concentration in Jinan, China. The results show that the model predicts better, especially for heavy-pollution weather [17]. W Zhong et al. first used the mRMR algorithm to select data features and then used the XGBoost model to predict the concentration of PM2.5 in the next hour. The results showed that selecting features first can improve prediction accuracy [18]. J Zhong et al. built a LightGBM algorithm based on meteorological data to predict hourly PM2.5 concentration [19]. J Kang et al. built KNN, BP, SVR, GPR, XGBoost, and random forest models to predict PM2.5 concentration. The comparative experimental results show that the prediction accuracy of the XGBoost model is higher than other models [20].
As a new research direction in machine learning, deep learning has gradually become a prediction method with frequent use and better prediction effect. Deep learning enhances the representation ability of complex features. The research on PM2.5 concentration prediction using deep learning algorithms mainly applies a deep neural network model to extract sufficient historical information in time series to predict PM2.5 concentration. Standard deep neural network models include Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and Deep Belief Networks (DBN). Y Zhang predicted the dust concentration of the Haerwusu open-pit mine using the LSTM recurrent neural network model and compared it with the traditional regression and random forest models. The results show that the prediction results of the LSTM recurrent neural network model are better than the other two models [21]. M Teng et al. proposed a hybrid model combining long short-term memory neural network and convolutional neural network combined with the monitoring information of adjacent stations to predict PM2.5 concentration, and compared with the single model. The results show that the hybrid model has higher prediction accuracy [22]. F Jiang et al. used a hybrid model of CEEMDAN and deep-time convolutional neural networks to predict hourly PM2.5 concentrations. The prediction results show that the hybrid model can achieve better prediction accuracy than a single model [23]. J Kang et al. improved XGBoost and LSTM to obtain a variable weight combination model XGBoost-LSTM (Variable) to predict the hourly concentration of PM2.5 in Shanghai. The processed data sets were input into two models for prediction, and then the final prediction results were obtained using the variable weight combination method based on residual improvement. They chose XGBoost, LSTM, and SVR for comparative experiments. The results show that the prediction accuracy of the hybrid model is better than that of the single model [24]. M Zhu and J Xie established a CNN-BiLSTM mixed model to predict PM2.5 concentration. The experimental results show that combining the monitoring data of the target site and nearby sites can improve prediction accuracy [25]. Kow P Y et al. proposed a fusion of multiple convolutional neural networks (MCNN) and back-propagation neural network (BPNN) and used MCNN-BP to predict PM2.5 at 74 stations in Taiwan [26]. K Gu et al. proposed a composite LSTM model GRA-ICEEMD-LSTM. They use GRA for variable screening, ICEEMD for data processing, and LSTM to predict PM2.5 concentration. The results show that the composite LSTM model has higher accuracy and shorter training time than the single prediction model [27]. Kim Yong-been et al. compared the PM2.5 prediction performance of LSTM, GRU, and BiLSTM models using the data from Seoul, Daejeon, and Busan in South Korea. The results showed that BiLSTM has high prediction accuracy in long-term prediction [28]. Scholars ' research has developed from a single neural network model to a hybrid model. Facts have proved that the hybrid model's prediction accuracy is better than a single model. There are many influencing factors for dust concentration, but each aspect affects dust concentration changes differently. Currently, most prediction methods based on CNN and LSTM do not consider this. Therefore, by introducing the attention mechanism, the prediction model can obtain the influence of variable characteristics at different times in the past on the future dust concentration. Therefore, this paper proposes a CNN-BiLSTM-Attention hybrid model based on the Keras framework to predict road dust concentration in open-pit coal mines. Keras can easily define and train almost all types of deep learning models [29]. CNN can extract sufficient feature information from input variables [30]. BiLSTM as a variant of LSTM, can not only solve the problem of gradient disappearance or gradient explosion of a recurrent neural network but also obtain data information from the forward (from left to right) and reverse (from right to left) of the data. At the same time, it effectively uses the characteristics of the future and the past [31]. At the same time, the model introduces an attention mechanism to analyze the importance of all features and gives corresponding weights. The model obtains the proportion of the influence of different characteristics on dust concentration and outputs the final forecast results [32]. The hybrid model combines the aspects and advantages of each algorithm to improve the prediction accuracy and better meet the needs of actual prediction. The address of the source code of the prediction model on GitHub is https://github.com/Zongwei-Yang/ Thesis-code.git.
The content of this paper is as follows. The first part is the introduction. The second part is the basic theory of each model. The third part is the related processing and analysis of experimental data. The fourth part is the construction of the prediction model. The fifth part is the testing process. The sixth part is the conclusion.

CNN
Convolutional Neural Network (CNN) is a feed-forward neural network with convolution computation and deep structure. The structure of the feed-forward neural network adopts a one-way and multi-layer mode. The whole neural network transmits signals in one direction, and neurons are arranged in layers. At the same time, each layer contains several neurons. Each layer of neurons is only connected to the neurons of the previous layer, receives the neurons of the last layer, and then outputs to the next layer. The overall structure of the CNN model consists of five parts: input layer, convolution layer, pooling layer, whole connection layer, and output layer. The input layer of the convolution neural network can process multidimensional data. The function of the convolution layer is to extract the feature of input data and then reduce the dimension of the feature extracted by the convolution layer through the pooling layer to further extract the feature. Finally, the training results of the model are output through the complete connection layer and the output layer [33,34]. Fig 1 shows the basic structure of the CNN model.
The mathematical expression of CNN model convolution process is: ði; In the form, Z l is the convolution input of the layer l+1; Z l+1 is the convolution output of the layer l+1; L l+1 is the size of Z l+1 ; Z(i,j) is the pixel of the corresponding feature graph; K is the number of characteristic graph channels; f is the convolution kernel size; S o is convolution step length; p is zero fill layers; w is the weight coefficient; b is the deviation. The sum part in the formula is equivalent to solving a cross-correlation.
The activation function is ReLU. Compared with the Sigmoid and Tanh functions, the ReLU function can improve convergence speed [35]. Because of its simple gradient solution formula, it can effectively improve the problem of gradient disappearance and gradient explosion [36]. The following is the mathematical expression of the ReLU function.
The data needed in this paper are time series data. Since the time series data only change along the time axis, it is more effective to use a one-dimensional convolution neural network (1D-CNN) to process [37]. Therefore, this paper uses the 1D-CNN model to extract the local characteristics of data.

BiLSTM
The recurrent neural network (RNN) has a specific advantage in learning the nonlinear characteristics of the sequence, and it has a good effect in processing time series data. However, the recurrent neural network has a long-term dependence problem. When the input time series data is too long, the model will lose the ability to learn early data. To solve this problem, scientists proposed the variant structure of the recurrent neural network, the long-term and short-term memory network (LSTM), based on the 'gate' structure [38].
The LSTM model effectively solves the problems of gradient disappearance and gradient explosion in the recurrent neural network [39]. Still, in model training, the LSTM model continuously learns in the time order from front to back, so it can only know certain sample data information before a moment. Unable to consider the data information after this moment. This training method leads to a low data utilization rate and cannot fully mine the inherent characteristics of the data. To solve this problem, the forward LSTM can add a reverse LSTM to form a bidirectional extended short-term memory network (BiLSTM) for prediction. The BiLSTM model comprises a forward LSTM layer and a reverse LSTM layer. During the training process, the internal relationship between the data at a specific moment, historical data, and future data to obtain the global data characteristics [40]. Improve the utilization rate of sample data and prediction accuracy rate to a certain extent.  The calculation process is as follows: In the formula, x t is the current input; hF t-1is the output state of the forward LSTM layer at the previous moment; hR t+1is the output state of the inverse LSTM layer at the previous moment; w 1 , w 3 are the weight matrix from the input layer to the forward and reverse LSTM layers; w 2 , w 5 is the weight matrix from forward and reverse LSTM layer to its own transfer layer; w 4 , w 6 are the weight matrix from the forward and reverse LSTM layers to the output layer; f F (•), f R (•) are the cell computing process; f(•) is the splicing function of the output result of the forward and reverse LSTM layers; hF tis the output result of the forward LSTM layer; hR tis the output of the reverse LSTM layer; h t is the final output.

Attention mechanism
When processing long sequence multidimensional feature data, different variable characteristics affect the target variable differently. Important features contain more key information and impact the target variable, so by introducing an attention mechanism to give each variable feature different weights and highlight key information, it is possible to mine data association features more effectively. At the same time, it can solve the problem of information overload, significantly reduces model computation, and make reasonable use of computing resources to achieve the best model training effect [41], Fig 3 shows the structure of the attention mechanism.
First, calculate the similarity score of PM2.5 concentration and other variables. Then, use the Softmax function to normalize the obtained similarity scores and get the attention distribution for each variable. Finally, the input variables other than the target variable and the attention distribution are weighted and summed to obtain the final attention value. The calculation process is as follows: In the formula, v T is the weight vector, W, U is the weight matrix, Q is the target variable, X n is an input variable other than the target variable, α n is the attention distribution, A n is the final attention value.

Research objective
The research object of this paper is road dust in opencast coal mines. International Organization for Standardization (ISO) regulations, dust is suspended solids with particle size less than

Research data
Since training deep learning models requires a large amount of data, the original data's unit time span determines the prediction cycle's length. To achieve short-term and long-term forecasts, the data set for training the model in this paper is the hourly air quality data and meteorological data from January 1, 2020, to December 31, 2021, an open-pit coal mine in Tongliao City, Inner Mongolia Autonomous Region. A total of 17520 sets of data includes 12 types of factors, respectively PM2.5, PM10, SO 2 , NO 2 , CO, O 3 , temperature, relative humidity, wind speed, wind direction, precipitation, and atmospheric pressure. After model training is complete, use the measured data of dust monitoring equipment to test the model's performance. Table 1 shows the variables included in the dataset, Fig 4 offers each variable's original data time series.
As shown in the figure above, the horizontal axis represents 17575 sets of data in hours from January 1, 2020, to December 31, 2021. The vertical axis represents the value of each variable. As can be seen from the figure, the temperature value shows a pronounced periodic change. The concentrations of SO 2 , NO 2 , CO, and O 3 , relative humidity, and wind speed also have certain periodic changes. Therefore, determine the size of the input variable of the prediction model by the length of the period change. To a certain extent, we can improve the model's prediction accuracy. But from the above figure, it is evident that there are missing values and outliers in the dataset, so these data need to preprocess to enhance the quality of sample data. It is a prerequisite for further analysis of the data.

Data processing 3.3.1 Data preprocessing.
(1) Missing values and outliers During data collection, uncontrollable factors such as sensor failure, network problems, and external bad weather maybe cause missing and abnormal data. Missing and abnormal values affect further data analysis and model training. First, fill the missing values in the dataset with NaNs via Python, then use the mean method to fill in the missing values. For the processing of outliers, we can use the quartile analysis method to detect outliers. Let the first, second, and third quartiles are by Q 1 , Q 2 , and Q 3 , respectively. Q 1 , Q 2 , and Q 3 are the values at the 25%, 50%, and 75% positions after all the values in the sample data are arranged from small to large. The values less than Q 1 -1.5(Q 3 -Q 1 ) or greater than Q 3 +1.5(Q 3 -Q 1 ) are outliers. Since the sample data is time series data, we could use the mean value of the data before and after the outliers to correct it. (2) Data normalization This paper studies a multivariate problem. The essence of the training model is to establish the relationship between different variables, but different variables have different physical meanings and dimensions. If it is not processed, it will seriously affect the training effect of the model, so all variables need to be normalized and scale all variable data between 0 and 1 by a certain ratio. Processed by normalization, it can speed up the convergence of the model, reduce training time, and improve the model's prediction accuracy. The calculation formula for data normalization is as follows: In the formula, x* is the normalized result, x is the original data, min x is the smallest value in the data, and max x is the maximum value in the data.
3.3.2 Data set partitioning. The multivariate mixed model established in this paper belongs to the supervised learning model. Before model training, the preprocessed dataset needs to divide proportionally. Suppose you train the model with all the data. In that case, it may lead to training overfitting, so the data set is divided into a training set, validation set, and test set according to the ratio of 8:1:1. Training model with the training set then use the validation set to validate the model, adjust hyperparameters based on results, constantly adjust the model, find an optimal model by validation error. Select the current hyperparameters and model settings, add the data from the validation set to the training set to form a new training set, train a new model, and finally, use the test set to evaluate the model's performance.

Correlation analysis
To train a robust and efficient predictive model, in addition to requiring a large amount of sample data, it is also necessary to select variables that impact changes in dust concentration. The selection of variables greatly affects the prediction accuracy of the model. So should analyze the correlation between dust concentration and other variables and choice appropriate variables. The paper uses the Pearson correlation coefficient to analyze the correlation between variables, calculated as follows: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi X n i¼1 In the formula, x i and y i are sample values, x* and y* are the sample mean, and r is a statistical indicator that reflects the correlation between variables. The value range of the correlation coefficient r is [-1,1]. When the absolute value of r is closer to 1, it indicates a stronger correlation between two variables. When r is 0, there is no correlation between the two variables. If r>0, there is a positive correlation between variables; if r<0, there is a negative correlation. We can draw a Pearson correlation analysis heat map of each variable to get Pearson correlation coefficients between variables. The Pearson correlation analysis heat map of each variable is  Table 2 shows the results of correlation and significance analysis between PM2.5 and other variables.
Analysis of the above table shows that, the correlation coefficient between PM2.5 and PM10 is 0.97, P<0.01, there is a significant positive correlation between the two, and the correlation is extremely high; the correlation coefficient between PM2.5 and SO 2 is 0.54, P<0.01, there is a significant positive correlation between the two; the correlation coefficient between PM2.5 and NO 2 is 0.52, P<0.01, there is a significant positive correlation between the two; the correlation coefficient between PM2.5 and CO is 0.63, P<0.01, there is a significant positive correlation between the two; the correlation coefficient between PM2.5 and O 3 is -0.38, P<0.01, there is a significant negative correlation between the two; the correlation coefficient between PM2.5 and temperature is -0.22, P<0.01, there is a significant negative correlation between the two; the correlation coefficient between PM2.5 and relative humidity is 0.20, P<0.01, there is a significant positive correlation between the two; the correlation coefficient between PM2.5 and wind speed is -0.32, P<0.01, there is a significant negative correlation between the two; the correlation coefficient between PM2.5 and wind direction is -0.078, P<0.05, the two have a significant negative correlation, but the correlation is extremely low; the correlation coefficient between PM2.5 and precipitation is -0.008, the two are not significant and the correlation coefficient is close to 0, it is considered that there is no correlation between the two, this is related to precipitation intensity and precipitation duration, resulting in no apparent correlation with PM2.5; the correlation coefficient between PM2.5 and atmospheric pressure is 0.21, P<0.01, there is a significant positive correlation between the two.
Because the correlation between PM2.5 concentration and wind direction is extremely low, and there is no correlation with precipitation, so remove these two variables from the dataset. PM2.5 and PM10 are strongly correlated, and related research shows an inclusion relationship between PM2.5 and PM10. PM2.5 accounts for 50%~70% of PM10. Therefore, we use PM2.5 for the research on the prediction of road dust concentration in opencast coal mines.

Data sample modeling
The model trained in this paper is a supervised learning model that needs to train the data according to the ' feature parameters-label ' corresponding form to generate training samples. As shown in Fig 6, that is to use the data (feature parameters) at the historical time to predict the PM2.5 concentration (label) at the next time or the PM2.5 concentration multiple times after time T.
The data set in this paper is multivariate time series data, and the data before and after are strongly correlated. To preserve the correlation between adjacent data as much as possible, generate training samples by sliding time windows. The data at time T includes air quality data and meteorological data. The data at time T i are T i1 , T i2 ,. . ., T ij , j is the number of variables at each moment, set the first variable of each set of data as PM2.5 concentration. For example, T i1 represents the PM2.5 concentration at the time of T i .
(1) Assume that the characteristic parameters are all variable data at T i , labeled as PM2.5 concentration at T i+1 moment. This is a multidimensional single-step prediction. Fig 7 is the sample generation process of training data.
(2) Assume that the characteristic parameters are all variable data at time T i , labeled as PM2.5 concentration at T i+1 and T i+2 . The sample generation process of training data is as follows. We can set the label as PM2.5 concentration at the time of T i+1 , T i+2 ,. . ., T i+n . When the number of labels exceeds 1, it is a multi-dimensional multi-step prediction. Fig 8 is the sample generation process of the training data at this time.

Predictive model modeling
The predictive model can build a hybrid model of the two structures. The first is a parallel structure, which divides the feature extraction process into two channels. Channel 1 uses 1D-CNN to locally extract nonlinear features of the data along the positive direction of the time axis. Channel 2 uses BiLSTM to extract the data's bidirectional global temporal characteristics, assign weights to variables through an attention mechanism, and combine the two channels' data features. Finally, the fully connected (Dense) lay can extract the correlation information between features, mapping to the output space for regression prediction and the output prediction result. Fig 9 shows the parallel structure model. The second is the serial structure. The first part is the establishment of a hybrid model of CNN and BiLSTM. The second part is to introduce the attention mechanism. 1D-CNN and BiLSTM sequentially extract features in the same channel, superimpose data features to the same vector, then raise the attention mechanism and final output through the fully connected (Dense) layer. The next section of the article will set up a comparative experiment to compare the performance of the two structural mixture models and select the optimal structure of the model for further experiments. Fig 10 shows the CNN-BiLSTM-Attention serial structure model.
As shown in the figure above, set up a two-layer one-dimensional convolutional layer (Conv1D) in CNN to extract data features. The convolution kernel sizes are 32 and 64, respectively. The convolution kernel step size is 1. Compress and flatten the data through a onedimensional max pooling layer (MaxPooling1D) and a Flatten layer. The data features extracted by CNN are input into BiLSTM to further mine the time series features in the data. Finally, the Dense layer completes the output of the CNN-BiLSTM part, this part of the output processes the input from the Dropout layer to the Attention layer, then goes through the Flatten layer, the Dropout layer. Two fully connected (Dense) layers for regression prediction complete the model's training, and the Dropout layer prevents model overfitting. A dense layer can enhance data features. Deep Learning Model Based on Keras, we can use the Earlystopping mechanism to improve model training efficiency and prevent overfitting. At the same time, use the Glorot uniform initializer to initialize the neural network's weights.

Evaluation indicators
To measure the prediction accuracy of the model, quantify the prediction effect of the model, select mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination (R 2 ) as the evaluation indicators of the prediction model, calculated as follows.
In the formula, n is the number of PM2.5 concentration data, yi is the actual value, y^i is the predicted value, and y is the mean of the actual values. MAE and RMSE reflect the error between actual and predicted values. The smaller the value, the higher the prediction accuracy; R 2 is the coefficient of determination, and the closer R2 is to 1, the better the model predicts.

Performance comparison of different structural models
For time series models, it is necessary to determine a suitable size of input and output variables to obtain higher prediction accuracy, how much historical data to input, and how much forecast data to output, a variable time span that is too long will increase the computational complexity of the model, the variable time span is too short. The model cannot fully extract the feature information between the data because the data of the data set has periodic changes. It can use periodicity as a characteristic parameter of a prediction model and find a suitable input and output variable size by determining the change period. We Plot changes in PM2.5 concentrations over 5, 10, 15, and 20 days in the first month of the dataset. In Fig 11, there is an obvious peak of PM2.5 concentration in 5 days, two obvious peaks of PM2.5 concentration in 10 days, three obvious peaks of PM2.5 concentration in 15 days, and four obvious peaks of PM2.5 concentration in 20 days, so determine five days as a cycle.
According to the determined period, set the size of different input and output variables and specific size settings are shown in Tables 3 and 4, and conduct many experiments on the prediction models of different structures. We divided the experiment into two groups. The experimental results are as follows.
1. Parallel structure model 2. Serial structure model Comparative analysis of experimental results, when the input and output are the same sizes, the prediction error of the serial structure model is lower than that of the parallel structure model. Better performance of the serial structure model when the input size is 24*15, and the output size is 24, the MAE and RMSE of the two models are the minima, and the prediction error is the lowest. Therefore, in the next experiment, select the CNN-BiLSTM-Attention serial structure model to continue the experiment while setting the input size to 24*15 and the output size to 24. The input matrix of the training model is 360*10 (24*15*10), 24 means 24 hours a day, 15 means 15 days of sample data input, 24 sets of sample data per day, each set of sample data contains ten variables, the output of the training model is 24h PM2.5 concentration. Table 5 is the hyperparameter setting of the model.

Short-term forecast
To further verify the model's performance, we randomly selected four groups of sample data from the test set for experiments. Each sample data group included 15-day time series data, and the PM2.5 concentration of 24 hours after 15 days was output. Fig 12 shows  To verify whether the model has generalization, ten groups randomly selected from the test set for experiments were of sample data. We show the evaluation index results obtained in Table 6.
The evaluation index results in the above table show that, except for the large prediction error results of individual experiments, the overall experiments have achieved good prediction results. To verify the superiority of the model proposed in this paper, we selected traditional regression, machine learning, single deep learning, and hybrid deep learning models for experiments with the same dataset. Namely Lasso regression, SVR, XGBoost, LSTM, BiLSTM, CNN-LSTM, CNN-BiLSTM, and CNN-BiLSTM-Attention model, adjust the hyperparameters of all models to the optimal for comparative experiments. When the true value of PM2.5 concentration is between 15 and 105 μg/m3, the prediction effect of each model is better, and the fitting degree between the actual value and the predicted value is high. When the actual value of PM2.5 concentration is less than 15μg/m3 and greater than 105μg/m3, the fitting effect of the model is poor. The result shows that the prediction effect of the traditional model Lasso regression is the worst, and the curve change of the predicted value has a large deviation from the actual value. Fig 14 is the scatter plot of the actual and predicted values of PM2.5 concentration in each model. When the scatter is closer to the diagonal line, the error between the actual and predicted values is smaller. It can be seen intuitively from the figure that the prediction effect of the traditional model is poor, followed by the machine learning model. Still, the prediction effect of SVR is worse than that of XGBoost, and the prediction effect of the deep learning model is significantly better than  that of the traditional and machine learning models. The prediction accuracy of the hybrid deep learning model is better than that of the single deep learning model. The scatter of CNN-BiLSTM-Attention is closer to the diagonal than CNN-LSTM and CNN-BiLSTM, indicating that its prediction accuracy is higher. To see the comparison between the prediction effects of each model more intuitively, select the real values and predicted values of the first 48 groups of each model and draw their line graphs. As shown in Fig 15, blue represents the actual value, and the red dotted line represents the predicted value of the CNN-BiLSTM-Attention hybrid model. The predicted value obtained by the model proposed in this paper is the closest to the actual value, and the prediction accuracy is the highest. The model obtains good prediction results because it can extract nonlinear features locally and two-way global time features of data and assign weights to variables through the attention mechanism to mine data information fully.

Long-term forecast
To test the robustness of the proposed model, experiments for long-term prediction through the model are also required. This section sets prediction experiments of 48h, 72h, 96h, and 120h. As seen from Fig 16, as the prediction time increases, the prediction error also increases gradually, but the prediction result can still fit the data trend. When the prediction time gradually increases, the correlation between the predicted data (output) and the known data (input) decreases, the influence of each variable on PM2.5 concentration will gradually decrease, and the weight assigned to the variable by the attention mechanism will also decrease. This results in a decrease in prediction accuracy and an increase in prediction error. At the same time, for all the comparison models, we performed the same long-term prediction experiments, and we compared the performance of the models through the evaluation indicators MAE and RMSE. Table 7 shows the evaluation indicators of each model under different prediction output step sizes. The table shows that with the increase in prediction time, the prediction errors of each model gradually increase, and the results of the hybrid model proposed in this paper are still optimal at different prediction times. Therefore, this model is also suitable for long-term prediction.

Measured data verification
After the model training and testing are completed, the relevant data show that the model performs well. We use field-measured data to verify the model's performance in practical applications. We set the monitoring point on the side of the transportation road of the tailings

Conclusion
Based on the hourly air quality and meteorological data of an open-pit coal mine in Tongliao City, Inner Mongolia Autonomous Region, from January 1, 2020, to December 31, 2021, this paper analyzes the correlation between each variable and dust concentration. We screened the appropriate variables for the prediction model, and a multivariate mixed dust concentration prediction model was established using the deep learning method and verified by field-measured data. This paper draws the following conclusions: (1) The variable correlation analysis shows that PM2.5 and PM10 have a strong correlation. Therefore, PM2.5 is the primary method to establish a prediction model of road dust concentration in an open-pit coal mine. At the same time, the correlation analysis results show that the correlation between pollutant variables (PM10, SO 2 , NO 2 , CO, O 3 ) and PM2.5 concentration is higher than that between meteorological variables (temperature, relative humidity, wind speed, atmospheric pressure) and PM2.5 concentration. This result also shows that it  is necessary to strengthen the prevention and control of other pollutants in the air in the process of road dust control in open pit mines.
(2) This paper establishes a CNN-BiLSTM-Attention prediction model with parallel and serial structures. Through many experiments, the results show that the performance of the serial structure model is better than that of the parallel structure model. When the input size is 24*15, and the output size is 24, the model has the best prediction performance and the lowest prediction error. The evaluation indicators MAE, RMSE, and R 2 are 2.694, 3.384, and 0.957. We compare the model established in this paper with the traditional regression, machine learning, and single deep learning models for prediction experiments with different time steps (48h,72h, 96h, and 120h). The experimental results show that the prediction performance of the CNN-BiLSTM-Attention model is optimal at each time step, and the hybrid model has good generalization and robustness.
(3) As verified by the measured data, the prediction results are good. The results indicate that we can use the model for the on-site prediction of dust concentration. Evaluation index MAE, RMSE, and R 2 are 3.127, 3.989, and 0.951.