Highly accurate prediction of flammability limits of chemical compounds using novel integrated hybrid models

Two novel and highly accurate hybrid models were developed for the prediction of the flammability limits (lower flammability limit (LFL) and upper flammability limit (UFL)) of pure compounds using a quantitative structure–property relationship approach. The two models were developed using a dataset obtained from the DIPPR Project 801 database, which comprises 1057 and 515 literature data for the LFL and UFL, respectively. Multiple linear regression (MLR), logarithmic, and polynomial models were used to develop the models according to an algorithm and code written using the MATLAB software. The results indicated that the proposed models were capable of predicting LFL and UFL values with accuracies that were among the best (i.e. most optimised) reported in the literature (LFL: R2 = 99.72%, with an average absolute relative deviation (AARD) of 0.8%; UFL: R2 = 99.64%, with an AARD of 1.41%). These hybrid models are unique in that they were developed using a modified mathematical technique combined three conventional methods. These models afford good practicability and can be used as cost-effective alternatives to experimental measurements of LFL and UFL values for a wide range of pure compounds.


Introduction
Flammability can be broadly defined as the ease with which a material can be burned or ignited under specific conditions. The parameters-of-concern frequently used to characterise the flammability of chemical substances include the flash point, autoignition temperature, limiting oxygen concentration, lower flammability limit (LFL), and upper flammability limit (UFL) [1]. According to the American Society for Testing and Materials (ASTM), the LFL and UFL are defined as the lowest and highest concentrations (percentage) of the fuel (gas or vapor) in air capable of propagating a flame [2]. Flammability limits are commonly expressed using units of volume percent [3][4][5][6][7]. Most hydrocarbons are extremely volatile under relatively normal operating conditions [8][9][10]; thus, their flammability limits can be used to establish a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 guidelines for the safe handling of these volatile substances. The flammability-limit values for pure compounds are typically found in material safety datasheets provided by the manufacturers. Extensive flammability data for pure gases and some gas mixtures can also be found in Bureau of Mines Bulletin publications [4-6, 11, 12] and elsewhere in the literature [13][14][15].
Many scientists performed experimental studies on flammability in the 1800s [3,16] and 1900s [4-6, 11-13, 17]. Since then, several methods involving conventional experimental equipment, such as 20-L explosion apparatuses, have been introduced and utilised by numerous researchers to determine the flammability limits of gases and liquids [2,18]. Shu and Wen [19] used this type of apparatus to investigate the flammability limits, maximum explosion overpressure, minimum oxygen concentration, and flammability zone of o-xylene. Chang et al. [20,21] employed it to study the flammability of benzene and methanol with different vapor mixing ratios, as well as the flammability characteristics of 3-picoline/water mixtures. Liao et al. [22] conducted experiments to study the flammability limit of natural gas-air mixtures. Brooks and Crowl [23,24] used the apparatus to study the flammability of vapours above aqueous solutions of ethanol and acetonitrile, as well as the flammability of methanol, ethanol, acetonitrile, and toluene mixtures. Wu et al. [25] employed a 20-L apparatus to investigate the flammability and explosion characteristics of methane with three different inert gases (CO 2 , N 2 , and Ar) at 1 atm and 30 or 100˚C. Liaw et al. [26] used a 20-L spherical explosion vessel to study the flammability of a mixture containing acetone + steam, methanol + steam, methyl formate + steam, isopropyl alcohol + steam, isopropyl alcohol + nitrogen, and acetone + nitrogen.
Flammability limits can be determined using various established standard test methods: (i) ASTM methods (ASTM E681 and ASTM E918), (ii) the National Fire Protection Association method (NFPA 69), (iii) American Society of Heating, Refrigerating, and Air-Conditioning Engineers methods, and (iv) European methods (DIN 51649 and EN 1839). For details regarding these test methods, readers are referred to the work of Britton [27].
Even though these experimental standard tests are recommended for measuring the flammability limits of combustible gases, they are expensive and time-consuming. Additionally, because new chemicals are constantly being introduced in various industries, an easier and more cost-effective alternative to the experimental determination of flammability limits is needed. Scientists and engineers increasingly rely on desktop-based modelling methods for this purpose. One such method is the quantitative structure-property relationship (QSPR), which can quickly provide flammability-limit estimations with reasonable accuracy at a fraction of the cost and/or time of experimental testing.
QSPR studies have been widely applied for the prediction of the flammability limits of numerous substances. Many researchers have applied the QSPR for the estimation of LFL or UFL values. For example, Albahri [28] proposed models for estimating the flammability limits using the group contribution method. Gharagheizi 38] utilised the QSPR and developed models for estimating the flammability limits of organic compounds. Albahri [39] developed a neural network-based structural group contribution model for the prediction of LFLs. Frutiger [40] used the Marrero/Gani method to develop models for the prediction of LFLs, UFLs, flash points, and autoignition temperatures of organic chemicals. Chen et al. [41] proposed a QSAR model with four descriptors for predicting the LFLs of organic compounds. These models are compared in the "Results and discussion" section. Rowley [42] presented a comprehensive review of the use of QSPR models and other models for estimating LFL/UFL values.
In this study, we extracted 1057 LFL and 515 UFL data published by DIPPR Project 801 [14] to develop two new accurate models for predicting the LFLs and UFLs of pure compounds using the QSPR approach. These models were developed by combining three methods: multiple linear regression (MLR), logarithmic, and polynomial method. To the best of our knowledge, no QSPR model for the prediction of any property, including the LFL and UFL, based on a combination of these three methods has been reported in the literature.

Dataset collection and preparation
The dataset utilised in this study was obtained from the DIPPR Project 801 database [14]. The data were published by the American Institute of Chemical Engineers and can be considered as a reliable, comprehensive, and accessible source for the hazard and safety properties of pure compounds. The dataset covers a myriad of organic compounds with multiple functional groups, namely; hydrocarbons, halogenated hydrocarbon compounds, ethers, ketones, alcohols, aldehydes, amides, esters, amines, acids, nitriles, nitro compounds, and heterocyclic compounds. The first step in preparing the dataset was to design a 'molecular structure table' based on the molecular fragments (groups), for describing the molecular structure of the pure compounds. In this study, 1057 LFL values were selected from DIPPR Project 801 to develop the LFL model. The same dataset was previously used by Gharagheizi [29]. UFL values for 515 pure compounds were also selected from the DIPPR Project 801 database and used as the main UFL dataset (S1 and S2 Tables of the Electronic Supplementary Material). The molecular descriptors of all these pure compounds were determined using the software package Dragon [43]. This software is generally used for molecular descriptor calculations; details regarding its usage can be obtained from its website (http://www.talete.mi.it/) or from the Handbook of Molecular Descriptors [44]. The molecular descriptors (S1 and S2 Tables) were subsequently used as datasets for MATLAB processing to predict the best-fit models that afforded the most accurate predicted results (i.e. closest agreement between the experimental and predicted values). A brief description of the molecular descriptors used in this study is presented in S1 and S2 Tables).

Model development
The QSPR process quantitatively correlates the structural properties of molecules (the descriptors) with their functional properties (in this case, the LFL and UFL values) for a set of similar compounds. The process uses linear statistical methods, such as MLR, polynomial regression, and partial least-squares, or nonlinear methods, such as support vector machines (SVMs), artificial neural networks (ANN), etc., to generate mathematical models that relate the experimentally measured properties of the compounds with a set of chemical descriptors.
In this study, we integrated MLR, logarithmic, and polynomial models to combine the inherent strengths of each model and enhance the predictive accuracy of the resultant model. In the case of linear regression, the dependent (prediction) variable was represented as Y, while the independent variables (descriptors) were represented as X 1 , X 2 , . . ., X p , where p represents the p th predictor variable. The relationship between the response variable Y and the descriptors X 1 , X 2 , . . ., X p can be expressed as a linear regression model (Eq (1)) [45]: where ε represents the normal random error (residual) reflecting the difference between the observed and the predicted values. Eq (1) can be expressed in a linear form as where a o , a 1 , . . ., a p are the regression coefficients for the MLR model. Eq (2) is in a linear form and can be expressed in a nonlinear form (i.e. logarithmic form): where β 1 , β 2 , . . ., β p are the regression coefficients for the logarithmic model. The MLR (Eq (2)) and logarithmic model (Eq (3)) were then integrated with a polynomial model (Eq (4)) [46].
This interaction yielded the final hybrid model) Eq (5)): where n represents the number of parameters for the MLR model, m represents the number of parameters for the polynomial model, k represents the number of interactions between the MLR model and the polynomial model, X k i¼nþ1 d i oX i represents the interactions between the MLR and polynomial models, and X n j¼1 X n i;¼1 l j;i X j lnX i represents the interactions between the MLR and logarithmic models.
The parameters for the MLR model are The interaction between the polynomial and logarithmic models was found to have a negligible effect on the results of the proposed model.
To estimate the overall parameters of the proposed model, we used the least-squares error method. The corresponding prediction equation is The algebraic matrix for the proposed models is given as follows: where Here, N represents the number of LFL/UFL experimental values for the pure compounds. The MATLAB software (version 7.8.0.347) was employed to build the code and predict the LFL and UFL values using the algorithm shown in Fig 1. The average relative deviation (ARD, Eq (8)), average Absolute relative deviation (AARD, Eq (9)), average absolute error (AAE, Eq (10)), and standard deviation (square root of the variance, ŝ2, Eq (11)) were used to confirm the accuracy of the developed model.
Here, N represents the number of substances, FL Cal represents the calculated flammability value (LFL or UFL), FL Exp represents the experimental flammability value (LFL or UFL), and FL � Cal represents the mean FL value. To determine the significant coefficients that define the relationship between the flammability limits of each compound and their molecular structures, the MLR, logarithmic, and polynomial models were combined via the group contribution method, as indicated by Eq (5). MATLAB was employed to perform the calculations. The code was written using an 80%/20% training/testing split. The purpose of the training process was to calibrate the model and to optimise the optimal coefficients according to the least-squares method. This method yields the best-fitting curve between the predicted results and the DIPPR 801 LFL/UFL values. The validation process was used for predicting the values not included in the training set. To reduce the number of coefficients in the final models without losing accuracy, R 2 hypothesis testing was performed for each coefficient to evaluate its significance in the developed model. Coefficients with insignificant values were eliminated to simplify the models. Only the most significant coefficients obtained from the hypothesis testing were selected and used in Eq (5) to build the final models and then to predict the results.

Results and discussion
The training set was initially subjected to the least-squares method for developing the different models. The MATLAB program utilised the DIPPR data (80% of the entire dataset) to train the code and then to compute the coefficients for the developed models (α i ,γ i ,δ i ,λ i,j ) using Eq (5). The MATLAB code then analysed the remaining data (20% of the entire dataset) using the coefficients obtained from the training dataset to evaluate how well the models had been trained and how accurately the models could predict the results. The testing set was not used during the training process and was only used to compare the predicted results. For the development of the LFL model, 846 components were utilised for the training set, and 211 were used for the testing set. The LFL model was constructed according to the 105 molecular descriptors, as described in S1 Table. For the UFL model, 412 components were utilised for the training set, and 103 were used for the testing set. Furthermore, 82 molecular descriptors were used to build the UFL model (S2 Table). For the proposed method, the interactions among the three models generated a large number of coefficients, which enhanced the accuracy of the models. For instance, the number of coefficients for the LFL model was 6421, and the model had an R 2 of 99.72%. For the UFL model, the number of coefficients was 12481, and the model had an R 2 of 99.64%. It is highly recommended to use the proposed models with the aforementioned numbers of coefficients. This is because if the number of coefficients is reduced (e.g. from 6421 to 357 for LFL and from 12481 to 175 for UFL), the accuracy of the LFL and UFL models decreases (to R 2 = 96% and R 2 = 78%, respectively). Table 1 presents the comparison results for the accuracies of the three models interacting together (proposed models) and the three models individually. Table 2 presents the statistical parameters of the training, testing, and total datasets. The developed model was capable of predicting the LFL with a high accuracy (R 2 = 99.69% for the training set, R 2 = 99.83% for the testing set, and R 2 = 99.72% for the whole dataset). Additionally, the R 2 , ARD, AARD, AAE, and ŝ2 values of the training and testing sets were very similar. This indicates that the predicting abilities of the proposed model were stable. To validate the proposed model, we tested the MLR, polynomial, and logarithmic models separately. The output accuracies (R 2 ) of these methods were 76.06%, 12.64%, and 52.36%, respectively ( Table 1). The proposed model (MLR + logarithm transformation) exhibited far more accurate prediction (R 2 = 99.72%) than the individual models, indicating that the proposed concept of using a hybrid model based on the interaction between these three models enhanced the accuracy of the results. This is because the combination of the three models took more predictor variables (X 1 , X 2 , . . ., X n ) into consideration during the processing of the data by the MATLAB code and optimised the coefficients (α i ,γ i ,δ i ,λ i,j ) for use in the best model. The most significant coefficients were optimised using MATLAB and applied to Eq (5) to obtain the best model for predicting the LFL. All QSPR models require further validation before they can be considered reliable. The proposed model was validated using a dataset consisting of a random selection of 20% of the components in the dataset. The predicted results were validated against the experimental values of the dataset and were found to be consistent, with no significant deviations. An excellent fit was achieved (R 2 = 99.72%), as illustrated in Fig 2. The ARD, AARD, AAE, and standard deviation were 0.1%, 0.8%, 1.2%, and 6.6 × 10 −4 , respectively. As shown in Figs 3-5 among the 1057 components, there were approximately 800 components with 'zero' error between the predicted LFL values and DIPPR 801 values. The results predicted using the model were also compared with results obtained using models developed by other authors, as shown in Table 3.

LFL prediction accuracy and validation
Albahri [39] developed a model for predicting the LFL with a higher accuracy than our model (R 2 = 99.98%). However, the number of compounds used in his study (543) was smaller than that utilised in the present study (1057). To test the efficiency of our novel MATLAB code, we utilised the dataset provided by Pan et al. [37] and developed an accurate LFL model (Eq (12) Here, SIC0 represents information indices (structural information content, neighbourhood symmetry of 0-order), AAC represents topological descriptors (mean information index on atomic composition), PW5 represents topological descriptors (path/walk 5 Randic shape index), and GATS1v represents two-dimensional (2D) autocorrelations (Geary autocorrelation-lag 1/ weighted by atomic van der Waals volumes).

UFL prediction accuracy and validation
It can be clearly concluded from Table 2 that the developed model was able to predict the UFL values with a high accuracy (R 2 = 99.34% for the training set, R 2 = 99.33% for the testing set, and R 2 = 99.64 for the whole dataset). The UFL values obtained using the proposed model were compared with the experimental values from DIPPR 801. A good fit was achieved (R 2 = 99.64%), as illustrated in Fig 6. The ARD, AARD, AAE, and standard deviation were 0.086%, 1.41%, 9.87%, and 0.041, respectively. As shown in Figs 7-9,among the 515 components, approximately 470 exhibited 'zero' error between the predicted UFL values and the DIPPR 801 values. The results predicted by the model were also compared with results obtained using models developed by other authors, as shown in Table 3 (13)). The model predicted the UFL with an R 2 of 92.72%. This indicates that the accuracy of the proposed model is slightly higher than that of Gharagheizi's model   Prediction of flammability limits of chemical compounds using novel integrated hybrid models [31]. Details are presented in S4 Table. UFL ¼ 14:011 À 0:765 MLOGP À 33:853ðJhetv þ PW5Þ þ 0:834ðSIC0 þ MATS4mÞþ 32:167ðJhetv À PW5Þ À 281:86ðPW5Þ 2 þ 35:904ðSIC0Þ 2 þ 2622:185ðPW5Þ 3 À 23: Here, Jhetv represents topological descriptors (balaban-type index from van der Waals weighted distance matrix), MATS4m represents 2D autocorrelations (Moran autocorrelation-lag 4 weighted by atomic masses), and MLOGP represents molecular properties (Moriguchi octanol-water partition coefficient (log P)).

Conclusion
A new method was proposed for the development of flammability-limit (LFL and UFL) models based on a QSAR approach. The development of these models was based on code written using the MATLAB software (version 7.8.0.347) and a combination of MLR, logarithmic, and polynomial models. To develop the LFL and UFL models, 1057 and 515 pure compounds were used, respectively, spanning many families of compounds. Therefore, the developed models have a wide range of applicability. The developed models predicted the LFL and UFL with high accuracy (R 2 = 99.72% and R 2 = 99.64%, respectively) and are more accurate than previously reported models. Prediction of flammability limits of chemical compounds using novel integrated hybrid models   Prediction of flammability limits of chemical compounds using novel integrated hybrid models