Key factors influencing earthquake-induced liquefaction and their direct and mediation effects

Many factors impact earthquake-induced liquefaction, and there are complex interactions between them. Therefore, rationally identifying the key factors and clarifying their direct and indirect effects on liquefaction help to reduce the complexity of the predictive model and improve its predictive performance. This information can also help researchers understand the liquefaction phenomenon more clearly. In this paper, based on a shear wave velocity (Vs) database, 12 key factors are quantitatively identified using a correlation analysis and the maximum information coefficient (MIC) method. Subsequently, the regression method combined with the MIC method is used to construct a multiple causal path model without any assumptions based on the key factors for clarifying their direct and mediation effects on liquefaction. The results show that earthquake parameters produce more important influences on the occurrence of liquefaction than soil properties and site conditions, whereas deposit type, soil type, and deposit age produce relatively small impacts on liquefaction. In the multiple causal path model, the influence path of each factor on liquefaction becomes very clear. Among the key factors, in addition to the duration of the earthquake and Vs, other factors possess multiple mediation paths that affect liquefaction; the thickness of the critical layer and thickness of the unsaturated zone between the groundwater table and capping layer are two indirect-only mediators, and the fines content and thickness of the impermeable capping layer induce suppressive effects on liquefaction. In addition, the constructed causal model can provide a logistic regression model and a structure of the Bayesian network for predicting liquefaction. Five-fold cross-validation is used to compare and verify their predictive performances.


Introduction
The selection of key factors is a critical step in any development of any model [1]. Considering too few factors will cause the model to underfit, and considering too many factors in the model will lead to overfitting. Moreover, factors with little or no effects that are added to the model will largely increase the uncertainty and complexity of the model and make it more difficult both to fit and interpret [1]. Many factors impact earthquake-induced liquefaction, mainly including seismic parameters, soil properties, and site conditions (as shown in Table 1). The contribution of each factor in these three categories to the occurrence of liquefaction is different, and the mutual influence between the factors is complicated. Therefore, identifying the key factors and screening their direct and mediation effects on the occurrence of liquefaction can largely reduce the complexity of the model and more clearly explain the influence path and mechanism of each factor, which is conducive to improving the predictive performance of the model. Table 1 summarizes almost all factors related to earthquakeinduced liquefaction and their influence rules. It should be noted that many factors do not solely affect liquefaction potential (LP). For example, for the same site, the greater the moment magnitude (M w ), the more likely the site is to liquefy, and the greater the peak ground The bigger the amplitude of the site, the less likely the site is to liquefy [3] Intensity I The bigger the I, the less likely the site is to liquefy Soil property Fine or clay content FC, CC The non-linear relationship between liquefaction resistance and FC or CC is a concave upward parabola; FC or CC has a positive effect on LP when it less than the critical value, vice versa [2][3] Soil type ST The cohesive soil and gravelly soil are usually not easy to liquefy Particle size characteristic D 50 , C c , C u The larger the D 50 and the better the gradation, the bigger the k, the less likely the soil is to liquefy Relative density D r or e The increase of relative density increases the liquefaction resistance Over-consolidation ratio OCR The larger the OCR, the better the liquefaction resistance of the soil Degree of saturation S r Usually, the saturated soil can liquefy Plasticity index I p Liquefaction resistance decreases as the I p increases Soil structure -Well-structured soil is not easy to liquefy Particle shape -The coarser the particles, the harder the soil is to liquefy Permeability coefficient k The greater the k, the less likely the site is to liquefy [4] Site condition Vertical stress The increase of σ V or s 0 V increases the liquefaction resistance of the soil [2][3] Groundwater The bigger the H n , the bigger the σ V , the less likely the site is to liquefy, whereas the occurrence of gravelly soil liquefaction requires a certain H n [5] Drainage channel D n The site with a good drainage channel is not easy to liquefy Drainage boundary -The better the drainage boundary, the less likely the site is to liquefy [4] acceleration (PGA) and duration (t); the M w can indirectly promote LP through PGA and t.
For the silty sand, the greater the fines content (FC), the more the average practical size (D 50 ) decreases, and the permeability coefficient (k) is reduced accordingly; the increase in the FC and the decrease in k are not conducive to liquefaction, while the decrease in D 50 is conducive to liquefaction, forming a competitive effect. However, these are only qualitative cognitions, and it is impossible to quantitatively analyse the contribution of each factor.
Although there are many studies on the influence rules of various factors on liquefaction, few studies have focused on the screening of significant factors. Seed and Idriss [6] suggested five factors, namely, soil type (ST), relative density or void ratio, initial confining pressure, and the intensity and duration of ground shaking, for predicting soil liquefaction. Zhu [7] selected eight significant factors from 15 total factors, namely, the groundwater table (D w ), depth of the critical layer (D s ), normalized standard penetration blow count (SPTN), thickness of the impermeable capping layer (H n ), thickness of the critical layer (T s ), D 50 , nonuniform coefficient (C u ) and frequency of the maximum particle size, for predicting liquefaction using the Bayesian regression method. Dalvi et al. [8] found eight significant factors, the M w , PGA, peak ground velocity (PGV), frequency (f), normalized SPTN, vertical effective stress (s 0 V ), dynamic shear modulus and relative density (D r ), among 16 total factors using the analytic hierarchy process and entropy analysis method. Tang et al. [2] identified 12 significant factors from 22 total factors using the bibliometric method, and these significant factors contain almost all the important factors suggested by the above studies. Lee and Hsiung [9] presented an approach for quantifying the sensitivities of the key factors in a multilayer perceptron neural network and revealed that the PGA is the most sensitive factor, and the earthquake parameters (e.g., M w , PGA, etc.) are more sensitive to liquefaction potential than soil properties (e.g., SPTN, FC). However, the conclusions of these studies were different, and some research methods, such as the analytic hierarchy process and bibliometric method, were more subjective, so that the screening results were easily affected by experience or sampling, while those objective methods, such as regression methods and artificial neural networks, only considered the direct causality between the factors and liquefaction potential, whereas the mutual influence between the factors was ignored, and the mediation effects of the factors on liquefaction were not considered. Thus, the calculation of the contribution of the factors to the occurrence of liquefaction was inaccurate, which affected the identification results of the key factors.
Path analysis is a combination of multiple regression equations that can analyse the causal relationships between factors, as well as their direct and indirect effects on LP, and obtain more accurate causal contributions. However, because path analysis needs to determine the causal relationships by assumptions in advance, it is subjective, and assumption errors will cause the model to be revised multiple times, which requires much work to finalize the model structure. Therefore, this paper studies how to identify the key factors of seismic liquefaction and uses the path analysis method to analyse their direct and mediation effects on LP without a correlation hypothesis. The research idea is shown in Fig 1. First, because of the lack of subjective assumptions about factor relationships in the path analysis method, based on the collected data and factors, on the one hand, the correlation analysis method is used to eliminate variables with multicollinearity; on the other hand, the maximum information coefficient (MIC) method is used to quantitatively screen out the relatively important variables and determine their nonlinear relationships. Then, domain knowledge is used to determine the direction of causal influence and obtain an initial path structure, which can greatly reduce the number of manual adjustments to the model structure. Finally, the significance and multiple measurement indexes are used to verify the fitting effect of the initial structure. When the fit is not good, the links between factors can be appropriately added to improve the performance of the model and obtain revised impact path models until the final model passes the test. After an analysis of the direct and mediation effects of the key factors on LP, their comprehensive contributions can be further identified. In addition, the causal model can directly provide the structure of a Bayesian network (BN) model for parameter learning, or it can also be directly extracted as a logistic regression (LR) model for predicting liquefaction. The performances of these two models are verified through the collected data.

Correlation analysis method
Correlation analysis is generally used to describe the relationship and multicollinearity between two variables. For different variable types, the calculation equations are different. For instance, the Pearson correlation coefficient [10] is used to quantitatively describe the relational degree between two continuous variables that conform to the normal distribution; the Spearman correlation coefficient [11] is used to quantitatively describe the rank correlation between any continuous variable and an ordinal variable, and the Kendall correlation coefficient [11] is used to quantitatively describe the contingency relation between two categorical variables or between any continuous variable and a categorical variable. Their calculation functions are as follows: r Spearman ¼ covðrg x ; rg y Þ=ðs rgx s rgy Þ ð2Þ where ρ Pearson , ρ Spearmas and ρ Kendall are the Pearson, Spearman, and Kendall correlation coefficients, respectively; cov(x,y) is the covariance of variables x and y; σ x and σ y are the standard deviations of x and y; rg x and rg y stand for the rank transformed values of x and y; n is the sample size; n c and n d are the numbers of concordant and discordant variables in x and y, respectively. The coefficient values range from -1.0 to 1.0. A correlation coefficient of -1.0 shows a perfect negative correlation, while a correlation coefficient of 1.0 denotes a perfect positive correlation. If a correlation coefficient value between the two variables is larger than or equal to 0.9, it means they exhibit multicollinearity.
Since the above correlation analysis methods do not perform well when calculating the nonlinear correlation between two variables, Reshef et al. [12] proposed a measuring method, the maximum information coefficient (MIC), for the dependence of two-variable relationships. The MIC is based on the idea that if a relationship exists between two variables, then a grid can be drawn on the scatterplot of the two variables that partitions the data to encapsulate that relationship. Thus, the largest possible mutual information can be calculated for every pair of integers (x, y) based on mutual information theory. After normalizing these mutual information values, the highest normalized mutual information is the MIC value. More details can be found in Reshef et al. [12]. The MIC calculated equation is as follows: where I(x,y) is the mutual information of variables x and y in a grid; i and j are the line and column numbers of the grid, respectively; n is the sample size; x×y<B(n) denotes the boundary of the grid; normally, B(n) = n 0.6 ; P(x i ) and P(y i ) are the frequency of occurrence of x i and y i in a small square given a grid, respectively; P(x i ,y j ) is the joint probability density of the two variables that is equal to the frequency of simultaneous occurrence of x i and y i in a small square. Normally, if MIC(x,y)�0.9MaxMIC(X) or MIC(y,x)�0.9MaxMIC(Y), x and y are correlated. Thus, the MIC method can obtain most of the correct connections among variables [13]. Max-MIC(X) and MaxMIC(Y) are the maxima in a given row and column, respectively. In addition, if MIC(x 1 ,y) is much less than the others MIC(x i ,y) (i 6 ¼ 1), x 1 produces little impact on y.

Path analysis method
Path analysis is a method of causality analysis first proposed by Wright [14]. The path diagram (see Fig 2) can help researchers clearly understand the influence path between variables (arrow direction) and the degree and properties of causal influence (the magnitude and positiveness of the coefficient) and analyse the direct, mediation, and total effects of independent variables on the dependent variables. The path analysis method has been widely used in the fields of psychology, sociology, and economics but less in the field of civil engineering. To date, the path analysis method has not been applied in seismic liquefaction analysis. Since path analysis does not contain latent variables, it is a special case of structural equation modelling. Path analysis includes the following four steps: 1. Assumptions about the causal relationships between variables.
2. Collection of enough data and calculation of the path coefficient. Kline [15] recommended that the sample size should be 10 times (or ideally 20 times) the number of parameters. The calculation of path coefficients is designed to solve the regression coefficients of multiple regression equations, which can usually be calculated by special softwares, such as SPSS, Amos, Mplus, etc.
3. Inspection and revision of the model. The estimated values of the regression coefficients need to be tested for statistical significance and the critical proportion value of the C.R. If the coefficients are not statistically significant (normally larger than 0.05) or the absolute value of the C.R. is less than 1.96, the above steps should be repeated, that is, redefine the assumptions and calculate the path coefficients, until the significance and the C.R. value of the model meet the requirements. After the above test is passed, the goodness of fit of the model needs to be examined using multiple statistical fit indexes. If the test fails, the model needs to be manually corrected, such as by adding some links, to improve the goodness of fit of the model. 4. Effects analysis. The researchers can determine the direct effect and the mediation effect of any independent variable on the dependent variable. For example, in Fig 2B, the direct effect is c', the mediation effect is a�b, and its total effect is c'+a�b. It is worth noting that path analysis is a technique for testing causality but cannot be used to discover or search for causality.
The statistical fit indexes include absolute indexes, comparative indexes, and parsimonious indexes for the goodness of fit, where the absolute indexes contain the ratio of likelihood-ratio χ 2 values to degrees of freedom values (χ 2 /df), root mean square error of approximation (RMSEA), the goodness of fit index (GIF), and adjusted goodness of fit index (AGIF); the comparative indexes contain the comparative fit index (CFI), normed fit index (NFI), relative fit index (RFI), incremental fit index (IFI), and Tucker-Lewis fit index (TLI); the parsimonious indexes contain the parsimony goodness of fit index (PGFI), parsimony normed fit index (PNFI), and parsimony-adjusted comparative fit index (PCFI). The calculation equations for all of these indexes and their standard values for indicating a well-fitted model (shown in Table 2) can be found in the references [15][16][17][18]. Generally, it is difficult for a model to meet the requirements of all fit indexes. Therefore, as long as most indexes can meet their standard ranges, then the model possesses a good fit. In addition, the smaller the values of the Akaike information criteria (AIC), Bayesian information criteria (BIC), and Browne-Cudeck criterion (BCC) are, the better the model fit.

Mediation effect
The mediation effect is mainly used to study the influence path and mechanism of the independent variable acting on the dependent variable indirectly through the mediation variable. Fig 2B shows a simple mediation model. In addition to the independent variable X directly affecting the dependent variable Y, it can also affect Y through a variable M. Thus, M is considered to play a mediating role between X and Y, and it is called the mediator. In Fig 2A, however, X produces only a direct effect on Y but not a mediation effect. If there is a mediation effect on the influence of X on Y, but the influence is not considered, it is unable to fully explain the influence of X on Y.
In most studies of mediation effect models, when the independent variable, mediator, and dependent variable are all continuous variables, linear regression analysis can be used directly to construct a model. However, there are relatively few studies on the situation where the dependent variable is a binary variable, such as the occurrence of seismic liquefaction. A common approach is to use logistic regression instead of linear regression in the analysis of the independent variables and dependent variables, as well as mediation analysis [19]. The calculation equations are as follows: where M is a mediator; X is an independent variable; and Y is a binary dependent variable (Y = 0 or 1). a, b, c and c' are the fitting parameters or regression coefficients in the regression analysis, where a denotes the influence of X on M; b denotes the influence of M on Y; c and c' denote the direct influences of X on Y with and without considering the influence of M, respectively. P(Y|X) and P(Y|M,X) are the conditional probabilities of Y given X and M, respectively. e 1 and e 2 are the residuals of Y in the model (a) and model (b), respectively; e 3 are the residuals of M. β 1 , β 2 , and β 3 are regression constant terms in Eqs (6), (7) and (5), respectively. In Fig 2B, there are generally two methods for calculating the size of the mediation effect; one is the coefficient difference method, i.e., c−c'; another is the coefficient product method, i.e., a�b. MacKinnon et al. [19] found that a�b is closer to the true value of the mediation effect, and compared with c−c', it has good robustness and can better represent the mediation effect. Therefore, a�b is used to represent the mediation effect in this study. However, the units of b, c and c' in logistic regression are logits, and they are inconsistent with a of the linear regression in scale. In addition, c and c' of Eqs (6) and (7), respectively, are also different in scale due to their different independent variables. Thus, one cannot simply multiply a and b. To solve the problem of different scales for the different regression equations, MacKinnon and Dwyer [20] proposed an approach to standardize regression coefficients. The calculation equations are as follows: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where the std superscript denotes the standardization of logistic regression coefficients. SD(�) is the standard deviation of a variable; var(�) is the variance of a variable; cov(X,M) is the covariance of X and M. Thus, the mediation effect of X is changed to a std b std . The total effect is equal to the sum of the direct effect and the mediation effect, i.e., c' std +a std b std . When c' std and a std b std possess the same sign, the mediation effect is complementary, and the mediation effect ratio is a std b std /(c' std +a std b std ). However, if their signs are different, e.g., c' std is positive whereas a std b std is negative, the mediation effect is competitive, i.e., the suppression effect is present by MacKinnon et al. [21]. The suppression effect ratio is |a std b std /c' std |.
Since the mediation effect model contains a binary dependent variable, and its mediation effect equals, Z a ×Z b , this study uses the Sobel method suggested by Iacobucci [22] to test the significance of the product of coefficients a std b std . The calculation equations are as follows: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi ffiffi SEða std Þ ¼ SEðaÞSDðXÞ=SDðMÞ ð13Þ where SE(�) denotes the standard error of the regression coefficient; a |Z| value larger than 1.96 indicates that the indirect effect of X on Y is significant; otherwise, there is no mediation effect. When there are multiple independent variables and mediators, the model becomes very complicated, as shown in Fig 2C and 2D. Fig 2C is a single-step multiple mediation model, and Fig 2D is a multiple-step multiple mediation model [23]. In Fig 2D, in addition to the direct effects of the independent variable X 1 ,X 2 ,� � �,X n on the dependent variable Y, there are two parallel mediation effects via M 1 and M 2 and a chain mediation effect from M 1 to M 2 .
Thus, the regression equations are as follows: Fig:2ðcÞ LogitPðY Fig:2ðdÞ when there are more than two mediators, and readers can derive this equation by themselves according to the formula suggested by Sobel [24]. SD(Y") for calculating SE(d std ) and SE(b std ) can be expressed by The historical case data Many factors affect earthquake-induced liquefaction, and these factors are summarized in Table 1. However, some factors are difficult to characterize or quantify with a certain indicator (e.g., particle shape, soil structure, etc.) or their values are difficult to obtain in the historical database (e.g., permeability coefficient, liquid-plastic limit index, particle size distribution). Therefore, 19 factors, as shown in Table 3 Fig 8), or at least 200 cases [15], which can ensure the validity of parameter fitting in the path analysis.
For each case, the site behaviour is characterized through a binary indicator LP, where LP = 1 if liquefaction occurred and LP = 0 if it did not occur, and the surveyed fields are limited to level and gently sloping sites. Table 3 shows the statistical characteristics of the cases. Almost every variable possesses an uneven proportion between groups, especially for LP; the liquefied sample size is approximately twice that of the non-liquefied sample size, and there is sampling bias, which affects the performance of the liquefaction prediction model [26] but does not affect the parameter estimation of the path analysis model. The collected data cover almost all possible liquefaction situations, such as M w between 5 and 9.2, PGA between 0.1 and 0.789 g, FC between 0% and 99%, D 50 between 0.006 and 33.4 mm, V s between 59 and 380 m/ s, D w between 0 and 7 m, D s between 1.1 and 17.8 m, etc., which facilitates the construction of a reliable causal model.

Identification of the key factors
To avoid the adverse effects of multicollinearity on the performance of the model and to further identify variables that produce less impact on liquefaction, this section first uses the Pearson, Spearman, and Kendall methods to calculate the correlations between factors and find variables with correlations greater than 0.9. Then, the MIC method is used to calculate the nonlinear relationship between factors and liquefaction and to identify the factors with the largest contributions.      Fig 6A. It can be seen that the relationship between the variables is not directional because the MIC method can only identify the nonlinear correlation between variables. To obtain the causalities between variables, domain knowledge as shown in Table 1 is used to determine the causal direction of the variables in this study. For example, for the same site, the larger the M w is, the larger the PGA, and so the M w affects the PGA, not PGA affects M w . Using domain knowledge to determine the causal direction is very simple and convenient. When the research problem does not include domain knowledge, mathematical methods can be used to calculate the causal direction [13]. In particular, there is no direct physical relationship between FC and D 50 , but usually D 50 decreases as FC increases. In contrast, however, the relationship may not be true. Therefore, this study assumes that FC is the cause of D 50 . The causal model is determined as shown in Fig 6B, and the direction of the arrow indicates cause and effect.

Construction of an initial path model and its correction
Generally, the path analysis method is used for analysing linear causality between variables. However, most of the factors of seismic liquefaction exhibit nonlinear relationships. Therefore, this paper has computed the natural logarithms of some variables according to their functional forms (as shown in Table 4) and converted them into a linear equation in the path analysis. Moreover, the processed variables also approximately follow the normal distribution.  Table 4. Functional relationships between some variables.

Functional relationship
Reference lnY = a+b�M w +clnR [27] lnD 50 = a+b�FC [28] lnV s = a+b�D s This study D n = D w −H n (when D n is negative, D n = 0) [5] An initial path analysis model, as shown in Fig 7, is constructed according to the causal structure in  Table 5. It can be seen that the C.R. values are greater than 1.96, and the P-values are less than 0.05. Therefore, the causality path constructed by the MIC method combined with domain knowledge is effective. However, except for the parsimonious indexes, other statistical fit indexes almost fall short of the standard values. Therefore, it is necessary to add some new links in the initial model, recalculate the path coefficients, and evaluate the fit indexes.
According to the modification indexes (MI) for improving the performance of the model, the links between variables corresponding to the large MI values are added to the initial model. The  revised model is shown in Fig 8. Compared with Figs 7 and 8 adds three new links between variables (i.e., links between D s and D w , D s and D n , and D s and H n ) and six correlations (e.g. correlation coefficient 0.44 between two residual terms of M w and R) between the residual terms. The correlations between the residual terms may be caused by the exclusion of some factors, or they may show that these variables are mathematically correlated. This issue requires further study in the future. However, adding the correlations of the residual terms does not affect the path causalities of the model. After recalculating the path coefficients, it is found that all the statistical indexes of the path coefficients are significant as shown in Table 6, and most of the model's fitness indexes pass the test except for χ 2 /df, RMSE, and AGFI, but the values of these three indexes are close to their standard values. In addition, compared with the initial model, the values of the information indexes in the modified model are largely decreased. Therefore, the fitting effect of the improved model is acceptable, and it is appropriate for an analysis of the effects.

Construction of a multiple casual path model for liquefaction
In the above section, the path model of the factors of liquefaction was constructed. In this study, LP is treated as a binary variable, and it cannot be directly analysed with the Amos software along with its factors. Therefore, a stepwise logistic regression method is first adopted to construct a model between LP and its factors and eliminate some links with insignificant effects on LP. For example, the coefficients of T s and D n do not pass the significance test, so they possess no direct links to LP. However, their influences on liquefaction can be produced indirectly through D s . Then, after combining the LR model and the modified model, a multiple mediation model of seismic liquefaction can be constructed, as shown in Fig 9. The multiple mediation model is also a recursive causal model because it can not only reflect the influences of the factors on liquefaction but also the interactions between factors. The logistic regression function and path functions are as follows: 406M w À 0:576lnR þ 2:169lnPGA À 0:816lnt À 0:044FC À 0:593lnD 50 À 4:901lnV s À 0:402D w À 0:12D s þ 0:454H n þ 10:159 , lnPGA ¼ 0:576M w À 0:42lnR À 4:013 ð21Þ  where P L is the probability of LP; all estimates in the regression functions are significant.  Table 1 except for H n and t, which will be discussed in Section 6.

Analysis of direct and total effects of the factors on liquefaction
The absolute values of the total effects of these factors are ranked as M w , D 50 , V s , PGA, R, t, D w , D n , D s , T s , H n , and FC in descending order, which is different from the order of MIC values between the factors and LP, especially the ranking of FC. This is because the relationships between these factors (mediation effects) are not considered when calculating the MIC values. However, when the mediation effect is not considered, the rankings of the direct effects and MIC values are not much different. In addition, comparing the direct or total effects of earthquake parameters, soil properties, and site conditions, the effects of earthquake parameters (M w , PGA, t, and R) are much larger than those of the other two terms for most factors. These findings are consistent with the conclusions found in the literature [9]. Table 7 shows the multiple mediation effects of the factors on liquefaction. It can be seen that all mediation paths pass the Z test because their absolute values are larger than 1.96. For all factors except t and V s , their influences on liquefaction include at least one mediation path, e.g., the mediation effect of R on LP not only through PGA or t (R ! PGA ! LP or R ! t ! LP) but also through PGA to t (R ! PGA ! t ! LP), which forms multiple chain mediation effects. For factors with multiple mediation effects, the sizes and signs of their specific mediation effects are different. For instance, the specific mediation effect of R ! PGA ! LP is equal to -0.236 (a negative value means suppression), whereas the specific mediation effect of R ! t ! LP is equal to 0.019 (a positive value means promotion), and the ratio of its specific mediation effect is much less than that of the path R ! PGA ! LP. Therefore, the mediation effect of PGA as a mediation variable is much stronger than that of t; i.e., for R, PGA is more important than t for predicting liquefaction.

Analysis of multiple mediation effects
In addition, mediation effects include indirect-only mediation effects (e.g., T s and D n with mediation effect ratios of 100%) and partial mediation effects (e.g., R, M w , PGA, D 50 , D s , and D w ). Comparing these mediation effect ratios, the mediation effects of R, T s , and D n are greater than their direct effects. If their mediation effects are ignored when analysing their importance, the results will be biased. Moreover, there are two factors, FC and H n , that , produce suppressive effects. When analysing their influences on liquefaction, in addition to their mediation effects, their suppression effects should also be considered. For example, the suppression effect ratio of FC is as high as 101.2%; that is, the suppression effect is greater than the absolute of the direct effect, which reverses its influence on liquefaction, and this mechanism is consistent with the influence rule of FC in Table 1. Therefore, analysing the mediation and covering the effects of factors is helpful for further understanding of the mechanism of liquefaction. In addition to FC and H n , T s may exhibit a suppression effect, but T s is considered to have no direct effect in the causal model, so it is considered an indirect-only mediator. This situation is related to the collected data and requires the collection of more data for verification or updating.

Predictive performance of the causal model
In the construction of the causal path analysis model, it can be found that the model can directly extract a liquefied LR prediction model such as Eq (20). The logistic regression model with an accuracy of 84.8% (73.9% and 90.2% for non-liquefaction and liquefaction cases, respectively) in its training performance shows a strong learning ability. To further analyse the predictive performance of the model, 5-fold cross-validation is used to train and test the model by equally dividing the collected data into 5 folds [29]. In the crossover trial, four folds are used for training the model, and the remaining fold is used for testing its predictive performance. The process is repeated 5 times so that each fold is involved in training and testing. In addition, the causal path model can be directly taken as a structure of the BN model; discretization of factors according to Table 3, and parameter learning based on the divided data are conducted to learn the parameters or conditional probabilities of the model using the expectationmaximization algorithm. The detail of parameter learning can refer to Hu and Liu [29]. The 5-fold cross-validation is used to verify its performance.
In the 5-fold cross-validation, the comparisons of the performances of the LR and BN models are shown in Fig 11. It can be seen that the accuracies of the BN model are better than those of the LR model in each fold test, as well as in the prediction of liquefied and non-liquefied cases in each fold dataset. The reason is that the LR model ignores the impact of the important factors on liquefaction, e.g. D n , whereas the BN model contains the impact of the factor, as well as other factors, e.g. T s . In addition, the parameters in the LR model are constant, whereas parameters in the BN model are taken as random variables, their values are probability distributions that are more suitable for the calculation of uncertain problems such as liquefaction prediction. What's more, it is worth noting that these two models are more capable of identifying liquefied samples than non-liquefied samples. This is because the liquefied sample size in this study is larger than the non-liquefied sample size, i.e., there is a sampling bias in the training process for each model. Hu et al. [26] suggested that the best sampling bias ratio is between 1 and 1.5 (liquefaction/non-liquefaction) for the BN model and approximately 0.5 for the LR model. However, the ratio of liquefied samples to non-liquefied samples is approximately 2 beyond the recommended ranges. This issue can be dealt with using the oversampling technique or adding more data to balance the ratio [26].

Discussion
This study proposes an approach to quantify the importance of factors and uses a multiple mediation effects model to prove that many factors of liquefaction not only produce direct effects but also significant mediation effects. The 12 key factors identified in this study are almost the same as those concluded in Tang et al. [2], except for grain composition, drainage conditions such as permeability coefficient, and OCR. For the unselected factors, such as the permeability coefficient, grain composition, soil structure, etc., if their MIC values are large, they will be identified as significant factors and vice versa. Furthermore, these factors will slightly affect the structure of the causal model in Fig 9 due to adding new variables to the model, but will not affect the mediation effects of other factors, and might slightly increase the total effects of relevant factors. However, they are not selected because they are difficult to obtain in the historical database. Thus, if the data of these factors are available, their effects should also be considered in the path analysis. In addition, it is worth noting that several variables that were eliminated due to multicollinearity, namely, I, σ V , s 0 V , and V s1 , are also key factors. Therefore, when selecting important factors for predicting seismic liquefaction, these factors should be considered as candidates according to engineering demands.
The multiple mediation model constructed in this study can not only analyse the direct effects of the factors on liquefaction but also their mediation effects and suppression effects. Therefore, it can effectively avoid serious evaluation biases regarding contributions of the factors on liquefaction like the LR model that can only analyze the direct effects of factors. In addition, the causal model can also compare the mediation effects of different paths such as R ! PGA ! LP and R ! t ! LP, which is helpful for a clearer understanding of the liquefaction mechanism of multi-factor coupling. However, because the causal model ignores the influences of site conditions on seismic parameters, this may cause a certain deviation of the indirect influence of seismic parameters on liquefaction in the causal model.
Comparing the total effects of factors in the causal model and the correlation coefficients in Fig 3, it can be found that most factors exhibit the same influence characteristics on liquefaction except for t and D 50 . Obtaining a different or "wrong" sign in these two methods is a common phenomenon [30]. For example, t produces a negative effect on LP in the casual model, whereas it produces a positive effect in the correlation analysis. This is because the correlation analysis only considers the correlation between t and LP, while the regression analysis can consider both the effect of t on LP and the effects of other variables related to t on LP. When the inhibiting effects of other variables are too large, the regression coefficients exhibit anti-regular phenomena. McGuire and Barnhard [31] and Trifunac and Brady [32] proposed the relationships between t and M w and R as lnt = 0.19+0.15M w +0.35lnR and t = 2.33M w +0.149R, respectively. The positive regression coefficients of R in the two functions illustrate the situation. However, from a physical point of view, the larger the R is, the smaller t should be. Therefore, the regression coefficient violates a law of physics but is statistically correct. In addition, ignoring the influences of site conditions on t as mentioned above may cause the endogenous problem, which may lead to an abnormal regression coefficient. Similarly, the reason for the abnormal effect of H n on liquefaction in the casual model is the same as that for t. Therefore, compared with the correlation analysis method, the causal model can reflect the real impacts of factors by considering the mediation effects.
When determining the relationship between factors using the MIC method, the threshold of 0.9 times maxMIC in this study results in the omission of causal relationships between variables. For example, in the initial structure, H n and D s are not connected, but they share a causal connection in the subsequently modified structure. Therefore, the selection of the threshold affects the construction efficiency of the model (i.e., the number of revisions) but does not affect the structure of the final model. Therefore, using the MIC method to construct the structure of the path analysis diagram can quickly and objectively determine an initial path diagram, which greatly reduces the number of subsequent revisions, and using its structure directly as the structure of the BN also results in a strong performance.

Conclusion
The casual path analysis method is applied for the first time to study the direct and mediation effects of various factors on earthquake-induced liquefaction in this study, and a useful approach to quantitatively identify the key factors of liquefaction is presented. The important findings are as follows: 1. Twelve key factors, M w , D 50 , V s , PGA, R, t, D w , D n , D s , T s , H n , and FC, are identified in this study. In addition, I, V s1 , σ V and s 0 V are multicollinearity with PGA, V s , and D s , respectively, but they are also important factors. The results can provide a reference for the selection of factors when constructing a predictive model for liquefaction.