Stochastic gradient boosting frequency-severity model of insurance claims

The standard GLM and GAM frequency-severity models assume independence between the claim frequency and severity. To overcome restrictions of linear or additive forms and to relax the independence assumption, we develop a data-driven dependent frequency-severity model, where we combine a stochastic gradient boosting algorithm and a profile likelihood approach to estimate parameters for both of the claim frequency and average claim severity distributions, and where we introduce the dependence between the claim frequency and severity by treating the claim frequency as a predictor in the regression model for the average claim severity. The model can flexibly capture the nonlinear relation between the claim frequency (severity) and predictors and complex interactions among predictors and can fully capture the nonlinear dependence between the claim frequency and severity. A simulation study shows excellent prediction performance of our model. Then, we demonstrate the application of our model with a French auto insurance claim data. The results show that our model is superior to other state-of-the-art models.


Introduction
Insurance claims modeling is a topic of great concern in non-life insurance. The model helps an insurer accurately estimate potential loss and make appropriate actuarial decisions. Specifically, the model enables an insurer to set a fair premium for each individual policy. It is important to charge the policyholder with a fair premium. For instance, Dionne, Gouriéroux, and Vanasse [1] point out that in auto insurance, if an insurer charges too little for young drivers and too much for old drivers, young drivers will be attracted while old drivers will switch to competitors. Then, the insurer loses profitable and gain underpriced policies, both resulting in economic losses. Further, the model helps the insurer determine a suitable level of risk capital. The underestimation of loss can make the insurer not hold enough risk capital and hence raise bankruptcy risk. In contrary, the overestimation can reduce liquid capital of the insurer and then hamper business expansion. Thus, an accurate model of insurance claims is significant to competency and profits of an insurer.
The frequency-severity model is a standard model of insurance claims, which separately models the claim frequency and average claim severity. The claim frequency examines the number of claims and the average claim severity takes account of the average amount of claims conditional on occurence. The claim frequency and severity depend on characteristics of an individual policy. For instance, in auto insurance, the characteristics include the age, gender and motor vehicle record points of the policyholder, per capital income and population density of the policyholder's residential area, age and model of the vehicle, etc. Thus, there is a need of predictive models. The traditional frequency-severity model uses generalized linear models (GLM) for modeling the claim frequency and severity. The frequency part often uses Poisson or negative binomial regressions and the severity part uses gamma or inverse Gaussian regressions. There is a large literature extending the model to capture different features of the data. For instance, multivariate models can give a joint analysis of the frequency or severity at different levels of classification. Anastasopoulos, Shankar, Haddock, and Mannering [2] introduce a multivariate Tobit model to study accident rates categorized by severities. The conditional autoregressive model can be used for accommodating spatial correlation. Huang, Song, Xu, Zeng, Lee, and Abdel-Aty [3] develop a macro-level Bayesian spatial model with conditional autoregressive prior and a micro-level Bayesian spatial joint model for predicting the claim frequency. Zeng, Wen, Wong, Huang, Guo, and Pei [4] use a bivariate conditional autoregressive model to simultaneously analyze daytime and nightime claim frequencies. Aguero-Valverde [5] introduces a multivariate conditional autoregressive model to estimate excess claim frequencies at different severity levels. Generalized linear mixed models and the other random parameters models can be used to capture unobserved heterogeneity across observations. Barua, El-Basyouny, and Islam [6] develop a multivariate random parameters conditional autoregressive model to predict claim frequencies. Zeng, Wen, Huang, Pei, and Wong [7] propose a multivariate random parameters Tobit model to analyze accident rates by severity. Zeng, Guo, Wong, Wen, Huang, and Pei [8] introduce a multivariate random parameters spatio-temporal Tobit model to accommodate spatio-temporal correlation and interaction. Dong, Ma, Chen, and Chen [9] use a mixed logit model to examine the difference of singlevehicle and multivehicle accident probabilities. Chen, Chen, and Ma [10] adopt a mixed logit model to analyze the hourly accident probability for highway segments. Chen, Song, and Ma [11] develop a random parameters bivariate ordered probit model to investigate the injury severity of the two drivers involved in the same crash. However, there are two major limitations in the frequency-severity model. First, the model has a linear predictor form. In practice, there are nonlinear effects from predictors. For instance, in auto insurance, the nonlinear relation between the claim severity and the insured's age is well documented (Frees,Shi,and Valdez [12]). Generalized additive models (GAM) developed in Hastie and Tibshirani [13] and popularized by Wood [14] overcome the restrictive linear form by modeling continuous variables with smooth functions estimated from data. However, the additive form of GAM models can't automatically capture complex interactions among predictors. Though interaction terms can be manually added to the structure of the model, identifying interactions terms can be tedious when many predictors are involved. Missing important interactions can reduce prediction accuracy. Second, the standard frequencyseverity model assumes an independent relation between the claim frequency and severity. However, in practice, the claim frequency and severity are often dependent. For instance, in auto insurance, the claim frequency and severity are often negatively correlated (Gschlößl and Czado [15]). Home insurance claims due to natural hazard such as earthquake or flood are both large and frequent in affected areas. Frees, Gao, and Rosenberg [16] also point out that the claim frequency has a significant effect on the claim severity for outpatient expenditures. Gschlößl and Czado [15], Frees, Gao, and Rosenberg [16], Erhardt and Czado [17], Shi, Feng, and Ivantsova [18] and Garrido, Genest, and Schulz [19] capture the dependence between the claim frequency and severity by treating the claim frequency as a predictor variable in the regression model for the average claim severity. Shi, Feng, and Ivantsova [18] show that the predictor method applied to the GLM frequency-severity model can only measure a linear relation between the claim frequency and severity. Czado, Kastenmeier, Brechmann, and Min [20], Krämer, Brechmann, Silvestrini, and Czado [21] and Shi, Feng, and Ivantsova [18] model the joint distribution of the claim frequency and average claim severity through copulas. However, popular copulas, such as elliptical and Archimedean copulas, can only capture the symmetric or limited dependence structures. The multivariate conditional autoregressive model (Aguero-Valverde [5]) with its random parameters version (Barua, El-Basyouny, and Islam [6]) and the multivariate Tobit model (Anastasopoulos, Shankar, Haddock, and Mannering [2]) with its random parameters version (Zeng, Wen, Huang, Pei, and Wong [7]) and its random parameters spatio-temporal version (Zeng, Guo, Wong, Wen, Huang, and Pei [8]) accommodate the correlation between the claim frequency and severity by modeling claim frequencies or accident rates at different severity levels. But the usage of finitely many severity levels only partially captures the dependence between the claim frequency and severity. Thus, there is a need to develop a data-driven dependent frequency-severity model, which can learn the optimal model structure from the data and can flexibly capture the nonlinear dependence between the claim frequency and severity.
Boosting is one of the most successful ensemble learning methods, which combines a large number of weak prediction models (weak learners) in an additive form to enhance prediction performance. The seminal work is Freund and Schapire [22], which introduce a boosting algorithm named AdaBoost for classification. Breiman et al. [23] and Breiman [24] observe an intrinsic connection between the AdaBoost algorithm and the functional gradient descent algorithm. Friedman, Hastie, Tibshirani, et al. [25] reveal another important fact that the AdaBoost and other boosting algorithms are additive models, i.e., an additive combination of weak learners. Then, they propose a general boosting algorithm named gradient boosting for both of classification and regression. The algorithm can be viewed as an estimation method for an additive model that combines weak learners. From this new perspective, many boosting regression models are developed. They are different in forms when different loss functions, weak learners or optimization schemes are used. Friedman, Hastie, and Tibshirani [26] and Friedman [27,28] develop boosting regression models with the leastsquares, least absolute deviation and Huber loss functions. Ridgeway [29,30] propose the boosting Poisson regression and boosting proportional hazards regression models. Kriegler and Berk [31] introduce the boosting quantile regression model. In actuarial literature, Noll, Salzmann, and Wuthrich [32] show that the boosting Poisson regression model performs better than the GLM model in predicting the claim frequency. Yang, Qian, and Zou [33] develop a gradient boosting Tweedie compound Poisson model, where they use a profile likelihood approach to estimate the index and dispersion parameters. They show that the model makes more accurate premium prediction than GLM and GAM Tweedie compound Poisson models. In order to cope with extremely unbalanced zero-inflated data, Zhou, Yang, and Qian [34] introduce a gradient boosting zero-inflated Tweedie compound Poisson model by using a similar method. In fact, the method that combines the gradient boosting algorithm and profile likelihood approach can be used to develop any gradient boosting exponential family regression models. Sigrist and Hirnschall [35] apply the method to develop a gradient boosting Tobit model for predicting defaults on loans made to Swiss small and medium-sized companies. They show that the model outperforms other state-ofthe-art models in predictive performance.
In this paper, we apply the method to develop a gradient boosting frequency-severity model (D-FSBoost). We illustrate the model with a Poisson distribution for modeling the claim frequency and with a gamma distribution for modeling the claim severity. We use the profile likelihood approach to estimate the dispersion parameter in the gamma distribution. The gradient boosting frequency-severity model with other exponential family distributions for modeling the claim frequency and severity can be developed in the same manner. Following Gschlößl and Czado [15], Frees, Gao, and Rosenberg [16], Erhardt and Czado [17], Shi, Feng, and Ivantsova [18] and Garrido, Genest, and Schulz [19], we capture the dependence between the claim frequency and severity by treating the claim frequency as a predictor in the regression model for the average claim severity. Since the gradient boosting gamma regression model can learn the optimal model structure from the data, the D-FSBoost model can fully capture the nonlinear dependence between the claim frequency and severity. The D-FSBoost model inherits all advantages of boosting models, such as the data-driven model structure, high prediction accuracy, automatic feature selection and high capacities of avoiding overfitting problems, etc. In a simulation study, we demonstrate that the D-FSBoost model can flexibly capture the nonlinear relation between the claim frequency (severity) and predictors and complex and higher order interactions among predictors and can fully capture the nonlinear dependence between the claim frequency and severity. We compare the D-FSBoost model with GLM and GAM frequency-severity models and show that the D-FSBoost model can make more accurate prediction in claim frequency and severity distributions. We apply the D-FSBoost model to analyze a French auto insurance claim data. We provide further evidence on the dependence between the claim frequency and severity and indicate that the frequencyseverity model can be significantly improved by taking the claim frequency as a predictor in the regression model for the average claim severity. We also show that the D-FSBoost model is superior to other state-of-the-art models in prediction of pure premium.
The rest of this paper is organized as follows. In section 2, we review the gradient boosting algorithm and introduce the D-FSBoost model. In section 3, we show high prediction accuracy of the model in a simulation study. Finally, in section 4, we apply the model to analyze a French auto insurance claim data.

Stochastic gradient boosting frequency-severity model
In this section, we introduce the stochastic gradient boosting algorithm. Then, we show the implementation of the D-FSBoost model.

Stochastic gradient boosting
In this subsection, we briefly review the stochastic gradient boosting algorithm in Friedman [28]. Denote by x = (x 1 , . . ., x p ) the set of predictors and y the response variable. Given a training sample fy i ; x i g d i¼1 and a loss function C(y, f(x)), the algorithm estimates the optimal prediction functionf ðxÞ by minimizing loss over the training sample, where f(x) is constrained to a form of a sum of weak learners as where h(x; a m ) is a weak learner with a parameter vector a m , b m 2 R is an expansion coefficient, M is the number of weak learners.
The algorithm estimates the functionf ðxÞ in a forward stagewise manner. Let the constant f 0 (x) be an initial estimate off ðxÞ as Denote by f m−1 (x) the estimate off ðxÞ at the (m − 1) th step. Then, at the m th step, the algorithm randomly selects a subsample of sized < d, fỹ i ;x i gd 1 , computes the negative gradient and then fits the weak learner h(x; a m ) by minimizing the following least square sum The optimal value of β m is determined by Then, the current estimate off ðxÞ is updated as where 0 < ν � 1 is the shrinkage factor that controls the learning rate. Friedman [27] points out that small ν reduces overfitting and enhances predictive performance.
The algorithm reduces to a standard gradient boosting algorithm when the full sample is used at each iteration in place of the randomly selected subsample. Friedman [28] shows that the stochastic gradient boosting algorithm has a faster computation speed and higher prediction accuracy.

The D-FSBoost model
In this subsection, we introduce the dependent frequency-severity model. Then, we estimate mean parameters by using the stochastic gradient boosting algorithm.
In the frequency-severity model, we model the claim frequency N with a Poisson distribution with the parameter λ > 0, For N > 0, denote byỸ the average claim severity, where Y j is the j th claim amount. We model the average claim severityỸ conditional on N via a gamma distribution with parameters μ N > 0 and δ > 0 where we model the dependence between the claim frequency and severity by making the mean parameter μ N depend on N.
Denote by x the vector of predictors representing characteristics of an individual policy. We assume that the parameters λ and μ N are determined by the following two regression models: where log link functions are used, F N : R p ! R and FỸ jN : R p � N ! R are two regression functions, and α and β denote the vector of parameters for F N and FỸ jN , respectively. The functions F N and FỸ jN are restricted to linear and additive forms in GLM and GAM models, respectively. In our model, F N and FỸ jN are ensembles of weak learners. For the time being, we assume that the dispersion parameter δ is given. We will estimate δ later. Then, we apply the stochastic gradient boosting algorithm to estimate the functions F N and FỸ jN .
Denote by {n i , s i , x i } the claim frequency, the average claim severity and the vector of predictors for the i th policy, respectively. We consider θ independent insurance policies. Then, we have the log-likelihood function as follows: |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl } |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl } Since maximizing the above log-likelihood function is equivalent to maximizing l 1 (α) and l 2 (β, δ), respectively, we use negative log-likelihood functions −l 1 (α) and −l 2 (β, δ) as loss functions and estimate the functions F N (x;α) and FỸ jN ðx; N; βÞ by minimizing loss over the sample and the functions f(x) and g(x, N) are confined to the form of a sum of weak learners as (2).
Then, the gradient boosting algorithm estimatef ðxÞ andĝ ðx; NÞ in a forward stagewise manner. The initial estimates are computed as Denote by f m−1 (x) and g m−1 (x, N) the estimates off ðxÞ andĝ ðx; NÞ at the (m − 1) th step, respectively. At the m th step, the algorithm randomly selects a subsample of sizeỹ < y, , and computes the negative gradient Then, the algorithm fits weak learners h f ðx; a f m Þ and h g ðx; N; a g m Þ by minimizing the following least square sums, We use K-terminal node regression trees as weak learners, i.e.,  (17) is solved by a greedy algorithm with a least squared splitting criterion (Friedman [27]).
Once the weak learners h f ðx; a f m Þ and h g ðx; N; a g m Þ are obtained, the optimal expansion coefficients b f m and b g m are solved by We can obtain the better estimation off ðxÞ andĝ ðx; NÞ by replacing a single expansion coefficient b f m (b g m ) with the optimal coefficient g f k;m (g g k;m ), k = 1, . . ., K for each region U k,m (V k,m ), k = 1, . . ., K. The optimal coefficients g f k;m (g g k;m ), k = 1, . . ., K are solved by We have explicit solutions as follows: Then, the estimates off ðxÞ andĝ ðx; NÞ are updated as where we set ν = 0.03 in our implementation. The procedures are repeated M times. Then, we obtain f M (x) and g M (x, N) as the final estimates.
The D-FSBoost algorithm is summarized as follows: The D-FSBoost Algorithm 1. Initialize f 0 (x) and g 0 (x, N), 3. K-terminal node regression trees fit two datasets fz f i ;x i gỹ i¼1 and fz g i ; ðx i ;ñ i Þgỹ i¼1 with a least squared splitting criterion and obtain the regions fU k;m g K k¼1 and fV k;m g K k¼1 .
4. Compute the optimal coefficient for each region U k,m (V k,m ), k = 1,

Estimating δ and choice of tuning parameters
We estimate the dispersion parameter δ using the profile likelihood approach. The D-FSBoost algorithm determines the value of β for each fixed δ. Denote by β δ the estimated value of β. Then, the profile log-likelihood function for δ is given bỹ The value of the dispersion parameter δ is obtained by maximizing the profile log-likelihood functionlðdÞ:d In the implementation, we need select tuning parameters, including the size of trees K and the number of trees M. The value of K controls the degree of the interaction among the predictors x or (N, x). The appropriate value of M avoids over-fitting and improves out-of-sample prediction accuracy. We determine the parameters (K, M) via the cross-validation method. The k−fold cross-validation method splits the data into k equal-sized folds. Let κ(i):{1, . . ., θ} ! {1, . . ., k} be an index function that indicates the fold to which the i th observation is allocated by randomization. We calculate loss of the j th fold data by using functions estimated by the remaining k − 1 folds. We repeat this procedure for j = 1, . . ., k. Denote by ðf À j ðx; K; MÞ;ĝ À j ðx; N; K; MÞÞ the functions estimated with the j th fold data removed and with the parameters (K, M). Then, the cross-validation estimate of loss is Then, we use ðK ;MÞ in the D-FSBoost algorithm and finish all estimates.

Simulation study
In this section, we compare the D-FSBoost model with GLM and GAM frequency-severity models in two simulation experiments. We consider the models in the cases that the frequency and severity are independent and dependent. Denote by D-FSBoost, D-GAM and D-GLM the three models in the dependent case and by FSBoost, GAM and GLM the three models in the independent case. We compare the models in prediction accuracy of the claim frequency and severity distributions. We also investigate the impact of the value of δ on estimating FỸ jN ðx; N; βÞ in the D-FSBoost model.
In simulation studies, we use one set of samples for training and another one for testing.
Denote by fn i ;ŝ i ;x i gŷ i¼1 the testing sample with known true functions or parameters fF N ðx; αÞ; FỸ jN ðx; N; βÞ; dg. Let ff ðxÞ;ĝ ðx; NÞ;dg be the functions or parameters estimated by the model. We use the out-of-sample loss and parameter estimation errors to measure prediction accuracy of the models.

Simple case
In this subsection, we demonstrate that the D-FSBoost model can capture the nonlinear relation between the claim frequency (severity) and predictors, complex interactions among predictors, and the nonlinear dependence between the claim frequency and severity. The sample fn i ; s i ; x i g y i¼1 is generated using the following specifications, n i � Poiðl i Þ; s i � Gammaðm n i ; dÞ; x ij � Unifð0; 1Þ; i ¼ 1; . . . ; y; j ¼ 1; . . . ; 4; ð33Þ where λ i = exp(F 1 (x i1 , x i2 )), m n i ¼ expðF 2 ðx i3 ; x i4 ; n i ÞÞ, δ = 2, and We generate a sample of size 10000 for training and another one of size 10000 for testing. The out-of-sample loss and parameter estimation errors on the testing sample are shown in  Table 2, which are averaged over 20 independent replications. Since the independent and dependent models share the same claim frequency model, we only list the claim frequency result for the dependent models. We can find that dependent models perform better than independent ones. In dependent models, the D-FSBoost model has the best performance in terms of the smallest out-of-sample loss and parameter estimation errors. In contrast to the GLM, D-GLM, GAM and D-GAM models, the FSBoost and D-FSBoost models can capture complex interactions. Denote by c 1 and c 2 the coefficients of cross-product terms x i1 x i2 and x i3 x i4 , respectively. In Fig 1, we change c 1 from 8 to 12 and c 2 from 0.3 to 0.7 to increase effects of interaction terms. We can find that the FSBoost and D-FSBoost models have more stable predictive performance. In the GLM, D-GLM, GAM and D-GAM models, the parameter estimation errors show an increasing trend since they can't capture interaction effects. Next, we use values of x i3 and x i4 in the training sample and change all values of n i , i = 1, . . ., 10000, from 0 to 20. For each value of n i , i = 1, . . ., 10000, we calculate Then, we show the change of s with respect to n i in Fig 2. The D-GLM model can only measure a linear relation between the claim frequency and severity. Both of the D-GAM and D-FSBoost models can capture the nonlinear dependence between the claim frequency and severity. The D-FSBoost model performs better. The results also confirm that the D-FSBoost model can capture the nonlinear relation between the claim frequency (severity) and predictors.

Complex case
In this subsection, we demonstrate the D-FSBoost model in a complex simulation experiment. We compare the models on a variety of randomly generated functions by using the "random function generator" in Friedman [27]. The "random function generator" generates a function as a linear expansion of functions fg k g 20 k¼1 : Each coefficient a k is generated from a uniform distribution on (0, 1). The variables z k is a m k −sized subset of p-input variables x as where ϕ(k) is an independent random permutation of integers {1, . . ., p}. The size m k is randomly selected as min(b2.5 + r k c, p), where r k is generated from an exponential distribution with mean 2. Then, the expected number of input variables for each g k (z k ) is between four and five. Each g k (z k ) is an m k −dimensional Gaussian function where each mean vector u k is generated from the same distribution as z k . The m k × m k covariance matrix V k is generated by where U k is a random orthonormal matrix, D k ¼ diagfd k 1 ; . . . ; d k m k g, and the square root of each eigenvalue is generated from a uniform distribution on (a, b), where the values of a and b are determined by the distribution of z k . We set the number of predictors p = 10 and generate the data fn i ; s i ; x i ; y i g y i¼1 using the following specifications, n i � Poiðl i Þ; s i � Gammaðm n i ; dÞ; x i � Nð0; I p Þ; y i � Nð0; I p Þ; i ¼ 1; . . . ; y; ð40Þ where λ i = 1.2exp(F 1 (x i )), m n i ¼ expðlogðn i þ 5ÞF 2 ðy i ÞÞ, δ = 2, and F 1 (x i ) and F 2 (y i ) are the functions generated from the "random function generator". The eigenvalue limits are a = 0.1 and b = 2. We generate 10000 observations for training and another 10000 for testing. Table 3 reports parameter estimation errors on the testing sample, which are averaged over 10 independent replications. Fig 3 shows out-of-sample loss. The results are the same as in the simple case. Dependent models have more accurate prediction than independent models. The D-FSBoost model performs best in predicting the claim frequency and severity distributions.

The impact of the parameter δ
In this subsection, we investigate the impact of the value of δ on estimating FỸ jN ðx; N; βÞ. We generate 20 sets of training samples as in the complex case. Then, we estimate FỸ jN ðx; N; βÞ using the D-FSBoost algorithm for each value of δ 2 {1.5, 1.6, . . ., 2.5}. Fig 4 shows parameter estimation errors. We can see that the value of δ has no significant effect on estimation accuracy of FỸ jN ðx; N; βÞ.

Application
In this section, we apply the D-FSboost model to analyze a French auto insurance claim data. We compare the models in prediction of the claim frequency and severity distributions. Then, we introduce two important tools, variable importance measures and partial dependence plots, from Friedman [27] to interpret the D-FSBoost model.

Data
We consider a French motor third-party liability dataset, where the data "freMTPL2freq" and "freMTPL2sev" are in the R package "CASdatasets". Noll, Salzmann, and Wuthrich [32] use the data "freMTPL2freq" to compare the GLM, regression tree, gradient boosting Poisson model and neural network in predicting the claim frequency. We make the same data  preprocess as in Noll, Salzmann, and Wuthrich [32], except deleting records that have positive claim frequency but have no claim severity and also except using different partitions on variables "VehAge" and "DrivAge". After the data preprocess, the dataset contains 668897 records. The dataset is openly available from S1 Dataset. Table 4 shows variables in the dataset. There are 24944 (3.73%) policies that have positive claim frequency. Table 5 reports the distribution  of the claim frequency and average claim severity. There are only several policies in which the claim frequency is larger than 3. The average claim severity shows an increasing trend when the claim frequency changes from 0 to 3. This implies a positive dependence structure between the claim frequency and severity. In Figs 5 and 6, we can find that the usage of old cars tend to incur more accidents and higher claim payments. Young drivers have less driving experience than middle-age and old drivers and cause more car crashes and more serious accident loss. In Fig 7, we can find that there are interactions among predictor variables. For young drivers, the vehicle age has a significant effect on the claim frequency. When the driver age increases, the effect gradually decreases. For young and old drivers, there are significant difference in the claim severity between different vehicle age groups. However, for middle-age drivers, the difference is small.

Model comparison
We use 445931 observations as training data and the remaining 222966 as testing data. Then, we estimate the GLM, D-GLM, GAM, D-GAM, FSBoost and D-FSBoost models. In dependent models, we take the frequency/exposure instead of the frequency as the predictor variable. The FSBoost and D-FSBoost models can finish automatic feature selection. In the GLM, D-GLM, GAM and D-GAM models, we remove the insignificant variables. Table 6 shows out-of-sample loss for the models. The results indicate that dependent models are more competitive than independent models. The D-FSBoost model is most favorable. Then, we calculate pure premium prediction from the models on the testing data. We compare the models by using a Gini index to measure the discrepancy between the premium and loss distributions (Frees,Meyers,and Cummings [36,37]). Let B(x) be the base premium and T(x) be the alternative premium. Denote by P(x i ) and y i the pure premium and loss for the i th observation, respectively. Frees, Meyers, and Cummings [36]  and order observations by relativities {R(x 1 ), . . ., R(x θ )}. They define the ordered premium distribution as and the ordered loss distribution as The graph of (F P (s), F L (s)) is an ordered Lorenz curve. When the percentage of losses equals the percentage of premiums, the curve results in a 45-degree line known as "the line of equality". The Gini index is defined as twice the area between the Lorenz curve and the line of equality. Then, the empirical Gini index can be computed by where F P (R(x 0 )) = F L (R(x 0 )) = 0. A larger Gini index represents more profits for an insurer. Table 7 reports Gini indices calculated by using the prediction from each model as the base premium and using predictions from the remaining models as alternative premiums. Following Frees, Meyers, and Cummings [37] and Yang, Qian, and Zou [33], we use a "minimax" strategy to find the best model. For each base premium, we calculate the maximum Gini index over all alternative premiums. Then, we choose the base premium model that is least vulnerable to alternative premium models, i.e., we select the base premium model that has the smallest maximum Gini index. We find that the maximum Gini index is 0.9432 when using GLM as the base premium model, -0.1300 when using D-GLM, 0.0198 when using GAM, 0.0737 when using D-GAM, 0.0233 when using FSBoost, -0.2855 when using D-FSBoost. Thus, the D-FSBoost model represents the best choice.

Model interpretation
In this subsection, we use variable importance measures and partial dependence plots to interpret the D-FSBoost model. Variable importance measures show the importance of each predictor in predicting the frequency (severity). Partial dependence plots visualize the effect of the predictor on the frequency (severity). Variable importance. For a single K−terminal node tree T i , Breiman, Friedman, Olshen, and Stone [38] introduce the following importance measure for the predictor x j , where the sum is taken over all K−1 internal nodes, υ k is the splitting variable in the node k, 1 u k ðx j Þ is an indicator function that equals one when the splitting variable υ k is x j , and ρ k denotes the decrease in squared error by splitting the region associated with the node k into two subregions. Friedman [27] generalizes the variable importance measure to the gradient boosting model by taking the average over all trees {T 1 , . . ., T M }, The variable importance measure is biased since an independent predictor x j can be selected as a splitting variable and henceÎ x j can not be zero. See Sandri and Zuccolotto [39,40] for a bias correction.
In Fig 8, we show variable importance measures for the D-FSBoost model. We can find that VehBrand and BonusMalus are two most important variables in predicting the frequency. The VehBrand dominates the prediction. In predicting the severity, the variables DrivAge, Frequency/Exposure, BonusMalus and LogDensity are most influential. The DrivAge and Frequency/Exposure exert the leading effects. This result also provides further evidence on the dependence between the frequency and severity.
Partial dependence plots. Let z k be the subset of variables x and z −k be the complement subset of z k such that The partial dependence of F N (x; α) on z k can be calculated bŷ where z i,−k is the i th observation of z −k . Then, the partial dependence plot of the frequency part is obtained by plotting the functionFðz k Þ against z k . The partial dependence plot of the severity part can be obtained in the same manner.
In Fig 9, we show the partial dependence plots for the D-FSBoost model, indicating the effects of two most important variables on the claim frequency and severity. From the top two panels, we can find that the car with brands B7-B9 causes much more accidents. The frequency is positively associated with the bonus-malus level. In France, the bonus-malus level less than 100 and larger than 100 means bonus and malus, respectively. The change from bonus to https://doi.org/10.1371/journal.pone.0238000.g008 malus represents that at least an accident occurs. This explains the sudden change in the frequency at the bonus-malus level 100. The bonus-malus level 60 is the least bonus level to encourage policyholders to drive more carefully, which explains the sudden increase in occurrence of accidents when the bonus-malus level is near to 60. The bottom two panels show that young drivers induce more serious accidents. The severity increases dramatically when the claim frequency is small. This result is consistent with the observation in the distribution of the claim frequency and average claim severity.

Conclusion
This paper develops a stochastic gradient boosting frequency-severity model by using the stochastic gradient boosting algorithm and profile likelihood approach. We demonstrate that the model can flexibly capture the nonlinear relation between the claim frequency (severity) and predictors and complex interactions among predictors, and can also fully capture the nonlinear dependence between the claim frequency and severity. The model is superior to other state-of-the-art models in the sense that it provides more accurate predictions in the claim frequency and severity distributions and pure premium.
In this paper, we illustrate the model with a Poisson distribution for the claim frequency and with a gamma distribution for the average claim severity. In fact, there are more flexible distribution choices. For example, we can use the negative binomial distribution for the claim frequency and the generalized gamma distribution for the average claim severity as in Shi, Feng, and Ivantsova [18]. The model can also be extended to capture different features of the claim data. For example, our model can combine with the hurdle and zero-inflated modeling framework to accommodate the overdispersion and zero inflation in the claim frequency. We can generalize our model to a random parameters version. We can also assume that the dispersion parameter depends on predictors and model the dispersion parameter with another ensemble of regression trees. These works are left for future research.