Gendist: An R Package for Generated Probability Distribution Models

In this paper, we introduce the R package gendist that computes the probability density function, the cumulative distribution function, the quantile function and generates random values for several generated probability distribution models including the mixture model, the composite model, the folded model, the skewed symmetric model and the arc tan model. These models are extensively used in the literature and the R functions provided here are flexible enough to accommodate various univariate distributions found in other R packages. We also show its applications in graphing, estimation, simulation and risk measurements.


Introduction
Various probability distribution models have been proposed in the past and the number increases with time. Recently, in the area of actuarial loss modeling, several new models were found to provide good fits to the loss data. For instance, mixture of Erlang distribution has been proposed to model catastrophic loss data in the United States [1]. Mixture of exponential with peaks-over-threshold has been considered to fit Danish fire losses and medical claims data found in Society of Actuaries (SOA) Group Medical Large Claims Database [2]. Mixture of lognormal and inverse Gaussian distribution were used to model fire insurance portfolio in Serbia [3]. In the actuarial context, loss data are monetary losses claimed by insureds purchasing general insurance policies such as fire and catastrophic insurances. Many reasons in which the data is found to be crucial and recorded by insurance companies or insurance service agencies, among others, to model their future financial obligations. In addition to the above, various applications of mixture model can be found in the literature for other area of studies. To name a few: a logistic mixture distribution model has been applied on polychotomous item responses [4]; mixture of logistic has been proposed to fit long tail distributions in analyzing network performance [5]; Rayleigh mixture model has been studied for plaque characterization in intravascular ultrasound [6]; a finite mixture of two Weibull distributions has been suggested to model the diameter distributions of rotatedsigmoid, uneven-aged stands [7]; two-component mixture Weibull statistics has been used to estimate wind speed distributions [8]; gamma mixture models have been proposed for target recognition [9]; Gaussian mixture model has been applied for human skin color and its applications in image and video databases [10].
Another important model receiving increasing applications in actuarial loss modeling is the composite model. Composite model is constructed by piecing two weighted distributions together. Such a model has been introduced initially with constant mixing weight [11]. It was then improved by allowing flexible mixing weight [12]. Several other related composite models have been proposed by various authors, see [13][14][15] and [16]. All these authors employed the well known Danish fire insurance data to measure their model performance. Composite models have also been applied to some simulated data belong to a particular class of distribution. Among others, a comparison study have been made on Weibull-Pareto and Lognormal-Pareto composite models [17]. Besides, some properties, inferences and numerical illustration using simulated data have been provided for composite exponential-Pareto models [18].
Another model that provides useful application in loss modeling is the folded model. It has been used to model the Norwegian fire claims data [19]. An extension to this study proposed three new folded models, namely, the folded generalized t distribution, folded Gumbel distribution and folded exponential power distribution [20]. An interesting feature of this model is the folding mechanism of a real value defined distribution into positive value distribution. Several existing folded models found in the literature are the folded normal distribution [21], folded t distribution [22], folded Cauchy distribution [23] and folded logistic distribution [24].
In addition to the above, skewed symmetric models also featured attractive properties with respect to loss data. The skewed normal and skewed t distributions have been studied in fitting insurance claims data [25]. Later, the same author applied the two models to asset returns of insurance companies [26]. Skewed t distribution is found to provide promising results for both data. A variety of other skew models have been considered, among them, skewed Cauchy [27], skewed Laplace [28], skewed logistic [29], skew reflected gamma, skew double Weibull and skew beta-prime [30] and several skew inverse reflected distributions [31].
More recently, the arc tan model has been introduced to model Norwegian fire claims data [32]. This new methodology has been proposed for an underlying Pareto distribution and found to provide good fit compared to several classical distributions. Its statistical properties also have been derived.
Beside heavily used in the actuarial field, most of the aforementioned models received vast applications in many other areas. Motivated by this, we compile several important models into an R package gendist for academics and public use. R is a free statistical computing and graphics software downloadable from http://www.r-project.org, see [33].
Distribution models provided in the R package gendist include the mixture, the composite, the folded, the skewed symmetric and the arc tan models. Computation functions of these models are given for probability density function (pdf), cumulative distribution function (cdf), quantile function (qf) and random generated values (rg).
The conventional R prefixes d, p, q and r define the pdf, cdf, qf and rg of an arbitrary distribution function. For instance dexp, pexp, qexp and rexp of the stats package [33] in R gives the pdf, cdf, qf and rg for an exponential distribution. The gendist package follows similar rule to define all the functions related to the generated probability distribution models.
In preparing the gendist package, we have thoroughly studied the Distribution CRAN Task View of R [34] which lists a number of available packages related to probability distributions. None of the packages on this page has so far provide tools to work with the folded and arc tan models discussed in this paper. Several packages are found for mixture models: mixtool [35] provides d and r functions for finite mixture models; nor1mix [36] provides d, p and r functions for a specific underlying distribution, Gaussian; GSM [37] provides d, p and r functions with underlying Gamma shape distribution; AdMit [38] provides d and r functions for mixture of student distribution.
Only the CompLognormal package [39] is available for composite models. However, the model is restricted to head of a lognormal distribution. In addition, the function is bound below at zero. For skewed symmetric model, several packages are available: skewt [40], sn [41], gamlss.dist [42] and sgt [43] provides functions d, p, q and r for variety of either skew normal or skew student distributions; Newdistns [44] provides functions related to skew symmetric G distribution. SkewHyperbolic [45] provides functions related to skew hyperbolic student distribution. All the distributions mentioned above plus others (including newly proposed distributions) can be easily managed by the functions developed in the proposed gendist package.
The paper is structured as follows. First, we discuss in detail all the above models along with examples on using related functions developed in the gendist package. Then, we provide descriptions on the program structure of the package and some discussion on appropriate usage with respect to the support of the models. We finally conclude the works carried out. Further details of the gendist package can be found at http://CRAN.R-project.org/ package=gendist.

Generated Probability Distribution Models
Models presented here are generated with underlying parent distributions. These models are specified by a predefined rule or structure. In what follows, we describe in detail each generated model giving particular emphasis on related functions encompassed in the gendist package.

The mixture models
The mixture model was first considered with an underlying normal distribution to address the decomposition issue with respect to non-normal forehead to body length attributes in a population of female shore crabs, see [46]. In general, the pdf of the mixture model is given by where 0 w i 1 for i = 1, 2, . . ., n are the mixing weights such that the sum of all w i equals to one. Note that, there are n components involved in Eq (1). The proposed mixture model in this paper is restricted to a two component distribution. Hence, the pdf can be represented as follows: We choose mixing weight of the form w ¼ 1 1þ so that ϕ > 0. g 1 (x) and g 2 (x) represent the densities of arbitrary parent distributions. Note, it is not necessary that the two arbitrary distributions in Eq (2) are identical. However, both must be defined on the same range dimension.
Finding cdf of the mixture model is straightforward. By direct integration, the cdf for the two component mixture model in Eq (2) can be written as Explicit general expression for qf of the mixture model is not available. However, it can be obtained numerically by solving the root, Q(u), of the following equation Finally, random generated numbers for the two component mixture model can be obtained as where u i for i = 1, 2, Á Á Á, n are random numbers from a uniform [0, 1] distribution.

The composite models
In general, the pdf of the composite model is given by whereby w is the mixing weight, θ denote the threshold and f Ã i ðxÞ for i = 1, 2 are the truncated pdfs of the parent distribution defined by respectively. Eq (6) can be made continuous and smooth by applying the continuity and differentiability conditions and thus provide a closed form for w, see [12]. Furthermore, it has been suggested that w take the form of w ¼ 1 1þ for ϕ > 0 for convenience [13]. We find this useful to ease program writing in R for the composite model. A more comprehensive approach to implement Eq (6) which we adopt in this paper specifies the component of mixing weight, ϕ, and the threshold, θ in term of other parameters of the model, see [15]. Further to this, we do not restrict the composite model to bound below at zero and thus allow parent distribution defined on real line as well as that defined for positive values to be considered for f Ã 1 ðxÞ. All other functions including the cdf, qf, and rg for the composite model are developed in a similar manner.
Hence, the cdf, qf and n random generated numbers provided in gendist package can be written as and where u i with i = 1, 2, . . ., n are n uniform [0, 1] random numbers. As a result, the number of parameters of the composite models equal the total number of parameters of the two underlying parent distributions selected. For instance, if truncated exponential distribution with scale parameter is chosen for f Ã 1 ðxÞ and Weibull distribution with scale and shape parameters are chosen for f Ã 2 ðxÞ, the Exponential-Weibull composite model will consist of three parameters.
We now show the implementation of the pdf function dcomposite of the composite model. All other functions related to the composite model follow similarly. Let f 1 (x) and f 2 (x) denote the pdfs of Weibull and Gamma distributions, respectively. From Eq (6) (with w ¼ 1 1þ ), the composite Weibull-Gamma model can be written as where Γ(a, z) is the incomplete gamma function defined by R 1 z t aÀ1 e Àt dt. Suppose, α = β = 1 and σ = λ = 2, then, the pdf values of the composite model for some values of x are computed as follows: R> dcomposite(1:3, spec1 = "weibull", arg1 = list(shape = 1, scale = 2), spec2 = "gamma", arg2 = list(shape = 1, scale = 2)) Output: 0.3032653 0.1839397 0.1115651 Density plots of this model can be done using curve function as below.

The folded models
Existence of folded models can be traced back to 1960s when methods were described to estimate mean and variance of normal distribution based on its folded form and an application was shown to real camber data [21]. The folded model is obtained from a transformation of random variable by taking its absolute value. Suppose that Y is a real valued random variable with cdf G(Á) and X = |Y| a positive random variable, then the cdf for X is We can then write its pdf as In this case the parent distribution has a support of real values and the resulted folded distribution has x > 0. It is obvious from Eq (13) that the quantile function may not have an explicit form. However, this may not be the case if the underlying distribution is symmetric around 0. The quantile function for a folded model of this case can be obtained as follows: where G −1 (Á) is the inverse function of the underlying symmetric distribution. For an underlying non-symmetric distribution (also applies to symmetric case), we can solve this numerically using computer programs by finding root of the cdf function, that is, Q(u) is the solution of the following equation Next, rg is obtained as follows where u i with i = 1, 2, . . ., n are n uniform [0, 1] random numbers. Note that g(Á) and G(Á) correspond to the pdf and cdf of the parent distribution of the folded model. We implement Eqs (13), (14), (16) and (17) in the gendist package for cdf, pdf, qf and rg of the folded model.
In what follows, we show applications of two risk measures by implementation of qf for the folded model, qfolded. It involves computation of Value-at-Risk (VaR) and Conditional Tail Expectation (CTE). VaR is related to the quantile function of a random variable with a specified probability, say, α. Often it describes the risk that a loss random variable exceeds a certain amount although it is also used outside the financial area. Let G −1 (α) denote the inverse function of G(x) for a random variable Y, then VaR at a given level of confidence, 1 − α, is given by For a range of α values, this is illustrated for folded normal and folded t distributions in Fig 3. Some arbitrary values are chosen for the parameters of both models. The following commands describe the plotting of VaR with respect to α: R> curve(qfolded(1-x/2, spec = "norm", arg = c(mean = 0.8, sd = 0.9), + interval = c(0,100)), xlim = c(0,1), ylim = c(0,4), + xlab = expression(alpha), ylab = "VaR") R> curve(qfolded(1-x/2, spec = "t", arg = c(df = 15), interval = c(0,100)), + add = TRUE, col = 2) The CTE measures the expected value of risk beyond VaR at a given probability level, α. Mathematically, For a range of α values, this is illustrated for folded normal and folded t distributions in  Generated Probability Distribution Models that is, higher VaR is found for higher level of confidence. In practice, α are commonly chosen as 1% or 5%. In this example, folded t distribution consistently show a lower VaR value than the normal distribution. However, the actual results will depend on the parameter values for each model. Similar to VaR, CTE for both distributions are decreasing function of the confidence level, α. This is depicted in Fig 4. The skewed symmetric models Azzalini proposed a new class of distributions with an underlying normal distribution [49]. In its original form, the pdf is given by where g(Á) and G(Á) are the pdf and cdf of the normal distribution and λ is a shape parameter. The distribution reduces to normal distribution when λ = 0.
Further study leads to a general form of the skew symmetric distribution [50]. The new skewed symmetric distribution has the following pdf where h(x) is a pdf symmetric at 0 and G(x) is a Lebesgue measurable function that satisfies 0 G(x) 1 and G(x) + G(−x) = 1 for z 2 <. In this new form, the parent distributions h(x) and G(x) may be of different type satisfying the conditions above. In the gendist package, we adopt Eq (21) for computer implementation of the skewed symmetric model. Several functions for h(x) and G(x) include the normal, logistic, students t and Cauchy distributions. General form for the cdf, qf and rg functions of the skewed symmetric models are not available. Thus, implementation of these functions in gendist are done via numerical methods. In particular, integrate and uniroot functions are used. The cdf of skewed symmetric model is found by Solving Q(u) of the following equation leads to its qf Z QðuÞ À1 2hðyÞGðyÞdy À u ¼ 0 ð23Þ and rg is obtained as follows where u i with i = 1, 2, . . ., n are n uniform [0, 1] random numbers. Some skewed symmetric models may have an explicit form. For instance, consider a skewed Cauchy-Cauchy distribution. Its pdf, cdf and qf are given by FðxÞ and respectively. In the following illustration, we check goodness-of-fit of two theoretical distributions, the skew Logistic-Logistic and the skew Normal-Normal distributions, to the length of rivers scaled by 1000. The data consist of 141 observations related to major North American rivers. The data (S2 Data) can be obtained as follows: R> data(rivers) R> x <-rivers/1000 Initial step involves parameter estimation using the maximum likelihood method.

The arc tan models
Arc tan model has been proposed to model a specific Norwegian insurance loss data [32]. Consider parent distribution with pdf and cdf of g(x) and G(x), respectively. Then, the pdf, cdf, qf and rg of the arc tan model with support [a, b] are given by FðxÞ ¼ 1 À arctan ðað1 À GðxÞÞÞ arctan ðaÞ ð29Þ QðuÞ ¼ G À1 1 À 1 a tan ðð1 À uÞ arctan ðaÞÞ ð30Þ and respectively, where u i with i = 1, 2, . . ., n are n uniform [0, 1] random numbers, arctan denote the inverse function of trigonometric tangent and G −1 (Á) is the inverse of G(Á). Note that the model consist of additional parameter α > 0 and the support is a x b where a and b take the support of parent distribution. Consider a Weibull distribution with pdf In what follows, we illustrate simulation of the arc tan distribution with parent distribution Eq (32). For simplicity, we let β = 2 and λ = 0.5. The empirical biases and mean square errors of α for the Weibull arc tan distribution can be obtained as follows: 1. Ten thousand sample sizes are generated by inversion of Eq (29). Each sample size is n. The variates of the Weibull arc tan distribution can be written as where U * U(0,1) is a variates of the uniform distribution.
An advanced example featuring empirical data related to air quality is given in the Appendix (S1 Appendix). The subroutines provided there emphasize on calling functions from the gendist package and thus avoid creating separate object running similar computation. The underlying concept pertaining to this idea make use of do.call which is discussed next.

Results and Discussion
In this paper, we specify the generated probability distribution models by its parent distribution, that is, the underlying function which generates new models. Codes for writing functions in gendist package use a general construct of parent distribution of the form do.call(paste(d, spec, sep = ""), c(list(z), arg)), do.call(paste(d, spec1, sep = ""), c(list(z), arg1)) or do.call(paste(d, spec2, sep = ""), c(list(z), arg2)).
As such, spec = "Weibull", for instance, corresponds to Weibull parent distribution and arg = list(shape, scale) list its parameters. Similarly, for two component distributions spec1 and spec2 specify the parent distributions with corresponding parameters arg1 and arg2.
The parent distribution may assume any value on real line or positive real numbers. The suitability of the parent distributions for the models must be checked by the user. No warnings are given for choosing inappropriate distributions. Table 1 describes a proper selection for the parent distribution corresponding to each model with respect to its support.
Functions to compute pdf, cdf, qf and rg for all five models produced in gendist are summarised in Table 2. As mentioned earlier, the input argument spec, spec1 and spec2 specify the parent distribution and their corresponding parameters arg, arg1 and arg2. The distribution can be one that is implemented in R base package, contributed R packages or one written by a user. In any case, the parent functions must be defined with prefix d, p, q and r attached to spec, spec1 and spec2. It is important to note that some models may not have explicit general form for its probability functions. Whenever this is the case, numerical methods are used. In gendist, we utilise integrate, nlm and uniroot functions. Some of these functions, in particular, uniroot require interval values to search for roots and they are specified by interval. nlm require a starting value specified by initial.

Conclusions
In this paper, we discuss five models provided in the R package gendist, namely, the mixture model, the composite model, the folded model, the skewed symmetric model and the arc tan model. The functions related to these models are written in R environment and are freely downloadable from http://www.r-project.org, see [33]. Users can easily create any specific model by specifying the parent distributions along with their parameters. Thus, uncountable number of distributions can be produced from the tools in the gendist package. Another advantage is that the users have the option to write their own function to serve as the parent distributions.
Next, we also show at least five important applications of the tools given in the gendist package. Usefulness of these tools in graphing are shown for Q-Q plot, P-P plot, producing pdf curves as well as to show output for some risk measures. Simulation study is provided for a specific Weibull arc tan model and produced favorable result with respect to biases and mean squared error of the parameter estimated. Both measures decrease to zero when the number of observations approaches infinity. Table 2. Calling sequences for mixture, composite, folded, skewed symmetric and arc tan models.