Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

On Bayesian approach to composite Pareto models

Abstract

In data modelling using the composite Pareto distribution, any observations above a particular threshold value are assumed to follow Pareto type distribution, whereas the rest of the observations are assumed to follow a different distribution. This paper proposes on the use of Bayesian approach to the composite Pareto models involving specification of the prior distribution on the proportion of data coming from the Pareto distribution, instead of assuming the prior distribution on the threshold, as often done in the literature. Based on a simulation study, it is found that the parameter estimates determined when using uniform prior on the proportion is less biased as compared to the point estimates determined when using uniform prior on the threshold. Applications on income data and finance are included for illustrative examples.

1 Introduction

Size-type data may exhibit heavy upper tail property that must be taken into account in statistical modelling. Pareto type distributions have been used successfully to model the distribution of the upper tail [1, 2] but these distributions can only model the observations above a certain threshold and do not take into account of any observations below the threshold value. In this case, a separate distribution is required for describing any observations below the threshold value.

Composite Pareto models are a family of distribution models that splice the observations into two parts: observations below a threshold value and observations above the threshold, where the threshold is treated as a parameter. In these models, any observations with values lower than or equal to the threshold are distributed according to a lower data distribution model whereas any observations with values higher than the threshold are assumed to follow a Pareto type distribution. The probability density function (pdf) for the composite Pareto models can be written as: where c1 and c2 are the mixing weights of the components in the composite model, τ is the threshold value, f1(x|θ) is the distribution model for observations below the threshold and f2(x|θ) is the Pareto model for observations above the threshold. This property of having separate models for observations below and above the threshold value is desirable when the upper tail observations are heavily skewed or believed to be distributed differently compared to the rest of the observations in the data. It has found many applications including in the area of insurance [3], survival data [4], internet traffic data [5] and ecology [6].

In this paper, a general implementation of Bayesian approach to the composite Pareto models with respect to specifying prior information on the proportion of observations coming from the Pareto tail is outlined and discussed. Although other literatures have discussed and used Bayesian analysis for composite Pareto models [7, 8], their approach are tied to specifying prior information on the threshold parameter, not on the proportion of data in the Pareto tail as proposed in this paper. Additionally, through simulation study, we demonstrate that using uniform prior on the proportion leads to less biased point estimates, when compared to using uniform prior on the threshold parameter.

This paper is outlined as follows. Section 2 gives a brief summary of the composite Pareto models. Section 3 provides a discussion on determining threshold value when the other parameters are given. Then prior distributions for composite Pareto models are discussed in Section 4. In Section 5, a simulation study is presented to compare the performance of point estimates based on uniform prior on the threshold versus uniform prior on the proportion of data coming from Pareto tail. We apply the composite Pareto models to the Malaysian household income data and NASDAQ-100 stock index in Section 6. Section 7 concludes the paper.

2 Composite Pareto models

Composite Pareto models divide the data into two parts: data with values lower than a threshold value, hereafter is termed as the lower data, and data with values higher than the threshold, called the upper data. The models assume that the distributions for the lower and upper data are different with the upper data following a Pareto distribution. Typically, a lighter tail distribution is used for the lower data, including truncated lognormal, truncated gamma, or truncated Weibull distributions.

The composite Pareto models were first introduced by Cooray and Ananda [3]. In their work, the lower data is distributed using lognormal distribution while the upper data is distributed using Pareto type I distribution. However, as noted by Scollnik [9], the model introduced is limited since the mixing proportion of data coming from Pareto distribution is fixed. Scollnik [9] has freed the mixing proportion and improves the composite model by considering Pareto type II distribution for the upper tail data. Since then, there have been many other composite Pareto models proposed, including Weibull-Pareto composite model [1012], gamma-Pareto composite model [12], exponential-Pareto composite model [13], inverse gamma-Pareto composite model [8] and composite Stoppa model [14].

2.1 Composite Pareto I models

Let θ be the collection of all parameters and let h(x|η) be a probability density function (pdf) with a collection of parameters η and cumulative distribution function (cdf) H(x|η). We call this the lower data distribution. The composite Pareto I model has the following pdf: (1) where ρ > 0 is the proportion of data coming from Pareto I distribution with pdf fP1(x|τ, α), τ > 0 is the threshold that separates the lower and upper data, and α > 0 is the tail index for the Pareto type I distribution. Note that In the pdf (1), any observation smaller than the threshold τ is distributed according to the lower data distribution, and any observation bigger than τ is distributed according to the Pareto type I distribution. Note that the pdf (1) may not be continuous and differentiable at τ. To make it continuous, we can specify And to further make it differentiable, we let (2) condition on τh′(τ|η) < −h(τ|η) so that α > 0. This simplifies the condition on ρ to (3)

2.2 Composite Pareto II models

Composite Pareto II models are similar to composite Pareto I models except for the use of Pareto type II distribution for the upper data. As before, let θ be the collection of all parameters, h(x|η) be a pdf with its cdf H(x|η), and η be a collection of parameters associated with h(x|η). Then the composite Pareto II models have the following pdf: (4) where ρ > 0 is the proportion of data coming from Pareto type II distribution with pdf fP2(x|τ, α, β), τ > 0 is the threshold that separates lower and upper data distributions, whereas α > 0 and β > 0 are the tail index and scale parameters of the Pareto type II distribution, respectively. The pdf of the Pareto type II is: The Pareto type II distribution is also a special case of generalized Pareto distribution with positive scale parameter.

In this case, observations with value less than τ is assumed to follow the lower data distribution while any observation greater than τ follows the Pareto type II distribution. To make pdf (4) continuous at τ, we can specify And for differentiability condition, let (5) condition to h′(τ|η) < 0. This simplifies the condition on ρ to (6)

The twice differentiability property can be satisfied by specifying the value of α. However it is not a focus here due to it being too restrictive [9]. In this paper, focus will be given to the composite models with continuous and differentiable pdf. Going forward, when composite Pareto models are mentioned, we refer to composite Pareto models with smooth pdf, unless specified otherwise.

2.3 Bulk models

The previous subsections deal with the general case of composite Pareto models. In the general model, the parameter ρ is free, or is specified to satisfy the continuity and differentiability conditions. However, in some cases, we may specify: (7) for simplification. This specification is referred to as the “bulk model based tail fraction” by Scarrot [15] and is commonly found in the extreme value literature. The term “bulk model” refers to the lower data distribution that is assumed to covers the bulk of the observations when compared to the Pareto distribution. It was first proposed by Behrens et al. [16] as a way to estimate the threshold value for extreme value data and improvements have been made since then [1719]. In this paper, we refer composite Pareto models with specification (7) as simply the bulk models.

Historically, the bulk models are developed separately with the general composite Pareto models in the previous subsections. There are two main differences in the development of this two models. First, the general composite Pareto models appear in the actuarial science literature while the bulk models appear in the extreme value literature. Second, the general composite Pareto models typically use frequentist approach to estimate the parameters while the bulk models use Bayesian approach for parameter estimation. Readers interested in the bulk models can refer to [15] which gives an outline on the developments and bibliography related to the bulk models. We include this special case of composite Pareto models since the results shown in this paper still apply to the bulk models.

For bulk models with Pareto I tail, the pdf can be written as: (8) Whereas for bulk models with Pareto II tail, the pdf is: (9)

Note that in the literature covering the bulk models, the continuity and differentiability of the pdf are usually not imposed. This is to allow for more flexibility to model the lower data distribution. In some cases, researchers may even use semi-parametric or nonparametric approach for specifying h(x|η) [17, 20, 21].

However, if desired, the continuity condition for bulk models can be imposed by specifying (10) or (11) for bulk models with Pareto I and Pareto II tail, respectively.

Differentiability condition is more complicated here as it requires specifying parameter in the lower data distribution. For example, the pdf (8) is differentiable when Eq (10) is satisfied and If h(x|η) is a lognormal distribution with log-mean μ and log-variance σ2, then the above condition simplifies to which does not have a solution for finite σ. Therefore in this case the differentiability condition cannot be met.

Similarly, the pdf (9) is differentiable when Eq (11) is satisfied and condition to h(τ|η)2 + h′(τ|η)[1 − H(τ|η)] < 0. However, this condition may not be met by all lower data distributions. For example, if the lower data distribution is lognormal, then it can be proven that the condition cannot be met unless the log-variance is greater than 1.

3 Threshold as a function of other parameters

The main contribution of this paper is to propose using prior information on the proportion of observations coming from Pareto distribution, ρ, rather than on the threshold value, τ, which is often suggested by other literatures. To achieve this objective, it is important to first study the relationship between ρ and τ. For the composite Pareto models, under continuity condition, ρ is a function of other parameters including τ. For example, for smooth composite Pareto I models, ρ = ψ1(τ, η) as shown in Eq (3) and for smooth composite Pareto II models, ρ = ψ2(τ, η, α) as shown in Eq (6).

For composite Pareto I model, if prior information on τ is used in Bayesian estimation, then ρ is a function of random variables τ and η as in Eq (3). However, if we want to use prior information on ρ instead, then we would have to invert the function so that τ can be determined for given values of ρ and η. Therefore we can write (12) Similarly for composite Pareto II model, ρ is a function of τ, η and α as in Eq (6). The function must be inverted for us to compute the value of τ when ρ, η and α are given. We write (13) Note that the solution to these two functions above may require numerical computation for estimation.

However, the solution for Eqs (12) and (13) may not exist for some choices of lower data distributions and parameters. For example, consider a normal-Pareto II composite model where h(x|η) is the pdf of normal distribution with mean −1 and variance 1, and suppose α is set to 2. Then the highest possible value of ρ for this model is approximately 0.3014 when τ → 0. Therefore, in this case, there will be no solution to Eq (13) when finding the value of τ for ρ > 0.3014.

There are two ways that can be used to show that the solution always exists for some lower data distributions. First, if the pdf is only defined for x > 0, it is enough to show that xh′(x|η) < −h(x|η) for all x > 0 for Eq (12), or that h′(x|η) < 0 for all x > 0 for Eq (13). In this case, ρ = 1 when τ → 0, ρ = 0 when τ → ∞, and ρ is a continuous function of τ and other parameters. By the intermediate value theorem, the solution must exists. The second method to show that the solution exists is to show that the there exists x* > 0 such that x*h′(x*|η) = −h(x*|η) for composite Pareto I model, or h′(x*|η) = 0 for composite Pareto II model. In this case, there will be a smooth function of ρ given τ and other parameters going from ρ = 1 (when τx*) to ρ = 0 (when τ → ∞). Therefore again by the intermediate value theorem, the solution to Eqs (12) or (13) must exists.

Unfortunately, the uniqueness of solution to the above equations is not guaranteed for all lower data distributions. There may be multiple values of τ that satisfy the equations even when using the same value of ρ, η and α, especially when h(x|η) is multimodal. For example, suppose the lower data distribution is a two-component lognormal mixture with and the parameters r = (0.5, 0.5), μ = (5, 7), σ = (0.6, 0.3) where fLN(x|μ, σ2) is the pdf of lognormal distribution with log-mean μ and log-variance σ2. Then in this case there will be three solutions to Eq (12) using ρ = 0.2: τ = 280.69, 476.23, or 1355.44 with α = 1.768, 0.526, and 2.357, respectively.

The value of τ is important to be determined as it is required in the calculation of the likelihood function. When unique solution of τ based on Eqs (12) or (13) is guaranteed, the solution of τ can be used for calculating the likelihood function and to estimate the posterior distribution in Bayesian analysis. On the other hand, when there are multiple values of τ that satisfy Eqs (12) or (13), we cannot simply use any of the solutions without valid reasoning. Therefore, there is a need for a method to select which τ is most suitable, and this will be discussed in the following subsections.

For the bulk model mentioned in Section 2.3, it is assumed that ρ = 1 − H(τ|η) or inversely, τ = H−1((1 − ρ)|η). Then in this case, there is a unique solution of τ given the values of ρ and η since H−1((1 − ρ)|η) is a strictly monotonic function. We can then use the prior on ρ rather than τ for Bayesian estimation without much problem due to the uniqueness of τ when the values of other parameters are given.

3.1 Unique solution of the threshold

Fortunately, the uniqueness of the solutions to Eqs (12) and (13) can be proven for many commonly used lower data distributions including gamma, lognormal, normal, and Weibull distributions. These distributions have the property that h′(x|η) = h(x|η)g(x|η) for some function g(x|η). The following two theorems provide a way to determine for the uniqueness of the solution to (12) and (13) with proofs shown in S1 Appendix.

Theorem 1. Suppose for some function g(x|η). Suppose the solution to exists. If g(τ|η) + τg′(τ|η) < 0 for all τ that satisfies h(τ|η) + τh′(τ|η) < 0, then the solution above must be unique, and the relationship between τ and ρ is one-to-one.

Theorem 2. Suppose for some function g(x|η). Suppose that the solution to exists. If h′(τ|η) + H(τ|η)[g′(τ|η) − g(τ|η)2] < 0 for all τ that satisfies h′(τ|η) < 0, then the solution above must be unique, and the relationship between τ and ρ is one-to-one.

Theorem 1 can be used for the composite Pareto I models while Theorem 2 is useful for composite Pareto II models when checking for the uniqueness of the solution to Eqs (12) and (13). Under Theorem 1, for lower data distribution with h′(x|η) = h(x|η)g(x|η) for some function g(x|η), it is enough to show that g(τ|η) + τg′(τ|η) < 0 to determine the uniqueness of solution to Eq (12). For composite Pareto II models, it is enough to show that h′(τ|η) + H(τ|η)[g′(τ|η) − g(τ|η)2] < 0 to show the uniqueness of the solution to Eq (13). Following Theorem 2, the condition to determine the uniqueness of the solution can be further simplified as shown in the following corollary.

Corollary 1. Suppose h′(x|η) = h(x|η)g(x|η) for some function g(x|η). If g′(τ|η) − g(τ|η)2 < 0, then the solution to must be unique

As an example, suppose h(x|η) = fLN(x|μ, σ2) be the pdf of a lognormal distribution with log-mean μ and log-variance σ2. Then

Therefore we can write It follows that . Therefore using Theorem 1, the solution to Eq (12) is unique and there is no need to check for multiple solutions to τ for given values of ρ and other parameters when using lognormal-Pareto I composite model.

Additionally, let z = log(x) − μ. Let φ(x) and Φ(x) be the pdf and cdf of the standard normal distribution, respectively. It can be shown that (14) If z ≥ 0, then the equation above is negative and the condition for Theorem 2 is satisfied. However if z < 0, we would have to use the property of Mill’s ratio to show that the condition is still valid. Gordon [22] has proven that the Mill’s ratio for x > 0. If z < 0, it follows that and This combines with Eq (14) satisfies the condition for Theorem 2. Therefore, there will not be instances of multiple τ solution for (13) and prior information on ρ can be used without problem for lognormal-Pareto II composite model.

3.2 Multiple solutions of the threshold

As mentioned previously, it is possible to have multiple solutions to τ given the values of ρ and other parameters. When such incidence happens, an analyst have the choice to either avoid the problem, or choose one of the solutions to τ. The former choice can be done by either using prior information on τ, use lower data distribution with proven solution uniqueness as shown in the previous subsection, or to use the bulk models.

If the analyst opt to select one of the solutions to Eqs (12) or (13), a naive way to select the value of τ out of the multiple solutions is to choose one at random or the highest value out of the solutions. However, this can lead to bias in estimation and should be avoided. A better method is to let the data to guide the selection of τ by choosing the value of τ out of the possible solutions that maximizes the loglikelihood. Therefore the equations are now (15) (16) for composite Pareto I and composite Pareto II models, respectively, where l(θ|x) is the loglikelihood.

This selection method assumes that the ‘true’ solution of τ should be the one that gives the highest likelihood. The solution is not to be misunderstood as the maximum likelihood estimate as the number of possible solution of τ is limited by the conditions (3) or (6). However, in Bayesian estimation, when the sample size grows, the solution to Eqs (15) and (16) should approach to the maximum likelihood estimates.

3.3 Computation to find the value of threshold parameter

Typically, numerical methods have to be used to compute the solution to Eqs (12) and (13). It is also important to consider the condition for the solution of Eqs (12) and (13) when computing the value of τ. Namely, the solution must satisfy τh′(τ|η) < −h(τ|η) and h′(τ|η) < 0 for composite Pareto I and II models, respectively. Without considering the restriction on τ, the function ψ1(τ, η) or ψ2(τ, η, α) may diverge to ±∞, making the numerical computation difficult.

To compute the value of τ given the values of other parameters, we use the following functions. (17) for composite Pareto I model, and (18) for composite Pareto II model. The roots of these two functions give the values of τ that satisfy Eqs (12) and (13), respectively.

The first two lines in Eqs (17) and (18) are added for numerical stability. These lines are useful when h(τ|η) is computationally 0 and H(τ|η) is computationally 0 or 1. The third lines in Eqs (17) and (18) are useful to impose the restrictions on τ as mentioned. The values are set to be ρ − 1 since when τh′(τ|η) + h(τ|η) → 0 from the left, then ψ1(τ, η) → 1. Similarly, when h′(τ|η) → 0 from the left, then ψ2(τ, η, α) → 1. Setting the values to be ρ − 1 makes κ1(τ|ρ, η) and κ2(τ|ρ, η, α) to be continuous functions.

There may be multiple roots to κ1(τ|ρ, η) and κ2(τ|ρ, η, α) which indicates the incidence of multiple τ solutions. This proves to be difficult since most root-finding algorithms output only a single root. To find the values of τ, we use the uniroot.all function from the rootSolve package in R, which divides the domain of τ (for a given lower and upper bounds) into a number of subintervals. The function then detects the changes in the signs of κ1(τ|ρ, η) and κ2(τ|ρ, η, α) on the boundaries of the subintervals, and uses bisection method to find the zeroes at which there is a change in the signs of κ1(τ|ρ, η) and κ2(τ|ρ, η, α).

4 Prior distributions for composite Pareto models

In Bayesian setting, prior information on the parameters must be specified and included in the analysis. For composite Pareto models, a closed form for the posterior distributions cannot be found due to the complexity of the likelihood. Thus MCMC method specifically Metropolis–Hastings within Gibbs sampling will be used to estimate the posterior distribution. In this section, focus will be given on using noninformative prior distributions for the parameters. If informative prior distributions were to be used, the posterior estimation process remains the same when using the MCMC algorithm.

Additionally, we assume independence between parameters such that the prior distribution can be written as follows: when using prior distribution on τ, or: (19) when using prior distribution on ρ.

4.1 Prior on τ versus prior on ρ

Setting a prior information on τ can be tricky unless there is a good prior information that can be used. Informative prior in the form of normal distribution has been proposed by Behrens et al. [16] but selecting the hyperparameters can be difficult. When no information on τ is available, researchers may opt to use noninformative priors. In this case, a uniform prior can be used for τ as proposed by Behrens et al. [16] and Cabras and Castellanos [18]. But as will be shown in the Section 5, using uniform prior on τ is not recommended. Villa [19] on the other hand has proposed to use data-dependent prior on τ by using Kullback–Leibler divergence to determine the order statistics’ ‘worth’.

In many cases, specifying prior information on ρ is easier as compared to specifying prior information on τ. This is particularly clear because ρ is bounded between 0 and 1 (whereas τ is not bounded above) and that ρ is free of unit. Furthermore, in many applications that uses generalized Pareto distribution to estimate extreme quantiles, researchers typically choose the threshold value based on the quantiles (e.g. 90th, 99th, 99.5th percentiles). Therefore using prior on ρ is more natural than specifying information on some arbitrary threshold value. Informative prior in the form of a beta distribution can be used for ρ when required.

If there are no information on both τ and ρ, we can use uniform prior for either τ or ρ. Note that the distribution of τ affects the distribution of ρ and vice versa. Fig 1 illustrates an example on how the distribution of τ affects the distribution of ρ and vice versa. The figure is based on having ρ = 1 − H(τ|η) as in the case of bulk models with h(x|η) be a lognormal distribution with log-mean 6.5 and log-variance 0.52. In this case, when τ is uniformly distributed, p(ρ) ∝ 1/h(H−1(1 − ρ)|η), and when ρ is uniformly distributed, p(τ) = h(τ|η). It can be seen from the figure that when τ is uniformly distributed, the distribution of ρ is concentrated around 0. On the other hand, when the ρ is uniformly distributed between 0 and 1, the distribution of τ is unimodal and its density decreases as τ increases.

thumbnail
Fig 1. The distribution of ρ when τ is uniformly distributed (left), and the distribution of τ when ρ is uniformly distributed (right), when ρ = 1 − H(τ|η) with h(x|η) is lognormal distribution with log-mean 6.5, log-variance 0.52.

https://doi.org/10.1371/journal.pone.0257762.g001

Note that if X is a random variable with pdf fX(x), Y = v(X) is a function of X, and v(X) is monotonic, then where fY(y) is the pdf of Y. In the case of ρ and τ, the above equation translates to (20) where τ = v(ρ) for some function v(⋅), if it exists. Assuming the independence between η, (α, β), and τ, the prior can then be written as (21)

Using Eq (21), the MCMC algorithm can be performed by sampling τ with the prior on ρ specified using p(ρ), which gives equivalent result to the MCMC that samples ρ using the prior written in Eq (19). However, it is important to note that this is only valid when the relationship between ρ and τ is one-to-one, and there is still a need to check for the uniqueness of τ given the values of ρ and other parameters, for example by using Theorem 1 or 2. In the case of multiple τ solutions, the mapping ρτ is not even a function since the relation is one-to-many. In this case, the function v(⋅) does not exist and Eq (20) cannot be used.

Not only that Eq (20) is invalid for the cases with multiple τ solution, specifying the prior according to the equation may lead to poor result on the posterior estimation. As an example, consider using a two-component lognormal mixture model for the lower data distribution, i.e. in a composite Pareto I model with r = (0.5, 0.5), μ = (5, 7), and σ = (0.6, 0.3). The relationship between ρ and τ is shown in Fig 2. The shaded area in the figure signifies the area where τh′(τ|η ≥ −h(τ|η) in which the acceptance probability for the Metropolis–Hastings step is zero. Suppose the true value of τ is greater than 1000. Then the MCMC that samples τ using prior specification on ρ and Eq (21) may get stuck in the region between 150 and 500 due to the area with zero acceptance probability, which in turn will cause the parameter point estimates to have huge error. Therefore, it is advised to avoid setting the prior as in Eq (21) for the cases where the relationship between τ and ρ is not proven to be one-to-one.

thumbnail
Fig 2. The relationship between ρ and τ when using a two-component lognormal mixture model for the lower data distribution.

https://doi.org/10.1371/journal.pone.0257762.g002

Following Cabras and Castellanos [18], it is assumed that x(m+1)τx(n−2) where x(i) is the ith order statistic, m is the dimension of η, and n is the sample size. This assumption is required to obtain proper posterior distribution over the parameters η, α, and β.

4.2 Prior distributions for α and β

For smooth composite Pareto I model, and continuous bulk model with Pareto I tail, the value of α is determined by Eqs (2) or (10). Therefore in these cases, prior distribution on α needs not to be specified since α is not a free parameter. However, when the composite Pareto I model is not assumed to be differentiable, or when the pdf of bulk model with Pareto I tail is not assumed to be continuous, the following prior distribution can be used: which is the Jeffreys prior for α for a Pareto I distribution with known threshold or scale parameter.

For composite Pareto II model with smooth pdf and bulk model with Pareto II tail and continuous pdf, the value of β is specified by Eqs (5) or (11). Since in this case α is a free parameter, a prior distribution on α is required. The Jeffreys prior for α for a Pareto II distribution when the parameter β and the threshold or location parameter are fixed or known as written below is used:

However, when the composite Pareto II model is not assumed to be differentiable, or the discontinuous bulk model with Pareto II tail is used, both α and β are free parameters and the prior distributions on α and β are required. Cabras and Castellanos [18] have proposed to use the Jeffreys prior for a Pareto II distribution with fixed or known location threshold as written below: (22) Using this prior leads to proper posterior distribution [23]. This prior distribution is used in our analysis.

4.3 Prior distributions for η

The prior distribution for η depends on the choice of lower data distribution h(x|η). Default priors can be use when no information is available. For example, if h(x|η) is a lognormal distribution with log-mean μ and log-variance σ2, then η = (μ, σ2). As recommended by Cabras and Castellanos [18], the following default prior is used:

5 Simulation study

In this section, the performance of posterior point estimates when using uniform prior on τ and using uniform prior on ρ is compared. To study the performance of these two prior choices, eight simulation scenarios are used. Four of these scenarios involve composite Pareto I models, and the other four involve composite Pareto II models. For the lower data distribution, we use lognormal distribution with log-mean 6.5 and log-variance 0.52. Table 1 presents the eight simulation scenarios used as well as the values for the parameters. The choice of parameter values and lower data distribution is arbitrary as similar results are obtained when using other values or distributions.

Referring to Table 1, in the first two scenarios, i.e. LNP1 and LNP1.nodiff, the parameters used are exactly the same. Here, LNP1 assumes the pdf to be differentiable while LNP1.nodiff does not assume the pdf to be differentiable. Therefore in LNP1, α is a function of other parameters as in Eq (2) and prior distribution is not required for estimation. Meanwhile, LNP1.nodiff requires prior distribution on α for estimation as α is a free parameter. Similarly, LNP2 assumes differentiable pdf while LNP2.nodiff does not, LNP1B assumes continuous pdf while LNP1B.nocont does not, and lastly LNP2B also assumes continuous pdf while LNP2B.nocont does not assume continuous pdf.

To compare the performance between the two priors, 100 datasets with n = 1000 observations each are generated for each scenarios using the parameters as shown in Table 1. For each dataset, Metropolis-within-Gibbs algorithm with 50 000 iterations, where the first 20 000 iterations are treated as burn-in period, is used to estimate the posterior distribution. The convergence of the MCMC are checked using Gelman–Rubin diagnostic. Then, for each dataset, the posterior mean, median, and mode of τ are estimated and the root mean square error (RMSE) is calculated. The RMSE of the point estimates based on the posterior densities are compared in order to investigate which prior gives a more accurate estimate.

Let τa be the posterior samples of τ obtained when using uniform prior on τ and τb be the posterior samples of τ obtained when using uniform prior on ρ. Table 2 shows the RMSE of the posterior mean, median and mode of τa and τb for the eight simulation scenarios. The last column on the table also shows the mean of the posterior standard deviation. From Table 2, it can be seen that the RMSE for mean, median, and mode is lower for τb compared to τa, for all scenarios. This implies that the point estimates using uniform prior on ρ has lower bias compared to point estimates obtained when using uniform prior on τ. Additionally, it can be observed that the mean of standard deviation of τa is higher than the mean of standard deviation of τb for all scenarios. Therefore the posterior distribution τa generally is more spread compared to the posterior distribution τb.

thumbnail
Table 2. Root mean square error (RMSE) of the posterior mean, median and mode when using uniform prior on τ versus uniform prior on ρ.

https://doi.org/10.1371/journal.pone.0257762.t002

The simulations are repeated but with τ = 1850, a more extreme value compared to previous simulations with τ = 1300. The values of the parameters are given in Table 3. In these cases, only around 2% of the data come from Pareto tail which translate to around 20 observations out of n = 1000. Using the given parameter values, 100 datasets are generated and the RMSE of the posterior mean, median and mode are calculated and given in Table 4. Similar results are observed in the table: first, the point estimates found from using uniform prior on ρ give lower bias compared to point estimates found from using uniform prior on τ and second, the posterior distribution of τ obtained from using uniform prior on τ is more spread with higher standard deviation compared to the posterior distribution of τ obtained from using uniform prior on ρ. Comparing the results obtained when τ = 1300 in Table 2, it appears that the relative differences of the RMSE found for the point estimates of τa and τb are much higher when τ = 1850, as seen in Table 4.

thumbnail
Table 3. Simulation scenarios and their true parameters for the second round of simulations with τ = 1850.

https://doi.org/10.1371/journal.pone.0257762.t003

thumbnail
Table 4. Root mean square error (RMSE) of the posterior mean, median and mode when using uniform prior on τ versus uniform prior on ρ for the second round of simulations.

https://doi.org/10.1371/journal.pone.0257762.t004

One possible explanation on the poor performance when using uniform prior on τ is that the prior assumes equal probability of τ, including for very high values of τ. We believe this shifts the posterior estimate of τ to the extreme values, sometimes close to the highest possible value, x(n−2), as we assume x(m+1)τx(n−2). This result if further illustrated using Fig 3 which shows the posterior distributions of τa and τb under LNP2 scenario using the same data. It can be seen in the figure that the posterior distribution of τa has longer tail to the right compared to the posterior distribution of τb. This behaviour is commonly observed in all scenarios considered, causing the standard deviation of τa to be higher than the standard deviation of τb, as shown in the last column in Tables 2 and 4.

thumbnail
Fig 3. Posterior distribution estimates of τa and τb under LNP2 scenario using the same data.

https://doi.org/10.1371/journal.pone.0257762.g003

In addition, the posterior obtained from using uniform prior on τ may get worse when x(n−2) increases, typically when sample size increases or when the underlying pdf exhibits very heavy tail properties. In our simulation, increasing the value of τ increases the heaviness of the upper tail and as a result, the performance of Bayesian approach using uniform prior on τ becomes worse as seen in Tables 2 and 4. It is thus not recommended to use uniform prior on τ for very heavy tail data since posterior distribution of τ may be overestimated. Thus. using uniform prior on ρ is better when little or no information is available regarding the parameter ρ and τ.

6 Applications

6.1 Application to income distribution

The composite Pareto I model is applied to the Malaysian Household Income Survey (HIS) to study the income distribution in Malaysia. Twice every five years, the HIS is conducted by Department of Statistics Malaysia (DOSM) to gather information regarding the economic well-being of Malaysian households. Surveys are performed by interviewing randomly selected households and several information including the household income and expenditure are recorded. Five HIS datasets are used: year 2007, 2009, 2012, 2014 and 2016. These data are provided by DOSM and Bank Data UKM, which consist of subsets of the whole survey data. (Unfortunately, access to the data is restricted and cannot be shared publicly. However, these data can be requested online from https://www.dosm.gov.my/v1/index.php.). There are more than 12 000 observations in each dataset, representing the households in Malaysia.

Since two households with the same income may not be in the same economic condition due to the difference in the size of household, the monthly gross income is equivalised by dividing it by the square root of the size of household. This square root equivalence scale is commonly used in studies including Congressional Budget Office [24] and OECD [25]. Although households are given weight depending on their location and strata, for simplicity, the weights are not used in our application. Fig 4 shows the histograms of the five datasets.

thumbnail
Fig 4. Histograms of the equivalised household income in Ringgit Malaysia (RM) for the year 2007, 2009, 2012, 2014 and 2016.

https://doi.org/10.1371/journal.pone.0257762.g004

These data sets have been studied previously by various authors [2630]. For example, Razak and Shahabuddin [26] have noted that the upper tail of the Malaysian household income fits the Pareto tail behaviour. Safari et al. [27] on the other hand have applied a semi-parametric approach to model the income distribution by combining Pareto type I distribution for the upper tail and a nonparametric approach for the rest of the data. In a recent work, the lognormal-Pareto II composite model was found to be able to describe the income distribution quite well [31].

We apply the smooth composite Pareto I model to these datasets and the two-component lognormal mixture distribution is used for the lower data distribution. In this case, where η = (r, μ1, μ2, σ1, σ2) and r ∈ (0, 1) is the weight of the first component and fLN(x|μ, σ2) is the pdf of lognormal distribution with log-mean μ and log-variance σ2. This choice of lower data distribution does not guarantee unique solution of τ given ρ and other parameters. However, when implemented to the datasets using the MCMC algorithm, no instances of multiple τ solutions is found.

For the prior distributions, reference prior is used for the parameters of the lognormal distribution such that and uniform prior is used for r. Using this real data, the Bayesian analysis is performed twice by using uniform prior on τ and uniform prior on ρ to compare the two results.

The estimated posterior means of τ and ρ are given in Table 5 along with the p-values of Kolmogorov–Smirnov (KS) goodness-of-fit test using the estimated parameter values. From the table, around 2% to 7% of the observations come from the Pareto tail. It can also be seen that the estimated τ obtained by using uniform prior on ρ are smaller compared to estimated τ obtained by using uniform prior on τ, which is similar to the results obtained in the simulation study. This may indicate that using uniform prior on τ overestimates the threshold value.

thumbnail
Table 5. Estimates of τ and ρ for Malaysian household income distribution.

Included in the table is the p-value obtained from KS test using the estimated parameters.

https://doi.org/10.1371/journal.pone.0257762.t005

Additionally, the p-values from the KS test are higher when using uniform prior on ρ rather than uniform prior on τ, except for the year 2014 in which the p-value is higher when using uniform prior on τ. The p-value indicates that except for the year 2014, the estimated parameters obtained when using uniform prior on ρ fit the datasets better compared to the estimated parameters obtained when using uniform prior on τ. Therefore in general, using uniform prior on ρ is deemed better when compared to using uniform prior on τ in this application.

6.2 Application to NASDAQ-100

We implement the composite Pareto model to NASDAQ-100, a stock market index made up of 100 largest companies listed in the Nasdaq stock exchange. The daily increments of the index are taken from Yahoo Finance for the period of January 1985 to May 2002 with the number of days n = 4394. The daily increment values are transformed such that where Pt is the adjusted closing price for the index at day t. Fig 5 shows the histogram of the NASDAQ-100 daily increments. From the figure, it is clear that the distribution exhibits heavy tail behaviour, thus using composite Pareto model can be beneficial.

thumbnail
Fig 5. Histogram of the NASDAQ-100 daily increments used in application.

https://doi.org/10.1371/journal.pone.0257762.g005

This data has been analyzed previously by Behrens et al. [16] and Villa [19] using bulk model with Pareto type II tail and gamma distribution. For the threshold parameter, Behrens et al. [16] have used uniform prior distribution whereas Villa [19] has used prior distribution based on the Kullback–Leibler divergence. In our implementation, we use both uniform prior on τ and uniform prior on ρ and compare our results with the two literatures.

We apply the same bulk model without the continuity condition to the data. In our application, the lower data distribution is h(x|η) = fG(x|a, b) where fG(x|a, b) is the pdf of a gamma distribution with shape parameter a and rate parameter b. Reference prior is used for a and b such that where ψ(1)(⋅) is the trigamma function. The prior distribution for α and β is given in Eq (22). Additionally, the uniform prior on τ and uniform prior on ρ are used for comparing the results based on these these two choices of prior, along with the results obtained by Behrens et al. [16] and Villa [19]. Note that both of these literatures have used informative prior for the shape and rate parameters but the values for the hyperparameters are not provided. Therefore, we are unable to duplicate their prior for our application and we opt to use reference prior instead.

Behrens et al. [16] and Villa [19] have used a different parameterization to the model. Let ξ = 1/α and σ = β/α. Table 6 shows the estimated posterior mode for the Pareto II tail along with the estimated parameters from Behrens et al. [16] and Villa [19]. Note that the estimate for ρ is not given in both Behrens et al. [16] and Villa [19] thus left empty in Table 6. From the table, around 30% of the observations come from the Pareto II distribution and the other 70% come from gamma distribution. The parameter estimates using uniform prior on τ is similar to the parameter estimates using uniform prior on ρ, indicating that there are not much advantages of using uniform prior on ρ as opposed to uniform prior on τ as seen in the previous simulations and example.

thumbnail
Table 6. Parameter estimates for NASDAQ-100 application with the 95% credible interval in bracket.

https://doi.org/10.1371/journal.pone.0257762.t006

Both parameter estimates found using uniform prior on τ and uniform prior on ρ are consistent to those found by Behrens et al. [16] and Villa [19]. Although Behrens et al. [16] also used uniform prior for the threshold parameter, the differences between our results and theirs might be due to the difference prior distribution used for the parameter of the gamma distribution. As mentioned previously, the values of hyperparameters used are not provided by Behrens et al. [16] and thus we cannot duplicate their results. Additionally, the credible intervals found in our analysis is narrower than the credible interval found by Behrens et al. [16], but broader when compared to the results in Villa [19]. It is noted that Villa [19] has used data dependent prior distribution for the threshold parameter.

7 Conclusion

When using Bayesian approach to composite Pareto models, the prior distribution for the threshold parameter τ must first be specified. However, choosing the prior distribution for τ can be difficult. Previous studies opt to use normal or uniform distributions for the prior of τ. In our study, we propose to specify the prior using the proportion of data coming from the Pareto distribution which is the parameter ρ. The two parameters ρ and τ are dependent as we can write ρ as a function of τ and other parameters. Therefore, specifying the prior distribution of ρ affects the prior distribution of τ.

Given the values of ρ and other parameters, the value of τ can be obtained numerically. However, the solution of τ given ρ and other parameters may not be unique for some lower data distributions. We have proven that for commonly used lower data distributions, for example lognormal, gamma, or Weibull distributions, the solution is unique. For distributions that are not covered in the proven theorems, we propose to select the solution τ that solves the required equations and maximizes the loglikelihood, out of the possible solutions.

One main benefit of specifying prior on ρ instead of τ is that it can be more natural and easily interpreted. Historically, researchers have used quantiles in extreme value theory to set up the threshold value that separates lower data distribution and Pareto distribution. Specifying prior on ρ is seen as an extension of this approach. Another benefit of using prior on ρ rather than τ is that ρ is bounded between 0 and 1 while τ is not bounded above. Hence, it is easier for researchers to specify the prior distribution on ρ, for example by using beta distribution, than specifying the prior distribution of τ.

When there are no information regarding τ or ρ, uniform prior can be used for either τ or ρ. In our simulations, using uniform prior on ρ leads to less biased point estimates compared to using uniform prior on τ. Using uniform prior on τ assumes equal density of τ even at extremely high values and may cause overestimation of the parameter. It is thus not recommended to use uniform prior on τ but to use uniform prior on ρ instead when there is a lack of information regarding τ and ρ.

Supporting information

Acknowledgments

The authors would like to thank Department of Statistics Malaysia (DOSM) and Bank Data UKM for the household income datasets used in this study.

References

  1. 1. Arnold BC. Pareto and Generalized Pareto Distributions. In: Chotikapanich D, editor. Modeling Income Distributions and Lorenz Curves. New York, NY: Springer New York; 2008. p. 119–145.
  2. 2. Pickands J. Statistical Inference Using Extreme Order Statistics. The Annals of Statistics. 1975;3(1):119–131.
  3. 3. Cooray K, Ananda M. Modeling actuarial data with a composite lognormal-Pareto model. Scandinavian Actuarial Journal. 2005;2005(5):321–334.
  4. 4. Cooray K. The Weibull–Pareto Composite Family with Applications to the Analysis of Unimodal Failure Rate Data. Communications in Statistics—Theory and Methods. 2009;38(11):1901–1915.
  5. 5. Dominicy Y, Sinner C. Distributions and Composite Models for Size-Type Data. In: Hokimoto T, editor. Advances in Statistical Methodologies and Their Application to Real Problems. InTech; 2017.
  6. 6. Wijeyakulasuriya DA, Hanks EM, Shaby BA, Cross PC. Extreme Value-Based Methods for Modeling Elk Yearly Movements. Journal of Agricultural, Biological and Environmental Statistics. 2019;24(1):73–91.
  7. 7. Cooray K, Cheng CI. Bayesian estimators of the lognormal–Pareto composite distribution. Scandinavian Actuarial Journal. 2015;2015(6):500–515.
  8. 8. Aminzadeh MS, Deng M. Bayesian predictive modeling for Inverse Gamma-Pareto composite distribution. Communications in Statistics—Theory and Methods. 2019;48(8):1938–1954.
  9. 9. Scollnik DPM. On composite lognormal-Pareto models. Scandinavian Actuarial Journal. 2007;(1):20–33.
  10. 10. Ciumara R. An actuarial model based on the composite Weibull-Pareto distribution. Mathematical Reports. 2006;8(4):401–414.
  11. 11. Scollnik DPM, Sun C. Modeling with Weibull-Pareto Models. North American Actuarial Journal. 2012;16(2):260–272.
  12. 12. Teodorescu S, Vernic R. On composite Pareto models. Mathematical Reports. 2013;15(1):11–29.
  13. 13. Teodorescu S, Vernic R. A composite Exponential-Pareto distribution. An Stiint Univ“Ovidius” Constanta Ser Mat. 2006;14(1):99–108.
  14. 14. Calderín-Ojeda E, Kwok CF. Modeling claims data with composite Stoppa models. Scandinavian Actuarial Journal. 2016;2016(9):817–836.
  15. 15. Scarrott C. Univariate Extreme Value Mixture Modeling. In: Extreme Value Modeling and Risk Analysis. Chapman and Hall/CRC; 2016. p. 41–67.
  16. 16. Behrens CN, Lopes HF, Gamerman D. Bayesian analysis of extreme events with threshold estimation. Statistical Modelling. 2004;4(3):227–244.
  17. 17. do Nascimento FF, Gamerman D, Lopes HF. A semiparametric Bayesian approach to extreme value estimation. Statistics and Computing. 2012;22(2):661–675.
  18. 18. Cabras S, Castellanos ME. A Bayesian Approach for Estimating Extreme Quantiles Under a Semiparametric Mixture Model. ASTIN Bulletin: The Journal of the IAA. 2011;41(1):87–106.
  19. 19. Villa C. Bayesian estimation of the threshold of a generalised pareto distribution for heavy-tailed observations. TEST. 2017;26(1):95–118.
  20. 20. Tancredi A, Anderson C, O’Hagan A. Accounting for threshold uncertainty in extreme value estimation. Extremes. 2006;9(2):87–106.
  21. 21. MacDonald A, Scarrott CJ, Lee D, Darlow B, Reale M, Russell G. A flexible extreme value mixture model. Computational Statistics & Data Analysis. 2011;55(6):2137–2157.
  22. 22. Gordon RD. Values of Mills’ Ratio of Area to Bounding Ordinate and of the Normal Probability Integral for Large Values of the Argument. The Annals of Mathematical Statistics. 1941;12(3):364–366.
  23. 23. Castellanos ME, Cabras S. A default Bayesian procedure for the generalized Pareto distribution. Journal of Statistical Planning and Inference. 2007;137(2):473–483.
  24. 24. Office CB. Projected Changes in the Distribution of Household Income, 2016 to 2021; 2019. Available from: https://www.cbo.gov/publication/55941.
  25. 25. OECD. In It Together: Why Less Inequality Benefits All. Paris: OECD Publishing; 2015.
  26. 26. Razak FA, Shahabuddin FA. Malaysian Household Income Distribution: A Fractal Point of View. Sains Malaysiana. 2018;47(9):2187–2194.
  27. 27. Safari MAM, Masseran N, Ibrahim K. A robust semi-parametric approach for measuring income inequality in Malaysia. Physica A: Statistical Mechanics and its Applications. 2018;512:1–13.
  28. 28. Masseran N, Yee LH, Safari MAM, Ibrahim K. Power Law Behavior and Tail Modeling on Low Income Distribution. Mathematics and Statistics. 2019;7(3):70–77.
  29. 29. Safari MAM, Masseran N, Ibrahim K, Hussain SI. A robust and efficient estimator for the tail index of inverse Pareto distribution. Physica A: Statistical Mechanics and its Applications. 2019;517:431–439.
  30. 30. Safari MAM, Masseran N, Ibrahim K, AL-Dhurafi NA. The power-law distribution for the income of poor households. Physica A: Statistical Mechanics and its Applications. 2020;557:124893.
  31. 31. Majid MHA, Ibrahim K. Composite Pareto Distributions for Modelling Household Income Distribution in Malaysia. Sains Malaysiana. 2021;50(7):2047–2058.