Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Copula-Based Approach to Synthetic Population Generation

  • Byungduk Jeong ,

    Contributed equally to this work with: Byungduk Jeong, Wonjoon Lee

    Affiliation Department of Industrial & Systems Engineering, KAIST (Korea Advanced Institute of Science and Technology), Daejeon, South Korea

  • Wonjoon Lee ,

    Contributed equally to this work with: Byungduk Jeong, Wonjoon Lee

    Affiliation Department of Industrial & Systems Engineering, KAIST (Korea Advanced Institute of Science and Technology), Daejeon, South Korea

  • Deok-Soo Kim,

    Affiliation Department of Mechanical Engineering, Hanyang University, Seoul, South Korea

  • Hayong Shin

    hyshin@kaist.ac.kr

    Affiliation Department of Industrial & Systems Engineering, KAIST (Korea Advanced Institute of Science and Technology), Daejeon, South Korea

    ORCID http://orcid.org/0000-0002-5161-661X

Copula-Based Approach to Synthetic Population Generation

  • Byungduk Jeong, 
  • Wonjoon Lee, 
  • Deok-Soo Kim, 
  • Hayong Shin
PLOS
x

Abstract

Generating synthetic baseline populations is a fundamental step of agent-based modeling and simulation, which is growing fast in a wide range of socio-economic areas including transportation planning research. Traditionally, in many commercial and non-commercial microsimulation systems, the iterative proportional fitting (IPF) procedure has been used for creating the joint distribution of individuals when combining a reference joint distribution with target marginal distributions. Although IPF is simple, computationally efficient, and rigorously founded, it is unclear whether IPF well preserves the dependence structure of the reference joint table sufficiently when fitting it to target margins. In this paper, a novel method is proposed based on the copula concept in order to provide an alternative approach to the problem that IPF resolves. The dependency characteristic measures were computed and the results from the proposed method and IPF were compared. In most test cases, the proposed method outperformed IPF in preserving the dependence structure of the reference joint distribution.

Introduction

Large-scale micro-simulations using agent-based models have gained wide popularity in recent years in various fields of socio-economic studies [1] including transportation planning [2] and land use [3]. Generating synthetic baseline populations is a key step in agent-based modeling and simulation. An agent in a microsimulation is described by a set of attributes such as age, income, residence type/region, and so on. These attributes are usually dependent on each other. Hence, the synthetic population generation can be considered as creating a set of agents with the attributes drawn from a joint distribution. However, a difficult element of the synthetic population generation is obtaining a relevant data set. As described in [4], there are two typical types of data sets available: disaggregated census data in form of PUMS (Public Use Micro Samples) and aggregated data in form of summary tables in census reports. PUMS contains individual samples of small size (typically less than 5% of population), which can be used to infer the joint distribution of the attributes. On the other hand, aggregated data in summary tables exhibits marginal distributions of attributes specific to each analysis zone of interest. As samples in PUMS data are chosen from a rather larger area (like state or nationwide) than target zones, the reference joint distribution from PUMS is often inconsistent with marginal distributions from the aggregate data of each zone. Beckman et al. [5] proposed to combine the disaggregated data with the aggregated data using IPF (Iterative Proportional Fitting) procedure. The primary concept of IPF is to maintain the dependence structure from the disaggregated data and alter the joint distribution to fit the marginal distribution of the attributes from the aggregated data. (We will use the term ‘marginal distribution’ and ‘margin’ interchangeably.) Since the inception by Beckman et al, there has been much research following this path: see [4,6]. IPF, which is briefly reviewed in a later section, is a very efficient and powerful technique for constructing a joint distribution table from a reference joint distribution and target margins. (In population synthesis area, the term ‘contingency table’ is often used to refer a table with the frequency of population in each cell. Since a contingency table can be easily converted to a distribution table (a probability mass function of discrete random variables), we will use the concept of distribution table instead.) Although the IPF procedure is very popularly in a variety of applications including synthetic population generation, it has some limitations as well. In this paper, we propose a novel approach based on copula theory for the same problem of constructing a joint distribution in place of IPF. It should be noted that the proposed approach can deal with only ordinal variables, not categorical ones.

Recently, some research papers have used copula theory for microsimulation of traffic behavior (e.g., see [7,8], in which the copula was used in different contexts. Kao et al. [9] also proposed a copula based approach to synthesizing households in order to preserve the dependence structure. However, they combine target margins using Gaussian copula, whose covariance matrix is determined from the reference joint distribution (possibly represented by samples). A limitation of this approach is that some dependency information is lost because of the intermediate Gaussian copula. A similar approach can be found in [10], which utilizes Copula for representing temporal dependence structure among time series of stream flow in a geographic region. These literature uses some well-known copula functions such as Gaussian or Gumbel copula, then parameters of the copula function are chosen to fit the data. On the other hand, in this paper, we propose to directly use the empirical copula as explained in “Copula based approach to joint fitting problem” section.

Problem Description

Though, in synthetic population generation, there are many research issues such as household–individual hierarchy and aggregation data inconsistency, the problem focused in this paper is the construction of a joint distribution from a given reference joint distribution and target margins. Although the approach is applicable to multi-dimensional distributions without significant modification, the description in this paper is confined to two dimensional setting, for simplicity. For the most part of this paper, we will assume that target margins are discrete distributions, then the procedure will be extended to continuous variables in “Distribution view of CBJF and extensions” section.

In order to formally describe the problem, the following notations are introduced. Let (X,Y) be a pair of discrete random variables that represent the attributes of the reference population. X and Y can have values from {x1,x2,…,xm} and {y1,y2,…,yn} respectively. We assume that X and Y are ordinal or interval variables, possible values of which have natural ordering. (For the variable types, readers are referred to [11].) The possibility of relaxing this assumption to handle categorical variables will be discussed in the conclusion section as a further research topic. Let be a pair of random variables representing the attributes of the target population. For simplicity, we assume that and have the same values as X and Y, respectively. Note that this assumption can be easily removed in the proposed approach by introducing a mapping between them. We will use the following notations, illustrated in Fig 1.

  • a = [ai,j]: an m×n matrix denoting the reference joint distribution of (X,Y),
    1. i.e. ai,j = P[X = xi, Y = yj]
  • r = [ri]: target marginal distribution of , i.e.
  • c = [cj]: target marginal distribution of , i.e.
  • b = [bi,j]: an m×n matrix denoting the target joint distribution of ,
    1. i.e.

The reference joint distribution a can be either given directly or obtained from the detailed disaggregated census data available in form of PUMS (by counting the samples in each cell), while the target margins r and c are obtained from the aggregated data. From the input data {a,r,c}, the goal is to find the target joint distribution b inheriting the dependence structure from a while fitting it to the margins r and c. We call this problem a joint fitting problem. By definition, ai,j, ri, cj, bi,j are probabilities, meaning that they are nonnegative and sum to one. The following symbols are used to denote marginal summations:

  • ai,+ = ∑jai,j: row margin of a, i.e. ai,+ = P[X = xi]
  • a+,j = ∑iai,j: column margin of a, i.e. a+,j = P[Y = yj]
  • bi,+ = ∑jbi,j: row margin of b
  • b+,j = ∑ibi,j: column margin of b

Using these symbols, the constraints of the problem are to satisfy bi,+ = ri and b+,j = cj. The goal of “preserving the dependence structure” in a may appear ambiguous. Quantitative measure of this goal is differently defined in each method described in the following sections.

Brief Overview of IPF (Iterative Proportional Fitting)

IPF (Iterative Proportional Fitting) is a concise and efficient procedure to solve the joint fitting problem described in the previous section. IPF has many names, including RAS algorithm, matrix raking, matrix scaling, bi-proportional fitting, and so on. Since its introduction in by Deming & Stephan [12], the properties of IPF has been studied thoroughly and used widely in various fields including the synthetic population generation. Although there are some variations of IPF, its essence can be described using the following algorithm (Algorithm 1):

Algorithm 1 (IPF)

bi,jai,j (Initialization)

While (convergence criterion is not met)

        

        

A common choice of convergence criterion is to measure the maximum deviation ε from the given margins r and: (1)

Ireland & Kullback [13] proved that if IPF procedure converges to a certain distribution table under the given constraints on the marginal, then the resulting table minimizes the relative entropy (called the ‘discrimination information’ in their paper), as defined below: (2)

Wong [14] investigated the reliability of IPF for use in geographical studies. Beckman et al. [5] proposes using IPF to combine disaggregated data with aggregated margins from different data sources. Following Beckman’s lead, much literature has been produced on adopting IPF for population generation, as summarized in [4]. Despite the popularity and rigorous mathematical analysis, IPF has some limitations, as follows:

  • Convergence problem: Although this is rare in practical applications, IPF procedure may not converge. Pukelsheim & Simeone [15] showed conditions when IPF may fail to converge (i.e. when a can be permuted to a block diagonal structure in 2-dimensional case, as exemplified below). Though IPF converges very fast in general as explained in [16], it may take many iterations before reaching the desired level of fitting accuracy when there are cells with non-zero initial probability that are eventually wiped out to zero.
  • Zero margin / Zero cell problem: If any of ai,+ (or a+,j) vanishes to zero while ri (or cj, respectively) is not, the IPF procedure fails. This is called zero margin problem. Furthermore, the zero cells in a do not have a chance to obtain positive probability mass during IPF procedure. That is, if ai,j = 0, then bi,j = 0 always. A zero cell is not a problem in computational sense, however it poses a semantic problem because a zero cell can be a result of underrepresentation caused by the limited sample size. As a work-around, zero cells (and hence zero margins) are replaced with a very small number at the initialization step.
  • Dependence structure preserving: Most importantly, it must be seriously considered whether IPF preserves well the dependence structure of the reference table a. Since there is no strict relationship between minimizing the discrimination information and preserving the dependence structure, minimum discrimination information does not guarantee the most similar dependence structure. Although IPF converges to the solution table b which minimizes the relative entropy subject to the marginal constraints, it does not necessarily mean that the dependence structure captured in a is best transferred to b. In this paper a new method is proposed to replace IPF and these methods are compared in terms of some dependence measures which can capture the strength of relationships between variables.

Tables 1 and 2 show a small example joint fitting problem and a solution obtained by IPF procedure. Inputs are the reference joint matrix a and the target margins r, c in Table 1. With the convergence tolerance ε = 10−5, the output of IPF procedure is shown in Table 2, obtained after more than 300 iterations. If a5,3 is changed to 0, the non-zero cells in a matrix are separated in two groups: upper-left group and lower-right group. In this case, IPF procedure does not come to convergence because probability mass in one group cannot be transferred to the other group due to the barricade of zero cells. If we replace zero cells with a very small number (10−20), IPF converges after 1000 iterations.

Copula Based Approach to Joint Fitting Problem

Copula, which was first coined by Sklar in [17] from a Latin word copulare meaning “to connect or link”, is a popular tool for modeling dependence between random variables. This section begins by reviewing the foundation of the copula theory. Further details on the copula can be found in [18]. A copula C(u,v) is a joint cumulative distribution function whose margins are uniform (0,1) distributions, satisfying the following properties:

  1. Uniform (0,1) margins:
  2. Monotonously increasing:
  3. Rectangle inequality (non-negative probability for [u,u + du] × [v,v + dv]):

A copula can also be obtained from a joint distribution. Let FXY(x,y) = P[X ≤ x, Y ≤ y] denote the joint cumulative distribution function of (X,Y). Furthermore, let FX(x) = P[X ≤ x] and FY(y) = P[Y ≤ y] be the marginal cumulative distribution functions of X and Y respectively. The copula of FXY(x,y) is defined by a function C(u,v) that satisfies the following: (3)

Sklar’s theorem states that such a function C(u,v) exists, and if X and Y are continuous, C(u,v) is uniquely determined. That is, by letting u = FX(x) and v = FY(y), the copula function associated with FXY(x,y) can be obtained as follows: (4)

However, if the variables are not continuous (like X and Y in this paper), then the copula C is not unique; in this case, the values of the copula are uniquely determined at points (x,y), and a copula C for which the properties (a)-(c) above holds can be obtained by interpolating the values at these points [19]. The proof for the general n-dimensional case is outlined in [20]. When we speak of the copula of variables X and Y, we will mean the copula whose existence is guaranteed using the bilinear interpolation, if one or both of the random variables are not continuous.

Using the bilinear interpolation (detail procedure is explained in the last part of this chapter), the copula function C(u,v) is obtained from the joint cumulative distribution function by removing the marginal information, meaning that C(u,v) extracts the dependence structure of the joint distribution without the information specific to margins. This forms the perfect basis of applying the copula to the joint fitting problem: the copula function derived from the reference joint distribution of (X,Y) can be combined with the new target margins of and to form the target joint distribution of . This is further elaborated with basic mathematical operations.

Let FXY(x,y) denote the joint cumulative distribution function of (X,Y), and let FX(x) and FX(x) denote the marginal distribution of X and Y, respectively. Let C(u,v) be the copula function obtained via bilinear interpolation. If we set U = FX(X) and V = FY(Y), then C(u, v) is the cumulative distribution function of (U,V). Now, the goal is to find the joint distribution of , whose marginal cumulative distributions are and . We would like to make the dependence structure of the joint distribution of as close to that of (X, Y) as possible. For this reason, the same copula C(u, v) is applied to the margins and in order to achieve the goal. Let and . Then, C(u, v) is used as the cumulative distribution function of .

(5)

The following symbols are introduced for denoting cumulative distributions:

  • : cumulative distribution of (X, Y), i.e. Ai,j = P[X ≤ xi, Y ≤ yj]
  • : cumulative distribution of X, i.e. ui = P[X ≤ xi] = FX(xi)
  • : cumulative distribution of Y, i.e. vj = P[Y ≤ yj] = FY(yj)
  • : cumulative distribution of , i.e.
  • : cumulative distribution of , i.e.
  • : for notational convenience

Because and , we get . Therefore: (6)

Hence, the target joint distribution b can be computed, if we have the copula function C(u,v) obtained from a. Because Ai,j = P[X ≤ xi, Y ≤ yj] = P[U ≤ FX(xi), V ≤ FY(yj)] = P[U ≤ ui, V ≤ vj] = C(ui,vj), the values of C(u,v) is uniquely defined only at each grid points (ui,vj). Since does not necessarily coincide with (u,v), C(u,v) should be estimated from adjacent known grid points. For a given (u,v), the cell [ui,ui+1] × [vj,vj+1] containing (u,v) can be found. The local parameters s and t ∈ [0,1] are defined as shown in Eq (7) and Fig 2: (7)

Then, C(u, v) is computed using a bilinear interpolation as below: (8)

With C(u, v) as defined in Eq (8), the following lemmas can be easily proved. (Proofs can be found in S1 Text.)

Lemma 1. C(u, v) is at least C0 continuous at all locations in [0,1] × [0,1].

Lemma 2. C(u, v) satisfies the properties (a)~(c) of a copula function.

Hence, C(u, v) qualifies as a copula from the reference joint distribution a. The following algorithm (Algorithm 2) is presented based on the copula, which we call copula based joint fitting (CBJF).

Algorithm 2 (CBJF)

Step 0 (Initialize)

        

Step 1 (Compute [Ai,j])

        Ai,j = ai,j + Ai−1,j + Ai,j−1Ai−1,j−1 for i = 1,…,m and j = 1,…,n

Step 2 (Compute )

        ui = ai,+ + ui−1 for i = 1,…,m

        vj = a+,j + vj−1 for j = 1,…,n

        

        

Step 3 (Compute )

        For each k = 1,…,m and l = 1,…,n

                Find the cell (i,j) such that

                

                Bk,l = (1 − s)(1 − t)Ai,j + (1 − s)tAi,j+1 + s(1 − t)Ai+1,j + stAi+1,j+1

Step 4 (Compute [bk,l])

        For each k = 1,…,m and l = 1,…,n

                bk.l = Bk,lBk−1,lBk,l−1 + Bk−1,l−1

Shown in Table 3 is the output of Algorithm 2 (CBJF) applied to the input condition in Table 1. It can be noted that some zero cells in a matrix get some probability mass by CBJF, unlike the result of IPF shown in Table 2. It can be interpreted as diffusion of probability through the lens of copula upon the request of target margins.

Distribution View of CBJF and Extensions

The computational complexity of the above Algorithm 2 (CBJF) is O(T), where T = mn is the total number of cells in the joint distribution table. Though it is a very efficient algorithm with linear complexity proportional to the input/output size, the total number of cells T grows exponentially as the dimension of target margins increases, which can easily lead to prohibitively large matrices for storage and computation. However, in many practical high dimensional distributions, non-zero cells are very sparse and this is the clue to handle high dimensional cases. In order to exploit the sparsity of the joint distribution table, we need to re-organize Algorithm 2.

We will start with an interpretation of Algorithm 2 (CBJF). In step 2 of Algorithm 2, [ui] and [vj] define a partition G on [0,1]x[0,1] space, shown as solid (blue) lines in Fig 3. [ai,j] is the probability mass assigned to each cell Gi,j of the partition. The target margins r and c overlay a new partition defined by and , shown as dashed (red) lines in Fig 3. Step 4 of Algorithm 2 can be interpreted as a “collection view”. In other words, the probability mass bi,j for a cell in the new partition is computed by collecting the probability mass of cells in G overlapping with . For example, the cell overlaps with G2,2, G2,3, G3,2, and G3,3. So, b2,2 is computed by collecting probability mass from a2,2, a2,3, a3,2, and a3,3 in proportion to the ratio of the overlapping area, which is achieved by the bilinear interpolation copula function in Eq (8). In this view, every cell in the partition should be visited and computed, since the algorithm does not know in advance whether the visit will result in a zero cell or not.

The same algorithm can be viewed from the other way around, which we call “distribution view”. Instead of collecting probability from the overlapping cells, we can visit each non-zero cell Gi,j and distribute the probability mass ai,j to the overlapping cells in , again in proportion to the ratio of the overlapping area. Fig 4 shows that a2,2 is distributed to b1,1, b1,2, b2,1, and b2,2, where bi,j acts as an accumulator of probability mass incoming from each non-zero cell of G. Sparse matrix representation can be used to store a and b. This algorithm (Algorithm 3) is shown below.

Algorithm 3 (CBJF-Distribution)

Step 0 (Initialize)

            Define partition G by computing [ui], [vj]

            Define partition by computing

            [bi,j] is an empty zero martix

Step 1 (Distribute [ai,j])

            For each non-zero ai,j

                For each overlapping cell

                        

                        bs,t = bs,t + p * ai,j

Algorithm 3 (CBJF-Distribution) is simpler, faster, and storage-efficient, and hence, it can be applied to high dimensional case. On top of these benefits, it enables further extensions. Let us first look into an extension to resampling-based population generation, which is very common. In case that disaggregated micro-samples are given for the reference joint distribution, let p(k) = (x(k),y(k),z(k)) denote the k-th survey record in PUMS, which we will call a PUMS entry (k) hereafter in order to reduce confusion in the use of the term “sample”. x(k) and y(k) are attributes whose target margins are given, while z(k) is a vector of additional attributes. Let P = {p(k); k = 1.N} denote the PUMS set. Once the target joint distribution b is obtained, it can be directly used for generating agents with attributes (x,y) drawn from b. However, we cannot set the additional attributes z. This is the motivation of resampling based population generation, where synthetic population is generated by resampling P. In this case, a selection probability w(k) is assigned to each PUMS entry p(k), so that resampling selects p(k) with probability w(k). The goal of simple resampling is to compute w(k) so as to meet the target margin requirement. Simple resampling can be efficiently combined with IPF procedure. However, in this case, the zero cell problem still persists, i.e. we cannot generate any population in a zero cell where no PUMS entry exists. This can be solved by viewing Algorithm 3 (CBJF-Distribution) at the granularity of PUMS entry. Algorithm 3 (CBJF-Distribution) allocates the probability mass of ai,j to overlapping grids in . On the same token, we can think a PUMS entry p(k) as a cell having the probability mass of 1/N. (When there are more than one PUMS entries in a cell, we can consider them as multiple layers overlaid on the same cell, each of which has the probability mass of 1/N.) For each p(k), we can find the overlapping cells , and a copy of p(k) is added to the cell with the selection probability determined in proportion to the overlapping area ratio. Note that (x,y) attribute of p(k) is to be replaced by that of . In this way, p(k) is duplicated into multiple copies, whose weights sum to 1/N, and each copy has different values of (x,y) attribute while retaining the same z(k) attribute. Details are given in Algorithm 4 (CBJF-Resampling). At the end of the algorithm, we get a set Q = {(q(k),w(k)); k = 1.M}, which can be used as the source of resampling for synthetic population generation.

Algorithm 4 (CBJF-Resampling)

Step 0 (Initialize)

        Define partition G by computing [ui], [vj]

        Define partition by computing

        Q = an empty set

Step 1 (Distribute PUMS entry)

        For each PUMS entry p(k) = (x(k),y(k),z(k)) ∈ P

                Find Gi,j (the cell where p(k) belongs)

                For each overlapping cell

                        

                        q = (xs,yt,z(k))

                        Add (q,w) to Q

The last extension is about handling continuous variables. For the continuous variable cases, it is natural to consider the reference joint is given as PUMS set P = {p(k); k = 1.N} as above, while target margins are given in functional form and . We also assume that their inverses and are available. In fact, continuous case is much simpler, at least conceptually, due to the fact that (X,Y) and share the same copula [18]. For each PUMS entry p(k) = (x(k),y(k),z(k)), the corresponding point can be computed as shown in Fig 5: and , where u(k) = FX(x(k)) and v(k) = FY(y(k)). Since the reference joint distribution is given as the PUMS set P, u(k) = i(k)/N and v(k) = j(k)/N, where i(k) (or j(k)) is the number of PUMS entries p(l) with x(l)x(k) (or y(l)y(k)). Computation of u(k) (or v(k)) can be done more efficiently by sorting {x(l)} (or {y(l)}). This procedure is described in Algorithm 5 (CBJF-Continuous) below. The overall computational complexity of CBJF-Continuous is O(N log N), where N is the size of PUMS set. The new sample set Q = {q(k); k = 1.N} is the source of resampling for synthetic population generation. In hybrid case where continuous and discrete variables are mixed, the above extensions can be easily combined and details are left to the readers.

Algorithm 5 (CBJF-Continuous)

Step 0 (Initialize)

        Sort {x(l)} and {y(l)}

        Q is an empty set

Step 1 (Transform PUMS set)

        For each PUMS entry p(k) = (x(k),y(k),z(k)) ∈ P

                Obtain i(k) and j(k) from sorted sets

                u(k) = i(k)/N and v(k) = j(k)/N

                 and

                Add to Q

Numerical Experiments and Comparison

For comparison purpose, the joint fitting problem was also formulated as the following weighted least square formulation, which is an instance of a quadratic programming (QP Formulation) problem:

QP Formulation

Subject to

        bi,+ = ri for i = 1,…,m,

        b+,j = cj for j = 1,…,n,

        bi,j ≥ 0,

        where

        and M is a sufficiently large number.

Note that the objective function in the QP can be replaced with a weighted L1-norm:

Although the L1-norm is not a linear function, a standard technique can be applied to convert the optimization problem to a linear programming (LP) (see [21], for example), resulting in the following LP formulation, where the new variable (or ) represent the positive (or negative) part of bi,jai,j. (The solution of this LP Formulation was also computed for the test cases listed in this section, however, LP results were discarded because the output quality was significantly inferior to other outputs. Though, we leave the formulation here for documentation purpose.)

LP Formulation

Subject to

        

        bi,+ = ri for ∀i

        b+,j = cj for ∀j

        

        where

        and M is a sufficiently large number.

Some quantitative measures will be introduced in order to compare the results in terms of preserving the dependence structure. The most popular measure of dependence is Pearson’s product moment correlation coefficient, which is defined as: (9)

Although it is simple and familiar, Pearson’s correlation coefficient measures only the linear dependence between X and Y. An alternative to Pearson’s correlation is a rank correlation, such as Spearman’s rho or Kendall’s tau. Spearman’s rho measures the Pearson’s correlation between the two uniform random variables FX(X) and FY(Y): (10)

For Kendall’s tau, consider the two independent samples (X1,Y1) and (X2,Y2) with the same joint distribution as (X,Y). (X1,Y1) and (X2,Y2) are concordant if (X1−X2)(Y1−Y2) > 0 or discordant if (X1−X2)(Y1−Y2) < 0. Then, Kendall’s tau is defined as follows: (11)

For the general types of dependence, the maximal information coefficient (MIC) [22] is also used. MIC is a measure of dependence which captures a wide range of (either functional or non-functional) associations between variables. In case that a functional relationship exists, MIC provides a score that roughly equals the coefficient of determination (R2) of the data relative to the regression function. (10,000 sample points were drawn from each joint distribution and used to calculate their MIC values.)

In order to test the effectiveness of the proposed method, the following five different types of reference joint distribution types were chosen (shown in Fig 6):

thumbnail
Fig 6. Reference joint types in the test set.

(a) Bivariate normal, (b) Bimodal, (c) Lower tail dependence, (d) U-shape, (e) Circle.

https://doi.org/10.1371/journal.pone.0159496.g006

  • Normal (Fig 6(A)): bivariate normal with mean μ = (0.0, 0.0), standard deviation σ = 1 and correlation ρ = 0.7
  • Bimodal (Fig 6(B)): mixture of two bivariate Gaussians
    1. -. μ1 = (0.0, 0.0), σ1 = 1 and ρ1 = 0.7
    2. -. μ2 = (3.0, 1.0), σ2 = 1 and ρ2 = -0.5
  • Tail dependent (Fig 6(C)): a joint distribution showing strong tail dependence when X is low
  • U-shape (Fig 6(D)): U-shaped distribution whose correlation coefficients are quite small
  • Circle (Fig 6(E)): Circular joint distribution.

Then, in order to obtain target margins, the following eight types of marginal modification operators were applied to the margins of each reference joint distribution (see Fig 7):

thumbnail
Fig 7. Target marginal modification operators.

(a) Original marginal distribution, (b) Skew left, (c) Skew right, (d) Uniform, (e) Fat tail, (f) Thin tail, (g) Perturbation.

https://doi.org/10.1371/journal.pone.0159496.g007

  • Skew: Left or right skew applied to the column margin and the row margin. There are 4 different combinations of skewed margins (LL, RR, LR, RL). For example, “Skew LR” indicates that the column margin [cj] is skewed left and the row margin [ri] is skewed right.
  • Uniform: Uniform target margin
  • Fat tail: To make the margin tails fatter
  • Thin tail: To make the margin tails thinner
  • Perturb: Add ±10% noise to each bin of the reference margins

Combining the five reference joint types with the eight marginal modification operators, a total of 40 combinations were used as the test set. The test set is not meant to be comprehensive, but rather some typical cases were selected in order to examine the effectiveness of the proposed methods in terms of dependency measures. Each margin was discretized into 100 bins, resulting in 100x100 cells in a joint table. For each combination of reference joint distribution and target margins, IPF procedure, QP method, and CBJF approach were applied in order to obtain the target joint distribution.

Fig 8 shows the output distributions for one out of the 40 test cases: the normal joint + fat tail operator. Fig 8(D), 8(E), 8(G) and 8(H) are the scatter plots of the 10,000 sample points (for graphical presentation purpose) from the reference distribution, the IPF result, the QP result, and the CBJF result, respectively. While copula result seems to have similar dependent structure of reference joint, IPF and QP results look having more linearly correlated dependent structure. This is one of many cases where the output from CBJF approach outperforms the other methods. (The output plots for all 40 cases can be found in S1 Fig. Another numerical experiment of various changes on marginal distributions can be found in S2 Fig.)

thumbnail
Fig 8. Output results for normal joint distribution and fat tail target margins.

https://doi.org/10.1371/journal.pone.0159496.g008

IPF produces the smallest relative entropy (Eq (2)). However, relative entropy does not convey the same information as level of dependence structure preservation which can be measured from correlation coefficients. For the quantitative comparison of dependence structure preservation, the dependency measures stated in the previous section were computed: Pearson’s correlation, Spearman’s rho, Kendall’s tau, and the MIC. The results are shown in Tables 47 (rows marked as (a)), respectively. For each dependency measure, shown in Tables 47 (rows marked as (b)) are the deviation from the measure of the original reference joint table to that of the output joint table of each method. There, the smaller deviation implies the better preservation of dependence structure. The comparison results are summarized as follows:

  • The perturbation operation (the second last columns of Tables 47) does not change the target margins significantly from those of the reference margins. For this small change, all three methods work well; showing almost no difference in the dependence measures.
  • For the larger modifications, the copula based approach was superior to the other methods in almost all combinations. This indicates that the proposed method preserves the dependence structure of the reference joint distribution, while the other methods (IPF and QP) often fail to maintain the dependence structure when the target margins are significantly different from those of the reference.

Since not only the type of marginal modification operator but also the amount of marginal change affects the result, skew, fat tail and thin tail operators are applied to each of the reference joint types with various levels of marginal change. And for each test combination, the MIC is calculated to see how the marginal change affects the dependence structure. Marginal changes are controlled by marginal variation which we define as the summation of the total variation distance of row margins and that of column margins where total variation distance between distribution P and Q is: (12)

From the results shown in Figs 913, we can see that all the methods perform well in maintaining the reference’s dependence structure when the marginal variation is very small. However, as the variation gets bigger, IPF and QP fail to preserve reference joint distribution’s dependence structure measured by MIC. On the other hand, MIC of the CBJF output remains almost unchanged as the level of marginal variation increases.

thumbnail
Fig 9. Results from the normal joint distribution with skew, fat tail and thin tail operators.

https://doi.org/10.1371/journal.pone.0159496.g009

thumbnail
Fig 10. Results from the bimodal joint distribution with skew, fat tail and thin tail operators.

https://doi.org/10.1371/journal.pone.0159496.g010

thumbnail
Fig 11. Results from the tail dependent joint distribution with skew, fat tail and thin tail operators.

https://doi.org/10.1371/journal.pone.0159496.g011

thumbnail
Fig 12. Results from the U-shape joint distribution with skew, fat tail and thin tail operators.

https://doi.org/10.1371/journal.pone.0159496.g012

thumbnail
Fig 13. Results from the circle joint distribution with skew, fat tail and thin tail operators.

https://doi.org/10.1371/journal.pone.0159496.g013

Finally, we applied the proposed method to generating synthetic patient population for simulating emergency department of a large hospital in Korea. Reference joint samples are obtained from more than 100K patient visit records during the year of 2013. Since the purpose of simulation analysis is to evaluate layout, patient admission policy, and staffing of the emergency department, we need to generate synthetic patient population which reflects the forecasted change in marginal distribution of patient age and severity of disease. As reported in [23], South Korea is one of the most rapidly aging society. ESI (Emergency Severity Index) is a 5-level triage system which classifies patients from level 1 (most urgent) to level 5 (least urgent) by both acuity and resource needs [24]. In this simulation study, the hospital wanted to analyze the capability of its emergency department in various situations, and hence the distribution of ESI was varied. Fig 14 shows the reference patient population in 2013 and synthetic populations generated by IPF, QP, and CBJF. In this study, target marginal distribution of age reflects the forecast in Statistics Korea [23], where the portion of elderly people (60+) is predicted to reach 42% in 2025 (from 31% in the reference population of 2013). In order to see the effect of reducing the patient concentration at ESI level 3, the distribution of ESI is changed from (2%, 11%, 78%, 6%, 3%) to (8%, 12%, 45%, 20%, 15%). As shown in Table 8, CBJF outperforms IPF and QP in preserving all dependency measures used in this study.

Summary and Concluding Remarks

The joint fitting problem turned out to be a natural application area of the copula concept so as to preserve the dependence structure of the reference distribution. In this paper, a novel method based on the copula concept, called CBJF, was proposed. Although IPF has long been used in a wide range of applications and studied with mathematical rigor, it is not a silver bullet for joint fitting problems including synthetic population generation. From the numerical tests, it was found that CBJF is superior to IPF or QP methods in most cases for the dependence structure preservation. Furthermore, CBJF is computationally efficient since it does not require iterative procedure. Also, its robustness is a significant advantage of CBJF as it does not need to consider the convergence problem or zero cells.

A disadvantage of CBJF is that it requires caution when X and/or Y are categorical variables, such as gender or ethnic groups. In such cases, CBJF result is affected by the ordering of the rows (or columns) when there is no definite natural ordering of the attribute values. In its present form, CBJF can be applied when the attribute values have a natural ordering, such as age, annual income, number of vehicles, or location coordinates. Handling categorical variables is a topic requiring further research. When categorical target variables are mixed with ordinal target variables, IPF may be applied first to the categorical variables and then CBJF can work on the remaining ordinal variables.

In many cases of agent-based simulation applications, micro-samples (such as PUMS) from the reference population may not be available mainly because of cost of survey or privacy issues. There are recent researches on generating synthetic population without micro-samples [2527]. We believe the concept of CBJF can be applied even when micro-samples are not available, however the details depend of the available information and require further research.

Supporting Information

S2 Fig. Experiments on various changes on marginal distributions.

https://doi.org/10.1371/journal.pone.0159496.s002

(PDF)

Author Contributions

  1. Conceived and designed the experiments: HS.
  2. Performed the experiments: BJ WL.
  3. Analyzed the data: BJ WL DK.
  4. Wrote the paper: HS BJ.

References

  1. 1. Zhou S, Chen D, Cai W, Luo L, Low M, Tian F, et al. (2010), “Crowd modeling and simulation technologies”, ACM Transactions on Modeling and Computer Simulation, 22(4), Article 20
  2. 2. MATSim-T (2010) “Multi Agent Transportation Simulation Toolkit”. Available: http://www.matsim.org
  3. 3. UrbanSim (2010) “Open Platform for Urban Simulation”. Available: http://www.urbansim.org
  4. 4. Muller K, Axhausen KW (2011), “Population Synthesis for Microsimulation: State of the Art”, Proceedings of the 90th Annual Meeting of the Transportation Research Board, Washington DC
  5. 5. Beckman RJ, Baggerly KA, McKay MD (1996), “Creating synthetic baseline populations”, Transportation Research Part A: Policy and Practice, 30 (6) 415–429
  6. 6. Bowman J L (2009), “Population synthesizers”, Traffic Engineering and Control, 49 (9) 342
  7. 7. Pinjari AR, Bhat CR, Hensher DA (2009), “Residential Self-Selection Effects in an Activity Time-use Behavior Model”, Transportation Research Part B, 43(7), 729–748
  8. 8. Rana TA, Sikder , Pinjari AR (2010), “Copula-Based Method for Addressing Endogeneity in Models of Severity of Traffic Crash Injuries”, Transportation Research Record, 2147, 75–87
  9. 9. Kao SC, Kim HK, Liu C, Cui X, Budhendra LB (2012), “Dependence-Preserving Approach to Synthesizing Household Characteristics”, Transportation Research Record, 2302, 192–200
  10. 10. Borgomeo E, Pflug G, Hall JW, Hochrainer-Stigler S (2015), “Assessing water resource system vulnerability to unprecedented hydrological drought using copulas to characterize drought duration and deficit”, Water Resource Research, 51(11), 8927–8948
  11. 11. Stevens SS (1946), “On the Theory of Scales of Measurement”, Science, 103 (2684), 677–680
  12. 12. Deming WE, Stephan FF (1940), “On the least squares adjustment of a sampled frequency table when the expected marginal totals are known”, Annals of Mathematical Statistics, 11 (4), 427–444.
  13. 13. Ireland CT, Kullback S (1968), “Contingency tables with given marginals”, Biometrika, 55, 179–188 pmid:5661043
  14. 14. Wong DWS (1992), “The reliability of using the iterative proportional fitting procedure”, The Professional Geographer, 44(3), 340–348
  15. 15. Pukelsheim F, Simeone B (2009), “On the iterative proportional fitting procedure: Structure of accumulation points and L1-error analysis”, Preprint, Available: http://opus.bibliothek.uni-augsburg.de/volltexte/2009/1368/pdf/mpreprint_09_005.pdf
  16. 16. Kalantari B, Lari I, Ricca F, Simeone B (2008), “On the complexity of general matrix scaling and entropy minimization via the RAS algorithm”, Mathematical Programming, 112, 371–401
  17. 17. Sklar A (1959), “Fonctions de répartition à n dimensions et leurs marges”, Publications de l’Institut de Statistique de l'Université de Paris, 8, 229–231
  18. 18. Nelsen RB (2006), An Introduction to Copulas, 2nd ed, Springer-Verlag
  19. 19. Darsow WF, Nguyen B, Olsen ET (1992), “Copulas and Markov processes”, Illinois Journal of Mathematics, 36, 600–642
  20. 20. Sklar A (1996), “Random variables, distribution functions, and copulas–A personal look backward and forward”, IMS Lecture Notes–Monograph Series, 28, 1–14
  21. 21. Bertsimas D, Tsitsiklis JN, Introduction to Linear Optimization, Athena Scientific, 17–20
  22. 22. Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, et al. (2011), “Detecting Novel Associations in Large Data Sets”, Science, 334, 1518–1524 pmid:22174245
  23. 23. Statistics Korea (2016) "Future population prediction of South Korea 2010–2060". Available: http://kostat.go.kr. Accessed May 2016.
  24. 24. Gilboy N, Tanabe T, Travers D, Rosenau AM (2011), “Emergency Severity Index (ESI) Version 4”, AHRQ Publication No. 12–0014, Agency for Healthcare Research and Quality
  25. 25. Gargiulo F, Ternes S, Huet S, Deffuant G (2010), “An Iterative Approach for Generating Statistically Realistic Populations of Households”, PLoS ONE, 5(1), 1–9
  26. 26. Barthelemy J, Toint PL (2013), “Synthetic population generation without a sample”, Transportation Science, 47(2), 266–279
  27. 27. Lenormand M, Deffuant G (2013), “Generating a Synthetic Population of Individuals in Households: Sample-Free vs Sample-Based Methods”, Journal of Artificial Societies and Social Simulation, 16(4)