Lexis Diagram and Illness-Death Model: Simulating Populations in Chronic Disease Epidemiology

Chronic diseases impose a tremendous global health problem of the 21st century. Epidemiological and public health models help to gain insight into the distribution and burden of chronic diseases. Moreover, the models may help to plan appropriate interventions against risk factors. To provide accurate results, models often need to take into account three different time-scales: calendar time, age, and duration since the onset of the disease. Incidence and mortality often change with age and calendar time. In many diseases such as, for example, diabetes and dementia, the mortality of the diseased persons additionally depends on the duration of the disease. The aim of this work is to describe an algorithm and a flexible software framework for the simulation of populations moving in an illness-death model that describes the epidemiology of a chronic disease in the face of the different times-scales. We set up a discrete event simulation in continuous time involving competing risks using the freely available statistical software R. Relevant events are birth, the onset (or diagnosis) of the disease and death with or without the disease. The Lexis diagram keeps track of the different time-scales. Input data are birth rates, incidence and mortality rates, which can be given as numerical values on a grid. The algorithm manages the complex interplay between the rates and the different time-scales. As a result, for each subject in the simulated population, the algorithm provides the calendar time of birth, the age of onset of the disease (if the subject contracts the disease) and the age at death. By this means, the impact of interventions may be estimated and compared.


Introduction
Chronic diseases impose a tremendous global health problem of the 21st century. The World Health Organization estimates that 63% of all deaths in 2008 were caused by chronic diseases [1]. Besides taking measures in politics and society, research efforts are needed to oppose this threat. In studying the characteristics of chronic diseases from a public health perspective, it is often important to consider different time-scales [2]. The age of the subjects in a population is a risk factor for many diseases. Changes in life-style and medical care influence the risk of contracting and dying from the disease on a secular scale. Thus, the incidence rate of the chronic disease as well as the mortality rates (with and without the disease) depend on the calendar time. Moreover, the mortality of people with the disease often depends on the duration since its onset. Examples are diabetes [3,4], dementia [5], depression [6], and systemic lupus erythematosus [7]. For decision-makers all time-scales may be important.
As a hypothetical example, consider the question which health programme to choose from two possibilities A and B if the outcome of interest is the gain of life-years. Possibility A is known to decrease the incidence of the disease by 15%, and possibility B lowers the mortality of those having the disease for more than ten years by 50%. The decision depends on several factors. If, on the one hand, the incidence rate is non-zero for children only and the birth rate in the population is low, possibility A may have little effect with respect to the gain of life-years. If, on the other hand, the chronic disease has very few people reaching ten-year survival after onset, programme B can be nearly useless. In problems similar to the example, the decision-maker may face a complex interplay of epidemiological and demographical considerations.
The aim of this work is to describe an algorithm and a flexible software framework for simulation of populations moving in a multi-state model (illness-death model) that describes a chronic disease. The simulation takes into account the different timescales: calendar time, age, and duration of the disease. Although simulations using multi-state models are subject to recent textbooks [8], to our knowledge no algorithm has been described that incorporates the effects of all the different time-scales.

Methods
A popular framework for studying irreversible diseases is the illness-death model (IDM) consisting of the three states Normal, Disease and Death, [9][10][11]. The associated transition rates, synonymously densities (in units ''per person-time'', not to be confused with risks or probabilities [12]), are the incidence i, and the mortality rates m 0 and m 1 (Figure 1). In general, these rates depend on calendar time (t), age (a) and in case of m 1 on the duration of the disease (d).
This article presents a method for simulating populations moving in the IDM. The motivation for the algorithms comes from analytical epidemiology where relations between common epidemiological measures are studied. Examples for those measures are the prevalence, the duration of a disease, the age of onset (or diagnosis), and lost life-years (due to the disease). A typical question may be: what is the mean age of diagnosis of subjects born in a certain time period? What is their mean age at death? Another interesting aim is the estimation of the incidence rate i from cross-sectional information. At a specific point in time t', each of the subjects j~1, . . . ,n, has a unique ''status''. Neglecting those who are unborn or dead at t', the status in the IDM is either normal (non-diseased) or diseased. Thus, the status can be seen as a binary random variable, and data of this kind are typically called current status data [13]. The current status is closely linked with the incidence i and the mortalities m 0 and m 1 before t'. Estimating the incidence from current status data, for example, has been a topic in research for decades [14]. The framework presented here may be useful in this field.

Overview of the simulation algorithm
The simulation is a microsimulation, i.e., it treats each person in the population as an autonomous unit. For each person, indexed j, j~1, . . . ,n, the relevant events diagnosis and death are simulated. This is accomplished in two steps: 1. Contracting the disease or dying without the disease is modelled as competing risk [11]. Given the time t (j) 0 of birth of person j, the cumulative distribution function F (j) 1 of the first failure time T (j) 1 is The term first failure time T (j) 1 refers to the time of diagnosis or death without disease and is measured in time units after birth of person j. Thus, T (j) 1 is the age at which the first transition from the state Normal occurs. Given that a transition occurs at T (j) 1 for person j, then the odds of moving into state Disease versus moving into state Death is 2. If the event at T (j) 1 is the death (without the disease), the simulation for person j is finished. If, however, the event is the diagnosis of the disease, the ''second failure time'' T (j) 2 to death (with disease) has the distribution function F (j) 2 : The next section describes in detail how the integrals in calculating the integrals, the question arises how the times T 1 and T 2 can be obtained from F 1 and F 2 . This is done by the inverse transform sampling method: Let F be a cumulative distribution function and u[(0,1). For F {1 (u) :~inf fxDF (x) §ug it holds: If U is a uniform random variable on (0,1), then F {1 (U) follows the distribution F . Thus, the simulation of T 1 and T 2 is easy, if a random number generator for U such as runif in R is available. For each of the n persons in the population we store four pieces of data: 1. a unique identifier j, 2. the date t (j) 0 of birth (dob) of person j, 3. the age at diagnosis (adi) of person j, and 4. the age at death (ade) of person j: If the person j does not contract the disease, the age at diagnosis adi is set to NA (missing). In summary, we get the Algorithm Calculating line integrals for these integrals exist, is straightforward. The first simulation in the next section is an example. However, in real world applications, analytical expressions for the integrals are rarely given. For convenience, mathematical functions (e.g., splines) may be fitted to the data and integration is accomplished with the fitted functions. Since the aim of this work is a flexible way of treating the incidence and mortality rates, we assume that the rates are given as numerical values on a regular grid. Here we focus on the most general case, which is characterized by: 1. None of the time-scales t,a, and d is negligible, and 2. the values of i,m 0 ,m 1 are given as data points only.
For many chronic diseases, we think this is the most relevant case: The mortality rates m 0 and m 1 depend on a and t. Since age is a risk factor for many diseases, the dependency on age is Equations (1) and (2) (1) obvious. Healthier life-style and medical progress in many countries lead to secular trends in m 0 and m 1 . In addition, disease duration d is likely to have an impact on m 1 in many chronic diseases. Thus, none of the time-scales is negligible, which is covered in the second and third example.
In the most general case, the integrands i,m 0 , and m 1 are given by data points only. We assume that the numerical values are located on a regular grid. The grid is two-dimensional in case of i and m 0 , which depend on two time-scales t and a; and the grid is three-dimensional in case of m 1 which depends on t,a, and d: In event history analysis [2], a useful concept is the Lexis diagram, which is a co-ordinate system with axes calendar time t (abscissa) and age a (ordinate). The t-dimension sometimes is referred to as period. Each subject is represented by a line segment from time and age at entry to time and age at exit. Entry and exit may be birth and death, respectively, or entry and exit in a epidemiological study or clinical trial. There are excellent and extensive introductions about the theory of Lexis diagrams (see for example [9,15,16] and references therein), which allows us to be short here. In irreversible diseases, the common two-dimensional Lexis diagram with axes in tand a-direction may be generalized to a three-dimensional co-ordinate system with disease duration d represented by the applicate (i.e., the z-axis). If a subject does not contract the disease during lifetime, the life line remains in the t-aplane parallel to the line bisecting abscissa and ordinate. In other words, the life line for the time without disease is parallel to e 1 :~(1,1,0) (where the triple (t,a,d) denotes the co-ordinates in time, age and duration direction, respectively). However if at a certain point in time E the disease is diagnosed, the life line changes its direction, henceforth runs parallel to e 2 :~(1,1,1).
The situation is illustrated in Figure 2. The life lines of two subjects are shown in the three-dimensional Lexis space. At birth (denoted B n ,n~1,2) both subjects are disease-free; both life lines are parallel to e 1 : The first subject contracts the disease at E: Henceforth, the life line is parallel to e 2 until death at D 1 . The second subject remains disease-free until death at D 2 .
Having the concept of the Lexis diagram at hand, we observe that F 1 and F 2 Lexis space. We start with calculating the first failure times T 1 : For subject j the associated life line starts at (t,a)~(t (j) 0 ,0). We chose an age vw0, when it is certain that a transition to one of the states Disease or Death has occurred, say v~150 (years). For calculating F (j) 1 , we trace the hypothetical life line from B j :~(t (j) 0 ,0) to D j :~(t (j) 0 zv,v): Thus, the hypothetical life line has a representation Following the life line is related to the method of ray-tracing in the field of computer graphics, where efficient algorithms for this purpose exist. In Siddon's algorithm [17], the key idea is to follow L j by calculating intersections with volume elements (voxels), which form a regular partition of the Lexis space. Let be a parametrization of the points where L j intersects the voxel faces plus the start and end points B j and D j . Details for the calculation of A ? j are described in the supporting information to this article. The parametrization A ? j is ideally suited for approximating the integral in Equation (1) by the trapezoidal rule [18]. The reason lies in the fact that in are a by-product. Algorithm 2 shows the necessary steps. in Equations (1) and (2) are line integrals in the 7: for p~2 to P (j) do 8: 12: end for 13: end for Since the values of i and m 0 are given on the voxel grid only, the calculation of f p ,p~1, . . . ,P (j) , needs bilinear interpolation of the values of the adjacent voxels [19].
After preparing F (j) 1 ,j~1, . . . ,n, the values of the times T (j) 1 can be calculated by the inverse transform sampling method. Since we have F (j) 1 calculated at points f p :~t (j) 0 p ,p~1, . . . ,P (j) , the inverse transform sampling would yield only those f p . A better accuracy can be obtained by (linear) inverse interpolation [18]. For : Then, it holds Thus, for u[(0,1) drawn from a uniform distribution, we find the unique p[f2, . . . ,P (j) g, such that F (j) 1 (f p{1 )ƒuvF (j) 1 (f p ), and set For those subjects j' who contract the disease, the associated F (j') 2 ( : DT (j') 1 ) can be derived in a similar way as in Algorithm 2. The associated line segment starts at (t,a,d)~(t (j') 0 zT (j') 1 ,T (j') 1 ,0). Again, a hypothetical maximal disease duration v' is assumed, say v'~80 (years), such that the line segment ends at (t,a,d)~(t (j') 0 zT (j') 1 zv',T (j') 1 zv',v'). Thus, the line segment is parallel to e 2~( 1,1,1): The Siddon algorithm computes the corresponding set of intersections with the voxel grid accordingly. The ages T (j') 2 at death with disease are obtained from Algorithm 2 mutatis mutandis. The interpolation of m 1 needs to be trilinear.

Examples
This section shows the results of different simulation settings. The associated R [20] code is provided with this article. The first simulation is about a hypothetical chronic disease with rates i,m 0 ,m 1 only depending on a: The corresponding age-specific prevalence can be calculated analytically, which allows crosschecking the results of the simulation. In the second example, another hypothetical disease is treated with mortality rates depending on t,a, and d: Here we use the ray-tracing approach in the Lexis diagram. Again, the outcomes of the simulation are compared with analytical results. The third simulation is about a relatively rare rheumatic disease. A hypothetical birth cohort of 100000 women is followed from birth to death and examined if the disease is diagnosed. Finally, the last simulation demonstrates applicability in the context of medical decision-making. Two interventions are compared with respect to the outcome life-years gained.

<
: one can show that the age-specific prevalence p(a) is given by p(a)~0:005(a{20) z [21]. The notation x z means the positive part of x : x z~m ax(0,x): The integrals of i(a),m 0 (a), and m 1 (a) can easily be expressed analytically. Figure 3 shows the agespecific prevalence in a simulated cohort with n~200000 subjects. For comparison, the true (analytical) prevalence is depicted as a solid line. The result of the simulation agrees very well with the analytical age profile of the prevalence, which indicates the correctness of the implemented R code.

Simulation 2: Mortality depends on duration
The second simulation is about a hypothetical disease with all time-scales t,a and d playing a role. We aim at calculating epidemiological measures that describe the population of the diseased. For example, age of onset, mean duration of the disease and age at death may be important to plan resource allocation (e.g., inpatient facilities).
In each of sixty consecutive years t~0, . . . ,59, 2000 persons are born and followed from birth to death. The incidence of a hypothetical chronic disease is assumed to be i(t,a)~( a{30) z 3000 , the age-specific mortality rate of the non-diseased is chosen to be m 0 (t,a)~exp({10:7z0:1azt ln(0:998)) and the mortality of the diseased is m 1 (t,a,d)~m 0 (t,a) : (0:04(d{5) 2 z1): In total, 42299 of the 120000 simulated persons contract the disease. The simulated data allows the derivation of important epidemiological measures. For example, the histograms of the age of onset and age at death are shown in Figure 4. The median age at death of those who contracted the disease is 77.20 (years) whereas the median age at death of those without the disease is 79.67 (years). The median duration of the disease in the 42299 ill subjects is 12:69, the interquartile range is 7.46-17.60 (years).
Finally, we can cross-check the results of the simulation by comparison with an analytical calculation. In year t~100, exactly 76548 persons are alive, 8802 of those having the hypothetical disease. Figure 5 shows the age-specific prevalence at t~100: The black lines indicate the prevalence of several age groups together with 95% confidence bounds as given by the simulation. The blue line represents the prevalence calculated analytically by the exact formula in ([9], Section 7.2). The results agree quite well within the confidence bounds.

Simulation 3: Systemic lupus erythematosus in the UK
The third simulation is about a hypothetical cohort of 100000 women born in the United Kingdom in 1945. The disease under consideration is systemic lupus erythematosus (SLE), which is an autoimmune disorder with significant morbidity and increased mortality. Women are more often affected than men; in terms of age-standardized incidence, the ratio is 5:1 [22]. The peak of the age-specific incidence in British women is located at the age of about 50. Since Somers et al. could not find a significant secular trend in the age-standardized incidence in 1990-1999, we assume that the incidence does not depend on t, [22]. The aim of this example is to study the interplay between age of onset and the mortality of the diseased women in a realistic setting.  If we mimic a cross-sectional study at year t~100, we obtain the agespecific prevalence as indicated by the black bars (with 95% confidence bounds). For comparison the analytically calculated age-specific prevalence is shown as blue line. doi:10.1371/journal.pone.0106043.g005 Mortality m 0 has been taken from the official life tables of the British Office for National Statistics [23]. The calendar time trend in m 0 has been simplified by assuming that the yearly decrease in mortality is 0:36% for all age groups. This is the geometric mean of the decrease in female mortality taken over all reported age groups in the past 60 years. Mortality m 1 of the women with SLE has been modelled according to [7], which takes into account several covariables: sex, age, duration of SLE, calendar time of diagnosis and geographical region. Unfortunately, an interaction analysis of these factors has not been reported, which forces us to make assumptions. We assume a multiplicative model of the impact of sex, age, region and SLE duration: The constant c reflects the impact of sex and the region, H age and H dur are hazard ratios. The hazard ratio H dur of SLE duration has been interpolated affine-linearly on a logarithmic scale from the published values and backtransformed. The age dependency of H age (a) has been treated similarly. Due to the weak (and possibly insignificant) effect, the impact of the calendar time of diagnosis has been neglected.
From the 100000 women in the simulation, 513 contract SLE. This corresponds to a lifetime risk of about 0.5%. For comparison, a recent publication about women in the US estimated 0.9% [24]. The study [24] included women with African ancestors, who are known to have a higher risk. The median age of onset in the simulation is 46.07 years, which is in nearly perfect agreement with the result of 46.1 years based on another simulation using the same data but a different method [25]. Table 1 shows the interplay between age of diagnosis and duration of SLE in the simulation.
From Table 1 it is apparent that ten-year survival is not easy to reach. Substantial loss of life time is also indicated by the age at death: Median age at death for the diseased and the non-diseased women is 61.4 and 77.1 (years), respectively. This indicates a considerable loss of lifetime in the population of the diseased compared to the non-diseased. Fortunately, this situation has changed in the last years with better medical care for SLE patients. Especially the introduction of optimized treatment regimes in the last decade lead to an enormous reduction of mortality [26]. These effects have not been included in the simulation.

Simulation 4: Effectiveness of two interventions
We take Simulation 2 as the basecase and compare the effectiveness of two hypothetical intervention programmes A and B with respect to the total gain of life-years. Assume that the primary prevention programme A gradually lowers the incidence rate for all a §30 in the calendar years t[½0,10 by 15% and remains at the 85% level for tw10. Programme B is assumed to reduce the mortality m 1 of the diseased people gradually by 50% starting after six years with the disease. B could be achieved, for example, by listing a new drug on the formulary. After running the simulation, we find that in total the 120000 simulated persons in the basecase, in intervention A, and intervention B, have 9275008, 9348941, and 9351338 life-years, respectively. Hence, intervention A and B yield about 73900 and 76300 life-years more than the basecase. With respect to the chosen outcome parameter, gain of life-years, programme B turns out to be superior to programme A. Thus, primary prevention by programme A, in this example, is less effective than lowering the mortality of the diseased. This result is hardly predictable without a simulation and demonstrates the usefulness of our algorithm in decision making.
To obtain the empirical distributions of the total gain of lifeyears in both interventions, 5000 random samples of size 120000 have been drawn (with replacement from the simulated 120000 persons) [27]. Figure 6 shows the resulting densities and the superiority of programme B.

Conclusions
This article is about simulating populations in an illness-death model consisting of the three states Normal, Disease, and Death. The relevant events, diagnosis and death, in the most general case depend on three time-scales: calendar time, age and disease duration.
After birth of a subject in the population, two cases may occur: 1. the subject dies without the disease, or 2. the subject contracts the disease and dies with the disease.
Both situations can be represented in the Lexis space, a common tool in event history analysis. In the first case, the life line is solely located in the time-age-plane. In the second case, the life line changes its direction after onset of the disease, which allows to model the duration of the disease. In many diseases, the duration plays an important role for the mortality. Beside systemic lupus erythematosus as treated above, other diseases such as diabetes [3,4], depression [6] and dementia [5] may serve as examples.
In the most general case, the simulation requires to numerically solve line integrals. Therefore, a synthesis between a ray-tracing technique and numerical integration is exploited. The method provides a fast way to follow the individual life lines in the Lexis diagram. Computation time is an issue, because the number of simulated subjects in the population may be large (several thousands). For example, Simulation 4 takes almost nine minutes (525 seconds) on an Intel i3 personal computer with 3.3 GHz and 8 GB RAM. Simulations 1, 2, and 3 take 40, 170 and 85 seconds, respectively.
Beside the areas mentioned above, we think the method may be applicable in following the fields: N In epidemiology, the simulation may be used to study the interplay between characteristics in chronic diseases: prevalence, mean disease duration, age of onset, age at death of the non-diseased and diseased population.
N The algorithms may serve as a test bench for estimation methods. For example, in [21] the age-specific incidence is derived from prevalence data. The simulation may be used to study the performance of this and related methods.
N In health economics, the result of our simulation allow the application of cost weights to each subject of the population. For example, in diabetes it is well-known that disease related costs depend on the duration since onset [28]. For each individual the disease related costs may be calculated at a specific point in time. By summing over all subjects, the total costs may be estimated easily.
N Similarly, for many chronic diseases, the health related quality of life depends on the duration of the disease [29]. By assigning utility weights to each individual and summing them up, the total quality-adjusted life-years (QALY) may be calculated.
The last two points are related to health economic modelling. We think that our algorithm may yield a contribution in that domain, because often health economic models are Markov models. Due to memoryless property of Markov models, the dependency of the relevant outcomes on the duration cannot be included directly ( [30], Sec. 4.2.3). In the field of diabetes, for example, at least six out of ten important economic models are not capable to accurately account for diabetes duration [31]. This may not be necessary in every research question, but if highly accurate results are needed, modelling the disease duration should be considered.