Surveillance of the Incidence of Non-Communicable Diseases (NCDs) with Sparse Resources: A Simulation Study Using Data from a National Diabetes Registry, Denmark, 1995–2004

We propose two new methods to estimate secular trends in the incidence of a chronic disease from a series of prevalence studies and mortality data. One method is a direct inversion formula, the second method is a least squares estimation. Both methods are validated in a simulation study based on data from a diabetes register. The results of the validation show that the proposed methods may be useful in epidemiological settings with sparse resources, where running a register or a series of follow-up studies is difficult or impossible.


Introduction
In 2010, more than 65% of all global deaths were caused by non-communicable diseases (NCDs) such as cancer, cardiovascular and chronic respiratory diseases, diabetes and neurological disorders [1]. Compared to 1990, not only did the absolute number of NCD-attributable deaths rise from 27 to 35 million, but the proportion of deaths attributable to NCDs also rose (1990: 57%).
The enormous and increasing burden of NCDs have attracted the highest political councils, which, for instance, led to a resolution of the United Nations' General Assembly [2]. The resolution demands strengthening of country-level surveillance systems for risk factors, determinants of health and health outcomes. This includes national surveillance of the prevalence and incidence of NCDs. Surveillance of prevalence alone is not always reasonable. The prevalence of a chronic disease may even increase though the incidence remains unchanged. The rise in prevalence may be due to improved survival of prevalent cases, which in turn may be due to improved disease treatment. Thus, surveillance of the incidence of NCDs is necessary as well. Secular trends of NCD incidences are especially important because they indicate changes in the risk profile of the population under consideration. Moreover, surveillance of temporal changes of the incidence may be helpful to judge effectiveness of population-based prevention programs, e.g., tobacco control programs in the US [3].
Common ways to detect secular trends in the incidence are to either performing a series of follow-up studies or to run a register. Both approaches may be expensive and time consuming. Furthermore, they both may lead to a variety of practical problems. As an example, consider a follow-up study where the people contracting the outcome, i.e., the cases, may feel stigmatized. This is likely to lead to a loss of follow-up among the the group of cases, possibly at a higher percentage than in the non-cases. In contrast, a series of cross-sectional studies can sometimes be easier to accomplish-and may lead to more valid estimates. An example of a periodically published summary of cross-sectional studies is the Diabetes Atlas of the International Diabetes Federation [4].
In [5] we described a method in which two cross-sectional studies combined with mortality information can be used to estimate the incidence of a chronic disease. Here we take this approach further and describe how a series of cross-sections can examine possible trends in the incidence of NCDs without the need to run a disease register and without a series of follow-up studies.
The next section introduces the necessary theoretical background for the methods. In the third section the theory is applied to data from a diabetes register in Denmark. The register observed an increasing incidence of diabetes between 1995 and 2004. We show that the trend is detectable using a sequence of prevalence and mortality data. The prevalence data can be obtained by cross-sectional studies and the mortality data may originate from official vital statistics and, for example, from case-control studies. The use of costly and lengthy follow-up studies is not always necessary.

Illness-Death Model
In modeling chronic (irreversible) diseases, the three-state model (compartment model) in Fig  1 is often used. This model is often called the illness-death model (IDM) [6] and dates back to at least 1951 [7]. The role of the IDM in biostatistics can hardly be valued enough. For example, it contributed to the development of the famous Kaplan-Meier estimator in survival and failure time analysis [8].
Let the number of persons in the states Healthy (H) and Ill (I) of the IDM be denoted by S and C, respectively. There are historical reasons for using the terminologies S (susceptibles) and C (cases). The transition intensities (synonym: rates) between the states are: the incidence rate i and the mortality rates m 0 and m 1 of healthy and diseased persons, respectively. These rates and the numbers S and C generally depend on the calendar time t, the age a and in the case of the mortality m 1 also on the duration d of the disease. Although the following derivations hold true if m 1 depends on d [5], for ease of notation we assume hereinafter that m 1 is independent from d.
In [9] we showed how the rates i, m 0 , and m 1 , are related to the age-specific prevalence pðt; aÞ ¼ Cðt;aÞ Sðt;aÞþCðt;aÞ : If we consider only chronic diseases acquired after birth, i.e., p(t, 0) = 0 for all times t, then, for differentiable rates i, m 0 , and m 1 , the age-specific prevalence p(t, a) is the unique solution of a partial differential equation (PDE) and the associated initial condition: pðt; 0Þ ¼ 0: ð2Þ Here, the notation @ x ¼ @ @x ; x 2 ft; ag; has been used. The partial differential Eqs (1) and (3) use the assumption that the persons who immigrate into or emigrate from the considered population have the same disease prevalence as the resident population. If this assumption does not hold true, the partial differential equation becomes slightly more complicated [9].
Typically, the mortality rate m 0 for the healthy persons is not known, but the general mortality m = pm 1 + (1 − p)m 0 , and the relative mortality R ¼ m 1 m 0 are known. Therefore, Eq (1) becomes Note that the fraction on the right-hand side of the PDE Eq (3) is the population attributable risk (PAR), which describes the proportion of deaths attributable to the chronic disease [10].

Direct estimation of the age-specific incidence
Eqs (1) and (3) can be solved for the incidence rate. For p(t, a) 6 ¼ 1 it holds iðt; aÞ ¼ ð@ t þ @ a Þ pðt; aÞ 1 À pðt; aÞ þ pðt; aÞ m 1 ðt; aÞ À m 0 ðt; aÞ ð Þ ¼ ð@ t þ @ a Þ pðt; aÞ 1 À pðt; aÞ þ mðt; aÞ pðt; aÞ Rðt; aÞ À 1 ð Þ pðt; aÞ Rðt; aÞ À 1 ð Þþ1 : These equations have an important consequence: they make it possible to estimate the agespecific incidence rate from two prevalence studies. Consider the situation where the age-specific prevalence p(t, a) is given for two points in time t = t 1 and t = t 2 , where t 1 < t 2 . Assume further that the general mortality m and the relative mortality R are given at some time t ? where t 1 < t ? < t 2 . This situation is depicted in Fig 2. Then, Eq (4) allows estimating i(t ? , a) [5]. Because Eq (4) follows directly from solving Eq (3) for i, we call this method the direct estimation of the incidence. In the next section we introduce a new alternative approach to estimate the incidence from two cross-sectional studies.

Least squares estimation of the age-specific incidence
In epidemiological applications the p(t, a) are sampled at discrete ages a k , k = 1, . . ., K. Furthermore, we assume that we know the general mortality rate m and the relative mortality R at some time t ? where t 1 < t ? < t 2 .
Typically, the p(t j , a k ), j = 1, 2, k = 1, . . ., K, are subject to sampling uncertainty. Let σ jk denote the standard error of p(t j , a k ).
For a moment let us assume that we know p at t ? , and that we have a "guess" of the age-specific incidence i (g) (t ? , Á). Then, we can use the PDE Eq (3) to approximate p at t 2 by where h 2 = t 2 − t ? and ¼ : denotes the first order approximation. The values of the partial derivative (@ t + @ a )p are calculated by the right-hand side of the associated Eq (3). Similarly, we can approximate p at t 1 : where h 1 = t ? − t 1 .
As i (g) was based on an initial guess, the calculated values for p(t j , a|i (g) ), j = 1, 2, are likely to deviate from the measured values p(t j , a k ).
Therefore, define the sum of the standardized squared error X 2 (i (g) ) as Then, estimation of the incidence i can be written in terms of a minimization problem: Thus, i is the weighted least squares solution, which minimizes the squared deviation between the estimated and measured p in t 1 and t 2 . Underlying this minimization approach is the idea that the error p(t j , a k ) − p(t j , a k |i (g) ) is approximately normally distributed with a mean of zero and a standard deviation σ jk [11].
So far, we have assumed that we know p at time t ? , which is not the case if we have only data from two cross-sections at t 1 and t 2 . In this case, we can approximate p(t ? , a) using pðt ? ; aÞ ¼ : By applying the direct or the least squares estimators of the incidence consecutively for a series of cross-sections at t 1 < t 2 < Á Á Á < t n , we are able to estimate trends in the incidence. This is shown in the next section.

Validation of the Method: Diabetes in Denmark
Based on data from the Danish Diabetes Register, we first simulate a sequence of cross-sectional studies for the years 1995-2004. Next, we estimate the age-specific incidence rates by repetitively applying the methods described above to the simulated prevalence data. We assume that the general mortality in the population (m) and the relative mortality R ¼ m 1 m 0 are known. Finally, the incidence rate used as input for the simulation-we will call it the "true incidence"-and the estimated incidence are compared in terms of the absolute relative error.

Simulation of a series of cross-sections
The data used for the simulation are based on a complete survey of physician diagnosed diabetes (type 1 and type 2 diabetes) in the Danish population (n > 5 million) [12]. For this validation task we confine ourselves to the male population aged ! 30. We do not use individuallevel data, only (aggregated) rate and prevalence data as published in [12]. The starting point is the prevalence of diabetes on January 1, 1995. Due to the imposed age restriction, the initial condition Eq (2) changes to: where p 0 is the age-specific prevalence in 1995. The incidence rate for 2004 is described in detail for all age groups in the article by Carstensen et al. [12]. For the years 1995-2003 the authors report a yearly increase of 5.3% per year (relative to the rate in 2004). This means that in 2003, the incidence rate was 95.0% (1/1.053) of the rate in 2004. The growth rate of 5.3% per year is an average across all age groups. Similarly, the mortality rates m 0 and m 1 of the non-diabetic and diabetic male population, respectively, are reported for 2004. The average reductions for m 0 and m 1 are 2.5% and 3.9% per year, respectively. In simulating the prevalence, we approximate the rates reported in 2004 and assume that the reported annual changes hold for all ages ! 30. This is necessary because the age-specific change rates are not reported. Further details are provided in S1 Text.
Having the initial condition and the rates i, m 0 and m 1 at hand, we solve the PDE Eq (1) with initial condition Eq (9) by numerical integration. This is performed by a Runge-Kutta method (routine rk4 in the deSolve package for the statistical software R). As result we obtain the age-specific prevalences p (1996, a), p(1997, a), . . ., p(2004, a).

Estimating the incidence from the cross-sections
After calculating the age course of the prevalences in the years 1996-2004, we estimate the mid-year age-specific incidence (denoted t = 1995.5, 1996.5, . . ., 2003.5) by successively applying Eqs (4) and (8) to pairs of consecutive years. For example, the incidence for t = 2003.5 is estimated by the age-specific prevalences in 2003 and 2004.
The mortality data (m, R) necessary for the application of Eqs (4) and (8) are assumed to be known for the validation study. We can calculate the relative mortality R by R ¼ m 1 m 0 and the mortality m of the general population by m = pm 1 + (1 − p)m 0 , and use these as the input for the algorithms. Fig 4 shows the estimated age-specific incidence for t = 1995.5 compared to the "true incidence", which was used as input for the simulation. Both methods, direct estimation and least squares estimation (with σ jk = 1), yield the same results. Visually, the curves are in nearly perfect agreement.
To quantify the deviation between the "true" and the estimated incidence, we calculate the absolute relative error. For each of the yearly estimates from 1995.5 to 2003.5, the median and maximum relative error over all ages a = 30.5, 31.5, . . ., 99.5 are calculated. The results are shown in Table 1. Note that the units in Table 1 are per 100,000, which means that the overall maximum of the absolute relative error is below 0.9 percent.
If we compare the results of the direct and the least squares method in Table 1, we see that both methods lead to the same relative errors. Indeed, both methods yield very similar (but not equal) estimates for the age-specific incidence. There are two reasons for the similarity. First, there is no sampling uncertainty. As the main difference between the methods is the possible weighting of the prevalences with different sampling uncertainties (measured by σ 2 in Eq (7)), both methods should yield similar results. Second, the prevalence is sampled in steps of one year in the age-and time-dimension. Hence, the impact of the discretization errors of the partial derivative (@ t + @ a )p in Eqs (4), (5) and (6) is small.
The incidence chosen as input data increases by 5.3% per year. Now, we examine whether this trend can be detected by the estimation methods. For illustration, we compare the data for age a = 65.5. Fig 5 shows the estimated incidence and the true incidence for t = 1995.5, 1996.5, . . ., 2003.5. Again, both methods direct and least squares estimates yield the same resultsvisually there is no difference between the two curves.
Hence, we may conclude that the estimation method is capable of accurately detecting the trend in the incidence-just by using the data from the prevalence studies and the mortality.

Estimating the incidence in the presence of sampling error
So far, the sampling errors σ jk were assumed to be equal. This means that each measurement p (t j , a k ) gets the same weight when minimizing X 2 in Eq (7). However, in real surveys the sampling of the age groups may be different. In a representative sample of a population, for instance, the sampling scheme is chosen with respect to the age distribution of the underlying population. This may lead to σ jk being different from each other. An example is given as part from S1 File (TestCase1_A). In this test case we assume age groups of length five years with a decreasing sample size in the higher age groups. The age-specific prevalence is calculated as in the previous section. Then, additive binomial noise is superimposed on the calculated prevalence to mimic sampling uncertainty. The additive binomial noise for year j and age group k is assumed to have a mean of zero and the variance given by: Then, the prevalence distorted by the additive noise serves as an input for the estimation method. To cope with the sampling uncertainty, a bootstrap technique is applied [13]. For each of B = 2000 bootstraps with binomial noise, we apply the direct incidence estimation. If we then plot the estimated incidences, we obtain the result shown in Fig 6. The true incidence (blue line) is compared to the median of the 2000 bootstrap estimates (solid black line) and the 95% confidence bounds (dashed black line).
For more details and further test cases the reader may refer to S2 Text and S1 File.

Discussion and Conclusions
This work presents two novel methods for deriving trends in incidence from a sequence of prevalence studies are presented. The first method, called direct estimation is based on repetitively applying a previously published algorithm [5]. The second method is based on a least squares approach and has not been published previously. Both methods lead to the same results when perfect measurements (i.e., those without sampling uncertainties) are assumed. However, real surveys typically face sampling uncertainties. These uncertainties may be included into the least squares estimation, giving more weight to those measurements that have a higher accuracy. Alternatively, direct estimation can address statistical uncertainties by using resampling techniques as described in the previous paragraph. Both methods, the direct and the least squares estimation, are based on a differential equation, which has shown superiority over other approaches for estimating the incidence from prevalence data [14]. The differential equation used in this work does not depend on assumptions about the transition rates. Hence, the estimation methods may be called non-parametric. The only (implicit) assumption is that the people who immigrate into or emigrate from the considered population have the same prevalence as the resident population. In situations where this assumption is violated, the differential equation must take additional terms for migration into account [9].
A simulation study based on the data from the Danish National Diabetes Register shows that both methods provide very accurate estimates of the age-specific incidence rates for a sequence of cross-sectional studies. This makes it possible to derive secular trends in the incidence. Incidence trends are especially important because they can indicate substantial changes of the risk profile in the population under consideration. With a view to the tremendous effort required to collect incidence data, this method may offer an alternative to the costly follow-up studies that would otherwise be required.
Apart from the sequence of prevalence studies, the new method requires age-and time-specific mortality data. First, we need the overall mortality for the population. Second, we need the relative mortality. While the overall mortality is often available from official vital statistics or death registries, the relative mortality, i.e. the ratio of the mortality rates of healthy and diseased persons, requires evidence from epidemiological studies. For example, case-control studies are capable of estimating the relative mortality [15]. Similarly, nested case-control studies can be used for this purpose [16]. Case-control studies provide an effective opportunity to estimate the relative mortality, because they can be conducted with relatively low effort compared to other study types [10]. Although the partial differential equation the estimation methods are based upon can be formulated in terms of the excess mortality (m 1 − m 0 , see Eq (1)), the equivalent differential Eq (3) in terms of the mortality rate ratio (m 1 /m 0 ) and the overall mortality (m) has practical advantages: Rate ratios are more likely to be generalizable from one population to another than rate differences [17]. Furthermore, estimates of the overall mortality in a population, i.e., the general mortality, are more easily obtainable than disease-specific death rates (such as m 1 ). Note that all mortality rates in this work do not take the cause of death into account. The rates refer to all-cause mortality in a specific subset of the population. For instance, m 0 is the overall mortality rate in the non-diabetic population.
Several extensions and modifications of the methods are possible. Changes in the age sampling can easily be made to the provided R scripts in S1 File. There are no restrictions on high or low age groups. Differences in covariates such as sex or education may be treated by stratification. The inclusion of covariates directly into the differential equations is subject to further research and is beyond the scope of this article.
Although the validation study uses data about diabetes, the methods are applicable to any disease where the underlying illness-death model in Fig 1 is able to capture important aspects of the disease. These comprise all diseases with incomplete cures, i.e., an elevated mortality once having contracted the disease. Examples are cardiovascular disease, cancer and rheumatic diseases. The model may be useful even in infectious diseases with incomplete cures, such as HIV.
A strength of the method proposed in this manuscript is that it does not require individual data, for example, the time of onset of a chronic disease. If these additional data are available, e.g., from electronic health records, other methods are preferable as summarized in the review of Keiding [18].
Although the treatment and impact of uncertainty in the prevalence data is discussed extensively in this article, a limitation may be seen in the fact that we have not considered the effect of uncertainty or errors in the mortality data. Studying the impact of errors in the mortality data and their interplay with errors in the prevalence data on the incidence estimates is subject to future work.
A typical application of the method is the conversion of a sequence of cross-sectional surveys that aimed to examine the age-specific prevalence of a chronic disease (or another irreversible health related state) into a sequence of incidence data. The method requires relatively low effort and therefore may be useful in settings with sparse resources.
Supporting Information S1 Text. Detailed description of generating the input data for the validation. (DOC) S2 Text. Description of the source code. To allow independent validation of the results of this article and to advocate the reproducibility of algorithms and computational experiments, the scripts that produced the results of this article are available as supporting information. In addition, scripts for generating test data and making estimates from a sequence of prevalence studies are provided. All scripts can be used with the free statistical software R (The R Foundation for Statistical Computing). (DOC) S1 File. Source code. The zip file contains the source files as described in S2 Text. After unzipping please refer to the readme.txt file. (ZIP)