Calculating Stage Duration Statistics in Multistage Diseases

Many human diseases are characterized by multiple stages of progression. While the typical sequence of disease progression can be identified, there may be large individual variations among patients. Identifying mean stage durations and their variations is critical for statistical hypothesis testing needed to determine if treatment is having a significant effect on the progression, or if a new therapy is showing a delay of progression through a multistage disease. In this paper we focus on two methods for extracting stage duration statistics from longitudinal datasets: an extension of the linear regression technique, and a counting algorithm. Both are non-iterative, non-parametric and computationally cheap methods, which makes them invaluable tools for studying the epidemiology of diseases, with a goal of identifying different patterns of progression by using bioinformatics methodologies. Here we show that the regression method performs well for calculating the mean stage durations under a wide variety of assumptions, however, its generalization to variance calculations fails under realistic assumptions about the data collection procedure. On the other hand, the counting method yields reliable estimations for both means and variances of stage durations. Applications to Alzheimer disease progression are discussed.


Properties of the counting method
Here we study mathematical properties of the counting method of calculating the averages and standard deviations of stage durations. We will suppose that the stage durations of the patients come from a probability distribution, P (x), with the mean µ and the standard deviation σ. Our method attempts to provide an approximation for quantities P (x), µ and σ.
Let us denote by n the total number of data points in a sample. To create a dataset, we pick n values (t k ) n k=1 according to some distribution on the support of P ′ (x), (t min , t max ). In what follows we will assume for simplicity that t min = 0 and that t max < ∞. The values (t k ) n k=1 are times of patients' visits. Then we draw values (T k ) n k=1 from the distribution P (x). These are actual stage durations of the patients. The data points are pairs of numbers, (t k , θ k ) n k=1 where θ k = 1, t k > T k , 0, otherwise.
The task of the estimator is to extract quantities P (x), µ and σ from the set (t k , θ k ) n k=1 . This can be compared with the same quantities estimated by standard means from the dataset (T k ) n k=1 of the actual stage durations, which are unknown in reality.
The counting method splits the interval (t min , t max ) into N subintervals, (I i ) N i=1 . Let us denote the length of subintervals by ∆t i , and the number of datapoints in the subintervals as j i . There are various possibilities for such splitting, for example, (a) the subintervals can be chosen to have equal lengths, ∆t i = t max /N , or (b) the subintervals can be chosen to have equal number of datapoints, j i = n/N . The approximation for the probability distribution is then calculated as follows, Note that no requirements of monotonicity is imposed on this probability distribution estimator. However, as we will see, under many realistic circumstances the estimator converges to the true cumulative probability distribution. The approximations for the mean µ and the standard deviation of the probability distribution are given by These estimates are derived from a discretization (by the trapezoid rule) of the following integrals, where we performed integration by parts and used P (t max ) = 1. Below we will study unbiasedness and consistency of the counting method. For this, we need to define the averaging process and the limiting process as the sample size increases. For the averaging procedure, we will assume that the points (t k ) n k=1 are fixed, and sample the probability distribution P (x) to create sets (θ k ) n k=1 . For the limiting process where n → ∞, for each n we will consider sets T n = (t k ) n k=1 such that T n ⊂ T n+1 for all n. Let us examine approximation (1) in several scenarios.
One data-point per interval. In this case (falling into category (b) above), we have j i = 1 andt i = t i , and The expectation of this value is by definition of the probability distribution P (x). Therefore, we conclude that estimator (1) is unbiased in the sense that its expectation is a discretization of the underlying probability distribution. The expectation of estimators (2) and (3) is an approximation to the true value µ and σ 2 respectively, obtained by replacing the integral with a trapezoid finite sum. Therefore, estimators (2) and (3) are biased. Note however that the bias is controlled tightly by the sample size, and is only as "bad" as a trapezoid rule is at approximating a continuous integral.
In the limit of a large number of data points we have that is, estimator (1) is not consistent. However, if we apply the obtained estimate for the probability distribution to calculations of µ and σ, equations (2) and (3), we will get a consistent estimate. To see that, let us assume for simplicity that the probability distribution for time-points t k is uniform. Let us split the interval (0, t max ) into √ n intervals, each containing on average √ n points. As n → ∞, the length of each interval goes to zero, and the number of time-points inside each interval tends to infinity. The sum N i=1 θ i that appears in expression (2) can be split into √ n parts, each containing √ n terms. For each such group, the function θ i is evaluated at points within t max /n of each other. If t j is the average value of the time-points within group j, then the corresponding sum tends to P (t j ) as n → ∞. Since the trapezoid discretization tends to the integral value as the mesh gets finer, we conclude that the estimates (2) and (3) tend to the true values µ and σ 2 in the limit of large n. Thus estimates (2) and (3) are consistent.
Many data-points per interval. To examine the unbiasedness of estimator (1) in this case, again we consider the expectation which is not equal to P (t i ). Therefore, this estimate is biased. Estimators (2) and (3) are therefore also biased. Next, let us take the limit as n → ∞, while keeping j i = J constant, which amounts to taking N → ∞. To study convergence of estimator (1) in the mean, we consider the limit Let us assume that the probability distribution for the time-points, t k , is such that as n → ∞, max i ∆t i → 0. The limit in (4) is nonzero because the expression under the summation is the difference between P (t i ) (the probability evaluated att i ) and a J-component sum of variables θ k which can only take discrete values 1/J, 2/J, . . . , (J − 1)/N . Therefore we conclude that if the number of points per interval stays fixed during the limiting process, the estimator is inconsistent. If, however, we allow both N → ∞ and J → ∞, the expression 1/J t k ∈Ii θ k converges to P (t i ), and the estimator becomes consistent. Consistency of estimators (2) and (3) follows.
Sample-size dependence. The error of estimators (2) and (3) comes from two sources: (i) the replacement of the probability distribution by finite sample averages of the quantity θ k , which is of the order 1/n, and (ii) the error resulting from the approximation of the integral by the trapezoid rule, which is of the order 1/n 2 and thus is negligible. Therefore, the accuracy of the counting method is of the order O(1/n).
The sample-size dependence of the counting method is illustrated by studying a specific example. We use the Rayleigh distribution with ν = 4 for the distribution for stage lengths, see figure 1. The true values of the mean and standard deviation are given by µ = 2 √ 2π and 4 2 − π/2. We assume that the visit times are distributed uniformly between zero and the maximum stage length value. Let us vary the number of datapoints as 10 k , k ∈ {1, . . . , 6}, and create approximations to the probability distribution, P (x), by using the counting method. We first set j 1 = J = 1, see figure 2. The "approximation" of the distribution function is a two-valued function with values in {0, 1}. It does not converge to the desired shape, see the dashed lines in figure 2. As we showed previously, the counting method with J = 1 is not a consistent estimator when it comes to restoring the underlying probability distribution. Next, we use j i = J = √ n, see figure 3. We can see that although the approximations to the probability distribution are non-monotonic, there are clear signs of convergence, which illustrates the proof of consistency given above. Finally, we calculate the means and standard deviations of the distribution by using the counting method (µ n and σ n ) with the number of datapoints given by 10 k from k = 1 to k = 6 again. The convergence of the estimator is studied in figure 4, where we plot the logarithm of the absolute error for the mean and the standard deviation. Again, we consider two cases. First, we set the number of points per interval to be j i = J = √ n (circles with solid lines) and then we set j i = J = 1 (circles with dashed lines). The convergence speed of estimators (2) and (3) as n → ∞ is compared to that of finite sample means and variance: The values for the errors of µ f.s. We can see that the power laws are similar for the counting method and for the finite-sample estimate, but the multiplicative constants are lower for the finite sample estimate.

Invertibility and error propagation concerns for the regression method
On the surface it appears that with a sufficiently large data set, we could easily form the 2n equations needed to solve for the unknown stage duration means and variances using only equation ( equation (2). However, a study of the matrices generated from these equations for all diseases with 3 or more stages shows them to be extremely ill-conditioned when the means are unknown. Therefore, we use the a priori knowledge of the mean stage durations computed using equation (2), and then apply equation (3) to reconstruct the unknown variances.
We first explore the stability of the matrix of coefficients generated from equations (2) or (3) for each transition class, C M . The least squares method is equivalent to solving the matrix equation: where E[t] is the vector of mean transition times for each transition class and E[T ] is the vector of mean stage durations. The coefficient matrix is typically not square, so the linear system is solved by finding: The ability of our reconstruction method to accurately calculate the means and variances is dependent upon the condition of the matrix T M = (C T M C M ). We therefore study the invertibility of this matrix as a function of the various transition classes used to form it.
As seen in supplementary figure 5, the transition matrix is strongly sensitive to the presence of the self-transitions. Loss of these stages (for example, from insufficient data) will have a strongly deleterious effect on the accuracy of the method. Each loss of a self-transition leads to an approximately factor of two decrease in the determinant of T M . For comparison, removal of all i → i + 2 or greater transitions (6 in a 5-stage disease) leads to an equivalent decrease in  determinant value as does removal of 2 of 5 i → i transitions. Secondly, we note the variance equations matrix is more poorly conditioned than that of the mean equations. This implies that the variance reconstruction method will be more sensitive to error propagation than the mean reconstruction.

A brief review of Kaplan-Meier statistics
For reference, we include a brief description of the well-known Kaplan-Meier (KM) statistic. This statistic attempts to recreate the cumulative distribution function of a distribution of transition times (for example, stage duration times) at each timepoint of observation. Consider a collection of observations for N P patients sorted into chronological order. Like in the case of the counting method, we assume the first observation for each patient occurs at the onset of their current stage, and that each patient eventually has a transition observation (that is, they either move on to the next stage or otherwise stop being observed). Thus at time t = 0 there are N P total patients and N P transition times (note there may be non-transition observations prior to the transition observation for each patient in the set as well). The times of the transition observations are ordered and, for each of the j = 1...N P timepoint, the KM statistic is calculated as: The KM statistic also allows for censoring of the data: if a patient drops out of observation for any reason, at that timepoint they are removed from the sample and the value of N P is decremented by one. After computing the value of S j for each of the N P transition observation times, the cumulative distribution function P (t) is given by P (t j ) = 1 − S j , and the ordered pairs (t j , P (t j )) can be used to compute the parameters of the distribution in question via the equations discussed previously.

Using KM statistics to calculate stage means and standard deviations
The Kaplan Meier (KM) statistic is commonly used to analyze data such as those generated by method 2. To that end, we apply the KM statistic to datasets so generated to compute the cumulative distribution function, from which we can calculate the stage means and standard deviations. Figures 6, 7, 8, and 9 present the results for KM simulations for distribution A (see Table 1 of the main text). In particular, figure 6 shows the mean and figure 7 the standard deviation calculations for average sampling intervals of 1 year (the finer sampling case). Figures 8 and 9 show the mean and standard deviation calculations for average sampling intervals of 2 years (the coarser sampling case). We note immediately that for both of the sampling cases, the KM statistic is strongly biased and introduces an error factor at both patient sizes, both for the mean and standard deviation calculations. While the "spread" of the KM method decreases significantly as the patient number is increased, this bias is not eliminated. We further note that the bias becomes stronger for the coarser sampling case compared to the finer sampling case. It does not decrease for larger patient numbers. Thus, without knowledge of the source of the bias, any distribution statistics calculated via the KM statistic must be considered inaccurate, and a significant overestimation of the true mean and standard deviation values.
In contrast to the KM statistics, our counting method exhibits less bias induced by a coarse sampling methodology which is common in longitudinal studies of long-term diseases such as AD. Figures 10 and 11 show the results for the means and standard deviations extracted by the counting method with stage duration probability distribution A, under data generating method 2 (the fine case). Results for the coarse sampling case are presented in figures 4 and 5 of the main text.