A new linear regression-like residual for survival analysis, with application to genome wide association studies of time-to-event data

In linear regression, a residual measures how far a subject's observation is from expectation; in survival analysis, a subject's Martingale or deviance residual is sometimes interpreted similarly. Here we consider ways in which a linear regression-like interpretation is not appropriate for Martingale and deviance residuals, and we develop a novel time-to-event residual which does have a linear regression-like interpretation. We illustrate the utility of this new residual via simulation of a time-to-event genome-wide association study, motivated by a real study seeking genetic modifiers of Duchenne Muscular Dystrophy. By virtue of its linear regression-like characteristics, our new residual may prove useful in other contexts as well.


Introduction
In this paper we develop a new form of survival analysis residual with linear regression-like (LRL) properties, in a sense to be made clear in what follows. The simplest way to motivate this work is by describing a target application: discovery of genes that modify the Duchenne muscular dystrophy (DMD) phenotype.
DMD is an X-linked recessive disorder affecting � 1 in 5,000 live male births [1,2]. DMD involves progressive muscle tissue loss with replacement by fat and fibrotic tissue, and is currently without a cure. Patients typically become reliant on wheelchairs by early to mid-adolescence, but some maintain ambulation substantially longer, and age at loss of ambulation (LOA) is an important clinical indicator of disease progression. A great deal is known about the gene (DMD) that causes DMD, including the fact that modifier genes influence the rate of disease progression in a DMD mouse model [3,4]; evidence for modifiers exists in humans as well [5][6][7][8]. The discovery of modifier genes in humans has implications both for therapeutics and for the design of DMD clinical trials.
Using data from the United Dystrophinopathy Project, a multisite consortium [9][10][11], we are currently engaged in a search for modifier genes under a Genome-Wide Association Study In this paper we argue that MR and DR do not lend themselves to an LRL interpretation, but that we can modify DR to do so. The paper is organized as follows. In (2) we review MR and DR and discuss ways in which they thwart an LRL interpretation. We introduce a new form of residual, which we call an ordinary time-to-event (OTE) residual, so called because it affords an ordinary LRL interpretation in the context of time-to-event data (TE), and we compare OTE with MR, DR and with an ordinary linear regression residual (OLRR). Then in (3) we assess the comparative behavior of MR, DR, OTE and OLRR when used in conjunction with the PPLD for analysis of TE-GWAS data, using simulations designed to mimic the DMD application.

Various residuals for time-to-event data and their interpretations
Section 2.1 reviews the definitions of MR and DR and their interpretation; 2.2 introduces the OTE residual, and also the form of OLRR we will use for comparisons; 2.3 discusses handling of a covariate; 2.4 then further illustrates key aspects of MR, DR and OTE compared with each other and with OLRR.

MR, DR and Their Interpretation
The MR for individual x i at survival time t surv (x i ) is defined as where δ i = 1 if the individual is uncensored and δ i = 0 if the individual is censored; and CHF ft surv ðx i Þg is the estimated cumulative hazard function evaluated at min[t fail (x i ), t cens (x i )], the individual's failure time and censoring time, respectively. (Eq 1 assumes fixed, rather than time-dependent, covariates [22,23], as we have in our DMD application. Technically MR is an estimate, but for notational convenience we omit the caret over expressions for residuals.) The range of MR is (-1, 1], and asymptotically its expectation is 0. MR(x i ) is sometimes said to represent the observed of events minus the expected # of events for the i th individual over (0, t i ], suggesting an LRL interpretation. Its range, however, is markedly non-LRL. DR is then defined as: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Eq 2 "corrects" for the fact that MR is bounded above by 1 but unbounded below. DR maintains the sign of MR, while expanding MR values close to one and contracting large negative MR values to achieve a more symmetric distribution [22]. DR thus behaves more like a residual from linear regression than does MR. We focus here on DR for this reason, but points raised in the remainder of this section apply equally to MR. There are three reasons why DR should not be interpreted in an LRL manner. First, as previously noted, for a censored individual DR is based on CHF ft cens ðx i Þg. However, the information we have regarding the individual's actual failure time is only that t fail (x i ) > t cens (x i ). If we are interested in evaluating CHF ft surv ðx i Þg at a value that lends itself to an LRL interpretation in terms of underlying failure times (as opposed to a value chosen for its utility in Martingale theory), then evaluating the CHF at t cens (x i ) seems an arbitrary and unsatisfactory choice. There is a difference here between the time-to-event scale we seek to preserve and the number-of-events framework of DR.
Second, DR assigns a different value to δ i for uncensored and censored individuals. This makes sense if we define δ i as the observed number of events at time of evaluation. One consequence of this assignment, however, is that two individuals, one censored and the other uncensored, with the same covariate status and the same survival time, are guaranteed to have a difference of 1 in their MR, with a corresponding difference in DR. Whether these two individuals should have different residuals is one question; whether the difference between their MR values should always be exactly equal to 1 is another. (See 2.4 below for a detailed illustration of this issue.) To maintain an LRL interpretation, we would like the residual to reflect only how extreme is the individual's survival time relative to expectation, but DR's dependence on δ i confounds such an interpretation.
The third issue applies equally to uncensored and censored individuals. In linear regression, an individual with observation equal to expectation would be assigned a residual equal to 0. But DR = 0 does not correspond to the situation in which an individual's observed survival time is equal to the expected survival time, even for uncensored individuals; we consider this point in greater detail in 2.2.
In aggregate these issues introduce scaling issues for DR (and MR) that undercut an LRL interpretation. And as we will show below, ignoring the non-LRL qualities of DR has implications for downstream analyses.

OTE and OLRR
OTE is obtained by making three modifications to DR, designed to remedy the three issues raised above. First, for censored individuals, we define the predicted failure time t pred (x i ) as the median time to event conditional on t fail (x i ) > t cens (x i ), which can be calculated from the estimated survival functionŜðtÞ; t pred (x i ) represents the information afforded by censoring time regarding an individual's (expected) failure time. (Our reason for using the median rather than the mean will become clear below.) In the context of OTE, we therefore define t � surv ðx i Þ as t fail (x i ) for uncensored individuals and t pred (x i ) for censored individuals.
Second, we set δ i = δ for all i. Heuristically, this reflects the new definition of t � surv ðx i Þ for censored individuals; we are in effect treating everyone as having had one event (either observed or predicted) at the time of evaluation. More importantly, this choice removes scaling anomalies between uncensored and censored individuals, as we return to below.
Third, we set δ = −log(0.5) for all individuals (uncensored and censored). The rationale for this choice is that CHF(t � ) = −log S(t � ) = −log(0.5) � 0.7, regardless of the shape of S(t), where t � is the median survival time. Thus an uncensored individual with survival time t � is assigned a positive DR, while a censored individual with survival time t � is assigned a negative DR. By setting δ = −log(0.5) we ensure OTE = 0 if and only if t � surv ðx i Þ is the median survival time. This is also our justification for defining t pred (x i ) in terms of the median survival time rather than the predicted mean. The latter depends on the shape of Ŝ ðtÞ, so that there would be no fixed LRL meaning for a residual = 0.
To summarize, OTE is calculated as follows: Ŝ ðtÞ is estimated in the usual way based on t surv (x i ), i.e., t fail (x i ) for uncensored individuals and t cens (x i ) for censored individuals. (For family data, only unrelated individuals would be used for this step.) Ŝ ðtÞ is used to calculate the median survival time t � , and, for each censored individual (including any relatives, in the case of family data), the predicted median survival time t pred (x i ) given t fail (x i ) > t cens (x i ). Ŝ ðtÞ is also used to obtain the corresponding CHF (t). A modified MR (MMR) is then calculated for each individual (including any relatives) as MMR = −log(0.5) − CHFft � surv ðx i Þg. Finally a DR-like transformation is applied to MMR, setting δ = −log(0.5): ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi (See Appendix B for details.) We note that for censored individuals with t cens (x i ) � 0, t pred ðx i Þ � t � , which yields a residual � 0; as t cens (x i ) increases, the corresponding residual decreases. Thus, as with MR and DR, there is no such thing as a positive OTE residual for a censored individual. Of course this produces an asymmetry with respect to uncensored individuals, but this asymmetry is inherent in the available information.
For purposes of comparison, we also consider OLRR, which is computed on the original time scale as would be done in ordinary linear regression: (We center OLRR on the median, rather than the mean, in order to maintain comparability with OTE. OLRR is in the form "expected-observed," where ordinarily we would use "observed-expected." This is again for comparability, since for DR and OTE positive values represent individuals with earlier survival times and negative values represent individuals with later survival times.) OLRR is naïve with respect to the shape of CHF: it represents residuals on the original time scale ignoring the cumulative hazard function, whereas OTE transforms time differences onto a hazard scale. OLRR is therefore not correct for time-to-event data. (OLRR makes partial use of Ŝ ðtÞ, but only insofar as Ŝ ðtÞ is used to obtain t pred (x i ) in the first place.) But because we seek a residual that is OLRR-like in interpretation, OLRR provides a useful point of reference.

Handling the covariate
We restrict our attention here to a single binary covariate y, in keeping with our intended DMD application. In a regression setting (e.g., under the CPH model), a separate CHF is estimated for each level of y, and for the i th individual MR is calculated based on the CHF corresponding to that individual's covariate status; i.e., the covariate "adjustment" is made at the level of estimation of the CHF. The same is true for linear regression, where the residual is calculated relative to a covariate-adjusted conditional mean. Accordingly, DR and OTE take the covariate into account by estimating separate survival curves based on y; similarly, OLRR subtracts t � surv ðx i Þ from the y-specific median. For our purposes we need to place the resulting residuals on the same scale across levels of y, in order to effectively "remove" the covariate effect prior to downstream analyses. Therefore, before combining residuals across levels of y, they are standardized separately for each level of y, using

Additional contrasts among DR, OTE and OLRR
Our central claim is that the modifications to DR that lead to OTE change the meaning of the residual to a quantity that accurately reflects how extreme is an individual's survival time relative to expectation (i.e., the median), on a hazard scale, after adjusting for covariate effects. Recall that OTE and OLRR depend only on the (observed or predicted) survival time, without further distinguishing uncensored from censored individuals. Hence each is represented as a single line {Fig 1A}. We see that OTE preserves the same rank-ordering as OLRR. The differences between the two reflect the fact that, while OLRR is strictly linear in t, OTE correctly incorporates information from the hazard function. Note too that t � ¼ 11:69 coincides with the intercept with y = 0 for both residuals. DR, however, not only differs in value from OTE (and OLRR) for given t, it also disrupts the rank-ordering across values for uncensored and censored individuals {Fig 1B}. Note too the slightly different placement of the intercept with y = 0, which for DR occurs at t = λ = 12.89.

PLOS ONE
New time-to-event residual with linear regression-like interpretation 10, which for this particular CHF yields t fail (x 3 ) = 13.15 and DR(x 3 ) = −0.18. OTE assigns the same value to individuals 1 and 3, who share the same evaluation time (predicted or observed). DR, on the other hand, assigns different values to individuals 1 and 3, despite the fact that all the information we have regarding underlying failure times says that the two are equally extreme relative to the median. By contrast, DR assigns the same value to individuals 1 and 2, despite the fact that individual 1's predicted failure time is less than individual 2's observed failure time. Because of the other small differences between DR and OTE, OTE also assigns a different value to Individual 3 than does DR, with OTE(T 3 ) = −0.50. Note that DR preserves the

PLOS ONE
New time-to-event residual with linear regression-like interpretation rank-ordering of MR; thus, while the residual values will differ between MR and DR, the differences illustrated above in rank-ordering and interpretation with respect to OTE remain for MR. In summary: in contrast to MR or DR, OTE brings the scale for censored individuals into alignment with uncensored individuals, and situates the 0-point so that individuals with observation at the median are assigned OTE = 0. By replacing t cens (x i ) with t pred (x i ) for censored individuals, OTE more accurately represents the failure time information provided by censored individuals. Finally, OTE preserves the OLRR rank order with respect to deviation from the median, in (standardized) units of time-to-event, while incorporating information from CHF (t). It is on this basis that we call OTE an LRL residual for time-to-event data. Table 1 summarizes these conclusions.

Simulation studies
In this section we evaluate the effectiveness of using OTE in the context of our intended genetic application using simulated data. Because our primary interest is in developing a residual for use in downstream KELVIN analyses, and because in reality the underlying genetic model cannot be known in advance of gene discovery, this is a situation in which simulated data can be far more informative than real data for purposes of methods validation. Section 3.1 describes the simulation methods, and 3.2 presents results.

Simulation methods
In order to mimic features of our DMD data set, our base model uses a sample size of N = 500 individuals, half of whom come from each of two covariate levels (y = 1, 2). Individuals were randomly assigned a genotype for a 2-allele locus (with alleles 1, 2) as a function of q = P(allele 1) = 0.5, assuming Hardy Weinberg Equilibrium. (This is in anticipation of genetic association analysis based on SNPs. Assuming q = 0.5 keeps things simple while generating sufficient data for all 3 genotypes; but it is also not unrealistic, since allele frequencies at modifier loci could be quite high in the general population.) Failure time was simulated via a random draw from a mixture of normal distributions in the form N y = 1,k (μ k , σ k ), for given genotype k = 11, 12, 22 and y = 1, and in the form N y = 2,k (μ k + 3, σ k ) for y = 2. We also considered one model with a more complicated covariate effect (see below). The generating parameter values were chosen to mimic LOA in the real DMD data set for uncensored individuals with no history of steroid use (x � ¼ 11:6, s. d. = 3.4). All normal

PLOS ONE
New time-to-event residual with linear regression-like interpretation distributions were left-truncated at 0 in order to preclude non-positive age at event. (While the normal distribution was chosen for convenience, the resulting distribution was effectively Weibull; see SI.) The genetic model was varied as shown in Table 2. Model 1 has no genotypic effect, so that the generating model is a single (non-mixed) normal distribution. Model 2 creates a simple additive mixture model for age-at-event while maintaining comparable x � and s. d. at the population level. The remaining models vary effect size by increasing the genotypic variances (Model 3), introducing dominance (Models 4, 5), and by generating genotypic effects on variances as well as means (Model 6) or solely on variances (Model 7). Model 8 complicates the covariate effect. The models were chosen to illustrate a range of possible trait distributions, and are by no means intended to exhaustively cover what we might find in a real application. For each generating model, an individual was simulated based on a random draw of t e (x i ) from the corresponding age-at-event (AE) distribution and an independent random draw of t o (x i ) from an age-at-observation (AO) distribution. If t e (x i ) < t o (x i ), the individual was considered uncensored with failure time t fail (x i ) = t e (x i ); otherwise, the individual was considered censored with censoring time t cens (x i ) = t o (x i ). AO was simulated under a negative binomial distribution with r = 10, p = 0.4 in order to roughly mimic the censoring distribution in the real data, with � 40% of individuals censored. We simulated only unrelated individuals.
For each simulated data set, we obtained Ŝ y (t) and CHF y (t), as a function of age t, via maximum likelihood estimation of a pair of 2-parameter Weibull distributions, one for each level of y, based on t fail (x i ) for uncensored individuals and t cens (x i ) for censored individuals. We verified visually that the Weibull distribution provided a reasonable fit to the simulated data in comparison to Kaplan Meier curves. We simulated 1,000 replicates per model for Models 2-8; for the "null" Model 1, we simulated 1,000,000 replicates. PPLD-RES (RES = MR, DR, OTE or OLRR) was calculated for each replicate, using the residuals as data, and distributions of the PPLD across replicates were obtained for each model.
Recall that the PPLD does not require normality or even symmetry of the trait distribution. Therefore the distribution of the residuals is not our primary interest here. However, additional information on these distributions is given in Appendix C. Table 3 shows summary statistics of the PPLD distributions for the 8 models shown in Table 2.

Simulation results
(Recall that PPLD is on the probability scale, with values < 0.0004 indicating evidence against association and values > 0.0004 indicating (some degree of) evidence in favor of association.) For the "null" (non-genetic) Model 1, we see that all four forms of residual return very small PPLDs on average, below the prior of 0.0004, indicating evidence against association, with very little variability across replicates. For the remaining models, and based on Table 3 alone, we might conclude that PPLD-OTE and PPLD-OLRR are virtually identical in their behavior; with PPLD-DR also working reasonably well for the models considered here, although with a lower mean in general, and even worse performance for MR. Recall too that MR and DR preserve rank-order with respect to one another, so that performance differences are due to scaling differences. This underscores the importance of having a residual that maintains appropriate scaling with respect to the scientific hypothesis, in our case, regarding time to event.

PLOS ONE
New time-to-event residual with linear regression-like interpretation with dissimilar standard deviations (Model 7). As can be seen, models for which the PPLD means are similar to one another can have different distributions depending on the form of residual, with the chances of obtaining a large PPLD in general higher for PPLD-OTE compared to PPLD-MR or PPLD-DR. In addition, pairwise correlation coefficients between PPLD-MR or PPLD-DR and PPLD-OTE across replicates can be quite low, as shown in Table 4; even PPLD-OLRR and PPLD-OTE show lower correlations for some models. Thus in any given data set, depending upon the underlying genetic model (which is in practice always unknown), the PPLD can vary substantially depending on which form of residual has been used as data, even when the sampling distributions of the PPLD appear similar. This can impact rank-ordering of SNPs and lead to follow-up of different genes.

Discussion
We have developed a novel OTE residual for use in conjunction with time-to-event data, and we have argued that OTE has an interpretation much like that of an ordinary linear regression residual. OTE was designed to satisfy certain requirements of scaling and interpretation. It is notable, then, that in operational terms (Table 3 and Fig 3) it out-performs the alternatives considered here in application to simulated TE-GWAS data. In general, a statistical method that retains more information and/or uses that information more accurately ought to, on average, outperform methods that rely on assumptions violated by the data or that distort the scaling of the data. Thus the superior operational characteristics of OTE, over a range of genetic generating models and across a large number of replicates, is indirect confirmation that OTE does indeed do what we intended it to do.
When designing and evaluating test statistics, operational behavior is important. But in measuring the strength of evidence for or against an hypothesis in any given data set, what counts is not just operational effectiveness, but more importantly, how accurately and efficiently the statistical method utilizes the relevant information in the data at hand. In our actual study of DMD, we have at present only the one data set to work with, and we need to know that our chosen outcome measure represents the evidence as accurately as possible in this particular data set. Key to this objective is ensuring meaningfulness for the measurement scale on which data are represented. The strikingly low correlations (Table 4) for some generating models between PPLD-MR or PPLD-DR and PPLD-OTE underscores this issue. If we are to take results in any one data set at face value, we must ensure up front that the form of residual we use accurately and reliably captures the information we set out to model: how extreme is an individual's survival time relative to expectation. On this score we hope to have made a convincing case in favor of OTE.
While we have focused on application to TE-GWAS data, OTE could be useful in other contexts as well. In particular, in any setting for which it is desirable to remove covariate effects prior to combining data across covariate levels, these new residuals represent a methodology for so doing that retains a direct connection to the original scale of the dependent variable, simply shifting from survival time to the deviation of survival time from expectation, on a hazard scale, and expressed for each covariate level in units of level-specific standard deviations.
But of course, nothing we have presented guarantees that OTE is optimal in all settings. We have not, for example, considered the handling of more complex covariate structures, including continuous covariates or covariate interactions. OTE will also depend upon the choice of a functional form for the estimated survival function. While the Weibull distribution fit our simulated data well in comparison with Kaplan Meier curves, the robustness of downstream analyses to the fit of the survival curve would need to be considered in cases where that fit could be poor, and other survival models might need to be considered.
One additional aspect of our simulation results worth noting concerns the importance of including censored individuals, who are sometimes said to contribute relatively little to the power of survival analyses. In our application censored individuals turn out to contribute quite a bit. For instance, under Model 2, if we drop censored individuals prior to analysis the average PPLD-OTE drops from 84% to 12.5%. This is not particularly surprising given that censored individuals represent � 40% of each data set, so that dropping them results in a substantial reduction in sample size. But it illustrates the importance of being able to properly incorporate censored individuals into such analyses. Recall too that the only information we have used for these individuals in constructing OTE is that t fail (x i ) > t cens (x i ). Even so, they prove to be quite informative; and of course, they are critical to proper estimation of Ŝ y in the first place.
Finally, we note a difference in the handling of the covariate between what we are doing and the standard GWAS approach. The latter in effect imposes a random-effects model for the covariate, because the regression is performed separately at each SNP, with the covariate effect re-estimated each time. Along with the myriad other sources of sampling variability in the data, if a fixed effects model is more appropriate, then this practice introduces additional noise. By contrast, we have in essence assumed a fixed-effects model, factoring in the effect of the covariate just once in calculating the residuals. One disadvantage of this is that we might be sacrificing power to find genotype x covariate interactions. For instance, in our DMD data set, it is unclear whether using OTE as described here would support discovery of genes that influence response to steroids among DMD patients. This is an interesting topic for further investigation.
algorithm ( [29]. Let π be the probability that a randomly selected SNP is within detectable LD distance of a trait locus. Then the Posterior Probability of LD is simply PPLD ¼ pBR pBRþð1À pÞ . p � 0.35 in order to vary the proportion censored across replicates. Data were then generated and residuals calculated as described in the main text. (We confirmed that Weibull estimates across replicates varied from 12.70-12.93 (scale) and 3.61-3.92 (shape), producing virtually no visually discernable effect of the variable AO distribution on the survival function itself.) Fig 4 shows the mean z � RES and median z � RES for each replicate, plotted as a function of the observed proportion of censored (OPC) individuals. As expected, z � MR ¼ 0, and z � OLRR = 0 as well, regardless of OPC; z � DR , however, depends on OPC, and for all three z � RES depends on OPC. OTE is the closest to LRL behavior: z � OTE � 0 and z � OTE � 0, regardless of OPC, implying that the underlying distributions of OTE residuals are highly symmetric.