Weighted Estimation of the Accelerated Failure Time Model in the Presence of Dependent Censoring

Independent censoring is a crucial assumption in survival analysis. However, this is impractical in many medical studies, where the presence of dependent censoring leads to difficulty in analyzing covariate effects on disease outcomes. The semicompeting risks framework offers one approach to handling dependent censoring. There are two representative estimators based on an artificial censoring technique in this data structure. However, neither of these estimators is better than another with respect to efficiency (standard error). In this paper, we propose a new weighted estimator for the accelerated failure time (AFT) model under dependent censoring. One of the advantages in our approach is that these weights are optimal among all the linear combinations of the previously mentioned two estimators. To calculate these weights, a novel resampling-based scheme is employed. Attendant asymptotic statistical results for the estimator are established. In addition, simulation studies, as well as an application to real data, show the gains in efficiency for our estimator.


Introduction
In medical studies, it is very common that death or withdrawal of study and progression on disease of interest simultaneously occur in the study. For this case, death or withdrawal of study may censor the development of disease. This type of data structure is called 'semicompeting risks data' [4].
Semicompeting risks data have been widely studied in the past decade. Some researchers used a Gamma copula to estimate the association parameter between the event of interest and dependent censoring [2], [4]. There is a literature that extended the methodology of [2] to the case that a nuisance parameter exists and also considered a more general copula model [22].
On the other hand, other researchers used semiparametric regression to model the event of interest and dependent censoring jointly. One approach is an estimation procedure based on the accelerated failure time (AFT) model [13], [17]. They used the artificial censoring technique to adjust the bias of the usual estimator. While the estimating equation of [13] is a U-statistic of order one, that of [17] is a U-statistic of order 2.
However, none of these papers fully discussed optimality of the estimator. In this case, choosing an estimator that is optimal from an efficiency viewpoint is an important issue for consideration. Here, we adapt the idea of [25], which proposed an optimal estimator whose form is a linear combination of estimators for multivariate failure time data. They used idea of [24], which proposed using combinations of dependent tests in the presence of missing values. Idea of [24] is to create a test which can maximize power based on linear combination of test statistics. Approach of [25] is simple and flexible, so it is sensible to apply their method in our case.
In this paper, we propose a weighted estimator by using methodology from [25]. Our weighted estimator combines those of [13] and [17]. The structure of this paper is as follows.
In methods section, we review estimators proposed by [13] and [17] briefly. In addition, we describe details on our new weighted estimator. In model checking section, model checking procedure is briefly discussed. In simulation studies section, results of simulation studies will be given. Application of our method to a real data example is presented in real data analysis section. Some discussion concludes discussion section.

Review of Model
Let X be time to the event of interest, D the time to dependent censoring and C the time to independent censoring. All these times are transformed on a logarithmic scale. LetX ¼ X^D^C andD ¼ D^C. Define d ¼ IðX D Þ, Δ = I(D C) and let Z be covariates. The data contain n independent and identically distributed observations ðX i ;D i ; Z i ; d i ; D i Þ; i ¼ 1; . . . ; n. The model is where θ 0 and η 0 are k × 1 vectors, and i ð X i ; D i Þ are error terms with an unknown joint distribution. In this case, we assume that the model is identifiable only in upper wedge X < D [4], [17]. We assume that has unknown distribution H. The goal is to obtain an unbiased estimator of α = (η T ,θ T ) T without nonparametrically estimating the distribution of i , i = 1,. . ., n. We further assume that given Z, C and (X, D) are independent, but X and D can be dependent given Z. Now we are going to describe the procedures of [13] and [17] in turn.
SinceD only depends on independent censoring, a standard rank regression approach is available for estimation [11], [13], [15], [17], [20], [23]. The estimating equation for η is given by whereD Ã i ðηÞ ¼D i À Z T i η. The estimator of η can be obtained by solving S n (η) = 0. For estimation of θ, simply replacingD i À Z T i η toX i À Z T i θ does not yield unbiased estimation of θ. This is because the cause-specific hazard function forX i À Z T i θ depends oñ D i À Z T i θ, which violates the model assumption [13]. To fix this problem, many authors use artificial censoring techniques [3], [6], [7], [10], [13], [17]. In [13], a single constant term g(α) is proposed so that the estimation equation will be unbiased for estimation of θ in the two-sample case. The form of g(α) is gðαÞ ¼ max 1 i n f0; Z T i ðθ À ηÞg. The proposed estimator in [13] is obtained by solving U L n ðαÞ = 0, where and a^b means minimum of a and b. In [17], pairwise comparisons of all the subjects is proposed so that each subject has different degree of the artificial censoring. The transformations suggested by [17] are The proposed estimator according to [17] is obtained by solving U P n ðαÞ ¼ 0, which is defined by Note that X and D are not observable, but we can express transformation (Eq 1) and (Eq 2) by using observable quantities [7], [13].

Weighted estimator
Given these two estimation procedures, it is natural to consider their efficiencies with respect to standard error. However, in this point of view, neither estimator is superior to the other. Moreover, these estimators may not be optimal estimators with respect to the standard error. There is an argument that estimator of [17] gains more efficiency than that of [13] because pairwise comparisons lead to less artificial censoring than that in [13]. However, this logic only holds when we look at performance of estimators in the view of bias and variance across the estimators in simulation study. Concentrating on standard error of an estimator in a single dataset, the estimator by [17] may not provide better estimator than that of [13]. This will be seen in the real data analysis section.
The reason for this is due to estimation procedure of [17]. As discussed [7], for n samples, the number of comparisons of [13] for artificial censoring is of order n, while that of [17] is of order n 2 . By definition of g ij (α), different degrees of artificial censoring is applied to observations. It may lead more variation between observations, which makes standard error larger than that of [13].
We extend the scope of estimators which provide consistent estimation of θ 0 . The natural extension of estimators of [13] and [17] is to consider collections of estimators that are linear combination of these two estimators with sum of weights being 1. By choosing proper weights, we can expect that the variance of the new combined estimator is smaller than that of each individual estimator inθ L andθ P .
The goal is to find weights such that the variance of the new estimator is smaller than the minimum of variance of the estimators by [13] and [17], which have good theoretical properties. To obtain the estimator that yields smallest variance with these properties, we can use the idea of [25], which was applied to the problem of modeling multivariate failure times.
In [25], the joint distribution of estimatorsγ ¼ fĝ mr g is considered, where m = 1,. . ., k and r = 1. . .R. In this case, m indicates index of regression parameters and r stands for index of the rth event. For obtaining an optimal estimator, they applied arguments from [24] which derived a linear combination of test statistic to maximize power against every alternative hypothesis. LetĤ be the covariance matrix for the estimatorsγ . Then we fix m and defineĤ m be covariance matrix ofγ m ¼ ðĝ m1 ; . . . ;ĝ mR Þ. It can be obtained from the entire covariance matrix by selecting the part corresponding toγ for r = 1,. . ., R under fixed m. Now we can define m e is a vector of weights which leads the best estimator among linear combinations of estimators of γ m where e is a vector consisting of R ones [24], [25].
We now apply the argument in previous paragraph to our model by considering the joint tions for β 0 . The strong consistency and asymptotic joint distribution of three estimators, described in following theorems, play a crucial role in our methodology.
It suffices to show thatη;θ L andθ P are strongly consistent, respectively. Let α = (η T ,θ T ) T . Note that we have compact region, say W and we assume regularity conditions in [27]. By [27], there exists nonrandom function m 1 such that sup η 2 N 0 jjn −1/2 S n (η) − m 1 (η)jj converges to 0 with probability 1 where N 0 is a neighborhood of η 0 . Thusη is strongly consistent. Similarly, we have another nonrandom function m 2 such that sup α2N 1 jjn À1=2 U L n ðαÞ À m 2 ðαÞjj converges to 0 with probability 1 where N 1 is a neighborhood of α 0 . Hence by [27],α L is strongly consistent.
Forθ P , by argument in Appendix of [17], note that by the U-statistics version of the law of large numbers, for all α 2 W, jjn À1=2 U P n ðαÞ À γðαÞjj converges to 0 in probability where γðαÞ ¼ Efn À1=2 U P n ðαÞg. We can partition our compact space as W 1 ,. . .,W k so that W 2 [ k j¼1 W j . Clearly, then for {α j 2 W j , j = 1,. . ., k}, max 1 j k jjn À1=2 U P n ðα j Þ À γðα j Þjj converges to 0 in probability. Then by Appendix of [17], sup jjαÀα ÃÃ jj x n À1=2 jjU P n ðαÞ À U P n ðα ÃÃ Þjj 2 nðn À 1Þ and for all > 0, there exists ξ > 0 such that Thusθ P is strongly consistent and clearly,β is strongly consistent. Theorem 2. Assuming certain technical conditions from [27] and [17], n 1=2 ðβ À β 0 Þ is asymptotically normal with mean zero vector and covariance matrix S 0 where where Γ 0 is a nonsingular matrix and Ω 0 is the asymptotic covariance matrix of G n (β 0 ).
Proof. As consistency, we assume the same regularity conditions as in [27]. Let β 0 ¼ ðη T 0 ; θ T 0 ; θ T 0 Þ T and G n ðβÞ ¼ ½S T n ðZÞ; fU L n ðαÞg T ; fU P n ðαÞg T T . Similar to [13], let l ð1Þ 0 ðtÞ be the cause-specific hazard function for theD Ã i ðηÞ and let l ð2Þ 0 ðtÞ be the cause-specific hazard function forX Ã i ðαÞ under dependent censoring. Define Then M 1i and M 2i are martingales [5], [13]. By adapting a proof in the Appendix in [13], Rebdolledo's martingale central limit theorem [5] gives [13], [17]. For j = 1,. . ., n, define By the Cramér-Wold theorem, G n (β 0 ) has an asymptotically normal distribution with mean zero and covariance matrix Ω 0 , where Note that Efn À1=2 U P n ðαÞg ¼ γðαÞ. As stated in the Appendix of [17], under conditions of N1 − N3 from [9], there exists an open neighborhood of α 0 , say K 0 , such that Using a Taylor series expansion of γ(α) around α 0 , With these two results (Eq 5) and (Eq 6), by Appendix of [17], From [27], we have that for any η in the small neighborhood of η 0 , where P 0 is k × k nonsingular matrix. From the Appendix in [13], for J 1n ðαÞ ¼ ½S T n ðZÞ; fU L n ðαÞg T T , for any α in the small neighborhood of α 0 , where L 10 is defined as is 2k × 2k nonsingular matrix and M 0 and H 0 are k × k constant matrices. Define J 2n ðαÞ ¼ ½S T n ðZÞ; fU P n ðαÞg T T . Using expansion from [17], for any α in the small neighbor- where R 0 ¼ @γðαÞ @η j α¼α 0 and V 0 ¼ @γðαÞ @θ j α¼α 0 . Combining expansions of (Eq 8), (Eq 9) and (Eq 10), we have for any β in the small neighborhood of β 0 , where Γ 0 is defined as The results from [9] and [27], along with the consistency ofβ, imply that By combining the above results with Slutsky's theorem, n 1=2 ðβ À β 0 Þ has an asymptotically normal distribution with mean zero and covariance matrix G À1 0 O 0 G À1 0 . Theorem 2 implies the asymptotic normality ofβ with the form of S 0 being LetŜ be the estimated covariance matrix of S 0 . In this covariance matrix,Ŝ 11 is a k × k covariance matrix forη,Ŝ 22 is a k × k covariance matrix forθ L andŜ 33 is a k × k covariance matrix forθ P . Moreover,Ŝ 12 andŜ 13 represent covariance terms betweenη andθ L and betweenη andθ P , respectively. DefineŜ 23 as the covariance matrix betweenθ L andθ P . Clearly, S 21 ¼Ŝ T 12 ,Ŝ 31 ¼Ŝ T 13 andŜ 32 ¼Ŝ T 23 . The issue remains of how to obtain the matrix corresponding toĤ À1 m in our context. Note thatη;θ L andθ P are correlated with each other. The estimating equation structure implies that θ L andθ P cannot be estimated separately fromη. Thus our matrix corresponding toĤ À1 m should include the effect ofη. To obtain the matrix, we need to invert whole matrix and extract submatrix corresponding toθ L andθ P . There are two approaches to obtain submatrix.
The first approach is to invertŜ and obtain the submatrix ofŜ À1 corresponding toθ L m and θ P m . Let us denote this matrix asŜ Ã m . Clearly, this matrix is 2 × 2 and also positive definite. Then we can calculateĉ We can repeat this step for the other regression coefficients. Then we obtain T . In this first approach, weights are generated through using k number of 2 × 2 matrices. We can refer this first approach as 'marginal approach'. Sometimes it is desirable to consider entire covariates all at once when obtaining weights.
The second approach is to obtain the corresponding submatrix ofŜ À1 for fðŷ L Þ T ; ðŷ P Þ T g T . We denote this matrix asŜ ÃÃ . This approach is different from first one in thatĜ m consists of elements of the covariance matrix fromθ L m andθ P m but nowŜ ÃÃ has elements of covariance matrix from corresponding entire fðŷ L Þ T ; ðŷ P Þ T g T . This approach reflects the effect of fðŷ L Þ T ; ðŷ P Þ T g T jointly on our new estimator. Let E be a 2k × k matrix such that E is a multivariate extension of h. Note that E is concatenation of two k × k identity matrices by row. Entries that are 1 in these two k × k identity matrices are source of weights forθ L and θP . The next step is to constructB, which iŝ This matrix is a multivariate extension ofĉ m from the first approach. This matrix is a contrast matrix in the sense thatĉ Ã m;m þĉ Ã ðkþmÞ;m ¼ 1 for the mth regression coefficient ofθ L andθ P . Moreover,ĉ Ã p;p þĉ Ã ðkþpÞ;p ¼ 0 for p 6 ¼ m = 1,. . ., k. Using a vector form, from this approach our new estimator, sayθ JWE , We can also refer this approach as the 'joint approach'. Now the key step is to obtainŜ. We use the resampling approach of [16], which was also used in [13] and [17].
From [13] and [17], we have We then solve the estimating equation where Q i (i = 1,. . ., n) represent standard normal random variables. Note that G n ðβÞ ¼ ½S T n ðZÞ; fU L n ðαÞg T ; fU P n ðαÞg T T is joint estimating equation for ðη T 0 ; θ T 0 ; θ T 0 Þ T . By solving this equation, we obtain many realizations ofβs, are solutions from (Eq 11). The next theorem, combined with Theorem 2, justifies the resampling approach for calculatingŜ. Theorem 3. Based on the technical conditions in [16], the unconditional distribution of n 1=2 ðβ À β 0 Þ is same asymptotically as the conditional distribution of n 1=2 ðβ R ÀβÞ whereβ R are realizations ofβ from resampling.
Proof. Recall that for any β in the small neighborhood of β 0 , we have Note thatβ R are solutions of Eq (11). By conditioning on observed data and using expansion (Eq 12) as well as by adapting arguments in [13] and [16], and hence, Note that n À1=2 P n i¼1 W i Q i is asymptotically normal with covariance matrix S 0 . Then given observed data, distribution of n 1=2 ðβ R ÀβÞ is asymptotically normal with covariance matrix G À1 0 S 0 G À1 0 . Hence conditional distribution of n 1=2 ðβ R ÀβÞ on observed data is asymptotically same as unconditional distribution of n 1=2 ðβ À β 0 Þ.
IfBS j ! sup s jjS n ðs;ηÞjjg IfBS L j ! sup t jjU n ðt;α L Þjjg IfBS P j ! sup v jjU n ðv;α P Þjjg: [10]. If the p-value is smaller than predetermined level, we reject the null hypothesis, which means that data does not have appropriate fit on our bivariate model. Note that a multiple testing problem arises for testing the models for θ. We address this by adjusting p-values based on a Bonferroni correction with two tests.

Simulation Studies
We consider two simulation settings. In first simulation setting, the errors follow a bivariate normal distribution with mean (0,1.2) with variance 1 and correlation ρ = 0,0.25. The independent censoring time C is generated from log(U Ã ), where U Ã has uniform distribution with minimum value 0 and maximum value 20. Covariate is Z * Bernoulli(0.5), where Bernoulli(0.5) is Bernoulli distribution with success probability 0.5. We run 500 simulation runs. Within each simulation run, 500 resampling runs are tried for covariance matrix calculation. Sample sizes are N = 150 and N = 300. If there is only one covariate in the model, the first and the second method of the weighted estimation are equivalent. Let this common weighed estimator beθ WE . We calculate bias (Bias), mean squared error (MSE), mean of standard error (SEE), 95% coverage rate (Coverage). The coverage is based on the normal approximation. Moreover, to evaluate robustness of estimators, we also compute median of difference of the estimator from true value (Dmedian), median of squared error of estimates (Mediansq), and median of standard errors (Sdmedian). Results are summarized on Table 1 and Table 2.
In second simulation setting, we generate Gamma random variable ν with mean μ = 1 and variance σ 2 = 0 or 1, then create W = exp ( X ), which is an exponential random variable with rate 4ν −1 and exp ( D ) with an exponential random variable with rate ν −1 . Then we generate time to the event of interest by expðXÞ ¼ expðθ T 0 ZÞexpð X Þ and time to the dependent censoring by expðDÞ ¼ expðη T 0 ZÞexpð D Þ (By notation in our paper, X, D and C are already log-transformed times. Thus in this context, exp (X), exp (D) and exp (C) are times in the original scale). The independent censoring time exp (C) has uniform distribution with minimum value 0 and maximum value 20. True parameter values are θ 0 = (0.5,1) T and η 0 = (1,0.5) T and covariates are Z 1 * U(0,1), where U(0,1) is uniform distribution with minimum value 0 and maximum value 1 and Z 2 * Bernoulli(0.5). We run 500 simulation runs. Within each simulation run, 500 resampling runs are tried for covariance matrix calculation. Letθ MWE be weighted estimators from calculating weights marginally (the first proposed method) and letθ JWE be weighted estimators from calculating weights jointly (the second proposed method). We compute the same quantities as we did in the first set of the simulation study. Results are summarized on Table 3 and Table 4.
In these simulation results, we can see that our weighted estimators have good results. In both cases, bias and mean squared error of our new estimator has similar performance compared to the estimators by [13] and [17]. Mean of standard errors and median of standard errors are smaller than the estimators by [13] and [17]. Moreover, computation results for the median of difference of the estimators from true value and the median of squared error imply that our proposed estimator is comparable with the estimators from the original methods.
In the first simulation setting, the difference of standard error between our proposed estimator andθ L is bigger than the one betweenθ P and the proposed estimator. In the second simulation setting, the phenomenon is the opposite. Furthermore, in the first simulation setting,θ P has lower standard error on average than one ofθ L whileθ L have better efficiency (with respect to standard error) than ones byθ P in the second simulation setting. This simulation result verifies our claim, which means that neither estimator is better than another. Our proposed estimator takes advantage of smaller standard error with achieving small bias and correct coverage except N = 150 with σ 2 = 1 in the second simulation setting. In this scenario, empirical coverage of proposed estimators is lower than nominal 95% coverage. This is due to low coverage ofθ L .
Since we combineθ L andθ P , if one of them has low coverage, it is highly likely that the coverage of weighted estimator may also be below the nominal coverage.

Real data analysis
We applied our method to data from the AIDS Clinical Trial Group (ACTG) Study 364 [1], which was used in [17]. This multicenter randomized study investigated patients whose plasma RNA level is at least 500 copies per ml. Subjects were assigned to three treatments, nelfinavir (NFV), efavirenz (EFV), and combination of nelfinavir and efavirenz (NFV + EFV). Details about this study can be found in [1]. The two failure times are time to HIV RNA level greater than 2000 copies per ml and time to withdrawal of study. Let X be the first time when HIV RNA level is greater than 2000 copies per ml and D be time to withdrawal of study. We considered four covariates and 194 observations. Z 1 takes value 1 if a patient receives EFV and 0 otherwise. Z 2 takes value 1 if a patient receives NFV + EFV and 0 otherwise. Z 3 is New3TC, which takes value 1 if lamivudine is given as a new nucleoside analogue therapy to a patient and 0 otherwise. Z 4 is logarithm of RNA level at the start of the study. Table 5 and Table 6 show the point estimates and standard errors ofη,θ L ,θ P ,θ MWE and θ JWE . Our method works well for the models with and without New3TC on all covariates. Some variables are seen to be statistically significant based on the weighted estimator while they are not by [13] or [17]. For example, let's consider effect of EFV to the time to first virologic failure. By Table 6, the estimated effect by using approach of [13] is 0.475 and its standard error is 0.250. From approach of [17], an estimate is 0.464 and its standard error is 0.281. Based on the fact that estimators are asymptotic normal, from Wald test using [13] and [17], EFV is not a statistically significant variable on 5% significant level. On the other hand, a weighted estimate Table 3. Simulation result when N = 150 and N = 300, σ 2 = 0 with two covariates (Z 1 : U(0, 1), Z 2 : Bernoulli(0.5)).     using first approach is 0.471 and its standard error is 0.222. In this case, EFV is a statistically significant variable on 5% significant level. Observed score process with bootstrapped processes for withdrawal of study with respect to Z 1 is shown in Fig 1. Fig 2 and Fig 3 show observed score processes and bootstrapped processes of the first virologic failure usingα L ;α P with respect to Z 1 . These three plots are based on the model without New3TC. They are fluctuating around zero, so it seems that there is no graphical evidence for lack of fit. The p-value for the lack of fit tests of withdrawal is 0.952 and the first virologic failure usingα L andα P are 0.918 and 0.959 respectively. With graphical checking, p-value indicates that there is no evidence for violation of the model assumption.
For purposes of interpretation, since D represents a standard survival time, the interpretation ofη is in terms of covariate effect for survival time. However, since the observed time for X depends on D, interpretation ofθ is difficult. One way to interpretθ is to assume that D does not exist and interpret the effect ofθ on X only. This approach is possible if there exists a reasonable extrapolation mechanism for X [18]. However, considering the estimation structure for θ, it is difficult to separate effect ofθ to X from effect ofη to D.

Discussion
In this paper, we have proposed optimal estimators using combinations of the two estimators from [13] and [17]. Our methodology can be extended to a case of recurrent event with dependent censoring, which is extensively studied [6], [7], [10]. We are currently working on this extension.
Optimality of the estimator has been discussed in other contexts. Recently, there is a publication that proposed optimal additive functions based on score functions [14]. The main point of their method is to combine unbiased estimating functions. In our case, this would be combining estimating equations and new solution can be obtained by this estimating equation. Comparing performance of this solution and our proposed estimator is of interest. This will be left open to future research.
Another way of achieving optimality is to use generalized method of moment estimator [8]. This estimator is a linear combination of estimating functions [19]. In this case, the estimating functions have a greater dimension than the dimension of the parameter vector. The optimality is achieved by the linear combination. It is shown that the estimator from this linear combination of estimating functions is consistent and asymptotically normal [8]. In the literature of statistics, this idea is applied to generalized estimating equations [19]. The estimating functions proposed by [19] are called quadratic inference function. Recently, the quadratic inference function is applied to Cox model [26]. [8] and [19] derived new estimating functions, while we combined two estimators directly. This idea of the generalized method of moments is very appealing, but the estimating functions of [13] and [17] are nonsmooth. Finding derivative for the linear combination of the estimating functions, which is a key in generalized method moments, is challenging for our work because we cannot find the derivatives in the estimating functions proposed by [13] and [17]. Applying the idea of [8] to AFT model will be interesting future research. Our estimating equations to obtain estimators involve nonsmooth functions of η and α. Many literatures used a linear programming approach for estimating θ [3], [11]. However, this linear programming method is very slow for computing estimators of θ. Thus this approach is very inefficient when implementing to solve (Eq 11) for estimation of S. Recently, an approach called a derivative free-spectral algorithm for nonlinear equations (DF-SANE) was proposed [12], and there is a publication that showed that this algorithm is better than the linear programming method using an example of estimating parameters of AFT models under independent censoring. [21]. However, under dependent censoring, the artificial censoring term leads to numerical instability in estimating parameters and calculating resampled estimators. Moreover, this algorithm does not converge well under default tolerance settings using DF-SANE [21]. Thus using this algorithm requires changing the tolerance level. Developing efficient numerical algorithms for estimating parameters is an important topic for future research.

Author Contributions
Conceived and designed the experiments: DG. Performed the experiments: YC DG. Analyzed the data: YC. Contributed reagents/materials/analysis tools: YC DG. Wrote the paper: YC DG.