Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

• Loading metrics

# On Testing Dependence between Time to Failure and Cause of Failure when Causes of Failure Are Missing

• Isha Dewan,
• Sangita Kulathinal
x

## Abstract

The hypothesis of independence between the failure time and the cause of failure is studied by using the conditional probabilities of failure due to a specific cause given that there is no failure up to certain fixed time. In practice, there are situations when the failure times are available for all units but the causes of failures might be missing for some units. We propose tests based on U-statistics to test for independence of the failure time and the cause of failure in the competing risks model when all the causes of failure cannot be observed. The asymptotic distribution is normal in each case. Simulation studies look at power comparisons for the proposed tests for two families of distributions. The one-sided and the two-sided tests based on Kendall type statistic perform exceedingly well in detecting departures from independence.

## Introduction

We consider a competing risks set-up where a unit is subject to two disjoint risks of failure and each unit ultimately fails due to either of the two risks. We do not allow simultaneous failures due to both the risks. The observations are made on the time to failure T and an identifier of the risk δ = j if the failure is due to the risk j, j = 0, 1. Let the joint distribution of (T,δ) be specified by the subsurvival functions Sj(t) = P(Tt, δ = j), or the subdistribution function given by Fj(t) = P(T<t, δ = j), j = 0,1. The survival function and the distribution function of T are, respectively, given byWe assume that the subsurvival functions are continuous. Note that the distribution of the failure time and the cause of failure is specified using the observable variables (T,δ).

Let the conditional probability of failure due to the first risk given that there is no failure up to time t be given aswhenever S(t)>0. These probabilities were introduced while studying failure and preventive maintenance in a censoring setting where the interest is in the distribution of the failure time which would have been observed in the absence of preventive maintenance [1]. Another conditional probability of interest iswhenever F(t)>0.

Under independence Sj(t) = S(t)P(δ = j) and hence, T and δ can be studied separately. Thus, the hypothesis of equality of subsurvival functions reduces to testing whether P(δ = 1) = P(δ = 0) = 1/2, a Bernoulli trial with success probability half. Hence a two-dimensional problem reduces to one-dimensional problem. The dependence between the failure time T and the cause of failure δ in terms of the above two conditional probability functions was studied in [2]. Below we give formal proofs of two results, which were stated in [2] on the independence and positive quadrant dependent (PQD) structure of (T,δ) in terms of these conditional probabilities.

Lemma 1: T and δ are independent if and only if Φ1(t) = Φ1(0) = φ, for all t where φ = P(δ = 1).

Proof: When T and δ are independent their joint distribution is written as the product of the marginal distributions. Hence,where φ = P(δ = 1). Similarly when Φ1(t) does not depend on t then Φ1(t) = Φ1(0) and Φ1(0) = S1(0)/S(0) = P(δ = 1) = φ.

This in turn implies that S1(t) = φS(t) which is the product of the marginal distributions of δ and t. Also, S0(t) = S(t)−S1(t) = (1−φ)S(t). Hence the result.

The independence of T and δ is also equivalent to Φ*0(t) = Φ*0(0), for all t. A simple and easily checked dependence structure is positive quadrant dependent (PQD) indicating positive association between two random variables.

Definition 1: Random variables X and Y are Positive Quadrant Dependent (PQD) if the following inequality holds:In our case, because δ takes only two values 0 and 1, T and δ are PQD if the following inequality holds:This is because P(Tt,δ≤0) = P(Tt,δ = 0), P(Tt,δ≤1) = P(Tt), and P(δ≤1) = 1. Hence the required inequality always holds for δ = 1.

Lemma 2: T and δ are PQD if and only if Φ1(t)≥Φ1(0) = φ, for all t.

Proof: When T and δ are PQD thenNote thatand P(Tt)P(δ = 0) = (1−S(t))(1−φ). Substituting these identities in the above inequality, we getHence the result.

Note that T and δ are positive quadrant dependent (PQD) is also equivalent to Φ*0(t)≥Φ*0(0), for all t>0. Various hypothesis testing problems of checking independence of T and δ against various alternatives specifying dependence structures are considered and U-statistics are derived when the complete data on all n units are available [2]. However, in many practical situations the experimenter may have information on failure times for all the individuals but on the causes of failures only for some.

In mortality follow-up study, the causes of death are obtained from the death certificates. The problem of causes of death missing on death certificates is well-known. This may occur due to various reasons like doctor's strike, autopsy not performed and hence, no knowledge of Definite underlying cause of death, and no legal requirement of mentioning an underlying cause of death on the death certificate. The present work is motivated by a follow-up study on mortality where the underlying causes of death were missing for nearly 20% patients but the times of death were known for all. Similar situation arises in engineering fields where series systems are tested for failure due to various components, possibly under accelerated life testing. In this case, a thorough autopsy of failed system is required to identify the failed component(s) which leads to the system failure. Such information may not be available for all failed systems due to financial or logistic reasons. In motorcycle fatalities study, 40% of the death certificates had either partial or no information on underlying causes of death [3].

An example from animal bioanalysis where all causes were not available was considered [4]. Likelihood based estimation in case of missing causes of failure have been studied [5], [6], [7], [8], [9]. A modified log rank test for competing risks with missing failure type was also considered [10]. The maximum likelihood estimators and minimum variance unbiased estimators of the parameters of exponential distribution for the missing case were obtained [11] and their approximate and asymptotic properties were discussed and confidence intervals were derived [12].

In this paper, we consider the problem of testingagainst various alternative hypotheses characterising the dependence structure of T and δ, which are:when causes are missing for some units. Let (Ti,δi), i = 1,…,n be the competing-risks data available on n individuals. Here, we consider a situation when δi may not be observed always i.e., it may be missing for some units. Let Oi be an indicator variable which takes value one if δi is observed and zero if δi is missing. Let p = pr(Oi = 1) be the probability that δ is observed. We assume that δi are missing completely at random and hence Oi is independent of (Ti,δi). The fact that the cause of failure will be observed or not will have no bearing on the actual cause. Similar assumptions are made in [9], [13].

We extend some of the tests based on U-statistics proposed in [2] to the case when δs are not observed for all the units. We carry out simulation studies for comparing the power of the tests for two families of distributions. We also apply the proposed tests to the data on failure of switches given in [14] by artificially creating missing causes of failure. The proposed tests perform satisfactorily and the use of the data on the failure times even when corresponding causes are missing is recommended.

## Results

We apply the proposed tests Ukm, UPQDm and Ukm1, the one-sided test based on Ukm to simulated data from two parametric families of distributions and evaluate empirical powers. We also apply the tests to a real data. The computations were done using SAS [15] and the source codes and a brief guide on how to use the SAS codes are provided in the supplementary material (Text S1, Text S2, Text S3 and Text S4).

Example 1: Parametric family of distributions [2]

Consider the parametric family of distributions proposed in [2]where 1≤a≤2, 0≤φ≤0.5 and F(t) is a proper distribution function. Note that P(δ = 1) = φ andwhich is an increasing function of t. For a = 1, Φ1(t) = φ, that is, T and δ are independent and for 1<a≤2, Φ1(t)>φ, that is T and δ are PQD , and hence H2 holds. Let F(t) = 1−exp{−λt} be the overall distribution function. We simulated random samples by varying n, p, and (λ,a,φ) from the above distribution. Empirical level of significance and power were calculated by using 1000 replications for each combination of n, and (λ,a,φ). Table 1 gives the empirical level of significance and Table 2 gives the empirical power of the three tests based on Ukm, Ukm1 and UPQDm.

From the two tables it is clear that modified test statistic Ukm attains its level when roughly half of the failures are likely to be due to the first cause. The conclusions are valid even when 20% of the failure causes are not available. The power increases with increase in values of a and also with increase in sample size. The test has very good power even when a = 1.5. One should keep in mind the fact that the alternative of no independence is extremely general.

However, the test based on one-sided version of the Kendall's τ, Ukm1 performs much better than the test based on UPQDm for testing H0 against H2. It was observed in [2] that the test UPQDm, when p = 1 is extremely conservative and also inefficient. The entries for this test in the two tables confirm this observation. But given the fact that the level of significance attained is very low, it is able to detect alternatives reasonably well.

Example 2: Random sign censoring model [1]

A random sign censoring (RSC), also known as an age-dependent censoring, is a model in which the lifetime of a unit (X) is censored by Z = X, where W, 0<W<X is the time at which a warning is emitted by the unit before its failure, and η is a random variable taking values {−1,1} and is independent of X. Hence η = 1 would lead to the censoring of the life time at XW giving T = Z and δ = 0 and η = −1 will lead to the observation of complete lifetime X, giving T = X and δ = 1. Assume that X has exponential distribution with parameter λ. In this case, P(Zt, Z<X) = P(XWt, η = 1) and P(Xt, X<Z) = P(Xt)P(η = −1). This gives Φ1(t) = P(Tt, δ = 1)/P(Tt) = P(Xt, η = −1)/P(Xt, Xt).

Here P(η = −1) = 1−P(η = 1) = P(δ = 1) = φ. When X is exponentially distributed with parameter λ and W = aX, 0<a<1,leading to the increasing (and hence PQD) nature of Φ1(t) in t. As a goes to zero Φ1(t) goes to φ, that is T,δ are independent. Hence, to evaluate the empirical level of significance we choose a very close to zero.

The value a close to zero corresponds to independence of T and δ and a>0 gives Φ1(t) as an increasing function of t implying T and δ are PQD. For simulation purposes, we consider two values of a = 0.00001, and 0.5. Test based on Ukm almost attains its level even for sample sizes as small as n = 25 as can be seen from Table 3. This test has good power for n = 100. The test based on Ukm1 has a slightly higher level as well as higher power. But the test based on UPQDm is a very conservative test. It has low empirical power even for n = 100. One-sided test based on Ukm1 is definitely a better choice for detecting PQD alternatives.

Example 3: Nair's data revisited [14]

Here we consider the data on the failure of 37 switches due to one of the two possible causes of failures published in [14]. These data were analysed in [2] and it was shown that the failure time (T) and the cause of failure (δ)of switches were not independent. Also, the conditional probability of failure due to cause A, Φ1(t) was shown to be larger than φ and hence T and δ were PQD.

We calculate three test statistics for the entire data on 37 switches as earlier. We also artificially create missing data on the cause of failure for varying values of p and repeating it for 1000 times to evaluate the empirical powers of the test statistics.

The hypothesis of independence of T and δ, H0 is rejected against H1 at α = 5% level of significance using Ukm (the value of the test statistic is 2.70 which is larger than 1.96) and the one-sided test, Ukm1 (the value of the test statistic is 2.70 which is larger than 1.64) rejects the hypothesis H0 against H2 at α = 5% level of significance. However, the test based on PQD, UPQDm (the value of the test statistic is 1.35 which is smaller than 1.64) does not reject the hypothesis H0 against H2 at α = 5% level of significance. Table 4 shows the empirical powers of the tests for various values of p.

As seen earlier with the simulated data, the test Ukm1 performs well even when 60% of the causes are missing. The power of UPQDm test is unsatisfactory.

## Discussion

Testing independence between the failure time T and the cause of failure δ is often important because of reduction in dimensionality and possibility of studying T and δ separately. The available tests use only completely observed data on T and δ. One cannot avoid missing data situation in practice and hence, the issue of the effect missing observations on the existing tests needs to be addressed.

From the simulation studies it is clear that the two-sided test, Ukm is performing well for both the families of distributions for sample sizes as small as 25 and when 20% of the causes of failure are not known. These observations can be made from Table 1, Table 2 and Table 3 with attained level of significance and high empirical power. The empirical powers for all the three tests are higher in the case of the parametric family of distributions of Example 1 compared to RSC model of Example 2 for all sample sizes. The performance of the one-sided test, Ukm1 based on Kedall's τ is clearly superior to the UPQDm as demonstrated by Table 1, Table 2 and Table 3. Even when all causes are known it observed that the test based on Kendall's τ is four times more efficient than the test based on UPQD [2]. Even in the case of missing causes, we recommend the use of Ukm1 for testing independence against PQD. One obvious reason is that Kendall's τ uses information on (T,δ) for each pair of observations. Similar observations are made on the basis of real data analysis of Example 3 (Table 4).

The failure times with missing information on causes of failures also provide useful information regarding departures from independence of T and δ, and hence, omitting such observations from the analysis may result in loss of efficiency. For this reason, the analysis may not be based on only the complete data on both time and causes of failures (with reduced sample size, which is random). This article is the first attempt of its kind to carry out the tests for independence under the assumption of missing completely at random. How the tests perform under the assumption of missing at random or even informative missingness remains an open research problem.

## Materials and Methods

### General dependence between T and δ

First we consider the problem of testing H0 : Φ1(t) = φ, for all t against H1 : Φ1(t) is not a constant, where Φ1(t) = P(δ = 1|T≥t) = S1(t)/S(t), and φ = Φ1(0) = P(δ = 1). Recall that a pair (Ti,δi) and (Tj,δj) is a concordant pair if Ti>Tj, δi = 1, δj = 0 or Ti<Tj, δi = 0, δj = 1 and is a discordant pair if Ti>Tj, δi = 0, δj = 1 or Ti<Tj, δi = 1, δj = 0. The U-statistic based on the idea of concordance and discordance pairs or Kendall's τ iswherewhere the subscript k indicates that the U-statistic is defined using the idea of Kendall's τ [2]. If δ is missing for some units then ψk(Ti, δi, Tj, δj) cannot be defined for all pairs. In Table 5 m indicates that δ is not observable and ? indicates the cases when ψk is not defined.

Note that when Ti>Tj and δi = 1, but δj is missing, ψk(Ti, δi, Tj, δj) will take value 1 if δj = 0 and value 0 if δj = 1. Hence, in order to retrieve the best possible information we assign weight (1+0)/2 = 1/2 in this case. Similarly, when Ti>Tj and δi = 0, but δj is missing, ψk(Ti, δi, Tj, δj) will take value −1 if δj = 1 and value 0 if δj = 0. Hence, we assign value −1/2 to the kernel in this case. Now, we redefine the kernel when some observations on δ are missing as followsHere the subscript m indicates missing data situation. Define Ukm as the corresponding U-statisticThen the expectations of Ukm is given byand the asymptotic variance of under H0, denoted as Var(Ukm) isNote that when p = 1, the variance simplifies to (4/3)φ(1−φ), which is given in [2]. Also, E(Ukm)≠0 under H1. From the central limit theorem of U-statistics [16] (see Text S5), it follows that Ukm has asymptotic normal distribution for large n.

Theorem 1: Under H0, converges in distribution to N(0, σ2km) as n→∞, where σ2km = (4/3)p2φ(1−φ)+(1/3)p(1−p).

We refer to the supplementary material (Text S6) for the explicit derivation of E(Ukm), Var(Ukm) and the proof of Theorem 1.

In practice, p and φ are generally unknown and can be replaced by their consistent estimators, and respectively. A test procedure for testing H0 against H1 is then: reject H0 at 100α% level of significance if is larger than z1−α, cut-off point of standard normal distribution, where σ̂ 2km is a consistent estimator of σ2km got by replacing p and φ by and φ̂

For computational purposes, it is necessary to express Ukm as a function of ranks. Let , , and represent numbers of observations in three groups-causes are observed to be 1, causes are observed to be 0 and causes are not observed, respectively. Let the corresponding ordered times in each group be represented by X(1), X(2),…, , Y(1), Y(2),…, , and Z(1), Z(2),…, , respectively. Let Ri denote the combined rank of X(i) in the ordered arrangement of (n1+n2) samples of type X and Y, Si denote the combined rank of X(i) in the ordered arrangement of (n1+n3) samples of type X and Z, and Qi denote the combined rank of Y(i) in the ordered arrangement of (n2+n3) samples of type Y and Z. Then, the number of observations for which ψkm(.)takes valueThus, the expression Ukm in terms of ranks is(1)

Consider testing H0 against H2. The U-statistic for testing H0 against H2 iswhereThis test was proposed in [2]. If δ is missing for some units then ψPQD(Ti, δi, Tj, δj) cannot be defined for all pairs. Table 6 shows the pairs for which the kernel is defined completely and also the cases where it is not defined.

As in the earlier subsection, we define a modified kernel to take into account missing causes as follows.Let the corresponding U-statistic be UPQDmwhere the subscript m indicates missing data situation. The expectations of UPQDm are given byand under H2. The asymptotic variance of under H0, denoted as Var(UPQDm) isNote that when p = 1, the variance simplifies to (4/3)φ(1−φ), which is given in [2]. From the central limit theorem of U-statistics [16] (see Text S5), it follows that UPQDm has asymptotic normal distribution for large n.

Theorem 2: Under H0, converges in distribution to N(0,σ2PQDm), where σ2PQDm = (4/3)p2φ(1−φ)+(1/3)p(1−p), as n→∞.

We refer to the supplementary material (Text S7) for the explicit derivation of E(UPQDm), Var(UPQDm) and the proof of Theorem 2.

We reject the null hypothesis for large values of where E(ÛPQDm) and σ̂PQDm are obtained by replacing φ and p by their empirical estimators. As mentioned earlier, p and φ can be replaced by their consistent estimators and . Let Ri* denote the rank of Ti in the ordered observations (T1, T2, …, Tn). Then, it is easy to see thatNote that a one-sided test based on Ukm, where H0 is rejected for large values of Ukm can also be used for testing H0 against H2 since E(Ukm)≥0 under H2. In fact, the one-sided test uses data on both the T and δ in each pairwise comparison while UPQDm uses only information on (T, δ) from one and T from the other in a pairwise comparison. We refer the one-sided test based on Ukm as Ukm1.

## Supporting Information

### Text S1.

SAS source code for Example 1

https://doi.org/10.1371/journal.pone.0001255.s001

(0.06 MB DOC)

### Text S2.

SAS source code for Example 2

https://doi.org/10.1371/journal.pone.0001255.s002

(0.06 MB DOC)

### Text S3.

SAS source code for Example 3

https://doi.org/10.1371/journal.pone.0001255.s003

(0.03 MB DOC)

### Text S4.

A short guide on the use of SAS codes

https://doi.org/10.1371/journal.pone.0001255.s004

(0.02 MB DOC)

### Text S5.

Central limit theorem for U-statistics

https://doi.org/10.1371/journal.pone.0001255.s005

(0.07 MB DOC)

### Text S6.

Derivation of E(Ukm), Var(Ukm) and proof of Theorem 1

https://doi.org/10.1371/journal.pone.0001255.s006

(0.14 MB DOC)

### Text S7.

Derivation of E(UPQDm), Var(UPQDm) and proof of Theorem 2

https://doi.org/10.1371/journal.pone.0001255.s007

(0.07 MB DOC)

## Acknowledgments

The authors would like to thank the Academic Editor for careful reading and comments which helped in improving the manuscript.

## Author Contributions

Other: Contributed equally to the development of theory and simulation studies: SK ID. Contributed equally to the writing of the manuscript: SK ID.

## References

1. 1. Cooke RM (1996) The design of reliability databases, part II. Reliability engineering and system safety 51: 209–223.RM Cooke1996The design of reliability databases, part II.Reliability engineering and system safety51209223
2. 2. Dewan I, Deshpande JV, Kulathinal SB (2004) On testing dependence between time to failure and cause of failure via conditional probabilities. Scandinavian Journal of Statistics 31: 79–91.I. DewanJV DeshpandeSB Kulathinal2004On testing dependence between time to failure and cause of failure via conditional probabilities.Scandinavian Journal of Statistics317991
3. 3. Lapidus G, Braddock M, Schwartz R, Banco L, Jacobs L (1994) Accuracy of fatal motorcycle injury reporting on death certificates. Accident Anal. Prevention 26: 535–542.G. LapidusM. BraddockR. SchwartzL. BancoL. Jacobs1994Accuracy of fatal motorcycle injury reporting on death certificates. Accident Anal.Prevention26535542
4. 4. Kodel RK, Chen JJ (1987) Handling cause of death in equivocal cases using the EM algorithm. Comm. Stats.-Theory and Methods 16: 2565–2603.RK KodelJJ Chen1987Handling cause of death in equivocal cases using the EM algorithm.Comm. Stats.-Theory and Methods1625652603
5. 5. Dinse GE (1986) Nonparametric prevalence and mortality estimators for animal experiments with incomplete cause-of-death data. J. Amer. Statist. Assoc. 81: 328–336.GE Dinse1986Nonparametric prevalence and mortality estimators for animal experiments with incomplete cause-of-death data.J. Amer. Statist. Assoc.81328336
6. 6. Dewanji A (1992) A note on a test for competing risks with general missing pattern in failure types. Biometrics 79: 855–857.A. Dewanji1992A note on a test for competing risks with general missing pattern in failure types.Biometrics79855857
7. 7. Goetghebeur E, Ryan L (1995) Analysis of competing risks survival data when some failure types are missing. Biometrika 82: 821–833.E. GoetghebeurL. Ryan1995Analysis of competing risks survival data when some failure types are missing.Biometrika82821833
8. 8. Dewanji A, Sengupta D (2003) Estimation of competing risks with general missing pattern in failure types. Biometrics 59: 1063–1070.A. DewanjiD. Sengupta2003Estimation of competing risks with general missing pattern in failure types.Biometrics5910631070
9. 9. Lu K, Tsiatis A (2005) Comparison between two partial likelihood approaches for the competing risks model with missing cause of failure. Lifetime Data Anal. 11: 29–40.K. LuA. Tsiatis2005Comparison between two partial likelihood approaches for the competing risks model with missing cause of failure.Lifetime Data Anal.112940
10. 10. Goetghebeur E, Ryan L (1990) A modified log rank test for competing risks with missing failure type. Biometrika 77: 207–211.E. GoetghebeurL. Ryan1990A modified log rank test for competing risks with missing failure type.Biometrika77207211
11. 11. Miyakawa M (1984) Analysis of incomplete data in competing risks model. IEEE Transactions in Reliability 33: 293–296.M. Miyakawa1984Analysis of incomplete data in competing risks model.IEEE Transactions in Reliability33293296
12. 12. Kundu D, Basu S (2000) Analysis of incomplete data in presence of competing risks. J. Statist. Plann. Inference 87: 221–239.D. KunduS. Basu2000Analysis of incomplete data in presence of competing risks.J. Statist. Plann. Inference87221239
13. 13. Rubin DR (1976) Inference and missing data. Biometrika 63: 581–592.DR Rubin1976Inference and missing data.Biometrika63581592
14. 14. Nair VN (1993) Bounds for reliability estimation under dependent censoring. International Stat. Review 61: 169–182.VN Nair1993Bounds for reliability estimation under dependent censoring.International Stat. Review61169182
15. 15. SAS Institute IncSAS/STAT user's guide, version 8, 2000. SAS Institute IncSAS/STAT user's guide, version 8, 2000
16. 16. Puri ML, Sen PK (1971) Nonparametric methods in multivariate analysis: John Wiley, New York-London-Sydney. ML PuriPK Sen1971Nonparametric methods in multivariate analysis: John Wiley, New York-London-Sydney.