A score test for comparing cross-sectional survival data with a fraction of non-susceptible patients and its application in clinical immunology

Objectives In cross-sectional studies of time-to-event data collected by patient examinations at a single random point in time, a fraction of them will not experience the event regardless of the length of the follow-up time. This is the case in clinical immunology studies that include a mixed population, with both immune-reactive and immune-tolerant (or non-susceptible) patients. In these cases, classical tests of current status data may perform poorly. New methods for testing these data are needed. Methods In the two-sample comparison setting, we propose a score test for testing the null hypothesis that survival does not differ in either the non-susceptible fraction or the time-to-event distribution among the susceptible fraction. Results In a wide range of scenarios, simulation results show interesting improvements in power for the proposed score test compared to the logrank-type test in most of the configurations we investigated. In a cross-sectional study of drug immunogenicity among treated multiple sclerosis patients, the proposed score test reveals that gender is associated with the immunogenicity of interferon.


Introduction
Biopharmaceutical products (BP) are currently one of the fastest growing groups of drugs used in clinical immunology to treat patients with immune-based disorders, such as multiple sclerosis, rheumatoid arthritis, and inflammatory bowel diseases. These products, however, may lead to the development of antibodies (either binding or neutralizing) directed against the drug-known as anti-drug antibodies or ADAs-with treatment failures as a potential clinical consequence [1]. The prediction of ADA occurrence is a challenging problem today. Limited resources lead most studies conducted by clinical laboratories searching for the risk factors of ADAs to rely on cross-sectional sampling in which patients receiving BP are tested for ADA status (positive/negative) at a single random point in time. ADA status, however, is an active process that depends upon the dynamics of ADA production by (T-cell dependent) B-lymphocyte clones. Specific methodologies for time-to-event outcomes must be used to analyze such data. In addition, no information is available for individual patients between the first drug administration and the monitoring time point, which means that the only information available about the time-to-occurrence of ADA is whether it exceeds the random monitoring time point. This special kind of data, which is known as current status data (or case I interval-censored data), requires particular methods that differ substantially from those used for classical right-censored data [2,3].
Several k-sample tests have been proposed to compare time-to-event distributions of current status data. They rely on efficient score statistics that can be expressed as either rank tests or weighted logrank-type tests (for a review, see [3]). These statistics allow investigators to test for equality of hazard functions against constant or non-constant (over time) hazard ratio alternatives [3,4]. In general, the weighted logrank-type tests use weight functions that are either motivated by the expected deviation from the null hypothesis or model-based with some optimal properties for a particular family of alternatives (e.g. the G ρ extended family of Harrington-Fleming survival distributions as seen in [5,6]).
However, all these statistics rely on the assumption of so-called proper survival distributions. Broadly speaking, it is assumed that if the follow-up time is long enough, all patients will eventually experience the event of interest. This is obviously not the case in an investigation of BP immunogenicity; instead, we expect that a fraction of the patients receiving BP are immune-tolerant (non-susceptible) to the drugs and will not experience ADA at all during the long-term follow-up. The other patients are immune-reactive (susceptible) to ADA, and their time-to-ADA detection depends on the dynamics of ADA clonal production by B lymphocytes. Thus, the bioclinical factors studied may be associated with differences either in the proportion of immune-tolerant patients or in the distribution of the time-to-ADA occurrence among immune-reactive patients. For classical right-censored data, this problem of a mixed population has been tackled mainly from two different frameworks, one relying on two-component mixture cure models and the other on bounded cumulative hazard models [7].
The first approach considers that the study population is a mixture of two groups of patients: non-susceptible and susceptible. This formulation has led to proposals for various parametric and non-parametric models [7][8][9][10]. The second approach, called a promotion time cure model [11], assumes that the observed time-to-event is the first of some latent event time and has interesting mechanistic interpretations in various biological fields, such as oncology. Such a first-activation scheme with a Poisson process leads to the bounded cumulative hazard model introduced by Yakovlev and Tsodikov [12][13][14].
Although many estimates and testing methods for these cure models have been proposed, few have been implemented in classical survival analysis software (such as R or SAS), and none is designed to cope with interval-censored data. The latter fact explains why, at best, investigators today use the interval-censored tests that have previously been implemented, but ignore the mixed population issue.
We recently faced this methodological problem in analyzing the immunogenicity of interferon among multiple sclerosis patients. It prompted us to develop a procedure for testing the of an analysis plan. The analysis plan should explain the purpose of the use of the data and confirm the intention to use the data only for replication studies concerning anti-drug antibodies, since this is the limitation of the ethical permission on how this data can be used. The contact persons of the ABIRISK steering committee to whom the requests should be sent are Pierre Dönnes (pierre@scicross.com) and Sophie Tourdot (sophie.tourdot@inserm.fr).
null hypothesis of equality of the two survival functions with a fraction of non-susceptible patients.
The main purpose of this work is thus to provide a simple test able to detect survival differences, for use by practitioners with cross-sectional data from a mixed population.
In the Methods section, we first introduce a semi-parametric improper survival model that allows us to describe changes in the non-susceptible fraction and/or in the survival distribution in the susceptible fraction. We then present a score test for the two-sample problem to test the null hypothesis that the variable under consideration has no effect on either the proportion of susceptible patients or the time-to-event distribution. This test relies on the components of the score statistic obtained under the nu ll hypothesis and can be re-expressed as a vector of linear rank statistics. In the Results section, the Simulation study subsection reports the results of simulation experiments performed to study the power properties of the proposed test and compare them with those of the classical logrank test for current status data. In the second Results subsection, we use the proposed test to analyze the predictive effect of gender on the occurrence of ADA. In the last section, the Discussion, we review the advantages and limitations of the proposed test and its potential extensions. We also give some advice for its practical use.

Notations
Let the continuous random variables T and C represent the unobservable failure and monitoring time, respectively. Let f(t) denote the probability density function, and S(t) (resp. " SðtÞ) the survival function (resp. cumulative distribution function) of T. The hazard function (or the For current status data, we observe only whether the event of interest occurred before some single random monitoring time. Here, for each patient i (i = 1, . . ., n), Z i is a binary variable that indicates group membership (Z i = 0 or 1, with 0 the reference group). Thus, If an event occurred, we know that T i belongs to [0, C i ]; otherwise T i belongs to [C i , 1 + [. Here, we assume that the censoring and the failure times are independent. We also assume that the censoring times are independent and identically distributed random variables for all subjects.

Improper survival model
Rationale for considering a bounded cumulative hazard model. The biological mechanisms of ADA immunogenicity as well as pragmatic statistical considerations led us to consider a bounded cumulative model. The main idea is to model the distribution of the ADA detection time through a simplified mechanistic immunological model whereby each individual is potentially able to produce ADAs that arise from the activation of latent (or unobservable) BP-specific (T-dependent) B-cell clones. At the cellular level, each one of these clones can emerge and become an immunocompetent ADA-producing clone. Once a BP-specific B-cell clone is activated, its production leads sooner or later to ADA detection. ADA status becomes positive for the first time as soon as any of these immunocompetent BP-specific B-cell clones produces sufficient antibodies to reach the detection threshold.
From a statistical point of view, assuming relevant probability distributions for both the number of latent B-cell clones and the time-to-ADA detection, we can deduce the marginal (or population) survival distribution of this time to detection. In the spirit of the stochastic models developed in the seminal work of Yakovlev and Tsodikov [12], we assume that the number of B-cell clones is distributed as a Poisson distribution and that the clones are independent. This leads to the bounded cumulative model presented just below.
Survival model. In this work, we consider the following semi-parametric improper survival model such that for patient i we have: where Λ(t) is an unspecified increasing positive function from zero to infinity. For the hazard functions, h 0 (t) and h 1 (t): The survival function S z (t) is improper in the sense that lim t!+1 S z (t) > 0. Its limiting value is called the tail defect (sometimes referred as the plateau) and here equals e −θe αz . In our setting, it represents the probability of being immune-tolerant. Changes in the immune-tolerant fraction and in the time-to-event distribution (here the dynamics of ADA production) are modeled through the parameters of interest α and β. Thus when α = 0, the two groups have the same plateau (proportion of non-susceptible patients). And when β = 0, Model (1) is a proportional hazard model with a relative risk constant over time with a different plateau value.
Under Model (1), the simplified log-likelihood for the n observed current status data is: Score test for H 0 : α = β = 0 The null hypothesis H 0 : α = β = 0 to be tested is the equality of the two improper survival distributions, that is, that group membership has no effect on either time-to-event distribution or the immune-tolerant fraction. Under the null hypothesis H 0 , the score vector has the following components U = (U α , U β ) where: Thus, And, Thus, As shown from the preceding two formulas, these score statistics can be rewritten as linear rank statistics with the following ranking-like functions w α,i and w β,i for subject i, which depend upon the cumulative distribution function under the null hypothesis.
where " S H 0 ð:Þ is the cumulative distribution function under the null hypothesis H 0 .
In practice, we replace " S H 0 ðtÞ and θ by" S H 0 ðtÞ andŷ applying the following ad hoc approach. For interval-censored data, a non-parametric maximum likelihood estimator (NPMLE) for the cumulative distribution function under the null hypothesis" S H 0 ðtÞ can be obtained by running the classical pooled-adjacent-violators algorithm on the full dataset [15]. This estimator uses the estimated jumps (probability mass) occurring over the so-called innermost intervals [16]. In our setting with an improper survival distribution, we arbitrarily consider that for the last innermost interval (the upper limit of which is infinite), the jump is set to zero. Thus, under the null hypothesis, an estimator of θ is obtained byŷ ¼ À logð" S H 0 ð1ÞÞ where" S H 0 ð1Þ is the estimate of the tail defect and its value is the difference between the last estimate of 1 and " S H 0 ðtÞ. Moreover, when " ÞÞ. This approach relies on the hypothesis that the censoring mechanism verifies a condition of sufficient follow-up, that is, that the susceptible subjects will experience the event within the follow-up period [8]. In practice, it implies that we should allow a period of observation that is long enough to detect the presence of immune-tolerant individuals in the study population (i.e. the last interval should be event-free).
Since the statistics U α and U β can be expressed as linear rank statistics, we can obtain their permutational variances and the covariance under the null hypothesis (see [2,17]): Finally, the proposed test statistic of H 0 is given by: where V is the matrix of variance-covariance obtained with the preceding formulas. Under the null hypothesis, the statistic T H 0 is asymptotically distributed as a chi-square with two degrees of freedom.

Simulation study
Protocol. To examine the properties of this test, we conducted Monte-Carlo simulations with data generated under either a bounded cumulative hazard model or a two-component mixture cure model. We compared the proposed score test to the logrank-type test for interval-censored data [18], which is the test statistic used most often for interval-censored data. Both test the same null hypothesis of equality of the two survival distributions. Note that the logrank-type test supposes a mis-specified model in which the cure fraction is not considered. We review the formula for the logrank-type test in the appendix. In this simulation study, we investigated the impact of various percentages of censoring rates, covariate imbalances and restricted follow-ups.
Survival times were generated according to the two models described below. The first was a bounded cumulative model such that: The other was a two-component mixture model such that: S(t j Z = z) = e −θe αz + (1 − e −θe αz ) e −te βz . The variable Z was generated according to a Bernoulli distribution of parameter ξ. Unless otherwise stated, ξ = 0.5. The censoring (monitoring) times, C, were independently generated from an exponential distribution with rate parameter λ C ; its value was chosen according to the desired percentage of censored susceptible observations. The total number of subjects n tot was set at 400. The following configurations were considered: β varied from −3.2 to 3.2 with a pitch of 0.4, and α varied from −0.5 to 0.5 with a pitch of 0.25. The plateau values for the reference group, τ 0 = S Z = 0 (1) = e −θ , were set at 0.3, 0.5, and 0.7. We indicated for each value of α the plateau value for the group Z = 1 such that t 1 ¼ t expðaÞ 0 . The censoring rates (p) used were 20% and 40%. Here, p refers only to the percentage of censored observations among the susceptible subjects. Thus, the total percentage of censored subjects for the reference group is equal to τ 0 + p(1 − τ 0 ). The results are presented in Tables 1-6 and in Tables A-F in S1 File. For unbalanced designs, ξ was set at 0.3 and 0.7. The results are presented in Tables 7-8 and in Tables G-H in S1 File.         A score test for current status data with a fraction of non-susceptible patients Table 5. Bounded cumulative hazard model, exponential censoring, τ 0 = 50%, p = 40%, n tot = 400, ξ = 0.5.      We also evaluated the impact of a restricted follow-up on the test's properties by generating censoring times from a uniform distribution lying in a small interval (0, U max ), with U max varying according to the β value and set to ensure that 20% of the susceptible patients did not have a sufficient follow-up. The results are reported in Table I (S1 File). For the latter two scenarios, the plateau value was 30% and p = 20%.
As stated above, we used a non-parametric maximum likelihood estimator for the cumulative distribution function under the null hypothesis (" S H 0 ðtÞ) for the proposed statistic. This estimate was obtained by the pooled-adjacent-violators algorithm available from the 'isotone' package [19]. For the classical logrank-type test adapted to interval-censored data, we used the 'gLRT2' function from the 'glrt' package [20]. The R-function for computing the proposed score test is available from the corresponding author on request.
For each scenario, we generated 5,000 datasets. We considered a two-sided test with a 0.05 significance level. The results, in percentages, are summarized in the tables presented below. We report the probabilities of rejecting H 0 for the proposed score test together with the Table 7. Bounded cumulative hazard model, exponential censoring and unbalanced design, τ 0 = 30%, p = 20%, n tot = 400, ξ = 0.3.   2) (-7.9) (-6.7) (-6.9) (-6.7) (-6.9)   (Tables 1-6), the estimated values for the proposed score test under the null hypothesis (α = β = 0) were within the random sampling fluctuation of the nominal significance level based on the chi-square distribution with two degrees of freedom.
As seen in Tables 1-3 (20% censoring, balanced design), the proposed score test was more powerful than the logrank-type test in most of the scenarios. However, as expected, when β = 0, the improper model was a proportional hazard model and the logrank-type test was always more powerful. Table 1 shows that power was up to 53.1% higher for the proposed score test than for the logrank-type test, and never more than 9.6% lower. In Tables 2 and 3, power for the proposed score test was up to 41.8% and 28.1% higher and no more than 8.8% and 11.2% lower, respectively. For any given configuration, power decreased as plateau value increased. This reduced power is not surprising since fewer events are expected with a higher plateau value (as in the configurations in Tables 2 and 3). Moreover, for configurations with low values of β, power for the proposed score is lower than for the logrank-type test, as reported above for β = 0.
It should be noted that for any given combination of α and β, the power for the proposed score test is higher than for the logrank-type test when αβ > 0; on the other hand, the differences in power between these two score tests are highest in favor of the proposed score test when αβ < 0.
These power trends for the proposed score test may be explained by the model from which the test is derived: it supports two parameters for only one covariate. This means that the two parameters compete with each other to some extent. Moreover, a first-order Taylor expansion of β around zero gives us: h Z = 1 (t) % h Z = 0 (t) e α + β(1 − Λ(t)) . We then have a time-dependent hazard ratio. Broadly speaking, at earlier event times (when Λ(t) < 1), α and β each offsets the other when they have opposite signs.
High observed differences in power when αβ < 0 can also be explained by the fact that these configurations are clearly not favorable to the logrank-type test because the survival curves cross. For example, when α is negative and β is positive, the reference group has a lower plateau value and a lower risk of event.
From Tables 4-6, which used 40% censoring, we see the same trends in comparing the power of these two tests. Note that power for the proposed score test compared to that for the logrank-type test was slightly lower than it was in the same scenario with 20% censoring. It was, however, still interesting: up to 45.3% higher and no more than 8.5% lower (τ 0 = 30%). In general, for a given configuration, the power of the proposed score test was higher with a censoring rate of 40% compared to 20%. This increase in power when the censoring rate goes up may be explained by the hazard ratio form when we have a mix of positive and negative contributions associated with β and dependent on the follow-up scheme.
Looking at the unbalanced designs (ξ = 0.3 and ξ = 0.7) in Tables 7-8, we see that the differences in power were quite similar to those observed in the balanced designs. The proposed score test performed better with ξ = 0.3 than in situations with ξ = 0.7, depending on the sign of αβ.
For all the simulated scenarios under a mixture cure model (Tables A-H in S1 File), the estimated values for the proposed score test under the null hypothesis were within the random sampling fluctuation of the nominal significance level. When we look at the power results and differences in power compared with the logrank test, we see that the results were slightly better than those obtained under the cumulative bounded hazard model. As an example, in Table A (S1 File) with p = 20%, τ 0 = 30%, ξ = 0.5, power for the proposed score test, compared to that with the logrank-type test, was as much as 58.1% higher and no more than 8.7% lower.
In the case where the non-sufficient follow-up condition was violated (Table I in S1 File), the type I error rate was close to the nominal significance level (4.8%). The performances of the proposed score test were slightly lower than that obtained with sufficient follow-up and comparable censoring rates. Moreover, the power gains of the proposed score test compared with the logrank-type test were also lower in this case.

Testing the influence of gender on ADA occurrence
This dataset came from a cross-sectional study performed by the Immunology Reference Laboratory of Düsseldorf. The study, which was approved by the ethics committee of the medical faculty of the university of Düsseldorf, was part of a databank created by the European ABIR-ISK consortium [21]. The objective of this consortium is to identify and decipher the impact of bioclinical factors on the immunogenicity of BP across various immune diseases and drugs. In this work, we focused on immunogenicity of interferon products-intramuscular and subcutaneous-in newly treated multiple sclerosis patients. Here, we considered biotherapeutic-naive adult patients whose biological samples were taken within the first 21 months of treatment to be sure to capture potential late events, given that the classical window of appearance for ADAs is 18 months. The dataset comprised patients taking interferon-β: 63.3% took interferon-β-1a and 36.7% interferon-β-1b. Interferon-β-1b is only administered subcutaneously whereas interferon-β-1a can be administered subcutaneously (55.2%) or intramuscularly (44.8%). We expected a non-negligible proportion of immune-tolerant patients in this cohort. One sample per patient was provided. In all, 969 patients were analyzed, 9.9% of whom had developed ADAs (n ADA = 96). The sample included 72.6% of women (n women = 703). The variable Gender was tested. Fig 1 displays the non-parametric maximum likelihood estimator of the time-to-ADA survival distribution for each group. The proposed score test produces a significant difference at the global level of 5% with a value of 7.54 (p-value = 0.02), whereas the logrank-type test was not significant with a value of 3.33 (p-value = 0.07).

Discussion
Due to time and budget constraints, cross-sectional survival studies often use designs in which patients are randomly sampled at a single point in time. One example is the investigation of BP immunogenicity in a study population expected to include a mixture of immune-tolerant and immune-reactive patients. The analysis of these data requires an extension of classical test statistics for current status data since these statistics are not designed to cope with mixed populations. It is worth noting that this problem is not specific to clinical immunogenicity but is also encountered in oncology (early-stage cancer with cured patients) and infectious diseases (immune individuals, vaccinology), among other fields. In this paper, we proposed a novel two-sample test based on an improper survival model that is well suited for detecting departures from the equality of survival distributions in mixed patient populations. Here, the choice to use a bounded cumulative hazard model was motivated by its interesting mechanistic interpretation of ADA immunogenicity. The proposed score test is designed to detect changes either in the proportion of susceptible patients or in the time-to-event distribution among non-susceptible patients. As seen from our simulations, under the null hypothesis, the proposed test maintains a correct type I error across all the configurations we studied. Looking at the simulation results under the alternative hypotheses, we see that the power of the proposed test is better than that of the logrank-type test in most situations. For a plateau of 30% and a censoring rate of 20%, power increases by up to 53% and decreases to no more than 10% as compared to the logrank-type test. Our test is derived under a model that supports two parameters for only one covariate. Thus, the two parameters compete against each other to some extent. This explains why the power of the proposed score test is lower in situations where there is an opposite directional effect on the time-to-event distribution and the non-susceptible fraction: belonging to group 1, compared to the reference group, entails a lower risk of susceptibility but a higher risk of event for those who are susceptible. The simulation results also show that when censoring is high among the susceptible individuals, the power gains relative to the logrank-type test are preserved. When the sufficient follow-up condition is not verified, that is, when the length of follow-up is shorter than the window of time during which an event can occur, the power of the test is reduced. In that case, its estimated type I error rate is close to the nominal level.
In our immunogenicity study, using our proposed test, we identified a significant association between gender and ADA occurrence. In the analysis considered in this work, an immune-tolerant fraction exists; continuing follow-up enabled many patients to be tested after the first 18 months and therefore allowed an interpretable time sequence for ADA occurrence. We determined that gender plays a role in ADA immunogenicity, a role undetected by the classical logrank-type test but reported in other interferon-β cohorts treated for multiple sclerosis [22]. Gender may explain the lack of statistical significance of the logrank-type test, for it mainly acts on the dynamics of ADA production. As seen in Fig 1, men have a higher risk of developing ADA than women and a similar risk of susceptibility. A limitation of the proposed test is that it can detect only a global difference between groups; it cannot distinguish between a change in the non-susceptible fraction or the survival distribution among the susceptible individuals. Thus, further additional work is needed to derive simple solutions to resolve this problem. Nonetheless, in view of the improved power obtained when the survival distribution changes among susceptible individuals, the use of the proposed score test can be recommended for widespread use when a fraction of nonsusceptible individuals is expected. From a practical perspective, it should be borne in mind that this test performs best in situations with a sufficient length of follow-up, that is, a window of observation long enough for the last potential event to occur within it. In other words, we should provide a follow-up adequate for detecting the presence of non-susceptible individuals in the study population. As with the classical logrank-type test, the proposed score test relies upon the assumption that the distribution of the monitoring times is the same across the different groups to be compared. In our study, we can reasonably consider that there is no planned difference in monitoring times between men and women. It should also be noted that the proposed test can be extended to take other factors into account, by developing a stratified version with strata defined by the levels of the factors. Finally, we think that our proposed test is both easy to implement and valuable in cross-sectional survival studies with mixed populations, given that the logrank-type test can be ineffective in this situation. τ 0 = 30%, p = 20%, n tot = 400, ξ = 0.7. Table I, Bounded cumulative hazard model, uniform censoring and insufficient follow-up, τ 0 = 30%, n tot = 400, ξ = 0.5. (PDF)