Testing for Mechanistic Interactions in Long-Term Follow-Up Studies

In follow-up studies, interactions are often assessed by including a cross-product term in a (multiplicative) Cox model. However, epidemiologists/clinicians often misinterpret a significant multiplicative interaction as a genuine mechanistic interaction. Though indices specific to mechanistic interactions have been proposed, including the ‘relative excess risk due to interaction’ (RERI) and the ‘peril ratio index of synergy based on multiplicativity’ (PRISM), these indices assume no loss to follow up and no competing death in a study. In this paper, the authors propose a novel ‘mechanistic interaction test’ (MIT) for censored data. Monte-Carlo simulation shows that when the hazard curves are proportional to, non-proportional to, or even crossing over one another, the proposed MIT can maintain reasonably accurate type I error rates for censored data. It has far greater powers than the modified RERI and PRISM tests (modified for censored data scenarios). To test mechanistic interactions in censored data, we recommend using MIT in light of its desirable statistical properties.


Introduction
The assessment of interactions is important for epidemiologists and clinicians alike. Epidemiologists are interested in knowing whether the combined effect of two risk factors is greater than the effect expected when considering these same risk factors separately. For example, Jensen et al [1] found that the combination of obesity and heavy smoking creates a risk for acute coronary syndrome which is greater than the risk posed by either of these factors in isolation. On the other hand, clinicians are interested in knowing whether a particular subgroup of patients will benefit more from a new therapeutic agent. For example, Tsao et al [2] found that the presence of an EGFR mutation may increase responsiveness to erlotinib for patients with non-small-cell lung cancer. In both of these examples, the combined effect of two factors indeed exceeds the expectations of one factor alone. However, we might question whether this means that the two factors interact mechanistically to bring about the outcome?
The sufficient cause framework conceptualizes causation as a collection of causal mechanisms, each requiring component causes to operate [3]. If two factors participate in the same causal mechanism, we will say there is synergism between the two factors in the sufficient cause sense, i.e., causal co-action, causal mechanistic interaction, or simply mechanistic interaction (a term we adopted in this paper). This should not be confused with the multiplicative interaction often assessed by including a cross-product term in a multiplicative model. The (multiplicative) Cox model which assumes proportional hazards is probably the statistical method most commonly used in follow-up data. The use of the Cox model is currently so prevalent that many epidemiologists and clinicians will mistake a significant multiplicative interaction in a Cox model as indicating a genuine mechanistic interaction.
Recently, indices specific to mechanistic interactions have been proposed, including the 'relative excess risk due to interaction' (RERI) [4][5][6][7] and the 'peril ratio index of synergy based on multiplicativity' (PRISM) [8]. RERI is an index based on risk-ratio additivity, and a mechanistic interaction can be declared when RERI is statistically larger than one. PRISM is an index based on peril-ratio multiplicativity, and a mechanistic interaction can be declared when log PRISM is statistically different from zero. PRISM has a less stringent threshold to detect mechanistic interactions as compared to RERI [8]. Both indices assume no loss to follow up and no competing death in a study. But in fact, data censoring (due to loss to follow up, competing death, or simply study closure) is inevitable in any long-term follow-up study.
In this paper, we propose a novel statistical test, the 'mechanistic interaction test' (MIT), for censored data. We will examine the statistical properties of the MIT using Monte-Carlo simulations and demonstrate its use based upon real data.

Sufficient Causes and Mechanistic Interactions
In this section, we give an overview of the sufficient cause framework and its relation to mechanistic interactions. In a sufficient component cause model, a sufficient cause contains a combination of component causes. Any sufficient cause with all components completed is sufficient to cause the disease. For two dichotomous exposures (X and Z), sufficient causes can be classified into a total of nine classes, including one 'all-unknown' class (U 1 ), two X-only classes (U 2 and U 3 ), two Z-only classes (U 4 and U 5 ), and four interaction classes (U 6 *U 9 ) (see Fig. 1) [8,9]. We will say there is mechanistic interaction between X and Z, if at least one of the four interaction classes is present.
Lee [8] pointed out that as follow-up time approaches infinity, RR x,z will tend toward a value of one for each and every x, z 2 {0, 1}, RERI will tend toward zero (perfect risk-ratio additivity), and at this point a test for RERI > 1 will have no hope of achieving significance. Lee [8] therefore proposed an alternative metric of risk: the 'peril' which is simply an exponentiated cumulative rate or an inverse of a survival proportion (risk complement). The 'peril ratio index of synergy based on multiplicativity' (PRISM) was defined as PRISM ¼ Peril 1;1 ÂPeril 0;0 Peril 1;0 ÂPeril 0;1 ¼ ð1ÀRisk 1;1 Þ À1 Âð1ÀRisk 0;0 Þ À1 ð1ÀRisk 1;0 Þ À1 Âð1ÀRisk 0;1 Þ À1 : Under the no-redundancy assumption [9], Lee [8] showed that a test for log PRISM 6 ¼ 0 is a global test for mechanistic interaction (at least one of U 6 * U 9 is present).

Mechanistic Interaction Test
In a follow-up study, we assume that the exposure status is time-invariant. Note that here we allow the study to have censored observations, provided that the censoring mechanism is independent of the event process. Assuming that there is no confounding, selection bias, or measurement error in the study, the association between the two exposures and the disease should reflect the genuine causal effect of the exposures. Let the hazard rate at follow-up time t for people with exposure profile of X = x and Z = z be denoted by h x,z (t). Define the 'interaction contrast' (IC) as an index of departure from hazard-rate additivity, that is, As shown in Lee [8,9], IC(t) = 0 for every t under the null hypothesis of no mechanistic interaction. [By contrast, multiplicative interaction is assessed by departure from hazard-rate multiplicativity (such as in Cox regression), that is, ðtÞ ¼ Under the null hypothesis of no multiplicative interaction, ϕ(t) = 1 for every t [10]. For more elaboration on the difference between mechanistic and multiplicative interactions, see S1 Appendix.] Assume that a follow-up study yields a total of J (j = 1,2,. . .J) 'risk sets'. (A risk set is the set of subjects contracting a disease at a particular point in time, together with all those subjects at risk for the disease at that time.) Define IC j ¼ d 11j n 00j for j = 1,2,. . .J, where (n x,z,j ) is the number of diseased subjects (subjects at risk) with an exposure profile of X = x and Z = z at the j th risk set. Under the null hypothesis, we therefore have E(IC j ) = 0 for j = 1,2,. . .J. The proposed MIT is defined as where w j is the weight for the jth risk set (to be discussed shortly). Asymptotically for large J (e.g., J > 30, that is, large number of risk sets), MIT is distributed as a chi-square distribution with one degree of freedom under the null hypothesis of no mechanistic interaction. Like the PRISM test [8], the proposed MIT is a global test; a significant test result implies the presence of at least one of the four interaction classes.
: When hazard rates are approximately the same for each exposure profile at each follow-up time, the weight is inversely proportional to the variance of IC j . Such a weighting system thus makes optimal use of the information contained in different risk sets.

Simulation Studies
Monte-Carlo simulation was conducted to evaluate the performance of MIT. For simplicity, the hazard rate for each and every exposure profile is assumed to be a linear function of t (detailed in S2 Appendix). (S3 Appendix presents the simulation results for more complex hazard functions. The results are essentially the same as those presented in this paper.) To obtain a complete picture of the performance of MIT, we deliberately contrived scenarios where the hazard curves for the four exposure profiles were proportional to one another (Panel A in Figs. 2 and 3), where they were non-proportional (Panel D in Figs. 2 and 3), and where non-proportionality was so extreme that some of the hazard curves actually crossed others (Panel G in Figs. 2 and 3). We considered all those scenarios where the hazard curves satisfied IC(t) = 0 for every t (null hypothesis of no mechanistic interaction, shown in Panels B, E and H in Fig. 2) as well as when they did not (alternative hypothesis, shown in Panels B, E and H in Fig. 3). We assumed 250 subjects for each exposure profile at the beginning of a five-year follow-up study.
The censoring process was assumed to follow a simple exponential distribution with a hazard rate of 0.2. A total of 10000 simulations were performed for each scenario. In each round of the simulation, MITs with different total follow-up times were calculated. The level of significance was set at α = 0.05. For a comparison, we also performed the interaction test in a Cox model (only for the proportionality scenarios), the RERI and the PRISM tests (modified for censored data scenarios in this paper, detailed in S4 Appendix). For the Cox model, we tested the cross-product term of the two exposures. For RERI, we tested (one-sided) whether the index was statistically larger than one (a specific test for U 6 ), and for PRISM, we tested (two-sided) whether log PRISM was statistically different from zero (a global test for U 6 *U 9 ). Fig. 2 shows the type I error rates for the scenarios of proportionality (Panel C), non-proportionality (Panel F), and crossover (Panel I), respectively. It can be seen that for the MIT and PRISM, the type I error rates with different total follow-up time are very close to the nominal α level for all scenarios. [MIT and PRISM can still maintain accurate type I error rates even when subjects with different exposure profiles have different censoring rates (see S5 Appendix).] By contrast, RERI is a very conservative test with unduly small type I errors. As for the interaction test in the Cox model, its type I error rates are severely inflated even under the proportionality assumption (Fig. 2C), and we will thus not consider it further in the following power analysis. Fig. 3 (Panels C, F, and I) shows the simulation results for the powers. In all scenarios, the power of the proposed MIT increases as the total follow-up time increases and is higher than those of the remaining two competitors (PRISM and RERI). The power of the PRISM test initially matches that of the MIT and it also increases as total follow-up time increases. Yet, as the total follow-up time passes a certain threshold (3 years when around 80% of the subjects become diseased), the power decreases instead. As for the RERI test, it has minimal power to detect mechanistic interactions; even so this is limited to only a very early phase of the follow-up (< 1.5 years).

An Example
Klein and Moeschberger described a follow-up study of allogeneic marrow transplantation for patients with acute leukemia (the Appendix D in reference 10). A total of 137 patients from March 1, 1984 were followed up. We restricted our analysis in the five-year period afterward. During this period, 42 patients relapsed (two patients relapsed at the same time). A total of 41 risk sets can be defined (see Table 1 Using the method proposed in this paper, we calculated the MIT statistic of this example (based on the risk-set data presented in Table 1) to be 6.2323 with a P value of 0.0125. We therefore have evidences for a mechanistic interaction between CMV status and MTX treatment. For this example, the PRISM index (based on Kaplan-Meier estimates) is with a much larger and non-significant P value of 0.2173. Again we see that the RERI test is underpowered. The interaction test using the Cox regression results in a marginally non-significant P value of 0.0521 for this example. But we caution again that this is a test for multiplicative interaction and not a test for mechanistic interaction.

Discussion
Epidemiologists had long been pondering over the possibility of testing mechanistic interactions using only the observational data at hand. Recent developments in causal-inference methodologies had for the first time proved that this was indeed possible for cohort and casecontrol data [4][5][6][7][8]. However, testing mechanistic interactions in censored data is somewhat complicated. A stratified log-rank test can be used to test the equality of two survival distributions (of the exposed and the unexposed) in each and every stratum delineated by another variable. This, however, is still a test to determine the main effect of the exposure but is not a test to measure interactions. A modification of the Cox model (such that hazard ratios are a linear function of covariates) can test for mechanistic interactions [9] but requires the proportional interaction test), bold dotted lines, for PRISM (peril ratio index of synergy based on multiplicativity), bold broken lines, for RERI (relative excess risk due to interaction), and the horizontal thin solid lines, the nominal α level of 0.05. The thin broken line in C is for the interaction test in the Cox model. hazards assumption. By contrast, the proposed MIT is extremely robust. It can test mechanistic interactions when the hazard curves are proportional to, non-proportional to, or even crossing over, one another.
As mentioned previously, a test for RERI > 1 is a specific test for the U 6 interaction class [4][5][6][7]. In our simulation studies, RERI test is severely underpowered. Under a strong assumption, the 'monotonicity assumption' [11][12][13][14], U 6 is the only interaction class to consider and the threshold of the RERI test becomes less stringent (RERI > 0 vs. RERI > 1) [4][5][6][7]. Even with this additional assumption, however, the power of the test for RERI > 0 still decreases very quickly as total follow-up time increases (S8 Appendix). The limited power of the RERI test (whether for RERI > 0 or RERI > 1) is understandable, because it assesses departure from risk-ratio additivity at one and only one point in time: the time of study closure. By contrast, the (weighted) sum operation in MIT integrates information at various time points during the follow-up (from the different risk sets, each one gauging a departure from hazard-rate additivity at a specific point in time). Comparatively it therefore has much greater power to detect mechanistic interactions.
In this study, we found that the power of the PRISM test initially matches that of MIT, but beyond a certain point (when *80% of the subjects are diseased, see also S7 in reference 8), the power does not increase but instead can actually decrease with more follow-up time. This is rather undesirable; with more efforts put into longer-term follow-up, a researcher naturally will expect the study to generate more power-but not less-if at all! We show in S9 Appendix that the PRISM test is akin to a hazard-rate additivity test with equal weight attached to the risk sets (cf. the adaptive weight in MIT according to the sample size of a set). As follow-up progresses, the size of a risk set will gradually decrease to a point where it contains too few subjects to provide any reliable information regarding hazard-rate additivity. Indiscriminately assigning large weightings to a limited number of subjects will only overwhelm the true interaction signals that may be present in the earlier sets.
To test for mechanistic interactions between two binary exposures in long-term follow-up studies, we recommend using MIT in light of its desirable statistical properties. Further studies are warranted to expand the methodologies to include categorical or ordinal variables with three or more levels. A stratified MIT also awaits development which should prove useful to adjust for possible confounders (stratification according to the levels of confounders). Finally, further work is needed to cast the present method in a general regression framework.
Supporting Information S1 Appendix. Comparison between mechanistic and multiplicative interactions.