Mediation model with a categorical exposure and a censored mediator with application to a genetic study

Mediation analysis is a statistical method for evaluating the direct and indirect effects of an exposure on an outcome in the presence of a mediator. Mediation models have been widely used to determine direct and indirect contributions of genetic variants in clinical phenotypes. In genetic studies, the additive genetic model is the most commonly used model because it can detect effects from either recessive or dominant models (or any model in between). However, the existing approaches for mediation model cannot be directly applied when the genetic model is additive (e.g. the most commonly used model for SNPs) or categorical (e.g. polymorphic loci), and thus modification to measures of indirect and direct effects is warranted. In this study, we proposed overall measures of indirect, direct, and total effects for a mediation model with a categorical exposure and a censored mediator, which accounts for the frequency of different values of the categorical exposure. The proposed approach provides the overall contribution of the categorical exposure to the outcome variable. We assessed the empirical performance of the proposed overall measures via simulation studies and applied the measures to evaluate the mediating effect of a women’s age at menopause on the association between genetic variants and type 2 diabetes.


Introduction
Mediation analysis is a statistical method used to evaluate the direct and indirect effects of an exposure on an outcome in the presence of a mediator. Mediation models have been widely used to determine direct and indirect contributions of genetic variants in clinical phenotypes, such as contribution of CHRNA3-A5 genes in lung cancer [1][2][3][4][5][6][7]. In many studies, one encounters right-or left-censored data instead of complete data. Approaches to assess mediation when the outcome variable is censored have been widely studied [8][9][10][11][12][13][14][15]. However, the mediator itself can also be a censored variable. For instance, genes may affect the age at which a person stops smoking, a variable that is censored for current smokers and has been associated with lung cancer-associated mortality [16]. Few studies have considered mediation models with a censored mediator. Wang  with the accelerated failure time (AFT) model to address a censored nature of a mediator when outcome was a continuous variable, and this approach yielded accurate estimations for the coefficients of different paths, the indirect effect (IE), and proportion of the total effect mediated (PM) by the mediator. Wang et al. [18] further extended the mediation approach with a censored mediator for studies with binary outcomes (e.g., case-control studies), based on the semiparametric AFT model with an unspecified error distribution combined with a pseudo-likelihood function, which was shown to be efficient yet robust.
In genetic association studies of complex diseases-including our motivating study of the association between single nucleotide polymorphisms (SNPs), woman's age at menopause, and type 2 diabetes-because often there is no concrete evidence of the genetic mode of inheritance, one usually uses three classic genotypic models: the additive, dominant, and recessive genetic models [19]. For example, consider a SNP with two alleles R and r, and let R be the risk allele and r be the normal allele. The additive genetic model is defined using a categorical random variable, X = (0, 1, 2), to denote the three genotypes (rr, Rr, RR), assuming that the disease risk depends on the dose of the risk allele R. When the dominant or recessive genetic model is assumed, we use a binary variable, X = (0, 1, 1) or X = (0, 0, 1), respectively, to denote the three genotypes (rr, Rr, RR). The additive genetic model is the most commonly used model because typically the mode of action of susceptibility SNPs is unknown and the additive model can detect effects from either recessive or dominant models, or any model in between [20][21][22]. In addition to SNPs, the highly polymorphic loci, such as human leukocyte antigen (HLA) genes, can also be involved in the mediation model as an exposure. Such genes have many different alleles, resulting in many different genotypes (i.e., more than three genotypes found in di-allelic SNPs). The previous approaches by Wang and Shete [17] and Wang et al. [18] assumed that the exposure is either continuous (e.g., gene expression) or binary (e.g., dominant or recessive mode of inheritance for a genetic variant). These methods therefore cannot be directly applied when the genetic model is additive (e.g. the most commonly used model for SNPs) or categorical (e.g. polymorphic loci), and thus, modification to measures of indirect and direct effects is warranted for these general scenarios. Therefore, we extended our approach to the scenario in which the exposure is a categorical variable which can be applied to the model where the mediator is subject to censoring.
In particular, we proposed the measures for calculating the overall IE, direct effect (DE), and total effect (TE) in such a mediation model, where we first assessed the IE, DE, and TE for each category of the exposure and then calculated the IE, DE, and TE, weighted by the frequency of each category (e.g., frequency of each genotype), to estimate overall effect (IE, DE, and TE) of the exposure on the outcome variable in the presence of a censored mediator. The proposed measures provide the overall contribution (indirect, direct and total) of the categorical exposure to the outcome variable, instead of the relative contribution comparing one category to another in previous studies [23,24]. The proposed measures are general and valid regardless of whether the outcome variable and mediator are continuous, binary, or censored.
We applied the proposed overall measures of IE, DE, and TE to the motivating study of the mediating effect of a woman's age at menopause on the effect of SNPs on risk of type 2 diabetes. Type 2 diabetes is a complex disease characterized by interplay between genetic and environmental factors [25][26][27][28][29][30]. Previous studies using epidemiologic data or genetic epidemiologic data have suggested an association between a woman's age at menopause and type 2 diabetes [31][32][33] as well as an association between several SNPs and a woman's age at menopause [34,35]. Therefore, we hypothesized that there are potential dual pathways between SNPs and type 2 diabetes, one via a direct effect and the other indirectly through a women's age at menopause, which is hypothesized as a mediator for this association and is a censored variable because not all women have had gone through menopause (Fig 1). In Methods section, we introduce the notations; mediation models; definitions of the overall IE, DE, and TE; and the associated estimation approaches [18]. We assess the empirical performance of the proposed overall IE, DE and TE via simulation studies in the Simulation section, conduct a data analysis in the section of Application to the motivation study, and provide a discussion in the Discussion section.

Methods
We first review the approach by Hayes and Preacher [23], which is a widely used mediation model in which the exposure has multiple categories, and point out its limitations in the context of our motivating study. In their approach, Hayes and Preacher proposed to code the multi-categorical exposure using different coding strategies, including dummy coding, contrast coding etc., depending on the research interest. For example, when using the dummy coding, if the multi-categorical exposure X has k groups, one can create k-1 dummy variables X i , i = 1, . . . k-1, with X i = 1 if the subject is in group i and X i = 0 otherwise, where one group is considered the reference group in the analysis. In this way, the approach creates multiple binary exposure variables in one mediation model.
As assumed in the Hayes and Preacher model, when outcome Y and mediator T are continuous, the direct and indirect effects of X on Y are estimated using the dummy-coded multiple exposure variables, mediator, and outcome using following linear regressions: where a i represents the path from the dummy-coded exposure X i to the mediator T,c i represents the path from X i to the outcome Y conditional on the mediator T; c i represents the Conceptual model for the study of the mediating effect of a women's age at menopause on the association between genetic variants and type 2 diabetes risk. Nodes represent the variables being analyzed in the mediation model, including genetic variant (i.e., exposure), a women's age at menopause (i.e., mediator) and type 2 diabetes (i.e., outcome). A direct edge implies a potential direct causal effect. A pathway from one variable (genetic variant) to another (type 2 diabetes) implies a potential causal relationship through the mediator on the path (age at menopause). relative total effect of X i on the outcome Y; i = 1, . . . k-1; and b represents the effect of the mediator T on the outcome Y. To assess the mediating effects, Hayes and Preacher adopted the terms relative IE, relative DE, and relative TE, respectively, to refer to a i b,c i , and c i ¼ a i b þc i ; and the effects were calculated for each of the binary exposure variables recoded from the original exposure. (see details in [23]).
The approach proposed by Hayes and Preacher has advantages over other approaches such as aggregating groups or discarding to construct a dichotomous exposure, but it still has certain limitations. Since the relative IE is calculated for each created binary exposure separately, it cannot provide the overall mediating effect of the mediator on the relation between the exposure and outcome variable. Also, if the exposure has many categories, the number of possible tests to be conducted can be large. In this case, multiple correction tests reduce the power of the test for the mediating effect. Importantly, the approach proposed by Hayes and Preacher [23] assumes that both outcome and mediator are continuous and normally distributed, which allows one to estimate the relative TE as the summation of relative IE and relative DE. Therefore, the approach cannot be directly applied in many practical situations, such as when the mediator is a non-normal distributed variable subject to censoring and the outcome is binary. More recent work has been conducted to develop approaches for the analysis of treatment/ exposure with multiple categories [24,[36][37][38]. However, these approaches compare one category to another of the treatment/exposure without providing overall direct and indirect effects of the categorical exposure variable on the outcome variable. Furthermore, the direct application of these methods to a mediation model with a censored mediator and a binary outcome is not straightforward. For instance, the approaches proposed by Samoilenko et al. focused on continuous mediator and outcome [24].
To address these limitations, we proposed an approach to assess the overall mediating effect of a mediation model in which the exposure is categorical, and the mediator is subject to censoring. In the next sections, we propose definitions for the overall measures for IE, DE, and TE by extending the approach proposed in Wang et al. [18]. We focused on this approach because, compared to other existing approaches, it employed the semiparametric AFT model, which does not require a parametric distribution assumption on the mediators, and the pseudo-likelihood function, which is more flexible to be extended to different outcome variables (e.g., continuous outcome). However, as mentioned above, the proposed overall measures of IE, DE and TE are general and valid regardless of whether the outcome variable and mediator are continuous, binary, or censored or the approaches used.

IE for a mediation model with a categorical exposure and a continuous outcome
We first present methodology for the scenario when outcome variable is continuous. Let X be the categorical exposure, T be the mediator subject to right censoring, Y be the continuous outcome variable and Z be the other covariates involved in a mediation model. For the mediator, given an individual i, i = 1, . . ., n, we observe m i = min(t i , c i ) and δ i = I(t i � c i ), where c i is the right-censored time and δ i is the indicator for censored (δ i = 0) or observed (δ i = 1). For the categorical exposure, we utilize the dummy coding as suggested in Hayes and Preacher. If the categorical exposure X has k categories, we create k-1 dummy variables, X j , j = 1, . . . k-1, with one of the categories as the reference (Fig 2), where X j = 1 if X = d j , and X j = 0 if X 6 ¼ d j . That is, X = d j means (X 1 = 0, . . . X j = 1, . . . X k−1 = 0). For the reference category, X = d 0 is equivalent to (X 1 = 0, . . . X j = 0, . . . X k−1 = 0). For example, in our motivation study, the exposure is coded as an additive genetic model with three categories (0, 1, 2) denoting the three genotypes (rr, Rr, RR). Here k = 3 and we create two dummy variables, X 1 and X 2 , using the genotype rr (i.e., 0) as the reference group. Specifically, X 1 = 1 when X = 1 (genotype Rr) and X 2 = 1 when X = 2 (genotype RR).
The relationships among the variables can be specified using the linear regression model (i.e., association of exposure and mediator with outcome) and AFT model (i.e., association of exposure with mediator) as follows: where b 0 , b,c j , j = 1, . . ., k − 1, γ, and y ¼ ða 0 ; a 1 ; . . . ; a kÀ 1 ;g T Þ T are the regression coefficients; . . . ; x ðkÀ 1Þi ; z T i Þ T ; ε yi~N ormal(0, σ 2 ); and ε ti represents the independently and identically distributed random errors with mean zero and an unspecified distribution. Particularly, the coefficients a j , j = 1, . . ., k − 1, correspond to the paths from k-1 dummy variables created for the original exposure, X 1 , X 2 , . . ., X k−1 , to the mediator T; the coefficientsc j , j = 1, . . ., k − 1, correspond to the paths from k-1 dummy-coded exposure variables to the outcome Y; and b corresponds to the path from the mediator T to the outcome Y. In the presence of rightcensoring, given a continuous outcome, the likelihood function for the observed data (y i , m i , We use the two-stage approach proposed in Wang et al. [18] to assess the coefficients for different paths a j ,c j , j = 1, . . ., k − 1, and b. In particular, in the first stage, based on the semiparametric AFT model [11,39,40], we use a weighted least square estimator to estimate coefficients a j , with the closed form aŝ whereĜð�Þ is the Kaplan-Meier estimator for the censoring survival function which accounts for the right-censoring. Based onŷ ¼ ðâ 0 ;â 1 ; . . . ;â kÀ 1 ;ĝ T Þ T and the AFT model error distri-butionẐŷ ð�Þ, which is estimated from the censored residues by Kaplan-Meier estimator, in the second stage, we assess the coefficients for paths b, andc j , j = 1, . . ., k − 1, with the use of a logpseudo-likelihood function as below, given a sample of n individuals: is the set of parameters to be estimated and τ is the largest observed event time on a residual scale. The conditional probabilities Pr φ (y i |m i , A i ) and Pr φ ðy i jt þ A T iŷ ; A i Þ can be formulated using the AFT model and the linear regression model. The estimatorsφ ¼ ðb 0 ;b;ĉ 1 ; . . . ;ĉ kÀ 1 ;ĝ T Þ T are assessed by maximizing the above logpseudo-likelihood with the use of the minimization algorithms such as Nelder and Mead approach [41]. See Wang et al. [18] for details on parameter estimation. We propose to assess the overall IE, DE and TE using an approach in which the IEs, DEs and TEs are calculated based on different categories of the exposure and combined using the corresponding frequencies of different categories of the exposure. The measures for IE, DE, and TE of each category of the exposure are derived following the counterfactual framework, which has been widely applied in mediation analysis, especially for scenarios involving nonlinearities and interactions [10,[42][43][44][45][46][47][48][49]. Briefly, we denote Y d j and T d j respectively to be the values of the outcome Y and mediator T that would have been observed if the exposure X had been set to d j ; and denote Y d j t to be the value of Y that would have been observed if T and X had been set to t and d j respectively [18,44,45]. Based on the counterfactual framework, conditional on the covariates Z, the natural IE is defined as E Y d j T d j jZ the effects of the mediator T at values of T d j and T d 0 on the outcome Y when the exposure X is pares the effects of the exposure X on Y by setting the mediator T to the value it would have been observed if X had been set to be d 0 (reference category). Here, the assumptions on the absence of unmeasured confounders and consistency are required, which have been extensively discussed in the literature [8,12,13,18,[43][44][45][46][47][48]. The detailed derivations for calculating IE and DE and associated assumptions are shown in the online S1 Appendix. For each of the binary dummy-coded exposure variables X j , j = 1, . . ., k − 1, we evaluate the IE in the mediation model and denote it as IE j_versus_0 (indirect effect of category X = d j versus the reference category X = d 0 ), given x j = 0 as the reference group: The overall indirect effect of the exposure X on the outcome Y mediated through the mediator T needs to account for the frequency of each possible categories d j of X. Therefore, we define overall IE as: where f j , j = 1, . . ., k − 1, is the frequency of category d j of the exposure. For genetic studies, f j is the genotypic frequencies of possible genotypes of SNP X and can be obtained from external sources that represent general population such as the 1000 genome project data [42]; or one could estimate the genotypic frequencies from the current data. If the models in the above equations are correctly specified, we can estimate the overall measures of IE based on the estimated AFT model error distribution for the mediatorẐŷ ð�Þ and the estimated coefficientsŷ On the basis of the counterfactual framework, the natural IE j_versus_0 for the dummy-coded exposure X j , as defined above, measures the expected change in outcome Y due to the effects of the mediator T at values of T x j ¼1 versus T x j ¼0 when X j is set to 1, conditional on the covariates [18,50]. In turn, the overall IE, calculated accounting for the frequency of different categories of the exposure, measures the average expected change in Y from the effects of the mediator T responding to change/transition in the categorical exposure X from any group to the reference group [50]. We can similarly define direct effects, DE j_versus_0 : Similarly, the overall DE is then defined: The overall TE is the summation of the IE and DE, calculated as The proportion of total effect of X on Y mediated by the mediator T (PM) can be estimated as PM = IE/TE, which is commonly reported when there is a significant IE [51].

IE for a mediation model with a categorical exposure and a binary outcome in a case-control study
We further extended the framework in the above section to accommodate a categorical exposure, a censored mediator and a binary outcome (e.g. presence or absence of a phenotype in case-control studies). The relationships among the variables can be specified using the logistic regression model and AFT model as follows: All coefficients are defined similarly as for the scenario with a continuous outcome. To account for the binary nature of the outcome variable in a case-control study where cases may be oversampled, in stage one, in addition to the weight to account for the right-censoring, we consider one additional sampling weight, i.e. w i , in the weighted least square estimator to estimate coefficients a j , j = 1, . . ., k − 1, as below: The log-pseudo-likelihood function can be similarly defined by including the sampling weight: can be evaluated to provide the coefficients for paths b andc j , j = 1, . . ., k − 1. Specifically, the coefficientsc j , j = 1, . . ., k − 1, correspond to the paths from k-1 dummy variables created for the exposure, X 1 , X 2 , . . ., X k−1 , to the outcome Y, and b corresponds to the path from the mediator T to the outcome Y. The sampling weight w i is included to account for the sampling mechanism of case-control study design. Such weighting strategy of using inverse disease prevalence is well-established [18,[52][53][54][55][56][57][58].
To assess the overall IE and DE, we first estimate the IE and DE for each of the dummycoded exposure variables X j , j = 1, . . ., k − 1, as described above, IE j_versus_0 and DE j_versus_0 : and Based on the estimated AFT model error distribution for the mediatorẐŷð�Þ and the esti- and DE of the exposure X for the mediation model are calculated using Eqs (1) and (2). The study and data use were approved by US National Institute of Health and The University of Texas MD Anderson Cancer Center through Material Transfer Agreement (MTA ID: 00016197). The data of the motivating study was downloaded from dbGaP (phs000209.v13.p3) [59] for the Multi-Ethnic Study of Atherosclerosis (MESA) cohort study. The MESA study was approved by the Institutional Review Board at each site of the study and informed consent was obtained from each participant [60].

Simulation approach
To examine the performance of the proposed overall measures of IE, DE, and TE, we conducted simulations for a mediation model with a categorical exposure, where the mediator is subject to right censoring.
Binary outcome. To mimic the motivation study, we first considered a case-control study in the simulations, with a binary outcome and an additive genetic variant (SNP) as the exposure. We used the robust estimating approach by Wang et al. [18] to estimate the parameters under the mediation model. For each individual i, the genotype of the genetic variant x i (exposure) was generated with the use of the genotypes' frequencies assuming the genetic variant is in Hardy-Weinberg proportion. We assumed a genetic variant with a minor allele frequency (MAF) of 0.1, 0.3 and 0.5. The genotypic frequencies could be calculated accordingly. For example, when the MAF was 0.3, the genotypic frequencies were 0.49, 0.42, and 0.09 for the three genotypes rr, rR, and RR, respectively. Given the exposure x i , the censored mediator t i was generated using an AFT model log(t i ) = a 0 + ax i + ε t , where ε t~N ormal(0, 1) and the coefficients were set as a 0 = 6 and a 0 = 0 or 0.4. The right-censoring time c i was generated independently from the uniform distributions using different intervals to create different censoring percentages,~20% and~40%. The observed censored variable m i and indicator δ i for each individual were then obtained as m i = min(t i , c i ) and δ i = I(t i � c i ). Conditioned on the values of x i and t i , the outcome y i was generated using the logistic regression model, where the coefficients were set to be b = 0 or 0.4, andc ¼ 0:5. The intercept coefficient b 0 was set to various values to reflect different disease prevalence of~10% and~30%. In this way, we simulated a large amount of data on the population, from which we randomly sampled same numbers of cases and controls based on the outcome status. In particular, we randomly sampled 500 cases and controls for the scenarios where MAF = 0.3 and 0.5; while sampled 2000 cases and controls for the scenarios where MAF = 0.1 to ensure the sufficient sample size in RR genotype category, thereby, producing stable estimations of the regression coefficients.
Note that in the simulations, we employed X = (0, 1, 2) corresponding to the genotypes (rr, rR, RR). In this situation, the regression coefficients, a andc, are interpreted as the per-allele effect, which corresponds to the effect of each copy of the deleterious allele R. When analyzing the mediation model, as described in the Methods section, we created two dummy variables, X 1 = (0, 1, 0) and X 2 = (0, 0, 1) using the genotype rr as the reference group. Given such a coding, it is straightforward to derive that a 1 = a, a 2 = 2a,c 1 ¼ c, andc 2 ¼ 2c. That is, a 1 = 0 or 0.4; a 2 = 0 or 0.8;c 1 ¼ 0:5; andc 2 ¼ 1. We report the estimates for a 1 , a 2 ,c 1 , andc 2 for the simulation studies.
For each of the MAF values (i.e., 0.1, 0.3, 0.5), when either coefficient a = 0 or b = 0, we considered twelve scenarios for which there is no IE through the mediator, with respect to different censoring percentages (~20 or 40%) and different disease prevalence (~10 or 30%). When both a and b are non-zero (i.e., a = b = 0.4), we considered four scenarios for which there is an IE through the mediator with respect to different censoring percentages and disease prevalence. For different scenarios, the theoretical true values of IEs and PMs were calculated using the prespecified parameters and prespecified normal distribution for the conditional probability of the mediator. In our estimating procedure to calculate the empirical IEs and PMs, the nonparametric Kaplan-Meier estimator of the censored residuals was used to assess the conditional probability of the mediator. To test the significance of the IEs and PMs, the bias-corrected and accelerated (BCa) bootstrap approach [61] was employed to determine the confidence intervals (CIs) for the IEs and PMs [17,18].
Continuous outcome. When investigating the performance of the proposed overall measures for a mediation model with a continuous outcome, a categorical exposure (genetic variant) and a censored mediator, we generated the exposure X and the mediator T similarly as described above. We still assumed different MAFs for the genetic variant (i.e., 0.1, 0.3, 0.5) and different censoring percentages for the mediator (i.e.,~20%,~40%). For each individual i, given the values of the exposure x i and the mediator t i , the outcome y i was generated using the linear regression model. The coefficients were set to be b 0 = 1, b = 0 or 0.4, andc ¼ 0:5. We randomly generated 1000 samples for the scenarios where MAF = 0.3 and 0.5; and 4000 samples for the scenarios where MAF = 0.1. For each of the MAF values (i.e., 0.1, 0.3, 0.5), there were six scenarios for which there is no IE through the mediator (i.e., a = 0 or b = 0); and two scenarios for which there is an IE through the mediator (i.e., a = b = 0.4), with respect to different censoring percentages of the mediator.

Simulation results
Binary outcome. Tables 1 and 2 report the simulation results for all the scenarios assuming a binary outcome in a case-control study and an additive SNP with MAF = 0.3, including the scenarios for which there is no IE through the mediator (top panel) and those for which there is an IE through the mediator (bottom panel). In Table 1, we report the means and standard errors of the estimated coefficients for the different paths, a 0 , a 1 , a 2 , b 0 , b,c 1 , andc 2 . As expected, the robust approach provided accurate estimations for all coefficients through different scenarios. As an example of no IE, scenario 3, in which the prespecified values were a 0 = 6, a = 0 (i.e., a 1 = 0, a 2 = 0), b 0 = -5 (i.e., disease prevalence =~10%), b = 0.4, andc ¼ 0:5 (i.e., c 1 ¼ 0:5,c 2 ¼ 1), the estimated values were a 0 = 5.9995, a 1 = 0.0019, a 2 = 0.0080, b 0 = -5.0064, b = 0.3984,c 1 ¼ 0:5057 andc 2 ¼ 1:0055, respectively, which were close to the prespecified parameter values in the simulation model. Similarly, under the scenarios where there is an IE through the mediator, all parameters were accurately estimated. For example, for scenario 15, in which the prespecified values were a 0 = 6, a = 0.4 (i.e., a 1 = 0.4, a 2 = 0.8), b 0 = -3.7 (i.e., disease prevalence =~30%), b = 0.4, andc ¼ 0:5 (i.e.,c 1 ¼ 0:5,c 2 ¼ 1), the estimated values were a 0 = 6.0043, a 1 = 0.3985, a 2 = 0.7809, b 0 = -3.6902, b = 0.3975,c 1 ¼ 0:5004 andc 2 ¼ 1:0074, respectively, which were close to the prespecified parameter values in the simulation model. Table 2 report the means and standard errors of the estimated IEs and PMs (for the scenarios with significant IEs) for MAF = 0.3, obtained using the overall measures proposed in the study, and the coverage probabilities of the 95% CIs of the IEs and PMs. When estimating the IEs and PMs, our overall measures provided accurate estimations for all scenarios. For example, for scenario 16, when the theoretical IE and PM were respectively 0.043 and 0.268, the estimated values obtained using our approach were 0.0430 and 0.2767, which were close to the theoretical values. The 95% coverage probabilities for the IE and PM, based on the proposed approach, were close to the nominal value of 0.95. The proposed measures were practically not impacted by different disease prevalence values (~10 or 30%) and censoring percentages (~20 or 40%).
For the other scenarios where the MAFs of the genetic variant (exposure) were 0.1 and 0.5, the simulation results are reported in the online S1 and S2 Tables, respectively. Similar results were observed. The proposed approach provided accurate estimations for all coefficients of  estimated coefficients for different paths, a 0 , a 1 , a 2 , b 0 , b,c 1 andc 2 , given the minor allele frequency (MAF) = 0.3.

Scenario a b 0 b
Theoretical IE The simulation was based on 500 replicates, each with 500 cases and 500 controls. Different scenarios were considered based on different values of a, b 0 , b, censoring percentage (CP) and disease prevalence (prev), with a 0 = 6 andc ¼ 0:5. different paths, as well as IEs and PMs, through various scenarios. It is worth to note that a relatively small MAF for the genetic variant (e.g., 10%) would affect the parameter estimations for the paths a 2 , andc 2 , which were particularly pronounced when sample size was smaller (e.g., 500 cases and controls; data not shown). This is not surprising to observe because in this situation, the expected frequency of the genotype RR is only 1%, resulting in a very small number of samples in this category. Larger sample sizes can help to ensure the accurate estimations of these parameters. Therefore, when MAF = 0.1, we increased the sample size to 2000 cases and 2000 controls (or 4000 samples for the continuous outcome) in the simulation studies. Continuous outcome. Tables 3 and 4 (Table 3). Meanwhile, the estimated overall IE and PM were 0.1877 and 0.2433, respectively, which were close to the theoretical values of 0.188 and 0.242 ( Table 4). The 95% coverage probabilities for the IE and PM were 0.948 and 0.944 and both were close to the nominal value of 0.95. The simulation results for scenarios where MAF = 0.1 and 0.5 are reported in the online S3 and S4 Tables, respectively. As in the binary outcome scenarios, we increased the sample size to 4000 when MAF = 0.1 for the genetic variant. The simulation was based on 500 replicates, each with 500 cases and 500 controls. Different scenarios were considered based on different values of a, b 0 , b, censoring percentage (CP) and disease prevalence (prev), with a 0 = 6 andc ¼ 0:5.

Application to the motivation study
We applied the proposed overall measures for IE, DE, and TE for the mediation analysis to the data from a genetic case-control study of type 2 diabetes downloaded from dbGaP [59], relating to the Multi-Ethnic Study of Atherosclerosis (MESA) cohort study. The conceptual  estimated coefficients for different paths, a 0 , a 1 , a 2 , b 0 , b,c 1  The simulation was based on 500 replicates, each with 1000 samples. Different scenarios were considered based on different values of a, b, and censoring percentage (CP), with a 0 = 6, b 0 = 1 andc ¼ 0:5.  The simulation was based on 500 replicates, each with 1000 samples. Different scenarios were considered based on different values of a, b, and censoring percentage (CP), with a 0 = 6, b 0 = 1 andc ¼ 0:5.
https://doi.org/10.1371/journal.pone.0257628.t004 mediation model is shown in Fig 1, where the genetic variant is the exposure (X), the age at menopause is the mediator (T), and type 2 diabetes status is the outcome variable (Y). There were 47,871 genetic variants from 2,956 women included in the mediation analysis. A woman's age at menopause was censored if she had not gone through menopause, and the censoring percentage was~14.5%. Assuming an additive genetic model for all the genetic variants, we conducted the association analyses of genetic variants with a woman's age at menopause (path a) as well as with type 2 diabetes status (paths b andc). In particular, when assessing the association between a woman's age at menopause and a genetic variant with type 2 diabetes (paths b andc), we used logistic regression model, where type 2 diabetes status was the dependent variable and the genetic variant and age at menopause were the predictors. We included age and ethnicity as covariates in the logistic regression model. When assessing the association between a genetic variant with a woman's age at menopause (path a), we used the AFT model, where age at menopause was the dependent variable and the genetic variant was the predictor. Ethnicity was adjusted as a covariate in the AFT model. There were four categories for the ethnicity in the MESA data, including White, Caucasian; Black, African-American; Chinese American; and Hispanic. Ethnicity was considered as a categorical variable in the analysis where White, Caucasian was used as the reference category, resulting in three related coefficients in the AFT model (g 1 ,g 2 andg 3 ) and the logistic regression model (γ 1 , γ 2 and γ 3 ). Age was considered as a continuous variable, resulting in one coefficient (γ 4 ) in the logistic regression model.
For the purpose of demonstration, we considered a threshold of 0.005 to identify top variants. Our approach identified three variants-rs12744291, rs2503182, and rs11771343-associated with both type 2 diabetes and age at menopause to be included in the mediation analysis. Specifically, the p-values were 1.27×10 −3 , 2.56×10 −3 , and 1.43×10 −3 for rs12744291, rs2503182, and rs11771343, respectively, for their association with type 2 diabetes and 5.53×10 −4 , 4.01×10 −3 , and 4.12×10 −3 , respectively, for their association with age at onset of menopause. The top and middle panels of Table 5 list the estimations for all the coefficients for the AFT model and logistic regression model, respectively, for the three top genetic variants. Consider SNP rs12744291 as an example, in the AFT model where age at menopause was the dependent variable, the estimated coefficients were a 1 = 0.8768 and a 2 = 1.4933 for the SNP (path a); andg 1 ¼ 0:6625,g 2 ¼ À 0:7286 andg 3 ¼ À 1:0543 for ethnicity. In the logistic regression model where type 2 diabetes was the dependent variable, the estimated coefficients were b = -0.0090,c 1 ¼ À 0:1704,c 2 ¼ À 0:8943 for the age at menopause and SNP (paths b andc); and γ 1 = 1.4437, γ 2 = 1.4859, γ 3 = 1.7071, and γ 4 = 0.0305 for ethnicity and age.
The bottom panel of Table 5 reports the overall IEs, DEs, TEs, and 95% CIs obtained from the mediation analysis of the three genetic variants, a woman's age at menopause, and type 2 diabetes. BCa bootstrapping was used to assess the CIs for IEs as in the simulation studies. The overall IEs for the three genetic variants, rs12744291, rs2503182, and rs11771343, were reported as -0.0007, -0.0004, and 0.0005, respectively; and the 95% CIs of IEs for all three genetic variants include zero. These results suggest no statistically significant mediating effect of the age at menopause on the association between the three variants and type 2 diabetes risk.

Discussion
In this study, we proposed overall measures to calculate the IE, DE, and TE for a single censored mediator model involving a categorical exposure. Specifically, we defined the IE, DE, and TE for each of the categories for the exposure first and then assessed the overall IE, DE, and TE of the exposure accounting for the frequencies of different categories of the categorical exposure variable.
Compared with the traditional approach for a multi-categorical exposure, the proposed measure has several advantages. First, it provides an overall IE, DE, and TE of the mediation model from the exposure, instead of relative IE, DE, and TE as described in previous studies. Second, it avoids the multiple testing issue caused by recoding the multi-categorical exposure into multiple binary exposure variables. We did not compare the proposed approach of handling categorical exposure variable with the one used in Wang et al. [18] because their approach is limited to binary exposure only.
We demonstrated the performance of proposed overall measures with simulation studies for the mediation model with a binary outcome or a continuous outcome and a right-censored mediator. Note that such measures are general and robust and can be employed regardless of whether the outcome variable is continuous or binary and the mediator is censored or not. We also investigated the performance of the proposed overall measures for mediation models in the presence of covariates using simulations (online S5 Table). In particular, we considered a mediation model with a binary outcome. Without loss of generality, we fixed the MAF at 0.3, the censoring percentage at~20% and the disease prevalence at~10%. We followed the same procedure as described in the Simulation section to generate data. In addition to the exposure, mediator and outcome, we generated a continuous covariate Z~Normal(0, 0.5 2 ), which was associated with both the mediator T and outcome Y. Based on the simulation results, we observed accurate estimations for all the coefficients, as well as IE and PM. For example, for Table 5. Estimations of the coefficients, as well as the overall total effects (TEs), direct effects (DEs), and indirect effects (IEs), along with 95% confidence intervals (CIs), for the single nucleotide polymorphisms (SNPs) associated with both type 2 diabetes and a woman's age at menopause in the real data analysis � . These results show that the proposed measures are robust even in presence of covariates. Furthermore, in practice, one may encounter censored data for both outcome variable (e.g., time to onset of disease) and mediator. The approach, using the semiparametric AFT model combined with a pseudo-likelihood function, can be extended to such a mediation model where the outcome variable is also censored. Particularly, one can revise the pseudo-likelihood function to accommodate the survival component. Survival regression models, such as the commonly used Weibull regression model [62], may be employed to address this issue. However, the development of such extension is not straightforward and will need further investigation.
We applied the overall measures of IE, DE, and TE to the motivation study of genetic variants, a woman's age at menopause, and type 2 diabetes risk. Assuming the additive genetic model for the genetic variants, we identified three variants, rs12744291, rs2503182, and rs11771343, to be included in the mediation analysis because they were associated with both the mediator (i.e., a woman's age at menopause) and the outcome (i.e., type 2 diabetes status). The results from the mediation analysis showed that a woman's age at menopause had no mediating effect on the effect of the three genetic variants on type 2 diabetes risk.
Important assumptions required for the mediation analysis have been discussed previously [17,18]. The sensitivity analysis for the assumptions about unmeasured confounders for the derivations of IE and DE have been extensively conducted and discussed for the motivation study in our previous study [18]. In addition to the "no-unmeasured-confounder" assumptions, we assumed that the mediation model was accurately specified and there were no measurement errors for all the variables in the mediation model. Specifically, for our real data application, we conceptualized the mediation model based on the literature, including the causal orders and causal directions [25][26][27][28][29][30][31][32][33][34][35], and assumed that all the variables, including the exposure, mediator, outcome, and covariates, had no measurement errors.
Besides the assumptions for the mediation analysis, for the parameter estimation approach using the semiparametric AFT model, we assumed that the censoring process for the mediator was independent of the mediator T, exposure X, outcome Y, and covariates Z. Sensitivity analysis was conducted previously and showed some degree of robustness for the approach to the violation of the independence assumption [18]. We used the AFT model to relate the exposure to the mediator because it provides the change in the length of survival time as a function of the effect of the exposure, which has an easy way to interpret in the mediation context [17,18]. Other semiparametric survival models, such as the most popularly used Cox proportional model, could be adapted in the mediation model with a censored mediator. However, such adaptation is not straightforward. For example, the measure of effect for the Cox proportional model is the hazard ratio. In such a case, the mediating effect could be difficult to be interpreted because it is the survival time but not the hazard ratio to be expected to have causal effect on the outcome variable. Using other semiparametric survival model in the mediation analysis is of interest to investigate; however, future work is necessary for the derivation of the IE, DE and TE so that these effects can be appropriately interpreted in the mediation context.
Parametric (linear and logistic regressions) and semiparametric approaches (semiparametric AFT model) were employed in the estimation of coefficients for different paths in the mediation model, which usually rely on certain modeling assumptions in one way or another [63][64][65]. Alternatively, non-parametric approaches to mediation analysis, which has been received a great deal of attention recently, could be considered [63,[66][67][68]. Extension of the current proposed approach for mediation model, where the mediator is censored and the exposure is categorical, to the non-parametric framework will be worthy of future research.
Supporting information S1 Table.