## Figures

## Abstract

Third-variable effect refers to the effect transmitted by third-variables that intervene in the relationship between an exposure and a response variable. Third-variable effect analysis has been broadly studied in many fields. However, it remains a challenge for researchers to differentiate indirect effect of individual factor from multiple third-variables, especially when the involving variables are of hierarchical structure. Yu et al. (2014) defined third-variable effects that were consistent for all different types of response (categorical or continuous), exposure, or third-variables. With these definitions, multiple third-variables can be considered simultaneously, and the indirect effects carried by individual third-variables can be separated from the total effect. In this paper, we extend the definitions of third-variable effects to multilevel data structures, where multilevel additive models are adapted to model the variable relationships. And then third-variable effects can be estimated at different levels. Moreover, transformations on variables are allowed to present nonlinear relationships among variables. We compile an R package *mlma*, to carry out the proposed multilevel third-variable analysis. Simulations show that the proposed method can effectively differentiate and estimate third-variable effects from different levels. Further, we implement the method to explore the racial disparity in body mass index accounting for both environmental and individual level risk factors.

**Citation: **Yu Q, Li B (2020) Third-variable effect analysis with multilevel additive models. PLoS ONE 15(10):
e0241072.
https://doi.org/10.1371/journal.pone.0241072

**Editor: **Gang Han,
Texas A&M University, UNITED STATES

**Received: **August 29, 2020; **Accepted: **October 8, 2020; **Published: ** October 23, 2020

**Copyright: ** © 2020 Yu, Li. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **All relevant data are within the manuscript and its Supporting Information files.

**Funding: **This study is funded by the National Institute on Minority Health and Health Disparities (https://www.nimhd.nih.gov/). The award number is #1R15MD012387. The PI is Dr. Qingzhao Yu. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

A third-variable effect (TVE) refers to the effect conveyed by a third-variable to an observed relationship between an exposure and a response variable of interest. Depend on whether there is a causal relationship from the exposure variable to the third-third variable and then to the response, the third-variable (denoted as *M*) is often called a mediator (when there are causal relationships) or a confounder (no causal relationship is involved). Research purposes for the third-variable analysis are generally, 1) identify significant third-variables that can partially or completely explain the relationship between the exposure variable (*X*) and the outcome (*Y*); and 2) differentiate the TVE from different paths that connect between *X* and *Y*. Except for the difference in explaining the effects from a mediator and from a confounder, the TVE analysis method can be used to make inferences on both mediation and confounding effects [1]. Third-variable analysis has been widely used in psychology, social sciences, behavioral research, health prevention, epidemiological studies, and genetic analysis. Within the context of generalized linear models, general predictive models, and counterfactual framework, a number of methods have been proposed for inferences on TVEs, to name a few, [2–7].

In research practices, the experiment data are often collected hierarchically. For example, to identify variables that are related with childhood obesity, we consider both environmental (e.g. walkability of the neighborhood) and individual factors (e.g. snacking habit). When hierarchical databases are considered, third-variable analysis method based on generalized linear models are usually not readily adaptable since the independence assumption among observations is violated. In such cases, hierarchical models, also known as multilevel or mixed-effect models, are more appropriate to fit relationships among variables since these models can catch dependencies among observations and allow for predictors from different levels of the data [8]. In this paper, we propose a third-variable analysis method based on multilevel models. For the hierarchical model, we assume there are two levels of data and refer the individual level as level 1, and the group level as level 2. Although more than two levels of hierarchy is possible, this paper focuses on two-level databases only.

In third-variable analysis, besides the pathway that directly connect the exposure variable with the outcome, we explore the *exposure* → *third*-*variable* → *response* or *X* → *M* → *Y* pathways. When doing third-variable analysis with multilevel models, the level of the variables at the left of the arrow should be higher than or equal to that of the right, since a group level variable may affect an individual level variable but not vice versa [9]. In this setting, only the 2 → 2 → 2, 2 → 2 → 1, 2 → 1 → 1, and 1 → 1 → 1 relationships are legible. Multilevel models are necessary to deal with hierarchical data base even for the 1 → 1 → 1 relationship. [10] studied the bias brought by using single level models when data are hierarchical. [9, 11–13] proposed third-variable analysis methods for all three types of multilevel models. Moreover, [14–16] proposed alternative methods to test the indirect effects in 2 → 2 → 1, 2 → 1 → 1 models. In addition, [17] proposed to use Bayesian third-variable analysis to deal with hierarchical databases.

In this paper, we propose to use generalized additive multilevel models for third-variable analysis with hierarchical databases. The major contributions of this proposed method are that: 1) we extend the general definitions of TVEs proposed by [7] to multilevel data structure; 2) the method allows any types (categorical or continuous) of the exposure(s), third-variables and response variable(s) in exploring TVEs; 3) TVEs from individual or group of third-variables and from different levels are differentiable; 4) nonlinear associations among variables are allowed in calculating the TVEs; 5) multiple exposure(s) from different levels and multivariate outcomes are allowed in the third-variable analysis, and 6) an R package *mlma* was developed for the method proposed in this paper. Practitioners can easily implement the method in real data analysis.

The rest of the paper is organized as follows. In Section 2, we review the general definitions of TVEs and then extend these concepts to the multilevel model situation. We also review the generalized additive multilevel models (GAMM) that are used to model relationships among variables. Based on that, we propose the multilevel third-variable analysis with GAMM. In Section 3, we illustrate the use of the proposed method in different multilevel data structures and the usage of the *mlma* R package. We then use the method in a real data example in Section 4: to explore the ethnic disparity in obesity considering both individual and environmental risk factors. Section 5 gives a summary of the proposed method and points out future research directions.

## 2 Third-variable analysis with multilevel additive models

[7] proposed general definitions for TVEs. First we recapitulate the definitions and then propose the multilevel additive models in third-variable analysis with hierarchical databases.

### 2.1 Definitions of third-variable effects with single-level data

The conceptual model for TVE of one level is shown in Fig 1. In the Figure, *X* is the exposure variable, *Y* is the outcome, **M** = {*M*_{1}, …, *M*_{p}} is the vector of *p* third-variables, and **Z** is the vector of other variables which relate with *Y* but do not intermediate the *X*−*Y* relationship. As usual, the TVE includes total effect—the overall effect from the exposure variable *X* to the outcome *Y*, direct effect—the remaining effect from *X* to *Y* after accounting for effects form third-variables, and indirect effect from *M*_{i}—the effect from the path *X*−*M*_{i}−*Y*.

Basically, [7] defines the total effect as the *changing rate* in the outcome when the exposure variable changes. Compared with the traditional definition of TVEs, which are defined as the *amount of change* of the outcome when the exposure variable changes from one status to another, the Yu et al.’s definition has the following benefits:

- The total effect is scale invariant to the unit of the exposure variable. This is because the effect is defined as the changing rate but not as the changing amount of
*Y*with*X*. - There is no need to define the two status of the exposure variable. Therefore the total effects can be calculated not only for binary but also for continuous exposures.
- The total effect can be calculated with nonlinear models when the relationship among variables are changing at different values of the exposure variable. The total effect can be expressed as a function of the exposure variable by the Yu definition [7].

The direct effect not from *M*_{i} is similarly defined as the total effect except that the relationship between the exposure variable and *M*_{i} is manipulatively broken. In such case, *M*_{i} are controlled not to change with *X*: *M*_{i} follows its marginal distribution from the observations. The indirect effect from *M*_{i} is then defined as the total effect minus the direct effect not from *M*_{i}. All the benefits of the definition of total effect are also fit for the direct and indirect effects. In additon, multiple third-variables can be considered simultaneously, and the indirect effects carried by individual third-variables can be separated from the total effect. For the formal definitions of TVEs, readers are referred to [7] and [18]. [7] showed that the proposed definitions of TVEs are equivalent to the conventional TVEs under certain situations (e.g., linear regression models with continuous third-variables and outcome). They also established the relationship between the proposed definitions of direct or indirect effect and the natural direct or indirect effect for binary exposure variables. Later, the definitions have been implemented to explore racial and ethnic health disparities by [19, 20] and extended to deal with time-to-event outcomes ([21]) and for multiple exposures and multivariate outcomes ([22]). In this paper, we extend the method to the context of multilevel models.

### 2.2 Definitions of third-variable effects with data of two levels

The unique data structure in multilevel models raises the potential problem of confounding TVEs from different levels. As pointed out in [13], the relationship between two level-one variables can be decomposed into between-group and within-group components. In particular, the aggregated variables at the second level can be highly related while the relationship may be very weak or even at an opposite direction when considered at the individual level [23]. For example, [24] points out that the proportion of black residents may be an important variable for the census tract, while it is different from the meaning of ethnicity as an individual-level variable. [25] discussed the difference of the two components extensively. It is important to differentiate the between-group and within-group components in third-variable analysis, where the TVEs can be decomposed to level 1 and level 2 effects. To identify the level 1 and level 2 TVEs separately, [13] proposed the group-mean centering method (CWC), where they subtracted the group means from individual level variables and added group means as level 2 covariates. In their paper, [13] showed that the CWC method efficiently separated level 1 and level 2 TVEs and resulted in less bias and more power compared with non-centering methods. In this paper, we use a different way to estimate the level 1 and level 2 TVEs: extend the definitions of TVEs with single level models by [7] to multilevel models.

With the generalized definition of TVEs, [22] has shown that a third-varaible analysis can involve multiple exposure variables and multivariate outcomes. The purpose of third-variable analysis is to differentiate the direct effect and indirect effect from each third-variable for each pair of the exposure-outcome relationship. If the outcome is at level 2, all exposure and mediators have to be level 2 as discussed in Section 1. Therefore, a single-level third-variable analysis works. If the outcome is at level 1, the exposure variable can be a level 1 or level 2 variable. The third-variables can be level 1 or 2 for a level 2 exposure, but have to be level 1 for a level 1 exposure variable. In this paper, we focus on level 1 outcomes.

Denote **M**_{ij} = (*M*_{ij1}, …, *M*_{ijK}) as the vector of *K* potential level 1 third-variables for the *ith* object at the *jth* group. **M**_{ij,−k} is the vector **M**_{ij} excluding the *kth* element. Denote **M**_{.j} = (*M*_{.j1}, …, *M*_{.jL}) as the vector of the *L* potential level 2 third-variables or level 1 third-variables aggregated at level 2 within group *j*. Let *M*_{ijk}(*x*_{ij}) be a random variable that has a conditional distribution given *X*_{ij} = *x*_{ij}. For an exposure variable *X* at any level, let *u** be the minimum unit of *X*, such that if *x* ∈ *domain*(*X*), then *x* + *u** ∈ *domain*(*X*). For now, we ignore other covariates **Z**. Assume effects of exposures and third-variables on the outcome are additive, we have the general definitions of TVEs, following [7], for level 1 (Definition 1) and level 2 exposure variables (Definition 2). Note that a level 1 exposure can have only level 1 mediators while a level 2 exposure can have both level 1 and level 2 mediators.

**Definition 1**. *For a level 1 exposure variable X, the level 1 total effect (TE*_{1}) *of X on Y, the level 1 direct effect* () *of X on Y not from level 1 third-variable M*_{k} *and the level 1 indirect effect of X on Y through M*_{k} *at X* = *x*_{ij} () *are defined as*:
(1) (2) (3)

The average level one TVEs are the mean value of the TVEs defined by Definition 1: *ATE*_{1} = *E*_{ij}[*TE*_{1}(*x*_{ij})], *ADE*_{1,\k} = *E*_{ij}[*DE*_{1,\k}(*x*_{ij})] and *AIE*_{1,k} = *ATE*_{1} − *ADE*_{1,\k}.

**Definition 2**. *For a level 2 exposure variable X, the level 2 total effect* (*TE*_{2}) *of X on Y, the level 2 direct effect* () *of X on Y not from the level 1 third variable M*_{k} *and level 2 third variable M*_{l}, *and the level 2 indirect effect of X on Y through M*_{k} *and M*_{l} *at X* = *x*_{.j} () *are defined as*:
(4) (5) (6) (7) (8)

The average level 2 TVEs are the TVEs defined by Definition 2 averaged at the group level: *ATE*_{2} = *E*_{j}[*TE*_{2,j}(*x*_{.j})], *AIE*_{21,k} = *E*_{j}[*IE*_{21,jk}(*x*_{.j})] and *AIE*_{22,l} = *E*_{j}[*IE*_{22,jl}(*x*_{.j})].

### 2.3 Multilevel additive models

We use multilevel additive models to build relationships among variables of hierarchy. The additive model is a nonlinear regression method that was first proposed by [26]. A multilevel additive model can deal with both nonlinear covariate effects and cluster-specific heterogeneity [27]. It is now gaining rapid popularity in psychological and social research [28]. Using the notations in Section 2.2, assume that we have *L* level 2 and *K* level 1 third-variables. In addition, assume that there are *E*_{1} level 1 and *E*_{2} level 2 exposure variables, denoted as and respectively. We propose the following linear additive multilevel models for multilevel third-variable analysis. The boldfaced letter indicates a potential vector of functions or numbers.

For level 2 third-variables, *M*_{.jl}, *l* = 1, …, *L*:

For level 1 third-variables, *M*_{ijk}, *k* = 1, …, *K*:

In the models, *r*_{0jk} and *r*_{0j} are second-level random errors with mean zero and finite variances. **f**(⋅) is a function/transformation vector of ⋅. The transformation enables modeling nonlinear relationships among variables. We assume that all the transformation functions are first-order differentiable. *α* and *β* are coefficient vectors for transformed variables in predicting the response variable on the left side of each equation. In addition, *g*(⋅)s are the link functions that link the expected response variable with the right-hand-side of each equation, the systematic component of a generalized linear model. For example, a binary *M*_{ijk} may have a link function *g*_{1k} = *logit*(*P*(*M*_{ijk} = 1)). Similarly, a link function can be used on the outcome. With the link function, we can deal with different types of third-variables and outcomes.

### 2.4 Third-variable effects with multilevel additive model

Based on the definitions of TVEs, we derive the TVEs in Theorems 1 and 2. In the theorems, *f*′(*x*) denotes the first derivative of function *f* on the random variable *X* and realized at *X* = *x*. In addition, *g*^{−1} denotes the inverse function of *g*. We further denote *μ*_{ijk} = *E*(*M*_{ijk}) and *μ*_{.jk} = *E*(*M*_{.jk}). The proofs of theorems are included in the S1 File.

**Theorem 1** *With the relationships among variables built by Section 2.3, the TVEs for level 1 exposure variable X*_{ije}, *e* = 1, …, *E*1 *on the outcome variable Y are*:

**Theorem 2**. *With the relationships among variables built by Section 2.3, the TVEs for level 2 exposure variable X*_{.je}, *e* = 1, …, *E*2 *on the outcome variable Y are*:

### 2.5 Bootstrap method for third-variable effect inferences

Finally, we use the bootstrap method to calculate the variances of the TVEs. In particular, at the group level, a bootstrap sample of the same size for each group is drawn with replacement from the original data set. Then multilevel additive models are fitted based on the bootstrap sample to get the estimates of *β*s, based on which the TVEs can be calculated by Theorems 1 and 2. The above process repeats many times and inferences can be made based on the repeated estimates. To draw the bootstrap sample, we keep all groups at the second level and then at the individual level, we draw observations with replacement of size *n*_{j} from the jth group, where *n*_{j} is the number of observations in the jth group. This bootstrap method is adopted in the R package *mlma* to estimate the variances of estimates and to build up confidence intervals. The *mlma* package is available from the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/web/packages/mlma/index.html and illustrated by the simulations in Section 3.

### 2.6 The *mlma* R package

The authors developed a R package, *mlma*, for multilevel third-variable analysis. The analysis is based on multilevel additive models where nonlinear transformations of variables are allowed. The package **mlma** contains three groups of functions: function *data.org* is used to prepare data sets for analysis—transforming variables, dichotomizing categorical third-variables, and getting the derivatives of the transformation functions. The functions *mlma* and *boot.mlma* are used for statistical inferences on the TVEs. The former estimates the TVEs and the latter generates bootstrap samples from the original data sets and does the third-variable analysis based on the bootstrap samples. The estimates of TVEs from the bootstrap samples are then used for statistical inferences. The third group of functions are generic functions—*print, summary* and *plot* results from the *mlma* and *boot.mlma* functions. The details of how to use the functions are fully documented within the package and the mlma vignette. The analysis in Section 3 are all performed using the package.

## 3 Simulations

In this section, we show the performance of the proposed method and illustrate the use of the R package *mlma*. The first simulation is based on a 2 → 1 → 1 true model adapted from [13] but it is extended to include exposure and third-variables at both levels. There is one exposure variable at each level and one third-variable at each level. The level 1 exposure and level 2 third-variable are binary. No transformation is performed on the original variables. The first simulation is to exam how the sensitivity and specificity of identifying important third-variables are influenced by different sets of parameters.

The second simulation is based on nonlinear multilevel models with both level exposures and one level 1 continuous third-variable. The second simulation is to demonstrate how to include transformation terms of variables in the multilevel mediation analysis and how to use the graph tools in the R package *mlma* to illustrate relationships among variables.

The simulations are performed using the R package *mlma*. The R codes are included in the S1 File.

### 3.1 Simulation 1

The first simulation includes one level 1 binary exposure, *X*_{ij}, and one level 2 continuous exposure, *X*_{.j}, where *j* denotes the *jth* group and *i* denotes the ith observation in the *jth* group. There are 600 observations in total and *n* observations in each group. Therefore, there are 600/*n* groups. To check how group sizes and number of groups can influence the sensitivity of identifying important TVEs, we set *n* at 5 and 20 respectively. *X*_{ij} are generated through binomial distribution with the probability of success 0.5. *X*_{.j} are generated through a standard normal distribution. There are a level 2 binary third variable, *M*_{.j}, and one level 1 continuous third-variable, *M*_{ij}. In detail, the simulation data are generated from the following models for *i* = 1, …, *n* and *j* = 1, …, 600/*n*:
where *u*_{0j} ∼ *N*(0, 0.5), *ϵ*_{0ij} ∼ *N*(0, 1), *u*_{1j} ∼ *N*(0, *v*_{2}) and *ϵ*_{1ij} ∼ *N*(0, *v*_{1}) are independently generated random errors at both levels. The level two random error for the response variable is set as one-fifth of the level one variance—*v*_{2} = *v*_{1}/5, which makes the intraclass correlation (ICC) to be.17. As pointed out by [29], this medium ICC value facilitates model convergence. *v*_{2} is chosen to be 5 or 1, *β*_{1} and *β*_{2} ranges within the set {−.59, −.14, 0, 0.14, 0.39} and *β*_{3} and *β*_{4} from {0, 0.14, 0.59}. We then check the influence of these parameters on the sensitivity and specificity of identifying important TVEs. The data with each combination of parameters are generated 20 times. The sensitivity of a TVE is the proportion of times that the estimated confidence interval of the effect does not include 0 when the actual effect is not 0. Relatively, the specificity is defined as the proportion of times that the estimated confidence interval of the effect includes 0 when the true effect is 0.

For the level 2 exposure, the average direct effect for *X*_{.j} → *Y*_{ij} (denoted as de2), the indirect effect from *M*_{.j} for *X*_{.j} → *M*_{.j} → *Y*_{ij} (denoted as ie2.2), the indirect effect from *M*_{ij} for *X*_{.j} → *M*_{ij} → *Y*_{ij} (denoted as ie2.1), and the total effect (te2) are estimated.

For the level 1 exposure, the average direct effect for *X*_{ij} → *Y*_{ij} (denoted as de1), the indirect effect from *M*_{ij} for *X*_{ij} → *M*_{ij} → *Y*_{ij} (denoted as ie2.1), and the total effect (te1) are estimated.

Fig 2 shows the sensitivity (or 1-specificity when *β*_{2} = 0) of identifying significant level 2 direct effect, when *β*s change but *n* is fixed at 20 and *v*_{1} at 1. The x-axis is *β*_{2}, the true level 2 direct effect. Different color and line type in each plot represents different values of *β*_{1}. Each row of the 3 × 3 panel in Fig 2 is a different value for *β*_{4}, each column is for a different value of *β*_{3}. We found that the sensitivity of correctly identifying important level 2 direct effect depends on the value of other TVEs. This is mainly due to the increased correlations among variables that results in raised variances in the estimation.

The x-axis is the true direct effect, and y-axis is the sensitivity (or 1-specificity when true de is 0). Different color and line type represents different value for *β*_{1}. Rows are for *β*_{4} = 0, 0.14 and 0.59, and columns are for *β*_{3} = 0, 0.14 and 0.59 respectively.

Fig 3 shows the sensitivity or 1-specificity (when *β*_{1} = 0) of identifying important level 1 direct effect. The sensitivity is averaged over *β*_{3} and *β*_{4} for different values of *β*_{1}, *v*_{1} and *n*. We see that the sensitivities reduce a lot when the variances are bigger. However, values of *n* have minimum influence on the sensitivity.

The x-axis is the true direct effect, and y-axis is the sensitivity (or 1-specificity when the true de is 0). Different color and line type represents different value for *β*_{2}. Each plot is for a different combination of *n* and *v*_{1}.

Finally, Figs 4 and 5 show the sensitivity and specificity of identifying indirect effects. For both Figures, we fix *v*_{1} at 1. Fig 4 shows how the sensitivity of level 2 indirect effects of *M*_{ij} and *M*_{.j} changes with *β*_{3} and *β*_{4}. The left panel is the sensitivity and 1-specificity (when *β*_{3} = 0) of identifying *M*_{ij}. The level 2 indirect effect of *M*_{ij} is proportional to *β*_{3}, so the sensitivity increases with *β*_{3}. The sensitivity is minimally influenced by the value of *β*_{4}. The right panel is the sensitivity and 1-specificity (when *β*_{4} = 0) of identifying *M*_{.j}. Again, the sensitivity of identifying *M*_{.j} is proportional to *β*_{4}, but is not systematically influenced by the value of *β*_{3}.

The left panel is the sensitivity and 1-specificity of identifying *M*_{ij}. The right panel is the sensitivity and 1-specificity of identifying *M*_{.j}.

The left panel is for the level 1 indirect effect and the right panel is for the level 2 indirect effect of *M*_{ij}.

Fig 5 shows how the actual direct effect influences the estimation of the indirect effect of *M*_{ij}. The left panel is for the level 1 indirect effect and the right panel is for the level 2 indirect effect. For both panels, the black solid line is the 1-specificity and the other two lines are the sensitivity. The sensitivity for level 1 indirect effect increases with *β*_{3} and that for level 2 indirect effect increases with *β*_{4}. The x-axis is the true level 1 or 2 direct effect for the left and right panel respectively. We see very small influence of the true direct effect on the estimation of the indirect effect.

### 3.2 Simulation 2

The second simulation is to illustrate the use of the *mlma* package with transformations in the variables. In the simulation, there is only one level 2 third-variable, *M*_{ij} and the variable is causally proceeded by two exposure variables, *X*_{ij} ∼ *N*(0, 0.5) at the first level and *X*_{j} ∼ *χ*^{2}(*df* = 2) at the second level. In the simulation, there are 30 groups, and 20 observations generated for each group. That is *i* = 1, …, 20 and *j* = 1, …, 30. The data are generated from the following model:
where *ϵ*_{1ij} ∼ *N*(0, 1), *ϵ*_{ij} ∼ *N*(0, 2), *u*_{0j} ∼ *N*(0, 0.4) and *u*_{oj1} ∼ *N*(0, 0.5) are independently generated as the level 1 and level 2 random errors respectively. For the data generation mechanism, the true level 1 direct effect is 0.98 and the true level 2 direct effect is *de*2(*x*_{j}) = 0.98/*x*_{j}. In addition, the true level 1 indirect effect for *M* at *x*_{ij} is 1.2 × 0.98 × *x*_{ij} and the true level 2 indirect effect is 1.2 × 0.98 = 1.176, where the former depends on the value of *x*_{ij} representing a nonlinear relationship between *M* and *X*_{ij} and the latter represents a strong (1.176) constant indirect effect.

Fig 6 shows the estimated direct effects at both levels from the simulated data. We see that the estimated direct effect correctly delineates the nonlinear relationship among variables.

The *mlma* package provides a *plot* function to show relationships among variables. Fig 7 shows the results of the *plot* function without specifying a specific third-variable. The Figure shows the relative effect (defined as the TVE divided by the total effect) at each exposure variable.

Fig 8 shows results of the *plot* function with the third-variable *M*. The left panel is the effect of *M* with the level 1 exposure denoted as *x*1 and the right panel is for the level 2 exposure which is denoted as *x*2. The first row of Fig 8 gives the estimated indirect effect with confidence intervals. The second row shows the changing rate of *M* with the exposure variables and the third row gives the changing rate of the outcome with the third-variable *M*. Again, the figures correctly describe relationships among variables.

The codes for simulating data set and transforming variables in the third-variable analysis are available in the S1 File.

## 4 Explore the racial disparity in obesity

Finally, we implement the method in a real data example: to explore the racial disparity in body mass index (BMI). In a previous research, we found that on average, non-Hispanic blacks have a higher BMI and higher rate of obesity compared with non-Hispanic whites [19, 22] using the 2003-2006 National Health and Nutrition Examination Survey (NHANES). We are interested to see how the disparities can be explained by both individual and environmental factors. Environmental risk factors are generated at the census tract level, which include both food environments and physical activity environments. Individual level variables include age, gender, smoking status, etc. Readers are referred to [19, 22] for more details about the variables.

In this demonstration, we try to use risk factors to explain the racial disparity in BMI. We first use multiple additive regression trees to describe the relationships between BMI and all risk factors, and then we performed a data transformation on the risk factors so that the transformed variables have a roughly linear relationship with BMI. For the individual level risk factors, we did the following transformations:

- The natural cubic spline bases for age with degrees of freedom of 4.
- Truncate the physical activity measurement to two parts: smaller than 2.1 and larger than 2.1 since we see a change point at 2.1 when depicting the relationship between physical activity and BMI.

We also use the natural cubic spline basis with different degrees of freedom on some of the environmental factors. Please refer to the S1 File to see how we make the transformations. The individual level exposure variable is the race (black (0) and white (1)). The census tract level exposure is the proportion of whites in the census tract. Table 1 shows the estimated third variable effects at both the individual and census tract levels. Note that since high level variable (e.g. census tract level) can influence the lower level (individual level) variable but not the reverse, all level 1 third-variables have both individual and census tract level effects, while all level 2 third-variables have only census tract level effects.

We found that at the individual level, on average, the BMI among Whites is 1.94 (TE) lower than that for Blacks. Individual level factors can partially explain the racial difference. For example, age explains about 15%(−0.30 *dividedby* −1.94) of the difference. Fig 9 shows that a larger proportion of Blacks (*x* = 0) are at the middle age (late 30s to early 60s), compared with Whits. In addition, middle age is related with the highest BMI. The right panel of Fig 9 shows the relationship between age and BMI, which is not linear. BMI increases with age, peaked at the middle age and then declines. By the transformation in third-variable analysis, we can catch the nonlinear relationship between age and BMI.

On the other hand, the racial difference in BMI at the census tract level is not significant. The total effect is −0.9 with a 95% confidence interval that includes 0. In the study, no environmental risk factors that significantly influences the racial disparity in BMI is found. However, in the analysis, multilevel model is required to control the dependencies of subjects who come from the same residential environments.

## 5 Discussion and future work

In this paper, we extend the definitions of TVEs proposed by [7] to multilevel data settings, based on which the direct and indirect effects can be differentiated from different levels. In addition, indirect effects from different third-variables can be separated. Multilevel additive models are adapted to model the relationships among variables to account for the hierarchical data structure. Data transformation is allowed to represent potential non-linear relationship. We also create an R package, *mlma*, that implements the proposed method for multilevel third-variable analysis. Using the package, TVEs on average as well as at different values of the exposure(s) can be estimated. Simulations showed that the proposed method and package can accurately estimate TVEs at different levels. We implement the method to explore the racial disparity in BMI accounting for both environmental and individual-level risk factors.

There are some limitations in the current method. One is that we have to know how to transform the variables so that the multilevel additive linear models can accurately catch the relationships among variables. Also, the transformation functions have to be differentiable to calculate the indirect effects. Alternative, we can use the multilevel smoothing splines [30], to analyze the multilevel relationship. In such case, there is no need to know the transformation forms beforehand and the bases for smoothing splines are differentiable. Another limitation of the package of the current version is that the research is confined to deal with data of at most two levels. Extension to more levels is more complicated due to the conditions of TVEs such that the predictor should have equal or higher level than the third-variables, which in turn should have higher or equal levels than the outcome. Furthermore, when the dataset has more levels, TVEs from different levels should be estimated separately. A future work is to develop a more flexible algorithm for general models of potentially more levels.

## References

- 1.
MacKinnon D. P., Krull J., and Lockwood C. Equivalence of the mediation, confounding and suppression effect.
*Prevention Science*, 2000, 1(4):173–181. - 2.
Robins J. M. and Greenland S. Identifiability and exchangeability for direct and indirect effects.
*Epidemiology*, 1992, 3:143–155. - 3.
MacKinnon D. P., Warsi G., and Dwyer J. A simulation study of mediated effect measures.
*Multivariate Behavioral Research*, 1995, 30:41–62. pmid:20157641 - 4.
Albert J. M. Mediation analysis via potential outcomes models.
*Statistics in Medicine*. 2007; 63:926–934. - 5.
VanderWeele T. J. Marginal structural models for the estimation of direct and indirect effects.
*Epidemiology*, 2009, 20(1):18–26. pmid:19234398 - 6.
Imai K., Keele L., and Yamamoto T. Identification, inference, and sensitivity analysis for causal mediation effects.
*Statistical Science*, 2010, 25(1):51–71. - 7.
Yu Q., Fan Y., and Wu X. General multiple mediation analysis with an application to explore racial disparity in breast cancer survival.
*Journal of Biometrics Biostatistics*, 5(2):189, 2014. - 8.
Gelman A. Multilevel (hierarchical) modeling: What it can and cannot do.
*Technometrics*, 2006, 48(3):432–435. - 9.
Krull J. L. and MacKinnon D. P. Multilevel modeling of individual and group level mediated effects.
*Multivariate Behavioral Research*, 2001, 36(2):249–277. - 10.
Tofighi D., West S. G., and MacKinnon D. P. Multilevel mediation analysis: the effects of omitted variables in the 1-1-1 model.
*British Journal of Mathematical and Statistical Psychology*, 2012, 66(2):290–307. - 11.
Bauer D. J. Estimating multilevel linear models as structureal equatin models.
*Journal of Educational and Behavioral Statistics*. 2003; 28(2):135–167. - 12.
Bauer D. J., Preacher K. J., and Gil K. M. Conceptualizing and testing random indirect effects and moderated mediation in multilevel models: New procedures and recommendations.
*Psychological Methods*. 2006; 11(2):142–163. - 13.
Zhang Z., Zyphur M. J., and Preacher K. J. Testing multilevel mediation using hierarchical linear modesl, problems and solutions.
*Organizational Research Methods*, 12(4):695–719, 2009. - 14.
Kenny D. A., Bolger N., and Korchmaros J. D. Lower level mediation in multilevel models.
*Psychological Methods*, 2003, 8(2):115–128. - 15.
Gitelman A. I. Estimating causal effects from multilevel group-allocation data.
*Journal of Educational and Behavioral Statistics*, 2005, 30(4):397–412. - 16.
Pituch K. A., Whittaker T. A., and Stapleton L. M. A comparison of methods to test for mediation in multisite experiments.
*Multivariate Behavioral Research*, 2005, 40(1):1–23. - 17.
Yuan Y. and YMacKinnon D. P. Bayesian mediation analysis.
*Psychological Methods*, 14(4):301–322, 2009. pmid:19968395 - 18.
Yu Q. and Li B. mma: An r package for mediation analysis with multiple mediators.
*Journal of Open Research Software*, 5, apr 2017. - 19.
Yu Q., Scribner R., Leonardi C., Zhang L., Park C., Chen L., et al. Exploring racial disparity in obesity: a mediation analysis considering geo-coded environmental factors.
*Spatial and Spatio-temporal Epidemiology*, 21:13–23, 2017. pmid:28552184 - 20.
Yu Q., Medeiros K., Wu X., and Jensen R. Nonlinear predictive models for multiple mediation analysis: With an application to explore ethnic disparities in anxiety and depression among cancer survivors.
*Psychometrika*, Apr 2018, 83(4):991–1006. pmid:29611093 - 21.
Yu Q., Wu X., Li B., and Scribner R. Multiple mediation analysis with survival outcomes: With an application to explore racial disparity in breast cancer survival.
*Statistics in Medicine*, 38(3):398–412, sep 2018. pmid:30255567 - 22.
Yu Q. and Li B. A multivariate multiple third-variable effect analysis with an application to explore racial and ethnic disparities in obesity.
*Journal of Applied Statistics*, 2020. - 23.
Raudenbush S. W. and Bryk A. S.
*Hierarchical Linear Models: Applications and Data Analysis Methods*(2nd edition). Sage, Thousand Oaks, CA, 2002. - 24.
Snijders TAB and Bosker R. J.
*Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling*. Sage, Thousand Oaks, CA, 1999. - 25.
Alker H. R. A Typology of Ecological Fallacies, In
*Quantitative Ecological Analysis in the Social Sciences*. The MIT Press, Cambridge, MA, 1969. - 26.
Friedman J. H. and Stuetzle W. Estimating optimal transformations for multiple regression and correlation.
*Journal of the American Statistical*, 1981, 76:817–823. - 27.
Lang S., Umlauf N., Wechselberger P., Harttgen K., and Kneib T. Multilevel structured additive regression.
*Statistics and Computing*, 2012, 24(2):223–238. - 28.
Brunauer W., Lang S., and Umlauf N. Modelling house prices using multilevel structured additive regression.
*Statistical Modelling*, 2013, 13(2):95–123. - 29.
Busing F. Distribution characteristics of variance estimates in two-level models. Unpublish manuscript, 1993.
- 30.
Gu C. and Ma P. Optimal smoothing in nonparametric mixed-effect models.
*The Annals of Statistics*, 2005, 33(3):1357–1379.