Orienting the causal relationship between imprecisely measured traits using genetic instruments

Inference of the causal structure that induces correlations between two traits can be achieved by combining genetic associations with a mediation-based approach, as is done in the causal inference test (CIT) and others. However, we show that measurement error in the phenotypes can lead to mediation-based approaches inferring the wrong causal direction, and that increasing sample sizes has the adverse effect of increasing confidence in the wrong answer. Here we introduce an extension to Mendelian randomisation, a method that uses genetic associations in an instrumentation framework, that enables inference of the causal direction between traits, with some advantages. First, it is less susceptible to bias in the presence of measurement error; second, it is more statistically efficient; third, it can be performed using only summary level data from genome-wide association studies; and fourth, its sensitivity to measurement error can be evaluated. We apply the method to infer the causal direction between DNA methylation and gene expression levels. Our results demonstrate that, in general, DNA methylation is more likely to be the causal factor, but this result is highly susceptible to bias induced by systematic differences in measurement error between the platforms. We emphasise that, where possible, implementing MR and appropriate sensitivity analyses alongside other approaches such as CIT is important to triangulate reliable conclusions about causality.


Introduction
Observational measures of the human phenome are growing ever more abundant, but using these data to make causal inference is notoriously susceptible to many pitfalls, with basic regression-based techniques unable to distinguish a true causal association from reverse causation or confounding (1)(2)(3). In response to this, the use of genetic associations to instrument traits has emerged as a technique for improving the reliability of causal inference in observational data, and with the coincident rise in genome-wide association studies it is now a prominent tool that is applied in several different guises (3)(4)(5)(6). However, potential pitfalls remain and one that is often neglected is the influence of non-differential measurement error on the reliability of causal inference.
Measurement error is the difference between the measured value of a quantity and its true value. This study focuses specifically on non-differential measurement error where all strata of a measured variable have the same error rate, which can manifest as changes in scale or measurement imprecision (noise). Such variability can arise through a whole plethora of mechanisms, which are often specific to the study design and difficult to avoid (7,8). Array technology is now commonly used to obtain high throughput phenotyping at low cost, but comes with the problem of having imperfect resolution, for instance methylation levels as measured by the Illumina450k chip are prone to have some amount of noise around the true value due to imperfect sensitivity (9,10). Relatedly, if the measurement of biological interest is the methylation level in a T cell, then measurement error of this value can be introduced by using methylation levels from whole blood samples because the measured value will be an assay of many cell types (11). Measurement error will of course arise in other types of data too. For example when measuring BMI one is typically interested in using this as a proxy for adiposity, but it is clear that the correlation between BMI and underlying adiposity is not perfect (12). A similar problem of biological misspecification is unavoidable in disease diagnosis, and measuring behaviour such as smoking or diet is notoriously difficult to do accurately. Measurement error can also be introduced after the data have been collected, for example the transformation of non-normal data for the purpose of statistical analysis will lead to a new variable that will typically incur both changes in scale and imprecision (noise) compared to the original variable. The sources of measurement error are not limited to this list (8), and its impact has been explored in the epidemiological literature extensively (13,14). Given the near-ubiquitous presence of measurement error in phenomic data it is vital to understand its impact on the tools we use for causal inference.
An established study design that can provide information about causality is randomisation. Given the hypothesis that trait A (henceforth referred to as the exposure) is causally related to trait B (henceforth referred to as the outcome), randomisation can be employed to assess the causal nature of the association by randomly splitting the sample into two groups, subjecting one group to the exposure and treating the other as a control. The association between the exposure and the outcome in this setting provides a robust estimate of the causal relationship. This provides the theoretical basis behind randomised control trials, but in practice randomisation is often difficult or impossible to implement in an experimental context due to cost, scale or inability to manipulate the exposure. The principle, however, can be employed in extant observational data through the use of genetic variants associated with the exposure (instruments), where the inheritance of an allele serves as a random lifetime allocation of differential exposure levels (15,16). Two statistical approaches to exploiting the properties of genetic instruments are widely used: mediation-based approaches and Mendelian randomisation (MR).
Mediation-based approaches employ genetic instruments (typically single nucleotide polymorphisms, SNPs) to orient the causal direction between the exposure and the outcome. If a SNP is associated with an exposure, and the exposure is associated with some outcome, then it logically follows that in this simple three-variable scenario the estimated direct influence of the SNP on the outcome will be zero when conditioning on the exposure. Here, the exposure completely mediates the association between the SNP and the outcome, providing information about the causal influence of the exposure on the outcome. This forms the basis of a number of methods such as genetical genomics (17), the regression-based causal inference test (CIT) (4,18), a structural equation modelling (SEM) implementation in the NEO software (5), and various other methods including Bayesian approaches (6). They have been employed by a number of recent publications that make causal inferences in large scale 'omics datasets (6,(19)(20)(21)(22)(23).
MR can be applied to the same data -phenotypic measures of the exposure and the outcome variables and a genetic instrument for the exposure -but the genetic instrument is employed in a subtly different manner.
Here the SNP is used as a surrogate for the exposure. Assuming the SNP associates with the outcome only through the exposure, the causal effect of the exposure on the outcome can be estimated by scaling the association between the SNP and the outcome by the association between the SNP and the exposure. Though difficult to test empirically, this assumption can be relaxed in various ways when multiple instruments are available for a putative exposure (24,25) and a number of sensitivity tests are now available to improve reliability (26). By utilising genetic instruments in different ways, mediation-based analysis and MR models have properties that confer some advantages and some disadvantages for reliable causal inference. In the CIT framework (described fully in the Methods) for example, the test statistic is different if you test for the exposure causing the outcome or the outcome causing the exposure, allowing the researcher to infer the direction of causality between two variables by performing the test in both directions and choosing the model with the strongest evidence. The CIT also has the valuable property of being able to distinguish between several putative causal graphs that link the traits with the SNP (Figure 1). Such is not the case for MR, where in order to infer the direction of causality between two traits the instrument must have its most proximal link with the exposure, associating with the outcome only through the exposure.
Assuming biological knowledge of genetic associations can be problematic because if there exists a putative association between two variables, with the SNP being robustly associated with each, it can be difficult to determine which of the two variables is subject to the primary effect of the SNP (i.e. for which of the two variables is the SNP a valid instrument? Figure 1). By definition, we expect that if the association is causal then a SNP for the exposure will be associated with the outcome, such that if the researcher erroneously uses the SNP as an instrument for the outcome then they are likely to see an apparently robust causal association of outcome on exposure. Genome-wide association studies (GWASs) that identify genetic associations for complex traits are, by design, hypothesis free and agnostic of genomic function, and it often takes years of follow up studies to understand the biological nature of a putative GWAS hit (27). If multiple instruments are available for an hypothesised exposure, which is increasingly typical for complex traits that are analysed in large GWAS consortia, then techniques can be applied to mitigate these issues (16). But these techniques cannot always be applied in the case of determining causal directions between 'omic measures where typically only one cis-acting SNP is known. For example if a DNA methylation probe is associated with expression of an adjacent gene, then is a cis-acting SNP an instrument for the DNA methylation level, or the gene expression level (Figure 1)?
MR has some important advantages over the mediation-based approaches. First, the mediation-based approaches require that the exposure, outcome and instrumental variables are all measured in the same data, whereas recent extensions to MR circumvent this requirement, allowing causal inference to be drawn when exposure variables and outcome variables are measured in different samples (28). This has the crucial advantage of improving statistical power by allowing analysis in much larger sample sizes, and dramatically expands the breadth of possible phenotypic relationships that can be evaluated (26). Second, the mediationbased approach of adjusting the outcome for the exposure to nullify the association between the SNP and the outcome is affected by unmeasured confounding of the exposure and outcome. This is because adjusting the outcome by the exposure induces a collider effect between the SNP and outcome (29), and the in order to fully abrogate this association one must also adjust for all (hidden or otherwise) confounders. MR does not suffer from this problem because it does not test for association through adjustment. Third, when MR assumptions are satisfied the method is robust to there being measurement error in the exposure variable (30). Indeed instrumental variable (IV) analysis was in part initially introduced as a correction for measurement error in the exposure (31), whereas it has been noted that both classic mediation-based analyses (13,14,32,33) and mediation-based methods that use instrumental variables (34,35) are prone to be unreliable in its presence.
Using theory and simulations we show how non-differential measurement error in phenotypes can lead to unreliable causal inference in the mediation-based CIT method. We then present an extension to MR that allows researchers to ascertain the causal direction of an association even when the biology of the instruments are not fully understood, and also a metric to evaluate the sensitivity of the result of this extension to measurement error. Together these extensions improve the utility of MR in cases where mediation based methods might have otherwise been used preferentially. Finally, we apply this method to infer the direction of causation between DNA methylation levels and gene expression levels. Our analyses highlight that because these different causal inference techniques have varying strengths and weaknesses, triangulation of evidence from as many sources as possible should be practiced in causal inference (36).

Model
We model a system whereby some exposure x has a causal influence β x on an outcome y such that In addition, the exposure is influenced by a SNP g with an effect of β g such that The α * terms represent intercepts, and henceforth can be ignored. The * terms denote random error, assumed independently and normally distributed with mean zero. Mediation-based analyses that test whether x causally relates to y rely on evaluating whether the influence of g on y can be accounted for by conditioning on x, such that cov(g, y −ŷ) = 0 whereŷ =β x x and assuming no intercept y −ŷ = x . MR analysis estimates the causal influence of x on y by using the instrument as a proxy for x, such that where β M R = 0 denotes the existence of causality, and β M R is an estimate of the causal effect.
Measurement error of an exposure can be modeled as a transformation of the true value (x) that leads to the observed value, x o = f (x). For example, following Pierce and VanderWeele (30) we can define where α mx and β mx influence the error in the measurement of x by altering its scale, and mx represents the imprecision (or noise) in the measurement of x. The same model of measurement error can be applied to the outcome variable y. In this study we assume there is no measurement error in the SNP, and that measurement error in the exposure and the outcome are uncorrelated.

CIT test
First we describe how the CIT method (4) is implemented in the R package R/cit (18). The methodology of the CIT is as follows. Assume an exposure x is instrumented by a SNP g, and the exposure x causes an outcome y, as described above. The following tests are then performed: 1. H 0 : cov(g, x) = 0; H 1 : cov(g, x) = 0; the SNP associates with the exposure 2. H 0 : cov(g, y) = 0; H 1 : cov(g, y) = 0; the SNP associates with the outcome 3. H 0 : cov(x, y) = 0; H 1 : cov(x, y) = 0; the exposure associates with the outcome 4. H 0 : cov(g, y −ŷ) = 0; H 1 : cov(g, y −ŷ) = 0; the SNP is independent of the outcome when the outcome is adjusted for the exposure where y −ŷ = y −α g +β g x is the residual of y after adjusting for x, where x is assumed to mediate the association between the SNP and the outcome. The 4th condition is formulated as an equivalence testing problem that is estimated using simulations, comparing the estimate against from the data against empirically obtained estimates for simulated variables where the independence model is true (full details are given in (4)). We note here that this approach is liable to fail, even when there is a true causal relationship, when confounders of the exposure and outcome are present, as these will induce collider bias.
If all four tests reject the null hypothesis then it is inferred that x causes y. The CIT measures the strength of causality by generating an omnibus p-value, p CIT , which is simply the largest (least extreme) p-value of the four tests, the intuition being that causal inference is only as strong as the weakest link in the chain of tests.
Now we describe how we used the CIT method in our simulations. The cit.cp function was used to obtain an omnibus p-value. To infer the direction of causality using the CIT method, an omnibus p-value generated by CIT for each of two testsp CIT,x→y , was estimated for the direction of x causing y (Model 1), and for the direction of y causing x, p CIT,y→x (Model 2). The results from each of these methods can then be used in combination to infer the existance and direction of causality. For some significance threshold α there are four possible outcomes from these two tests, and their interpretations are as follows: • If p CIT,x→y < α and p CIT,y→x > α then model 1 is accepted • If p CIT,x→y > α and p CIT,y→x < α then model 2 is accepted • If p CIT,x→y > α and p CIT,y→x > α then no evidence for a causal relationship • If p CIT,x→y < α and p CIT,y→x < α then the confounding model is accepted (x ← g → y).
For the purposes of compiling simulation results we use an arbitrary α = 0.05 value, though we stress that for real analyses it is not good practice to rely on p-values for making causal inference, nor is it reliable to depend on arbitrary significance thresholds (37).

MR causal test
Two stage least squares (2SLS) is a commonly used technique for performing MR when the exposure, outcome and instrument data are all available in the same sample. A p-value for this test, p M R , was obtained using the systemf it function in the R package R/systemf it (38). Note that the value of p M R is identical when using the same genetic variant to instrument the influence of the exposure x on the outcome y, or erroneously, instrumenting the outcome y on the exposure x.
The method that we will now describe is designed to distinguish between two models, x → y or y → x. Unlike the CIT framework, this approach cannot infer if the true model is x ← g → y. We also assume all genetic effects are additive.
To infer the direction of causality it is desirable to know which of the variables, x or y, is being directly influenced by the instrument g. This can be achieved by assessing which of the two variables has the biggest absolute correlation with g (Appendix 2), formalised by testing for a difference in the correlations ρ gx and ρ gy using Steiger's Z-test for correlated correlations within a population (39). It is calculated as and and a p-value, p Steiger is generated from the Z value to indicate the probability of obtaining a difference between correlations ρ gx and ρ gy at least as large as the one observed, under the null hypothesis that both correlations are identical.
The existence of causality and its direction is inferred based on combining information from the MR analysis and the Steiger test. The MR analysis indicates whether there is a potential causal relationship (p M R ), and the Steiger test indicates the direction (sign(Z)) of the causal relationship and the confidence of the direction (p Steiger ). For the purposes of compiling simulation results, these can be combined using an arbitrary α = 0.05 value: • If p Steiger < α and p M R < α and Z > 0 then a causal association for the correct model is accepted, x → y • If p Steiger < α and p M R < α and Z < 0 then a causal association for the incorrect model is accepted, Note that the same correlation test approach can be applied to a two-sample MR (28) setting. Two-sample MR refers to the case where the SNP-exposure association and SNP-outcome association are calculated in different samples (e.g. from publicly available summary statistics). Here the Steiger test of two independent correlations can be applied where.
An advantage of using the Steiger test in the two sample context is that it can compare correlations in independent samples where sample sizes are different. Steiger test statistics were calculated using the r.test function in the R package R/psych (40).

Causal direction sensitivity analysis
The Steiger test for inferring if x → y is based on evaluating ρ gx > ρ gy . However, ρ gx (or ρ gy ) are underestimated if x (or y) are measured imprecisely. If, for example, x has lower measurement precision than y then we might empirically obtain ρ g,xo < ρ g,yo because ρ g,xo could be underestimated more than ρ g,yo .
As we show in Appendix 2 it is possible to infer the bounds of measurement error on x o or y o given known genetic associations. The maximum measurement imprecision of x o is ρ g,xo , because it is known that at least that much of the variance has been explained in x o by g. The minimum is 0, denoting perfectly measured trait values (the same logic applies to y o ). It is possible to simulate what the inferred causal direction would be for all values within these bounds.
To evaluate how reliable, R, the inference of the causal direction is to potential measurement error in x and y we need to predict the values of ρ gy − ρ gx for those values of measurement error. We integrate over the entire range of ρ gy − ρ gx values for possible measurement error values. We find the ratio of the volume that agrees with the inferred direction of causality over the volume that disagrees with the inferred direction of causality. A ratio R = 1 indicates that the inferred causal direction is highly sensitive to measurement error, with equal weight of the measurement error parameter space supporting both directions of causality. In general, the R value denotes that the inferred direction of causality is R times more likely to be the empirical result than the opposite direction. Full details are provided in Appendix 2.

Simulations
Simulations were conducted by creating variables of sample size n for the exposure x, the measured values of the exposure x o , the outcome y, the measured values of the outcome y o and the instrument g. One of two models are simulated, the "causal model" where x causes y and g is an instrument for x; or the "non-causal model" where g influences a confounder u which in turn causes both x and y. Here x and y are correlated but not causally related. Each variable in the causal model was simulated such that: , α mx and β mx are parameters that represent non-differential measurement error into the exposure variable x, and α my and β my are parameters for non-differential measurement error in the outcome y. Similarly in the non-causal model: giving a total of 432 combinations of parameters. Simulations using each of these sets of variables were performed 100 times, and the CIT and MR methods were applied to each in order to evaluate the causal association of the simulated variables. Similar patterns of results were obtained for different values of cor(g, x) and cor(g, u).

Applied example using two sample MR
Two sample MR (28) was performed using the summary statistics for genetic influences on gene expression and DNA methylation. To do this we obtained a list of 458 gene expression -DNA methylation associations as reported in Shakhbazov et al (41). These were filtered to be located on the same chromosome, have robust correlations after correcting for multiple testing, and to share a SNP that had a robust cis-acting effect on both the DNA methylation probe and the gene expression probe. Because only summary statistics were available (effect, standard error, effect allele, sample size, p-values) for the instrumental SNP on the methylation and gene expression levels, the Steiger test of two independent correlations was used to infer the direction of causality for each of the associations. The Wald ratio test was then used to estimate the causal effect size for the estimated direction for each association.

Mediation-based causal inference under measurement error
In the causal inference test (CIT), the 4th condition (see Methods) employs mediation for causal inference, and can be expressed as cov(g, y −ŷ) = 0, whereŷ =α x +β x x o . When measurement error in scale and imprecision is introduced, such that y o is the measured value of y, it can be shown using basic covariance properties (Appendix 1) that Thus an observational study will find cov(g, y o −ŷ o ) = 0 when the true model is causal only when D = 1. Therefore, if there is any measurement error that incurs imprecision in x (i.e. var( mx ) = 0) then there will remain an association between g and y o |x o , which is in violation of the the 4th condition of the CIT. Note that scale transformation of x or y without any incurred imprecision is insufficient to lead to a violation of the test statistic assumptions, and henceforth mention of measurement error will relate to imprecision unless otherwise stated.
We performed simulations to verify that this problem does arise using the CIT method. Figure 2 shows that when there is no measurement error in the exposure or outcome variables (ρ x,xo = 1) the CIT is reliable in identifying the correct causal direction. However, as measurement error increases in the exposure variable, eventually the CIT is more likely to infer a robust causal association in the wrong direction. Also of concern here is that increasing sample size does not solve the issue, indeed it only strengthens the apparent evidence for the incorrect inference.
We also performed simulations to compare the performance of MR against CIT in detecting a causal association between simulated variables under different levels of imprecision simulated in the exposure. Figure 3 shows the true positive rates between the CIT and MR for detecting a causal association. We observe that the CIT has lower power in all cases, with performance declining as measurement imprecision increases in the exposure. When MR assumptions are satisfied, notably that it is known on which of x and y the SNP g has a primary influence, the performance of MR in detecting an association is unrelated to measurement error in the exposure. Measurement error in the outcome does reduce power, but does not induce a substantive difference in performance between CIT and MR.

Using MR Steiger to infer the direction of causality
If we do not know whether the SNP g has a primary influence on x or y then CIT can attempt to infer the causal direction. Here we introduce the MR Steiger approach to similarly orient the direction of causality but in an MR analysis when the underlying biology of the SNP is not fully understood.
For a particular association, it is of interest to identify the range of possible measurement error values agree and disagree with the empirically inferred causal direction (Figure 4a, Appendix 2). This metric can be used to evaluate the reliability of MR Steiger.
We show that in the presence of measurement imprecision, d = ρ x,xo − ρ x,y ρ y,yo (Appendix 2) determines the range of parameters around which the Steiger test is liable to provide the wrong direction of causality (i.e. if d > 0 then the Steiger test is likely to be correct about the causal direction). Figure 4b shows that when there is no measurement error in x, the Steiger test is unlikely to infer the wrong direction of causality even if there is measurement error in y. It also shows that in most cases where x is measured with error, especially when the causal effect between x and y is not very large, the sensitivity of the Steiger test to measurement error is relatively low.

Comparison of CIT and MR Steiger for obtaining the correct direction of causality
We used simulations to explore the performance of the MR Steiger approach in comparison to CIT in terms of the rate at which evidence of a causal relationship is obtained for the correct direction of causality, and the rate at which evidence of a causal relationship is obtained where the reported direction of causality is incorrect. Simulations were performed for two models, one for a "causal model" where there was a causal relationship between x and y; and one for a "non-causal model" where x and y were not causally related, but had a confounded association induced by the SNP g influencing a confounder variable u. Figure 5a shows that, for the "causal model", the MR analysis is indeed liable to infer the wrong direction of causality when d < 0, and that this erroneous result is more likely to occur with increasing sample size. However, the CIT is in general more fallable to reporting a robust causal association for the wrong direction of causality. When d > 0 we find that in most cases the MR method has greater power to obtain evidence for causality than CIT, and always obtains the correct direction of causality. The CIT, unlike the Steiger test for MR, is able to distinguish the "non-causal model" from the "causal model" (Methods, Figure 5b), but it is evident that measurement error will often lead the CIT to identify the causal model as true, when in fact the underlying model is this non-causal model.

The causal relationship between gene expression and DNA methylation levels
We used the Steiger test to infer the direction of causality between DNA methylation and gene expression levels between 458 putative associations. We found that the causal direction commonly goes in both directions (Figure 6a), but assuming no or equal measurement error, DNA methylation levels were the predominant causal factor (p = 1.3 × 10 −5 ). The median reliability (R) of the 458 tests was 3.92 (5%-95% quantiles 1.08 -37.11). We then went on to predict the causal directions of the associations for varying levels of systematic measurement error for the different platforms. Figure 6a shows that the conclusions about the direction of causality between DNA methylation and gene expression are very sensitive to measurement error.
We performed two sample MR (28) for each association in the direction of causality inferred by the Stieger test. We observed that the sign of the MR estimate was generally in the same direction as the Pearson correlation coefficient reported by Shakhbazov et al (41) (Figure 6b). There was a moderate correlation between the absolute magnitudes of the causal correlation and the observational Pearson correlation (r = 0.45). Together these inferences suggest that even in estimating associations between 'omic' variables, which are considered to be low level phenotypes, it is important to use causal inference methods over observational associations to infer causal effect sizes.
We also observed that for associations where methylation caused gene expression the causal effect was more likely to be negative than for the associations where gene expression caused methylation (OR = 0.61 (95% CI 0.36 -1.03), Figure 6c), suggesting that reducing methylation levels at a controlling CpG typically leads to increased gene expression levels, consistent with expectation (43).

Discussion
Researchers are often confronted with the problem of making causal inferences using a statistical framework on observational data. In the epidemiological literature issues of measurement error in mediation analysis are relatively well explored (44). Our analysis extends this to related methods such as CIT that are widely used in predominantly 'omic data. These methods are indeed susceptible to the same problem as standard mediation based analysis, and specifically we show that as measurement error in the (true) exposure variable increases, CIT is likely to have reduced statistical power, and liable to infer the wrong direction of causality.
We also demonstrate that, though unintuitive, increasing sample size does not resolve the issue, rather it leads to more extreme p-values for the model that predicts the wrong direction of causality.
Under many circumstances a practical solution to this problem is to use Mendelian randomisation instead of methods such as the CIT or similar that are based on mediation. Inferring the existence of causality using Mendelian randomisation is robust in the face of measurement error and, if the researcher has knowledge about the biology of the instrument being used in the analysis, can offer a direct solution to the issues that the CIT faces. This assumption is often reasonable, for example SNPs are commonly used as instruments when they are found in genes with known biological relevance for the trait of interest. But on many occasions, especially in the realm of 'omic data, this is not the case, and methods based on mediation have been valuable in order to be able to both ascertain if there is a causal association and to infer the direction of causality. Here we have described a simple extension to MR which can be used as an alternative to or in conjunction with mediation based methods. We show that this method is still liable to measurement error, but because it has different properties to the CIT it offers several main advantages. First, it uses a formal statistical framework to test for the reliability of the assumed direction of causality. Second, after testing in a comprehensive range of scenarios the MR based approach is less likely to infer the wrong direction of causality compared to CIT, while substantially improving power over CIT in the cases where d > 0.
We demonstrate this new method by evaluating the causal relationships of 458 known associations between DNA methylation and gene expression levels using summary level data. The inferred causal direction is heavily influenced by how much measurement error is present in the different assaying platforms. For example, if DNA methylation measures typically have higher measuremet error than gene expression measures then our analysis suggests that DNA methylation levels would be more often the causal factor in the association. Indeed, previous studies which have evaluated measurement error in these platforms do support this position (45,46), though making strong conclusions for this analysis is difficult because measurement error is likely to be study specific. We also haven't accounted for the influence of winner's curse, which can inflate estimates of the variance explained by SNPs, with higher inflation expected amongst lower powered studies. Using p-values for genetic associations from replication studies will mitigate this problem.
In our simulations we focused on the simple case of a single instrument in a single sample setting with a view to making a fair comparison between MR and the various mediation-based methods available. However, if there is only a single instrument it is difficult to separate between the two competing models of g instrumenting a trait which causes another trait, and g having pleiotropic effects on both traits independently (47). Under certain conditions of measurement error the CIT test can distinguish these models. We also note that it is straightforward to extend the MR Steiger approach to multiple instruments, requiring only that the total variance explained by all instruments be calculated under the assumption that they are independent. Multiple instruments can indeed help to distinguish between the causal and pleiotropic models, for example by evaluating the proportionality of the SNP-exposure and SNP-outcome effects (16). Additionally, if there is at least one instrument for each trait then bi-directional MR can offer solutions to inferring the causal direction (16,48,49). We restricted the simulations to evaluating the causal inference between quantitative traits, but it is possible that the analysis could be extended to binary traits by using the genetic variance explained on the liability scale, taking into account the population prevalence (50). However, our analysis goes beyond many previous explorations of measurement error by assessing the impacts of both imprecision (noise) and linear transformations of the true variable on causal inference.
In this work we assumed that pleiotropy (the influence of the instrument on the outcome through a mechanism other than the exposure) was not present. Recent method developments in MR (24,25) have focused on accounting for the issues that horizontal pleiotropy can introduce when multiple instruments are available, but how they perform in the presence of measurement error remains to be explored. An important advantage that MR confers over most mediation based analysis is that it can be performed in two samples, which can considerably improve power and expand the scope of analysis. However, whether there is a substantive difference in two sample MR versus one sample MR in how measurement error has an effect is not yet fully understood. We have also assumed no measurement error in the genetic instrument, which is not unreasonable given the strict QC protocols that ensure high quality genotype data is available to most studies. We have restricted the scope to only exploring non-differential measurement error and avoided the complications incurred if measurement error in the exposure and outcome is correlated. We have also not addressed other issues pertaining to instrumental variables which are relevant to the question of instrument-exposure specification. One such problem is exposure misspecification, for example an instrument could associate with several closely related putative outcomes, with only one of them actually having a causal effect on the outcome. This problem has shown to be the case for SNPs influencing different lipid fractions, for example (51,52).
Mediation based network approaches, that go beyond analyses of two variables, are very well established (35) and have a number of extensions that make them valuable tools, including for example network construction. But because they are predicated on the basic underlying principles of mediation they are liable to suffer from the same issues of measurement error. Recent advances in MR methodology, for example applying MR to genetical genomics (53), multivariate MR (52) and mediation through MR (54-56) may offer more robust alternatives for these more complicated problems.
The overarching result from our simulations is that, regardless of the method used, inferring the causal direction using an instrument of unknown biology is highly sensitive to measurement error. With the presence of measurement error near ubiquitous in most observational data, and our ability to measure it limited, we argue that it needs to be central to any consideration of approaches which are used in attempt to strengthen causal inference, and any putative results should be accompanied with appropriate sensitivity analysis that assesses their robustness under varying levels of measurement error. Figure 1: Gene expression levels (blue blocks) and DNA methylation levels (green triangles) may be correlated but the causal structure is unknown. If a SNP (yellow circle) is associated with both DNA methylation and gene expression levels then it can be used as an instrument, but there are three basic competing models for these variables. The causal inference test (CIT) attempts to distinguish between them. a) Methylation causes gene expression. The left figure shows that the SNP influences methylation levels that in turn influence gene expression levels. The right figure shows the directed acyclic graph that represents this model. Faded symbols represent the measured values whereas solid symbols represent the true values. b) The same as in A, except the causal direction is from gene expression to DNA methylation. c) A model of confounding, where gene expression and DNA methylation are not causally related, but the SNP influences them each through separate pathways or a confounder.    Figure 5: a) Outcome y was simulated to be caused by exposure x as shown in the graph, with varying degrees of measurement error applied to both. CIT and MR were used to infer evidence for causality between the exposure and outcome, and to infer the direction of causality. The value of d = ρ x,xo − ρ x,y ρ y,yo , such that when d is negative we expect the Steiger test to be more likely to be wrong about the direction of causality. Rows of graphs represent the sample size used in the simulations. For the CIT method, outcome 1 denoted evidence for causality with correct model, outcomes 2 or 3 denoted evidence for causality with incorrect model, and outcome 4 denoted no evidence for causality. b) As in (a) except the simulated model was non-causal, and a genetic confounder induces an association between x and y. MR is unable to identify this model, so any significant associations are deemed to be incorrect. Outcome 3 denotes evidence for the correct model for the CIT method. The proportions change when we assume different levels of measurement error in gene expression levels (x-axis) or DNA methylation levels (columns of boxes). If there is systematically higher measurement error in one platform than the other it will appear to be less likely to be the causal factor. b) The relationship between the Pearson correlation between DNA methylation and gene expression levels (x-axis) and the causal estimate (scaled to be in standard deviation units, y-axis). c) Distribution of estimated causal effect sizes, stratified into associations inferred to be due to DNA methylation causing expression (blue) and expression causing DNA methylation (red).

Appendix 1
We assume the following model

Appendix 2
Assuming that either x → y or y → x, the causal direction can be inferred by evaluating which of ρ g,x and ρ g,y is larger in magnitude. The Steiger test is a hypothesis test that provides a p-value for observing the difference in these correlations under the null hypothesis that they are equal.
Assuming the causal direction is x → y, two stage MR is formulated using the following regression models: for the first stage and y = α 2 + β 2x + e 2 wherex =α 1 +β 1 g. Writing in scale free terms, ρ g,x denotes the correlation between g and the exposure variable x, and it is expected that ρ g,x > ρ g,y because ρ g,y = ρ g,x ρ x,y , where ρ x,y is the causal association between x and y (which is likely to be less than 1).
In the presence of measurement error in x and y, however, the empirical inference of the causal direction will instead be based on evaluating ρ g,xo > ρ g,yo , which can be simplified: In order to assess how reliable the inference of the causal direction is in the presence of measurement imprecision, we can evaluate the range of potential values of measurement error in x and y over which the empirical difference in ρ g,xo and ρ g,yo would return the wrong causal direction.
Call z = ρ g,y − ρ g,x the true difference in the variance explained by the genetic variant in y and x. If z < 0 then we infer that x → y. There will be some values of ρ x,xo and ρ y,yo that do not alter whether z < 0. To evaluate the reliability, R, of the inference of the causal direction with regards to measurement error, the objective is to compare the proportion of the parameter space that agrees with the inferred direction against the proportion which does not: If R = 1 then the direction of causality is equally probable across the range of possible measurement error values. If R > 1 then R times as much of the parameter space favours the inferred direction of causality. V z , the total volume of the function (Figure 4), can be obtained analytically by solving: ρg,y o ρ g,yo ρ y,yo − ρ g,xo ρ x,xo dρ y,yo dρ x,xo = ρ g,xo log(ρ g,xo ) − ρ g,yo log(ρ g,yo ) + ρ g,xo ρ g,yo (log(ρ g,yo ) − log(ρ g,xo )) V z≥0 , the proportion of the volume that lies above the z = 0 plane, can also be obtained analytically. The region of this volume is bound by the values of ρ x,xo and ρ y,yo where 0 = ρ g,y − ρ g,x , which can be expanded to ρ y,yo = ρ g,yo ρ x,xo /ρ g,xo . Hence, = 2ρ g,xo ρ g,yo − 2ρ g,yo − ρ g,yo log(ρ g,xo ) − ρ g,xo ρ g,yo log(ρ g,xo )