Comparative analysis of genetically-modified crops: Part 1. Conditional difference testing with a given genetic background

The European Food Safety Authority (EFSA) mandates two sets of statistical tests in the comparative assessment of a genetically-modified (GM) crop: difference testing to demonstrate whether the GM crop is different from its appropriate non-traited control; and equivalence testing to demonstrate whether it is equivalent to conventional references with an history-of-safe-use. The equivalence testing method prescribed by EFSA confounds the so-called GM trait effect with genotypic differences between the reference varieties and non-traited control. Critically, these genotypic differences, which we define as a ‘control background effect’, are the result of conventional plant breeding. Thus, the result of EFSA equivalence testing often has little or nothing to do with the GM trait effect, which should be the sole focus of the comparative assessment. Here, an integrated method is introduced for both difference and equivalence testing that considers the differences of the three genotype groups (GM, control, and references) as a two-dimensional random variable. A novel statistical model is proposed, called the trait model, that treats the effects of the GM and control materials as fixed for their difference, and as random for their common background. For significance testing, the covariance structure of the three genotype groups is utilized to decompose the differences into the trait effect and the control background effect. The trait difference is then derived as a conditional mean, given the background effect. The comparative assessment can then focus on the conditional mean difference, which is independent of the control background effect. Furthermore, the trait model is flexible enough to include various types of genotype-by-environment (G×E) interactions inherent to the experimental design of the trial. Numerical evaluations and simulations show that this new method is substantially more efficient than the current EFSA method in reducing both Type I and Type II errors (protecting both the consumer and producer risk) after the background effect is removed from the test statistic, and successfully addresses two major criticisms (i.e. statistical model lack of G×E, and study-specific equivalence criterion) that have been raised.


Introduction
In the European Union, the safety of a genetically-modified (GM) crop and derived food/feed is established, in part, using a proof-of-equivalence approach [1]. Under this paradigm, any phenotypic or compositional differences between the GM crop and its near-isogenic, nontraited comparator are evaluated in the context of natural variation, estimated from conventionally-bred varieties (hereafter, references) grown in the same field trials. Currently, the GMO Panel of the European Food Safety Authority (EFSA) mandates two sets of statistical tests in the comparative assessment of a GM crop [2]: difference testing to demonstrate whether the GM crop is different from its near-isogenic, non-traited comparator; and equivalence testing to demonstrate whether it is equivalent to conventional references with an history-of-safe-use. Per [2], this assessment should be based on data from a minimum of eight sites in one growing season (or two growing seasons each with four sites). Within each site, three genotype groups are assigned to plots under a randomized complete block design. The genotype groups include: a test (GM) variety, a control (the near-isogenic comparator), and a set of (minimum six) conventional references.
At least three linear mixed models have been proposed for this design ( [3] (this paper presents the method in [2]), [4,5]). All three models include a fixed factor 'genotype group', which distinguishes the test, control, and group of references as a whole, and a fixed or random factor 'genotype' with as many levels as there are materials. Differences between the three genotype groups can be driven by one or more mechanisms. Because the genetic composition of the test and control are very similar, any difference between the two is a potential 'trait effect' and becomes a focus of the comparative assessment [6,7]. Differences between the control and references, although not directly considered by [2], are part of the natural variation inherent to the crop; these differences are defined as a 'control background effect'. Critically, differences between the test and references can be driven by a trait effect, a control background effect, or both. However, although the control itself is needed, the control background effect is not relevant in terms of the comparative assessment because differences between the control and references are the result of conventional plant breeding, and do not involve the trait.
EFSA's method [2] for equivalence testing assesses differences between the test and references. These differences, as described above, may be driven by a trait effect, a control background effect, or both. In all three linear mixed models noted above, however, the trait effect and control background effect are confounded, and cannot be decomposed. Thus, the result of EFSA equivalence testing often has little or nothing to do with the trait effect (Fig 1), which should be the sole focus of the comparative assessment. To address this problem, an integrated method is introduced for both difference and equivalence testing that considers the differences of the three genotype groups as a two-dimensional random variable. A novel statistical model is proposed, called the trait model, that treats the effects of the test and control materials as fixed for their difference, and as random for their common background. For significance testing, the covariance structure of the three genotype groups is utilized to decompose the differences into the trait effect and the control background effect. The trait difference is then derived as a conditional mean, given the background effect. The comparative assessment can then focus on the conditional mean difference, which is independent of the control background effect. The trait model is also flexible enough to separate the various types of genotypeby-environment (G×E) interactions, one for each type of genetic effect, inherent given the design of the experiment. Numerical evaluations and simulations show that this new method is substantially more efficient than the current EFSA method [2] in reducing both Type I and Type II errors, and successfully addresses two major criticisms of EFSA that have been raised (i.e., the inability to model G×E interactions; and study-specific equivalence criteria, making the GM trait much less relevant than the control in the equivalence testing [4]). The paper is organized as follows: the derivation of the conditional method and introduction of the trait model are presented first, followed by an application using empirical field data. Results of numerical evaluations and simulations are presented alongside a re-analysis of a maize grain composition example by the conditional method. The paper concludes with a discussion of the trait model as it compares to the EFSA method [2], additional thoughts on G × E interactions, and highlights areas for further research.

The TCR trial
The Test-Control-Reference (TCR) trial is the experimental base of EFSA's prescribed twostep difference and equivalence testing methodology. Despite differences in the number of sites, reference varieties, etc., all TCR trials share some common properties. Let (Δ TC , Δ CR , Δ TR ) denote genotype group differences of the test versus control, control versus references, and test versus references, respectively. Let (D TC , D CR , D TR ) be the sample estimates and s 2 D TC ; s 2 D TR ; s 2 D CR ; s D TC ;D CR ; s D TC ;D TR , and s D TR ;D CR be the respective variances and co-variances. Under common statistical assumptions, i.e. independent and identically-distributed responses, two basic properties of all TCR trials are: a. Genetic relationship: Both Δ TC and Δ TR contain the trait effect, but Δ TR also contains Δ CR due to Δ TR = Δ TC + Δ CR , and Δ CR represents the control background effect; b. Experimental design: The paired arrangement of the test and control materials in the field, along with the above genetic relationship, leads to the following relationships: Notice in property [a] that the differences of the three genotype groups are two dimensional: one dimension is the control background effect; the other dimension is the GM trait effect. While both Δ TC and Δ TR are measures of the GM trait effect, in principle, only one dimension should be left for the comparative evaluation of the trait. Conceptually, a new The background effects are genotypic differences between reference varieties and the conventional control, a result of conventional plant breeding. Conclusions of EFSA equivalence are often driven by the background effect and have little or nothing to do with a GM trait. In this hypothetical (but very realistic) example, the test with the largest trait effect (i.e. largest mean difference, shown in red) is the only one deemed to be equivalent. parameter, possibly a combination of Δ TC and Δ TR (since each one cannot represent the other), must exist for the best measure of the GM trait difference, without confounding by the background effect.
Property [b] is a direct result of the field arrangement. For example, s 2 D CR and s 2 D TR are the sampling variations of D CR and D TR , respectively, and should be equal due to the paired arrangement in the field and the common background effect. Similarly, the covariance of D TC with D TR should be the same as with D CR . Other relationships can also be derived among variances and covariances.
The control material is the result of conventional plant breeding and there is no obvious preference for one control over other commercially-available varieties. Thus, the control background effect should be considered as random, and its potential value is no different from any individual reference in the reference genotype group. The random control background has been considered by [2], [5], and [8] in establishing the equivalence criterion, but not properly modeled (as will be shown in the following trait model) and incorporated in the statistical testing. In this investigation, this has been an underlying assumption in every step of the analysis, including the development of hypotheses, estimation of the trait effect, and the derivation of test statistics.
The remainder of this section addresses three fundamental questions: (1) how do we define the GM trait effect for a given background; (2) how do we model the random background effect in addition to other factors; (3) what is the sample performance of the estimation procedure?

GM trait difference for a given control background
Given Δ TR , a mixture of the GM trait-and background effects can be decomposed using conditioning or regression. Under the assumption of a bivariate normal distribution of (D TC ,D CR ), consider the following hypothetical model: D TC jD CR , and s 2 D TR jD CR denote the conditional mean and variance of D TC and D TR given D CR . Using the notation of conditioning or regression, the model can be specified as: jD CR , and β depends on the correlation. The two equations in this model are linearly related, thus only one model equation is necessary since D TR = D TC + D CR . This implies that, given D CR , evaluation based on both D TC and D TR would be the same, and can be replaced by the evaluation of the intercept a ¼ m D TC jD CR ¼0 ¼ m D TR jD CR ¼0 . The application of the above model depends on the correlation structure among the three genotype groups, which was derived in the last section. The objective of the conditioning is to eliminate the dependence of Δ TR (when its estimate is applied in the equivalence testing) on Δ CR (the control background effect). This is achieved by making use of the estimated background effect.
Let r D TC ;D CR denote the correlation coefficient between D TC and D CR . The following equations are direct applications of bivariate normal theory [9]: . Substituting the corresponding equations in property [b], (1) and (2) become: 8 > > > > < > > > > : 8 > > > > < > > > > : By conditioning on the estimated background effect, we do not obtain total independence of Δ CR in m D TR jD CR . However, m D TR jD CR in (4) becomes only fractionally-dependent on Δ CR , and s 2 D TR jD CR is even less than s 2 D TC , which is known to be free of background variation. Also, notice that the expectations of m D TC jD CR and m D TR jD CR , with respect to D CR , are equal to the marginal means Δ TC and Δ TR , assuming D CR follows a normal distribution with mean Δ CR and variance s 2 D CR . While the difference (D CR − Δ CR ) would be reduced by increasing the size of the TCR trial, such as the number of sites, the background effect D CR in expectation (i.e. Δ CR ) can never be reduced because the same control appears in all sites. Our objective is to find a measure of the trait effect less-dependent, or ideally independent, of the background effect.
Although m D TR jD CR is a regression-type of estimate, D CR is a random variable which can take different values, making it different from classical regression. Let d TC , d TR , and d CR be values of D TC , D TR , and D CR estimated from a TCR trial. Two cases of special interest in (3) and (4) are when D CR = d CR and D CR = 0, where D CR = d CR assumes the control background as in the current trial, and D CR = 0 implies that the random control background is adjusted to the mean of the reference distribution.
Intuitively, (5) is a regression of D TR on D CR . s 2 D TC =2s 2 D CR is the regression coefficient based on the correlation structure, m D TR jD CR ¼0 represents the expectation of the sample estimate of the intercept (or the expected mean of D TR at D CR = 0), and m D TR jD CR ¼d CR represents the expectation of D TR given D CR at the observed mean d CR in the current study. The implication of m D TR jD CR ¼d CR will be discussed later in (8) and (9). The intercept m D TR jD CR ¼0 is more interesting here; it has two components: the first part is Δ TC , known to be a pure estimate of the GM trait effect; and the second part is a portion of Δ CR , due to the covariance structure defined in property [b]. This is the familiar form of a simple regression: the intercept, say α, is equal to (μ y − βμ x ) [10]; i.e. the dependent variable mean (μ y ) minus the regression coefficient (β) times the independent variable mean (μ x ). Here, the correlation and the regression coefficient are negative. The second part of (5) separates m D TR jD CR ¼0 from Δ TC as a potentially better measure of the trait effect. It's shown later when and how much m D TR jD CR ¼0 would make a maximum difference as compared with Δ TC and Δ TR .
The implication of (5) is technically-interesting and practically-important. By conditioning, three parameters of (Δ TC , Δ TR , Δ CR ) are resolved into a two-dimensional space of (m D TR jD CR ¼0 , Δ CR ). In this two-dimensional space, the value of m D TR jD CR ¼0 is determined mostly by Δ TC and depends on Δ CR only to the extent of the correlation between D TC and D CR . Besides, given the background, the expectations of D TC and D TR are the same. Thus, from a statistical point of view, m D TR jD CR ¼0 or m D TC jD CR ¼0 can be considered as the unique parameter of interest to replace Δ TC and Δ TR in the comparative assessment.

A trait model for the TCR trial
Characteristics of the TCR trial have posed some difficulties [3,4] in statistical modelling, partly due to properties [a] and [b], and because of the differential field arrangement of the test and the control materials, versus references, among sites. At least three linear mixed models have been proposed for the TCR trial. The discussion of these models is delayed until later, but two common limitations are noteworthy (which were also concerns of the agricultural biotechnology industry [4]). The first limitation is the inability to decompose the trait effect and random background effect; these effects are therefore confounded in the equivalence testing. In fact, EFSA requires a two-step procedure-using EFSA Model 1 (assuming both the trait difference and the control background effect as fixed) for difference testing and Model 2 (assuming the independent random background effect separately for the test and the control) for setting equivalence limits [2]. The second limitation is the inability to separate the GM traitrelated interaction from the control-and reference-related ones, which are all inherent to the design of the TCR trail. A GM trait-related interaction of two nearly-isogenic lines (i.e. the test and the control materials) with the site, if present, would certainly be different from a G×E interaction in a conventional plant breeding context. Even though the interaction may not always be an important source of variation, the statistical model should provide such an option, if needed. Instead, EFSA requests a separate, post hoc interaction analysis, restricting data to test and control only, and assuming the site effect is fixed [2]. These limitations are structural shortcomings of the existing methods. For this reason, a trait model is proposed, which defines effects on linear contrasts of genotype groups corresponding to either the traitor control background effect. This contrasts with the current genotype models which define effects on each genotype group.
Two factors can be defined on the three genotype groups based on property [a]: the GM trait T with two levels (GM: the test material, and Non-GM: the control material and the references), and the background G also with two levels (Non-Ref: The test and the control varieties, and Ref: reference varieties). Then, G and T(G) represent two orthogonal, single-degree-offreedom decompositions of the genotype group effects. While G represents the contrast of the test and the control as a pair versus the references, T(G) is the contrast of the test versus the control within the Non-Ref group.
The trait model can be expressed as: where Y is the observed response; μ is overall mean; [S and B(S)] are environmental effects for site and block within each site, g is the background effect, and e is the residual. G and T(G) are main effects, and GS and TS(G) are corresponding interactions. Based on property [a] of the TCR trial, the background effect is defined as: Only μ, G and T(G) are assumed to be fixed; all other factors are random and assumed to be independently distributed with mean 0 and corresponding variance. One advantage of the expression of Model A over the model equations (one equation for each genotype) in Kang and Vahl [5] and Vahl and Kang [8] is the avoidance of a specified covariance structure derived from the random common background effect between the test and the control materials.
The primary factor differentiating Model A from the existing methods can be summarized as follows. First, Model A assumes, in the effect g, that the random control background is identical for the control and the test, and has the same variance as the references, which is part of property [a] of the TCR trial. Second, there are three possible types of interactions in the TCR trial: the GM trait-by-site TS(G), the control background-by-site GS, and the reference-by-site. Inclusion or exclusion of each or all of them is optional depending on the data. Model A includes the first two types of interactions because of the balanced arrangement of the test and control materials across sites, which support the inclusion of the trait-by-site and the control background-by-site interactions in the analysis. A minor modification could also include the reference-by-site interaction, when a proper field arrangement of references is available across sites. Since our focus here is on modelling and estimating effects G and (G), and interactions GS and TS(G), Model A is sufficient. Additional discussion is provided in subsequent sections.
The components of variances and co-variances are as follows in a design with the same number, but different varieties of references, at each site. Overlapping references among sites will be discussed later. Let n s denote the number of sites, n b the number of blocks and n sg the number of references at each site, and n g the total number of references.
TSðGÞ =n s þ 2s 2 e =ðn b n s Þ 2s 2 GS =n s þ 2s 2 TSðGÞ =n s þ s 2 g þ s 2 g =n g þ s 2 e =ðn b n s Þ þ s 2 e =ðn b n sg n s Þ 2s 2 GS =n s þ 2s 2 TSðGÞ =n s þ s 2 g þ s 2 g =n g þ s 2 e =ðn b n s Þ þ s 2 e =ðn b n sg n s Þ À s 2 TSðGÞ =n s À s 2 e =ðn b n s Þ s 2 TSðGÞ =n s þ s 2 e =ðn b n s Þ 2s 2 GS =n s þ s 2 where s 2 GS ; s 2 TSðGÞ ; s 2 g , and s 2 e are corresponding variance components of GS and TS(G), g and e, respectively, in Model A. A simplified case may help the understanding of (6) when no G×E is assumed, i.e. s 2 GS ¼ 0 and s 2 TSðGÞ ¼ 0. Then, (6) becomes g þ s 2 e =ðn b n s Þ þ s 2 e =ðn b n sg n s Þ ð1 þ 1=n g Þs 2 g þ s 2 e =ðn b n s Þ þ s 2 e =ðn b n sg n s Þ À s 2 e =ðn b n s Þ s 2 e =ðn b n s Þ ð1 þ 1=n g Þs 2 g þ s 2 e =ðn b n sg n s Þ Results in (7) show that under Model A, s 2 D TC is the same as that of EFSA Model 1, but s 2

D TR
under Model A is the same as that of EFSA Model 2. This is because Model A includes not only the fixed trait effect but also the random background of the test and the control, i.e. the test and the control materials are assumed to be both fixed and random per the effect, rather than the identity of the genotype group. When fitting Model A, (d TC , d CR , d TR ) are the sample means. Let s 2 d TC and s 2 d CR and s d TC ;d CR be the sample variances and covariance. When these estimates are applied to Eq in (5), the sample estimates of the conditional means can now be obtained: ( Eq (8) indicates that d TR is the estimate of Δ TR given D CR at its observed value d CR . It simply says that, given the control background as d CR , the observed trait difference d TR would be a sum of two correlated differences d TC and d CR , which is what EFSA uses in their equivalence testing [2]. Given D CR = 0, however,m D TR jD CR ¼0 is only slightly affected by d CR . It can be shown that the correlation betweenm D TR jD CR ¼0 (the estimate of the so-defined GM trait effect m D TR jD CR ¼0 ) and d CR (the estimate of the background effect Δ CR ) is zero. That is, they are independent from each other when normality is assumed. Now, we have two independent estimates for two parameters, one for the background effect and another for the GM trait effect.
To estimate the variance ofm D TR jD CR ¼0 , the variance and co-variance of d TC and d CR can be obtained directly from (6) or (7). Even though the ratio s 2 d TC =2s 2 d CR is a random variable itself, its variance consists of higher order terms and the variances of d TC and d CR are the leading terms in the overall variance ofm D TR jD CR ¼0 . Therefore, an approximation can be obtained based on the variation of d TC and d CR :

> > < > > :
Note that s 2 m D TR jD CR ¼0 has the same form as s 2 D TR jD CR in (4) with each variance replaced by the corresponding sample estimates due to the nature of the regression type of estimation. Thus, s 2 D TR jD CR denotes the sample estimate of the variance of the conditional difference. Though d CR is the control background effect of a TCR trial and could never exactly equal zero, by property [a], d CR is random and follows a distribution with mean zero and variance s 2 D CR estimated from the references.
Since both D TR and D CR are the respective differences from the same reference mean, the regression of D TR on D CR is then equivalent to the regression of the test mean on the control mean. The conditional meanm D TR jD CR ¼0 is the test mean given the control mean at the estimated reference mean. Therefore,m D TR jD CR ¼0 would have minimum variance among estimates for given D CR at any value other than zero. Furthermore, the sample variance s 2 m D TR jD CR ¼0 would have the same form as the conditional variance of (3) and (4), i.e. (8) and (9) are likelihood estimates of the corresponding parameter functions since each component is the likelihood estimate from the mixed model analysis. Now, the following hypothesis can be tested using a t-test: Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi with degrees of freedom (df d TC À 1Þ. When the Kenward-Rodger [10,11] method is applied, the degrees of freedom can be obtained from the analysis for df d TC , and an additional adjustment of minus one is due to the regression.

Results
In this section, the asymptotic and sample performance of the conditional method is evaluated for a wide range of parameter values. Simulations are conducted on selected parameter sets to compare the conditional method and trait model with the EFSA method. Lastly, the EFSA example [2] on maize grain composition data is reanalyzed using the conditional method and results are compared. For the numerical evaluation and simulation, a typical TCR trial was assumed, i.e. n s = 8, n b = 4, n gs = 4. For illustration of the conditional method, all references were assumed different, i.e. n g = n s n gs . The variation setting was defined as the ratio between the reference (or genetic) versus residual variances s 2 g : s 2 e . Zero G×E interactions were assumed in data generation for simulations. In general, m D TR jD CR ¼0 has the lowest variance as compared with D TC and D TR . Even though the variation of D TR is mostly larger than that of D TC due to the background effect, it is actually smaller when σ g is small. When genotypic variation is negligible, the residual becomes the main source of variability even among references. As the total number of replicates of references is larger than that of the control (a common case in practice), s 2 D TR < s 2 D TC becomes true, which is also revealed by Eqs (6) and (7). Sample performance of the conditional mean is examined in Fig 3. Fig 3 shows 100    Conditional difference testing of genetically-modified crops common background, as long as a s 2 g is large enough, say s 2 g � 0:1 s 2 e , or s 2 g : s 2 e � 0:1. The parallel change becomes dramatic as s 2 g : s 2 e increases to 1. The conditional meanm D TR jD CR ¼0 , however, shows no correlation with the background. Results in Fig 3 strongly support the message in Fig 1 that EFSA, by using D TR as a test statistic, is mostly testing the background effect, as long as nonnegligible reference variation exists. In contrast, the conditional method usinĝ m D TR jD CR ¼0 is independent of the background, and is thus a true test of the trait effect even if the reference variation is around zero.

Conditional difference testing
In this section, the practical performance of the difference testing methods is compared using the following hypotheses: Although hypotheses (13) are for difference testing, the results are directly related to equivalence testing, as is shown next.
The empirical Type I error of the difference testing was evaluated by simulation at α = 0.1 (the same as in [2]). Unlike those results in Fig 2, simulations are subject to sampling variation. The total variation mirrored the example in [2] and assumed a coefficient of variation (CV) of 40%, including both the genetic and residual components with an overall mean 10 and variation setting s 2 g : s 2 e the same as in Fig 3. The site variation was assumed to be s 2 e =4 with no block variation. For the power analysis, the trait effect was varied from 0 to 3, which is equal to 0 to 0.75 standard deviations of the total effects of the reference and residual. The trait model and the conditional method were applied to each simulated dataset and the Kenward-Roger adjustment was applied in the difference testing. Simulations were repeated 10,000 times for each set of parameter values, and results were summarized over replicate simulations.
All three methods performed as expected with the Type I error being around the nominal level (approximately 7 to 13% except one case H 02 : Δ TR = 0 and s 2 g : s 2 e ¼ 0) (Table 1). Slight deviations from the nominal level are likely due to the bias in the likelihood estimates of the variance components; especially when the actual value is close to zero, the non-zero restriction is applied in the estimation, the assumed experimental design will dictate the degrees of freedom of the variance estimates, as well as potential biases. Results of the power analysis in Fig 4 confirm those from Figs 1-3: among the three difference tests, the conditional difference test H 03 : m D TR jD CR ¼0 ¼ 0 is always the most efficient, and the simple difference test H 02 : Δ TR = 0 is often the least efficient. When s 2 g : s 2 e is low, say < 0.1, the difference in statistical power is Conditional difference testing of genetically-modified crops substantial between testing for H 01 : Δ TC = 0 and H 03 : m D TR jD CR ¼0 ¼ 0. As s 2 g : s 2 e gets larger, such as s 2 g : s 2 e > 1, this difference diminishes, and the two tests become very similar. Results in Fig 4 are directly relevant to those of EFSA's equivalence testing. Large background variation in D TR often makes the testing H 02 : Δ TR = 0 very inefficient with extremely low power, regardless of the trait difference. This implies that, when the trait difference is small, large variation in D TR will result in a failure to reject the non-equivalence hypothesis at a rate much greater than the nominal α = 0.10 level. Alternatively, a large trait effect might confound with the background variation, resulting in a rejection of the non-equivalence hypothesis.

Revisiting EFSA
The two-step difference and equivalence testing approach was illustrated with a maize grain composition example [2,3]. The purpose of revisiting that example here is to explore the performance of the trait model and conditional method. Before the re-analysis, results reported in EFSA (2010) were confirmed, including the post hoc testing of the trait-by-site interaction.
In the re-analysis with the trait model, SAS PROC GLIMMIX [12] was used since the procedure provides a likelihood ratio test on variance components using option COVTEST. One Conditional difference testing of genetically-modified crops statistically-significant interaction (Neutral Detergent Fiber) was found at the 0.05 level (the same as used by EFSA) for the trait-by-site interaction, and none of the control backgroundby-site interactions were significant. In contrast, eight statistically-significant interactions were found using the EFSA method [2]-treating site as a fixed effect while restricting the data to test and control only (i.e., references are excluded). The discrepancy highlighted here may be the result of the data or the statistical procedure. Nevertheless, the interaction is an effect of the design, and should be reflected in the statistical model (as in the trait model) instead of in a fragmented, post hoc procedure, as recommended by EFSA [2].
EFSA could not reach a conclusion on equivalence for nine of the 53 analytes (including two analytes with reference variance estimated as zero) [2]. Among those analytes, only two were found to be significantly-different from the control ( Table 2). Using s g estimated from the corresponding model, the standardized differences d TR /s g of the seven analytes under the EFSA model are all greater than or close to 2.0 (from 1.92 to 2.75) in absolute value. Under the trait model, five of the seven show a background effect d CR /s g greater than 2 in absolute value. Obviously, the inability to reach conclusions on equivalence is the result of large control background effects, and has nothing to do with the GM trait itself. In the conditional difference testing, nine analytes showed a statistically-significant difference at the α = 0.10 level. The standardized differencesm D TR jD CR ¼0 =s g are mostly less than 1 in absolute value and only two (Niacin and Vitamin B2) were slightly greater than 1 (ranging from 0.27 to 1.38 except Phytic Acid with reference variation estimated as zero).

Discussion
The equivalence testing method prescribed by [2] assesses differences between the test and references. These differences, as the main topic of this paper, may be driven by a trait effect, a Analytes included are those showing significance in EFSA difference testing and failure of concluding equivalence (nine in total including two excluded due to zero reference variation) in the equivalence testing, and significance in the conditional difference (nine in total with Phytic Acid excluded due to zero reference variance in both methods) testing at the α = 0.10 level. The estimated differences were standardized by the reference standard deviation estimated by respective model. Asterisks ' � ' denote significant differences and lack of equivalences. https://doi.org/10.1371/journal.pone.0210747.t002 Conditional difference testing of genetically-modified crops control background effect, or both. The mixed models proposed in the literature and adopted by EFSA confound the trait and control background effects. Two direct consequences are worth noting. The first is the mismatched outcome type (i.e. non-significance in the difference testing and non-equivalent to the conventional references) that may result in the two-step procedure-using Δ TC in difference testing and Δ TR in equivalence testing. Second, if the control background effect is large, the result of EFSA equivalence testing has little or nothing to do with the trait effect, which should be the sole focus of the comparative assessment. The proportion of background variation in the test-reference difference can be estimated as the variation reduction from the conditioning: For example, when s 2 g : s 2 e ¼ 1; R 2 D TR jD CR will be as large as 94%. Since the background effect is often random among different endpoints, so is the testing result. As the re-analysis of the EFSA example indicates, s 2 g : s 2 e has a mean estimate of 3.4 and a range of 0 to 9.6, and the mean of R 2 D TR jD CR is 86%. In addition, seven out of nine analytes, which could not be concluded as either equivalent or non-equivalent, have relatively large control background effects, while the trait differences are small. Experimental results have recently been provided directly relevant to this observation [13]. They plotted means of GM stacks (i.e. materials with more than one GM trait) and references from a group of studies against the corresponding control mean in each study with different background and concluded that traditional breeding more strongly influenced grain composition than did transgenesis or crossing (stacking) GM events. Therefore, confounding the GM trait effect with natural genotypic variation is not appropriate for comparative assessment and is a misinterpretation of the Codex principle of substantial equivalence. In contrast, the conditional method proposed here evaluates the conditional mean difference, assuming a background effect of zero; this is equivalent to statistically-centering the control background at the reference mean. Consequently, the comparative assessment is independent of the natural genomic differences between the control and references and is more efficient than the EFSA method in the event of a real trait effect.

G×E effects in the TCR trial
The lack of a genotype-by-environment interaction term is one of the main concerns highlighted by Ward et al. [4]. In fact, three types of genotype-by-environment interactions (G×E) can be defined based on the design of the generic TCR trial: the GM trait-by-site, the control background-by-site, and the references-by-site. The trait model proposed in this investigation can accommodate all or any subset of the interactions based on evidence in the data. One consequence among various models is that the trait-by-site interaction is the leading term in the standard error of the estimated trait difference under the trait model. The presence or absence of this interaction term often affects the error degrees of freedom used in testing the trait difference, even if the true trait effect is negligible or absent. This is likely why only eight significant trait differences were detected under the trait model, while 22 analytes were significantly-different under EFSA Model (1) [2]. As illustrated in [4], by neglecting to include any interaction terms, the difference tests conducted under EFSA Model (1) will always be based on an error term that mixes all potential sources of interaction and the residual with a correspondingly large degrees of freedom. If the GM trait-by-site and reference-by-site interactions are both non-zero and of similar magnitude, the error term is under-estimated and the associated degrees of freedom tends to be higher (relative to the true distribution of the test statistic), leading to an inflated false positive error rate that exceeds the nominal level. Other circumstances may result in the difference test being overly conservative. In fact, the only time when the difference test performs as intended under [2] is when all interactions involving site are zero [4].
The control background-by-site interaction may not be of interest to the GM safety assessment per se. It is, however, a component in the covariance structure, and will affect the standard error of the estimated background effect. The impact of this interaction must be determined on a case-by-case basis.
The reference-by-site interaction is different from the above two types of interactions. First, the TCR design does not require overlapping references among sites. Without overlapping references, this interaction cannot be properly estimated, and a good estimate of the interaction depends on a large degree of overlap. Second, the reference-related interaction will confound with the reference variation when absent in the statistical model. Such confounding should not have any consequence on the comparative assessment.

Equivalence criterion
The equivalence of the test and references should be determined in the context of natural variation, i.e. reference variation, when reference variation is a significant part of the total variation. When the reference variation is estimated as zero, EFSA (declares the equivalence criterion and the equivalence status as undetermined [2]. Presumably, when it is near zero relative to the residual variation, the equivalence criterion and the conclusion would be poorly determined. In practice, many endpoints have negligible reference variance. For example, the ratio of s 2 g : s 2 e was estimated between 0 to 9.6 across 53 analytes in the EFSA example under Model A. It makes sense to use the reference variation for the equivalence limit only when s 2 g : s 2 e is much greater than zero, say s 2 g : s 2 e > 0:5. When s 2 g : s 2 e is small, the equivalence limit should consider other sources of variation, such as those discussed by [14].
Clearly, the conditional mean will need a different equivalence criterion because of its independence from the background effect. A plausible equivalence criterion for the conditional mean likely should consider the following: residual variation-the size of the standard error of the conditional mean difference, and the reference variation-providing biological context [15]. Ideally, such a criterion should be predetermined, like the 80-125% rule [16] used in the regulation of the generic drugs, except that the criterion for a GM crop would be on the variation instead of mean level when sufficient reference variation exists. The criterion should be general enough for all types of characteristics in terms of the variation setting s 2 g : s 2 e . In addition, the criterion should be able to take account of a GM trait-by-site interaction, when needed. These questions are topics of ongoing research.

Conclusion
The equivalence testing method prescribed by [2] directly compares a GM crop with commercial references, and confounds the so-called GM trait effect with genotypic differences between the reference materials and control. Critically, these genotypic differences are the result of conventional plant breeding. Thus, the result of EFSA equivalence testing often has little or nothing to do with the GM trait effect, which should be the sole focus of the comparative assessment. A novel statistical model, called the trait model, was introduced that treats the effects of the GM and control materials as fixed for their difference, and as random for their common background. For significance testing, the covariance structure of the three genotype groups is utilized to decompose the differences into the trait effect and the control background effect. The trait difference is then derived as a conditional mean, assuming a background effect equal to zero; this is equivalent to statistically adjusting the control background to the reference mean. Consequently, the comparative assessment is independent of the natural genotypic differences between the control and references, which substantially reduces both Type I and Type II errors and corrects a fundamental flaw in the testing method mandated by the GMO Panel of the European Food Safety Authority.