Hypothesis Testing of Inclusion of the Tolerance Interval for the Assessment of Food Safety

In the testing of food quality and safety, we contrast the contents of the newly proposed food (genetically modified food) against those of conventional foods. Because the contents vary largely between crop varieties and production environments, we propose a two-sample test of substantial equivalence that examines the inclusion of the tolerance intervals of the two populations, the population of the contents of the proposed food, which we call the target population, and the population of the contents of the conventional food, which we call the reference population. Rejection of the test hypothesis guarantees that the contents of the proposed foods essentially do not include outliers in the population of the contents of the conventional food. The existing tolerance interval (TI0) is constructed to have at least a pre-specified level of the coverage probability. Here, we newly introduce the complementary tolerance interval (TI1) that is guaranteed to have at most a pre-specified level of the coverage probability. By applying TI0 and TI1 to the samples from the target population and the reference population respectively, we construct a test statistic for testing inclusion of the two tolerance intervals. To examine the performance of the testing procedure, we conducted a simulation that reflects the effects of gene and environment, and residual from a crop experiment. As a case study, we applied the hypothesis testing to test if the distribution of the protein content of rice in Kyushu area is included in the distribution of the protein content in the other areas in Japan.


Introduction
The safety assessment of genetically modified (GM) foods was confirmed as an important issue in the Organization for Economic Cooperation and Development (OECD) discussion resumed in 1988. Substantial equivalence has been a starting point for the safety assessment for GM foods which is used worldwide since this approach was first suggested in 1993 [1]. Substantial equivalence embodies the concept that if a new food or food component is found to be substantially equivalent to an existing food or feed component, it can be treated in the same manner with respect to safety [2]. To decide if a modified product is substantially equivalent, the product is tested by the manufacturer for unexpected changes in a limited set of components such as toxins, nutrients, or allergens that are present in the unmodified food. Piaggio et al. [3] gave a clear framework of reporting of equivalence randomized trials. Ennis and Ennis [4,5] used an open interval to define equivalence and provided methods for testing a null hypothesis of nonequivalence. McNally et al. [6] proposed applying the generalized test function method in comparison to the confidence interval for assessing population bioequivalence. Herman and Price [7] examined research that has occurred over the past two decades relative to the mechanisms that affect crop composition in GM and traditionally bred crops.
In substantial equivalence tests of the population means, it is impossible to prove exact equality, so a buffer margin (c) for the treatment effect is defined. The equivalence is defined as the treatment effect being between c and −c.
A broad range of factors affect crop compositions, such as the genetic background [8,9], environmental factors [10,11], and agronomic practices [12,13]. Ricroch et al. [14] reviewed the published studies regarding the effect of genetic modification in comparison with the environmental and intervariety variations. Because the contents vary largely between crop varieties and production environments, the test of substantial equivalence should examine the inclusion of the tolerance intervals of the samples from the two populations, the population of the contents of the proposed food or feed, which we call the target population and denote as POP tar , and the population of the contents of the conventional food or feed, which we call the reference population and denote as POP ref (Fig 1).
Statistical tolerance intervals are useful in practical applications in many areas and the construction of tolerance intervals has been extensively studied [15]. Formula for tolerance intervals (regions) for known and unknown mean and variance was given by Proschan [16] for univariate normal distribution and by Chew [17] for multivariate normal distribution. The tolerance interval procedure was developed for balanced one-way random model [18], general linear mixed models for balanced data [19] and unbalanced data [20]. A (1 − γ, 1 − α) tolerance interval (TI 0 ) based on a sample is constructed so that it would include at least a proportion 1 − γ of the sampled population with confidence 1 − α [21]. Such a tolerance interval is usually referred to as (1 − γ)-content-(1 − α)-confidence (coverage) tolerance interval.
We introduced the complementary tolerance interval that is guaranteed to have at most a pre-specified level of the coverage probability. A (1 − γ, 1 − α) tolerance interval (TI 1 ) based on a sample is constructed so that it would include at most a proportion 1 − γ of the sampled population with confidence 1 − α. By applying TI 0 and TI 1 to the samples from the target population and the reference population respectively, the rejection of the test guarantees that the target population essentially does not include outliers in the reference population.

Two complementary tolerance intervals and two-sample hypothesis testing
We consider a sample X from a Gaussian population N(μ, σ 2 ). When the sample is collected by simple random sampling, the sample meanm follows N(μ, σ 2 /R 0 ), and the sample varianceŝ 2 follows (σ 2 /m) w 2 m . R 0 is the sample size, and the degree of freedom, m, is R 0 − 1. By allowing for the uncertainty of the sample mean and the sample variance, the conventional two-sided (1 − γ)-content, (1 − α)-confidence tolerance interval is defined as where w 2 1;1Àg;1=R 0 denotes the upper 100(1 − γ)% point of the non-central chi-squared distribution with degree of freedom one and non-centrality parameter 1/R 0 , and w 2 m;a denotes the upper 100α% point of chi-squared distribution with degree of freedom m [22]. The notation ncp stands for non-centrality parameter. R 0 is the ratio of σ 2 over the variance ofm, and m represents the ratio of 2ðŝ 2 Þ 2 over variance of σ 2 . The tolerance interval TI 0 covers at least 1 − γ of the population with the probability of 1 − α.
Here, we introduce a new tolerance interval TI 1 defined by where w 2 m;1Àa denotes the lower 100α% point of Chi-squared distribution with degree of freedom m. As is seen below, the tolerance interval TI 1 covers at most 1 − γ of the population with the probability of 1 − α. By increasing the sample size, the two complementary tolerance intervals both converge to the population tolerance interval.
We contrast the tolerance interval of the target population, POP tar , with the tolerance interval of the reference population, POP ref . Given the values of γ tar and γ ref (γ tar > γ ref ), the null hypothesis is that the tolerance interval of POP tar is not included in the tolerance interval of POP ref . The alternative is that the tolerance interval of POP tar is included in the tolerance interval of POP ref . To make the dependence of the tolerance intervals on the sample X and population P explicit, we express them as TI 0 (α, γ, X), TI 1 (α, γ, X), and TI(γ, P). Our framework of testing substantial equivalence is to test the null hypothesis, H 0 , against the alternative hypothesis, H 1 .
We define the test statistic as,

Mixed effect model and the coverage probabilities of the tolerance intervals
The two complementary tolerance intervals (Eqs (2) and (3)) can be generalized for the noniid samples. The effective sample size, R 0 , is obtained by comparing the variance of the estimated global mean with the total variance: V½m ¼ ðŝ 2 T Þ=R 0 . The effective degree of freedom, m, is obtained by equating the estimated variance of the estimated total variance and the expected variance of the estimated total variance by the Satterthwaite's chi-square approximation: As an example, we consider the hypothetical samples with random genetic and environmental effects. The hypothetical samples reflect the maize samples of 61 lines from eight multisite field studies. The field sites represented 47 unique environments in the commercial maizegrowing regions of the United States, Canada, Chile and Argentina [23]. The experimental design used at each field site was a randomized complete block design containing three of four blocks. Variances of random components of concentrations of two analytes (tryptophan and oleic acid) were used to generate the simulated data. Table 1 shows the variances of random components of tryptophan and oleic acid. The variance component of environmental effect is large for tryptophan, whereas the genetic effect is the major component for oleic acid.  Table 2 shows the simulation setting with total number of environment, n E = 50, total number of genotype, n G = 50 and number of blocks per environment, n B = 4. We generated 1,000 sample datasets by normal random numbers with the variances in Table 1. We applied a linear mixed model to each of the dataset, and estimated the total mean and the variance components. The variance of total mean and the total variance, which are required for the calculation of R 0 and m, were estimated by the variance among the 100 runs of parametric bootstrap.
The estimated m and R 0 were distributed widely (Fig 2). The means of the estimated m were 98.0 for tryptophan and 149.3 for oleic acid. The means of the estimated R 0 were 65.4 for tryptophan and 70.8 for oleic acid. Fig 3 shows the median, lower and upper 5 percentiles of the coverage probabilities of the tolerance intervals, TI 0 and TI 1 . The coverage probability of TI 0 is larger than the nominal coverage probability (1 -γ) with probability 0.95. For the value of γ = 0.01 (see the first dotted vertical line from the left on both panels), with probability 95%, the lower 5 percentiles of coverage probabilities of TI 0 were larger than 98.9% and 98.9% for tryptophan and oleic acid respectively; the upper 5 percentiles of coverage probabilities of TI 0 were  smaller than 99.9% and 99.8% for tryptophan and oleic acid respectively; the medians were 99.6% for both tryptophan and oleic acid. This means that TI 0 covers at least 1 − γ of the population with the probability 95%.
On the other hand, the coverage probability of TI 1 is smaller than the nominal coverage probability (1 -γ) with probability 0.95. For the value of γ = 0.01 (see the first dotted vertical line form the left on both panels), with probability 95%, the upper 5 percentiles of coverage probabilities of TI 1 were smaller than 99.0% for both tryptophan and oleic acid; the lower 5 percentiles of coverage probabilities of TI 1 were larger than 95.9% and 96.7% for tryptophan and oleic acid respectively; the medians were 97.8% and 98.1% for tryptophan and oleic acid respectively. This means that TI 1 covers at most 1 − γ of the population with the probability 95%.

Results
The p values and the power of the hypothesis test: a simulation study  A case study of testing inclusion of tolerance intervals: Contrasting protein composition of rice in Kyushu against other areas in Japan As an example of empirical study, we applied the hypothesis testing to test if the protein value of rice in Kyushu area (Kyushu, including prefectures Fukuoka and Kagoshima) was included in the other areas in Japan (Japan). We downloaded the rice component data for Japan from  Table 3. The power to reject the null hypothesis with significance level of 0.05 for each combination of sample size and the size of σ tar /σ ref .
Sample size σ tar /σ ref The Food Composition Database for Safety Assessment of Genetically Modified Crops as Foods and Feeds [24,25]. It is third-party data and not owned by the authors. Major varieties of non-glutinous rice cultivated and distributed in Japan were collected from 1999 to 2009 (except for 2003 and 2004). A total of 15 or 16 samples consisting of 10−12 varieties were collected every year. The production areas are located in Japan stretching from the far north to south of the country. Table 4 shows the number of samples of different varieties and production areas.
In total, the sample X Japan includes 120 rice samples of varieties and the sample X Kyushu includes 18 rice samples of varieties.
To test if the protein value of rice in Kyushu was included in Japan, we applied TI 0 and TI 1 to the samples from Kyushu and Japan respectively. The null hypothesis is that the tolerance interval of the protein of rice in Kyushu was not included in the tolerance interval of that in Japan. The alternative is that the tolerance interval of the protein of rice in Kyushu was included in the tolerance interval of that in Japan. The value of γ Kyushu and γ Japan were set to 0.05 and 0.01 respectively.
Using a linear mixed-effect model we estimated the total mean of POP Japan as μ Japan = 6.60 and random effects s 2 G;Japan ; s 2 E;Japan and s 2 GxE;Japan = 0.05, 0.07 and 0 respectively, and the error term, s 2 ε;Japan = 0.19. The total variance s 2 T;Japan = 0.31. The variance of the estimated total mean and that of the estimated total variance were estimated as V½m Japan ¼ 0:0128 and V½ŝ 2 T;Japan ¼ 0:00365 respectively by 1,000 runs of parametric bootstrap. These values provide the effective sample size, R 0;Japan ¼ŝ 2 T;Japan =V½m Japan ¼ 24:28 and the effective degree of freedom, m ¼ 2ðŝ 2 T;Japan Þ 2 =V½ŝ 2 T;Japan ¼ 52:70. The sample from POP Kyushu is assumed to be an iid sample from normal distribution with mean μ Kyushu = 6.92 and variance s 2 T;Kyushu = 0.22. In this case, R 0,Kyushu is the sample size, 18, and m Kyushu is R 0,Kyushu − 1 = 17. With these values of R 0 's and m's, we obtained TI 1 (α = 0.05, γ = 0.01, X Japan ) as (5.34, 7.86) and TI 0 (α = 0.05, γ = 0.05, X Kyushu ) as (5.59, 8.24). The latter does not include the former.
The value of the test statistic, α TI01 , was numerically obtained as 0.247 by solving the Eq (3). We obtained the p value by locating the value of α TI01 on the distribution under the null hypothesis. We generated this distribution by parametric bootstrap, assuming μ Kyushu = μ Japan and σ T,Kyushu = 1.41σ T,Japan . Without losing generosity, we assumed μ Kyushu = μ Japan = 0 and σ T,Japan = 1. The iid sample of Kyushu was generated by normal random numbers with mean 0 and standard deviation 1.41. As for the sample of Japan, we generated the genetic effect (G), environmental effect (E), the G×E interaction and the error term, by decomposing the total variance into the variance components by the relative size of the estimated variance components. We generated 1,000 sets of the data. For each of the simulated data, we estimated the means and the total variances of Kyushu and Japan. We estimated their variances by 100 parametric bootstrap. With these estimates, we obtained R 0 's and m's, and the value of α TI01 . From 1,000 values of α TI01 , we obtained the cumulative distribution of α TI01 under the null hypothesis ( Fig 5). As a result, we obtained the p value as 0.195.

Conclusion
In this study, we proposed a hypothesis test of inclusion of tolerance interval using the existing tolerance interval and a newly introduced the complementary tolerance interval. The result of simulation showed that the power of the test for the case of σ tar /σ ref = 1.41 stayed nearly at the value of 0.05 (Fig 4), which means that the testing procedure is almost unbiased. However, the test statistic, α TI01 , is complex in form, and we could not attach a direct interpretation to it. We need make further effort to develop candidates of test statistics that measure the extent of coverage or non-coverage of the target population by the reference population. The mixed effect model enables unbiased estimation of the effective sample size and the effective degree of freedom, when the samples consist of subsamples collected in various conditions of genetic factors and environmental factors. However, a survey may be designed to collect the samples of matched controls. Another promising project is to develop a testing procedure for such samples.
As an alternative to the testing non-inferiority or substantial equivalence of the population mean, the proposed test examines the "range" of the distribution. A statistical test on the range of the distribution may be useful, especially when it is difficult to formulate the distribution by a simple statistical model. If a large sample is available, it is possible to construct non-parametric tolerance intervals [26,27]. The future study will investigate the statistical property of the non-parametric testing procedure.