Bootstrap simulations for evaluating the model estimation of the extent of cross-pollination in maize at the field-scale level

With the recent advent of genetic engineering, numerous genetically modified (GM) crops have been developed, and field planting has been initiated. In open-environment cultivation, the cross-pollination (CP) of GM crops with wild relatives, conventional crops, and organic crops can occur. This exchange of genetic material results in the gene flow phenomenon. Consequently, studies of gene flow among GM crops have primarily focused on the extent of CP between the pollen source plot and the adjacent recipient field. In the present study, Black Pearl Waxy Corn (a variety of purple glutinous maize) was used to simulate a GM-maize pollen source. The pollen recipient was Tainan No. 23 Corn (a variety of white glutinous maize). The CP rate (%) was calculated according to the xenia effect on kernel color. We assessed the suitability of common empirical models of pollen-mediated gene flow (PMGF) for GM maize, and the field border (FB) effect of the model was considered for small-scale farming systems in Asia. Field-scale data were used to construct an optimal model for maize PMGF in the maize-producing areas of Chiayi County, southern Taiwan (R.O.C). Moreover, each model was verified through simulation and by using the 95% percentile bootstrap confidence interval length. According to the results, a model incorporating both the distance from the source and the FB can have optimal fitting and predictive abilities.


Introduction
With improvements in biotechnology and genetic engineering, the area assigned to the cultivation of genetically modified (GM) crops has increased by 1.9 million hectares from 2017 to 2018 [1]. Among the 26 countries growing GM crops in 2018, five countries grew 91% of these crops [1]. Moreover, more than 70% of plants produce offspring through cross-pollination (CP) between species; this is a commonly observed evolutionary phenomenon in plants, and pollen is the primary medium for this process [2]. Therefore, when GM crops are cultivated in recipient field to the pollen source when using this model. Šuštar-Vozlič et al. 2010 [13] used inverse power functions to develop a suitable CP model. Numerous samples must be collected and investigated to accurately estimate the CP of maize fields [13]. However, if an appropriate model and set of sampling methods are employed to calculate CP in the field, the required sample size can be reduced.
To perform farm-scale evaluation of maize fields with an area less than 5 ha, Weekes et al. [14] used experimental data collected from 55 experimental sites in England and developed a gene flow model. This model comprised two stages: (1) second-order log equations were adopted to determine the probability that a sample has a GM content of 0% at a given distance within the recipient field and (2) a beta distribution was used to calculate the mean proportion of GM content at a given distance. Based on this model, a separation distance of 3 m was recommended to ensure that the neighboring crop had a CP rate (%) less than the 0.9% threshold when using a square source field with an area of 150 × 150 m 2 .
Most PMGF studies of GM maize have been conducted in EU member states or North America on large-scale fields. However, PMGF results might be affected by environmental factors, particularly regional climate, agricultural landscape, and experimental materials. Therefore, real field experiments on PMGF from pollen source plots to recipient fields in a smallscale farming system are warranted.
In this study, a farm environment with both GM and non-GM crops was designed to investigate the gene flow of GM crops and subsequently reduce the risk of GM gene flow to conventional crops. Although many countries have permitted GM crops to be commercially cultivated over large areas, no GM crops have been approved for commercially cultivation in Taiwan. Moreover, in the small-scale agricultural landscapes of Taiwan, crop fields are often separated by field borders (FBs), such as roadways between fields. Therefore, the influence of FBs on CP was also considered. This is the first comprehensive investigation of the effects of ID and FBs on the CP rate (%) of maize in Taiwan, an island with a subtropical climate.
Chiayi County is a primary maize-producing area of Taiwan. Therefore, experiments were conducted in Puzih City, Chiayi County (23˚47 0 N, 120˚26 0 E) during the growing seasons of 2009 and 2010 to determine the optimal model for CP. The empirical models of pollen flow used in other studies were referenced to develop a CP model for maize production areas with small-scale agricultural landscapes. Moreover, the fitting abilities of the empirical models were compared, and the fitting stability and predictive abilities of the optimal model were evaluated using 1,000 bootstrap simulations. The results can aid governments in establishing coexistence systems for GM and non-GM crops in field allocation for the cultivation of GM maize. This information can also be a reference for other Asian counties with similar farming systems, such as Japan, Korea, and the Philippines.

Field design
In this study, experiments were conducted at the Puzih Branch Station of the Tainan District Agricultural Improvement Station from 2009 to 2010. Moreover, FBs are common in the agroecosystems of Taiwan because fields are often separated by roadways. Therefore, to establish a model that includes both the distance from the source and the FB, the following three field experiments were conducted (Fig 1): (1) During the first crop season in 2009 (2009-1), a study was designed to investigate circumstances in which the pollen source neighbors the pollen recipient field. Because the prevailing wind during 2009-1 is from the south, the pollen source was designated at the southern edge, and the pollen recipient was downwind from the pollen source ( Fig 1A). The total site area was approximately 0.42 ha (84.5 × 50 m 2 ), and the pollen source to recipient field area ratio was approximately 1:2.6. (2) During the second crop season in 2009 (2009-2), a study was designed to evaluate the effects of the FB. The pollen recipient field was divided into part A (without an FB; 2009-2A) and part B (with an FB; 2009-2B). Part A neighbored the pollen source, and part B had an FB with a width of 6.75 m to provide separation from the pollen source. Because the prevailing wind during fall is from the north, the pollen source was designated at the northern edge ( Fig 1B). The total site area was approximately 0.42 ha (84.5 × 50 m 2 ), and the pollen source to recipient field area ratio was approximately 1:2.5. (3) During the first crop season in 2010 (2010-1), an experiment was conducted with an FB larger than that in 2009-2B. Maize fields were separated with an FB with a width of 7.5 m between the recipient field and the pollen source ( Fig 1C). Because the prevailing wind during summer is from the south, the pollen source was designated at the southern edge. The total site area was approximately 0.55 ha, and the pollen source to recipient field area ratio was approximately 1:1.8.

Plant culture
For this experiment, two commercial glutinous maize varieties with different grain colors were selected from those grown in Taiwan. Black Pearl waxy corn (Known-You Seed Co., Kaohsiung, Taiwan), which has purple grains, was used as the pollen source. The period from planting to flowering of this variety is approximately 40-50 days. Moreover, Tainan No. 23 corn, which has white grain and is suitable for planting in central and southern Taiwan, was used as the pollen recipient. The kernel pericarp color of the maize resulting from the xenia effect (i.e., purple grains on the white ears of the pollen recipient) was used to determine the CP rate (%). The xenia effect of the maize is caused by the effect of the different pollen source gene resulted in endosperms on the development of the seeds. We adopted conventional farming management methods in this study. The distance between individual plants in a row was 0.25 m, and the distance between rows was 0.75 m. Maize fields were planted with densities of 53,018 plants ha −1 in the 2009-1 field, 53,251 plants ha −1 in the 2009-2 field, and 53,467 plants ha −1 in the 2010-1 field (Fig 1).

Plant dates
To ensure congruence between releases of male pollen from the purple maize and the silking period of the white maize, the purple maize was sown in two batches according to planting time. The white maize variety flowers slightly later than the purple variety does in Puzih City, Chiayi County. Therefore, for the 2009-1 experiment, the white maize was sown on 8 May, the first batch of purple maize was sown on 11 May, and the second batch of purple maize was sown on 13 May. However, the results of the 2009-1 experiment indicated no clear difference between the flowering periods of the white and purple maize varieties. Therefore, in subsequent experiments, the first batch of purple maize was sown 3 to 5 days before the white maize was sown, and the second batch of the purple maize was sown at the same time as the white maize was. In the 2009-2 experiment, the white maize was sown on 16 October, the first batch of purple maize was sown on 13 October, and the second batch of purple maize was sown on 16 October. In the 2010-1 experiment, the white maize was sown on 3 May, the first batch of purple maize was sown on 28 April, and the second batch of purple maize was planted on 3 May.

Climate monitoring during the flowering period
Meteorological information during the experiments, including wind speed and wind direction, was measured at the Central Weather Bureau's Yichu Branch Station (23˚36'N, 120˚28'E). Wind speed and wind direction were recorded hourly from 6 a.m. to 4 p.m. for 7 days before and after the silking period of the pollen recipient. Changes in wind rose plots were used to interpret trends in wind speed and direction.

Data collection and CP (%) rate calculation
In this study, to explore the relationship between the CP rate (%) and the distance from the pollen source, ears were measured for each small plot by using the census method. The 2009-1 field was investigated using a census scale with small sample plots with areas of 2.5 × 0.75 m 2 (number of missing plots: 5; number of measured plots: 1,635). Furthermore, to identify and precisely describe the gene flow trends in maize, ears were measured with small sample plots with areas of 1.25 × 0.75 m 2 in the 2009-2 (number of missing plots: 10, number of measured plots: 3,110) and 2010-1 fields (number of missing plots: 392, number of measured plots: 3,248).
Visual inspection of the ears of the pollen recipient and xenia counting were used to calculate the CP rate (%). That is, the CP rate (%) was evaluated by counting the number of purple xenia kernels on the white pollen recipient according to the following formula (Eq 1) [5,11]: where n is the number of ears in the sample plot, Ear i is the number of purple kernels on the ith ear in the sample plot, and AVK is the average number of kernels for an ear in the field. To determine AVK, two ears were randomly chosen from each sampling plot, and the total number of kernels was counted for each selected maize ear to calculate the average number of kernels for an ear.

PMFG models
In previous studies, the majority of models describing the relationship between the CP rate (%) and the distance from the source pollen can be categorized into the following types: the exponential model, log-log model, and log-square model. In these equations, distance refers to the distance (m) from the sampled ears to the edge of the pollen source field, and a and b are model parameters [4,5].
To consider the effects of FBs, the field designs of the 2009-2 and 2010-1 experiments included FBs. Therefore, in addition to employing the aforementioned models, the empirical modeling approach proposed by Gustafson et al. [12] was investigated. This model contained information for the adventitious presence of the GM trait in seeds within the recipient field (%) (AP), the fraction of pollen containing the GM trait (F GM ), the CP rate in the row closest to the edge of the pollen recipient field (P 0 ), the ID from the edge of the source field to the recipient field (ID; this was the same as FB in the present study), distance from edge of recipient field nearest to the source (x), and the width of the border rows (in meters) for the non-GM crop planted between the source and recipient fields (BR) (Eq 2).
After individual field trial data were fit using the empirical modeling approach, the trend of the gene flow within recipient or border rows could be reduced, as indicated by the coefficient of the proportionality constant, and twice as effective as the ID (the unplanted isolation buffer), as reported in Gustafson et al. [12]. Consequently, in this study, the proportionality constants for FB and the distance within the recipient field were re-estimated based on the data from field experiments conducted during the growing seasons of 2009 and 2010.
Furthermore, in the European commercial biotech environment, because the AP of the unavoidable GM traits of seeds must be less than 0.3%, the AP was set at 0.003 in the model [12]. However, under the current planting regulations for GM crops in Taiwan, no GM crops have been approved for commercial cultivation in open fields. Moreover, planted seeds cannot be admixed with GM seeds. Therefore, the AP for the empirical modeling approach should be adjusted to 0% for the conditions of Taiwan's current agricultural environment. In addition, to conform to the planting regulations that forbid the cultivation of GM crops, F GM was set to 1; that is, all pollen grains containing GM traits introduced a risk of spreading GM traits through pollination [12]. Gustafson et al. [12] also indicated that the mathematical descriptions of the relationships between gene flow and distance from the pollen source are the same as those for the recipient and border rows. However, rows in the isolation field with non-GM maize between the source and recipient fields were not considered as border rows. Consequently, in the empirical modeling approach, BR was set to 0.
The models with the coefficient values estimated from the data and proposed by Gustafson et al. [12] were evaluated in this study. Furthermore, the models with and without 0.003 AP were investigated to assess the assumption of receptor seeds carry GM trait due to 0.3% AP. In summary, the following five models (Eqs 3-7) were used to describe the relationship between the CP rate (%) and the distance from the pollen source in the field experiments: where CP i refers to the CP rate (%) estimated using the ith model, distance is the distance (m) from the sampled plot to the edge of the pollen source field, FB is the width of the FB, which functions as an isolation buffer, P 0 is the parameter that describes CP rate (%) at the edge of the pollen recipient field, and a and b are model parameters.

Simulation analysis
To evaluate the fitting performance and predictive ability of the models and the stabilities of the estimated parameters, bootstrapping was performed to simulate the population distribution. Bootstrapping is a nonparametric simulation method that uses a computationally intensive resampling technique to assess the variance of a statistic or parameter [15]. The benefit of the bootstrapping method is that the population distribution need not be assumed when generating data sets through random sampling with replacement. In this study, random sampling with replacement was applied to the actual data set across lumped together across the three fields. A total of 8,005 observations of raw data were used to generate new data sets through 1,000 bootstraps. Then, the PMGF models were fitted according to the 1,000 bootstrap samples. To compare the stabilities in the fitting performance and predictive ability of the calibration and validation sets, confidence intervals were calculated for specific evaluation criteria with a 5% significance level on the basis of the percentile bootstrap (PB) confidence interval method [16]. By sorting the evaluation criteria of the 1,000 bootstrap samples from smallest to largest, the 95% PB confidence intervals were calculated between the 2.5th and 97.5th percentiles.

Statistical analysis
In this study, Statistical Analysis System (SAS) version 9.4 (SAS Institute, Cary, NC, USA) was used to conduct the statistical analysis. The PROC NLMIXED was used for fitting the models. Before fitting the models, the CP rates were transformed to the count data, and the count data were assumed to follow a Poisson distribution. Thereafter, the collected experimental data were randomly partitioned into two sets, a calibration set (two thirds of the samples) and a validation set (one third of the samples), for fitting the models and validating their predictive abilities, respectively.
To compare the fitting performance of the models in describing the relationship between the CP rate (%) and distance from the pollen source, evaluation criteria were based on the deviance and Akaike information criterion (AIC) [17]. In addition, based on the validation data, deviance and AIC were used to evaluate the fitting abilities of the fitted models. Furthermore, the correlation coefficient (r) and scatter plots were used to assess the predictive ability of each fitted model based on the validation data.
Finally, similarly to the analysis of the collected experimental data, the simulated data for each bootstrap sample were also categorized into a calibration set (two thirds of the samples) and a validation set (one third of the samples) to fit the models and validate their predictive abilities. In the simulated data of each bootstrap sample, the values of deviance, AIC, and parameters were estimated for each model. Then, standard deviations (SDs), confidence intervals of deviance and AIC were calculated to assess the stabilities of the performance of each model. For the simulated validation data, the correlation coefficient was also calculated to evaluate the predictive ability of each model.

Weather patterns
The average wind speed and direction were monitored for 7 days before and after the 50% pollen silking of the pollen recipient. The average wind speed was higher and more varied during the 2009-1 and 2010-1 crop seasons than during the 2009-2 season (Fig 2). During the investigation period in the 2009-1 season (June 17 to July 1), the prevailing winds were mostly from S to SE (Fig 3A). The daily average wind speed was 4.57 ± 2.06 m s −1 (Fig 2A).
In the 2009-1 experiment, the average CP rate (%) declined from 27.24% to 2.49% at a distance from the pollen source of approximately 9 m. In the 2009-2 field, the average CP rate (%) decreased from 74.29% to 2.49% at a distance from the pollen source of approximately 18 m. However, when the pollen recipient field was separated from the pollen source by an FB of 6.75 m, a rapid decline in the average CP rate (%) from 36.12% to 2.50% was observed at a distance from the FB of approximately 9 m. When the width of the FB was increased to 7.5 m in the 2010-1 field, the average CP rate (%) decreased from 27.58% to 2.18% at a distance from the FB of approximately 4.5 m. According to these results, the average CP rate (%) rapidly decreased with increasing distance from the pollen source and with increasing FB width. In addition, with the 6.75 m natural barrier in the 2009-2A field, the average CP rate (%) of the row closest to the pollen source was only 7.43%. Therefore, border rows may be more effective buffer zones relative to FBs for decreasing the CP rate (%), whereas FBs are more suited to long-distance pollination events. However, when the distance was increased to 18.75 m in the 2009-2 experiments, the average CP rates (%) of the rows in the 2009-2B field were mostly lower than those in the 2009-2A field (Table 1).

PMGF model
The estimated values of the regression parameters, the deviance, and AIC for each CP model based on the complete data are presented in Table 2. The estimated values of a for CP 1 , CP 2 , and CP 3 as well as of b for CP 4 and CP 5 were negative, implying that CP rate (%) decreased as distance from the pollen source increased. However, the estimated values of a for CP 4 and CP 5 were positive, indicating that the FB may have enhanced pollen exchange. Moreover, in CP 5 and CP 4 , the lower deviance and AIC compared with the other models indicated a closer fit for the complete dataset. However, to evaluate the model fitting results and compare the predictive abilities of the investigated models, the experimental data were randomly partitioned into a calibration set and validation set. Table 3 presents the fitting ability of each CP model based on the deviance and AIC for calibration and validation data. When the fit was assessed using the calibration set, four of the models had deviance > 30000, and only one had deviance < 30000. The lowest deviance value was 29994 (CP 5 ). Moreover, four models had AIC > 45000, and one had AIC < 45000. The lowest AIC value was 42044 (CP 5 ).
For the validation set, the fitting abilities were similar to the fits evaluated using the calibration set. Model CP 5 exhibited optimal fitting ability with the lowest deviance value of 15878 and AIC value of 22158 and was followed by CP 4 (deviance = 17496, AIC = 23777; Table 3). The scatter plots of actual and predicted values in Fig 5 indicate the predictive abilities of the models based on the validation set. Only one model had r � 0.7 (CP 1 ), four models had r > 0.7. The highest r value was 0.79 (CP 5 ), indicating that CP 5 had higher prediction abilities than the other models.
Accordingly, CP 1 , CP 2 , and CP 3 had worse fits and predictive abilities than the other models. CP 5 exhibited the optimal fit and predictive ability, followed by CP 4 . Because the FB effect occurred in some fields, CP 1 , CP 2 , and CP 3 had worse fits and prediction abilities relative to the models that included the FB effect.

Simulation results
Bootstrap simulations were implemented to compare the stabilities of the fits and predictive abilities of the CP models. The simulation method was repeated for 1,000 runs, where 1,000 bootstrap samples were generated during 1,000 runs. The means and SDs of the parameters were calculated for each CP model based on the bootstrap simulations (Table 4). In addition, the mean and SD of the deviance and AIC of the simulated calibration and validation sets were also calculated ( Table 5). For the validation sets, the correlation coefficient was calculated to evaluate the model predictive ability.
The estimated values of parameters (Table 4) were similar to the results obtained from the complete observed data listed in Table 2. This indicated that the parameters estimated from the simulation data were identical to those estimated from the whole original data. The average deviance values of CP 5 and CP 4 were lower than those of the other models, indicating that the fits of CP 5 and CP 4 were closer to the data than were those of the other models. Among the other models, CP 1 , CP 2 , and CP 3 performed similarly in terms of average deviance. Furthermore, the SDs of deviance were higher in CP 1 , CP 2 , and CP 3 than in CP 5 and CP 4 . This indicated that CP 1 , CP 2 , and CP 3 had worse fit and greater variability than CP 5 and CP 4 did. Moreover, the average AIC values were similar to the results for deviance (Table 5). CP 5 and CP 4 exhibited the lowest average AIC values, and they had smaller SDs compared with the other models. In summary, CP 5 and CP 4 had the optimal fit, and these fits were more stable than those of the other models.  The validation results were similar to those obtained from the calibration set (Table 5). This indicated that the fit and validation results were consistent and no overfitting occurred. Moreover, CP 5 and CP 4 remained more fitted than the other models. The SDs of deviance and AIC in CP 5 and CP 4 were also smaller than those in the other models. The CP 5 also presented the highest r (r = 0.796) among the models. The value of r indicated that the CP 4 and CP 5 performed the best predictive ability.
For the calibration and validation sets, the results of CP 5 and CP 4 were superior to those of the other models. Therefore, CP 5 and CP 4 , which included FBs, had the closest fits, optimal predictive abilities, and most stable performance.
Additionally, the 95% PB confidence interval lengths (PBLs) of the average deviance and AIC values were used to further assess the stabilities of model fitting (Figs 6 and 7).
As indicated in Fig 6, the fits of CP 5 and CP 4 were more stable than those of the other models, resulting in smaller PBLs (PBL CP5 = 4075, PBL CP4 = 5140). Because the PB confidence intervals were overlapping, the CP 5 and CP 4 models exhibited similar performance. Most of the models exhibited relatively larger PBLs except CP 4 and CP 5 . Additionally, the results obtained from the validation set were similar to those obtained from the calibration set. Moreover, the 95% PBLs of the average AIC values were similar to the results for deviance. CP 5 and CP 4 exhibited similar fits and were more stable than the other models (Fig 7). In addition, the results obtained from the validation and calibration sets were similar.
In summary, CP 5 and CP 4 exhibited the optimal stability, model fits, and predictive abilities. Moreover, the values of the regression parameters estimated based on the simulation data were similar to the values calculated using the complete observed data. Furthermore, the comparison results of deviance and AIC estimated using the simulation data were similar to those obtained from the complete observed data. Additionally, the results in the calibration and validation sets were similar.

Conclusion
In Taiwan and most Asian countries, crop fields are often separated by roadways. Therefore, this study established a CP model for maize to describe the relationship between CP rate (%) and distance from a pollen source; this model was suitable at the field-scale level in southern Taiwan. Three models commonly applied in gene flow studies and empirical modeling approaches including the FB effect were investigated. The results indicated that the models that omitted the FB yielded poor performance. By contrast, models that included FB had improved fits and predictive abilities as well as the greatest stability.
Previous studies have examined the models used in the present study, but their results differ from those of the present study because of differences in experimental designs. Ma et al. [4] used the CP 2 model to describe downwind and upwind data. Their pollen source (0.07 ha) was placed in the center of the field and surrounded by the pollen recipient (0.68-1 ha). Their R 2 values for the fitting results were 0.64 for downwind areas and 0.58 for upwind areas. Therefore, Ma et al. [4] concluded that the CP 2 model was suitable for describing the relationship between CP rate (%) and distance from the pollen source. In addition, Della Porta et al. [5] used the CP 1 , CP 2 , and CP 3 models for various field designs, barriers, positions, and distances from the pollen source to perform model fitting. Their results revealed that the CP 2 model provided the optimal fit (R 2 = 0.89-0.99) when the recipient neighbored the pollen source without a barrier. In general, pollen flow tendencies were similar for a fixed FB size. Therefore, the CP 2 model exhibited similarly high performance when the FB size was fixed; resulting in R 2 values extremely close to 1.
One of the aims of this study was to investigate the effects of FBs on the gene flow tendencies between the adjacent fields. According to the estimates of the regression parameters in the empirical and simulation analyses as well as the results of the gene flow trends, the CP rate (%) decreased as the distance from the pollen source increased. The presence of an FB enhanced pollen exchange in the row of the pollen recipient closest to the pollen source and facilitated long-distance pollination events [18]. However, when the distance from the pollen source increased, the average CP rate (%) in the presence of an FB may have been lower than that in the absence of an FB (Table 1).
Considering isolation buffers (e.g., FBs) together, the CP 5 model exhibited the optimal stability, fit, and predictive ability. The CP 4 model also yielded performance similar to that of CP 5 . However, the AP must be set to 0% for the agricultural environment in Taiwan. Therefore, although the CP 5 model exhibited the optimal performance, after setting AP to 0%, CP 4 appropriately modeled the data obtained from southern Taiwan and was recommended in this study.
In addition to exploring the relationship between CP and the distances from the pollen source, studies on the CP models for GM maize have recently included other influential factors to improve the completeness and accuracy of predictive models. Loos et al. [19] used a Gaussian plume model to simulate the pollen transport of maize in and from plant canopies. This semi-empirical approach combined the atmospheric diffusion equation with the Lagrangian method. To describe the trends of maize pollen dispersal, Klein et al. [20] used individual dispersal functions containing biological parameters (i.e., difference in height between male and female flowers) and aerodynamic parameters (i.e., pollen settling velocity, wind speed, and air turbulence). Goggi et al. [21] combined an exponential model with a linear equation to establish a maize CP model. The exponential model described the relationship between the CP rate (%) and the distance from the pollen source, whereas the linear equation described the relationship between wind speed, wind direction, and distance.
In addition to the distance from the pollen source and FB considered in this study, meteorological and biological factors have been gradually introduced in CP models. Therefore, these factors should be included in future studies to further improve the fit and predictive ability of the proposed models for investigating gene flow and CP. In this study, only three experiments data were used to estimate the parameters of models. For the further study, more replications of experiment should be needed to establish a more robust model. Supporting information S1 Table. Regression parameters, deviance, and AIC for the complete dataset in models with parameter P 0 .