Incorporating statistical strategy into image analysis to estimate effects of steam and allyl isocyanate on weed control

Weeds are the major limitation to efficient crop production, and effective weed management is necessary to prevent yield losses due to crop-weed competition. Assessments of the relative efficacy of weed control treatments by traditional counting methods is labor intensive and expensive. More efficient methods are needed for weed control assessments. There is extensive literature on advanced techniques of image analysis for weed recognition, identification, classification, and leaf area, but there is limited information on statistical methods for hypothesis testing when data are obtained by image analysis (RGB decimal code). A traditional multiple comparison test, such as the Dunnett-Tukey-Kramer (DTK) test, is not an optimal statistical strategy for the image analysis because it does not fully utilize information contained in RGB decimal code. In this article, a bootstrap method and a Poisson model are considered to incorporate RGB decimal codes and pixels for comparing multiple treatments on weed control. These statistical methods can also estimate interpretable parameters such as the relative proportion of weed coverage and weed densities. The simulation studies showed that the bootstrap method and the Poisson model are more powerful than the DTK test for a fixed significance level. Using these statistical methods, three soil disinfestation treatments, steam, allyl-isothiocyanate (AITC), and control, were compared. Steam was found to be significantly more effective than AITC, a difference which could not be detected by the DTK test. Our study demonstrates that an appropriate statistical method can leverage statistical power even with a simple RGB index.


Introduction
Weed control is very important for crop management. Weed density counts or weed control efficacy has been traditionally quantified by manual weed density counts, and it is time-consuming and labor-intensive process. To overcome these challenges, photographic data taken by a digital camera can be used for quantifying weed densities instead of manual assessments. In recent years, many researchers have used image analysis techniques in agricultural research. PLOS  For instance, Peña et al. (2015) detected weed plants with visible-light and multispectral cameras to obtain highly accurate results of weed detection [1]. Mahlein (2016) introduced RGB, spectral, thermal, fluorescence, and 3D sensors to detect color difference in cell, leaf, plant, plot, field, and ecosystem conditions [2]. Soil fumigants are an important part of the weed control program for strawberries in California. However, regular use of Methyl bromide (MB), the widely used soil fumigant for soil disinfestation in strawberry, has been discontinued in many countries in 2005 due to its ozone-depleting effect [3]. Steam and allyl-isothiocyanate (AITC) are being evaluated as alternatives to MB. Both methods are effective on weed control and do not deplete the ozone layer [4] [5].
RGB imaging is a simple method for quantifying a combination of red (R), green (G), and blue (B) colors. Most people have access to digital camera sensors in mobile phones or tablet PCs, and there are abundant image analysis programs available to analyze RGB data for plant recognition, identification, classification, and leaf area. Despite the technological advances, RGB data are often difficult to interpret due to lack of appropriate statistical methods for comparison. Downie et al. (2015) noted that limitations for image analysis still exist in statistical processes [6].
Some researchers have applied statistical hypothesis testing to RGB data. Woebbecke et al. (1995) tested and compared various color indices using t-tests to distinguish between a green plant from a nonplant background [7]. Golzarian et al. (2012) used the non-parametric Kruskal-Wallis and Friedman tests, which are alternatives to parametric ANOVA tests, to compare various color indices [8]. They used 240 digital images of wheat, ryegrass, brome grass, and wild oat, and they found that the 2G-R-B index (known as the excessive green index or EGI) is computationally simple and effective when the contrast between the background and the target is high. Longchamps et al. (2012) used the non-parametric Friedman test to compare the weed coverage in three different areas of ground by randomly selecting 110 images per plot [9].
In this article, RGB data were obtained from non-treated (control), steam, and AITC treatments. According to the previous studies, a traditional multiple comparison test, the Dunnett-Tukey-Kramer (DTK) test, and two alternative statistical methods, a bootstrap method and the Poisson model, were evaluated via simulation studies to compare their statistical power. The Poisson model can be applied to the case when the number of green spots is very low, and it was indicated in the current study. The two alternative statistical methods were preferred because they were more sensitive (i.e., higher statistical power). After confirming their advantage in the simulation studies, these two alternative statistical methods were applied to provide statistical evidence for effect of steam and AITC on weed control.

Treatment description and application
Steam and Dominus (99.8% AITC) were applied on June 15, 2018. Microplots for this experiment were located at the US Department of Agriculture research station in Salinas, California, and each microplot was 1 m 2 . Each of the four treatments, control, Dominus, steam, and steam + Dominus, was replicated three times in a randomized complete block design. Steam was applied at a pressure of 0.8 bar for 60 minutes at the center of each microplot with a diesel-powered steam generator (SF-20, Sioux Corp, Beresford, SD). The microplots were covered with insulation mats for 24 h. HOBO data loggers (Onset Computer, Bourne, MA) were inserted at 0.3, 0.5, and 0.7 m from the center of each microplot. The maximum soil temperatures at 0.3, 0.5, and 0.7 m from the center were 99.6, 99.7, and 99.5˚C, respectively, and times above 65˚C were 457.3, 416.0, and 278.0 minutes, respectively. The maximum soil temperatures at 0.3 and 0.7 m from the center were 30.8 and 29.8˚C in the control microplots. Dominus (18.7 mL m −2 ) was also applied at the center of each microplot by syringe with stylet at a depth of 15 cm.

Data collection
Thirty-one days after treatment (DAT), each microplot was photographed by a digital camera (EOS 70D DSLR, Cannon, Inc., Tokyo, Japan). Each picture was converted from a JPEG file to a GIF image file for image analysis by Adobe Photoshop CC 2017 (Adobe Systems, San Jose, CA, USA). Each GIF file was uploaded to the image analysis program available at http:// mkwak.org/imgarea/ developed by Dr. Minseok Kwak. The result of image analysis was copied to a spreadsheet of Microsoft Excel to identify the RGB code. The image analysis program output 255 rows of data per microplot, and the Excel files in S1 and S2 Files include all data from each microplot. In Fig 2, the first color is close to black (RGB code of 333333), and the proportion occupied is 0.0377 (i.e., 3.37% of the observed microplot). The 42 nd color is close to green (RGB code: 669966), and the proportion occupied is 0.0056 (i.e., 0.56%). Among the various green colors observed, a common feature is a positive value of 2G -R -B which is known as an excess green index [7]  [10] [11]. Defining a green color by G > R and G > B (i.e., both conditions to be met), the estimated proportion of green area is 0.2206 (22.06%).

Challenge of comparative analysis
To formally test for differences among the three treatments (control, Dominus, and steam) on weed control, the proportion of green area, represented by the percent of pixels where G > R and G > B, were compared among the three treatments. From each microplot, the estimated proportion was 0.2206 (i.e., 22.06%), 0.1952, and 0.1436 in the control group (Group 0); 0.0843, 0.0679, and 0.1898 in the Dominus group (Group 1); and 0.0008, 0.0003, and 0.0007 in the steam group (Group 2). The nine data points, three per treatment, are shown in the left In the Dominus group, the observed proportions of green area were slightly more variable than in the other two groups, and it could be due to random error within-microplot and between-microplot (e.g., Dominus low vapor pressure and limited movement in the soil). To test for statistical significance at α = 0.05, the Dunnett-Tukey-Kramer (DTK) test was used [12]. The DTK test is a popular hypothesis testing method which performs multiple pairwise comparisons with unequal variance assumption. As shown in the right panel of Fig 3, the difference in the two proportions was estimated as -0.072 with 95% CI (-0.334, 0.189) for comparing Group 1 to 0; -0.185 with 95% CI (-0.319, -0.052) for comparing Group 2 to Group 0; and -0.113 with 95% CI for (-0.338, 0.112). At α = 0.05, the difference between steam and control was statistically significant, but the other pairwise differences were not statistically significant. Though the effect of steam seems different from the effect of Dominus according to the boxplot, the statistical power was low due to a small number of microplots (three per group). The DTK test, which treated the entire data as a sample of size nine (three per group), did not utilize all information contained in the RGB data for the hypothesis testing. In agricultural experiments, increasing sample size is not always an option due to cost, space, time, and logistics. Given the small sample size, it is important to use statistical methods with enhanced power in order to overcome the practical challenges.

Statistical methods
For the DTK test discussed in the previous section, an estimated proportion of green area (e.g., 0.2206) is treated as a single data point. In other words, the DTK test ignores individual pixel values for each green color obtained from the image analysis program, and it treats the entire data as a sample of size three per treatment. In this section, in order to utilize detail information contained in the RGB data, a bootstrap method and a Poisson model are considered for pairwise comparisons. The bootstrap method estimates and compares the proportion of green area, and a Poisson model estimates and compares the expected count of a small green spot in the case of a rare occurrence of weed. In a later section, the operating characteristics of the bootstrap method and the Poisson model (e.g., Type I error rate and statistical power) are compared to the operating characteristics of the DTK test by simulation studies.
The non-numeric values (RGB codes) have been transformed to the proportion of green area which is a numeric value between zero and one. The general structure of the data per treatment group is demonstrated in Table 1. In the table, let R i , G i , and B i denote the decimal code of red, green, and blue in the i th row. Let W i denote the weight of the i th row such that P n i¼1 W i ¼ 1. For example, in a previous section, W 42 = 19151/3422208 = 0.0056 was calculated. According to the RGB decimal code, Z i = 1 if G i > R i and G i > B i and Z i = 0 otherwise. Then the proportion of green area in the microplot is estimated by the weighted sum

Bootstrap confidence intervals
Let H 0 denote the null hypothesis, the proportion of green area is the same across all groups (i.e., no effect of steam and Dominus on weed control). Let H 1 denote the alternative hypothesis. Let θ 0 , θ 1 , and θ 2 denote the proportion of green area in Group 0, Group 1, and Group 2, respectively. For the bootstrap method, the null hypothesis is H 0 : θ 0 = θ 1 = θ 2 , and the alternative hypothesis is H 1 : To construct a confidence interval (CI) for θ 0 , θ 1 , θ 2 , or any parameter in terms of θ 0 , θ 1 , and θ 2 , a bootstrap method is considered [13] [14]. A bootstrap sample refers to a sample of size n drawn from the original sample of size n with replacement. Let W � i and Z � i denote the i th weight and the binary outcome, respectively, in a bootstrap sample for i = 1,. . ., n. For a bootstrap sample, and each W � i can be normalized such that can be calculated from a bootstrap sample. By generating bootstrap samples a large number of times, CIs can be constructed for θ 0 , θ 1 , θ 2 , and any combination (e.g., θ 1 /θ 0 ).
There are several kinds of bootstrap CI, and the method of bias-corrected and accelerated (BCa) bootstrap CI is considered [15]. It is known to be a robust method particularly when the sampling distribution is unknown or difficult to be represented mathematically [16]. Since the sampling distribution of S is not easily characterized, the BCa method would be an appropriate bootstrap method in our case. The computation can be done by the "boot" package in R version 3.3.3 [17] [18]. As a side note, other estimators for θ j (j = 0, 1, 2) could be considered than the weighted sum S. For instance, a microplot image could be divided up into multiple sections, calculate S for each section, then average S's. Since the bootstrap method is fairly robust, the bootstrap method could be utilized with an alternative estimator for θ j .
For comparing three groups, three pairwise comparisons are needed, namely θ 1 versus θ 0 , θ 2 versus θ 0 , and θ 2 versus θ 1 . For constructing 95% CIs for the multiple comparisons, the individual confidence level can be adjusted at 1 − α/m = 0.983 by using the Bonferroni's correction. In other words, a 98.3% CI is constructed for each pairwise comparison, and H 1 is concluded if at least one of three CIs excludes a null value.
The null hypothesis can be formulated as H 0 : μ jk = 1 for all j 6 ¼ k, where μ jk = θ j /θ k , the relative proportion of green area when group j is compared to group k. The parameter μ jk can be estimated by S j /S k , where S j and S k are estimated proportions of green area in groups j and k, respectively.
The method of statistical hypothesis testing described in this section can be summarized as (1) obtaining bootstrap distributions of S 0 , S 1 , and S 2 , (2) calculating BCa bootstrap CIs for μ 10 , μ 20 , and μ 21 with the Bonferroni's correction ("boot" package in R), and (3) drawing a conclusion based on resulting CIs. This method can be generalized for comparing two or more groups.

Poisson model for rare green spots
When the proportion of green area is very low (i.e., rare weed spots), the bootstrap method may not be appropriate because most bootstrap samples will estimate a zero proportion. For example, if H 0 is true with θ 0 = θ 1 = θ 2 = 0.01, a point mass at zero will be observed in the distribution of S j . If all θ j 's are close to zero, a Poisson model can be an alternative approach. Instead of comparing the proportion of green area, the expected count of green spots can be compared. Under the Poisson model, θ j is interpreted as the expected count per given experimental area. For hypothesis testing, the likelihood ratio test (LRT) is considered with the "glm" function in R. The LRT calculates the log-likelihood value under the null hypothesis H 0 , denoted by l 0 , and the log-likelihood value under the alternative hypothesis H 1 , denoted by l 1 . The LRT compares 2(l 1 − l 0 ) to the chi-square distribution with m − 1 degrees of freedom to calculate the p-value, where m is the number of groups (e.g., m = 3 in our case). If the p-value is less than α = 0.05, H 0 is rejected.

Simulation designs
For simulations, 23 scenarios were considered, 6 scenarios under H 0 and 17 scenarios under H 1 . In each treatment group, n binary variables (Z i = 1 or Z i = 0) were simulated with some true values of θ 0 , θ 1 , and θ 2 for groups 0, 1, and 2, respectively. See the left columns of Tables 2  and 3 for the true values of θ 0 , θ 1 , and θ 2 in each scenario. Here Z i = 1 represents a spot with a green color, and n weights were simulated by a beta distribution. To create a scenario close to the observed data, a beta distribution was fitted to the actual data (parameters were estimated by the methods of moment), n weights were generated from the beta distribution, then the weights were normalized such that P n i¼1 W i ¼ 1. The sample size was fixed at n = 765 per group as in our actual data (255 data points per microplot × 3 microplots). Each scenario was repeated 1,000 times to estimate the probability of rejecting H 0 in each scenario. Two thousand bootstrap samples were used to construct bootstrap CIs.

Simulation results
The simulation results are reported in the right columns of Tables 2 and 3. Focusing on Scenarios 1 to 6 when H 0 was true, the bootstrap method rejected H 0 with a probability close to α = 0.05, except for Scenario 6. In Scenario 6 (θ 0 = θ 1 = θ 2 = 0.01), the bootstrap method frequently resulted in a numerical error because Z i = 1 occurred too rarely to calculate a bootstrap CI, and NA is reported in Table 2. Similarly, the DTK test could not be done for some simulated data because there was no variation in the estimated green proportions (i.e., all zeros), and NA is reported in the table as well. In such a case, the Poisson model resulted in the Type I error rate close to α = 0.05. The Poisson model showed lower type I error rates than α = 0.05 when θ j 's were relatively high. The DTK test consistently showed conservative results for all values of θ j 's.
In Scenarios 7 to 15 of the form H 1 : θ j 6 ¼ θ k for all j 6 ¼ k, the Poisson model consistently showed highest statistical power among the three hypothesis testing methods. The DTK test showed substantially lower statistical power than the bootstrap method and the Poisson model. In Scenarios 16 to 23 of the form H 1 : θ 1 6 ¼ θ 2 and θ 2 = θ 3 , the Poisson approach still resulted in a higher statistical power than the bootstrap approach, and the DTK was not powerful. When Scenarios 13 and 20 are compared, resulting powers were generally lower in Scenario 20 because the values of θ's were closer. Similar results were found when Scenarios 9 and 18 are compared and when Scenarios 14 and 21 are compared. See Table 3 for these results.
Regarding the Poisson approach giving a power equal to one in several settings but not in all settings, it would be reasonable to compare Scenarios 8 versus 9 (Table 3) and relate to the null scenarios ( Table 2). As shown in Table 2, the Poisson approach resulted in a very low Type I error rate when θ's were close to 0.5 (i.e., too conservative), and it resulted in a Type I error rate closer to the fixed significance level α = 0.05 as θ's were closer to zero. The power in Scenario 8 (θ's were close to 0.5) was substantially lower than the power in Scenario 9 (θ's were closer to zero). In addition, although θ's were evenly spaced by 0.05 in both Scenarios 8 and 9, the effect size was greater in Scenario 9 in terms of relative expected counts (i.e., a greater magnitude of θ j /θ k in Scenario 9).
When simulations from Scenario 1 to 23 were replicated 10 times (1,000 times per replication per scenario), similar results were obtained. The replicated simulation results are included in the S5 File.

Application
From the actual data described in a previous section, μ 01 , μ 02 , and μ 12 were estimated as 1.57, 263.7, and 168.3, respectively, using the bootstrap method. When Group 0 (control) was compared to Group 1 (Dominus), the estimated proportion of green area was 1.6 times greater. When Group 0 was compared to Group 2 (steam), it was 264 times greater. When Group 1 was compared to Group 2, it was 168 times greater. The resulting bootstrap CIs were (1.13, 2.29), (137.6, 726.0), and (90.4, 451.2), respectively. At significance level α = 0.05, all three pairwise groups were significantly different. This conclusion is different than the conclusion from the aforementioned DTK test. In particular, steam is substantially more effective than both control and Dominus, and Dominus is slightly more effective compared to control. Using the Poisson model with LRT, the resulting p-value was nearly zero, and the respective CIs for μ jk were (1.17, 1.97), (7.78, 29.18), and (5.07, 19.43) which are different than the CIs using the bootstrap method. Recall that θ j is interpreted as the proportion of green area in the bootstrap method, and it is interpreted as the expected count of green spots in the Poisson model. Fig 4 graphically presents the resulting CIs from the DTK method (left), the bootstrap method (middle), and the Poisson model (right). To avoid the extremely large magnitude of the CIs in the figure, the resulting CIs for μ jk were log-transformed, so the null value became zero after the log-transformation in the bootstrap method and the Poisson model.
In the data collection, there was an additional experimental group, a combination of steam and Dominus (referred to as Group 3). In three microplots of Group 3 (steam + Dominus), the observed green areas were 0.0004, 0.0006, and 0.0004. In the three microplots of Group 2 (steam only), recall that the observed green areas were 0.0008, 0.0003, and 0.0007. Since these are very low proportions (i.e., W i is close to zero for Z i = 1), the Poisson model would be more appropriate to compare these two experimental groups. μ 23 was estimated as 0.82 with a 95% CI (0.41, 1.67). As shown in S1 Fig, the effect of steam was nearly 100% effective by itself, so it would make sense that Dominus would not improve upon the effect of steam. It appears that the estimated μ 23 is accurate under the Poisson model.
All supporting information including data and R code are available in S1 Fig, S1, S2 and S3 Files, and S1 Code. Woebbecke et al. (1995) reported that the 2G-R-B index (known as the excessive green index or EGI) or the modified hue was sensitive for separation of plants and soil [7], so the similar idea was applied for separation of weeds and soil in this study. Although they used a modified version of the EGI, Golzarian et al. (2012) reported that the EGI is computationally simple and effective when the contrast between the target (weed) and the background (soil) is high [8]. Among various indices for defining "green," the simple definition "G > R and G > B" was used in this study, and the effect of steam was statistically significant based on the bootstrap approach and the Poisson approach. These approaches also detected the small effect of Dominus based on three microplots per group. These results reflect that an appropriate statistical method can leverage the sensitivity of simple color indices. Our supplementary experiment showed that even smart phones are highly reliable for estimating the proportion of green area with an intraclass correlation of 0.9 when three randomly chosen smartphones were tested [19] [20] [21]. (The data of supplementary experiment are available in S4 File.) Therefore, the simple color indices, together with the statistical methods demonstrated in this article, can be an alternative strategy for examining the effects of weed control and other similar types of comparative analysis, particularly when a large sample size is not available. Most of the aforementioned studies focused on the technical advantages of various color indices with a less of an emphasis on statistical methodology to detect the effect of a treatment.

Discussion
There are many statistical methods available for comparing two or more treatments such as ANOVA with Tukey's range test, Dunnett-Tukey-Kramer (DTK) test, Kruskal-Wallis test, and Friedman test [22], [12] [23] [24] [25] [26]. If one of these statistical methods was considered, the RGB data would be treated as a sample of size three per treatment, and the hypothesis testing would suffer from low statistical power. For instance, the DTK test is a popular hypothesis testing method in agricultural studies to test for difference in means under the normality assumption with unequal variances [12]. With such a small sample size per group, it is difficult to check the normality assumption, and even if the assumption is true, there should be low statistical power for comparing treatment effects. To this end, two alternative statistical methods, bootstrap method and Poisson model, were considered for the RGB data. The simulation results demonstrated that the bootstrap method and the Poisson model provide higher statistical power while preserving the type I error rate at α = 0.05. In the applied example, the DTK only detected the difference between steam and control, but the bootstrap method and the Poisson model detected the difference among steam, Dominus, and control. In addition, unlike the non-parametric tests (e.g., Kruskal-Wallis and Friedman), the bootstrap method and the Poisson model can address the research objective with interpretable parameters. The bootstrap approach can estimate the relative proportion of weed area between two treatments, and the Poisson approach can estimate the relative average count of weed per given area between two treatments when the occurrence is rare. After determining the outperformance of these alternative methods when compared to the DTK test via the simulation studies, they were applied to the RGB data to compare the three treatment groups on weed control.
In the simulation study presented in this article, a number of assumptions were simplified, and more investigations are needed under weaker assumptions. In the weed control experiment presented in this article, all microplots were located close to each other (about one meter apart between two neighboring microplots), and they were completely randomized to one of the treatments. Because it was an outdoor experiment, there could be additional random error among the microplots due to environmental factors and other unknown factors. For instance, the weed species composition was not the same in all plots which may have contributed to variation. If these factors are significant, the Poisson model approach described in this manuscript can lead to an inflated Type I error rate in over-dispersed data. According to the simulation study, the Poisson approach outperforms the bootstrap approach in terms of both Type I error and power. In a case of severe over-dispersion, the Type I error rate should be controlled, and a quasi-Poisson model or a generalized linear mixed model may be considered alternatively. Comparing to these alternative approaches, the relative performance of the bootstrap approach remains unknown, and it is a potential topic for future research. In the simulation study, the Bonferroni's correction was used for multiple comparisons, and an alternative correction method could be used. Benjamini and Hochberg (1995) discussed another procedure of multiple comparisons based on ranking p-values [27]. The choice of a correction method was not a main focus in this article, but it is an important topic in agricultural studies which involve many comparisons. The procedure of ranking p-values also controls the false discovery rate (i.e., Type I error rate), and it may be more powerful particularly when the number of comparisons is very large. Despite the fact that the estimated green proportions were generally lower in the Dominus group than in the control group, the observed variance was greater in the Dominus group. If it was the case for the count data (i.e., lower expected counts with higher variance), the data could not conform the mean-variance relationship of the Poisson model. In this case, the zero-inflated Negative-Binomial distribution would be a more flexible approach than the Poisson distribution, and this comparison was not made in the simulation study. Future research should focus on the impact of the simplified assumptions when the truth is more complex, and any negative consequences (e.g., an inflation of Type I error, a loss of statistical power) should be addressed by more sophisticated statistical methods than the methods discussed in this article.

Conclusion
The bootstrap method and the Poisson model (particularly for rare green spots of weed occurrence) result in substantially higher statistical power than the DTK test. The bootstrap method and the Poisson model are more powerful than the DTK test because they utilize detail information (pixels) obtained from each microplot, whereas the DTK test treats an observed green proportion in a microplot as a single data point. Using the bootstrap method and the Poisson model, it was evident that weed control with steam is effective for at least 31 days, and weed control with steam is significantly more effective than weed control with Dominus. The powerful statistical methods described here are recommended particularly when the sample size is small. The combination of the RGB analysis and the statistical methods can be useful for various purposes such as interim evaluations (e.g., monitoring and quantifying color changes until a final harvest) and final evaluations in a large field (e.g., areas where it is practically impossible to count weeds).