High-throughput image-based plant stand count estimation using convolutional neural networks

The landscape of farming and plant breeding is rapidly transforming due to the complex requirements of our world. The explosion of collectible data has started a revolution in agriculture to the point where innovation must occur. To a commercial organization, the accurate and efficient collection of information is necessary to ensure that optimal decisions are made at key points of the breeding cycle. In particular, recent technology has enabled organizations to capture in-field images of crops to record color, shape, chemical properties, and disease susceptibility. However, this new challenge necessitates the need for advanced algorithms to accurately identify phenotypic traits. This work, advanced the current literature by developing an innovative deep learning algorithm, named DeepStand, for image-based counting of corn stands at early phenological stages. The proposed method adopts a truncated VGG-16 network to act as a feature extractor backbone. We then combine multiple feature maps with different dimensions to ensure the network is robust against size variation. Our extensive computational experiments demonstrate that our DeepStand framework accurately identifies corn stands and out-performs other cutting-edge methods.


Introduction
The "phenotyping bottleneck" refers to the phenomenon that the agricultural community is not able to accurately and efficiently collect data on physical properties of crops (Furbank and Tester, 2011).This highthroughput phenotyping (HTP) bottleneck is oftentimes due to resource limitations such as labor, time, and domain expertise in pathology and genetics, even for large scale farming operations and commercial breeding programs.Certain phenotypic traits such as early stage stand counting for corn (Zea mays L.) can only be accurately performed during its early growth stage.If this data collection time-window is missed, then the task is near impossible to complete (McWilliams et al., 1999).
Email address: skhaki@iastate.edu(Saeed Khaki) Preprint submitted to ArXiv October 26, 2020 arXiv:2010.12552v1[cs.CV] 23 Oct 2020 Understanding the data collection challenges facing modern agriculture, agronomists and researchers have turned to modern solutions that combine modern machine learning and imaging analytics (Yang et al., 2020).
Image-capturing devices such as unmanned aerial vehicles, high definition cameras, and, even, cell phone cameras are being used as devices to collect information to be analyzed either live or at a later time (Mogili and Deepak, 2018;Kulbacki et al., 2018).With these new technologies organizations and farmers are no longer bounded by the data collection time-window and now have the ability to manually analyze images at a later time.However, with this new approach comes additional problems such as storing massive amounts of image-based data and a familiar but new challenge -analyzing massive reserves of images accurately and efficiently.To advance modern agriculture, tools must be made easily available to agronomists to enable real-time decision making.The information contained within these images allows for timely management decision to optimize yield against harmful attack vectors (pests, diseases, drought, etc.).
To analyze large reserves of images quickly modern deep learning tools have been invoked by various crops.Recent literature has seen the combination of planting phenotyping and traditional machine learning techniques to count crops, detect color and classify stress in various crops through images (Singh et al., 2016;Naik et al., 2017;Yuan et al., 2018;Guo et al., 2018).These recent works demonstrate the impact traditional machine learning has on the future of agriculture.However, these methods are not without limitations.
Using traditional approaches oftentimes, requires high quality images, constant lighting conditions, and fixed camera distances.These limitations act as a barrier to true HTP.With advances in state-of-the-art deep learning techniques, these constraints are no longer binding.Robust models can be constructed to analyze crops in numerous variable conditions.This is seen in the current literature combining deep learning and HTP.
Image-based phenotyping and deep learning can broadly be labeled as an application area of computer vision.Traditional tasks include classifying single images, counting objects and detection objects.Common deep learning frameworks using AlexNet, LeNet, and ResNet-50 architectures have been applied to classify various fruits and vegetables from single images (Mohanty et al., 2016;Cruz et al., 2017;Wang et al., 2017).Other deep learning models using VGG-16 as a feature extractor and the "You Only Look Once" model has been used to count and detect leaves, sorghum heads, and corn kernels (Giuffrida et al., 2018;Ghosal et al., 2019;Mosley et al., 2020;Khaki et al., 2020c;Redmon et al., 2016).Using novel frameworks to count corn tassels, Lu et al. (2017) combined convolutional neural networks (CNN) and local counts regression into a framework they call TasselNet.Additionally, open-access, high-quality, annotated datasets are being created and released to the public to engage researchers in applying their deep learning knowledge to agriculture (Zheng et al., 2019;Sudars et al., 2020;Haug and Ostermann, 2014).These recent works showcase the potential for combining modern deep learning and agriculture in hopes of mitigating the so called "phenotyping bottleneck".For the curious reader who is interested in a clear, concise, and thorough review of image-based HTP, we point the reader towards a survey paper by Jiang et al. (2020).
Corn is known to be one of the world's most essential crop due to the number of products that it can create (e.g.flour, bio-fuels) (Berardi et al., 2019).Additionally, a large percentage of corn is used in livestock farming to feed pigs, cattle, and cows.The world's reliance on corn cannot be understated.Aside from the manufacturing aspect, corn has a large impact on the United States' economy.In 2019, it is estimated that the U.S. corn market contributed approximately $140 billion to the U.S. economy.It is evident that the agricultural community and the world must act to ensure the continued optimal production of corn.
By 2050, the world's population is estimated to be approximately 9 billion (Stephenson et al., 2010).The increase in population combined with the non-increasing arable land, changes will need to occur so that we can continue to optimize corn yield while utilizing less resources.Previous studies invoke deep learning based approaches to predict corn yield based using genetics, environment, and satellite imagery, but these studies are not considered HTP on commercial corn and only act as a way to estimate yield during the growing season (Khaki et al., 2019a;Russello, 2018;Khaki et al., 2019b;Khaki and Wang, 2019;Khaki et al., 2020a).
The ultimate goal of this paper is to count the number of corn stands in an image of a specific area on the field taken during the early phenological stages (VE to V6) (Ciampitti et al.).Roughly these phenological stages refer to the visible leaves on the stem.For instance, VE (emergence) is the first phenological stage where the stem breaks through the soil.V1 is the appearance of the first leaf.V2 is appearance of the second leaf, and VN is the appearance of the n-th leaf.From a practical perspective, an estimated stand count value allows for the establishment of a planting rate and, ultimately, yield potential.If the proper rate is planted, then farmers/breeders can estimate yield based on product by population.However, if population is not there i.e. poor germination/ bad planter, then farmers can identify the issue and replant, or at the very least establish what the new yield will be.Knowing that the planting density is below its desired threshold enables farmers to decide how they want to best manage their corn to make up for the difference in planting rating (e.g. more fertilizer, more aggressive pesticide control, etc.).Traditionally, farmers perform stand count estimations manually.However, this process is time consuming, labor intensive and prone to human error.Because of this, there is a reluctance to perform a stand count, ultimately, leaving farmers at a net-loss for the corn yield.Utilizing an image-based approach to this problem will allow for the timely estimation of stand counts and well as a consistent measure to the quality of data.
Due to the need of efficient and effective HTP, in this paper, we present a deep learning based approach, named DeepStand, to alleviate the concerns of manual, labor intensive stand counting.The proposed method adopts a truncated VGG-16 network as a backbone feature extractor and merges multiple feature maps with different scales to make the network robust against scale variations.This approach is similar to common methods in crowd counting where models are used to detect individual people in large crowds.Due to the similarity of these problems, we utilize a point density based approach for detecting the corn stands.

Methodology
Image-based corn stand counting is challenging compared to the counting tasks in other fields due to multiple factors, including occlusions, scale variations, and small distance between corn stands.Figure 1 shows the corn stands at different growing stages.This paper proposes a deep learning based method, DeepStand, to count the number of corn stands based on using a 180-degree image taken at 4-6 feet above the ground.It is worth mentioning that as corn progresses through its phenological stages, accurately counting the planting density becomes a difficult task for a computer due to the amount of overlapping leaves.However, from a pragmatic perspective, stand counting should be performed before V4 to ensure agronomists can act in a timely manner to mitigate any crop issues.and small distance between corn stands especially at stages V4 to V6.

Network Architecture
Corn stand images usually include high scale variations and occluded corn stands by nearby corn leaves.
As a result, the proposed counting method should be robust against these factors.Our proposed stand counting method is inspired by methods proposed for the counting task in other fields such as crowd counting (Gao et al., 2020a) and dense object counting (Gao et al., 2020b).
The architecture of the proposed network is outlined in Figure 2. The proposed network generates a density map given an image of corn stands, where integral over the density map gives the total number of corn stands.Our proposed method is a CNN-based density estimation method.We do not use other methods such as detection-based (Li et al., 2008;Zeng and Ma, 2010) or regression-based (Chan et al., 2008;Idrees et al., 2013;Wang et al., 2015) methods for the following main reasons.Detection-based approaches usually apply an object detection method such as faster R-CNN (Ren et al., 2015), SSD (Liu et al., 2016) or a detector via a sliding window (Khaki et al., 2020b) on an image to first detect the objects and then count them.However, theses approaches may not work well when applied to the scenes with occlusion and dense objects.Moreover, training these methods requires a considerable amount of annotated images, which is not publicly available for the task of corn stand counting.Regression-based approaches directly map an image patch to the count.These approaches deal with the problems of occlusion and background clutter successfully, however, they ignore the spatial information in the input image.As such, these approaches do not know how much each region of image contribute the final count (Gao et al., 2020b).We use a truncated VGG-16 (Simonyan and Zisserman, 2014) as a backbone for feature extraction in our proposed network.The truncated VGG-16 is composed of convolutional layers with a fixed kernel size of 3 × 3 that extracts discriminative features from input image for further analysis of the network.We use the VGG-16 network backbone in our network mainly because of its good generalization ability to other vision tasks such as counting and object detection (Liu et al., 2016;Gao et al., 2020b;Liu et al., 2019;Li et al., 2018).The truncated VGG-16 network includes all layers of VGG-16 network except the last max-pooling layer and all fully connected layer.The truncated VGG-16 shrinks the input images' resolution to the 1/8 of its original size.We increase input size of the truncated VGG-16 network form 224 × 224 to 300 × 300 in our proposed network to learn more fine-grained features and patterns from the input image to further improve accuracy of our proposed method (Tan and Le, 2019).
The proposed network merges feature maps from multiple scales of the network to make it robust against scale variations in images.Similar scale-adaptive architectures have been used in other vision studies (Zhang et al., 2018;Ronneberger et al., 2015;Bai et al., 2019).To concatenate feature maps with different spatial resolutions, we use zero padding to enlarge the smaller feature maps to the size of the largest feature map.
Finally, we use three deconvolutional layers (transposed convolution) (Dumoulin and Visin, 2016) with stride of 2 to up-sample the output of the network to the size of the original input image.
Finally, we do a post processing on the predicted density map to draw a bounding box around each corn stand.The post processing includes the following steps: (1) threshold the estimated density map to zero-out regions where the value of density map is insignificant, (2) find peak coordinates on the density map as the center location of corn stands, (3) draw a bounding box around each corn stand, and (4) apply non-maximum suppression to remove overlapping bounding boxes.The above-mentioned post processing has a very low computational cost and does not increase the inference time.

Loss Function
Let I i , D i , F (I i , Θ), Θ, and N denote ith image, the ith ground truth density map, the predicted density map of the ith image, network parameters, and the number of input images, respectively.As such, the network loss can be defined as below: (1) Euclidean loss measures estimation error at pixel level and has been used in other crowd counting studies (Boominathan et al., 2016;Gao et al., 2020b;Lian et al., 2019).

Experiments and Results
This section first introduces the dataset used in our study, data augmentation, evaluation metrics, and training procedure.Then, we report the results of our proposed method along with other competing methods.

Ground Truth Density Maps Generation
We generate ground truth density maps following the procedure of density map generation in Boominathan et al. (2016) to train the network parameters.A corn stand located at pixel x i can be represented by a delta function δ(x − x i ).As such, ground truth output for an image with M annotated corn stands can be defined as follows: Then, the H(x) is convoluted with a Guassian kernel with a standard deviation σ to generate the density map , where the standard deviation can be defined based on the average distance of k-nearest neighboring annotations.The summation over the density map is equal to the number of corn stands presented in the image.
The use of such density maps as ground truth can help CNNs learn from the spatial information in images.

Stand Count Data
This section presents the procedure to prepare sufficient data to train and evaluate our proposed method.
Our original dataset includes 394 images of corn stands at growing stages V1 to V6 with a fixed size of 1024 × 768 taken at 4-6 feet above the ground.This includes a total of 6154 total stands across all images.
The following table shows the summary of statistics of the dataset.We randomly selected 20% of the images (80 images) as test data and used the rest of the images as training data (314 images).We augment the training dataset to generate sufficient data to train our proposed method.To make the the proposed method robust against, we construct a multi-scale pyramidal representation of each training image following the work Boominathan et al. (2016).The multi-scale pyramidal representation of images includes scales of 0.4 to 1.3, incremented in steps of 0.1, times the original image resolution.Then, patches with size of 300 × 300 are cropped at random locations, which are followed by randomly flipping and adding Gaussian noise.

Evaluation Metrics
We use standard evaluation metrics, Mean absolute Error (MAE) and Root Mean Squared Error (RMSE), to measure the counting performance of the proposed method.Generally, MAE indicate the accuracy of the results and RMSE measures the robustness.These two metrics are defined as follows: where N is the number of test images, C pred i and C GT i are the estimated count and the ground truth count corresponding to the ith test image.

Training Hyperparameter
DeepStand network is trained end-to-end from scratch.The network parameters are initialized with Xavier initialization (Glorot and Bengio, 2010).Adam optimizer (Kingma and Ba, 2014) with learning rate of 3e-4 and a mini-batch size of 24 is used to minimize the loss function defined in Equation ( 1).The learning rate is gradually decayed to 25e-6 during the training process.The network is trained 80,000 iterations on 93,258 image patches generated following the data augmentation procedure in 3.1.2.

Comparison with State-of-the-art
To evaluate the efficiency of our proposed method, we compare our method with five state-of-the-art models.These five models were originally proposed for crowd counting problem, but they are also applicable for other object counting problems, which are as follows: CSRNet: proposed by Li et al. (2018), uses a fully convolutional architecture which includes truncated VGG-16 network backbone as the front-end feature extractor and a set of dilated convolutional layers as back-end to estimate the density map.DeepCrowd: proposed by Wang et al. (2015), is a regression-based method which directly learns a mapping from the image patches to the count.DeepCrowd network includes five convolutional layers which are followed by two fully connected layers.

Results
This section reports the evaluation results and compares our proposed method with other state-of-theart methods for the task of corn stand counting.After having trained all methods, we evaluated their performance on the hold-out test data which includes 80 images of corn stands from growing stages V1 to V6.Table 2 illustrates that our proposed stand counting method outperforms the other methods to varying extents.The MSCNN has a comparable performance with CSRNet and both perform better than other methods except our proposed method.The main reason for the good performance of CSRNet is because of utilizing dilated convolutional layers to aggregate the multi-scale contextual information.The good performance of MSCNN can also be attributed to the use of multi-scale features which make it robust against scale variation.The DeepCrowd method as a regression-based method outperformed the SaCNN and the CrowdNet methods.The proposed method performed better than other methods for a couple of reasons: (1) our proposed method merges feature maps from multiple scales of the network to cope with scale variation, and (2) the use of deconvolution layers for up-sampling the feature maps increases the quality of the predicted density maps.
Even though the counting performance of the CSRnet, MSCNN, DeepCrowd, and SaCNN are good, the performances of these methods are not satisfactory when the whole image is fed to these methods at once.
As a result, an input image should be cropped into some non-overlapping patches and fed to these methods which increases their inference time.Our proposed method and CrowdNet take the whole image and predict the density map in a single forward path, which makes them considerably faster than other methods.
Figure 3 visualizes some stand counting results of our proposed method including original image, predicted density map, and the detected corn stands in an image.

Figure 1 :
Figure 1: The images show corn stands at vegetative stages V1 to V6.The images include high scale variations, occlusions,

Figure 2 :
Figure 2: The outline of the DeepStand architecture.The parameters of the convolutional layers and deconvolutional layers are denoted as 'Conv-(kernel size)-(number of filters)' and 'Deconv-(kernel size)-(number of filters)', respectively.All layers have the stride of 1 except for the layers with "S2" notation, which have stride of 2. The padding type is the 'same' except for the first deconvolutional layer for which we use 'valid' padding.c ○ denotes matrix concatenation.

SaCNN:
proposed by Zhang et al. (2018), employs a scale-adaptive CNN network to cope with scale and perspective change in images.The SaCNN uses a backbone similar to VGG-16 architecture for feature extraction and merges feature maps from different layers of network to make the model robust against scale variation.MSCNN: proposed by Zeng et al. (2017), extracts scale-relevant features using a multi-scale CNN network.The MSCNN network consists of multiple Inception-like (Szegedy et al., 2015) modules for multiscale feature extraction.CrowdNet: proposed by Boominathan et al. (2016), uses a multi-column network architecture.The network consists of a deep CNN and a shallow CNN to predict density map.The deep CNN has a VGG-like network architecture and the shallow CNN includes three convolutional layers.

Figure 3 :
Figure 3: Visual results of our proposed stand counting method.The first, second and third rows indicate, respectively, input images, estimated density maps, and detected corn stand in images.The figure is best viewed in color.

Table 1 :
The summary statistics of the corn stand dataset.The Min, Max, Avg, and Total denote the the minimum, maximum, average, and total number of corn stands in the dataset, respectively Table 2 illustrate the stand counting performances of the proposed and comparison methods on the test data with respect to MAE and RMSE evaluation metrics.

Table 2 :
The counting performances of the proposed and comparison methods on the test data.
Table3, the proposed method has a consistently low error across all growing stages which indicates the robustness of our proposed method.The results also indicate that the highest counting error of all methods except MSCNN belong to the stage V5-V6 due to the occlusion and background clutter.

Table 3 :
The counting performances of the proposed and comparison methods on the test data across different growing stages.