High-throughput phenotyping analysis of maize at the seedling stage using end-to-end segmentation network

Image processing technologies are available for high-throughput acquisition and analysis of phenotypes for crop populations, which is of great significance for crop growth monitoring, evaluation of seedling condition, and cultivation management. However, existing methods rely on empirical segmentation thresholds, thus can have insufficient accuracy of extracted phenotypes. Taking maize as an example crop, we propose a phenotype extraction approach from top-view images at the seedling stage. An end-to-end segmentation network, named PlantU-net, which uses a small amount of training data, was explored to realize automatic segmentation of top-view images of a maize population at the seedling stage. Morphological and color related phenotypes were automatic extracted, including maize shoot coverage, circumscribed radius, aspect ratio, and plant azimuth plane angle. The results show that the approach can segment the shoots at the seedling stage from top-view images, obtained either from the UAV or tractor-based high-throughput phenotyping platform. The average segmentation accuracy, recall rate, and F1 score are 0.96, 0.98, and 0.97, respectively. The extracted phenotypes, including maize shoot coverage, circumscribed radius, aspect ratio, and plant azimuth plane angle, are highly correlated with manual measurements (R2 = 0.96–0.99). This approach requires less training data and thus has better expansibility. It provides practical means for high-throughput phenotyping analysis of early growth stage crop populations.


156
Datasets with annotated images are necessary for robust image segmentation 157 models. In practice, this dataset was constructed using the top-view images obtained at 158 two periods, V3 and V6 (Figure 1a). Because the soil background accounts for a large 159 proportion of the raw images, the images were cropped around the area containing the 160 plants and the images were scaled to 256 × 256 pixels for further training the model. A 161 total of 192 images, containing seedling maize shoots, were annotated using LabelMe 162 software. Among the total number of images, 128 images were expanded into 512 163 images to use as a training set after mirror symmetry, translation, and rotation. The 164 remaining 64 labeled images were used to form a validation set to determine the criteria 165 that may prevent network training. To prevent overfitting, the network will train until 166 the losses on the validation set are stable. The model designed in this study is a small 167 sample learning model, and data augmentation was adopted to ensure the quality of the 168 training set, which will be discussed later. There are 200 images in the testing set, which 169 were randomly selected from the four maize subpopulations described in the 170 experiment in Figure 1a. Here, images of 50 hybrids belonging to each subpopulation 171 were randomly selected (subpopulation SS consisted of only 32 hybrids, so there are 172 18 duplicated hybrid images belonging to the SS subpopulation in the test set).
173 PlantU-net Segmentation Network 174 To accurately segment maize shoots at the seedling stage in field conditions from 175 the top-view image, the shoots were segmented as the foreground and output as a binary 176 image. However, top-view images of field maize are relatively complex with stochastic 177 background and uneven light conditions. Consequently, existing models are not 178 satisfactory to extract pixel features. To address this issue, we built a PlantU-net 179 segmentation network by adjusting the model structure and key functions of U-net [37], 180 which improves the segmentation accuracy of images taken under a complex 181 environment.

182
Model Structure 183 PlantU-net is a network designed for the segmentation of top-view images of crops 184 grown in the field. A full convolution network is adopted to extract hierarchical features 185 via an "end-to-end" process. As shown in Figure 2, the feature contraction path is 186 composed of three layer downsampling modules, each module uses a 3 × 3 convolution 187 to extract one row feature, and a 2 × 2 pooling operation to reduce the spatial 188 dimensionality. Two convolution operations are conducted after downsampling to 189 adjust the input size of the extended path. Corresponding to the contracted path, the 190 extended path includes three layer upsampling modules. In each upsampling module, a 191 2 × 2 up sampling convolution is first performed to expand the spatial dimension. Then 192 the upsampled results are fused with the low-level feature maps in the corresponding 193 contracted path to connect contextual information across adjacent levels. Two 194 convolution operations are performed during the upsampling process to reduce the 195 feature dimension and facilitate feature fusion. After upsampling, a 1 × 1 convolution 196 is performed as the full connection layer to output the segmented image. The same 197 padding is filled in the samples during the convolution operations, which facilitates the 198 computation. The parameters used for each layer of the model are shown in Table 1.

205
Table1. Configuration of the model structure parameters. Refer to Figure  206 3 for the architecture of the PlantU-net network. To a certain extent, the network parameters of the model are reduced to ease the 208 burden of computers, and also to reduce the training time while ensuring the 209 segmentation effect. Since the number of training samples is small, a dropout layer is 210 appropriately added to prevent overfitting. In addition, to identify and utilize edge 211 features, a maximum pooling layer is adopted for downsampling.

Main Functions
213 Activation Function

214
The activation function in deep learning incorporates nonlinear factors to solve the 215 linear classification problem. In PlantU-net, Leaky ReLU is used as the activation 216 function. It still has an output when the input is negative, which eliminates the neuron 217 inactivation problem in back propagation. The expression is: For the final output layer of the model, Sigmoid is used as the activation function 219 for biclass. Sigmoid is capable of mapping a real number to an interval of (0, 1), and is 220 applicable for biclassing. Its expression is: 221 Loss Function

222
The loss of function in the U-net model is replaced by the binary-cross-entropy 223 function in the PlantU-net model. The binary-cross-entropy function is a cross-entropy 224 of two-class classifications, which is a special case of the entropy function. The binary 225 classification is a logistic regression problem and the loss function of the logistic 226 regression can also be applied. Considering the output of the last layer of the sigmoid 227 function, this function is selected as the loss function. The mathematical expression of 228 binary-cross-entropy function is: (3) 229 where is the true value and ′ is an estimation when y = 1.
The output of this loss of function is smaller when the estimated value is closer to 231 0, and the output value of the loss of function is larger when it is closer to 1. This is 232 suitable for the binary classification output of the last layer in this network.

234
The PlantU-net was trained using the Keras framework ( Figure 1) with 235 acceleration from GPUs (NVIDIA Quadro P6000). Five hundred and twelve images 236 were used to train the model. Data expansion is the key to making the network have the 237 required invariance and robustness because this model uses a small number of samples 238 for training. For top-view images of maize shoots, PlantU-net needs to meet the 239 robustness of plant morphology changes and value changes of gray images. Increasing 240 the random elastic deformation of training samples is the key to training segmentation 241 networks with a small number of labeled images. Therefore, during the data reading 242 phase, PlantU-net uses a random displacement vector on the 3 × 3 grid to generate a 243 smooth deformation, where the displacement comes from a Gaussian distribution with 244 a standard deviation of 10 pixels. Because the number of training samples is small, the 245 dropout layer is added to prevent the network from overfitting. Through these "data 246 enhancement" methods, the model performance is improved and overfitting is avoided. 247 In each epoch, the batch size was 1, the initial learning rate was 0.0001, and adam is 248 used as an optimizer to quickly converge the model. PlantU-net was trained until the 249 model converged (the training loss was satisfied and remained nearly unchanged).

Evaluation of segmentation accuracy 251
Because the segmentation of the top-view images of maize shoots using the 252 PlantU-net model is considered a binary classification problem, when evaluating the 253 segmentation results, the classification results of predicted output and ground truth 254 (GT) data can be used to perform pixel-level comparisons. If the pixel in the leaves is 255 marked as 1, and in the segmented image, the corresponding pixel is still 1, then it is 256 judged as true positive (TP); if the pixel point is judged as 0 after segmentation, the 257 pixel is judged as false positive (FP). Similarly, when the pixel in the original image 258 does not belong to the maize leaf, it is marked 0, if such pixel is judged as 1 after 259 segmentation, it is a false negative (FN); if such a pixel is also judged as 0, then it is a 260 true negative (TN). Following these rules, four indicators for evaluation [39] were used 261 in this study: 262 (1) Precision.Precision represents the proportion of true positive samples among those 263 predicted to be positive and is defined as: Recall indicates how many positive samples of the total sample are correctly 265 predicted and is defined as: 266 (3) F1-Score. After calculating the accuracy and recall, the F1-Score can be calculated, 267 which represents the weighted harmonic average of accuracy and recall. It is used for 268 standardized measurement and is defined as: 269 (4) DICE. Several metrics are commonly used to evaluate the segmentation results.
270 Here, Rseg is used to present the predicted results, and Rgt represents the manually 271 segmented ground truth data. Then DICE ( ∈ [0,1]) is defined as: DICE represents the ratio of the coincidence area between the segmentation results 273 and the ground truth data to the total area. The value for perfect segmentation is 1.

275
The phenotypic traits concerning the shape and color characteristics of each shoot 276 were estimated using PlantU-net based on the top-view images of the segmented maize 277 shoots. The segmented images may still contain multiple maize plants. The phenotypic 278 parameter extraction process will start with edge detection based on the segmentation 279 results, connective domain markers based on the edge detection results, and finally 280 single-plant phenotypic parameter extraction based on these connective domain 281 markers.

282
Morphological feature extraction 283 The description of morphological features can be divided into two categories. The 284 first category is the outline-based shape description, which focuses on describing the 285 outline of the target area. The other category is the area-based shape description, which 286 describes the target by area, geometric moment, eccentricity, and region shape. In this 287 study, the center point

302
where f (x, y) represents the binary map, m is the maximum number of pixels in 303 the x-axis direction, and k is the maximum number of pixels in the y-axis direction. 304 Regarding the binary maps, pixels of the target plant are always labeled by 1, whereas 305 the background pixels are labeled using 0 for the output; therefore, the pixel method of 306 calculation was used, meaning that pixels were counted as ( , ) = 1 pixels. 307 Calibration objects were used in the original image of the dataset. The length and width 308 of the cropped image can be calculated using the calibration objects because the image 309 size was cropped to 256 × 256. The area of each pixel was calculated according to the 310 length and width of the image, and the size of the maize plant in the image was obtained 311 by multiplying the total number of pixels in the segmented target area. 312 (4) Studies have shown that expanded leaves of maize shoots are distributed along a 313 vertical plane, which is the plant azimuth plane [40,41]. The original images for this 314 study were oriented eastward during the data acquisition process; thus, the left side of 315 the image in the dataset indicates the north. In the Figure 3f, the blue line indicates a 316 single maize plant after segmentation and shows a north-south orientation. A red line 317 was fitted by clustering in the leaf section (or tangent to it if the clustering result is a 318 curve) as the plant azimuthal plane. The angle between the red line and the blue line 319 was calculated as β, which was used as the azimuthal plane angle of the plant. The 320 specific morphological features were extracted as shown in Figure 3. . This approach was primarily 338 used because the HSV model is similar to the color perception by the human eye, and 339 the HSV model can reduce the effect of light intensity changes on color discrimination. 340 Therefore, the parameters of the color phenotypes in this study are represented using 341 the mean of the RGB or HSV parameters.
342 Statistical analysis 343 The phenotypic traits extracted from segmentation results were compared with a 344 manually measured value. The measured value of the circumcircle radius, aspect ratio, 345 and plant azimuth plane was manually measured from the results by segmentation. 346 Maize shoot coverage compared the segmentation results of PlantU-net with the results 347 of manual segmentation. The adjusted coefficient of determination (R 2 ) and 348 normalization root-mean-squared error (NRMSE) were calculated to assess the 349 accuracy of these extracted parameters. The equations were as follows: 350 where n is the numbers of objects, is the results of manual segmentation, ′ is the 351 value of PlantU-net, and is the mean value of the results of manual segmentation.

352
In the phenotypic analysis of the four subpopulations, this study analyzed the 353 phenotypic trait data extracted from the test set. Box plots were drawn using Python. 354 The extracted phenotypic trait data was marked in Excel and Python was used to write 355 a program to read the data. The data was then visualized by calling the Matplotlib 356 development library in Python.

359
The PlantU-net segmented network has been trained many times. During the

389
Using the PlantU-net model and the phenotype extraction method, the coverage, 390 circumscribed radius, aspect ratio, and plant azimuth plane were determined using the 391 validation dataset, and the measured data were compared with the extracted results for 392 verification (Figure 6). Among them, the correlation coefficient R 2 of the artificial 393 segmentation results and the automatic extraction results of the four morphological 394 phenotypic parameters were all greater than 0.96, and the NRMSE values were all less 395 than 10%, indicative of the reliability of the PlantU-net segmentation model and the 396 phenotypic extraction method. To evaluate the performance of the PlantU-net model in the image segmentation 404 and phenotypic parameter extraction of the maize population, the field high-throughput 405 phenotypic platform and the top-view of maize seedlings obtained by UAV were 406 selected as inputs. The top-view images were obtained using both the field ground 407 phenotypic platform and UAV. Figure 7 shows the segmentation results and schematic 408 diagram of phenotypic parameter extraction of the PlantU-net model applied to two 409 sample plots.

418
Phenotypic parameters were extracted from the segmentation results of two 419 sample plots using the above methods. The mean value and standard deviation of 420 various morphological parameters of the same cultivar of maize are shown in Table 3. 421 The mean value can be used to quantify the growth potential of different maize cultivars 422 in the same growth period, while the standard deviation can be used to evaluate the 423 consistency of plant growth within the same maize cultivar. Therefore, this method can 424 provide techniques for quantitative evaluation of plant growth potential, allowing for 425 phenotypic analysis of the top-view of a maize population at the seedling stage obtained 426 using multiple high-throughput phenotyping platforms in the field. Phenotypic parameters were extracted from the images of the test set, and four 433 phenotypic parameters, including coverage, the angle of plant azimuth plane, aspect 434 ratio and circumscribed radius were statistically analyzed from the perspective of 435 subgroups. Figure 8 shows the results from the phenotypic parameter analysis extracted 436 from the image segmentation results of the test set. Among the four subgroups there 437 were no statistical differences between the azimuth plane of plant growth and the 438 included angle of due north (Figure 8b), while the other three phenotypic parameters 439 all had differences within subgroups. In the analysis of the other three phenotypic 440 parameters, the extracted values of SS and NSS subgroups were similar, which was 441 related to the temperate zone of the two groups of cultivars. The TST subgroup includes 442 tropical and subtropical cultivars, so the extracted parameters are different from the SS 443 and NSS subgroups. However, the differences of the Mixed subgroup are relatively 444 distinct. The results of the coverage analysis (Figure 8a) shows that the coverage value 445 of the Mixed subgroup in the test set is low; in contrast, the results of the circumscribed 446 radius (Figure 8d) showed a higher extracted value for the Mixed subgroup than that of 447 the SS and NSS subgroups. This indicates that the leaves of the Mixed subgroup are 448 more slender, resulting in low plant coverage and high leaf extension during the same 449 growth period.

456
In terms of color phenotype, RGB and HSV phenotypic traits were extracted from 457 the top image of the plant. Considering the segmented mask region is composed of 458 many pixels, the mean value of the color of the pixels in the region is taken as the color 459 phenotypic parameter of the plants. Similarly, based on the spatial color information of 460 RGB and HSV, color traits of maize plants of different subgroups were analyzed 461 (Figure 9). According to the analysis of RGB values, there was no obvious difference 462 among the subgroups of all cultivars. In the analysis based on HSV color information, 463 the TST and NSS subgroups did not show evident differences in color; however, the 464 color difference between the TST and NSS subgroups was clear (the H and S of the 465 cultivars in the NSS subgroup were higher than those in the TST subgroup). 466 Approximately 1/3 of cultivars in both the SS subgroup and the Mixed subgroup were 467 different from other cultivars in this subgroup (both H and S were higher than other 468 cultivars in this subgroup). mean value analysis.

473
The above results indicated that the PlantU-net model and phenotypic trait 474 extraction method could be used to quantitatively analyze the morphological and color 475 phenotypic trait differences among subgroups, which was suitable for a correlation 476 analysis of genotype-phenotype.

479
At present, the threshold segmentation method is often used to segment top-view 480 images of field crops. Although threshold segmentation with specific constraints can 481 achieve very similar segmentation results [39,43], threshold segmentation is sensitive 482 to noise and the effect on target segmentation is not ideal when there is little difference 483 in gray scale. Threshold segmentation in different application scenarios (such as light 484 and soil background) is relatively dependent on the selection of an empirical threshold. 485 Manually setting different thresholds will greatly increase the workload of the 486 interaction of the segmentation process, and it is difficult to achieve high-throughput in 487 the processing of large quantities of data [44,46]. In comparison, this study designed 488 the PlantU-net network model, which can implement end-to-end seedling stage of 529 ( Figure 5), but also extract phenotypic parameters with a higher correlation with 530 measured data (R 2 >0.96). The results show that the PlantU-net method can replace 531 artificial measurement and threshold segmentation for quantitative extraction and 532 evaluation of phenotypic traits. 533 The location and direction of the maize plant remains relatively unchanged, and 534 the method overcomes the problem of the plants overlapping each other when viewed 535 from above. Therefore, the information of plant growth and plant azimuth-plane angle 536 extracted from the top-view image of a maize population can provide measured data 537 driving 3-D modeling of a maize population [47] and light distribution calculation and 538 analysis [48] in the later growth stages. At present, the technology and equipment of 539 high-throughput phenotyping platforms [51], including UAV [49], vehicle-based [50], 540 and track-type, are developing rapidly, allowing for the collection of phenotypic data 541 throughout the whole growth period. PlantU-net can also be applied to phenotypically 542 analyze the top-view of a crop population obtained by multiple phenotypic platforms 543 and can solve problems such as continuous monitoring of plant selection, analysis of 544 plant growth difference between different plots, and analysis of plant growth 545 consistency within the same treatment. These collected data would provide practical 546 technical means for field crop breeding and cultivation research [52].

547
This study showed the applicability of the PlantU-net model in the extraction of 548 phenotypic parameters in the seedling stage of maize. However, due to a large number 549 of cross-shading in the top-view images caused by the overlapping of different plant 550 leaves, this model could not solve the problem of phenotypic extraction in the middle 551 and late stage of maize plant growth and development. Future work must determine 552 how to use top-view continuity and the edge detection ability of the PlantU-net model 553 to achieve the phenotypic extraction of plants in the middle and late stages of crop 554 plants.

556
In this study, an end-to-end segmentation method named PlantU-net was proposed 557 based on the fully convolutional network, which improved the high-throughput 558 segmentation performance of a top-view image of a seedling population and realized 559 the accurate extraction of phenotypic data. The PlantU-net model had an average 560 segmentation precision of 0.96 for the aerial image of maize plants at the seedling stage, 561 and the phenotypic parameters extracted from the segmentation results were highly 562 correlated with the values obtained by manual measurement (R 2 =0.96-0.99). The model 563 described in this manuscript is helpful for the segmentation of top-view images of the 564 maize shoot, the extraction of phenotypes, and the quantitative evaluation of phenotypic 565 traits obtained by high-throughput UAV and ground phenotypic platforms.