Crop yield prediction integrating genotype and weather variables using deep learning

doi:10.1371/journal.pone.0252402

Fig 1.

Map showing different locations in the USA and Canada included in our dataset.

The dataset comprises of different maturity groups (MGs), some of which are labeled in the figure. The relative size of a yellow dot (representing location) indicates the size of the dataset for that particular location. Dataset included observations from the National Uniform Soybean Tests for years 2003–2015 and is split into North (MG 0 to 4) and South (MG 4 to 8) regions [50, 51], consisting of 103,365 performance records over 13 years and 150 locations. These records are matched to weekly weather data for each location throughout the growing season (30 weeks). This generated a dataset with 35,000 plots having phenotype data for all agronomic traits.

More »

Expand

Fig 2.

Stacked LSTM model.

The input feature vector is x^<t> at time-step ‘t’. Depending on whether the MG and genotype cluster information are incorporated in the model or not, the vector x^<t> can be 9-dimensional or 7-dimensional. We included 7 weather variables in our study. The embedding vector a^<T_x> encodes the entire input sequence and summarizes the sequential dependencies from the time-step 0 to the time-step T_x. We designed two variants of our proposed model based on input information with the time series encoding part remaining the same for both variants. This model (when including MG, cluster with T_x = 30) had 202,503 learnable parameters and the training time/epoch was 18 secs.

More »

Expand

Fig 3.

Temporal attention model.

The LSTM encoding part is the same as that of the Stacked LSTM Model where we get the annotations a^<t> for each timestep. Instead of only using a^<T_x>, this model utilizes all annotations which act as inputs for the temporal attention mechanism. Based on the computed context vector, the two variants of this model are designed depending on the input information. This model (when including MG, cluster with T_x = 30) had 202,632 learnable parameters and the training time/epoch was 18 secs.

More »

Expand

Table 1.

Comparison of performance of the two deep learning models (2 variants of each model based on the input information) with SVR-RBF and LASSO by varying the input sequence length (T_x) using metrics of the test set.

Each model was trained three times, to obtain the average and standard deviation of each evaluation metric.

More »

Expand

Fig 4.

Results for different inputs to the temporal attention model.

The vertices of the triangle demonstrate results including only the MG, only genotype cluster and only weather variables in the input. The edges show the results with a combination of inputs from the respective vertices. The results showed improvement when the genotype cluster was included with weather variables. The coefficient of determination increased further when MG was included with weather variables. The best results were noticed when information from all sources was incorporated (shown at the center of the triangle).

More »

Expand

Fig 5.

Distribution of attention weights for the entire input sequence (spanning the growing season).

Considering different ranges of actual yield, the results are demonstrated for two different maturity groups (MG = 1, MG = 7) providing stark geo-climatic regions (Fig 1). Early season variables were observed to be comparatively less important for prediction of the highest yielding genotypes.

More »

Expand