LSTM-attention-guided graph neural networks for integrated genotype–Environment modeling in maize yield prediction

doi:10.1371/journal.pcbi.1013729

Fig 1.

Distribution of maize yield (Mg/ha) across training years (2014–2021).

More »

Expand

Fig 2.

Overview of the proposed G × E prediction pipeline.

Genomic markers are reduced to 548 principal components and used as genotype node features. Daily weather variables are encoded via LSTM into a 21-dimensional environment embedding. These embeddings form nodes in a GNN, followed by an MLP predictor for yield estimation.

More »

Expand

Fig 3.

Architecture A fully connected bipartite graph between 548 genotype nodes (principal components) and 21 environmental nodes (LSTM-derived features).

All genotype–environment pairs are connected via multi-head attention message passing.

More »

Expand

Fig 4.

Architecture B extension of Architecture A with additional directed intra-set top-k similarity edges (k = 10) among genotype nodes and among environment nodes, while retaining all bipartite genotype–environment connections.

More »

Expand

Fig 5.

Architecture C retains the message-passing structure of Architecture B and introduces a global supernode attention readout applied after K propagation layers.

The supernode attends to all genotype and environment embeddings to produce a compact graph-level representation used for prediction.

More »

Expand

Table 1.

Final training configuration across architectures. A uses only bipartite edges (no intra-set k); B adds intra-set edges; C keeps B’s message passing but replaces readout with a single global supernode attention pooling computed after message passing.

More »

Expand

Fig 6.

Pearson correlation coefficient (PCC) on the validation set as a function of environmental embedding dimension m using Architecture A.

Each point represents average PCC across genotype–environment pairs. Performance peaks at m = 21.

More »

Expand

Table 2.

Test-set metrics across architectures.

More »

Expand

Fig 7.

Predicted vs. actual yield on the test set for Architectures A, B, and C (left to right).

The red dashed line is the identity; the green line is the fitted regression. Each point in the plot refers to an individual genotype-site-year combination in the test set.

More »

Expand

Fig 8.

Training dynamics for A (top), B (middle), and C (bottom).

Each row shows training loss (left), RMSE (middle), and PCC (right) across epochs.

More »

Expand

Fig 9.

Per-environment predictive correlation (PCC) versus the number of unique genotypes evaluated.

More »

Expand

Fig 10.

PCA of weather features for test environments.

Each point represents one environment projected onto the first two principal components (PC1 and PC2). Colors indicate the mean PCC achieved by the proposed model in that environment.

More »

Expand

Table 3.

Comparison of proposed architectures with top models from the Global G×E Prediction Competition using the same forward-time split (2014–2021 train, 2022 test).

More »

Expand