Fig 1.
Distribution of maize yield (Mg/ha) across training years (2014–2021).
Fig 2.
Overview of the proposed G × E prediction pipeline.
Genomic markers are reduced to 548 principal components and used as genotype node features. Daily weather variables are encoded via LSTM into a 21-dimensional environment embedding. These embeddings form nodes in a GNN, followed by an MLP predictor for yield estimation.
Fig 3.
Architecture A fully connected bipartite graph between 548 genotype nodes (principal components) and 21 environmental nodes (LSTM-derived features).
All genotype–environment pairs are connected via multi-head attention message passing.
Fig 4.
Architecture B extension of Architecture A with additional directed intra-set top-k similarity edges (k = 10) among genotype nodes and among environment nodes, while retaining all bipartite genotype–environment connections.
Fig 5.
Architecture C retains the message-passing structure of Architecture B and introduces a global supernode attention readout applied after K propagation layers.
The supernode attends to all genotype and environment embeddings to produce a compact graph-level representation used for prediction.
Table 1.
Final training configuration across architectures. A uses only bipartite edges (no intra-set k); B adds intra-set edges; C keeps B’s message passing but replaces readout with a single global supernode attention pooling computed after message passing.
Fig 6.
Pearson correlation coefficient (PCC) on the validation set as a function of environmental embedding dimension m using Architecture A.
Each point represents average PCC across genotype–environment pairs. Performance peaks at m = 21.
Table 2.
Test-set metrics across architectures.
Fig 7.
Predicted vs. actual yield on the test set for Architectures A, B, and C (left to right).
The red dashed line is the identity; the green line is the fitted regression. Each point in the plot refers to an individual genotype-site-year combination in the test set.
Fig 8.
Training dynamics for A (top), B (middle), and C (bottom).
Each row shows training loss (left), RMSE (middle), and PCC (right) across epochs.
Fig 9.
Per-environment predictive correlation (PCC) versus the number of unique genotypes evaluated.
Fig 10.
PCA of weather features for test environments.
Each point represents one environment projected onto the first two principal components (PC1 and PC2). Colors indicate the mean PCC achieved by the proposed model in that environment.
Table 3.
Comparison of proposed architectures with top models from the Global G×E Prediction Competition using the same forward-time split (2014–2021 train, 2022 test).