Fig 1.
Samples from the public PaddyDoctor dataset spanning common diseases and varieties.
All image tiles in this figure are directly composed from PaddyDoctor images [18] without any third-party sources. The montage layout and annotations were generated by the authors using Python scripts.
Fig 2.
Calibration- and uncertainty-aware multitask network.
Schematic of the proposed approach: frozen MobileNetV2 embeddings feed a two-head MLP with MC-Dropout, multi-objective optimization (MO–BBBC), and conditional temperature scaling. The diagram was drawn manually by the authors and does not reuse any third-party graphical material.
Fig 3.
Learning dynamics for both heads: accuracy and loss per epoch (train/validation).
(a) Disease accuracy (train/val) per epoch. (b) Variety accuracy (train/val) per epoch. (c) Disease loss (train/val) per epoch. (d) Variety loss (train/val) per epoch.
Fig 4.
Reliability diagrams complement Table 2.
Disease reliability diagram (test). (b) Variety reliability diagram (test).
Fig 5.
Micro-averaged ROC and precision–recall curves for disease and variety heads on the test set.
(a) Disease ROC (b) Disease PR (c) Variety ROC (d) Variety PR.
Fig 6.
Confusion matrices on the test set (rows = ground truth, columns = prediction; colour intensity increases with the number of samples).
(a) Disease head (b) Variety head.
Fig 7.
(Top) Pareto fronts along key axes; the asterisk marks the knee solution.
(Bottom) Error–energy fronts for search strategies under a matched budget (,
, 20-epoch candidates). (a) Error vs energy (b) Error vs size (c) Energy vs calibration. (d) MO_BBBC (e) Random (f) TPE_lite.
Fig 8.
NSGA2_lite front under the matched budget.
Table 1.
Headline test metrics (conditional temperature calibration where beneficial).
Table 2.
Calibration on the test set, before/after conditional temperature scaling (lower is better).
Table 3.
Knee genome (rounded) and objectives at the Pareto knee. Objective values are reported in the robustly normalized space used for knee selection (see Eq (6)).
Table 4.
Runtime/resource summary (lower is better for latency/energy).
Table 5.
Ablation of curriculum strategies (test set; uncalibrated).
Table 6.
Seed stability (; mean ± std, reduced budget as described).
Table 7.
AUROC for ID vs noise-like OOD separation using uncertainty scores.
Fig 9.
Runtime and uncertainty analyses.
BALD histogram is analogous (not shown). Latency by device (b) PE: ID vs OOD.
Fig 10.
Headline radar plot comparing the disease and variety heads across Accuracy, Macro-F1, micro-AUC, micro-AP, and AECE.
For readability, the AECE axis is inverted (larger is better after inversion). Values are from the calibrated state (Table 1/2): disease acc 0.906, macro-F1 0.902, micro-AUC 0.994, micro-AP 0.961, AECE 0.0138; variety acc 0.979, macro-F1 0.907, micro-AUC 0.999, micro-AP 0.994, AECE 0.0138.