Integrating biological and machine learning models for rainbow trout growth: Balancing accuracy and interpretability

doi:10.1371/journal.pone.0336890

Fig 1.

Methodological flowchart illustrating the analytical workflow.

Data from mark-recapture studies were merged with environmental covariates, split into training and test sets, and used to train multiple model classes including biological growth models (VBGM, Gompertz), tree-based machine learning models (Random Forest, XGBoost, LightGBM), and other ML approaches (Support Vector Regression, Artificial Neural Networks). Individual model predictions were combined using stacked ensemble and Bayesian model averaging techniques. Final model performance was evaluated on the held-out test set using RMSE, MAE, , and information criteria, with the stacked ensemble achieving the best performance (RMSE = 15.96 mm, = 0.9658).

More »

Expand

Table 1.

Descriptive statistics of selected numerical variables.

More »

Expand

Fig 2.

Neural network architecture with three hidden layers (256, 128, 64 neurons), batch normalization, SiLU activations, dropout regularization, and residual connections.

The model maps 47 biological and environmental input features to a continuous prediction of fork length at recapture.

More »

Expand

Table 2.

Comparison of model performance metrics (sorted by RMSE). Baseline for percent reduction is the baseline von Bertalanffy growth model (VBGM). Abbreviations: VBGM = von Bertalanffy growth model; BMA = Bayesian Model Averaging; ANN = Artificial Neural Network; SVR = Support Vector Regression; RF = Random Forest; LGBM = Light Gradient Boosting Machine; XGB = XGBoost; RMSE = root mean squared error.

More »

Expand

Fig 3.

Feature importance heatmap showing normalized importance values (0-1) across nine models.

Models: RF = Random Forest, XGB = XGBoost, LGBM = LightGBM, SVR-L = Support Vector Regression (Linear), SVR-R = Support Vector Regression (RBF), NN-IG = Neural Network with Integrated Gradients, VBGM = Von Bertalanffy Growth Model (Bayesian), G-Bayes = Gompertz (Bayesian), Avg = Average across all models. Features: L1 = Initial length, Time Large = Time at large, Weight Release = Weight at release, RT Biomass = Rainbow trout biomass, Solar Insol = Solar insolation, Release RM = Release river mile, SRP Conc = Soluble reactive phosphorous concentration, Water Temp = Water temperature. Color intensity and numerical values indicate relative feature importance within each model.

More »

Expand

Table 3.

Pairwise stochastic dominance probabilities (sorted by number of models beaten).

More »

Expand