Fig 1.
Approach, datasets, and tasks.
(A) Schematic of the approach for benchmarking uncertainty quantification (UQ) in machine learning for protein engineering. A panel of UQ methods were evaluated on protein fitness datasets to assess the quality of the uncertainty estimates and their utility in active learning and Bayesian optimization. (B) Our study utilized three protein datasets/landscapes and different train-validation-test split tasks within each dataset. These datasets and tasks covered a range of sample diversities and domain shifts (task difficulties).
Fig 2.
Miscalibration area vs. root mean square error (RMSE).
For the (A) AAV, (B) Meltome, and (C) GB1 landscapes. Miscalibration area (also called the area under the calibration error curve or AUCE) quantifies the absolute difference between the calibration plot and perfect calibration. It is desirable to have a model that is both accurate and well-calibrated, so the best performing points are those closest to the lower left corner of the plots. Each point represents an average of 5 models trained using different random seeds for initialization of the CNN parameters and batching / stochastic gradient descent. Fig A in S1 Appendix shows the corresponding results for the OHE representation. See the Uncertainty Methods section for an explanation of points for which experiments were not feasible (e.g. there is no GP Continuous model result for the AAV landscape due to memory constraints for training these models).
Fig 3.
Coverage vs. average width / range.
For the (A) AAV, (B) Meltome, and (C) GB1 landscapes. Coverage is the percentage of true values that fall within the 95% confidence interval (±2σ) of each prediction, and the width is the size of the 95% confidence region relative to the range of the training set (4σ/R where R is the range of the training set). A good model exhibits high coverage and low width, which corresponds to the upper left of each plot. The horizontal dashed line indicates 95% coverage. Each point represents an average of 5 models trained using different random seeds for initialization of the CNN parameters and batching / stochastic gradient descent. Fig B in S1 Appendix shows the corresponding results for the OHE representation. See the Uncertainty Methods section for an explanation of several points for which experiments were not feasible (e.g. there is no GP Continuous model result for the AAV landscape due to memory constraints for training these models).
Fig 4.
Of (A) predictions (ρ) and (B) uncertainties (ρunc) vs. extrapolation. Within each landscape (AAV, Meltome, and GB1), splits are ordered by the amount of domain shift between train and test sets, with the lowest domain shift on the left and the highest domain shift on the right. Error bars on the CNN results represent the 95% confidence interval calculated from 5 different random seeds for initialization of the CNN parameters and batching / stochastic gradient descent. Fig C in S1 Appendix shows the corresponding results for the OHE representation. See the Uncertainty Methods section for an explanation of several points for which experiments were not feasible.
Fig 5.
(A) Schematic of active learning approach. A model is trained on an initial dataset, and is then retrained in each iteration by adding more points to the training set based on some selection criteria. (B-D) Uncertainty-guided active learning in protein sequence-function prediction. Spearman rank correlation of predictions (ρ) for the CNN ensemble, CNN evidential, and GP methods evaluated on the AAV/Random (B), Meltome/Random (C), and GB1/Random (D) splits. The “random” strategy acquired sequences with all unseen points having equal probabilities, the “explorative sample” strategy acquired sequences with random sampling weighted by uncertainty, and the “explorative greedy” strategy acquired the previously unseen sequences with the highest uncertainty. See the Uncertainty Methods section for an explanation of why GP experiments for the AAV landscape were not feasible.
Fig 6.
(A-C) Bayesian optimization in protein sequence-function prediction. % of top-100 scores in training set found for the CNN ensemble, CNN evidential, and GP methods evaluated on the AAV/Random (A), Meltome/Random (B), and GB1/Random (C) splits. The “greedy” strategy acquired sequences with the best predicted property values. The “UCB” and “TS” strategies acquired sequences based on the upper confidence bound (UCB) and Thompson sampling (TS) approaches, respectively. The “random” strategy acquired sequences with all unseen points having equal probabilities. See the Uncertainty Methods section for an explanation of why GP experiments for the AAV landscape were not feasible. Note that in several plots, including the Gaussian process plots for Meltome and GB1 and the evidential plot for Meltome, the “greedy” strategy performance is nearly identical to and is covered up by the “UCB” strategy.