Fig 1.
Combining datasets using CAPYBARA to predict new antibody-virus interactions.
(A) Given studies (...Sj-1, Sj, Sj+1…) measuring serum HAI against a subset of influenza variants V0-Vn, and study-of-interest S0 measuring HAI against V1-Vn, CAPYBARA predicts V0’s measurements in S0. (B) CAPYBARA first identifies the most predictive features (HAI against a subset of variants) using Recursive Feature Machines (pink boxes). Ridge regression is applied using those features, training on a subset of data in Sj and cross-validating on the rest (error σInternal, Table 1). This model predicts titer values μj from Sj → S0 without uncertainty. (C) To estimate cross-study prediction error, every other variant is withheld and predicted from Sj → S0 to determine the internal (σInternal) and cross-study (σExternal) error. Combining the errors from every overlapping variant yields the transferability function fj that is applied to V0’s σInternal from Panel B to estimate the uncertainty σj in Sj. (D) Predictions from all studies are combined through a Bayesian approach to yield a consensus prediction for the study-of-interest (S0).
Table 1.
Definitions of CAPYBARA error terms and their roles in model training, transferability, and evaluation.
Models are trained to infer HAI titers for variant V0 in study Sj and then applied to predict V0‘s titers in study S0. Titers from other variants Vk can be chosen as model features.
Table 2.
List of large-scale influenza studies used in this analysis. 25 influenza datasets comprising vaccine [Vac, white background] or infection studies [Inf, gray background] used to assess cross-study predictions. The year represents when each study was conducted (e.g., 2010-2014 implies that samples were collected annually across these 5 years). Sera collected at different time points from the same subject were considered independently. The total number of measurements in each study equals (# of sera)×(# of viruses)-(% missing).
Fig 2.
HAI titers across vaccination and infection studies are consistently predicted within experimental noise by combining predictions from all other studies.
(A,B) Example predictions trained on an individual dataset (left and middle columns) and the combination of both datasets (right column). Labels above each plot identify the training → testing datasets. (C-E) Predicting three datasets using all other studies in Table 2. The estimated fold-error (σPredict), measured fold-error (σActual), and the number (N) of predicted titers are shown, with the gray diagonal bands representing σPredict.
Fig 3.
Predicting HAI responses across all studies.
(A) Heatmap of the average RMSE (σActual) across all subjects and overlapping variants in a study-of-interest (column). Training is either done using all studies (top row) or using a single study (all other rows). (B-C) All predicted versus measured HAIs when training on (B) a single study or (C) all other studies. The number N of predictions is larger for pairwise predictions since the same serum-virus pair is predicted multiple times using different training datasets. The diagonal line y = x represents perfect predictions.
Fig 4.
Training on similar datasets marginally improves prediction accuracy.
Cross-study RMSE (σActual) when training and predicting between datasets based on (A) the age groups adult-only, children-only, or mixed (child + adult); (B) vaccination or infection studies; (C) datasets grouped in 5-year intervals based on their median year; or (D) pre-vaccination (Day 0) vs post-vaccination (~1 month) data. Each box plot shows the distribution of errors for all possible withheld variants. The horizontal line denotes the median, boxes show the interquartile range, and whiskers extend to 1.5 times the interquartile range. Circles denote outliers. Statistical significance was assessed using two-sided permutation tests with Benjamini–Hochberg correction for multiple testing. Asterisks denote adjusted p-values: **** = p < 0.0001, *** = p < 0.001, ** = p < 0.01, * = p < 0.05.
Fig 5.
A global dictionary of influenza variant importance.
(A) Rainbow diagram of feature importance between any pair of variants (connections are bidirectional). (B) Examples of universal HAI titer equations for multiple influenza vaccine strains, using titers from one variant (when possible) or two variants. Each virus name stands for its log2(HAI/5) titer. See S1 File for all relations using ≤5 variants. (C) Measured versus predicted HAI titers for all vaccine strains in each study. Predictions were averaged from all other studies that measured the necessary variants. (D) Example using a small subset of five variants to predict ten other vaccine strains.