Fig 1.
Schematic overview of our approach to estimate GFR robustly in new applications.
Application of GFR estimating equations that was developed in one population (blue figures on the top left) will be less accurate in different populations (green figures on the top right) due to variation in the distribution of factors that determine the level of the filtration markers other than GFR (referred to as non GFR determinants). Non-GFR determinants of markers can affect GFR estimation accuracy by distorting markers in individual patients (plots on the left hand side) and causing systematic differences between development and application populations (plots on the right hand side). We propose methods to address both errors, using techniques to detect outlying predictor variables, robust estimation, and transfer learning. Finally, we combine these approaches for robust GFR estimates in new applications.
Table 1.
Summary of each combination of outlier detection methods and robust estimation approaches.
We combined each outlier detection method with each estimation approach such that there were nine different appoaches for robust GFR estimation in new application data.
Table 2.
Summary of analytic dataset (N = 3,554).
Fig 2.
Regression models for mGFR using each marker by study.
The average, minimum, and maximum correlations [Average r (minimum, maximum)] of each marker with mGFR across studies is provided within each plot.
Fig 3.
Comparison of RMSE using all modeling and prediction approaches after mean and variance contamination of each marker individually.
The color of the points represents the underlying outlier detection strategy, and the shape represents the robust estimation approach. Results are averaged across ten cross-validation iterations. RMSE: Root Mean Square Error.
Fig 4.
Comparison of RMSE using all outlier detection and robust estimation after mean and variance contamination of two markers.
The color of the points represents the underlying outlier detection strategy, and the shape represents the robust estimation approach. Results are averaged across ten cross-validation iterations. We show four of the 28 possible pairs of contaminated markers. The selected pairs represent the results after contaminated two excellent predictors (cystatin-c and pseudouridine, average correlation r with mGFR across studies -0.73 and -0.74, respectively), one excellent predictor and one good predictor (cystatin-c and creatinine, average r for creatinine and mGFR = -0.58), one excellent predictor and one poor predictor (cystatin-c and tryptophan, average r for tryptophan and mGFR = -0.30), and two poor predictors (tryptophan and phenylacetylglutamine, average r for phenylacetylglutamine and mGFR = -0.41). Results are averaged across ten cross-validation iterations. RMSE: Root Mean Square Error.
Fig 5.
Comparison of RMSEs from naïve (red), study specific linear (blue) and transfer (green) learning models for various training sizes.
RMSES are shown on the y-axis and the training sample size is shown in the x-axis. All models include all 8 predictors. RMSEs from linear models that were fit using all studies except for a single held out study used as the test dataset are shown in the horizontal red line (external Model). RMSEs from linear models fit within single studies are shown in blue. In this case, models were developed using a random sample of observations from the given study, and tested on the remaining observations in the study. RMSEs from transfer learning models, shown in green, were developed using a random sample of training observations from the target data and tested on the remaining observations. Given its relatively small total sample size (n = 55), we did not include Crisp in this analysis. Results are averaged across ten cross-validation iterations. RMSE: Root Mean Square Error.
Table 3.
Summary of RMSEs after combining outlier detection and robust prediction with transfer learning for each study.