^{1}

^{2}

^{3}

^{4}

^{5}

^{4}

^{4}

^{3}

^{3}

^{6}

^{6}

^{2}

The authors have declared that no competing interests exist.

Environmental enteric dysfunction (EED) is associated with chronic undernutrition. Efforts to identify minimally invasive biomarkers of EED reveal an expanding number of candidate analytes. An analytic strategy is reported to select among candidate biomarkers and systematically express the strength of each marker’s association with linear growth in infancy and early childhood. 180 analytes were quantified in fecal, urine and plasma samples taken at 7, 15 and 24 months of age from 258 subjects in a birth cohort in Peru. Treating the subjects’ length-for-age Z-score (LAZ-score) over a 2-month lag as the outcome, penalized linear regression models with different shrinkage methods were fitted to determine the best-fitting subset. These were then included with covariates in linear regression models to obtain estimates of each biomarker’s adjusted effect on growth. Transferrin had the largest and most statistically significant adjusted effect on short-term linear growth as measured by LAZ-score–a coefficient value of 0.50 (0.24, 0.75) for each log_{2} increase in plasma transferrin concentration. Other biomarkers with large effect size estimates included adiponectin, arginine, growth hormone, proline and serum amyloid P-component. The selected subset explained up to 23.0% of the variability in LAZ-score. Penalized regression modeling approaches can be used to select subsets from large panels of candidate biomarkers of EED. There is a need to systematically express the strength of association of biomarkers with linear growth or other outcomes to compare results across studies.

Childhood undernutrition is widespread throughout the world and has severe, long-lasting health impacts. Substances measured in blood, urine and stool could be used as biomarkers to identify children undergoing growth failure before these impacts occur. However, it is not yet known which of the many markers that can be identified are accurate and clinically useful predictors of poor growth in infants and children. This study used a large number of candidate biomarkers of immune activation, metabolism and hormones and applied statistical methods to narrow them down from 110 different substances, to the 36 best predictors of growth in 258 Peruvian infants. It also estimated how large the effect of each of these markers was on height two months later. The biomarker with the largest effect was transferrin, a glycoprotein that can be measured in blood samples. 15-month old children with elevated transferrin were around two thirds of a centimeter taller on average at 17 months than those with low levels. Transferrin and other proteins, glycoproteins, hormones and antibodies that this study identified, can be measured easily and affordably in standard laboratories making them feasible to be used broadly as prognostic markers as part of child health and nutrition programs in under-resourced settings.

Chronic undernutrition affects around one in three children under age five, rendering them susceptible to prolonged and more severe infections and putting them at increased risk of mortality [

Studying the impact of EED is challenging. Gold standard diagnostic tests for other enteropathies, such as celiac and Crohn’s disease, include endoscopy and gut biopsy, invasive and demanding procedures that cannot feasibly be deployed in resource-constrained settings or to assess disease burden at the population level [

Recently developed methods allow for quantifying large panels of soluble analytes in blood that relate to inflammation or immune status [

The objective of this study was to identify clinically relevant biomarkers of the precursors of EED that can inform intervention early in the disease process. To this end, penalized regression approaches for variable subset selection were applied to a large panel of candidate biomarkers measured in a cohort of Peruvian infants to identify the optimal subset that are most predictive of nutritional status (length-for-age Z-score–LAZ-score) over a two-month lag.

Ethical approval for MAL-ED was given by the Johns Hopkins Institutional Review Board as well as the Ethics Committee of Asociacion Benefica PRISMA, and the Regional Health Department of Loreto. Written informed consent was obtained from the caregiver of every participating child.

A cohort of 303 infants was enrolled between December 2009 and February 2012 from Santa Clara de Nanay, a peri-urban community located 15 km from the city of Iquitos, Peru, a study setting that has been described in detail elsewhere [

The outcome of interest in this analysis was the subjects’ LAZ-score, a widely used measure of nutritional status and attained statural growth [

The primary exposure variables were 180 time-varying candidate fecal, urinary and plasma biomarkers of EED compiled from the following panels (biomarker names, abbreviations and units are listed in the supporting information):

Three overlapping panels of in-house quantitative, multiplexed immunoassays of cytokines, chemokines, hormones and other regulators of metabolism and growth [

86 analytes from 20 samples taken at age 7 months from a sub-sample of subjects and run in June 2013. These 20 subjects (10 cases and 10 controls) were selected for more expansive testing to examine extremes in growth in this setting over the target period between 6–15 months when exclusive breastfeeding is no longer the optimal feeding practice. Cases (positive deviants) were those subjects who grew by >0.77 LAZ, while controls (negative deviants) were selected from those subjects who experienced a change in LAZ of <0.25 over the 8-month follow-up period. This sub-sample was selected for a separate study to compare extremes in growth in this setting over the same period. At the age of 7 months, both cases and controls had equal LAZ.

49 analytes from 178 samples mostly taken at the target age of 24 months (though with a small number taken at 7, 15, 25 and 26 months) run in May 2014.

59 analytes from 443 samples taken at the target ages of 7 and 15 months (though with a small number taken at 8–9 or 16–18 months) run in January 2015.

9 chemokine and 9 proinflammatory assays run on 596 of the same blood samples as 1 a-c at a laboratory at Johns Hopkins University (Baltimore, MD) in 2013–2014.

The amino acids citrulline and tryptophan and the latter’s metabolite kynurenine (umol/L) quantified in 640 of the same blood samples by liquid chromatography-mass spectrometry (LCMS) in the Oregon Analytics laboratory in 2015 [

51 other biogenic amines quantified in 464 of the same blood samples by LCMS at a laboratory in Imperial College London in 2017 [

Several plasma analytes measured in the same blood samples as part of the MAL-ED protocol, including Alpha-1-acid glycoprotein (AGP—mg/dl, measured by radioimmune diffusion assay in 618 samples), Insulin-like growth factor (IGF) 1 and IFG-binding protein 3 (IGFBP-3—measured by enzyme-linked immunosorbent assay (ELISA) in 566 and 597 samples respectively) and hemoglobin (g/dL, measured by Hemocue).

Three fecal biomarkers—alpha-1-antitrypsin (AAT–mg/g), myeloperoxidase (MPO–ng/mL) and neopterin (NEO–nmol/L)—measured by ELISA tests of stool samples collected from the infants at monthly intervals [

5 urinary biomarkers calculated from lactulose to mannitol recovery tests of intestinal permeability performed on urine samples collected at 3, 6, 9 and 15 months of age.

Panel number | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 2 | 3 | 4 | 5 | 6 | 7 | |||||||

a. | b. | c. | AGP | IGF-1 | IGFBP-3 | Hb | |||||||

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 174 | 267 | ||

20 | 5 | 211 | 210 | 226 | 148 | 236 | 175 | 202 | 340 | 262 | 2 | ||

0 | 0 | 2 | 2 | 7 | 2 | 2 | 2 | 1 | 7 | 264 | 1 | ||

0 | 0 | 2 | 2 | 3 | 2 | 2 | 1 | 2 | 6 | 175 | 247 | ||

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 257 | 3 | ||

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 248 | 0 | ||

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 253 | 0 | ||

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 209 | 0 | ||

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 203 | 0 | ||

0 | 6 | 211 | 189 | 209 | 156 | 200 | 179 | 183 | 355 | 177 | 226 | ||

0 | 0 | 13 | 10 | 11 | 12 | 12 | 12 | 11 | 14 | 213 | 1 | ||

0 | 0 | 3 | 2 | 3 | 2 | 2 | 2 | 3 | 5 | 214 | 2 | ||

0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 226 | 0 | ||

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 221 | 0 | ||

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 218 | 0 | ||

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 213 | 0 | ||

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 206 | 0 | ||

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 197 | 0 | ||

0 | 167 | 0 | 167 | 167 | 129 | 154 | 181 | 182 | 304 | 182 | 180 | ||

0 | 9 | 0 | 9 | 9 | 8 | 8 | 9 | 8 | 13 | 52 | 4 | ||

0 | 4 | 0 | 4 | 4 | 4 | 1 | 4 | 4 | 4 | 43 | 1 | ||

The fecal and urinary samples were matched to the plasma biomarker values that were closest in age and those that were not matched to any blood sample were excluded from the analysis. Exposure values were lagged by two months so that the analysis assessed the association between the subjects’ LAZ-score at month of age

7–15-month database–This excluded the samples in panel 1.b and any samples from panels 2–7 that were taken at ≥24 months of age, resulting in 461 observations and 110 biomarkers.

7–24-month database–This included the 24-month samples but only for those biomarkers that were included in both panels 1.b and 1.c as well as panels 1. a and 2–7, resulting in 639 observations and 80 biomarkers.

Non-detectable biomarker values, for which the analyte concentration was below the lower limit of quantification (LLOQ), were substituted with LLOQ /√2 [

The retained biomarkers were included in penalized linear regression models with three different shrinkage methods that have been used in other studies of EED biomarkers—Adaptive LASSO (Least Absolute Shrinkage and Selection Operator), Minimax Concave Penalty (MCP) and Smoothly Clipped Absolute Deviation (SCAD) penalties [

Regression models were fitted with robust variance estimation to allow for intra-subject correlation first for each of the candidate biomarkers separately (adjusting for the a priori-selected non-biomarker covariates) in order to report their independent effects and statistical significance and then for a multi-variable model that included all biomarkers selected for the best-fitting subset, to estimate the adjusted effect of each in the presence of the others and their combined effect on LAZ-score. To account for the false discovery rate (FDR) due to the large number of comparisons, ^{th} and the 75^{th} percentile of each included biomarker’s distribution at the age of the final included sample (15 or 24 months, depending on which database was used) and holding all other included biomarkers at their mean values and based on the standard deviation in height at that age reported in the WHO child growth standards [^{2} statistic for the final subset model was reported along with the partial ^{2} for all included biomarker terms as an estimate of the proportion of the total variability in the outcome that was explained by the selected biomarker subset. Results from the final models were compared with those obtained from adjusting for LAZ-score measured contemporaneously with the biomarker (in place of the other covariates), in order to compare the prognostic potential of the final biomarker subset in predicting future growth relative to a natural and existing alternative, namely attained LAZ-score. The potential for non-linear relationships between biomarkers in the final subsets and LAZ-score was assessed by generating nonparametric smooth plots and by applying a multivariate spline model-selection algorithm to the final models. Finally, as a validation exercise, associations between each of three of the most important biomarkers (expressed per standard deviation [SD]) and changes in LAZ over increasing lag-lengths of 1–10 months adjusting for contemporaneous LAZ were plotted to assess the performance using an existing method that has previously been used for tryptophan and citrulline and compared to a comparator biomarker of a known endocrinologic agent–Insulin-like growth factor 1 (IGF-1), replicating the methodology of Kosek and colleagues [

Summary statistics of the distributions of the 180 candidate biomarkers and whether they met the criteria for inclusion in further analysis are presented in

^{2} values for the three penalized regression models fitted on each of the two biomarker databases. MCP selected the smallest subset of biomarkers when fitted to the 7-15-month, but not the 7-24-month database, for which adaptive LASSO selected the smallest. For both databases, the SCAD penalty resulted in the largest subset, the highest cross-validated R^{2} and the lowest cross-validation error (jointly with MCP in the 7–24 month model) and was chosen as the subset for subsequent analyses.

7–15 months | 7–24 months | |||||
---|---|---|---|---|---|---|

Adaptive LASSO | MCP | SCAD | Adaptive LASSO | MCP | SCAD | |

17 | 8 | 23 | 5 | 22 | 25 | |

0.84 | 0.84 | 0.82 | 0.85 | 0.78 | 0.78 | |

0.06 | 0.05 | 0.08 | 0.10 | 0.10 | 0.10 |

In both databases, just 5 biomarkers were selected by all three penalties. In the 7-15-month models, these were hemoglobin, Immunoglobulin A (IgA), Insulin-like growth factor-binding protein 3 (IGFBP-3), Pulmonary and Activation-Regulated Chemokine (PARC) and Thyroid-Stimulating Hormone (TSH), while in the 7-24-month models these included adiponectin and IgM instead of IgA and TSH. In both databases, all biomarkers selected by MCP were also selected by SCAD and in the 7-24-month database all biomarkers selected by adaptive LASSO were also selected by the other two penalties. SCAD selected 5 biomarkers in the 7-15-months and 3 in the 7-24-month database that were not included in either of the other two subsets, however only one of these–growth hormone (GH)–was significant in the final model.

Biomarkers for which

_{2} increase of each of the biomarkers selected by SCAD along with the difference in child’s height predicted for children aged 17 and 26 months respectively at the 25^{th} and 75^{th} percentile of the biomarker distribution (holding all other included biomarkers at their sample mean). The 36 selected biomarkers include numerous amino acids, chemokines, hormones, glycoproteins and proteins along with two antibodies, three apolipoproteins, the enzyme myeloperoxidase, the sugar lactulose and 5-OH-Indole-3-acetic Acid (5-HIAA), the metabolite of serotonin. Thirteen biomarkers were included in both final models, while 11 were only included in the 7-15-month model and 12 only in the 7-24-month model.

Biomarker | 7 & 15 months | 7, 15 & 24 months | ||
---|---|---|---|---|

Coefficient—LAZ score | Predicted height difference (cm) at 17 months | Coefficient—LAZ score | Predicted height difference (cm) at 26 months | |

- | - | -0.05 |
-0.16 | |

-0.07 |
-0.10 | - | - | |

-0.06 |
-0.15 | - | - | |

-0.26 |
-0.40 | -0.29 |
-0.54 | |

0.20 |
0.33 | 0.07 |
0.14 | |

- | - | 0.06 |
0.13 | |

-0.22 |
-0.30 | - | - | |

- | - | -0.08 |
-0.12 | |

0.18 |
0.39 | 0.17 |
0.46 | |

- | - | 0.06 |
0.38 | |

-0.15 |
-0.22 | -0.14 |
-0.24 | |

- | - | 0.06 |
0.11 | |

0.03 |
0.19 | 0.03 |
0.20 | |

-0.07 |
-0.28 | - | - | |

0.47 |
0.30 | 0.44 |
0.31 | |

-0.05 |
-0.16 | - | - | |

-0.12 |
-0.23 | - | - | |

- | - | -0.03 |
-0.07 | |

0.19 |
0.25 | 0.18 |
0.28 | |

- | - | 0.17 |
0.63 | |

-0.05 |
-0.24 | - | - | |

0.02 |
0.09 | 0.06 |
0.27 | |

- | - | -0.07 |
-0.21 | |

- | - | -0.07 |
-0.25 | |

- | - | 0.02 |
0.06 | |

-0.02 |
-0.04 | -0.14 |
-0.28 | |

-0.21 |
-0.42 | -0.24 |
-0.50 | |

-0.28 |
-0.65 | -0.29 |
-0.79 | |

0.00 |
0.01 | - | - | |

-0.11 |
-0.33 | -0.12 |
-0.42 | |

- | - | -0.01 |
-0.05 | |

- | - | 0.28 |
0.36 | |

0.50 |
0.66 | - | - | |

0.17 |
0.29 | 0.23 |
0.44 | |

-0.07 |
-0.16 | - | - | |

0.07 |
0.23 | - | - |

The iron-transporting glycoprotein transferrin had the largest effect size in the 7-15-month model both in terms of its estimated coefficient–a highly statistically significant 0.50 (0.24, 0.75) increase in the predicted LAZ-score–and the height difference predicted–a 17-month-old child at the 75^{th} percentile of plasma transferrin concentration being two thirds of a centimeter taller than one at the 25^{th}. Hemoglobin had the second largest absolute coefficient value in the 7-15-month model–a slightly significant 0.47 (0.02, 0.93)–but the second largest difference in height was predicted by SAP—a child at the 3rd quartile of its distribution predicted to be 0.65 cm shorter than one at the 1st quartile–which also had a highly statistically significant coefficient estimate. Other biomarkers for which the 7-15-month model predicted large and statistically significant negative effects include the hormones adiponectin and GH–predicting respectively around a -0.4cm and a -0.28cm height difference–and apolipoprotein (Apo) C-I—-0.3cm–while AGP had a slightly statistically significant positive effect.

Several biomarkers that had large effect sizes in the 7-15-months model–transferrin, Apo C-I and GH- were not included in the 7-24-month database, due to no values being available at 24 months of age. Instead, in that model, while hemoglobin again had the largest coefficient estimate–a non-significant 0.44 (-0.03, 0.91)–SAP predicted the largest difference in height between the extremes of the interquartile range of the analyte’s distribution at 24 months– 26-month-old children with high SAP concentration at 24-months a predicted 0.79cm shorter than their low SAP counterparts–the next largest being the chemokines Interleukin-8 (IL-8)– 0.63cm taller–and adiponectin– 0.54cm shorter–the latter having a highly statistically significant effect estimate. Proline, arginine, tryptophan and SHBG also all had slightly statistically significant coefficient estimates and predicted among the largest height differences.

The final 7-15-month model explained 43.0% of the variance in the LAZ-score according to the ^{2} statistics, with 23.0% of the variance explained solely by the selected subset of biomarkers (the partial ^{2} statistic excluding the non-biomarker covariates). The equivalent proportions for the final 7-24-month model were 39.6% and 17.7% respectively. ^{2} statistics of 83.7% and 84.3% for the 7-15-month and 7-24-month models respectively–but decreased the proportion explained by the biomarker subsets– 9.6% and 7.3% respectively—demonstrating that growth already attained has far more explanatory power for modeling short-term future growth than any combination of biomarkers.

Numerous biomarkers, including Eotaxin-3, citrulline, myoglobin, lactulose, and SHBG, exhibited evidence of having non-linear relationships with the outcome when visualized in polynomial smooth plots (

Analytical techniques such as multiplex immunoassays and mass spectrometry are increasingly being used in human studies to enable the quantification of ever more diverse and extensive panels of analytes in biological samples, many of which have biological functions that have yet to be fully characterized. At the same time, advanced statistical learning methods have emerged that can be used to identify patterns in large datasets. This study brings together these two developments and applies them to an issue that has received growing attention in recent years but has yet to be fully resolved–identifying prognostic biomarkers of EED that can predict future linear growth over time windows relevant to clinical intervention. In a birth cohort recruited from a low-resource setting in Peru, this study reports the distributions of 180 candidate biomarkers in fecal, urinary and plasma samples, of which 110 met the criteria for inclusion in variable-subsetting penalized regression models–the largest number of markers ever considered in a study of this nature.

The final subsets selected by SCAD penalty included numerous biomarkers that previous studies have implicated as potential predictors of linear growth and markers of gut function. The essential amino acid tryptophan has previously shown promise as a prognostic indicator of EED due to its role in normal infant growth and its hypothesized correlation with indoleamine 2,3-dioxygenase 1 (IDO1) activity in states of chronic low-grade endotoxin exposure [

For some other biomarkers in the subsets, evidence in previous literature on EED is more scant though known mechanisms nonetheless exist through which they might plausibly track nutritional status. Most obvious of these is hemoglobin, long the gold standard marker of severe anemia and therefore of its attendant delaying effects on growth and development [

While the SCAD-selected subset was used for the final models due to its yielding the low cross-validated error and explaining a larger proportion of the variance, it is notable that this penalty did select several biomarkers that had small non-significant effect estimates and did not select several biomarkers, which had statistically significant single biomarker effect sizes and known associations with nutritional outcomes (such as IGF-1 and ferritin). Though SCAD has been used in numerous studies of EED biomarkers [

For other biomarkers in the subsets, the functions or pathways through which they might impact growth are as yet unclear, which demonstrates the hypothesis-generating potential of this approach. SHBG is of interest in biomarker research for its association at low levels with type-II diabetes and metabolic syndrome but, although elevated SHBG is seen following weight loss, this glycoprotein has not previously been considered as a prognostic marker of growth faltering [

C-Reactive Protein (CRP), which multiple previous studies have found to be a promising biomarker [

Although ferritin, the body’s stored form of iron, has been implicated previously [

Several other limitations warrant highlighting. Most associations that were apparently statistically significant in the single biomarker models appeared much less so after accounting for the FDR–indeed, only adiponectin remained significant at the Bonferroni-corrected α level. Furthermore, the results of the adjusted subset models do not account for the variable selection in the first stage SCAD model, a post-selection inference problem that can lead to inflated type-1 errors and overly narrow confidence intervals [

Applying the penalized regression models to the database that included the observations at 24 months of age, did not improve the predictive capability of the model. Similarly, the final 7-24-month model explained a smaller proportion of the variance in the outcome than the 7-15-month model. However, for some biomarkers that were included in both models, the 7-24-month model tended to give larger and more statistically significant effect size estimates than the 7-15-month model (with the notable exception of hemoglobin). The reason for the difference in explanatory power may be because the 7-24-month database did not include transferrin (which was not tested at 24 months of age), the biomarker with the largest effect size in the 7-15-month final model.

Studies with more intensive sample collection and frequent follow-up are needed to explore random effects and short-term intra- and inter-subject variability of these biomarkers as well as those that were excluded from this analysis and to more precisely model their effects on growth [

The expanded testing of analytes chosen for their characterization as being important immune and metabolic regulators pertinent to child growth revealed several important findings. This selected subset of biomarkers explained 17.7–23.0% of the variance in LAZ score with measurements taken at 2 or 3 time points, compared to a single biomarker such as MPO which only accounted for 2.8% of the variance with monthly follow-up up to age 3 years in the same population [

In summary, penalized regression modeling approaches–most notably SCAD—can be used to select subsets from large panels of candidate biomarkers of EED providing translational value in the form of further evidence for known markers and in generating hypotheses about new ones. Adiponectin, IL-8, proline, SAP and transferrin, among others, are promising plasma biomarkers of EED.

(PDF)

(PDF)

(PDF)

(PDF)

(TIF)

(TIF)

(TIF)

(TIF)

(TIF)

(TIF)

We wish to thank participants, their families and the study community for their dedicated time and effort to better the understanding the transmission and more enduring impact of enteric infections in early childhood. We would also like to thank Drs. Leah Jager (JHSPH) and William Pan (Duke University) for consultation regarding the statistical analysis, and Dr. Ben Jann (University of Bern, Switzerland) for guidance in generating the figures. We would like to acknowledge support for the statistical analysis from the National Center for Research Resources and the National Center for Advancing Translational Sciences (NCATS) of the National Institutes of Health through Grant Number 1UL1TR001079.