Penalized regression models to select biomarkers of environmental enteric dysfunction associated with linear growth acquisition in a Peruvian birth cohort

Environmental enteric dysfunction (EED) is associated with chronic undernutrition. Efforts to identify minimally invasive biomarkers of EED reveal an expanding number of candidate analytes. An analytic strategy is reported to select among candidate biomarkers and systematically express the strength of each marker’s association with linear growth in infancy and early childhood. 180 analytes were quantified in fecal, urine and plasma samples taken at 7, 15 and 24 months of age from 258 subjects in a birth cohort in Peru. Treating the subjects’ length-for-age Z-score (LAZ-score) over a 2-month lag as the outcome, penalized linear regression models with different shrinkage methods were fitted to determine the best-fitting subset. These were then included with covariates in linear regression models to obtain estimates of each biomarker’s adjusted effect on growth. Transferrin had the largest and most statistically significant adjusted effect on short-term linear growth as measured by LAZ-score–a coefficient value of 0.50 (0.24, 0.75) for each log2 increase in plasma transferrin concentration. Other biomarkers with large effect size estimates included adiponectin, arginine, growth hormone, proline and serum amyloid P-component. The selected subset explained up to 23.0% of the variability in LAZ-score. Penalized regression modeling approaches can be used to select subsets from large panels of candidate biomarkers of EED. There is a need to systematically express the strength of association of biomarkers with linear growth or other outcomes to compare results across studies.


Introduction
Chronic undernutrition affects around one in three children under age five, rendering them susceptible to prolonged and more severe infections and putting them at increased risk of mortality [1]. Growth faltering in undernourished children begins to accrue early in life, is generally irreversible and leads to chronic sequelae such as impaired cognitive development and short stature that last into adulthood impeding economic productivity and increasing the risk of low birthweight in offspring [2]. Many evidence-based interventions targeting infant growth demonstrate only modest improvements in outcomes in effectiveness trials [3], a gap that, it is increasingly suspected, may be partially explained by a phenotype of intestinal abnormalities known as environmental enteric dysfunction (EED) [4], which is gaining recognition as a neglected disease [5]. According to the EED hypothesis, concurrent exposures to multiple enteric pathogens in already undernourished children cause cumulative damage to their guts' surface, increasing its permeability to microbes and large molecules, causing systemic inflammation and impairing uptake and utilization of nutrients [6][7][8], which in turn leads to suboptimal growth [9].
Studying the impact of EED is challenging. Gold standard diagnostic tests for other enteropathies, such as celiac and Crohn's disease, include endoscopy and gut biopsy, invasive and demanding procedures that cannot feasibly be deployed in resource-constrained settings or to assess disease burden at the population level [10]. For this reason, there is considerable interest in identifying and validating biomarkers of EED that can be used as surrogate endpoints in population-based studies and for evaluating nutrition and hygiene interventions [11]. The most widely adopted biomarkers of EED use saccharide-based permeability assays like the lactulose/mannitol test [12]. However, such tests, while non-invasive, have well-documented limitations to their use in EED-endemic populations, taking hours to administer, requiring samples to be shipped to well-equipped facilities which makes them cumbersome, expensive and impractical for screening and randomization for intervention trials [13]. Several fecal biomarkers, such as alpha-1-antitrypsin, myeloperoxidase and neopterin, have been shown to have complex associations with growth outcomes [9,11], while certain plasma biomarkers show correlations with suboptimal growth, including the amino acid tryptophan and its ratio to its derivative, kynurenine [14].
Recently developed methods allow for quantifying large panels of soluble analytes in blood that relate to inflammation or immune status [15], however there is a lack of consensus about how to select the most important markers from among these panels and quantify their association and explanatory power with respect to specific disease outcomes relevant to EED (growth, cognitive function, immune activation, intestinal permeability, nutrient bioavailability, and hormones that alter growth and metabolism) [11,16]. Machine learning approaches have been used in biomarker analyses to identify best subsets of predictors from among large databases of candidate markers [16][17][18]. More specifically, penalized regression methods estimate coefficient values for modeled variables, while applying different penalties to those that overly increase model complexity relative to improving goodness of fit, assigning such variables a coefficient value of ("shrunk" to) zero. Those variables that are assigned non-zero coefficients can be interpreted as belonging to the subset that best predicts the outcome. Although these methods do not themselves report standard errors or adjust for within-cluster correlation in longitudinal data, the selected subsets can be included in more traditional multivariate regression models once identified and the effect size described by conventional methods.
The objective of this study was to identify clinically relevant biomarkers of the precursors of EED that can inform intervention early in the disease process. To this end, penalized regression approaches for variable subset selection were applied to a large panel of candidate biomarkers measured in a cohort of Peruvian infants to identify the optimal subset that are most predictive of nutritional status (length-for-age Z-score-LAZ-score) over a two-month lag.

Ethical approval and consent to participate
Ethical approval for MAL-ED was given by the Johns Hopkins Institutional Review Board as well as the Ethics Committee of Asociacion Benefica PRISMA, and the Regional Health Department of Loreto. Written informed consent was obtained from the caregiver of every participating child.

Study population
treated as a continuous variable. Its distribution in this study population has also been described elsewhere [11].

Exposure variables
The primary exposure variables were 180 time-varying candidate fecal, urinary and plasma biomarkers of EED compiled from the following panels (biomarker names, abbreviations and units are listed in the supporting information): 1. Three overlapping panels of in-house quantitative, multiplexed immunoassays of cytokines, chemokines, hormones and other regulators of metabolism and growth [24], each run at the Myriad RBM laboratories (Austin, TX) on a separate subset of blood samples from the cohort including: a. 86 analytes from 20 samples taken at age 7 months from a sub-sample of subjects and run in June 2013. These 20 subjects (10 cases and 10 controls) were selected for more expansive testing to examine extremes in growth in this setting over the target period between 6-15 months when exclusive breastfeeding is no longer the optimal feeding practice. Cases (positive deviants) were those subjects who grew by >0.77 LAZ, while controls (negative deviants) were selected from those subjects who experienced a change in LAZ of <0. 25  c. 59 analytes from 443 samples taken at the target ages of 7 and 15 months (though with a small number taken at 8-9 or 16-18 months) run in January 2015.
2. 9 chemokine and 9 proinflammatory assays run on 596 of the same blood samples as 1 a-c at a laboratory at Johns Hopkins University (Baltimore, MD) in 2013-2014.
3. The amino acids citrulline and tryptophan and the latter's metabolite kynurenine (umol/L) quantified in 640 of the same blood samples by liquid chromatography-mass spectrometry (LCMS) in the Oregon Analytics laboratory in 2015 [14].
4. 51 other biogenic amines quantified in 464 of the same blood samples by LCMS at a laboratory in Imperial College London in 2017 [25]. 5. Several plasma analytes measured in the same blood samples as part of the MAL-ED protocol, including Alpha-1-acid glycoprotein (AGP-mg/dl, measured by radioimmune diffusion assay in 618 samples), Insulin-like growth factor (IGF) 1 and IFG-binding protein 3 (IGFBP-3-measured by enzyme-linked immunosorbent assay (ELISA) in 566 and 597 samples respectively) and hemoglobin (g/dL, measured by Hemocue).
7. 5 urinary biomarkers calculated from lactulose to mannitol recovery tests of intestinal permeability performed on urine samples collected at 3, 6, 9 and 15 months of age. Table 1 shows the number of biological samples and analytes available in each panel by age of the subjects. In addition, the following variables were included as potential confounders: infants' sex, birthweight, breastfeeding status on the previous day (a time-varying categorical variable with four categories-"exclusively breastfed", "partially breastfed", "predominantly breastfed" and "not breastfed"), age in whole months (modeled using linear and quadratic terms) and mother's height at the time of birth.

Statistical analysis
The fecal and urinary samples were matched to the plasma biomarker values that were closest in age and those that were not matched to any blood sample were excluded from the analysis. Exposure values were lagged by two months so that the analysis assessed the association between the subjects' LAZ-score at month of age j and the exposures measured at age j-2 months. A 2-month lag was chosen because it is a length of time at which the impacts on a child's growth of interventions such as steroids [27], chemotherapy [28] or treatment for severe acute malnutrition [29] become manifest, and therefore offers a feasible time window for clinical intervention and in which to reproducibly detect meaningful changes in ponderal growth associated with important physiologic determinants. Two months has also been demonstrated to be optimal for predicting future growth trajectory using fecal biomarkers [11]. All biomarkers were log-transformed with base 2. Because numerous biomarkers were either only available for samples collected at 24 months of age, or only for those collected around 7 and 15 months of age, the following analyses were performed on two subsets of the full biomarker database:   2. 7-24-month database-This included the 24-month samples but only for those biomarkers that were included in both panels 1.b and 1.c as well as panels 1. a and 2-7, resulting in 639 observations and 80 biomarkers.
Missing data. Non-detectable biomarker values, for which the analyte concentration was below the lower limit of quantification (LLOQ), were substituted with LLOQ / p 2 [30]. No standard equivalent approaches exist for substituting values that are above the upper limit of quantification (ULOQ), however this only affected a small number of values for biomarkers in panel 4 which were treated as missing values. Almost all biomarkers and subjects had some number of missing values. Biomarkers for which more than 40% of the original values were missing were excluded from the imputation and further analysis, as were variables with fewer than 25 unique values within the detectable range. Observations that had missing values for more than 40% of the remaining biomarkers were excluded from the analysis. A small number of missing length measurements (n = 19, 3.0% of total) were linearly interpolated and extrapolated based on the actual or target date of assessment before calculating LAZ-scores. For timefixed baseline variables (birth weight and maternal height), the small number of missing values were substituted with the sample mean of that variable. All other missing values of the biomarker exposures were imputed using multivariate normal regression (MVN) with an iterative Monte Carlo method to accommodate the arbitrary missing-value patterns of the continuous variables [31]. Missing values-of which there were 5,279 (11.1%) in the 7-15-month database and 4,508 (10.6%) in the 7-24-month database-were substituted with the average of the imputed values from 10 MVN imputations. The kynurenine/tryptophan (K/T ratio) and lactulose/mannitol ratios were excluded from imputation and recalculated after from their component biomarkers.
Variable selection. The retained biomarkers were included in penalized linear regression models with three different shrinkage methods that have been used in other studies of EED biomarkers-Adaptive LASSO (Least Absolute Shrinkage and Selection Operator), Minimax Concave Penalty (MCP) and Smoothly Clipped Absolute Deviation (SCAD) penalties [16,17]with values for the tuning parameter λ determined through 10-fold cross validation. For each model, the variables assigned non-zero coefficients were treated as the optimal, best-predicting subset and the subset for the method that yielded the lowest cross-validation error (calculated from the mean-squared error or deviation from the fitted mean) was retained in a final multivariable model.
Effect modeling. Regression models were fitted with robust variance estimation to allow for intra-subject correlation first for each of the candidate biomarkers separately (adjusting for the a priori-selected non-biomarker covariates) in order to report their independent effects and statistical significance and then for a multi-variable model that included all biomarkers selected for the best-fitting subset, to estimate the adjusted effect of each in the presence of the others and their combined effect on LAZ-score. To account for the false discovery rate (FDR) due to the large number of comparisons, p-values from the separately modeled biomarkers were compared visually in scatterplots with their corresponding q-values (a measure of significance in terms of the FDR [32] calculated using the method proposed by Simes [33]) and with a Bonferroni corrected α value calculated from the number of comparisons. The effect measures from the single-biomarker and adjusted subset models were visualized using forest plots. For the biomarkers included in the final, multi-variable models, the coefficient estimates were reported along with the difference in a child's height predicted by the model between subjects at the 25 th and the 75 th percentile of each included biomarker's distribution at the age of the final included sample (15 or 24 months, depending on which database was used) and holding all other included biomarkers at their mean values and based on the standard deviation in height at that age reported in the WHO child growth standards [21]. The R 2 statistic for the final subset model was reported along with the partial R 2 for all included biomarker terms as an estimate of the proportion of the total variability in the outcome that was explained by the selected biomarker subset. Results from the final models were compared with those obtained from adjusting for LAZ-score measured contemporaneously with the biomarker (in place of the other covariates), in order to compare the prognostic potential of the final biomarker subset in predicting future growth relative to a natural and existing alternative, namely attained LAZ-score. The potential for non-linear relationships between biomarkers in the final subsets and LAZ-score was assessed by generating nonparametric smooth plots and by applying a multivariate spline model-selection algorithm to the final models. Finally, as a validation exercise, associations between each of three of the most important biomarkers (expressed per standard deviation [SD]) and changes in LAZ over increasing lag-lengths of 1-10 months adjusting for contemporaneous LAZ were plotted to assess the performance using an existing method that has previously been used for tryptophan and citrulline and compared to a comparator biomarker of a known endocrinologic agent-Insulin-like growth factor 1 (IGF-1), replicating the methodology of Kosek and colleagues [14]. Analyses were carried out using Stata 15.1 [34] and R 3.6.1.

Results
Summary statistics of the distributions of the 180 candidate biomarkers and whether they met the criteria for inclusion in further analysis are presented in S1 Table in the supporting information. A participant flowchart is provided as S1 Fig (supporting information). Before applying exclusion criteria, 639 observations were available for 258 of the 303 enrolled subjects for whom blood samples were available relating to 180 biomarkers. 23 of the biomarkers were only available in the case control panel (panel 1a.) and so were excluded from further analysis for only having 20 available observations. A further 47 biomarkers were excluded from the 7-15-month database either because more than 40% of their values were missing, fewer than 25 were within the detectable range, they were only available at 24 months of age (panel 1c.) or some combination of these. 77 biomarkers were excluded from the 7-24-month database due to missingness, detectability, or because they did not have values available at 24 months. Overall, 110 biomarkers were retained for analysis in the 7-15-month database, and 80 in the 7-24-month. Table 2 shows the number of biomarkers selected (assigned non-zero coefficients) and the cross-validation error and R 2 values for the three penalized regression models fitted on each of the two biomarker databases. MCP selected the smallest subset of biomarkers when fitted to the 7-15-month, but not the 7-24-month database, for which adaptive LASSO selected the smallest. For both databases, the SCAD penalty resulted in the largest subset, the highest crossvalidated R 2 and the lowest cross-validation error (jointly with MCP in the 7-24 month model) and was chosen as the subset for subsequent analyses. association with the outcome, compared with 28 for which a positive association was predicted. In 17 of these models, the estimate was statistically significant at the uncorrected α = 0.05 level. 42 of the 80 biomarkers in the 7-24-month database had a negative association with the outcome in the single biomarker models, 38 had a positive association, and 12 had statistically significant estimates. No fecal or urinary biomarkers were included in the final subsets selected by adaptive LASSO, although in both the 7-15 and 7-24-month models, the SCAD and MCP penalties assigned small, non-zero coefficients to fecal MPO and SCAD also selected urinary lactulose.
In both databases, just 5 biomarkers were selected by all three penalties. In the 7-15-month models, these were hemoglobin, Immunoglobulin A (IgA), Insulin-like growth factor-binding protein 3 (IGFBP-3), Pulmonary and Activation-Regulated Chemokine (PARC) and Thyroid-Stimulating Hormone (TSH), while in the 7-24-month models these included adiponectin and IgM instead of IgA and TSH. In both databases, all biomarkers selected by MCP were also selected by SCAD and in the 7-24-month database all biomarkers selected by adaptive LASSO were also selected by the other two penalties. SCAD selected 5 biomarkers in the 7-15-months and 3 in the 7-24-month database that were not included in either of the other two subsets, however only one of these-growth hormone (GH)-was significant in the final model.  Table 3 presents the coefficient estimates from the final 7-15-month and 7-24-month linear regression models for a 1 log 2 increase of each of the biomarkers selected by SCAD along with the difference in child's height predicted for children aged 17 and 26 months respectively at the 25 th and 75 th percentile of the biomarker distribution (holding all other included biomarkers at their sample mean). The 36 selected biomarkers include numerous amino acids, chemokines, hormones, glycoproteins and proteins along with two antibodies, three apolipoproteins, the enzyme myeloperoxidase, the sugar lactulose and 5-OH-Indole-3-acetic Acid (5-HIAA), the metabolite of serotonin. Thirteen biomarkers were included in both final models, while 11 were only included in the 7-15-month model and 12 only in the 7-24-month model.
The iron-transporting glycoprotein transferrin had the largest effect size in the 7-15-month model both in terms of its estimated coefficient-a highly statistically significant 0.50 (0.24, 0.75) increase in the predicted LAZ-score-and the height difference predicted-a 17-monthold child at the 75 th percentile of plasma transferrin concentration being two thirds of a centimeter taller than one at the 25 th . Hemoglobin had the second largest absolute coefficient value in the 7-15-month model-a slightly significant 0.47 (0.02, 0.93)-but the second largest difference in height was predicted by SAP-a child at the 3rd quartile of its distribution predicted to Table 2 be 0.65 cm shorter than one at the 1st quartile-which also had a highly statistically significant coefficient estimate. Other biomarkers for which the 7-15-month model predicted large and statistically significant negative effects include the hormones adiponectin and GH-predicting respectively around a -0.4cm and a -0.28cm height difference-and apolipoprotein (Apo) C-I--0.3cm-while AGP had a slightly statistically significant positive effect. Several biomarkers that had large effect sizes in the 7-15-months model-transferrin, Apo C-I and GH-were not included in the 7-24-month database, due to no values being available at 24 months of age. Instead, in that model, while hemoglobin again had the largest coefficient estimate-a non-significant 0.44 (-0.03, 0.91)-SAP predicted the largest difference in height between the extremes of the interquartile range of the analyte's distribution at 24 months-26-month-old children with high SAP concentration at 24-months a predicted 0.79cm shorter than their low SAP counterparts-the next largest being the chemokines Interleukin-8 (IL-8)-0.63cm taller-and adiponectin-0.54cm shorter-the latter having a highly statistically significant effect estimate. Proline, arginine, tryptophan and SHBG also all had slightly statistically significant coefficient estimates and predicted among the largest height differences.

7-15 months 7-24 months
The final 7-15-month model explained 43.0% of the variance in the LAZ-score according to the R 2 statistics, with 23.0% of the variance explained solely by the selected subset of biomarkers (the partial R 2 statistic excluding the non-biomarker covariates). The equivalent proportions for the final 7-24-month model were 39.6% and 17.7% respectively. S2 Table in the supporting information show the equivalent results when the non-biomarker covariates were replaced in the final models with contemporaneous LAZ-score to adjust for attained growth.
In the presence of this variable, many of the effect size estimates decreased in magnitude and

Fig 1. Forest plots of coefficient estimates and 95% confidence intervals from linear regression models of single biomarkers, for subsets of multiple biomarkers selected by the three penalized regression methods and from a final multi-variable linear regression model of the subset with the lowest cross-validation error adjusting for covariates.
https://doi.org/10.1371/journal.pntd.0007851.g001 α values (represented by the dashed lines). Biomarkers for which q<0.1 are labeled.

Fig 2. Scatterplot comparing the p-values from the separately modeled biomarkers to their corresponding q-values calculated using the method proposed by Simes' method [33] and to the Bonferroni corrected
https://doi.org/10.1371/journal.pntd.0007851.g002

Coefficient-LAZ score
Predicted height difference (cm) at 26 months Several biomarkers did increase in statistical significance upon adjustment for attained growth however, including Alpha-2-Macroglobulin (A2Macro), fecal MPO, tryptophan and TSH in the 7-15-month and proline and hemoglobin in the 7-24-month models. Adjustment for baseline LAZ-score also greatly increased the proportion of the variability explained by the models-R 2 statistics of 83.7% and 84.3% for the 7-15-month and 7-24-month models respectively-but decreased the proportion explained by the biomarker subsets-9.6% and 7.3% respectively-demonstrating that growth already attained has far more explanatory power for modeling short-term future growth than any combination of biomarkers.

5-OH-
Numerous biomarkers, including Eotaxin-3, citrulline, myoglobin, lactulose, and SHBG, exhibited evidence of having non-linear relationships with the outcome when visualized in polynomial smooth plots (S2-S6 Figs respectively in the supporting information). When a multivariate spline model-selection algorithm was run on each of the two final biomarker subsets, none of the biomarkers improved the model when represented by multiple cubic splines relative to linear terms with the exceptions of proline in the 7-15-month model (4 degrees of freedom) and Thymus and activation regulated chemokine (TARC) and Monocyte Chemotactic Protein 4 (MCP-4) (2 degrees of freedom each) in the 7-24-month model (results not reported). Fig 3 shows the results of the validation exercise in which a previously published methodology was replicated using three biomarkers from the final subset identified here along with IGF-1 as a comparator. This analysis treated the difference in LAZ-score (ΔLAZ) over timewindows of increasing length as the outcome, standard deviations of the biomarkers as exposures and adjusted for baseline LAZ, as well as the other covariates. Adiponectin, which had   previously exhibited a large and highly statistically significant association with nutritional status showed no obvious trend after adjustment for attained growth, while IGF-1 and, most markedly, transferrin showed large and statistically significant associations with changes in LAZ-score over longer time windows of 5-10 months.

Discussion
Analytical techniques such as multiplex immunoassays and mass spectrometry are increasingly being used in human studies to enable the quantification of ever more diverse and extensive panels of analytes in biological samples, many of which have biological functions that have yet to be fully characterized. At the same time, advanced statistical learning methods have emerged that can be used to identify patterns in large datasets. This study brings together these two developments and applies them to an issue that has received growing attention in recent years but has yet to be fully resolved-identifying prognostic biomarkers of EED that can predict future linear growth over time windows relevant to clinical intervention. In a birth cohort recruited from a low-resource setting in Peru, this study reports the distributions of 180 candidate biomarkers in fecal, urinary and plasma samples, of which 110 met the criteria for inclusion in variable-subsetting penalized regression models-the largest number of markers ever considered in a study of this nature. The final subsets selected by SCAD penalty included numerous biomarkers that previous studies have implicated as potential predictors of linear growth and markers of gut function. The essential amino acid tryptophan has previously shown promise as a prognostic indicator of EED due to its role in normal infant growth and its hypothesized correlation with indoleamine 2,3-dioxygenase 1 (IDO1) activity in states of chronic low-grade endotoxin exposure [14]. However, while tryptophan was selected by the majority of the penalized regression models, and its association with LAZ-score was statistically significant in the 7-24-month final models it was not among the biomarkers most predictive of differences in height. A positive association between plasma tryptophan concentration and a 6-month change in LAZ-score has already been reported in this cohort and a similar one in Tanzania and separately in one in Northeast Brazil with effect sizes comparable to that of the final model here [14,35]. Immunoglobulin A (IgA), which was retained in the final 7-15-month model, had a small, non-significant, negative effect size consistent with that observed for IgA anti-LPS antibody also in the Brazil cohort [35].
For some other biomarkers in the subsets, evidence in previous literature on EED is more scant though known mechanisms nonetheless exist through which they might plausibly track nutritional status. Most obvious of these is hemoglobin, long the gold standard marker of severe anemia and therefore of its attendant delaying effects on growth and development [36]. Analysis of data from the 8-site study to which the cohort described here contributed found an association (though weaker and less significant than those found here) between hemoglobin and LAZ-score at age 5 years [37], while other studies of EED have adjusted for hemoglobin as a potential confounder [38,39]. Low levels of plasma transferrin are found during proteinenergy malnutrition [40]. Adiponectin is an appetite-regulating hormone that promotes satiety and therefore may inhibit food intake, which may explain its negative association with growth [41,42]. While elevated levels of circulating adiponectin have a known negative association with obesity [43], its role in child growth is unclear, and among twins this adipokine had a positive association with birthweight-adjusted LAZ-score (counter to the negative one reported here) [42]. Leptin and the serum leptin-adiponectin ratio were found to be associated with stunting in Bangladeshi children and increased in this group following food supplementation [38]. The positive association between serum arginine concentrations and nutritional status is consistent with findings from Malawi, though the same study failed to find a significant association with proline, which was one of the more predictive of the biomarkers in these results [44].
While the SCAD-selected subset was used for the final models due to its yielding the low cross-validated error and explaining a larger proportion of the variance, it is notable that this penalty did select several biomarkers that had small non-significant effect estimates and did not select several biomarkers, which had statistically significant single biomarker effect sizes and known associations with nutritional outcomes (such as IGF-1 and ferritin). Though SCAD has been used in numerous studies of EED biomarkers [16][17][18], these findings do suggest that this penalty lacks both sensitivity and specificity when applied to large panels.
For other biomarkers in the subsets, the functions or pathways through which they might impact growth are as yet unclear, which demonstrates the hypothesis-generating potential of this approach. SHBG is of interest in biomarker research for its association at low levels with type-II diabetes and metabolic syndrome but, although elevated SHBG is seen following weight loss, this glycoprotein has not previously been considered as a prognostic marker of growth faltering [45]. Though known for its association with amyloidosis, SAP is also involved in the humoral innate immune system's response to infections and might plausibly lie on the pathway connecting enteric pathogen infection to growth deficits that is specific to the EED hypothesis [46][47][48]. TBG, responsible for binding the thyroid hormones thyroxine and triiodothyronine in the blood down, which downregulate the activity of hormones that stimulate metabolic rate and may influence the regulation of skeletal growth [49,50].
C-Reactive Protein (CRP), which multiple previous studies have found to be a promising biomarker [17,51], was not selected despite having a statistically significant, though small, negative effect in the single biomarker 7-24 months model. The fact that CRP is inversely related to Fetuin-A [52] and, like SAP, is a calcium-dependent ligand binding plasma protein [46] may mean that the presence of the latter protein in the final model fully accounted for any effect of CRP. The three fecal biomarkers and the urinary lactulose/mannitol ratio (along with the other four urinary markers) have shown clinical potential in previous studies [11,53] but in this analysis were not significant in any of the single-biomarker or final models. It may be the case that restricting the data to assessments at just 2-3 time points meant that the analysis was underpowered to detect the true but relatively small effects of these substances [11]. Citrulline, which has shown promise in previous studies [35], was not significant in either single biomarker model, and was selected but not significant in the final models.
Although ferritin, the body's stored form of iron, has been implicated previously [17,51] and was significant in the single biomarker model, it was not selected here for either final model. This may be because its association with growth is mediated by the stronger and more statistically significant effect of the related glycoprotein transferrin [54]. Some biomarkers that have been implicated in other studies-such as soluble CD14 [16,17], endotoxin core antibodies (EndoCAB) [12], zonulin, intestinal fatty acid binding protein [35], retinol binding protein and calprotectin [17]-were not included in any of the panels. Others were excluded from the analysis due to having too few unique observations, notably almost all the interleukins, which were only tested for in the case-control panel, a limitation of this study.
Several other limitations warrant highlighting. Most associations that were apparently statistically significant in the single biomarker models appeared much less so after accounting for the FDR-indeed, only adiponectin remained significant at the Bonferroni-corrected α level. Furthermore, the results of the adjusted subset models do not account for the variable selection in the first stage SCAD model, a post-selection inference problem that can lead to inflated type-1 errors and overly narrow confidence intervals [55]. However, the associations identified by this analysis should be assessed, not just by their statistical significance but by their biological plausibility and in light of the fact that the biomarkers selected for the subset and the relative strength of their associations with the outcome are broadly consistent with known biological pathways. Another limitation is the assumption both in the subset selection stage and in fitting the final models that any relationships between biomarkers and LAZ-score would be linear. Exploratory analysis revealed some evidence to challenge this, which may limit the accuracy of the predictions from the linear models, however further analysis using multivariate regression splines suggested that only a very small number of biomarkers were affected by this assumption. As consensus develops around a final set of important biomarkers of EED such non-linear effects will need to be more rigorously characterized.
Applying the penalized regression models to the database that included the observations at 24 months of age, did not improve the predictive capability of the model. Similarly, the final 7-24-month model explained a smaller proportion of the variance in the outcome than the 7-15-month model. However, for some biomarkers that were included in both models, the 7-24-month model tended to give larger and more statistically significant effect size estimates than the 7-15-month model (with the notable exception of hemoglobin). The reason for the difference in explanatory power may be because the 7-24-month database did not include transferrin (which was not tested at 24 months of age), the biomarker with the largest effect size in the 7-15-month final model.
Studies with more intensive sample collection and frequent follow-up are needed to explore random effects and short-term intra-and inter-subject variability of these biomarkers as well as those that were excluded from this analysis and to more precisely model their effects on growth [11]. The validity of these biomarkers as clinically relevant predictors of growth in new populations can be readily assessed given that ELISA kits for most of them are commercially available. This is important considering the high burden of stunting in under-resourced settings in low-and middle-income countries where these biomarkers can potentially be tested in regional laboratories, and the results used to inform care and programs aimed at controlling stunting.
The expanded testing of analytes chosen for their characterization as being important immune and metabolic regulators pertinent to child growth revealed several important findings. This selected subset of biomarkers explained 17.7-23.0% of the variance in LAZ score with measurements taken at 2 or 3 time points, compared to a single biomarker such as MPO which only accounted for 2.8% of the variance with monthly follow-up up to age 3 years in the same population [11]. Future studies should aim to characterize changes in LAZ scores when assessing the interaction between EED biomarkers and intestinal infections by specific pathogens. These plasma biomarkers represent a set of surrogate outcomes which can be measured at different time points, all of which are characteristic of a good biomarker of EED to circumvent the problems associated with the lactulose/mannitol test, the current gold standard test (such as the variable in its association with child growth, which, even when significant has an effect size that is much smaller than the selected panel described here) [56].
In summary, penalized regression modeling approaches-most notably SCAD-can be used to select subsets from large panels of candidate biomarkers of EED providing translational value in the form of further evidence for known markers and in generating hypotheses about new ones. Adiponectin, IL-8, proline, SAP and transferrin, among others, are promising plasma biomarkers of EED.
Supporting information S1 List. Biomarkers quantified in each panel and their units. (PDF) S1  Table. Coefficient estimates (with 95% confidence intervals) from linear regression models for biomarkers selected by SCAD along with the predicted difference in child's height 2 months after the last sample for children at the 25 th and 75 th percentile of the biomarker distribution adjusted for contemporaneous LAZ-score. (PDF) S3 Table.