Real world external validation of metabolic gestational age assessment in Kenya

Using data from Ontario Canada, we previously developed machine learning-based algorithms incorporating newborn screening metabolites to estimate gestational age (GA). The objective of this study was to evaluate the use of these algorithms in a population of infants born in Siaya county, Kenya. Cord and heel prick samples were collected from newborns in Kenya and metabolic analysis was carried out by Newborn Screening Ontario in Ottawa, Canada. Postnatal GA estimation models were developed with data from Ontario with multivariable linear regression using ELASTIC NET regularization. Model performance was evaluated by applying the models to the data collected from Kenya and comparing model-derived estimates of GA to reference estimates from early pregnancy ultrasound. Heel prick samples were collected from 1,039 newborns from Kenya. Of these, 8.9% were born preterm and 8.5% were small for GA. Cord blood samples were also collected from 1,012 newborns. In data from heel prick samples, our best-performing model estimated GA within 9.5 days overall of reference GA [mean absolute error (MAE) 1.35 (95% CI 1.27, 1.43)]. In preterm infants and those small for GA, MAE was 2.62 (2.28, 2.99) and 1.81 (1.57, 2.07) weeks, respectively. In data from cord blood, model accuracy slightly decreased overall (MAE 1.44 (95% CI 1.36, 1.53)). Accuracy was not impacted by maternal HIV status and improved when the dating ultrasound occurred between 9 and 13 weeks of gestation, in both heel prick and cord blood data (overall MAE 1.04 (95% CI 0.87, 1.22) and 1.08 (95% CI 0.90, 1.27), respectively). The accuracy of metabolic model based GA estimates in the Kenya cohort was lower compared to our previously published validation studies, however inconsistency in the timing of reference dating ultrasounds appears to have been a contributing factor to diminished model performance.


Introduction
The need for novel, non-invasive methods to accurately estimate gestational age (GA) in low resource settings has been identified by the World Health Organization as a priority area for improving global estimation of the burden of preterm birth at < 37 completed weeks of gestation. Preterm birth as well as being born small (small for gestational age; SGA = lowest ten centiles of birthweight given gestational age) are leading causes of infant mortality and morbidity, particularly in low-and middle-income countries (LMIC) [1,2]. Furthermore, medical needs and developmental milestones differ between term, preterm and SGA infants. Thus, accurately identifying at-risk infants at birth is important, both for estimating the true population burden of preterm birth, and potentially in informing postnatal care and supportive resources. Although the use of first-trimester ultrasound has improved our ability to estimate GA [3], it is not widely available in all low resource settings and its implementation poses significant obstacles, including cost, training, equipment maintenance and lack of standardization [4]. In low resource settings without access to prenatal ultrasound, GA estimates are often made based on last menstrual period, the accuracy of which may be affected by memory recall as well as irregular menses and maternal malnutrition [5,6]. Commonly used postnatal examination methods for GA dating of infants (e.g., Dubowitz or Ballard score) also have limitations in terms of their accuracy-particularly in preterm and growth-restricted infants-and their utility is further limited by challenges with feasibility and high inter-user variability [7].
Given the limitations associated with existing GA dating methods, numerous research groups are testing new ways to accurately estimate GA [8][9][10][11]. We have developed novel machine learning-based algorithms that use newborn screening metabolites and clinical and demographic covariates to estimate GA [12,13]. These algorithms were originally developed and internally validated in a large cohort of newborns in Ontario, Canada [14,15]. Refinements to the algorithms incorporated machine learning and improved the accuracy of gestational age estimations [12,16]. Here we evaluate the use of these algorithms in a population of infants born in Siaya County, Kenya.

Study setting
A detailed study protocol has previously been published which describes the study sites and provides further details on sample collection and processing [17]. The Kenya study site is located in Kisumu at the KEMRI Centre for Global Health Research, with field sites located in Siaya County, where a maternal-infant demographic surveillance program followed a prospective cohort of pregnant women and their infants in two community hospitals: Siaya County Referral Hospital (SCRH) and Bondo sub-County Hospital (BSCH). Eligible participants were pregnant women between the ages of 15-49 years, residing within a 10 km radius of the research facility, willing to deliver in the research hospital, and not planning to relocate within 1 year of enrollment into the surveillance program. Participants were enrolled at their first antenatal care visit (ANC-1), which typically occurred prior to 20 weeks' gestation. Participants underwent an early pregnancy ultrasound as early as possible and were offered treatment for common illnesses, including malaria, urinary tract infections, and sexually transmitted infections. A small portion of infants were born at home and evaluated within 72 hours of delivery.

Consent
Informed written consent was obtained from all mothers prior to study enrollment. All liveborn infants of enrolled mothers were eligible for inclusion.

Collection of newborn screening specimens
Cord blood samples were collected via syringe within 30 minutes of delivery of the placenta. Four to five drops of blood from the syringe were applied to filter paper within pre-printed circles. Heel prick samples were collected from newborns ideally between 24-72 hours after birth, or prior to discharge if the newborn was released from the hospital within 24 hours of delivery. The newborn's heel was warmed prior to skin puncture to promote blood flow. The puncture site was cleaned and air-dried and a sterile lancet was used to puncture the lateral plantar aspect of the newborn's heel. The first drop of blood was wiped away and 4-5 drops of blood were applied within pre-printed circles of a second filter paper.
Heel and cord dried blood spot (DBS) cards were dried and stored at ambient temperature and shipped weekly to the Newborn Screening Ontario (NSO) laboratory at the Children's Hospital of Eastern Ontario in Ottawa, Canada for analysis, along with clinical and demographic information required for clinical interpretation of metabolic profiles and for metabolic GA estimation models. This information included infant sex, birthweight (in grams), multiple birth status, GA (in weeks + days), date of birth, and timing of sample collection.

Newborn screening analysis
The newborn screening analysis process has been described in detail previously [17]. Dried blood spot samples were analyzed for the following metabolites: hemoglobin profiles, 17-hydroxyprogesterone (17-OHP), thyroid stimulating hormone (TSH), immunoreactive trypsinogen (IRT), a panel of 12 amino acids and 31 acylcarnitines, T-cell receptor excision circles (TREC), biotinidase activity, and galactose-1-phosphate uridylyltransferase activity (Table 1). Real-time screening for three conditions [congenital hypothyroidism (CH), hemoglobinopathies, and medium-chain acyl-CoA dehydrogenase deficiency (MCADD)] occurred during this study. These conditions were deemed to be high priority for reporting and were treatable at the local collection sites. Results of screening for congenital metabolic conditions will be published elsewhere. Table 1. Newborn screening analytes included in predictive models.

Data preparation and statistical analysis
All analyses were conducted using SAS 9.4 [18] and R 3.3.2 [19]. Data preparation steps, including standardization and log transformations are detailed in S1 Appendix. Analytes were included as candidate predictors in GA estimation models based on their routine measurement as part of Ontario's expanded newborn screening program, including hemoglobin profiles, amino acids, acylcarnitines, hormone and endocrine markers, enzymes and co-enzymes (Table 1). Newborn GA was estimated from models derived using multivariable regression coupled with elastic net regularization and including the following covariates: 1. Model 1: Birth weight, sex, multiple birth status and pairwise interactions.
2. Model 2: Birth weight, sex, multiple birth status and newborn screening analytes and pairwise interactions.
Models were trained and internally validated in independent training and validation/test cohorts of infants from Ontario, Canada (S1 Appendix). These pretrained models were then applied to the data for infants from the external cohorts to estimate GA. To evaluate model accuracy, GA estimates were compared to the ultrasound reference GA for each infant, and the residual error calculated. Different metrics were calculated to estimate model uncertainty, including mean square error (MSE), standard error of estimation [also known as root mean square error (RMSE)], and mean absolute error (MAE), which is the average of the absolute value of the residual across all subjects (or subsets of subjects). MAE is less sensitive to outliers (large residuals) and is a helpful metric of "average error" which is often reported in model validation studies with continuous outcomes. We have included both metrics to facilitate comparisons with other study results. Additionally, we calculated the proportion of model-derived estimates that fell within ± 1 week of reference GA. MAE was the main performance metric used to evaluate model accuracy, but multiple metrics were calculated and reported to facilitate comparisons to other models developed by our group and others. Model-derived frequency of preterm birth will be compared to the observed prevalence of preterm birth.

Results
1,039 newborns had heel prick samples available, as well as clinical and demographic data including ultrasound-derived reference GA. Of these, 92 infants (8.9%) were preterm and 88 (8.5%) were SGA (Table 2). 1,012 newborns (97.4%) also had a cord blood sample collected. It should be noted that the Ontario cohort in which the models were developed and internally validated had a lower preterm birth prevalence of 5.6% and SGA prevalence of 3.9% (Table 2). There were 11 screen positive results for hemoglobinopathies that were reported back to the study site for follow up. Details of incidental screen positive findings and follow up will be reported in a separate manuscript.

Model-based GA estimation for heel prick samples
Overall, Model 1, which included only readily available clinical covariates (sex, birthweight and multiple birth status) estimated GA within 10.5 days on average, with a MAE of 1.5 (95% CI 1.41, 1.58) weeks. 58.5% of model estimates were within ± 1 week of reference GA. For preterm births, Model 1 MAE was 2.64 (95% CI 2.30,3.01) weeks and only 24.1% of estimates were within ± 1 week of reference GA. In SGA newborns, MAE was 3.13 (95% CI 2.85, 3.38) weeks and 3.4% of estimates were within ± 1 week of reference GA (Table 2). Model 2, which included clinical covariates plus analytes, estimated GA within 9.5 days overall, with a MAE of 1.35 (95% CI 1.27, 1.43) weeks, and 64.1% of estimates were within ± 1 week of reference GA.
The performance of Models 1 and 2 did not appear to be affected by the HIV status of the mother. Results for subjects with HIV-positive mothers (n = 197) were almost identical to model performance for infants of HIV-negative mothers (n = 842) ( Table 4).

Model-based GA estimation for cord-blood samples
Model 1 demonstrated nearly identical performance in cord blood samples compared to heel prick samples, as analytes were not covariates in Model 1, and the heel and cord blood cohorts were almost entirely comprised of the same infants. Overall, in the cord blood cohort, Model 2  (Tables 5 and 6).

Model-based GA estimation using reference GA derived from ultrasounds within recommended window
There was significant variation in the timing of gestational dating ultrasound, despite best efforts to conduct the ultrasound as early as possible. Reference GA for 28 newborns (2.7%) was derived from ultrasound conducted before 9 weeks' gestation, and 889 (85.6%) had reference GA based on an ultrasound later than 13 weeks' gestation. Only 120 newborns (11.5%) had reference GA based on an ultrasound conducted within 9-13 weeks' gestation (Table 4). When evaluated in these 120 newborns, model performance was markedly better, with Model 2 having a MAE of 1.04 (95% CI 0.87, 1.22) weeks overall and a MAE of 2.56 (95% CI 1.50, 4.00) and 1.07 (95% CI 0.46, 1.70) weeks in preterm and SGA infants, respectively (Table 5). Similar to the heel prick results, Model 2 for cord blood specimens performed markedly better when only samples with reference GA ascertained between 9 and 13 weeks of gestation were included [overall MAE 1.08 (95% CI 0.90, 1.27) weeks] ( Table 5).

Discussion
We externally validated the performance of a postnatal GA dating algorithm developed and validated in a cohort of infants in Ontario, Canada in a prospective birth cohort in Siaya County Kenya, a lower-middle-income sub-Saharan African country. Heel prick and umbilical cord blood samples were collected shortly after birth, and ultrasound was used to provide a reference GA for each infant. Overall, model performance was worse in the Kenya birth cohort for Model 1 and Model 2 compared to internally validated model performance in Ontario, and in comparison to previously published external validations of metabolic GA algorithms [12,13]. The heterogeneity of reference ultrasound timing was an important contributor to diminished model performance, as only 120 out of 1,039 participants had reference ultrasound completed between 9 and 13 weeks of gestation. Model performance was markedly better in participants with reference GA ascertained inside compared to outside the recommended window. For example, Model 2 had an overall MAE of 1.04 weeks among infants with reference GA between 9 and 13 weeks, compared to MAEs of 1.48 (<9 weeks), 1.34 (14-20 weeks) and 1.43 (>20 weeks) weeks for those with dating ultrasounds earlier and later than the recommended window. A similar pattern was seen for Model 1 and Model 2 in heel and cord samples both overall and in preterm and SGA newborns. Our study highlights the challenges in reliably estimating GA in low resource settings, even in those with access to dating ultrasound, given that the timing of dating ultrasound is critical to accurate estimations of GA [3,20]. Indeed, most pregnant women in Kenya access ANC for the first time in the second trimester [21]. These challenges further underscore the need for novel, reliable GA estimation methods that can be adopted in LMICs.
Given the significant barriers to obtaining an early dating ultrasound, the metabolic GA approach may be a more feasible alternative approach to GA dating than dating ultrasound when the timing of the latter is variable. Our study also demonstrated the utility of cord blood samples, which could further strengthen the feasibility of our approach in low resource settings. Cord blood samples are obtained shortly after birth and remove the burden of sample collection before discharge, do not cause any discomfort to the newborn and may be more readily accepted by parents who are not accustomed to the heel prick procedure. Given the higher prevalence of HIV in our patient population, our results also provide reassurance that HIV positive status does not appear to impact performance of algorithms based only on clinical measurements (Model 1) or those including metabolic markers measured in heel prick or cord blood (Model 2). The major limitation of our study was the small number of GA ultrasounds conducted during the optimal reference time-period. Therefore, a gold standard for reliable comparison with accurate true GA was not possible for a large percentage of the sample. Further, as observed in our previous validation studies, the study sample was affected by a participation bias against preterm and extremely preterm infants. Model estimated gestational ages were most accurate in infants born close to full-term, and were overestimated in preterm infants and underestimated in post-term infants. Strengths of the study include the real-world approach to evaluating the algorithm, allowing us to assess not only model performance but the feasibility of this GA estimation approach as well.
Our study demonstrated that, despite being conducted within a prospective pregnancy cohort with a well-defined protocol in a controlled research setting, there were still challenges in obtaining a true reference GA measurement due to timing of dating ultrasounds. Even under perfect circumstances, the metabolic prediction algorithm may not agree perfectly with the ultrasound-based GA because it is based on "metabolic maturity" rather than physical size, which may in fact be a better marker of physiological maturity. The results of this evaluation suggest that postnatal GA estimation algorithms such as the ones we have developed are both feasible and accurate, and previous analyses have indicated that GA estimation algorithm approaches are also potentially cost-effective [22]. Therefore, we believe that GA estimation algorithms based on metabolic analysis of heel prick or cord blood DBS may be able to serve an important role in both individual infant estimates of GA and population level estimations of preterm birth rates. Algorithm-based GA estimates have potential even in settings where early ultrasound is available, given the substantial heterogeneity in timing of reference GA ultrasound in our population, a factor that may compromise the accuracy of estimates based on ultrasound alone. Given these findings, we believe that GA estimation algorithms may serve an important role in providing both individual estimates of GA and population-level estimates of preterm birth.