Integrative omics to detect bacteremia in patients with febrile neutropenia

Background Cancer chemotherapy-associated febrile neutropenia (FN) is a common condition that is deadly when bacteremia is present. Detection of bacteremia depends on culture, which takes days, and no accurate predictive tools applicable to the initial evaluation are available. We utilized metabolomics and transcriptomics to develop multivariable predictors of bacteremia among FN patients. Methods We classified emergency department patients with FN and no apparent infection at presentation as bacteremic (cases) or not (controls), according to blood culture results. We assessed relative metabolite abundance in plasma, and relative expression of 2,560 immunology and cancer-related genes in whole blood. We used logistic regression to identify multivariable predictors of bacteremia, and report test characteristics of the derived predictors. Results For metabolomics, 14 bacteremic cases and 25 non-bacteremic controls were available for analysis; for transcriptomics we had 7 and 22 respectively. A 5-predictor metabolomic model had an area under the receiver operating characteristic curve of 0.991 (95%CI: 0.972,1.000), 100% sensitivity, and 96% specificity for identifying bacteremia. Pregnenolone steroids were more abundant in cases and carnitine metabolites were more abundant in controls. A 3-predictor gene expression model had corresponding results of 0.961 (95%CI: 0.896,1.000), 100%, and 86%. Genes involved in innate immunity were differentially expressed. Conclusions Classifiers derived from metabolomic and gene expression data hold promise as objective and accurate predictors of bacteremia among FN patients without apparent infection at presentation, and can provide insights into the underlying biology. Our findings should be considered illustrative, but may lay the groundwork for future biomarker development.


Results
For metabolomics, 14 bacteremic cases and 25 non-bacteremic controls were available for analysis; for transcriptomics we had 7 and 22 respectively. A 5-predictor metabolomic model had an area under the receiver operating characteristic curve of 0.991 (95%CI: 0.972,1.000), 100% sensitivity, and 96% specificity for identifying bacteremia. Pregnenolone steroids were more abundant in cases and carnitine metabolites were more abundant in controls. A 3-predictor gene expression model had corresponding results of 0.961 (95% CI: 0.896,1.000), 100%, and 86%. Genes involved in innate immunity were differentially expressed. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Introduction 10] and FN (defined as <1000 neutrophils/μL 2 ), from the Brigham and Women's Hospital emergency department (receiving emergency department for the Dana Farber Cancer Institute). All patients had detailed clinical evaluations including history, physical exam, chest Xray, urinalysis, and, blood and urine cultures. Information on antibiotic use in the 24 hours prior to blood collection was also recorded. Patients presenting with focal bacterial infection (pneumonia, skin infection, urinary tract infection, intra-abdominal infection) or evidence of sepsis in the judgment of two investigators were excluded, as the goal of this study was to find an indicator of bacteremia in patients without initially-detectable bacterial infection. Trained research assistants screened for eligible patients between the hours of 7am and 11pm daily. Once subjects were identified, they collected two blood samples (in PAXgene RNA preservation tubes and lithium heparin tubes) and clinical data. Two investigators then independently categorized each subject, relying upon the diagnostic investigation conducted as part of routine care and the recommendations of the Infectious Diseases Society of America and the American Society of Clinical Oncology. [4,10,22] The investigators placed each subject into one of three categories: (i) No evidence of focal infection or sepsis at presentation and ultimately found to have been bacteremic (cases), (ii) No evidence of focal infection or sepsis at presentation and ultimately found not to have been bacteremic (controls), or (iii) Evidence of focal infection or sepsis at presentation (excluded),. The Partners Health Care Human Subjects Research Committee approved this study, and all participants provided written informed consent.

Metabolomic and gene expression profiling
Mass spectrometry-based metabolomic profiling was performed on plasma samples by Metabolon, Inc. (Durham, NC), as described previously [23,24] Briefly, the global biochemical profiling analysis was composed of four unique arms covering a broad range of the metabolome; (i) reverse-phase chromatography positive ionization methods optimized for hydrophilic compounds (LC-MS Pos Polar) and (ii) hydrophobic compounds (LC-MS Pos Lipid); (iii) reverse-phase chromatography with negative ionization conditions (LC-MS Neg), and (iv) a HILIC chromatography method coupled to negative (LC-MS Polar). Metabolites were annotated based on an iterative process of matching on mass to charge ratio, retention time and spectral fragmentation signature, followed by manual curation to confirm biochemical identification. Further details are provided in S1 Methods.
RNAseq could not be used in this population due to the paucity of white blood cells. We used EdgeSeq from HTG Molecular, Inc (Tucson, AZ). [25] to quantify the expression of 2,560 genes relevant to cancer and immunology. This technology does not require an RNA isolation step. The panel of genes was chosen by an iterative process of literature review and key opinion leader feedback, and included 24 gene groups and pathways. [26]. Analysis was performed as described previously [25,26]. Briefly, the EdgeSeq assay couples quantitative nuclease protection with next-generation sequencing. After allowing nuclease protection probes (NPPs) to hybridize to their target RNAs, S1 nuclease is added to remove excess unhybridized NPPs and RNA, leaving behind only NPPs hybridized to their target RNAs, resulting in a stoichiometric conversion of target RNA to the NPPs and producing a 1:1 ratio of NPP to RNA. The quantitative nuclease protection steps are automated on the EdgeSeq processor, followed by PCR to add sequencing adaptors and tags. The labeled samples are pooled, cleaned, and sequenced on a next-generation sequencing platform using standard protocols. The resulting data are processed and reported by EdgeSeq parsing software. Further details are provided in S1 Methods.

Quality control and data processing
Metabolomics. We used a previously published [27] QC and processing pipeline to clean the metabolomic dataset. We excluded metabolites with zero abundance in all samples, then imputed all remaining missing observations with half the lowest detected value for that metabolite. We considered metabolites with zero variance uninformative and excluded them. We then pareto scaled and log-transformed the data.
Gene expression. We performed data processing and normalization according to Edge-Seq manufacturer's standards. [25,26] Data were transferred from the Illumina MiSeq sequencer as demultiplexed FASTQ files, with one file per original well of the 96-well sample plate. The HTG EdgeSeq Parser was used to align the FASTQ files to the probe list to collate the data, which were then median-normalized. [28] Statistical analysis Prediction. We used unsupervised principal components analysis (PCA) to assess the ability of the metabolome and gene expression data to discriminate cases (with bacteremia) from controls (without bacteremia). We further interrogated these plots to determine if other clinical factors may be driving the metabolomic profiles. We then used supervised partial least squares discriminant analysis (PLS-DA) to assess predictive accuracy for bacteremia. Next, we attempted to identify metabolomics and genetic predictive profiles using two approaches; (i) We used independent logistic regression models adjusting for age, sex, body mass index (BMI), and, tumor type (solid/liquid) to identify the metabolites most strongly associated with the presence or absence of bacteremia. Differential gene expression analysis was used for the gene expression data. (ii) We employed least absolute shrinkage and selection (LASSO) sparsity-inducing logistic regression to identify more parsimonious metabolomic and genomic predictors. We ran two logistic models; one containing all metabolites, and one containing all genes to identify the subset of metabolites and genes retained in the models. These were then selected as the predictors. We used the lambda that produced the minimum mean cross validated error. [29] We then created metabolite and gene summary scores based on (i) the first principal component of metabolites/expressed genes identified as differentially abundant in cases vs. controls in the regression models; and (ii) metabolites/genes selected in the LASSO model. We used receiver operating characteristic (ROC) curve analysis to evaluate the predictive ability of these summary scores for bacteremia, and employed the method of DeLong to compare areas under the ROCcurves for the different scores. [30] We determined a cutoff that maximized sensitivity, and calculated specificity at this cutoff. The currently recommended approach to risk stratification is to classify a patient as high-risk if the MASCC score is <21 or if any of the Infectious Diseases Society of America/American Society of Clinical Oncology high risk criteria are met. [3,4] While this approach to risk stratification was not designed to detect bacteremia, no other method currently exists to classify these patients. Therefore, we compared the accuracy of this classifier as a predictor of bacteremia to the accuracy of our omic predictors of bacteremia.
Analysis of underlying biology. We explored metabolic pathways using MetaboAnalyst (http://www.metaboanalyst.ca/), which takes both overrepresentation and pathway topology into account, assigning more weight to metabolites that form key components or 'hubs' of specific pathways. For gene expression, we performed gene set enrichment analysis using the g. GOSt tools from the g.profiler package (http://biit.cs.ut.ee/gprofiler/).
In order to combine the findgins from the metabolomics and genetic analysis, we used IMPaLA (Integrated Molecular Pathway Level Analysis; http://impala.molgen.mpg.de/) [31] to identify pathways that were jointly dysregulated at the level of both metabolites and gene expression. IMPaLA performs over-representation analysis considering both genes and metabolites to provide a combined pathway p-value as well as a q-value that accounts for multiple testing.

Study population
Metabolomic profiling was performed for 58 subjects who were classified by two investigators with no disagreements. Fourteen (24%) had bacteremia (cases); 25 (43%) had no evidence of bacterial infection (controls); and, 19 (33%) had evidence of a focal infection without bacteremia (excluded from analysis), resulting in a total of 39 analyzed subjects ( Table 1). Gene expression data were available for only seven of the cases and 21 of the controls, due to logistical issues. A further control had gene expression profiling only. In total seven cases and 22 controls were included in the gene expression analysis.
Cases demonstrated a significantly higher maximum temperature, lower neutrophil and lymphocyte counts, and a lower MASCC score as expected. Patients with both solid and liquid tumors originating from a variety of organs were included. Cases were more likely to have a liquid tumor than controls, but there was no significant difference in tumor site. Antibiotic treatment was initiated prior to sample collection in 85% of subjects; there was no significant difference in the proportion of cases and controls who received antibiotics ( Table 1).

Metabolomics
A total of 1,296 metabolites were measured. After exclusion of metabolites that were missing for all subjects and those with no variance across the population, 1,204 metabolites remained for analysis. These included amino acids, carbohydrates, lipids, nucleotides, vitamins, peptides, energy metabolites, and 163 xenobiotics. PCA based on all 1,204 metabolites revealed separation between cases and controls along the first two components which together explained 27% of the variance in the data (Figure A in S1 File). To determine if these metabolomic profiles were driven by other clinical factors we also interrogated the PCA plot in terms of tumor type (liquid or solid); tumor site and antibiotic use prior to blood draw. Among the bacteremic cases only, we also explored the bacteria type subsequently identified in the culture (Grampositive, Gram-negative or both). These plots indicated no clustering based on any of these variables; and therefore provided no evidence that these factors were driving the metabolomic profiles ( Figure B in S1 File). Regression models confirmed that both PC1 (p = 8.8x10 -4 ) and PC2 (p = 0.024) were significantly associated with case status.
Partial least squares discriminant analysis (Fig 1) suggested that a metabolomic classifier could distinguish between cases and controls, with R 2 = 0.650, and a cross-validated Q 2 of 0.410 for the first component. Interrogation of the variable importance in the projection plot (VIP; a measure of the relative importance of each feature in the PLS-DA) identified 17αhydroxypregnanolone glucuronide, estrone 3-sulfate, 5α-pregnan-3, 20β-diol disulfate and pregn steroid monosulfate (C21H3405S) as the top metabolites driving the discrimination. Carnitines, also had high VIP scores.
Permutation testing revealed that the model was not robust (p = 0.366). Therefore, a refined discriminatory profile was sought by identifying metabolites significantly associated with bacteremia using multivariable logistic regression. After adjustment for age, sex, BMI and, tumor type (liquid or solid), a total of 177 metabolites were significant at p<0.05 and 19 were significant at a p<0.01 (Fig 2 and Table A in S1 File). A majority of the significant metabolites were lipids. We observed upregulation of pregnenolone steroids and downregulation of carnitine metabolites among bacteremia cases. The relative metabolite intensities in cases and controls for the top eight upregulated and top eight downregulated metabolites are shown in Figure C in S1 File. Pathway analysis identified six metabolic pathways that were enriched among these significant metabolites: pyrimidine metabolism (p = 0.002), ascorbate and aldarate metabolism (p = 0.003), purine metabolism (p = 0.017), sphingolipid metabolism (p = 0.018), pantothenate metabolism (p = 0.022) and valine, leucine and isoleucine metabolism (p = 0.022). We generated a summary score based on the first principal component of the 177 metabolites, and used ROC curve analysis to explore the discriminatory ability of this score (Fig 3).
To identify the most parsimonious model, we then used LASSO regression, and identified a five-metabolite score. The AUCs of the two currently available clinical classifiers; MASCC and MASCC plus an indicator of risk ("high-risk" classifier), were 0.624 (95%CI: 0.508-0.741) and 0.540 (95%CI: 0.486-0.594), respectively. Both metabolite scores significantly out-performed these classifiers with AUCs of 0.969 (95%CI: 0.918-1.000) (p dif MASCC classifier = 2.3x10 -8 ; p dif high risk classifier <2.2x10 -16 ) for the standard logistic score and 0.991 (95%CI: 0.972-1.000) (p dif MASCC classifier = 8.0x10 -10 ; p dif high risk classifier <2.2x10 -16 ) for the LASSO score. Furthermore, while the sensitivity of the binary MASCC and "high risk" classifiers was high (93% and 100%, respectively), the corresponding specificities were only 32% and 8%. In contrast, at the cutoff required to achieve 100% sensitivity, the specificity of the standard logistic classifier was 88%, and the parsimonious LASSO classifier was 96% ( Table 2). Sensitivity analyses. A sensitivity analysis additionally adjusting for antibiotic use in the 24 hours prior to blood draw identified 118 significant metabolites; 112 (95%) of which were also among the 177 identified in the original analyses. We ran a further sensitivity analysis adjusting for the difference between absolute neutrophil and absolute leukocyte count. Again, of the 169 metabolites that retained significance in this model, 159 (94%) were among the original 177 metabolites., demonstrating the robustness of these findings. Similarly, we wanted to determine whether the type of bacteria (Gram-negative or Grampositive) responsible for the bacteremia influenced the results. Although the numbers were too small (n = 5 cases) for the model to converge when considering only at Gram-positive bacteria, we identified 129 metabolites significantly associated with Gram-negative bacteria (n = 8 cases). Ninety-nine (77%) of these were also among the 177 metabolites; including many of the most significant hits such as the carnitines and pregnenolone steroids. Furthermore, when comparing the relative levels of these top metabolites in controls versus cases stratified by Gram-status, both the Gram-negative and Gram positive bacteria cases were distinct from the controls (Figure D in S1 File). This again suggests that case-control status rather than bacteria type among the cases was the biggest driver of the differential metabolite abundance.

Gene expression profiling
Evidence of separation in the gene expression profile of cases (n = 7) and controls (n = 22) was also suggested by a PCA model based on 2,560 mRNAs. The PLS-DA model (Fig 4) resulted in an R 2 of 0.60, but a Q 2 of only 0.09, and again permutation testing indicated that this model was not robust (p = 0.974). Differential expression analysis identified 150 nominally significant genes, of which three were significant at p<0.01 (Fig 5).  Figure D in S1 File. G.profiler analysis determined that the significant genes were enriched for 24 biological terms, including a number relating to vesicle mediated transport, and cytokines (Table B in S1 File). When LASSO regression was employed, only three genes were retained in the model: RAD18, which encodes a DNA repair protein, MAPKAPK3, a kinase activated in response to cellular stress and JAG1, which has a reported role in hematopoiesis. Sensitivity analyses were not performed on the gene-expression data due to sample size limitations.
The binary MASCC and high-risk classifiers in this gene expression subpopulation were sensitive (89% and 100%, respectively), but had low specificities (36% and 9%). In contrast, with 100% sensitivity, the specificities was 86% for both our standard logistic and LASSO genomic classifiers ( Table 2). Omics and bacteremia

Integrative analysis
When we combined the metabolomic and gene expression data into a single data set, there were too few observations to demonstrate predictive ability, because we had both datasets for only 28 participants. This caused the number of parameters to exceed the number of observations by too wide a margin for our models to converge on a solution. However, integrative analysis using IMPaLA revealed that a number of the genes and metabolites independently identified as significant in the previous analyses are involved in the same biological pathways and processes ( Table 3). Forty eight pathways were identified as being differentially perturbed in terms of both metabolomics and gene expression (joint q-value<0.05). These included pathways relating to the immune system (joint q-value = 6.29x10 -8 ) insulin regulation (including the Insulin receptor signaling cascade; joint q-value = 6.29x10 -8 , IRS-related events triggered by IGF1R;joint q-value = 4.33x10 -3 ), the MAPK signaling pathway (joint q-value = 8.38x10 -3 )  and multiple pathways involved in the signaling processes necessary for cell survival, differentiation and apoptosis, (such as signaling by NGF; joint q-value = 3.73x10 -45 , signaling by EGFR; joint q-value = 7.25x10 -4 andDAP12 signaling; joint q-value = 4.49x10 -3 ).

Discussion
We have demonstrated methods for discovery of a multi-omics-based predictor for detection of bacteremia among FN patients without apparent infection. Omic profiles are increasingly being leveraged as predictive, diagnostic and prognostic biomarkers, and can provide insight into the underlying biology. [32,33] Metabolomic and transcriptomic diagnostics have already been deployed clinically. For example, real-time analysis of the metabolome is a clinical reality, as exemplified by the iKnife, which performs instantaneous analysis of the mucosal lipidome to phenotype colorectal cancer during surgery. [34] Similarly, transcriptomics have yielded diagnostic tests that are approaching clinical implementation for detection of serious bacterial infection. [35,36] To our knowledge, this was the first study to develop multi-omic based biomarkers of bacteremic FN; one prior study used metabolomics alone to investigate infection in the setting of FN. [37] Our future work will apply these methods to a larger sample size, separated into derivation and validation sets, with the goal of developing a clinically-applicable tool for detection of bacteremia during the initial evaluation. For detection of bacteremia, the two existing clinical predictors; MASCC and the high-risk classifier, were sensitive but demonstrated poor specificity. In contrast, although derived from only a small population, our metabolomic and transcriptomic predictors maintained impressive specificities with 100% sensitivity. These results demonstrate that derivation of omicsbased predictors of bacteremia in the setting of FN is feasible, and justify further research in a larger sample.
We found that pregnenolone steroids, which are cortisol precursors, were upregulated in cases relative to controls. Prior research has linked sepsis to an overexpression of cortisol precursors despite normal cortisol levels, implying decreased activity of 3-beta-hydroxysteroid dehydrogenase. [38] Carnitines were down-regulated in cases, which is in agreement with previous evidence for a differential abundance of carnitines in bacteremic compared to non-bacteremic patients. [39,40] L-carnitine has entered clinical trials as a therapeutic agent for patients with sepsis, and metabolomic analysis has been used to identify the subset of patients responding. [41] Other pathways involved in amino acid metabolism were also perturbed, which could relate to the enhanced extraction of amino acids by the liver noted in patients with sepsis and systemic inflammatory response syndrome. [42] For example, pyrimidine metabolism was differentially regulated, possibly due to the de novo synthesis of pyrimidines required for successful proliferation of pathogens in blood. [43] Furthermore, the dysregulation of the ascorbate and aldarate metabolism pathway may relate to the lower circulating levels of ascorbate reported in patients with sepsis. [44] A number of genes showed differential expression between cases and controls. These included, PPBP, an antimicrobial protein with bactericidal and antifungal activity; CYBB, which is essential for the microbicidal oxidase system of phagocytes; LYZ, which encodes a component of the innate immune system that cleaves peptidoglycan bonds in the bacterial cell wall;,and CD86 and FGL2 which have been associated with severity and worse outcomes in sepsis [45,46] These results suggest that a weaker innate immune response might predispose to bacteremia after depletion of the adaptive immune system by chemotherapy. On a pathway level, the differentially-expressed genes were enriched for a number biological processes including some relating to the release of cytokines, supporting the idea of a unique immunologic milieu in FN patients with bacteremia.
Integrative analysis supported the findings of the individual omics analysis, and provided mechanistic links between the metabolites and genes independently associated with bacteremia. For example, MAPK signaling pathways were perturbed. MAPKs play an important role in the cascade of cellular responses evoked by extracellular stimuli such as pro-inflammatory cytokines or physical stress, and they have been shown to be activated in the setting of bacterial challenge. [47] These integrative results should be viewed with caution due to the limited sample size, which was constrained by funding. However they do provide a good indication of potential gene-metabolite relationships, and a demonstration that in a larger sample, dysregulation at multiple omic levels can be explored simultaneously to interrogate underlying mechanisms.
Regarding limitations, we caution the reader that the sample size, particularly for the transcriptomic data was limited, and we were unable to stratify by potential effect modifiers such as sex.These results should therefore be considered illustrative, and the specific biological entities identified as differentially abundant should be considered exploratory.RNAseq could not be used to quantify expression of the entire transcriptome, because leukocyte-poor blood has insufficient cells for this technique. Therefore, we used EdgeSeq, which does not require an RNA isolation step, but quantifies relative abundance of only 2,560 transcripts. However, the genes included in the EdgeSeq panel were selected on the basis of their relevance to both cancer and the immune system, making them ideally suited to this study population. Because participants had FN as an unscheduled emergency, their dietary intake prior to presenting in the Emergency room may have influenced their metabolome, but this represents the scenario in which a predictor of bacteremia would ultimately be utilized. However, we were able to provide evidence that prior antibiotic use was not influencing our results; nor was the type or site of the initial tumor. Furthermore, within the limited sample size we were able to demonstrate that the metabolomic profile appeared to be similar for Gram-negative and Gram-positive bacteremia,although it would be of interest to explore this further in the larger sample.
In conclusion, we generated metabolomic and transcriptomic predictors of bacteremia, among FN patients without apparent infection at presentation. With overfitting as a caveat, these predictors significantly outperformed currently-recommended risk-stratification tools, with markedly improved specificity and perfect sensitivity. Interrogation of differentiallyabundant biomarkers revealed biologically-plausible roles in bacteremia within the setting of FN. This is the first such study within a field that is in dire need of novel biomarkers and management strategies. Via further study in a larger sample with discovery and validation sets, development of a biologically meaningful objective predictor that utilizes both clinical and omics data is feasible, and will facilitate early diagnosis. This, in turn, will enable appropriate aggressive treatment for patients predicted to have bacteremia, and appropriate conservative treatment of those predicted not to have bacteremia.