Updated Systematic Review and Meta-Analysis of the Performance of Risk Prediction Rules in Children and Young People with Febrile Neutropenia

Introduction Febrile neutropenia is a common and potentially life-threatening complication of treatment for childhood cancer, which has increasingly been subject to targeted treatment based on clinical risk stratification. Our previous meta-analysis demonstrated 16 rules had been described and 2 of them subject to validation in more than one study. We aimed to advance our knowledge of evidence on the discriminatory ability and predictive accuracy of such risk stratification clinical decision rules (CDR) for children and young people with cancer by updating our systematic review. Methods The review was conducted in accordance with Centre for Reviews and Dissemination methods, searching multiple electronic databases, using two independent reviewers, formal critical appraisal with QUADAS and meta-analysis with random effects models where appropriate. It was registered with PROSPERO: CRD42011001685. Results We found 9 new publications describing a further 7 new CDR, and validations of 7 rules. Six CDR have now been subject to testing across more than two data sets. Most validations demonstrated the rule to be less efficient than when initially proposed; geographical differences appeared to be one explanation for this. Conclusion The use of clinical decision rules will require local validation before widespread use. Considerable uncertainty remains over the most effective rule to use in each population, and an ongoing individual-patient-data meta-analysis should develop and test a more reliable CDR to improve stratification and optimise therapy. Despite current challenges, we believe it will be possible to define an internationally effective CDR to harmonise the treatment of children with febrile neutropenia.


Introduction
Febrile neutropenia (FNP) is a common and potentially lifethreatening complication of therapy for childhood cancer, which has increasingly been subject to targeted treatment based on clinical risk stratification [1]. For children this move towards riskdirected care is based upon evidence of the low incidence of death [2], the majority of patients being without identified significant infection or sepsis [3], and small randomised trials demonstrating the feasibility of out-patient based treatment for patients at lowrisk of septic complications [4]. A large proportion of the evidence for risk stratifications has originated from adult oncology [5] It is acknowledged that children are not 'little adults' but distinct in the biology of their malignancies, treatment regimens, infections and psychosocial setting and therefore specific evidence for stratification of children with FNP is needed [6].
Since we undertook a systematic review and meta-analysis of risk stratification systems in 2008 [3], further studies have been published which address this issue [7]. Accordingly we have updated our review to summarise the most recent advances in our knowledge of evidence on the discriminatory ability and predictive accuracy of such risk stratification clinical decision rules (CDR) for children and young people with cancer.

Methods
This update review was conducted in accordance with ''Systematic reviews: CRD's guidance for undertaking reviews in health care'' [8] and registered on the PROSPERO Registry of systematic reviews: CRD42011001685. It sought studies which aimed to derive or validate a CDR in children or young people (aged 0-18 y) presenting with febrile neutropenia. Both prospective and retrospective cohorts were included, but those using a case-control (''two-gate'') approach were excluded as these have been previously shown to exaggerate diagnostic accuracy estimates [9].

Search strategy and selection criteria
The electronic search strategy [3] was reviewed and repeated on the following databases from Reference lists of relevant systematic reviews and included articles were reviewed for further relevant articles. Published and unpublished studies were sought and no language restrictions applied. Non-English language studies were translated. Two reviewers independently screened the title and abstract of studies for inclusion, and then the full text of retrieved articles. Disagreements were resolved by consensus.

Validity assessment and data extraction
The validity of each study was assessed as with our previous review using 11 of the 14 questions from the QUADAS assessment tool for diagnostic accuracy studies [10].
Data were extracted by one reviewer and checked by the other. The data extracted included age and sex distribution of the included participants, geographical location of the study, the participant inclusion/exclusion criteria, and the performance of the CDR as a 2*k table (where k refers to the number of strata described) or as sensitivity/specificity, as well as aspects of the methods used to derive the CDR (where applicable).

Methods of analysis/synthesis
Where possible, data from new publications were added to meta-analyses undertaken in the original review [3]. Quantitative synthesis was undertaken when more than 2 studies tested the same CDR, and where appropriate, was investigated for sources for heterogeneity. For this update review, only dichotomous test data were found. For CDR with 3 datasets, a univariate approach was used (pooling sensitivity and specificity separately) [11]. For those with 4 or more, a bivariate model was fitted using 'metandi' in STATA10 [12]. The protocol specified a random-effects metaanalysis was undertaken using WinBUGS 1.4.3 [13] for tests with 3 or more risk strata, but no data were found eligible for this analysis.
Heterogeneity between study results was explored through consideration of study populations, study design, CDR and outcomes chosen, although the small number of studies in each category limited this approach. Sensitivity analysis was undertaken by comparing results when the original (derivation) data set was included and excluded.
For those areas where a quantitative synthesis was not possible, a narrative approach was used.

Results
9 articles reporting on 8 studies were eligible for inclusion in the review (see Figure 1). The studies included patients from 2 month to 22 years old, with a wide range of malignancies, and a total of 2591 episodes of FNP describing four groups of outcomes: death, critical care requirement, serious medical complication, and bacteraemia. Six studies undertook prospective data collection, two retrospective. Details of the CDR included in this review are given in Table 1.

Quality assessment
The studies varied in quality. Potential biases due to threats to independent outcome assessment were present in two studies [2,14], verification bias in two [2,7], and two were presented only as abstracts [14,15]. Five definitions of febrile neutropenia were used, with five definitions of fever and two of neutropenia. However, all definitions were clinically similar, with variation was mainly in the duration of time for a lower temperature to be considered 'prolonged'.

New CDR derivations
Five studies attempted to derive at least one CDR. Four studies examined rules to predict significant medical complications; a group of outcomes generally encompassing death, intensive care admission, significant bacterial or fungal infection, and need for organ support such as supplemental oxygen, inotropes or dialysis [7,14,16,17]. Two examined rules to predict bacteraemia [16,18], and one intensive care admission [15]. In one case a clear CDR could not be assessed [15]. The CDR used data from the initial/ admission assessment, or from a later assessment after approximately 24 hours of observation. The new CDR generally had high sensitivity for the chosen outcome at the expense of poor specificity (see Table 2) and considered patient-disease, patient-episode and laboratory factors. Considerable imprecision in the estimates was seen, due mainly to the small numbers in individual studies (fewer than 350 patients).
The newly derived CDR were subject to validation by internal statistical means (cross-validation) or in one alternative data set (see Table 3). In all except one case [15], multivariable regression analysis was used to build the model. One rule was built with a classification and regression tree (CART) approach [15].

Validation of CDR
Four studies [2,7,19,20] were explicit in undertaking validations of 9 previously described CDR. These universally demonstrated poorer discriminatory ability when tested in alternative data sets (see Table 3). The geographical settings for validations of the rules varied from those where the rule had been derived.

Synthesis of CDR accuracy
Meta-analysis was undertaken for two CDR; the ''Klaassen'' rule and the ''Ammann'' rule. Two further CDR, the PINDA rule and the ''Alexander'' rule, have not been subject to meta-analysis as the results are too heterogeneous, these results are presented graphically. Two further CDR, the Rondellini rule and the SPOG2003 rule, have been assessed in two datasets, too few to perform meaningful meta-analysis. No data were available to update the three-stratum ''Rackoff'' rule meta-analysis of the previous study [3].
The ''Klaassen'' rule is based on a single feature: an absolute monocyte count of greater than 100/mm 3 to predict patients less likely to have significant infection. Data were pooled from 4 studies from the previous review [21,22,23,24] and two new sources [7,20]. The results of this analysis give a pooled average sensitivity of 88% (95% CI 84 to 91%) and specificity of 36% (95% CI 27 to 45%), see Figure 2.
The ''Ammann'' rule describes patients at low risk of significant bacterial infection as from weighted factors including: bone marrow involvement, clinical signs of viral infection, serum Creactive protein (CRP) level, leukocyte count, presence of a central venous catheter, high haemoglobin level, and diagnosis of pre-Bcell leukaemia (see Table 1 for details). Three studies provide data to test this rule [7,18,20]. The pooled average sensitivity was 98% (95%CI 91 to 99%) but pooled average specificity only 13% (95% CI 8% to 21%), see Figure 3.
The ''Alexander'' rule examined adverse clinical consequences, using a combination of clinical features which predict prolonged neutropenia, and significant co-morbidities at presentation. This rule was assessed by two further studies [2,18]. There was marked heterogeneity in the results of these three studies (see Figure 4). When used at reassessment after 48 hrs of hospitalisation, there was marked improvement in the discriminatory ability of the rule [2] (sensitivity = 100%, specificity = 39%).
The PINDA rule again describes patients at low risk of significant bacterial infection as from weighted factors including laboratory and chemotherapy related parameters. This has been examined in two studies from the Santolya group [25,26] and by two validations from European centres [7,20]. There was marked heterogeneity (see Figure 5), potentially explained through geographical variation: the rule worked well applied in the population in Chile, but failed to differentiate patients in French and Swiss/German studies.
The rule of Rondellini [27] is a weighted score of clinical and haematological parameters (see Table 1 for details) and was assessed in two validation datasets. These demonstrated a sensitivity of 84% [7] and 62% [20] and both estimated specificity at 43%.
The SPOG2003 is a weighted score of haematological parameters with intensity of chemotherapy. It is applied after 8-24 hours of hospitalisation. This model was shown to have a sensitivity 92% and specificity of 45% [7]. A validation of this model demonstrated poorer sensitivity (82%) and slightly better specificity (57%) [19].

Discussion
This update systematic review builds on previous work to bring our knowledge of currently developed clinical decision rules for risk stratification in paediatric febrile neutropenia up to date. Now nine further models have been described, bringing the total to 25,  and have included 10,000 episodes. It remains the case that no one rule is clearly better than any other, but we are now more clearly aware of the limitations of CDR which have not been subject to temporal and geographical validation. The majority of CDR in this review focus upon defining a group at 'low risk' of complications. These rules once again have clinical and physiological similarities. The dominant themes are of a relationship between underlying diagnosis, chemotherapeutic regime, and clinical and laboratory parameters at the outset of the episode of fever. A further finding from this review is the demonstration that undertaking risk stratification at 24-48 hours after the onset of the episode leads to much greater discrimination, as many occult infections will have declared in this period. Two rules have shown relative consistency of results. These are the simplest stratification of patients using the criteria of absolute monocyte count .100/mm 3 to define a low risk group [24]. This has a pooled average sensitivity of 88% (95% CI 84 to 91%) and specificity of 36% (95% CI 27 to 45%), and if we assume serious infectious events occur in 30% of the group, the low-risk group has a 9% risk of serious infection, and accounts for approximately 29% of the total population. The high risk group has a 37% risk of infectious complications.
The Ammann 2003 rule [7] has much better sensitivity (estimated at 98%), leading to a risk of serious infectious complications in around 5% cases, but would only class 9% of patients as low risk, making it of little practical use.
Other further rules have shown marked heterogeneity: the Alexander rule [28] and the PINDA rule. The data support the use of the PINDA rule in Chile, where it has been successfully validated [25], but do not support its use in Europe. A similar situation exists with the Brazilian rule [27] which again was not successfully validated in European data sets. The Alexander rule did not successfully differentiate patients at admission in the UK and Europe, but its use at a 48 hour reassessment was associated with successful reductions in hospital stay. A further, newer, rule from the SPROG group requires more validation before a decision can be made on its usefulness.
These findings, that validation of CDR may be poor in comparison to derivation, and that geographical variation may mean CDR fail to work universally, have important clinical implications. There is a wealth of examples in the statistical and methodological literature regarding the over-optimism of newly derived CDR [29,30]. The core concept is that rules derived from one dataset fit the idiosyncrasies and anomalies of the data collected, rather than reflecting the predictive power in the whole population of children experiencing FNP. However, these frequently equation-laden papers are uncommonly read by clinicians, and the complex approaches suggested to 'shrinking' the CDR values to increase their reproducibility are tricky to understand and to implement. The finding of geographical variation is potentially through different interpretations of similar findings; for example, how ''unwell'' should a child appear before they fall into this diagnostic category? There may also be subtle differences in the regimes used, as an example the use of steroid pulses in maintenance treatment for acute lymphoblastic leukaemia varies across Europe, and this may affect the CDR discriminatory ability.
This review has demonstrated there is an increasingly wide range of rules mainly for the prediction of an absence of adverse outcomes during episodes of febrile neutropenia in children, despite the existence of at least sixteen other applicable CDR [3]. Six rules have been subject to further verification, each demonstrating a variable degree of over-optimism in the original reports when the CDR is applied in different settings. The small size of these reports, with low ratios of events per variable examined may explain some of the variability in factors selected and poor reproducibility, as may undefined aspects of geographical differences between populations.
The practical application of these CDR requires it to be appropriate to the healthcare setting, and validated in the setting in which it is to be used. There remains a need for further research to reduce uncertainty around the efficiency of CDR, and potentially generate a very robust model on the basis of a much larger dataset, with well over 20 events per variable under examination. Importantly, rules should also identify a group at the highest risks of complications, to concentrate hopefully lifesaving early sepsis interventions in this group [31]. This project is already underway, with the PICNICC collaboration having collected data on around 5000 episodes of febrile neutropenia from 18 collaborating groups across North & South America, Europe and Asia.