Clinical Scores for Dyspnoea Severity in Children: A Prospective Validation Study

Background In acute dyspnoeic children, assessment of dyspnoea severity and treatment response is frequently based on clinical dyspnoea scores. Our study aim was to validate five commonly used paediatric dyspnoea scores. Methods Fifty children aged 0–8 years with acute dyspnoea were clinically assessed before and after bronchodilator treatment, a subset of 27 children were videotaped and assessed twice by nine observers. The observers scored clinical signs necessary to calculate the Asthma Score (AS), Asthma Severity Score (ASS), Clinical Asthma Evaluation Score 2 (CAES-2), Pediatric Respiratory Assessment Measure (PRAM) and respiratory rate, accessory muscle use, decreased breath sounds (RAD). Results A total of 1120 observations were used to assess fourteen measurement properties within domains of validity, reliability and utility. All five dyspnoea scores showed overall poor results, scoring insufficiently on more than half of the quality criteria for measurement properties. The AS and PRAM were the most valid with good values on six and moderate values on three properties. Poor results were mainly due to insufficient measurement properties in the validity and reliability domains whereas utility properties were moderate to good in all scores. Conclusion This study shows that commonly used dyspnoea scores show insufficient validity and reliability to allow for clinical use without caution.


Introduction
Acute dyspnoea is an important reason for paediatric emergencies and hospital admissions [1]. Assessment of dyspnoea severity and treatment response is essential for therapeutic management. Evaluation of dyspnoea severity in children primarily relies on clinical evaluation, because pulmonary function tests are unreliable and infeasible in acutely dyspnoeic children [2]. This clinical evaluation usually involves a combination of clinical signs, as there is there is no single clinical sign that sufficiently correlates with the degree of dyspnoea or airway narrowing [3][4][5][6]. For this purpose, a range of dyspnoea scores, comprising a combination of clinical features and signs, have been developed. Although these are being widely used both in clinical practice and in research, evidence on their measurement properties is limited [7][8][9]. For instance, adequate information on evaluative quality (responsiveness after treatment) is lacking in all existing scores, limiting their use as evaluative instrument in clinical practice or research. Although the lack of sufficient information from validation studies does not necessarily disqualify existing scores to reliably measure dyspnoea, it does emphasize the need for further validation [7][8][9].

Study population and design
We prospectively recruited fifty children aged 0-8 years, presenting with acute dyspnoea and wheeze to the emergency department of Isala, a large teaching hospital in Zwolle, the Netherlands. All children were assessed before and 30 minutes after treatment with a standard dose of nebulized salbutamol (2.5 mg for patients aged < 4 years and 5 mg for ages ! 4 years). Only children treated with nebulized salbutamol were included in the study. The emergency department nurse counted the respiratory rate for 1 minute and recorded heart rate and transdermal oxygen saturation by pulse oximetry. If oxygen saturation was <93% or when the child's dyspnoea was considered to be severe, supplemental oxygen was provided through nasal prongs. Respiratory and heart rate were classified using age-equivalent percentile categories [13]. After bronchodilator administration, the attending clinician rated whether the patient showed improvement, slight improvement, no change or deterioration. In addition, after parental informed consent was obtained to videotape the head and chest of the children before and 30 minutes after bronchodilator administration. The heart rate and transdermal oxygen saturation were visible on the pulse oximetry during the whole recording. Video recording took place in a single-bed room to prevent disturbing noises and ensure the assessors could hear the wheezing, which was confirmed during evaluations by the assessors.
Written parental informed consent was obtained on behalf of the children to videotape the children. The study was approved by the hospital's medical research and ethics committee Medical Ethical Committee Isala, Zwolle (09.0536n).

Video assessment
Five paediatricians and four paediatric nurses, each with at least five years of experience in paediatrics, rated the videos. Observers were blinded from clinical information and the timing of the video (before or after bronchodilator treatment). They assessed all necessary items to calculate the five dyspnoea scores: respiratory rate, wheeze, prolonged expiratory phase, retractions (subcostal, intercostal, jugular, supraclavicular), nasal flaring and mental status. Except for respiratory rate, these clinical signs were rated according to a scale ranging from none, mild, moderate to severe. Finally, an overall dyspnoea severity score on a Likert-scale from 0 (no dyspnoea) to 10 (severe dyspnoea defined as respiratory insufficiency) was given. Observers rated all videos twice with an interval of at least two weeks. Participating paediatricians and paediatric nurses gave verbal informed consent.
We purposely left out auscultatory findings, to reflect daily clinical practice in which healthcare professionals assessing the child are not always trained in pulmonary auscultation, and for reasons of feasibility (having an acutely dyspnoeic child auscultated by nine independent assessors was considered infeasible and excessively distressing to the child).

Statistical analysis
The five dyspnoea scores were evaluated according to the Consensus-based Standards for the selection of health Measurement Instruments initiative (COSMIN) definitions of fourteen measurement properties for validity, reliability and utility of measures (see S2 File) [14]. Six of these properties (face and content validity, suitability, age span, ease of scoring and auscultation skills) have been described in detail previously [9], and we will explain the methodology used for the other eight measurement properties (construct and criterion-concurrent validity, measurement error, inter and intra-observer reliability, internal consistency, responsiveness and floor and ceiling effects) below. [15] Validity. To assess construct validity we formulated five pre-defined hypotheses about the difference in scores between subgroups in our study sample, three of which referred to evaluative capacity (i.e. the ability to find (small) difference in dyspnoea severity in response to treatment), and two to interpretability (do more children with more severe symptoms have higher scores?). The hypotheses we tested were: 1) Dyspnoea score improvement after treatment with bronchodilator is larger in patients in whom the attending physician observed an improvement after bronchodilator, compared to the stable group; 2) The response to bronchodilator treatment is larger when risk factors for atopy (eczema or a positive family history) are present in comparison to patients without risk factors; 3) Patients diagnosed with episodic wheezing or asthma at discharge show a better response than patients diagnosed with bronchiolitis or pneumonia; 4) Children requiring supplemental oxygen have higher dyspnoea scores than those without oxygen. 5) Change in dyspnoea scores is higher in children who are admitted to the hospital in comparison to children sent home. Sufficient construct validity was reached if 75% of the hypotheses were confirmed by an independent t-test or Mann-Whitney U test.
Concurrent validity was evaluated by comparing the total dyspnoea scores with oxygen saturation, age-equivalent respiratory rate percentiles [13] and the 10-point Likert scale for dyspnoea severity. A Pearson or Spearman correlation coefficient of > 0.7 was considered sufficient [14].
Reliability. Agreement was quantified by calculating the Smallest Detectable Change (SDC) and Minimal Important Change (MIC) [16]. For a score to be of evaluative value, the SDC must be smaller than the MIC [14]. The SDC was obtained by multiplying the standard deviation of the change in dyspnoea score in the stable group (not importantly improved as judged by the attending physician) by 1.96 [16,17]. We applied the visual anchor-based MIC distribution to calculate the MIC for each dyspnoea score, by using two external criterions or 'anchors' for judgment on the degree of responsiveness to treatment [16]: 1) the clinical judgment of the response to treatment by the paediatrician who assessed the child "live" at the emergency department and 2) the difference in respiratory rate percentile before and after administration of bronchodilator.
To assess intra-and interrater reliability, intraclass correlation coefficient (ICC) was calculated (two way mixed models, absolute agreement and single measurements), considering an ICC of !0.70 as adequate [13]. Standard error of measurement (SEM) due to variation within observers was calculated by: SEM = SD difference / p 2 [16]. The SEM due to variation between observers was calculated by using the pooled SD of the mean scores of the different observers using the formula: SEM = SD pooled Ã p (1-ICC). SD pooled was similar to p (SD 2 observer1 + SD 2 observer2 + . . ./n) [16,17]. As the RAD score cannot be regarded a continuous variable with a range of only 4, we chose to use weighted kappa values instead of ICC and did not calculate SEM.
Responsiveness was determined calculating the area under curve (AUC) of the receiver operating characteristic curve of the improved versus stable group using the two abovementioned anchors [14]. A value !0.70 was considered appropriate.
Utility. Floor or ceiling effects were evaluated by calculating the percentage of patients with the lowest or highest possible dyspnoea score. Floor and ceiling effects were considered adequate when <15% [16].
All statistical analyses were performed using SPSS version 23.0. P values < 0.05 were considered statistically significant.

Results
Fifty patients were evaluated before and after administration of bronchodilators, some on several occasions, resulting in 148 observations. Twenty-seven of these patients were videotaped before and after bronchodilator treatment, assessed twice by the nine observers, accounting for 972 video ratings and thus a total of 1120 observations. Patient and clinical characteristics are shown in Table 1.

Validity
The AS, PRAM and RAD showed adequate construct validity, defined by 75% confirmed hypotheses (Table 2). Remarkably, hypothesis 5 showed that the change in dyspnoea scores was rated not rated higher in patients ultimately hospitalised than in the children who were sent home after their visit to the emergency department.
Concurrent validity showed an insufficient correlation with the oxygen saturation or respiratory rate percentile for all five scores (Table 3). Correlations with the dyspnoea severity score (Likert scale) were moderate, ranging from 0.441-0.567, and did not exceed the minimum threshold.

Reliability
Results on agreement are presented in Table 4 (for a detailed calculation in S3 File). None of the five dyspnoea scores showed good agreement: in all five the smallest detectable change (SDC) was smaller than the minimal important change (MIC).
Data on intra and interrater reliability and internal consistency are presented in Table 5. Intrarater reliability was good for the AS (ICC 0.75) and ASS (ICC 0.74) and borderline sufficient for the other three scores. Only the AS showed a moderate interrater reliability (ICC 0.64). Internal consistency was inadequate in all scores. For responsiveness, only the ASS reached the threshold of an adequate AUC (Table 4).

Utility
Floor and ceiling effects were adequate in all five dyspnoea scores ( Table 6) A summary of the assessment of the 5 scores based on our results is given in Table 7.

Discussion
In this prospective study we aimed to examine external validity of five commonly used dyspnoea scores in children with acute dyspnoea and wheezing. To our knowledge, this is the first study to quantitatively compare measurement properties across dyspnoea scores in dyspnoeic children using independent ratings by nine experienced clinicians. The results show that commonly used dyspnoea scores in children have poor measurement properties, with insufficient results on more than half of the quality criteria for measurement properties (7 to 9 out of 14). The AS and PRAM were evaluated as most valid with good values on six and moderate values on three measurement properties. The poor results were mainly due to insufficient Although assessment of validity (do the scores measure what they intend to do?) is difficult due to the lack of a solid "gold" standard or reference value, all dyspnoea scores scored insufficient on the different aspects of validity which can be assessed in absence of a "gold" standard. None of the composite dyspnoea scores showed good correlations with other single measures of dyspnoea severity, including objective measures (oxygen saturation or respiratory rate) and subjective measures (impression of dyspnoea severity judged by the attending experienced physician). These findings are consistent with earlier reports, showing poor to modest correlation with single clinical signs or arterial oxygen saturation or airway obstruction [3][4][5][6]. The AS and PRAM showed a significant higher, but still only slight correlation with oxygen saturation. Data is presented as mean ± standard deviation or median with interquartile range (IQR); for the hypothesis 1-3 and 5, change in dyspnoea score after bronchodilator treatment is compared between the two subgroup, and for hypothesis 4 the absolute dyspnoea score is presented. CI confidence interval; AS Asthma score (range 4-12); ASS Asthma severity score (range 0-9); CAES-2 Clinical asthma evaluation score 2 (range 0 This is likely to be explained by the fact that SpO2 is an item in these scores, so a higher SpO2 would automatically lead to a lower score on the AS and PRAM. The validity of dyspnoea scores may be hindered by the fact that clinical signs of dyspnoea may vary largely across different ages. Even within the limited age range of preschool children in our study, signs of dyspnoea may differ between young infants and toddlers. A score that is applicable in different settings and across a broad age range is desirable. The poor validity and especially the poor discriminative and evaluative properties of paediatric dyspnoea scores appears to be mainly due to the large interrater variation. The discriminative power of these composite scores is too low compared to the large variation in the Data is presented as mean ± SD (range) or median [interquartile range]; AS Asthma score; ASS Asthma severity score; CAES-2 Clinical asthma evaluation score 2; PRAM Pediatric respiratory assessment measure; RAD Respiratory rate, accessory muscle use, decreased breath sounds;  characteristics of the population examined. In other words, the scores do not seem to be sensitive nor precise enough to detect the often subtle changes in clinical conditions in young children. This means that dyspnoea scores should be used with caution. Since none of the scores is significantly better than the others, it would be preferable if clinicians and researchers in the field of paediatric pulmonology could agree on which (selection of) dyspnoea scores are to be used. We would suggest to choose the AS or PRAM since they scored best of all 36 available dyspnoea scores.
Because the large degree of interrater variation limits the validity of these dyspnoea scores in children, efforts to diminish this interrater variation are clearly needed. In the absence of studies examining this, we propose repeated training and discussions over videotaped dyspnoeic patients between the health care professionals. In addition, objective measures to assess severity of dyspnoea in young children are needed. Until such more reliable methods become available, health care professionals caring for acutely dyspnoeic children need to be aware of the unreliability of these scores. This implies that repetitive assessment of dyspnoeic children patients should preferably be done by the same professional, and simultaneous assessment during handovers is warranted.

Strengths and weaknesses
The main strength of our study is the prospective design, with a sufficient number of patients and observers to enable thorough external validation of different dyspnoea scores simultaneously. This has not been done previously. Furthermore the choice to include patients  [16,17]; SD diff SD of the differences between the two raters [16,17]; ICC intraclass correlation coefficient; CI confidence interval; AS Asthma score; ASS Asthma severity score; CAES-2 Clinical asthma evaluation score 2; PRAM Pediatric Respiratory assessment measure; RAD Respiratory rate, accessory muscle use, decreased breath sounds ¶ Difference in the dyspnoea score between the two assessments of the same video by the same rater *Weighted kappa for ordinal scales, multirater (Light) doi:10.1371/journal.pone.0157724.t005 Table 6. Floor and ceiling scores and percentages in five dyspnoea scores. AS Asthma score; ASS Asthma severity score; CAES-2 Clinical asthma evaluation score 2; RAD Respiratory rate, accessory muscle use, decreased breath sounds presenting to the hospital with acute dyspnoea without a specific diagnosis increases the generalizability and applicability of our results, because this patient selection closely reflects daily practice. We used protocolled care for children presenting with acute asthma, pneumonia and bronchiolitis, with clear treatment and admission criteria, although we cannot deny variance in practice between physicians and individualized treatments exist. The limitations of our study are mainly related to the use of audio-visual recordings and the exclusion of auscultation. The use of videos has its shortcomings in comparison to 'live' assessment, because the lack of direct personal contact between patients and caregivers. Nevertheless, video assessment was the most suitable option enabling us to make comparison among multiple clinicians. We left out auscultation because it was difficult to capture with video recordings. This might have led to a score that is different than it would be if auscultation was included. However, previous studies underscored the weak association between auscultatory findings and the actual degree of airway obstruction [2,18]. Another limitation of our study is that most of our patients were moderately dyspnoeic, not representing the entire range of dyspnoea severity. This may have limited the discriminative power of the scores. However, because severe dyspnoea accounts for only 1-2% of the population we believe that used dyspnoea scores should be especially reliable in moderately dyspnoeic children.
We are aware of the central role of the attending pediatrician in this study. The pediatrician was involved in the clinical decision making with regard to for instance oxygen supplementation and admission to the hospital. However, the aim of the dyspnoea scores is to reflect clinical decision making of clinicians and therefore by comparing the dyspnoea scores to precise these measures seems the most adequate manner to evaluate the applicability of these scores. Furthermore by using several hypotheses without involvement of the attending physician, and by using a second more objective comparison as an anchor (i.e. respiratory rate) we tried to optimize study design and compensate for a lack of a "true gold standard.

Conclusion
This study is the first to prospectively compare the external validity of five different paediatric dyspnoea scores and enables us to make suggestions about which is the best applicable across the population of dyspnoeic children. We found that all of the scores have poor measurement properties, leading to insufficient validity and reliability. Even the two scores with the best test results (AS and PRAM) lack sufficient discriminative and evaluative power to allow for the sole use as outcome measure for dyspnoea severity in children.
Supporting Information S1 File. Overview of tested dyspnoea scores.