Accuracy and interobserver-agreement of respiratory rate measurements by healthcare professionals, and its effect on the outcomes of clinical prediction/diagnostic rules

Objective In clinical prediction/diagnostic rules aimed at early detection of critically ill patients, the respiratory rate plays an important role. We investigated the accuracy and interobserver-agreement of respiratory rate measurements by healthcare professionals, and the potential effect of incorrect measurements on the scores of 4 common clinical prediction/diagnostic rules: Systemic Inflammatory Response Syndrome (SIRS) criteria, quick Sepsis-related Organ Failure Assessment (qSOFA), National Early Warning Score (NEWS), and Modified Early Warning Score (MEWS). Methods Using an online questionnaire, we showed 5 videos with a healthy volunteer, breathing at a fixed (true) rate (13–28 breaths/minute). Respondents measured the respiratory rate, and categorized it as low, normal, or high. We analysed how accurate the measurements were using descriptive statistics, and calculated interobserver-agreement using the intraclass correlation coefficient (ICC), and agreement between measurements and categorical judgments using Cohen’s Kappa. Finally, we analysed how often incorrect measurements led to under/overestimation in the selected clinical rules. Results In total, 448 healthcare professionals participated. Median measurements were slightly higher (1-3/min) than the true respiratory rate, and 78.2% of measurements were within 4/min of the true rate. ICC was moderate (0.64, 95% CI 0.39–0.94). When comparing the measured respiratory rates with the categorical judgments, 14.5% were inconsistent. Incorrect measurements influenced the 4 rules in 8.8% (SIRS) to 37.1% (NEWS). Both underestimation (4.5–7.1%) and overestimation (3.9–32.2%) occurred. Conclusions The accuracy and interobserver-agreement of respiratory rate measurements by healthcare professionals are suboptimal. This leads to both over- and underestimation of scores of four clinical prediction/diagnostic rules. The clinically most important effect could be a delay in diagnosis and treatment of (critically) ill patients.


Introduction
An abnormal respiratory rate is an important predictor of deterioration of a patient. [1,2] Consequently, the respiratory rate has a prominent place in many clinical prediction/diagnostic rules, which aim to early identify critically ill patients. Adequate and timely identification of these patients is important, as a delay in treatment increases morbidity and mortality disproportionately. [3][4][5] Commonly used prediction/diagnostic rules for critical illness are the Systemic Inflammatory Response Syndrome (SIRS) criteria, the quick Sepsis-related Organ Failure Assessment (qSOFA), the National Early Warning Score (NEWS), and the Modified Early Warning Score (MEWS) ( Table 1). [6][7][8][9] Considering the predictive potential of the respiratory rate, one would expect healthcare professionals to assess it as often and accurate as possible. However, in daily practice, the respiratory rate turns out to be the least often recorded vital sign, both on wards as well as in emergency departments (EDs). [10][11][12] Contrary to body temperature, blood pressure, and heart rate, the respiratory rate is mostly measured manually, which could be one of the explanations of infrequent recording. In addition, counting the respiratory rate is believed to waste valuable time. [13] In order to improve documentation of the respiratory rate, some organizations use systems that force employees into recording it. This may however, lead to inaccurate estimations of the respiratory rate, causing a delay in the identification and treatment of patients with serious conditions, such as sepsis. [7,14] Importantly, minor changes in the respiratory rate, just above or below normal, can have important effects on risk stratification for critically ill patients. Although the accuracy and interobserver-agreement of respiratory rate measurements by healthcare professionals has been reported to be fair to good, most of these studies used a wide and probably unnaturally low or high-range (5-60 breaths/minute), and the number of observers was small. [14,15] The impact of misclassification of respiratory rate measurements on important diagnostic/prognostic rules for critically ill patients has not yet been studied.
In this study, we investigated the accuracy and interobserver-agreement of respiratory rate measurements by different healthcare professionals, using 5 videos with different respiratory rates of one healthy volunteer. We hypothesized that a substantial proportion of measurements would deviate more than 4/min from the true respiratory rate, and that there would be inconsistencies when comparing continuous measurements with categorical judgments. Furthermore, we expected that deviations from the true respiratory rate would influence the outcome of 4 frequently used clinical prediction/diagnostic rules: SIRS, qSOFA, MEWS, and NEWS. [6][7][8][9]

Design and setting
For this questionnaire-based study, we made videos of a healthy volunteer, breathing with different respiratory rates. We shared these videos and a corresponding questionnaire with healthcare professionals through e-mail and social media. The research protocol was judged by the ethics committee METC Z and approval was not deemed necessary. Participants were aware of the study aims and the intention of publishing the results in a peer-reviewed journal. They were asked to participate when interested.

Videos
We created five videos, showing a healthy, male volunteer in supine position in a quiet setting. In each video, the volunteer breathed with a constant respiratory rate between 13 and 28 breaths per minute (28, 13, 22, 19 and 25 breaths/minute for video 1 to 5, respectively). In order to breathe at a constant rate, our volunteer was guided by ECG derived respiratory signals on a monitor. We selected stable video recordings, to make sure there was no variation in

Questionnaire
In March 2018, an invitation to participate in this questionnaire was distributed among different healthcare professionals throughout the Netherlands. We sent invitations by e-mail to the professional network of the authors, and we stimulated recipients to pass the invitation on to relevant colleagues. Furthermore, we posted the link to the (Dutch) survey on social media (Twitter, LinkedIn) in order to reach as many potential respondents as possible. The questionnaire could be filled out during a period of 3 weeks. We asked respondents about their profession, the years of experience in the current profession, and their preferred method of respiratory rate assessment. Thereafter, video 1 was shown. Respondents were asked to measure the respiratory rate, and after each video, they were asked to judge whether it was 'low', 'normal' or 'high'. We did not provide a definition of these three categories, as a categorical description of the respiratory rate is often used in daily practice.

Statistical analyses
All statistical analyses were performed using IBM SPSS statistical software version 25 (Chicago, Illinois, USA). We used descriptive statistics to summarize the respondents' profession, experience, and preferred method of respiratory rate assessment.
In order to assess how accurate the respondents' measurements were, we decided to use descriptive analysis and calculate medians with interquartile ranges (IQR). In addition, we calculated the proportion of measurements that were within 4 breaths/minute of the true respiratory rate. This cut-off value was chosen since we expected that a majority of the respondents would measure for 15 seconds and multiply by 4. A deviation of 1 breath would therefore result in a deviation of 4 from the true rate. To investigate if there were significant differences in measurements between groups of professionals, we compared groups for each video.
We further determined the interobserver-agreement of the measured respiratory rates, by calculating the intraclass correlation coefficients (ICC) and their 95% confidence intervals (CI), based on a single-measurement, absolute-agreement, 2-way random effects model. This was done for all videos together, as well as combined for video 1, 3 and 5 (respiratory rate >20 breaths/minute), and for videos 2 and 4 (respiratory rate <20 breaths/minute). ICC values less than 0.50 are considered indicative of poor interobserver-agreement, between 0.50 and 0.75 moderate agreement, between 0.75 and 0.90 good agreement, and values higher than 0.90 indicate excellent agreement. [16] In order to achieve a large, representative group of participants, we limited the number of videos to 5. This was in accordance with the sample size we calculated to investigate interobserver agreement. We additionally calculated the effect of showing 10 instead of 5 videos to reduce the width of the confidence intervals, but this did not result in narrower confidence intervals.
In addition, the respondents' measurements of the respiratory rate were compared with their categorical judgments ('low', 'normal', 'high'). We used the following cut-off values to define a low, normal and high respiratory rate: <12 breaths/minute for 'low', 12 through 20 for 'normal', and >20 for 'high'. These are widely used cut-off points for adults. [6] Cohen's Kappa statistics were used to measure the agreement between the respondents' measurements and their categorical answers. Kappa values of 0.6-0.8 represent moderate agreement, values of 0.8-0.9 strong agreement, and values >0.9 almost perfect agreement. [17] In order to evaluate the potential clinical relevance of accurate respiratory rate measurements, we calculated how often an incorrect measurement of the respiratory rate would have resulted in an incorrect result on 4 clinical prediction/diagnostic rules for critical illness: SIRS, qSOFA, NEWS, and MEWS (Table 1).

Respondents and method of assessment
In total, 452 respondents filled out the questionnaire within 3 weeks after sending out the first invitation (median 3, IQR 2-7 days). After exclusion of 4 incomplete questionnaires, we included 448 respondents in the analyses. The study sample consisted of nurses, consultants, residents, medical students, general practitioners (GPs) and other healthcare professionals ( Table 2). Of these participants, 432 (96.4%) assessed the respiratory rate on a regular base. Fig 2 shows the measured respiratory rates for each video. In general, the median reported respiratory rate was between 1-3 breaths/minute higher than the true rate. IQRs were between 2-4 breaths/minute, and the overall range of measurements was between 6 and 64/min. Table 2 shows the proportion of measurements within 4/min of the true respiratory rate. Overall, 78.2% of measurements were within this range (67.4%, 81.9%, 81.9%, 87.9%, and 71.7%% for video 1-5, respectively). We found no significant differences in this proportion between the different groups of professionals (Table 2).

Interobserver-agreement
For all respiratory rate measurements of the 5 videos together, the ICC was 0.64 (95% CI 0.39-0.94), which indicates moderate agreement. For videos with a high respiratory rate (video 1, 3 and 5 (>20 and �22/min)), the ICC was 0.29 (95% CI 0.10-0.94), indicating poor agreement. Videos with a low respiratory rate (video 2 and 4 (<20)) showed an ICC of 0.50 (95% CI 0.16-0.99), indicating moderate agreement. Table 3 shows the agreement between the respondents' measurements and their categorical judgments. For all videos together, 324 (14.5%) inconsistencies were present. Most (n = 194, 8.7%) of these occurred when a respondent measured a "normal" respiratory rate (12 through 20/min), and incorrectly judged this to be "high". In most (n = 148, 76.3%) of these cases, the respiratory rate was measured as exactly 20/minute. In 68 cases (3.0%), a respondent measured a "high" respiratory rate (>20 breaths/minute), and incorrectly judged this to be "normal" (n = 64, 2.9%) or "low" (n = 4, 0.2%). Cohen's Kappa was 0.71 for all videos together, which represents moderate agreement. However, for all individual videos, Cohen's kappa was lower (0.27-0.59). Table 4 shows the potential effect of incorrect respiratory rate measurements on SIRS, qSOFA, NEWS, and MEWS. Of these rules, SIRS was least affected, with misclassification in 8.8%. qSOFA scores changed in 8.9%, NEWS in 18.2%, and MEWS scores changed in 37.1% of cases. Overall, 4.5-7.1% of patients would incorrectly receive a lower score, while 3.9-32.2% would receive a higher one, when compared to the score based on their true respiratory rate.

Discussion
This study is, to our knowledge, the first that used a large, heterogeneous group of professionals to measure and categorize different clinically relevant respiratory rates. Our study shows that these respiratory rate measurements by health care professionals are not accurate, and that the interobserver-agreement is suboptimal, which may have an important effect on the results of four common clinical prediction/diagnostic rules. We designed this study using simple tools, available to the majority of healthcare professionals today. We made five videos and shared them using e-mail and social media, after which 448 professionals completed and returned the questionnaire within three weeks. Median measured respiratory rates were slightly higher than the true respiratory rate, 78.2% of measurements were within 4 breaths per minute from the true rate, and the ICC was moderate. These results are in line with those of previous studies. [18,19] Remarkable is the fact that 14.5% of responses showed inconsistencies when comparing the respondents' measurements Quality and effect of respiratory rate measurement and their categorical judgments. In addition, incorrect respiratory rate measurements may in theory have led to both overestimation (12.9%) and underestimation (5.4%) of the score of four common prediction/diagnostic rules.
The median measured respiratory rates varied highly. While IQRs were between 2 and 4/ min, ranges were wide (overall 6-64/min). Overall, 78.2% of measurements were within 4 breaths per minute from the true rate. We did not find any differences between professional groups regarding the proportion of measurements within 4/min from the true rate. These results suggest that respiratory rate assessment by different groups of healthcare professionals is suboptimal.
With a value of 0.64 (95% CI 0.39-0.94), the ICC was moderate. Previous studies have demonstrated values as low as 0.26 (95% CI 0.16-0.35), but also as high as 0.99 (95% CI 0.97-1.00). [14,15] A possible explanation for this low ICC is the difference in design between these studies. One study, with a low ICC (0.26), compared values recorded in patient charts to values measured manually by residents. [14] These values were not obtained at the exact same time, and while the participating residents were informed and prepared, the nurses who performed the measurements were not. Another study, with a high ICC (0.99), performed a simulation using 5 videos as well. [15] Respondents were mostly experienced nurses, and the respiratory rates in the videos varied largely: 5, 10, 15, 30 and 60 breaths/min. For professionals like these, it is relatively easy to differentiate between a respiratory rate of 15 and 60, or even 30 breaths/ minute. However, measuring a respiratory rate just above or below commonly used cut-off points of >20 or �22 breaths/minute is more difficult. Therefore, the smaller range of respiratory rates in our videos, and our large, heterogeneous group of (future) healthcare professionals may have resulted in our less favourable ICCs. As the respiratory rate has been proven to predict adverse outcomes and is incorporated in many clinical prediction/diagnostic rules, this is an important finding. [2,20,21] When comparing the respondents' measurements and their categorical judgments, 14.5% of the answers were inconsistent. Respondents measuring a normal (12-20/min) respiratory rate, while judging this as 'high', caused the most inconsistencies (8.7%). In over 75% of these cases, the measured respiratory rate was exactly 20/min, which could suggest that some respondents believe that a respiratory rate of 20/min is abnormal. We did not provide a definition of "low", "normal", or "high", but there is no current guideline which supports the use of a cut-off point <20/min for an abnormal respiratory rate. It would be worthwhile to investigate if education would improve these results, as these results suggest a lack of knowledge regarding common cut-off points. One of the most interesting results of this study was found in the impact of incorrect respiratory rate measurements on daily practice. We entered the respondents' answers into four commonly used prediction/diagnostic rules, as a proxy of the "true consequence" of incorrect measurements. This resulted in incorrect scores for SIRS in 8.8%, for qSOFA in 8.9%, for NEWS in 18.2%, and for MEWS in 37.1%. While median measurements were higher than the true respiratory rate in all videos, the incorrect measurements resulted in both incorrect lower and higher scores (Table 3). In daily practice, this could have led to delayed diagnosis and treatment of (critically) ill patients or overalerting and eventually alarm fatigue.
By performing this video-based questionnaire, we created the opportunity to have 448 healthcare professionals measure the respiratory rate of the same patient breathing at a constant rate. This design also has limitations. Respondents could only visually measure the respiratory rate. Some professionals normally use palpation of the chest to optimize their measurement. However, we made sure that the volunteer's breaths could be seen clearly in all videos, and we expect that the restriction to visual assessment had no major influence on the results. In order to provide high quality, stable recordings, we had to select specific sections of video, resulting in 4/5 videos being slightly less than 1 minute long. This could have resulted in suboptimal measurements by 8.3% of respondents, as they reported that they usually measure the respiratory rate for a full minute. Finally, we did not include a video with a low respiratory rate, so we cannot draw conclusions regarding the ability of healthcare professionals to recognize bradypnea.
Notwithstanding these limitations, this study shows that, even when professionals are asked to measure the respiratory rate at the best of their ability, results are still suboptimal. In crowded EDs, quick and reliable methods to accurately measure the respiratory rate could be valuable, especially since many EDs and hospitals rely on these measurements to identify patients at risk, for instance, of sepsis. Therefore, further research should be undertaken to investigate the reliability of non-invasive methods to measure the respiratory rate, especially in EDs. This to avoid incorrect alarms, and even more important, delays in diagnosis and treatment, even when patients are potentially very ill.
In conclusion, using simple tools available to most healthcare professionals today, we showed that accuracy and interobserver-agreement of respiratory rate measurements by healthcare professionals are suboptimal. The clinical relevance of incorrect measurements is illustrated by alterations in the score of four common prediction/diagnostic rules. This happened in 8.8-37.1% of cases, with the clinically the most important effect being potential delay in diagnosis and treatment of (critically) ill patients.