Inter- and intra-unit reliability of the COSMED K5: Implications for multicentric and longitudinal testing

Purpose To evaluate the intra-unit (RELINTRA) and inter-unit reliability (RELINTER) of two structurally identical units of the metabolic analyser K5 (COSMED, Rome, Italy) that allows to utilize either breath-by-breath (BBB) or dynamic mixing chamber (DMC) technology. Methods Identical flow- and gas-signals were transmitted to both K5s that always operated simultaneously either in BBB- or DMC-mode. To assess RELINTRA and RELINTER, a metabolic simulator was applied to simulate four graded levels of respiration. RELINTRA and RELINTER were expressed as typical error (TE%) and Intraclass Correlation Coefficient (ICC). To assess also inter-unit differences via natural respiratory signals, 12 male athletes performed one incremental bike step test each in BBB- and DMC-mode. Inter-unit differences within biological testing were expressed as percentages. Results In BBB, TE% of RELINTRA ranged 0.30–0.67 vs. RELINTER 0.16–1.39 and ICC ranged 0.57–1.00 vs. 0.09–1.00. In DMC, TE% of RELINTRA ranged 0.38–0.90 vs. RELINTER 0.03–0.86 and ICC ranged 0.22–1.00 vs. 0.52–1.00. Mean inter-unit differences ranged -2.30–2.20% (Cohen’s ds (ds) 0.13–1.52) for BBB- and -0.55–0.61% (ds 0.00–0.65) for DMC-mode, respectively. Inter-unit differences for V˙O2 and RER were significant (p < 0.05) at each step. Conclusion Two structurally identical K5-units demonstrated accurate RELINTRA with TE < 2.0% and similar RELINTER during metabolic simulation. During biological testing, inter-unit differences for V˙O2 and RER in BBB-mode were higher than 2% with partially large ES in BBB. Hence, the K5 should be allocated personally wherever possible. Otherwise, e.g. in multicenter studies, a decrease in total reliability needs to be considered especially when the BBB-mode is applied.


Conclusion
Two structurally identical K5-units demonstrated accurate REL INTRA with TE < 2.0% and similar REL INTER during metabolic simulation. During biological testing, inter-unit differences for _ VO 2 and RER in BBB-mode were higher than 2% with partially large ES in BBB. Hence, the K5 should be allocated personally wherever possible. Otherwise, e.g. in multicenter studies, a decrease in total reliability needs to be considered especially when the BBBmode is applied.

Introduction
Cardiopulmonary exercise (CPX) testing is frequently used to evaluate fitness in healthy and unhealthy populations in cross-sectional [1] or longitudinal [2] evaluations or studies, as underlined by 3357 hits on the search item " _ VO 2 max testing" in pubmed within 5 years (Aug. 2020). As for any scientific method, reliability is an important quality measure also for CPX devices and if testing is conducted within one laboratory and only one CPX-unit is applied, the overall variability of one unit between two or more trials is a key quality measure. This will be called intra-unit reliability (i.e. intra-rater reliability [3]) hereafter. As soon as more than one unit of a particular device is used-be that within one or in multiple labs-the variability between two or more CPX-units needs to be considered, too. This will be called inter-unit reliability (i.e. inter-rater reliability [3]) hereafter. A third measure used throughout this paper is the difference between two devices at a given intensity or stage. This will be called inter-device difference hereafter.
Interestingly, reliability of CPX devices has been determined in several studies, with differences ranging 0.12-8.15% for common respiratory variables [4,5], but these results are nearly always based on results obtained by a single apparatus. However, they are in fact generalized, assuming that a particular unit represents the whole "population" of the respective device. In the rare cases where inter-unit variability between two units of a particular device has been reported, a considerable variability of 0.4-2.1% was calculated [6,7]. These studies applied either simulated testing [7] via metabolic simulator [8] or biological testing via exercise tests in humans [6]. Of note, metabolic simulators exclude biological variation, but the price to pay is a questionable transferability to human individuals, where expired gas concentrations, their temperature, and humidity may change throughout the test, as respiratory frequency and tidal volumes do, too. Therefore, an integrative approach has been suggested to evaluate quality measures of CPX devices, where technical and biological testing are combined [9,10].
Based on this approach, we provided intra-unit reliability and validity data for the portable CPX analyzer COSMED K5 (COSMED, Rome, Italy) [11], which allows to select between breath-by-breath (BBB) or dynamic micro mixing chamber (DMC) technology. We found significant different measures of validity between modes (0.03-11.56%), but did not evaluate inter-unit reliability. Furthermore, since other studies have reported differences in reliability and validity for several devices applying either BBB-or DMC-technology [12][13][14], we hypothesized that inter-unit reliability might also be mode-dependent in the K5. However, to the best of our knowledge, there are no data available that account for intra-and inter-unit reliability of the two measurement principles of the K5.
We therefore aimed to study intra-and inter-unit reliability of the K5 both in BBB-and DMC-mode at low to high ventilation and gas exchange rates under artificial and natural conditions.

Materials and methods
We used the integrative approach suggested by [9,10] to evaluate intra-and inter-unit reliability via metabolic simulations, allowing for repeated measurements without any biological variation or influence (study 1). To substantiate our results, we evaluated the differences between two units in biological tests under natural respiration conditions (study 2).
During all measurements, two K5-units operated simultaneously, receiving the identical flow and gas signals (see: Equipment). Experiments for metabolic simulation were conducted within one day. Biological testing consisted of two tests per participant, which were conducted within four days, with at least one day off. All experiments were conducted in a well-ventilated laboratory with temperature and humidity ranging from 18-21˚C and 60-75%, respectively.

Participants
Twelve trained to well-trained [15] experienced male triathletes, cyclists, or rowers, all of them regularly incorporating cycling into their training (age 29±3 years, _ VO 2 peak 58.0±6.7 mL�kg -1 �min -1 , measured with a K5 in DMC-mode) participated in the study. They were experienced with CPX testing and were carefully instructed and familiarized with the specific test procedures. Participants were instructed to keep a standardized diet and to avert from strenuous exercise for 24 hours preceding the tests. All participants gave their informed written consent to participate in the study, which was approved by the ethical review board of Ulm University (#123) and was conducted in accordance with the Declaration of Helsinki.

Equipment
Metabolic analyzer K5. All tests were conducted with two units of the metabolic analyzer K5 (firmware 1.2, COSMED, Rome, Italy). The K5 combines a 2 mL dynamic micro mixing chamber with proportional micro sampling technology and a dual gas sampling system (Intel-liMET™), allowing for either BBB or DMC measurements. For technical details see [11].
To allow for simultaneous measurements, both K5s were connected with a single flow sensor via a customized flow signal splitter and both gas sample lines were connected to a customized turbine housing (Fig 1). Except for the flow signal, both devices operated independently and autonomously.
After a 60-min warm-up of the K5s, calibration of flow (simultaneously), gas sensors, and time-delay were conducted for both units according to the manufacturer's instructions [16].
Cycling ergometer. All exercise tests were conducted on an adjustable electrically braked cycling ergometer (Lode Excalibur, Groningen, Netherland) that was checked for validity prior to the study. Mean differences to calibration device were -1.59%, 95% CI [-2.14, -1.03]) and therefore systematic, without relevance for this study.

Measuring procedures Study 1 -intra-and inter-unit reliability via metabolic simulation.
To study intra-and inter-unit reliability via metabolic simulations, a total of eight trials were conducted, each consisting of four increasing steady-state metabolic rates produced by the MS (Bf 20-60 min -1 , _ VE 30-150 L�min -1 , _ VO 2 0.94-3.96 L�min -1 ). Data were recorded with both units operating four times simultaneously in BBB-and four times simultaneously in DMC-mode (Fig 2A).
Study 2 -inter-unit differences via biological testing. To study inter-unit differences via biological testing, participants performed two incremental tests within four days, each separated by at least one day. Both tests consisted out of four 4-min stages at 100, 150, 200, 250 W with a consecutive ramp test (starting at 250 W, increment of 40 W�min -1 ) up to voluntary exhaustion ( Fig 2B). Each trial was used to obtain respiratory data from both K5s operating simultaneously either in BBB-or DMC-mode in a randomized order between participants. A standardized warm-up (10 min at 60 W and 5 min at 100 W) preceded each test.

Data processing and statistics
Respiratory data were calculated by 60-s arithmetic means over the final minute of each rate (study 1) or stage (study 2) based on the exported csv-files. To identify corresponding absolute time points of each unit, markers were manually set during measurements and exported afterwards. Since the MS produced a partially humidified gas mixture at ambient temperature, but the K5 calculates standard temperature and pressure, dry (STPD) conditions, a correction for simulated gas exchange data was applied using a modified spreadsheet [17]. In the consecutive ramp test _ VO 2 peak was calculated as the highest 30-s moving average (BBB) or the highest single _ VO 2 -value (DMC), the latter equaling to a 30-s rolling average reported in 10-s intervals.
Overall intra-and inter-unit reliability were calculated by (i) the typical error (TE% ¼ CV difference score = ffi ffi ffi 2 p ) [18], (ii) minimal detectable change ðMDC% ¼ 1:96 � TE% � ffi ffi ffi 2 p Þ [7], and (iii) intra class correlation coefficient (ICC) [3] using pooled data of all trials and rates of simulated testing (N = 4 x 4 per unit and mode). According to Koo and Li [10], the intra-unit ICC is defined as a single-measurement, absolute-agreement, 2-way mixed effects model, and the inter-unit ICC by a single-rater, absolute-agreement, 2-way random model. To compare intra-unit A vs. intra-unit B and intra-vs. inter-unit reliability we used the 95% CI of TE%, MDC% and ICC. Test protocols for simulated (A, study 1) and biological (B, study 2) testing. To evaluate intra-and inter-unit reliability, study 1 consisted of four metabolic rates produced by a metabolic simulator (MS), conducted four times for breath-by-breath and four times for dynamic mixing chamber mode (left part A). To evaluate inter-unit differences, study 2 consisted of four increasing stages with a consecutive ramp test in 12 male athletes (right part B). The test was conducted twice with both units operating in a randomized order between participants either both in breath-by-breath or both in dynamic mixing chamber mode.
To assess detailed differences between devices at particular workloads during exercise testing, we calculated percentage inter-unit differences of biological data at each stage (N = 12 x 5 per unit and mode). Inter-unit differences were also applied to construct Bland-Altman plots displaying magnitude and nature of the inter-unit differences as percentage 100×(K5 A −K5 B )� (K5 A +K5 B /2) with 95% limits of agreement (differences±(1.96×SD)) [19]. In case of a magnitude dependency, Bland-Altman plots [19] were adapted by means of linear regression analysis and 95% prediction bands. Bland-Altman plots were interpreted according to Atkinson et al. [8], who distinguish between random, systematic, and proportional differences. There are some cut-off values, which are supposed to be used for classifying reliability of metabolic analyzers. Due to the lack of a generally accepted guideline, we applied the most restrictive limits for reliability suggested by Hodges et al. [20], with TE% and inter-unit differences within ±2.00% to be rated as "accurate" for _ VE, _ VCO 2 , _ VO 2 , and RER. Statistical analyses were conducted using SPSS (IBM Corp. Released 2017. IBM SPSS Statistics for Windows, Version 25.0. Armonk, NY: IBM Corp.) unless otherwise stated. Level of significance was set to P � 0.05, distribution and normality of the data were assessed using histograms, probability plots, and Shapiro-Wilks' tests. Table 1 presents the simulation results for intra-and inter-unit reliability of BBB-and DMCmode. We found a high and accurate technical intra-unit reliability below 1% in both devices and modes. Differences in reliability of intra-unit reliability between unit A and B were neither significant in BBB-nor in DMC-mode (indicated by 95% CI of TE% and MDC%).

PLOS ONE
Inter-and intra-unit reliability COSMED K5 As expected, inter-unit reliability was lower than intra-unit reliability, but always remained within the 2% threshold. While TE%s for inter-vs. intra-unit reliability were only slightly higher in DMC-mode, differences for inter-vs. intra-unit reliability were higher in BBB-mode and significant for RER compared to unit B. Further, intra-vs. inter-unit reliability for _ VE differed significantly between both units and also between BBB-and DMC-mode.

Study 2 -inter-unit differences via biological testing
The Bland-Altman-plots in Fig 3 visualize the inter-unit differences between both units in BBB-and DMC-mode. In BBB-mode differences between both units for _ VE and _ VCO 2 were random, for RER differences were random systematic, while they were rated as proportional systematic for _ VO 2 . In DMC-mode, differences for all variables were random. . Inter-unit differences were significantly for _ VO 2 (stage 1-4) and RER (stage 3) in BBB-mode, and also significant larger in BBB-compared to DMCmode.

Discussion
The aims of this study were the quantification of intra-and inter-unit reliability of two structural identical COSMED K5 metabolic analyzers in BBB-and DMC-mode at low to maximal respiratory rates and the differences between the analyzers. To evaluate intra-and inter-unit reliability in study 1, two identical units operated simultaneously during four trials of four simulated metabolic rates. In study 2, inter-unit differences were determined, during two trials of bike exercise consisting out of four increasing work rates with a consecutive ramp test. The major findings were a high and accurate technical intra-unit reliability below 1% in both devices and also a high and accurate inter-unit reliability in BBB-mode, albeit almost 75%  lower. During biological testing, we found several significant mean differences in _ VO 2 and RER > 2% between both units operating in BBB (trivial to large; highest 95% CI range ±3.7%), but < 1% in DMC (trivial to moderate; highest 95% CI range ±1.8%). Hence, inter-unit differences were occasionally significantly larger in BBB-than in DMC-mode.

Intra-and inter-unit reliability via metabolic simulation (study 1)
Generally speaking, small variations between two or more units are likely, because calibration, data-acquisition, data processing, and manufacturing tolerances of the hardware add inherent measurement noise. Nevertheless, inter-unit reliability should be close to intra-unit reliability, to prevent that the inaccuracy of an additional device exaggerates the variability of cross-sectional (multicenter studies) and longitudinal (individual monitoring) data in addition to the unavoidable intra-unit variability.
We found an accurate intra-unit reliability of the K5 in both modes and for each respiratory variable, indicated by 95% CI of the TE%s within 2.00% [20] and moderate to excellent ICCs [3]. While the ICCs for intra-unit reliability were almost perfect for _ VE, _ VCO 2 , and _ VO 2 in both modes, ICCs for RER were remarkably low, especially in DMC-mode. This is attributable to the dilution method [8], where the MS produces constant ratios for _ VCO 2 and _ VO 2 , resulting in a nearly constant RER of approx. 1.00 at all rates and trials. Consequently, variability within and between rates is very low and therefore even slightest alterations will have a distinctive effect on the calculated ICC, even though TE% and MDC% indicate an accurate reliability. Aside from that, an unknown percentage of variability between repeated trials is related to the recalibration of the K5s between the repeated trials and the inaccuracy of the MS. However, intra-unit mean TE%s are below or within the inaccuracy range of the MS, which has been reported as ±0.50-1.00% [17].
Unsurprisingly, except for _ VE, which was measured by a single flow sensor, inter-unit reliability of the K5 in both modes tended to be lower than intra-unit reliability, but not significantly. It is worth to mention that inter-unit reliability of RER measured in BBB-mode was significantly lower than intra-unit variability of unit B. Hence, if BBB-mode is applied, the addition of device A to a virtual test setup would definitely increase the variability of longitudinal or cross-sectional RER data in comparison to device B, only. Notably, intra-unit reliability in our study was in line with previous data [11,13] as well as with very limited data on interunit reliability that solely focused on BBB-mode during metabolic simulation [7].

Inter-unit differences via biological testing (study 2)
The detailed differences for each biological stage between both units in each mode (Table 2 and Fig 2B) indicate that inter-unit differences were always below the 2.00%-threshold [20] for biological testing in DMC-mode including the upper 95% CI. In BBB-mode, the situation was less clear-cut, because while for _ VE and _ VCO 2 inter-unit differences were always lower than 2.00% [20], differences for _ VO 2 and RER increased with intensity and upper limits of 95% CI exceeded the 2.00%-threshold up to 1.9-fold (Table 2 and Fig 2A). It is worth to mention that even though mean differences exceeded the 2.00% cut off [20] only slightly, they were significant and of moderate to large effect size. Hence, biological testing indicated that inter-unit differences could be substantial at certain workloads, even if technical inter-unit reliability is accurate.
To the best of our knowledge the only study that focused inter-unit variability so far, reported mean differences between two identical DMC systems (TrueOne 2400, ParvoMedics, Salt Lake City, USA) of 0.8 to 2.6% at low to moderate intensities on a bike ergometer (30 to 120 W) [6]. Considering the higher exercise intensities of up to 409 W during biological testing in our study, the K5-results for inter-unit differences fit well and indicate superior inter-unit reliability for the DMC-mode.

Practical implications of intra-and inter-unit reliability
Our results indicate that intra-unit reliability of the COSMED K5 is highly accurate for both modes and the magnitude of inter-unit TE%s are in the similar range. Inter-unit differences for all variables obtained in DMC-mode (-0.55-0.61%) are probably neither of any physiological relevance-assuming a biological variability of~2.00% [6,[21][22][23][24]-, nor of practical relevance-assuming a smallest worthwhile change of about 0.3-1.2% [25][26][27]. But when using the K5 in BBB-mode, scientists and practitioners should be aware of inter-unit differences for _ VO 2 and RER of up to -2.30% (95% CI [-3.68, -0.92]) at high intensities, which exceed intradevice reliability and thereby add additional noise. Hence, if possible, researchers should assign a particular unit personalized in longitudinal settings. If that is not possible, e.g., when CPX-data are collected in multicenter studies, we recommend to use the DMC-mode of the K5, due to its lower variability in comparison to the BBB-mode. However, irrespective of the mode, inter-unit reliability and -differences have to be considered because of their impact on the MDC%. The MDC% is practically very relevant, because it represents the smallest detectable change within repeated measurements or the smallest detectable between devices, respectively. In other words, the similar intra-and inter-unit MDC%s in our dataset (e.g. 2.3 vs. 2.4% for _ VO 2 in DMC) indicate that the technical variability obtained by two measurements with different but structurally identical units is already as high as the variability obtained by four repeated simulated measurements with only one instrument.
Our study has a few potential limitations that need to be mentioned: The relatively short observation period did not allow to evaluate potential impact of the age of the metabolic analyzer on reliability. Furthermore, it should be noted that the results of our biological tests might be limited to cycling ergometer exercise, because we cannot exclude that other types of exercise might influence the results. Finally, the results are strictly speaking limited to male athletes and it should be noted that female athletes were not acquired for this study, because we aimed to stress the K5 with high minute ventilations. These are generally higher in males and therefore potential problems especially in BBB-mode are probably rather mitigated in female athletes.

Conclusion
Two simultaneously operating COSMED K5s demonstrated accurate intra-unit reliability during a wide range of low to maximal simulated intensities, indicated by TE% < 2.00%, MDC% < 2.48 and ICCs > 0.93 (except for RER). Inter-unit reliability was not significantly lower and we found a considerable mode dependency due to moderate to large inter-unit differences of 3.7-3.8% for _ VO 2 and RER measurements in BBB-mode (95% CI), while differences were small to moderate and not of physiological relevance in DMC-mode. Therefore, inter-unit reliability adds additional variability especially in BBB-mode. We conclude that whenever possible, units of the same K5 model should be assigned personalized in longitudinal studies and if several units are used in multicenter studies, inter-unit reliability needs to be considered because it increases the smallest worthwhile change in cardiopulmonary exercise tests.