Differences and agreement between two portable hand-held spirometers across diverse community-based populations in the Prospective Urban Rural Epidemiology (PURE) study

Introduction Portable spirometers are commonly used in longitudinal epidemiological studies to measure and track the forced expiratory volume in first second (FEV1) and forced vital capacity (FVC). During the course of the study, it may be necessary to replace spirometers with a different model. This raise questions regarding the comparability of measurements from different devices. We examined the correlation, mean differences and agreement between two different spirometers, across diverse populations and different participant characteristics. Methods From June 2015 to Jan 2018, a total of 4,603 adults were enrolled from 628 communities in 18 countries and 7 regions of the world. Each participant performed concurrent measurements from the MicroGP and EasyOne spirometer. Measurements were compared by the intra-class correlation coefficient (ICC) and Bland-Altman method. Results Approximately 65% of the participants achieved clinically acceptable quality measurements. Overall correlations between paired FEV1 (ICC 0.88 [95% CI 0.87, 0.88]) and FVC (ICC 0.84 [0.83, 0.85]) were high. Mean differences between paired FEV1 (-0.038 L [-0.053, -0.023]) and FVC (0.033 L [0.012, 0.054]) were small. The 95% limits of agreement were wide but unbiased (FEV1 984, -1060; FVC 1460, -1394). Similar findings were observed across regions. The source of variation between spirometers was mainly at the participant level. Older age, higher body mass index, tobacco smoking and known COPD/asthma did not adversely impact on the inter-device variability. Furthermore, there were small and acceptable mean differences between paired FEV1 and FVC z-scores using the Global Lung Initiative normative values, suggesting minimal impact on lung function interpretation. Conclusions In this multicenter, diverse community-based cohort study, measurements from two portable spirometers provided good correlation, small and unbiased differences between measurements. These data support their interchangeable use across diverse populations to provide accurate trends in serial lung function measurements in epidemiological studies.


Introduction
Portable spirometers are commonly used in longitudinal epidemiological studies to measure and track the forced expiratory volume in first second (FEV 1 ) and forced vital capacity (FVC). During the course of the study, it may be necessary to replace spirometers with a different model. This raise questions regarding the comparability of measurements from different devices. We examined the correlation, mean differences and agreement between two different spirometers, across diverse populations and different participant characteristics.

Methods
From June 2015 to Jan 2018, a total of 4,603 adults were enrolled from 628 communities in 18 countries and 7 regions of the world. Each participant performed concurrent a1111111111 a1111111111 a1111111111 a1111111111 a1111111111

Introduction
Lung function assessments are now more accessible with the wide adoption of handheld portable spirometers in the community and ambulatory care setting. These devices are easy to operate and many have inbuilt quality check software to enable high quality measurements. They are also commonly employed in research studies to provide rapid and reliable lung function measurements and tracking of lung function over time [1]. However, in large multicenter trials, it is common to have different portable spirometers across different study sites depending on the local availability of these devices; and it is often necessary to replace older devices with newer models over time [2]. This raise questions regarding the reliability and agreement between measurements obtained from different devices. Therefore, it is important to ascertain the reliability, differences and agreement between different spirometers; and identify factors that may contribute to the variability between spirometers.
To date, there have been few small studies, which examined the variability between different portable spirometers [2][3][4][5][6][7][8][9][10][11]. Many were conducted in highly selected healthy young individuals in laboratory setting. Only a few were conducted in the community but limited to one population (generally from Europe or North America). It is unclear whether these findings can be generalized to other populations with different anthropometrics, demographics and underlying disease prevalence. Furthermore, not much is known on the source of variability between spirometers.
The Prospective Urban Rural Epidemiology (PURE) study is an international prospective cohort study, comprising of adults recruited from urban and rural communities from high, middle and low-income countries. Baseline spirometry data was collected with a handheld portable turbine spirometer without flow volume loops (FVL). In the course of cohort follow- this STUDY. The PHRI believes the dissemination of research results is vital and sharing of data is important. PHRI prioritizes access to data to researchers who have worked on the PURE study for a significant duration, have played substantial roles, and have participated in raising the funds to conduct the study. Data will be disclosed upon request and approval of the proposed use of the data by a PURE Review Committee. Specific collaborative projects can be developed with groups with similar data for joint analyses. The underlying data for this clinical study contains personal information and personal health information of participants who were involved, which is protected under Canada's privacy laws, HIPPA (US) and GDPR, amongst other international laws governing privacy. Consent for public disclosure of this information was not obtained and could pose a threat to confidentiality and violate privacy laws. PHRI has no objection in sharing the information under confidentiality and with appropriate data protection and privacy, including to the journal statisticians in a timely manner, for verification or validation of the analyses in the paper upon request. As per the Canadian funding body guidelines https://cihr-irsc. gc.ca/e/29072.html, (referenced by PLOS), Element 8: "there should be strict limits on access to data and secure procedures for data linkage, subject to data-sharing agreements". PHRI follows this procedure and does not share or link data from clinical studies publicly where such data is or contains personal health information. Requests for access to data may be sent to PURE Publications Committee and the PHRI Contracts phri. contracts@phri.ca.

Funding:
The funding for the main study of PURE is provided in the accompanying appendix. The current substudy is not funded. The authors received no specific funding for this work. The funders of the main study of PURE had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
up, a new portable ultrasonic spirometer was introduced, which provided FVL. In the present study, we examined the correlation, agreement and mean difference between measurements from the old and new spirometer, in an unselected sub-sample. We also assessed whether the correlation and agreement between spirometers may differ across diverse populations from different socioeconomic and geographic regions. Lastly, we examined the impact of utilizing two different spirometers on the interpretation of spirometry measurements, using the Global Lung Initiative (GLI) normative values. Our findings will address some of the challenges associated with the widespread use of portable spirometers and their role in providing access to lung function measurements in the community. This information will facilitate correct interpretation of data and offer insight into how best to address the variability between spirometers.

Methods
The PURE study began recruitment in 2004 of community-based adults aged 35 to 70 years old; from 628 urban and rural communities across 18 high-, middle-and low-income countries. The study design and methodology have been described elsewhere [12]. In brief, standardized approaches were used for the enumeration of households, identification of participants, recruitment and data collection. As it was not feasible to collect data from a representative sample of each country, the sampling method used for each country aimed to reduce participation bias based on local risk factors and disease prevalence.  [13]. The MicroGP spirometer contains a turbine, which generates rotational flow during the spirometry maneuver. The rotation of the low-inertia vane is converted into electrical impulses by means of an infrared light-emitting diode and a photodiode sensor. A microprocessor within the device converts the electrical pulses into spirometry measurements, which are displayed digitally. According to the manufacturer, the microGP has an accuracy of ±2%. In 2015, the EasyOne (Ndd, Medical Technologies, Inc., Switzerland) ultrasonic spirometer was introduced, which provided automated quality checks, messaging, quality grades and FVL. The quality grades after each test session provided by the EasyOne include: (1) Grades A or B for three acceptable efforts and <100 ml (Grade A) or <150 ml (Grade B) variability between the two highest FEV 1 and FVC; (2) Grade C for two or more acceptable efforts and <200 ml variability; (3) Grade D for one acceptable effort or highly variable efforts > = 200 ml; and (4) Grade F for no acceptable efforts. The Easy-One spirometer uses an ultrasonic sensor to measure airflow. It has no moving parts and its accuracy is not dependent on mechanical function or the measurement of pressure or volume displacement. Accordingly, the manufacturer information report an accuracy <3%, which is maintained throughout its operational life and not needing regular calibration.
All study visits were conducted in dedicated research clinics in the community for all sites and countries. Participants were coached by a trained staff, prior to performing pre-bronchodilator forced inspiratory and expiratory manoeuvers (up to six attempts). All tests were performed in a standing position with participants' back straight and wearing a nose-clip. With the introduction of the EasyOne spirometer, each center enrolled the first five consecutive participants from each community into the present substudy. Each participant provided spirometry measurements using the two devices in a random order within 3 hours supervised by the same research staff. The order of spirometer measurements was randomly generated by the coordinating site and issued to the center prior to the day of testing. Spirometers were calibrated monthly (or as needed in extreme weather or handling) using a 3L syringe to ensure an accuracy <105 ml or 3.5%.

Statistical analysis
Means and frequency statistics were used to describe the data. The highest FEV 1 and FVC from each spirometer were analyzed. The assumption of normality and constant variance of the FEV 1 and FVC were assessed by visual inspection of histograms and plots of residuals against fitted values. The correlation and agreement between spirometers were assessed with scatterplots, intra-class correlation coefficients (ICC) and Bland-Altman plots [14]. Mean differences between paired FEV 1 and paired FVC were calculated as absolute (EasyOne-MicroGP) and relative ([EasyOne-MicroGP)/ average] � 100) differences between spirometers. The random-intercept multilevel 'null' model was used to estimate the source (region, country, center and participant level) of variation between spirometers. Stratified analyses by region, sex, age, body mass index (BMI), smoking status, known COPD or asthma, education level and quality grades were performed to explore the effect of each factor on the reliability and agreement between spirometers. Countries were classified into seven regions according to geographic location and socioeconomic level (by the World Bank classification) [15]. To examine the impact on interpretation, the GLI normative values were used to transform FEV 1 and FVC into z-scores prior to Bland-Altman analysis [16]. We used the ATS/ERS recommendation for between-effort repeatability within test session of <150 ml to assess whether mean differences between spirometers met the criterion for within test reproducibility [17]. Similarly, a difference in z-score <0.5 SD was regarded as not meaningful difference between age, sex, height and ethnicity GLI adjusted values [18]. All analyses were performed using SAS version 9.4 (The SAS Institute, Cary, NC, USA) and STATA 15 (StataCorp LLC, Texas, USA).

Results
A total of 4,603 participants from 628 communities in 18 countries across 7 regions completed measurements from the two spirometers. Baseline characteristics of included participants are shown in Table 1. Similar to the larger PURE study (Appendix II in S1 File), there were more females and individuals between the ages of 50-65 years. The overall proportion of participants meeting quality grades A, B or C on the EasyOne device was 65%, which is similar to the larger PURE study. There was a trend for higher prevalence of comorbidities including COPD/ asthma and cardiac diseases; and lower education level in the substudy.
The correlations, mean differences and limits of agreement (LoA) between paired FEV 1 and FVC by region are shown in Table 2. Overall, paired FEV 1 and FVC between spirometers were highly correlated (Fig 1). The overall mean differences between spirometers, whether in absolute volume or as a percentage of mean FEV 1 or FVC were small and within acceptable limits of between-effort reproducibility (Fig 2). The 95% LoA between paired measurements were wide and showed no association with the size of FEV 1 or FVC. Correlations between paired FEV 1 and FVC were similarly high across regions except for South Asia, where there were low to moderate strength of correlation (Table 2). For South America and the Middle East, the correlation between paired FVC were lower than FEV 1 . Across regions, the mean differences between paired FEV 1 were small (range from absolute -83ml [relative difference -4%] to 49ml [2.5%]) and showed no consistent bias across regions. The mean differences between paired FVC were larger, particularly for the Middle East (-203 ml [-6%]) and South America (141 ml [6.3%]) and again showed no consistent bias across regions. The 95% LoA were wide for both FEV 1 and FVC; suggesting large variation in agreement between spirometers across regions.
To understand the source of variation between spirometers, the ICC and variance components between spirometers were assessed at the region, country, center and participant levels ( Table 3). The highest ICC between paired FEV 1 and FVC were observed at the participant level, indicating the measurements between spirometers were highly correlated within individuals. This correspond to the largest variance component, suggesting that participant factors contributed significantly to the variation between spirometers. The correlation and variance between spirometers at the region, country and center levels were substantially less, suggesting Variables are presented as means ±SD for continuous data and absolute numbers (% of total in each region/ column). Abbreviations: BMI = body mass index calculated as weight divided by height squared; COPD (chronic obstructive pulmonary disease)/asthma, CHF (congestive heart failure) and strokes were self-reported; FEV 1 = forced expiratory volume in the first second measured in liters (L); FVC = forced vital capacity in liters (L); z-scores were estimated using the Global Lung Function Initiative normative values appropriate for age, sex, height and ethnicity; Micro = microGP spirometer; Easy = EasyOne spirometer. For regions S = South; N Am/ Eur = North America/Europe. The grades were quality grades using ATS guideline provided by the EasyOne spirometer. https://doi.org/10.1371/journal.pgph.0000141.t001 these levels contribute substantially less to the variation between spirometers. Furthermore, the increase in size of the ICC and variance component from the region to country and center levels were not dramatic, compared to the large increase from center to participant levels. This further highlights the importance of participant factors in contributing to the variation between spirometers.
To explore the participant factors that may contribute to the variation between spirometers, we examined the baseline characteristics of participants, whose inter-device difference were within and outside the 95% LoA for the overall population (Appendix III in S1 File). The distribution in age, body mass index and sex were similar between these 2 groups. Furthermore, COPD/asthma, cardiac disease, strokes and tobacco smoking did not adversely impact the agreement between spirometers. However, there were higher percentages of lower quality grade spirometry and lower education level in those outside the 95% LoA. Separate stratified analyses were conducted to further explore the effects of sex, age, BMI, smoking status, known COPD or asthma, education and quality grades on spirometer variability (Table 4, Appendix IV in S1 File). The correlation between paired FEV 1 and FVC were generally high and similar across strata. The mean differences between spirometers were small with minimal variation  across strata, even for the lower quality grades. However, there were lower correlation, larger variability and larger LoA between spirometers among those with lower education level and lower quality grades. Similar Bland-Altman analyses were conducted on the FEV 1 and FVC z-scores using age, sex, height and ethnic appropriate GLI normative values (Table 5). Mean differences between paired FEV 1 and FVC z-scores from the two spirometers were small and less than 0.5 SD for the overall substudy and across regions.

Discussion
In this large international multi-center community-based sub-study, we examined the correlation, mean differences and agreement between measurements from two commonly used portable spirometers used in the community and field studies; and how they may vary across diverse populations. We found an average of 65% of quality grades A, B and C, which are clinically acceptable efforts. The overall correlation between paired FEV 1 and paired FVC between spirometers were high. The overall mean differences between measurements were small and within acceptable limits of between-effort reproducibility. There were moderate to high correlations between spirometers across diverse populations from different geographic and socioeconomic regions. Mean differences between paired FEV 1 were uniformly small across regions, while larger differences between paired FVC were observed. In both cases, there was no systematic bias observed across region. The main source of variation between spirometers was at the participant level, with much less variation observed among regions, countries and centers. Exploratory analyses of participant factors identified low education level and poor quality grade efforts were associated with higher variability between spirometers.
As portable spirometers become widely adopted and used in the community, more information on their quality of measurements, reliability, biases and agreement are needed, which will enable correct interpretation and comparison of lung function data across spirometers.   Analyses were stratified by sex; age; body mass index (BMI); smoking (EVER included current and ex-smokers of tobacco products); known self-reported COPD or asthma; low education level = primary school and lower; high education = secondary school and higher; quality grades were provided by the EasyOne spirometer. The mean absolute difference (EasyOne minus -microGP value) or mean relative difference (EasyOne minus microGP/average) � 100) between spirometers are provided with 95% CI. The 95% upper and lower limits of agreement (LoA) are provided. These data are also graphically represented in Appendix IV in S1 File. https://doi.org/10.1371/journal.pgph.0000141.t004 To date, most studies have compared different portable devices in highly selected healthy and mainly young non-smokers within a single population [2][3][4][5][6][7][8][9][10][11]. These studies have reported on high correlation and agreement between devices, which are likely to be inflated given the controlled setting under which the comparisons were made. The relatively small sample sizes and homogeneity of the population studied also limit the ability of prior studies to adequately address the source of variation between spirometers. In contrast, we examined two commonly used portable spirometers in large numbers of unselected individuals, from a wide range of urban and rural communities, and geographic regions. The measurements were collected outside of controlled laboratory setting, which can lend our findings more generalizable to a broader range of populations and settings.
Similar to other community-based studies, we found an average of 25 to 35% of suboptimal quality grade efforts [19]. Even with these data included, there were high correlations and small mean differences between paired FEV 1 across regions. For paired FVC, there was more variation in the correlation and mean differences between spirometers. However, for most regions, the mean differences between paired FVC still remained within the acceptable limits of between-effort reproducibility [17]. Furthermore, we observed no consistent bias between spirometers across regions suggesting the variation between devices was random in nature. We found the LoA were wide and variable across regions for both FEV 1 and FVC. This was expected as other studies have shown that the LoA will tend to increase with larger sample size and including wider range of data examined [20]. Also, in keeping with previous findings, we observed larger LoA between paired FVC than FEV 1 [2,21].
To date, there has been very limited information on the source of variability between spirometers. The few studies that have examined the effect of age and sex on inter-device variability have reported on disparate findings [2,7,9]. These studies were generally small in sample size and included healthy volunteers across a limited age range. Our large sample size and diverse population enabled a robust analysis of the potential sources of variation between spirometers at the region, country, center and participant levels. We identified the largest source of variability was at the participant level, with much smaller contribution at the region, country or center levels. Importantly, participant factors such as older age, higher BMI, previous and current smoking and known COPD/asthma did not adversely affect the variation between spirometers. However, low education level and poor-quality grade efforts, were more likely to demonstrate lower correlation and larger variation between spirometers. Even in these subgroups, the mean differences between spirometers remained small and unbiased, suggesting sufficient precision and comparable estimates of group means across devices.
Our findings have a number of implications. First, we report on the robustness of the FEV 1 measurement, which was highly correlated, with small and unbiased differences between devices across diverse populations. The correlation and mean differences between paired FVC, however, were more variable but unbiased across regions. This suggests that a more customized approach by region may be needed to adjust for the larger differences in the FVC between spirometers. Second, the LoA were wide, but random, suggesting considerable between-subject variability in agreement between devices. In this regard, it is important to differentiate the need for individual versus group level precision in estimating lung function for different types of studies. In population-based studies, where exclusion of participants is undesirable (since excluded participants may be systematically different from those included) this will inherently lead to larger inter-subject variability. Furthermore, the focus of populationbased studies is mainly on the average differences in lung function between populations or the mean changes over time. In this context, it is more relevant to determine whether on average the recordings from different devices are well correlated, and collected without systematic bias. Therefore, the precision of group mean estimates to provide accurate trends is more important than the precision of individual measurements. By contrast, in clinical studies the within-subject variability may be more relevant in assessing changes in lung function within individuals or small groups in response to an intervention. Here the precision of individual measurements is likely to be more important. To that end, our findings suggest that the two different spirometers, on average, were highly correlated, and had sufficiently high precision in estimating the group means in the overall population and in key subgroups without bias. Furthermore, when the data were transformed using GLI normative values, we observed very small and acceptable differences in the mean z-scores across spirometers; suggesting limited impact on interpretation of the data. Lastly, we did not observe a large contribution to the variation between spirometers at the region, country or center levels, suggesting consistent execution of spirometry measurements across these levels. The main source of variation identified was at the participant level and may be related to factors such as low education level and poor quality spirometry efforts. To this end, while every reasonable effort should be made to increase the precision of individual lung function measurements; those that are beyond what is easily achievable, may not necessarily increase the power of the study but could lead to considerable increase in the complexity and cost of the study and therefore comprise study feasibility [22]. Moreover such methods may create biases (and distort results) particularly if such stringent criteria exclude participants with specific conditions or demographics that may influence lung function. The strengths of our study include the large sample size, the diverse and unselected populations, which increases the generalizability of our findings. Measurements were taken in random order and supervised by the same-trained staff, and therefore minimize procedure-related variability. Furthermore, all spirographs available from the EasyOne were inspected and assessed by a staff respirologist to ensure agreement with the assessment. Limitations include the measurements of lung function without bronchodilation. The use of bronchodilation can help to reduce variable airway tone in asthmatic patients, which may contribute to the variation between spirometers. However, participants were not requested to withhold any medications prior to testing. Therefore, it is reasonable to assume, that those with chronic lung diseases including asthma would have taken their inhaler medications prior to spirometry assessments; and therefore are less likely to exhibit variable airway tone.
In conclusion, we found moderate to high correlation and small mean differences between paired FEV 1 and FVC between the MicroGP and EasyOne spirometers across diverse populations. The differences between paired measurements showed no consistent biases across regions. Our findings support the use of these two spirometers in large long-term studies to provide reliable and comparable measurements, with highly correlated and small unbiased differences between group means across diverse population.