Comparison of devices used to measure blood pressure, grip strength and lung function: A randomised cross-over study

Background Blood pressure, grip strength and lung function are frequently assessed in longitudinal population studies, but the measurement devices used differ between studies and within studies over time. We aimed to compare measurements ascertained from different commonly used devices. Methods We used a randomised cross-over study. Participants were 118 men and women aged 45–74 years whose blood pressure, grip strength and lung function were assessed using two sphygmomanometers (Omron 705-CP and Omron HEM-907), four handheld dynamometers (Jamar Hydraulic, Jamar Plus+ Digital, Nottingham Electronic and Smedley) and two spirometers (Micro Medical Plus turbine and ndd Easy on-PC ultrasonic flow-sensor) with multiple measurements taken on each device. Mean differences between pairs of devices were estimated along with limits of agreement from Bland-Altman plots. Sensitivity analyses were carried out using alternative exclusion criteria and summary measures, and using multilevel models to estimate mean differences. Results The mean difference between sphygmomanometers was 3.9mmHg for systolic blood pressure (95% Confidence Interval (CI):2.5,5.2) and 1.4mmHg for diastolic blood pressure (95% CI:0.3,2.4), with the Omron HEM-907 measuring higher. For maximum grip strength, the mean difference when either one of the electronic dynamometers was compared with either the hydraulic or spring-gauge device was 4-5kg, with the electronic devices measuring higher. The differences were small when comparing the two electronic devices (difference = 0.3kg, 95% CI:-0.9,1.4), and when comparing the hydraulic and spring-gauge devices (difference = 0.2kg, 95% CI:-0.8,1.3). In all cases limits of agreement were wide. The mean difference in FEV1 between spirometers was close to zero (95% CI:-0.03,0.03), limits of agreement were reasonably narrow, but a difference of 0.47l was observed for FVC (95% CI:0.53,0.42), with the ndd Easy on-PC measuring higher. Conclusion Our study highlights potentially important differences in measurement of key functions when different devices are used. These differences need to be considered when interpreting results from modelling intra-individual changes in function and when carrying out cross-study comparisons, and sensitivity analyses using correction factors may be helpful.


Introduction
Blood pressure, grip strength and lung function are commonly assessed in longitudinal population studies.All three are non-invasive measures of physiological function that are practical for a nurse or interviewer to administer in a home or clinical setting using portable equipment.They avoid the subjectivity of self-reports of health, enable researchers and clinicians to track changes in health and function over the life course [1] and are important biomarkers of healthy ageing [2].Their repeat assessment within longitudinal studies, and inclusion in many studies, facilitates comparisons over time and across ages and cohorts [3,4].
Although there have been a number of initiatives to encourage standardisation of these measures [5][6][7], different devices have been adopted by different studies for a variety of practical reasons [8,9].Furthermore, the device used within a long-running longitudinal study will often need to change over time as obsolete or outdated models are replaced with devices that are more technologically advanced and improve or extend measurement, are less costly, more portable or easier to use.Because devices of this kind are only subject to moderate regulation [10,11], the measures obtained from different makes and models of device are unlikely to be equivalent.This has important implications for research which either compares findings across studies or considers change in function longitudinally.For example, in a study modelling age-related changes in blood pressure across the life course which used data from eight British longitudinal studies, switching from a manual sphygmomanometer to an automated device, without correction for the difference in measurement, resulted in a steeper increase in mean trajectory of systolic blood pressure [4].Similarly, artefactual findings attributable to a change in device have been observed in studies of lung function [12,13].Indeed, concerns about potential differences in measures due to differences in spirometry devices have contributed to study investigators in the UK discouraging within-and cross-study analyses [14,15].
There are existing studies which have shown differences between devices used to measure blood pressure [16][17][18][19][20], grip strength [21][22][23][24] and lung function [6,12,13,[25][26][27], but these have not yet compared all the devices commonly used in cohort and longitudinal population studies in the UK and many other countries.Further, these are only occasionally discussed in the context of both within-and between-study comparisons.To address this gap, a randomised cross-over trial was undertaken to compare measurements between devices used to assess blood pressure, grip strength and lung function commonly used in UK longitudinal population studies within the CLOSER consortium [28].

Study design and sample
For each of blood pressure, grip strength and lung function, a randomised cross-over study was carried out, so as to make within-person measurement comparisons.The study was conducted following established (CONSORT) guidelines [29].The target sample, based on sample size calculations (S1 Appendix), was 120 men and women from the general population aged 45 to 74 years comprising 20 men and 20 women from each of three age groups (45-54, 55-64, 65-74).Participants were drawn from a list of individuals who had participated in a market research study, consented to be re-contacted for research purposes, and were living in London and the South East of England.An invitation letter and information sheet was sent and this was followed-up with a telephone recruitment process including assessment of health-related exclusion criteria (S1 Appendix).Eligible participants were then invited to attend a face-toface assessment and each participant was measured on every machine (Table 1) at a single assessment visit.
All 90-minute face-to-face assessments took place in central London between October 2015 and January 2016 and were conducted by one of seven researchers who were trained and tested in all relevant protocols.All participants gave informed, written consent.The analytical dataset was pseudo anonymised with each participant given a study number so that individuals could not be identified.Ethical approval for data collection was given by University College London (UCL) (Ethics Project Number: 6338/001) and, for analysis, by the University of Southampton (Ethics Project Number: 18498).Participants received feedback on their results, advice to contact their General Practitioner if their blood pressure was elevated, and a gift voucher.
During the assessment, each participant was assessed in the sequence shown in Table 2. Blood pressure was measured consecutively on each device and the remaining measures were ordered to ensure that there was sufficient time between the four grip strength and two spirometry measurements to avoid participant fatigue.Multiple measurements were recorded on each device as would be done in survey research.Height and weight were also measured and a short self-completion questionnaire was administered (S2 Appendix).For each of the three measures, the order of devices was determined before fieldwork began, using computer-generated random numbers within each age-sex strata.Individuals were randomly allocated to one of two possible orders of blood pressure and lung function devices and to one of 24 possible orders of grip strength devices.

Blood pressure, grip strength and lung function measurement
Standardised measurement protocols were used as follows.For blood pressure, the participant was asked to sit on a chair with legs uncrossed and their right arm resting comfortably, palm up, on a table, with the sphygmomanometers positioned so that they could not see the display.The participant was asked to expose their right arm, making sure that rolled up sleeves did not restrict circulation and that any watches or bracelets had been removed and, the sphygmomanometer cuff was then positioned over the brachial artery.After 3 minutes of quiet rest, 3 readings with a minute's rest between each reading were recorded using the first device.The device was then changed and after a further 2 minutes rest, 3 readings were taken using the second device.There was no talking until three readings on both devices had been completed.Grip strength assessment was based on a published measurement protocol [30].While seated in a chair with fixed arms, participants were asked to place their forearm on the arm of the chair in the mid-prone position (the thumb facing up) with their wrist just over the end of the arm of the chair in a neutral but slightly extended position.Adjustments were made to each dynamometer to accommodate different hand sizes according to the make and model of the device.On hearing the words "And Go", the participant was encouraged, through strong verbal instruction, to squeeze as hard as possible for a few seconds until told to stop.For each device, two measurements were carried out in each hand in the sequence Left-Right-Left-Right.The value on the display was recorded to the nearest 0.1kg for the Jamar Plus+ and Nottingham Electronic, to the nearest 0.5kg for the Smedley and to the nearest 1kg for the Jamar Hydraulic.
Lung function measurements adhered to the American Thoracic Society/European Respiratory Society (ATS/ERS) lung function protocol [6].The procedure was explained and demonstrated, and the participant then had a practice blow without completely emptying their lungs.All measurements were carried out with the participant standing unless they felt unable to do so.During measurement, maximum effort was encouraged verbally.In addition, the ndd Easy on-PC was linked to a laptop which showed a cartoon of a child blowing up a balloon.This represents a real-time trace and as the participant is encouraged to exhale until the balloon pops this helps ensure a maximal FVC is achieved.After each trial the researcher recorded whether it satisfied the protocol, for example a trial was classified as not valid if the participant did not form a tight seal around the mouthpiece or coughed during the procedure, and in these instances, feedback was provided before the next attempt.Participants had up to five attempts to produce three valid measurements of lung function from each spirometer.
Readings for blood pressure, grip strength and lung function using the Micro Medical spirometer were data entered twice, independently, and compared to ensure accuracy.Lung function readings taken using the ndd Easy on-PC spirometer were downloaded directly from the laptop.

Other measures
Height was measured using a portable Marsden Leicester stadiometer and weight using Tanita 352 scales according to standardised procedures, from which body mass index was calculated as weight (kg)/height (m) 2 .Responses to the self-completion questionnaire provided additional information on: age at completing full-time education, self-rated health, smoking history, medication use and musculoskeletal, cardiovascular and respiratory conditions which might influence performance on the functional tests (S2 Appendix).

Primary outcome measures
For the purposes of the main analyses, outcomes commonly used in epidemiological research were derived.The mean of the second and third readings of systolic blood pressure and diastolic blood pressure in millimeters of mercury (mmHg) were used.For grip strength, the maximum of the four readings in kilograms (kg) was used.For lung function, the maximum forced expiratory volume in 1 second (FEV 1 ) and forced vital capacity (FVC) in millilitres (ml) from the highest quality readings (quality A or B) were used.Quality grade A was when 3 or more acceptable tests were achieved with repeatability within 100 ml, and B when 3 acceptable tests were achieved with repeatability within 150 ml, as per ATS/ERS criteria [6].

Statistical analyses
We described relevant characteristics by randomisation group for each measure.For each device we estimated the reliability using intraclass correlations (or Rho) and within-person standard deviations using a variance-components model [31].To investigate order effects we used two sample t-tests to compare the difference in mean values between groups with the measurements carried out in one sequence (device A followed by device B) compared with the opposite order (BA).For grip strength where 4 devices were tested, 6 pairwise comparisons were made, ignoring the exact placement of devices within the sequence.
We calculated the differences in measurement between pairs of devices then assessed the mean within-person differences between pairs of devices using paired t-tests.The assumption that the mean differences were normally distributed was checked by plotting histograms, and Bland-Altman plots (the difference between measures versus the average of the measures from the two devices for each individual) were used to assess whether the variation was dependent on the magnitude of the measurements [32,33].The mean difference in values between the two devices, and the 95% limits of agreement, which give the range in which we would expect 95% of future differences in measurements between the two devices to lie, were plotted [33,34].
We also performed a series of sensitivity analyses to test the robustness of the results.We repeated analyses having: (i) excluded measurements where the devices were administered in the incorrect order (n = 2 for blood pressure, n = 5 for grip strength and n = 1 for lung function); (ii) removed extreme outliers identified using scatter plots (n = 1 for blood pressure and n = 2 for grip strength) and; (iii) used alternative outcome definitions commonly used in analyses.For blood pressure, we considered the mean of three readings [35] and the second reading only [36] and for grip strength, the mean of the four readings [37,38].For lung function, we used the highest reading of FEV 1 and FVC drawn from all available readings irrespective of whether they adhered to the ATS/ERS quality criteria.
Finally, we used multilevel modelling, as an alternative statistical approach, to estimate the differences between devices, using all available readings rather than a summary measure, in order to account for variance between readings.The models treat the repeated readings as Level 1 and the individual as Level 2 to account for non-independence of measurements from the same person.Model 1 included device treated as a fixed effect.Model 2 also included covariates to account for the order in which the devices were administered and the position of the reading in the sequence (1 to 3 for blood pressure, 1 or 2 for the dominant and non-dominant hands for grip strength, and 1 to 5 for lung function).Model 3 was additionally adjusted for age, sex and, for blood pressure only, body mass index.
Data cleaning and management were carried out using Excel, IBM-SPSS Version 22 and STATA 14.0 and analyses were conducted using STATA 15.0.

Results
During fieldwork, 118 assessments were completed, with 18-21 participants in each of the agesex strata (S1 Table ).Of the seven researchers, three carried out 20-30 assessments, two carried out 10-20 assessments and two carried out fewer than ten assessments.
The socio-demographic characteristics of the randomised groups were reasonably well balanced as were key aspects of cardiovascular, musculoskeletal and respiratory health (Tables 3  and 4).The reliability of every device was good.The intra-cluster correlations were lowest for blood pressure (0.89-0.94), due to the acknowledged within-person variation in this measure (S2 Table ).The values for grip strength of dominant hand were above 0.95 for all devices except the Smedley dynamometer (0.92).Reliability was best for lung function (�0.96),where within-person standard deviations were small.Reliability was slightly better when including only assessments adhering to the ATS/ERS quality criteria because two measures must be within 150ml of each other.There was no evidence of order effects for blood pressure or lung function.For grip strength, there was evidence of an order effect for the comparison between the Nottingham Electronic and Smedley dynamometers (difference = -3.08kg(95% CI = -5.93,-0.23, p = 0.03) (S3 Table ).Histograms show that for all three measures, the mean differences between devices were approximately normally distributed (S1 Fig).

Blood pressure
Three participants were excluded from analyses due to missing readings leaving 115 for analysis.The mean difference in SBP between the two devices was 3.9mmHg (95% CI: 2.5, 5.2, p<0.001) and for DBP was 1.4mmHg (95% CI: 0.3, 2.4, p = 0.1), with the Omron HEM-907 measuring higher than the Omron 705-CP (Table 5).The Bland-Altman plots showed that as blood pressure increased, the difference between the two devices remained approximately constant (Figs 1 and 2).The limits of agreement were wide, being -10.6 to 18.3mmHg for SBP and -9.8 to 12.5mmHg for DBP.

Grip strength
All 118 participants were included in the analyses.There was no evidence of a difference in mean maximum grip strength when comparing the two electronic dynamometers, the Nottingham Electronic and Jamar Plus+ (difference = 0.3kg (95% CI: -0.9, 1.4, p = 0.6), or when comparing the hydraulic and spring-gauge dynamometers, the Jamar Hydraulic and Smedley (difference = 0.2kg (95%CI:-0.8,1.3, p = 0.7).However, there were mean differences in maximum grip strength of between 4 and 5kg when comparing either of the electronic dynamometers with either the hydraulic or spring-gauge dynamometer (Table 5).The limits of agreement varied depending on the pair of devices being compared, for example, these were narrower (-2.0 and 10.1 kg) when comparing the Jamar Plus+ and Jamar Hydraulic but very wide (-10.6 and 20.5 kg) when comparing the Nottingham Electronic and Smedley dynamometers.Even in cases where the mean difference was near zero, the limits of agreement indicated substantial differences in measurement between devices.The Bland-Altman plots (Figs 3-8) showed that for the comparisons of the Smedley dynamometer with all other devices, the difference increased at higher magnitudes of mean grip strength (Figs 4, 6 and 8).

Lung function
Twelve participants had missing lung function measures and just under a third (n = 32 for FEV 1 and n = 39 for FVC) of the remaining participants were excluded because there were no readings of a sufficiently high quality.There was no evidence of a difference in mean FEV 1 between devices (difference = 0.00 litres (95% CI:-0.03,0.03,p = 0.9)) but there was evidence of a difference in FVC (-0.47 litres (95% CI:-0.53,-0.42,p<0.001)) with the ndd Easy on-PC measuring higher than the Micro Medical (Table 5).The Bland-Altman plots suggested that for FEV 1 , the difference between the two devices was approximately constant as measurements increased and close to zero (Fig 9) with reasonably narrow limits of agreement (-0.25 and 0.25 litres).The plot for FVC suggested that the difference between devices remained constant as values of FVC increased (Fig 10) but the limits of agreement were wider (-0.92 and -0.03).

Sensitivity analyses
When we repeated the analyses having excluded measurements where the devices were administered in the incorrect order (n = 8), removed outliers (n = 3), included the lung function readings that did not meet ATS/ERS criteria (n = 32 for FEV 1 and n = 39 for FVC), and used alternative definitions of outcomes, there were only small changes in the estimated differences between devices such that the conclusions were unaltered (S4 Table ).The only differences found were a small number of additional order effects (S5 Table ), but these had no impact on the findings when order of device was controlled for through multilevel analysis.Indeed, when the data were reanalysed using multilevel models, the estimates of differences between devices showed only marginal changes, though the standard errors were reduced (S6-S8 Tables).

Discussion
In a randomised cross-over study of 118 adults aged 45-74 years, we found evidence of differences in measurement of blood pressure, grip strength and lung function when assessed using different devices.For blood pressure, the newer Omron HEM-907 measured higher than the older Omron 705-CP with wide limits of agreement.For grip strength, the two electronic dynamometers recorded measurements on average 4-5kg higher than either the hydraulic or the spring-gauge dynamometer, but there were only small mean differences when comparing the two electronic dynamometers or the hydraulic and spring-gauge dynamometers.However, limits of agreement were wide for all comparisons.For lung function, the ndd Easy on-PC measures of FVC were an average of 0.47 litres higher than those for the Micro Medical, but there was no difference between measures of FEV 1 and the limits of agreement were reasonably narrow.We are aware of only a few studies that have compared combinations of dynamometers previously.For example, King [21] compared the Jamar Hydraulic with the Jamar Plus+ dynamometer and, in contrast to our findings, reported that the electronic dynamometer had consistently lower readings than the hydraulic device and narrower.However, the study population was younger, with an average age of 32 years, comprising a convenience sample of 40 men and women and may have better function than our older sample which could influence comparability across machines.Another study reported a difference of 3.2kg (limits of agreement -6.3 to 12.6) when comparing the Smedley dynamometer and the Jamar Hydraulic dynamometer, which contrasts with our finding of a smaller mean difference (0.2kg) but wider limits of agreement (-10.8 to 11.3) [22].However, this other study was carried out in an older, smaller sample of 55 participants aged 65-99 years recruited from a retirement home and social day care centre.Another study [23], found that the Smedley dynamometer measured lower than the Jamar+ Digital, similar to our study, although in this other study there were other potentially important variations in measurement protocol-measures using the Smedley device were undertaken in a standing position and those using the Jamar device were undertaken seated.Our findings provide some reassurance that there is a lack of bias in measurement between specific device combinations (i.e. the Jamar Plus+ and Nottingham electronic; the Jamar Hydraulic and Smedley), although the limits of agreement suggest that the variation can still be substantial.
We have not identified a comparison of Micro Medical or other turbine spirometers with the ndd Easy on-PC spirometer.However, in a study of 35 volunteers, the Micro Medical turbine spirometer, used in our study, gave lower readings compared with the Vitalograph Micro pneumotachograph spirometer [13], for both FEV 1 (mean difference of 0.24l) and FVC (0.34l).Another study of 49 volunteers found that the handheld ndd Easy on-PC spirometer produced systematically lower values than a pneumotachograph spirometer (Masterscreen) [25], for both FEV 1 (mean difference of 0.24l) and for FVC (0.37l).
For lung function, the accuracy of measurement relies primarily on optimal coaching: maximally deep breath, a rapid blast and appropriate encouragement as well as a full seal around the mouthpiece and correct body posture [6].The ndd Easy on-PC spirometer presents visualisation of the volume-time graph in real time, meaning that the participant can be encouraged to blow until the curve has reached a plateau, that is, when the true FVC has been achieved.In the absence of this visual display the forced manoeuvre may be terminated prematurely, and the FVC underestimated.We propose that this is the most likely explanation for the substantially higher FVC values obtained using the ndd Easy on-PC device than the Micro Medical device in our study, while there was no difference for FEV 1 .For FEV 1 the mean difference between the 2 spirometers was zero and are, therefore, within the 150ml ATS/ERS criteria for replication of measurement.In addition, the limits of agreement did not exceed the 350ml criterion set in previous spirometry studies [27].Whether using a group correction for FVC is valid, however, remains debateable as in the SAPALDIA study, a group correction from a quasi-experimental study was found not to be adequate, and an approach using spirometerspecific reference equations from longitudinal measurements to describe individualised corrections terms was preferred [12].
In considering the potential clinical significance of the differences between devices, we have referred to published normative or predicted values of blood pressure, grip strength and lung function [3,39,40].Based on analysis of age-related differences in mean blood pressure in the Health Survey for England 2016, the mean differences in SBP and DBP between devices that we observed are equivalent to an age difference of approximately five years, although the possible non-linearity of change with age in diastolic blood pressure across the age range of interest [41] that comparison more difficult.Further, the within-person standard deviation for systolic blood pressure is larger than the mean difference between devices.For grip strength [3] the observed 4-5kg difference in grip strength is equivalent to an age difference of approximately 5 years among men and 10 years among women aged 65 years and above.For lung function, based on the National Health and Nutrition Examination Survey (NHANES) III data [42], predicted values for five-year age-groups (with male height of 175cm and female height of 160cm), show that a difference of 0.47l in FVC is equivalent to an age difference of around 15 years, between 45-75 years.Therefore, together with the wide limits of agreement and good measurement reliability for each device, the difference that we observed between devices are likely to have important practical implications for both grip strength and lung function.For example, the differences in dynamometers may result in discrepancies in clinical diagnoses which use cut-points when identifying an individual as sarcopenic [43].Similarly, the difference in FVC, but not FEV 1 , between machines will have implications for defining participants with COPD based on the ratio FEV 1 /FVC.Maintaining consistency in the make and model of device used in studies reduces the likelihood of measurement differences, but is not always realistic given that equipment becomes obsolete and new technology can improve measurement, for example through automation (as is the case with the Omron 907), the transition from analogue to digital (as is the case with the transition from the Jamar hydraulic to Jamar Plus+ devices) or the introduction of visual encouragement and specific feedback (as provided by the ndd Easy on-PC).An important implication of our findings is that it would be advisable for researchers, therefore, to include simple experiments to assess machine comparability when a new device is introduced into a study.Conducting external comparison studies, such as ours, would also help interpretation for both within-study and between-study comparisons.In addition, the differences between devices need to be considered in the context of reliability of measurements for each device being compared.Our analysis showed good reliability of measurements, particularly for the dynamometers and spirometers, suggesting the differences observed are important.The ATS/ ERS quality control for lung function ensures excellent reliability, but does result in exclusion of those who cannot meet the criteria.
A key strength of this study design was that it used the same standardised measurement protocols for all devices, which is important, as for all three functional measures, the type of device is only one of several factors which can affect measurements unless these other factors are kept constant as in our study.Blood pressure is affected by multiple factors [10] including the participant talking, actively listening, being exposed to cold, ingesting alcohol, having a distended bladder, recent smoking [44] and also to measurement protocols such as arm position and cuff size [45].For grip strength, the values and precision of measurements have also been shown to be influence by a range of factors [30,37] including whether allowance is made for hand size and hand-dominance [46], dynamometer handle shape [47], position of the elbow [48] and wrist during testing [49], setting of the dynamometer [50,51], effort and encouragement, frequency of testing and time of day and training of the assessor [30,51].The study also included a relatively large sample size, based on a priori sample size calculation, compared with other similar studies, and implemented a randomised design.While confidence in the results rests primarily on this randomised design [29], the fact that participants were drawn from a large database of members of the public, who had been involved in previous market research and consented to be re-contacted, suggests they may be more representative of the general population than the small-scale volunteer samples used in many previous studies.We also acknowledge the limitations of the study.The study findings cannot be generalised beyond the parameters of the research design; for example, results might differ for those outside the sampled age range (i.e., 45 to 74), and while the trial compared devices most commonly used in UK population-based studies, no comment can be made about device combinations which were not included [15].While standardising the measurement protocols was an important aspect of the research design, it meant deviating from the protocol for the Smedley dynamometer (normally assessed standing rather than sitting) and so may limit the applicability of the findings for this device [30].Furthermore, in the primary analyses of lung function, a number of participants were excluded due to missing or low-quality readings, particularly on the ndd Easy on-PC, thus reducing the sample size and power of these analyses.Nevertheless, sensitivity analyses using all available readings, irrespective of quality, suggested that this did not have a big impact on findings.Indeed, sensitivity analyses considering outliers, incorrectly ordered tests and alternative coding of measures, all showed that our results were robust.Assessor may be a source of variation in our study which we have not accounted for, although this variation was minimised by the consistent training and protocol, and is not likely to have had a substantial impact on differences between devices since this was a withinperson comparison and the same researcher assessed the same person on all machines.
In conclusion, this randomised cross-over study showed measurement differences between devices commonly used to assess blood pressure, grip strength and lung function which researchers should be aware of when carrying out comparative research between studies and within studies over time.

Table 4 . Cardiovascular, musculoskeletal and respiratory health status of the study population by first device used (N = 118).
a Includes doctor diagnosed heart attack, angina and other heart condition b Includes eczema, hay fever, asthma, COPD, bronchitis, emphysema and other respiratory problems.https://doi.org/10.1371/journal.pone.0289052.t004

Table 5 . Differences in mean and limits of agreement for each pair of devices used to measure blood pressure, grip strength and lung function.
* p-value from paired t-test.https://doi.org/10.1371/journal.pone.0289052.t005