Trend analysis for national surveys: Application to all variables from the Canadian Health Measures Survey cycle 1 to 4

Background Trend analysis summarizes patterns over time in the data to show the direction of change and can be used to investigate uncertainties in different time points and associations with other factors. However, this approach is not widely applied to national surveys and only selected outcomes are investigated. This study demonstrates a research framework to conduct trend analysis for all variables in a national survey, the Canadian Health Measures Survey (CHMS). Data and methods The CHMS cycle 1 to 4 was implemented between 2007 and 2015. The characteristics of all variables were screened and associated to the weight variables. Missing values were identified and cleaned according to the User Guide. The characteristics of all variables were extracted and used to guide data cleaning. Trend analysis examined the statistical significance of candidate predictors: the cycles, age, sex, education, household income and body mass index (BMI). R (v3.2) and RStudio (v0.98.113) were used to develop the framework. Results There were 26557 variables in 79 data files from four cycles. There were 1055 variables significantly associated with the CHMS cycles and 2154 associated with the BMI after controlling for other predictors. The trend of blood pressure was similar to those published. Conclusion Trend analysis for all variables in the CHMS is feasible and is a systematic approach to understand the data. Because of trend analysis, we have detected data errors and identified several environmental biomarkers with extreme rates of change across cycles. The impact of these biomarkers has not been well studied by Statistics Canada or others. This framework can be extended to other surveys, especially the Canadian Community Health Survey.

of these biomarkers has not been well studied by Statistics Canada or others. This framework can be extended to other surveys, especially the Canadian Community Health Survey.

Background
Trend analysis that summarizes the patterns across time has been popularly used in a variety of disciplines, such as business [1], financial market [2], economics [3] and epidemics or mortality [4][5][6][7]. Trend analysis helps to estimate the quantities of current or previous events and their variability or uncertainties in different time points. It is also the foundation for prediction and projection after analyzing the significance of time and relationships with other predictors [8][9][10]. For national surveys, certain trends have been studied to show the progress or deterioration in public health and health care [11]. These trends provide important clues for the healthcare professionals to understand the unmet needs for care and the magnitudes of health problems. The comparison of multiple trends allows us to prioritize the issues and allocate resources [4,12]. If well conducted, projections can be made to further prepare incoming challenges to health systems [8,9].
However, there are certain issues arising if taking this approach. First, the adjustment of survey design requires researchers to assign appropriate weights and specify survey sampling units and strata [13]. The identification of the necessary variables requires extra attention and expert knowledge. Second, the adjustment of survey design also limits the options of research tools [14]. The automatic procedures developed for time series data or repeated surveys are not applicable concerning survey design [1]. Linear methods, such as generalized linear models and principal component analysis, remain useful for surveys to generate nationally representative statistics [14,15].
Third, the access to the data may be restricted. For example, some of the Statistics Canada data products can be accessed only through the Research Data Centres (RDC) for academic researchers, such as the Canadian Health Measures Survey (CHMS) [16]. Physical restrictions may prevent complicated or exhaustive research protocols from being conducted for researchers outside Statistics Canada or other collaborating agencies. Fourth, the outcomes analyzed in national surveys are often limited to individuals' interests. There are many published studies conducted trend analysis of the CHMS data but only limited numbers of variables are taken as target for analysis, especially hypertension and obesity related factors [17][18][19][20][21][22][23]. Even if trends are studied by data holders or affiliated researchers, important issues may remain unanswered. For example, the extensive review of environmental chemicals by Health Canada is not informative because statistics are listed by cycle without testing the significance of time trends or association with other contextual factors [24][25][26]. This needs to be addressed because effective use or extensive application of trend analysis to national surveys may lead to more efficient biomonitoring [11] and better identification of unexpected disease trends [17].
Four, trend analysis may impose challenges to computing resources [27,28]. The large numbers of variables in national surveys may limit the use of this method if not well planned. Lastly, there may not be sufficient incentive for academia, especially the researchers mainly funded by research grants, to innovate toward novel objectives in the long run [29]. Trend analysis with national surveys requires exhaustive research on documentation and survey method beforehand. There is no immediate benefit by studying variables other than the outcomes that are related to or can lead to research funding.
To address these issues that may be encountered while conducting trend analysis with national surveys, this study aims to 1) propose a framework of trend analysis for all variables in national surveys developed based on the CHMS data, 2) test the feasibility of trend analysis with all CHMS variables using computing resources available to most researchers, 3) summarize the results of the research framework and compatibility with previous studies, and 4) describe some of the obstacles and issues that may be encountered if applied to other surveys.

Methods
There were several major steps designed to execute this framework with the CHMS data after reviewing the data structure, data dictionaries, the CHMS User Guide [30,31] and the CHMS Cycle 1 to 8 Content Summary [32]. This framework was applied to the CHMS data to generate a customized research flowchart in Fig 1. First, all variables were imported from data files and screened for basic characteristics, including file names, variables of weights, bootstrap weight files to be merged, maximal and minimal values, responses and variable types (continuous or categorical). For the CHMS variables, the maximal values were important for data cleaning because the missing values were always coded with values far exceeding the observed values [30,31,33]. The values ending in 4, 5, 6, 7, 8, and 9 might represent "values higher than limits of detection", "values less than limits of detection", "not applicable", "don't know", "refusal" and "not stated" [30,31,33]. For other surveys, missing values might be represented with certain values [34] or be coded with reserve values, such as -1 to -3 [35]. To prevent computer memory from being exhausted, the data sets were always removed from the memory if unused.
Second, user-defined summary variables were be generated once data was stored for cleaning. The summary variables remained blank at this stage and could be the summaries of medication use, biomarker abnormality, or numbers of chronic conditions, depending on the research objectives. After these two steps, an exhaustive list of the CHMS variables was created. Original and derived variables were listed together and could be important indicators of data processing quality. An illustration of the variable list was shown in Table 1.
Third, the CHMS data were cleaned based on the reserve values ending in four to nine [30,31,33]. The problem particular to biomarker data was that there were values larger or less than the upper or lower limits of detection. Health Canada imputed the values less than the limits of detection with half of the limits of detection [24][25][26]. In addition, Health Canada excluded the variables with more than 40% of subjects having values less than limits of detection from analysis [24][25][26]. In contrast, there were currently no official guide to impute values larger than the upper limits of detection and were tentatively imputed with 110% of the upper limits of detection.
Fourth, the summary variables or the derived ones needed to be recoded or calculated after data cleaning. For example, the summary variables of medication use included the use and the numbers of prescription drugs for cardiovascular conditions. This needed to be derived from the drug codes, either Anatomical Therapeutic Chemical (ATC) Classification System or American Society of Health-System Pharmacists (ASHP) drug codes [36]. Another example was that the chronic conditions reported in the CHMS could be further simplified or summarized in the numbers of chronic conditions diagnosed. Abnormality of disease biomarkers could be identified through external information, such as the clinical reference ranges used by health professionals [37,38]. The numbers of abnormality in biomarkers could be derived after data labeling. Certain secondary biomarkers, such as the estimated creatinine clearance that is used to evaluate kidney health [39,40], could also be derived after data cleaning.
In addition, some of the original variables needed to be made consistent across the CHMS cycles. The inconsistency arose for a variety of reasons, such as the changes in the measurement sample (serum or plasma), whether subjects fasted or not, and categorization of  continuous variables. For example, the level of glucose was measured with plasma in the CHMS cycle 1 and with serum in the other cycles. In cycle 3 and 4, glucose was only quantified with fasted subjects. The glucose measurement with serum or plasma could be taken compatible [41] and could be recoded to the same variable. However, the fasted glucose levels had different diagnostic values from those not fasted and needed to be distinguished [42][43][44]. Therefore, glucose measured with serum or plasma among fasted and non-fasted subjects were recoded to two variables that represented fasted glucose in cycle 3 and 4 and non-fasted in cycle 1 and 2. Fifth, some of the summary or derived variables needed to be merged to other data sets to obtain useful statistics. For example, the file of medication use in the CHMS cycle 3 was not assigned survey weights and needed to be merged with the household or other data files to understand issues such as prevalence of drug use or numbers of prescription drugs. The other example was that the information on non-environmental biomarkers in cycle 3 was stored in a stand-alone data set with identification numbers that could be used for data merging. In such cases, the summary variables of medication or abnormality in clinical biomarkers were generated in respective data files and merged to household data files for inference.
Sixth, descriptive or analytical study of all CHMS variables could be conducted. In this study, trend analysis was performed with the CHMS cycles in continuous scales as the only predictor to understand whether there were significantly increasing or decreasing trends across cycles. It was also possible to add more predictors that were important for researchers, such age, sex and provinces. Continuous and binary outcomes were analyzed with linear and logistic regression respectively. The sample sizes, model fit statistics, p values of predictors and variance inflation factors of all predictors were obtained. However, there were several issues to be dealt with for the adjustment of survey design. The sample sizes should be sufficient relative to the primary sampling units. For the CHMS, the sampling units were the cities of clinical visits [30,31,33]. The numbers of unweighted sample sizes should satisfy the vetting rules administered by Statistics Canada, which varied by survey and analytical method. The collinearity issue could be assessed between predictors [45]. To avoid memory overload and increase computation efficiency, only necessary variables were loaded for regression analysis. Lastly, the results were reorganized for vetting and release. The trends were plotted against the CHMS cycles along with the necessary summary tables designated for release vetting by the RDC analyst. Age [46] and blood pressure [47] that had official statistics released were the examples of trend analysis using the CHMS data. The trends were illustrated in relative values compared to the mean values in the CHMS cycle 1. The 95% CIs (confidence intervals) were plotted as shade areas. The details in the blood pressure measurement could be found elsewhere [48,49]. The significance of time trends was confirmed if there was significant association with the CHMS cycles in continuous scale based on linear regression adjusting for survey design. The association with body mass index (BMI) was also tested with linear regression, while age in years, sex, household income in Canadian dollars, and educations in four categories (less than secondary school education, secondary school education, some post-secondary, and post-secondary graduation) were controlled. BMI was calculated as weight in kilograms divided by height in meters squared [15,50]. This study was conducted at the Research Data Centre (RDC) at McGill University (Montréal, Québec, Canada). The computer at the RDC was equipped with Intel i7 3070 CPU (central processing unit, 4 cores 8 thread), 16 GB RAM (Random-access memory), 128 GB SSD (solid state disk) and an operating system, Window 7 Professional 64 bit (Microsoft Corporation, Seattle, USA). Data processing and analysis were conducted with R (v3.20) [51] and RStudio (v0.98.113) [52]. Biomarkers were the variables that were identified in the CHMS Cycle 1 to 8 Content Summary [32]. This Summary also defined environmental biomarkers that were the chemicals that could be detected in human specimens or living spaces Statistics Canada, 2015 #451}. P values, two-tailed, less than 0.05 were considered statistically significant. The processing time was reported to help researchers understand the complexity of trend analysis using national surveys.

Data processing and analysis
There were 26557 original variables in 79 data files released before March 2017. In 32 data files, 16064 variables were related to bootstrap weights only. There were 19212 variables created to summarize data or derived to represent important secondary outcomes for future projects. Using a typical desktop computer at McGill RDC, the processing time of each major step was estimated in Fig 1. First, the data were imported from STATA format and then stored in R data format. Data importation, storage and screening took less than five minutes to finish. In the third step, the cleaning of all original variables took less than 30 minutes. However, the creation of the summary measures or derived secondary outcomes in the fourth step, such as the numbers of chronic conditions, medication use, and abnormality in biomarkers, was timeconsuming. The processing time could be up to two days. At least two factors were contributing to the long processing time. The first factor was that efficient variable-wise calculation was not applicable. Depending on the nature of derived variables, there might be subject-based operation and each observation needed to be screen, for example, for the numbers of cardiovascular or diabetes medication for each individual. The other factor was due to time spent on loading data to memory and writing processed data back to disk.
In the fifth step, the summary or derived variables that needed to be linked to or reproduced in other data files, such as the information on medication use and biomarker summaries, were merged to destination files. For example, the summary of medication use needed to be merged to the household data set and used with appropriate bootstrap weights to obtain nationally representative statistics. This took less than one hour to finish. Sixth, trend or regression analysis with and without the adjustment of other predictors took less than one day to finish for all original or derived variables. The predicted values of all CHMS variables could also be calculated within one day. Lastly, selected trends and summary tables were produced for vetting and release from the RDC within 10 minutes. This research framework took less than four days to screen and analyze all CHMS variables.

Characteristics of the CHMS cycles and Canadians
The summary of the CHMS data and the population characteristics were shown in Table 2. The cycle 3 had the most numbers of variables and many of them were ever repeated in other cycles. There were cycle-4 variables to be released after April 2017. In cycle 2, there were more biomarkers than in any others. Because of the large numbers of biomarkers in cycle 2 and 3, there were variables designed to label limits of detection for all subjects.
The numbers of Canadians increased over time, from 29 to 32 million between cycle 1 and 4. About half of them were female. The proportion of females may not be different from that obtained with other data sources [53]. The minimal ages were three years in cycle 1 and six in cycle 2 to 4. The maximal ages were 79 for all cycles. The mean age remained similar and might not be different from the official statistics, which described age by median values [46]. The ranges of blood pressure might also be similar to those published based on the same data [47], while Canadians of all ages were included in this study. In Fig 2, the trends of age, mean arterial pressure, and systolic and diastolic blood pressure were shown along with their 95% confidence intervals (CIs) compared to the first measures in the CHMS. None of the trends was significantly associated with the CHMS cycles (p> 0.05 for all). Age and blood pressure were significantly associated with BMI while controlling for age, sex, education and household income (p <0.05 for all).

Summary of the trends in the CHMS data
In Table 3, the findings of the trend analysis were summarized. In the first row, the numbers of the CHMS variables that had been repeatedly measured were listed. There were 519 variables measured in CHMS cycle 1 to 4. The rates of change of BMI from cycle 1 to 4 were listed. There were 429 variables significantly associated with the CHMS cycles from one to four and 86 of them were biomarkers identified by Statistics Canada (p<0.05 for all). There were 1099 variables significantly associated with BMI and 152 of them were biomarkers. There were 20 and 26 variables respectively increasing and decreasing for more than 10% in three time intervals from cycle 1 to 4. There were 52 and 68 biomarkers observed to respectively increase or decrease once for at least 10% from cycle 1 to 4. Compared to the average growth rates of BMI, 0.2% per cycle, there were 130 biomarkers increasing more rapidly and 22 of them were nonenvironmental biomarkers.

Discussion
There are large numbers of the CHMS variables and biomarkers increasing or decreasing at high rates. The importance of these trends to public health and wellbeing are not clear because current rate of investigating and publishing the trends of the CHMS variables is not satisfying. There were less than ten trends of the CHMS variables published between 2015 and 2017 including those only considering selected populations [48,54,55]. It can take more than ten years to have a comprehensive understanding in the trends of the biomarkers or physical activities or other variables, given the large numbers of variables in national surveys. Currently the CHMS data have been mostly used as a novel data source [12, 18-23, 56, 57], rather than a continuous effort to monitor population health. Only several outcomes have been studied continuously among selected populations [48,54,55], in addition to the biomonitoring activities by Health Canada [24][25][26]. This research framework of trend analysis customized to the CHMS data is highly feasible with computing resources available to most researchers. Scaling up trend analysis to all variables in national surveys has several advantages. In the first place, the automated data cleaning system is effective and efficient. It takes less than 30 minutes to clean all 79 files from the CHMS cycle 1 to 4. The results of data cleaning are examined based on parameters such as the maximal or minimal values to ensure appropriate quality for subsequent trend analysis. Another advantage is that the visualization of trends is easy to understand and useful to prioritize biomarkers or variables for evaluation. In this study, the trends of blood pressure is plotted with the BMI trend to contrast the different patterns. We are applying this method to other variables to find unexpected trends. Moreover, certain types of data errors can also be easily highlighted with the trends. For example, the measurement unit of blood fibrinogen is mislabeled and leads to more than 10-fold decrease in the levels after the CHMS cycle 2 (personal communication with Statistics Canada). The trends with the highest and lowest rates of increase or decrease are easy targets for data quality examination.
Finally, this framework of trend analysis can be supplemented with regression analysis, prediction and projection subsequently. Multiple regression for all CHMS variables to identify the significance of BMI and socioeconomic status has been tried and proven realistic. Predicted values are retrieved to understand the trends least explained by BMI and socioeconomic status (statistics not requested for release). The CHMS has also been used for the projection of obesity trends [10] and projection is also possible.

Limitations
However, there are several limitations to the research framework. First, there may be other data or documentation errors not identified. The data and documentation accuracy of several of the trends of the largest relative magnitude of change have been confirmed (personal  [24,26,58]. Other advanced methods may be tried to take other contextual factors into consideration [59,60]. In fact, it is unclear whether the proportions used by Health Canada are based on weighted or unweighted statistics [24,26,58]. This study uses unweighted proportions to exclude the variables from analysis. Furthermore, the codes have been written inside the RDC and suffered from significant time and resource constraints. The research framework will be structured into an R package for application to other major surveys and research purposes. There are several improvements expected for the implementation. For example, the evaluation of data products can be customized and made interactive. The method to create a list of variable characteristics to be extracted is related to the research hypothesis and should be made flexible for other projects. The introduction of external information to create or derive new variables as predictor or outcome can be improved. We are introducing the reference ranges for clinical or disease biomarkers [37,38] to further interpret clinical data and population health status. A system that describes the relationships between variables to infer information between them will be useful for sequential questions that study complicated status, such as disease history or evolution of life events. We are also considering incorporating imputation of missing information into the research framework [60].

Extension to other surveys
This research framework can be extended to other major surveys with similar data structure, variable naming systems, missing value identification strategies and sampling frames, especially the Canadian Community Health Survey [48,56]. For other major surveys that provide cleaned data [61] or do not use bootstrap weights [35], it requires minimal revision to replicate this research framework to conduct trend analysis for all variables. The automated process for visualization of trend analysis is suggested for researchers to look for neglected trends and for survey administrators to search and correct data errors that can be demonstrated with trends of extreme rates of change across cycles or time points.

Ethics review
This secondary data analysis was approved by the ethics review committee at the Centre Hospitalier de l'Université de Montréal.