The External Validity of Randomized Controlled Trials of Hypertension within China: from the Perspective of Sample Representation

Objective To explore external validity of randomized controlled trials (RCTs) of hypertension within China from the view of sample representation. Methods Comprehensive literature searches were performed in Medline, Embase, Cochrane Central Register of Controlled Trials (CCTR) et al and advanced search strategies were used to locate hypertension RCTs as well as observational studies conducted in China during 1996 to 2009 synchronously. The risk of bias in RCTs and observational studies was assessed by two modified scales respectively, and then both types of studies with 3 or more grading scores were included for the purpose of evaluating of external validity. Following that the study characteristics relative to sample representation were extracted from RCTs and observational studies synchronously, and the later were taken as external references for validating sample representation of RCTs. Results 226 hypertension RCTs and 21 observational studies were included for final analysis. Comparing samples with observational studies, the mean age of samples within RCTs was 54.46 years, significantly lower than that of observational studies (66.35 years) (P=0.002). The average disease course in patients of RCTs was 3.89 years and grade III hypertensive patients accounted for 17%; both were lower than that of the observational studies (12.96 years, P<0.001; 34%, P=0.026 respectively). In addition, the proportions of patients with complications due to heart failure, stroke, diabetes, or coronary heart disease in RCTs were 8%, 5%, 12% and 11% correspondingly, all of which were significantly less than that of observational studies (11%, 18%, 17% and 29%). Conclusion Sample characteristics within hypertension RCTs were significantly different from those in observational studies. The samples in most RCTs were under-represented. It’s feasible to take samples of observational studies as a mirror of the actual composition of hypertension patients in the real world, if the reporting of observational studies is abundant and available.


Introduction
As the design and conduct has effectively eliminated the possibility of bias and confounding [1], randomized controlled trials (RCTs) having a favorable internal validity and being the gold standard for determining the effects of treatments, have been widely recognized in clinical researches [2][3][4][5]. Apart from the internal validity (i.e., whether the results suffer from systematic error) within RCTs, the external validity of RCTs needs to be emphasized too [6,7]; if RCTs were misused or the results from RCTs were irrelevant to the patients in a particular clinical setting [1,8,9], that may adversely affect to health care. Lack of external validity is frequently advocated as one of the obstacles to the translation of research evidence into clinical practice, which is why interventions found to be effective in clinical trials and recommended in guidelines are underused in clinical practice [1,10,11]. However, in comparison to internal validity, the external validity was easily neglected in clinical trials [6,9,10,[12][13][14]; in addition, the assessment of the external validity is a complex reflection, studying how external validity assessments are also challenging. As currently, there is no consensus about how to assess the external validity of RCTs [9,15]. Some previous studies have highlighted somewhat potential determinants of external validity [9,14]; for example, strict eligibility criteria can limit the external validity of RCTs. A previous study indicated that fewer than 10% of patients with hypertension are managed in hospital clinics, and this group will differ from those managed in primary care [14]. However, external validity cannot be easily formalized [9] as the baseline clinical characteristics recorded often say very little about the real composition of the trial population. Easy to be quantified and reported abundantly, the sample representation is often used as an important indicator to assess external validity [16]; but, the lack of reference is frequently advocated as one of the obstacles to explore sample representation of RCTs. As few observational studies enrolled participants with stringent eligibility criteria, samples within observational studies were more likely representative, by which they could be candidate references for mirroring the real composition of patients in clinical practice. Hypertension has become a serious burden disease in China [17,18]; although a great number of clinical trials on hypertension have been conducted within China, few studies were successful in developing as evidence based information and disseminating to patients under specific circumstances [18]. This study intends to explore the sample representation in hypertension RCTs by comparing with the sample characteristics within observational studies.

Search strategy and study selection
A comprehensive literature search was performed; literature databases included Medline (Ovid), Embase, Cochrane Central Register of Controlled Trials (CCTR, Ovid), Chinese biomedical literature database (CBM), China National Knowledge Infrastructure/China Academic Journals Full-text Database (CNKI) and Chinese scientific journals database (VIP). The Medical Subject Headings (MeSH) 'hypertension', 'randomized controlled trial', 'controlled clinical trial', 'random allocation', 'cases series' and 'cohort study' were used as English and corresponding Chinese search terms to identify studies from the aforementioned databases (January 1, 1996 to December 31, 2009). In addition, references from included articles, as well as articles citing included articles, were screened for inclusion.
Two authors (ZX and WYX) screened the titles and abstracts to identify relevant studies. In cases of disagreement, consensus was achieved by discussion with the third author (KDY). Criteria for final inclusion of RCTs included: (1) drug therapy for primary hypertension, in which six kinds of antihypertension drugs recommended by WHO were included (ACEI, Angiotensin-Converting Enzyme Inhibitor; ARB, Angiotensin II Receptor Blocker; CCB, Calcium Channel Blocker; alpha-blocker; beta-blocker; Diuretics); (2) studies with a grading score equal to or greater than 3. Similar criteria for final inclusion of observational studies were set: (1) topics on managing primary hypertension, in which six kinds of antihypertension drugs recommended by WHO were included; (2) any types of design related to cases series and cohort studies; (3) studies with a grading score equal to or greater than 3. RCTs were excluded if they: (1) recruited patients with secondary hypertension; (2) were published as abstracts only; (3) reported partial data from multi-center research. Observational studies were also excluded if they: (1) recruited patients with secondary hypertension; (2) had a sample size of less than 30; or (3) published repetitively.

Internal validity assessment
Two kinds of scales for assessing internal validity of RCTs and observational studies were modified from five available tools; these included two RCTs-based tools: the Jadad scale [19] and the evaluation criteria in Cochrane Review's Handbook [20]; and three tools for observational studies, including the Critical Appraisal Skills Programme (CASP) [21], Newcastle-Ottawa Scale (NOS) [22] and 'Validity Checklist for Appraising an Article on Observational Study' [23]. The scale developed for RCTs includes five domains: randomization (0-2 points), allocation concealment (0-2 points), blinding (0-2 points), attrition (0-2 points) and baseline condition (0-1 points); the total score for a perfect RCT is 9. Additionally, another scale for observational studies was used; as judgments associated with assessing quality in observational studies are often complex; here, we address four key issues that arise in assessing risk of bias: diagnostic criteria (0-1 points), sample source (0-1 points), recruitment (0-1 points) and setting of research (0-1 points); if an observational study eliminated the possibility of bias and confounding effectively, it would receive a grade of 4 points.
A pilot study was then performed to validate the two modified scales; the agreement for each item ('yes' scores vs. any other scores) and the whole tool was explained by the percentage of actual agreement as well as the Kappa coefficient. We adopted the Kappa values of <0 rates as less than chance agreement, 0.01-0.20 as slight agreement, 0.21-0.40 as fair agreement, 0.41-0.60 as moderate agreement, 0.61-0.80 as substantial agreement, and 0.81-0.99 as almost perfect agreement [24].We tested the coding framework of RCT through comparison with the Jadad scale [25] and the criterion validity of the tool was assessed through calculating correlation coefficients.
All included articles were rated using the above modified scales by two authors (ZX, WYX). Frequent ongoing discussions among all authors regarding any queries were proceeded throughout the coding process.

Data abstraction for evaluating external validity
Information for evaluating external validity was extracted by a pre-developed form [23,26]. Two authors (ZX, WYX) abstracted data independently and any discrepancies were resolved by discussion. The data extract form includes 4 domains and 25 items. The domain of "source" has 5 items: region of trial setting, research setting, date of study, number of centers involved, funding source; domain of "subjects recruitment" includes 7 items: location, setting, method, duration of recruitment, number of eligible patients, number of patients not meeting inclusion criteria, number of patients refusing participation; domain of "baseline characteristics of subjects" has 8 items: sample size, source of patients, age, gender, diagnosis criteria, duration of disease, state of disease, complications; the last domain relates to patients importance outcomes, includes "effectiveness outcomes" and "adverse events" respectively.

Statistical analysis
Data were analyzed using SPSS software, version 13.0 (SPSS, Chicago, IL) and MetaXL, version 1.3 (MetaXL, Brisbane,Australia). Descriptive statistics, such as rate and proportion were used for dichotomous data, and means ± SDs or median (range) for continuous data. Correlation coefficients were taken to validate criterion validity of the modified scales for internal validity. T-test, Mann-Whitney test and multiple linear regression were used to test sample representation in terms of the age, duration of disease and proportions of female, grade III hypertension and other main complications. Generic Inverse Variance (GIV) method [27] was used to synthesize rate and proportion statistics reported in observational studies. We also used the guidelines for inferential interpretation of the overlap of CIs between two independent group rates or means to identify statistically significant difference: P <0.05 when the proportion overlap of the 95% CIs is ≤0.50 and P <0.01 when the two CIs do not overlap, that is, when proportion overlap is about 0 or there is a positive gap [25]. All tests were two-sided and P values of 0.05 or less were considered to be of statistical significance.

Flow of included studies
1197 RCTs were identified from the searches (excluding 136 duplicates and 4888 non-relevant articles), after that, 99 RCTs were excluded based on the inclusion criteria; finally, 225 RCTs with internal validity scores of ≥3 remained ( Figure 1) Meanwhile, 32 observational studies were identified from the searches (excluding 504 duplicates and 6940 non-relevant articles), 10 observational studies were further excluded based on the inclusion criteria; 21 observational studies with quality scores of ≥ 3 were finally included ( Figure 2) Clinical studies, either RCTs or observational studies, may suffer bias and confounding in their design or conduct, and incur additional risk of misleading results. Therefore, we take 3 as cut-off point for inclusion criterion of RCTs, which is equivalent to one third of total score of 9; as observational studies were more likely suffered bias and confounding than that of RCTs, means that strict eligibility criterion is needed, so we use 3 as the cut-off point for including observational studies, which is equivalent to the upper quartile of total score of 4.

Validation of modified scales for internal validity
We selected 50 RCTs randomly using a computer-generated list to validate inter-rater agreement. The kappa between two assessors for the global assessment was 0.72 and the percentage of actual agreement was 76% ( both P<0.001). (Table 1) Another 30 RCTs were randomly selected for validity evaluation. The total mean score was converted into the percentage of the maximum score for the modified scale, and the ICC against Jadad score was 0.84; that is, the results of the modified scale were highly convergent with the results of Jadad score. However, as the number of observational studies was limited, the validation procedure didn't perform adequately.

Internal validity of included studies after applying two modified scales
Internal validity of the selected 1099 RCTs (one citation includes 2 RCTs) was assessed by applying the modified scale for RCTs, of those, 226 RCTs with a grade of equal to or greater than 3 points were included (Table 2) , the median grade of RCTs was 3, RCTs with a grading score equal to or greater than 7 only accounted for 3.1% (n=7); 22 observational studies met inclusion criteria were evaluated by applying the modified scale for OBSs, 21 OBSs with a grading score equal to or greater than 3 were included (Table 2), the median score was 4.

Comparisons of study characteristics between RCTs and observational studies
Study characteristics, like sample size, location of setting and class of hospital, sample source and diagnosis criteria, therapy regimen and type of drug, patient important outcomes, et al, meet the minimum requirements for comparison analysis because of adequate reporting either in RCTs or in observational studies.
Sample size. The medians of sample size were 99 (minmax: 29-1352, total=57813) and 360 (min-max: 73-5106, total=15789) respectively for included RCTs and observational studies; the sample size in RCTs was smaller in general than that of observational studies (P<0.001).
Location of setting and hospital class. All included studies reported the research setting (location of setting and class of hospital); no significant discrepancy was observed in location of setting and hospital class (both P>0.05) ( Table 3).
Patient important outcomes. The blood pressure change and effective rate of anti-hypertension were the most addressed primary outcomes among RCTs; while the secondary outcomes in RCTs varies considerably, including cardiovascular death, QOL, health economics, adverse events, compliance, as well as intermediate measures (such as left ventricular hypertrophy, renal function, vascular endothelial function, pulse wave velocity, new-onset diabetes, resistance ameliorating effect) . Significant discrepancy was observed in effective rate of anti-hypertension and adverse events (both

Comparisons between RCTs and observational studies in terms of sample representation
Duration of disease.
The duration of disease was presented in 42.5% of RCTs as well as 42.9% of observational studies (Table 8). Of those, the average disease course in patients of RCTs was significantly lower than that of observational studies (3.89±4.39 vs. 12.96±4.49, P<0.001) Grade III hypertension. Proportion of grade III hypertension is presented in Table 9. Patients with grade III hypertension in RCTs were significantly underrepresented in comparison with observational studies, with overall proportions of 0.17 (95%CI: 0.09 to 0.28) and 0.34 (95%CI: 0.27 to 0.42) respectively (P=0.026).
Complications. Only 10.2% (n=23) RCTs presented the reporting of complications. Proportions of complications in RCTs were lower than those of observational studies in terms of heart failure (P=0.505), stroke (P=0.018), diabetes (P=0.141) and CHD (Coronary Atherosclerotic Heart Disease, P=0.125). However, the proportion of complicating renal insufficiency was higher than those patients from observational studies (P <0.01, zero overlap in two CIs).
Age, gender. Patient ages were presented in 73.9% of RCTs and 90.5% of observational studies (Table 8). Patients in RCTs were younger than those in observational studies: 54.46±6.34 versus 66.35±13.91 (P=0.002). Accordingly, the proportions of females are presented in Table 9. Multiple linear regressions were further used to explore impact factors of age and gender underrepresentation, but only study type had statistical significance (both P<0.05). Similar analyses in terms of duration of disease, proportion of grade III hypertension and proportion of complication didn't perform adequately due to the limited number of studies. (Table 10)

Discussion
Therapeutic efficacy is often studied with observational surveys in clinical practice of patients whose treatments were selected non-experimentally. Observational studies have several advantages over randomized controlled trials (including lower cost, greater timeliness, and a broader range of patients). An important advantage of the expanded observational study is its ability to estimate treatment effects in this broader spectrum of clinical practice. In this study, we attempt to use samples from observational studies of hypertension in China to create references which mirror hypertension patients in the real world. There are several interesting findings in our study. Firstly, the characteristics of RCTs on hypertension were significantly different from observational studies in terms of sample size, sample source, diagnosis criteria, frequency of diuretics used and types of medicine. Insufficient trial size may cause overhomogenous patients to be enrolled; simultaneously, it confers insufficient power for the statistical test employed, the failure to attain a level of statistical significance does not necessarily mean that the two treatments being compared are identical [28]. In comparison to inpatients, outpatients may have mild hypertension, short disease duration and even different therapy; if too many outpatients are recruited in hypertension RCTs, it's easy to get overestimated effects.
Secondly, samples in RCTs were underrepresented in terms of the elderly, disease course, grade III hypertension patients and complications. Patients in RCTs were more likely young,   having short duration of disease, as well as lower proportions of concurrent stroke and renal insufficiency than those in actual clinical settings. Due to the discrepancy in clinical characteristics, clinical manifestations and treatments among different age hypertension patients are also disparate different. Including insufficient elderly patients in RCTs, on other words, the lack of efficacy and safety information on elderly people, will directly limit the application and generalization of trial results to such spectrums of patients. RCTs tend to include less serious or shorter disease duration patients, who generally response well to drugs and are less likely to suffer severe side effects or adverse events, making it easier to get beneficial results. However, side effects or adverse event rates may appear to rebound when the intervention is applied in routine clinical practice. With regards to medicine, Angiotension Conversion Enzyme Inhibitor (ACEI) and Angiotensin II Receptor Blocker (ARB) are recommended by China Guideline for hypertension prevention and control [18] for hypertension patients complicated with diabetes or renal insufficiency; however, most of the 88 RCTs excluded diabetic patients (n=73, 83.0%) and renal insufficiency (n=87,98.9%). Betablocker and ACEI were recommended for hypertension patients complicated with CHD or heart failure [29]; among the 48 available RCTs, 15 (31.2%) studies excluded CHD patients and 26 (54.2%) studies excluded heart failure patients. Ruling out patients with complications excessively in trials will directly weaken the sample representation, leading to the overestimation of intervention effects; that is, the conclusion may be valid only to the sample population, but not be applicable to patients in the real world. There are several limitations in our study. First, we assume that these cohorts represented the "real world" in China but they may be not either due to publication bias, the ideal reference to reflect patients in the "real world" come from nationwide large-scale survey, however, such survey is very  difficult to perform due to financial, political or technique barriers. Second, as the reporting quality of the included original studies (either RCTs or observational studies), were not good enough, much information related to external validity was not reported or was reported insufficiently, making it hard to analyze the factors related to sample representation thoroughly, such as patients enrollment information (those who didn't fit the inclusion criteria, those who fit but refused to participate, and those who were finally enrolled in the trial). Incidentally, more than half (58.8%) of the RCTs did not report disease course of included participants, and only 23 (10.2%) RCTs described complications of patients. Though the inclusion and exclusion criteria for patients were prior set, it's still unclear about patients' characteristics and limit to apply the trial results to patients in real world. Therefore, there is marked room for improving quality of the reporting in RCTs, especially at the respects related to external validity. Third, high quality observational studies were insufficient to make-up external references, as only 21 studies were identified in this study; caution is needed to use those synthesized results as substitutes of patients in routine clinical practice. Case reports by nature have one person in them, while case series we refer to is a design to study only patients exposed to the interventions, both types raise serious questions about false positive results caused by chance if sample size is less than 30 cases. Additionally, the design of case control study is not really representative of the general population and would not serve as reasonable "gold standard" for comparison to any RCT for external validity. Such types of observational studies were excluded. Another potential limitation needs to be addressed too, that is, a considerable amount of issues and multiple comparisons being involved in our study, those issues may be hard to follow and multiple comparisons without correction may lead to false positive findings, that is, positive results may be caused by chance. Moreover, heterogeneity existed in most meta-analyses but cannot be explained fully by the differences in patients' age, sample source, class of hospitals, or sample size; sources of heterogeneity need to be investigated in further researches.

Conclusion
The samples within hypertension RCTs in China are underrepresented in terms of elderly patients, patients with long disease course, patients with complications and grade III hypertension patients. Although observational studies are frequently performed as a substitute for the randomized clinical trial, the evidence from such surveys is frequently not convincing. Taking samples of observational studies to makeup of patients in the real world is somewhat feasible; however, more studies are needed to demonstrate the validity of our results and their generalizability. There is also marked room for improving quality of the reporting either in RCTs or in observational studies.