Figures
Abstract
Most existing quality scales have been developed with minimal attention to accepted standards of psychometric properties. Even for those that have been used widely in medical research, limited evidence exists supporting their psychometric properties. The focus of our current study is to address this gap by evaluating the psychometrics properties of two existing quality scales that are frequently used in cancer observational research: (1) Item Bank on Risk of Bias and Precision of Observational Studies developed by the Research Triangle Institute (RTI) International and (2) Newcastle-Ottawa Quality Assessment Scale (NOQAS). We used the Rasch measurement model to evaluate the psychometric properties of two quality scales based on the ratings of 49 studies that examine firefighters’ cancer incidence and mortality. Our study found that RTI and NOQAS have an acceptable item reliability. Two raters were consistent in their assessment, demonstrating high interrater reliability. We also found that NOQAS has more items that show better fit than the RTI scale. The NOQAS produced lower study quality scores with a smaller variation, suggesting that NOQAS items are much easier to rate. Our findings accord with a previous study, which conclude that the RTI scale was harder to apply and thus produces more heterogenous quality scores than NOQAS. Although both RTI and NOQAS showed high item reliability, NOQAS items are better fit to the underlying construct, showing higher validity of internal structure and stronger psychometric properties. The current study adds to our understanding of the psychometric properties of NOQAS and RTI scales for future meta-analyses of observational studies, particularly in the firefighter cancer literature.
Citation: Ahn S, Pinheiro PS, McClure LA, Hernandez DR, Caban-Martinez AJ, Lee DJ (2023) An examination of psychometric properties of study quality assessment scales in meta-analysis: Rasch measurement model applied to the firefighter cancer literature. PLoS ONE 18(7): e0284469. https://doi.org/10.1371/journal.pone.0284469
Editor: Simon Grima, University of Malta, MALTA
Received: November 6, 2022; Accepted: March 31, 2023; Published: July 26, 2023
Copyright: © 2023 Ahn et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting information files.
Funding: This study was supported by funds from Florida State Appropriation #2382A (Principal Investigator: Kobetz). Research reported in this publication was also supported by the National Cancer Institute of the National Institutes of Health under Award Number P30CA240139. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Assessment of study quality is a critical aspect of conducting meta-analyses. Study quality considerably varies across studies and may lead to heterogeneity in study findings [1–5]. Bérard and Bravo warned that overall effect size estimates obtained from meta-analysis that do not account for variation in study quality may suffer from increased Type I error rates [6]. In addition, other factors investigators ought to be concerned when evaluating studies include the sources, directions, and even plausible magnitudes of such biases [7, 8]. Therefore, many researchers suggest that the quality of primary studies should be accurately assessed and used in meta-analysis [5, 7]. Despite the importance of assessing study quality in general, many researchers have identified challenges in dealing with the quality of primary studies in meta-analyses [3, 6, 9]. One of the critical issues is that while there are a variety of scales to assess the quality of primary studies, none has been universally adopted [10]. In fact, there is no consensus about how study quality should be conceptualized or measured in the existing quality scales [5, 11–13]. Moreover, most existing quality scales have been developed with minimal attention to accepted standards of psychometrics properties such as reliability and validity [14]. Most of the research has focused on interrater reliability measures, such as kappa statistics or percentage of agreement, rather than item reliability, content validity, or construct validity. In addition, even for those that have been used widely in medical research, no to little evidence exists supporting their psychometric properties.
Therefore, the focus of our current study is to address this gap by evaluating the psychometrics properties (i.e., item reliability, interrater reliability, and construct reliability) of existing quality scales that are frequently used in cancer observational research: (1) Item Bank on Risk of Bias and Precision of Observational Studies developed by the Research Triangle Institute (RTI) International [15] and (2) Newcastle-Ottawa Quality Assessment Scale (NOQAS) [16]. Specifically, we used the Rasch measurement model [17] to evaluate the psychometric properties of these two quality scales based on the ratings of 49 studies that examine firefighters’ cancer incidence and mortality. The present study is focused on three primary research questions, namely:
- Can the RTI or the NOQAS scale be considered reliable?
- Do the items of RTI or NOQAS fit the overall quality score?
- Do the individual studies fit the overall quality score?
Study quality
Two different frameworks have been proposed in the literature to define and measure the quality of primary studies in meta-analysis [18]. One is based on the validity framework developed by Campbell and his associates [19] and the other, called “quality assessment”, was proposed by Chalmers and his colleagues [8].
The former approach, based on the idea of Campbell and his associates, suggests a matrix of designs and their features or threats to validity. The validity framework includes 33 separate threats to validity based on four distinct categories: internal, external, statistical, and construct validity [18]. This validity framework for assessing the quality of primary studies in a meta-analysis is mainly used in the social sciences. For instance, Devine and Cook [20] evaluated the quality of primary studies based on the validity framework by examining six design features representing internal, external, statistical, and construct validity (e.g., floor effect, publication bias, attrition, and domains of content).
The second approach, proposed by Chalmers and his associates, has been applied primarily to medical research [8, 18, 21–24]. The objective of Chalmers’ system is to quantify the overall quality of primary studies based on in-depth criteria for assessing randomized controlled trials. Chalmers and his colleagues mainly focused on construct validity and statistical conclusion validity, examining such features as randomization, blinding of the statistician, and minimization of data-extraction bias [18].
Study quality assessments in observational studies
An informal PubMed search of published meta-analyses and systematic reviews in the cancer literature revealed that the Newcastle-Ottawa Quality Assessment Scale (NOQAS) [16] was the widely employed tool for review articles which focused on risk factor association studies [25–31]. This tool was employed in a recent meta-analysis of the firefighter cancer literature [32]. The second identified assessment tool was less commonly employed in cancer-focused meta-analyses and systematic reviews [33, 34]: the Research Triangle Institute (RTI) International and Item Bank on Risk of Bias and Precision of Observational Studies [35, 36]. Although not commonly employed in cancer meta-analyses [33], it has been utilized in a variety of syntheses of other disease outcome association studies [36–41] and was employed in a systematic review of lung function in firefighters [42]. Of note, some investigators have employed both the NOQAS and the RTI item bank to assess quality in meta-analyses and systematic reviews [37, 42, 43]. The RTI item bank is comprised of 29 multiple-choice questions that is designed to assess a range of risk of bias and precision domains for a variety of observational study designs [36, 37]. These domains include: sample definition and selection, interventions/exposure, outcomes, creation of treatment groups, blinding, soundness of information, follow-up, analysis comparability, analysis outcome, interpretation, and presentation and reporting. Investigators are encouraged to select items from the bank that are most appropriate to the content area and study design of studies under assessment.
The 8-item NOQAS was developed to assess the quality of nonrandomized studies with specific assessment forms for case-control and cohort study designs [16]. Several questions are designed to be tailored for use given the content being assessed. A simple summary quality score can be obtained by summing each individual item judged to be of high quality, although given its relatively short length investigators often report quality levels for the individual 8 items for each study under review. The NOQAS has been recommended for use for the assessment of quality of observational study designs [41, 44].
Our literature search using PubMed, PsycInfo and Medline resulted in one published study that compares the psychometric evidence of NOQAS to RTI. In a study by Margulis and her colleagues [40], two raters independently assessed the quality of 44 primary studies with RTI and NOQAS. After coding the quality of studies, Margulis and her colleagues computed interrater agreement using percentage of agreement and the first-order agreement coefficient statistics. In their study, the relationship between NOQAS and RTI for ranking ordering studies in terms of risk of bias was found to be medium, as indicated by the Spearman’s rank correlation coefficients of .35 and .38. Also, authors stated that NOQAS is easier to apply than the RTI item bank, but more limited in its scope, although the scope of quality was similar between NOQAS and RTI. Lastly, the interrater reliabilities between raters were reported to be fair for both NOQAS and RTI.
Like a study by Margulis and her colleagues [40], a few published studies addressed the quality of either NOQAS or RTI, using interrater reliability measures such as kappa statistics or percentage of agreement between raters [41, 44]. Likewise, all these studies evaluated the psychometric properties of the quality assessment tools under the Classical Test Theory (CTT) framework, which is somewhat simple by analyzing the raw data of the instrument. Also, most of the existing studies focused on interrater reliability or face validity of items used to measure the quality of individual studies.
Rasch measurement model
Whereas classical test theory (CTT) has been frequently used in evaluating the validity and reliability of study quality ratings, some issues have arisen regarding the calibration of item difficulty, sample dependence of coefficient measures, and estimates of measurement error. The Rasch model enables us to address those issues by (1) assessing the dimensionality of assessment; (2) identifying redundant items or items that measure a different construct or construct-irrelevant factors through the item-fit; (3) identifying items that should be flagged based on their difficulty levels; and (4) assessing whether response categories are appropriate for distinguishing items by their quality.
The Rasch Measurement Theory (RMT) is a psychometric model to analyze categorical data (particularly dummy variables) as a function of the person (e.g., rater or reviewer)’s ability on a trait and the item difficulty [17]. Andrich [45] then developed the Rasch Rating Scale Model (RSM, also called the Polytomous Rasch model) for polytomous data, which is data with more than two ordinal categories. The RSM provides estimates of person location on a continuous latent variable (a), item difficulties (b), and an overall set of thresholds that are fixed across items (c).
RMT obtains information from the person and the item to estimate the probability of a person with a given level of ability to answer a given item correctly, thus, connecting person ability to item difficulty [46]. This probabilistic framework allows RMT to be falsifiable and to meet the linearity assumptions of parametric statistical tests. Therefore, measures of fit statistics for both person-fit and item-fit can be obtained, which provide evidence of validity—how well the model can predict the response to each item.
In addition, RMT transforms ordinal data to logits, which allows a proper use of parametric statistical analysis, without assumption violation that is associated with Type I and II error inflation. Lastly, the item parameters estimated by RMT are generally invariant to the population used to generate these estimates. In other words, parameter estimates obtained from a sufficient sample should be equivalent to those obtained from another sufficient sample despite the average person’s ability level in each of the samples [46]. This property of RMT allows for greater generalization of results as well as more sophisticated applications.
Psychometric properties in Rasch measurement model
For any quality test or assessment, the supporting evidence must have three psychometric properties—validity, reliability, and fairness [47]. This section briefly reviews how each of these psychometric properties can be assessed when using the RMT. In this study, our focus is on reliability and validity.
Reliability.
Reliability refers to the consistency or precision of scores across replications of a testing procedure. Under RMT, the Rasch-based reliability index, called the reliability of separation, is used to measure the reliability of a test or assessment. A reliability of separation index is obtained based on latent measures with equal intervals along the underlying continuum and it reflects how distinct latent scores are along the scale, which ranges from 0 to 1. This is defined as:
(1)
, where SD = standard deviation of Rasch measures of a specific facet (e.g., students, tasks, and raters) and MSE = average Mean Squared Errors of Rasch measures for each facet. Higher values indicate higher reliability. High reliabilities are preferred because they indicate a good presentation of Rasch measures across the entire range of the latent scale.
Validity.
Validity refers to the degree to which theory supports the interpretation of test scores47. Under the RMT, the Infit and Outfit Mean Square (MnSq) statistics can be used to evaluate how well the measures of an individual facet (i.e., item, study, and rater) fit the constructed latent scale (i.e., study quality score). In particular, the Infit MnSq identifies irregular response patterns, and the Outfit MnSq detects large residual values. The expected value for both Infit and Outfit MnSq statistics is 1.0, which shows a perfect fit to the underlying scale. The fit indices provide diagnostic information for identifying misfit elements on each facet (e.g., item, study, or rater), supporting the validity arguments of internal structure. Therefore, the validity can be rated on a scale ranging from A (item, study, or rater fits the scale very well) to D (item, study, or rater does not fit the scale). See Table 1 for the guidelines for interpreting the Infit and Outfit MnSq values.
Methods
Description of 49 studies on firefighter cancer incidence and mortality
The studies evaluated in this quality assessment were gathered for a meta-analysis project that examines cancer incidence and mortality risk among firefighters. The included studies were identified through a comprehensive literature search using multiple databases including ERIC, PsycINFO, ProQuest Dissertation & Theses, PUBMED, and MEDLINE via EBSCO, and online search engines including Embase, Web of Science Core collection, Google Scholar, and SCOPUS. A total of 49 studies were identified that met the inclusion and exclusion criteria.
Two independent raters were responsible for coding (1) study design characteristics, (2) outcome type, (3) cancer coding system, (4) cancer types, (5) source of occupation designations, (6) type of incident that firefighters attended, (7) sample characteristics, and (8) study characteristics. Two additional reviewers were responsible for coding the statistical estimates presented in these studies for computing a standardized incidence ratio and (2) a standardized mortality ratio.
Procedure
Two content experts on epidemiology independently rated 49 observational studies using RTI item banks and NOQAS. Two independent raters are: (1) a cancer epidemiologist who holds a PhD in Epidemiology and has 20 years of experience in cancer research and teaching; and (2) a chronic disease and occupational epidemiologist who holds PhD in preventive medicine and community health and has over 30 years of teaching and research experience and the Principal Investigator of the Florida cancer registry (Florida Cancer Data System).
The two study quality scales were first tested by the independent reviewers on a sample of studies (i.e., random sample of 5–7 studies) to ensure that consistent assumptions and criteria were employed by raters. Slight modifications were then made to the original quality assessments to better align with the methods of the studies evaluated, and some items were removed that were not relevant. The items evaluated along with their modifications (modifications are italicized) and specific instructions (13 RTI and 8 NOQAS items) are displayed in Table 2.
Model specification
The FACETS computer program [48, 49] for Rasch analysis, was used to examine the quality of two study quality assessments using a Many-facet Rating Scale Model (MFRM) [48]. The MFRM is expressed as below.
, where
Pjnmi,k = probability of study j receiving a rating k on item i;
Pjnmi,(k−1) = probability of study j receiving a rating k-1 on item i;
θj = quality measure of study j;
δi = difficulty of endorsing item i;
τk = difficulty of endorsing category k relative to k-1;
Analyses
The ratio between Pjnmi,k and Pjnmi,(k−1) specified in Eq 3 is called odds so that the log-odds (logits) are a linear combination of latent measures for different facets. Since all the measures are on a common scale with logits as the units, the MFRM can create measures on an additive interval scale. Higher logit values reflect higher quality for studies, and items that are more difficult to endorse. These values were presented using a Wright map to show an empirical display of study quality scores and item difficulties.
In addition to logit values, we computed the reliability of separation indices for items; and study, rater, and Infit and Outfit MnSq statistics. The reliability of separation indices shows how reproducible the scale would be if using a different but equivalent study sample. Infit and Outfit MnSq statistics are used to demonstrate how well item and study fit the latent scale. Lastly, a Chi-square test is performed to examine if all items can be viewed as equal. A significant result indicates that the studies are distinct from each other.
Results
Descriptive statistics of study quality scale scores for firefighters’ cancer literature
Table 3 displays the summary statistics of observed and Rasch scores for each of two study quality measures: (1) RTI and (2) NOQAS. The RTI scale items had an observed mean of 1.51 (SD = .66) and a Rasch score mean of 0.00 (SD = 3.85), while the items in NOQAS had an observed mean of 1.84 (SD = .37) and a Rasch score mean of 0.00 (SD = 2.29). In addition, the overall quality scores of 49 studies had an observed mean of 1.48 (SD = 0.24) and a Rasch mean of 1.32 (SD = 2.24) when measured by RTI scale, but they had an observed mean of 1.84 (SD = 0.17) and a Rasch mean of 0.21 (SD = 0.78), when measured by the NOQAS. These results indicate that on average, the RTI scale item banks produced much higher quality scores across 49 studies, when compared to the NOQAS.
Figs 1, 2 display the wright map, which is an empirical display of the RTI scale (Fig 1) and NOQAS (Fig 2), respectively.
In each figure, the first column shows the Rasch score on a logit scale. The items (column 4), raters (column 3), and the individual studies (column 2) are located on the wright map based on their Rasch score. The last column displays threshold estimates of response categories on the Likert scale. As shown in Fig 1, the latent Rasch scores of study quality measured by the RTI scale were skewed to the left, indicating that most studies appeared to be low in its study quality, while those Rasch scores measured by NOQAS followed a normal distribution (see Fig 2). Out of 13 items in the RTI scale item bank, 9 were above the mean of 0 (column 3 in Fig 1), indicating that most were quite difficult to evaluate. On the other hand, items on NOQAS were distributed relatively evenly in terms of item difficulty (column 3 in Fig 2), except item #5 (very easy; located at 5 standard deviations below the mean). Lastly, two raters (column 2) were quite consistent in evaluating the quality of primary studies using the NOQAS and RTI scale item bank in their ratings of study quality.
Psychometric evidence for the RTI items for firefighters’ cancer literature
Dimensionality.
Results from the Many-facet Rating Scale Model (MFRM) indicated that there was one underlying factor that explained 79.82% of variances in 13 items. This result suggests that the RTI is unidimensional for measuring the quality of individual studies (> 20%) [49].
Reliability.
The reliability of separation for RTI scale items was .99 (near 1.0), implying that the distribution of item measures can well represent the entire range of latent scale. The reliability value higher than .80 suggest that the RTI scale item bank scale does present acceptable reproducibility and consistency of the ordering of the Rasch scale scores.
Validity.
As shown in Table 4, the Infit and Outfit MnSq for RTI scale items 6, 9, 10, and 12 were found to fall into a fit category of A, indicating a good fit of each item to the study quality scale. Although items 2 and 11 fell into a fit category of A using the Infit MnSq, item 11 did show high MnSq. Items 1, 3, 5, 7, 8 and 11 were less productive based on either Outfit or Infit MnSq. See Table 4.
The quality measures of the 49 studies were significantly different, χ2 (46) = 177.7, p < .01. As shown in Table 5, study 29 had the highest study quality score, while study 8 had the lowest. Most studies fit the scale well with a fit category of A or B, with six studies falling into category C or D.
Psychometric evidence for the NOQAS items for firefighters’ cancer literature
Dimensionality.
The result from MFRM indicated that one underlying factor exists. This factor explained 41.41% of variances in 8 items, suggesting that the NOQAS is unidimensional for measuring the quality of individual studies (> 20%) [49].
Reliability.
The reliability of separation for items was .99, implying that the NOQAS quality assessment scale does show acceptable reproducibility and consistency of the ordering of the Rasch scale scores.
Validity.
As shown in Table 6, the Infit and Outfit MnSq for most items fell into a fit category of A, indicating a good fit of each item to the study quality scale. Exceptions were item 1 (B for Outfit MnSq), item 7 (C for Infit and Outfit MnSq). Particularly, item 7 had high Infit and Outfit MnSq values, showing that this item is unproductive and distorting. Content specialists should be consulted in terms of future uses of item 7 for this scale.
Results indicated that the study quality measures of these 49 studies were significantly different, χ2 (46) = 71.9, p = .05. As shown in Table 7, study 12 had the highest study quality score, while study 40 had the lowest. Most studies fit the scale well with a fit category of A or B, with eight studies showing falling into category C or D.
Comparison between RTI and NOQAS for firefighters’ cancer literature
The reliability of item separation index for the RTI scale and NOQAS were both found to be high (approximating 1), indicating that both scales are reproducible and consistent of the ordering of Rasch scale scores. In terms of rater agreement, the two coders rated study quality equally consistent using NOQAS than RTI, as shown in Tables 8 and 9. The NOQAS measures produced much lower Rasch latent study quality scores with a less variation, and their items were much easier to rate than the RTI. The NOQAS had more items that showed better fit between items and the overall quality scores. The reason for this could be that the NOQAS was adapted to assess the quality of firefighters’ cancer literature, which may have enabled ratings to be more closely aligned and less varied. As a result, there may be better fits between items and the quality scores. Additionally, study quality scores measured by the NOQAS scale were found to follow normal distribution. Both measures were found to be unidimensional.
Discussion
Using firefighters’ cancer literature, the current study is the first attempt to examine the psychometric properties of two commonly used study quality assessment measures using the Rasch measurement theory. Of many strengths, Rasch models can be used to (a) produce invariant study quality measures on a latent continuum, (b) assess the validity, reliability, and fairness of latent measures, and (c) use latent scores to explain variation in outcome measures. These characteristics of Rasch measurement theory offer practical applications in meta-analysis. Of many, study quality scores estimated by Rasch measurement model enable us to be directly compared across different studies and further modeled to explain variation in study effects by study quality scores.
Our study found that the RTI scale and NOQAS were reproducible and consistent in evaluating the quality of firefighters’ cancer literature, showing higher item reliability. In terms of interrater reliability, two raters were quite consistent in their assessment of study quality, when using both RTI and NOQAS scales. In terms of validity, we found that the NOQAS has more items that show better fit to the underlying construct of study quality than the RTI scale. This result indicates that NOQAS demonstrates better validity of internal structure to measure the quality of firefighters’ cancer literature. Lastly, latent scores measured using NOQAS were distributed across all range of the latent scores, with much lower study quality scores with a smaller variation. These results suggest that NOQAS items are much easier to rate the quality of firefighters’ cancer literature. Our findings accord with a previous study conducted by Margulis and her colleague [40], which concludes that RTI was harder to apply and thus produces much heterogenous quality scores than NOQAS.
The present study is significant in at least two major respects. First, the current study is the first in its kind that assesses the psychometric properties—reliability and validity—for two quality assessment tools that are most used in observational studies. Previous studies focused on interrater reliability of NOQAS and RTI scales, thus leaving the item reliability and validity of NOQAS and RTI unanswered. The current study provides the psychometric properties—reliability and validity—of NOQAS and RTI for future use beyond interrater reliability. Second, more importantly, we used the Rasch Measurement theory (RMT) that produces the compatible quality scores of the included studies in meta-analysis, which further enhance its generalizability and applicability in meta-analysis. It is because that Rasch scores allow us to utilize parametric statistical analysis, which mostly assumes normal distribution. When utilizing the Rasch scores of NOQAS and RTI in a meta-analysis of firefighters’ cancer incidence and mortality, we found that NOQAS scores significantly predict variation in the effect sizes. Specifically, results from a mixed-effects model indicate a significant and positive relationship between quality score and firefighters’ cancer incidence and mortality. Lastly, the item parameters estimated by RMT are generally invariant to the population, which will offer greater generalization of meta-analytic results.
In this study, we did not address one of the important psychometric properties: whether NOQAS and RTI showed fairness in its assessment. If NOQAS and RTI are equally applicable to any study, it is expected that NOQAS and RTI scores are invariant regardless of study characteristics such as sampling method, funding sources, inclusiveness of samples, and whether a study used a good-quality instrument or not. Despite this limitation, the current study certainly adds to our understanding of the psychometric properties of NOQAS and RTI for future meta-analyses of the observational studies, similar to firefighters’ cancer literature.
References
- 1.
Barley Z. Assessing the influence of poor studies on meta-analytic results. In: Meeting PpatAERAA, editor.; Chicago, IL 1991.
- 2.
Cao H. A random effect model with quality score for meta-analysis: Master’s thesis, University of Toronto; 2001.
- 3. Jüni P, Altman DG, Egger M. Systematic reviews in health care: Assessing the quality of controlled clinical trials. BMJ. 2001;323(7303):42–6. pmid:11440947
- 4. Ahn S, Becker BJ. Incorporating Quality Scores in Meta-Analysis. Journal of Educational and Behavioral Statistics. 2011;36(5):555–85.
- 5. Jüni P, Witschi A, Bloch R, Egger M. The hazards of scoring the quality of clinical trials for meta-analysis. Jama. 1999;282(11):1054–60. pmid:10493204
- 6. Bérard A, Bravo G. Combining studies using effect sizes and quality scores: application to bone loss in postmenopausal women. J Clin Epidemiol. 1998;51(10):801–7. pmid:9762872
- 7. Colditz GA, Miller JN, Mosteller F. How study design affects outcomes in comparisons of therapy. I: Medical. Stat Med. 1989;8(4):441–54. pmid:2727468
- 8. Chalmers TC, Smith H Jr., Blackburn B, Silverman B, Schroeder B, Reitman D, et al. A method for assessing the quality of a randomized control trial. Control Clin Trials. 1981;2(1):31–49. pmid:7261638
- 9. Detsky AS, Naylor CD, O’Rourke K, McGeer AJ, L’Abbé KA. Incorporating variations in the quality of individual randomized trials into meta-analysis. J Clin Epidemiol. 1992;45(3):255–65. pmid:1569422
- 10.
Clarke M, Oxman A. Assessment of study quality. 2002. In: Cochrane Reviewers Handbook 415 [Internet]. Norway: The Cochrane Collaboration. Oxford: Update Software.
- 11. Greenland S, O’Rourke K. On the bias produced by quality scores in meta-analysis, and a hierarchical view of proposed solutions. Biostatistics. 2001;2(4):463–71. pmid:12933636
- 12. Linde K, Scholz M, Ramirez G, Clausius N, Melchart D, Jonas WB. Impact of study quality on outcome in placebo-controlled trials of homeopathy. J Clin Epidemiol. 1999;52(7):631–6. pmid:10391656
- 13.
Valentine JC, Cooper H. Can we measure the quality of causal research in education? In: Education USDo, editor. Phye G D, Robinson D H, & Levin J (Eds), Experimental methods for educational interventions: Prospects, pitfalls and perspectives. San Diego: Elsevier Press.; 2005. p. 85–112.
- 14. Moher D, Cook DJ, Jadad AR, Tugwell P, Moher M, Jones A, et al. Assessing the quality of reports of randomised trials: implications for the conduct of meta-analyses. Health Technol Assess. 1999;3(12):i-iv, 1–98. pmid:10374081
- 15.
Viswanathan M, Berkman ND, Dryden DM, Hartling L. Assessing Risk of Bias and Confounding in Observational Studies of Interventions or Exposures: Further Development of the RTI Item Bank [Internet]. 2013. In: AHRQ Methods for Effective Health Care [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US).
- 16.
Wells GA, Shea B, O’Connell D, Robertson J, Petersen JA, Peterson VW, et al., editors. The Newcastle-Ottawa Scale (NOS) for Assessing the Quality of Nonrandomised Studies in Meta-Analyses2014: Ottawa: Ottawa Hospital Research Institute.
- 17.
Rasch G. I. Probabilistic Models for Some Intelligence and Attainment Tests. In: Lydiche N, editor. Studies in mathematical psychology:. 2nd ed. ed1960.
- 18.
Cooper H, Hedges LV, Valentine JC. The Handbook of Research Synthesis and Meta-Analysis. 2nd ed. New York, NY, US: Russell Sage Foundation; 2009. 97–109 p.
- 19. Campbell DT. Factors relevant to the validity of experiments in social settings. Psychol Bull. 1957;54(4):297–312. pmid:13465924
- 20. Devine EC, Cook TD. A meta-analytic analysis of effects of psychoeducational interventions on length of postsurgical hospital stay. Nurs Res. 1983;32(5):267–74. pmid:6554615
- 21. Bérard A, Andreu N, Tétrault J, Niyonsenga T, Myhal D. Reliability of Chalmers’ scale to assess quality in meta-analyses on pharmacological treatments for osteoporosis. Ann Epidemiol. 2000;10(8):498–503. pmid:11118928
- 22. Griffiths AM, Ohlsson A, Sherman PM, Sutherland LR. Meta-analysis of enteral nutrition as a primary treatment of active Crohn’s disease. Gastroenterology. 1995;108(4):1056–67. pmid:7698572
- 23. Treadwell JR, Tregear SJ, Reston JT, Turkelson CM. A system for rating the stability and strength of medical evidence. BMC Medical Research Methodology. 2006;6(1):52. pmid:17052350
- 24. Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJ, Gavaghan DJ, et al. Assessing the quality of reports of randomized clinical trials: is blinding necessary? Control Clin Trials. 1996;17(1):1–12. pmid:8721797
- 25. Blond K, Brinkløv CF, Ried-Larsen M, Crippa A, Grøntved A. Association of high amounts of physical activity with mortality risk: a systematic review and meta-analysis. Br J Sports Med. 2020;54(20):1195–201. pmid:31406017
- 26. Michels N, van Aart C, Morisse J, Mullee A, Huybrechts I. Chronic inflammation towards cancer incidence: A systematic review and meta-analysis of epidemiological studies. Crit Rev Oncol Hematol. 2021;157:103177. pmid:33264718
- 27. Patterson R, McNamara E, Tainio M, de Sá TH, Smith AD, Sharp SJ, et al. Sedentary behaviour and risk of all-cause, cardiovascular and cancer mortality, and incident type 2 diabetes: a systematic review and dose response meta-analysis. Eur J Epidemiol. 2018;33(9):811–29. pmid:29589226
- 28. Ji LW, Jing CX, Zhuang SL, Pan WC, Hu XP. Effect of age at first use of oral contraceptives on breast cancer risk: An updated meta-analysis. Medicine (Baltimore). 2019;98(36):e15719. pmid:31490359
- 29. Yan S, Gan Y, Song X, Chen Y, Liao N, Chen S, et al. Association between refrigerator use and the risk of gastric cancer: A systematic review and meta-analysis of observational studies. PLOS ONE. 2018;13(8):e0203120. pmid:30161245
- 30. Qin L, Deng HY, Chen SJ, Wei W. Relationship between cigarette smoking and risk of chronic myeloid leukaemia: a meta-analysis of epidemiological studies. Hematology. 2017;22(4):193–200. pmid:27806681
- 31. Jalilian H, Ziaei M, Weiderpass E, Rueegg CS, Khosravi Y, Kjaerheim K. Cancer incidence and mortality among firefighters. Int J Cancer. 2019;145(10):2639–46. pmid:30737784
- 32. Lange L, Peikert ML, Bleich C, Schulz H. The extent to which cancer patients trust in cancer-related online information: a systematic review. PeerJ. 2019;7:e7634. pmid:31592341
- 33. Al-Saleh MA, Armijo-Olivo S, Thie N, Seikaly H, Boulanger P, Wolfaardt J, et al. Morphologic and functional changes in the temporomandibular joint and stomatognathic system after transmandibular surgery in oral and oropharyngeal cancers: systematic review. J Otolaryngol Head Neck Surg. 2012;41(5):345–60. pmid:23092837
- 34. Viswanathan M, Berkman ND. Development of the RTI item bank on risk of bias and precision of observational studies. J Clin Epidemiol. 2012;65(2):163–78. pmid:21959223
- 35.
Viswanathan M, Berkman ND. Development of the RTI Item Bank on Risk of Bias and Precision of Observational Studies. Methods Research Report. (Prepared by the RTI International–University of North Carolina Evidence-based Practice Center under Contract No. 290-2007-0056-I.) January 6, 2022. p. 77 pages.
- 36. Varas-Lorenzo C, Margulis AV, Pladevall M, Riera-Guardia N, Calingaert B, Hazell L, et al. The risk of heart failure associated with the use of noninsulin blood glucose-lowering drugs: systematic review and meta-analysis of published observational studies. BMC Cardiovasc Disord. 2014;14:129. pmid:25260374
- 37. O’Dwyer T, O’Shea F, Wilson F. Physical activity in spondyloarthritis: a systematic review. Rheumatol Int. 2015;35(3):393–404. pmid:25300728
- 38. Bijle MNA, Yiu CKY, Ekambaram M. Can oral ADS activity or arginine levels be a caries risk indicator? A systematic review and meta-analysis. Clin Oral Investig. 2018;22(2):583–96. pmid:29305690
- 39. Senra H, Barbosa F, Ferreira P, Vieira CR, Perrin PB, Rogers H, et al. Psychologic adjustment to irreversible vision loss in adults: a systematic review. Ophthalmology. 2015;122(4):851–61. pmid:25573719
- 40. Slattery F, Johnston K, Paquet C, Bennett H, Crockett A. The long-term rate of change in lung function in urban professional firefighters: a systematic review. BMC Pulm Med. 2018;18(1):149. pmid:30189854
- 41. Margulis AV, Pladevall M, Riera-Guardia N, Varas-Lorenzo C, Hazell L, Berkman ND, et al. Quality assessment of observational studies in a drug-safety systematic review, comparison of two tools: the Newcastle-Ottawa Scale and the RTI item bank. Clin Epidemiol. 2014;6:359–68. pmid:25336990
- 42. Reinold J, Schäfer W, Christianson L, Barone-Adesi F, Riedel O, Pisa FE. Anticholinergic burden and fractures: a protocol for a methodological systematic review and meta-analysis. BMJ Open. 2019;9(8):e030205. pmid:31439607
- 43. Zeng X, Zhang Y, Kwong JS, Zhang C, Li S, Sun F, et al. The methodological quality assessment tools for preclinical and clinical studies, systematic review and meta-analysis, and clinical practice guideline: a systematic review. J Evid Based Med. 2015;8(1):2–10. pmid:25594108
- 44. Bae JM. A suggestion for quality assessment in systematic reviews of observational studies in nutritional epidemiology. Epidemiol Health. 2016;38:e2016014. pmid:27156344
- 45. Andrich D. An Index of Person Separation in Latent Trait Theory, the Traditional KR-20 Index, and the Guttman Scale Response Pattern,. Education Research and Perspectives. 1982;9:1:95–104.
- 46. Kleppang AL, Steigen AM, Finbråten HS. Using Rasch measurement theory to assess the psychometric properties of a depressive symptoms scale in Norwegian adolescents. Health and Quality of Life Outcomes. 2020;18(1):127. pmid:32381093
- 47.
American Educational Research Association APA, National Council on Measurement in Education. Standards for educational and psychological testing.. Association AER, editor 2014.
- 48. Linacre JM. Inter-rater reliability. Rasch Measurement Transactions 1991;5(3) (166).
- 49. Reckase MD. Unifactor Latent Trait Models Applied to Multifactor Tests: Results and Implications. Journal of Educational Statistics. 1979;4(3):207–30.