Diagnostic accuracy research in glaucoma is still incompletely reported: An application of Standards for Reporting of Diagnostic Accuracy Studies (STARD) 2015

Background Research has shown a modest adherence of diagnostic test accuracy (DTA) studies in glaucoma to the Standards for Reporting of Diagnostic Accuracy Studies (STARD). We have applied the updated 30-item STARD 2015 checklist to a set of studies included in a Cochrane DTA systematic review of imaging tools for diagnosing manifest glaucoma. Methods Three pairs of reviewers, including one senior reviewer who assessed all studies, independently checked the adherence of each study to STARD 2015. Adherence was analyzed on an individual-item basis. Logistic regression was used to evaluate the effect of publication year and impact factor on adherence. Results We included 106 DTA studies, published between 2003–2014 in journals with a median impact factor of 2.6. Overall adherence was 54.1% for 3,286 individual rating across 31 items, with a mean of 16.8 (SD: 3.1; range 8–23) items per study. Large variability in adherence to reporting standards was detected across individual STARD 2015 items, ranging from 0 to 100%. Nine items (1: identification as diagnostic accuracy study in title/abstract; 6: eligibility criteria; 10: index test (a) and reference standard (b) definition; 12: cut-off definitions for index test (a) and reference standard (b); 14: estimation of diagnostic accuracy measures; 21a: severity spectrum of diseased; 23: cross-tabulation of the index and reference standard results) were adequately reported in more than 90% of the studies. Conversely, 10 items (3: scientific and clinical background of the index test; 11: rationale for the reference standard; 13b: blinding of index test results; 17: analyses of variability; 18; sample size calculation; 19: study flow diagram; 20: baseline characteristics of participants; 28: registration number and registry; 29: availability of study protocol; 30: sources of funding) were adequately reported in less than 30% of the studies. Only four items showed a statistically significant improvement over time: missing data (16), baseline characteristics of participants (20), estimates of diagnostic accuracy (24) and sources of funding (30). Conclusions Adherence to STARD 2015 among DTA studies in glaucoma research is incomplete, and only modestly increasing over time.


Results
We included 106 DTA studies, published between 2003-2014 in journals with a median impact factor of 2.6. Overall adherence was 54.1% for 3,286 individual rating across 31 items, with a mean of 16.8 (SD: 3.1; range 8-23) items per study. Large variability in adherence to reporting standards was detected across individual STARD 2015 items, ranging from 0 to 100%. Nine items (1: identification as diagnostic accuracy study in title/abstract; 6: eligibility criteria; 10: index test (a) and reference standard (b) definition; 12: cut-off definitions for index test (a) and reference standard (b); 14: estimation of diagnostic accuracy measures; 21a: severity spectrum of diseased; 23: cross-tabulation of the index and reference standard results) were adequately reported in more than 90% of the studies. Conversely, 10 items (3: scientific and clinical background of the index test; 11: rationale for the PLOS

Introduction
Researchers, journal editors and publishers acknowledge the need for adequate reporting of biomedical research as a means of improving the transparency and usability of journal articles [1,2]. For this purpose, a growing set of tools has been made available to guide authors during article preparation, which have been collected in the EQUATOR framework (http://www. equator-network.org/reporting-guidelines).
The Standards for Reporting of Diagnostic Accuracy Studies (STARD) tool was released in 2003 to guide the reporting of diagnostic test accuracy (DTA) studies [3]. DTA studies are essential to investigate the performance of a new test in detecting a target disease, and can ultimately guide clinicians in the use of diagnostic tests in clinical practice [4]. An updated version of STARD has recently been published; STARD 2015 includes 9 new items compared to STARD 2003 and now consists of a list of 30 essential items that should be reported in all reports of a DTA study [5].
In the last two decades, retinal nerve fiber layer (RNFL) and optic nerve head (ONH) imaging devices for detecting glaucoma, such as optical coherence tomography (OCT), Heidelberg retinal tomography (HRT) and scanning laser polarimetry (GDx), were introduced in ophthalmic clinical practice to identify structural damages occurring early in glaucoma. However, the performance of these tests in clinical decision-making for detecting glaucoma is still debatable [6] and since their introduction, a large number of studies have been published on their diagnostic ability [7][8][9]. With such a large amount of evidence available, a high quality of reporting is crucial for clinicians to best appreciate the potential for bias and the internal/external validity of such studies [10]. In the case of suboptimal reporting, the available evidence could be misleading, and the potential role of these imaging tests in clinical decision-making could be misunderstood. As consequence, a biased estimate of the sensitivity/specificity of the imaging tools for detecting glaucoma could generate an over-referral of false-positive glaucoma suspects or an under-referral of false-negative glaucoma patients [11]. The application of the original version of STARD on published studies investigating the accuracy of RNFL and ONH imaging in diagnosing glaucoma showed an overall modest compliance, but they were all published in the first few years after STARD's launch in 2003 [12][13][14]. More recent studies have investigated the adherence to STARD 2015 in DTA studies published in imaging journals and evaluating imaging test in different areas of interest, showing an overall moderate and variable compliance [15,16].
The aim of our study was to assess the adherence of a set of studies included in a Cochrane systematic review to STARD 2015. We investigated the overall adherence as well as for each item, whether any improvement occurred with time and which factors were associated with adherence.
STARD 2015 has been published only recently and no formal requirement of compliance with this reporting checklist has been enforced. Nonetheless, methodological knowledge underlying STARD guidance has been gradually made available over the last years [12][13][14]. Moreover, our study is meant to be a 'baseline' evaluation to guide improvement that follows STARD 2015 introduction, which could be valuable for glaucoma specialist associations in monitoring the quality of accuracy research as well as in methodological training programs.

Methods
In this study, we considered all the 106 studies included in a Cochrane DTA systematic review published in 2016, which aimed to evaluate the diagnostic accuracy of RNFL and ONH imaging derived parameters to diagnose manifest glaucoma; details on the search and selection of studies can be found elsewhere [17].
We used the updated version of STARD to assess the quality of reporting in the included studies [5]. The STARD 2015 checklist comprises 30 items grouped in 6 domains: title and abstract, introduction, methods, results, discussion, and other information. Four STARD items (10, 12, 13 and 21) consist of two sub-items (a and b), one generally referring to the index test and the other to the reference standard.
STARD was developed to be applied to all types of diagnostic medical tests and target diseases, and some items need further specification when applied to a given test or disease. Item 10a, for example, recommends that authors report the "index test, in sufficient detail to allow replication". Which test details are most relevant may obviously vary from test to test. In order to adapt the STARD checklist to the specific tests and target disease in the current review, we first prepared a guidance form and a data extraction form, in which specific criteria were established for scoring each STARD item. The forms were then piloted in a training session based on 5 of the included studies.
After the pilot, we drafted the final form which did not include item 2 (structured abstract), as a specific guidance for reporting abstracts has being published only recently and this should be the subject of a further study [18]. We also excluded item 13a (information available to the performers/readers of the index test) and item 25 (test-related adverse events), as they were not applicable to our index. The exclusion of item 13a was motivated by the fact that, although we know from a large body of research that knowledge of the reference standard at the time of interpreting the index test is an important source of bias [10,19], glaucoma imaging test results are always analyzed by standard, built-in software which provides an objective continuous measure of, e.g., RNFL thickness, or classifies the subject according to standard categories. Morevoer, item 25 was not considered since the test is not invasive. Overall, a total of 31 items were assessed, including several sub-items (Table 1).
Each study was appraised by two independent authors: one author (MM) assessed all 106 studies, while three other authors (AM, VF, GC) each independently assessed one third of the articles. For each study, each item was scored as "yes" (indicating that the item was adequately reported) or "no" (indicating that the item was not adequately reported), as explained in details in Table 2. We then assessed adherence to STARD at item level: for each item we calculated the percentage of studies scored with "yes", as a measure of adherence to STARD. Disagreements were solved through discussion and when necessary a senior author (GV) made the final decision.
We calculated the overall adherence to STARD 2015 as the mean number of items reported per each of the included studies. We used logistic regression to evaluate the effect of the publication year on STARD overall adherence, as well as to test whether the impact factor (IF) of the publishing journal (in the year the paper was published) could have affected the overall adherence. In the latter analysis, we formed approximate tertiles of the impact factor for 106 studies at cut-offs of 2 and 3.5, assuming that the different IFs achieved yearly by each of the 25 publishing journals were independent. The effect of the publication year on STARD adherence Registration number and name of registry 2 (1.9) Information must have been reported as explained

(new item)
Where the full study protocol can be accessed 2 (1.9) Information must have been reported as explained Compliance with STARD 2015 of glaucoma diagnostic accuracy studies: A systematic review

Adherence to STARD 2015
Overall adherence was 54.1% for 3,286 individual rating across 31 items, with a mean of 16.8 (SD: 3.1; range 8-23) items per study. Table 1 presents the adherence to STARD 2015 for each item with an explanation of the main patterns.
Overall, a large variability in adherence to reporting standards was detected across STARD 2015 items, ranging from 0 to 100%. Nine items were adequately reported in more than 90% of the studies: identification as a study of diagnostic accuracy in the title (item 1); eligibility criteria (item 6); index test (item 10a) and reference standard (item 10b) in sufficient detail to allow replication; definitions of test positivity cut-offs for the index test (item 12a) and reference standard (item 12b); methods for estimating measures of diagnostic accuracy (item 14); severity spectrum of diseased (21a); cross-tabulation of the index test and reference standard results (item 23). Specifically, three items were reported in all the included studies (items 1, 10b and 23).
Conversely, 10 items showed adherence to STARD in less than 30% of the studies: scientific and clinical background, including the intended use and clinical role of the index test (item 3); rationale for the reference standard (item 11); whether assessors of the reference standard were blinded (13b); analyses of estimate variability (item 17); sample size calculation (item 18); study flow diagram (item 19); baseline characteristics of participants (item 20); registration number and registry (item 28); availability of study protocol (item 29); sources of funding (item 30). Four items showed mixed reporting among the included studies, with adherence close to 50% of reporting: study design as prospective or retrospective (item 5); setting, location and dates (item 8); basis for identifying potential eligible participants (item 7); time interval and intervention between index test and reference standard (item 22).

Trends in and association with adherence
Overall, a modest increase of adherence was found with publication year (OR: 1.03 per year, 95%CI 1.00 to 1.05; p = 0.032). While most trends were towards an improvement of adherence over time, a statistically significant improvement in reporting was found for only four items (OR: odds ratio of adherence per one year): how missing data on the index test and reference standard were handled (item 16, OR 1.22, p = 0.003); baseline demographic and clinical characteristics of participants (item 20, OR 1.24, p = 0.010); estimates of diagnostic accuracy and their precision (such as 95% confidence intervals) (item 24, OR 1.22, p = 0.018); sources of funding and other support; role of funders (item 30, OR 1.20, p = 0.037). No item showed a significant decrease in adherence over time.
The journals publishing the largest number of studies were Investigative Ophthalmology and Vision Science (n = 19), Ophthalmology and Journal of Glaucoma (n = 16), and the American Journal of Ophthalmology (n = 12). We found slightly better overall adherence for journals with IF 3.5 or more versus less than 2 (OR: 1.22, 95%CI 1.02 to 1.47; p = 0.033).
A mixed-effect linear model showed that most of the variance was found at the item level, while variance at the journal level was more than 100 times smaller, suggesting little effect of a journal on adherence.

Patterns of adherence and non-adherence
All included studies were identified as a diagnostic accuracy study in the title or abstract, mainly by reporting measures of accuracy such as ROC curve, sensitivity or specificity (item 1).
Only 9% of the studies were considered to have reported the scientific and clinical background adequately (item 3); although the authors often reported the imaging test characteristics and its ability to detect damage, the intended use and clinical role of the index test along the diagnostic pathway were lacking in most cases. Study objectives and hypotheses (item 4) was poorly reported (38% of cases): although objectives were almost always reported, the scientific hypothesis was often missing. Items related to study design and participant enrollment were variably reported across included studies. With the exception of eligibility criteria (item 6), which was reported in 97% of the studies, the other items were adequately reported in only about half of the studies: the prospective or retrospective nature of the study (item 5, 55% of the studies), the basis on which eligible potential participants were identified (item 7, 49% of the studies), whether participants formed a consecutive, random or convenient series (item 9, 42% of the studies). Setting location and dates (item 8) were reported in 53% of the studies, with dates more often missing.
Imaging test devices and reference standard used (items 10a and 10b) were clearly reported in 98% and 100% of the studies, respectively. The definition of and rationale for test positivity cut-offs (items 12a and 12b) were properly reported both for the index test and reference standard in 92% and 98% of the studies, respectively. On the contrary, authors reported the rationale for choosing the reference standard (or the existence of an alternative) only in 20% of the studies, and information about masking of assessors of the reference standard was reported in 27% of the studies.
The methods for estimating or comparing measures of diagnostic accuracy were reported in almost all studies (99%). How indeterminate results were handled (item 16) was reported in 88% of the studies; the exclusion of low quality scans was the main method for handling indeterminate results. On the contrary, how missing data were dealt with (item 17) was reported in only 58% of cases. The authors rarely specified how missing data were dealt with, and in most cases missing data could only be computed by comparing the number of enrolled patients with those included in the final analysis.
Analyses of variability in diagnostic accuracy were reported in only 27% of cases (in most cases related to the disc size of the ONH or disease severity of participants), and only 6% of the studies reported the intended sample size and how it was determined.
All studies reported a cross-tabulation of the results of the index test with the results of the reference standard, or data to derive this cross-tabulation (item 23), and 99% of the studies reported the distribution of disease severity in participants with the target condition (item 21b). Most studies (84%) reported estimates of diagnostic accuracy and their precision (item 24). When this item was not properly reported (16%), a measure of precision such as 95% confidence intervals was missing.
Baseline demographic and clinical characteristics of participants (item 20) was considered properly reported if at least age, gender, intraocular pressure (IOP) and refractive status were reported, which was the case in 26% of the studies. Age was almost always reported, while sex, refraction and IOP were most often missing. No study presented a flow diagram of participants (item 19).
Study limitations (item 26) were reported in 75% of the studies. The case-control design and a low generalizability due to the characteristics of included participants (such as disease severity or ethnicity) were mainly reported as limitations. Only 32% of the studies reported implications for practice, and for the intended use and clinical role of the index test (item 27), sometimes referring to changes between pre-and post-test probability.
Sources of funding, including the role of funders (item 30), were reported only in 21% of the studies; frequently, authors did report the source of funding but did not describe the funders' role. Registration number and name of registry (item 28) as well as full study protocol details (item 29) were reported in only 2% of the studies.

Discussion
Our review investigated adherence to STARD 2015 in a large set of DTA studies evaluating the diagnostic performance of imaging devices for detecting manifest glaucoma. In general, the completeness of reporting was modest and highly variable across items.
Overall, a mean of 16.8 out of 31 items, ranging from 8 to 23 items, were adequately reported for the 106 studies included.
Across the 31 items assessed in our review, some items showed an almost perfect adherence to STARD 2015 but the reporting of other items was definitely very poor. Items with the lower level of reporting included the scientific and clinical background (item 3), the basis on which eligible potential participants were identified (item 7), and the setting location and dates (item 8). This information is crucial, as the performance of a test is not fixed, but may vary if applied in different settings and among patients with different characteristics [2]. The lack of this information makes it difficult to evaluate the generalizability of the results. Moreover, only one third of the studies discussed the consequences of false positive and false negative results in the clinical pathway. This could increase the risk of a misunderstanding how the test could change the post-test probability of disease.
Poor reporting was also found regarding the rationale for choosing the reference standard (item 11), masking of assessors of the reference standard (item 13b), and handling of missing data (item 16). The use of different reference standards can introduce heterogeneity in test accuracy, as one reference standard may be more accurate than the other. Review bias can arise when the index test results are known to the assessor of the reference standard. Improper handling of missing data can also be associated with biased results.
Time interval between index test and reference standard (item 22) was reported in half of the studies. Glaucoma is a progressive disease and functional/structural damage may occur over time not concurrently [20]. Different time intervals between structural index tests and functional reference standards may affect the estimated diagnostic accuracy.
Demographics and clinical characteristics of participants (item 20) and alternative diagnoses in those without the target condition (item 21b) were also often inadequately reported. Details on the population enrolled permit judgement of the potential for selection and spectrum bias and decide on the applicability of the results to other populations.
One item was never reported in the included studies: participant flow using a diagram. STARD 2015 strongly recommends the use of a flow chart to facilitate the reader's comprehension of study design and the flow of participants along the study process [21].
Other studies have evaluated the adherence to STARD 2015 in DTA studies. Hong et al. investigated 142 DTA studies published in imaging journals, and found the mean number of reported STARD items was 16.6/30 with an overall adherence of 55%, which is similar to our results with the updated tool [15]. A better adherence to STARD was found by Choi et al., who investigated 63 DTA studies published between 2011 and 2015 in a single specialty journals (Korean Journal of Radiology) with a mean adherence of 20/27 items (74%) [16]. We acknowledge that adherence could vary according to type of diagnostic test (imaging, biochemistry, histopathology), as well as specialty. Moreover, the specific guidance adopted by different reviewers to score STARD adherence might introduce differences. Despite these potential sources of variability, the limited number of studies which have been conducted on adherence to STARD 2015 suggest there is room for improvement.
We also found the overall completeness of reporting slightly improved over the years. Only 4 items (13%) showed a significant improvement over time but, despite this improvement, 3 of these items were only reported in less than 60% of cases. Korevaar et al. identified 16 surveys analyzing the reporting of 1496 DTA studies, and found moderate improvement of reporting in the first years after STARD's introduction, but with substantial heterogeneity among studies [22]. In 2015, Fidalgo et al. investigated the use of STARD 2003 in 58 studies on automated perimetry for glaucoma and recorded suboptimal reporting with no improvement between 1993-2004 and 2004-2013 [23].
We also hypothesized that journal IF could have affected the completeness of reporting. Overall, a higher IF was associated with only slightly better reporting, suggesting that the need for improved reporting involves both journals with low and high IF.
The Cochrane review from which our studies were retrieved [17] assessed the methodological quality of the studies using the QUADAS-2 tool [19]. We found the relationship between adherence to STARD 2015 and methodological quality with QUADAS 2 was only partial, which is the subject of a different methodological study (accepted).
The general picture emerging from the literature is that the completeness of reporting of imaging studies in different disciplines is only moderate and DTA studies of imaging test for detecting glaucoma are in line with these findings. The STARD group members and promoters encouraged journal editors to prescribe the use of their checklist in submissions. Although this led to some improvement of overall adherence to STARD, many items were still not reported in studies published in journal adopting the STARD checklist [15].
Our review has limitations and strength. All the included studies were published before STARD 2015 was introduced, so that authors were only able to use the previous version of STARD, which was published in 2003. However, we included a very large set of studies (70% of which were published after 2010) and each study was judged by two independent reviewers to improve the reliability of the assessment. Another limitation is that we evaluated only a specific disease entity-index test(s), which may limit generalizability. Moreover, we used a cohort of studies that met inclusion into a Cochrane review which, depending on the inclusion criteria applied, may have biased the included studies to be of higher 'quality' or better reported than those that might not have met inclusion for the review.
Our study offers an updated focus on the completeness of reporting of DTA studies in ophthalmology, specifically in glaucoma research. Our study also confirms that the adherence of glaucoma imaging DTA studies to STARD 2015 is modest and that more work and effort is needed to improve the completeness of research. Finally, this study has also set the basis for future evaluations of how the introduction of STARD 2015 will change the reporting of DTA studies on glaucoma over the next few years.