Representation of Cancer in the Medical Literature - A Bibliometric Analysis

Background There exists a lack of knowledge regarding the quantity and quality of scientific yield in relation to individual cancer types. We aimed to measure the proportion, quality and relevance of oncology-related articles, and to relate this output to their associated disease burden. By incorporating the impact factor(IF) and Eigenfactor™(EF) into our analysis we also assessed the relationship between these indices and the output under study. Methods All publications in 2007 were retrieved for the 26 most common cancers. The top 20 journals ranked by IF and EF in general medicine and oncology, and the presence of each malignancy within these titles was analysed. Journals publishing most prolifically on each cancer were identified and their impact assessed. Principal Findings 63260 (PubMed) and 126845 (WoS) entries were generated, respectively. 26 neoplasms accounted for 25% of total output from the top medical publications. 5 cancers dominated the first quartile of output in the top oncology journals; breast, prostate, lung, and intestinal cancer, and leukaemia. Journals associated with these cancers were associated with much higher IFs and EFs than those journals associated with the other cancer types under study, although these measures were not equivalent across all sub-specialties. In addition, yield on each cancer was related to its disease burden as measured by its incidence and prevalence. Conclusions Oncology enjoys disproportionate representation in the more prestigious medical journals. 5 cancers dominate yield, although this attention is justified given their associated disease burden. The commonly used IF and the recently introduced EF do not correlate in the assessment of the preeminent oncology journals, nor at the level of individual malignancies; there is a need to delineate between proxy measures of quality and the relevance of output when assessing its merit. These results raise significant questions regarding the best method of assessment of research and scientific output in the field of oncology.


Introduction
Proportional representation in the medical literature of individual cancers and oncology as a whole has been difficult to establish. The rapid increase in medical research publications has been facilitated by the development of the internet, integrated search engines, and on-line publishing. The two principal repositories for medical research publications are the Web of Science (WoS) (Thompson Reuters), and PubMed (the National Library of Medicine (NLM)), the latter recognised as the most frequently used source for information in the medical field [1]. Recently developed internet-based analytical tools now allow for interrogation of these online databases and for the provision of reports that are comparable within and between datasets.
Bibliometrics is a systematic method for evaluating research output which can help map changes in the interest of a scientific community over time, [2] and can provide insights into both qualitative and quantitative research trends. The bibliometric indicator most commonly used to undertake qualitative analysis is the journal impact factor (IF) which is based on two elements; the numerator, which is the number of citations in the current year to items published in the previous 2 years, and the denominator, which is the number of substantive articles and reviews published in the same 2 years [3]. Despite its popularity, there are a number of criticisms of the IF, particularly surrounding the ease with which it may be manipulated by journals [4], and the lack of clarity regarding what constitutes a 'citable' output [5]. It has further been argued that the metric lacks normalisation for reference practices across disciplines [6,7], and does not indicate the relevance (the degree to which a journal publishes on a particular topic or sub-specialty) of publications to a particular audience. In contrast, the recently developed Eigenfactor TM (EF), which ranks journals according to the number and weight of incoming citations, attempts to adjust for differences in ''citation culture'' between journals and across fields, and may provide an enhanced level of discrimination [8].
Calculation of the EF is based on that which Google uses when ranking web pages. This PageRank algorithm [9] regards the hyperlinks (links on web pages which connect users to other web pages) as recommendations, with two extra comments: (1) the status of the recommender is important, and (2) the recommendation should drop in weight if the recommender is too generous giving them. In short, a web page is important (gets a high popularity score) if it is pointed to by other important (high ranked) pages [10]. Instead of websites, the Eigenfactor algorithm scores journals, and instead of using hyperlinks, it uses citations. By simulating random traffic on a network these algorithms calculate the popularity of journals in a self-consistent fashion. Its developers claim that the resulting score provides an estimate of the percentage of time that users spend with a particular journal, with this amount of time postulated to be a measure of that journal's influence within the overall network of academic citations [11]. The resulting rankings for journals have been published on Eigenfactor.org since 2006 [11], and are now published as part Thomson Reuter's annual Journal Citation Report (JCR).
Bibliometric analysis has previously been employed as a method of correlating research productivity in oncology with geographic variation in output and funding [12,13], and the development of translational research [14]. Investigation of output across a range of disciplines within oncology has not been undertaken previously however, nor has an attempt been made to relate this output to proxy measures of quality such as the IF and EF. The principal objective of this study therefore, was to measure the proportion, quality and relevance of articles for the most common cancer types. By incorporating the IF and EF into our analysis we also aimed to assess the relationship between these bibliometric indices and the research output under study.

Materials and Methods
Publications were retrieved by searching for each cancer using its medical subject heading (MeSH) term in PubMed. The subheadings encompassed by each MeSH term were then employed to perform an equivalent search in the WoS database. Numbers were obtained for English-language entries for each of the malignancies under study. All peer-reviewed articles, including editorials, reviews, technical notes and letters to the editors were included.
Both PubMed and the WoS databases were consulted for the reference period 01/01/2007 to 31/12/2007, with all searches conducted between May and August 2009. Search results in the WoS included entries from the ''Science Citation Index-Expanded'' and the ''Social Sciences Citation'' indices, yielding 126845 articles. Search results for PubMed, which covers Medline and other specialised databases within the National Library of Medicine (NLM), and catalogues entries from 6000 journals, yielded 63260 articles.
The 26 cancers with the highest incidence as defined by the Surveillance, Epidemiology, and End Results (SEER) database of the National Cancer Institute (NCI) in 2006 were included in the study (Table S1) [15]. The cancer with the 27th highest incidence in this database was that involving bone; this was not included due to the potential confounding influence of publications relating to bone metastases on our analysis.
Three collections of journals were included in this analysis. Cluster A (Table S2) (Table S3): The ten journals which published most prolifically on each of the 26 cancers.
Clusters A and B were identified in the 2007 edition of Thomson Reuter's JCR, and cluster C was identified using the cloud-based web service PubReminer [16].
In order to assess scientific yield relative to disease burden, a publication ratio was derived using a method described by Al Shahi et al. in 2001 [17]. Briefly, we divided the number of Pubmed papers published in 2007 about each cancer by a measure of its disease burden (incidence or prevalence). Incidence and prevalence data were obtained from the SEER database [15].
Each relevant bibliographic record was downloaded and then evaluated and assessed using Microsoft Excel spreadsheet software, and Statistical Package for Social Science version 15.0 (SPSS Inc, Chicago, Illinois, USA) software. The relationship between IF and EF values was investigated using the Spearman correlation coefficient and Kruskal-Wallis between groups analysis; a p value of less than 0.05 was deemed statistically significant.

Results
Publications on the 26 cancers under study accounted for 8.19% (63260/772243) of output in PubMed, and 8.04% of output in WoS (126845/1576018). Breast cancer accounted for the highest percentage of oncology publications in both Pubmed (13?81%) and the WoS (13?83%) ( Table 1). Other high publication rate cancer subjects (present in the top quartile of output for both databases) included lung cancer, leukaemia, intestinal cancer, and prostatic cancer.
Articles relating to the 26 cancers with the highest incidence constituted 53% and 72% of all publications in the top 20 oncology journals according to IF (BIF, 5086/9527) and EF (BEF, 8775/12209), respectively (Table 3). Two thirds of these articles were related to just 7 cancers: breast, prostate, lung, intestinal cancer, leukaemia, ovarian cancer and cancers involving the CNS (BIF, 63.5%, 3230/5086; BEF, 65.5%, 5746/8775). Figure 1(a) demonstrates the results of running our publication : incidence ratio. When research output is related to the actual incidence of each cancer, leukaemia and cancers involving the liver and central nervous system (CNS) appear overrepresented. The publication ratio for the latter, for example, was approximately 23-fold greater than that for prostate cancer and almost 10-fold greater than that for breast cancer. This is again highlighted in Figure 1(b), where cancer incidence is plotted against the percentage contribution to the overall research yield contributed by each cancer in 2007. Output was next assessed relative to associated disease prevalence. Cancers involving the liver and pancreas were markedly overrepresented (Figure 2(a)), with the publication ratio for the former more than 70-fold and 40-fold greater than that for prostate or breast cancer, respectively; similarly, when each cancer's percentage contribution to the total output is plotted against associated disease prevalence, it is evident that liver cancer and cancers involving the CNS are associated with high levels of research yield (Figure 2(b)).
The top 5 oncology journals by IF and by EF were in the top 10 most frequently publishing journals for 11 and 18 of the malignancies under study, respectively (Table S3). The journals which published on the widest range of cancers under study were the Journal of Clinical Oncology (n = 11), the Annals of Surgical Oncology (n = 11), and Clinical Cancer Research (n = 9).
The ten journals which published most prolifically on each cancer (cluster 'C') generated 16219 articles on the 26 cancers of interest in 2007 (Table S3). Over half of these articles were related to just 6 cancer sites: breast, leukaemia, prostate, lung, intestinal cancer, and cancers involving the CNS (n = 8570; 52%) ( Table 4). There was no correlation between the total output of these journals and either the IF (p = 0.073) or the EF (p = 0.053). Hodgkin lymphoma, melanoma, lung and prostate cancer were located in the top quartile by both IF and EF, whilst cancers of the gallbladder, vulva, larynx and mouth were located in the lowest quartile by both measures.

Discussion
The relationship between a medical specialty or condition and its associated research output, with respect to volume and 'quality', is complex and may be dependent on such diverse influences as funding, socio-political influence and disease causation [18]. This complexity notwithstanding, the journal IF is increasingly used as a simple proxy measure of research productivity. Despite concerns that it was neither designed nor intended for this purpose, research funding is now frequently dependent upon it. Furthermore, the independence and objectivity of a commercially driven measure has been questioned. Until recently there was no viable alternative to the IF; the development of the EF at least offers stakeholders within the various scientific and medical disciplines an opportunity to assess research yield from a different perspective. We have demonstrated that whilst oncology contributed about 8% of the output of medical journals in 2007, it represented 25% of output from the highest IF and EF journals. This disproportionate representation of oncology topics in more prestigious journals has been suggested before [19]; that these articles are dominated by a small number of cancers is a novel finding, however.
This level of bias to specific cancers is not limited to the general medical journals. Publications relating to prostate, breast, lung, and intestinal cancer, and leukaemia, accounted for over half of the total output in the top medical oncology journals. Analysis of the top 10 most prolific journals for each cancer reveals a similar picture; the journals publishing most frequently on these cancers are associated with much higher IF and EF scores.
It may be a source of concern for those working in less fashionable areas of oncology that a small number of cancers dominate the scientific yield. This domination notwithstanding however, our results support the argument that, relative to their impact on society, 4 of these cancers -breast, prostate, lung and intestinal -are actually underrepresented, with research interest in many of the rarer malignancies disproportionately greater. Furthermore, as much as 60% of cancer research is not 'site specific' and hence may hold relevance for all types of cancer [20]. In addition, whilst we focus on the relationship between disease burden and research output, there are many other influences on the level of research interest in a given area, including scientific opportunity; researchability; potential for progress; fundraising (certain tumours might attract more public donations than others); and the quality and size of the research workforce in different areas [20].
Our results have demonstrated that whilst there is significant correlation between the IF and EF in the oncology literature as a whole, differences exist both for the high impact oncology journals and at the level of individual cancers. If one measures article value on IF alone, then the highest-ranking oncology journal would be CA: A Cancer Journal For Clinicians. In contrast, Cancer Research is the highest-ranking oncology journal by EF (with CA: A Cancer Journal for Clinicians not making the top 20 oncology journals).  In addition, IF and EF do not correlate when assessing output on individual malignancies; for example, publications relating to testicular and renal cancers are published most frequently in high EF, but lower IF journals. In contrast, breast cancer, which has a high IF for the ten journals which published most frequently on it, is associated with a comparatively low EF score. Given that scientific impact is a multidimensional construct, the difference in IF and EF ranking may be an effect of relevance rather than quality of the article searched.
The top 5 oncology journals by EF in 2007 included at least one of the ten most prolific journals for 18 of the 26 cancers. The top 5 oncology journals by IF, however, published prolifically on just 11 of the 26 commonest cancers. What is the reason for this difference? The developers of the EF score described it as ''the result of a random walk through the scientific literature. The algorithm corresponds to a basic model of research in which readers follow chains of citations as they move from journal to journal…..Because of the structure of the citation network, our model researcher will frequently visit large, important journal-s….and will seldom visit small journals in the lowest tiers of the publishing hierarchy'' [8]. Our data demonstrates that those who choose to employ the EF as their discriminator will identify journals covering a greater breath of cancer topics than those identified using the IF, and suggests that the EF is not just an index of quality, but also functions as a measure of relevance, at least within oncology as a whole.
The above analysis notwithstanding, researchers need to be aware of the fact that literature regarding certain cancers may be limited to low ranking journals according to standard proxy measures. This may not be a reflection of the quality of the output so much as targeting of specific audiences [6]. This study has demonstrated the breadth of journal titles responsible for output in oncology and has identified which journals are most prolific in which sub-specialties. Even the most wide-ranging journals published on less than half of the 26 cancers included in this analysis; it is thus clear that, for many, a reliance on quality indicators, including the EF, in choosing which journals to search will result in poor retrieval of the most relevant information, and those involved in this process thus need to be cognisant of the situation within their particular sub-specialty.
This work has a number of limitations. Whilst bibliometric indicators can provide an interesting overview of scientific yield for a given subject area, they are nevertheless just proxy measures, and cannot replace the gold standard of reading each article and journal individually to assess their quality or otherwise. We examined English-language entries only; this obviously has implications for those cancers which have a greater disease burden in non-english speaking areas, wherein research output on those topics might be much greater and introduces a level of bias into our results. Furthermore, it should be noted that our study was limited to a single year and, whilst we tried to ensure that our searches were equivalent across the databases, it was not possible to ensure absolute uniformity in the search strategies used; hence the figures should be viewed in the context of the overall picture, rather than in absolute terms. In addition, our work does not concentrate specifically on research articles only, which some would argue to have been preferable, although we believe our search strategy gives a better indication of overall interest within each sub-discipline of oncology. Finally, our analysis of the 132 journals within the category 'Oncology' was based on the list    provided by the 2007 JCR from Thomson Reuters; it should be noted that an alternative list could have been used based on the SCImago Journal Rank indicator (SJR) which is calculated within Scopus (Elsevier), but it has been previously demonstrated that those oncology journals indexed in Scopus which are not covered by the JCR tend to have low to very low impact factors. [21] The controversy surrounding the use of bibliometric indicators notwithstanding, this analysis has demonstrated the privileged position which oncology holds in the medical literature. This preferential bias is not extended uniformly across the oncological spectrum and this heterogenicity requires closer scrutiny.
It is clear that the commonly used IF and the recently introduced EF do not correlate for the preeminent oncology journals, nor do they correlate at the level of individual cancers. Researchers should be aware that selection of one measure over the other as a proxy evaluation of quality may significantly change the strength of, for example, a grant proposal.
Finally, our results also suggest that the most relevant information for those working in many of the oncologic subspecialties is not necessarily to be found in the most prestigious journals as delineated by proxy indicators of quality. This article raises significant questions regarding the best method of assessment of research and scientific output in the field of oncology.  Author Contributions