Find Duplicates among the PubMed, EMBASE, and Cochrane Library Databases in Systematic Review

Background Finding duplicates is an important phase of systematic review. However, no consensus regarding the methods to find duplicates has been provided. This study aims to describe a pragmatic strategy of combining auto- and hand-searching duplicates in systematic review and to evaluate the prevalence and characteristics of duplicates. Methods and Findings Literatures regarding portal vein thrombosis (PVT) and Budd-Chiari syndrome (BCS) were searched by the PubMed, EMBASE, and Cochrane library databases. Duplicates included one index paper and one or more redundant papers. They were divided into type-I (duplicates among different databases) and type-II (duplicate publications in different journals/issues) duplicates. For type-I duplicates, reference items were further compared between index and redundant papers. Of 10936 papers regarding PVT, 2399 and 1307 were identified as auto- and hand-searched duplicates, respectively. The prevalence of auto- and hand-searched redundant papers was 11.0% (1201/10936) and 6.1% (665/10936), respectively. They included 3431 type-I and 275 type-II duplicates. Of 11403 papers regarding BCS, 3275 and 2064 were identified as auto- and hand-searched duplicates, respectively. The prevalence of auto- and hand-searched redundant papers was 14.4% (1640/11403) and 9.1% (1039/11403), respectively. They included 5053 type-I and 286 type-II duplicates. Most of type-I duplicates were identified by auto-searching method (69.5%, 2385/3431 in PVT literatures; 64.6%, 3263/5053 in BCS literatures). Nearly all type-II duplicates were identified by hand-searching method (94.9%, 261/275 in PVT literatures; 95.8%, 274/286 in BCS literatures). Compared with those identified by auto-searching method, type-I duplicates identified by hand-searching method had a significantly higher prevalence of wrong items (47/2385 versus 498/1046, p<0.0001 in PVT literatures; 30/3263 versus 778/1790, p<0.0001 in BCS literatures). Most of wrong items originated from EMBASE database. Conclusion Given the inadequacy of a single strategy of auto-searching method, a combined strategy of auto- and hand-searching methods should be employed to find duplicates in systematic review.


Introduction
Systematic review is characterized as explicitly formulated, reproducible, and up-to-date summary of the effects of health care interventions [1,2]. It provides the top level of evidence for clinical decision [3,4]. More than 2500 new systematic reviews every year can be retrieved in PubMed [5]. Compared with the traditional narrative review, the most prominent specialty of the systematic review is that literature search is comprehensive and literature selection is unbiased. Recently, the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) statement has recommended that a four-phase flow diagram should be employed for literature search and selection in systematic review [1]. The first phase is to identify all relevant literatures through databases and subsequently to remove the duplicates simultaneously recorded by different databases or published by different journals. The process of finding duplicates among databases is so critical that the researchers can avoid the repetitive evaluation of data from the same study and the readers can accurately understand the quantity of scientific publications in the field. Based on our previous systematic reviews [6,7,8,9], a high prevalence of duplicates can be frequently observed among different databases. More importantly, not all duplicates can be readily found, because wrong information is occasionally recorded. However, no consensus regarding the methods to find duplicates and the prevalence of duplicates among different databases has been given yet.
Herein, we attempted to describe our methods to find duplicates among the PubMed, EMBASE, and Cochrane library databases in systematic review and to evaluate the prevalence and characteristics of duplicates.

Literature search
Literatures in two fields were retrieved to minimize the potential selection bias. They included ''portal vein thrombosis'' and ''Budd-Chiari syndrome'' literatures. The selection of the two fields was primarily attributed to our research interests in the two vascular disorders of the liver [6,7,8,9,10,11]. QX searched the PubMed, EMBASE, and Cochrane library databases (from the database inception to November 12, 2012). Our search strategy aimed to maximize the quantity of literatures recorded by these databases. The search items were discussed by all review authors. For the literatures regarding portal vein thrombosis, the search items were: (portal vein thrombosis) OR (portal venous thrombosis) OR (portal vein obstruction) OR (portal venous obstruction). For the literatures regarding Budd-Chiari syndrome, the search items were: (budd chiari) OR (hepatic vein obstruction) OR (hepatic venous obstruction) OR (hepatic vein thrombosis) OR (hepatic venous thrombosis).

Definitions and classifications of duplicates
Duplicates were divided into type I (duplicates among databases) and II (duplicate publications). Type I duplicates were defined as one paper was simultaneously recorded in one database twice or more times or in two or three databases (see examples in Table 1). Type II duplicates were defined as one study was published in different journals or issues. According to the type of publication, type II duplicates were classified as Abstract-Abstract, Abstract-Full text, and Full text-Full text. The first two types were often permitted, but the last one type was unethical in most of cases [12] (see examples in Table 2).
Duplicates consisted of one index paper and one or more redundant papers. For type I duplicates, index paper was considered as one paper of the duplicates had more accurate and/or adequate reference information; and for type II duplicates, index paper was considered as one paper of the duplicates was published earlier and/or had a larger sample size [13]. According to the number of redundant papers, duplicates were classified as follows: double duplicates were defined if only one redundant paper was found, triple duplicates if two redundant papers were found, quadruple duplicates if three redundant papers were found, and so on. According to the origin of index and redundant papers, duplicates were classified as PubMed-PubMed, PubMed-EM-BASE, PubMed-Cochrane, EMBASE-EMBASE, EMBASE-Cochrane, Cochrane-Cochrane, and PubMed-EMBASE-Cochrane.

Auto-search duplicates
QX imported all literatures retrieved by the three databases into an Endnote library (ENDNOTE X3, Thomson Reuters, USA). All literatures were expressed in Vancouver reference type. In the Endnote library, QX used the ''Find Duplicates'' command on the ''References'' menu to identify the auto-searched duplicates among the three databases. Prior to this step, ''Find Duplicates'' preferences could be defined on the ''Edit'' menu. To maximize the quantity of auto-searched duplicates, our preference was consistent with the Endnote default setting. In this setting, duplicates were identified as references of the same reference type with matching ''author'', ''title'', and ''publication date'' items, but ''journal's name'', ''volume'', ''issue'', and ''page'' items were not compared. QX further verified the accuracy of auto-searched duplicates.

Hand-search duplicates
After auto-searched redundant papers were removed, the remaining literatures were alphabetically ordered according to the first authors' names. Then, duplicates were identified among the literatures by the same first author. In details, if Notably, if the first author's name was wrongly spelt or missing or the authors' order was reversed in some database, we would miss some duplicates. Accordingly, to minimize the quantity of missed duplicates, the literatures were also alphabetically ordered according to the titles. Then, duplicates were identified among the literatures with the same titles. YM and JJ were responsible for the literatures regarding portal vein thrombosis, and QX and RW for the literatures regarding Budd-Chiari syndrome. QX and YM were also responsible for rechecking the accuracy of their tasks. Disagreement would be resolved by discussion among the four review authors.

Difference between index and redundant papers of type I duplicates
We just compared the difference of reference items between index and redundant papers of type I duplicates, but not type II duplicates. This behavior was primarily attributed to the fact that nearly all type II duplicates had different journal's name, volume, issue, and page between index and redundant papers. QX and YM extracted the detailed information of type I duplicates (i.e., author, title, journal's name, publication date, volume, issue, and page) into an Excel table (Microsoft Office Excel 2003, Microsoft Corporation, USA). Then, QX and YM compared the difference of reference items between index and redundant papers, and identified ''acceptable or unacceptable'' duplicate publications in order to distinguish whether or not they had wrong information.
Difference between index and redundant paper(s) would be considered acceptable to readers and reference reviewers, if the information was expressed in different styles. These different styles included: 1) punctuation, space, or case was different; 2) author's middle name was omitted; 3) title of non-English language paper and non-English language journal's name were translated into different words, but their meanings were identical; 4) journal's name was expressed in full or abbreviated style; 5) publication date was expressed in ''year'' or ''year month (day)'' style; and 6) volume, issue, or page was expressed in different styles, but their meanings were identical (see examples in Table 1).
Difference between index and redundant paper(s) would be considered unacceptable to readers and reference reviewers, if the information was wrongly expressed. These wrong styles included: 1) author's name and order, title of English language paper, and/ or journal's name was wrongly recorded, added, or missing; and 2) publication date, volume, issue, and/or page was wrong or missing (see examples in Table 1). QX further obtained the full-texts of the corresponding papers to identify the database which the wrong information originated from. In the cases where some full-text papers could not be obtained, we were uncertain about which database the wrong information originated from.

Data analysis
The count data and/or percentage were reported in texts or bar charts. The prevalence of duplicates with 95% confidence intervals (CI) was calculated as follows:

EMBASE
1) The author's middle name was missing in PubMed.
2) The journal's name was spelt in full style in EMBASE, but in abbreviated style in PubMed.
3) The publication date was expressed in ''year'' style in EMBASE, but in ''year month'' style in PubMed.

None
No. Cochrane Library 1) The journal's name was spelt in full style in EMBASE, but in abbreviated style in PubMed.
2) The publication date was expressed in ''year'' style in EMBASE, but in ''year month'' style in PubMed.
1) The authors' order was wrong in Cochrane library. 2) The volume and page were missing in Cochrane library.
No   Notes: -All examples originated from the literatures regarding portal vein thrombosis.
-In every example, the same study was simultaneously recorded by two or three databases.
-All literatures were expressed in Vancouver reference type.
-Bold and italics formatting indicated the different styles between index and redundant paper(s).
-In every example, the reference recorded by PubMed database had more complete information. doi:10.1371/journal.pone.0071838.t001 The proportion of type I and II duplicates was compared between auto-searching and hand-searching methods. The prevalence of different and wrong items in type I duplicates was compared between auto-searching and hand-searching methods. Two-tailed P values ,0.05 were considered statistically significant. The statistical analyses were performed in SPSS 12.0 (SPSS Inc, Chicago, Ill).

Portal vein thrombosis literatures
Overall, 10936 papers were identified via the three databases, including 6733 from PubMed database, 4002 from EMBASE database, and 201 from Cochrane library database ( Figure 1A).
Auto-searched duplicates. Initially, 2401 papers were identified as auto-searched duplicates. Notably, 2 papers with the same author, title, and publication date were excluded from duplicates, because both of them reported different contexts in different issues. Thus, 2399 papers were auto-searched duplicates, including 1198 index papers and 1201 redundant papers ( Table 3). The prevalence of auto-searched redundant papers was 11.0% (95%CI: 10.4%-11.6%).
EMBASE database had the highest proportion of wrong information regarding page, issue, and volume items (Figure 2A).  Hand-searched duplicates. After auto-searched redundant papers were removed, 1307 papers were further identified as hand-searched duplicates, including 642 index papers and 665 redundant papers ( Table 3). The prevalence of hand-searched redundant papers was 6.1% (95%CI: 5.6%-6.5%).
Of EMBASE database had the highest proportion of wrong information regarding author, title, journal, and publication date items. Cochrane library database had the highest proportion of wrong information regarding volume and page items. PubMed database had the highest proportion of wrong information regarding issue item ( Figure 2B).
Comparison. The number of duplicates identified by autosearching methods was larger than that identified by handsearched duplicates (2399 versus 1307). Most of type I duplicates were identified by auto-searching methods (69.5%, 2385/3431). The proportion of type I duplicates among the auto-searched duplicates was significantly higher than that among the handsearched duplicates (

Budd-Chiari syndrome literatures
Overall, 11403 papers were identified via the three databases, including 5894 from PubMed database, 5278 from EMBASE database, and 231 from Cochrane library database ( Figure 1B).
EMBASE database had the highest proportion of wrong information regarding page, issue, and volume items ( Figure 2C).
EMBASE database had the highest proportion of wrong information regarding author, title, journal's name, and publication date items. Cochrane library database had the highest proportion of wrong information regarding volume, issue, and page items ( Figure 2D).
Comparison. The prevalence of duplicates identified by auto-searching methods was significantly higher than that identified by hand-searching methods (3275/11403 versus 2064/ 11403, p,0.0001). Most of type I duplicates were identified by auto-searching methods (64.6%, 3263/5053). The proportion of type I duplicates among the auto-searched duplicates was significantly higher than that among the hand-searched duplicates (3263/3275 versus 1790/2064, p,0.0001). Nearly all type II duplicates were identified by hand-searching methods (95.8%, 274/286). The proportion of type II duplicates among the autosearched duplicates were significantly lower than that among the hand-searched duplicates (12/3275 versus 274/2064, p,0.0001).
Compared with those identified by auto-searching methods, type I duplicates identified by hand-searching methods had a significantly higher prevalence of different and wrong items

Discussion
Finding duplicates among different databases is an indispensable and important phase of systematic review. The phase is not as easy as we expected according to our previous experiences of systematic reviews [6,7,8,9]. However, little attention has been paid to this phase. To our knowledge, this study is the first systematic analysis of duplicates among the three databases commonly used by systematic review (i.e., PubMed, EMBASE, and Cochrane library database). We attempted to devise a scheme to identify duplicates in a systematic review ( Figure 3). In this scheme, we employed two methods to find duplicates (i.e., autosearch and hand-search duplicates) and two approaches to find hand-searched duplicates (i.e., alphabetical order of literatures according to the first authors and titles). Indeed, the process of auto-searching duplicates can be easily accomplished by Endnote library software. By comparison, the process of hand-searching duplicates is really a time-consuming and careful work. Four review authors spent more than four weeks on finding handsearched duplicates, and two of them also paid another two weeks for checking the accuracy of these works. Certainly, further studies should be designed to assess the practical utility of this method in systematic review.
A major finding of our study was that a large number of duplicates could be found among the three databases in systematic review. Notably, about 10% of literatures remained duplicates among the three databases after auto-searching duplicates, which strongly suggested the necessity of hand-searching duplicates in systematic review.
We further compared the difference of reference items between index and redundant papers of type I duplicates. Nearly all type I duplicates had different items between index and redundant volumes, issues, and pages. Subsequently, if these articles had the same titles, journals' names, and issues, they would be attributed to the type I duplicates. Notably, the review authors should identify whether the difference between index and redundant papers was acceptable or not. On the other hand, if these had the same or similar titles but different journals or issues, the review authors would further read the abstracts and/or full-texts to judge whether or not they could be attributed to the type II duplicates. Third, the remaining literatures were also alphabetically ordered according to the titles in the Endnote library. If the titles were the same between two or more articles, the review authors would further read the journals' names, volumes, issues, and pages. Subsequently, if these articles had the same journals' names and issues, they would be attributed to type I duplicates. Notably, the review authors should identify whether the difference between index and redundant papers was acceptable or not. On the other hand, if these articles had the same or similar titles but different journals or issues, the review authors would further read the abstracts and/or full-texts to judge whether or not they could be attributed to the type II duplicates. Finally, review authors should check the accuracy. doi:10.1371/journal.pone.0071838.g003 papers. Regardless of the literatures regarding portal vein thromobosis or Budd-Chiari syndrome and auto-searched or hand-searched duplicates, ''journal's name'', ''publication date'', and ''title'' were three most commonly different items. Most of them were acceptable, for example, journal's name was expressed in full or abbreviated style, publication date was expressed in ''year'' or ''year month'' style, and titles used different punctuations and/or cases in different databases. This finding could be potentially explained by the fact that each database had its own special reference type. Other items were uncommon, but were mostly unacceptable. For example, author, volume, issue, or page was wrong or missing. These mistakes should be corrected, thereby decreasing the prevalence of type I duplicates.
In addition, our study explored the origin of wrong information in type I duplicates. Regardless of the literatures regarding portal vein thrombosis or Budd-Chiari syndrome, EMBASE database had the highest proportion of wrong information regarding author, title, journal, and publication date items. These mistakes in EMBASE database were severe (see examples in Table 1), because they not only misled the readers but also disrespected the researchers. Cochrane library database had the highest proportion of wrong information regarding volume and page items in type I duplicates. This was primarily due to the reference type of Cochrane library database (volume and page were not provided). By comparison, only a minority of wrong information in type I duplicates originated from PubMed database. These findings suggested the following: 1) the accuracy of reference information recorded by EMBASE database should be substantially improved; and 2) the same reference type among these databases may be beneficial for literature screening.
Auto-searching methods could identify a larger number of duplicates, especially type I duplicates. However, only a very small proportion of type II duplicates could be identified by autosearching methods (5.1% in portal vein thrombosis literatures; and 4.8% in Budd-Chiari syndrome literatures). This phenomenon could be readily explained by the fact that the authors, titles, and publication years were often different between index and redundant papers among type II duplicates. Additionally, the wrong reference items were rarely observed among type I duplicates identified by auto-searching methods, but very frequently among those identified by hand-searching methods. This finding also suggested the limitation of auto-searching duplicates, in which ''author'', ''title'', and ''publication date'' items should be exactly matched between two literatures. Accordingly, the necessity of combining auto-and hand-searching methods should be fully recognized in finding duplicates in systematic reviews.

Limitations
Several limitations of this study should be clearly recognized. First, the selection of portal vein thrombosis and Budd-Chiari syndrome literatures was based on our subjectivity. Accordingly, the conclusions achieved by analyzing these literatures might be unsuitable for the literatures from other fields. But it should be noted that we employed a comprehensive search strategy and literatures of two fields to strengthen our conclusions. And given that the results were similar between portal vein thrombosis and Budd-Chiari syndrome literatures, it was possible that these findings of our study might be generalizable. Certainly, further studies should be warranted to compare the frequency of wrong information from a random sample of literatures among the three databases. Second, only three databases were searched in our study. This behavior might underestimate the prevalence of duplicates among databases. However, given that PubMed, EMBASE, and Cochrane library were three most common databases used for systematic review, our results should be a representative sample. Third, only two approaches were employed in this study to identify hand-searched duplicates. It was not easy to find duplicates as both the first author's name and title were different between index and redundant papers. Thus, the prevalence of duplicates might be underestimated. Fourth, a minority of full texts could not be obtained to identify the origin of wrong information. However, it should be noted that we tried our best to contact with the authors and seek help from our and other University libraries. And these unavailable full texts did not substantially influence our judgment on the proportion of wrong information in different databases.

Conclusions
In conclusions, a high prevalence of duplicates could be identified among the PubMed, EMBASE, and Cochrane Library databases in systematic review. These findings were primarily attributed to the effect of a pragmatic strategy of combining autoand hand-searching methods to find duplicates. Indeed, a single strategy of auto-searching method was inadequate to find duplicates, especially type II duplicates. In general, to enhance the transparency of systematic review, PRISMA might require the reporting of the detailed information regarding the methods to find duplicates and the quantity of duplicates identified by different methods. In addition, considering that wrong reference items were frequently observed in type I duplicates identified by hand-searching methods, we strongly recommended that the information of every reference should be strictly examined and carefully inputted by database administrators.