Unveiling scientific articles from paper mills with provenance analysis

doi:10.1371/journal.pone.0312666

Fig 1.

Retracted papers due to image problems from 2010 to 2024.

Paper mill production has drastically increased from 2020 to 2024. The category “Others” regards ambiguous retraction reasons related to images that we could not fit into any other category.

More »

Expand

Table 1.

Retractions due to problematic images by scientific area from 2010 to 2024.

More »

Expand

Fig 2.

Filtering & evidence collection workflow.

A suspect collection of documents undergoes parsing and figure extraction, resulting in a set of figures. These figures are then processed by a filtering stage that identifies and extracts the panels of interest. Later, a machine learning model [23] creates a robust evidence representation that can withstand transformations commonly applied to panels, such as resizing, compression, and color changes. Finally, each panel representation is stored in a database for further analysis. The figure used to depict this workflow is available under the Creative Commons license at https://doi.org/10.1371/journal.pone.0152712.g002.

More »

Expand

Fig 3.

Compound figure annotation for panel extraction.

Each colored rectangle corresponds to a panel that should be extracted as part of the panel extraction task. The categories of each rectangle are indicated by their color. For instance, the green panels are annotated as microscopy imaging. To generate this example, we used the figure distributed under a Creative Commons license found at https://doi.org/10.1371/journal.pone.0152712.g002.

More »

Expand

Fig 4.

Provenance analysis at image level.

Provenance analysis is performed for each panel collected during the Filtering & Evidence Collection step. Let P be one of these panels. First, the method performs a content retrieval by comparing the similarity of P’s description with the other panels in the database (step 1). The top-K similar panels to P are included in a processing queue. Then, the next panel R_I from the queue is compared to P to determine if they have consistent content (step 2). If R_I matches P, the method calculates the content-sharing score of P and R_I (step 3). This score informs the area shared between these panels. Using such a score, the method updates a content-shared table (step 4). If the content-sharing score is above a threshold (1%), the processing queue will expand with more L (L < = K) similar panels to R_I. After processing all collected panels, the method starts constructing the provenance graph. It uses the scores located at the content-shared tables to identify the relationship of each pair of images within the collection. Then, it isolates the panels that relate to one another by finding their connected components (step 5). Finally, to visualize these components more clearly, the method generates provenance graphs by computing the maximum spanning tree of each connected component (step 6). This process results in a tree-like structure that shows the relationships between the panels within each connected component. The images from this figure were retrieved from a public domain source, in which we created multiple versions of the same image for illustration’s sake (https://pixnio.com/science/microscopy-images/tularemia-francisella-tularensis/photomicrograph-of-francisella-tularensis-bacteria-using-a-methylene-blue-stain).

More »

Expand

Fig 5.

Visualization of a graph before and after computing its maximum spanning tree, referred to as the provenance graph.

Each blue node represents a different image, and the links between nodes indicate that their corresponding images share content. On the left is the connected component graph, which shows a connected group of images identified by the proposed method. On the right, the corresponding provenance graph is obtained by pruning the links of the connected components graph by computing the maximum spanning tree (MST). MST removes all cycles within the graph while keeping the edges that maximize the sum of all content-sharing scores between each linked node.

More »

Expand

Fig 6.

Provenance analysis at document level.

The process of document-level provenance analysis begins with Filtering & Evidence Collection (1), followed by image-level analysis to identify relationships and graphs of the collected figures (2). Finally, it tracks related documents through their linked figures and creates a provenance visualization of them. The images used in this figure are public domain and were used only for illustration’s sake.

More »

Expand

Fig 7.

Extension of the SPP dataset annotations.

The original annotations (left) rely on spreadsheets and offer limited information about the shared visual content. They indicate whether a document, identified by its DOI, reuses an image and which label corresponds to the reused image. In this instance, “WH02” represents a label for a group of similar wound-healing assay photos identified across multiple articles. The original spreadsheet annotation and its detailed explanation are described on Dr. Bik’s website [11]. The proposed new annotations (right) rely on documents in JSON format to track and register all the figures within a document and all the panels within a figure that suspiciously share regions. The panel in this figure is present in the dataset and extracted from https://doi.org/10.1042/BSR20191453 under a Creative Commons license.

More »

Expand

Table 2.

Number of items per SPP dataset.

More »

Expand

Table 3.

Distribution of image panel types across the SPP datasets.

More »

Expand

Table 4.

Content pairing results at image level.

More »

Expand

Fig 8.

Content pairing results per image type for each SPP dataset version.

Blots are the most challenging type of image for all solutions. Contrary to the other solutions and regardless of the version of the SPP dataset, the proposed method does not suffer significant drops of CP in the presence of distractors.

More »

Expand

Table 5.

Content pairing results at document level.

More »

Expand

Table 6.

Content grouping results at the image level.

More »

Expand

Fig 9.

Content grouping results per image type for each SPP dataset version.

Blots are the most challenging type of image for all solutions. Like the case of CP, the proposed method does not suffer significant drops of CG in the presence of distractors.

More »

Expand

Table 7.

Content grouping results at document level.

More »

Expand

Table 8.

Content classification results at the image level.

More »

Expand

Fig 10.

Content classification results per image type for each SPP dataset version.

Unlike the other solutions, the proposed method does not suffer significant drops of CC in the face of unsuspicious data.

More »

Expand

Table 9.

Content classification results at the document level.

More »

Expand

Fig 11.

Provenance graph computed by the proposed solution herein over the extended SPP dataset (v2).

Each graph node refers to an image panel from a scientific figure reused in a document. Blues lines linking each pair of nodes indicate the sharing of visual content. To improve visualization, we select one node from the graph and highlight its neighbors, increasing the size of the nodes and coloring them with a stronger blue color. We did not include the source figures in the graph due to copyright issues. Below each node, we provide the document object identification (DOI) of the source manuscript of the involved figure and the reference used in the article to such figure.

More »

Expand

Fig 12.

Provenance graph of Western blots related to the group labeled as “SWB1” within Bik’s annotations [11].

Blue nodes refer to correctly predicted figures. Red nodes indicate missed figures not found by the proposed method. Due to copyright issues, we did not include the real figures in the graph. Below each node, we provide the DOI of the document that is the source of the involved figure, as well as the figure’s reference used in the document.

More »

Expand

Fig 13.

Provenance graph generated by the proposed solution, representing the document level relationships between articles sharing content in the extended SPP dataset (v2).

All documents within this graph were reported as problematic by Dr. Bik’s investigation [11], without false positives. Each node in the graph corresponds to a publication, with its DOI indicated below. The most densely connected region of the graph is magnified, revealing a document that shares its content with many others.

More »

Expand

Fig 14.

The seven false alarm provenance graphs detected by the proposed solution when applied to the SPP-v2 dataset.

The red nodes represent publications identified by their PubMed Central (PMC) ID, displayed below each red node. Connections between nodes indicate potential image duplication between their articles detected by our solution. When reviewing the cause of the connection, we have found similar images in all connected articles, but most properly citing their sources. All flagged publications refer to distractor documents added to the extended SPP dataset, indicating false connections. Out of 4096 distractor documents included in the SPP-v2 dataset, only 21 were flagged as false alarms.

More »

Expand

Table 10.

Performance of the panel extraction solution by image panel type.

More »

Expand