Fig 1.
Process for labeling and predicting violations of the proportional ink principle in bar charts.
A. Figures from the PubMed Open Access Subset (PMOAS) dataset are the source data. See Materials and Methods for details to (B) separate figures if they are compounded and (C) classify them into different kinds of charts. D. A large sample of bar charts are annotated by humans. E. These annotations are used to train a classifier and predict a large number of bar charts.
Table 1.
Fig 2.
Features of both datasets are nearly identical. The prediction dataset could provide a reasonable estimation of the more general problems of graphical integrity.
Fig 3.
Correlation between journal rank and the likelihood of having articles with graphical integrity issues.
No relationship between ranking and the median likelihood that an article has a bar chart violating the proportional ink principle.
Fig 4.
The likelihood of having graphical integrity issues across each research field.
Fig 5.
The likelihood of having graphical integrity issues across each country.
Top three countries as the Netherlands, Spain, and France.
Fig 6.
The likelihood of having graphical integrity issues across each year.
Fig 7.
Flowchart of our data source and process.
Predictions and Human Annotations data sets are randomly selected from PubMed Open Access Images. Authors annotated 8,001 bar charts from the human-annotated set, and 4,834 bar charts could be processed by the method pipeline.
Fig 8.
Flowchart of our graphical integrity classification.
Refer to Fig 9 for comparison and Section 4.2 for method details of each step in our deep learning-based method. In our method, we apply compound figure classification, subfigure separation, text localization, text recognition, text role classification, and graphical integrity classification. To reproduce our method, see our code in https://github.com/sciosci/graph_check.
Fig 9.
An example process for predicting violations of the proportional ink principle (see Materials and Methods for details, and our code is in https://github.com/sciosci/graph_check).
A. Input image representing a scientific figure. PubMed Open Access subset provides figures already extracted from the publications. B. Subplot extraction using the YOLO deep learning architecture [51] trained on the hand-annotated dataset (see Materials and Methods). C. Each subplot is extracted from the input image. D. Subfigure plot classification where only bar charts are extracted (E). For each bar chart, we detect a set of low-level features (F), which are later used for predicting whether a bar chart is violating the proportional ink principle (H, yes) or not (I, not).
Table 2.
Summary of features for proportional ink violation detection.