A Machine Reading System for Assembling Synthetic Paleontological Databases

Many aspects of macroevolutionary theory and our understanding of biotic responses to global environmental change derive from literature-based compilations of paleontological data. Existing manually assembled databases are, however, incomplete and difficult to assess and enhance with new data types. Here, we develop and validate the quality of a machine reading system, PaleoDeepDive, that automatically locates and extracts data from heterogeneous text, tables, and figures in publications. PaleoDeepDive performs comparably to humans in several complex data extraction and inference tasks and generates congruent synthetic results that describe the geological history of taxonomic diversity and genus-level rates of origination and extinction. Unlike traditional databases, PaleoDeepDive produces a probabilistic database that systematically improves as information is added. We show that the system can readily accommodate sophisticated data types, such as morphological data in biological illustrations and associated textual descriptions. Our machine reading approach to scientific data integration and synthesis brings within reach many questions that are currently underdetermined and does so in ways that may stimulate entirely new modes of inquiry.


Relational Features
Supplementary Figure 2. Overview of PDD feature extraction. Text, tables, and images in an original document are parsed (e.g., by table position extraction or natural language). Two or more entities and the specific properties in the document (i.e., features) that relate them are expressed as a row in a database.

Entity1
Entity2 Feature    Figure 7. PDD genus-level diversity (black curve) calculated using occurrences with period level or finer temporal resolution, as opposed to epoch or finer temporal resolution used in Fig. 1. The red curve shows PBDB data and is identical to the red curve in Fig. 1c. Dictionary (English dictionary, GeoNames, PaleoDB, Species2000, Microstrat, MySQL stop words) Part-of-speech tag from StanfordCoreNLP Name-entity tag from StanfordCoreNLP Name entity mentions in the same sentences (paragraphs, or documents)

Mention-level Relations
Word sequence between name entities Dependency path between name entities Name-entity tag from StanfordCoreNLP

Body Size Extraction
In order to extract body size estimates from biological illustrations, we need to extract the relation: where ImageArea is a region on the PDF with known DPI so that the actual size of the image on a printed document is known. The following table is an example of the target extracted relation.  Figure 9). Standard online-tutorials were followed, with one variant for Image Dilation. In this step, one needs to specify a parameter for dilation. Instead of specifying one value for the parameter, we tried a range of parameters and generate different versions of segmentations. PDD then trained a logistic regression classifier to choose between these segments trained on a human-labeled corpus.  e.g., "3a-c"), Taxon (e.g., "B. rara"), and magnitude (e.g., "X1"). Supplementary Figure 10 shows an example of these mentions (raw text with OCR errors). PDD then extracts relations between these mentions using the same set of features as other diversity-related relations.    PDD joins these two intermediate relations to form a large factor graph to populate the target relation. Joint inference on the whole factor graph is then executed.

Body Size Extraction Validation
Corpus. Other researchers [1] recently compiled body size measurements by manually measuring illustrations and reading captions in the Treatise on Invertebrate Paleontology. Of the 55 volumes now accessible, humans have made measurements from part H, I, K, L, N, O, P, Q, R, S, T, U. We created from these documents the following three sets: 1. Testing Corpus (With Ground Truth). Part H. We used the Training Corpus to generate training data for distant supervision. We compared our results with those of human annotators using the Testing Corpus (With Ground Truth). The Testing Corpus (Without Ground Truth) shows that PDD helps to extend the body size database with new extractions that are not provided by human annotators.
Results on Testing Corpus (With Ground Truth). PDD is able to to achieve high precision and slightly higher recall than human when extracting body size measurements and their relations.
Precision. We measured the precision of PDD by randomly sampling 100 extracted instances of the target relation and manually annotate those extractions. We find that the accuracy is more than 92%.
Recall. We next counted the number of distinct (genus, figure name, figure label) tuples that are extracted by humans and PDD on the same set of documents. We find that human extracted 4,837 distinct tuples, and PDD extracted 5,783 distinct tuples, or 20% more. The primary reason for the increase is the complete extraction of meaurements for all parts of a figure (e.g., "1a-f"). Humans typically extract only one part.
Although selective data extraction is often a decision made for the sake of expediency and because not all images provide optimal orientations for the dimensions being targeted by a given investigation, extracting complete measurements and associated textual descriptions establishes the foundation for more complete morphometric analyses.
Results on Testing Corpus (Without Ground Truth). PDD is able to extract facts on documents that have not yet been processed by humans. PDD processed Parts A, B, C, D, E, F, G, V, W of the Treatise on Invertebrate Paleontology, which have not yet been processed for body size by [1]. PDD extracts 7K distinct (genus, figure name, figure label) tuples from these documents.

Multi-linguistic Extraction
Corpus. We followed a similar protocol as we used to collect the overlapping corpus for English documents. We identified the top-20 journals ranked by the number of journal articles in PBDB, and attempted to download articles from their web site. Access was limited to Vertebrata Palasiatica (Chinese), Stuttgarter Beitrage zur Naturkunde (German), and Eclogae Geologicae Helvetiae (German). A total of 1,583 Chinese journal articles and 4,393 German journal articles were obtained in this way. We used the same protocol to map these journal articles to articles in PBDB. Of these, there were 47 articles in Chinese and 56 German articles that overlapped with the PBDB. Protocol. We compared the extractions of PDD in the overlapping set with the PBDB extractions on the same set of documents. Our way of assessing quality is recall for the tuple

(T axon, T imeInterval)
This tuple is language-independent because (1) taxon has unified Latin-representation in all English, Chinese, and German articles; and (2) time Intervals and their hierarchical relationships are known by PDD for all languages. To extract this tuple, PDD requires the information in all other tuples, including (T axon, F ormation), (F ormation, T imeInterval), and (F ormation, Location). We selected taxa common to both PDD and PBDB, and label PDD's extraction as correct if the taxon temporal ranges overlap.
Recall. From the overlapping corpus, PBDB extracts (T axon, T imeInterval) tuples for 85 distinct genera in Chinese and 242 distinct genera in German. We find that PDD correctly extracts (T axon, T imeInterval) for 24 genera (28%) in Chinese and 82 (33%) genera in German. The difference between Chinese and German is caused primarily by OCR quality, even though we used commercial OCR tools for both. Chinese has lower OCR quality because of the large vocabulary in East-Asian languages.
Precision. Out of all 24 distinct genera in Chinese and 82 distinct genera in German articles, we find that all of them overlap with PBDB extractions in terms of their temporal interval, indicating high precision.

Specific Technical Validation
Here we describe DeepDive, the underlying system that powers PDD [2-7].
Rule-based Systems. The earliest KBC systems used pattern matching to extract relationships from text. The most well known example is the "Hearst Pattern" proposed by Hearst [20] in 1992. In her seminal work, Hearst observed that a large amount of hyponyms can be discovered by simple patterns, e.g., "X, such as Y". Hearst's technique forms the basis of many further techniques that attempt to extract high quality patterns from text. In industry, rule-based (pattern-matching-based) KBC systems, such as IBM's SystemT [8,21], have been built to develop high quality patterns. These systems provide the user a (usually declarative) interface to specify a set of rules and patterns to derive relationships. These systems have achieved state-of-the-art quality after carefully engineering effort as shown by Li et al. [21].
Statistical Approaches. One limitation of rule-based systems is that the developer needs to ensure that all rules provided to the system are high precision rules. For the last decade, probabilistic (or machine learning) approaches have been proposed to allow the system select between a range of a priori features automatically. In these approaches, the extracted tuple is associated with a marginal probability that it is true (i.e., that it appears in the KB). DeepDive, Google's knowledge graph, and IBM's Watson are built on this approach. Within this space there are three styles of systems: • Classification-based Frameworks Here, traditional classifiers assign each tuple a probability score, e.g., naïve Bayes classifier, and logistic regression classifier. For example, KnowItAll [12] and TextRunner [13,14] uses naïve Bayes classifier, and CMUs NELL [16,17] uses logistic regression. Large-scale systems typically use these types of approaches in sophisticated combinations, e.g., NELL or Watson.
• Maximum a Posteriori (MAP) Here, the probabilistic approach is used but the MAP or Most likely world (which do differ slightly) is selected. Notable examples include the YAGO system [15],which uses a PageRank-based approach to assign a confidence score. Other examples include the SOFIE [10] and Prospera [11], which use an approach based on constraint satisfication.
• Graphical Model Approaches The classification-based methods ignore the interaction among predictions, and there is a hypothesis that modeling these correlations yields higher quality systems more quickly. A generic graphical model has been used to model the probabilistic distribution among all possible extractions. For example, Poon et al. [19] used Markov logic networks (MLN) [22] for information extraction. Microsoft's StatisticalSnowBall/EntityCube [18] also uses an MLN-based approach. A key challenge with these systems is scalability. For example, Poon et al. was limited to 1.5K citations. Our relational database driven algorithms for MLN-based systems are dramatically more scalable [3].

Calibrated Probabilities
DeepDive takes a Bayesian probabilistic approach to KBC by treating OCR, NLP, image processing, and feature recognition as one joint probabilistic inference problem in which all predictions are modeled as a factor graph (Fig. S3). This probabilistic framework ensures all facts that are produced by DeepDive are associated with a marginal probability. 1 These marginal probabilities are meaningful in DeepDive (i.e., they should correspond to the actual probabilities of a fact beig correct), which provides a mehcanism for evaluation and an aid to improving the system.
Calibration. In DeepDive, calibration plots are used as a way to summarize the overall quality of the KBC results. Ideally, the probability associated with a given fact in DeepDive should equal the empirical probability that this fact is correct (i.e., an extraction with a probability 0.95 should be correct with a 95% of the time when inspected in the original source). Because DeepDive uses a joint probability model, any set of predictions can be assigned a marginal probability. Queries can then be against the model to help determine where a model needs improvement. Supplementary Figure 11 and Supplementary Figure 12 show calibration plots for the ODS and the WDS presented in the main text. We will use Supplementary Figure 11(1) as an example, which is the target relation Taxonomy in the ODS. A calibration plot contains three components: (a) Accuracy, which measures the test-set accuracy of a prediction with a certain probability; (b) # Predictions (Testing Set), which measures the number of extractions in the test set with a certain probability; and (c) # Predictions (Whole Set), which measures the number of extractions in the whole set with certain probability. The difference between test set and whole set is that the former has training labels for each random variable. Results are summarized as histograms, and empirically we find that a bin of size of 0.1 is usually sufficient to understand the behavior of the system. Using Calibration Plots (a) Accuracy. If the accuracy curve is similar to the ideal (0,0)-(1,1) line, it means that a probability produced by the system matches the test-set accuracy. For example, Supplementary Figure 11(1) shows a reasonably good curve for calibration. Differnces in these two lines can be caused by (1) inefficient training data or a small testing corpus, and/or (2) bad mixing behavior of the sampler or other software bugs. For example, Supplementary Figure 12(2,3,4) shows a much better calibration behavior than Supplementary Figure 11(2,3,4), primarily because the former is based on the whole corpus, which has more training data and a larger testing set.  (Testing Set). Ideally, the # Predictions histogram should have a "U" shape. That is, most of the data are concentrate at high probability (where we are confident it is correct) and low probability (where we are confident it is incorrect). Large numbers of predictions with a probability approximately 0.5 means that the system has little information about how to classify these extractions. This implies that more features could be defined to resolve uncertainty. For example, Supplementary Figure 11 (c) # Predictions (Whole Set). This histogram is similar to (b), but illustrates the behavior of scaling the system to a set of documents for which we do not have any training examples. Usually we hope that (c) has a similar shape to (b).
Usage. The above techniques have proven critical to debugging and improving the quality of PDD. In response to low confidence, a user can provide labeled examples, which allows the system to learn weights that yield higher confidence. Additionally, a user may write logical inference rules that provide ways of improving quality, which is a key component of all statistical relational approaches.

Related Work
Here we survey recent efforts that focus on how to improve the quality of a KBC system. Rich Features. Different researchers have recently noted the importance of combining and using a rich set of features and signals to improve the quality of a KBC system. Two famous efforts, the Netflix challenge [24], and IBM's Watson [25], which won the Jeopardy gameshow, have identified the importance of features and signals: Ferrucci et al. [25]: For the Jeopardy Challenge, we use more than 100 different techniques for analyzing natural language, identifying sources, finding and generating hypotheses, finding and scoring evidence, and merging and ranking hypotheses. What is far more important than any particular technique we use is how we combine them in DeepQA such that overlapping approaches can bring their strengths to bear and contribute to improvements in accuracy, confidence, or speed.
Buskirk [24]: The top two teams beat the challenge by combining teams and their algorithms into more complex algorithms incorporating everybody's work. The more people joined, the more the resulting team's score would increase.
In both efforts, the rich set of features and signals contributed to the high-quality of the corresponding system. Other researches have found similar phenomena. For example, Mintz et al. [26] finds that although both surface features and deep NLP features have similar quality for relation extraction tasks, combining them achieves a significant improvement over using either one in isolation. Similar "feature-based" approaches are also used in other domains (e.g., Finkel et al. [27] uses a diverse set of features to build a NLP parser with state-of-the-art quality). In our own work [28], we have also found that integrating a diverse set of deep NLP features can improve a table extraction system significantly.
Joint Inference. Another recent trend in building KBC system is to take advantage of joint inference [5,19,[28][29][30][31][32][33]. Different from traditional models [34], such as logistic regression or SVM, joint inference approaches emphasize learning multiple targets simultaneously. For example, Poon et al. [19,31] find that learning segmentation and extraction in the same Markov logic network significantly improves the quality of information extraction. Similar observations have been made by Min et al. [29] and McCallum [30]. Our recent work also show the empirical improvement of joint inference on the diverse set of tasks, including relation extraction [5] and table extraction [28].
Deep Learning and Joint Inference. A recent emerging effort in the machine learning community is to build a fully-joint model for NLP tasks [32,33]. The goal is to build a single joint model from the lowest level (e.g., POS tagging) to the highest level (e.g., semantic role labeling). The PDD system is built in a similar spirit that attempts to build a joint model for low-level tasks (e.g., OCR), to high-level tasks (e.g., cross-document inference of relation extraction).

The DeepDive Approach and the Impact of Rich Features and Joint Rules
DeepDive uses joint inference rules and rich features. In this section, we test that these features and rules are important to PDD's quality by conducting a lesion study.
Protocol. All experiments were run on the overlapping corpus as described in the main text. We produced variants of PDD by removing features/rules and all components that rely on the output of the removed feature/rule. We summarize the quality of PDD by computing Spearman's rho for first differences in genuslevel biodiversity (as in Fig. 1).
Features. The PDD feature extraction phase extracts a set of features, including deep linguistic features, e.g., dependency parsing results, and vision-based features (e.g., a simple table extractor based on Hough Transform). To study their impact, we conduct lesion study by sequentiallydisabling these features.
Deep NLP Features. Supplementary Figure 13(a) shows the impact of removing NLP features (e.g., dependency path). If we use the whole PBDB is used, dropping these Deep NLP features does not have a significant effect on Spearman's rho. However, if the knowledge base used for training is reduced to 1% of it s size, then dropping NLP features results in a decrease of Spearman's rho from 0.72 from 0.82. Table Recognition. PDD contains a table recognition component to detect tables using vision-based features (e.g., Hough Transform). When disabling this component and using the 1% PBDB for distant supervision, PDD achieves a Spearman's rho of 0.69. This drop is the effect of decreased recall of data in tables.

Vision-based
Joint Inference Rules. PDD contains a set of factors for joint inference among random variables, as shown in Fig S3. We study their impact on two types of joint inference rules: (1) joint inference within one relation; and (2) joint inference across different relations (Supplementary Figure 14).
Joint Inference for Same Relations. Disabling all joint inference rules results in a Spearman's rho of 0.64, even when using the whole PBDB knowledge base. This is a marked decline from the Spearman's rho of 0.82 obtained when these rules are enabled. This large decline in quality is caused by the fact that jointly infering the values of random variable results in much higher-quality predictions. For example, assume that we have three candidate facts that Tsingyuan Formation has the age (1) Carboniferous ,(2) Namurian, and (3) Kungurian. In the current PDD system, the higher confidence for Carboniferous will also boost its confidence for Namurian (because of containment), and decrease its confidence for Kungurian (because Kungurian is so much younger than Carboniferous). This type of joint inference between random variables help PDD to produce result with higher recall (by boosting confidence to cross the imposed 0.95 threshold) and precision (by eliminating wrong predictions).
Joint Inference across Relations. The current PDD system has three joint inference rules across different relations (e.g., one geologic formation entity mention cannot be concurrently a location mention). We disable these rules and show in Supplementary Figure 14 that it does not have a large impact to the overall quality. This implies that the current PDD system is quite modular across different relations. This means that different types of relations can be decoupled and applied to other related applications (e.g., for biology or geology).

Related Work
There is an emerging trend in both industry and academia to support statistical inference and learning, and we survey these efforts in this section.
Hardware Efficiency. One line of research tries to speed-up statistical inference and learning by better taking advantage of modern hardware and clusters. For example, many industrial database vendors have integrated statistical analytics components into their product. For example, Oracle's ORE [35], Pivotal's MADlib [36], and IBM's SystemML [37]. These systems provide functionalities like logistic regression and collapsed Gibbs sampling for topic modeling on their data management systems. There are also efforts to design new data processing framework instead of relying on the traditional database systems. Indeed, most data processing frameworks developed in the last few years are designed to support statistical analytics including Mahout [38] for Hadoop, MLI for Spark [39], GraphLab [40], GraphChi [41], and Delite [42,43]. These systems have been shown to increase the performance of corresponding statistical analytics tasks significantly.
Statistical Efficiency. One key difference between statistical inference and learning with traditional SQLlike analytics is that different ways of executing the same tasks usually lead to different speed when converging to the same quality. Therefore, another line of related work, mainly contributed by the mathematical optimization and machine learning community, is to design more efficient algorithms for statistical inference tasks. One of the recent trends is to design lock-free algorithms that can be executed on the emerging multi-socket multi-core machines with high parallelism [3,[44][45][46][47]. For example, Tsitsiklis et al. [44] proves asymptotic convergence for a parallel coordinate descent algorithm, and Bradley et al. [47] proves the convergence rate and theoretical speedups for parallel stochastic coordinate descent. Our own work [3,46] proves the convergence of lock-free execution for stochastic gradient descent and stochastic coordinate descent.

The DeepDive Approach and The Performance of PDD
The DeepDive Approach. The statistical inference and learning engine in DeepDive [4] is built upon the challenge of designing a high-performance statistical inference and learning engine on a single machine [4,6,7,46]. Compared to traditional work, the main novelty of DeepDive is that it considers both hardware efficiency and statistical efficiency for executing an inference and learning task.
Hardware Efficiency. DeepDive takes into consideration the architecture of modern non-uniform memory access (NUMA) machines. A NUMA machine usually contains multiple nodes (sockets), where each sockets contains multiple CPU cores. To achieve high hardware efficiency, it is useful to decrease the communication across different NUMA nodes.
Statistical Efficiency Pushing hardware efficiency to the extreme might cause statistical efficiency to suffer because the lack of communication between nodes could decrease the rate of convergence of a statistical inference and learning algorithm. DeepDive takes advantage of theoretical results of model averaging [45] and lock-free execution [7,46].