Figure 1.
Cocaine: The Predicted Accuracy of Individual Text-Mined Facts Involving Semantic Relation Stimulate
Each directed arc from an entity A to an entity B in this figure should be interpreted as a statement “A stimulates B”, where, for example, A is cocaine and B is progesterone. The predicted accuracy of individual statements is indicated both in color and in width of the corresponding arc. Note that, for example, the relation between cocaine and progesterone was derived from multiple sentences, and different instances of extraction output had markedly different accuracy. Altogether we collected 3,910 individual facts involving cocaine. Because the same fact can be repeated in different sentences, only 1,820 facts out of 3,910 were unique. The facts cover 80 distinct semantic relations, out of which stimulate is just one example.
Table 1.
A Sample of Sentences That Were Used as an Input to Automated Information Extraction, Biological Relations Extracted from These Sentences, and the Corresponding Evaluations
Table 2.
List of Annotation Choices Available to the Evaluators
Figure 2.
The Correlation Matrix for the Features Used by the Classification Algorithms
The half-matrix below the diagonal was derived from analysis of the whole GeneWays 6.0 database; the half-matrix above the diagonal represents a correlation matrix estimated from only the manually annotated dataset. The white dotted lines outline clusters of features, suggested by analysis of the annotated dataset; we used these clusters in implementation of the Clustered Bayes classifier. We used two versions of the Clustered Bayes classifier: with all 68 features (Clustered Bayes 68), and with a subset of only 44 features but a higher number of discrete values allowed for nonbinary features (Clustered Bayes 44). The Clustered Bayes 44 classifier did not use features 1, 6, 7, 8, 9, 12, 27, 28, 31, 34, 37, 40, 42, 47, 48, 49, 52, 54, 55, 60, 62, 63, and 65.
Figure 3.
A Hypothetical Three-Layered Feed-Forward Neural Network
We used a similar network with 68 input units (one unit per classification feature) and ten hidden-layer units.
Table 3.
Parameter Values Used for Various SVM Classifiers in This Study
Table 4.
Machine Learning Methods Used in This Study and Their Implementations
Table 5.
List of the Features That We Used in the Present Study
Figure 4.
ROC Curves for the Classification Methods Used in the Present Study
We show only the linear-kernel SVM and the Clustered Bayes 44 ROC curves to avoid excessive data clutter.
Figure 5.
Accuracy of the Raw (Noncurated) Extracted Relations in the GeneWays 6.0 Database
The accuracy was computed by averaging over all individual specific information extraction examples manually evaluated by the human curators. The plot compactly represents both the per-relation accuracy of the extraction process (indicated with the length of the corresponding bar) and the abundance of the corresponding relations in the database (represented by the bar color). There are relations extracted with a high precision; there are also many noisy relationships. The database accuracy was markedly increased by the automated curation outlined in this study (see Figure 9).
Table 6.
ROC Scores for Methods Used in This Study, with Error Bars Calculated in 10-Fold Cross-Validation
Figure 6.
Ranks of All Classification Methods Used in This Study in Ten Cross-Validation Experiments
Table 7.
Comparison of the Performance of Human Evaluators and of the MaxEnt 2 Algorithm
Figure 7.
Comparison of a Correlation Matrix for the Features (Colored Half of the Matrix) Computed Using Only the Annotated Set of Data and a Matrix of Mutual Information between All Feature Pairs and the Statement Class (Correct or Incorrect)
The plot indicates that a significant amount of information critical for classification is encoded in pairs of weakly correlated features. The white dotted lines outline clusters of features, suggested by analysis of the annotated dataset; we used these clusters in implementation of the Clustered Bayes classifier.
Table 8.
Comparison of Human Evaluators and a Program That Mimicked Their Work
Figure 8.
Values of Precision, Recall, and Accuracy of the MaxEnt 2 Classifier Plotted against the Corresponding Log-Scores Provided by the Classifier
Precision is defined as , recall is defined as
, and accuracy is defined as
. The optimum accuracy was close to 88%, and attained a score threshold slightly above 0. We can improve precision at the expense of accuracy. For example, by setting the threshold score to 0.6702, we can bring the overall database precision to 95%, which would correspond to a recall of 77.91% and to an overall accuracy of 84.18%.
Figure 9.
Accuracy and Abundance of the Extracted and Automatically Curated Relations
This plot represents both the per-relation accuracy after both information extraction and automated curation were done. Accuracy is indicated with the length of the relation-specific bars, while the abundance of the corresponding relations in the manually curated dataset is represented by color. Here, the MaxEnt 2 method was used for the automated curation. The results shown correspond to a score-based decision threshold set to zero; that is, all negative-score predictions were treated as “incorrect.” An increase in the score-based decision boundary can raise the precision of the output at the expense of a decrease in the recall (see Figure 8).