The authors have declared that no competing interests exist.
Contributed to the preparation of the manuscript: HL LH VK KV. Conceived and designed the experiments: HL KV. Performed the experiments: HL. Analyzed the data: HL LH VK. Contributed reagents/materials/analysis tools: LH. Wrote the paper: HL.
The biomedical text mining community has focused on developing techniques to automatically extract important relations between biological components and semantic events involving genes or proteins from literature. In this paper, we propose a novel approach for mining relations and events in the biomedical literature using approximate subgraph matching. Extraction of such knowledge is performed by searching for an approximate subgraph isomorphism between key contextual dependencies and input sentence graphs. Our approach significantly increases the chance of retrieving relations or events encoded within complex dependency contexts by introducing error tolerance into the graph matching process, while maintaining the extraction precision at a high level. When evaluated on practical tasks, it achieves a 51.12% F-score in extracting nine types of biological events on the GE task of the BioNLP-ST 2011 and an 84.22% F-score in detecting protein-residue associations. The performance is comparable to the reported systems across these tasks, and thus demonstrates the generalizability of our proposed approach.
Systems biology investigates the complex interactions between various components of biological systems, and the consequential impacts of these interactions on the function and behavior of the systems. Text mining of the biomedical literature has been shown to be an effective way of automatically extracting important relations between biological components such as protein-protein interactions (PPI)
While a relation generally involves a pair of entities with different participating roles, linked by a semantic relation type, an event typically captures the association of multiple participants of varying numbers and with diverse semantic roles
Graphs provide a flexible structure to represent a network and naturally describe the interactions between its components. Therefore, they are a powerful primitive for modeling relations and events. In this work, we take advantage of dependency graphs that capture syntactic relations in sentences of natural language text, based on state-of-the-art natural language parsers that can achieve accuracies in the 80–90% range
The feature-based approach encodes node tokens, edge labels and path structures of variable depths of a dependency graph as syntactic features, together with lexical features such as morphological characteristics and bag-of-word frequencies of token texts, to feed learning algorithms
On the other hand, graph matching-based techniques that directly operate on dependency graphs have also proven effective for information extraction tasks in both general English and biomedical domains. A dependency graph matching module was introduced to compute the text relatedness between student answers and correct answers in assisting the automatic grading of student answers
More recently, we proposed an approach based on exact subgraph matching (ESM) for mining various relations and events from literature in the biomedical domain
However, the overall performance of our ESM-based approach is limited by lower coverage, with an 11% recall deficit contributing to the 7.3% F-score difference with the best individual system. Careful error analysis suggests that the syntactic dependencies encoded in the rules are not sufficient to capture the variety of textual surface forms used to express biological processes. We attribute this problem to the inherent, restrictive property of the exact subgraph matching algorithm that strictly requires that all nodes and connections between nodes in one graph find their injective matches in the other. Although ensuring a high precision, this requirement does not allow partial matching, and therefore limits the generalization potential of the graph representation of rules, leading to the lower recall. In this work, we introduce a novel approach for relation and event extraction based on approximate subgraph matching (ASM). By including a certain degree of error tolerance into the graph matching process, the approach increases the chance of retrieving relational knowledge encoded within complex dependency contexts, while maintaining the extraction precision at a high level. We have successfully applied it in two biological relation/event extraction tasks, achieving results competitive with the state-of-the-art methods, demonstrating the generalizability of our proposed approach.
The rest of the paper is organized as follows: In Section 2, we review recent research advances in mining biological relations and events. Section 3 describes our ASM-based event extraction approach. Section 4 demonstrates two applications in which our approach has been successfully applied. Finally, Section 6 summarizes the paper and introduces future work.
With state-of-the-art protein annotation methods achieving a reasonable 88% F-score
Airola
In addition to binary relations, the BioNLP-ST 2009 shared task included a more ambitious task of detecting complex, nested event structures. It successfully drew interest from 24 teams and has since served as the platform for many studies on event extraction.
The Turku Event Extraction System (TEES) used multi-class SVM classifiers incorporating a wide array of features capturing both linear and dependency contexts to extract arguments of biological events
BioNLP-ST 2011 extended BioNLP-ST 2009, addressing a wider range of text types, event types, and subject domains. Riedel
As the only rule-based system among the top 5 systems of BioNLP-ST 2009, the “ConcordU” team carefully analyzed 2,000 automatically derived dependency relation paths involved in expressing biological events, and manually coded 27 dependency path patterns which were then applied sequentially to identify event participants
As one of the participating teams in BioNLP-ST 2011, we proposed an exact subgraph matching (ESM)-based method for event extraction
An index-based approximate subgraph matching tool
In this section, we first introduce the framework of our ASM-based approach. We then describe in detail the core components of the framework in the context of biological event extraction. Next, we formally illustrate our ASM algorithm, and investigate its complexity. Finally, we compare the ASM with existing graph distance/similarity metrics in terms of the different aspects considered in the process of graph comparison.
Interactions among biological entities are expressed in various ways in the biomedical literature. The underlying assumption of our approach is that the contextual dependencies of each stated biological relation or event represent a typical context for such events in the biomedical literature. Our approach falls into the machine learning category of instance-based reasoning
Several standard preprocessing steps are first completed on both training and testing data. These include sentence segmentation and tokenization, Part-of-Speech (POS) tagging, and syntactic parsing that produces dependency graphs for sentences
The two BioNLP shared tasks focused on the recognition of biological events from the literature, in a setting where protein mentions are provided in the input
Event rules are learned automatically using the following method. Starting with the dependency graph of each training sentence, for each annotated event, the shortest dependency path connecting the event trigger to each event argument in the undirected version of the graph is selected. While additional information such as individual words in each sentence (bag-of-words), sequences of words (n-grams) and semantic concepts is typically used in the state-of-the-art supervised learning-based systems to cover a broader context
While the dependencies of such paths are used as the graph representation of the event, a detailed description records the participants of the event, their semantic role labels and the associated nodes in the graph. All participating biological entities are replaced with a single tag, e.g. “BIO_Entity”, to ensure generalization of the learned rules. As a result, each annotated event is generalized and transformed into a generic graph-based rule. Algorithm 1 shows the details of the rule induction. The resulting event rules are categorized into different target event types.
1:
3:
4: //unDirected() transforms the directed graph
5:
7:
8: //shortestPath() finds the shortest path(s) between trigger and argument in
10:
11: //directed() retrieves the original dependencies of
12:
For simple events such as
In this work, for complex events, in addition to computing dependency path unions, individual dependency paths connecting triggers to each argument are also considered to determine event arguments independently. If the resulting arguments share the same event trigger, they are grouped together to form a potential event. In fact, similar approaches were attempted in both BioNLP shared tasks, and have been proven successful by the best-performing systems
Rule | Rule Description | Graph | |||
ID | Type | Trigger | Theme | Cause | Representation |
E1a | Pos. | lead-20/VBP | Phosphorylation: | Binding: | nsubj(lead-20/VBP, ligation-6/NN) |
reg. | phosphorylation-23/NN | ligation-6/NN | prep_to(lead-20/VBP, phosphorylation-23/NN) | ||
E1b | Pos. | lead-20/VBP | Phosphorylation: | Binding: | rcmod(ligation-6/NN, lead-20/VBP) |
reg. | phosphorylation-23/NN | ligation-6/NN | prep_to(lead-20/VBP, phosphorylation-23/NN) | ||
E1c | Pos. | lead-20/VBP | Phosphorylation: | prep_to(lead-20/VBP, phosphorylation-23/NN) | |
reg. | phosphorylation-23/NN | ||||
E1d | Pos. | lead-20/VBP | Binding: | nsubj(lead-20/VBP, ligation-6/NN) | |
reg. | ligation-6/NN | ||||
E1e | Pos. | lead-20/VBP | Binding: | rcmod(ligation-6/NN, lead-20/VBP) | |
reg. | ligation-6/NN |
Event extraction is achieved by matching the induced rules to each testing sentence and applying the descriptions of rule tokens to the corresponding sentence tokens. Since rules and sentences all possess a graph representation, event recognition becomes a subgraph matching problem. In this work, we introduce a novel
An event rule graph
The subgraph distance computes the cost of transforming a subgraph of the sentence graph into the rule graph, and is proposed to be the weighted summation of three penalty-based measures for a candidate match between the two graphs. The measure
The weights
Compared to binary relation extraction tasks, the challenge of event extraction lies in the aim of recognizing complex and nested events. For instance, simple events can serve as arguments of complex events, and complex events themselves may also act as participants of other complex events. Therefore, an iterative, bottom-up matching process is proposed in this work.
Starting with the extraction of simple events, simple event rules are first matched with a testing sentence. Next, as potential arguments of higher level events, obtained simple events continue to participate in the subsequent matching process between complex event rules and the sentence to initiate the iterative process for detecting complex events with nested structures. The process terminates when there is no new candidate event generated for the testing sentence.
In Section “Rule Induction” we showed that the graph representation of our induced rules, even for complex events, is “simple” in the sense that higher-order constructs are not explicitly encoded in the representation (see
Finally, post-processing is performed to transform raw sentence matching results into the required format according to the event extraction task.
Typical of instance-based reasoners, the accuracy of rules with which to compare an unseen sentence is crucial to the success of our approach. As observed in
Therefore, we measured the accuracy of each rule
Because of nested event structures, the removal of some rules might incur a propagating effect on rules relying on them to produce arguments for the extraction of higher order events. Therefore, an iterative rule set optimization process, in which each iteration performs sentence matching, rule ranking and rule removal sequentially, is conducted, leading to a converged, optimized rule set. While the ASM algorithm aims to extract more potential events, this performance-based evaluation component ensures the precision of our event extraction framework.
The subgraph matching problem is NP-complete
The algorithm starts with finding the start nodes for matching. Each rule is allowed to have only one start node while each sentence can possess a set of start nodes. Two scenarios are considered. First, if the rule contains at least one “BIO_Entity” token, the “BIO_Entity” token that has the lowest token number becomes the start node of the rule. This does not reduce the set of found solutions. In the meantime, every “BIO_Entity” token in the sentence becomes an alternate start node for the sentence. Second, if the rule does not have any “BIO_Entity” token, the token with the lowest token number becomes the start node of the rule, while every token in the sentence becomes a candidate start node. The second scenario applies to regulatory event rules that only use sub-events as arguments.
The
When comparing two graph nodes in the
1:
3:
4:
6:
7: //matchNode() checks if an injective match exists between two nodes
8: go to Line 5
9:
10:
12:
14:
15: //matchNode() assesses if two nodes can be matched using node features
16:
17:
18: go to Line 5
19:
20:
21:
22:
23: //combinMatching() recursively generates all candidate node matching schemes in
25:
26:
Let us assume that the sentence graph
However, we have observed that the algorithm is relatively efficient in practice and we have successfully run it on several event and relation extraction tasks. We show that this efficient performance in practice can be expected. First, on average there are about 24 words in a sentence in the biomedical text
1:
2: //assign
3:
4:
5:
6: pop
8:
9:
10:
11:
1:
2:
3: //
4:
1:
3:
4:
1:
2: create two empty stacks
4: push Label(
5: //Label() returns all labels on the shortest path between nodes
6: push Label(
7:
8: //diffLabel() returns the number of different labels between two stacks
9:
1:
2: create two empty stacks
3:
4: push Direction(
5: //Direction() returns all directions on the shortest path between nodes
6: push Direction(
7:
8: //diffDirect() returns the number of different directions between two stacks
9:
10:
While the cost function of the ASM measures the subgraph distance between two graphs, graph kernels directly compute the similarity between the graphs. Since a distance function can be converted straightforwardly into a similarity measure, we briefly compare the ASM with some existing graph kernel metrics in terms of the different aspects considered in the process of graph comparison.
The edit distance kernel
The dependency kernel
The all-paths graph (APG) kernel
In fact, all existing graph kernels are developed to facilitate the extraction of binary relationships, i.e., to help SVM make a decision on whether a co-occurrence of two entities bears a pre-defined relation type. The ASM targets a broader problem definition and is able to identify various components of a relation or event, such as predicate of a relation, and trigger or various themes of an event. However, in order to perform a direct, fair comparison between the ASM and existing graph kernel metrics, the ASM has to also be kernelized. This will allow the ASM to not only take advantage of the capability of SVM that implicitly explores a high dimensional feature space, but also be compared with existing kernels on the same IE tasks. We plan to explore the use of ASM in a graph kernel in future work.
In this section, we evaluate the proposed ASM-based approach on two biomedical applications: BioNLP shared tasks, and Protein-Residue association detection.
We use the dataset of the GENIA Event (GE) task of BioNLP-ST 2011, including training, development and testing sets. This dataset subsumes the BioNLP-ST 2009 dataset of biomedical journal abstracts, but adds full-text articles. Genes and gene products are pre-annotated as “Proteins” and provided in the dataset. The event annotation is only available for training and development sets.
Attributes Counted | Training | Development | Testing |
Abstracts+Full articles | 908 (5) | 259 (5) | 347 (4) |
Sentences | 8,759 | 2,954 | 3,437 |
Proteins | 11,625 | 4,690 | 5,301 |
Total events | 10,287 | 3,243 | 4,457 |
Sentence-based events | 9,583 | 3,058 | hidden |
The GE task includes 9 different event types. Since each type possesses its own event contexts, an individual threshold
Our GA works with a population of potential parameter settings. The values of parameters are encoded by integer values within a predefined range: [0, 50]. For each potential setting, the fitness function of GA performs sentence matching between rules learned from the training set and sentences of the development set, evaluates the corresponding event extraction performance on the development set using the provided gold event annotation, and returns the resulting F-score. GA iterates the fitness function with a goal of maximizing the F-score on the development data.
Our GA is set up to evolve for 50 generations, each of which consists of a population of 100 potential parameter settings. GA starts with a randomly generated population of 100 potential solutions and proceeds until 50 generations are reached. The number of generations and the population size are decided with consideration of the runtime cost of evaluating the fitness function. A large number of generations or population size would incur an expensive runtime cost of evaluation.
Parameter | Value | Parameter | Value |
7 | 3 | ||
5 | 3 | ||
7 | 3 | ||
10 | 10 | ||
10 | 10 | ||
7 | 10 |
Following the proposed framework, rules are first induced from both training and development sets. The resulting rule set is then optimized and matched with the testing sentences using the ASM with the above parameter setting and node matching criteria “P*+L”. The graph-based rules are distributed over the nine event types shown in
Event type | No. of event rules |
Gene_expression | 2,438 |
Transcription | 479 |
Protein_catabolism | 130 |
Phosphorylation | 282 |
Localization | 281 |
Binding | 1,651 |
Regulation | 1,487 |
Positive_regulation | 4,626 |
Negative_regulation | 1,619 |
TOTAL | 12,993 |
Event type(No. of events) | Recall(%) | Precision(%) | F-score(%) |
Gene_expression (1002) | 68.66 | 85.36 | 76.11 |
Transcription (174) | 47.13 | 76.64 | 58.36 |
Protein_catabolism (15) | 53.33 | 100.00 | 69.57 |
Phosphorylation (185) | 80.00 | 71.15 | 75.32 |
Localization (191) | 45.55 | 75.65 | 56.86 |
[SVT-TOTAL] (1567) | 64.65 | 81.43 | 72.07 |
Binding (491) | 35.44 | 54.55 | 42.96 |
[EVT-TOTAL] (2058) | 57.68 | 75.94 | 65.56 |
Regulation (385) | 22.34 | 42.16 | 29.20 |
Positive_regulation (1443) | 33.75 | 54.66 | 41.73 |
Negative_regulation (571) | 28.55 | 39.95 | 33.30 |
[REG-TOTAL] (2399) | 30.68 | 48.97 | 37.72 |
[ALL-TOTAL] (4457) | 43.15 | 62.72 | 51.12 |
System | SVT | BIND | REG | TOTAL | ||
F-score | F-score | F-score | Recall | Precision | F-score | |
UMass |
73.50 | 48.79 | 43.82 | 48.49 | 64.08 | 55.20 |
UTurku |
72.11 | 43.28 | 42.72 | 49.56 | 57.65 | 53.30 |
MSR-NLP |
71.54 | 41.39 | 40.02 | 48.64 | 54.71 | 51.50 |
72.07 | 42.96 | 37.72 | 43.15 | 62.72 | 51.12 | |
ConcordU |
70.52 | 36.88 | 40.16 | 43.55 | 59.58 | 50.32 |
UWMadison |
68.70 | 36.88 | 40.37 | 42.56 | 61.21 | 50.21 |
Stanford |
70.88 | 44.34 | 35.21 | 42.36 | 61.08 | 50.03 |
68.47 | 36.21 | 36.01 | 37.45 | 66.41 | 47.89 |
System | SVT | BIND | REG | TOTAL | ||
F-score | F-score | F-score | Recall | Precision | F-score | |
UMass |
71.54 | 50.76 | 45.51 | 48.74 | 65.94 | 56.05 |
UTurku |
70.36 | 47.50 | 44.30 | 50.06 | 59.48 | 54.37 |
MSR-NLP |
70.08 | 43.86 | 40.85 | 48.52 | 56.47 | 52.20 |
70.07 | 43.21 | 38.78 | 42.80 | 64.73 | 51.53 | |
Stanford |
69.29 | 47.57 | 36.09 | 42.55 | 62.69 | 50.69 |
UWMadison |
65.13 | 43.21 | 41.08 | 42.17 | 62.30 | 50.30 |
ConcordU |
67.75 | 37.41 | 40.96 | 43.09 | 60.37 | 50.28 |
64.78 | 41.55 | 36.68 | 36.77 | 68.86 | 47.94 |
Our approximate subgraph matching-based method achieves an overall 51.12% F-score on the GE task testing data, including both abstracts and full-text papers. Considering that “MSR-NLP”
Compared with the exact subgraph matching scenario, the ASM results in a nearly 6% recall gain but still maintains precision at the high level, leading to an important 3.2% increase for F-score. However, a recall deficit of about 5% between the ASM and the top two systems is still observed. Careful error analysis reveals that the difference comes primarily from the extraction of complex events. Specifically, only 23% of the cause arguments for regulatory events that contain both theme and cause (as in
We attributed the missed event arguments to two main reasons. First, information on the shortest dependency path represented in rules is accurate to infer mutual relationship between tokens but sometimes not sufficient to cover all possible linguistic contexts of multi-participant events. Due to missing the relevant event components, even though the ASM attempts to maximize the generalization potential of rules, the corresponding events cannot be identified. As a result, the compound effect of one missing theme of a three-theme
Second, the current implementation of the injective mapping requirement of the ASM algorithm constrains further generalization of rules. Currently, “P*+L” is used as the matching criteria requiring that the relaxed POS tags and the lemmatized form of tokens be identical when comparing non-“BIO_Entity” nodes in the two graphs. “P*” provides shallow syntactic information but would be too general if used as a standalone criterion. “L” is added to provide specificity. However, although somewhat abstracted from original surface tokens, lemmas are constrained to match at the word level. For further relaxation of node matching, ontology-based, concept-level generalization is necessary. For instance, when “lysine” appears as a rule node, the ASM could allow all amino acids to match with it instead of only looking for this specific residue.
One way to improve the recall of ASM is to provide it with more training data. This can potentially be accomplished through the use of the
While
Since the gold event annotation of the GE task testing data is hidden to the public, our statistical test is performed on the development data. The 259 documents are randomly divided into 10 groups with 26 documents in 9 groups and 25 documents in the last. Each group is evaluated independently by both optimized ASM and ASM (
WilcoxonTest | ASM | ASM | ASM( |
ASM( |
|
Recall | 42.26 | 5.71 | 36.62 | 5.52 | 0.002 |
Precision | 69.39 | 5.19 | 72.93 | 9.14 | 0.037 |
F-score | 52.40 | 5.52 | 48.51 | 5.70 | 0.002 |
The test confirms that the recall and F-score increases from the ASM method itself are statistically significant, as evidenced by the 0.002
In three-dimensional protein structures, the appearance of certain amino acid residues at key structural positions plays a central role in protein function, for instance enabling ligand or substrate binding. For proteins of therapeutic importance, identifying these protein residues as potential targets is a key early step in drug design. Text mining has been shown to play an important role in such protein function prediction
Instead of manually curated annotations, sentences that contain high confidence protein-residue relationships are prepared via distant supervision using Protein Data Bank (PDB) as the biological knowledge source to drive relation extraction learning. Sentences in which at least one protein and one amino acid co-occur are selected from 18,045 abstracts of the primary references for the PDB entries. These sentences are further filtered to retain only those that contain physically validated relationships, i.e., the protein-residue co-occurrence can be substantiated by a physical match of the particular residue to the mentioned protein according to its PDB record (see
Attributes Counted | No. of instances |
Total abstracts | 18,045 |
Total sentences | 138,790 |
Sentences with co-mentions of protein and residue | 5,256 |
Physically validated protein-residue relations | 2,814 |
Association rules are induced from sentences for 2,216 physically validated relationships by extracting the shortest paths connecting association arguments. The rule set optimization process involves only one iteration as the task does not contain relations with nested structures. An empirical parameter setting for the ASM is used throughout our experiments in which the three distance function weights are
When evaluated against the remaining 598 physically validated relationships, the ASM with the above parameter setting achieved an 84.22% F-score in extracting protein-residue associations, with an 86.62% recall and an 81.96% precision. The system surpasses a co-occurrence baseline method that assumes a relation when one protein and an amino acid are mentioned together in texts, and a run of the ASM with
System | Recall(%) | Precision(%) | F-score(%) |
Co-occurrence baseline | 100.00 | 62.42 | 76.86 |
78.43 | 83.60 | 80.93 | |
86.62 | 81.96 | 84.22 |
Distant supervision helps to relax the reliance of rule induction on curated annotations. Taking advantage of a much broader set of training instances, more rules are reliably learned to cover diverse relation contexts, thus improving the overall coverage of our approach. While distant supervision has been shown effective for system development in relation extraction in the general English domain
In this paper, we proposed a novel approximate subgraph matching-based approach for extracting relational knowledge from biomedical literature. By introducing a certain degree of error tolerance into the graph matching process, our approach increases the chance of retrieving relations or events encoded within complex dependency contexts, while maintaining the extraction precision at a high level. Our approach has been successfully applied to two relation and event extraction tasks. We report results of 51.12% F-score in extracting nine types of biological events of the BioNLP-ST 2011 task and 84.22% F-score in detecting protein-residue associations, demonstrating the generalizability of our approach. In addition, we investigated the complexity of the proposed algorithm, and compared it with existing related graph distance/similarity metrics.
Our approach has a number of advantageous features. First, characterized by high precision, our approach is a preferable choice when accurate information about biological processes is emphasized. It works particularly well on extracting binary relations (including events containing only two participants) with training data where biological entities of the target relation are pre-annotated. Second, although already possessing a reasonable recall, the coverage of the approach can be further increased by integrating distant supervision. Meanwhile, rules learned from co-mentions of pairs of entities known to interact are not prone to over-fitting to an annotated training corpus, thus they are more generalizable across different datasets
In our future work, we are interested in extending the proposed subgraph matching algorithm into a graph kernel to be integrated into SVM so that we can take advantage of the capability of state-of-the-art supervised learning methods and compare straightforwardly with existing graph kernel metrics on common information extraction tasks. We would also like to explore some alternative, linguistically-based, methods to relax the current labelDist measure. Currently, a simple strategy in the labelDist measure in the ASM subgraph distance function is used that tracks all different edge labels on the compared paths in two graphs. For instance, even though “prep_of(increase, immunoreactivity) ” in rule possesses the same meaning as “prep_in(increase, immunoreactivity)” in sentence, because “prep_of” is different from “prep_in” in form, labelDist will record a difference of two labels, resulting in a larger labelDist score. Some approaches have been developed to prune or collapse dependency graphs by unifying labels that are equivalent in meaning in order to simplify the graphs
The authors thank Dr. John Wilbur for providing valuable feedback to the manuscript.
This research was supported in part by the Intramural Research Program of the NIH, NLM.