Learning the Structure of Biomedical Relationships from Unstructured Text

Bethany Percha; Russ B. Altman

doi:10.1371/journal.pcbi.1004216

Abstract

The published biomedical research literature encompasses most of our understanding of how drugs interact with gene products to produce physiological responses (phenotypes). Unfortunately, this information is distributed throughout the unstructured text of over 23 million articles. The creation of structured resources that catalog the relationships between drugs and genes would accelerate the translation of basic molecular knowledge into discoveries of genomic biomarkers for drug response and prediction of unexpected drug-drug interactions. Extracting these relationships from natural language sentences on such a large scale, however, requires text mining algorithms that can recognize when different-looking statements are expressing similar ideas. Here we describe a novel algorithm, Ensemble Biclustering for Classification (EBC), that learns the structure of biomedical relationships automatically from text, overcoming differences in word choice and sentence structure. We validate EBC's performance against manually-curated sets of (1) pharmacogenomic relationships from PharmGKB and (2) drug-target relationships from DrugBank, and use it to discover new drug-gene relationships for both knowledge bases. We then apply EBC to map the complete universe of drug-gene relationships based on their descriptions in Medline, revealing unexpected structure that challenges current notions about how these relationships are expressed in text. For instance, we learn that newer experimental findings are described in consistently different ways than established knowledge, and that seemingly pure classes of relationships can exhibit interesting chimeric structure. The EBC algorithm is flexible and adaptable to a wide range of problems in biomedical text mining.

Author Summary

Virtually all important biomedical knowledge is described in the published research literature, but Medline currently contains over 23 million articles and is growing at the rate of several hundred thousand new articles each year. In this environment, we need computational algorithms that can efficiently extract, aggregate, annotate and store information from the raw text. Because authors describe their results using natural language, descriptions of similar phenomena vary considerably with respect to both word choice and sentence structure. Any algorithm capable of mining the biomedical literature on a large scale must be able to overcome these differences and recognize when two different-looking statements are saying the same thing. Here we describe a novel algorithm, Ensemble Biclustering for Classification (EBC), that learns the structure of drug-gene relationships automatically from the unstructured text of biomedical research abstracts. By applying EBC to the entirety of Medline, we learn from the structure of the text itself approximately 20 key ways that drugs and genes can interact, discover new facts for two biomedical knowledge bases, and reveal rich and unexpected structure in how scientists describe drug-gene relationships.

Citation: Percha B, Altman RB (2015) Learning the Structure of Biomedical Relationships from Unstructured Text. PLoS Comput Biol 11(7): e1004216. https://doi.org/10.1371/journal.pcbi.1004216

Editor: K. Bretonnel Cohen, University of Colorado School of Medicine, UNITED STATES

Received: July 29, 2014; Accepted: March 1, 2015; Published: July 28, 2015

Copyright: © 2015 Percha, Altman. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: All relevant data are within the paper and its Supporting Information files. All supplementary materials have been uploaded to Zenodo. DOI: 10.5281/zenodo.17215.

Funding: This work was supported by a grant from Oracle Corporation, NIH LM05652, MH094267, GM102365, and GM61374. BP was supported by a Morgridge Family Stanford Interdisciplinary Graduate Fellowship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Biomedical research generates text at an incredible rate. Each year, several hundred thousand new articles enter Medline from over 5,500 unique journals [1, 2]. The literature’s rapid growth and the rise of interdisciplinary domains like bioinformatics and systems biology are changing how the scientific community interacts with this important resource. Knowledge bases like OMIM [3], DrugBank [4] and PharmGKB [5] manually curate and restructure information from the literature to increase its accessibility to researchers and clinicians. These knowledge bases capture cross-sectional “slices” of the literature, drawing connections among facts reported in different journals, at different times, and in different research domains. Often, they examine the literature in ways not easily captured by current indexing strategies, such as MeSH terms or key words.

As the literature grows and the information we need to extract increases in complexity, full manual curation of these knowledge bases is rapidly becoming infeasible. Progress in natural language processing (NLP) has encouraged the development of automated and semi-automated methods for enabling more efficient curation of biomedical text [6–9], especially as biomedical research begins to explore even larger text-based resources, such as electronic medical records (EMRs) [10, 11]. However, tasks that are simple for human readers, such as recognizing when two different-looking statements mean the same thing, or when one statement is a more general version of another statement, are often extremely challenging for NLP algorithms. One way around this problem is to infer the meaning of words and phrases by examining their usage patterns in large, unlabeled text corpora, an approach called “distributional semantics” [12–14]. If two words or phrases are used in similar contexts, they are likely to be semantically related.

Here we introduce a novel algorithm, called Ensemble Biclustering for Classification (EBC), that applies this strategy to uncover relationships between biomedical entities, such as drugs, genes and phenotypes. We focus on the problem of drug-gene relationship extraction and characterization from unstructured biomedical text, using statistical dependency parsing to extract descriptions of drug-gene relationships from Medline sentences and applying EBC to recognize when two drug-gene pairs share a similar relationship, even when they are described differently in the text. We show that EBC significantly improves our ability to extract both pharmacogenomic and drug-target relationships, and use it to discover new drug-gene relationships for PharmGKB and DrugBank. Finally, we combine EBC and hierarchical clustering to map the global “landscape” of drug-gene interactions, revealing much unforeseen complexity in how these relationships are described in text. We learn, for example, that there are subtle differences in how static knowledge (past discoveries) and new experimental discoveries are described, even when they refer to similar phenomena like inhibition, and that seemingly well-defined relationship classes (such as pharmacogenomic and drug-target relationships) often exhibit much more detailed chimeric structure than anticipated. More generally, we demonstrate that extracting biomedical relationships based on corpus-level usage patterns, rather than on the properties of individual sentences, helps bypass the need for large, annotated biomedical training corpora–an important property in a domain where few such corpora are available.

Results

Quantifying the variability of drug-gene descriptions in Medline sentences

The full set of abstracts from the 2013 edition of Medline contains approximately 184,000 sentences in which at least one drug name and at least one gene name are present. Many of these sentences contain multiple drug and gene names; the total number of unique drug-gene-sentence combinations is approximately 236,000.

As described in the Methods, we use dependency parsing to prune away irrelevant terms and phrases and focus attention on the parts of a drug-gene sentence most relevant to the relationship between a drug and a gene. The pruned versions of drug-gene sentences are called dependency paths. Fig 1 illustrates how dependency paths are constructed from raw sentences. Table 1 provides some common drug-gene dependency paths and associated example sentences. Details about the meanings of the individual grammatical dependencies, with examples, can be found in [15].

Download:

Fig 1. Example of a dependency graph for a Medline 2013 sentence.

(a) The raw sentence. (b) The complete dependency graph for the sentence. (c) The dependency path connecting the gene CYP3A4 with the drug rifampicin. (d) A more compact representation of the dependency path.

https://doi.org/10.1371/journal.pcbi.1004216.g001

Download:

Table 1. Selected dependency paths and representative sentences.

https://doi.org/10.1371/journal.pcbi.1004216.t001

We can quantitatively estimate the diversity of drug-gene descriptions in Medline by considering the space of all unique drug-gene dependency paths. The vast majority of dependency paths are rare, indicating high variability in how drug-gene relationships are described. The total number of unique drug-gene dependency paths in Medline is approximately 197,000, of which 7,272 (4%) connect at least two different drug-gene pairs. The total number of unique drug-gene pairs co-occurring in Medline sentences is 49,564, of which 14,052 (28.4%) share a dependency path with at least one other drug-gene pair.

Table 2 describes the two datasets used in this paper, which consist of matrices, M, in which the rows are drug-gene pairs and the columns are dependency paths. A cell of M, M_ij, contains “1” if drug-gene pair i is connected by dependency path j somewhere in Medline and “0” otherwise. Both of the datasets are over 99% sparse. An important goal, therefore, must be to recognize when different-looking statements are saying the same thing. Otherwise, we can only recognize that two drug-gene pairs share a relationship if their dependency paths are identical. The details of how EBC builds connections among different dependency paths can be found in the Methods.

Download:

Table 2. Summary of datasets for the PGx and drug-target relation extraction tasks.

In the dense dataset, the drug-gene pairs and dependency paths represented must have occurred at least five times in Medline. In the sparse dataset, the dependency paths must have occurred at least twice, and all drug-gene pairs connected by these paths were included, even if they only occurred once.

https://doi.org/10.1371/journal.pcbi.1004216.t002

Identifying pharmacogenomic and drug-target relationships in biomedical text

We evaluated EBC’s ability to mine the literature for drug-gene pairs exemplifying two specific types of drug-gene relationships. The algorithm was given only the full, unlabeled text of Medline and a small number of drug-gene pairs that exemplified each type of relationship. We refer to the small sets of labeled drug-gene pairs (sizes 1, 2, 3, 4, 5, 10, 25, 50, and 100) as “seed sets”. No text was annotated and no specific sentences were marked as “evidence” for any particular type of relationship. The two relationship types we examined were:

Pharmacogenomic (PGx) relationships. PharmGKB’s relationships database [5] contains 6283 manually-curated drug-gene associations in which polymorphisms in the gene are known to impact drug response.
Drug-target relationships. DrugBank [4] maintains a list of known drug-gene relationships in which the protein product of the gene is a known target of the drug. This list contains 14,594 known relationships.

Fig 2 shows EBC’s performance extracting PGx and drug-target drug-gene pairs on the two datasets described in Table 2, and compares EBC to two alternative classifiers that do not account for the semantic relatedness of different dependency paths.

Download:

Fig 2. Classifier performance at the task of recognizing (a) PGx associations (dense matrix), (b) drug-target associations (dense matrix), (c) PGx associations (sparse matrix) and (d) drug-target associations (sparse matrix).

https://doi.org/10.1371/journal.pcbi.1004216.g002

On both datasets, and on both tasks, EBC outperforms the other classifiers by a significant margin. On the dense dataset, using seed sets of only 10 labeled drug-gene pairs as input, EBC accurately (AUC > 0.7) ranks 89.6% of test sets for the PGx task and 96.5% of test sets for the drug-target task. In comparison, using the same seed and test sets, the best-performing non-EBC classifier accurately ranks only 31.3% of test sets for the PGx task and 49.6% for the drug-target task. On the sparse dataset, EBC’s increased performance is even more pronounced. Again using only 10 labeled pairs, EBC accurately ranks 54.4% of test sets on the PGx task and 90.4% on the drug-target task, compared to 1.1% and 6.3% for the best-performing non-EBC classifier.

EBC’s raw assessments of the similarity of all drug-gene pairs in both datasets can be found in S1 Data.

Inferring connections among related descriptions based on patterns in the text

The backbone of EBC is a biclustering algorithm called Information-Theoretic Co-Clustering (ITCC; [16], see Methods). Fig 3 shows the result of one ITCC run on a small sample dataset consisting of dependency paths that connect different drugs to the gene CYP3A4 (a liver cytochrome involved in the pharmacokinetic pathways of many drugs) at least five times in Medline. This dataset contains 62 drug-gene pairs (where the gene is always CYP3A4) and 14 unique dependency paths. As with the datasets in Table 2, these are arranged in a matrix, M, where an element M_ij is “1” if drug-gene pair i is connected by path j somewhere in Medline, and “0” otherwise. We used ITCC to bicluster this matrix into four row clusters and six column clusters. Besides biclustering the matrix, ITCC produces a “smoothed” version of the matrix where certain elements that were not observed in the original dataset are filled in.

Download:

Fig 3. Example of ITCC output for a small matrix consisting of drug-CYP3A4 pairs and their associated dependency paths.

The top heatmap shows the original data after the clustering was performed. An orange square represents an observed path (column) between a given drug-gene pair (row). The bottom heatmap shows the approximate distribution arising from a single ITCC run.

https://doi.org/10.1371/journal.pcbi.1004216.g003

Fig 3 illustrates that the rows fragment into four clusters that reflect distinct ways that drugs can interact with CYP3A4. Row cluster 1 contains CYP3A4 inhibitors, a few of which are also substrates. Row cluster 2 contains CYP3A4 inducers. Row clusters 3 and 4 contain substrates of CYP3A4 that are not known inhibitors. EBC combines information from thousands of different biclusterings like this one to assess the relationship similarity of any two drug-gene pairs (rows) in the matrix, by looking at how frequently they cluster together.

It is also interesting to examine which columns of the matrix cluster together, as this provides insight into how the method is working. Fig 3 shows that the dependency paths naturally fragment into clusters reflecting known biomedical properties. All of the paths referring to inhibition, for example, appear together in column cluster 2. The sole path referring to induction appears by itself in column cluster 6. The other four clusters include paths describing situations where the drug is a substrate of CYP3A4, or is metabolized by it. We see a similar pattern emerge when we examine co-clustering frequencies of the columns on a larger dataset: the dense dataset from Table 2. Table 3 shows some dependency paths from this dataset that frequently cluster together over 2000 separate runs of ITCC. Paths that frequently cluster together appear to be semantically related.

Download:

Table 3. Some dependency paths that cluster together with relatively high frequency.

https://doi.org/10.1371/journal.pcbi.1004216.t003

Mapping the semantic landscape of drug-gene interactions

EBC provides a measure of relationship similarity between every drug-gene pair and every other pair (the frequency with which each pair of rows in the data matrix cluster together). By combining these assessments with hierarchical clustering, we created the dendrogram shown in Fig 4, the details of which are described in the figure caption. Table 4 summarizes the general “themes” of the clusters from Fig 4 and includes the size of each cluster and the density of known PGx and drug-target relationships within that cluster. The cluster assignments for different slices of the dendrogram are provided in S3 Data.

Download:

Fig 4. Dendrogram illustrating the semantic relationships among 3514 drug-gene pairs.

In this dendrogram, the leaves represent 3514 drug-gene pairs that co-occur in Medline sentences at least 5 times, and we have cut the dendrogram at various levels (illustrated by the red lines in the interior of the dendrogram) to produce the colored clusters shown around the edges. Drug-gene pairs that are known drug-target relationships from DrugBank are denoted by blue dots, and those that are known PGx relationships from PharmGKB are denoted by orange dots. The heights of the turquoise bars are proportional to how often the corresponding drug-gene pairs co-occur in Medline sentences (a proxy for how well-documented they are).

https://doi.org/10.1371/journal.pcbi.1004216.g004

Download:

Table 4. Explanation of the clusters shown in Fig 4.

Clusters with 20 or fewer members are not described in the table in the interest of space.

https://doi.org/10.1371/journal.pcbi.1004216.t004

Cluster 8, the largest cluster, contains drug-gene pairs whose descriptions mainly refer to inhibition. This cluster is highly enriched for both PGx and drug-target relationships. When cluster 8 is subdivided by cutting the dendrogram at a lower height, a subcluster (8a) of antagonists and their protein targets splits off from the main cluster. EBC has learned that antagonism is a subclass of inhibition. Cluster 10, which is a close relative of cluster 8 in the dendrogram, contains drug-gene pairs where the drug is both an inhibitor and a substrate of the protein, such as verapamil/P-glycoprotein.

Cluster 3, another large cluster, is almost exclusively devoted to metabolism and substrate relationships, and is highly enriched for PGx relationships, though not drug-target relationships. Cluster 3 contains three subclusters with slightly different properties. Cluster 3a involves mainly substrate relationships where the concept of “metabolism'' is not mentioned. These include, for example, transport relationships like aminopterin/hOAT1. Cluster 3b contains most of the metabolic relationships, many of which involve liver cytochromes like CYP3A4 and CYP2D6. Cluster 3c includes substrate relationships where the drug is often also described as having an effect on the activity of the protein.

Other clusters enriched for drug-target relationships include cluster 12, where the protein is described as the receptor for the drug, cluster 14a, where the drug is described as an agonist of the protein, and cluster 15, which refers to protein binding. Notably, cluster 14a (agonists) is part of a larger cluster, cluster 14, that encompasses activation and stimulation relationships. Here, EBC has learned that agonism is a subclass of activation. Interestingly, cluster 14b, the part of cluster 14 that refers to activation more broadly and does not specifically refer to agonism, is not enriched for drug-target relationships.

Clusters 1–16, which comprise 3 of the 4 main high-level groups within the dendrogram, are relatively easy to interpret: in general, each displayed a consistent theme. Clusters 17–25, however, involve descriptions of experimental methods or results about drug effects on gene expression or protein activity. Here, the dendrogram reveals a distinction between past and present knowledge. Drug-gene pairs that are already well-studied are often reported in a static context–“D is an inhibitor of G”, or “D is a G agonist”–whereas other pairs are reported primarily in an experimental context–“we investigated the effect of D on G expression”, “G was activated by D”, or “exposure to D significantly increased G activity”. Depending on the relative frequency of different types of descriptions, a drug-gene pair exemplifying an inhibitory relationship might end up in cluster 8 (mostly static descriptions) or cluster 21 (mostly experimental descriptions). Interestingly, drug-gene pairs from cluster 21 appear together in the literature significantly fewer times than drug-gene pairs from cluster 8 (median 9 times for cluster 21 vs. 16 times for cluster 8; maximum 66 times for cluster 21 vs. 2722 times for cluster 8; p < 0.0001, Mann-Whitney test), which seems to corroborate our assertion that the drug-gene pairs from cluster 21 represent more tentative experimental findings as opposed to well-established static knowledge.

Finally, the dendrogram reveals that PGx and drug-target relationships do not constitute distinct classes of relationships, but are chimeras. PGx relationships are composed of relatively distinct subgroups corresponding to (a) situations where the drug inhibits the gene/protein (and therefore, mutations in the gene could be expected to impact response to the drug), and (b) situations where the protein is involved in the metabolism or transport of the drug. Drug-target relationships overlap with (a) but not (b), and include other non-PGx subclasses, such as receptor binding and agonism.

Discovering novel relationships for PharmGKB and DrugBank

EBC reliably detects new drug-gene pairs reflecting relationships of interest to PharmGKB and DrugBank, so we attempted to discover new examples from our corpus. We built seed sets containing all known relationships from PharmGKB and DrugBank and incorporated these into EBC to rank the remaining drug-gene pairs according to EBC’s certainty that they represented PGx or drug-target relationships. There was 13.6% overlap between the two seed sets, with 84 drug-gene pairs in both, 206 in PharmGKB only, and 326 in DrugBank only, and 2898 pairs that were unknown to both.

The dendrogram shown in Fig 5 is identical to that in Fig 4, except that the clusters are replaced by vertical bars, the heights of which correspond to EBC's relative certainty that the pairs in question represent PGx relationships (shown in orange) or drug-target relationships (shown in blue). The raw prediction data can be found in S4 Data. Known PGx or drug-target pairs are excluded from the bar graphs, but are denoted beneath the bars with orange or blue dots. As expected, we see high prediction certainty for drug-target and PGx relationships among the inhibitors in cluster 8, and high certainty for PGx relationships among the metabolic/substrate relationships in cluster 3. We also observe an interesting area of high enrichment for both types of relationships among clusters 21–23, where inhibition is mostly reported in an experimental context, but the density of known PGx and drug-target relationships is quite low. These could represent new experimental findings that will be discussed as static knowledge in a few years.

Download:

Fig 5. Dendrogram illustrating predictions of novel PGx and drug-target relationships among 3514 drug-gene pairs.

The height of the bars corresponds to EBC's certainty that the pair in question represents a relationship of the corresponding type (orange: PGx relationships, blue: drug-target relationships). The dots represent known PGx and drug-target relationships, as in Fig 4.

https://doi.org/10.1371/journal.pcbi.1004216.g005

Table 5 shows the top 20 predictions of new PGx candidate pairs for PharmGKB, and Table 6 shows the top 20 candidate drug-target pairs for DrugBank. Among the top 20 PGx predictions, five are already known to PharmGKB and have been demonstrated experimentally (one or more variants of the gene have been shown to impact response to the drug), but were coded in the PharmGKB relationships file in such a way that they were not included in the seed set. One is brand new: polymorphisms in ABCB1 (P-glycoprotein) do impact clinical response to fentanyl, but this relationship is currently unknown to PharmGKB. An additional eight pairs represent likely PGx relationships, such as known inhibitory or metabolic relationships, but no experiments have yet been conducted that might relate polymorphisms in the gene to drug response. And finally, in five cases, the potential for a PGx association was considered likely enough that it was investigated experimentally, but no significant clinical association between genotype and drug response was found.

Download:

Table 5. Top 20 predictions of new drug-gene relationships for PharmGKB, and whether a PGx relationship has been documented in the literature.

https://doi.org/10.1371/journal.pcbi.1004216.t005

Download:

Table 6. Top 20 predictions of new drug-target relationships for DrugBank.

https://doi.org/10.1371/journal.pcbi.1004216.t006

Among the top 20 predictions for new drug-target relationships for DrugBank, four are already known but were listed in DrugBank under alternate gene names. An additional seven are new, proven drug-target relationships. Of these, five involve drugs that are themselves unknown to DrugBank (there are no entries for ketanserin, cangrelor, nutlin-3, or tropisetron in DrugBank). There are also several interesting, yet erroneous findings arising from parser and lexicon errors in which a molecule, such as IL-1, is mistaken for its receptor, and that receptor is the true target of the drug. These are explored further in the Discussion.

Discussion

Relationship extraction in the biomedical domain

Although a great deal of research effort has been directed at the problem of relationship extraction in pharmacogenomics [17–19], and in the biomedical domain in general [20–25], high-quality biomedical knowledge bases like OMIM, DrugBank and PharmGKB still rely almost entirely on human curators, who comb the literature manually in search of new relationships. The authors of BioGraph, a new biomedical knowledge base incorporating data from 21 different sources, recently decided to exclude databases that were not manually curated, citing data quality issues [26]. Why is biomedical relationship extraction so challenging?

We believe that one key stumbling block lies in how the problem has historically been defined. Biomedical relationship extraction is usually thought of as a sentence-level problem–does a particular sentence describe a specific type of relationship or not? However, as we have seen, sentence-level descriptions are highly erratic. Faced with a bewildering array of possibilities for how similar relationships can be described, sentence-level relationship extraction algorithms often rely on manually-constructed rules or ontologies that map diverse surface forms onto common semantics [17, 27–29]. These systems require a non-trivial amount of human maintenance and must be rebuilt for each new domain. Machine learning algorithms for sentence-level relationship extraction avoid rules but face another serious problem: the need for annotated training sentences. Recently, researchers have begun to produce annotated training sets for the biomedical domain [30, 31] but manual annotation is almost as expensive as manual curation, both in time and human effort. As a result, little to no annotated training data exist for many classes of biomedically interesting relationships.

These are important problems for NLP, but they only exist because we think of biomedical relationships at the level of individual sentences. From a biomedical research standpoint, there is no need to do so—we are most interested in the true relationship between a drug and a gene, not in the meaning of any particular sentence. As a result, we have taken a corpus-level approach where all of the information about a drug-gene pair from all of its available sentence-level descriptions is combined. Latent connections among different-looking descriptions are then discovered in an unsupervised fashion from structure inherent in the raw text, requiring no human effort and boosting our ability to extract relationships of interest.

Support for corpus-level inference

We contend that biomedical relationships should be considered properties of biomedical entities like drug-gene pairs, not individual sentences. A description like “D decreased G levels” does not constitute an inhibitory relationship; it is simply an experimental finding that increases the likelihood of such a relationship. This allows the same sentence to provide evidence for or against multiple types of relationship, the exact definitions of which are application dependent. It also allows drug-gene pairs to exhibit multiple relationship types at once.

We see evidence for such an approach when we contrast EBC’s performance at extracting PGx relationships with its performance extracting drug-target relationships. EBC was uniformly worse at extracting PGx relationships, even though these two sets of experiments used the same data matrices. We see why in Fig 4: it turns out that what we originally considered to be well-defined relationship classes (PGx and drug-target relationships) are actually composites of several finer-grained sub-classes. A high percentage of PGx relationships reside in cluster 3, the metabolism/substrate cluster, which inhabits a region of the dendrogram far from the inhibition clusters. In cases where the seed set consists mostly of metabolic relationships and the test set mostly of inhibition relationships, we would not expect EBC to perform well, even though both groups are still technically PGx relationships.

We initially believed that PGx relationships would be expressed in sentences relating specific polymorphisms to changes in drug efficacy, such as, “The CYP3A4 C3435T polymorphism influences rifampicin exposure in human hepatocytes”. In reality, however, relatively few such sentences exist. Most evidence for PGx relationships comes instead from descriptions of other types of relationships, such as inhibition and metabolism. So we see that although a PGx relationship can be considered a property of a drug-gene pair, it is not generally a property of any particular sentence describing that pair.

Distributional semantics for relationship extraction

EBC is part of a subfield of NLP called distributional semantics, in which patterns in large, unlabeled text corpora are used to create feature representations of words, phrases, or other entities (in our case, drug-gene pairs) based on how they are used in context. The similarity of these representations then serves as a proxy for semantic relatedness [12]. Distributional semantics algorithms’ theme of discovering semantic relatedness by looking at large-scale usage patterns inspired our corpus-level approach to drug-gene relationship extraction. For example, in EBC, these representations are the co-clustering frequencies of each drug-gene pair with every other pair, and the contextual features are the dependency paths.

EBC builds on a long history of distributional semantics work in the NLP literature, much of which focuses on assessing the semantic similarity of individual words [12, 13, 32], and some of which has tackled relationship extraction outside the biomedical domain [33–36]. EBC is most similar in spirit to matrix factorization techniques like Latent Semantic Analysis (LSA) [13]; ITCC forms a low-rank approximation of the original drug-gene-pair-by-dependency-path matrix, and EBC stacks thousands of slightly different ITCC-based approximations on top of each other to make its similarity assessments. LSA uses the singular value decomposition (SVD) [37] instead of ITCC to accomplish a similar goal, and has been applied in at least one case to corpus-level relationship extraction (a technique called Latent Relational Analysis, or LRA) [36]. We compare EBC to LSA on the PGx relationship extraction task in S2 Text.

There are dozens of other clustering and matrix factorization methods available, and some have already been applied to text mining tasks like relationship extraction. Several methods cluster textual patterns to discover latent groupings of entity pairs corresponding to distinct relations [38–41]. Others use the entity pairs flanking different textual patterns to group the patterns themselves into semantically related classes [33]. Some methods, like EBC, address both problems simultaneously [42–45]. The issue of textual “entailment”–finding the degree to which one statement implies the existence of another–is also an active area of research in NLP and is closely related to several of the methods described above [46]. Although these techniques have already shown great promise on related tasks in web and newswire data, to our knowledge none has yet been applied to relationship extraction in the biomedical domain.

Study limitations: Dependency paths, lexicons and abstracts

In our analysis of drug-gene relationships, we made several choices about (a) how to identify drugs and genes in text, (b) the type of text to use as our corpus, and (c) what constitutes a “feature” (a single column in the data matrix). In all cases, we made the simplest choices possible, both to enable others to reproduce our results, and to distinguish EBC’s own limitations from errors/omissions in the preprocessing steps and text itself.

We identify drugs and genes in the text based on simple string matching to single-word drug and gene names from PharmGKB [5]. Named entity recognition (NER) is its own area of NLP, and identifying biomedical entity names in text is itself a nontrivial proposition. We can see one obvious disadvantage of this approach in cluster 24 of Fig 4 and Table 4, which includes “gene names” like COPD (a.k.a. chronic obstructive pulmonary disease) and NIDDM (non-insulin-dependent diabetes mellitus). Table 6 also reflects a lexicon error where the term “leukotriene” is listed as a synonym for the leukotriene B4 receptor. Some such errors might be avoided if we used a more elaborate NER system [47, 48], though such systems themselves are not perfect and can introduce new sources of error. Our stipulation that the entity names be single words also led to errors in cases (see Table 6) where a molecule, such as IL-1, is mistaken for its receptor, the “IL-1 receptor”, because “IL-1 receptor” is a multi-word phrase not allowed in the lexicon, while “IL-1” is allowed.

We also made no attempt to normalize gene names, so in our results, ABCB1, MDR-1, and P-gp are all different. Again, this was done to avoid introducing normalization errors, and because genes and their corresponding proteins are often described in different contexts.

To construct dependency paths from raw Medline sentences, we used the Stanford Parser [49], a free and open-source statistical parser. The Stanford Parser was trained using labeled text from newswire corpora, so it sometimes fares poorly on biomedical text. For example, the parser often mistakes gene names for adjectives (“CYP3A4” in the phrase “CYP3A4 polymorphism” is frequently labeled as an adjective). We used the out-of-box implementation of the Stanford Parser and did not perform any manual review or correction of parses to improve its performance (again in the interest of simplicity). Because EBC operates at the level of drug-gene pairs and not individual sentences, its performance is generally robust to parsing errors as long as the parser makes the same errors consistently.

There are some errors that do lead to incorrect conclusions, however. For example, we observe some situations where dependency paths bypass important details about relationships, such as a sentence where a drug is described as “transcriptionally up-regulating G expression” and the dependency path only captures the effect on expression, not its directionality. These are usually generalizations rather than errors, but they do result in some loss of information from the sentence.

Finally, our corpus consisted of all abstracts from the 2013 edition of Medline. Including information from the full text of the research articles could help discover relationships not mentioned in the abstracts, but many journals do not provide access to the full text, and we did not wish to bias our results in favor of relationships reported in a subset of journals. Our approach would remain the same regardless of the corpus.

Extensions and future applications

The combination of EBC and dependency path features described here allows us to reliably extract biomedical relationships of interest from Medline sentences, smoothing over differences in how these relationships are described. This finding opens the door to many interesting possible future applications. For example, EBC could be used to extract relationships spanning multiple sentences or entire abstracts by using features such as individual dependencies, words, or phrases in place of dependency paths. As new gold-standard sets of biomedical relationships become available (such as all drug-gene pairs reflecting inhibitory relationships or specific collections of drug-gene pairs relevant to particular laboratories’ research efforts) these can seamlessly be incorporated into EBC to extract these relationships at scale. EBC could also potentially be used for lexicon or ontology expansion in a manner similar to LSA or random indexing [50, 51]. At its core, EBC is not relationship extraction-centric. The algorithm itself is agnostic to the type of data contained in its input matrix. EBC simply allows us to use latent structure in large, unlabeled datasets to boost our ability to extract new information from those datasets, even when our access to labeled training examples is limited. Datasets like these occur throughout biomedical research, even beyond NLP. We look forward to seeing how EBC fares on some other classes of related problems, in NLP and elsewhere.

Methods

Outline of the EBC algorithm

When applied to drug-gene relationship discovery, the EBC algorithm operates on a data matrix where the rows are drug-gene pairs and the columns are dependency paths that connect them in the literature. The algorithm has two steps, the first unsupervised and the second supervised.

First, unsupervised biclustering is used to simultaneously discover (a) latent connections among dependency paths (columns) that appear different but connect similar drug-gene pairs, and (b) latent similarities among different drug-gene pairs (rows) that are connected by similar dependency paths. Over multiple iterations of (a) and (b), the algorithm can infer that two drug-gene pairs share a similar relationship, even when they share no dependency paths in common. To make its similarity assessments, EBC uses an ensemble of biclustering runs where the cluster centers are initialized randomly on each run, providing many different guesses about which dependency paths and drug-gene pairs are related.

In the second step, EBC incorporates a small seed set of drug-gene pairs (rows) reflecting some known relationship, and ranks other pairs based on their similarity to the pairs in the seed set. The specific steps of the EBC algorithm are as follows:

Preprocessing (drug-gene relationship extraction task):

Identify all drug-gene pairs co-occurring in sentences within a corpus of text. (In our experiments, these were drug-gene pairs co-occurring in Medline sentences.) Call the number of drug-gene pairs n.
Extract all dependency paths connecting these drug-gene pairs in the corpus. Call the total number of observed paths m.
Arrange the data in an n x m matrix where the rows represent drug-gene pairs and the columns dependency paths. A cell with coordinates (i, j) in this matrix contains “1” if drug-gene pair i has been connected by path j somewhere in the corpus, and “0” otherwise.

EBC algorithm:

4. (Unsupervised step.) Use Information-Theoretic Co-Clustering (ITCC; [16], details below) to bicluster the n x m matrix N times, recording the number of runs in which each row appears in a row cluster with each other row. The result is an n x n array, C, of co-occurrence values. Note that no information about the seed set is incorporated at this stage, so the unsupervised step need be run only once per data matrix.
5. (Supervised step.) Identify a seed set, S, of rows that share some property of interest. (In our experiments, these were drug-gene pairs with known PGx or drug-target relationships.) Rank the entity pairs in a test set, T, based on a scoring function related to how often they co-cluster with members of S (details below). Repeat this step as desired with different seed sets.

Named entity recognition of drugs and genes

We identified drug and gene entity names in the text using simple string matching to lexicons, though any type of named entity recognition software could be incorporated at this stage [47, 48]. We obtained drug and gene lexicons from PharmGKB [5] and filtered them against a dictionary of common English words to remove promiscuous terms (such as “CAT”, which is both a gene name and an animal). We included only drug and gene entities with one-word names, as these names mapped to single nodes in the dependency graphs. The final drug lexicon contained 4008 unique terms, and the final gene lexicon contained 109,597 terms (many genes/proteins had multiple names).

Extraction of dependency paths from Medline abstracts

We used the Stanford Parser [49] to generate dependency graphs for all sentences in Medline 2013 between 4 and 50 words in length (roughly 95% of all sentences in Medline). The input to the parser is a raw Medline sentence, and the output is a dependency graph. A dependency graph (see Fig 1) is one way to represent the grammatical architecture of a sentence; the nodes are words, and the edges are grammatical dependencies (grammatical relationships between pairs of words, described in detail in [15]).

A dependency path is a path through a dependency graph that connects two entities of interest. Considering a dependency path, instead of an entire sentence, can help “prune out” irrelevant terms and phrases and focus our attention on the part of the sentence directly relevant to the relationship between the two entities. We extracted all dependency paths linking drugs to genes.

It was possible for a single sentence to generate more than one dependency path if multiple drug or gene names were present in the sentence. We oriented our paths so that they always started at the drug and ended at the gene, and we eliminated edge directions. (We never observed a single situation where we accidentally collapsed paths with different meanings in doing so, since most pairs of words can only be connected by a particular dependency type, like amod or nn, in one direction.) We eliminated paths containing dependencies of type conj [15], because these were usually errors arising from inadequacies in how the dependency parser represents lists. Note that because the dependency graphs are trees, there is one unique dependency path for each drug-gene pair in a sentence.

Ensemble biclustering

ITCC forms a low-rank approximation of a matrix by iteratively clustering the rows and columns. ITCC treats the data matrix, M, as a joint probability distribution over its rows (Y, drug-gene pairs) and columns (X, dependency paths). Given fixed numbers of row (k) and column (l) clusters, ITCC finds a set of cluster assignments for the rows and columns that captures most of the mutual information between X and Y with the stipulation that X and Y only interact via their cluster assignments, and . Mathematically, ITCC replaces the joint distribution of X and Y, , with an approximate distribution of the form , and assigns rows and columns to clusters so that q(x,y) captures most of the mutual information between X and Y in p(x,y) (equivalent definition: the Kullback-Leibler divergence between p(x,y) and q(x,y) is minimized). We implemented ITCC in Java. Some technical details about our implementation can be found in S3 Text.

There are two unknown input parameters to ITCC: the numbers of row (k) and column (l) clusters. The optimal choices for k and l must be decided heuristically. We describe our heuristic for choosing k and l in S1 Text A.

Due to random initialization of the row and column cluster centers, ITCC generally converges to a different locally-optimal biclustering on each run; this diversity is what guarantees optimal performance of the EBC algorithm. We ran ITCC N = 2000 times at the optimal k and l and recorded the number of runs in which each pair of rows shared a cluster. We observed that on our data matrices, EBC’s performance increased monotonically with N, stabilizing at approximately N = 1000.

Scoring of test set pairs

Once EBC’s unsupervised step is performed and appropriate seed (S) and test (T) sets identified, test set items can be ranked as follows:

EBC’s scoring function. For each test set member, T_i, rank all n rows of the data matrix based on how often they co-cluster with T_i. This produces a ranking R_i of length n in which pairs that frequently co-cluster with T_i are assigned high ranks and those that seldom co-cluster get low ranks. The score for T_i is the rank sum of the members of the seed set, S, within this list, or: where
Using ranks instead of absolute co-clustering frequencies produces a score that does not depend on how often, on average, a given drug-gene pair co-clusters with other pairs, since this baseline “promiscuity” changes from pair to pair. For some applications, those differences might not matter (or they might be informative) but we normalized to ranks so promiscuous pairs (which are often well-known or frequently mentioned pairs) would not consistently receive higher scores than less promiscuous pairs. EBC’s scoring function will assign a high score to a test set member as long as the seed set rows tend to cluster with it more frequently than other rows do. Ties are broken randomly.
We compared EBC’s performance to two other ranking methods that did not take the semantic similarity of different dependency paths into account:
AvgCosine. Let be the row vector in the data matrix associated with test set member i. This vector contains m elements: one for each dependency path. Let be the row vector associated with seed set member j. Here we score each test pair T_i based on the average cosine similarity of with all of the row vectors from the seed set, or: where ‖⋅‖ denotes the Euclidean norm.
RankSum. In keeping with the spirit of EBC’s scoring function, for each T_i we rank all n rows of the data matrix based on cosine similarity to . This produces a ranking R_i of length n in which rows with high cosine similarity to are assigned high ranks and those with low cosine similarity to get low ranks. The score for T_i is the rank sum of the members of S within this list, and looks identical to that for EBC; the only difference is that the rankings R_i are produced using cosine similarity and not EBC.

Evaluating rankings of PGx and drug-target relationships

For both the PGx and drug-target tasks, and for seed set sizes |S| = 1, 2, 3, 4, 5, 10, 25, 50, and 100, we generated 1000 random seed sets and 1000 corresponding test sets, ensuring that the seed sets and test sets did not overlap. The test sets were all composed of 100 drug-gene pairs, 50 of which had known PGx or drug-target relationships and 50 of which did not. All three ranking methods were used to rank the members of each test set, using its associated seed set for scoring.

We also explored the impact of data sparsity by performing these evaluations on two separate datasets. In the “dense” dataset, we included only drug-gene pairs and dependency paths that occurred at least five times in Medline. In the “sparse” dataset, we included dependency paths occurring at least twice, and any drug-gene pairs they connected (even if they only co-occurred in a single sentence). More information about the two datasets can be found in Table 2, and the data matrices themselves can be found in S2 Data.

We evaluated the quality of each ranking by calculating the area under the receiver operating characteristic curve (AUC) [52], a measure of how likely it is that a positive element of the test set will be ranked higher than a negative element. We elected to use AUC instead of precision or recall because we wanted a threshold-independent measure of the overall quality of the ranking. We used R’s ROCR package to calculate the AUCs. From a practical standpoint, we were concerned mainly with the following scenario: Given that I have a seed set about whose quality I know nothing, what is the chance I can accurately prioritize the knowledge I am looking for within my [unlabeled] corpus? Our evaluation metric was, therefore, the fraction of the 1000 seed sets that ranked their corresponding test sets with AUC > 0.7.

Comparing EBC to Latent Semantic Analysis (LSA)

To investigate how similar EBC’s performance was to a more established method designed to solve a similar problem, we used the singular value decomposition (SVD) [37] to decompose our two data matrices, creating “compressed” feature vectors of reduced dimensionality for each drug-gene pair and incorporating these, rather than the raw row vectors, into the two non-EBC ranking methods described above. This approach is identical to the famous text mining technique Latent Semantic Analysis (LSA; [13]) which was originally applied to overcome issues of data sparsity in document retrieval. The results of these experiments are described further in S2 Text.

Building a dendrogram of drug-gene pairs based on EBC’s similarity assessments

EBC provides a natural measure of similarity for each drug-gene pair and every other pair: the number of times the rows corresponding to those two pairs clustered together over the N biclustering runs. However, as we have seen, these raw values are not fair measures of distance for all pairs, since some drug-gene pairs tend to cluster frequently with many other pairs, and others cluster less frequently. EBC’s rank-based scoring function accounts for this by normalizing to ranks: each drug-gene pair ranks all other pairs by co-clustering frequency, and these ranks are used in place of the raw co-clustering values in the scoring function.

To implement EBC's scoring function in an unsupervised manner to construct our dendrogram, we started with our n x n matrix of co-occurrence values, C, in which C_ij was the number of runs (out of N total) in which drug-gene pair i co-clustered with drug-gene pair j. We then converted C into a correlation matrix, ρ, also n x n, where ρ_ij contained the Spearman correlation of C_i⋅ and C_j⋅, the ith and jth rows of C (note that C is symmetric, so we could just as easily have used columns). These correlations are, as in EBC's scoring function, measures of how similarly drug-gene pair i and pair j rank all other pairs in the matrix, and are not biased in favor of promiscuous pairs. We then used 1 − ρ as the distance measure for hierarchical clustering using minimax linkage [53] to produce the dendrogram shown in Fig 4. Using a different linkage function or distance metric, obviously, would produce a different-looking dendrogram.

We used several R packages to produce the dendrogram figures, including ape (a library for making phylogenetic trees), and protoclust (a library for hierarchical clustering using minimax linkage). To achieve the radially-spaced tip markers, we used a separate package [54].

Supporting Information

S1 Text. Optimizing row and column cluster numbers for EBC.

We describe our heuristic for choosing the optimal number of row (k) and column (l) clusters for EBC based on the structure of the data matrix.

https://doi.org/10.1371/journal.pcbi.1004216.s001

(PDF)

S2 Text. Comparing EBC to Latent Semantic Analysis (LSA).

We compare EBC to another related technique that was one of the first to use matrix decompositions to address the problem of data sparsity in text mining.

https://doi.org/10.1371/journal.pcbi.1004216.s002

(PDF)

S3 Text. Technical details about our implementation of EBC in Java.

https://doi.org/10.1371/journal.pcbi.1004216.s003

(PDF)

S1 Data. Co-clustering frequencies on dense and sparse matrices.

We provide the raw co-clustering frequencies of the rows (drug-gene pairs) of both matrices over N = 2000 runs.

https://doi.org/10.1371/journal.pcbi.1004216.s004

(PDF)

S2 Data. Sparse and dense data matrices for the drug-gene relationship extraction task, stored in a sparse format.

https://doi.org/10.1371/journal.pcbi.1004216.s005

(PDF)

S3 Data. Cluster assignments for the dendrogram in Fig 4, at five different cut heights.

https://doi.org/10.1371/journal.pcbi.1004216.s006

(PDF)

S4 Data. Prediction certainties from Fig 5 for PharmGKB and DrugBank.

https://doi.org/10.1371/journal.pcbi.1004216.s007

(PDF)

Acknowledgments

BP thanks Francisco Gimenez, Marshall Pierce, Sida Wang, Kenneth Jung, Lynn Eckert, and Tim Rocktaeschel for useful conversations and pointers to relevant literature, and the members of her reading committee, Art Owen, Christopher Potts, Pentti Kanerva, and Nigam Shah, for detailed feedback.

Author Contributions

Conceived and designed the experiments: BP RBA. Performed the experiments: BP. Analyzed the data: BP. Contributed reagents/materials/analysis tools: RBA. Wrote the paper: BP RBA.

References

1. http://www.nlm.nih.gov/bsd/num_titles.html. Accessed 3/3/14.
2. http://www.nlm.nih.gov/bsd/medline_cit_counts_yr_pub.html. Accessed 3/3/14.
3. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA (2005) Online Mendelian Inheritance in Man (OMIM), a knowledge base of human genes and genetic disorders. Nucleic Acids Res 33(Suppl 1): D514–D517.
- View Article
- Google Scholar
4. Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, et al. (2006) DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res 34(Suppl 1): D668–D672.
- View Article
- Google Scholar
5. Whirl-Carrillo M, McDonagh EM, Hebert JM, Gong L, Sangkuhl K, et al. (2012) Pharmacogenomics knowledge for personalized medicine. Clin Pharmacol Ther 92: 414–417. pmid:22992668
- View Article
- PubMed/NCBI
- Google Scholar
6. Jensen LJ, Saric J, Bork P (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Gen 7: 119–129.
- View Article
- Google Scholar
7. Lu Z (2011) PubMed and beyond: a survey of web tools for searching biomedical literature. Database, baq036.
8. Shatkay H, Feldman R (2003) Mining the biomedical literature in the genomic era: an overview. J Comput Biol 10: 821–855. pmid:14980013
- View Article
- PubMed/NCBI
- Google Scholar
9. Cohen AM, Hersh WR (2005) A survey of current work in biomedical text mining. Brief Bioinform 6: 57–71. pmid:15826357
- View Article
- PubMed/NCBI
- Google Scholar
10. Katzan IL, Rudick RA (2012) Time to Integrate Clinical and Research Informatics. Sci Transl Med 4: 162fs41. pmid:23197569
- View Article
- PubMed/NCBI
- Google Scholar
11. Nadler JJ, Downing GJ (2010) Liberating Health Data for Clinical Research Applications. Sci Transl Med 2: 18cm6. pmid:20371480
- View Article
- PubMed/NCBI
- Google Scholar
12. Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37: 141–188.
- View Article
- Google Scholar
13. Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. J Am Soc Inform Sci 41: 391–407.
- View Article
- Google Scholar
14. Cohen T, Widdows D (2009) Empirical distributional semantics: methods and biomedical applications. J Biomed Inform 42: 390–405. pmid:19232399
- View Article
- PubMed/NCBI
- Google Scholar
15. De Marneffe MC, Manning CD (2008) Stanford typed dependencies manual. http://nlp.stanford.edu/software/dependencies_manual.pdf.
16. Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ‘03): 89–98.
17. Coulet A, Shah NH, Garten Y, Musen M, Altman RB (2010) Using text to build semantic networks for pharmacogenomics. J Biomed Inform 43: 1009–1019. pmid:20723615
- View Article
- PubMed/NCBI
- Google Scholar
18. Buyko E, Beisswanger E, Hahn U (2012) The extraction of pharmacogenetic and pharmacogenomic relations–a case study using PharmGKB. Pac Symp Biocomp 17: 376–387.
- View Article
- Google Scholar
19. Xu R, Wang Q (2012) A knowledge-driven conditional approach to extract pharmacogenomics specific drug-gene relationships from free text. J Biomed Inform 45: 827–834. pmid:22561026
- View Article
- PubMed/NCBI
- Google Scholar
20. Pustejovsky J, Castano J, Zhang J, Kotecki M, Cochran B (2002) Robust relational parsing over biomedical literature: extracting inhibit relations. Pac Symp Biocomp 7: 362–373.
- View Article
- Google Scholar
21. Rindflesch TC, Fiszman M (2003) The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform 36: 462–477. pmid:14759819
- View Article
- PubMed/NCBI
- Google Scholar
22. McDonald R, Pereira F, Kulick S, Winters S, Jin Y, et al. (2005) Simple algorithms for complex relation extraction with applications to biomedical IE. In: Proc. of the 43^rd Annual Meeting on Assoc. for Comp. Linguist: 491–498.
23. Li J, Zhang Z, Li X, Chen H (2008) Kernel-based learning for biomedical relation extraction. J Am Soc Inf Sci Tec 59: 756–769.
- View Article
- Google Scholar
24. Fundel K, Kueffner R, Zimmer R (2007) RelEx–Relation extraction using dependency parse trees. Bioinformatics 23: 365–371. pmid:17142812
- View Article
- PubMed/NCBI
- Google Scholar
25. Segura-Bedmar I, Martinez P, de Pablo-Sánchez C (2011) Using a shallow linguistic kernel for drug–drug interaction extraction. J Biomed Inform 44: 789–804. pmid:21545845
- View Article
- PubMed/NCBI
- Google Scholar
26. Liekens AM, De Knijf J, Daelemans W, Goethals B, De Rijk P, et al. (2011) BioGraph: unsupervised biomedical knowledge discovery via automated hypothesis generation. Genome Biol 12: R57. pmid:21696594
- View Article
- PubMed/NCBI
- Google Scholar
27. Friedman C, Alderson PO, Austin JH, Cimino JJ, Johnson SB (1994) A general natural-language text processor for clinical radiology. J Am Med Inform Assn 1: 161–174.
- View Article
- Google Scholar
28. Rindflesch TC, Tanabe L, Weinstein JN, Hunter L (2000) EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pac Symp Biocomput 5: 514–525.
- View Article
- Google Scholar
29. http://www.technologyreview.com/news/523411/facing-doubters-ibm-expands-plans-for-watson/ Accessed 3/3/14.
30. Kim JD, Ohta T, Tateisi Y, Tsujii JI (2003) GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics 19(Suppl 1): i180–i182. pmid:12855455
- View Article
- PubMed/NCBI
- Google Scholar
31. Hirschman L, Yeh A, Blaschke C, Valencia A (2005) Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6(Suppl 1): S1.
- View Article
- Google Scholar
32. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems: 3111–3119.
33. Lin D, Pantel P (2001) DIRT: discovery of inference rules from text. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining: 323–328.
34. Riedel S, Yao L, McCallum A, Marlin BM (2013) Relation Extraction with Matrix Factorization and Universal Schemas. In Proceedings of NAACL-HLT: 74–84.
35. Dagan I, Roth D, Sammons M, Zanzotto F (2013) Recognizing Textual Entailment: Models and Applications. San Rafael: Morgan and Claypool.
36. Turney PD (2005) Measuring semantic similarity by latent relational analysis. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI-05): 1136–1141.
37. Skillicorn D (2007) Understanding complex datasets: data mining with matrix decompositions. Boca Raton: CRC Press.
38. Shinyama Y, Sekine S (2006) Preemptive information extraction using unrestricted relation discovery. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 304–311). Association for Computational Linguistics.
39. Hasegawa T, Sekine S, Grishman R (2004) Discovering relations among named entities from large corpora. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (p. 415). Association for Computational Linguistics.
40. Zhang M, Su J, Wang D, Zhou G, Tan CL (2005) Discovering relations between named entities from a large raw corpus using tree similarity-based clustering. In Natural Language Processing–IJCNLP 2005 (pp. 378–389). Springer Berlin Heidelberg.
41. Rosenfeld B, Feldman R (2007) Clustering for unsupervised relation identification. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (pp. 411–418). ACM.
42. Bollegala DT, Matsuo Y, Ishizuka M (2010) Relational duality: Unsupervised extraction of semantic relations between entities on the web. In Proceedings of the 19th international conference on World wide web (pp. 151–160). ACM.
43. Yao L, Haghighi A, Riedel S, McCallum A (2011) Structured relation discovery using generative models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1456–1466). Association for Computational Linguistics.
44. Kok S, Domingos P (2008) Extracting semantic networks from text via relational clustering. In Machine Learning and Knowledge Discovery in Databases (pp. 624–639). Springer Berlin Heidelberg.
45. Riedel S, Yao L, McCallum A, Marlin BM (2013) Relation Extraction with Matrix Factorization and Universal Schemas. In Proceedings of NAACL-HLT (pp. 74–84).
46. Dagan I. and Roth D. and Sammons M. and Zanzotto F., Recognizing Textual Entailment: Models and Applications. Morgan and Claypool (2013).
47. Leaman R., Gonzalez G., BANNER: an executable survey of advances in biomedical named entity recognition. Pac. Symp. Biocomp. 13, 652–663 (2008).
- View Article
- Google Scholar
48. Leser U, Hakenberg J (2005) What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform 6: 357–369. pmid:16420734
- View Article
- PubMed/NCBI
- Google Scholar
49. De Marneffe MC, Manning CD (2008) The Stanford typed dependencies representation. In: COLING Workshop on Cross-framework and Cross-domain Parser Evaluation: 1–8.
50. Sahlgren M (2005) An introduction to random indexing. In Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE (Vol. 5).
51. Percha B, Altman RB (2013) Inferring the semantic relationships of words within an ontology using random indexing: applications to pharmacogenomics. AMIA Annu Symp Proc: 1123–1132. pmid:24551397
- View Article
- PubMed/NCBI
- Google Scholar
52. Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30: 1145–1159.
- View Article
- Google Scholar
53. Bien J, Tibshirani R (2011) Hierarchical clustering with prototypes via minimax linkage. J Am Stat Assoc 106: 1075–1084.
- View Article
- Google Scholar
54. https://github.com/willpearse/willeerd/blob/master/R/phylo.plots.R (accessed 5/23/14)

[ref1] 1. http://www.nlm.nih.gov/bsd/num_titles.html. Accessed 3/3/14.

[ref2] 2. http://www.nlm.nih.gov/bsd/medline_cit_counts_yr_pub.html. Accessed 3/3/14.

[ref3] 3. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA (2005) Online Mendelian Inheritance in Man (OMIM), a knowledge base of human genes and genetic disorders. Nucleic Acids Res 33(Suppl 1): D514–D517.
View Article
Google Scholar

[4] View Article

[5] Google Scholar

[ref4] 4. Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, et al. (2006) DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res 34(Suppl 1): D668–D672.
View Article
Google Scholar

[7] View Article

[8] Google Scholar

[ref5] 5. Whirl-Carrillo M, McDonagh EM, Hebert JM, Gong L, Sangkuhl K, et al. (2012) Pharmacogenomics knowledge for personalized medicine. Clin Pharmacol Ther 92: 414–417. pmid:22992668
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref6] 6. Jensen LJ, Saric J, Bork P (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Gen 7: 119–129.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref7] 7. Lu Z (2011) PubMed and beyond: a survey of web tools for searching biomedical literature. Database, baq036.

[ref8] 8. Shatkay H, Feldman R (2003) Mining the biomedical literature in the genomic era: an overview. J Comput Biol 10: 821–855. pmid:14980013
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref9] 9. Cohen AM, Hersh WR (2005) A survey of current work in biomedical text mining. Brief Bioinform 6: 57–71. pmid:15826357
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref10] 10. Katzan IL, Rudick RA (2012) Time to Integrate Clinical and Research Informatics. Sci Transl Med 4: 162fs41. pmid:23197569
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref11] 11. Nadler JJ, Downing GJ (2010) Liberating Health Data for Clinical Research Applications. Sci Transl Med 2: 18cm6. pmid:20371480
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref12] 12. Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37: 141–188.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref13] 13. Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. J Am Soc Inform Sci 41: 391–407.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref14] 14. Cohen T, Widdows D (2009) Empirical distributional semantics: methods and biomedical applications. J Biomed Inform 42: 390–405. pmid:19232399
View Article
PubMed/NCBI
Google Scholar

[40] View Article

[41] PubMed/NCBI

[42] Google Scholar

[ref15] 15. De Marneffe MC, Manning CD (2008) Stanford typed dependencies manual. http://nlp.stanford.edu/software/dependencies_manual.pdf.

[ref16] 16. Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ‘03): 89–98.

[ref17] 17. Coulet A, Shah NH, Garten Y, Musen M, Altman RB (2010) Using text to build semantic networks for pharmacogenomics. J Biomed Inform 43: 1009–1019. pmid:20723615
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref18] 18. Buyko E, Beisswanger E, Hahn U (2012) The extraction of pharmacogenetic and pharmacogenomic relations–a case study using PharmGKB. Pac Symp Biocomp 17: 376–387.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref19] 19. Xu R, Wang Q (2012) A knowledge-driven conditional approach to extract pharmacogenomics specific drug-gene relationships from free text. J Biomed Inform 45: 827–834. pmid:22561026
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref20] 20. Pustejovsky J, Castano J, Zhang J, Kotecki M, Cochran B (2002) Robust relational parsing over biomedical literature: extracting inhibit relations. Pac Symp Biocomp 7: 362–373.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref21] 21. Rindflesch TC, Fiszman M (2003) The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform 36: 462–477. pmid:14759819
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref22] 22. McDonald R, Pereira F, Kulick S, Winters S, Jin Y, et al. (2005) Simple algorithms for complex relation extraction with applications to biomedical IE. In: Proc. of the 43^rd Annual Meeting on Assoc. for Comp. Linguist: 491–498.

[ref23] 23. Li J, Zhang Z, Li X, Chen H (2008) Kernel-based learning for biomedical relation extraction. J Am Soc Inf Sci Tec 59: 756–769.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref24] 24. Fundel K, Kueffner R, Zimmer R (2007) RelEx–Relation extraction using dependency parse trees. Bioinformatics 23: 365–371. pmid:17142812
View Article
PubMed/NCBI
Google Scholar

[68] View Article

[69] PubMed/NCBI

[70] Google Scholar

[ref25] 25. Segura-Bedmar I, Martinez P, de Pablo-Sánchez C (2011) Using a shallow linguistic kernel for drug–drug interaction extraction. J Biomed Inform 44: 789–804. pmid:21545845
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref26] 26. Liekens AM, De Knijf J, Daelemans W, Goethals B, De Rijk P, et al. (2011) BioGraph: unsupervised biomedical knowledge discovery via automated hypothesis generation. Genome Biol 12: R57. pmid:21696594
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref27] 27. Friedman C, Alderson PO, Austin JH, Cimino JJ, Johnson SB (1994) A general natural-language text processor for clinical radiology. J Am Med Inform Assn 1: 161–174.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref28] 28. Rindflesch TC, Tanabe L, Weinstein JN, Hunter L (2000) EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pac Symp Biocomput 5: 514–525.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref29] 29. http://www.technologyreview.com/news/523411/facing-doubters-ibm-expands-plans-for-watson/ Accessed 3/3/14.

[ref30] 30. Kim JD, Ohta T, Tateisi Y, Tsujii JI (2003) GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics 19(Suppl 1): i180–i182. pmid:12855455
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref31] 31. Hirschman L, Yeh A, Blaschke C, Valencia A (2005) Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6(Suppl 1): S1.
View Article
Google Scholar

[91] View Article

[92] Google Scholar

[ref32] 32. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems: 3111–3119.

[ref33] 33. Lin D, Pantel P (2001) DIRT: discovery of inference rules from text. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining: 323–328.

[ref34] 34. Riedel S, Yao L, McCallum A, Marlin BM (2013) Relation Extraction with Matrix Factorization and Universal Schemas. In Proceedings of NAACL-HLT: 74–84.

[ref35] 35. Dagan I, Roth D, Sammons M, Zanzotto F (2013) Recognizing Textual Entailment: Models and Applications. San Rafael: Morgan and Claypool.

[ref36] 36. Turney PD (2005) Measuring semantic similarity by latent relational analysis. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI-05): 1136–1141.

[ref37] 37. Skillicorn D (2007) Understanding complex datasets: data mining with matrix decompositions. Boca Raton: CRC Press.

[ref38] 38. Shinyama Y, Sekine S (2006) Preemptive information extraction using unrestricted relation discovery. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 304–311). Association for Computational Linguistics.

[ref39] 39. Hasegawa T, Sekine S, Grishman R (2004) Discovering relations among named entities from large corpora. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (p. 415). Association for Computational Linguistics.

[ref40] 40. Zhang M, Su J, Wang D, Zhou G, Tan CL (2005) Discovering relations between named entities from a large raw corpus using tree similarity-based clustering. In Natural Language Processing–IJCNLP 2005 (pp. 378–389). Springer Berlin Heidelberg.

[ref41] 41. Rosenfeld B, Feldman R (2007) Clustering for unsupervised relation identification. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (pp. 411–418). ACM.

[ref42] 42. Bollegala DT, Matsuo Y, Ishizuka M (2010) Relational duality: Unsupervised extraction of semantic relations between entities on the web. In Proceedings of the 19th international conference on World wide web (pp. 151–160). ACM.

[ref43] 43. Yao L, Haghighi A, Riedel S, McCallum A (2011) Structured relation discovery using generative models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1456–1466). Association for Computational Linguistics.

[ref44] 44. Kok S, Domingos P (2008) Extracting semantic networks from text via relational clustering. In Machine Learning and Knowledge Discovery in Databases (pp. 624–639). Springer Berlin Heidelberg.

[ref45] 45. Riedel S, Yao L, McCallum A, Marlin BM (2013) Relation Extraction with Matrix Factorization and Universal Schemas. In Proceedings of NAACL-HLT (pp. 74–84).

[ref46] 46. Dagan I. and Roth D. and Sammons M. and Zanzotto F., Recognizing Textual Entailment: Models and Applications. Morgan and Claypool (2013).

[ref47] 47. Leaman R., Gonzalez G., BANNER: an executable survey of advances in biomedical named entity recognition. Pac. Symp. Biocomp. 13, 652–663 (2008).
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref48] 48. Leser U, Hakenberg J (2005) What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform 6: 357–369. pmid:16420734
View Article
PubMed/NCBI
Google Scholar

[112] View Article

[113] PubMed/NCBI

[114] Google Scholar

[ref49] 49. De Marneffe MC, Manning CD (2008) The Stanford typed dependencies representation. In: COLING Workshop on Cross-framework and Cross-domain Parser Evaluation: 1–8.

[ref50] 50. Sahlgren M (2005) An introduction to random indexing. In Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE (Vol. 5).

[ref51] 51. Percha B, Altman RB (2013) Inferring the semantic relationships of words within an ontology using random indexing: applications to pharmacogenomics. AMIA Annu Symp Proc: 1123–1132. pmid:24551397
View Article
PubMed/NCBI
Google Scholar

[118] View Article

[119] PubMed/NCBI

[120] Google Scholar

[ref52] 52. Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30: 1145–1159.
View Article
Google Scholar

[122] View Article

[123] Google Scholar

[ref53] 53. Bien J, Tibshirani R (2011) Hierarchical clustering with prototypes via minimax linkage. J Am Stat Assoc 106: 1075–1084.
View Article
Google Scholar

[125] View Article

[126] Google Scholar

[ref54] 54. https://github.com/willpearse/willeerd/blob/master/R/phylo.plots.R (accessed 5/23/14)

Figures

Abstract

Author Summary

Introduction

Results

Quantifying the variability of drug-gene descriptions in Medline sentences

Identifying pharmacogenomic and drug-target relationships in biomedical text

Inferring connections among related descriptions based on patterns in the text

Mapping the semantic landscape of drug-gene interactions

Discovering novel relationships for PharmGKB and DrugBank

Discussion

Relationship extraction in the biomedical domain

Support for corpus-level inference

Distributional semantics for relationship extraction

Study limitations: Dependency paths, lexicons and abstracts

Extensions and future applications

Methods

Outline of the EBC algorithm

Named entity recognition of drugs and genes

Extraction of dependency paths from Medline abstracts

Ensemble biclustering

Scoring of test set pairs

Evaluating rankings of PGx and drug-target relationships

Comparing EBC to Latent Semantic Analysis (LSA)

Building a dendrogram of drug-gene pairs based on EBC’s similarity assessments

Supporting Information

S1 Text. Optimizing row and column cluster numbers for EBC.

S2 Text. Comparing EBC to Latent Semantic Analysis (LSA).

S3 Text. Technical details about our implementation of EBC in Java.

S1 Data. Co-clustering frequencies on dense and sparse matrices.

S2 Data. Sparse and dense data matrices for the drug-gene relationship extraction task, stored in a sparse format.

S3 Data. Cluster assignments for the dendrogram in Fig 4, at five different cut heights.

S4 Data. Prediction certainties from Fig 5 for PharmGKB and DrugBank.

Acknowledgments

Author Contributions

References