Building Disease-Specific Drug-Protein Connectivity Maps from Molecular Interaction Networks and PubMed Abstracts

doi:10.1371/journal.pcbi.1000450

Figure 1.

A conceptual paradigm for the development of disease-specific molecular connectivity maps.

In this paradigm, molecular interaction data and PubMed abstracts are the primary data sources. Network mining is used to generate disease-related proteins from molecular interactions. Text mining is used to extract disease-related drug terms from PubMed abstracts and to further build drug-protein connectivity map in the disease context.

More »

Expand

Figure 2.

A computational framework for developing molecular connectivity maps in any given disease context.

The framework consists of three components: network construction, text retrieval and information extraction, and molecular connectivity mapping. The network construction component takes the inputs of disease-specific seed proteins and outputs a disease-related protein interaction network with a ranked list of disease-related proteins. The text retrieval and information extraction component takes synonym-expanded disease-related proteins and outputs a list of drug terms enriched in the retrieved collection of PubMed abstracts. The molecular connectivity mapping component takes two inputs—disease-related proteins from constructed protein interaction network in the first component, and enriched drug terms in the second component—and outputs a drug-protein connectivity map, in which further knowledge filters and clustering analysis can be applied.

More »

Expand

Table 1.

Top 30 ranked proteins from AD-related protein interaction network.

More »

Expand

Figure 3.

The effect of different disease-related protein seeding situation on the specificity and sensitivity of AD drug identification.

In the text retrieval and information extraction component, the AD-related drugs are identified from the retrieved PubMed abstracts relevant to a list of AD proteins. We have an initial set of 49 AD seed proteins. To evaluate the effect of different seeding situations on AD drug identification, we sub-sampled the initial AD seed set into 8 data sets of varying sizes i.e., S5, S10, S15, S20, S25, S30, S35, S40 (the number indicating size) and also generated a random seed set with 50 proteins.. Given different seed sets, Panel (A) shows the specificity performances of AD-related drug identification at top N drugs determined by FDR (false discovery rate), and Panel (B) shows the sensitivity performances.

More »

Expand

Figure 4.

Specificity and sensitivity tradeoffs for AD-related drug identification.

The ROC (receiver operating characteristic) curve shows the sensitivity vs. false positive rate (1-specificity) for AD-related drug identification, when FDR (false discovery rate) varies at different threshold levels. Evaluation results are built by querying against PubMed abstracts and Enrez gene function description in search of evidence that may contain any of the drug terms and the term “Alzheimer's Disease” with all their term variants. The sensitivity and specificity are defined in Methods section.

More »

Expand

Table 2.

A representative sample of enriched AD drugs.

More »

Expand

Figure 5.

Performance assessment of comparable systems on the task of identifying AD-related drugs.

Two curated data sources (DrugBank and CTD) and two computational methods (Chi2 and BITOLA) were selected to compare against the performance of our approach on AD drug identifications. DrugBank and CTD manually curated database content about disease-modifying gene/proteins and drugs. Chi2 is a baseline system using commonly Chi-square statistical method to identify significant co-occurring drug-disease relationships cited in PubMed abstracts. BITOLA (Biomedical Discovery Support System) is a computational system based on natural language processing that can extract drug-protein relation in a disease context. The histogram shows sensitivity, specificity, PPV (positive predictive value), F-score, and ACC (accuracy) of each group. These performance measurements are defined in the Methods section.

More »

Expand

Table 3.

Cross-validation of protein-drug relationships identified in the AD connectivity map with target-drug relationships of AD drugs in Drugbank.

More »

Expand

Figure 6.

An AD connectivity map linking AD-related proteins to significant drugs.

After ranking proteins involved in the AD related protein interaction network and selecting enriched drugs in AD network related corpus, 66 AD highly-relevant proteins and 166 significant AD candidate drugs are identified to construct an AD connectivity map. Hierarchical clustering of drugs and proteins are performed before results are shown as the final heatmap format, in which the x-dimension represents drugs and the y-dimension represents proteins. The color intensity for each cell is drawn in proportion to the connectivity score as shown in the heatmap legenda. Panels (A) and (B) show zoomed-in views of boxed regions A and B on the original map. Panel (C) shows the chemical structures of three drugs (Diazepam, Clonazepam, and Flunitrazepam) from a cluster of drugs found in Panel (B), with their common structure (Benzodiazepine) shown in a box. CID refers to entity identifier in PubChem (http://pubchem.ncbi.nlm.nih.gov/).

More »

Expand

Table 4.

Performance assessment of molecular connectivity maps for several representative cancers.

More »

Expand