Figure 1.
CTD text mining technical overview.
(1) A triaged corpus is retrieved for a target chemical-of-interest by querying PubMed. (2) Using the PMID, an article's title and abstract are mined for gene, chemical, disease, and action term recognition in CTD's integrated text-mining pipeline (red box). (3) Each text-mined term is first validated against CTD's controlled vocabularies and ignored if a match is not secured. The CTD text-mining pipeline process is run on a Red Hat Enterprise Linux 6.2 operating system using primarily Java 1.6 within the context of asynchronous batch processes. (4) PMIDs are then assigned a document relevancy score (DRS) by the text-mining tool and (5) sent to biocurators. (6) All interactions are composed and entered in CTD's web-based Curation Tool with the client running HTML 5, CSS3, JavaScript 1.85, and Ajax; a server processes the interactions and stores them in the Curation Database using Tomcat 6.0, Java 1.6, Servlet 2.5, JSP/JSTL, and Spring 3.0 framework.
Table 1.
CTD rules-based document ranking algorithm.
Figure 2.
(1) Independent CTD-specific queries were made of PubMed to retrieve 14,904 articles for the seven heavy metals cadmium, cobalt, copper, lead, manganese, mercury, and nickel. (2) These articles were text mined and assigned a document relevancy score (DRS). (3) Of this preliminary corpus, 1,020 articles were found to have been previously reviewed in CTD and were used as a test set to evaluate the DRS and determine suitable cut-offs. (4) Articles with DRS ≥100 (high), DRS ≤20 (low), and a subset with DRS between 21–99 (medium) were combined to provide a final corpus of 3,583 documents which was then (5) sent to five CTD biocurators (who were kept blind to the DRS of each article) for review. (6) Biocurators timed themselves while reviewing all articles and ultimately rejected 1,381 (as non-curatable for CTD) and curated 2,202 of them (7) from whence 41,208 chemical-gene-disease interactions were extracted.
Figure 3.
Test set of previously reviewed articles validates assigned DRS.
A total of 1,020 articles are distributed by their text-mining assigned DRS (binned in 20-unit increments, x-axis) and are indicated as to whether they were found to have been either curated (green) or rejected (gray) by a CTD biocurator (as percent of articles in bin) at a previous time. The number of articles in each DRS bin (n) appears at the top of each column. There were no articles for the bins 280–299, 340–359, or 360–379.
Figure 4.
Curation of heavy metal corpus validates assigned DRS.
Of the original 14,904 articles (boxes in top row, N), a representative set of 3,583 documents (second row, n) were assigned to CTD biocurators for curatorial review, including all articles (1,981) with a high DRS ≥100, all articles (723) with a low DRS ≤20, and the complete subset of the articles (879) with a medium DRS 21–99 for the heavy metal mercury. (The 1,020 previously reviewed articles were not included in the assigned set.) The articles are distributed by their text-mining assigned DRS (binned in 20-unit increments, x-axis) and are indicated as to whether they were either curated (green) or rejected (gray) by a CTD biocurator (as percent of articles in bin). There is a progressive decrease in the percentage of curated articles with DRS <100. In total, 1,685 of the 1,981 articles (85%) with a high DRS ≥100 were curatable, while only 111 of the 723 articles (15%) with a low DRS ≤20 could be curated.
Table 2.
CTD manual curation metrics.
Figure 5.
DRS reflects the number of interactions per curated article.
Biocurators extracted 41,208 interactions from 2,202 curated articles (top row, c). The average number of interactions per curated article (log-scale, y-axis) is distributed by the assigned DRS (binned in 20-unit increments, x-axis), with the number of curated articles (c) in each bin indicated at the top. The average number of interactions per curated article increases with the DRS. The aberrant spike in bin 240–259 is due to a single article (amongst a total of nine curated documents in the bin) from whence 5,977 interactions were curated from a microarray experiment.
Figure 6.
DRS effectively ranks articles for relevance.
The 3,583 text-mined articles were ranked via (A) each article's PubMed identification number (PMID) in descending order and via (B) the text-mining assigned DRS, with articles grouped into progressive quartiles (Q1–Q4), each containing 896 documents. The articles were reviewed by CTD biocurators who determined that 2,202 of the articles contained relevant data (curated, green bars) while 1,381 of them did not (rejected, gray bars). The percent of total curated papers vs. rejected papers for each unique quartile are shown.
Figure 7.
DRS effectively ranks articles for data content.
A total of 38,118 novel interactions are distributed into progressive quartiles (Q1–Q4) based upon either DRS ranking (blue) or PMID ranking (orange) for three different types of interactions: (A) 35,385 novel chemical-gene (C–G) interactions, (B) 1,549 novel chemical-disease (C–D) interactions, and (C) 1,184 novel gene-disease (G–D) interactions.
Figure 8.
DRS effectively ranks articles for productivity.
(A) The number of total interactions (both novel and repeated) for each quartile is divided by (B) the time spent on curating them to produce (C) an averaged interaction yield rate (interactions per minute) for each quartile.
Figure 9.
Disease category distribution for the seven heavy metals.
The number of diseases curated for each metal is indicated for cadmium (Cd), cobalt (Co), copper (Cu), lead (Pb), manganese (Mn), mercury (Hg), and nickel (Ni). These specific disorders were then mapped and distributed across 21 generic disease categories (legend at top) using CTD's MEDIC-Slim disease mappings [2] to look for overrepresented disease classes for each individual heavy metal. For example, of the 70 specific diseases associated with copper (Cu), 23 of them (33%) are nervous system disorders and 12 of them (17%) are cardiovascular disorders.