Fig 1.
Assay and target type distribution in ChEMBL.
Distribution of assay types in ChEMBL (by percentage of all assays in the database) and distribution of the types of associated biological targets. The molecular target category covers multiple ChEMBL target types, including “single protein”, “protein complex”, “protein family”, “nucleic acid”, “macromolecule”, and “protein-protein interaction”.
Fig 2.
Animals used in in vivo efficacy assays.
Other mammals include mainly laboratory rodents (e.g. hamster, gerbil), carnivores (cat), lagomorphs (rabbit), and primates (e.g. rhesus monkey); the latter were used in 1,157 assays. The main classes of non-mammal animals include arthropods, nematodes, and birds.
Fig 3.
Length of assay descriptions (in words).
Fig 4.
ATC classes of approved drugs tested in in vivo efficacy assays.
The dendrogram represents the Anatomical Therapeutic Chemical (ATC) drug hierarchy and the coverage of various drug classes in the ChEMBL in vivo dataset. The height and color of bars on the circular bar plot (external ring) represent the number of assays involving drugs assigned given ATC code (level 2 of the ATC classification system). Most common ATC level 2 classes corresponding to different therapeutic/pharmacological subgroups are highlighted.
Fig 5.
Most common rodent strains and experimental disease models mentioned in the descriptions of in vivo efficacy assays in ChEMBL.
(A) Twenty strains that are most frequently mentioned in assay descriptions; outbred strains are marked with an asterisk (*). Upon identification in the text of assay descriptions, the strain names were normalized using strain synonym listings maintained by rodent genome databases. For instance, C57BL mouse was described in various descriptions with more than 30 different terms including names that do not follow official nomenclature guidelines: “BL6”, “Black6”, or “C57/Black”. (B) Bar plot showing twenty experimental models that are most frequently mentioned in assay descriptions. The models were manually annotated with disease area.
Table 1.
For each example query, the table shows four most similar words/phrases as measured by Cosine similarity of associated vector embeddings (shown for each result). The embeddings were learned by Word2Vec model trained with preprocessed in vivo assay descriptions (following shallow parsing and noun phrase extraction workflow summarized in the Methods section).
Fig 6.
Semantic similarities between animal models and phenotypes.
A hierarchically clustered heatmap showing pairwise semantic similarities between 35 animal models and 35 phenotypes frequently mentioned in the assay descriptions. Red color corresponds to higher, blue—to lower semantic similarity; values in each row are Z-score normalized. Both rows and columns are hierarchically clustered (using average linkage and Euclidean distance) and the results are represented as dendrograms. Semantic clusters, shown as red regions on the heatmap, correspond to distinct disease areas including epilepsy, pain, inflammation, hypertension, diabesity, and cancer. The figure provides an automatically-generated summary of the use of common animal models to study the effect of drugs on different types of disease-related phenotypes.
Fig 7.
Visualization of a semantic space of assay descriptions.
Vector representations calculated for individual assay descriptions were projected into two-dimensional space and visualized as points on a scatterplot. The colors correspond to ATC codes of approved drugs tested in the assays: antiepileptics, N03; anti-inflammatory, M01, M02, C01, S01; antidiabetics, A10; psycholeptics, N05; antineoplastics, L01.
Table 2.
Most common drugs, phenotypes, experimental animal models, and top 5 enriched phrases (ranked by a simple Fisher test p-value) for the five most common ATC combinations.
Phenotypes, animal models, and noun phrases were text-mined from the text of assay descriptions; drug names were extracted from structured data fields in ChEMBL.
Fig 8.
Confusion matrix and per-class performance measures calculated for one of the random forest classifiers.
The figure shows performance measures calculated for a multiclass random forest classifier that assigns each assay with one of the five most common ATC code combinations—a proxy for the most common disease areas in ChEMBL. The model was built with data visualized on Fig 7; strict partitioning method based on random document split was used to partition the dataset into cross-validation subsets. The model achieved overall prediction accuracy of 0.87. (A) Per-class confusion matrix. (B) Per-class classification report.
Fig 9.
Major component of the animal model—Drug network with detailed “diabesity” cluster.
The nodes in the graph correspond to approved drugs and animal models of disease, including induced, spontaneous, and transgenic disease models text-mined from assay descriptions. A drug is linked to an animal model if it was tested in at least five assays involving this model. Drug nodes are colored according to the assigned ATC (level 2) codes, while animal model nodes are blue; node size is proportional to the number of assays involving a given drug or model. Animal model-drug relationships visualized in the graph are listed in the S5 Dataset. STZ, streptozotocin-induced model; GTT, glucose tolerance test; ZDF, Zucker Diabetic Fatty rat; glucose, glucose-loaded model.
Fig 10.
Differential use of rats and mice across in vivo assays.
(A) Number of assays involving rats and mice for eight example experimental systems. (B) Differential use of the two rodent species in assays testing drugs from the 10 most common drug classes (based on the second level of the ATC classification). Classes are ordered by the difference in the number of assays involving rats and mice. The images of the animals used in the figure were obtained under the open license from Gene Expression Atlas https://www.ebi.ac.uk/gxa.
Fig 11.
Processing of assay descriptions, with an illustrative example case.
(A) The input data: raw assay descriptions retrieved from the ChEMBL database. (B) Shallow grammatical analysis (shallow parsing). GENIA tagger annotates each word with its corresponding part-of-speech (POS) category (e.g. noun, adjective, verb). The POS annotations are then used to find longer chunks of text corresponding to noun phrases; here represented as yellow blocks in the shallow parse tree. (C) Custom chunking. Noun phrases detected by GENIA are simplified using custom tags and chunking rules. (D) Named entity recognition (NER). Strains, experimental animal models, and phenotypic terms are identified in terms using a combination of dictionary and rule-based NER methods. (E) Learning distributed vector representations. The entire dataset of preprocessed assay descriptions is used to train a neural network language model, Word2Vec. Thus, words and noun phrases from each assay description are converted to high-dimensional numerical vectors that can be used as input for clustering and machine learning models. S, sentence; NP, noun phrase; PP, prepositional phrase; VP, verb phrase; JJ, adjective; NN, noun; IN, preposition; NNP, proper noun; VBN, verb, past participle.