Classification and analysis of a large collection of in vivo bioassay descriptions
Fig 11
Processing of assay descriptions, with an illustrative example case.
(A) The input data: raw assay descriptions retrieved from the ChEMBL database. (B) Shallow grammatical analysis (shallow parsing). GENIA tagger annotates each word with its corresponding part-of-speech (POS) category (e.g. noun, adjective, verb). The POS annotations are then used to find longer chunks of text corresponding to noun phrases; here represented as yellow blocks in the shallow parse tree. (C) Custom chunking. Noun phrases detected by GENIA are simplified using custom tags and chunking rules. (D) Named entity recognition (NER). Strains, experimental animal models, and phenotypic terms are identified in terms using a combination of dictionary and rule-based NER methods. (E) Learning distributed vector representations. The entire dataset of preprocessed assay descriptions is used to train a neural network language model, Word2Vec. Thus, words and noun phrases from each assay description are converted to high-dimensional numerical vectors that can be used as input for clustering and machine learning models. S, sentence; NP, noun phrase; PP, prepositional phrase; VP, verb phrase; JJ, adjective; NN, noun; IN, preposition; NNP, proper noun; VBN, verb, past participle.