Extraction of Transcript Diversity from Scientific Literature

Transcript diversity generated by alternative splicing and associated mechanisms contributes heavily to the functional complexity of biological systems. The numerous examples of the mechanisms and functional implications of these events are scattered throughout the scientific literature. Thus, it is crucial to have a tool that can automatically extract the relevant facts and collect them in a knowledge base that can aid the interpretation of data from high-throughput methods. We have developed and applied a composite text-mining method for extracting information on transcript diversity from the entire MEDLINE database in order to create a database of genes with alternative transcripts. It contains information on tissue specificity, number of isoforms, causative mechanisms, functional implications, and experimental methods used for detection. We have mined this resource to identify 959 instances of tissue-specific splicing. Our results in combination with those from EST-based methods suggest that alternative splicing is the preferred mechanism for generating transcript diversity in the nervous system. We provide new annotations for 1,860 genes with the potential for generating transcript diversity. We assign the MeSH term “alternative splicing” to 1,536 additional abstracts in the MEDLINE database and suggest new MeSH terms for other events. We have successfully extracted information about transcript diversity and semiautomatically generated a database, LSAT, that can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression. LSAT (Literature Support for Alternative Transcripts) is publicly available at http://www.bork.embl.de/LSAT/.


Inductive Learning
In the process of inductive learning, positive and negative learning examples are provided to a learning method. The learning performance is then assessed on the set of examples the learner haven't seen before. The process is repeated till the classifier achieves satisfactory performance.

Predicate argument structures
A verb which indicates a particular type of event conveyed by a sentence can exist in its verbal form, its participial modifier format or its nominal form. For example, the normal form of a verb used to describe the event "finding presence of something" would be detect, its participial modifier format would be detecting or detected, and its nominal format would be detection. Sentence constituents holding meaningful roles to complete the meaning of an event indicated by the verb are called arguments. (also see below)

Merging multiple syntactic patterns to semantic patterns
For example, in the sentence, 'Northern blot analysis detected the presence of a 2.4kb transcript and a 3.2 kb transcript in brain, liver and pancreas', the phrases 'Northern blot analysis' and 'brain, liver and pancreas' would serve the role of arguments to the verb detect with semantic labels of experimental methods and tissues, respectively. It is clear that variation of the sentence as 'Detection of 2.4 kb and 3.2 kb transcripts present in brain, liver and pancreas by northern blot analysis' would not change the semantic role assigned to constituent 'northern analysis' and 'brain, liver and pancreas'. At the same time in sentence, 'Using RT-PCR and nucleotide sequencing, alternative splicing was confirmed in liver, brain and testis', phrases 'RT-PCR and nucleotide sequencing' and 'liver, brain and testis' would serve roles of experimental methods and tissues, respectively.

Rules for extracting semantic patterns
For example, a rule to find out the role of the variable region in alternatively spliced transcripts in terms of structure or function could be summarized as Apart from the phrases extracted using predicate argument structure analysis, event mechanisms were extracted based on bi-gram and tri-gram lists. Tissue specificity was identified by tagging the word 'specific*' that may follow the tagged tissue name or part of the word describing the tissue (e.g. brain-specific).
Similarly, 'number of isoforms' was extracted by the fact that such numbers always preceded the tagged event mechanisms. Tissues were tagged using a dictionary compiled from Swissprot and Refseq. Gene names were tagged using an entity tagger [6].  Ensembl genes for human, mouse and rat genomes. Using literature entries present in these databases we mapped our results to Ensembl genes. We could add 674, 637, and 359 annotations for AS for human, mouse and rat genomes, respectively.

Supplementary figure 3: Description of training set
Example sentences from our training set, describing generation of transcript diversity (figure3a) and negative sentences (figure3b) from MEDLINE. Alternative transcripts are generated by many mechanisms or combinations of them. Hence, the SVM classifier has to learn multiple patterns apart from their syntactic variants. The sentences are classified in to various categories and semantic patterns are marked from 1-8. Please see table1 for the pattern labels.