Textrous!: Extracting Semantic Textual Meaning from Gene Sets

The un-biased and reproducible interpretation of high-content gene sets from large-scale genomic experiments is crucial to the understanding of biological themes, validation of experimental data, and the eventual development of plans for future experimentation. To derive biomedically-relevant information from simple gene lists, a mathematical association to scientific language and meaningful words or sentences is crucial. Unfortunately, existing software for deriving meaningful and easily-appreciable scientific textual ‘tokens’ from large gene sets either rely on controlled vocabularies (Medical Subject Headings, Gene Ontology, BioCarta) or employ Boolean text searching and co-occurrence models that are incapable of detecting indirect links in the literature. As an improvement to existing web-based informatic tools, we have developed Textrous!, a web-based framework for the extraction of biomedical semantic meaning from a given input gene set of arbitrary length. Textrous! employs natural language processing techniques, including latent semantic indexing (LSI), sentence splitting, word tokenization, parts-of-speech tagging, and noun-phrase chunking, to mine MEDLINE abstracts, PubMed Central articles, articles from the Online Mendelian Inheritance in Man (OMIM), and Mammalian Phenotype annotation obtained from Jackson Laboratories. Textrous! has the ability to generate meaningful output data with even very small input datasets, using two different text extraction methodologies (collective and individual) for the selecting, ranking, clustering, and visualization of English words obtained from the user data. Textrous!, therefore, is able to facilitate the output of quantitatively significant and easily appreciable semantic words and phrases linked to both individual gene and batch genomic data.


Introduction
With the increasing experimental prevalence of high-throughput genomic technologies, researchers are often challenged with the task of selecting, analyzing, clustering, and interpreting lists of functionally-relevant genes to a particular experiment at hand [1]. Given that an abundance of information about individual genes is contained in the text of published literature, with the recent development of novel informatic procedures literature mining with natural language processing techniques has become much more fruitful in recent years [2]. Current developments in this emerging field include literature-based methods for determining the functional coherence of a gene set, generating related transcription factors from microarray derived gene sets, and the functional userbased clustering of related genes [3][4][5].
An important aspect of gene set interpretation is the transformation of large gene sets into interpretable and manageable forms. Bridging the gap between large gene sets and the English language is potentially valuable for a variety of applications, including the discovery of previously unknown biological connections, identifi-cation of potential research topics, visualization of biological themes, discrimination between specific data sets, and validation of existing data. Current software for the interpretation of highthroughput genomic data share one or more of the following characteristics: reliance on controlled-languages (Gene Ontology (GO), Medical Subject Headings (MeSH), BioCarta, Kyoto Encyclopedia of Genes and Genomes (KEGG)), inability to search more than a few genes, and use of standard Boolean and cooccurrence models [6][7][8][9]. For example, Gene2MeSH, LigerCat, AmiGO, and Genes2WordCloud, four tools for generating enriched biological themes from a gene set, employ Boolean models or use exclusively terms that are preselected by BioCarta, GO, MeSH, and KEGG [10][11][12][13]. Our development of Textrous! in no way makes any of these excellent resources redundant. Therefore Textrous! should, as with other applications, be seen as a complementary device that should be used in conjunction with other forms of textual analysis such as the exemplary LigerCat which facilitates data text extraction using 'MeSH Cloud' outputs [13]. The combined use of multiple data analysis tools is therefore likely to yield the most comprehensive and meaningful appreci-ation of the input data. Similar tools that fall into the same generic category include the Database for Annotation, Visualization, and Integrated Discovery (DAVID), PubMatrix, WebGestalt, and Gene Set Enrichment Analysis [14][15][16][17]. All of these important and useful applications can create structured text interpretations of complex biological data, but do so using rigid clustering criteria that may possess considerable redundancy or possess limitations in their scope.
The use of predefined vocabularies such as GO or KEGG pathways places limitations on the range of words that can be used to describe a gene set. In addition, the textual connections implied by these curated libraries may be added to (e.g. GO terms) or even rapidly superseded, e.g. for the case of KEGG pathways, by the acquisition of additional experimental knowledge. Potentially, this can lead to decreased recall, as infrequent words not suitable for curation are discarded completely. A co-occurrence model suffers from disadvantages as well; most notably, such a model is unable to extract indirect relationships and facilitate new discoveries. To address these issues, we have developed Textrous!, a web-based framework for the extraction of semantic meaning from gene sets without the use of controlled-languages and pathways. Textrous! employs various natural language processing techniques, including latent semantic indexing (LSI), sentence splitting, word tokenization, parts-of-speech tagging, and noun-phrase chunking, to mine MEDLINE abstracts, PubMed Central articles, articles from the Online Mendelian Inheritance in Man (OMIM), and Mammalian Phenotype annotation obtained from Jackson Laboratories (www. informatics.jax.org/phenotypes.shtml) [18,19]. From an input of one or more genes, Textrous! is able to generate words and nounphrases and their associated similarity scores, z-scores, and pvalues. In addition, Textrous! can easily create a hierarchical cloud, combining elements of traditional word clouds and agglomerative hierarchical clustering, as well as a heat map, illustrating the pairwise similarities between each gene and word. Textrous! therefore presents an alternative to rigidly-curated data set interpretation systems that allows experimenters to generate additional and more nuanced levels of textual appreciation of large biomedical data sets.

Generation of ''Gene-Documents''
Our corpus of ''gene-documents'' was created from a concatenation of all MEDLINE titles, abstracts, full articles, and articles from the Online Mendelian Inheritance in Man as well as all the articles from the Jackson Laboratories Mammalian Phenotype Database. Individual genes were linked to PubMed articles by manually curated citation cross-reference data in the Entrez Gene repository. High precision and low recall is expected due to the manual curation process, as there are far more gene-article links than curated links [3]. Since a small proportion of MEDLINE abstracts describe sequencing experiments that specifically mention a disproportionately large number of genes, all PubMed abstracts that mention more than ten genes are discarded. Abstracts and articles were downloaded using PubMed's E-Utilities, and articles from OMIM and Jackson Laboratories were downloaded and extracted from their respective FTP dumps.
All gene-documents were kept in the collection and left unfiltered. Punctuation was stripped from all the gene-documents, with the exception of hyphens, underscores, and apostrophes. Words occurring in more than half of all documents, only one document, or found in Cornell University's SMART stoplist were excluded from all documents [20,21]. In addition to Cornell's SMART stoplist, a small list of 200 words was manually added to the stoplist. These words were determined empirically by multiple experimenters in our laboratory after multiple diverse-user interrogations of Textrous!. The final corpus consisted of 67412 genes from a variety of plant and animal genomes and 12281 words.

Generation of the Term Document Matrix
A term-document matrix was constructed by applying both local and global weightings to the frequency of terms across each document in the corpus. In the term-document matrix, each row represents an English word, while each column represents a gene in the gene-document collection. A term frequency (TF) -inverse document frequency (IDF) weighting scheme was used in the generation of the term-document matrix. Given our total collection of words (T), and gene-documents (D), term frequency can be calculated by the number of times a word appears in a document: The global weighting function, inverse document frequency, can be calculated as follows: Weighting functions are applied to each element of the termdocument matrix to increase the significance of words that are more likely to identify particular genes and decrease the significance of common, non-descriptive words. Since our corpus is selected from exclusively biological/biomedical literature, certain colloquial words have erroneously high global weightings (''geared'', ''ready'', ''book'') while other seemingly-important words often demonstrate inappropriately low weightings due to their relative commonality in biological/biomedical literature (''energy'', ''metabolism'', ''cancer''). To account for this, inverse document frequency was calculated from two sources: the ''gene documents'' collection from PubMed, OMIM, Jackson Laboratories, and the fifteenth edition of Encyclopedia Britannica (www.britannica. com/). The Encyclopedia Britannica text source was chosen specifically for its high-quality literary standards, large amount of text, academic writing, and variety of topics, some of which are not biomedically-related.
A rank-reduced Singular Value Decomposition (SVD) was applied to the term-document matrix (M), yielding three matrices, U, S, and V T ( Figure 1A-C) [22]. S is computed by taking the square root of the eigenvalues of MM T or M T M sorted in descending order on the main diagonal. U and V T are computed by taking the eigenvectors of MM T and M T M corresponding to the eigenvectors in S and placing them into their appropriate columns and rows, respectively. A rank of 120 was employed and was empirically determined by testing varying values of k (from 100 to 500). The columns of U and rows of V T can be viewed as LSI ''concepts'' or ''topics'', dimensions by which two terms or documents can be compared. As such, U and V T are referred to as the ''term-concept'' and ''document-concept'' matrices, respectively.

Generation of the Phrase Matrix
For each gene-document, all noun-phrases were extracted by the use of four statistically-based classifiers involved in sentence splitting, word tokenization, parts of speech tagging, and nounphrase chunking. From this, a master list of all noun-phrases was obtained. This list was processed by eliminating all punctuation and capitalization with exception of hyphens, underscores, and apostrophes; stripping all preceding articles; removing duplicate  Table' is depicted first. Additional textual output modes can be accessed subsequently using the toolbar. (C) Phrase hyperlinking from Cosine Similarity tables. Each word term generated from the input query list can be clicked on to link out (in red box) to the phrases in which it resides. The phrases containing the identified word are ranked according to their cosine similarity as well. (D) In addition to the Cosine Similarity output feature, the resulting word lists can be assessed by their output Z-score table or the probability scores in their p-value table. In each of these text word output formats each word can be linked out to its phrase context scoring box as in Figure 5. doi:10.1371/journal.pone.0062665.g002 phrases; and discarding all phrases that cannot be formed by the 12281 words in the term-document matrix. The resulting set of noun-phrases (NP) was used to generate a term-phrase-document matrix, with phrases and words as rows, gene documents as columns, and each cell as the presence of a word or phrase in a document.
A term-phrase-document matrix (P) was constructed as follows: Weighting functions are not applied to avoid the application of weighting functions twice, because this matrix is left-multiplied by U k . The resulting matrix U* can be viewed as a ''term-phraseconcept matrix'', with each row representing a word or phrase and each column representing an LSI dimension ( Figure 1D) [23].

Query Processing
Since each word and phrase is represented by a row vector in U*, a similarity score between any two words and phrases can be generated by using cosine distance. Cosine distance, defined by the cosine of the angle between two vectors, was computed as follows: Higher cosine scores correspond to a higher degree of similarity between two words. A user query of a list of genes is then treated as another hypothetical word in the English language, represented by a row in the original term-document matrix. All values of this row vector are positive for genes in the user query, and zero otherwise. This, in effect, is the same as supposing such a word existed in all gene documents in the user query. Such a word would perfectly describe the user's gene set and thus serves as a point-of-reference for all other input English words.
Fortunately, recomputation of the SVD is not needed to index the query into LSI space. Since M = USV T and therefore MVS 21 = U, we can index the query vector by right-multiplying it by V and S 21 . The LSI-space query vector can then be compared to any other word or phrase in U* using cosine distance. The top associated words and phrases can then easily be retrieved by identifying the words and phrases with the highest associated cosine scores. Rudimentary statistical significance was calculated using a Student's t-test.

Generation of Hierarchical Word Clouds
Hierarchical clouds are an integration of agglomerative hierarchical clustering (typically viewed as a dendrogram) with traditional word clouds. Agglomerative hierarchical clustering was applied to the top 30 output words of any given gene-based query. Briefly, the process is defined as the initialization of each word as its own cluster and the two closest clusters at each step being iteratively joined into one at each step, forming a tree structure. Each join is represented by a 261 or 162 HTML table, chosen randomly. The resulting cloud displays the collection of words via nested HTML tables. Each cell is color-coded to represent the time at which joins were made and font sizes are adjusted to be proportional to the calculated cosine similarities.

Generation of Gene-Word Heat Maps
Two-dimensional heat maps are generated to illustrate the pairwise differences between specific genes and specific words. Gene-word similarities were pre-calculated for all possible combinations and represented in a heat map format as an HTML table. In a heatmap, words are sorted by the number of genes with which they share statistical significance. The strength of gene-word association is indicated by the color intensity of each pairwise association. For the output heatmap teal is used as color of geneword association. The most popular noun-phrase associations processing. The hierarchical cloud shows depression and stress at the conjunction between terms related to the central nervous system and terms related to obesity. Each cell is color-coded to represent the time at which joins were made. Font sizes are adjusted in proportion to the calculated cosine similarities. (B) An illustration of the heat map produced by individual processing. The top associated (Cosine Similarity) terms are shown, as well as the relationships amongst genes. Here, the heat map shows that the top words are obesity-related, and that ''Bdnf'' is dissimilar to the other genes in the query. Grey color indicates a relative lack of association, while the intensity of teal color corresponds directly to the strength of correlation of each pairwise association. (C) Each of the output textual terms can be hyperlinked, via clicking on the word, to their associated topscoring (Cosine Similarity) phrases. In this panel the output word term 'hyperphagia' was linked out to its associated phrase contexts. doi:10.1371/journal.pone.0062665.g003 from the identified words can be accessed through a hyperlink embedded in the gene-word heat map.

Programming Procedures
Generation of gene-documents, including word tokenization, web crawling, filtering and parsing, was written in Python. Construction of the term-document matrix was written in Java, and computation of the SVD was done with SVDLIBC [24]. The parsing of noun-phrases, generation of the phrase matrix, and indexing of the phrase matrix into U* was written in Java with the help of Apache OpenNLP [25]. All web development was programmed with Python CGI, and resulting data displayed with HTML and CSS.

Description of the Textrous! User Interface
The current Textrous! website can be accessed at http://textrous. cit.nih.gov/ (Figure 2A). The web interface contains a search box where the user can input one or more official gene symbols delimited by whitespace. Textrous! is able to generate words and noun-phrases and their associated similarity scores, z-scores, and pvalues. After initial Gene Symbol input and activation of 'Submit' the Cosine Similarity results are depicted ( Figure 2B). Each output word in the ranking list can be used to link out to the top-scoring phrases associated with that specific word ( Figure 2C). After initial searching, the user can interact with the results by displaying the top words Cosine Similarity as well as their Z-score and probability value tables ( Figure 2D). In addition Textrous! allows the generation of hierarchical word clouds as well as displaying heat maps: both of these options allow the linking to noun-phrases from each word. Noun-phrases can then be traced back to their original PubMed articles. The number of genes found, as well as the genes excluded from the query, can be viewed from the search bar. All features are accessible on every page. A list of stopwords can be found at http://textrous.cit.nih.gov/genes2word/ stopwords or equivalently through the Features page of the main site.

Data Examples and Applications
Textrous! is able to process multiple genes with two different methodologies: collective processing and individual processing.     Data tables, hierarchical clouds, and phrasing collectively process an entire gene set as a whole, while two-dimensional heat maps individually process each gene. Both methods can be advantageous for distinct reasons and often times generate idiosyncratically effective results from the same initial gene set query. Unlike the previously described gene set annotational applications, e.g. DAVID or WebGestalt, Textrous! is able to generate biomedically-relevant word association data from even just one input gene identity. Such flexibility may be advantageous when minimal numbers of important genomic and proteomic factors are extracted from experimental data.
Simple dataset paradigm. Using collective processing, the features of every gene are combined into an ''average'' gene, or equivalently, the sum of the vectors created by each individual gene. Such dataset management can be likened to a gestalt appreciation of the whole dataset 'phenotype'. This serves as a different method of retrieving words, and leads to potentially distinct and alternatively meaningful results. Using collective processing in addition to hierarchical clouds allows the viewing of distinct themes within a specific gene set. One can quickly observe the presence of distinct themes in the hierarchical cloud generated by the query ''Lep Bdnf Fto Lepr'' (Lep, leptin; Bdnf, brain-derived neurotrophic factor; Fto, fat mass and obesity associated; Lepr, leptin receptor): obesity related words in the upper section, nervous system related words on the lower right, and, in the conjunction, depression related words on the lower left ( Figure 3A). Additionally, collectively-processed phrasing allows users to clarify potentially vague terms. For example, expanding the word ''gain'' on the query ''Fto'' yields ''body weight gain'', ''excessive weight gain'', and ''weight gain'' as the top calculated results. Using the same method, adjectives can be traced back to the nouns that they describe. Using individual processing, a user is able to view the relationships between specific words and specific genes, as well as gene to gene relationships ( Figure 3B). For example, a query of the metabolism-and neurotrophic-associated genes ''Lep Bdnf Fto Lepr'' shows not only which genes are responsible for which output words, with eventual associated noun-phrases, e.g. hyperphagia ( Figure 3C), and at what degree. In addition from this individual processing it is evident that an additional nuance of investigation is revealed, i.e. Bdnf is 'currently' considered to be relatively dissimilar to the other genes in the query ( Figure 3B).
To further validate the potential utility of Textrous! for dataset investigation we chose to next employ two considerably larger, previously validated and investigated transcriptomic datasets representing diverse molecular signatures. In these two paradigms Textrous! is challenged with extracting phenotypically relevant behavioral data as well as demonstrating its capacity to discriminate between two closely related molecular signaling datasets.
Physiological large dataset paradigm. To demonstrate the use of Textrous! for large behavioral datasets we chose a transcriptomic dataset obtained from murine central nervous tissue from experimental mice subjected to physical and cognitive tasks designed to isolate specifically the transcriptomic signatures associated with cognitive activity from transcriptional effects induced collaterally by the physical activity required to perform the cognitive task (i.e. Morris Water Maze) [26]. The dataset for transcripts significantly altered in the murine cortex in response to a cognitive task (Morris Water Maze completion) involving physical activity (swimming) compared to a task involving the same amount of physical activity with no goal-oriented behavior (time-controlled random swimming) is available on PubMed Central [26] and has been included in Table S1. Using this dataset (392 significantly-regulated transcripts) we derived, using Textrous!, the significantly associated words linked to this dataset (Table S2). In addition we also extracted the noun-phrases associated with the top 10 significantly-associated words from this list (Table S3). In Figure 4 we demonstrate the hierarchical cloud and the cosine similarities, Z scores and P values for the words forming the cloud. The strongest elements in the cloud, e.g. brainderived, neurotrophic, neuroprotective and neuroplasticity are all words consistently linked with physiological activities (e.g. neurosynaptic reinforcement, learning, memory) as well as neurochemicals (e.g. brain-derived neurotrophic factor) that regulate cognitive behavior [27][28][29][30][31]. We next compared these Textrous!-derived hierarchical cloud outputs to a diverse array of other forms of bioinformatic analysis of the same dataset ( Figure 5, Tables S2-S9). We found that with respect to the actual experimental paradigm, i.e. assessment of cognitive and learning behavior in mice [26], the Textrous! output (using the Top 5 lowest P value scoring words: numbered 1-5) was more tightly associated with the physical experiment data than the Top 5 lowest P value scoring outputs using KEGG (Table S4), GO (Table S5), WikiPathways (Table S6), IPA BioFunction (Table S7), NIH-DAVID PIR (Protein Information Resource: http://pir.georgetown.edu/) (Table S8) or IPA Canonical Pathways analysis (Table S9). In contrast to Textrous!, the other annotational tools (KEGG) often generate and prioritize highly generic and poorly-focused outputs, e.g. metabolic pathways. Therefore Textrous! appears to at least provide an important additional resource for extracting physiologicallyrelevant information from larger-scale datasets via collective processing. As Textrous! also allows simultaneous individual processing, via heatmap generation ( Figure 6, Figure S1), we also found that the strongest gene-word associations for this specific transcript set again exhibited a profound neurophysiological learning phenotype ( Figure S1, red box). For example, the strongest connections were discovered between neurophysiological words such as: dendrites; synaptic; plasticity; potentiation and transmitter, with important neurophysiological genes linked with learning such as: Gria2 (glutamate receptor, ionotropic, AMPA 2) [32]; Nrxn (Neurexin) [33]; Arc (activity-regulated cytoskeleton-associated protein) [34]; Homer1 (homer homolog 1) [35] and Rasgrf1 (Ras protein-specific guanine nucleotide-releasing factor 1) [36]. Using this physiological model example, Textrous! was able to generate physiologically accurate textual data extraction and presentation using individual as well as collective processing techniques from this large dataset.
Compare and Contrast Paradigm. As Textrous! demonstrated a robust ability to extract physiologically-relevant phenotypic data from a single comparison behavioral dataset we next tested whether Textrous! would be able to facilitate discriminatory data extraction from two contrasting datasets instead. We have recently demonstrated that structurally distinct therapeutic molecules, human parathyroid hormone (hPTH (1-34)) and a molecular variant (bPTH (7-34)) can activate the same parathyroid hormone receptor in bone tissue in a G protein-or b-arrestindependent manner respectively [37]. The functional signaling pathways and functional sequelae entrained by these two distinct ligands both support bone development but via clearly distinguishable mechanisms. Treatment of mice with hPTH (1-34) primarily affects signaling activity associated with enhanced bone formation through collagen synthesis and matrix mineralization, while bPTH (7-34) primarily affects pathways that promote expansion of the osteoblast pool, via modulation of cell cycle regulation, cell survival, and migration. This diverse molecular activity is one of the first demonstrations of 'biased agonist activity' in an in vivo setting. As the molecular pathways of these two ligands are well characterized and mechanistically distinct we employed Textrous! in a compare-and-contrast manner between these two large datasets (Tables S10, S11). Using the Textrous! collective processing with the hierarchical clouds we found a clear distinction between the two datasets ( Figure 7A-B). From the hierarchical cloud output from the hPTH (1-34) dataset that this ligand induces a classical 'bone development' phenotype as the most significantly associated words extracted are linked with mineralization, matrix synthesis and bone structure including: catenins [38,39]; calvaria; osteocytes; cadherin [40] and mineral ( Figure 7A, Table S12). In contrast, the words depicted in the hierarchical cloud from the bPTH (7-34) dataset include words less strongly associated with classical bone modeling activity but more with the atypical cell-cycle regulatory activity demonstrated by this ligand in vivo [37], e.g. cyclin, cyclin-dependent, cdk, m-phase and mitosis ( Figure 7B, Table S13). When we compared the top 100 extracted word associations from these two datasets we found that no words were common between the two Textrous! extractions ( Figure 7C). When the noun-phrases associated with the top 10 extracted words were compared between the hPTH (1-34) and bPTH  datasets again there were no common noun-phrases (Tables S14 and S15 respectively). Even with a manual dismantling of the individual words contained in the two noun-phrase lists only a minimal overlap between the two datasets was observed ( Figure 7D). Therefore the collective processing module of Textrous! was able to both generate an accurate appreciation of the two datasets in such a manner that their distinct mechanistic natures can be confidently compared and contrasted. Indicative of the potential discovery aspect of Textrous! data extraction a potentially strong interaction of bPTH  with neuronal activity is also suggested by the following extracted words, e.g. glutamatergic, nmda and post-synaptic. Future experimentation may indeed demonstrate this potential activity of this parathyroid hormone variant.
With the individual processing of these contrasting datasets ( Figure 8) we again found a strong distinction of Textrous! output. There were no common extracted heatmap words between these two datasets and the phenotypic nature of each signaling paradigm was clearly indicated. For the 'classically-acting' G proteindependent hPTH (1-34) we were able to extract G protein signaling-associated words (gtp-binding, heterotrimer), bone differentiation-associated words (collagen, osteogenic, ossification, periosteum, etc.) and most surprisingly the word 'pluripotent' ( Figure 8A). This final word is extremely interesting as the hPTH (1-34) ligand is considered 'pluripotent' in its signaling activity compared to the 'b-arrestin-focused' bPTH   [37]. In contrast to the hPTH (1-34) dataset, the individual processing (heatmap output) for the bPTH (7-34) dataset yielded extraction of words more specifically associated with alteration of cell cycle activity (s-phase, arrests, prophase, centrosomes) and cell motility (invasion, rearrangement, cytokinesis, projection). This data output therefore accurately replicates the actual in vivo data for the specific activity of this bPTH  receptor ligand compared to the standard hPTH (1-34) variant.
Using these three group paradigms (simple data, large physiological data and compare-and-contrast data) we have found that Textrous! data analysis is able to facilitate efficient and physiologically-meaningful data extraction, via multiple processing techniques, from both small and large input data lists.

Discussion
Textrous! is a novel web-based bioinformatics application that extracts semantic meaning from gene sets without the use of potentially outdated curated datasets, signaling pathways, or controlled languages. It is useful in many different contexts to help biologists extract impartial and differential knowledge from large volumes of genes or proteins (using official Gene Symbol nomenclature). Using an LSI-based approach, we were able to extract both implicit and explicit links to specific input genomic factors from diverse forms of scientific literature. Textrous! provides the user with a wealth of functionality for visualizing words, identifying themes, interrogating results, and determining statistical significance. Taken together, the most important aspect of Textrous! is that it allows genomic and proteomic researchers to determine word associations to gene sets of arbitrary length in an objective, standardized, non-biased, and non-curated manner. Such a technological development therefore may possess considerable advantages over user-defined gene/protein clustering applications as it is not as reliant upon historically-derived functional signaling pathway data. It is highly likely that with further experimental evidence the rigid nature of specific gene/ protein biological annotation will be demonstrated to be more and more redundant and inaccurate. Therefore in using a non-curated process, Textrous! provides a less biased and more 'future-proof' informatics data set annotation process. In conclusion, when using complicated large scale datasets Textrous! is able to simply create a naturalistic and unbiased scientific interpretation of these data. Table S1 Relative transcription responses for learningbased physical activity versus non-learning based physical activity. Gene transcription z-ratios for the learning task (Morris Water Maze: Learn) mice compared to time-controlled non-goal oriented physical activity (Swim). (DOC)

Table S3
Textrous! noun-phrase output for learning task-oriented activity. The noun-phrase output from Textrous! indicated below was generated from the top 10 most-significantly associated words (Table S2)    Table S7 Ingenuity Pathway Analysis BioFunction enrichment output for learning task-oriented activity. Ingenuity Pathway Analysis (IPA: http://www.ingenuity.com/ products/ipa) was employed to generate specific BioFunction activity output from the murine learning transcriptomic dataset. The specific significant P value for each enriched BioFunction is indicated. (DOC) Table S8 NIH DAVID PIR Keyword enrichment output for learning task-oriented activity. The batch gene annotation module of NIH DAVID (http://david.abcc.ncifcrf.gov/) was employed for the derivation of the fold enrichment (using a murine background set) and the significant P value (,0.05) for the specifically enriched PIR keywords extracted from the learning oriented-task transcriptomic dataset. (DOC) Table S9 IPA Canonical signaling pathway enrichment output for learning task-oriented activity. Ingenuity Pathway Analysis (IPA: http://www.ingenuity.com/products/ ipa) was employed to generate specific Canonical Signaling Pathway activity output from the murine learning oriented-task transcriptomic dataset. The specific significant negative log 10 of the P value, as well as the enrichment ratio for each of the significantly-populated Canonical signaling pathways is indicated. (DOC)  1-34))-induced bone transcription response in wild-type mice. The transcriptomic response data indicates the significantly regulated genes expressed in calvarial bone extracts from mice intermittently dosed with hPTH (1-34).

(DOC)
Table S11 Parathyroid hormone variant (bPTH (7-34))induced bone transcription responses in wild-type mice. The transcriptomic response data indicates the significantly regulated genes expressed in calvarial bone extracts from wildtype mice intermittently dosed with the parathyroid hormone variant bPTH .

(DOC)
Table S12 Textrous! output from hPTH (1-34) calvarial bone transcription response in wild-type mice. The Cosine similarity, Z-scores and associated P values for the word data output (top 100) from hPTH (1-34)-treated mice is indicated in the table. (DOC) Table S13 Textrous! output from bPTH (7-34) parathyroid hormone variant calvarial bone transcription responses in wild-type mice. The Cosine similarity, Z-scores and associated P values for the word data output (top 100) from bPTH (7-34)-treated mice is indicated in the table. (DOC) Table S14 Textrous! noun-phrase output from hPTH (1-34) calvarial bone transcription responses in wild-type mice. The data indicated in the table consists of the Cosine similarity scores for the most strongly associated noun-phrases linked to the top 10 most significantly-associated words extracted by Textrous! from the hPTH (1-134)-induced transcriptome data. (DOC) Table S15 Textrous! noun-phrase output from bPTH (7-34) parathyroid hormone variant-treatment of calvarial bone transcription responses in wild-type mice. The data indicated in the table consists of the Cosine similarity scores for the most strongly associated noun-phrases linked to the top 10 most significantly-associated words linked to the bPTH (7-34)-induced transcriptome data. (DOC) Author Contributions