Comprehensive Map of Molecules Implicated in Obesity

Obesity is a global epidemic affecting over 1.5 billion people and is one of the risk factors for several diseases such as type 2 diabetes mellitus and hypertension. We have constructed a comprehensive map of the molecules reported to be implicated in obesity. A deep curation strategy was complemented by a novel semi-automated text mining system in order to screen 1,000 full-length research articles and over 90,000 abstracts that are relevant to obesity. We obtain a scale free network of 804 nodes and 971 edges, composed of 510 proteins, 115 genes, 62 complexes, 23 RNA molecules, 83 simple molecules, 3 phenotype and 3 drugs in “bow-tie” architecture. We classify this network into 5 modules and identify new links between the recently discovered fat mass and obesity associated FTO gene with well studied examples such as insulin and leptin. We further built an automated docking pipeline to dock orlistat as well as other drugs against the 24,000 proteins in the human structural proteome to explain the therapeutics and side effects at a network level. Based upon our experiments, we propose that therapeutic effect comes through the binding of one drug with several molecules in target network, and the binding propensity is both statistically significant and different in comparison with any other part of human structural proteome.


Introduction
Obesity, a complex condition with serious medical, psychological and social consequences, affects millions of people across the world [1]. In addition, rising numbers of juvenile onset obesity cases contribute to increased incidence of time-dependent complications of obesity such as insulin resistance, non-insulin-dependent diabetes mellitus, hypertension, coronary artery disease and other cardiac disorders often grouped as "metabolic syndrome X" [2][3]. The pathophysiology of obesity is influenced by several factors such as candidate genes and their expression, single nucleotide polymorphisms, proteins, metabolic pathways and their perturbations due to mutations, nutrition, exercise, gut microbes, and diseases, e.g. hypothyroidism [4][5]. Experts recommend that increase in physical activity and reduction in intake of high calorie pertaining to obesity pathophysiology. For this reason, we retrieved 114 transcription factors from DGAP (Diabetes Genome Anatomy Project) & GenMapp (http://www.genmapp.org/ default.html). Apart from these, we identified an independent set of 33 genes reported in obesity literature (See Table A in S2 File for list of molecules & Table B in S2 File for experimental evidence). This work was complemented by mining over 35,000 genes in 96,219 abstracts using perl scripts. Through text-mining, we found 4,274 genes as first round of 'hits' (See Methods). Since text mining systems are known to produce large number of false hits, therefore we screened these hits manually and removed gene names matching with common English words, abbreviations and methodology terms using various types of filters (See examples provided on our website http://tinyurl.com/d74r9xy as well as in S3 File). Out of 4,274 hits, we label 1,268 genes as positive hits and 3,006 as false positive hits (See Table C in S2 File). Text mining system also reported several recently published molecules such as fat mass and obesity associated (FTO) and omega-3 fatty acid receptor 1 (GPR120) [31][32].
Based upon these techniques, we constructed two datasets (A and B) to create comprehensive network. Set A consist of 473 genes and proteins retrieved through deep curation strategy whereas set B consist of 1268 genes retrieved through semi-automatic text mining system. We started with this set of molecules as a 'partial list' of the proposed comprehensive network and expanded by adding more molecules based upon interactions reported in literature in context of obesity. The final comprehensive map was constructed based upon genes, proteins, receptors, transcription factors, enzymes, ion channels, drugs, RNA molecules, simple molecules and their relationships (See Fig 1).

General features of the map
We screened over one thousand research articles manually and more than 96,219 abstracts (published till December 2012) using text mining system. The majority of molecules identified in this study can be tracked to sources such as human obesity gene map database update 2005, GenMapp and miscellaneous literature reports ( Table A in S2 File). We have prepared a resource base where each molecule is linked with its research article. Each paper is curated manually and the portion of text denoting gene (molecule) or its interaction with other molecules in context of obesity is highlighted. The information on interactions of molecule is given on our website (Supplementary Folder 1: Gene interaction evidence: "http://tinyurl.com/ nc3yjj7" & A in S3 File & Table D in S2 File). During this study, we encountered set of molecules which are found to be involved in syndromes where obesity is one of the clinical outcomes. Since the direct evidence on the role of these molecules in obesity is not known, we decided to include them as an independent part of the proposed map. We label this set as 'lesser studied (reported) group' due to paucity of literature data. To illustrate, ALMS1 gene is related with an "Alstrom syndrome 1" where obesity is a frequent clinical outcome in patients [33], but, direct evidence of linking ALMS1 with obesity is not reported in literature. Similarly, Gamma-aminobutyric acid A receptor gamma 3 (GABRG3) is an early childhood obesity gene reported in Prader-Willi syndrome [34], but, direct experimental evidence is not known. Likewise, genes reported from X-linked mutation studies and linkage studies could not be placed in the main network due to sparse experimental or interaction data. Therefore, out of 473 molecules, we included 389 molecules in the proposed network and the rest 84 molecules were reported as an independent set (See Table E in S2 File). The process, of incorporation of lesser studied molecules in proposed network, is elaborated in the following sections.
In Fig 2, we show comprehensive map of molecules that was manually assembled based on the published literature. Various entities of the network, e.g. genes, proteins and their modifications, protein complexes are described using standard Systems Biology Markup Language (SBML) with the help of Cell Designer 4.1 software [35] and Systems Biology Graphical Notation (SBGN) [36] (www.sbgn.org). The nodes (also known as species) represent molecules that participate in a given reaction. The edges represent reactions among nodes. The resulting network on obesity consists of 804 nodes (includes set A molecules as well as other genes/molecules interacting with set A) and 971 edges. These 804 nodes are categorized as 510 proteins, 115 genes, 1 ion, 3 drugs, 3 degraded molecules, 62 complexes, 23 RNA molecules, 83 simple molecules, 3 phenotypes and 1 unknown molecule (See Table F  Linking lesser-studied/reported molecules with comprehensive map There are several clinical conditions as well as syndromes, where obesity is one of the reported phenotype apart from other clinical features characteristic of that syndrome. These include Prader-Willi syndrome, Ulnar-mammary syndrome and Biemond 2 syndrome (See Table E in S2 File). Prader-Willi syndrome is characterised by hyperphagia, characteristic facial features, hypogonadism and short stature. This syndrome is caused by loss of genes imprinted on 15q11-q13 region such as gamma-aminobutyric acid (GABA) A receptor, gamma 3 (GABRG3), imprinted in Prader-Willi syndrome (IPW), small nucleolar RNA, C/D box 116 cluster (PWCR1), small nuclear ribonucleoprotein polypeptide N (SNRPN) and MAGE-like 2 (MAGEL2). We retrieved information for these genes from literature databases as well as from relevant pathways such as KEGG [37] and REACTOME [38]. Then, we aimed to find any evidence of relationship between less-studied genes with obesity network molecules. After extensive manual screening, we were able to find one study [39] which links the GABRG3 with methyl CpG binding protein 2 (MECP2) gene. The MECP2 gene is a part of module 1 of comprehensive network (See Fig 3). Encouraged by this result, we screened over 6000 abstracts representing 84 lesser-studied genes using our text-mining approach. These includes molecules such as, CYP11B2 (cytochrome P450, family 11, subfamily B, polypeptide 2), PLSCR1 (phospholipid scramblase 1), PTPNS1 (signal-regulatory protein alpha gene interactions), ALMS1 (alstrom syndrome 1), UBR1 (ubiquitin protein ligase E3 component n-recognin 1) and GABRG3 (gamma-aminobutyric acid A receptor, gamma 3). These efforts led to identification of several abstracts/studies through which we could link highly connected nodes with lesserstudied genes with the help of intermediary molecules. For example, we could identify that CYP11B2 (molecule belonging to lesser studied group) expression and secretion is inhibited by peroxisome proliferator activator receptor gamma (PPAR γ), a key molecule in adipocyte lineage [40] and a reported hub of our map. Additional details are given at our website http:// tinyurl.com/knnqsmm. We also computed composite score of lesser studied (reported) genes and compared with hubs (well studied genes) (See File I in S1 File).

Structure of the map
We used standard techniques to find structure in the constructed map [41]. The map has bow-tie architecture and resembles to alphabetical character "I". To facilitate map exploration, we divided our map into three regions: top, intervening (or central) and bottom.  Table 1). The bottom region majorly consists of transcription factors and signaling molecules, inclusive of glucose transporter 4 (GLUT4), adiponectin (ADIPOQ), lipin1 (LPIN1), fatty acid binding protein 4 (FABP4), necdin, BMP delta-like 1 homolog (DLK1/ PREF1), tumour necrosis factor alpha (TNF α) and PR domain containing 16 (PRDM16). In addition, several feedback loops connect top and bottom regions (highlighted in dark green colour in Fig 4).

Representation of the map
Genes and proteins are represented by standard notations, whereas interactions are categorized as positive, negative, neutral and catalysis. A positive regulation is defined for set of molecules, in which the molecule's activity is stimulated by another molecule. In this context, authors frequently uses specific verbs such as stimulate, activate, induce, enhance, up-regulate and increase. Negative regulation is the inhibition of the neighbouring interacting molecule which is evident by verbs like inhibit, down-regulate, decrease, prevent, suppress and reduce. The edge representations include transcription, translation, association and dissociation using standard graphical notation. Apart from these, there are some reactions where a molecule regulates the reaction between other molecules, i.e. catalysis (See Table 2). The colour scheme and graphical representation is explained at website (http://tinyurl.com/dykn8fd) and File J in S1 File.

Module generation-Reverse Engineering of the pathway
To understand a large network, a logical step is to divide the network into biologically meaningful smaller functional components [42]. This process is often termed as reverse engineering and several approaches have been described to identify modules. These range from spectral methods [43][44], methods that identify maximum flows or minimum cuts [45][46], heat kernels [47], betweeness centrality [48], seed node searches [49] e.g. MCODE in cytoscape [50], brute force methods [51] and weighted kernel k-means [52]. Community structures or modules are defined when a larger density of links exists within a specific part of the network than outside it [53]. We used different methods to identify community structures (modules) in obesity network. In addition, we clustered genes based upon tissue specific expression data. Since each method produces different results with some degree of overlap, we decided to integrate information to identify functionally meaningful modules (See File A in S1 File). Hence, the constructed network was divided into 5 modules based upon physiological processes and likely anatomical component (Table 3). In the following section, we attempt to relate modules with disease conditions (Fig 5).
Module 1: This module consists of highly connected nodes involved in neuro-hormonal signaling affecting energy homeostasis, hunger and mood. They include leptin, ghrelin and dopamine. Leptin is one of the highly studied molecules in obesity after insulin (present in7.4% of total abstracts). Leptin acts as a satiety factor and its discovery has paved the way for the study of adipocyte derived factor in energy balance homeostasis. Further, the secretion of the leptin is directly proportional to amount of fat cells [54]. Recently, leptin replacement therapy has been proposed to treat obese individuals [55]. Frequent association of obesity with clinical depression can be explained by the impaired leptin activity in brain [56].
Ghrelin act as an endogenous ligand for growth hormone secretagogue receptor (GHSR). It has been reported to be involved in energy regulation and appetite signaling through activation of peptides, including AgRP, NPY and POMC [57]. Rise and fall in plasma ghrelin levels before and after food intake supports the hypothesis that ghrelin plays a physiological role in meal initiation in humans [58]. Ghrelin levels are altered in individuals suffering from Prader-Willi and Cushing's syndrome [59]. A meta-analysis linked gastrointestinal hormones, ghrelin and obestatin levels with obesity [60].
The importance of dopamine signaling in obesity has been demonstrated by the alteration of dopamine receptor levels with changes in body mass index (BMI) [61]. Apart from these, several other molecules have been reported in context of obesity; therefore, we have described their roles at our website: http://tinyurl.com/kazahj6. Module 2: Obesity is a major risk factor for non-insulin dependent diabetes mellitus (NIDDM) [62]. Insulin is a central molecule in pathophysiology of type 2 diabetes and also appears in large number of abstracts related to obesity (23,165 abstracts; 24% of total dataset) in humans. Module 2 primarily encapsulate insulin and its interactions with other molecules, for instance, apolipoprotein A-V (APOA5), forkhead box C2 (FOXC2), macrophage migration inhibitory factor (MIF), uncoupling protein (UCP) and v-akt murine thymoma viral oncogene homolog 2 (AKT2). This module builds a link between tightly coupled clinical conditions-obesity and type 2 diabetes. Module 3: Lipid storage and metabolism is affected frequently in obese patients leading to dyslipidemia, exposing them to cardiovascular risks [63] and atherosclerosis [64]. The third module maps interactions, catalysis and processing of molecules involved in lipid metabolism, including acetyl CoA, aspartate, mevalonate, cholesterol, cholic acid, and diacylglycerol.
Module 4: It is the largest module in the network and majorly consists of transcription factors involved in adipose tissue differentiation and other biological activities in humans. The interactions are dominated by molecules like peroxisome proliferator-activated receptors-PPAR (α, β, γ) and CCAAT/enhancer-binding proteins-C/EBP (α, β, γ). The molecules such as PPAR γ (with 41 edges) provide indirect connections with lesser studied genes/molecules reported in context of obesity. This module is divided into another sub-module labelled as "4A" to incorporate set of molecules distinct from transcription factors.
Module 4A: Though the Wnt pathway has been shown to play a major role in embryogenesis and some of the cancers, it has also emerged as an important regulator of adipocyte differentiation [65]. In addition, recent evidence of obesity treatment using traditional herbal medicine,  SH21B, has indicated about anti-adipogenic mechanism mediated by Wnt-β catenin signaling [66]. Module 5: The last module contains information about disjoint set of genes/proteins involved in obesity which are difficult to categorize due to inadequate information.

Quantitative Analysis
To understand the properties of constructed network, we computed several topological parameters as described below (See File B in S1 File for detailed information).
1. Degree distribution parameter: We found that the several number of connections follow power laws that indicates scale-free pattern of connectivity (γ in: in-degree parameter as 2.19 and γ out : out-degree parameter as 2.11). The scale free behaviour is also observed in constituent modules suggesting preferential attachments and hubs in the network (See Table 4).  3. Average shortest path length value was found to be 15.85 for comprehensive network supporting scale free nature of the graph [69].

Randomization of Constructed Network
We constructed null models (control) and compared the properties of comprehensive network with null models [68,70]. The protocol is described as following: In a true network gene A (leptin) binds with gene B (leptin receptor) to perform a function X (i.e. leptin act as satiety factor and exhibits its action by binding to leptin receptor) in cell. Null model 1-In this model, we randomised the edges but kept the node labels and their degrees intact. For example, the connection (edge) between gene A (leptin) and gene B (leptin receptor) is deleted. A new connection is established between gene A and gene C (any other gene of the network except leptin receptor) so as to disrupt the function X.
Null Model 2-We shuffle the positions of nodes by keeping the global degree distribution of the comprehensive map intact.
Null Model 3-This is generated by shuffling both the position of nodes as well as their edges (See File C in S1 File).
Null Model 4-We construct the network with same number of edges and nodes using methods proposed by Erdos-Renyi [71], Watts-Strogatz [72] and Barabasi-Albert [73]. To see the effect of properties on size of the network, we construct networks with node numbers from 100 to 1000. Firstly, we use method proposed by Erdos-Renyi to construct a random graph of N nodes connected with n edges, which are chosen randomly from N (N-1)/2 possible edges and are not scale-free [71]. Secondly, a control network was generated through Watts-Strogatz model (1998) [72], where in random graph is produced with small-world properties, including short average path length and high clustering. Thirdly, in Barabasi-Albert model [73], the generation of random graph is based on the connected seed network of s nodes. Remaining nodes (n-s) are added one at a time, and connected to existing nodes (m) randomly. The resulting network is found to follow power-law degree distribution.
In addition, we generated randomized networks using random network module of cytoscape. The obesity network (true network) exhibit different properties when compared to 18 control randomized networks obtained by shuffling the obesity network associations while keeping the degree distribution of nodes fixed (Fig G and Fig H in S1 File). We find that clustering coefficient increases from 0 (in true network) to 0.00201 (randomized network with 30000 shuffling. See Fig G in S1 File). This pattern is reversed in case of mean shortest path, which reduced from 18 to 11 units (See Fig H in S1 File). We have also enclosed additional information for results generated during shuffling procedure in the Table H in S2 File and website in S3 File.

Robustness of Network
To see the robustness of network and its dependence on failure of a particular node, we randomly deleted nodes and computed properties for the remaining network. There are several indexes of network centrality such as degree, eccentricity, closeness, betweenness, stress, centroid and radiality which allow quantifying the topological relevance of single nodes in a network. Recently introduced parameters such as node interference and robustness were also included in the analysis. These parameters measure the relative importance of given node in context of network [74]. It was also shown in the past that the hubs (nodes with high degrees) play important roles in maintaining structural integrity of networks against failures and attacks, [75] in spreading phenomenon [76] and in synchronisation [77]. Since, obesity network shows scale free structure with presence of hubs, we started our deletion experiments by sequential deletion of hub nodes to see the effect on network robustness. This was achieved by removing a node and calculating the interference on the centrality of the remaining nodes using centiscape plugin of cytoscape. We find that removal of hubs alone or in combination impact the network tremendously. We find that various critical properties of network changes to significant extent. For example, betweenness of nodes of original network (Mean = 10738.4; Var. = 3.3E8) were significantly different when compared with networks obtained after deleting all hubs (Mean = 13675.4; Var. = 1.1E9) computed through paired t test (P<0.05). When we randomly deleted any node (not hub), the changes were not significant in the parameters (See Table I

Mapping of Microarray Data
The microarray data was obtained from Gene Expression Atlas [80] using search term "obesity and homo sapiens" from the URL (http://www-test.ebi.ac.uk/gxa/). The gene list was obtained for three possible conditions: up-regulation, down regulation and non-differentially expressed. We selected 3,485genes reported to be up-regulated in obesity and labelled them as set 'U'. Subsequently, we found 2,135 genes (labelled as D) as down-regulated group and a very large number of genes (2,91,407) as non-differentially expressed (NDE). After removal of redundancy, we obtained 1,340 molecules as up-regulated (U), 918 molecules as down-regulated (D), and 38,434 molecules as non-differentially expressed (NDE) molecules as a filtered sets. Thereafter, we compared filtered dataset obtained from microarray database with our list. Based upon comparisons, we found that 27 genes (obtained from deep curation approach (DC)) are up-regulated in obesity whereas 24 genes show down-regulation and large numbers of genes did not show any change in expression or information is not available in the database. Using gene ontology analysis, it was revealed that most of the up-regulated genes are involved in protein binding and down-regulated group are involved in steroid binding activity (See File D in S1 File).
Since, we could not map large number of genes; we attempted to find expression data of obesity genes in GEO (http://www.ncbi.nlm.nih.gov/geoprofiles) microarray database. We found that 34.5% of genes (obtained from text mining (TM) approach) are up regulated whereas 27.58% are down regulated.

Applications of Obesity Network-Implications in therapeutics
We used orlistat (tetrahydrolipstatine, an FDA approved drug for treatment of obesity) to dock against the molecules listed in our network using our in-house docking pipeline "Docoviz". We observe that orlistat not only binds to fatty acid synthase (FASN) (ΔE = -13.7 Kcal/mol; experimentally known target) but also binds to several other molecules in the obesity network. To check whether orlistat produces its clinical effect (of weight reduction) possibly due to preferential binding to several molecules listed in the obesity network (N) than any other part of proteome, we created a dataset of 24,000 known human protein structures (P) and docked orlistat against them. In addition, we created datasets of randomly selected protein structures from P labelled as P1, P2. . .Pn as controls. We also used Alzheimer disease network molecules [81] as an additional control (D). We observed that the distribution of binding energies obtained from controls (P1, P2, P3. . .Pn) and Alzheimer disease network(D) is significantly different from test dataset(N) (P value <0.05, Welch T test).
In another experiment, we docked drugs (which do not have effect on obesity) against the obesity network proteins. For instance, we used Acetylsalicylic acid (selected randomly; antiinflammatory medicine) to dock against the obesity network proteins. Apart from that, we used drugs, showing comparable tanimoto co-efficient to orlistat, such as 3-Carboxy-N,N, N-Trimethyl-2-(Octanoyloxy) Propan-1-Aminium (Tc Value: 0.68) and 6-Deoxyerythronoli-deB (Tc Value: 0.6) to ascertain binding energy profiles in the obesity network. We detected that the binding energy profiles of the above mentioned drugs against the obesity network proteins are different from that of orlistat (P value <0.05, Student's T test).
Orlistat is known to produce several side-effects namely acne, respiratory tract infection, urinary tract infection and nausea, possibly due to binding to off targets perturbing unrelated pathway. Using text mining systems and manual screening, we obtained list of molecules implicated in the side effects/diseases related to orlistat. On comparison, we found that several molecules are common in obesity network and acne (14 molecules; 2.7% of total dataset), providing a possible clue for causation of acne in patients taking orlistat during obesity treatment. Likewise, sibutramine (antidepressant and anorexigenic drug) was withdrawn due to adverse effects such as agitation, fever, vomiting, diarrhoea, loss of coordination, and dilated pupils. Using our map, we could link the side effects of sibutramine with disease networks. To illustrate, symptoms such as nausea, vomiting and depression are likely to be produced due to binding of sibutramine to targets such as SLC6A3 and SLC6A4 and subsequent perturbation of pathway involving HTR2C (anxiety), HTR2A (anxiety), DRD2 (nausea and vomiting), COMT (nausea and vomiting), and MAOA (depression) (File E in S1 File).

Discussion
This work shows a new approach of combining data from heterogeneous databases including literature, structure and microarrays to construct disease networks and attempt to explain therapeutics of a drug molecule in context of networks. Our methods are generic, web enabled and open in nature to build rich networks. Each entity i.e. node or edge has been hyper-linked to its source (research papers) so as to maintain transparency in the system for users to evaluate and improve the system in a collaborative fashion.
Network targeting involves activity of a compound across multiple pathways which might be necessary to effectively stop neoplasm and pathogens, but can also produce side effects by targeting undesirable proteins [82]. Very few large scale docking studies have been conducted in the past (Gao et al. used~1,100 targets [83]; Hui-fang et al., used 1,714 targets [84]; [85]). Here, we performed docking of orlistat with obesity network proteins as well as with whole human proteome (>24000 proteins) as a test example. Based upon our predictions, we propose that a given drug (orlistat) not only bind to its known target (FASN; ΔE = −13.6 Kcal/mol) but also to several other targets in the network with varying degree of binding energies. This propensity of binding of drug within the target network (obesity) is different from binding with any other disease network or network randomly drawn from human proteome. Further, we also observe that the therapeutically unrelated drugs for a given clinical condition ("Acetylsalicylic acid in obesity") show different binding patterns to network proteins. These results contribute to emerging concepts of network pharmacology [82] and chemigenomics [86] to develop safer, cheaper and effective medicines. The possible limitation of this approach is non-specific or random binding of ligand to many of the protein targets.
Real world networks including biological networks are characterised by presence of few highly connected nodes known as hubs and they tend to show non-Poisson degree distribution. Evidence shows that hub proteins are encoded by essential genes [87] that seem to be older, evolve slowly and their deletion affect a large number of nodes as compared to non-hub nodes [88][89][90]. Therefore, different studies have attempted to associate hub proteins to disease genes. Some studies support this hypothesis, whereas few studies contradict this hypothesis [90][91][92]. Our network shows hub based architecture with select set of nodes occupying most of the connections-leptin, insulin and PPAR gamma. Most of these genes likely to be essential in nature, whereas some of the recently reported candidate genes are present in periphery in our map, e.g. fat mass and obesity associated (FTO) gene. It may be inferred that the obesity pathophysiology is primarily influenced by interactions of essential genes, therefore obesity could be considered as a system level adaptation toward chronic nutritional over intake and other causative factors.
We compared our network with previously published dataset, including Kitano et al, 2004 [41], Logsdon et al, 2012 [93] and found several of our network molecules present in these datasets (See File F in S1 File). Various population wide studies have indicated that hypertension is a predominant clinical condition affecting over 40% of obese people (BMI > 30) [94], whereas type II diabetes mellitus affects 40-60% of obese people [95]. Using text mining approaches, we found that there is a significant overlap between molecules implicated in obesity and its associated disorders such as diabetes or hypertension. This overlap is less when molecules implicated in obesity are compared to molecules implicated in unrelated disease group e.g. asthma, urticaria and ataxia.
Considering wide variety of factors affecting the obesity pathophysiology, we believe that obesity comprehensive map will act as a platform to integrate information derived from gene expression experiments, protein-protein interaction data, drug information, clinical data, metagenomic and pharmacogenomic information. It will be interesting to understand how this network evolves temporally in a lifespan of a given individual(s) from lean state to obese state. What modules or links get formed or abolished during the process? It can also act as a system where new drugs may be tested against disease networks to predict their therapeutics or side-effects.

Material and Methods (A) Retrieval of Literature Data
We screened each research article manually and highlighted text for the name of molecules as well as their interactions. We also used information provided in human obesity gene map database 2005 update [30] and GenMapp (http://www.genmapp.org/default.html). The abstracts having the term "obesity" and "human" were downloaded from PubMed using RefNavigator (version 2.0). We obtained 96,219 abstracts on obesity in human till December 2012 (See Folder 2 available at website (A) in S3 File). We used perl scripts to parse additional information which includes authors' names, affiliations, journal name & year of publication. Each abstract was processed and unique id was assigned using perl scripts.

(B) Determining True Positives and False Positives
Researchers have used several approaches to link genes with complex traits such as obesity. Primarily, linkage analysis and association studies have been used to find the variants that affect obesity. In addition, animal models also provide list of candidates genes through linkage studies, expression profiling, and transgenic strains. The techniques such as expression analysis and protein interaction studies also identify candidate genes for obesity. Given the wide variety of available experimental techniques, we grouped these studies (evidences) into various categories and provided a numerical code to each of them (See Table B in S2 File). Next, we label each gene with a numeric code for better data management.
A gene is defined as true positive example, when we have enough evidence to link a gene with a disease. For example, Leptin (Lep) deficiency is linked with intractable form of obesity (Uniprot Id-P41159; OMIM ID-614962). As a rule of thumb, we labelled genes with high confidence when many independent research studies published in high impact journals with sufficient citations support that link. Since, each gene has different types of experimental evidences ranging from mutation studies, animal studies, genome wide association linkage studies and clinical studies. We grouped these evidences into various categories and provided a numerical code (See Table B in S2 File). The false positives are those gene examples which matched common English words used in sentences, abbreviations of organizations, and author names. They also include examples which occurred in abstracts but rejected during manual screening due to lack of clear evidence.

(C) Hybrid approach
Deep-curation approach (DC) is defined as screening of literature data by experts whereas text-mining systems (TM) sift through publication data for the occurrence of the genes and their interactions using computational software and predictive algorithms at large scale.
Though, text mining systems are fast, but they suffer from several problems limiting their use. For example, consider a representative statement from a research article [54], "the binding of the SH2 domain of SH2B1 to phospho-Tyr, 813, in JAK2 enhanced leptin induction of JAK2 activity". Here, different text mining tools will report-"Jak2 enhanced leptin". This is considered to be a positive interaction but the real meaning is leptin increases JAK2 activity upon binding of SH2 domain to JAK2. Due to these constraints, text mining systems are not considered robust enough to resolve numerous problems warranting the need for deep-curation approach. Our TM approach is formulated as following-: (i). Let W be the set of all the genes and their synonyms in human that may occur at least once in the set of abstracts labelled as A. The W is represented as a matrix where each row represents a gene (w i ) and its synonyms. The synonyms and approved symbols for each gene are shown in tab separated format in a text file where notation "w ij " is designated for them.
(ii). A separate matrix (M) is constituted for storing frequencies of genes, listed in W. It contains genes (w i ) in the first column and their respective counts (c k ) in second column. For example, w 2 represents the gene LEP, having a gene count (c 2 ) of 7,159 in the PubMed abstracts (1960-2012 December).
(iii). We also define N asthe gene co-occurrence matrix. Each entity of this matrix is described as N xyz to store information extracted from research articles. This is composed of three units: N x , N y and N z . N x capture first instance of gene encountered in the sentence whereas N z keeps the next instance of gene and N y stores intermediate set of words. To illustrate, consider a statement, "Insulin is known to increase expression of the ob gene product leptin in adipose tissue". Here, insulin and leptin are labeled as gene pair having 10 intermediate words between them. Therefore, "insulin" will be N x ; "leptin" will be N z and "is known to increase expression of the ob gene product" is N y.
We extract gene pairs from the abstracts and full length articles and compute their frequencies. We also build frequency distribution of intermediate words (N y ) useful for building dictionary for subsequent natural language processing. This dataset is also useful for training of machine learning systems such as hidden markov models and support vector machines (manuscript in preparation) as well as manual curation.
(iv). Parser is a set of dictionaries that are built for various types of interactions, tenses and negations. We curated data of 300 research articles to identify the most frequently used words to represent interactions namely, positive, negative and neutral. We use these dictionaries to label interactions by building a matrix O. In matrix O, O xyz represent the data structure where the gene O x (insulin) is followed by gene O z (leptin) with their type of interaction, O y (positive). This is processed for graphical-view using GraphViz (Version 2.28). The detailed example (tutorial) of TM approach is provided in a S7 in S1 File.

Text-Mining Approach Algorithm
Let abstracts = A; Let genes = W; // 35,000 Genes in human & its synonym Let gene count matrix = M; Let co-occurrence matrix = N; Let NLP matrix = O; for i = 1 to n do // 'i' is a row representing a gene in W for j = 1 to n do // 'j' is a column representing a gene name, symbol in W Let c k = 0; // Initializing the count of a gene 'i' in abstracts A as 0 if i,j A then write i to M; // write the gene 'i' in gene count matrix M c k = c k +1; append c k to M; // The gene 'i' is appended with its count c k in M next; // Search for the next gene read gene x M; //Reading the gene x from gene count matrix M read gene z W; //Reading the gene z from the dictionary W gene-pair N x,z ; for x = 1 to n do // x represents the first gene of a gene-pair in M for z = 1 to n do // z represents the second gene of a gene-pair in W Let N x,z = 0; // Initializing the count of gene-pair x, z as 0 Let y = 0; // y is words between gene-pair initialized as 0 for gene x A do for gene z A do if x then z then read y in A if length y > 3; write x,y,z to N; // N x,y,z is a co-occurrence matrix N x,z = N x,z + 1; // Total occurrence of a gene pair append N x,z to N next; // Search for the next gene-pair Let NLP Parser = P; //Set of Dictionaries P Let Interaction verb dictionary = P a ; //Sub-dictionary in P Let tenses = P b ; //Sub-dictionary in P Let negations = P c ; //Sub-dictionary in P for P a N // Search for interaction verb in N for P b N // Search for tenses in N for P c N // Search for negations in N write O; // O is a NLP matrix

(D) Comprehensive Map Construction
The comprehensive map of molecules in obesity was constructed using Cell Designer software [35]. Cell Designer support systems biology graphical notation (SBGN) and provides various functions to the users to represent molecular entities, including gene, protein, and RNA as well as edge notations-transcription, translation, inhibition and stimulation. The activity as well as modulation in the molecule can also be represented. The constructed map can be exported as systems biology mark-up language (SBML) format, preferred for computational models of biological processes.

(E) Module Generation
Reverse engineering of the comprehensive map was conducted using tools and methods mentioned in A File in S1 File.

(E) Random Model Generation
Random models of the comprehensive map were generated by two approaches: Firstly, by Degree Preserving Random Shuffle using Network Analyzer Tool [96] and secondly, by applying Scale-free random graph (a cytoscape plug-in Random Networks). We also used perl scripts developed in-house for randomisation process.
2. The gene ontology (GO) analysis was carried out for three categories: molecular function, biological process and cellular component using BiNGO [78] and Network Ontology Analysis (NOA) [79].
3. The identification of protein targets of drugs, particularly orlistat, was accomplished with Docoviz pipeline (Fig 6). Docoviz is an automated system used for docking of drugs against protein structures at large scale using Auto-dock Vina [98]. This system is based upon perl and other languages such as ruby (manuscript in preparation). We obtained structural information of the genes implicated in obesity from protein data bank (PDB). Orlistat as well as other drugs were obtained from Drugbank [99] and their side effects were retrieved from SIDER database. The pdb format of protein structure was converted to pdbqt format before commencing the docking procedure. We identified active site coordinates through geometric search method. A grid of about 20Å around the active site coordinates was generated to search all possible transition point (See K File in S1 File).
Supporting Information S1 File.