Conceived and designed the experiments: KA SB. Performed the experiments: ASJ FJSSAR KKB NW OT. Analyzed the data: KA. Wrote the paper: KA TSJ.
The authors have declared that no competing interests exist.
Exposure to environmental chemicals and drugs may have a negative effect on human health. A better understanding of the molecular mechanism of such compounds is needed to determine the risk. We present a high confidence human protein-protein association network built upon the integration of chemical toxicology and systems biology. This computational systems chemical biology model reveals uncharacterized connections between compounds and diseases, thus predicting which compounds may be risk factors for human health. Additionally, the network can be used to identify unexpected potential associations between chemicals and proteins. Examples are shown for chemicals associated with breast cancer, lung cancer and necrosis, and potential protein targets for di-ethylhexyl-phthalate, 2,3,7,8-tetrachlorodibenzo-p-dioxin, pirinixic acid and permethrine. The chemical-protein associations are supported through recent published studies, which illustrate the power of our approach that integrates toxicogenomics data with other data types.
Exposure to environmental chemicals and drugs may have a negative effect on human health. An essential step towards understanding the effect of chemicals on human health is to identify all possible molecular targets of a given chemical. Recently, various network-oriented chemical pharmacology approaches have been published. However, these methods limit the protein prediction to already known molecular drug targets. New findings can for example be made by using high-confidence protein-protein association databases. Here, we describe a generic, computational systems biology model with the aim of understanding the underlying molecular mechanisms of chemicals and the biological pathways they perturb. We present a novel and complementary approach to existing models by integrating toxicogenomics data, chemical structures, protein-protein interaction data, disease information and functional annotation of proteins. The high confidence protein-protein association network proposed reveals unexpected connections between chemicals and diseases or human proteins. We provide literature support to demonstrate the validity of some predictions, and thereby illustrate the power of an approach that integrates toxicogenomics data with other data types.
Humans are daily exposed to diverse hazardous chemicals via skincare products, plastic cups, computers and pesticides to mention but a few sources. The potential effect of these environmental compounds on human health is a major concern
An essential step towards deciphering the effect of chemicals on human health is to identify all possible molecular targets of a given chemical. Various network-oriented chemical pharmacology approaches have been published recently to identify novel protein candidates for drugs, using structural chemical similarity
In this paper we present a method that can associate chemicals to disease and identify potential molecular targets based on the integration of toxicogenomics data, chemical structures, protein-protein interaction data, disease information and functional annotation. The core of our procedure is derived from the “target hopping” concept defined previously
Based on the Comparative Toxicogenomics Database (CTD)
DATA: Extraction and filtering of human protein-chemical associations from CTD. The visualization of the chemical space by Principal Component Analysis projection confirms that drugs (D) and environmental chemicals (E) shared structural properties, and then may affect similar protein targets. The two first principal components, which explained about 44% of the variance on the calculated properties are shown (green: pharmaceutical actions, red: toxic actions and blue: specialty uses of chemical). All proteins (P) were mapped to Ensembl gene identifiers to facilitate further data integration. MODEL GENERATION: Construction of the P-PAN. The P-PAN was created from associations present in the CTD (dashed edge lines) between chemicals and proteins. In the P-PAN, two proteins are connected to each other (edge lines) if they share a common chemical. A weighted score, represented by the width of the black edges, was assigned to each protein-protein association. It represents the strength of the network between two proteins as defined by the number of shared compounds for both molecular targets. Selection of a scoring function and a high confidence P-PAN after overlaps comparison with two human interactomes (PPIs) based on experimental evidences. Clustering of the P-PAN and evaluation of the biological meaningful of the clusters using Gene Ontology annotations. PREDICTION: (1) Prediction of novel molecular targets for chemical using a neighbor protein procedure. DEHP (orange) is known to be connected with blue proteins and is predicted to be associated with green proteins. A confidence score was calculated for each protein, represented by the width of the edges; thick edge for high score to thin edge for low score. (2) Prediction of disease associated with chemical after integration of protein-disease information using GeneCards in clusters. As example, apocarotenal, a compound found in spinach is predicted to be link to necrosis.
In the CTD, drugs and environmental compounds are claimed to be associated with toxicologically important proteins. To estimate how much the information from the CTD differs from available data on pharmacological action of drugs, we compared the data shared between CTD and DrugBank, as of May 2009
To investigate the assumption that two compounds sharing similar structure can potentially affect the same molecular targets, we compared chemical properties of the compounds collected from the CTD. The chemicals were characterized by 50 properties calculated from the structure, including the molecular mass and affinity for a lipid environment. The distribution of properties, as it appears in a multi-dimensional properties space, was projected and visualized in two dimensions using principal component analysis (PCA) (shown in
The human P-PAN was generated based on the assumption that if two proteins are biologically affected with the same chemicals (defined as shared chemicals), they are likely to be involved in a common mechanism of action of the chemicals. Then, two proteins are connected to each other if they are linked to the same chemical in the CTD. The resulting P-PAN consists of 2.44 million associations. To reduce noise and select the most significant associations, we assigned two reliability scores to each protein-protein association: a score based on hypergeometric calculation and a weighted score. The weighted score was calculated as the sum of weights for shared chemicals, where weights were inversely proportional to the number of associated proteins for a given compound.
We went one-step further and compared the P-PAN with two human PPI databases: (1) a high confidence set of experimental PPIs extracted from a compilation of diverse data sources
Since proteins tend to function in groups, or complexes, an important step has been to verify that our high confidence network mimics true biological organization. This task is commonly executed using graph clustering procedures, which aim at detecting densely connected regions within the interaction graph. Two clustering methods have been applied to our network. The molecular complex detection (MCODE) approach
In the clusters of the P-PAN, proteins are more connected with other proteins within the cluster than with the other targets in the network. As proteins are associated based on their shared relationship with chemicals, proteins within a given cluster tend to be more linked to specific compounds. It is thus possible to find associations between diseases and the chemicals that underlie the protein-protein associations within the cluster using protein-specific disease annotations. For each cluster, we investigated if specific disease annotation was found more frequently than expected by using protein-disease information
To predict which chemicals may affect human health, we then analyzed selected clusters to identify new chemical-disease associations (see
Cluster ID | Disease | Chemical name | p-Value |
1 (462 proteins) | Breast cancer (128 proteins) | 7.68e-134 | |
4.46e-92 | |||
1,15e-88 | |||
2.20e-78 | |||
7.05e-63 | |||
12 (59 proteins) | Lung cancer (29 proteins) | 1.57e-26 | |
(10 proteins) | |||
3.29e-22 | |||
(12 proteins) | |||
7.78e-06 | |||
2 (433 proteins) | Necrosis (122 proteins) | 4.76e-35 | |
1.63e-29 | |||
(8 proteins) | |||
2.66e-26 |
Chemicals already known from the literature to be associated to disease are shown in italic. In bold are the chemicals significantly associated to disease, which are unknown to be disease-causing chemical from the literature. The number of proteins is shown in brackets for each cluster, disease and novel association. As example, among the 433 proteins associated to cluster 2, 122 are known to be linked to necrosis. Among these 122, 8 are connected to apocarotenal in CTD.
One of the clusters showed high enrichment for breast cancer. The most significantly associated chemicals are already known from the literature to be related to cancer, thus supporting the clustering quality of the P-PAN. Among the most significantly associated chemicals are the well-known polychlorinated biphenyls (PCBs). PCBs are used for a variety of applications i.e. flame retardants, paints and plasticizers. After being banned due to their toxicity, they still persist in the environment. Previous results suggest that specific PCBs may indeed be associated with breast cancer
Besides revealing disease-chemical associations, the network can be used to predict novel targets for chemicals. It has been shown that many small molecules affect multiple proteins rather than a single target, and that proteins sharing an interaction with a chemical are targeted by the same chemicals
Chemical | Known protein | Cpscore |
Novel protein | Cpscore |
Literature |
DEHP | CDO1 | 13.23 | 5.46 | Yes | |
PPARA | 9.48 | 5.44 | Yes | ||
SUOX | 4.35 | 5.40 | Yes | ||
(15 proteins) | 4.32 | Yes | |||
4.32 | Yes | ||||
4.26 | Yes | ||||
TCDD | HSPA9B | 82.69 | 10.17 | Yes | |
SLC2A4 | 82.69 | 8.97 | Yes | ||
TRIP11 | 82.69 | 6.96 | Yes | ||
TSP1 | 82.69 | 6.39 | Yes | ||
EPHX2 | 75.77 | 6.77 | No | ||
MT2A | 10.85 | 5.61 | Yes | ||
(90 proteins) | |||||
PA | CYP4X1 | 5.67 | 5.19 | No | |
PPARA | 2.53 | 5.19 | No | ||
CES1 | 1.45 | 3.19 | Yes | ||
SULT2A1 | 0.87 | 2.61 | No | ||
CYP1A1 | 0.37 | 2.80 | Yes | ||
1.34 | Yes | ||||
1.21 | No | ||||
1.08 | Yes | ||||
1.04 | No | ||||
0.93 | No | ||||
0.91 | Yes | ||||
(5 proteins) | |||||
Permethrin | AR | 4.67 | 4.43 | Yes | |
WNT10B | 4.12 | 3.51 | Yes | ||
PGR | 3.75 | 2.89 | No | ||
ESR1 | 3.31 | 2.64 | Yes | ||
TFF1 | 3.15 | ||||
NR1I2 | 2.94 | ||||
(17 proteins) |
*Proteins known to be associated to a compound were extracted from the CTD. In brackets is the total number of known proteins used to query the P-PAN. To find novel protein targets (in bold) associated to a chemical, a neighbor proteins procedure was used which scored the association between proteins and chemicals (cpscore). Among the novel predicted proteins (thus not input data), some are supported by literature, highlighting the usefulness of the P-PAN to identify new chemical-protein associations.
Phthalates, mainly used as plasticizers, have received a lot of attention as environmental compounds because they are potential human carcinogens. As there are many phthalates, we focused on Di-EthylHexyl Phthalate (DEHP) that has been associated with more proteins compared to other phthalates such as additional information on kinases (e.g. mitogen-activated protein kinase 1, and mitogen-activated protein kinase 3)
The examples we provide include both known and new protein associations with a given chemical, and many of the novel associations are supported by the literature. We compared our approach with STRING (version STRING 1)
We propose an approach different from existing computational chemical biology networks, which primarily integrate drugs information, to identify new molecular targets for chemicals and to link them to diseases. In our approach we have integrated toxicogenomics data for drugs and environmental compounds. The ability to make new findings using a different network is illustrated by a comparison with a similar method, showing the capacity of our P-PAN to identify novel chemical-protein associations. Using phthalate as an example, our model suggests potential associations between DEHP and GABA receptors, which have not been predicted previously.
An extension of this network by integrating more data, for example other chemical-protein associations or dose levels for which a compound may affect human health, would be beneficial to the proposed approach. Paracelsus (1493–1541) is often cited for his quote, “all things are poisons and nothing is without poison, only the dose permits something not to be poisonous”. This emphasize that the dose of a chemical is an issue to consider in the deregulation of systems biology. Nevertheless, a global mapping could allow a better understanding of adverse effects of drugs and toxic effects of environmental compounds. This could be used as a new approach for risk assessment and regulatory decision-making for human health.
Among the examples presented, some predictions are supported by literature for other organisms. Regarding toxicogenomics, the available human data are generally sparse compared to rodents. Data on toxicity - adverse effects of chemicals on humans – can be acquired through epidemiologic studies and from occupational, accident-related exposures as intentional human testing of environmental compounds remains limited. However, differences exist between model animal and human responses to chemicals, including differences in the type of adverse effects experienced and the dosages at which they occur. The differences may reflect variations in the underlying biochemical mechanisms, in metabolism, or in the distribution of the chemicals. As an example, bisphenol A (BPA) does not affect proteins in a similar way across species (
Molecular targets are represented as nodes, and colored by gene family. Nodes presence represent available information extracted from the CTD and node absence are the unknown information. Colored nodes defined that BPA affect the protein, while nodes are not colored when BPA does not affect the protein. This figure highlights similarities and differences existing between animal model and human responses to chemical exposure.
The major limitation of our integrative systems biology approach is that the molecular target predictions are limited to the 3,528 proteins present in our P-PAN, which represent only 15% of the estimated human proteome
We downloaded the publicly available Comparative Toxicogenomics Database (CTD) as of June 26, 2008
To verify the uniqueness of chemicals, chemical names extracted from the CTD were checked using PubChem (
To investigate chemical space of drugs and environmental compounds, 50 two-dimensional properties were calculated for each structure extracted from PubChem. To visualize them, principal component analysis (PCA) was performed. All necessary data were calculated using the MOE software (Chemical Computing Group version 2007.09)
Relevant human chemical-protein associations collected from the CTD were used to create a P-PAN. The maximum number of molecular targets assigned to one compound ‘tert-Butylhydroperoxide’ was 1,189 and the maximum of chemicals assigned to one protein, the cytochrome P450 3A4 (CYP3A4), was 276. The P-PAN was generated by instantiating a node for each protein, and linking by an edge any protein-protein pair where at least one overlapping chemical was identified. Scripts were used to convert the protein-protein associations into a non-redundant list of associations. If proteins A and B are associated, the network may have two associations, A–B and B-A. Only one of these associations was retained in the P-PAN. We assigned two reliability scores to each protein-protein association: a score based on hypergeometric calculation and a weighted score. The weighted score was calculated as the sum of weights for overlapping compounds, where weights were inversely proportional to the number of assigned proteins. The resulting P-PAN is a complex structure containing a total of 2.44 million unique associations between 6,060 human proteins.
The reliability of the weighted score was confirmed by fitting a calibration curve of different scores against Lage's PPIs18 (version 2.9) and Vidal's PPIs19. Only 35,000 high confidence experimental interactions were extracted from Lage's PPI, which contains interactions present in the largest databases (Reactome, KEGG…) and data inferred from model organisms. Vidal's PPIs are based on an internal consistent single data source defined using yeast two-hybrid system and contains 3111 interactions.The overlaps of our P-PAN scores and Lage/Vidal PPIs are shown in
Among the ∼200,000 high confidence associations selected, 3,528 proteins were identified, and these were significantly enriched among the high scoring protein-protein associations as shown in
A high confidence sub-network of ∼200,000 protein-protein associations was selected which contained 3,528 proteins. This sub-network was highly interconnected, with the majority of proteins belonging to a single large cluster. In order to increase the resolution and facilitate biological interpretation, two clustering methods were applied to the sub-network, MCODE
To identify chemicals associated with disease, protein-specific information such as involvement in disease was integrated in each cluster. The Online Mendelian Inheritance in Man database (OMIM)
To predict molecular targets for a chemical, a network-neighbor's pull down was done in a three steps procedure: (1) Selection of the input protein(s): Extraction of the protein(s) known to be associated with the selected chemical from the CTD. (2) Identification of network(s) surrounding the input proteins by a neighbor proteins procedure. In this procedure, our P-PAN was queried for the input proteins, and associations between these were added. Next, the first order interactors of all the input proteins were queried and added. For each neighbor, a score was calculated taking into account the topology of the surrounding network, based on the ratio between total associations and associations with input proteins. Molecular targets with a score higher than the threshold (0.1) were kept in the final sub-network(s). This node inclusion parameter is in the conservative end of the optimal range for protein-protein interaction networks18. As a final step all proteins in the complex were checked for associations among them and the missing one were added. (3) Establishment of a confidence score for the surrounding network (cscore) and of a score for each protein (cpscore): Each of the pulled down complexes was tested for enrichment on our input set by comparing them against 1.0e4 random complexes for the protein-protein association set to establish a cscore for each sub-network and a cpscore for each connected proteins. The cpscore was used to rank proteins to select potential molecular targets for chemicals. An illustration of cpscore is available on
All the CTD human protein-chemical associations were extracted from the CTD on June 26, 2008. Subsequent updates of CTD, as of June 25, 2009, did not change the overall trends or conclusions of the present study.
Structure-target relationship: Oral bioavailability profiles.For drugs, permeability and absorption are properties considered to be important for effective delivery systems, and they receive special attention in pharmaceutical research. We chose to focus on the oral bioavailability properties based on standard Lipinski and Veber rules. It is important to keep in mind that the rules serve as guidelines only - some classes of chemicals, like antibiotics, do not respect the rules. The selected properties are the molecular weight, the octanol/water partition coefficient (an indication of the ability of a molecule to cross biological membranes), the number of hydrogen bond-donor, the number of hydrogen bond-acceptor and the number of rotatable bond. The distributions of the different molecular properties have partial overlaps indicating that small environmental molecules could mimic drug properties. As an example, the distribution of the molecular weight shows a similar profile for each of the three MeSH categories, with a light tendency for ‘Toxic Actions’ chemicals to have a smaller molecular weight (MW). The mean of MW for ‘Toxic Actions’ is 264 daltons whereas the mean of MW for ‘Pharmaceutical Actions’ chemicals is 386 daltons.
(0.06 MB DOC)
Comparing overlaps between protein-protein associations and protein-protein interactions. To assess the reliability of our protein-protein association scores, we fitted a calibration curve of the different PPA scores against overlaps with two PPI databases: the Vidal's interactome and a highly confident set from Lage et al. Vidal's PPIs are based on an internal consistent single data source defined using yeast two-hybrid system. Lage's PPIs contain interactions present in the largest databases and data inferred from model organisms. All the interactions used from Lage et al for the calibration curve are experimental (extracted from Reactome, KEGG and experimental data from small scale experiments). In both comparison, the weighted score (wscore, in red) appears to be superior compared to the score derivates from a hypergeometric test (hscore, in green) and to the random scores. The vertical line represent the threshold selected, which correspond to 8% of the complete P-PAN i.e. 200,080 proteins.
(0.07 MB DOC)
Molecular target predictions for DEHP: novelty of the P-PAN. The novelty of our P-PAN is supported by comparing the predicted proteins associated to DEHP using our approach and an existing method String
(0.26 MB DOC)
Distributions of the gene- disease scores from GeneCards-AKS2 and OMIN. To integrate disease information to the clusters, GeneCards was used as a source of disease-protein connections. In order to limit the use of false positives present in GeneCards, we mapped shared protein-disease association from OMIN and GeneCards. According to the overlap curves, we set a significant cut-off value of the GeneCards-AKS2 score (in red) of 60.
(0.28 MB DOC)
Mining the P-PAN for chemicals associated with diseases.
(0.06 MB DOC)
Molecular targets predictions for chemicals.
(0.07 MB DOC)
Example of molecular target predictions for chemicals. References: 1. Mahgoub AA, El-Medany AH (2001) Evaluation of chronic exposure of the male rat reproductive system to the insecticide methomyl. Pharmacol. Res. 44:73–80. 2. Bernard L, Martinat N, Lécureuil C, Crépieux P, Reiter E, Tilloy-Ellul A, Chevalier S, Guillou F (2007) Dichlorodiphenyltrichloroethane impairs follicle-stimulating hormone receptor-mediated signaling in rat Sertoli cells. Reprod. Toxicol. 23:158–164. 3. Saqib TA, Naqvi SN, Siddiqui PA, Azmi MA (2005) Detection of pesticide residues in muscles liver and fat of 3 species of Labeo found in Kalri and Haleji lakes. J. Environ. Biol. 26:433–438. 4. Flodström S, Hemming H, Warngard L, Ahlborg UG (1990) Promotion of altered hepatic foci development in rat liver cytochrome P450 enzyme induction and inhibition of cell-cell communication by DDT and some structurally related organohalogen pesticides. Carcinogenesis 11:1413–1417. 5. Sakai H, Iwata H, Kim EY, Tsydenova O, Miyazaki N, Petrov EA, Batoev VB, Tanabe S (2006) Constitutive androstane receptor (CAR) as a potential sensing biomarker of persistent organic pollutants (POPs) in aquatic mammal: molecular characterization expression level and ligand profiling in Baikal seal (Pusa sibirica). Toxicol. Sci. 94:57–70 6. Ding X, Staudinger JL (2005) Repression of PXR-mediated induction of hepatic CYP3A gene expression by protein kinase C. Biochem. Pharmacol. 69:867–873. 7. Matsuura I, Saitoh T, Tani E, Wako Y, Iwata H, Toyota N, Ishizuka Y, Namiki M, Hoshino N, Tsuchitani M, Ikeda Y (2005) Evaluation of a two-generation reproduction toxicity study adding endpoints to detect endocrine disrupting activity using lindane. J. Toxicol. Sci. Spec No 135–161.
(0.04 MB DOC)
Illustration of cpscore for approved drugs.
(0.09 MB DOC)
The authors would like to thank Daniel Edsgärd for his technical help and Ramneek Gupta for critical reading of the manuscript.