Pathway-specific protein domains are predictive for human diseases

doi:10.1371/journal.pcbi.1007052

Fig 1.

Overview of scoring pathway specificity of the protein domains.

(A) A co-pathway protein network was constructed based on similarity of the protein domain profiles (0 and 1 represent absence and presence of each domain, respectively, in the protein). Sub-networks that represent pathway f₁, f₂, and f₃ were enriched for domain d₁, d₂, and d₃, respectively. Probability operating the same pathway is proportional to the edge thickness. (B) Next, each protein received a protein-pathway association (PPA) score for a specific pathway f by sum of edge scores to all member proteins of the pathway f. (C) Domain-pathway association (DPA) score of each domain was assigned by the average PPA of all proteins that harbor the domain. In this example, DPA of domain d₃ for pathway f₃, DPA₃(f₃), was assigned by the average of PPA₈(f₃), PPA₉(f₃), and PPA₁₀(f₃). Gini Index (GI) was used to measure the impurity of the data. (D) Subsequently, pathway specificity (PS) was calculated. In this example, because domain d₁, d₂, and d₃ have high PSs for pathway f₁, f₂, and f₃, respectively, they were classified as pathway-specific domains (PSDs) for the corresponding pathways. However, domain d₄ was classified as a non-specific domain (NSD) due to the low PS for all pathways.

More »

Expand

Fig 2.

Disease implications of PSDs.

(A) Regression between pathway specificity (PS) and the significance of overlap with the gold-standard domain-pathway pairs by sigmoidal curve fitting. Domain-pathway associations were divided into two groups: the top 16,000 associations that showed significant overlap (p < 0.01 by Fisher’s exact test) with the gold-standard data, and the remaining 33,636 associations. 4,506 domains for the top 16,000 associations were defined as pathway-specific domains (PSDs) and 3,856 domains for the remaining associations were defined as non-specific domains (NSDs). (B) Comparison of normalized variation rates (NVRs) for neutral and pathogenic variants between PSDs and NSDs (*, P < 0.01; n.s., P > 0.05) (C) Comparison of NVRs for three classes of missense disease mutations described by Sahni et al. and nonsynonymous variants known to affect physical protein interactions by IMEx consortium between PSDs and NSDs (*, P < 0.01; n.s., P > 0.05). (D) Comparison of the ratios (log base 2) of PSDs to NSDs for groups of human structural interaction network (hSIN) interfacing domains with similar sizes for different ranges of domain interaction connectivity. (E) Proposed models for the relationships between mutational consequences and the number of domain interactions. The blue node represents a hub domain that mediates interactions between a large number of proteins that contain domains with a single or a few, at most, interacting domains (green nodes), and the yellow nodes represent domains with moderate numbers of domain interactions, which are involved in ‘within-pathways’ (shaded areas).

More »

Expand

Fig 3.

PSDs can predict disease genes.

(A) A summary of candidate gene selection for coronary artery disease (CAD) and schizophrenia (SCZ) by integration of GWAS significance and PSD occurrence data. SNPs from GWASs were divided into three groups: (i) SNPs with high significance that indicate confident candidate genes; (ii) SNPs with low significance that are generally discarded; and (iii) SNPs with moderate significance that were considered for further selection in this study. Based on the overlap between disease genes and pathway genes, we converted domain-pathway associations into domain-disease associations to identify disease-associated PSDs. Candidate disease genes of the GWAS∩PSD set were selected based on the occurrence of disease-associated PSDs of the genes with moderate GWAS significance. (B) The precision of CAD gene predictions was assessed based on CADgeneDB annotations. The precision by random expectation (i.e., the number of disease genes / the number of all human genes) is indicated by the blue line (~2.5%). (C) The precision of SCZ predictions was assessed based on SZdatabase annotations. The precision by random expectation is indicated by the blue line (~4.1%).

More »

Expand

Fig 4.

Experimental validation of novel genes for heart development in zebrafish.

(A) Tg(flk1:EGFP) zebrafish embryos injected with morpholinos (MOs) for novel candidate genes for CAD showed morphological heart abnormalities, such as peripheral edema at 3 days post-fertilization (arrows in the left panel, scale bar = 500 μm). Zebrafish embryos normally have hearts with a left ventricle (V) and right atrium (A), whereas the embryos injected with MOs related to CAD genes exhibited either no asymmetry or reversed V and A orientation (middle panels, scale bar = 200 μm). These embryos also exhibited malformed blood vessels in the trunk (asterisks in the right panel, scale bar = 200 μm). (B) MO-injected Tg(flk1:EGFP) zebrafish embryos were counted to quantify those that exhibited heart asymmetry. (C) MO-injected Tg(flk1:EGFP) zebrafish embryos were counted to quantify those that exhibited vascular defects. Over 20 MO-injected embryos per gene were counted for each analysis (A-C).

More »

Expand