An Integrative Analysis of Preeclampsia Based on the Construction of an Extended Composite Network Featuring Protein-Protein Physical Interactions and Transcriptional Relationships

Preeclampsia (PE) is a pregnancy disorder defined by hypertension and proteinuria. This disease remains a major cause of maternal and fetal morbidity and mortality. Defective placentation is generally described as being at the root of the disease. The characterization of the transcriptome signature of the preeclamptic placenta has allowed to identify differentially expressed genes (DEGs). However, we still lack a detailed knowledge on how these DEGs impact the function of the placenta. The tools of network biology offer a methodology to explore complex diseases at a systems level. In this study we performed a cross-platform meta-analysis of seven publically available gene expression datasets comparing non-pathological and preeclamptic placentas. Using the rank product algorithm we identified a total of 369 DEGs consistently modified in PE. The DEGs were used as seeds to build both an extended physical protein-protein interactions network and a transcription factors regulatory network. Topological and clustering analysis was conducted to analyze the connectivity properties of the networks. Finally both networks were merged into a composite network which presents an integrated view of the regulatory pathways involved in preeclampsia and the crosstalk between them. This network is a useful tool to explore the relationship between the DEGs and enable hypothesis generation for functional experimentation.


Introduction
Preeclampsia (PE) is a pregnancy condition characterized by hypertension and proteinuria. PE affects 5 to 7% of all pregnancies and remains a major cause of maternal morbidity and mortality. The disease constitutes also a major threat for the lives of child, being both a cause of prematurity and growth retardation; reviewed by [1,2]. PE can develop at any time after 20 weeks of gestation, however early onset disease is more severe than later onset disease and associated with poorer outcomes for both mother and child. The pathogenesis of PE is thought to originate from the placenta since its delivery generally resolves the syndrome and remains the only cure. A consensus exists that in PE abnormal vascularization of the placenta leads to poor perfusion. Then, placental ischemia would cause intermittent hypoxia, oxidative stress, cell death, and the release to the maternal circulation of anti-angiogenic factors and debris that promote a systemic endothelial dysfunction [1][2][3]. To investigate the molecular mechanisms involved in this disease several studies have used genome-wide microarray expression analysis to identify differentially expressed Genes (DEGs) between the preeclamptic and the non-pathologic placenta; reviewed in [4]. Several DEGs have been found systematically modified in the preeclamptic placenta, including: LEP, FLT1, ENG, INHA [5][6][7][8]. Yet the characterization of the transcriptome signature of the preeclamptic placenta has not been followed by a detailed understanding on how the modifications in the expression of these genes impacts the function of the placenta. Biological processes in normal or pathological conditions results of complex interactions between genes, proteins, metabolites and other molecules. Systems biology approaches attempt to gain a deeper understanding of biological processes by integrating the ensemble of its components through the mathematical modelling of networks. In these networks, the components (genes/proteins/metabolites) of the system are represented as nodes and their interactions as edges [9]. Network analysis detects the relationships between its components and enables the physical or functional relationships between them to be interrogated. This methodology is useful to dissect the complexity of a biological process, providing information relevant to understand the process, including the pathways and the prioritization of genes involved in it [10]. Herein, we performed a cross-platform meta-analysis of several public gene expression datasets to identify DEGs that are consistently modified in PE. Then we used these DEGs to build a composite extended network integrating physical protein interactions (PPIs) as well as regulatory interactions with transcriptional factors (TFs). This composite extended network can be useful to investigate and gain insight into the molecular mechanisms involved in PE, to identify new biomarkers or potential drug targets.

Meta-analysis of microarray datasets
We performed a meta-analysis of several published gene expression data in preeclampsia. We searched the Gene Expression Omnibus (GEO) repository (http://www.ncbi.nlm.nih.gov/geo), to identify microarray datasets that compared gene expression in preeclamptic versus normal placentas. The keywords: preeclampsia, placenta, microarrays and gene-expression, were used for this search. This way, we identified a total of 12 microarray studies in the GEO repository. However, to be included in our study the microarray experiments had to carried out with RNAs obtained from placental biopsies collected at delivery and at relatively comparable gestational ages (30-39 weeks), include a minimal number of samples (! 5) and be comparable in terms of the microarray technology used. Thus finally, seven datasets were found meeting our criteria and considered eligible for our study, Table 1 [11][12][13][14][15][16][17].
Since these studies used different microarray platforms, we performed our meta-analysis using the rank product algorithm implemented in the online INMEX software (http://inmex. ca), [18]. Before the datasets were analyzed, we used the processing modules (utilities) implemented in the INMEX software to annotate all probe identities from each dataset as Entrez gene IDs (for data consistency), transform the intensity values for gene expression to log 2 and perform quantile normalization. A list of 369 differentially expressed genes (DEGs) were identified based on adjusted p-value (threshold was set at adjP < 0.05) and average fold change (FC ! 1.3 and FC -1.3). FC were calculated as preeclampsia (PE) versus control (CO) samples. In addition, genes found to be significantly deregulated but showing discordant fold change in at least two of the seven datasets analyzed were discarded. Correction for multiple testing was done using Benjamin Hochberg False Discovery Rate (FDR).

Enrichment analysis
The enrichment analysis of the DEGs was performed with the WebGestalt (http://bioinfo. vanderbilt.edu/webgestalt) bioinformatics resource [19]. WebGestalt (WEB-based GEne SeT AnaLysis Toolkit) incorporates updated information from different public resources and provides an easy way to make sense out of gene lists. Databases interrogated include: Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and WikiPathways [20][21][22]. The significance of the detected enrichments was calculated using the Benjamini & Hochberg multiple test adjustment. In addition, WebGestalt was also used to perform enrichment analysis of the network modules identified by community clustering.

Extended Protein-Protein Interaction networks: Construction and Analysis
The DEGs identified in our meta-analysis were used as seeds to build a network of protein-protein interactions (PPIs). We constructed a network that not only consists of the seed proteins but also of their direct PPI neighbors and the interactions between these proteins (first order network). We obtained high confidence PPIs from the Biological General Repository for Interaction Datasets (BioGRID), (http://thebiogrid.org), [23]. The PPI network was constructed, visualized and analyzed using the Cytoscape 3.2.1 software and its complementary applications [24]. The network was analyzed for both centrality and modularity parameters. The centrality parameters of the networks were analyzed using the Cytoscape application NetworkAnalyzer [25]. Two topological parameters Betweenness Centrality (BC) and node degree were used to identify hub genes [26]. In PPI networks genes or proteins are represented by nodes and the interactions between the nodes are represented as edges. BC is defined as the number of shortest paths from all vertices to all others that pass through that node. BC reflects the amount of control that the node exerts over the interaction of other nodes in the network. Degree indicates the number of edges linked to a given node. Nodes with higher degree and high BC are hub genes which correspond to the most functionally relevant elements in the network. Modular analysis was performed using the fast-greedy HE (G) algorithm of the Community Clusters Glay implemented in the Cytoscape Apps [27]. The idea behind identifying functional modules is that proteins interacting with each other have higher probability of sharing the same function Illumina human HT-12 V4.0 [17] than none interacting proteins. Thus, finding functional modules in a biological network is similar to identify clusters of densely interacting nodes.

Regulatory Network Construction
We built a regulatory network by extracting gene regulatory relationships from RegNetwork (http://www.regnetworkweb.org) an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse [28]. This database integrates the documented regulatory interactions among transcription factors (TFs), microRNAs (miRNAs) and target genes from 25 selected databases including: JASPAR, TRANSFAC, TRED and USCS [29][30][31] [32]. It also incorporates potential regulatory relationships based on TFs binding sites. The DEGs were used as seeds to extract high confidence transcriptional regulatory relationships from the human RegNetwork.

Identification of DEGs by meta-analysis
Seven microarray datasets were extracted from the GEO database, which meet our criteria for meta-analysis (Table 1). Cross-platform meta-analysis of the selected studies using the rank product algorithm resulted in the identification of a total 1940 DEGs at adjP <0.05. Significant genes having the same direction of fold change in at least 5 out of 7 datasets and with an average log 2 FC ! 0.3 were considered for further analysis (The average log 2 FC being the average log 2 FC for every single gene for the 7 datasets). This resulted in a list of 369 DEGs, of which 212 were up-regulated in PE and 157 down-regulated genes in PE (S1 Table). Among the consistently up-regulated genes in PE with the higher mean log 2 FC and lowest p-value we detected in descending order LEP, FSTL3, PAPPA2, INHA, FLT1, SPAG4 and BHLHE40. The downregulated genes with the largest mean log 2 FC and lowest p-value included CLDN1, SPP1, ACTG2, NR2F1, GCLM and MMP1. By analyzing the literature we found that several of the top DEGs identified in our meta-analysis have been validated experimentally as differentially expressed in preeclamptic placentas (S2 Table).

Functional and pathway enrichment analysis
Enrichment analysis was performed to gain insight on the biological role of the DEGs. The most over-represented Gene Ontology (GO) terms in the biological process category included: "response to steroid hormone stimulus", "response to wounding", "cell migration", "and response to lipid" and "cell proliferation". In the category of GO cellular component most enriched terms were "extracellular region", "extracellular matrix" and "plasma membrane". A complete list of GO terms is provided in S3 Table. KEGG and WIKI pathways analysis revealed that the DEGs are associated with a number of significant pathways ( Table 2). These included: "focal adhesion", "pathways in cancer", "metabolic pathways", and "cytokine-cytokine receptor interaction". "Focal adhesion" and "regulation of actin cytoskeleton" were detected in both the KEGG and Wiki pathway databases. The most significantly deregulated signaling pathways included: MAPK, P53, TGF-Beta, and Chemokine signaling. To complement the functional analysis we calculated the histogram of gene frequency in the different pathways identified by the enrichment analysis (S1 Fig). Thus, TGFB1 and PDGFRA are the genes with the higher frequencies. They are present in 8 and 7 pathways respectively. CYP11A1 and SERPINE1 belong to 6 pathways, FOS, IGF1 and PIK3CB are present in 5 pathways. COL1A1 and CYP11A1 are present in 4 pathways. The other genes present lower frequencies. 26 genes were present in 3 pathways, 32 genes in 2 pathways and 64 genes belong only to 1 pathway.

Construction of a physical PPI network
To interpret the functional meaning of the identified DEGs at the protein level, we constructed a physical PPI network for the proteins encoded by these genes. Physical PPI interactions were retrieved from the BioGRID database [23]. We constructed a network that not only consists of the seed proteins but also their direct PPI neighbors (first order interaction network) and the interactions between these proteins. The proteins (nodes) and resulting interactions (edges) were imported into Cytoscape for visualization and further analysis [24]. This resulted in a network composed 1231 proteins and 5486 interactions. The network contained 214 out of the 369 DEGs identified in the meta-analysis. For the remaining DEGs no high confidence PPIs were reported in the BioGRID database. Biological networks have topological characteristics distinguishing them from random networks [26]. Among these the node degree distribution usually follows a power law. As shown (S2A Fig), this is the case for our network: R 2 = 0.913. This also means that a few nodes are highly connected (hubs) while a majority of nodes interact only with one or a few neighbors. The hub nodes play a major role in the architecture of the network and usually correspond to proteins of special biological relevance in the system under study. Here we used two topological parameters to identify hub proteins: betweenness centrality (BC) and degree (both defined in the methods section). These parameters were calculated using the Cytoscape NetworkAnalyzer app. The topological scores (for the top 50 genes) are supplied in S4 Table. When considering the BC parameter, the principal hub genes include EGFR, ZBTB16, GRB2, EP300, TP53 and GOLGA2. Several of the DEGs are present among the top 50 hub genes including ZBTB16, CASK, FOS, BHLHE40, TGFB1 and NOS3.
Module detection using connectivity patterns The PPI network was further subdivided into connectivity modules using the fast-greedy HE (G) algorithm of the Community Clusters Glay [27]. This resulted in the identification of nine connectivity modules closely connected. The Modules were analyzed for functionality enrichment using the gene ontology (GO) database (Table 3). Thus, the bigger module, module1 was composed of 292 nodes and significantly enriched in the biological processes terms: positive regulation of transcription from RNA polymerase II promoter (adjP = 9.52e-58), positive and regulation metabolic process activity (adjP = 5.58e-51). The smaller module, module 9, contained 22 nodes and was associated with healing (adjP = 0.0059) and coagulation processes (adjP = 0.0059).

Composite interactions PE network
The physical interactions PPI Network and the TFs-DEGs regulatory network were merged using Cytoscape to generate a composite network which integrates both kinds of interactions. Thus we obtained a network composed of 1542 nodes and 7718 edges. To illustrate it, Fig 1 shows a portion of the composite network displaying the transcription factors AP2A, its target among the DEGs, its physical protein interactions and its transcriptional regulators. For visualization purposes TFs were differentiated from their targets or other proteins in the network using a different symbol. Regulatory interactions were colored in green and physical interactions in mauve. The mean log 2 FC of the DEGs was incorporated in the network to visualize up or down-regulation. The complete composite network is provided as supplementary data in form of an XML file which can be visualized and explored using Cytoscape.

Discussion
Herein we have performed a cross-platform meta-analysis of seven gene-expression datasets comparing non-pathological and preeclamptic placentas. Using a rank product algorithm we identified 369 DEGs consistently modified across the considered datasets. In a previous study we used a vote counting strategy which led to the identification of only 98 DEGs in preeclamptic placentas [8]. Sixty one DEGs identified in our previous study were also identified in the present study. Vote counting is a simple method for summarizing evidence from multiple datasets; however it does not take into account the quality of the studies, the size of the samples, or the size of the effect. In the present study we used a rank product algorithm, a simple non-parametric statistical method based on ranks of fold changes [33]. Methods combining P-values or effect sizes have the disadvantage that the results can be often dominated by outliers, a significant problem when thousands of genes are compared in the noisy environment of microarray experiments. The methods combining rank statistics can be used to alleviate this problem. Comparative studies suggest that the rank product method outperforms the other methods in terms of sensitivity and specificity, especially in a setting of small sample size and large between-study variation [34]. In addition to our studies, two other meta-analysis on transcriptomic datasets of preeclamptic placentas have been conducted recently [5,6]. The study by Moslehi and coworkers was conducted on 4 datasets using a combining P-values method and identified a total of 419 DEGs. The overlap with the present study shows 139 common DEGs. The second one published by van Uitert and coworkers, was conducted on 11 placenta gene expression datasets which were preprocessed to conduct quality analysis. A combining size effect algorithm using inverse variance random effects was applied, resulting in the identification of 388 DEGs. In this case the overlap with the current study is of 148 common DEGs. Discrepancies between the studies can be attributed to the different number of datasets and different statistics meta-analysis methodologies. Functional enrichment analysis of the DEGs identified in the present study, revealed that they are mainly related to biological functions such as Focal adhesion, metabolic pathways, ECM-interactions, cytokine-cytokine interactions, complement and coagulation cascades, etc. All these processes have been previously associated with preeclampsia [3]. To further characterize the relationship among the DEGs we attempted to identify physical interactions among the proteins encoded by the DEGs. However, the scarcity of direct interactions between the DEGs imposed the need to create a first order network. Thus we build an extended network which includes not only the interactions between the DEGs but also the interactions between the DEGs and their first neighbors. This resulted in a network composed of 1231 proteins and 5486 physical interactions. To evaluate the biological significance of proteins in a complex disease such as PE, network biology considers the topological position of a protein as crucial as can be any eventual modification in its expression level. Within a network not all nodes are equally important. Many studies have shown that "hub" nodes are more likely related to diseases [35,36]. Proteins occupying a "hub" position are potentially key molecules in signaling, as they are interacting with many other proteins, being thus able to receive and transmit multiple signals [37]. Herein proteins with high node degree and betweenness centrality were considered "hubs". Among the top 50 hubs we found EGFR, ZBTB16, GRB2, EP300, TP53 and GOLGA2. Several of the DEGs are present among the top 50 hubs including ZBTB16, CASK, FOS, BHLHE40, TGFB1 and NOS3. Below we discuss the role of some of these hubs in PE.
EGFR encoding the epidermal growth factor receptor is the principal hub of our network. This protein is a receptor for members of the epidermal growth factor (EGF) family. In addition to EGF this family includes the transforming growth factor-alpha (TGFA), heparin-binding EGF-like growth factor (HBEGF), betacellulin (BTC), amphiregulin (AREG) and epiregulin (EREG). Accumulating evidence suggests that human trophoblasts survival and invasive capacities are linked to these growth factors. EGF protects cultured human cytotrophoblats against induced apoptosis [38,39]. Also in vitro studies have shown that cytotrophoblast motility and invasiveness are stimulated by TGFA and heparin-binding EGF-like growth factor HBEGF [40,41]. HBEGF is involved in extravillous trophoblast differentiation and it also protects cytotrophoblast cells from apoptosis when exposed to hypoxia or oxidative stress consecutive to hypoxia/reoxygenation [42]. A recent study has shown that in vitro, EGF, TGFA, BTC and EREG can also prevent apoptosis in cytotrophoblast exposed to hypoxia/reoxygenation stress [43]. Decreased expression of EGF, TGFA and HBEGF has been reported in the preeclamptic placenta, while expression of BTC, EREG and AREG show no significant differences [43,44]. BTC and EREG could at least in part compensate the lack for the other EGF family members since they are effective in preventing cytotrophoblast apoptosis. However, the analysis of EGFR mRNA expression in the placenta has revealed two different full-length transcripts and one truncated transcript, p110/EGFR. In the preeclamptic placenta two independent studies have found significantly higher mRNA expression of the truncated transcript, while there were no significant differences in the mRNA expression of the two full-length transcripts [43,45]. Since the p110/EGFR isoform contains an extracellular domain and lacks the intracellular domain it should not be able to signal upon EGF-binding and thus could act as a dominant negative. The high expression of the p110/EGFR in the preeclamptic placenta would contribute to impair the EGF signaling system [43]. Thus, concordantly with our bioinformatics analysis experimental evidence support an important role for EGFR in the development of PE. Our study is also in good consonance with two previous studies. A first study by Tejera and coworkers identified EGFR and GRB2 genes as central components of the PE network [46]. Also, an integrative analysis by Moleshi and coworkers identified EGFR signaling deficiency combined to hypoxia/oxidative stress as central mechanisms of PE [5].
Among the DEGs the transcriptional repressor ZBTB16 (PLZF) appears as one of the principal hubs in the network. ZBTB16 is a member of the kruppel-like zinc finger protein family involved in a multitude of biological processes including spermatogenesis, hematopoiesis, myeloid differentiation lymphoid development, cytokine production, programming and maturation of NKT and INKT cells, cellular proliferation and apoptosis [47,48]. Due to its role in limiting cell-cycle progression and cellular proliferation PLZF is known as a tumor suppressor [49]. Recruitment of co-repressors and subsequent chromatin remodeling has been shown to underlie the repressor function of ZBTB16 [48,50]. Although considered as a repressor an increasing number of studies describe ZBTB16 as an activator of transcription [48]. A role of ZBTB16 in the control of inflammation is emerging. ZBTB16 would form a co-repressor complex with HDAC3 and NF-κB in order to moderate the inflammatory program [51,52]. A possible function of ZBTB16 in placenta physiopathology remains to be explored, however a recent study linked it to hydatidiform molar pregnancies through its physical interaction with NLRP7 [53]. According to this study NLRP7 (a protein involved in innate immunity and apoptosis) could trap ZBTB16 in the cytoplasm and disrupt its nuclear inhibitory function, and this would lead to excessive trophoblastic proliferation. Interestingly ZBTB16 is known to interact physically with the EGFR ligand HBEGF [54]. The EGFR ligands, are synthesized as type I transmembrane protein precursors and are expressed on the plasma membrane [55]. Membrane-anchored HBEGF, is cleaved from the plasma membrane in a process termed ectodomain shedding, which yields soluble HBEGF (sHBEGF) and a carboxy-terminal fragment of HBEGF (HBEGF-CTF). The sHBEGF binds to the EGFR and induces EGFR phosphorylation and subsequent activation of downstream signaling events. In parallel, the HBEGF-CTF translocates to the nucleus, where it induces nuclear export of the transcriptional repressors ZBTB16 and BCL6, thereby facilitating cell cycle progression [54,56,57]. The ectodomain shedding is done by disintegrin and metalloproteinase (ADAM) members. A recent study has shown that TGFβ induces HBEGF shedding and EGFR transactivation through ADAM17 activation in gastric cancer cells [58]. It would be interesting to investigate if a similar mechanism is also at work in trophoblasts since aberrant TGFβ signaling is involved in PE [3].
We identified BHLHE40 (also known as STRA13, DEC1) as another of the DEGs occupying a hub position. This gene has been found systematically up-regulated in the preeclamptic placenta [5,6,8]. BHLHE40 is a transcriptional repressor binding as an homodimer to class B Ebox sites with high affinity; reviewed by [59,60]. Its transcriptional repression activity includes recruitment of co-repressors (such HDAC1) that mediated repressive chromatin marks or methylate non-histone proteins; dimerization with bHLH factors to form non-functional complexes; competition for binding to E-box sites; interaction with non-bHLH transcription factors to inhibit their activity; and promotion of proteasome-mediated degradation. BHLHE40 is involved in different biological processes and its expression is up-regulated by numerous stimuli including retinoic acid, serum deprivation, TGFβ, cAMP, light, cytokines, insulin and hypoxia. BHLHE40 regulates the differentiation of several cell types, thus gain and loss of function experiments suggest that it can repress mesoderm differentiation and enhance muscle regeneration after injury [61,62]. BHLHE40 is deregulated in a wide variety of cancers and impairs the DNA mismatch repair mechanism by repressing the expression of MLH-1 [63]. Many studies have shown that BHLHE40 mediates cell cycle arrest and promotes cellular senescence [60]. It has been proposed that its effects on cell cycle could result of down-regulation of c-Myc, and cyclin D1 as well as its interaction with p53 [64,65]. Its capacity to induce senescence seems to be partially p53 dependent [66]. Also, depending on the cell type BHLHE40 has been shown to either induce or block apoptosis [59]. Hypoxia has been identified as a hallmark of the preeclamptic placenta. Interestingly, BHLHE40 is a target of the hypoxia inducible factor (HIF1) and is induced by hypoxia [67][68][69]. Under normoxic conditions, BHLHE40 expression is down-regulated by von Hippel-Lindau (VHL) a component of the VHL/E3 ligase complex which ubiquitinates and targets HIF1 for degradation [70]. Recent studies indicate that BHLHE40 could act also as a regulator of HIF1 and hypoxia responses. Thus, induction of HIF-1α and VEGF through an EGFR/PI3K/BHLHE40 pathway has been reported [71]. As an effector of HIF1 it has been shown that BHLHE40 regulates adipogenesis through a mechanism involving recruitment of HDAC1 to repress CEBPB transcriptional activity [72]. Also, under hypoxic conditions BHLHE40 represses sterol regulatory element binding protein 1c (SREBP1c), involved in the induction of several enzymes of the lipogenesis pathway [73]. The control exerted by BHLHE40 on wide range of biological functions, its role in mediating hypoxia responses, and is systematic up-regulation in the preeclamptic placenta suggests that it could play an important role in this disease. It is likely that exploring the function of BHLHE40 in cytotrophoblasts under normoxia and hypoxia could reveal new cues on the molecular mechanisms at work in the preeclamptic placenta.

Conclusions
In this study we have used publically available microarray datasets comparing gene expression in preeclamptic versus non-pathologic placentas. Other studies analyzing global gene expression in the preeclamptic placenta have been published but the datasets are not publically accessible. Thus, the number of datasets constitutes a first limitation of our study which could have reached stronger value if more datasets were included. Another limitation of the present study comes from the coverage and quality of the human interactome data. Despite the best curation efforts, the interactome remains incomplete, current coverage is estimated at 20% and biased toward much-studied proteins [74][75][76]. In principle the incompleteness of the human PPI network poses limitation to any study of network properties in the context of normal physiology or disease. However, a recent study showed that notwithstanding its incompleteness, the available interactome has sufficient coverage to pursue a systematic network-based approach to human diseases [77].
Our composite network summarizes current knowledge on the relationships between gene/ proteins involved in PE. It constitutes a model which can be easily upgraded with new PPI or regulatory interactions.  Table. List of genes identified as differentially expressed in preeclampsia in our metaanalysis and whose differential expression has been experimentally validated by other studies. (DOCX) S3 Table. GO terms "Biological process" and "Cellular Component" associated with the DEGs identified by the meta-analysis.