Gene Expression Profiling Combined with Bioinformatics Analysis Identify Biomarkers for Parkinson Disease

Parkinson disease (PD) progresses relentlessly and affects approximately 4% of the population aged over 80 years old. It is difficult to diagnose in its early stages. The purpose of our study is to identify molecular biomarkers for PD initiation using a computational bioinformatics analysis of gene expression. We downloaded the gene expression profile of PD from Gene Expression Omnibus and identified differentially coexpressed genes (DCGs) and dysfunctional pathways in PD patients compared to controls. Besides, we built a regulatory network by mapping the DCGs to known regulatory data between transcription factors (TFs) and target genes and calculated the regulatory impact factor of each transcription factor. As the results, a total of 1004 genes associated with PD initiation were identified. Pathway enrichment of these genes suggests that biological processes of protein turnover were impaired in PD. In the regulatory network, HLF, E2F1 and STAT4 were found have altered expression levels in PD patients. The expression levels of other transcription factors, NKX3-1, TAL1, RFX1 and EGR3, were not found altered. However, they regulated differentially expressed genes. In conclusion, we suggest that HLF, E2F1 and STAT4 may be used as molecular biomarkers for PD; however, more work is needed to validate our result.


Introduction
Parkinson disease (PD) is a common chronic neurodegenerative disorder characterized by selective loss of dopaminergic neurons from the substantia nigra and presence of Lewy bodies [1]. The obvious symptoms are tremor at rest, muscle rigidity, bradykinesia and other movement-related symptoms [2]. PD is difficult to diagnose in its early stages, and when it was diagnosed, the only treatment involved boosting inadequate levels of dopamine in the brain, which did not eliminate all symptoms. Therefore, it is of significantly importance to find molecular biomarkers of PD to improve diagnosis accuracy, monitor disease progression and develop therapeutic interventions.
The etiology of PD remains a puzzling mix of environmental factors, genes and the aged brain [3,4]. Epidemiological research indicates that exposure to pesticides elevates the risk of PD. By contrast, caffeine and tobacco are associated with reduced risk [5]. In recent years, several causative genes of PD have been identified, including a-synuclein (SNCA), parkin (PARK2), UCHL-1 (PARK5), PINK1 (PARK6), DJ-1 (PARK7), LRRK2 (PARK8) and ATP13A2 (PARK9) [6,7]. These PD-linked molecules are candidate biomarkers for PD [8]. Among them, the levels of DJ-1 and a-synuclein in human cerebrospinal fluid and blood between PD patients and non-PD controls are the most frequently tested biomarkers in previous studies; however, the results are conflicting [9,10,11,12,13]. At this stage, neither DJ-1 nor a-synuclein alone appears to be satisfactory as the biological biomarker for PD. Besides, changed levels of Urinary 8-hydroxydeoxyguanosin (Urinary 8-OHdGe) and proinflammatory cytokines such as tumor necrosis factor a (TNF-a), interleukin 6 (IL-6) and interleukin 1b (IL-1b) are also been studied as biomarkers for PD [14,15]. Godau et al. recently showed that the levels of serum insulin-like growth factor (IGF-2) were significantly higher in PD patients than that in controls [16].
The purpose of this study is to identify molecular biomarkers for PD initiation using a computational bioinformatics analysis of gene expression. The availability and integration of highthroughput gene expression data and the computational bioinformatics analysis may shed new lights on molecular biomarker identification of PD.

Affymetrix Microarray Data
The transcription profile of GSE 20333 was downloaded from a public functional genomics data repository GEO (Gene Expression Omnibus) (http://www.ncbi.nlm.nih.gov/geo/). Affymetrix HG-Focus array was used to determine a global gene expression profile of clinically and neuropathologically confirmed cases of sporadic Parkinson disease (n = 6) compared to controls (n = 6). Postmortem human brains were obtained from moderately to severe Parkinsonism individuals based on the Hoehn & Yahr criteria. The average age for PD and control is 76.6 and 77.8 years, respectively. The average postmortem delay for PD and control is 26.2 and 19.8 hours, respectively.

Pathway Data
KEGG (Kyoto Encyclopedia of Genes and Genomes) is one of the most popular pathway databases; it groups genes into pathways of interacting genes and substrates, and contains specific links between genes and substrates that interact directly [17,18]. The PATHWAY database records networks of molecular interactions in the cells, and variants of them specific to particular organisms (http://www.genome.jp/kegg/). We collected pathway information from KEGG on June 30, 2011.

Regulatory Data
UCSC (http://genome.ucsc.edu) is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. We downloaded the human transcription factors (TFs) and their target chromosome region from UCSC. Then, we downloaded the chromosome annotation information from NCBI and analyzed the relationships between TFs and their target genes.

Differentially Coexpression Analysis
From the perspective of systems biology, functionally related genes are frequently coexpressed across a set of samples [19,20,21]. Differentially Coexpressed Genes and Links (DCGL) is designed for identifying differentially coexpressed genes and links from gene expression microarray data [22].
For GSE20333, we used the DCGL package [22,23] in R [24] to identify differentially coexpressed genes (DCGs) and links in PD patients compared to non-PD controls. We calculated the p-values and adjusted the raw p-values into false discovery rate (FDR) using the Benjamini-Hochberg method [25] to circumvent the multi-test problem which might induce too much false positive results. The genes only with FDR ,0.25 were selected as differentially coexpressed genes.

Pathway Enrichment Analysis
In order to facilitate the functional annotation and analysis of large lists of genes in our result, we inputted all the DCGs into DAVID (The Database for Annotation, Visualization and Integrated Discovery) for KEGG (Kyoto Encyclopedia of Genes and Genomes) term enrichment analysis. The DAVID identifies canonical pathways associated with a given list of genes by calculating the hypergeometric test p-value for probability that association between this set of genes and a canonical pathway [26]. We chose p-value ,0.05 as the cut-off criterion.

Measures of RIF
Regulatory impact factor (RIF) appears to be a robust and valuable methodology to identify the regulators with the highest evidence of contributing to differential expression in two biological conditions. It is a metric given to each TF that combines the change in coexpression between the TF and the DEGs (i.e. the potential targets). The measures of RIF are computed as follows [27]: where n de is the number of DEGs; e1 j and e2j represent the expression value of the jth DEG in conditions 1 and 2, respectively; r1 ij and r2 ij represent the coexpression correlation between the ith TF and the jth DEG in conditions 1 and 2, respectively.

Identification of Differentially Coexpressed Genes in PD
We downloaded publicly available microarray dataset GSE20333 from GEO database and applied DCGL package in R to identify DCGs in 6 PD patients and 6 non-PD controls. Among all genes tested, we found a total of 1004 DCGs with FDR ,0.25. Besides, a total of 459683 links were predicted among these DCGs.

Enrichment of PD Associated Pathways
In order to functional annotation of the large lists of genes in our result, we used the online biological classification tool DAVID and observed significant enrichment of these genes in multiple KEGG categories (Table 1). Pathway analysis revealed that the DCGs were strongly associated with Ribosome (p = 2.21E-06), and Neurotrophin signaling pathway (p = 1.45E-04). In addition, Steroid biosynthesis, Spliceosome, and NOD-like receptor signaling pathway showed evidence of association with the differentially co-expressed genes (p,0.01).

Regulatory Network Construction
We matched the 1004 DCGs and the 459683 links to the known regulatory data between transcription factors (TFs) and target genes, and obtained a total of 745 pairs of relationships between 82 TFs and 601 target genes. By integrating the regulatory relationships above, we built a regulatory network using Cytoscape [28] (Figure 1).

Impact Analysis of Transcription Factor
The above network generates vast amounts of data. In order to focus on the most meaningful information, we calculated the RIF of each TF. The top 5 ranked TFs are HLF (hepatic leukemia factor), NKX3-1 (NK3 homeobox 1), TAL1 (T cell acute lymphocytic leukemia 1), RFX1 (regulatory factor X, 1) and EGR3 (early growth response 3) ( Table 2). The relationships between these top 5 TFs and their target genes were shown in Figure 2 and Table 3. From Table 3, we could find that HLF, E2F1 (E2F transcription factor 1) and STAT4 (signal transducer and activator of transcription 4) are both TFs and DCGs. Other TFs, such as NKX3-1, TAL1, RFX1 and EGR3, are not DCGs, but their target genes are.

Discussion
Molecular biomarkers are useful to improve diagnosis, to predict clinical behavior and to demonstrate new therapeutic efficacy. Since microarray can interrogate expression levels of thousands of genes in human genome simultaneously, it has been widely used in discovery of disease biomarkers [29,30,31]. In this work, we have analysed gene expression data with computational methods with the aim of uncovering genes that potentially dysregulate in PD. We identified a total of 1004 DCGs in PD patients compared to non-PD controls. After regulatory network construction and regulatory impact factor analysis, we found that the transcription factors HLF, E2F1, STAT4, NKX3-1, TAL1, RFX1 and EGR3 may play important roles in PD initiation. Of these, HLF, STAT4 and E2F1 were found have altered expression levels in PD patients. The expression levels of other transcription factors, NKX3-1, TAL1, RFX1 and EGR3, were not found altered. However, they regulated differentially expressed genes.
HLF encodes a member of the proline and acidic-rihc protein family, a subset of the bZIP transcription factors. Chromosomal translocations fusing portions of this gene with the E2A gene cause a subset of childhood B-lineage acute lymphoid leukemias [32]. While HLF has been linked to malignancies of the lymphoid system, it is detected in the liver, kidney, and adult nervous system by northern blotting [33]. Hitzler et al. found that HLF expression increased markedly with synaptogenesis and was coincident with barrel formation and suggested that HLF plays a role in the function of differentiated neurons in the adult nervous system [34]. HLF appears as the most significant transcription factors related to the differential expression of genes in PD patients.
E2F1 is a member of the E2F family of transcription factors. The E2F family plays a crucial role in the control of cell cycle and action of tumor suppressor proteins and is also a target of the transforming proteins of small DNA tumor viruses. Several studies have demonstrated that E2F1 contributes to neuronal damage and  death using in vitro models of neurodegeneration [35,36,37]. E2F1 immunoreactivity and/or protein levels were reported to increase in neurons of patients with PD [38]. They showed that pRb/E2F pathway is activated in dopaminergic neurons in PD, but also demonstrated that activation of this pathway is instrumental in the degeneration of these neurons in the MPTP/MPP+ model of the disease [38]. In a recent study, Lu and his colleagues showed that mutations in LRRK2 cause PD through inhibiting the translational repression of the transcription factors E2F1 and DP [39]. STAT4 is a transcription factor belonging to the signal transducer and activator of transcription protein family [40]. STAT4 is involved in the signaling of interleukin-12 and interferon -c, as well as interleukin-23 [41]. Though we found STAT4 was differentially expressed in PD patients compared to non-PD controls, the gene has no known role in PD pathogenesis to data.
From Table 1, we could find that the most significant enriched pathway is ribosome which is responsible for catalyzing the formation of proteins from individual amino acids. Besides, some pathways associated with protein synthesis were also enriched in the result, such as ribosome, steroid biosynthesis, and spliceosome. This result suggests that biological processes of protein turnover were impaired in PD. Our result is in line with previous study [42,43].
In conclusion, we have identified molecular biomarkers for PD initiation using a computational bioinformatics analysis of gene expression. A total of 1004 differentially coexpressed genes were identified between PD patients and non-PD controls. Pathway enrichment of these genes suggests that biological processes of protein turnover were impaired in PD. After regulatory network construction and regulatory impact factor analysis, we found that the transcription factors HLF, E2F1, STAT4, NKX3-1, TAL1, RFX1 and EGR3 may play important roles in PD initiation. Of these, HLF, STAT4 and E2F1 were found have altered expression levels in PD patients. Therefore, we suggested that HLF, E2F1 and STAT4 may be used as biomarkers for PD; however, more work is needed to validate our result.

Author Contributions
Conceived and designed the experiments: HD. Performed the experiments: XL. Analyzed the data: HD XL. Contributed reagents/materials/analysis tools: SH. Wrote the paper: YL.