Assembly of a Comprehensive Regulatory Network for the Mammalian Circadian Clock: A Bioinformatics Approach

By regulating the timing of cellular processes, the circadian clock provides a way to adapt physiology and behaviour to the geophysical time. In mammals, a light-entrainable master clock located in the suprachiasmatic nucleus (SCN) controls peripheral clocks that are present in virtually every body cell. Defective circadian timing is associated with several pathologies such as cancer and metabolic and sleep disorders. To better understand the circadian regulation of cellular processes, we developed a bioinformatics pipeline encompassing the analysis of high-throughput data sets and the exploitation of published knowledge by text-mining. We identified 118 novel potential clock-regulated genes and integrated them into an existing high-quality circadian network, generating the to-date most comprehensive network of circadian regulated genes (NCRG). To validate particular elements in our network, we assessed publicly available ChIP-seq data for BMAL1, REV-ERBα/β and RORα/γ proteins and found strong evidence for circadian regulation of Elavl1, Nme1, Dhx6, Med1 and Rbbp7 all of which are involved in the regulation of tumourigenesis. Furthermore, we identified Ncl and Ddx6, as targets of RORγ and REV-ERBα, β, respectively. Most interestingly, these genes were also reported to be involved in miRNA regulation; in particular, NCL regulates several miRNAs, all involved in cancer aggressiveness. Thus, NCL represents a novel potential link via which the circadian clock, and specifically RORγ, regulates the expression of miRNAs, with particular consequences in breast cancer progression. Our findings bring us one step forward towards a mechanistic understanding of mammalian circadian regulation, and provide further evidence of the influence of circadian deregulation in cancer.


Introduction
Altogether, our findings suggest, new potential clock genes and describe their role and topology within the circadian network. Our work delivers novel evidence to the influence of circadian deregulation in cancer and adds a novel way via which a clock-dependent cancer output may emerge, i.e., miRNA circadian regulation. Work flow used to establish a network of circadian regulated genes (NCRG). Two independent data types were used to predict genes which interact with the human core clock network (CCN, orange) and the extended core clock network (ECCN, green). Co-expression data was used to find sets of genes with strongest (anti-) correlating expression with the 43 ECCN genes across a large number of independent experiments. A total of 2357 genes were found to interact with more than 2 ECCN genes. The GeneView text-mining pipeline was used to analyse published knowledge (approximately 22 million citations) about interacting genes. A total of 961 text-mining-predicted genes were found. The intersection of both methodologies resulted in 118 new genes, which together with the ECCN form a new network of circadian regulated genes (NCRG, purple). Results A text-mining based approach for network discovery We aimed to update and extend our recently reported circadian network (ECCN-extended core-clock network) [21]. For that we combined the results of a text-mining system with highthroughput gene co-expression data to obtain new elements and interactions. This procedure resulted in a network of circadian regulated genes (NCRG) following the workflow schematized in Fig 1. The original ECCN [21] contains a core of 14 well known circadian genes, Per1,2,3, Cry1,2, Bmal1,2, Rorα,β,γ and Rev-Erbα,β as well as Clock, its paralog Npas2, and their direct neighbouring targets. We started our study by generating an update version of the ECCN using the text-mining software-GeneView (see Materials and Methods) [23] to extract all pairwise interactions among our genes of interest and their directly interacting neighbours. The new ECCN contains 43 elements as the previous network [21], the depicted interactions were updated to the current PubMed available data resulting in more than 200 regulatory relationships (Fig 2). Additional information containing all interactions and corresponding references, as well as a more detailed characterization of the ECCN is provided in S1 Table and S1 Text, respectively.

Co-expression data analysis confirms the ECCN network topology
In this work we expanded the updated ECCN with a new layer of potentially ECCN-regulated elements (genes and proteins) using co-expression data as a first source of evidence. We consider as such candidates all genes which show a strong co-expression to ECCN members and which can be confirmed using text-mining, as indicated in Fig 1. There are several public available databases providing co-expression metrics for human genes, we evaluated four different such databases [24][25][26] regarding their ability to reproduce the ECCN, to find the best suited for our analysis. Results of our comparisons are presented in S2 Text. We eventually choose COXPRESdb [25] (see Fig 3), as this database showed the highest degree of correlation within all genes of the extended core-clock network.
We expected a significant difference between the correlation measure distributions of interacting gene pairs and a chosen background, if unknown interacting pairs were to be predicted based on correlation values. All possible pairs between a random set of 43 genes and all genes (19,788) were used as background. As foreground pairs we used the 42 known available interactions amongst the CCN genes, as well as the 119 curated interactions of the ECCN gene set. Both the CCN (orange) and ECCN (green) gene pairs tended to have higher correlations compared to the random background and thus lower mutual rank (MR) values (Fig 3A and 3B and S1 Fig). The probability density functions of correlation and mutual rank are shown in S2 Fig for both datasets. All correlation values were Fisher transformed to ensure normal distribution prior to hypothesis testing to characterize differences between the CCN, ECCN, and the background. Subsequent one-sided t-tests with the alternative hypothesis to observe smaller correlations in the background confirmed the results of the visual inspection: CCN gene pairs are significantly higher correlated than the background pairs (p < 0.0195) similar to the ECCN gene pairs (p < 6e -7 ). Similarly significant differences were observed in the Hsa dataset for the CCN (p < 1.4e -3 ) and the ECCN (p < 1.2e -7 ). Furthermore, no significant difference was found between correlations in the CCN and the ECCN for the Hsa (p < 0.41) and the Hsa2 dataset (p < 0.18).
Next, we tested whether the distribution of correlations between pairs of reported interacting ECCN genes could be distinguished from all other possible pairs of ECCN genes. The known interactions exhibited positive ρ HSA2 values and thus lower MR (Fig 3C and 3D). However, only a weak tendency of known interactions towards higher correlations compared to The human core clock network (CCN) and the extended core clock network (ECCN). The CCN (orange) contains the known core-clock elements (Per1,2,3, Cry1,2, Rev-Erbα,β, Rorα,β,γ, Bmal1,2, Clock and Npas). The ECCN (green) was obtained after an extensive collection of CCN-interacting genes followed by a detailed curation for direct interactions [21] and a further update to the recent literature. Activation (green lines), inhibition (red lines) and other sort of interactions (grey lines) are represented. The resulting clock network contains 43 elements and more than 200 regulatory relationships. non-reported pairs can be observed (probability density functions shown in S3 Fig). The corresponding comparisons of correlation distributions via t-tests confirmed the weak tendency in the Hsa2 dataset (p < 0.059) whereas no signal was found in the Hsa dataset (p < 0.395). As a consequence, we conclude that known interacting gene pairs cannot reliably be distinguished Comparison of Pearson ρ and mutual rank (E, F) between all possible pairs of ECCN genes (green) and all possible pairs of non-ECCN genes as background (black). All data were taken from the Hsa2 data collection. from other pairs within the ECCN based merely on expression correlation. We then tested the assumption that expression patterns are generally higher correlated within the ECCN as compared to other non-ECCN gene pairs. The resulting distributions for Pearson ρ and mutual rank between all possible pairs of the ECCN genes included in the datasets, as compared to all combinations of the same genes with all other genes are shown in Fig 3E and 3F (probability density functions shown in S4 Fig). We confirmed the difference of the ECCN set as compared to the background set as before with t-test which yielded high significance in both datasets (p < 2.2e -16 ). In addition, the corresponding mutual rank measures were also found to be significantly lower (Wilcoxon Rank Sum test, p < 2.2e -16 ). Hence, we concluded that the examined expression correlation data provided information about the membership of a gene to the clock network, but not about the network's topology.

Expression correlation-based target prediction
We selected the 10.000 highest correlating pairs of one of the 43 ECCN genes and any other gene. This conservatively chosen threshold selects 1.18% of all pairs, corresponding to an absolute correlation cut-off of 0.3636, or 2.6 σ. In this set, the number of unique new genes was 4.183. As we sought to investigate genes that were tightly associated with the ECCN (i.e. associated with multiple ECCN genes). We defined tightness as the number of connections between a gene and the ECCN and sought to find the largest number of tightly connected predicted targets by filtering them at several levels of minimum tightness. At increasing levels of minimum tightness, we performed an overrepresentation analysis of GO and KEGG terms for varying minimal values of these counts. The largest changes in overrepresented terms occurred when changing the threshold from one to two (Fig 4). There, terms related to cell cycle and the Variation in functional annotation enrichment with increasing tightness between the predicted targets and the ECCN. The significance of 28 enriched GO terms (A) and 16 KEGG terms (B) for genes connected to the CCN steadily decreases for most terms as the minimum number connections is increased (this number of connections between a gene and a gene set is here defined as "tightness"). As the minimum tightness between the predicted targets and the ECCN increases, the enrichment and rank of functional annotation changes. We observe an overall decrease in enrichment but little in rank. The greatest changes in rank occur between a tightness of 2 and 3. At a tightness of two and above, the rank of the majority of significant GO terms such as "mitotic cell cycle" and "nuclear mRNA splicing, via spliceosome", and KEGG terms such as "Spliceosome", "Ubiquitin mediated proteolysis" and "RNA degradation" remain largely stable suggesting a natural threshold on tightness at this point. ribosome rapidly dropped in significance, while, terms related to splicing and transcription largely retained their position. When increasing the tightness further, much smaller changes occurred.
Accordingly, we defined tightly connected genes as those having two or more associations with the ECCN, which reduced the number of predicted ECCN targets from 4,183 to 2,357 with 8,180 interactions (S5 Fig).
The number of tightly connected genes associated to a given ECCN element varied greatly. While 11 ECCN members did not feature any interaction, the three elements CREB, AMPK, and CLOCK covered 48% of all predicted interactions (S6 Fig).
Gene ontology terms (GO) and KEGG pathway enriched in this set are listed in Table 1 and Table 2, respectively. To obtain an insight into which ECCN genes are of particular importance for the enriched function or pathway, we determined the cross table for each combination of ECCN gene versus enriched term, counting the number of predicted target genes featuring the corresponding term. This approach yielded a consistent pattern for GO functional annotations (50 terms with q < 0.01) and KEGG pathways (35 pathways with q < 0.01) (Fig 5). About half of the ECCN genes were associated with a multitude of genes covering a range of GO annotations, while the other half was associated with few genes covering only a small number of GO terms ( Fig 5A). The largest number of target genes was annotated with the molecular function "protein binding" and the cellular component "nucleus". The second-strongest molecular function signal was "DNA binding". The most striking association was found between the genes Csnk2a, Wdr5, Nono, and Parp-1 and the spliceosome (q < 1.5e -37 ) and RNA transport (q < 8e -38 ) pathways, where q represent the p-value adjusted by Benjamini-Hochberg multiple testing correction. These genes were also predicted to target ribosome biogenesis, cell cycle, and purine/pyrimidine synthesis related genes. Another strong association was found between cancer-related pathways such as "Pathways in cancer" (q < 7e -7 ),"Wnt signalling" (q < 2.8e -8 ), "MAPK signalling" (q < 4e -6 ), and the Ampk and Creb target genes ( Fig 5B).

An extended network of circadian regulation: beyond the core
We used text-mining to obtain a second set of genes potentially regulated by the ECCN, and then compared this set to the 2357 genes obtained from co-expression analysis (Fig 1). First, we obtained from GeneView the 50 most frequent interaction partners for each ECCN element, resulting in 961 new interacting genes, each supported by 55 sentences on average. These genes and their supporting sentences are given in S2 Table. The analysis of a large set of GeneViewoutput sentences revealed 20% of wrong sentences which corresponded to 10% false-positive interactions. Again, we subjected this gene set to enrichment analysis. A large number of significantly enriched annotations were observed in the analysis of GO terms (154 terms with q < 0.01) and KEGG pathways (115 pathways with q < 0.01) (S3 and S4 Tables). The top 4 GO terms (q < 7.6e -18 ) included positive and negative regulation of transcription from RNA polymerase II promoters (GO:0045944, GO:0000122), indicating a large fraction of transcription regulatory genes in this set. The term "anti-apoptosis" was listed on the 7 th position (q < 7.5e -13 ) with 64 annotations found, where only 16 are expected by chance. The top-three enriched KEGG annotations were "Pathways in cancer" (q < 1.6e -92 ), "Cytokine-cytokine receptor interaction" (q < 3e -34 ), and "Toll-like receptor signalling pathway" (q < 2.6e -35 ), with a range of cancer-related pathways following.
Intersecting the ECCN-interacting gene sets predicted by expression correlation (n = 2357) and text-mining (n = 961), respectively, resulted in a set of 118 genes ( Fig 6A). While 38 novel interactions with an ECCN gene were predicted by both methods, 364 interactions were co-expression-specific and 182 were text-mining-specific (S5 Table). Interestingly, enrichment Only terms with q < 0.01 (false discovery rate after Benjamini-Hochberg) are shown. analysis of the 118 target genes using KEGG annotations indicated a strong connection to signalling-and cancer-related pathways ( Fig 6B). The GO enrichment yielded the terms "telomere maintenance" and "peptidyl-serine phosphorylation" as significantly enriched biological processes ( Fig 6C). The molecular function "ligand-dependent nuclear receptor binding" was also found to be significantly enriched (q < 0.0009). Finally, we used this intersection of the text-mining analysis and co-expression analysis to extend the ECCN, resulting in a novel network of circadian regulated genes (NCRG) comprising 161 genes all together (Fig 7). An additional 220 interactions between the ECCN and the new NCRG were found amongst the text-mining dataset and 402 interactions within the co-expression data. The number of correlation-based interactions is less informative because, as we have shown above it is not a precise method to infer network topology. Since this assessment was derived from a mixture of various tissue types, the NCRG can be expected to be an aggregation of different tissue-specific interactions.
Additionally, we were interested in the possible consequences of perturbing the newly identified genes in the circadian phenotype and checked whether any of the 118 predicted ECCN-interacting genes were found to cause perturbations on the circadian clock in available siRNA datasets [28,29]. We found hits for different circadian phenotypes a) long-period phenotype: Csnk1a1 (casein kinase 1, A 1), Mapk8 (mitogen-activated protein kinase 8), Ncl (nucleolin); b) high-amplitude phenotype: Ddb1 (damage-specific DNA binding protein 1); and c) short-period phenotype: Cops2 (COP9 signalosome subunit 2). Among these, Ncl and Cops2 also showed a circadian expression pattern. Ncl yielded a JTK q-value of 6.16e-06, a period of 24h, and a phase of 18.5. Cops2 yields a p-value of 0.007, a period of 28h, and phase 2.5. These findings are summarized in Table 3.   x Genes marked in bold were found to be BMAL1 targets.
All 62 genes (of 118) are shown which exhibit at least one of the following properties: regulated by REV-ERB or ROR, circadian expression pattern, causing a clock phenotype upon RNAi knockdown, predicted as similar to known clock genes [29], and featuring an OMIM annotation. doi:10.1371/journal.pone.0126283.t003 We further compared our findings with a recent list of 1000 genes classified as-"sufficiently similar"-to known clock genes by a machine learning approach on a combination genomescale datasets from mouse fibroblast cell lines [29]. One quarter of these genes were also contained in at least one of our gene sets (253 of 993 with a homolog in the human genome), and 10 genes were also detected by our text-mining and co-expression analysis: Atf2, Ddx6, Dhx9, Elavl1, Hspa4, Ncl, Nme1, Med1, Rbbp7, Dnm1l. Out of these, the four genes Ddx6, Dhx9, Ncl, and Dnm1l exhibit a circadian expression pattern.

Clock target genes could be validated with ChIP-seq data
To further validate our 118 consensus genes gained from the bioinformatics approach, we examined the publicly available ChIP-seq datasets for REV-ERBα/β [16,17]. Additionally, a BMAL1 dataset [12] was considered. ChIP-seq peak locations were used to calculate an association score ("ClosestGene" [30]) for each gene to the corresponding transcription factor. Simple threshold calculation then yielded a TF-target prediction. The gene association score S g,tf was calculated for all annotated refSeq genes of the mouse genome build used in the corresponding experiment. The resulting log2 transformed S g,tf distributions are shown in S7 Fig. The threshold for accepting a TF-gene association was chosen as 3, which yields the higher second gene-score peak in case of the bimodal REV-ERBβ peak set, or the prominent right shoulder of the distribution for all other peak sets (S7 Fig). A total of 3847 predicted REV-ERBα and 3388 REV-ERBβ target genes [16] were found. The alternative dataset provided 4618 target genes associated with REV-ERBα/β unspecific peaks [17]. Lastly, this procedure yielded 223 significant BMAL1 target genes [12]. Since the ChIP-seq peak location data for RORα and γ, were not accessible, we relied on the list of predicted targets provided by the authors based on a less stringent target prediction method [22,31].
Overall, we obtained a set of 118 genes potentially regulated by the ECCN. Of those, 19 exhibited circadian expression patterns, 5 exhibited phenotypic changes in the clock when targeted with RNAi, 59 were targeted by REV-ERBα/β, and 14 were targeted by RORα or γ. Additionally, the two NCRG genes Ddb1 and Mapk8 were found to associate with BMAL1 binding sites. These findings are summarized in Table 3 and depicted in Fig 8, (see S7 Table for all annotations).

Discussion
The mammalian circadian clock is an endogenous, time-generating system with the peculiarity of synchronizing and propagating time-cues to the entire organism. Its relevance in the timedependent regulation of biological processes has been shown at the organismal and cellular levels. As such, it is of no surprise that malfunctions of the circadian system were found to be associated to pathological phenotypes including obesity, sleep disorders and increasing incidence of cancer. The prospect of using individual patient-timing, based on the internal circadian clock, for therapy optimization is being explored with promising results. For instance, advances in chronotherapy have proven to be efficient in reducing toxicity and increasing efficacy in some types of cancer, particularly colon cancer [32]. A more detailed knowledge of the circadian network including the pathways it regulates is of major importance for the analysis on how time effects may be propagated and to determine the time-dependent action of certain drugs.
In this work we set up to dissect such clock-regulated pathways and to analyse the extent of circadian regulation at the cellular level by expanding the core circadian network to its potential target genes. We used human high-throughput transcriptome-data sets associated to textmining of biomedical literature, for de novo clock regulated gene discovery.

A network of circadian regulation: combining independent evidences
Gene co-expression has previously been used to predict gene functions. Such works rely on the Pearson correlation coefficient and extensions of it and, although able to predict gene functions in mammals, are limited in terms of de novo network generation [24,25]. We observed that reportedly interacting ECCN genes feature correlation values which are similar to non-reported. This is a limitation of co-expression methodologies and the problem of erroneous transitive  [16,17]. The ROR α/ γ interactions (purple lines) were adopted from the report of a third ChIP-seq experiment [22]. Genes with an observed phenotype in the genome-wide RNAi screen [28,29] are shown with a coloured box, red indicating long period, blue a high amplitude, and green a short period. links inferred by correlation analysis was described before [33]. Therefore, we used a hybridmethodology where to the expression correlation data we associated the text mining as an independent source of knowledge, enabling us to find regulated genes and their connection to the ECCN with increased confidence (Fig 1). This allowed us to partially overcome the limitations of expression analysis in terms of network topology and to be able to generate a semi-regulatory network for the mammalian circadian clock. Still, we do not analyse tissue-specificity issues which go beyond the scope of this work. Nevertheless, the circadian clock has been reported, in mammals, to be present in all cells so that the core network is expected to be very similar [34]. The output genes in the large network might indeed show tissue-specific differences which will be very interesting to explore in future work.

Biological significance and impact in tumourigenesis
The detailed analysis of the network generated by our pipeline (NCRG) strengthens previous findings which associate the circadian clock to regulation of several molecular processes such as mRNA processing, cell division, cell cycle progression and DNA repair [19,21,[35][36][37][38][39][40]. Particular pathways, including RNA transport, splicing and several cancer related pathways were identified by our study as being significantly associated with the circadian clock, highlighting the important function of the circadian system in the regulation of cellular processes. By comparing the difference in overrepresented terms between genes tightly-and those loosely-associated to the ECCN, we found that cell cycle and translation related terms are highly significant in loosely associated genes in comparison to tightly associated genes. We also found that splicing remains a highly over-represented term regardless of tightness (Fig 4). Together with the enriched biological processes such as "DNA-dependent regulation of transcription" and "gene expression", it became clear that the co-expression based predicted ECCN target gene set has a stout emphasis on cellular signalling, transcriptional regulation, and cancer (Fig 5). Furthermore, several members of the predicted set of ECCN target genes are associated with Mendelian diseases as listed in the Online Mendelian Inheritance in Men dataset (OMIM) (S6 Table). 30% of the correlation/text-mining consensus genes featured such an annotation (35 of 118), pointing to the role of the circadian clock in pathogenesis.
In particular, among our top candidate genes is a group of genes associated with tumourigenesis (see Table 3): Elavl1 is known to be highly expressed in several cancers and potentiates a characteristic pro-inflammatory profile of some immunological and non-immunological diseases [41], Nme1 is considered a tumour suppressor and its expression is reduced in metastatic cancers [42], Dhx6 belongs to the DEAD box helicase superfamily and is involved in DNA repair, Med1 regulates p53-dependent apoptosis [43] and Rbbp7 interacts with the tumour-suppressor gene Brca1 [44] and may have a role in the regulation of cell proliferation and differentiation.
Remarkably, we found a subset of nine genes (Apoh, Ifnar1, Sp1, Narg2, CALU, EEF1A1, RBM14, Spag5, Med1) which are targets of both REV-ERB and ROR according to Chip-Seq experiments. These two nuclear receptors are known to bind RORE elements within the promoter regions of target genes: while REV-ERB is an inhibitor, ROR acts as an activator. APOH (Apolipoprotein H) and IFNAR1 (Interferon Alpha, Beta, Omega Receptor) are involved in immune disorders [45,46]. SP1 (Sp1 transcription factor) is also involved in immune response and in many other cellular processes, including cell differentiation, cell growth, apoptosis, response to DNA damage, and chromatin remodelling [19]. NARG2 (NMDA receptor regulated 2) is associated to breast cancer [47], and Med1 regulates p53-dependent apoptosis [43] and was found to be mutated in human carcinomas with microsatellite instability [48]. The eukaryotic translation elongation factor EEF1A1 was recently shown to mediate the alternative caspase-independent cell death mechanism induced by genetically unstable tetrapolidy [49]. The sperm associated antigen 5 (SPAG5) was found to be associated with various types of cancer, such as cervical cancer and breast cancer [50]. Circadian regulation of these genes and as such of the processes they regulate could be achieved via a fine-tuning of ROR/REV-ERB.
Two other circadian regulated genes identified by our study are nucleolin (Ncl) and Ddx6. The analysis of ChIP-seq data identified these genes as targets of RORγ and REV-ERBα, β, respectively. Interestingly, they were also reported to be involved in miRNA regulation [51][52][53]. DDX6 (RNA helicase) is found in p-bodies for mRNA degradation, needed for miRNAmediated silencing. NCL regulates several miRNAs including miR-21, miR-221, miR222 and miR-103. miR-21 is defined as an oncogene and found to be overexpressed in most tumour types [51,[54][55][56][57][58][59], whereas miR-221 and miR222 show an increased expression in human breast cancer [60,61]. Also, miR-222 was shown to promote resistance of cancer cells to cytotoxic T lymphocytes [62]. Interestingly, miR-103 which is also a target of NCL was reported to exhibit circadian pattern [63].
Altogether, our data allowed the generation of a large network of circadian regulation. The network was retrieved from human expression data intersected with text-mining of the biomedical literature, for topology refinement and de novo target identification. The novel predicted targets of the circadian clock network showed a remarkable association to cancer driving mechanisms. One of these mechanisms is miRNA regulation. Very recent studies point to an influence of miRNAs on the circadian clock [64][65][66][67][68][69][70][71], but only a few links on the regulation of miRNAs via the circadian clock have been described [69]. NCL represents a potential novel link via which the circadian clock, in particular RORγ, regulates the expression of miR-NAs, with particular consequences in cancer progression.

Methods Preprocessing
For all text-mining steps we used articles from PubMed and PubMed Central open access subset.

Named entity recognition
Genes: For gene name recognition and normalization we used the GNAT library [72]. GNAT uses custom dictionaries and conditional random fields (CRF) for gene name recognition and subsequently normalises gene mentions to Entrez Gene ID's. The system is ranked among the first in several critical evaluations [73,74] and achieves, according to these assessments, a precision of 82% and recall of 82% for abstracts and 54/47% for full-text articles.

Relation extraction
GeneView (a search engine which uses a comprehensively annotated database of all PubMed abstracts and 270,000 full texts from the open PubMed Central corpus) uses the shallow linguistic kernel [75] and LibSVM for relationship extraction between proteins. The model is trained on the ensemble of five publicly available training corpora [76]. This kernel achieved very good results in a comprehensive evaluation of nine machine learning kernels for PPI extraction from text [77][78][79]. Furthermore, is does not use dependency information and thus is very fast, a pre-requisite for usage in a large system such as GeneView. Data contained in Gene-View is available at http://bc3.informatik.hu-berlin.de/. To account for species specificity, we mapped mammalian gene identifiers to Homologene clusters [80]. To test the efficiency of text-mining in contributing to new network generation, we first evaluated its ability to reconstruct a previously designed network of clock-controlled genes (CCGs) containing 121 interactions among 41 different proteins [19]. We used GeneView to extract all pairwise interactions. GeneView contained evidence for 73% of all interactions described in the network tested. The high sensitivity of the method encouraged us to further develop our pipeline in order to ascertain potential new elements and interactions. We further used GeneView to collect all interactions among the CCN and its directly interacting neighbours. After curation and filtering for direct interactions, we enriched the core-clock network with 108 novel interactions supported by 132 PubMed references, which led to the extended core-clock network (ECCN) recently reported [21]. For the ECCN, each candidate interaction is supported by up to 851 sentences (in total 4,206 sentences). We reduced the number of sentences to 580 by ranking them by confidence and returning only 5 sentences at maximum for each candidate. Sentences containing potentially novel PPI were ranked by the confidence of the classifier (ie. distance to the hyperplane) and were subsequently evaluated.

Predicting interactions using coexpression data and overrepresentation of associated gene terms
Each dataset was assessed on the number of genes they share with the ECCN and how well the correlation coefficient distributions of known ECCN gene interactions were separated from a background distribution of all genes, where the Wilcoxon Rank Sum test was used for quantification. For more details on the dataset properties and selection, see S2 Text.
To find associated genes based on the correlation coefficients, we selected the 10000 highest correlations between any ECCN gene and a non-ECCN gene as predicted interactions, thereby considering the 1.18% most extreme correlation values.
We sought to detect and characterize only genes that were tightly associated with the ECCN, where "tightness" was defined as the number of connections between a gene and a set of genes. Accordingly, the comparison of the number of predicted NCRG with required tightness 1 to 10 shows the most drastic decline between 1 and 2, which quickly diminishes with rising tightness values (Fig 4). We therefore chose to employ a tightness threshold of 2 for the remaining analysis. We then proceeded to find the overrepresented terms and enriched clusters using the R package TopGO [81]. We annotated the associated genes with terms from the Genetic Association Database [82], Online Mendelian Inheritance in Man database [83], Swissprot Protein Information Resource [84], Gene Ontology [85], Pubmed and Kyoto Encyclopedia of Genes and Genomes [86]. Significant overrepresentation was determined using p-values corrected by Benjamini-Hochberg multiple testing correction (q-values).

Integration of the predicted NCRG with transcriptional features
We compared our NCRG prediction with the machine learning based prediction of clock genes [29]. Therefore, we retrieved the top 1000 genes as of the evidence factor ranks and used the HomoloGene database build 66 [80] to map the reported mouse genes to 993 unique entrez genes, could then be compared to our predicted genes set.
Similarly, we tested how many of the NCRG are amongst the genes with circadian expression regulation according to recent publications [14,27]. After combination of both lists of mouse genes, a total of 1771 unique entrez transcripts were obtained for comparison after mapping via HomoloGene build 68.
An extensive collection of genes which lead to circadian clock phenotypes upon knockout via RNAi has been described recently [28]. The reported 343 genes are categorized into double hitters, i.e. two different pairs of siRNAs lead to a circadian clock phenotype, and single hitters, for which only one of the two siRNA pairs designed for each gene lead to a phenotype, where amplitude-and phase-changes were considered as phenotype.

ChIP-seq data analysis
We employed the R package TFTargetCaller [30] to derive target gene sets for clock-related transcription factors from experimental Chip-seq data using the method "ClosestGene". We used available data sets to extract target genes for REV-ERB α/β [16,17] and for BMAL1 [12]. These include all available Chip-seq data sets for core-clock genes. Specifically, the genomic peak locations were obtained, and the gene association score S g,tf was calculated for all annotated refSeq genes of the mouse genome build used in the corresponding experiment. The resulting log2 transformed S g,tf distributions are shown in S7 Fig. The threshold for accepting a TFgene association was chosen as 3, which yields the higher second gene-score peak in case of the bimodal REV-ERB β peak set (S7B Fig), or the prominent right shoulder of the distribution for all other peak sets. Since the genomic locations of the peaks for the ROR α/γ dataset were not available, we used the predicted target list provided by the authors [22,31]. Association strength scores S tf,g between the core clock transcription factors REV-ERB α/β and BMAL1 and all refSeq genes annotated in the corresponding genome version (mm8 [17] or mm9 otherwise) were calculated using the "ClosestGene" method of the R package TFTargetCaller and the ChIP-seq peak annotations. The number of genes with S tf,g > 0 is shown as n gene , the cutoff for accepted TF-gene association was set to 3 as marked with a red dashed line, and the number of accepted target genes is shown as n target for each individual dataset. (EPS)   Table. List of GO term annotations enriched amongst the ECCN target genes predicted by text-mining. The table provides the term ids ("GO.ID"), the corresponding "Term", as well as the overrepresentation p-value after false discovery rate correction after Benjamini-Hochberg ("pval"). In addition, the total number of genes annotated with the respective term is provided ("Annotated"), the number of significant annotations ("Significant"), along with the number of annotations expected by chance in the gene set ("Expected"). (XLSX) S4 Table. List of KEGG pathway annotations enriched amongst the ECCN target genes predicted by text-mining. The table provides the pathway ids ("kegg.id"), the corresponding pathway "name", as well as the overrepresentation p-value before ("pval") and after false discovery rate correction after Benjamini-Hochberg ("fdr"). (XLSX) S5 Table. List of all newly predicted interactions. The columns "gene1.entrez", "gene2. entrez", "gene1.symbol", and "gene2.symbol" provide the Entrez ids and gene symbols for the two predicted interacting genes, respectively. The Boolean flags "txtmn" and "coxp" indicate interactions predicted by text-mining, and co-expression, respectively. "with.consensus" indicates interactions involving one of the 118 consensus genes, and "overlapping" indicates the interactions predicted similarly by text-mining and co-expression. (XLSX) S6 Table. Characterization of the 118 predicted NCRG regarding disease-related annotation. Annotations from OMIM, gad, and the KEGG database were integrated in addition to SNPs. Genetic association database (gad), KEGG pathway annotation, traits that significantly associate with the gene as provided in the Ensemble database (gwas_trait, gwas_pval, gwas_-pubmed_id), the number of non-synonymous SNPs found in the gene (nonsyn_count, non-syn_norm), the Uniprot database derived protein domains (up_seq_feature), the Gene Ontology biological processes annotations (goterm_bp), and the Gene Reference into Function (generif). (XLS) S7 Table. Characterization of the 118 NCRG regarding expression regulation by transcription factors. The transcription factor target prediction of all 118 NCRG for both REV-ERBα/β datasets, the BMAL1, and also the RORα/γ dataset is provided. Additionally, the phenotype upon gene knockdown observed [28] and the prediction of "similar to clock gene" [29] are included. The observed circadian expression and OMIM annotations are indicated. (XLS) S1 Text. Text-mining-based assembly and characterization of the ECCN network.