A Functional and Regulatory Network Associated with PIP Expression in Human Breast Cancer

Background The PIP (prolactin-inducible protein) gene has been shown to be expressed in breast cancers, with contradictory results concerning its implication. As both the physiological role and the molecular pathways in which PIP is involved are poorly understood, we conducted combined gene expression profiling and network analysis studies on selected breast cancer cell lines presenting distinct PIP expression levels and hormonal receptor status, to explore the functional and regulatory network of PIP co-modulated genes. Principal Findings Microarray analysis allowed identification of genes co-modulated with PIP independently of modulations resulting from hormonal treatment or cell line heterogeneity. Relevant clusters of genes that can discriminate between [PIP+] and [PIP−] cells were identified. Functional and regulatory network analyses based on a knowledge database revealed a master network of PIP co-modulated genes, including many interconnecting oncogenes and tumor suppressor genes, half of which were detected as differentially expressed through high-precision measurements. The network identified appears associated with an inhibition of proliferation coupled with an increase of apoptosis and an enhancement of cell adhesion in breast cancer cell lines, and contains many genes with a STAT5 regulatory motif in their promoters. Conclusions Our global exploratory approach identified biological pathways modulated along with PIP expression, providing further support for its good prognostic value of disease-free survival in breast cancer. Moreover, our data pointed to the importance of a regulatory subnetwork associated with PIP expression in which STAT5 appears as a potential transcriptional regulator.


Introduction
Breast cancer is one of the most common malignancies in Western countries and is associated with a high mortality rate [1,2]. Aside from a small subset of patients (,5%) with inherited genetic alterations, sporadic breast cancer accounts for the majority of all breast cancers and limited knowledge is available about the underlying process of carcinogenesis. It is widely accepted that breast cancer, like most other cancers, develops through the accumulation of genetic aberrations [3]. Some of these changes involve specific genetic loci, determining the activation of oncogenes or the inactivation of tumor-suppressor genes, while others confer genetic instability, which increases the possibility of acquiring additional genetic lesions relevant to tumorigenesis. In the last decades, PIP protein expression has been proposed as a specific and sensitive marker for breast cancer [4][5][6][7] and further used to support breast origin in metastatic carcinoma of unknown primary origin [8][9][10][11][12]. A PIP over-expression was shown in primary and metastatic breast cancers [13,14], as well as in some breast carcinoma cell lines. However, the exact functions of that protein in mammary tumor progression remain unclear.
In previous work, we reported preliminary conclusions on the PIP properties showing that the protein, a secreted factor known as prolactin-inducible protein (PIP) [13] or as gross cystic disease fluid protein-15 (GCDFP-15) [15], binds to CD4 [16][17][18], exerts a potent inhibition on T lymphocyte apoptosis mediated by CD4/ T-cell receptor (TCR) activation [19] and carries a fibronectinspecific aspartyl protease activity [20]. In addition, the PIP gene localized on the long arm of chromosome 7 at 7q34 [21] was found to display a variety of rearrangements in numerous solid tumors [22,23]. Interestingly, we found that the T47D cell line, that constitutively overexpresses PIP, exhibits an inverted duplication of the 7q34-q35 region containing the PIP gene resulting from a breakage-fusion-bridge (BFB) cycle mechanism initiated within the common fragile site FRA7I [24].
Here, we report an in-depth exploration of the functional and regulatory networks associated with PIP gene expression in breast carcinoma cell lines using DNA microarray-based gene expression profiling techniques. Taking advantage of the presence of androgenresponsive elements in the PIP gene promoter, breast carcinoma cell lines were analyzed before and after treatment with dihydrotestosterone to modulate PIP expression, allowing comparison between the PIP-expressing [PIP+] and -non expressing [PIP2] cell profiles. Thus, we identified a series of 205 genes that display significant expression changes between the [PIP+] and [PIP2] subgroups of samples. A representative part of these genes exhibited a good concordance of expression changes when assessed using tailored Q-PCR. A network analysis allowed us to propose that PIP gene expression is mainly associated with a decrease of the cell proliferation and migration potential, as well as with an increase of the apoptotic pathway. In addition, the identification of specific STAT5 (Signal Transducer and Activator of Transcription 5) motifs found within promoters of a significant part of the differentially expressed genes suggests that STAT5 could play an important role in the regulatory network associated with PIP expression. We also point to other novel modulated pathways that warrant further biological and clinical investigations.

Results and Discussion
Characterization of the cellular models Four breast cancer cell lines presenting different features especially concerning PIP expression, hormonal receptor status and invasiveness potential were selected: MDA-MB231, a poorly differentiated and highly invasive cell line, MCF7, T47D and VHB1, which are known to be differentiated and non-invasive breast cancer cell lines [25].
As PIP expression was shown to be increased by androgens [26], we first analyzed the expression of the androgen receptor (AR). Moreover, as the estrogen receptor (ER) expression is currently used as a potential marker to classify breast cancer samples, and the ER-positive tumors are often found associated with a better outcome than the ER-negative ones [27,28], we also analyzed the status of this hormonal receptor in the four cell lines. RT-PCR analysis of AR expression in the four cell lines indicated that MDA-MB231 is AR-negative and MCF7 AR-positive (data not shown). As expected both T47D and VHB1 cells were AR-positive. Similarly, the expression of ER was found positive in MCF7, T47D and VHB1 and negative in MDA-MB231. The phenotype of the cell lines further used for DNA microarray-based gene expression profile studies is summarized in Table 1.
Androgen treatments were used to induce PIP expression. Breast cancer cell lines were grown in presence or absence of 10 nM of dihydrotestosterone (DHT) for various periods, and PIP gene expression was analyzed by northern blot (Figure 1). No detectable PIP expression was found in MCF7 despite the expression of AR; similar results were obtained in MDA-MB231 even after 10 days of androgen treatment (data not shown). In contrast, in T47D which constitutively expresses PIP, DHT treatment for 6, 8 and 10 days increased PIP expression about 2.5, 4 and 4.9 times, respectively, whereas in VHB1 that did not exhibit detectable endogenous expression, DHT treatment induced a consistent PIP expression. A low PIP expression level was observed in mammary gland samples (MG) taken from healthy women (Figure 1). Identical phenotypes were observed by RT-PCR (data not shown). In the present study, our strategy was to take advantage of the differences in hormonal receptor expression level and invasiveness potential of the distinct cell lines in order to predominantly focus on the gene expression modulations associated with PIP expression, but independently of the possible influence of particular genetic backgrounds. Thus, the samples were categorized in either a [PIP+] or a [PIP2] subgroup and used for subsequent analyses.

Experimental design and statistical power simulations
Gene expression profiles were collected in duplicate from a total of 32 RNA samples derived from 4 independent cultures and RNA preparations of the 4 breast carcinoma cell lines cultured without (J0) and with DHT for 7 days (J7).
The a priori statistical power of the gene expression dataset was measured as the probability of obtaining statistical significance when true biological differences exist between the compared groups of samples (1 -b; true positive rate). A conventional power analysis requires the designation of parameters such as the anticipated variability of individual measurements for all genes within each biological group (s), the total sample size (n, n1 & n2), the magnitude of the effect to be detected (W) and the acceptable false positive rate (significance level a). It allows verifying which subgroups of samples are likely to provide the most comprehensive relevant information and that enough samples are compared to meet the objectives of the study.
Thus, samples were divided in 2 subgroups according to their PIP expression level ( (J0 and J7) samples as the PIP expression level is significantly higher than in VHB1 (J7) and can influence differently the gene expression profiles (Figure 1). Statistical power (1-b) was computed for a two-class comparison, detecting a true 1.5-fold mean difference between either the [PIP+] or [PIP++] group of samples and the [PIP2] group of samples at a significance level (a) of 0.01, considering a total sample size of 64 or 56 respectively with n2/n1 = 0.6 or 0.4 (Table 2), and an expected variability within each biological sample group s median , = 0.30. The statistical power was estimated to be satisfactory, with a limited proportion of false negatives (b = 0.01 to 0.03, Table 2), while consistent with a small number of spurious discoveries (cf. Table 3).
The simulations thus suggested that only a negligible proportion of the information relevant to the question addressed would be missed in the class comparisons, and provided a high confidence toward the differentially expressed genes identified.

Measure of the range of biological variability in samples
As the cell line intrinsic properties may obscure expression patterns related to PIP gene expression, we appreciated the range of biological variability through unsupervised clustering of the entire gene expression profiles ( Figure 2). Similarity measures between genes were computed using a Pearson correlation. Clusters were defined by an average linkage clustering method. No clear cluster was observed according to either the PIP gene expression level or the ER status, the resulting dendrograms of the samples probably reflecting predictable biological variability between cell lines. As shown in Figure 2, samples were clustered in two distinct groups, one containing DHT-untreated (J0) ortreated (J7) MDA-MB231 and VHB1 J7 samples, the other MCF7 (J0 and J7), VHB1 J0, T47D (J0 and J7) and normal mammary gland samples. Except for the VHB1 samples, serially treated cell line samples tended mainly to cluster together, independently from DHT treatment or PIP expression. This indicates that the genomewide expression profile changes induced by the hormonal treatment may be less prominent than the inherent observed cell line differences. The transcriptome data analysis strategy was therefore designed to assess the likelihood of detecting reliably significant gene expression differences linked to variations in PIP expression.

Identification of the genes co-modulated with PIP expression
Beforehand, several statistical differential comparisons were performed to first highlight gene expression modulation that may be unrelated to the PIP gene influence, but potentially resulting from the cell line heterogeneity itself. The analysis was done comparing the expression profiles of the 3 cell lines associated with a [PIP2] phenotype (i.e. MCF7, VHB1 and MDA-MB231) without DHT treatment in order to identify specific unique gene expression. This analysis pointed out 85% of the genes (7,996 clones) that were found not significantly differentially expressed between the 3 cell lines (p = 0.01). The corresponding gene list was used as a reference for subsequent statistical analyses with the drawback that a fraction of them will escape detection of differential expression in relation with PIP in subsequent analyses, being confounded by differences in the genetic background of the individual cell lines, but ensuring that the gene modulations identified are strictly related to PIP expression. Two-and threeclass comparisons of mean relative expression levels were then performed gene-by-gene between [   subgroups using a combination of tand Fstatistic approaches and yielded complementary lists (See Table S1), including a list of 606 clones (L606, two-class, data not shown) and a list of 235 clones (L235, three-class, see Table S2). The 3-class comparison was privileged to take into account the additional modulations of expression that could occur between samples with a moderate [PIP+] or a high [PIP++] expression and to focus our exploration on genes that may be found co-regulated in relation to the PIP expression level changes. The split of PIP-expressing samples in two separate subgroups of samples raised the strength of the statistical analysis and led to the identification of a more extensive and explicit list of modulated genes, as 44% of the clones in L235 were not detected in two-class comparison and therefore not included in L606. In addition, these genes are unlikely to correspond to genes modulated by DHT treatment independently from PIP expression, since both [PIP+] and [PIP++] subgroups of samples consist of treated samples and the [PIP++] subgroup contains a balanced number of treated and untreated samples. The genes represented in L235 were annotated using the Unigene Cluster Ids identifiers [29]. L235 corresponds to 205 unique genes (193 unique named genes) (L235; see Table S2), including 92 upregulated named genes (64%) and 51 down-regulated named genes in the [PIP+] group when compared with the [PIP2] group, with a foldchange over 1.35. More than one third of the selected genes were associated with a fold-change ranging from 1.35 to 1.5; thus, the effective statistical power was computed to evaluate the reliability of detection of such slight gene modulations for both two-and threeclass comparisons. We found the computed power to be satisfactory, being over 90% and 78%, thereby ensuring the reliable detection of these small variations in gene expression between [ Table 2). The false positive rate associated with this threshold ratio was estimated to be lower than 0.2% in all subgroup comparisons ( Table 3), confirming that these slightly modulated genes could be taken into account confidently.
To further probe the ability of different subsets of the genes represented in L235 to discriminate between [PIP+] and [PIP2] phenotypes, supervised hierarchical clustering of the expression profiles was performed ( Figure 3). Gene clusters with related expression patterns were clearly discernable, consistently pointing out differences between the PIP-expressing and non-expressing samples. Precisely, the samples are divided in two main groups according to their PIP expression phenotype. This observation contrasts with the sample similarity dendrogram previously obtained using the whole gene expression matrix ( Figure 2). Clusters of gene modules that appeared the most relevant to differentiate [PIP+] and [PIP2] subgroups were identified using tstatistics with a permutation-based adjustment of the gene expression matrix (n = 10,000 and a = 0.05). The top-ranked clusters were NODE222X of 60 clones (50 named genes and 7 not assigned to any Unigene cluster Id) found up-regulated in the [PIP+] subgroup (t-stat = 24.57; p = 2.5610 24 ) and NODE196X and NODE167X containing 26 and 11 clones (23 and 8 named genes, respectively, and 3 not assigned to any Unigene cluster Id in each node), which conversely represent clusters of genes upregulated in the

Validation of the microarray gene expression data
The accuracy and reliability of the results obtained with microarrays was tested by quantitative RT-PCR (Q-PCR) using a tailored TaqMan Low Density Array (LDA). The relative gene expression levels (RQ = 2 2DDCt , [30] Thirty-two genes (28 from list L235, 4 from lists L578 and L2231, see Table S1) were chosen for validation of microarray data. Nine additional genes were selected among those found not significantly differentially expressed ( Table 4). Comparison of microarray and Q-PCR results after z-statistics with FDR adjustments indicated a good agreement: 28 of the 32 (87%) differentially expressed genes detected with microarrays were fully validated by Q-PCR ( Table 5). The remaining differentially expressed genes were considered as false positive results: two of them were associated with an inverted Q-PCR expression ratio compared to that obtained with microarrays and the 2 others were not found significantly differentially expressed when analyzed by Q-PCR. Finally, four of the nine genes not detected as differentially expressed with microarrays and found discordant when analyzed by Q-PCR are likely to represent microarray false negative results or false positive results of one or the other technology (Table 5).
Taken together, these results show that most of the genes identified by microarrays were validated and only few genes were found to be false positive results. Thus, expression of the genes in L235 derived from the three-class comparison correlates directly or inversely with PIP modulations. It provides a faithful representation of the breast cancer cellular model and therefore a solid basis for further functional exploration of the results.

Functional annotation of the differentially expressed genes
Functional analysis was performed using the Ingenuity Pathway Analysis (IPA, version 4.0) tool which relies on a knowledge database of curated functional and regulatory interactions extracted from the literature and provides integrated graphical representation of the biological relationships between genes and gene products. Two distinct analyses were performed based on Locuslink ID gene identifiers, considering separately up-and down-regulated genes from list L235 ( Table 6). The p-values relative to the most enriched functions appeared highly significant (a = 0.05, Fisher's exact test). A total of 48 and 43 significantly over-represented biological functions were identified in the upand down-regulated gene lists, respectively. Among them, 37 are overlapping. The significance was higher for functions associated with down-regulated genes. The 20 most relevant functions identified with down-regulated genes and the corresponding pvalues for both analyses are reported in Table 6.
Thus, cancer, cell cycle, cellular growth and proliferation related functions appeared to be in the top five of highest-level functions highlighted in both analyses according to the assigned pvalue. Cell death was also associated with significant expression changes in genes co-modulated with PIP (1.33e210 to 9.72e25 for up-regulated genes and 1.96e212 to 3.58e27 for downregulated genes). Taken together, these results suggest that tumor proliferation might be deeply impacted by modulations of gene expression levels between the [PIP+] and the [PIP2] samples. Moreover, significantly enriched gene classes related to downregulated genes in [PIP+] versus [PIP2] samples are highly indicative of processes involving cell morphology and movement. The prominent functions associated with theses classes are cell morphology, tissue morphology, tumor morphology, cellular movement, cell to cell signalling and interaction, and cellular assembly and organization (Table 6). These biological functions are also found significantly over-represented in the up-regulated gene list even though the associated p-values are slightly smaller but statistically relevant (p-value lower limit#1e26). A large number of genes involved in cellular movement partly overlapped with cancer-related genes. Altogether these gene expression modulations might influence cell death or tumor invasiveness.

Identification of molecular networks associated with PIP gene expression modulations
We next investigated biological relationships between genes and gene products by performing a network analysis for the genes represented in L235. A total of 126 unique genes in L235, called focus genes, were mapped to genetic networks as defined by the IPA tool. Nine networks were found significantly enriched with scores ranging from 9 to 19 (data not shown), the probability for a network to be selected by chance (score,3) decreasing when its corresponding score value increases. As network identification using the IPA tool may be strongly dependent upon size and content of the input gene list used, further analyses were conducted using independent gene lists for the up-and down-regulated genes in L235. This analysis led to the identification of ten and eleven networks for up-and down-regulated genes, respectively (data not shown). These networks were associated with the same biological functions (i.e. cancer, cell death, cell cycle, cellular growth, gene expression, proliferation and tissue morphology) exhibiting higher scores (15-23 and 8-23 for the top 6 up-and 4 down-regulated gene networks, respectively) as those previously identified using the whole gene list L235 (Table 7).
The PEA-15 function was recently related to cell invasion via its ability to bind to ERK1/2 [62]. It has also been shown that PEA-15 is expressed in normal mammary gland and exhibits a decreased expression in pathologically invasive cancer, suggesting   The down-regulated gene analysis also highlighted a network linked to cell to cell signalling and cellular movement function, reflecting an impact of these gene modulations on cell motility and invasiveness (network 8, down-regulated genes, Table 7). Within this network, several genes had a function promoting migration and cell invasiveness (IQGAP1, TGFBI, CD44, CTSL, PTN) [68][69][70][71]. Accordingly, their down-regulation in [PIP+] versus [PIP2] samples could have a suppressive effect on cell invasion. Similarly, the down-regulated expression of LOXL2 might prevent tumor progression, as shown by the induction of the epithelial-tomesenchymal transition process in epithelial cells overexpressing these genes [44].
In summary, this pathway analysis strongly suggests that the majority of gene modulations, occurring in [PIP+] versus [PIP2]  cells, may contribute to a reduction of cell proliferation concomitantly with an increase of apoptosis and cell adhesion.

Visualization of relevant modulations in a master molecular interaction network
In order to visualize comprehensive interactions between the modulated genes within breast cancer cells and place them in the context of molecular interaction network, the most significant upand down-regulated gene networks were merged (Figure 4). Networks 1, 4 and 6 of up-regulated genes were chosen according to the highest score value (23), the corresponding functions in tumorigenesis (proliferation, cancer and cell cycle) and the presence of the PIP gene, respectively (Table 7). These networks were merged through the overlapping genes (CD14 and TCF3 for networks 1 & 6; CCND3 for networks 4 & 6). About 45% of the focus genes of those up-regulated gene networks are found to take part in one of the most relevant clusters identified by t-statistics in L235 hierarchical clustering. More specifically, 67% of the focus genes from network 1 are included in NODE 222X. Networks 7, 8 and 10 of down-regulated genes were selected upon the presence of overlapping genes between networks: TP53 for networks 7&10 and IQGAP1 for networks 7 & 8 ( Table 7). An important fraction of the focus genes from those down-regulated gene networks are localized within NODE 167X or NODE 196X from L235 hierarchical clustering.
A 231-member master molecular network has been assembled with 1,262 edges corresponding to a global view of gene expression modulations occurring together with PIP gene expression. This master network was constructed by merging the selected up-and down-regulated networks. The nodes and edges for each individual network were added to the merged network together with any new edges that connect these networks, resulting in incorporation of 29 additional genes. Nineteen nodes appeared to be highly connected in the network as demonstrated by the important number of edges emerging from or pointing to them. These nodes were considered as 'hub genes' and were moved to the periphery of the network, together with the PIP gene, in order to highlight them [72]. Their high connectivity is likely to reflect their ability to regulate an important number of genes within the master network and potentially to control the gene expression modulations identified between cells overexpressing PIP or not ( Figure 4) [73]. Unexpectedly, of the 15 oncogenes and tumor suppressor genes included in the master network, 10 end up among the 19 hub genes, and 8 of them have been detected as differentially expressed through microarray and/or Q-PCR analyses. Such genes are usually not detected through analysis of differential expression, and are incorporated in predictive network modules only through integration of curated protein-protein interactions [74]. This highlights their central interconnecting role in the master network, and the value of using high-precision expression measurements with careful assessment of statistical power as performed in this study. All 19 hub genes except AR, IFNG, SPL1 and TNF were present on the array. Among them, CDKN2A and HDAC3 were identified as significantly overexpressed, and NRAS and CD44 as significantly decreased in [PIP+] versus [PIP2] samples, as detected by both microarray analysis (list L235) and Q-PCR (Figure 4, grey shaded symbols and underlined names). Four hub-genes (EGFR, CCNA2, and TGFB1, TP53 for up-and down-regulated genes, respectively; Figure 4, open symbols and underlined names) were not present in L235 but were found significantly differentially expressed by Q-PCR analysis only; three hub genes (MYC, BCL2, CDKN1A; Figure 4, asterisks) were found significantly differentially expressed by both Q-PCR and microarray analysis, belonging to other relevant computed gene lists. The remaining 4 hub genes (FN1, CCNE1, ERBB2, SP1) were not assayed by Q-PCR nor identified as differentially expressed by microarrays except ERBB2 and SP1 which were selected in lists L578 and L1114 (See Table S1). As molecular relationships represented on the network include not only induction or inhibition of expression, but also protein-protein interactions, DNA-protein interactions and activation, localization, inhibition of the corresponding proteins, it is not surprising that microarrays and Q-PCR may fail to identify some of the hub genes as being significantly modulated among the [PIP+] and [PIP2] subgroups of samples. These genes might play a major role through protein activation for instance. Alternatively, their modulations may be very subtle and below the threshold for reliable detection of differences of our microarray platform despite its high sensitivity. Distinguishing between these different possibilities will require targeted validation experiments.
The PIP gene was also moved to the periphery of the network, even though it is connected to only two other genes, CD4 and EGR2 (Figure 4). The edge connecting PIP to EGR2 was previously reported by a microarray study in rat Schwann cells, which demonstrated an up-regulation of PIP in cells overexpressing EGR2 [75]. EGR2 cDNA clones are represented on the array but no significant modulation of its expression was observed in parallel with the PIP gene. The edge connecting PIP to CD4 is based on the reported interaction between these proteins leading to T lymphocyte programmed cell death inhibition induced by CD4 cross-linking and subsequent TCR activation [19]. In our in vitro models of breast cancer cell lines, the interaction between the secreted glycoprotein PIP and CD4 cannot take place since CD4 is not expressed in these cells.
In addition, previous studies reported that the PIP protein may exert an aspartyl proteinase activity able to specifically cleave fibronectin (encoded by FN1) [20], supporting a link between PIP and FN1 at the protein level. This interaction between PIP and FN1 is actually missing in the IPA database. In spite of this lack, FN1 appears as a hub gene in our study (Figure 4), thus supporting the strength of the genes identified as involved in the master network associated with PIP expression. This indicates that even though more than one million of functional, regulatory and physical interactions are included in the IPA knowledge database, its content is far from being exhaustive. Consequently, other interactions can be missed in the network of the PIP co-modulated genes represented in Figure 4.

Promoter analysis of differentially expressed genes
Co-regulation of mammalian genes usually depends on sets of transcription factors rather than on one individual factor. Therefore an analysis of the promoter regions of the genes from list L235 was conducted in order to identify potential common transcriptional regulators.
Using the cluster ElDoradoe/Gene2promotor/GEMS Launcher for promoter analysis [76], three families of transcription factor binding sites (TFBS) were identified to be common to at least 40% of the genes from L235. They correspond to the glucocorticoid responsive and related elements (GREF), LEF1/TCF (LEFF) and the signal transducer and activator of transcription (STAT). The   Table 7). The network is displayed graphically as nodes (genes/gene products) and edges (the biological relationship between the nodes). The [PIP+] relative to[PIP2] overexpressed genes are shaded in light red and down-regulated genes in green. The genes connected with PIP (EGR2 and CD4) and STAT5B, which was identified by a promoter analysis as a potential key regulator of the master network, are shaded in yellow. The nodes are represented using various shapes that represent the functional class of the gene products. Highly interconnected nodes ('hub genes') are moved to the network periphery together with the PIP gene. The hub genes belonging to L235 are shaded in gray and those, which were detected by quantitative PCR, are underlined. The gene names are written in green (down-regulated) or in red (up-regulated) relative to a [PIP+] versus [PIP2] modulation. An asterisk refers to a gene that was not selected in L235, but was identified at another level of the statistical analysis (* for L578&L2231 and ** for L1114 & L2231,  HAX1, HDAC3, IQGAP1, MPHOSPH6, NR1H3, PEA-15, PFN2,  RAB13, REEP5, RFC4, SCAMP2, SEC22B, SLC2A1, SUCLA2,  TFPI2, THBD, TNNT1, TSPAN1, TSTA3, TULP1). Specifically, STAT5 transcription factor binding sites were implicated in this GREF-STAT motif. Additionally, a TFBS from the STAT motif family only was identified in the promoter region of several other genes from L235 (data not shown).
Previous studies described that STAT5 may exhibit opposite functions in mammary oncogenesis, either increasing tumor development in several murine models [85][86][87] or inhibiting tumor progression in human breast cancer cells [88,89]. More recently, it has been proposed that STAT5 may act as a suppressor of invasion, epithelial mesenchymal transition and dispersal of breast cancer cells from the primary tumor [89][90][91]. This was novel in light of the previous tumor-promoting role attributed to STAT5 [89]. The suppressive role of STAT5 on cell invasion was confirmed in vitro in the well-differentiated ER-positive breast cancer cells T47D [88]. It was also shown that PRL may suppress human breast cancer cell invasion through multiple mechanisms, such as activation of STAT5 [90]. Indeed, STAT5, one of the main downstream effector molecules of PRL [90], has been shown to directly modulate transcriptional activity through interaction with the promoter region of the target genes [92]. Moreover, high levels of activated STAT5 have been found in a substantial proportion of human breast tumors, which interestingly exhibited a better prognosis [88,93].
Interestingly, PIP gene expression was previously reported to be synergistically induced by prolactin (PRL)-activated STAT5 and DHT-activated AR. More precisely PRL-induced phosphorylation on Tyr694 of STAT5A and Tyr699 of STAT5B was demonstrated to be required for the synergistic effect of DHT and PRL on transcriptional activation of the PIP/GCDFP-15 gene [94].
Altogether, our results suggest the potential involvement of STAT5 in the transcriptional regulation of several genes from the master network identified (Figure 4) associated with PIP. The failure to detect significant expression changes of STAT5A and STAT5B genes in PIP expressing versus non expressing breast carcinoma cells using microarray analysis suggests that the protective effects of these transcription factors on breast carcinoma development could be mainly due to their activation rather than to modifications of their gene expression levels. This hypothesis, supported by the suppression of cell invasion through STAT5 activation [90], will have to be further investigated in future studies.
In summary, we report here a comprehensive characterization of the gene expression modulations occurring in PIP-expressing versus non-expressing breast cancer cell lines. Using rigorous unsupervised and supervised analyses, we identified differentially expressed genes, which were found strictly co-modulated in relation to the PIP expression level changes and allowed us to discriminate [PIP+] and [PIP2] subgroups of samples. This study provides useful information in term of pathway modulations that occur within breast cells expressing PIP. The combination of a high-precision expression profiling with an extensive functional and regulatory network analysis has emphasized a central interconnecting role of a number of oncogenes and tumor suppressor genes in the network associated with PIP expression modulation. Many oncogenes and tumor suppressor genes, previously reported to exhibit particular breast cancer mutations, e.g. ERBB2 and TP53, are typically not detected through analysis of differential expression but can play a central role in signalling networks by interconnecting many expression-responsive genes [74]. Interestingly, half of them were found significantly differentially expressed with an increase level of PIP transcript. Consequently, our data allowed determination of a global view of the regulatory network resulting from PIP overexpression based on the aggregate behaviour of genes connected in a functional network rather than on unique genes found differentially expressed.
Functionally, the gene expression modulations associated with an increase of levels of PIP transcript appear associated with an inhibition of proliferation coupled with an enhancement of the apoptosis and the cell adhesion in breast cancer cell lines. These results provide additional and contextual support for the good prognostic value of PIP gene expression in breast cancer, as recently demonstrated by immunohistochemistry on a large cohort of tumor samples in which significantly longer disease-free survival times were associated with PIP positive tumors [95]. In addition, STAT5 was identified through in silico promoter analyses of the genes co-modulated with PIP suggesting that it might be a transcriptional regulator accounting for the observed altered functions. This unexpected result supports the view that an important part of the modulated genes act as upstream or downstream effectors of STAT5. For some of them, there is no experimental evidence of a relationship with STAT5 and additional experiments would be required to confirm this point. Finally, STAT5 is known to inhibit cell invasion and is considered as a good prognostic factor in breast cancer [88,89].
Many of the groups of genes ( Table 7) that form the basis of the master network reported here (Figure 4), including those discussed above, represent novel combinations of factors that may impact on important cancer-initiating biological processes or that may be modulated consequentially. Further biological and clinical investigations using a large cohort of patients will be necessary to identify those which contribute directly to breast cancer development and progression, have prognostic value and are possible targets for therapeutic intervention.

RNA extraction
Total RNA was extracted from monolayer cells in culture at 2/3 confluency using the RNeasy Mini Kit (Qiagen, Hilden, Germany). RNA purity and quantity was assessed by UV measurement. Healthy mammary gland RNA from three distinct healthy donors and a universal human reference RNA were obtained from commercial sources (Stratagene Europe, Amsterdam, Netherlands). RNA integrity was judged using RNA 6000 nano chips and the Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA) according to the manufacturer's instructions. RNA quality-control was performed using user-independent classifiers as described [97,98].

Microarray design and manufacture
The human cDNA microarrays used contained 11,520 sequences derived from various sequence-verified clone collections as previously described [102]. The array set provides a genomewide coverage of functional pathways, such as cell cycle and checkpoints, cell growth and/or maintenance, cell adhesion and proliferation, development, extracellular matrix, apoptosis, response to DNA damage and DNA repair, DNA replication, transcription and RNA processing. High confidence qualifications and annotations of the clone collections have been previously described [102] and are available through our web site (The Genexpress -Array s/IMAGE web site, [103]). All arrays were printed in the laboratory on amino-modified mirrored glass slides using the Lucidea array spotter (Amersham Biosciences) as described [102]. The suite of amplified cDNAs was printed as a group in two spatially separated replicates.

Hybridization experimental design and analysis
To reduce potential experimental biases, four independent RNA preparations were collected for each DHT-treated anduntreated cell lines. To assess data reproducibility and minimize dye bias effects, each of the samples was measured twice, once with Cy3 and once with Cy5. To ensure robustness and flexibility in data analysis, a reference design was used with a universal reference sample (Stratagene) serving as a baseline for the comparisons of cell line samples. Such a design does not require pre-definition of the subgroups for comparison, allows robust discovery of non-anticipated classes among the samples and is compatible with subsequent additional sampling [102].
Thirty mg of total RNA from each cell line and human universal reference RNA (Stratagene) were supplemented with known sequences (spikes, Universal ScoreCard), reverse transcribed using an oligo-dT primer and labeled alternatively with Cy-5-dCTP and Cy-3-dCTP (Amersham Biosciences). Samples were purified using the Qiagen's QIAquick PCR Purification kit procedure and submitted to a vigilant quality control procedure as previously described [102]. Hybridizations to the arrays were performed as described [102]. Array images and raw data were obtained using the GenIII array scanner (Amersham Biosciences) and ArrayVision 7.0 software (Imaging Research Inc., Amersham Biosciences, Palo Alto, CA, USA). Raw data were first imported into a Genetraffic duo database (Iobion Informatics, Toronto, Canada), local background-subtracted and normalized using a Lowess (locally weighted linear regression) transformation. The following selection criteria were applied: all spots having a mean signal (after background subtraction) less than that of the background and below that of the negative controls in both Cy3 and Cy5 channels were systematically excluded; the data were also filtered to exclude spots flagged as missing or corrupted in one array. For arrays considered as partially exploitable based on several quality criteria additional hybridizations were done and considered as technical replicates. We next calculated the average expression ratios (test/ reference) in all analyses. Log2 values of lowess-transformed data were used for all subsequent statistical analyses. For reporting genes by name, IMAGE Clone IDs corresponding to the microarray probe sequences were used to extract UniGene Cluster IDs and names (Build 199 Homo sapiens; Jan 16 2007) [104]. For genes represented by multiple probes (that is, different clones corresponding to the same gene) on the array, each probe and the related expression ratios were considered and reported separately. MIAME-compliant data [105] have been deposited in the Gene Expression Omnibus (GEO) at NCBI [106] and are accessible through GEO Series accession number GSE11627.

Modeling of experimental power
For statistical confidence and power analyses related to this specific program, power (z-score) for an unpaired t-test (twosample analyses) was computed as previously described [102] for estimation of false discoveries (FDR) [107] and using the GPower3.0.3 program [108,109] for estimation of false negatives (FNR), taking into account the standard deviation of expression measurements, the size of the distinct sample groups, a significance threshold and the fold ratio to be detected.
A priori power analyses were used to choose the appropriate number of replicates before the study was conducted. Conversely, post hoc power calculations were done to evaluate the actual power reached in our study.

Hierarchical clustering
For discriminant analysis of overall variation in samples/genes, median centering and normalization of the genes and samples were applied to the entire dataset. Genes which had missing values in more than 20% of the samples were removed from subsequent analysis. An unsupervised average-linkage hierarchical clustering algorithm using a centered Pearson correlation as similarity metric was applied to investigate relationships between samples and relationships between genes. This method leads to an expression matrix such that genes and samples with similar expression patterns are adjacent to each other. This analysis was performed using Cluster [110] and the resulting expression map was visualized with TreeView [110].
For discriminant analysis of differentially expressed genes, an average-linkage hierarchical clustering with uncentered Pearson correlation was applied to the dataset extracted from the list of the genes selected to be differentially expressed. Mean sample profiles and gene profiles were ranked based on a discrimination score, which is equivalent to the t-statistics z-score, using the Cluster Identification Tool (CIT), based on supervised t-statistics with permutation [111]. Discriminant analysis of the [PIP+] and [PIP2] samples was performed, providing a list of gene nodes that exhibit statistically significant differential expression between the two groups (a#0.05).

Statistical analysis
The statistical significance of measured intensity differences was tested using ArrayStat 1.0 software (Imaging Research Inc.). The whole data sets were adjusted using additive statistical models considering samples with homogeneous phenotypes ([PIP++], [PIP+] or [PIP2]) as replicates measures from one condition, with a minimum of 67-75% registered measures per gene. Offset corrections were applied to compensate any potential systematic errors that may exist within data for each condition across arrays. Random error was estimated using a curve-fit method; outliers were automatically detected and then excluded from subsequent analysis, based on thresholds computed over the entire dataset (in median absolute deviation (MAD) and standard deviation (sd) units). , data sets were centered for each condition and t-, Z-statistics and F-statistics were applied with false discovery rate (FDR) corrections to compensate for multiple testing effects [112]. Data from genes with significant differential expression levels between the two compared subgroups were displayed, together with a two-tailed p-value adjusted with a = 0.01.
Independent statistical analysis was achieved using SAM (Significant Analysis of Microarrays, Standford University) [113]. This class comparison method uses a modified t-test to identify genes that discriminate, for example, [PIP+] samples from [PIP2] samples. The modified t-test involves carrying out typical t-tests for each gene using the original data and a user-specified number of permuted datasets generated by randomly shuffling of the class labels. We conducted SAM analysis based on the microarray intensity level (in arbitrary units,

Real-time Quantitative RT-PCR (Q-PCR) analysis using Taqman Low Density Arrays
Pre-defined TaqMan probe and primer sets for target genes were chosen from an on-line database (Applied Biosystems, Foster City, CA,[114]). The sets were factory-loaded into 384 well microfluidic cards (Applied Biosystems) as customized with two replicates per target gene. Single-stranded cDNA was prepared from 1 mg of total RNA from breast carcinoma cell lines using the high capacity cDNA archive kit (Applied Biosystems), according to the manufacturer's instructions. Breast carcinoma cell line RNA samples derived from identical preparations for both cDNA microarray and Q-PCR analysis.
Two ml of single-stranded cDNA (equivalent to 100 ng of total RNA) were mixed with 48 ml of nuclease-free water and 50 ml of TaqMan Universal PCR Master Mix (Applied Biosystems). The sample-specific PCR mixture (100 ml) was loaded into one sample port, the cards were centrifuged twice for 1 min at 280 g and sealed to prevent well-to-well contamination. The cards were placed in the Micro Fluidic Card Sample Block of an ABI Prism 7900 HT Sequence Detection System (Applied Biosystems). The thermal cycling conditions were 2 min at 50 uC and 10 min at 95 uC, followed by 40 cycles of 30 s at 97 uC and 1 min at 59.7 uC. 96 genes were tested by quantitative PCR, using the TaqMan low density micro fluidic card (Applied Biosystems, USA). Raw data are available upon request.

Network and Gene Ontology analysis
The differentially expressed genes were used for pathway and Gene Ontology analyses. Locuslink ID gene accession numbers and their corresponding fold changes in our experiment were imported into the Ingenuity Pathway Analysis (IPA) tool and mapped to its corresponding gene object in the Ingenuity Pathways Knowledge Base (Ingenuity Systems,[115]). Genes were categorized based on their molecular functions using the software, mapped onto genetic networks in the IPA database and then ranked by score. The score associated with a particular network is the likehood (i.e. negative log of a p-value) of the genes identified as differentially expressed in a network being found together due to chance. The score is thus indicative of the proportion of genes identified as differentially expressed in our analysis among all the genes belonging to a particular network. A score of 3 reflects the likelihood that the presence of the focus genes in a network is solely due to chance is 1/1000. Therefore, scores of 3 or higher represent a .99.9% confidence level. Genes and gene products are represented as nodes and the biological relationship between two nodes is represented as an edge (line).
In functional analyses, the biological functions that were most significant to the dataset were identified. The significance value assigned to the functions is calculated using the one-side righttailed Fisher's Exact Test (a = 0.05) of the IPA tool. In this statistical test, the chances that the genes-of-interest participate in the biological functions are examined. A p-value is calculated by comparing the number of genes-of-interest in a particular function with their occurrences in all the functions in the IPA knowledge database.

Promoter sequence analysis
The human promoter sequences for all genes from L235 were extracted with the ElDoradoe/Gene2promotor system ( [76]; default 500-bp upstream of the transcription start site and 100bp downstream). The GEMS Launcher software was used to search for common transcription factor binding sites (TFBSs) in multiple sequences. The quorum constraint which determines the lower limit of loci within the input set that has to contain the common framework was set to 40% (core similarity 1). The selection of matrices associated with specific tissue was restricted to breast tissue.
The FrameWorker task of GEMS Launcher package [116] was then used to retrieve common motifs (frameworks) of transcription factor binding sites in the promoter region of the input genes.

Supporting Information
Table S1 Origins of gene lists derived from class comparison and class prediction of relative expression levels. Differential expression analyses were conducted to identify genes co-modulated with PIP using several statistical t-, z-and F-tests and an Î6 of 0.01. These analyses were conducted initially with a twophenotype sample classification