In silico recognition of a prognostic signature in basal-like breast cancer patients

Background Triple-negative breast cancers (TNBCs) display poor prognosis, have a high risk of tumour recurrence, and exhibit high resistance to drug treatments. Based on their gene expression profiles, the majority of TNBCs are classified as basal-like breast cancers. Currently, there are not available widely-accepted prognostic markers to predict outcomes in basal-like subtype, so the selection of new prognostic indicators for this BC phenotype represents an unmet clinical challenge. Results Here, we attempted to address this challenging issue by exploiting a bioinformatics pipeline able to integrate transcriptomic, genomic, epigenomic, and clinical data freely accessible from public repositories. This pipeline starts from the application of the well-established network-based SWIM methodology on the transcriptomic data to unveil important (switch) genes in relation with a complex disease of interest. Then, survival and linear regression analyses are performed to associate the gene expression profiles of the switch genes with both the patients’ clinical outcome and the disease aggressiveness. This allows us to identify a prognostic gene signature that in turn is fed to the last step of the pipeline consisting of an analysis at DNA level, to investigate whether variations in the expression of identified prognostic switch genes could be related to genetic (copy number variations) or epigenetic (DNA methylation differences) alterations in their gene loci, or to the activities of transcription factors binding to their promoter regions. Finally, changes in the protein expression levels corresponding to the so far identified prognostic switch genes are evaluated by immunohistochemical staining results taking advantage of the Human Protein Atlas. Conclusion The application of the proposed pipeline on the dataset of The Cancer Genome Atlas (TCGA)-Breast Invasive Carcinoma (BRCA) patients affected by basal-like subtype led to an in silico recognition of a basal-like specific gene signature composed of 11 potential prognostic biomarkers to be further investigated.


Results
Here, we attempted to address this challenging issue by exploiting a bioinformatics pipeline able to integrate transcriptomic, genomic, epigenomic, and clinical data freely accessible from public repositories. This pipeline starts from the application of the well-established network-based SWIM methodology on the transcriptomic data to unveil important (switch) genes in relation with a complex disease of interest. Then, survival and linear regression analyses are performed to associate the gene expression profiles of the switch genes with both the patients' clinical outcome and the disease aggressiveness. This allows us to identify a prognostic gene signature that in turn is fed to the last step of the pipeline consisting of an analysis at DNA level, to investigate whether variations in the expression of identified prognostic switch genes could be related to genetic (copy number variations) or epigenetic (DNA methylation differences) alterations in their gene loci, or to the activities of transcription factors binding to their promoter regions. Finally, changes in the protein expression levels corresponding to the so far identified prognostic switch genes are evaluated by immunohistochemical staining results taking advantage of the Human Protein Atlas.

Conclusion
The application of the proposed pipeline on the dataset of The Cancer Genome Atlas (TCGA)-Breast Invasive Carcinoma (BRCA) patients affected by basal-like subtype led to

Introduction
Breast cancer (BC) is the most common female cancer and despite important advances in early detection and research development, it continues to be the second leading cause of death in women worldwide [1]. Triple-negative BC (TNBC) accounts for a minority of all diagnosed BCs (15-20%) [2]. It is a subtype with a heterogeneous nature, defined by the low or absent expression of estrogen (ER), progesterone (PR) receptors and the lack of expression of the human epidermal growth factor (EGF) receptor-2 (HER2) receptors [3]. These cancers differ from other BC subtypes in that they grow and spread faster, have limited treatment options (typically treated with chemotherapy) and their metastatic pattern spread with a higher likelihood of brain and lung involvement and less frequently with bone lesions. Relapse is common in TNBC, usually in the first 5 years, leading to the poorest survival outcomes between all BC subtypes [4]. Currently, there are not available widely-accepted prognostic markers to predict outcomes in TNBC patients. TNBC is often used as a surrogate for identifying the aggressive basal-like BC subtype. Although the two patterns share many similarities, biologically they are not the same, but both are associated with poor clinical outcomes. Therefore, the development of new prognostic indicators for basal-like subtype represents an unmet clinical challenge that might be of benefit to the clinical management of this disease. To achieve this goal, we started from data extracted from our recent computational analysis of BC phenotypes [2]. In that study, we exploited The Cancer Genome Atlas (TCGA)-Breast Invasive Carcinoma (BRCA) dataset applying a network-based tool named SWItch Miner (SWIM), which predicts important (switch) genes within the co-expression network that regulate disease state transitions. The transcriptomic profile of BC patients was stratified into BC subtypes according to the wellestablished Immunohistochemistry (IHC) (Luminal A, Luminal B, Her2 positive and Triplenegative) and genetic (PAM50; Luminal A, Luminal B, Her2 positive and Basal-like) classification, to identify switch genes shared among four subtypes and those specific for each subtype. We focused our attention on shared switch genes to identify a common BC disease module univocally altered in all BC subtypes, leaving for a next deepening the understanding of the clinical utility of switch genes deregulated in a subtype-specific manner. So, here we focused our attention on switch genes specific for the most aggressive subtype (basal-like) to screen which of them, deregulated in BC patients, significantly associated with the survival of basallike affected subjects. Correlation analyses were performed, and the results were complemented with further studies at both DNA and protein level, to investigate whether variations in the expression of identified prognostic switch genes could be related to genetic (copy number variations), epigenetic (DNA methylation differences), and transcription factor activities. Also, changes in the protein expression levels were evaluated by immunohistochemical staining results taking advantage of the Human Protein Atlas. Overall, our findings led to an in silico recognition of a basal-like prognostic gene signature composed of 11 genes to be further investigated.

Study design
In our recent paper [2], we analysed a total of 505 BC subjects (229 Luminal A, 120 Luminal B, 58 HER2-enriched, and 98 Basal-like) and we identified a total of 108 switch genes (S1 Table) that were specific for the most aggressive BC subtype, i.e. the basal-like subtype [5][6][7]. In the present study, we aim to predict important prognostic biomarkers among these basal-like specific switch genes. A schematic for our study design is depicted in Fig 1.

Prognostic value of basal-like specific switch genes
In order to study the clinical relevance of the basal-like specific switch genes with respect to the patients' survival, we exploited their expression profiles to perform the Kaplan-Meier analysis. We used the RNA-sequencing data available on the TCGA to stratify BC patients in two groups according to the expression levels of the 108 basal-like specific switch genes. Thus, for each switch genes, low (high)-expression groups refer to patients with the expression level of that gene lower (greater) than the median of its expression values across all BC patients. Then, a log-rank test was performed to assess a statistical significance (p-value) to each gene: the lower the p-value, the better the separation between the two prognosis groups. Switch genes with log-rank p-values less than 0.05 were candidate as potential biomarkers for predicting the survival rate of breast cancer patients. We found a total of 15 basal-like specific switch genes that were significantly associated (p-values � 0.05) with BC patients' prognosis. Among them, 11 switch genes (i.e., CENPN, LRP8, DSCC1, CTPS, RCOR2, GINS4, TUBA1C, PRAME, SLC7A11, CDCA7, GSDMC) appeared to be an unfavourable prognostic gene (Fig 2), suggesting that their higher expression could be associated with poorer BC patients' overall survival (OS). The other four switch genes (i.e., NXNL2, PHGR1, LOC389033, C10orf79) appeared as a favourable prognostic gene since their high expression correlated with a better clinical outcome (S1 Fig). Hereafter, we focused only on the 11 basal-like specific switch genes whose activation appeared to be associated with the worst prognosis. Their clinical relevance was also confirmed using other BC datasets collected in the Kaplan-Meier plotter website [8] (Table 1, RNA level).

Overexpression of the basal-like prognostic biomarkers
A differential expression analysis showed that the 11 basal-like specific switch genes, whose unfavourable prognostic value was statistically significant from the previous survival analysis, were all up-regulated in the basal-like cancer condition compared to the normal condition (S2 Fig). Yet, by performing an ANOVA test and multiple pairwise-comparisons among all the BC subtypes, we found that each comparison is statistically significant and the expression value of the 11 basal-like specific switch genes is greater in the basal-like versus the others BC subtypes and always greater than the median used in the KM survival analysis, leading to an association between worst prognosis patients (high-expression groups in the KM plots) and basal-like affected subjects (Fig 3). Taken all together, these findings prompted us to identify these 11 switch genes as potential prognostic biomarkers for basal-like subtype.
To statistically quantify the increasing trend of the median expression values of these 11 switch genes as the phenotype varies from physiological to pathological condition passing across the different BC subtypes, we exploited a linear regression model, where the index Rsquared estimates the goodness-of-fit. We found that all but one showed a very strong straight-line relationship (R-squared � 0.7) between their median expression and the tumour subtypes (Table 1, RNA level), with the CENPN as the first on the list (R-squared = 0.99). These results were mostly confirmed by performing the same analysis using the pathological staging of the BC patients affected by PAM50 subtypes (Table 1, RNA level). Indeed, we observed that 6 basal-like specific switch genes (i.e., CENPN, DSCC1, CTPS, GINS4, TUBA1C, PRAME) reached an R-squared (rounded to one decimal place) � 0.7 also with respect to the staging (Table 1, RNA level). The increasing trend of the top-ranked switch  genes (highest R-squared) both with respect to the subtypes and the staging is depicted in Fig  4a and 4b, respectively.
In order to explore the expression patterns of the proteins encoded by the 11 prognostic switch genes, we queried the Human Protein Atlas (HPA) that provided representative immunohistochemistry images in BC tissues and normal breast tissues. As expected, we found that six of these proteins were overexpressed in BC tissues compared to normal breast tissues (

Gene regulatory network of the basal-like prognostic biomarkers
To provide some hints on which transcription factors (TFs) could regulate the expression of the 11 switch genes proposed as prognostic biomarkers for basal-like subtype, we built a gene regulatory network by combining information on both computationally predicted and experimentally validated TF-target relationships. In particular, we firstly exploited Pscan web tool [9] to predict TFs putatively able to bind the promoter regions of the selected switch genes. Then, we filtered the Pscan predictions keeping only the TFs known to physically interact with at least one switch genes in the human interactome [10]. These TF-target relationships were finally complemented with those experimentally validated from TRRUST database [11]. The final gene regulatory network was composed of seven switch genes and twelve TFs, including well-known TFs that, if deregulated, contribute to neoplastic transformation as MYC, TP53 and NFKB1 (Fig 6a and Table 1, DNA level). Interestingly, among the detected TFs, we found also four TFs (i.e., TP63, TWIST2, HIC1 and RARA) whose high expression appeared to be associated with the best prognosis for BC patients (Fig 6b). In accordance with this result, we observed that these four favourable TFs reached their highest value in the patients affected by the less aggressive BC subtype, i.e., luminal A (Fig 6c). It is worth noting that the other TFs of the gene regulatory network, in general, did not show a relevant increasing/decreasing trend  across the different BC subtypes (Fig 6c), indicating that the overexpression of their target basal-like specific switch genes maybe not ascribed to their transcriptomic variations but rather to other genetic and/or epigenetic alterations.

Genomic and epigenomic alterations of the basal-like prognostic biomarkers
Next, we investigated if the overexpression of the 11 basal-like prognostic biomarkers may depend on basal-like specific genomic alterations, such as Copy Number Variations (CNVs) and/or epigenomic alteration such as DNA methylation changes. In particular, we compared the CNVs and DNA methylation status of these 11 genes in basal-like subtype with respect to the less aggressive BC subtype, i.e., luminal A. The CNVs analysis was performed on a total of 317 TCGA-BRCA patients (92 basal-like and 225 luminal A) for which CNVs data were available. Hierarchical clustering analysis on this data identified three main clusters and showed a different pattern of amplification and deletion in the selected genes between basal-like and luminal A patients (Fig 7a). Interestingly, Cluster 1 appears to be enriched in basal-like samples (64/67, 96%), whereas Cluster 2 (151/ 177, 85%) and Cluster 3 (71/73, 97%) are enriched in luminal A samples. Specifically, most of the basal-like patients belong to Cluster 1 (64/92, 70%; highlighted in dark blue in Fig 7a) and almost all luminal A belong to Cluster 2 and Cluster 3 (222/225, 99%; highlighted in green in Fig 7a). Cluster 1 features are mostly related to DSCC1, GSDMC amplifications (> 1 copy amplification per gene) along with TUBA1C deletion (>1 copy deletion per gene).
Aberrant DNA methylation is another epigenetic alteration that plays a fundamental role in precipitating the development of a large and diverse number of human cancers [12]. For this reason, we investigated a potential correlation between DNA methylation patterns and mRNA expression profiles of the 11 basal-like prognostic biomarkers in basal-like and luminal A patients. The DNA methylation data analysis was performed on a total of 152 TCGA-BRCA patients (37 basal-like and 152 luminal A) for which DNA methylation data were available. Hierarchical clustering analysis on this data identified two main clusters and showed a different DNA methylation status of the selected genes between basal-like and luminal A patients (Fig 7b). In particular, Cluster 1 is enriched in basal-like patients (25/37, 68%, highlighted in dark blue in Fig 7b) and could be associated with a low methylation level especially for   Fig 7b).
We compared the frequency of amplification and deletion events between basal-like and luminal A, using Fisher's exact test (S2 Table) and we assessed the levels of methylation of the 11 basal-like prognostic biomarkers in the two groups (Fig 7c). We observed different scenarios of CNV alteration along with DNA methylation status of the 11 basal-like prognostic biomarkers. CTPS, CENPN and PRAME had a higher frequency of amplification events (> 1 copy amplification per gene) in basal-like, higher frequency of deletion events in luminal A group (p < 0.05, Fisher exact test) and they are hypomethylated in basal-like patients (Fig 7c). This first scenario showed the highest concordance between CNV alteration, DNA methylation levels and mRNA overexpression of these three genes in the basal-like group. Then, GSDMC is characterized by a higher frequency of amplification events in the basal-like group Kaplan-Meier analyzes to evaluate the correlations between the expression of the TFs and the OS in TCGA breast invasive carcinoma patients. Low-and high-expression groups refer to patients with expression levels lower and greater than the 50th percentile, respectively. c) Expression of the TFs in the gene regulatory network across the PAM50 breast cancer subtypes. The black dashed line reported in each plot indicates the median value used in the Kaplan-Meier survival analysis to split the low-expression and high-expression group. One-way ANOVA test was used to compare the means of the selected genes among the patients' groups. T-test was used to perform multiple pairwise-comparisons and statistical significance was indicated by the star symbols (i.e., ns: p > 0.05, � : p � 0.05, �� : p � 0.01, ��� : p � 0.001, ���� : p � 0.0001).
https://doi.org/10.1371/journal.pone.0264024.g006 (p < 0.05, Fisher exact test) and is hypomethylated in basal-like patients (Fig 7c), probably overlapping with its mRNA overexpression in the basal-like group. LRP8 is more amplified in the basal-like group and more deleted in luminal A patients (p < 0.05, Fisher exact test), supporting a putative correlation with its mRNA overexpression in the basal-like group. DSCC1 and CDCA7 had a higher frequency of amplification in basal-like patients (p < 0.05, Fisher exact test), which could be correlated with their mRNA overexpression in that group. Difficult to place is the result of TUBA1C, as we found that this gene has a higher frequency of deletion events in basal-like compared to luminal A group.

Discussion
BC is the neoplasia with the highest incidence and mortality affecting women worldwide [13] and is routinely analysed for ER, PR and HER2 using IHC-based assessment of protein expression levels and frequency [14]. This information is both prognostic and predictive, reflecting critical growth factor signalling dependencies that can be targeted for therapeutic benefit. Thanks to microarray technology, an intrinsic list of 496 genes used to classify BC into four molecular subtypes was identified [15,16]. This makes BC a heterogeneous group of tumours that are diverse in behaviour, outcome, and response to therapy. Among the four intrinsic molecular subtypes (Luminal A, Luminal B, HER2 positive and Basal-like), the basal-like has the worst prognosis as often aggressive and highly recurrent lesions. Basal-like subtype lacks expression of the ER, PR, and HER2 [17] and histologically shows a high grade, high mitotic indices, presence of central necrotic or fibrotic zones, pushing borders of invasion, lymphocytic infiltrate and atypical medullary features [18]. These features limit therapeutic response and impact the refractory nature of these tumours [19], thus, basal-like patients have a poor prognosis and short-term disease-free (DFS) and OS. Then, finding key genes associated with basal-like subtype aggressiveness would help identify prognostic biomarkers for survivals of BC patients as well as the most suitable target genes for new anticancer treatments [20].
Thanks to large international consortia such as The Cancer Genome Atlas (TCGA) [15] and the International Cancer Genome Consortium (ICGC) [21,22], significant inroads have been made characterizing the genomic diversity of BC using next-generation sequencing of RNA and DNA from human clinical samples. SWIM is a novel promising tool that builds upon the structural properties of gene co-expression networks to unveil key genes (called switch genes) likely associated with drastic physiological changes in many biological settings [23,24]. Until now, the relevance of switch genes related to an observed phenotype has been widely assessed through several applications [2,23,[25][26][27][28][29]. In particular, recently in [2], by using the transcriptomic profiling of TCGA breast collection [30], we analysed a total of 505 subjects for which PAM50 subtypes were provided (229 Luminal A, 120 Luminal B, 58 HER2-enriched, and 98 Basal-like) and compared their expression profiles with those of normal samples to identify switch genes associated with the transition between normal condition and each BC subtype. From this comparative analysis, we found both switch genes shared among four subtypes and switch genes specific for each subtype. In the study carried out in [2], we focused on the common switch genes and performed several in silico analysis and in vitro and ex vivo experiments to highlight molecular signatures shared among all BC subtypes. However, we believe that the in-depth investigation of the subtype-specific switch genes can allow us to find novel putative associations between gene functionality and subtype-specific aggressiveness especially for more aggressive BC subtypes. So, the goal of this study was to identify among the switch genes specific for basal-like subtype, those linked to a poor prognosis. In the wake of our recent study [2], the 108 switch genes found to be basal-like specific have been analysed for their prognostic abilities, and among them, 15 shown a significant prognostic role as demonstrated by Kaplan-Maier curves results. Of these, 11 appeared to be unfavourable prognostic genes (i.e., CTPS, CDCA7, GSDMC, LRP8, TUBA1C, CENPN, PRAME, SLC7A11, GINS4, DSCC1, RCOR2) as their overexpression was found to be associated with poorer OS; this result was confirmed using another BC dataset collection (http:// kmplot.com/analysis/). Interestingly, these 11 switch genes showed their highest mRNA overexpression in the basal-like compared to the other BC subtypes, and this data further strengthens the hypothesis that these switch genes could be poor prognostic biomarkers in basal-like subtype affected patients (Fig 3). After that, by a linear regression model, we found a straightline relationship (from 0.7 up to 0.99) among CENPN, LRP8, DSCC1, CTPS, RCOR2, GINSS4, TUBA1C and PRAME with tumour subtypes and staging, while SLC7A11 and CDCA7 correlated only with subtypes. No correlation between GSDMC with subtypes and staging were found. The protein levels of these 11 switches in BC specimens were evaluated by querying the Human Protein Atlas. IHC results were examined confirming that 6 (CTPS, LRP8, TUBA1C, DSCC1, GINS4, RCOR2) of the 11 proteins were overexpressed in BC tissues compared to normal ones. For the remaining proteins, IHC results were not yet available in the Human Protein Atlas (CDCA7, GSDMC, SLC7A11, PRAME and CENPN), nevertheless, the above citations confirmed us that all these switch proteins were overexpressed both in BC cell lines and tissues. These results led us to suspect their role in the neoplastic transformation. In fact, data from the literature, follow detailed, give to these molecules a tumorigenic characteristics being found deregulated in different human cancers including TNBC subtype, so as to make more robust our findings. CTPS1 (CTP synthase 1) gene, encodes an enzyme responsible for the catalytic conversion of UTP (uridine triphosphate) to CTP (cytidine triphosphate). This reaction is an important step in the biosynthesis of phospholipids and nucleic acids. Increased levels of the protein have been linked to several mammalian cancer types such as sarcoma [31], hepatoma [32,33] and leukaemia [33], where the activity of this enzyme is both transformations-and progressionlinked, marking out this enzyme as an important target in the design of chemotherapy. More important, in vitro experiments performed on BC cell lines demonstrated that CTP depletion results in a senescence-like growth arrest through activation of p53, whereas cells with mutated p53 undergo differentiation or apoptotic cell death [34].
LRP8 (LDL receptor-related protein 8) gene, encodes a member of the low-density lipoprotein receptor (LDLR) family. A recent study demonstrated that LRP8 was more strongly expressed in BC without hormone receptor expression (TNBC and HER2 positive) than in luminal tumours (Luminal A and Luminal B) [35]. Authors found that LRP8 depletion promoted apoptosis, impaired cell proliferation and colony formation suggesting that LRP8 has tumourigenic properties. These findings were further confirmed by experiments showing that LRP8 depletion slowed tumour growth in an in vivo xenograft model. Moreover, inhibition of LRP8 was found to attenuate Wnt/β-catenin signalling to suppress breast cancer stem cells (BCSCs) enriched in TNBC and responsible for chemoresistance and metastasis [35,36].
Tubulin alpha-1C chain is a protein that in humans is encoded by the TUBA1C gene. TUBA1C is a member of the tubulin families and several studies demonstrated that its upregulation promotes oncogenesis and predicts poor prognosis in different tumour types [37,38]. TUBA1C, TUBA1B and the β-tubulin isoform TUBB were found as isoforms with the highest expression levels compared to other isoforms in BC cell lines, and TUBA1C and TUBB were overexpressed in BC tumours compared to the normal breast tissues [39]. Also, the prognostic role of TUBA1C as a marker linked to the progression of BC was highlighted by [40], it was associated with lower OS in BC patients [41], and GTSE1 and TUBA1C combined predicted 100% probability of developing TNBC in whites [42].
Recently, overexpression of DSCC1 (DNA replication and sister chromatid cohesion 1) was found to increase proliferation, invasion and migration of breast carcinoma cells, as well as its knockdown showed opposite outcomes [43,44]. Besides, the authors found that DSCC1 could promote breast carcinoma progression by activating the Wnt/β-catenin signalling and inhibiting p53 protein.
PRAME nuclear receptor transcriptional regulator gene encodes an antigen that is preferentially expressed in human melanomas. The approved mutual link between BC and melanoma conditions emphasized the idea of utilizing this marker for targeting BC progression as well. Indeed, this protein was found to be involved in BC growth and metastasis and promote epithelial-to-mesenchymal transition in TNBC [45][46][47], suggesting that PRAME could serve as a prognostic biomarker and/or therapeutic target in TNBC.
Cancer cell requires excess nutrients to meet their biosynthetic and bioenergetics needs and to maintain appropriate redox balance. Glucose and glutamine are important nutrients supporting cancer cell survival. SLC7A11 (solute carrier family 7 member 11) gene encodes a member of a heteromeric, sodium-independent, anionic amino acid transport system that is highly specific for cysteine and glutamate; imports extracellular cystine coupled to the efflux of intracellular glutamate. SLC7A11 expression can be induced under various stress conditions, likely as an adaptive response to enable cells to restore redox homeostasis and maintain survival under stress conditions [48]. The upregulation of SLC7A11 was found correlated with a poor response to treatment in different cancers including breast [49]. Recent evidence support that cancer cells upregulate SLC7A11 expression through diverse mechanisms to enhance their antioxidant defence and to suppress ferroptosis, a key tumour suppression mechanism [50].
The gasdermin (GSDM) superfamily consist of several molecules involved in cell pyroptosis. Recently, various studies have revealed the dysfunction and abnormal expression of the GSDM family in multiple human cancers, implying the potential roles in tumorigenesis. GSDMC (gasdermin C), a member of GSDM superfamily was found to promote cell proliferation in colorectal cancer [51], and high expression of GSDMC in BC [52] and lung adenocarcinoma [53] correlates with poor survival.
CDCA7 (cell division cycle associated 7), was found to be elevated in various types of human cancer, including colon, lung, prostate and breast cancers [54], suggesting that this protein might play an important role in the development of cancer. Interestingly, CDCA7 is a DNA-binding protein and regulates the gene expression of the tumour-promoting effect of c-Myc and E2F1. Recently the role of CDCA7 in TNBC subtype has been partially clarified and authors found that high expression of CDCA7 was associated with metastatic relapse status and predicted poorer disease-free survival in patients with TNBC via transcriptionally upregulating the expression of EZH2 [55].
Centromere proteins (CENPs), which comprise 18 subtypes, are related dynamically to association and dissociation during mitosis with microtubule regulation. Among the CNPs, the protein encoded by CENPN (centromere protein N) gene, binds directly to the centromere-targeting domain of CENP-A. CENP-N depletion causes down-regulation of several CENPs and is considered essential for making a new centromere. Other functions of CENP-N, including its deregulation in BC are unclear, except the study that associated elevated expression of this protein with significantly increased mortality and risk of recurrence in BC smokers in contrast with non-smokers BC subjects [56].
RCOR2 (REST corepressor 2) is a protein-coding gene. Gene Ontology (GO) annotations related to this gene include DNA-binding transcription factor activity and transcription corepressor activity. To date, its involvement in the growth and progression of BC was not yet bee investigated.
GINS4 is a subunit of the GINS complex (GINS1, GINS2, GINS3, and GINS4 subunits) involved in the initiation and progression of DNA replication [57]. GINS4 was found highly expressed in lung, bladder and colorectal cancers, and its downregulation in the bladder and colorectal cancers inhibits growth and cell cycle and accelerate cell apoptosis progression in vitro as well as inhibits tumorigenesis in vivo [58,59]. As for RCOR2 protein, GINS4 involvement in the growth and progression of BC was not yet bee investigated.
Based on these findings, we felt compelled to understand which regulatory events might be responsible for their upregulation in basal-like subtype. So, we investigated whether the deregulated expression of the selected switch genes could be related to the activity of known transcription factors, copy number variation and DNA methylation. The construction of a gene regulatory network showed how these switch genes interact with several TFs known to be altered in cancer condition (MYC, TP53 and NFKB1), including in TNBC [60][61][62]. Nevertheless, we did not expect, but we were not surprised, that some of the identified TFs (TP63, TWIST2, HIC1 and RARA) were overexpressed in luminal A rather than in basal-like patients. So, being found also linked to a better prognosis, these results bring us to the hypothesis that these TFs could not be involved in the basal-like switch genes activation. Interestingly, we found that for most of the 11 switch genes their overexpression seems to be ascribed to genetic and/or epigenetic alterations. Indeed, we found that CTPS, CENPN, PRAME and GSDMC were found both hypomethylated and amplified in basal-like subtype as well as, except for GSDMC, also deleted in luminal A subtype; together these results are strongly in line with their expression data alterations found in the basal-like subtype. In the same way, also DSCC1 and CDCA7 were found amplified in basal-like, and CNVs profiles analysis demonstrated that the copy number amplification of two switch genes, DSCC1 and GSDMC, clustered for basal-like patients. Results on TUBA1C were somewhat controversial as this gene was found to be amplified in luminal A subtype and no genetic or epigenetic changes were found in basal-like subtype; for this switch gene seems that neither amplification nor methylation status is responsible for its overexpression in the basal-like subtype. Taken together these data enrich the pathophysiological and prognostic role of these genes in BC basal-like subtype.

Limitations of the study
The first limitation of this study is that it is based on gene expression data and, it would need further deepening at the protein level as soon as proteomic data will be available on large scale for the disease covered in this analysis. However, even if the cause-effect relationship cannot be directly inferred by expression data, correlation networks may highlight disease co-modulated genes that are functionally coordinated in response to an external stimulus, implying that they may be part of the same complexes or pathways, and may influence each other or maybe influenced by the same underlying mechanism(s).
A further limitation of this study is that our entire results should be validated by using another independent dataset. However, the proposed bioinformatics pipeline requires a huge quantity of transcriptomic, genomic, epigenomic, and clinical data related to patients affected by different breast cancer subtypes and, currently, TCGA is the only free repository providing simultaneously all this information for the same patient cohort.
Lastly, it would have been very interesting to correlate the expression of the 11 genes constituting the basal-like gene signature with the Ki-67 labeling index (Ki-67LI), defined as the percentage of Ki-67 antigen positive cells. Indeed, Ki-67LI is commonly used as proliferation marker and it has frequently been associated with the clinical outcome of TNBC patients [63,64]. Unfortunately, TCGA does not provide this index among the clinical data of the patients affected by breast cancer.

Conclusions
In conclusion, our study showed that 11 basal-like specific switch genes are overexpressed in BC tissues compared to normal counterpart and associated with BC patients prognosis acting as unfavourable prognostic markers. Also, their highest expression was found in the basal-like subtype and this overexpression could be putatively related to genetic and epigenetic alterations as well as the action of important transcription factors. Taken together, these results turn on a beam of light on CTPS, CDCA7, GSDMC, LRP8, TUBA1C, CENPN, PRAME, SLC7A11, GINS4, DSCC1 and RCOR2 that can constituite a gene signature to evaluate the prognosis of basal-like breast cancer patients independently from the therapeutic intervention. It is worth to stress that our study has a purely computational nature and experimental validations would be necessary to investigate the actual role of the identified genes in the framework of basal-like breast cancer. However, we belive that our findings could provide advancements in the ongoing effort to identify specific prognostic biomarkers for basal-like subtype in order to improve the clinical management of this disease.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA) is a comprehensive project born in 2006 from the joint effort between the National Cancer Institute and the National Human Genome Research Institute to improve diagnosis methods and treatments against cancers [30]. This project molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types and, in the last years, generated over 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomic data. All this data, which has already lead to improvements in our ability to diagnose, treat, and prevent cancer, will remain publicly available for anyone in the research community to use. In this study, we exploited TCGA to obtain transcriptomic, clinicopathological, Copy Number Variations (CNVs) data referring to patients affected by breast invasive carcinoma. Male samples, as well as samples undergoing a neoadjuvant treatment, were removed from the cohort under study.

The Human Protein Atlas
The Human Protein Atlas is a research program initiated in 2003 to map all the human proteins in cells, tissues and organs using an integration of various omics technologies, including antibody-based imaging, mass spectrometry-based proteomics, transcriptomics and systems biology [65]. All the data in the knowledge resource is open access to allow scientists both in academia and industry to freely access the data for exploration of the human proteome. In this study, the Human Protein Atlas website (https://www.proteinatlas.org) was leveraged to identify tumour-type specific proteins expression patterns and to perform immunohistochemistry image a direct comparison of the protein expression of selected prognostic indicators between normal and tumour breast tissues.

SWIM software
SWIM (SWitchMiner) is a new methodology that considers differentially expressed genes within the co-expression network framework to predict important genes affected by a disease of interest, and combines this information with a structured network of correlated patterns. Considering the topological properties of the nodes and assessing their functional roles according to their ability to convey information within and between modules in the network, SWIM identifies a small pool of genes (known as switch genes) that are associated with intriguing patterns of molecular co-abundance and play a crucial role in the observed phenotype (transitions) [23]. SWIM is a freely available software developed in MATLAB that implements a series of well-defined steps described in details in the Supplementary Information of [23]. Up to now, SWIM has sparked a widespread interest within the scientific community thanks to the promising results obtained through its application in a broad range of phenotype-specific scenarios, spanning from complex diseases to grapevine berry maturation [23,[25][26][27][28]66].

Kaplan-Meier survival analysis
To analyze the correlation between the expression level of the 108 basal-like specific switch genes and patient overall survival (OS) and therefore to evaluate their prognostic value, we used the RNA-sequencing data from TCGA to split the entire cohort of BC patients (1049 samples) into two groups (called low-expression and high-expression group) according to the upper and lower expression quartile. In particular, low-and high-expression groups refer to patients with expression levels of the given switch gene lower and greater than the 50th percentile (i.e., median), respectively. For each patient cohort, the cumulative survival rates were computed for each switch gene according to the Kaplan-Meier (KM) method [67] on the clinical metadata provided by TCGA. For each switch gene, the survival outcomes of the two patients groups were compared by the log-rank test. Switch genes with log-rank pvalues less than 0.05 were suggested as candidate prognostic biomarkers. In particular, the lower the p-value, the better the separation between the two prognosis groups. If the group of patients with high expression of the selected prognostic gene has a higher observed event than expected event (worst prognosis), it is defined as an unfavourable prognostic gene; otherwise, if its high expression is associated with the best prognosis, it is a favourable prognostic gene.
To confirm the prognostic value of the basal-like specific switch genes points out from the KM survival analysis on the TCGA breast invasive carcinoma patients, we performed the KM analysis on different breast cancer dataset. To do this, we exploited the Kaplan-Meier plotter website (http://kmplot.com/analysis/), which integrates gene expression data and OS information downloaded from GEO, EGA and TCGA for several types of cancer [68]. We ran Kaplan-Meier plotter by considering the entire breast cancer database including 7,830 unique samples from 55 independent affymetrix datasets [69] and by dividing patients into high and low expression group based on the auto selected best cuttoff computed between the lower and upper quartiles of switch genes expression.

Statistical methods
The one-way analysis of variance (ANOVA) is an extension of independent two-samples t-test for comparing means in a situation where there are more than two groups. In one-way ANOVA, the data is organized into several groups based on one single grouping variable (also called factor variable). In this study, the one-way ANOVA test was used to compare the means of selected genes in patients grouped based on the PAM50 breast cancer subtypes. A p-value � 0.05 indicated that at least two groups significantly differ from each other and multiple pairwise-comparisons exploiting the t-test method were performed to identify which ones.

Gene regulatory network
The gene regulatory network of the selected switch genes was constructed by integrating information from Pscan [9], TRRUST [11] and the human interactome (i.e., that is the network of all physical interactions within a cell, from protein-protein to regulatory protein-DNA and metabolic interactions [70]).
Pscan is a web tool designed to computationally predict TF-target regulatory relationships [9]. In particular, it scans the sequence of the promoter regions from an input gene list with motifs describing the binding specificity of known transcription factors and assesses which motifs are significantly over-or under-represented, suggesting which transcription factors could be common regulators of the input genes. In this study, the promoter regions were identified as the genomic regions spanning from -450 to +50 nucleotides to transcription start sites and the TF binding profiles were retrieved from JASPAR 2018 database [71].
TRRUST is a freely available and manually curated database containing 8,444 TF-target regulatory relationships of 800 human transcription factors. These relationships have been derived from PubMed articles describing small-scale experimental studies of transcriptional regulations by using a sentence-based text mining approach [11].
The human interactome, also called protein-protein interaction (PPI) network, was downloaded from Cheng and coauthors [10], where the authors assembled their in-house systematic human interactome with 15 commonly used databases with several types of experimental evidences (e.g., binary PPIs from three-dimensional protein structures; literature-curated PPIs identified by affinity purification followed by mass spectrometry, Y2H, and/or literaturederived low-throughput experiments; signalling networks from literature-derived lowthroughput experiments; kinase-substrate interactions from literature-derived low-throughput and high-throughput experiments). This version of the interactome is composed of 217,160 protein-protein interactions connecting 15,970 unique proteins.

Copy Number Variations (CNVs) data analysis
Copy Number Variations (CNVs) data of TCGA-BRCA project were retrieved from TCGA repository and reported contiguous chromosome regions with log2 ratio segment means in a tab-delimited format. To obtain segment means values of CNVs of the selected genes for the enrolled patients, we employed GISTIC 2.0 software [72]. Gistic's parameters used in this study are the following: The hierarchical clustering analysis was performed by using "Canberra" as clustering distance and "ward.D2" as clustering method. The association between the CNVs status of the selected genes and the BC subtypes was evaluated using Fisher's exact test.

DNA methylation data analysis
DNA methylation data of TCGA-BRCA project were retrieved by Firehorse TCGA GDAC browser (https://gdac.broadinstitute.org/). The methylation data were acquired by the Illumina 450K array, which measures the level of methylation as a beta value for more than 450 000 CpG sites on the Illumina chip. The data contained information for about 485 578 CpG sites. To make available and pre-process methylation data in R environment, we used minfi package [73]. Pre-processing was performed using an in-house R scripts that eliminated probes with no methylation level detectable, removed all known single-nucleotide polymorphism (SNP)-associated CpG sites, associated CpG sites with known genes and matched patients and genes selected in our study. The hierarchical clustering analysis was performed by using "Euclidean" as clustering distance and "ward.D2" as clustering method.  Table. Basal-like specific switch genes. The table is composed of two separated sheets. The first sheet reports the complete list of 108 switch genes found to be specific for basal-like breast cancer subtype. The second sheet reports the complete list of 11 switch gene whose activation was found to be associated with the worst prognosis from the KM survival analysis on the TCGA breast invasive carcinoma patients. (XLSX) S2