Consensus Analysis of Whole Transcriptome Profiles from Two Breast Cancer Patient Cohorts Reveals Long Non-Coding RNAs Associated with Intrinsic Subtype and the Tumour Microenvironment

Long non-coding RNAs (lncRNAs) are emerging as crucial regulators of cellular processes and diseases such as cancer; however, their functions remain poorly characterised. Several studies have demonstrated that lncRNAs are typically disease and tumour subtype specific, particularly in breast cancer where lncRNA expression alone is sufficient to discriminate samples based on hormone status and molecular intrinsic subtype. However, little attempt has been made to assess the reproducibility of lncRNA signatures across more than one dataset. In this work, we derive consensus lncRNA signatures indicative of breast cancer subtype based on two clinical RNA-Seq datasets: the Utah Breast Cancer Study and The Cancer Genome Atlas, through integration of differential expression and hypothesis-free clustering analyses. The most consistent signature is associated with breast cancers of the basal-like subtype, leading us to generate a putative set of six lncRNA basal-like breast cancer markers, at least two of which may have a role in cis-regulation of known poor prognosis markers. Through in silico functional characterization of individual signatures and integration of expression data from pre-clinical cancer models, we discover that discordance between signatures derived from different clinical cohorts can arise from the strong influence of non-cancerous cells in tumour samples. As a consequence, we identify nine lncRNAs putatively associated with breast cancer associated fibroblasts, or the immune response. Overall, our study establishes the confounding effects of tumour purity on lncRNA signature derivation, and generates several novel hypotheses on the role of lncRNAs in basal-like breast cancers and the tumour microenvironment.


Results
The UBCS RNA-Seq dataset was derived from fresh frozen breast tissue samples obtained from 88 women who had surgery at the Huntsman Cancer Hospital from 2009-2012. These included tumour tissues from 69 breast cancer patients, of which 51 ER+ and 12 triple-negative breast cancer (TNBC) tumours were selected for this study. A mean of 18,704,489 reads were uniquely mapped to the human genome corresponding to a mapping success rate of 87% (S1 Table). To achieve consistency with UBCS, TCGA RNA-Seq reads across 271 ER+ and 68 TNBC patients were re-mapped using a similar protocol (see Methods), resulting in a mean of 67,741,640 reads uniquely mapped to the human genome and a mapping success rate of 86% (S2 Table).

LncRNA expression generates clusters that correspond to hormone status and intrinsic subtype
We applied non-negative matrix factorization (NMF) [19] [20] to cluster 932 and 588 of the most highly expressed ((mean FPKM + standard deviation (SD))>1.00) and variable (coefficient of variation (CV)>0.10) GENCODE [10] annotated lncRNAs across UBCS and TCGA cohorts respectively, and tested whether lncRNA expression signatures allow separation of breast cancer into distinct subtypes. Stable clusters were achieved at k = 4 (UBCS; Fig 1A) and k = 3 (TCGA; Fig 1B) where k denotes the number of clusters selected according to the procedure outlined in Methods. Model-to-cluster mappings for both UBCS and TCGA are given in S1 and S2 Tables respectively, and genes deemed as key drivers of the clustering (meta-genes) are listed in S3a (UBSC) and S3b (TCGA) Table. Across both datasets, clusters broadly corresponded to breast cancer hormone status. Of the four UBCS clusters, clusters three and four consisted exclusively of ER+ tumours, and cluster two of TNBC tumours, with cluster one comprising a "mixed" theme of two ER+ and three TNBC tumours (Fig 1C; S4a Table). The two ER+ clusters were differentiated according to PAM50 intrinsic subtype [21] with luminal A tumours comprising the majority (16/19) of cluster three, and the remainder of the cluster including three normal-like tumours. Cluster four encompassed both ER+ luminal A (19/30) and luminal B (11/30) tumours. All TNBC tumours of cluster two corresponded to the basal-like subtype, whereas the three TNBC tumours in cluster one were classified as basal, HER2-and normal-like subtypes. Clusters derived from TCGA followed a similar pattern (Fig 1D; S4b Table), with themes relating to ER+ luminal A (cluster one; 91/113 tumours), basal-like (cluster two; 60/86) and ER+ luminal A/B (cluster three; luminal A: 76/140, luminal B: 59/140). Whilst the majority of cluster two consisted of TNBC, all eight ER+ basal-like tumours were members of the cluster two, further suggesting separation was being driven by intrinsic subtype rather than hormone status. Note that no ER+ basal-like tumours were present in the UBCS dataset.

A consensus set of basal breast cancer lncRNA markers
By comparing the cluster meta-genes of UBCS and TCGA (Table 1), we found significant overlap (p<0.001 by hyper-geometric test) between the TNBC basal clusters, and between the ER+ luminal A/B clusters (p<0.001). By contrast, very little overlap was observed between the ER+ luminal A cluster, and none of the 14 genes driving the UBCS mixed subtype cluster achieved the expression and variance criteria in the TCGA sample set.
Of the nine lncRNAs defined as meta-genes by NMF in both the UBCS and TCGA basallike clusters, six were also significantly over-expressed (log 2 FC>1.00; adjusted p<0.05) in basal-like compared to other breast cancer subtypes (Table 2; Fig 2; S5a and S5b Table). We therefore classed these six genes as candidate lncRNA markers of basal-like breast cancer.
For the majority of markers, there was a clear distinction between high expression levels in basal-like tumours compared to other subtypes (Fig 2). Exceptions were CTD-2015G9.2 ( Fig  2A) and LINC01198 (Fig 2D), both of which are also expressed in the normal-like tumours. Furthermore, expression was typically low across an extra 19 ER-/PR-/HER2+ TCGA tumours classified as HER2-enriched TCGA tumours compared to basal-like tumours (S1 Fig).

Relationship between lncRNA basal-like breast cancer markers and neighbouring genes
Recent observations have shown that cis-acting lncRNAs tend to correlate strongly with their neighbouring genes [22]. Therefore, to determine potential cis-regulatory functions, we first  Long Non-Coding RNA Markers of Breast Cancer Subtype and Tumour Microenvironment defined neighbouring genes of each potential basal-like breast cancer lncRNA marker (Table 2) using GREAT [23] with the "basal plus extension" setting, and then calculated the Pearson correlation coefficient (r) between each lncRNA expression profile and its neighbouring genes across the TCGA breast cancer cohort. We repeated this calculation across all other cancer types represented in the TCGA in which the lncRNA achieved expression ((mean FPKM+SD)>1.00) and variability (CV>0.10) thresholds. We sought consistently high correlation across a number of cancer types as support for a potential cis-regulatory relationship. Three of the six markers (RP11-19E11.1, LINC00393 and CTD-2015G9.2) achieved significant correlation (p<0.0001) with a neighbouring gene (Table 3; Fig 3A-3C; S6 Table) across TCGA breast cancer tumours. Of note was the high correlation achieved between RP11-19E11.1 and the transcription factor engrailed 1 (EN1; r = 0.90; Fig 3A). EN1 is over-expressed in basal-like breast cancer [24], and achieved significant differential expression between basal and other breast cancers in both UBCS (log 2 FC = 4.03, p = 2.26E-22) and TCGA (log 2 FC = 3.01, p = 8.04E-71) datasets. EN1 is also consistently correlated with RP11-19E11.1 across multiple cancers, achieving the highest correlation amongst 17308 PC genes in 7/11 cancers ( Fig 3A; S7a Table). Similarly, LINC00393 achieved significant correlation with the transcription factor krueppel-like factor 5 (KLF5; r = 0.45; Fig 3B), whose high expression in basallike breast cancers [25] was supported by both UBCS (log 2 FC = 2.18, p = 6.52E-05) and TCGA (log 2 FC = 1.73, p = 1.07E-30) datasets. Its correlation with LINC00393 also ranked highly in lung squamous cancer compared to other PC genes ( Fig 3B; S7b Table). The strong correlation between CTD-2015G9.2 and FOXL1 forkhead box L1 (FOXL1) in breast cancer (r = 0.73; Fig  3C) was repeated across 4/8 cancers (S7c Table). By contrast, no significant correlation was observed between the three remaining lncRNAs, CTD-2527I21.15, RP11-10A14. 5 and LINC00198, and their neighbouring genes across breast cancer ( Fig 3D-3F), although there was evidence for a cis-regulatory role across other cancers. For example, protein phosphatase 1 regulatory subunit 3B (PPP1R3B) was ranked in the top 100 most correlated PC genes with RP11-10A14.5 in 10/19 cancers (Fig 3D; S7d Table), and FXYD domain containing ion transport regulator 3 (FXYD3) with CTD-2527I21.15 in 8/14 cancers (Fig 3E; S7e Table). Interestingly, a higher correlation across basal-like compared to other breast cancers was observed for both PPP1R3B (basal-like: r = 0.29, other: -0.06) and FXYD3 (basal-like: r = 0.52, other: 0.07), significant at p<0.01 for FXYD3, suggesting their cisregulatory role is restricted to the basal-like subtype. A similar pattern was observed between LINC01198 and its nearby PC gene LCP1, where correlation increased from r = 0.06 across non-basal-like to r = 0.21 across basal-like breast cancers, although this was not significant and there was no evidence of cis-regulation of LCP1 by LINC01198 in other cancer types ( Fig 3F; S7f Table).

Identification of clusters associated with the tumour microenvironment
To understand the poor overlap between ER+ luminal A clusters, and characterize the mixed subtype cluster of UBCS, we functionally profiled these clusters from both TCGA and UBCS  Table. doi:10.1371/journal.pone.0163238.g003 using the guilt-by-association approach [14]. Briefly, the Pearson correlation coefficient (r) was calculated between each member of the lncRNA meta-gene and the 11027 and 12410 PC genes achieving the detection ((mean FPKM+SD)>1.00) and variability (CV>0.10) thresholds across UBCS and TCGA respectively. PC genes achieving r>0.60 (UBCS) or r>0.50 (TCGA) were then input to DAVID functional enrichment analysis [26]. The majority (9/14) of lncRNA meta-genes of the UBCS mixed subtype cluster were associated with the immune response or related processes (S9a Table). 8/20 lncRNAs of the TCGA ER+ luminal A cluster were associated with the extra-cellular matrix (S9b Table), and all achieved significant correlation with at least one of two established cancer-associated fibroblast (CAF) markers: fibroblast activation protein (FAP) and actin, alpha 2, smooth muscle, aorta (ACTA2; S9b Table). By contrast, no significant correlation was observed between these two genes and any of the lncRNAs related to the immune response.
These findings were supported by ESTIMATE [27] prediction of tumour purity across each cluster. High stromal ( Fig 4A) and immune cell ( Fig 4B) content was observed in samples of the UBSC mixed subtype cluster and this contributed significantly to their low tumour purity (0.33±0.05; p<0.001 by T-test) compared to the other three clusters (Fig 4C). By contrast, high stromal cell content was observed in some ER+ luminal A TCGA samples ( Fig 4D) but there was no significant difference in immune cell score (Fig 4E) or tumour purity (Fig 4F) between the clusters. Overall, only 1.4% (5/339) TCGA samples achieved immune cell ESTIMATE score>2000 compared to 11/63 (17.4%) of UBCS samples. Neither sample cohort had been subject to micro-dissection to separate the tumour from non-tumour cells. We next checked for evidence of expression in samples expected to consist of exclusively tumour cells, reasoning that lncRNAs expressed in clinical samples that typically contain a proportion of non-tumour cells (TCGA: mean tumour purity = 0.60±0.17, UBCS: 0.62±0.15), but with little or no expression in samples of high tumour cell purity, were likely stromal or immune cell specific. To do so, we calculated lncRNA expression levels across 41 breast cancer cell lines from the Cancer Cell Line Encyclopaedia [28] (CCLE; mean tumour purity = 0.99 ±0.01), and tumours from 10 breast cancer patient derived xenograft (PDX) models [29] (mean tumour purity = 0.99±0.01), in which tumour had been separated from stroma using an in silico species-specific mapping strategy [30]. For 5219 lncRNAs common to all three datasets, we then compared both median cell line and PDX expression to median expression across 47 UBCS samples achieving tumour purity>0.70.
Note that the three datasets were generated using different sequencing protocols and so subject to a number of confounding factors. Therefore this was not intended as a rigorous statistical assessment, rather a conservative guide to genes consistently over-expressed in cell lines and the tumour component of PDX models compared to clinical samples.
198 lncRNAs achieved log 2 FC>0.50 in both comparisons, and median FPKM<0.50 across cell lines and PDX tumours (S10 Table), thus were classed a potentially stromal or immune cell specific (SIC). Included in this list was the maternally expressed 3 gene (MEG3), one of the few lncRNAs known to be preferentially expressed in tumour stroma [31]. This achieved the highest fold changes in both cell lines (log 2 FC = 6.74) and the PDX tumour component (log 2 FC = 6.73), adding confidence to our approach. We observed the greatest overlap between SIC lncRNAs and the UBCS mixed subtype (7/14; 43%) and TCGA ER+ luminal A (7/20; 35%) clusters. No overlap was achieved with the basal-like clusters, and only some overlap with UBCS (3/28; 11%) and TCGA luminal A/B (2/33; 6%) clusters, and UBCS ER+ luminal A cluster (2/12; 17%). SIC lncRNAs indicated by this method are listed in S11 Table. A putative set of breast cancer stromal and immune cell associated lncRNAs By combining evidence from the guilt-by-association and cell line/PDX expression profiling, we derived a set of four and six lncRNAs achieving functional enrichment FDR<0.05 for "immune response" or "extracellular matrix" respectively, and low/undetectable expression in cell lines/PDX tumour (Table 4; S12 Table). All extracellular matrix associated lncRNAs achieved significant Pearson correlation with FAP and ACTA2 (Table 4b; S9b Table), suggesting a potential role in activating fibroblasts. For comparison, we also carried out the same analyses on known stromal lncRNA MEG3 [31], which achieved significant (p<0.0001) correlation with both FAP (r = 0.63) and ACTA2 (r = 0.53), and enrichment for "extracellular matrix" (p = 8.88E-38). Consequently, MEG3 met all criteria for our classification of a stromal cell associated lncRNA.
We next explored whether the three putative immune-response associated lncRNAs were linked with a specific immune cell type. To do so, we selected a set of established markers for immune cell type and calculated the correlation between each marker and the lncRNA (S12 Table). For cell types represented by at least four markers, the median correlations achieved by each cell type were compared (Fig 5). All three immune cell associated lncRNAs achieved the strongest correlations with macrophage cell type markers such as CD68 (RP3-460G2.

Discussion
In this study, we first establish that lncRNA expression alone is sufficient to separate breast cancer into clusters that broadly correspond to intrinsic subtype. This enabled us to identify a consensus set of lncRNAs associated with the basal-like breast cancer, and generate hypotheses on their relationship with neighbouring genes. Secondly, we find evidence that the lack of agreement between signatures derived from other subtypes is a result of the varying degree of stromal/immune cell infiltrate present in a typical clinical sample. Finally, we derive a set of lncRNAs whose expression is specific to the breast tumour microenvironment, with possible roles in activating fibroblasts to support tumour progression, and macrophage recruitment to the tumour.  We derived a set of six high confidence lncRNA basal-like markers, all of which were significantly over-expressed in basal-like tumours from both UBCS and TCGA, and made a significant contribution to the basal-like NMF clusters. Basal-like breast cancers are characterised by aggressive features and frequently associated with poor prognosis. They account for 85% of TNBCs, which lack expression of oestrogen, progesterone or HER2 receptors and as such, fail to respond to hormone targeting therapies. With rigorous follow-up, our lncRNA markers could indicate novel regulatory mechanisms specific to basal-like breast cancer, providing a platform to generate new drug targets.
As an initial exploration, three of these (RP11-19E11.1, CTD-2015G9.2, LINC00393) achieved significant expression correlation with a neighbouring PC gene, suggesting a possible cis-regulatory relationship. Notably, PC genes EN1 and KLF5 neighbouring RP11-19E11.1 and LINC00393 respectively are known markers of the basal-like breast cancer subtype [24] [25]. In all three cases, the relationship was not restricted to breast cancer but extended across at least one other cancer type, and for RP11-19E11.1 and CTD-2015G9.2 across at least half the cancer types in which expression of the lncRNA could be detected. To our knowledge, this represents a unique application of the guilt-by-association approach to identify cis-regulation through consistent pan-cancer co-expression between lncRNAs and their neighbouring genes. By doing so, we may have uncovered a potential route by which these lncRNAs control important basallike PC genes.
For two of the remaining lncRNA-neighbouring PC gene pairs (CTD-2527I21.15 and FXYD3, RP11-10A14.5 and PPP1R3B) co-expression only emerged across basal-like tumours despite consistent pan-cancer correlation. The greatest increase was seen between CTD-2527I21. 15 and FXYD3, a gene not previously associated with basal-like breast cancer but whose expression has been shown to increase in response to oestrogen and tamoxifen [32]. Our results demonstrate the need to consider disease subtype specificity when seeking transcriptional evidence of cis-regulation by lncRNAs.

Tumour purity confounds clustering but identifies tumour microenvironment-associated lncRNAs
Our finding that lncRNA expression clusters broadly correspond to intrinsic molecular subtype is in accordance with previous studies [12][17] [18]. Clear enrichment was observed for ER+ luminal A, ER+ luminal A/B and TNBC/ER+ basal-like subtypes in three of the clusters from both datasets. However, the fourth UBCS cluster comprised a mix of subtypes and showed no clear correspondence to a TCGA cluster. Notably, this cluster appeared to be driven by lncRNAs associated with the immune response, concordant with a greater proportion of samples with predicted high immune cell content in the UBCS cohort than TCGA. We also found that the ER+ luminal A cluster derived from TCGA is driven partially by lncRNAs associated with tumour stroma, a trend not observed in the equivalent UBCS cluster. Our results suggest that signature inconsistency is in part driven by the varying extent of stromal and immune cell infiltrate in patient samples, supporting a recent study that determined the confounding effects of tumour purity on differential expression and co-expression measurements [33].
The presence of non-tumour cells highlighted a small set of lncRNAs whose expression is restricted to the tumour microenvironment. The tumour microenvironment consists of multiple cell types including endothelial cells, adipocytes, CAFs and immune cells such as lymphocytes and tumour-associated macrophages that play a critical role in supporting cancer growth and metastasis [34]. An association between lncRNAs and the immune response has only recently emerged [35] [36], and only a few have been shown to be expressed in endothelial cells [37], and adipocytes [38], with preferential expression of MEG3 [31] and H19 [31] [39] observed in tumour stroma.
The observation that our putative immune cell associated lncRNAs correlate strongly with macrophage markers is consistent with tumour-associated macrophages (TAMs) as the most abundant immunosuppressive cell population in breast tumours. TAMs are frequently associated with poor prognosis [40], and their relationship with tumour cells is currently under intense scrutiny since the disruption of the positive-feedback loop between TAMs and breast cancer cells could inhibit the angiogenic and/or metastatic potential of the tumour. Therefore, lncRNAs could offer a novel avenue to target the TAM population, and supplement immunotherapy.
The high correlation between the six extracellular matrix-associated lncRNAs with CAF markers FAP and ACTA2 may suggest a role for lncRNAs in the acquisition of an activated phenotype by fibroblasts in the tumour stroma. Currently, the mechanisms of fibroblast activation are poorly understood, and so the possibility that numerous lncRNAs may have a role in their regulation opens up an enticing research opportunity. Overall, our discovery of several lncRNAs specifically expressed in either stromal or immune cells should stimulate further studies to determine their precise role in the tumour life-cycle, potentially leading to novel therapeutic strategies.

Conclusions
We have performed a comprehensive analysis of lncRNA in breast cancer that builds on previous findings that lncRNA expression alone is sufficient to separate breast cancer into clusters that broadly correspond to hormone status and intrinsic molecular subtype. By combining two independent clinical datasets, we establish a set of lncRNA markers specific to basal-like breast cancer, representing a preliminary effort to exploit the disease subtype specificity of lncRNAs. With rigorous validation, at least a subset of these could have clinical potential as either biomarkers or therapeutic targets. We also demonstrate the confounding effects of tumour purity as a source of inconsistency in signatures derived from different studies. The presence of nontumour cells in our patient samples provided the opportunity to discover several lncRNAs specifically expressed in the tumour microenvironment. Our list should motivate follow-up studies to establish whether these lncRNAs are key regulators of either macrophage recruitment or fibroblast activation, and thus critical in supporting tumour growth and progression.

RNA-Seq sample preparation and data processing
Utah Breast Cancer Study (UBCS). Fresh frozen breast tissue samples were obtained from 88 women who had surgery at the Huntsman Cancer Hospital from 2009-2012, including tumour tissues from 69 breast cancer patients. One tumour sample yielded poor quality RNA (RIN = 2.5) and was removed from consideration, resulting in a panel of 68 tumour samples. RNA libraries were made with the Illumina TruSeq Stranded mRNA Sample Preparation kit with oligo dT selection according to the manufacturer's protocol. These libraries were then submitted for 50bp single-end sequencing on the Illumina HiSeq 2000 platform using eight samples per lane. For the purposes of this analysis, five breast cancers of the HER2 subtype and one of ambiguous hormone receptor status were ignored. The reads for the remaining 63 were aligned to the human (GRCh38) genome using StarAlign [41] with no more than three mismatches and only uniquely mapped reads allowed. Reads whose ratio of mismatches to mapped length was greater than 0.10 were also discarded. All other parameters were set to their defaults for stranded alignment. Mapped read counts were consistent (14M-23M) across samples, so no samples were removed due to low mapping rate (S1 Table). The expression level, based on Fragments Per Kilobase per Million fragments mapped (FPKM), of each gene present in the human (GRCh38) GENCODEv22 annotation file was estimated using Cufflinks with library type defined as "fr-firststrand" and all other parameters set to defaults [42]. Only genes annotated as "lincRNA" or "protein_coding" were considered. LncRNAs overlapping PC genes were ignored for consistency with the TCGA dataset, as well as genes whose largest transcript is less that 400bp due to potential over-estimation of expression across transcripts less than the average fragment length. The resulting gene-by-sample matrix consisted 19567 protein-coding genes and 6062 lncRNAs across 53 non-basal and 10 basal-like samples. Intrinsic subtypes were assigned according to PAM50 classification [43]. Differentially expressed genes (|log 2 FC|>1.5 and FDR<0.05) were identified using Limma [44] with eBayes function parameter "trend" set to "TRUE" and all other parameters set to their default values. Raw sequence data associated with UBCS have been deposited in the ArrayExpress database (www.ebi.ac.uk/ arrayexpress) under accession number E-MTAB-4993.
The Cancer Genome Atlas (TCGA). Raw FASTQ solid tumour sequence files from 1136 breast cancer patients were downloaded from the Cancer Genomics Hub (CGHub; https:// cghub.ucsc.edu/), and reads aligned to the human (GRCh38) genome using the same procedure as for UBCS except all parameters were set to their defaults for un-stranded alignment. To reduce possible biases introduced by variable total read counts between samples, tumours achieving <20,000,000 mapped reads were removed. FPKM values for each gene present in the human (GRCh38) GENCODEv22 annotation file were calculated as before using Cufflinks with library type defined as "fr-unstranded" [42], and then batch normalized using COMBAT [45]. Sequencing data for all other TCGA cancer types used in this study were processed using the same procedure. The number of tumours used across each cancer type is given in S8 Table. Only tumours classified as either ER+ or TNBC according to TCGA Network [46] were considered in subsequent analyses, and breast cancer samples treated with tamoxifen were discarded. The resulting matrix consisted of 19567 PC genes and 6062 lncRNAs across 271 nonbasal and 68 basal-like samples. Differentially expressed genes were identified using Limma as for the UBCS dataset.
Cancer Cell Line Encyclopaedia(CCLE). BAM files consisting of reads mapped to the human (GRCh37) genome were downloaded from the Cancer Genomics Hub (CGHub; https://cghub.ucsc.edu/) for all breast cancer cell lines represented in the CCLE [28], except those of the claudin-low subtype according to [47], or with fibroblast morphology according to ATCC (http://www.lgcstandards-atcc.org). FPKM values for each gene present in the human (GRCh38) GENCODEv19 annotation file were calculated as before using Cufflinks with library type defined as "fr-unstranded". Of the 6995 lncRNAs annotated in GENCODEv19, 5284 overlapped with v22 based on Ensembl identifier. 38 overlapped by gene name only but were discarded since an Ensembl identifier change indicates that the gene structure has changed significantly between releases. Therefore, the resulting gene-by-sample matrix consisted of 5284 lncRNAs across 41 cell lines.

Clustering of gene expression data with consensus non-negative matrix factorization (NMF)
We applied NMF to cluster breast cancer lncRNA transcriptomes from UBCS and TCGA. Only the most highly expressed and variable lncRNAs were chosen for clustering according to the following criteria: (mean FPKM+SD)>1.00 and CV>0.10, where CV = coefficient of variation. The underlying principle of NMF is dimensionality reduction in which a small number of meta-genes, each defined as a positive linear combination of the genes in the expression data, are identified and then used to group samples into clusters based on the gene expression pattern of the samples as positive linear combinations of these meta-genes. Using the R package NMF [48], factorization rank k was chosen by computing the clustering for k = 2-6 against 50 random initializations of both the actual and a permuted gene expression matrix, and selecting the k value achieving the largest difference between cophenetic correlation coefficients calculated from the actual and permutated data (S1 Fig Table. Rank of correlation between each lncRNA basal-like marker and its nearest neighbours across cancer types. (XLSX) S8 Table. TCGA codes for each cancer type and the number of tumours used in the correlation analyses. (XLSX) S9 Table. Evidence of immune and stromal cell association of lncRNA meta-genes from (a) UBCS and (b) TCGA cluster one. (XLSX) S10 Table. List of 198 lncRNAs expressed in breast cancer patient samples but with undetectable/low expression in cell lines and tumour compartment of patient-derived xenograft models.