Molecular Profiling of Breast Cancer Cell Lines Defines Relevant Tumor Models and Provides a Resource for Cancer Gene Discovery

Background Breast cancer cell lines have been used widely to investigate breast cancer pathobiology and new therapies. Breast cancer is a molecularly heterogeneous disease, and it is important to understand how well and which cell lines best model that diversity. In particular, microarray studies have identified molecular subtypes–luminal A, luminal B, ERBB2-associated, basal-like and normal-like–with characteristic gene-expression patterns and underlying DNA copy number alterations (CNAs). Here, we studied a collection of breast cancer cell lines to catalog molecular profiles and to assess their relation to breast cancer subtypes. Methods Whole-genome DNA microarrays were used to profile gene expression and CNAs in a collection of 52 widely-used breast cancer cell lines, and comparisons were made to existing profiles of primary breast tumors. Hierarchical clustering was used to identify gene-expression subtypes, and Gene Set Enrichment Analysis (GSEA) to discover biological features of those subtypes. Genomic and transcriptional profiles were integrated to discover within high-amplitude CNAs candidate cancer genes with coordinately altered gene copy number and expression. Findings Transcriptional profiling of breast cancer cell lines identified one luminal and two basal-like (A and B) subtypes. Luminal lines displayed an estrogen receptor (ER) signature and resembled luminal-A/B tumors, basal-A lines were associated with ETS-pathway and BRCA1 signatures and resembled basal-like tumors, and basal-B lines displayed mesenchymal and stem/progenitor-cell characteristics. Compared to tumors, cell lines exhibited similar patterns of CNA, but an overall higher complexity of CNA (genetically simple luminal-A tumors were not represented), and only partial conservation of subtype-specific CNAs. We identified 80 high-level DNA amplifications and 13 multi-copy deletions, and the resident genes with concomitantly altered gene-expression, highlighting known and novel candidate breast cancer genes. Conclusions Overall, breast cancer cell lines were genetically more complex than tumors, but retained expression patterns with relevance to the luminal-basal subtype distinction. The compendium of molecular profiles defines cell lines suitable for investigations of subtype-specific pathobiology, cancer stem cell biology, biomarkers and therapies, and provides a resource for discovery of new breast cancer genes.


Introduction
Breast cancer, a leading cause of cancer death in women, is recognized to be a molecularly heterogeneous disease. Markers such as estrogen receptor (ER), progesterone receptor (PR) and ERBB2/HER2 are used for prognostication, and to stratify patients for appropriately targeted therapies [1].
More recently, DNA microarray studies have suggested a refined classification of breast cancer, distinguishing five major subtypes based on different patterns of gene expression, underlying DNA copy number alterations (CNAs), and associated clinical outcomes [2][3][4][5]. Luminal subtypes A and B are ER positive and share expression markers with the luminal epithelial layer of cells lining normal breast ducts. Luminal-A tumors are genetically simple (1q/16p gain) and are associated with favorable outcome, while luminal-B tumors exhibit high proliferation rates, frequent DNA amplification (e.g. 8q24/MYC), and less favorable prognosis. Basal-like tumors share expression markers with the underlying basal (myoepithelial) layer of normal breast ducts, are ER negative, exhibit frequent chromosome segmental gains/losses, and are associated with poor outcome in most studies. The ERBB2 subtype is associated with expression of genes co-amplified with ERBB2 (encoding HER2) on chromosome cytoband 17q12, and the normal-like subtype shares expression patterns with normal breast tissue.
Breast cancer cell lines have been used widely to investigate breast cancer pathobiology, and to screen and characterize new therapeutics [6,7]. Advantages of cell lines include the relative ease of pharmacologic and genetic manipulation, the variety of available functional assays, and, for some studies, the purity of the cancerous epithelial population (and absence of stromal cell contamination). However, while some investigators choose particular cell lines based on the known ER or HER2 status, many others rely on standard ''workhorses'' like MCF7 without regard to the particular tumor subtypes being modeled. The recent recognition of microarray molecular subtypes points to the need for additional consideration in cell line selection.
The goal of our study was to profile gene expression and CNAs genome-wide in a collection of 52 publicly-available and commonly-used breast cancer cell lines, in order to assess the relation of these cell lines to the recognized molecular subtypes of breast cancer, and to discover new candidate breast cancer genes and pathways. . EFM19 and EFM192A were obtained from DSMZ (Braunschweig, Germany). HCC38, HCC70, HCC202, HCC712, HCC1007, HCC1143, HCC1395, HCC1419, HCC1428, HCC1500, HCC1569, HCC1599, HCC1806, HCC1937, HCC1954, HCC2157, HCC2185, HCC2218, HCC2688 and HCC3153 were obtained from the cell repository of the Hamon Center for Therapeutic Oncology Research, UT Southwestern Medical Center (many are now available from ATCC). CAL51 was a kind gift from J. Gioanni from the Centre Antoine-Lacassagne, Nice, France. SUM44PE, SUM52PE, SUM102PT, SUM149PT and SUM190PT were kind gifts from Dr. Stephen P. Ethier (now available from Asterand, Detroit, MI). MCF10A was grown in MEGM media (Cambrex, East Rutherford, NJ). SUM52PE and SUM149PT were grown in Ham's F12 media with 5% FBS, supplemented with 5 mg/ml insulin and 1 mg/ml hydrocortisone. SUM44PE, SUM102PT and SUM190PT were grown in Ham's F12 with 0.1% BSA, supplemented with 5 mg/ml insulin, 1 mg/ml of hydrocortisone, 5 mM ethanolamine, 10 mM HEPES, 5 mg/ml transferrin, 10 nM of Triiodo Thyronin (T3) and 50 nM sodium selenite (10 ng/ml EGF was also included for SUM102PT). All other cell lines were grown in RPMI-1640 with 10% FBS and 1% Pen/ Strep. Clinicopathological characteristics of cell lines are summarized in Table 1. A subset of cell lines (focused on the HCC series) was subjected to a more detailed molecular pathological characterization of ESR1, PGR, ERBB2, EGFR and BRCA1, as summarized in Table 2.

RNA and DNA isolation
Cells were grown to 70-80% confluence, then harvested for total RNA and genomic DNA. For HCC lines, RNA was prepared using the Qiagen RNeasy Midi Kit (Qiagen, Valencia, CA) and DNA by phenol/chloroform extraction. For all other lines, RNA was isolated using Trizol (Invitrogen, Carlsbad, CA) according to the manufacturer's protocol, and DNA using the Blood Cell Maxi Kit (Qiagen). ERBB2 copy number assessment by quantitative PCR ERBB2 copy number was quantified by real-time quantitative PCR (Q-OCR), using the Chromo4 PCR System (Bio-Rad Laboratories, Hercules, CA). GAST, located at 17q21 (on the same chromosomal arm as ERBB2) was used as a reference control. PCR primer sequences for ERBB2 and GAST are as follows (forward and reverse, respectively): ERBB2( 59-TTGGGAGCCTGGCATTTCT-39 and 59-AGGTCATCG-TGCCCACTCTT-39); GAST (59-GTAGGCATCCTTCCCC-CATT-39 and 59-AGCCATGGTCCCTGCTTCTT-39), with PCR product lengths of 59 and 70 base pairs, respectively. Primers were chosen by TaqMan Primer Express TM 1.5 (Applied Biosystem, Foster City, CA) and purchased from Invitrogen. PCR reactions were carried out in a final volume of 20 ml containing 20 ng genomic DNA, 300 nM each primer (for both ERBB2 and GAST, in independent reactions) and 16 Power SYBR Green PCR Master Mix (Applied Biosystems, Foster City, CA). PCR conditions were as follows: one cycle at 95uC for 10 minutes, followed by 40 cycles each at 95uC for 15 seconds and 60uC for 1 minute. Samples were analyzed in triplicate. Each amplification reaction was checked for the absence of nonspecific PCR products by melting curve analysis. ERBB2 copy number calculation was carried out using the comparative Ct method [8] after validating that the efficiencies of PCR reactions of both ERBB2 and GAST were equal. Human Genomic DNA (DNA20) (EMD Biosciences, Darmstadt, Germany), a mixture of pooled human whole blood from 6-8 individual male and female donors, was run in every assay as a calibrator sample. ERBB2 gene copy number in normal human genomic DNA was set as 2 and copy number more than 4 in cell lines was considered to be increased. mRNA levels of ESR1, PGR, ERBB2 and EGFR Transcript levels of ESR1, PGR, ERBB2 and EGFR were analyzed as a part of RT2 Profiler Custom PCR Array (Super-Array Bioscience, Frederick, MD). After making cDNA from 1.0 mg total RNA using RT2 PCR Array First Strand Kit (SuperArray Bioscience), quantitative PCR was performed with the Chromo4 PCR System (Bio-Rad Laboratories) using RT2 Real-Time SYBR Green PCR Master Mix (SuperArray Bioscience) according to the manufacturer's protocol. We chose two different housekeeping genes, b-actin (ACTB) and glyceraldehyde-3-phosphate dehydrogenase (GAPDH) as internal controls, using the average of their Ct values. Primers were chosen by Taqman Primer Express TM 1.5 and purchased from Invitrogen, as follows:  [9]. We also analyzed the values of NC11 (normal lymphocyte) cell line for ESR1, PGR, ERBB2 and EGFR mRNA expression, and the tumor cell values were reported relative to NC11. For data analysis, the comparative Ct method [8] was used.

Western blot analysis and immunohistochemistry (IHC)
Preparation of total cell lysates and Western blotting were done as described previously [10]. Primary antibodies used were mouse monoclonal anti-ER-a (Cell Signaling, Beverly, MA), mouse monoclonal PR (6A1) (Cell Signaling), mouse monoclonal anti-HER2 (Cell Signaling), rabbit monoclonal anti-EGFR (Cell Signaling) and mouse monoclonal anti-actin (Sigma-Aldrich). Actin levels were used as a control for protein loading. Peroxidase-labeled anti-mouse or anti-rabbit antibodies (Amersham Pharmacia, Piscataway, NJ) were used as secondary antibody. IHC on breast cancer cell lines was described previously [11].

BRCA1 mutation analysis
DNA sequence analysis was performed on the entire BRCA1 gene in available lymphocyte DNA matched to breast cancer cell lines. In the lymphocyte DNA matching HCC3153, a heterozygous duplication of 10 base pairs was detected at position 943 in exon 11 of BRCA1 (943ins10). The region of BRCA1 exon 11 containing the 943ins10 mutation was amplified from genomic DNA in the tumor cell line (HCC3153) using standard PCR conditions. Sequence analysis revealed only the mutant sequence. Absence of the normal allele was also confirmed by single strand conformation analysis as well as gel electrophoresis of the amplified fragment on 5% acrylamide denaturing gels.

Gene expression profiling
Gene expression profiling was performed on Human Exonic Evidence Based oligonucleotide (HEEBO) arrays obtained from the Stanford Functional Genomics Facility and containing 36,192   were differentially labeled with Cy5 and Cy3, respectively, using an amino-allyl coupling protocol, then cohybridized onto the microarray in a high volume mixing hybridization at 65uC for 40 hrs. Details of the array processing and sample labeling/hybridization methods have been described [12]. Following hybridization, arrays were washed and scanned using a GenePix 4000B Axon scanner (Axon Instruments, Union City, CA). Fluorescence ratios were extracted using Spot Reader software (Niles Scientific, Portola Valley, CA) and uploaded to the Stanford Microarray Database [13] for storage, retrieval, and analysis. For two lines, HCC1806 and SUM44PE, expression profiling array hybridizations did not meet quality-control inspection and were excluded from analysis. The complete microarray expression data are available at the Stanford Microarray Database (SMD) (http://smd.stanford.edu) and at the Gene Expression Omnibus (GEO) (accession GSE15376); all microarray data reported in the manuscript are described in accordance with MIAME guidelines.

Gene expression profiling analysis
Background-subtracted fluorescence log 2 ratios were globally normalized for each array, and then mean-centered for each gene (i.e. reporting relative to the average log ratio across all samples). Unless otherwise specified, we included for subsequent analysis only well-measured genes defined as those with fluorescence intensities in the Cy5 or Cy3 channel at least 1.5-fold above background in at least 60% of samples. For unsupervised hierarchical clustering, we included only the 8,750 well-measured genes whose expression varied at least 3-fold from the mean in at least 5 samples (Table S1). Hierarchical clustering was performed and displayed using Cluster and TreeView software (http://rana. lbl.gov/EisenSoftware.htm). Enrichment for functionally related genes was tested across a collection of 1,687 curated gene sets (C2) using Gene Set Enrichment analysis (GSEA; Release 2.0) [14]. Cell lines were classified according to breast tumor subtype (luminal-A, luminal-B, ERBB2, basal-like and normal-like) using the nearest centroid method applied to the set of ''intrinsic genes'' (i.e. genes with small within-specimen compared to betweenspecimen expression variance), as done previously [15], here using  [16]. The cell line subtype classifier, comprising 484 genes, was then applied to classify primary tumors using the nearest centroid method (with Euclidean distance). We also classified each cell line as being associated with a good or bad prognosis signature (70-gene prognostic signature [17]), the presence or absence of a wound healing signature (512-gene wound signature [18]), and the presence or absence of an hypoxia signature (123-gene hypoxia signature [19]). For each signature, we calculated the gene expression centroid of the two groups of breast tumors (as determined in the original publications), and then correlated each centroid with cell line expression of the respective signature genes. Membership was assigned to the group with the highest correlation (Pearson correlation).

Array-based comparative genomic hybridization (aCGH)
Arrays for CGH were obtained from the Stanford Functional Genomics Facility. aCGH was performed using cDNA arrays containing 39,632 cDNAs, representing 22,279 mapped human genes (18,049 UniGene clusters [20], together with 4,230 additional mapped ESTs not assigned to UniGene IDs), according to previously published protocols [21,22]. Briefly, 4 mg of genomic DNA from cell lines was random-primer labeled with Cy5 and cohybridized onto a microarray along with 4 mg of Cy3 labeled normal leukocyte female reference DNA. Following overnight hybridization, the arrays were washed and scanned as above. The complete aCGH data are available at SMD and at GEO (accession GSE15376).

aCGH analysis
Background-subtracted log 2 fluorescence ratios were normalized for each array by mean centering. Well-measured genes used for subsequent analysis were those with fluorescence intensities in the Cy3 reference channel at least 1.4 fold above background. Map positions for arrayed cDNA clones were assigned using the NCBI genome assembly, accessed through the UCSC genome browser database (NCBI Build 36.1). For genes represented by multiple arrayed cDNAs, the average log 2 ratio was used. The complete processed aCGH dataset is available as Table S2. DNA gains and losses were identified using the cghFLasso (R package for Fused Lasso) method [23], which controls the false discovery rate (FDR) by using normal-normal hybridization arrays to approximate the null distribution of the test statistics (see [23] for more details). A FDR,1% was used to call gains and losses. The fraction of the genome altered was determined by calculating the fraction of genes with fluorescence ratios $3 (for amplifications) or with significant non-zero fused lasso calls (for gains and losses). Some analyses (where indicated) were carried out on cytobands (boundaries defined by NCBI Build 36.1) rather than individual genes. For each cell line, cytobands exhibiting CNA were defined as those with at least two genes called by cghFLasso, and the magnitude of the CNA defined as the average log 2 ratio of genes within the cytoband. We defined high-level DNA amplifications and multi-copy deletions as continuous regions identified by cghFLasso with at least 50% of genes having fluorescence ratios $3 or #0.25 respectively. These sites were also checked against known copy number variants (CNVs) reported in the Database of Genomic Variants (http://projects.tcag.ca/variation). Significant associa-tions between cytobands and gene-expression subtypes were identified using SAM with a FDR,5%.

Integrating genomic and transcriptional profiles
To integrate DNA copy number data (generated using cDNA microarrays) and gene-expression data (HEEBO oligonucleotide arrays), each gene expression measurement was first assigned a DNA copy number from either a probe interrogating the same named gene, or the average copy number of the nearest 59 and 39 probes (NCBI Build 36.1). Identification of genes with correlated copy number and expression was carried out using the DR-Correlate application of DR-Integrator (K. Salari, manuscript in preparation). Briefly, for each gene a modified Student's t-test was performed comparing gene expression levels in cell lines from the lowest and the highest deciles of all cell lines' copy number for the same gene; random permutations of sample labels were used to estimate a FDR.

Transcriptional profiling identifies three breast cancer cell line subtypes
To catalog molecular variation in a collection of 52 widely-used breast cancer cell lines, we first profiled gene expression using whole genome oligonucleotide microarrays. Unsupervised hierarchical clustering of the 8,750 most variably expressed genes stratified cell lines into two main groups (see dendrogram, Fig. 1B). One group, designated ''luminal'' (blue dendrogram branches), contained all the ER-positive cell lines ( Fig. 2A), and was characterized by the expression of ERa-regulated genes (e.g. MYB, RET, EGR3, TFF1; Fig. 1H, and not shown) [24][25][26][27], as well as genes associated with luminal epithelial differentiation (e.g. GATA3 and FOXA1, Fig. 1I) [28].
The other group, designated ''basal'', contained only ERnegative cell lines ( Fig. 2A) and was characterized by the expression of basal epithelial gene markers including MSN, ETS1, CAV1 and EGFR (Fig. 1E, and not shown) [29][30][31][32]. Basal cell lines were further stratified into two subgroups, designated A and B (in line with Neve et al. [33], discussed further below). The basal-A subtype (red dendrogram branches) contained many of the ''HCC'' lines established at UT Southwestern, including two known BRCA1 mutant lines (HCC1937, HCC3153) ( [34], and this study). Basal-A lines were characterized by expression of PROM1 (aka CD133), a marker of various cancer stem cells [35], as well as other genes like GABRP and VTCN1 (Fig. 1F and 2C). Some of the basal-A lines also shared expression of luminal epithelial markers like KRT8 and KRT18 (Fig. 1G).
Subtype-specific differences in gene expression could also be identified by pathway analysis, using Gene Set Enrichment Analysis (GSEA) [14]. Included among the top signature associations (Table 3), the luminal cell line subtype was In regard to molecular markers and gene mutations ( Fig. 2A), the luminal subtype included all the ER-positive cancer lines (P,0.001, 2-tailed Fisher's exact test), and all but two of the ERBB2-positive lines (P = 0.002), half of which were also ERpositive. PTEN inactivating mutations and PIK3CA activating mutations, functioning on the same pathway, were mutually exclusive in all but one sample. Interestingly, PTEN mutations were more common in the combined basal-like cell lines (P = 0.020), while PIK3CA mutations were more frequent in luminal lines (P = 0.022). TP53 mutations occurred more often in basal-like lines (P = 0.038).

Relationship of breast cancer cell line and tumor subtypes
To determine the relation between breast cancer cell line subtypes (luminal, basal-A, basal-B) and breast tumor subtypes (luminal-A, luminal-B, ERBB2, basal-like, and normal-like), we first classified cell lines according to tumor subtype using a nearest centroid approach applied to the set of ''intrinsic genes'' used originally to define the tumor subtypes [2,3] (see Methods) (Fig. 2B). By expression patterns, most of the luminal lines most We also carried out the reverse analysis, building a cell line subtype classifier to classify 86 breast tumors (from the original Stanford/Norway study defining the five tumor subtypes [3]) according to cell line subtype (see Methods) (Fig. 2D). Notably, all basal-like tumors most resembled basal-A cell lines. Luminal-A and -B tumors most resembled luminal cell lines, while ERBB2 subgroup tumors most resembled either luminal or basal-A cell lines. A similar analysis of breast tumors arising in carriers of BRCA1 mutation, analyzed from a different dataset (The Netherlands Cancer Institute) [17], revealed highest resemblance in 17 of 18 cases to basal-A lines (not shown), while two BRCA2 mutation associated cases most resembled luminal cell lines.
In addition to the above cluster-derived luminal/basal tumor subtypes, alternative breast tumor subtype classifiers have been proposed, including a 70-gene prognostic signature supervised on the metastatic/non-metastatic distinction [17], a ''wound'' signature trained on the serum response of cultured fibroblasts [18], and a hypoxia signature derived from the hypoxic response of cultured mammary and renal tubular epithelial cells [19]. Each of the three signatures predicts unfavorable clinical outcome. Interestingly, the basal-like lines (considered together) were those predominantly expressing the 70-gene (P = 0.001, Fisher's exact test) wound (P = 0.004), and hypoxia (P,0.001) signatures (Fig. 2B).

Genomic profiles of breast cancer cell lines
To survey DNA copy number alterations in the panel of 52 breast cancer cell lines, we carried out CGH on cDNA microarrays with validated performance characteristics [21] and covering 22,000 genes with an average mapping resolution (interprobe distance) of ,70 Kb. Across the sample set, the most frequent CNAs (called by cghFLasso-see Methods) were gains on 1q, 3q, 5p, 7p, 8q, 11q, 17q, and 20q, and losses on 3p, 4, 8p, 9p, 11q, 13q, 18p, and Xq.
Overall, the spectrum of cytoband gains and losses was similar in the cell lines compared to primary tumors (Fig. 3A), though the frequency of those CNAs was generally higher with the cell lines. Cell line subtype-specific CNAs could be identified by SAM analysis (Fig. 3B). Luminal cell lines were characterized by more frequent gains on 1q, 8q, 11q, 12q, 14q, 17q and 20q, and losses on 8p, 9p, 11q, 13q, and 18p. Of these, gains on 1q, 8q, and 20q, and losses on 1p, 8p and 13q (asterisked in Fig. 3B) also characterize luminal-B breast tumors, while 17q gain characterizes ERBB2-associated tumors [4,5]. Notably, simple patterns characteristic of luminal-A tumors (1q+, 16p+, 16q2) were not wellrepresented among the luminal cell lines. Basal-A and basal-B cell lines also exhibited characteristic gains/losses (Fig. 2B), but none also selectively characteristic of basal-like tumors.

Integrated analysis for cancer gene discovery
The molecular profiles generated provide opportunities to identify breast cancer cell lines with an altered copy number and expression of known cancer genes, useful to model pathogenesis and therapy, and to discovery new breast cancer genes. For the latter, high-amplitude CNAs, i.e. high-level DNA amplifications and homozygous deletions, are particularly informative in pinpointing new cancer genes. Within the aCGH dataset we identified 80 loci of high-level amplification in 35 different cell lines, each spanning 49-49,014 Kb (median 1,115 Kb). We also identified 13 multi-copy (possibly homozygous) deletions (fluorescence ratios #0.25) in 8 cell lines spanning 132-7,825 Kb (median 1,477 Kb). The boundaries of amplicons/deletions did not correspond to known germline CNVs (reported in the Database of Genomic Variants), and, for the subset of recurrent alterations, finding distinct boundaries in different cell lines was more consistent with somatic alteration. Several regions of high-level amplification contained known oncogenes, like 8q24 (MYC), 11q13 (CCND1) and 17q12 (ERBB2). Other amplicons did not correspond to known oncogenes and presumably harbor novel breast cancer genes.

Discussion
Using whole-genome DNA microarrays, we collected transcriptional and genomic profiles across a set of 52 widely used breast cancer cell lines, with the primary goals to establish their suitability in modeling known breast tumor heterogeneity, and to create a resource for cancer gene discovery. Cluster analysis of transcriptional profiles defined three cell line subtypes, one luminal and two basal (A and B), consistent with other recent studies of breast cancer cell lines [31,33,75]. The luminal subtype included all ERpositive cell lines, and associated gene expression patterns reflected both ER and luminal differentiation pathways, the latter including GATA3 and FOXA1, key transcriptional mediators of luminal differentiation [28,76]. The basal-like cell lines were ER-negative and exhibited more frequent mutations of TP53 and PTEN, consistent with findings in basal-like tumors [3,77]. The basal-A subtype exhibited enriched expression of ETS pathway genes, a pathway linked to diverse tumor phenotypes including invasion and metastasis [78]. The basal-B subtype, which included the three non-tumorigenic lines (consistent with prior studies [75]), as well as five highly invasive/metastatic lines with features of EMT, exhibited enriched expression of EMT and EGF regulated genes, the latter pathway also previously linked to basal-like tumors [79].
Recently, Neve et al. [33] profiled 51 breast cancer cell lines (though using a lower-resolution (,1 Mb) CGH platform), 38 of which (,3/4 th ) overlapped with the 52 we profiled. All the overlapping lines except for one clustered into the same corresponding gene-expression subtype in both their and our study. The exception was HCC1500, which we classified as luminal while Neve et al. labeled it as basal B. The discrepancy may reflect a cell line identification error. We note that ATCC describes the line as ER-positive, more consistent with a luminal classification.
Our comparisons of expression profiles between breast cancer cell line subtypes and breast tumor subtypes provided valuable information relevant to the suitability of cell lines in modeling known breast tumor heterogeneity. Luminal-A/B tumors best matched luminal cell lines. Notably, basal-like tumors most corresponded to basal-A cell lines. Consistent with this finding, two breast cancer cell lines from BRCA1 mutation carriers also clustered in basal-A (and basal-A lines exhibited enrichment of a BRCA1 signature), where it has been established that BRCA1associated tumors share many features with sporadic basal-like tumors [80]. Interestingly, ERBB2-associated tumors matched both luminal and basal-A lines. While ERBB2 represents a distinct expression tumor subtype in multiple independent cohorts [3,15,81], it is noteworthy that most ERBB2 (HER2+) cell lines clustered in the luminal subtype. The basis for the discrepant ERBB2 grouping in cell lines and tumors is unclear but warrants further investigation.
It has been suggested that the origin of the luminal vs. basal breast cancer distinction reflects the transformation of different breast epithelial progenitor cell compartments [82,83]. Breast epithelial stem/progenitor cells support mammary gland development during puberty and subsequent growth and remodeling during pregnancy [84]. A prevailing view is that breast epithelial stem cells give rise to bipotent basal/luminal progenitors, which then give rise to basal and luminal restricted progenitors, and from there to differentiated basal/myoepithelial and luminal epithelial   [40], also a presumed phenotype of normal breast epithelial stem or early progenitor cells [84]. Our transcriptional profiles of breast cancer cell lines are consistent with an origin in (or at least a likeness of the bulk cell population to) the various stem/progenitor cell compartments. Basal-B lines predominantly express CD44 + /CD24 2/low and MUC 2 /CALLA + phenotypes characteristic of stem or bipotent progenitor cells, as well as ITGB3 (CD61), also recently characterized as a cancer stem cell marker in MMTV-wnt-1 induced murine breast cancer [41]. In contrast, basal-A lines appear mainly CD44 + /CD24 + , but express PROM1 (aka CD133), a marker of luminal progenitors in mice [86] also more recently characterized as a stem cell marker in BRCA1-associated breast cancer [87], while luminal lines express markers of luminal lineage restriction like GATA3 and FOXA1 [28]. Conspicuously absent from our analysis is a breast tumor subtype corresponding to the stem-cell like (and sometimes mesenchymal-like) basal-B lines. Whether basal-B lines reflect an uncommon tumor subtype not yet characterized, or else a stem/progenitor subpopulation of tumor cells enriched in culture, or even an artifact of cell culture, remains to be determined. Regardless, breast cancer cell lines are likely to prove useful for discovering new stem cell markers, and for studying stem/progenitor cell biology.
Our genomic profiles of breast cancer cell lines indicate that overall the spectra of CNAs is reflective of breast tumors, consistent with prior findings from loss of heterozygosity (LOH) analysis [11]. Overall, however, cell lines exhibited higher frequencies and greater complexities of CNAs, and seemingly more than might be explained by a higher sensitivity of detecting CNAs in stromal-free tumor cell populations. Notably absent among the luminal subtype were the ''simple'' karyotypes characteristic of luminal-A tumors (i.e. 1q+, 16p+/16q2). By genomic profiles, luminal cell lines shared features characteristic of luminal-B tumors, including certain subtype-specific CNAs and overall higher levels of DNA amplification. Likewise, basal-A cell lines and basal-like tumors shared the feature of high levels of chromosome segment gain/loss. However, overall only a subset of subtype-specific CNAs was preserved. Therefore, at the genomic level it is uncertain how well cell line subtypes faithfully represent tumor subtype counterparts.
Taken together, the transcriptional and genomic profiles support the conclusion that luminal and basal-A cell lines are the most appropriate cell line models of luminal-B and basal-like tumors, respectively. Further, the basal lines are likely useful models for biological studies of the 70-gene, wound and hypoxia signatures. Despite incongruent expression results, luminal lines with amplification/overexpression of ERBB2 are likely appropriate models of ERBB2-associated tumors. Our findings indicate that new cell lines are needed to more faithfully model luminal-A tumors. Currently available cell lines likely reflect certain biases in the specimen source of cell line, and/or in the culturing methods, as suggested by the predominance of HCC lines (from UT Southwestern) among the basal-A group. Different culturing methods (e.g. ref. [88]) might support the establishment of cell lines from luminal-A tumors.
Our genomic profiles also identified numerous high-level DNA amplifications and multi-copy deletions, pinpointing known and novel cancer genes. Further, by integrating the genomic and transcriptional datasets, we could define a set of candidate cancer genes residing at these loci and exhibiting both altered copy number and expression. The larger set of amplified/overexpressed genes included several known breast cancer oncogenes, as well as many plausible candidates including genes with known functions relevant to carcinogenesis, like cell proliferation, survival and motility/invasion, and genome integrity (e.g. DNA damage response). Though genes maintaining genome integrity are more typically considered candidate tumor suppressors, the overexpression of such genes has been linked to genome instability [67,89]. The set of amplified/overexpressed genes also included many druggable targets [74], most notably several kinases. Importantly, the same cell lines used for discovery can also be used to functionally examine cancer gene candidates, for example using RNA interference to knockdown the expression of amplified oncogene candidates, and then assaying loss of tumorigenic phenotypes in cultured cells or in vivo (e.g. refs. [90,91]). Indeed, high-throughput RNA interference approaches [92,93] might be used to evaluate many or all of the candidate cancer genes simultaneously.
In summary, transcriptional and genomic profiling of 52 commonly used breast cancer cell lines identifies cell line subtypes, and defines the cell line subtypes that most faithfully capture the known heterogeneity of breast tumors. Specifically, luminal and basal-A lines appear to best model the features of luminal-B and basal-like tumors, while basal-B lines might inform stem cell biology. In addition, our integrated analysis of genomic and transcriptional profiles pinpoints loci and genes with altered copy number and expression, providing a rich source for discovery and future characterization of new breast cancer genes.