The genomic landscape of metastasis in treatment-naïve breast cancer models

Metastasis remains the principle cause of mortality for breast cancer and presents a critical challenge because secondary lesions are often refractory to conventional treatments. While specific genetic alterations are tightly linked to primary tumor development and progression, the role of genetic alteration in the metastatic process is not well-understood. The theory of tumor evolution postulated by Peter Nowell in 1976 has yet to be proven in the context of metastasis. Therefore, in order to investigate how somatic evolution contributes to breast cancer metastasis, we performed exome, whole genome, and RNA sequencing of matched metastatic and primary tumors from pre-clinical mouse models of breast cancer. Here we show that in a treatment-naïve setting, recurrent single nucleotide variants and copy number variation, but not gene fusion events, play key metastasis-driving roles in breast cancer. For instance, we identified recurrent mutations in Kras, a known driver of colorectal and lung tumorigenesis that has not been previously implicated in breast cancer metastasis. However, in a set of in vivo proof-of-concept experiments we show that the Kras G12D mutation is sufficient to significantly promote metastasis using three syngeneic allograft models. The work herein confirms the existence of metastasis-driving mutations and presents a novel framework to identify actionable metastasis-targeted therapies.

tissue from pre-clinical mouse models of metastatic breast cancer (S1 Fig). We focused on the luminal-like MMTV-PyMT (PyMT) and MMTV-Her2 (Her2) genetically engineered mouse models (GEMMs) of metastatic breast cancer, which model the PI3K activation or the HER2 amplification seen in 42% or 20% of breast cancer patients, respectively [12,21]. The PyMT model produces palpable primary tumors with a mean latency of 60 days and pulmonary metastases at 100 days in 85% of mice [22]. The Her2 model produces mammary tumors with a mean latency of 100 days and pulmonary metastases at 200 days in 60% of mice [23][24][25]. We crossed the PyMT model (FVB/NJ background) with five mouse strains (FVB/NJ, C57BL/6J, C57BL/ 10J, CAST/EiJ, MOLF/EiJ) to more closely recapitulate the genetic heterogeneity of human populations. Due to the latency and high variability of the Her2 (FVB/NJ background) model disease progression compared with PyMT, we only crossed the Her2 model with FVB/NJ for this analysis; however, sequencing of tissues from additional strains is ongoing (S1A Fig).
To investigate whether somatic mutations might contribute to metastatic progression, we performed exome sequencing (exome-seq) of 53 and 12 paired primary tumor (PT) and lung metastases (LM) from the PyMT and Her2 GEMMs, respectively (S1A Fig). Single nucleotide variant (SNV) analysis was performed using three independent SNV-calling algorithms. Genes predicted to harbor SNVs by each algorithm were intersected to identify a common set of potentially mutated genes. We identified 1202 exon SNVs in PT tissue and 1768 in LM tissue compared with normal strain-specific gDNA sequence (S2A and S2B Fig).
Candidate metastasis-driving SNVs were defined by two criteria: 1) enrichment in metastatic tissue but not in primary tumor and 2) presence within the metastatic seeding cell. To identify SNVs likely present in the metastatic seeding cell, only those SNVs found in at least 60% of the metastatic lesion (variant allele frequency between 0.3 to 0.5) were considered as potential metastasis-driver events. This cut-off accounts for the infiltration of non-tumor cells but still requires a heterozygous mutation to be present in at least 60% of the cells within the lesion. The comparison of LM sequences to matched PT identified 196 SNVs in 164 genes common to all 3 SNV calling algorithms (Figs 1 and 2, S2C Fig and S1 Table). The 164 genes were then screened to identify recurrently mutated codons. Five genes (Kras, Shc1, Ccni, Mtch2, Snrk) were identified with recurrent mutations in the same codon, and again were common to all three algorithms (S1 Table, Fig 3A-3D).
The SNV calls were then analysed for genes that were recurrently mutated in different codons. A total of 147 genes were identified with mutations observed in only a single animal by all three algorithms. Twelve genes were identified as having been recurrently mutated in different codons in independent animals (S1 Table). Finally, to prioritize genes for future analysis, RNA sequencing (RNA-seq) data was examined to validate SNVs in tissues analyzed by both analyses. An additional instance of the Shc1 mutated codon was identified, and a potential recurrent mutation in Rfx8 was found (S1 Table).
Two genes, Kras and Shc1, had recurring metastasis-specific mutations in independent animals of the PyMT cohort (Fig 3A and 3C). Two PyMT animals carried the oncogenic activating Kras G12D mutations (Fig 3A), and an additional three PyMT animals carried different nucleotide substitutions all within the codon of proline 561 or 451 (isoform a or b respectively) of Shc1, resulting in P561S, P561T, or P561H ( Fig 3C). For those samples with gDNA or RNA still available (4 out of 7), Sanger sequencing validated the presence of Kras and Shc1 SNVs in the original metastatic tissues analyzed by exome-and RNA-seq, and in some cases in additional metastases from the same animal (S2 Table). The presence of recurrent metastasisenriched SNVs in several animals in the absence of therapeutic pressures suggests that specific coding mutations may play a fundamental role in driving metastasis.
To expand the search for additional genes with recurrent metastasis-enriched mutations, we screened RNA-seq data from a cohort of 42 matched primary tumors and metastatic pairs from 40 PyMT and 2 Her2 animals (S1A Fig). The RNA-seq data included 17 PyMT animals also analyzed by exome-seq plus an additional 25 independent animals (2 Her2 and 23 PyMT) (S1A Fig). First, the 164 genes with high-probability SNVs identified by exome-seq were screened against the RNA-seq data to validate exome-seq SNV predictions and identify

PLOS GENETICS
The genomic landscape of metastasis in treatment-naïve breast cancer models additional animals with mutations in these genes (S1 Table and S2 Table). For animals used in both analyses, the predicted SNVs identified by exome-seq were validated by the RNA-seq data given the gene was transcribed, suggesting that the exome-seq filtering criteria correctly identified bona fide SNVs. One additional metastasis-enriched Kras mutation (Q61R; Fig 3B) and an additional mutation in Shc1 P561 (P561A; Fig 3D) were observed in two animals from the RNA-seq-only cohort. Unexpectedly, in addition to a single metastasis-enriched Ctnnb1 oncogenically-activating SNV (S45P) [26] observed in exome-seq data, RNA-seq data revealed three more independent animals possessing metastasis-specific Ctnnb1 S45F SNVs and another two with Ctnnb1 point mutations that are also often observed in human tumors (K335N, N387K) (S1 Table). Sanger sequencing validated the presence of the SNVs for Ctnnb1 exome-seq S45P and RNA-seq N387K mutations (S2 Table). Insufficient sample material prevented Sanger validation of the remaining RNA-seq-identified SNVs. Overall, for samples

PLOS GENETICS
The genomic landscape of metastasis in treatment-naïve breast cancer models with available material, the rate of validation by Sanger sequencing for metastasis-specific SNVs was approximately 80% (8 out of 10) (S2 Table and   Metastasis-specific mutations are recurrent and stratify patient outcome. IGV screen shots showing allele frequency in the blue and red boxes, representative SNV reads, mutated codon letter, tissue type, and mouse strain. A, Exome-seq identified a C (blue) to T (red) SNV within Kras resulting in the G12D amino acid substitution in metastases from two C57BL/6J x PyMT animals. B. RNA-seq identified a T (red) to C (blue) SNV within Kras resulting in the Q61R amino acid substitution in a metastatic lesion from one FVB/NJ x PyMT animal. C, Exome-seq identified C (blue) to T (red) and C (blue) to A (green) SNVs in metastases from three animals (FVB/NJ, CAST/Ei (2 mets from 1 mouse with same SNV), and C57BL/6J x PyMT) resulting in the P561S/T or H amino acid substitutions, respectively. D, RNA-seq identified a C (blue) to G (orange) SNV resulting in the P561A amino acid substitution in one FVB/NJ x PyMT animal. E. Kaplan-Meier plots generated using METABRIC for the top six genes identified with metastasis-driver SNVs by exome-seq in mice that most significantly stratify patient survival when altered in primary tumor tissue (blue = no CNV, red = CNV present). PT, primary tumor; Met, metastasis. https://doi.org/10.1371/journal.pgen.1008743.g003

PLOS GENETICS
The genomic landscape of metastasis in treatment-naïve breast cancer models or exome-seq was used to survey for 11 of the SNVs identified by exome-and RNA-seq. This analysis identified yet another Shc1 P561H mutation, but no additional samples containing the other 10 mutations were found. Taken together, these data indicate that the putative metastasis-driving recurrently mutated genes are mutually exclusive, as we did not observe the presence of mutations in two or more of the recurrently mutated genes within a single metastasis.

Copy number variation drives metastasis and reduces overall survival among human breast cancer patients
Using the METABRIC sequencing dataset of human breast cancer primary tumors, we assessed the relevance to human disease of genes possessing metastasis-enriched SNVs identified in the PyMT and Her2 mouse models. This analysis revealed that many genes possessing SNVs in our preclinical mouse models were amplified in human tissue, and 18% (23 out of 123 human orthologs) of such alterations were significantly associated with poor patient outcome ( Fig 3E, S4 and S5 Figs).
Since human primary tumors possessed gene amplification or deletion in addition to point mutations, we analyzed the PyMT primary tumor/metastasis exome-seq pairs for potential copy number variations (CNVs) that may contribute to metastatic spread. To determine allelespecific gain or loss we used only those samples generated from progeny of the FVB/NJ-PyMT outcross, as their genomes contained an FVB and non-FVB (C57BL/6J, C57BL/10J, CAST/EiJ, or MOLF/EiJ) allele that could be readily differentiated. Therefore, pure FVB/NJ-PyMT and FVB/NJ-Her2 tumor/metastasis pairs were not analyzed due to the lower confidence of CNV calling on a homozygous genetic background. Based on these criteria, we performed CNV analysis using exome-seq data from 30 of the primary tumor/metastasis pairs in our cohort. We first identified CNVs in primary tumors and metastases compared to normal, and then analyzed this data for metastasis-specific CNVs (S3-S7 Tables). Amplification and deletion events were primarily restricted to the F1 animals of crosses with C57BL/6J and MOLF/EiJ, with recurrent metastasis-specific CNVs on chromosomes 2, 4, 9, and 10 ( Fig 4A). Examination of the putative CNVs suggested that the C57BL/6J alleles were specifically lost in the F1 hybrids, while FVB/NJ alleles were under-represented in most of the CNVs in the MOLF/EiJ F1 hybrids. In addition, the chromosome 4 and 9 CNVs overlapped with regions of the genome we previously demonstrated to harbor inherited metastasis susceptibility genes [27][28][29], suggesting these genomic intervals may contain important metastasis-associated factors. The Genomic Regions Enrichment of Annotations Tool (GREAT) [30] was used to identify the genes associated with recurrent metastasis-specific CNVs. 371 genes were associated with recurrent regions of CNV in PyMT metastatic lesions, and 46 of these genes possessed CNVs in two or more outcross strains (S8 Table and S6 Fig). The METABRIC dataset was then screened again to assess if CNVs within any of the 46 genes were significantly associated with breast cancer patient survival. The amplification of approximately 17% (8 out of 46) of the genes queried was significantly associated with worse patient outcome, including EBAG9 and PKHD1L1, located on mouse chromosome 15, and PEX2, and ZFHX4, located on mouse chromosome 2 (Fig 4B), which are all located on human chromosome 8 between q21.13 and q23.2. This analysis reveals that in addition to metastasis driver SNVs, metastatic cells may also acquire specific CNVs to support their spread to a secondary site.

Metastases can originate from multiple clones in the MMTV-PyMT model
Curation of the RNA-seq data revealed several instances of discordant SNVs between independent metastases within the same animal (e.g., Ctnnb1 K335N in animal 974). The presence of a high probability SNV in only a subset of metastatic lesions within an animal may be due to the acquisition of the mutation early during the establishment of individual metastases or metastatic seeding by multiple primary tumor subclones within an animal. To address this, exomeseq was performed for primary tumors and matched metastatic tissue samples from five additional PyMT animals. Hierarchical clustering was then performed using both SNV and CNV analysis of those sequenced tissues. This analysis revealed that the metastases from three of the five animals (10204, 10245, and 10418) were highly related and likely originated from the same primary tumor subclone (Fig 5A and 5B). In contrast, two of the mice (10507 and 10548) produced heterogeneous metastases (Fig 5 and 5B), indicating that these nodules may have originated from different primary tumor subclones.

Fusion genes are not enriched in metastases
In a large-scale analysis of 560 human breast cancer samples, Nik-Zainal et al. discovered frequent variation in genomic structure in primary tumor tissue [31]. To identify any potential targetable gene fusion events that may contribute to metastatic progression, we analyzed the RNA-seq data (S1 Fig) using the deFuse algorithm to identify discordant paired-end alignments and thus putative gene fusion events specific to metastatic tissue [32]. To reduce false positive results, putative alternative splice events from adjacent genes were excluded from the MOLF/EiJ amplification, MOLF/EiJ deletion, CAST/EiJ deletion, C57BL/6J deletion, C57BL/10J deletion. B. Kaplan-Meier plots generated using METABRIC for six out of eight genes identified as metastasis drivers by exome-seq CNV analysis in mice that significantly stratify patient survival when altered in primary tumor tissue and reside on human chromosome 8 (blue = no CNV, red = CNV present).
https://doi.org/10.1371/journal.pgen.1008743.g004 analysis to eliminate rare transcriptional read-through products. Forty-four putative fusion transcripts were detected in the primary tumors by this analysis. Seventy-one fusion transcripts were detected in the metastases, with 17 fusions in common with the primary tumors and 54 unique to the metastatic lesions. The 54 metastasis-specific putative fusion events involved a total of 85 genes. Sixteen genes were associated with putative fusion events in more than one animal or involved genes with multiple fusion partners ( Fig 6A). However, manual curation of the putative recurrent fusion events revealed that 7 of the 15 breakpoints occurred in regions of repetitive elements within introns or UTRs, suggesting potential alignment artefacts. Moreover, for the highly expressed genes (Csn3, Trf), the variant transcripts were represented at <1% of the total read count, suggesting that these transcripts were either not required for maintenance of the metastatic lesions, were rare aberrant transcripts, or were artefacts of the RNA-seq/deFuse analysis.
Additionally, we performed 10x whole genome sequencing (WGS) on 20 matched pairs of primary and metastatic lesions from FVB/NJ-PyMT x MOLF/EiJ and FVB/NJ-PyMT x CAST/ EiJ animals. This data set was comprised of tissue from 16 of the animals represented in the exome-seq dataset and 4 additional animals (S1A Fig). MOLF/EiJ and CAST/EiJ F1 animals were used for this analysis as their genomes are significantly polymorphic compared to FVB/ NJ, allowing for allele-specific identification. SNVs predicted by exome-seq were confirmed in all of those animals also included in the WGS analysis (S2 Table). Putative coding-related structural variants such as insertion-deletions (indels), inversions, and translocations were identified using the BreakDancer algorithm [33] by limiting the analysis to within transcripts and requiring a minimum of three variant reads per sample (S9 Table). The resulting gene fusions were predominantly intrachromosomal and tended to cluster within regions of gene families, suggesting potential alignment artefacts. To reduce this potential artefact, a threshold of a minimum of 10 kb between putative fusion partners was introduced, resulting in 18 putative fusion events involving 31 genes observed in either more than one animal or with multiple fusion partners (Fig 6B). No overlap between the putative structural variants observed from the deFuse and BreakDancer analysis was observed. All structural variations predicted by BreakDancer under these conditions were intrachromosomal, and 11 of the 18 predicted recurrent structural variants occurred between 2 members of the same gene family. Of the remaining seven, two of the remaining predicted structural variation breakpoints were within repetitive elements at both sites ( Fig 6B, pink loops). Two additional predicted structural variations contained a repetitive element at one of the breakpoints (Fig 6B, tan loops), while three were within single copy sequence at both sites ( Fig 6B, blue loops). These results suggest that structural variation resulting in fusion transcripts does not play a significant role in the metastatic progression of the MMTV-PyMT mammary tumor model.

Single nucleotide variants drive metastasis
The presence of recurrent oncogenic mutations in Kras and Ctnnb1, in addition to the previously unreported recurrent mutations in Shc1, are consistent with a potential metastasis-driving function of these genes in the PyMT mouse model. Interestingly, previous analysis of a panel of metastatic mouse mammary tumor cell lines revealed recurrent mutation of Kras in several cell lines [34]. Moreover, SNVs within KRAS in the primary tumors of the METABRIC data set, although rare (0.6% of patients), were significantly associated with poor survival in human breast cancer (Fig 7A and 7B). CTNNB1 and SHC1 were amplified (0.3% and 22% of patients, respectively), but were not included in the targeted sequencing performed on 173 cancer-related genes in METABRIC. Therefore, the mutational burden for these genes in patient primary tumors within this data set is unknown. As METABRIC does not contain sequencing data from metastatic tissue, differential gene expression analysis of matched PDX primary mammary gland tumor and lung metastases from a recent study of PDX models [35] was examined by Ingenuity pathway Analysis (IPA). This analysis revealed that in 2 out of the 3 PDX models, KRAS signalling was significantly elevated in metastatic PDX tissue relative to primary tumor (S10-S12 Tables), consistent with the mutational activation observed in the exome-sequence analysis described here. Furthermore, CTNNB1 signalling was also significantly altered in 2 of the 3 PDX models, however it was elevated in 1 and decreased in another (S10-S12 Tables). We thus focused on Kras as a potential breast cancer metastasis driver gene and performed a proof-of-concept experiment to determine if Kras mutation in the primary tumor alters metastatic potential. Wildtype (WT) or Kras G12D constructs were expressed in two independent Kras WT mouse mammary tumor cell lines: 4T1, which is derived from a spontaneous BALB/c mammary tumor, and MET1, derived from the MMTV-PyMT model (S7A Fig) [34]. Kras WT-, Kras G12D-, or control empty vector (EV)-transduced cells were implanted orthotopically in syngeneic mice and primary tumor weight and the number of pulmonary metastases were assessed after four weeks. Expression of Kras WT and G12D increased tumor burden compared to EV in the MET1 line and a small but non-significant increase in tumor growth was also observed with the 4T1 line (Fig 7C and 7F). Therefore, to more accurately assess metastatic capacity, the number of metastatic nodules (Fig 7D and 7G) was normalized to primary tumor weight for each mouse. This analysis revealed a significant increase in metastatic capacity of Kras G12D-expressing cells compared to both Kras WT and EV cells in both lines (Fig 7E and 7H). To further test how modulation of Kras G12D affects

PLOS GENETICS
The genomic landscape of metastasis in treatment-naïve breast cancer models metastatic potential, we utilized the 6DT1 mouse mammary cancer cell line that bears an endogenous homozygous Kras G12D mutation [34]. Despite the short-lived effects of siRNA, knock down of Kras in 6DT1 cells reduced metastasis to the lungs in spontaneous metastasis assays compared to a non-targeting siRNA control (Fig 7I-7K and S7B Fig). This result was significant before controlling for primary tumor weight, and trending after normalization (p = 0.1) (Fig 7J and 7K). These results are consistent with a function for constitutively activated Kras as a metastasis driver gene in breast cancer, as well as significant evidence for the existence of spontaneous metastasis-driving somatic mutations.
Several studies have reported an association between oncogenic Kras expression, and a more mesenchymal-like phenotype associated with breast cancer progression [36][37][38]. Using the available RNA-seq data from two of the PyMT mice with spontaneous metastasis-specific Kras G12D mutations, we assessed the mRNA levels of several epithelial-to-mesenchymal transition (EMT)-related factors in the metastatic nodules [39]. There was no difference in the expression of these factors between tissues with WT or mutant Kras These data suggest that a potentially novel mechanism, independent of EMT, may be responsible for the Kras G12D metastasis-driving function observed in this study.

Discussion
Breast cancer metastasis is the primary cause of breast cancer-related mortality, and due to the disparate biology of metastatic lesions compared with the original tumor there are few options for clinical intervention [4,5,[40][41][42][43]. In this study, we hypothesized that somatic evolution of the tumor cell genome may drive metastasis and could also provide novel therapeutic targets. However, no robust sequencing datasets of paired treatment-naïve primary and metastatic tumor samples that are large enough to identify recurrent mutations or genomic events currently exist. A key obstacle in this progress is the difficulty of obtaining appropriate tissue samples. Metastatic lesions are typically not surgically removed, and data obtained from resected lesions are often confounded by the application of neo-adjuvant and adjuvant therapies, which may select for events associated with treatment resistance rather than metastasis [44]. Our utilization of mouse models in the present study overcomes these limitations and serves as a hypothesis generation and testing platform for subsequent validation and characterization in human experimental systems.
Here we have performed exome, whole genome, and RNA sequencing on a total of 94 pairs of treatment-naïve primary and metastatic tumors [21,45] of the luminal-like PyMT and Her2 GEMMs. Moreover, sequencing of matched primary tumors and metastases from five additional MMTV-PyMT animals was also performed. Several important observations were made from these studies. First, similar to human breast cancer [40,[46][47][48], both mono-and polyclonal seeding of metastases was observed in MMTV-PyMT animals. MMTV-PyMT animals develop multiple primary tumors in independent mammary glands, so it is possible that the polyclonal metastases arise from independent tumors, rather than different tumor subclones, as is the case in human breast cancer. Resolution of this question, however, would require comprehensive sequencing of all of the tumors from individual animals, which is beyond the scope of this study.
The second important observation from this study was the identification of multiple recurring metastasis-enriched SNVs. The recurrence and high enrichment of these SNVs within metastatic lesions strongly suggest roles as metastasis drivers. The vast majority of the metastasis-enriched SNVs were validated by orthogonal sequencing methods, 96% (47 out of 49; S2 Table), indicating that the analytical strategy was robust. The fact that MMTV-PyMT animals develop multiple primary tumors is not likely to confound this interpretation, since the metastasis-enriched SNVs did not appear as primary tumor drivers in any of the animals assayed. Many of the mutated genes were also found to be associated with poor outcome in human breast cancer data sets, consistent with a potential role in metastatic breast cancer. Moreover, we performed a proof-of-concept experiment to show that the recurrent Kras G12D mutation observed in metastatic lesions from two PyMT mice increases the metastatic efficiency of mouse mammary cancer cell lines in an orthotopic implantation model. Importantly, the samples sequenced in this study are from therapy-naïve animals. Recent large-scale sequencing efforts focused on metastatic lesions and matched normal tissue have become available for human breast cancer patients. However, several of these studies were performed primarily with small targeted gene sets [49,50]. In addition, the vast majority of the patient samples in these studies [49][50][51][52][53] had previously been exposed to systemic treatment, which was found to be a major contributor to tumor mutational burden [51] and to metastasis mutational burden [49]. Moreover, all of these studies have been performed on heterogeneous human populations, with different primary tumor driver mutations on diverse genetic backgrounds that likely select for different metastatic driver mutations. Taken together, these caveats suggest that very large human sample sets may be necessary to begin to identify recurrent metastasis-driver mutations in human patients.
The use of therapy-naïve, genetically reproducible animals with defined primary tumor drivers provides an important complementary platform to identify somatic mutations that contribute to metastatic progression. This may be particularly important if, as we observed in the mouse models presented here, no single putative metastasis driver is mutated at a high frequency. This result suggests that there may be many metastasis drivers, each contributing a small fraction of the overall metastasis-driving function within a population. We did, however, observe several SNVs located in genes functioning within the Ras signaling pathway. These SNVs were observed in~15% (14 (3-Kras, 5-Shc1, 6-Ctnnb1) out of 94) of the animals in this study, which implicates the Kras pathway as an important metastasis signaling axis in this model. The KRAS G12D mutation is a well-studied oncogenic driver leading to the development of aggressive lung and colorectal cancers [54][55][56]. However, KRAS is not considered a driver of human breast cancer development and its metastasis-specific recurrence in this model was unexpected [57]. Additionally, the exact role of the Shc1 P561X mutation is unknown, but the localization of this mutation to the SH2 domain of Shc1 is consistent with activating function and warrants further investigation [58].
While KRAS is not considered a major driver of breast cancer, recent work implicates this factor as an important player for some subtypes. In 2001, D'Cruz et al. described sustained oncogenic transformation following spontaneous mutagenesis of Kras2 in an inducible c-Myc mammary tumorigenesis model [59]. In 2019, Campbell et al. reported spontaneous alterations in Kras expression and activity in the aggressive ERα + NRL-PRL postmenopausal murine model, that drove neoplastic transformation [60]. Additionally, the Kras Q61H mutation was identified in primary tumors from a triple negative murine model by Liu et. al, who also report mutations and amplifications of KRAS and the KRAS pathway in 11% and 93%, respectively, of triple negative breast cancer patients in the TCGA data set [61]. Hollern et al. recently reported that murine mammary tumors with EMT histopathology, which are similar to human claudin-low breast tumors, also showed increased activity of KRAS signaling [38]. Interestingly, we did not observe any mesenchymal shift in gene expression in any of the mutant Kras tissues analyzed. However, we did also observe KRAS activation using pathway analysis of RNA-seq data collected from metastatic PDX lung tumors compared to matched primary mammary gland tumors [35], suggesting that KRAS pathway activation may be an important mechanism for at least some breast cancer patients. While these studies bring to light a potentially important role for KRAS signaling in breast cancer initiation, the work presented herein describes the first instance of Kras mutagenesis as a spontaneous driver of mammary tumor metastasis.
The SNV data reported here also suggest that there may be many ways within metastasisassociated pathways to activate a pro-metastatic function. Again, the identification of Kras and Ctnnb1 as recurrently mutated metastasis drivers was unanticipated due to the lack of association with primary tumour biopsies of human breast cancer. The presence of known, well-studied activating mutations in these genes within metastasis-founding cells, however, strongly implicates the KRAS-CTNNB1 signaling axis as a critical component of metastatic progression, at least in the PyMT model system. The activation of the CTNNB1 pathway has been previously linked to breast cancer progression, and thus our data corroborates the importance of this network [62] and suggests an additional therapeutic opportunity to intervene in metastasis establishment and progression.
We have also identified recurrent metastasis-specific regions of CNV that may contribute to metastatic spread. Several studies have shown divergent CNV between primary and metastatic tissue from small patient cohorts [63][64][65][66], and through our use of the treatment-naive PyMT animal model we have further shown that these genomic alterations occur through natural tumor progression and can drive metastasis. Additionally, our data identified several genes associated with metastasis-specific CNV that, while distributed across the mouse genome, are all localized within human chromosome 8q. Amplifications on chromosome 8q are associated with tumor progression and worse outcomes in many cancer types, suggesting that our analysis appropriately captures genetic alterations observed in human patients [67][68][69][70][71][72]. This association suggests that our strategy for the identification of metastasis-driving CNVs aligns with human biology, revealing a robust tool for the identification of clinically relevant metastatic drivers. We anticipate that with additional pairs, strain-specific and driverspecific metastasis-driver CNVs will be distinguished.
Understanding the etiology of metastasis has been an important goal in cancer research for many decades. The accumulation of chromosomal aberrations during tumor progression led Nowell to propose the progression model, where tumor cells evolve over time until a subclone acquires all of the somatic events necessary to confer metastatic potential [7]. However, cells derived from metastatic lesions were found to be no more efficient at metastasizing than primary tumor cells, suggesting that additional, potentially transient events are critical for acquisition of metastatic capacity [73]. More recently, the finding that bulk primary tumor gene expression can stratify breast cancer patient outcome [74] has led to the suggestion that metastatic capacity is encoded by the primary tumor drivers [75]. The results of our study are most consistent with the progression model originally proposed by Nowell [7], although it does not rule out important combined contributions of the transient epigenetic effects, microenvironment, and selection as proposed by Weiss in 1979 [73]. The presence of metastasis-specific driver mutations may also, in part, contribute to the difference in response to therapeutics observed between primary tumors and metastases, both in animal models [76] and in the clinic, and suggests different therapeutic strategies targeting metastasis drivers may improve patient outcomes.
In summary, by performing a survey of the genomic landscape of metastatic progression in two preclinical mouse models, we have for the first time identified recurrent, clinically relevant spontaneous mutations in genes and pathways that drive metastasis. Breast cancer patients with metastatic disease have an average 5-year survival of only 25%, underscoring the urgent need for more comprehensive molecular analyses of metastases. By using the strategy outlined here, the continued characterization of preclinical mouse models of metastasis will provide further insight into the associations between clinical subtypes, primary tumor drivers, and the now newly confirmed metastasis-driving genomic alterations, ultimately creating new opportunities for the generation of metastasis-targeted therapies.

Ethics statement
The research described in this study was performed under the Animal Study Protocol LCBG-004 and LPG-002, approved by the National Cancer Institute (NCI) Animal Use and Care Committee. Animal euthanasia was performed by cervical dislocation after anaesthesia by Avertin.

Genetically engineered mouse models
FVB/N-Tg(MMTV-PyVT)634Mul/J (PyMT) and FVB/N-Tg(MMTVneu)202Mul/J (Her2) male mice were obtained from Jackson Labs. Male PyMT mice were crossed with female wild type FVB/NJ, MOLF/EiJ, CAST/EiJ, C57BL/6J, and C57BL10/J mice also obtained from Jackson Labs. Male Her2 mice were crossed with female wild type FVB/NJ mice. All female F1 progeny were genotyped by the Laboratory of Cancer Biology and Genetics genotype core for the PyMT or Her2 gene and grown until humane endpoint. Mice were euthanized using intraperitoneal Avertin to anesthetize followed by cervical dislocation. All primary tumors generated by one animal were isolated weighed, randomly sampled, and combined into a single cryovial. Metastatic nodules, and normal (tail) tissue were also isolated immediately following euthanasia and snap frozen in liquid nitrogen. Tissue samples were then stored at -80˚C.

gDNA isolation
The combined primary tumor tissue from one mouse was ground on dry ice and small fragments were taken for gDNA isolation. Whole metastases were used for gDNA isolation. Tissue was lysed using Tail Lysis Buffer (100 mM Tris-HCl pH 8.0, 5 mM EDTA, 0.2% SDS, 200 mM NaCl, 0.4 mg/ml proteinase K) at 55˚C overnight. Samples were then placed in a shaking (1400 rpm) heat block for 1 hour at 55˚C. RNaseA (Thermo Fisher Scientific) was added (2 mg/ml final) and lysates were incubated on the bench for 2 minutes. gDNA was then isolated using the ZR-Duet DNA/RNA MiniPrep kit (Zymo Research).

Sequencing and analysis
All analyses were carried out on the NIH Biowulf2 high performance computing environment. All analyses were performed using software default parameters if not otherwise specified.

Exome sequencing
Exome sequencing was performed by the NCI Center for Cancer Research (CCR) Genomics Core and the NCI Illumina Sequencing Core. Exome libraries were prepared by the Genomics Lab using Agilent SureSelectXT Mouse All Exon target enrichment kit. Libraries were barcoded and pooled before sequencing on an Illumina HiSeq3000 or HiSeq4000 to an average depth of 40x. Samples were trimmed of adapters using Trimmomatic software. The trimmed reads were aligned to the mm10 reference mouse genome using BWA (0.7.17) or Bowtie (2-2.3.5.1) mapping software. SAMtools (1.9) Mpileup, GATK (3.8-1) Mutect2, and Strelka (2.7.1) were used to identify potential variants and the variants were filtered for known polymorphisms from mgp.v5 (http://ftp-mouse.sanger.ac.uk/current_snps/mgp.v5.merged.snps_ all.dbSNP142.vcf.gz) and variants with Phred-scaled quality scores of < 30 were removed. Annotation was performed by Annovar (2018-04-16). High probability SNVs were identified after manual curation of the data and select SNVs were validated by Sanger sequencing at the NCI CCR Genomics Core. Copy number variation was performed for the F1 tumors using the R (3.6.0) packages BubbleTree (2.15.0) and cn.mops (1.31.0), followed by filtering for metastasis-specific events and manual curation. We used BubbleTree to do allele-specific and strainspecific copy number analysis, which required genotype data, and cn.mops for general copy number analysis, which doesn't require F1 cross experiment data. The recurrent CNVs were defined as CNVs present in multiple samples.

Whole genome sequencing
Library preparation was performed using the TruSeq DNA Sample Prep kit FC-121-1001. Samples were barcoded, pooled, and sequenced on an Illumina HiSeq4000 to a depth of~10x per sample. Samples were trimmed of adapters using Trimmomatic (0.39) software before the alignment. The trimmed reads were aligned with the mm10 reference mouse genome using BWA alignment. SNV calls were performed as described for exome sequencing. Structural variation analysis was performed using BreakDancer (1.4.5), followed by filtering for metastasisspecific events and manual curation. For breakpoints located within or near a gene, we filtered structural variations (SVs) by limiting the breakpoint within 10 kb of a gene. CNV analysis was performed using cn.mops as described for exome sequencing.

RNA sequencing
RNA fusion transcripts were identified using deFuse (0.8.1) software. We used the filtered fusion transcripts. The putative fusion transcripts were further analyzed by comparing with WGS BreakDancer data to identify fusion transcripts that were supported by genomic changes. Mutant alleles from RNA-seq data were identified using SAMtools (1.9). The sequence reads in pileup format were generated using SAMtools mpileup for BAM files spanning the SNVs. We extracted the reference allele and mutant allele counts for each sample using a custombuilt Perl (v5.24.3) script and R (3.6.0) script.

Comparative expression analysis of EMT transcripts
Using Partek Flow Genomic Analysis Software, normalized RNA-seq counts for EMT factors were compared between Kras WT and Kras G12D metastatic nodules, or between 4T1 cells transduced with empty vector, Kras WT, or Kras G12D using gene-specific analysis (GSA). Within the Partek Flow GSA report each factor was identified individually by searching with the official gene symbol, and then dot plots were created. Gene symbols used in search included Kras (for GEMM metastatic nodules only), Fn1, Epcam, Cdh1, Vim, Zeb1, Snai1, and Snai2. Factors that were not expressed did not produce dot plots following the Partek Flow GSA report search. Only those factors that were expressed produced dot plots that were then included in the final figure.

Hierarchical clustering/heatmap and PCA analysis
CNV data were generated from cnmops as described in the exome sequencing section and were further processed using the DiffBind R package to generate a data matrix for the 25 lung metastases. Each genomic region is 4000 bp, which is coded as -1, 0, or 1 for CN loss, neutral, or gain, respectively. Genomic regions were filtered by the CN alterations presented in at least 2 samples. Similarly, SNV data were also filtered by the presence of SNVs in at least 2 samples.
SNVs are coded as 0 or 1 for the absence for presence of mutations. Heatmap plots were generated using the ComplexHeatmap package.

Sanger sequencing
Polymerase chain reaction (PCR) was performed using AmpliTaq Gold polymerase (Thermo Fisher Scientific) according to the manufacturer's instruction. PCR products were then purified by gel electrophoresis using the QIAquick Gel Extraction Kit (Qiagen). The Sanger reaction and electrophoresis for the forward and reverse sequences of each purified amplicon was performed by the NCI CCR Genomics Core. Sequences were manually assessed for single nucleotide variations using Geneious software. If the original sample was RNA, then cDNA was synthesized by reverse transcription using the iScript cDNA Synthesis Kit (Bio-Rad).

Genomic Regions Enrichment of Annotations Tool (GREAT) analysis
BED files containing the recurrent regions of genomic gain or loss for each mouse non-FVB mouse strain were loaded into the GREAT tool website using the default settings for gene assignment. GREAT calculates statistics by associating genomic regions with nearby genes and applying the gene annotations to the regions.

Ingenuity Pathway Analysis
RNA sequencing DESeq2 results of mammary tumors compared to lung metastases was downloaded from the 2019 Alzubi et.al article Additional File 10 [35]. Pathway analysis was performed using Ingenuity Pathway Analysis (IPA) (Qiagen). Differentially regulated gene lists from each paired comparison (HCI01, HCI02, WHIM2) were uploaded into IPA for Core Expression Analysis of expression data. The Ingenuity Knowledge Base was chosen as the reference set of genes, and both direct and indirect relationships were considered. No other analysis parameters were specified, and the default settings were selected.

Plasmid constructs
Lentiviral pDEST Gateway Entry clones expressing Kras WT or Kras G12D with a C-terminal myc tag, and under the control of the Pol2 promoter were obtained from the NCI Ras Initiative. An empty lentiviral pDEST Gateway Entry clone (Thermo Fisher Scientific) was used as the empty vector control.

Virus transduction
1 x 10 6 293T cells were plated in 6 cm dishes 24 hours prior to transfection in P/S-free 10% FBS DMEM media. Cells were transfected with 1 μg Kras or control expression plasmid and 1 μg of viral packaging plasmids (250 ng pMD2.G and 750 ng psPAX2) using 6 μl of X-tremeGENE 9 transfection reagent (Roche). After 24 hours of transfection, media was refreshed with complete DMEM. The following day, virus-containing supernatant was passed through a 45-μm filter to obtain viral particles, which were then transferred to 100,000 4T1 or MET1 cells. The viral media was removed and fresh complete DMEM was added 24 hours post-transduction. Finally, the cells were selected with 10 μg/ml puromycin-or 5 μg/ml blasticidin-containing complete DMEM beginning 48 hours after transduction.

In vivo metastasis experiments
Female virgin FVB/NJ or BALB/cJ mice were obtained from Jackson Laboratory, and athymic NCI Nu/Nu from NCI Frederick at 6-8 weeks of age. Two days prior to in vivo experiments, cells were plated at 1 x 10 6 cells/condition into T-75 flasks (Corning) in non-selective DMEM. A total of 100,000 cells per mouse was injected into the fourth mammary fat pad of FVB/NJ (MET1 and 6DT1 cells), BALB/cJ (4T1 cells) mice. The mice were euthanized between 28-30 days post-injection. Primary tumors were resected, weighed, and lung metastases counted. Statistical significance was calculated with a Kruskal-Wallis test followed by Conover Inman test. All animal experiments were performed in compliance with the National Cancer Institute's Animal Care and Use Committee guidelines.

Western blotting
Protein lysate from 1 x 10 6 cells were extracted on ice using Golden Lysis Buffer (10 mM Tris pH 8.0, 400 mM NaCl, 1% Triton X-100, 10% Glycerol+Complete protease inhibitor cocktail (Roche), phosphatase inhibitor (Sigma)). Protein concentration was measured using a BCA Protein Assay Kit (Pierce) and analyzed on the Versamax spectrophotometer at a wavelength of 560 nm. Appropriate volumes containing 20 μg of protein lysates combined with NuPage LDS Sample Buffer and NuPage Reducing Agent (Invitrogen) were run on 4-12% (or otherwise indicated) NuPage Bis-Tris gels in MOPS buffer. Proteins were transferred onto a PVDF membrane (Millipore), blocked in 5% milk (TBST + dry milk) for 1 hour and incubated in the primary antibody (in 5% milk) overnight at 4˚C. Membranes were washed with 0.05% TBST (TBS + 5% Tween) and secondary antibody incubations were done at room temperature for 1 hour. Proteins were visualized using Amersham ECL Prime Western Blotting Detection System and Amersham Hyperfilm ECL (GE Healthcare).

Statistics
Statistical significance between groups in in vivo assays was determined using the Mann-Whitney unpaired nonparametric test using Prism (version 5.03, GraphPad Software, La Jolla, CA). Statistical significance between samples in RT-qPCR analysis was determined by an unpaired t test, also using Prism.  Table. PyMT and Her2 exome-seq high probability metastasis-specific SNVs Sheets: 1. Instances of Her2 metastasis-specific (met. spec.) SNVs, 2. Instances of PyMT met. spec. SNVs, 3. Singly mutated genes, 4. Recurrently mutated genes. Abbreviations: Chr (Chromosome number), Position (mm10 genomic position of mutated SNV), type (mutation type: synonymous, nonsynonymous, or stop gain), alt.fraction (allele fraction within the metastatic tissue), Transcript (NCBI accession number for isoform), Exon (exon harbouring SNV within designated transcript), Codon (codon harbouring SNV within designated transcript), Nuc sub (nucleotide position within designated transcript and substitution), AA sub (amino acid position within gene isoform and resulting substitution) (XLSX) S2 Table. Sequencing validation and overlap Sheets: 1. Sanger sequencing (seq.) summary, 2. All seq. summary. Abbreviations: Y (yes), N (no), "/ "(and), E (exome seq), R (RNA-seq), W (whole genome seq) (XLSX) S3 Table. PyMT regions of CNV in PT and metastatic tissue compared to normal. Number of CNV events observed in PT and metastases compared to normal tissue. This table stratifies CNVs by mouse chromosome number and mouse strain. Also listed is the number of animals used in this study per stain, as well as the number of PT or metastatic samples collected from that strain total. Blue cells represent deletion events termed "loss," and red cells represent amplification events termed "gain." (XLSX) S4 Table. PyMT regions of CNV and associated genes specific to MOLF/Ei metastatic tissue. Sheet1: "Strain" loss/gain associated (assoc.) genes, 2. "Strain" loss/gain assoc. pathways. Abbreviations: Name (gene symbol), ID (Term identifier from GREAT ontology), Rank (ordinal rank of the p-value compared to the p-values of other annotations), Raw p-value (uncorrected p-value from the binomial test over genomic regions), FDR q-Value (False discovery rate q-value), Fold Enrichment (fold enrichment of number of genomic regions in the test set with the annotation), Observed Region Hits (actual number of genomic regions in the test set with the annotation), Region Set Coverage (the fraction of all genomic regions in the test set that lie in the regulatory domain of a gene with the annotation, Sheet 2: "Strain" recurrent regions of loss or gain. Abbreviations: chr (chromosome), start (position of amplification or deletion start), end (position of amplification or del end), overlap.region (length in bp of overlap in recurrent region of amplification or deletion), freq. (number of individual animals with overlapping region of amplification or deletion), s1 / s2 (region identified in individual animals 1 and 2). Table. PyMT regions of CNV and associated genes specific to CAST/Ei metastatic tissue. Sheet1: "Strain" loss/gain associated (assoc.) genes, 2. "Strain" loss/gain assoc. pathways. Abbreviations: Name (gene symbol), ID (Term identifier from GREAT ontology), Rank (ordinal rank of the p-value compared to the p-values of other annotations), Raw p-value (uncorrected p-value from the binomial test over genomic regions), FDR q-Value (False discovery rate q-value), Fold Enrichment (fold enrichment of number of genomic regions in the test set with the annotation), Observed Region Hits (actual number of genomic regions in the test set with the annotation), Region Set Coverage (the fraction of all genomic regions in the test set that lie in the regulatory domain of a gene with the annotation, Sheet 2: "Strain" recurrent regions of loss or gain. Abbreviations: chr (chromosome), start (position of amplification or deletion start), end (position of amplification or del end), overlap.region (length in bp of overlap in recurrent region of amplification or deletion), freq. (number of individual animals with overlapping region of amplification or deletion), s1 / s2 (region identified in individual animals 1 and 2). (XLSX) S6 Table. PyMT regions of CNV and associated genes specific to C57BL10/nJ metastatic tissue. Sheet1: "Strain" loss/gain associated (assoc.) genes, 2. "Strain" loss/gain assoc. pathways. Abbreviations: Name (gene symbol), ID (Term identifier from GREAT ontology), Rank (ordinal rank of the p-value compared to the p-values of other annotations), Raw p-value (uncorrected p-value from the binomial test over genomic regions),  Table. PyMT regions of CNV and associated genes specific to C57BL6/nJ metastatic tissue. Sheet1: "Strain" loss/gain associated (assoc.) genes, 2. "Strain" loss/gain assoc. pathways. Abbreviations: Name (gene symbol), ID (Term identifier from GREAT ontology), Rank (ordinal rank of the p-value compared to the p-values of other annotations), Raw p-value (uncorrected p-value from the binomial test over genomic regions), FDR q-Value (False discovery rate q-value), Fold Enrichment (fold enrichment of number of genomic regions in the test set with the annotation), Observed Region Hits (actual number of genomic regions in the test set with the annotation), Region Set Coverage (the fraction of all genomic regions in the test set that lie in the regulatory domain of a gene with the annotation, Sheet 2: "Strain" recurrent regions of loss or gain. Abbreviations: chr (chromosome), start (position of amplification or deletion start), end (position of amplification or del end), overlap.region (length in bp of overlap in recurrent region of amplification or deletion), freq. (number of individual animals with overlapping region of amplification or deletion), s1 / s2 (region identified in individual animals 1 and 2). (XLSX) S8 Table. PyMT strain summary for CNVs associated genes specific to metastatic tissue. Genes associated with metastasis-specific CNVs and found in more than one mouse strain.  Table. Fusion events for primary tumor and metastatic tissue compared to normal. This table shows the data generated from BreakDancer algorithm analysis of 10X WGS from 20 matched pairs of primary and metastatic lesions from FVB/NJ PyMT x MOLF/EiJ and FVB/ NJ PyMT x CAST/EiJ animals. We show here the metastasis specific gene fusion data used as input data for Fig 3B. We also she here primary tumor tissue and metastatic tissue gene fusion events versus normal strain specific tissue. Sheet1:Met. spec gene fusion events circos plot input. Sheet2: PT v Norm gene fusion events. Sheet 3: Met v Norm gene fusion event. Abbreviations: Sample (mouse ID and tissue type), Chr1 (chromosome where gene 1 is located), Pos1 (position within Chr1 where fusion event break point is located), Chr2 (chromosome where gene 2 is located), Pos2 (position within Chr2 where fusion event break point is located), Type (intra-chromsome events with different strands (ITX) inversion (INV), deletion (DEL), num_-Reads (number of reads). (XLSX) S10