Exome Analysis Reveals Differentially Mutated Gene Signatures of Stage, Grade and Subtype in Breast Cancers

Breast cancers exhibit highly heterogeneous molecular profiles. Although gene expression profiles have been used to predict the risks and prognostic outcomes of breast cancers, the high variability of gene expression limits its clinical application. In contrast, genetic mutation profiles would be more advantageous than gene expression profiles because genetic mutations can be stably detected and the mutational heterogeneity widely exists in breast cancer genomes. We analyzed 98 breast cancer whole exome samples that were sorted into three subtypes, two grades and two stages. The sum deleterious effect of all mutations in each gene was scored to identify differentially mutated genes (DMGs) for this case-control study. DMGs were corroborated using extensive published knowledge. Functional consequences of deleterious SNVs on protein structure and function were also investigated. Genes such as ERBB2, ESP8, PPP2R4, KIAA0922, SP4, CENPJ, PRCP and SELP that have been experimentally or clinically verified to be tightly associated with breast cancer prognosis are among the DMGs identified in this study. We also identified some genes such as ARL6IP5, RAET1E, and ANO7 that could be crucial for breast cancer development and prognosis. Further, SNVs such as rs1058808, rs2480452, rs61751507, rs79167802, rs11540666, and rs2229437 that potentially influence protein functions are observed at significantly different frequencies in different comparison groups. Protein structure modeling revealed that many non-synonymous SNVs have a deleterious effect on protein stability, structure and function. Mutational profiling at gene- and SNV-level revealed differential patterns within each breast cancer comparison group, and the gene signatures correlate with expected prognostic characteristics of breast cancer classes. Some of the genes and SNVs identified in this study show high promise and are worthy of further investigation by experimental studies.


Introduction
Breast cancer is the most common cancer (29% of newly diagnosed cancers) in women in US, and has the second highest mortality rate that accounts for about 25% of all cancer deaths [1]. It has been recognized that categorization of breast cancers into different subtypes can effectively guide treatments and greatly improve the prognosis. Several factors like hormone receptor status, breast cancer biomarkers and gene expression profiles have been used to classify breast cancers, estimate the recurrence risk, and guide targeted treatment [2].
Breast cancers are highly heterogeneous in their clinical and molecular profiles, which suggest that the prognosis for each subtype is very distinct. For example, estrogen and progesterone hormone receptor positive (ER+ and PR+) breast cancers have a better prognosis than estrogen and progesterone receptor negative (ER-and PR-) breast cancers. In addition, ER + and PR+ breast cancers can be treated with anti-hormonal therapy, while ER-and PR-breast cancers are not responsive to such therapies. On the other hand, HER2-positive (HER2+) breast cancers usually occur in younger women, grow more invasively, and prior to the advent of targeted therapy, posed a higher risk of recurrence than HER2-negative (HER2-) breast cancers, partly because of the overexpression of HER2/neu protein (human epidermal growth factor receptor 2, also known as ERBB2) in these cancers.
So far, breast cancer is one of the few cancer types in which targeted therapies have been designed based on the molecular classification [3]. In addition, the gene expression profiling based classification of breast cancers has identified four major subtypes: luminal A, luminal B, human HER2+, and basal-like [4], which have prognostic implications. For example, Oncotype Dx, a 21-gene assay [5], and Mammaprint, a 70-gene expression signature have been developed as a prognostic assessment tool to predict the risk of breast cancer metastasis [6]. However, one disadvantage of using gene expression profiling to identify biomarkers or signatures for cancer is that gene expression levels are highly variable and unsteady, and therefore a single measure often leads to misinterpretation. In contrast, genetic mutations at DNA level can be stably detected. As all cancers carry somatic mutations in their genomes and mutational heterogeneity widely exists in cancer genomes [7], biomarkers for cancer based on gene mutation information could be detected more accurately than those based on gene expression profiling. Rapid advances in next-generation sequencing (NGS) technology have enabled sequencing of a large number of whole exome samples in parallel at a reasonable expense. As a result, a large amount of NGS data on tumor genomes have emerged that makes detection and application of genomic mutant-based biomarkers for cancer a reality.
While differential gene expression among different subtypes of breast cancer have been widely used for assessing prognosis and predicting therapeutic response [8], The Cancer Genome Atlas (TCGA) network analyzed differential somatic mutations among the four breast cancer subtypes: luminal A, luminal B, HER2+, and basal like, and identified several significantly mutated genes that showed subtype-specific patterns of mutation [9]. Some of the studies report specific DNA mutations from comparisons of ER+/- [10] or HER2+/-classes [11], simply by checking genes that encode ER (ESR1 and ESR2) and HER2 (ERBB2), respectively. However, no systematic studies have been carried out to identify DMGs between the ER, PR, HER2 subtypes, or the tumor grade and stage classes. In the present study, we analyzed 98 breast cancer exome sequencing datasets that were previously published [12]. We performed large-scale comparison of single nucleotide variation (SNV) differences between three breast cancer subtypes (ER+ vs. ER-, PR+ vs. PR-, HER2+ vs. HER2-), two different histologic grades (grade II vs. grade III), and two different stages (stage II vs. stage III), all of which are clinical features that are directly associated with prognosis of breast cancers. We did not use PAM50 or other gene-expression based subtypes for identifying DMGs because there is no evidence showing that gene expression profiles are directly correlated with gene mutation profiles. We proposed a scoring function to evaluate the deleterious impact of the sum of all mutations in a gene, and then used multiple t-tests to identify DMGs between the five breast cancer comparison groups described above. We performed an extensive examination of literature to confirm the relevance of the identified DMGs to breast or other cancers. We also identified the deleterious SNVs from the DMGs that occur with significantly different frequency in between the five breast cancer comparison groups. For some important mutations, we also examined the impact of each mutation on the structure and function of the protein using proteinmodeling tools.

Breast cancer whole exome-seq datasets
We downloaded the whole exome sequencing datasets for 103 breast cancer samples (54 samples from Mexican patients and 49 samples from Vietnamese patients) from dbGap website http://www.ncbi.nlm.nih.gov/gap (accession number: phs000369.v1.p1) [12]. In this study, we analyzed only 98 samples because 5 Mexican samples have very low sequencing quality. All the 98 breast cancer samples contain tumor/normal pairs. We assume that germline and acquired somatic mutations (till the diagnosis of cancer) could significantly contribute to the differential phenotype of breast cancers [13]; hence, we did not filter out the mutations that are present in the normal sample. Based on the clinical information provided, we sorted the 98 samples (patients) into five comparison groups that include three clinical subtype groups (ER+ vs. ER-, PR + vs. PR-, HER2+ vs. HER2-), a grade-based (grade II vs. grade III), and a stage-based (stage II vs. stage III) group. A summary of the classification information for the 98 samples is shown in Table 1 (Stage I and IV, grade I were excluded in the comparisons due to lack of sufficient data). The mutation profile for each patient is often sparse. When comparing one class with smaller sample size (<10) against another class with a larger sample size (i.e. >20) in another, one mutation in the former class will be considered to occur at a rate of more than 10%. Therefore, we set a cut off value of 10 to define the descent sample size, in order to minimize the impact of rare mutations in classes with smaller sample size in the statistical tests. In this case, any class that has less than 10 samples will not be compared separately, if applicable (Table 1). We also performed Fisher's exact test to check the race compositional differences between each comparison group. Notably, the race composition in grade II and grade III is unbalanced (Table 1 and S1 Table); therefore, we only performed the class comparison for Mexican patients, in order to eliminate the effect of any race-specific genetic variations. (A detailed description of clinical information for all samples is shown in S1 Table).
Sample quality control, alignment, SNV calling and annotation We used FastQC [14] and FastX toolkit [15] for quality control of the 98 tumor whole exome sequencing datasets. Short reads with low sequencing quality (Phred score < 20) were removed or trimmed, accordingly. Processed reads were then aligned with Borrows-Wheeler Aligner [16] to the human reference genome hg19. We then applied the Genome Analysis Toolkit [17] (GATK) best practices pipeline [18,19] from Broad Institute for SNV (Single Nucleotide Variant) calling from alignment files, and the pipeline includes multiple steps such as Mark Duplicates, Local Realignment, Quality Score Recalibration and variant calling. After 98 SNV profiles were generated, we used ANNOVAR [20] for functional annotation of all the SNVs. The SIFT [21] score reported from ANNOVAR was used to evaluate the degree of deleteriousness of SNVs.

Scoring the deleteriousness of mutated genes
The SIFT score ranges from 0 to 1. An SNV is predicted to be deleterious when its SIFT score is less than or equal to 0.05. Therefore, we filtered out all the SNVs that have the SIFT score more than 0.05. We calculated the deleterious score, D, using the following function, Where D ij : the deleterious mutation score for the i th gene in sample j; S ijkx : the SIFT score for the k th mutation in isoform x of the i th gene in sample j; N k : the number of isoforms that are affected by the mutation k for that specific gene in that sample.
This scoring function combines the SIFT scores for all deleterious mutations in a gene (including the isoforms, if any) and generates a combined deleterious score for each mutated gene. Therefore, by applying this scoring function, we obtained a matrix with 98 columns (98 patients) and about 17000 rows (~17000 RefSeq genes). Each cell represents how deleterious one gene is mutated for the specific patient. Obviously, the higher the score is, the more deleterious way the gene mutations affect the gene function.

Identification of DMGs between breast cancer classes
We identified DMGs between five pairs of breast cancer class comparisons (ER+ vs. ER-, PR + vs. PR-, HER2+ vs. HER2-, grade II vs. grade III, and stage II vs. stage III) using the univariate t-test at a two-sided significance level of 0.001. Considering the small sample size of grade I and stage I classes, we only performed grade II vs. grade III, and stage II vs. stage III class comparisons (no patients with stage IV breast cancer were present in our list). To adjust for multiple testing, we also reported the false discovery rate (FDR) for each gene identified. The FDR was estimated using the method of Benjamin and Hochberg [22]. This procedure was implemented with the class comparison tool in BRB-ArrayTools [23].

Functional analysis of DMGs and SNVs
We examined the deleterious SNVs present in the DMGs of different breast cancer classes using an odd ratio, which identifies SNVs that are at least 2-fold more frequent in one class over the other between the populations of a two-class comparison. Fisher's exact test was used to examine the significance of the differences. We then used Pfam (protein family) database [24] and CONDEL [25] to predict the functional impact of those significant SNVs on proteins. Pfam database contains information on evolutionarily conserved functional domains; hence, if an SNV occurs in the domain region, it is more likely to affect the structure and/or function of the protein. CONDEL is a method to assess the outcome of non-synonymous SNVs with the best sensitivity and specificity [25]. It uses the consensus deleterious scores by combining predictions from five different tools that include SIFT [21], PolyPhen2 [26], Logre [27], Mutatio-nAssessor [27] and MAPP [28].

Protein stability analysis for point mutations
For the DMGs, we analyzed the overall impact of point mutations on protein stability. For feasibility of analysis, we selected a set of 10 relatively rare non-synonymous SNVs that occur either in a functionally annotated (Pfam-A) or evolutionarily conserved (Pfam-B) domain region. We used iprscan version 4.8 [29] for Pfam and PANTHER motif search. We then used two reliable structure prediction tools, RaptorX [30] online webserver and I-TASSER suite [31] standalone version, for protein structure prediction. We ran I-TASSER in parallel mode with the default parameters.
Further, we used three similar and independent tools, I-Mutant-2.0 [32], PopMusic-2.1 [33] and CUPSAT [34] to analyze the overall impact of a point mutation on protein stability. I-Mutant predicts the stability of a point-mutated protein from its primary sequence, while PopMusic 2.1 and CUPSAT predict the same from its 3D structure. We evaluated the overall impact of a point mutation on protein stability based on the consensus results from these three methods; if at least two tools predict the same mutation effect on the protein structure, i.e., destabilizing or stabilizing, then only we accept that result.

Results and Discussion
Samples in ER, PR, HER2 and Stage are all having at least one class with the sample size of less than 10, if we separate each class by race (Table 1). Therefore, we merged all the samples that are available for each class. Notably, giving the clinical significance of HER2 status in breast cancer, we still performed the class comparison for HER2 group, despite the fact that HER2 + class only has 8 samples. For Grade II vs. Grade III, we excluded all the Vietnam patients, as the unbalanced sample size in Vietnam patients (0 samples in Grade II and 13 samples in Grade III) will definitely bias the test result. Fisher's exact test result from table 1 shows that only ER class has significantly different race composition (p = 0.0018). This could indicate the potential impact of the race factor on ER comparison group result. However, the statistical power of the comparison will be limited by the number of sample size, if we do the ER comparison for each race separately.
SNV profiles were generated for 98 tumor exome-seq datasets using the GATK pipeline, and annotated using ANNOVAR. A mutation score matrix was created for all 23,769 RefSeq genes (42,239 transcripts in total) in all the samples based on the annotation results. We filtered out those genes that have deleterious mutations present in less than 5 (out of 98) samples, and obtained 3,826 genes for further analysis (S2 Table). Combined mutation score (from all mutations of all isoforms of a gene) of each gene was compared between the defined breast cancer classes to identify sets of DMGs. These include 18 (ER+ vs. ER-), 9 (PR+ vs. PR-), 10 (HER2 + vs. HER2-), 10 (grade II vs. grade III) and 7 genes (stage II vs. stage III), using a two-sided t-test (p-value 0.001).
Using literature survey, we sorted the DMGs into 4 different categories in the order of their relevance to breast cancer or other types of cancers. Category 1 includes the genes that have been reported to be directly associated to breast cancer, while category 2 includes those that are related to other types of cancer, but not to breast cancer. Category 3 includes the genes whose functions have not been well studied, but other members of these gene families have been reported to be associated to cancer. Category 4 includes the other genes that do not belong to the former three categories, while their relatedness to cancer remains to be studied. Fig. 1 presents the CIRCOS [35] graph of the DMGs identified by the five class comparisons along with their corresponding chromosomal positions. It shows that chromosome 4, 11, and 19 have the largest number (5 genes in each chromosome) of DMGs identified in the comparisons, while chromosome 14, 18, 21, and 22 do not contain any of the reported DMGs. Also, chromosome 4 has the largest number of DMGs (CPZ, CSN3, KIAA0922) that are directly related to breast cancer. Moreover, because of the similarities of ER and PR status in terms of breast cancer prognosis and therapy, the positional pattern for ER+/-and PR+/-is similar by having the same 3 DMGs in both group comparisons. Fig. 2 is a heatmap showing the deleterious mutation patterns for DMGs identified by five groups of breast cancer class comparisons. It is evident that the overall deleterious mutation scores are higher in classes with poorer prognosis (ER-, PR-, HER2+, Grade III and Stage III), suggesting that deleterious mutations in these genes contribute to different prognostic features for each class. However, it is also evident that sets of genes within a class comparison show contrasting deleterious mutation patterns suggesting that their roles as oncogenes or tumor suppressor gene are balanced (Fig. 2). For instance, ERBB2, an oncogene is predominantly mutated in ER+ class (77.5%) compared to ER-class (33.3%), suggesting that dysregulation or altered function of HER2/neu protein is associated with a better prognosis in breast cancer patients. In contrast, CSN3, a part of the CSN complex that activates tumor suppressor TP53, is predominantly less frequently mutated in ER+ (17.5%) compared to ER-(63.0%) samples. Descriptive information for all the identified DMGs is presented in Tables 2-6 and in the S1 File.
Notably, some of the FDR values we reported in Tables 2-6 are relatively high. This is because in the present study, we only considered SNVs whose SIFT score is not greater than 0.05. As a result, our deleterious mutation scores lie in a relatively narrow range, which could have generated high FDR values.

Comparison of DMGs in hormone receptor positive vs. negative breast cancer subtypes
Due to many similarities between the ER+/-and PR+/-class comparisons, we are presenting these two classes together in this section. We identified 18 genes with significantly different mutation scores between ER+ and ER- (Fig. 2, Table 2), and 9 genes between PR+ and PRbreast cancer subtypes (Fig. 2, Table 3). Three genes, OR1J2, SKOR1, and DPP3 are commonly identified in both class comparisons. In Tables 2 and 3, genes are listed based on their biological relevance to breast cancer. Of these, CSN3, ERBB2, PPP2R4, CAPZA2, SKOR1, ARID5A, and CPN1 belong to category 1 that contain literature-based relevance to breast cancer. Below, we describe the functional roles of all DMGs under category 1 in each comparison group, while description of all other DMGs can be found in S1 File.
CSN3 (kappa-casein) is involved in myeloid leukemia factor 1-mediated growth arrest and CSN3 deficiency impairs p53 activation, facilitates cell proliferation and affects COP1-mediated p53 degradation [36]. It indicates that mutationally impaired CSN3 could promote cancer growth and progression by dysregulation of the tumor suppressor gene p53. This is consistent with our results that CSN3 has a higher deleterious mutation score in the more aggressive ERbreast cancers compared to that in less aggressive ER+ breast cancers (p-value = 7.06×10 -5 ). The differentially mutated genes between breast cancer subtypes. A total of 50 genes are identified that are differentially mutated by comparison of ER+ vs. ER-, PR+ vs. PR-, HER2+ vs. HER2-, grade II vs. grade III, and stage II vs. stage III breast cancer classes, respectively. Each class comparison is shown in layered circles. The differentially mutated genes are shown in the outer layer, which correspond to their chromosome coordinates and subtype comparisons. The differentially mutated genes are sorted into four different categories based on their relevance to breast cancer or other types of cancer. Category 1 includes the genes that are directly related to breast cancer (in dark red). Category 2 includes the genes that are related to other types of cancer (in green). Category 3 includes the genes whose family members are related to cancer (in blue). Category 4 includes the genes whose relatedness to cancer remains to be studied (in gray). The mean deleterious mutation score for each gene in each class comparison is shown in colored thin bar (red and blue colors refer to two different classes). The length of thin bars is proportional to the mean deleterious score. On the other hand, ERBB2 (also known as HER2/neu), is a well-characterized oncogene that is responsible for development and progression of certain aggressive types of breast cancer. ERBB2 has been shown to be associated with poor prognosis of breast cancers [37]. Overexpression of this gene has been shown to be very crucial in the development and progression of certain aggressive types of breast cancer [38]. Our results corroborate that ERBB2 shows higher mutational load (p-value = 1.57×10 -4 ) in ER+ breast cancers compared to ER-breast cancers and consequently dysregulated to negate cancer growth and progression in the former subtype.
PPP2R4, also known as protein phosphatase 2A (PP2A), regulates estrogen receptor alpha (ER-α) expression through modulation of ER mRNA stability; hence, it has been considered as a potential therapeutic target for breast cancer [39]. It has been shown that PPP2R4 is involved in PI3K/Akt signaling pathway, a pathway that modulates the interaction between BRCA1 and ER-α [40]. Mutations of PPP2R4 have been shown to contribute to many cancer types including breast cancers [41], and it has been suggested that PPP2R4 might be a tumor suppressor The deleterious mutation scores for the differentially mutated genes across the compared samples. Five heatmaps show the deleterious mutation scores across the compared samples for the differentially mutated genes identified by comparison of ER+ vs. ER-, PR+ vs. PR-, HER2+ vs. HER2-, grade II vs. grade III, and stage II vs. stage III breast cancer classes, respectively. Higher score implies more deleterious mutations one gene has. It is evident that groups with better prognosis (ER+, PR+, HER2-, Stage II and Grade II) tend to have fewer deleteriously mutated genes. gene [42]. Our results show that PPP2R4 has more deleterious mutations in ER-breast cancers than in ER+ breast cancers (p-value = 2.09×10 -4 ), suggesting that the higher degree of loss of tumor suppression function for PPP2R4 in ER-subtype relative to ER+ contributes to worse prognosis in the former.
CAPZA2, named as F-actin-capping protein subunit alpha-2, is regulated by Erbb2 in mouse model [43]. It may be also involved in human Ras-MAPK/P13K signaling pathways, as it is predicted to interact with a retinoblastoma tumor suppressor (pRB) protein [44]. Consistent with this notion, our results show that this gene has a higher deleterious mutation score in ER-breast cancers than in ER+ breast cancers (p-value = 4.02×10 -4 ). SKOR1, also known as Fussel-15, is a SKI family transcriptional co-repressor that is identified as a DMG both in ER+/-and PR+/-comparisons. It is also a potential repressor of the BMP signaling pathway [45]. A previous study shows that repressing BMP signaling pathway can efficiently prevent bone metastasis from breast cancer cells [46]. Our results show that the gene has a higher deleterious mutation score in ER+ breast cancers relative to ER-breast cancers (p-value = 7.56×10 -4 ). Similarly, this gene has a higher deleterious mutation score in PR + breast cancers than in PR-breast cancers (p-value = 1.14×10 -4 ). ARID5A has been identified as an ER-α interacting co-repressor protein. ARID5A represses transcriptional activity of endogenous ER-α in MCF-7 breast cancer cells [47]. This gene has a higher deleterious mutation score in PR-breast cancers than in PR+ breast cancers (p-value = 1.0×10 -3 ).
CPN1 gene encodes an enzyme that is responsible for C-terminal cleavage of stromal cell derived factor-1α (SDF-1) [48]. SDF-1 functions as a growth factor for immature B-lymphocytes and controls chemokine expression, thereby regulating the destination of metastasizing breast cancer cells [49]. Besides, studies show that CPN1 is an estrogen target gene in zebrafish model [50]. This gene has a higher deleterious mutation score in PR-breast cancers than in PR + breast cancers (p-value = 5.47×10 -4 ).

Comparison of DMGs in HER2+ vs. HER2-breast cancer subtypes
We identified 10 genes that have significantly different deleterious mutation scores between HER2+ and HER2-breast cancer subtypes as listed in Table 4 ( Fig. 2). Among them, BCAR1, CENPJ, EPS8, KIAA0922, and SP4 are directly related to breast cancer as described below. Literature information for Category 2-4 genes can be found in the S1 File. BCAR1 is a breast cancer anti-estrogen resistance kinase. Previous studies showed that BCAR1 is responsible for resistance to the anti-proliferative effects of tamoxifen [51,52] and its expression level often positively correlate with ERBB2 expression [53], thereby leading to aggressive tumor progression. Table 4 shows that more deleterious mutations of BCAR1 were detected in HER2+ than in HER2-breast cancer subtypes (p-value = 1.06×10 -5 ), suggesting that BCAR1 mutations lead to its hyperactivation that correlates with the overexpression of ERBB2. Interestingly, it has been found that higher BCAR1 levels were significantly associated with ER +/PR+ tumors [54]. CENPJ encodes centromere protein J that is a co-activator for STAT5 signaling pathway [55] and NF-kappa-B-mediated transcription [56]. Nuclear localization of STAT5 marks a good prognosis of ER+/PR+ breast cancers [57] and could be used as an indicator of antiestrogen therapy [58]. NF-kappa-B pathway may be involved in the gain of resistance to HER2-targeting agents therapy [59]. Our results suggest that mutations in CENPJ could potentially be the driver events as the deleterious mutation score for CENPJ in HER2+ breast cancers is much higher than that in HER2-breast cancers (p-value = 2.59×10 -4 ).
EPS8, an epidermal growth factor receptor pathway substrate 8, has been identified as a novel candidate oncogene for breast cancer [60]. EPS8 also decreases chemosensitivity and affects survival of cervical cancer patients [61]. It has been found that small interfering RNA of Eps8, could reduce proliferation and tumorigenesis in Eps8-attenuated HeLa and SiHa cells cultured in dishes or inoculated in mice [61]. Table 4 shows that EPS8 has higher deleterious mutation score in HER2+ breast cancers than in HER2-breast cancers (p-value = 4.61×10 -4 ), suggesting that its mutations might result in poor prognosis of breast cancers.
KIAA0922 is a novel gene detected in Kazusa cDNA sequencing project [62]. Recent studies on KIAA0922 show that it is a transmembrane 131-like (TMEM131L) protein and it functions as a novel regulator of thymocyte proliferation [63]. KIAA0922 also functions as a novel inhibitor of Wnt signaling pathway [63]. Abnormality of Wnt signaling pathway has been associated with breast cancer [64].
Lastly, SP4 is a transcription factor and down-regulation of this gene is associated with inhibited growth of cancer cells in pancreatic [65], colon [66] and breast cancers [67,68].

Comparison of DMGs in Grade II vs. Grade III breast cancer classes
We identified 7 DMGs between Grade II and Grade III breast cancer subtypes as are listed in Table 5 (Fig. 2). SELP is directly associated with breast cancer [69,70,71]. ANO7 belongs to category 2, and ANKRD18B, ANKRD32 and THAP8 belong to category 3. ADD1 is related to hypertension and SNVs in ADD1 is strongly linked with cancer, but there is no literature evidence showing the involvement of tumorigenesis for this gene. Literature information for Category 2-4 genes can be found in S1 File.
SELP has been a part of an invasive ductal carcinoma gene signature [69]. SELP mediates adhesions for various cells including cancer cells in inflammation, thrombosis, cancer growth and metastasis [70]. High expression of SELP correlates with worse prognosis of human cancer by promoting metastasis of the cancer cells [71].
Although there is no direct evidence for the role of ADD1 in breast cancer progression and tumorigenesis, ADD1 has a significantly higher deleterious mutation score in Grade III breast cancers than in Grade II type (p-value = 3.63×10 -4 ). Among all patients with grade II and grade III breast cancer, 14 patients have deleterious mutations (rs4961 and/or rs4963) in ADD1, 12 of those have both rs4961 and rs4963 (Fig. 3). A previous study has shown that the carriers of rs4961 were at 1.8 times increased risk for hypertension (CI: 1.32-2.43) [72]. Also, it has been confirmed that rs4963 is tightly linked with rs4961, and thus could also be linked to hypertension [73]. Hypertension has been shown to be one of the common comorbidities in breast cancer patients, and be associated with worse prognosis of breast cancers [74]. Our data shows that 76.9% (10/13) of the grade III breast cancer patients have either rs4961 and rs4963, indicating an increased risk of having hypertension, compared to 16% (4/25) of that for grade II breast cancer patients (Odd ratio is 17.5). Thus, the correlation between hypertension and breast cancer is worth investigating.
It should also be noted that almost all the DMGs between grade II and grade III classes have higher deleterious mutation scores in grade III except one gene (THAP8). This suggests that deleterious gene mutations evolve with the progression of cancer.

Comparison of DMGs in Stage II vs. Stage III breast cancer classes
We identified 10 DMGs between Stage II and Stage III breast cancer classes as are listed in Table 6. Similar to the grade class, all the genes in Table 6 display higher deleterious mutations in the worse prognosis class (stage III) supporting the general notion that higher mutational load leads to worse prognosis. Half of these genes are directly related to breast cancer (CPZ, LPPR2, PRCP, UNC45A, and PLEKHG6), MMP20 is related to other types of cancer, while CDH26 and GSTO1 belong to category 3. Literature information for category 2-4 genes can be found in S1 File.
CPZ encodes a member of the metallocarboxypeptidase family. This gene is involved in Wnt signaling pathway [75], and therefore potentially plays a role in prognosis of breast cancer. LPPR2 encodes a lipid phosphate phosphatase-related protein that regulates lysophosphatidic acid (LPA) production and signaling [76], and could promote breast cancer initiation, progression and metastasis [77,78,79]. PRCP encodes a protein that acts as a regulator of cell proliferation and autophagy [80], and is also an anti-estrogen resistant protein in ER-positive breast cancer patients [80]. Autophagy functions as a tumor suppressor mechanism, thereby preventing tumor progression [81]. UNC45A encodes a protein that plays a role in cell proliferation and myoblast fusion, and could increase human breast cancer metastasis [82]. Knockdown of UNC45A mRNA slows down human breast carcinoma cell proliferation and invasion [82]. PLEKHG6 regulates the invasion activity of breast cancer cells [83,84].
Based on the five class comparisons we made in this study, it should be noted that there are several potential limitations in this study. First, results from class comparisons with small sample size were more likely to be affected by rare mutations. Secondly, tumor heterogeneity remains a big challenge for SNV analysis, although tumor heterogeneity did not introduce many false positives in this study. Here, we only reported the most likely genotypes using the GATK tool UnifiedGenotyper. Therefore, any reported deleterious mutations should have decent allele frequency in our samples. Heterogeneity of cancer cells would only neutralize the ability to identify those mutations with lower allele frequency. On the other side, a reported deleterious mutation should be either presented in all subclones, or in one or more subclones that are the dominant population in the sample. Thus, our statistical tests only identified the dominant mutations that are more deleteriously mutated in one group compared to another one. To resolve the tumor heterogeneity issue, the single-cell sequencing technology is a good choice.

Functional analysis of deleterious SNVs
We identified 24 deleterious SNVs that have more than 2-fold difference in the odd ratio while also located inside the functional or conserved domain regions of proteins, from the 117 DMG-associated SNVs (S3 Table). These SNVs are presented in Table 7 (rs12421620 from DPP3 is present in both ER and PR class comparisons). Fisher's exact tests show that all the odd ratio differences are significant (p 0.05). For each SNV, we also determined a score that suggests the degree of mutation deleteriousness using the CONDEL software (S4 Table). Fig. 3 shows the presence or absence of SNVs in patients from the five comparison groups of breast cancer. For each class comparison, the frequencies of mutations are highly correlated with the prognostic features. Also, except for the Stage class, all the other classes show contrasting patterns of SNVs (between better and worse prognoses), within each class suggesting their enhancing or suppressing roles in cancer progression. Nine SNVs from ER-vs. ER+ class are predominately present (60.0%) in ER-(poorer prognosis class) while a different set of 6 SNVs (40.0%) is identified in ER+ (better prognosis class). In PR-vs. PR+ comparison, 6 SNVs are significantly present (60.0%) in PR-(poorer prognosis class), compared to 4 (40.0%) in PR+ (better prognosis class). Other classes with poorer prognosis, all have higher number of deleterious mutations, with 4 SNVs (66.7%) in HER2+, 7 (87.5%) in Grade III, and 10 (100.0%) in Stage III. Besides, in Grade II vs. Grade III, Fig. 3 also shows an increased risk of having hypertension comorbidity in Grade III patients because of the higher mutation rate for ADD1 gene in this class (16% in Grade II vs. 76.9% in Grade III).
The deleterious mutation shown in Table 7 for ERBB2 is rs1058808. A previous study has shown that rs1058808 may be associated with higher Body Mass Index (BMI) for high risk of endometrial cancer [85]. Although the association between this SNV and the risk of breast cancer is not identified as statistically significant in these studies [86,87], our results show that this mutation is preferably present in the ER+ compared to the ER-subtype (odd ratio 0.23, p = 0.00624). Another mutation, rs2480452 in PPP2R4 is predominantly present in the ERsubtype (40.7% in ER-vs. 5% in ER+ with odd ratio of 13.06, p = 4.29×10 -4 ). Our protein stability analysis also suggests that this mutation is destabilizing PPP2R4 protein ( Table 8). As mutations of PPP2R4 are significant in the pathogenesis of breast cancer [41], especially in different ER status patients, the downstream effect of this SNV on protein stability is further investigated in the next section. Table 7 shows SNVs that have significantly different occurrence frequency between different breast classes. For example, rs11225089, rs61751507, and rs79167802 occur more frequently in the PR-than PR+ class; rs1139916 occur more frequently in the HER2-than HER2 + class; rs76504036 occur more frequently in the grade III than in grade II class; rs11540666 and rs2229437 occur more frequently in the stage III than in stage II class. These SNVs might be related to tumor evolution and contribute to different prognosis of breast cancer subtypes.
There are some SNVs that are differentially occurring between the comparison groups but not present in the functional domain regions of proteins (S3 Table). However, it is possible that these SNVs are present in the inter-domain or loop regions, but still have an effect on the structure of protein or otherwise affect a protein's ability to bind and interact with other proteins.

Protein stability analysis
For feasibility, we selected 9 relatively rare occurring SNVs from Table 7 to analyze the consequences of point mutations on protein stability. We carried out hmmpfam/hmmpanther motif search with iprscan, to assess if the SNVs are part of the functional motif or domain region (with high confidence at an E-value < = 10 -4 ) ( Table 8). Then, we compared two protein structure prediction tools, I-TASSER and RaptorX, using the known PDB structure of DPP3 (PDB ID-3FVY) to determine which method is more reliable and accurate for protein structure prediction. The Root-Mean-Square Deviations (RMSD) of atomic positions between the known DPP3 PDB structure and the I-TASSER or RaptorX predicted models are 0.45 and 7.1, respectively, indicating that I-TASSER is performing far better than RaptorX. We repeated the structure prediction twice for each protein, in order to check if we can get the same structure for each run or not. I-TASSER always gave the same result while RaptorX often gave slightly different results for several proteins. Thus we used I-TASSER for further analysis of all proteins. Further, we analyzed the impact of point mutations (SNVs) on protein stability by using I-Mutant 2.0, PopMusic2.1 and CUPSAT tools (S5 Table). Our results suggest that 7 out of 9 SNVs tested have destabilizing effect on proteins. In contrast, the other two SNVs (present in GFM2 and ANKRD32 proteins) have a stabilizing effect (means no significant change to structure or function) after mutation (Table 8) (Fig. 4). In mutated DPP3 protein, negatively charged Glutamate residue (E) got replaced with positively charged Lysine (K) at position 690. Structure analysis of DPP3 suggests that mutant protein has almost similar structure to normal protein, except that the C-terminus has its helix structure changed to a loop structure because of the point mutation (Fig. 4). It has been reported that the C-terminal structure of this protein can play a big role in substrate binding in DPP3 [88]. As the mutation occurs close to the substrate binding residues, K666 and R669 [88], we hypothesize that the altered structure at C-terminus affects substrate binding and consequently alters protein function. Conclusions Breast cancers exhibit highly heterogeneous molecular profiles, which often reflect their distinct prognosis. Although gene expression profiles have been widely used for the classification and targeted treatment of breast cancers, DNA mutational profiles-owing to their stability of detection-are more advantageous in developing biomarkers. In this study, we attempt to detect the genetic mutations (at gene-and nucleotide-level) that are significantly different across different breast cancer classes, by performing a large-scale analysis of 98 breast cancer exome sequencing datasets. We proposed a method for scoring the deleteriousness of mutated genes and identified differentially mutated genes (DMGs) and SNVs from five breast cancer comparison classes (ER+ vs. ER-, PR+ vs. PR-, HER2+ vs. HER2-, grade II vs. grade III, and stage II vs. stage III). We have identified many DMGs such as ERBB2, EPS8, PPP2R4, KIAA0922, SP4, CENPJ, PRCP and SELP, whose mutational loads match with experimentally or clinically verified breast cancer prognosis. We also identified some category 2 genes such as ARL6IP5, RAET1E, and ANO7 that could be crucial for breast cancer development and prognosis (S1 File). Interestingly, the majority of DMGs have higher deleterious mutation scores in the classes with poor prognosis (ER-, PR-, HER2+, grade III, and stage III), which suggests that the deleterious gene mutations are gradually accumulated with the progression of cancer.
Then, we identified some SNVs such as rs1058808, rs2480452, rs61751507, rs79167802, rs11540666, and rs2229437 that potentially influence protein functions and have significantly different occurrence frequency in the populations of different breast cancer comparison groups. Protein structure analysis also suggests that many of the SNVs identified in this study could alter the protein stability and structure, and those SNVs might be associated with cancer evolution and affect prognosis of breast cancers. Some genes and SNVs we identified are worthy of further experimental investigation and verification.  Table. Clinical information of all 103 breast cancer samples. Information includes ID for this study (ID), dbGap subject ID (dbGap SubjID), submitted subject ID (SUBJID), Age, Gender, Primary Disease, Expression Subtype, Country, ER status, PR status, HER 2 status, tumor stage (Stage), tumor grade (Grade), Menopausal Status, Histology, and whether it is used in this study (In the study). (XLSX) S2 Table. Deleterious mutation score matrix for filtered 3,826 genes in 98 breast cancer samples. Genes that have deleterious mutations present in less than 5 (out of 98) samples have been filtered out to obtain 3,826 genes. Deleterious scores were calculated using the scoring function described in method section. (XLSX) S3