Targeted Sequencing of the Mitochondrial Genome of Women at High Risk of Breast Cancer without Detectable Mutations in BRCA1/2

Breast Cancer is a complex multifactorial disease for which high-penetrance mutations have been identified. Approaches used to date have identified genomic features explaining about 50% of breast cancer heritability. A number of low- to medium penetrance alleles (per-allele odds ratio < 1.5 and 4.0, respectively) have been identified, suggesting that the remaining heritability is likely to be explained by the cumulative effect of such alleles and/or by rare high-penetrance alleles. Relatively few studies have specifically explored the mitochondrial genome for variants potentially implicated in breast cancer risk. For these reasons, we propose an exploration of the variability of the mitochondrial genome in individuals diagnosed with breast cancer, having a positive breast cancer family history but testing negative for BRCA1/2 pathogenic mutations. We sequenced the mitochondrial genome of 436 index breast cancer cases from the GENESIS study. As expected, no pathogenic genomic pattern common to the 436 women included in our study was observed. The mitochondrial genes MT-ATP6 and MT-CYB were observed to carry the highest number of variants in the study. The proteins encoded by these genes are involved in the structure of the mitochondrial respiration chain, and variants in these genes may impact reactive oxygen species production contributing to carcinogenesis. More functional and epidemiological studies are needed to further investigate to what extent variants identified may influence familial breast cancer risk.


Introduction
In 2008, 14 million breast cancers (BC) were diagnosed worldwide, representing more than 10% of all cancer diagnoses [1]. One woman out of nine will develop breast cancer in her lifetime in developed countries [1,2]. Breast cancer is a complex multifactorial disease, with genetic, life-style and environmental susceptibility factors. Today we estimate that the genetic component of breast cancer represents 5% to 10% of all breast cancers, with 4% to 5% due to high-penetrance mutations in susceptibility genes [3][4][5] Mutations in two high-penetrance susceptibility genes have been identified: BRCA1 and BRCA2. Pathogenic mutations in BRCA1 confer a lifetime risk of BC of 60% to 85% [6,7], whereas such mutations in BRCA2 confer a lifetime risk of approximately 40% to 85% [6,7]. Other genomic variations (for example in genes interacting with BRCA1 and BRCA2) have been identified as modifiers of BC cancer risk and alter the risk initially conferred by BRCA1 or BRCA2 mutation [8].
Early methods such as linkage or candidate gene studies, and more recent approaches like genome-wide association studies (GWAS) have been able to identify genomic features explaining about 50% of breast cancer heritability [9]. A number of low-to moderate-effect Single Nucleotide Polymorphisms (SNPs) have been identified as associated with BC risk. However, today the large amount of genome-wide studies performed that have failed to identify other highly penetrant breast cancer susceptibility genes suggests that the remaining heritability is likely to be explained by the cumulative effect of low-penetrant and low-to moderate-effect genomic variations, by moderately penetrant but rare alleles, and by the synergistic effect of environmental exposure combined to specific genomic variations.
Oxidative stress has been shown to play a role in BC development [10,11]. Mitochondria are intimately linked to oxidative stress, as they are a main source of Reactive Oxygen Species (ROS) in the cell. ROS are cytotoxic and mutagenic components that can cause damage to DNA, in particular double strand breaks. When not correctly handled, this damage leads to genomic instability, and therefore may contribute to tumorigenesis. Germline and somatic variations in the mitochondrial genome have been linked to cancer. Mitochondrial genome mutations are found in various cancers and may alter mitochondrial metabolism, enhance tumorigenesis and permit cancer cell adaptation to changing environments [12,13].
Whereas the mitochondrial genome is haploid, classical linkage studies are dedicated to the analysis of diploid genomes. Furthermore, commercially available GWAS arrays are not suited to capture variation in specific regions of the genome, in particular the mitochondrial genome. Therefore methods specifically targeting the mitochondrial genome are needed to further explore variability in the mitochondrial genome of breast cancer patients. For these reasons, we have sequenced the mitochondrial genome of 436 women diagnosed with breast cancer, having a positive familial breast cancer history, but testing negative for BRCA1/2 pathogenic mutations. GENESIS (GENE SISters) is a French national study designed to identify new breast cancer susceptibility genes [14]. GENESIS includes pairs of sisters diagnosed with breast cancer, but testing negative for BRCA1/BRCA2 mutation. Testing was performed in the context of an oncogenetic consultation following French national guidelines. GENESIS also includes matched non-related controls of the same age group and sharing the same living environment, i.e. close friends or colleagues of index cases.

Ethics Statement
The GENESIS study has been reviewed and approved by the appropriate ethics committee (Comité de Protection des Personnes Ile de France III, 3 october 2006, agreement n°2373). Informed consent has been obtained for each woman included in the GENESIS Study.

Sample Selection
We selected unrelated women among index breast cancer cases of the study. Index cases were ranked by breast cancer phenotypic aggressiveness: i.e. women diagnosed at an early age, and presenting bilateral cancer (see Table 1). Table 1 also presents the number of cases selected from different inclusion categories. Blood samples are centralized in the Breast Cancer Genetics laboratory of the Cancer Research Center of Lyon, where DNA is extracted using automated protocols. As our hypothesis was that rare, potentially deleterious alleles would be discovered, we chose to include as many index cases as possible, and no control individuals. Statistical power to carry out classical association testing in this context would be insufficient, and formal association testing was not the aim of this study.

Mitochondrial Genome Sequencing
Primers corresponding to eleven amplicons covering the whole mitochondrial genome were designed and tested for specificity to the mitochondrial genome (Table A in S1 File).  [19][20][21] were used to process aligned bam files. Indels were realigned after primary alignment with the GATK walker IndelRealigner. Base quality scores were recalibrated with the GATK walker BaseRecalibrator. Variants were called using the GATK UnifiedGenotyper algorithm. Visual examination of read alignment in regions surrounding detected variants was performed with Integrative Genomics Viewer [22,23].

Annotation and filtering
Variants called were annotated with the Ensembl tool Variant Effect Predictor (VEP) [24], and with locally developed Python scripts. Given the haploid status of mitochondria, variants called as heterozygous were considered unreliable and filtered as false-positives. Variants for which the alternative allele was not balanced on both strands, i.e. variants having less than 10% of reads supporting alternative allele on one strand are considered as unreliable and filtered. Remaining reliable variants were annotated according to MITOMAP [25], a catalog of mitochondrial variants including mini insertions and deletions (accessed September 2013). We also annotated positions that are conserved among vertebrates, i.e. positions strictly conserved between reference genomes of nine superior eukaryotic species (data not shown). If mutated, variants at these positions are potentially more likely to have a functional impact. Finally, VEP provided gene annotation, substitution effect in transcribed region, codon and amino acid change and functional effect prediction with PolyPhen [26] and SIFT [27] for non-synonymous substitutions.
To estimate mitochondrial variability at the gene level, 2 statistics were computed. The global variant enrichment rate estimates the number of distinct variants observed per gene and per Mb. For a given gene x of length l x , if N x distinct variants were observed in our study, then the global variant enrichment rate r g for this gene is: This statistic does not take into account the frequency of each variant observed in our study. Therefore, the weighted variant enrichment rate r w , taking into account both gene variability and variants frequency was also computed. With i from 1 to N x representing the i th variant observed on gene x, and n i the count of individual carrying variant i, then: The statistics were computed for all mitochondrial genes, except for those encoding for mitochondrial transfer RNAs, as their very small length would have biased the results.

Sequencing characteristics, coverage assessment, and variant description
More than 20 million reads were obtained, with a mean read count per sample higher than 45,000 (Table B in S1 File). On average, more than 27,000 of the 45,000 sequenced reads were successfully aligned against the reference genome, which gives an alignment rate of 57.4%. (Table C in S1 File) In our experiment, approximately 90% of the mitochondrial genome is covered with a depth of coverage of 50X. The global mean coverage is 198X. More details, including coverage profile along the mitochondrial genome, are presented in Figure A in S1 File.
1157 distinct variants successfully passed our quality filters; and are detailed in S2 File. More than 99% of variants are substitutions. About 80% of variants are located in the coding sequence of mitochondrial genes. This is not surprising given the probable bacterial origin of mitochondria, and the low proportion of intergenic sequences in its genome. There are no missense substitutions or indels detected in genes coding for mitochondrial transfer RNA (Mt-tRNA). Comparing concordance between analyses with SIFT and Polyphen shows that 24 variants are predicted as deleterious by SIFT and as probably damaging by Polyphen, and these variants are detailed in Table 2. Table 3 contains a description of observed variants unknown in MITOMAP, which represent 3% of all detected variants. None of these variants is detected at high frequency in our dataset (all were observed once, with the exception of the A-T variant at bp3385 which was observed twice), and most do not affect conserved positions. These observations are consistent with unknown rare polymorphisms, and these variants are also absent from dbSNP. Table 4 summarizes the distribution of variants by mitochondrial gene, and displays the global variant enrichment rate r g and the weighted enrichment rate r w for each gene. Two genes, MT-RNR1 and MT-RNR2, show low global enrichment rates (35.6 and 37.2 variants/Mb, respectively). These two genes code for the mitochondrial 12S and 16S ribosomal RNAs, respectively structural components of the small and large ribosomal subunits of mitoribosomes. Given their essential structural function, it is not surprising to see that these genes are more conserved than other mitochondrial genes.
As represented in Fig 1, we can observe 3 groups of genes. MT-RNR1 and MT-RNR2 represent the group of MT-RNAs genes. They are characterized by a low number of distinct variants, some having a high frequency in our population. On the other hand, MT-CYB and MT-ATP6 are both characterized by a high number of distinct variants (high r g value). They also are the most frequently altered genes in our population (high r w value). The third group is composed of the remaining genes.

Discussion
In this study we have performed a deep characterization of mitochondrial genome variability among 436 women diagnosed with breast cancer, having a positive familial breast cancer history, and negative for BRCA1/2 mutation screening. Women were chosen based on the age at diagnosis and on the laterality of their cancer.
We identified 1157 distinct variants compared to the reference sequence rCRS. All the frequent variants observed in this study (i.e. with a relative frequency > 5%) are known in the MITOMAP database. The majority of these variants are common mitochondrial SNPs, some of which have been associated with an increased risk of breast cancer [28,29]. Of the 35 variants (including 3 indels) that are unknown in MITOMAP database, (also unknown in dbSNP), none was observed in more than 3 individuals among the 436 samples that were sequenced. There is no reason to reject these variants after visual examination of alignments with IGV. They could be rare variants or private mutations, i.e. rare mutations restricted to a few families, appeared recently in mitochondrial genome evolution. 4 variants are predicted as deleterious by SIFT, 2 are predicted possibly damaging, and 4 probably damaging by PolyPhen. None of these 6 loci have been described in the literature.
No variant predicted as deleterious by SIFT and probably damaging by PolyPhen was observed more than 3 times in the 436 samples sequenced, except the polymorphism rs2853506 (A15218G), which was observed in 10 samples (2%). This polymorphism has been found associated with epileptogenesis [30]. This SNP is located in the gene coding for cytochrome b, MT-CYB. Along with cytochrome c1 and Rieske protein, cytochrome b is one of the three respiratory subunits of complex III, the third complex involved in Electron Transport Chain. Alterations in this subunit have been shown to modify catalytic capacities of complex III [31]. Although the coverage profile is not homogeneous or constant along the mitochondrial genome, this profile is robust through the 10 independent sequencing runs that were performed in this study. Furthermore, despite the relatively low alignment rate, we have a mean for approximately 90% of mitochondrial genome. Similar coverage profiles were obtained in other studies on Ion Torrent sequencing platforms and with various aligners [32,33]. Whereas the Ion Torrent sequencing technology has been reported to have very high confidence rates regarding point mutation sequencing, the reliability of detecting insertions and deletions (indels) has been a subject of intense debate and a significant proportion of indels called with this technology is likely to represent false positives [33][34][35]. In our study, only 10 out of 111 indels initially called passed filters, and 93% of these were discarded in at least one sample due to an unbalanced repartition of reads supporting alternative alleles on the two DNA strands, with  less than 10% of reads on one strand. This strand specific deletion bias has already been observed in Ion Torrent [36]. Furthermore, in our results, 8 out of the 10 deletions that have passed filters are located in homopolymer tracts. The only 2 remaining indels seem highly reliable when visualized with a viewer like IGV [22,23]. Ion Torrent sequencing is known to be prone to errors for homopolymeric tracts of length more than 3, the uncertainty of pH measure often leading to a misestimation of homopolymer length. Loman et al. have estimated the Ion Torrent PGM homopolymer-associated indel errors at 1.5 per 100 bases [37]. Better results are usually obtained on Ion Torrent PGM data by using an aligner based on Burrows-Wheeler Transform (BWT) algorithm [38], an alignment algorithm used by BWA (the aligner we used) or the mapper included in Nextgene sequencing analysis software (SoftGenetics, State College, PA, USA). Filtering non-homozygous variant calls enables us to obtain a restricted list of variants after having eliminated the high level of noise generated by Ion torrent sequencing. However, by doing this we choose to not analyze the potential heteroplasmy, or potential somatic changes in the mitochondrial genome in the blood sample sequenced, and to focus on inherited variants. We did so because heteroplasmy in blood may not be of interest for studying breast cancer etiology, as compared to heteroplasmy observed in breast tissue.
Our aim is to characterize in detail mitochondrial genome variability of women with a strong familial history of breast cancer, but without BRCA1/2 pathogenic mutations. Given the diminished capability of familial linkage and GWAS analyses to detect mitochondrial genome features, a targeted sequencing approach was needed. Several recent studies [38] underline the possibility that the remaining unexplained heritability of breast cancer could be explained by rare variants specific of only one or a few families. Population based methods like case-control studies are therefore not able to detect any associations with these variants. For this reason we have not included any control subjects in our study, and have included as many cases as feasible.
We identified 1157 variants; some of which were previously associated with breast cancer risk. However, as expected, we did not identify a common potentially pathogenic genomic pattern among the 436 women included in our study. It is therefore unlikely that mutations in the mitochondrial genome have the potential to explain a high proportion of the excess of breast cancer risk of women with a familial history but without BRCA1/2 mutations. New variants, unreferenced in MITOMAP and not described in the pubic literature were characterized, and some of them are predicted to potentially alter the function of the protein encoded by their genes. MT-ATP6 and MT-CYB, two genes coding for important structural subunits of the mitochondrial respiration chain, show the highest variability with the highest frequency in our data. These results are interesting in this context, and echo recent results underlying the potential role of mitochondrial variants as susceptibility factors for familial breast cancer risks [39]. Based on these elements, further studies are needed to examine if these variants influence breast cancer risk.
Alternative approaches are now required to try to understand a larger part of breast cancer heritability. Powerful technologies such as next generation sequencing can help to achieve this goal, and has already been applied in the context of breast cancer [40,41]. New consortiums are emerging, such as COMPLEXO which aims at deciphering breast cancer missing heritability by looking further into the human exome by using massive parallel sequencing technologies [42]. Our results provide evidence that the mitochondrial genome should be considered when designing studies in order assess the role of mitochondrial genome variability in breast cancer risk and etiology.
Supporting Information S1 File. Technical details of sequencing experiments. Details of PCR primers used to isolate the mitochondrial genome including primer sequence, melting temperature, and amplicon size (Table A). Post-sequencing read characteristics (Table B). Post-sequencing mitochondrial genome coverage (Table C). Coverage distribution along the mitochondrial genome with standard deviation for all 436 samples. Coverage in overlapping regions is divided by two to take into account the overlap ( Figure A). (DOCX) S2 File. Full list of variants identified. Details of all variants identified by sequencing 436 mitochondrial genomes (Table A). (XLSX)