Whole Genome Sequencing Identifies a 78 kb Insertion from Chromosome 8 as the Cause of Charcot-Marie-Tooth Neuropathy CMTX3

With the advent of whole exome sequencing, cases where no pathogenic coding mutations can be found are increasingly being observed in many diseases. In two large, distantly-related families that mapped to the Charcot-Marie-Tooth neuropathy CMTX3 locus at chromosome Xq26.3-q27.3, all coding mutations were excluded. Using whole genome sequencing we found a large DNA interchromosomal insertion within the CMTX3 locus. The 78 kb insertion originates from chromosome 8q24.3, segregates fully with the disease in the two families, and is absent from the general population as well as 627 neurologically normal chromosomes from in-house controls. Large insertions into chromosome Xq27.1 are known to cause a range of diseases and this is the first neuropathy phenotype caused by an interchromosomal insertion at this locus. The CMTX3 insertion represents an understudied pathogenic structural variation mechanism for inherited peripheral neuropathies. Our finding highlights the importance of considering all structural variation types when studying unsolved inherited peripheral neuropathy cases with no pathogenic coding mutations.


Introduction
Charcot-Marie-Tooth (CMT) disease is the collective name given to a group of clinically and genetically heterogeneous inherited peripheral neuropathies that affect both motor and sensory neurons. Over 80 genes have been associated with CMT and other related inherited peripheral neuropathies, which account for up to 80% of CMT cases [1][2][3][4]. In our Australian cohort, after extensive whole exome sequencing (WES) analysis of multiple family members, a proportion of these unsolved families also have no detectable protein-coding mutation in the exome. This suggests that point mutations and small insertions/deletions of non-coding DNA and DNA structural variations may account for some of the unsolved cases.
CMTX3, a subtype of X-linked CMT, is one such locus which has remained unsolved after extensive molecular analyses. The CMTX3 locus was initially mapped to the long arm of chromosome X in two American families [5]. The locus was confirmed and refined to a 5.7 Mb region on chromosome Xq26.3-q27.3 in a large United Kingdom/New Zealand family (CMT623) [6] and an Australian family (CMT193-ext) [7]. Affected males from these two families generally presented a slightly milder phenotype than the more common X-linked CMT subtype, CMTX1. However the degree of severity varied. Onset of disease generally started in the first decade, initially presenting in the lower limbs. Sensory symptoms included marked pain and paraethesia in hands and feet as well as sensory loss. Tremor in hands and spastic paraparesis was not observed. Nerve conduction velocities data suggested these patients have an intermediate CMT. Female carriers were considered asymptomatic with normal nerve conduction velocities, however the observation of subtle clinical signs including high-arched feet, weakness in foot dorsiflexion and loss of ankle reflexes suggested female carriers may present very mild symptoms [6].
The two families carry the same CMTX3 haplotype, suggesting they share an identical genetic mutation inherited from a common ancestor. Genotype analysis of one of the original American families (US-PED2) initially suggested this family also carried the distal portion of the CMTX3 haplotype [7]. However, re-examination of family US-PED2 by whole exome sequencing (WES) identified a known BSCL2 mutation (c.263A>G, p.Asn88Ser) as the genetic cause of disease in the family [8]. Mutation screening families CMT623 and CMT193-ext excluded all coding sequences mapping within the 5.7 Mb locus for pathogenic mutations [6,9]. Therefore, we employed whole genome sequencing (WGS) to interrogate the disease locus for pathogenic non-coding single nucleotide variants and structural variations in these families.

Results/Discussion
Two affected males and an unaffected male control from each of the families CMT623 and CMT193-ext (i.e. four patients and two controls) underwent WGS. An average of 134 Gb of sequence was generated for each individual. On average, 96% of total reads mapped to the reference genome and all samples had a minimum depth of coverage (DOC) of 44X across the whole genome ( Table 1). The CMTX3 locus had an average DOC of 24X, which reflected the males being hemizygous for chromosome X.
Patient and control sequence alignments revealed the presence of split-reads at Xq27.1 ( Table 2). The four affected males consistently showed split reads at the genomic location chrX:139,502,948. The corresponding paired ends for the split reads mapped both upstream and downstream of the suggestive breakpoint at chromosome Xq27.1. Split-reads at this location were not identified in the two unaffected males. The unaligned sequences of these splitreads mapped to two genomic regions (chr8:145,768,312 and chr8:145,848,158). These genomic positions are located 78 kb apart on chromosome 8q24.3 and represent the boundaries of the DNA region that has been inserted into chromosome Xq27.1 in the CMTX3 patients. Patient WGS data also showed split-reads on chromosome 8 that contained Xq27.1 sequence and paired with reads anchoring to these two locations on chromosome 8q24.3. Further analysis also identified discordant paired ends in which one read pair mapped to Xq27.1 and the other read pair mapped to 8q24.3. This was observed in all four patients and absent from the two control samples. Table 2 summarizes the number of split-reads and discordant paired ends identified for each patient. Based on these data we predicted that a 78 kb sequence from 8q24.3 had been inserted into chromosome Xq27.1 in CMTX3 patient DNA.
To determine whether the entire 78 kb region from chromosome 8q24.3 had been duplicated and inserted into Xq27.1 we assessed the DOC across the genomic interval chr8:145,700,000-145,900,000 (Table 3). Control males showed a uniform DOC across the entire 200 kb region with a mean DOC of 40X. The affected males, however, showed a 1.6-fold increase in DOC (mean DOC of 64X) within the boundaries of the insertion breakpoints (chr8:145,768,312-145,846,158). The DOC for the genomic regions immediately flanking the 8q24.3 insert sequence were similar to the controls ( Fig 1A). These data suggested that patients with CMTX3 carry an extra copy of the 78 kb region from chromosome 8q24.3 through the interchromosomal insertion event at the CMTX3 locus.
We next assessed whether the interchromosomal insertion segregated with the disease in our two distantly related families using a multiplex PCR genotyping assay ( Fig 1B). Genotyping results for a subset of family members from CMT623 are shown ( Fig 1C). The different sized amplicons were confirmed via Sanger sequencing (S1 Fig). The 78 kb insertion segregated in 55 individuals (25 affected males and 30 carrier females) from families CMT623 and CMT193-ext. The 78 kb insertion was not seen in the 50 unaffected members (30 males, 20 females) from families CMT623 and CMT193-ext that were available for testing. All individuals were clinically diagnosed and genotyped for the CMTX3 haplotype prior to this study. The 8q24.3 interchromosomal insertion was absent in 627 control X chromosomes from neurologically normal females (n = 252) and males (n = 123).
Sanger sequencing the amplicons spanning the insertion breakpoints confirmed the WGS predictions (Fig 2A and 2B). The 8q24.3 sequence inserted directly between the genomic locations chrX:139,502,948-139,502,949. For the proximal breakpoint, the exact location of the end sequence from chromosome X and start position of the 8q24.3 insertion sequence could not be unambiguously defined due to a 2 bp overlap (AA) in the sequence (Fig 2A). For the purposes of defining breakpoints, we have designated the chromosome 8 insertion start position as chr8:145,768,312. The distal breakpoint is more complex (Fig 2B). The 8q24.3 insertion sequence ends at position chr8:145,848,158 followed by a small insertion from chromosome 12q13.12, which maps within an intron of the FAIM2 gene. A total of 19 bp from the small insertion sequence maps to 12q13.12 however the first 10 bp also overlap with chromosome 8 (green sequence, Fig 2B). Adjacent to the 12q13.12 insertion, the first 12 bps of chromosome X at the distal breakpoint are inverted. There is also a single nucleotide variant (T>G) at chrX:139,502,968 and a single nucleotide deletion at chrX:139,502,976 ( Fig 2B). These variants Table 3. Average depth ± SD of sequence coverage across 8q24.3. appear to be unique to the two CMTX3 families and have not been reported in variant databases including the 1000 Genomes Project [10] or dbSNP [11]. The 8q24.3 insertion region is 77,856 bp and contains a partial transcript of the ARHGAP39 gene (exons 1-7) encoded on the negative strand ( Fig 2C). The duplicated 8q24.3 sequence has inserted into an intergenic region of Xq27.1 with the nearest flanking genes being LOC389895 (located 329 kb downstream proximal to the 78 kb insertion) and SOX3 (located 84 kb distal to of the insertion) (Fig 2C and 2D).

CMT193-ext CMT623
Based on the genomic architecture of the CMTX3 interchromosomal insertion, we hypothesized two possible mechanisms that could lead to peripheral neuropathy: 1) overexpression of the partial ARHGAP39 transcript due to 8q24.3 trisomy; or 2) transcriptional dysregulation of one or more genes mapping within the CMTX3 locus.
Aberrant splicing with the ARHGAP39 partial transcript may also be a possible mechanism. However this is unlikely, as the inserted ARHGAP39 partial transcript is predicted to be transcribed on the negative strand and the nearest downstream gene, LOC389895, is a single exon gene transcribed from the positive strand ( Fig 2C).
Copy number variations (CNVs) that result in the duplication or deletion of a gene is a wellknown cause of CMT neuropathy, indicating that peripheral nerves are sensitive to gene dosage. A 1.5 Mb duplication on chromosome 17p12 [12,13], resulting in trisomy of the PMP22 gene [14][15][16][17], causes the most common form of CMT (CMT1A). This was the seminal example of a CNV causing disease. The reciprocal 1.5 Mb 17p12 deletion causes hereditary neuropathy with liability to pressure palsies (HNPP) [18]. Although relatively rare [19][20][21], a small number of individual cases describing whole and partial gene duplications or deletions for other CMT loci including MPZ [21][22][23], GJB1 [24][25][26], MFN2 [27], and NDRG1 [28] have also been reported. Currently there are no interchromosomal insertions reported as a cause of CMT.
To assess whether the CMTX3 insertion affects gene expression, quantitative RT-PCR analysis was used to assess the mRNA expression levels of candidate genes in patient and control lymphoblasts. No difference in ARHGAP39 expression was observed between the patient and controls (Fig 3A). This suggested that trisomy of the ARHGAP3 partial transcript is unlikely the underlying cause of neuropathy.
Large rearrangements disrupting non-coding DNA sequences are likely to cause disease by dysregulating the transcriptional expression of one or more nearby genes [29]. Duplication of a 186 kb sequence located 3 kb distal to the PMP22 gene [30,31], harboring Schwann cell-specific transcription factor binding sites [32], was found to cause CMT1A by dysregulating PMP22 expression [30,31]. Non-coding DNA structural variations can disrupt the interaction between a gene and its functional non-coding DNA sequences (such as promoters, enhancers and silencers) or introduce new interactions, resulting in dysregulated temporal and spatial gene expression [29,33,34]. Recent studies have shown that regulatory elements and their target genes cluster within local chromatin interaction domains or "topologically associated domains" [35]. Genomic rearrangements that physically disrupt the boundaries of these domains introduce ectopic interactions between regulatory elements and genes that can cause disease [29]. However, based on Hi-C profile data from human embryonic stem cells [35] the 78 kb sequence from 8q24.3 appears to have inserted into a topologically associated domain without disrupting the boundaries (S2 Fig) suggesting that if the CMTX3 mutation dysregulates a nearby gene it is likely through some other mechanism.
To explore the possible mechanism of transcriptional dysregulation of one or more genes mapping within the CMTX3 locus, we assessed the expression of SOX3 and FGF13. Large DNA interchromosomal insertions at the Xq27.1 locus have been previously reported to cause a range of phenotypes [36][37][38][39][40] and these two genes are known to be dysregulated in patients with other Xq27.1 interchromosomal insertions [38,40].  Large Genomic Insertion Causes Charcot-Marie-Tooth Neuropathy SOX3 encodes the sex determining region Y-box 3 transcription factor. In an XX sex reversal patient carrying a 774 kb interchromosomal insertion from chromosome 1q25.3, an increase in SOX3 expression was observed in the patient lymphoblasts [40]. SOX3 expression however was not detected in the control lymphoblasts. In both our patient and control lymphoblast cell lines, SOX3 mRNA expression could not be detected (S3 Fig). These results reflect previous reports of SOX3 expression in control lymphoblasts [40] and it is likely that SOX3 is silenced by methylation in lymphoblasts [41]. Unlike the 1q25.3 interchromosomal insertion, the presence of the 8q24.3 interchromosomal insertion does not appear to affect SOX3 expression in lymphoblasts.
FGF13 encodes the fibroblast growth factor 13 protein that is part of the fibroblast growth factor homologous family [42]. Hypertrichosis patients carrying a 389 kb interchromosomal insertion from chromosome 6p21.1 showed reduced FGF13 expression in patient hair follicles Bars show the mean mRNA levels (± SD; error bars) relative to Control 1, which has been set to +1. A student t-test was performed comparing each value to Control 1 (*, p < 0.05).
doi:10.1371/journal.pgen.1006177.g003 [38]. We observed a 3-fold increase in expression in lymphoblast cells from the CMTX3 patient ( Fig 3B). Although the assay could not distinguish between the different FGF13 isoforms, our preliminary finding demonstrates that the 8q24.3 interchromosomal insertion dysregulates FGF13 expression in CMTX3 patient lymphoblasts. We hypothesize that if similar dysregulation of FGF13 gene expression were to be observed in patient neurons this could be the underlying cause of disease in CMTX3 patients. It is also possible that the observed dysregulation of FGF13 is a benign, bystander effect of the 78 kb interchromosomal insertion. Further gene expression studies on FGF13 and the remaining genes mapping to the CMTX3 locus, will be required to fully determine the pathogenic consequence of the CMTX3 8q24.3 insertion.
There have been six large interchromosomal insertions previously reported; each originating from unique genomic regions and ranging from 124-774 kb [36][37][38][39][40]. These interchromosomal insertions have been shown to cause hypoparathyroidism [36], hypertrichosis [37,38], ptosis [39], and XX male sex reversal [40]. CMTX3 is the fifth disease phenotype to be associated with an Xq27.1 interchromosomal insertion, clearly suggesting there is a recurrent mutation mechanism at the Xq27.1 locus. There are several mutation mechanisms that give rise to structural variations (recently reviewed in [43,44]). We propose that this recurring mutation mechanism is possibly due to double stranded DNA breaks occurring in the 180 bp palindrome sequence at Xq27.1 [37] followed by incorrect repair of the DNA break through microhomology-mediated break-induced replication [45,46]. For most of the interchromosomal insertions, including the CMTX3 insertion, at least one breakpoint is located near the center of the 180 bp palindrome sequence, close to where the hairpin loop is predicted to form (Fig 4) [37][38][39][40]. Hairpin loops are susceptible to double stranded DNA breaks due to endonuclease activity and are common hotspots for translocations [47]. Since the chromosome X breakpoints of these interchromosomal insertions localize within this hairpin structure, this suggests that hairpin formation of the palindrome sequence and endonuclease activity may be the initial process of the recurrent mutation mechanism.
Microhomology-mediated break-induced replication (MMBIR) coupled with fork stalling and template switching (FoSTeS) has been proposed as an alternative model for the formation of genomic rearrangements that cannot be explained by non-allelic homologous recombination [45,48,49]. In this model, microhomology-induced template switching occurs where nearby single-stranded DNA is used as template to repair DNA breaks. Depending on the template, this results in the formation of deletions, duplications, triplications inversions or translocations that are flanked by minimal sequence homology of 2-6 bp at the breakpoints [45]. Further complexity at the genomic rearrangement breakpoints, involving small deletions and/ or small insertions of unlinked or unknown sequences, are also commonly observed and is likely due to multiple template-switching events occurring during the repair process [49].
Sequencing the breakpoints of the CMTX3 rearrangement revealed an additional 19 bp from chromosome 12q13.12, an inversion of 12 bp from chromosome Xq27.1 and microhomology between chromosome X and chromosome 8 sequence as well as between the chromosome 8 and chromosome 12 sequence (Fig 2A and 2B). Microhomology, small deletions at the Xq27.1 sequence and additional small inserted sequences, from unlinked (i.e. from another chromosome) or unknown sources, also feature in the other disease-associated interchromosomal insertions at Xq27.1 [36][37][38][39][40] suggesting these insertions arose through MMBIR/FoSTeS.
Since each unique DNA insertion causes different disease phenotypes this suggests that the inserted genomic sequence is important. Based on the varying gene dysregulation observed for patients with hypertrichosis [38], XX sex reversal [40] and CMTX3, we predict the disease specificity from each interchromosomal insertion into Xq27.1 arises from the introduction of DNA regulatory elements that interact with the nearby genes in a tissue-specific manner. Unsolved Mendelian diseases mapping to the Xq27.1 region should therefore be assessed for large interchromosomal insertions using WGS analysis.
With 20% of our CMT families remaining genetically unsolved after WES [2], finding the causes of disease in these families is an important goal for inherited peripheral neuropathies. Our discovery suggests that structural variation involving non-coding DNA may explain a portion of the unsolved families. It also highlights the importance of looking beyond CNV when analyzing the genome for structural variation. Although the CMTX3 mutation represents trisomy of 8q24.3, given that this does not result in a dosage change for ARHGAP39, it is likely that the insertion itself underlies the peripheral neuropathy.  [37]); hypertrichosis 2 (blue, [37]); hypertrichosis 3 (green, [38]); ptosis (pink; Bunyan [39]); and XX sex reversal (purple, Haines [40]) are marked out on the hairpin structure. Single breakpoints are depicted by a solid line. Multiple breakpoints are indicated by broken lines. WGS provides a powerful tool to detect the full spectrum of DNA variation including all classes of structural variations [50,51]. Given that structural variations are found throughout the general population [52,53] distinguishing pathogenic and benign structural variations will be difficult without large families to confirm segregation. In time, improved annotation of benign genomic rearrangements in SV databases, that go beyond CNV and map the location and orientation of all SV subtypes, will assist in delineating pathogenic structural variations in patients. Pathogenic structural variations identified in families that are large enough for segregation analyses, as we have shown for the CMTX3 mutation, will provide genomic landmarks in which WGS data from smaller families can be mined for structural variation sequencing signatures (such as split reads and discordant paired ends). This strategy will, however, have limited use if structural variations causing inherited peripheral neuropathy prove to be rare private mutations. With decreasing WGS costs and improved sensitivity of WGS alignment algorithms, we predict that more structural variations are likely to be identified as the pathogenic cause of CMT. However, we acknowledge that the detection of these mutations in both the research and clinical diagnostic settings will be a challenge with no immediate solution.
In conclusion, we have provided compelling data supporting the likely genetic cause of CMTX3 neuropathy as a 78 kb interchromosomal insertion at Xq27.1 [der(X)dir ins(X;8) (q27.1;q24.3)]. Based on genealogy studies we believe this founder insertion originated prior to the early 1800s in a Scottish family. Our discovery is the first neuropathy caused by an Xq27.1 interchromosomal insertion. We propose that large structural variations involving non-coding DNA, similar to the CMTX3 mutation, may account for a proportion of the unsolved CMT cases.

Research participation
Participating family members gave informed consent according to the protocols approved by the Sydney Local Health District Human Ethics Review Committee, Concord Repatriation General Hospital, Sydney, Australia (reference number: HREC/11/CRGH/105).

Genomic DNA extraction
Genomic DNA was extracted from peripheral blood using the PureGene Kit (Qiagen) following manufacturer's instructions. Extractions were performed by Molecular Medicine Laboratory, Concord Repatriation General Hospital (Sydney, Australia).

Whole genome sequencing
Genomic DNA samples (3 μg) were dispatched to NextCODE (Massachusetts, USA) who outsourced WGS of samples to Macrogen (South Korea). Paired-end (101 bp) sequencing was performed on a HiSeq 2000 sequencer (Illumina) following standard protocols.

WGS bioinformatics analyses
Raw WGS data was returned to NextCODE who performed the following bioinformatics analyses. Access to all pipeline output files and visual representation of WGS data was made available through the Sequence Miner (NextCODE) application.
Sequence alignment. Sequence reads were aligned to the human reference sequence (hg19) using the Burrows-Wheeler Aligner (BWA) version 0.5.9 [54]. Alignments were merged into a single BAM file and marked for duplicates using Picard 1.55. Non-duplicate reads were selected for further downstream analyses.
Discordant paired end and split read detection. WGS data was assessed for discordant paired end reads and split reads using in house pipelines developed by NextCODE. For discordant paired end detection, scripts were developed to identify high quality read pairs mapping to different chromosomes or with inserts greater than 700 bp (more than twice the library mean insert size). Using a 200 bp window, the local maximum rearrangement position was identified and regions with generally poor read alignment were excluded. For split read detection, algorithms were used to extract reads whereby one half of the read mapped to the genome and the second half did not map locally.

Tissue culture of patient lymphoblasts
Patient EBV-transformed lymphoblast cell lines were prepared using standard procedures at Genetic Repositories Australia (Sydney, Australia). Sex and aged matched controls were obtained from the Genetic Repositories Australia. Lymphoblasts were maintained in RPMI 1640 (Invitrogen) supplemented with 10% fetal bovine serum (Scientifix) and 2 mM L-glutamine (Gibco).

RNA isolation and cDNA synthesis
Total RNA was isolated from patient lymphoblast cells using Trizol (Life Technologies) according to the manufacturer's instructions. RNA was eluted in 50 μl RNAse-free water, DNase-treated with Turbo DNase (Life Technologies) and stored at -80°C until required. RNA (1 μg) was converted to cDNA using iScript cDNA Synthesis Kit (Biorad) following manufacturer's protocols.

Gene expression analysis
Isolated cDNA (100 ng) was subjected to quantitative RT-PCR analysis using TaqMan Gene Expression Assays (Invitrogen) following manufacturer's protocols. Quantitative RT-PCR was performed on a Step One Plus (Applied Biosystems) and relative fold difference was calculated using the comparative Ct method [55]. Target gene expression was determined relative to the housekeeping gene 18S. For each RNA extraction (n = 3 per sample), quantitative RT-PCR reactions were performed in triplicate.