A tailored approach to fusion transcript identification increases diagnosis of rare inherited disease

Background RNA sequencing has been proposed as a means of increasing diagnostic rates in studies of undiagnosed rare inherited disease. Recent studies have reported diagnostic improvements in the range of 7.5–35% by profiling splicing, gene expression quantification and allele specific expression. To-date however, no study has systematically assessed the presence of gene-fusion transcripts in cases of germline disease. Fusion transcripts are routinely identified in cancer studies and are increasingly recognized as having diagnostic, prognostic or therapeutic relevance. Isolated reports exist of fusion transcripts being detected in cases of developmental and neurological phenotypes, and thus, systematic application of fusion detection to germline conditions may further increase diagnostic rates. However, current fusion detection methods are unsuited to the investigation of germline disease due to performance biases arising from their development using tumor, cell-line or in-silico data. Methods We describe a tailored approach to fusion candidate identification and prioritization in a cohort of 47 undiagnosed, suspected inherited disease patients. We modify an existing fusion transcript detection algorithm by eliminating its cell line-derived filtering steps, and instead, prioritize candidates using a custom workflow that integrates genomic and transcriptomic sequence alignment, biological and technical annotations, customized categorization logic, and phenotypic prioritization. Results We demonstrate that our approach to fusion transcript identification and prioritization detects genuine fusion events excluded by standard analyses and efficiently removes phenotypically unimportant candidates and false positive events, resulting in a reduced candidate list enriched for events with potential phenotypic relevance. We describe the successful genetic resolution of two previously undiagnosed disease cases through the detection of pathogenic fusion transcripts. Furthermore, we report the experimental validation of five additional cases of fusion transcripts with potential phenotypic relevance. Conclusions The approach we describe can be implemented to enable the detection of phenotypically relevant fusion transcripts in studies of rare inherited disease. Fusion transcript detection has the potential to increase diagnostic rates in rare inherited disease and should be included in RNA-based analytical pipelines aimed at genetic diagnosis.


Introduction
The uptake of next-generation sequencing for clinical testing has brought about a surge in the diagnosis of rare genetic disease. Approximately 18-40% of cases originally escaping a diagnosis with traditional genetic assays are now solved by exome-based DNA sequencing [1][2][3]. Despite such advances, a clear need remains for novel and improved methods that will further increase diagnostic rates and improve patient care. While whole-genome sequencing will likely lead to higher diagnostic rates, it remains less cost effective than exome sequencing and significant advances in understanding are required before its non-coding data can be harnessed for clinical practice [4].
Recently, RNA-Seq has been promoted as a versatile clinical tool capable of distilling diverse genetic variation into more readily interpretable transcriptional manifestations [5]. RNA-based profiling of genetic disease has traditionally occurred in targeted assays, with limited assessment of transcriptome-wide applications. Three recent studies reported on the utility of RNA-Seq as a complement to exome-based sequencing in inherited muscle pathologies [6], mitochondriopathies [7] and broad-spectrum rare disease [8]. Cummings et al. studied aberrant splicing patterns and allele-specific expression (ASE), achieving a diagnostic improvement of 35%, while Kremer et al. and Fresard et al. evaluated splicing, ASE, and gene expression quantification, increasing diagnostic yields by 10% and 7.5% respectively. These studies concluded that RNA-Seq represents an essential component of the diagnostic toolkit for rare genetic disease testing.
One transcriptional phenomenon not considered by these previous studies is the expression of fusion transcripts. This is the occurrence whereby genetic material from mutually distinct genes is aberrantly conjoined and transcribed. It can occur by translocation, inversion, deletion, and duplication, potentially leading to gained, lost or altered gene function. Human gene-fusion transcripts are known to occur in hematological and solid tissue cancers where their oncogenic, diagnostic and therapeutic relevance are well-documented [9]. However, the systematic application of fusion transcript detection in germline genetic disease is absent from the literature. This is despite the fact that mechanisms commonly responsible for fusion transcript formation, including deletions, inversions and translocations, often underlie inherited conditions [10]. Indeed, case studies have reported fusion transcripts in disease including brain malformation [11] [12] [13], intellectual disability [14] [15] [16] [17] [18], schizophrenia [19] [20], spastic paraplegia [21], autism spectrum disorder [22], Gille de la Tourette Syndrome [23] and more [24] [25] [10]. These sporadic cases suggest that the systematic inclusion of fusion transcript detection in RNA-based analysis of rare undiagnosed disease may lead to improved diagnostic rates.
Despite the availability of fusion-detection software, its practical application to transcriptome-wide rare disease studies in germline samples is challenging. Current solutions show limited agreement in the putative fusion candidates they output and none generate fully inclusive results. An appropriate fusion caller should be selected to match the data type under analysis. However, current tools were trained using cell line, tumor, or in-silico datasets and are not applicable to germline data. Filters empirically derived from mismatched training data lead to low sensitivity when profiling unrelated sample types [26]. Another obstacle to fusion detection in germline samples is the abundance of false-positive findings arising from bioinformatics alignment artifacts, PCR artifacts, DNA fragments or unprocessed mRNA [27]. Equally, the potential remains for the detection of genuine mRNA species, commonly originating from currently unrecognized single genes, or more rarely, from trans-splicing mechanisms [27][28][29][30][31]. Furthermore, non-pathogenic constitutive fusions may be detected [32] [30], or fusions occurring transiently in subclonal cell populations [33]. Thus, any attempt to systematically apply fusion transcript detection in inherited disease studies using germline samples will require methods to detect meaningful fusion candidates and deprioritize phenotypically inconsequential results.
Here, we describe the systematic application of fusion transcript detection to a cohort of 47 individuals with undiagnosed rare genetic disease. By applying a custom annotation and categorization process to fusion candidates, we demonstrate the presence of diagnostic fusion transcripts in a subset of patients. Our findings provide an analytical framework for others in the field and provide justification for the routine application of fusion transcript identification in genetic disease patients who eluded a diagnosis with existing assays.

Ethical compliance
This study was approved by the Mayo Clinic institutional review board and all participants provided written informed consent for genetic testing.

Study subjects
All patients were clinically referred to Mayo Clinic's Center for Individualized Medicine, seeking genetic diagnosis of a suspected rare inherited disease. Patients and parents underwent genetic counselling and a full case history and family pedigree were constructed. Patients not fully diagnosed by exome sequencing were selected for whole-transcriptome RNA sequencing.

RNA-sequencing
Sequencing was conducted on blood for 46 patients and cultured fibroblasts for 1 patient due to sample availability. Blood-derived RNA was obtained by collecting peripheral whole blood in PAXgene blood RNA tubes and using the QIAcube system (Qiagen) according to the manufacturer's protocol for RNA extraction. RNA was isolated from fibroblasts as previously described [34]. Sequencing

RNA fusion analysis
Candidate fusion events were initially detected using TopHat Fusion (TopHat release 2.1.0) [35]. Minimal depth filtering was applied to candidate fusions. Each fusion candidate was required to be supported by a single split read pair (one read-pair member mapping across the breakpoint) and a single spanning read pair (one read-pair member mapped to each side of the breakpoint). Ultimately this enabled us to maintain a strategy that was more inclusive than the default filters (3 split, 2 supporting) while still requiring supporting evidence from both classes of fusion-defining read pairs. To further increase candidate inclusiveness in this germline dataset, we omitted the cancer-cell-line-derived TopHat Fusion post-processing filter steps (tophat-fusion-post) and began with the unprocessed fusion calls as input into a candidate categorization workflow. We performed sequence alignment to the human genome and transcriptome using BLASTN (v2.6.0) [36]. A word size of 7 and e-value threshold of 1 was used to enable the BLAST alignment of short sequences. Alignments with less than 90% sequence identity and 75% sequence length coverage were filtered. Top scoring alignments were individually selected for (i) full length fusion candidates including conjoined 5'and 3'segments and (ii) decoupled 5'and 3'fusion candidate segments. Alignment results were annotated with Ensembl gene models [37] to identify putative gene involvement, exon-intron composition and coding-frame status, where applicable. Subsequent candidate classification rationale is detailed in Fig 1. Standard TopHat Fusion-filtered outputs were generated alongside custom categorized outputs to enable the comparison of results.

Population frequency-based filtering
As our patient cohort suffered from rare disease, we assumed that any causative event would occur with extremely low frequency in a normal population. To control for event frequency and recurrent artifacts, we compared our putative fusion candidates to a fusion-event database generated using normal samples from our institution, the Illumina Human BodyMap and the Genotype-Tissue Expression (GTEx) project (dbGaP accession phs000424.v7.p2) [38] (approximately 11688 RNA-Seq samples from 500+ individuals and 53 tissue types in total). Most fusion candidates in normal controls were detected with only one supporting read (S1 Fig) and we theorized that the artefactual candidates were likely overrepresented close to this level of support. We therefore considered fewer than two supporting reads as insufficient evidence of a genuine fusion event in our control database. Putative fusion candidates were removed from consideration if they were identified more than two supporting reads in a normal control specimen. Candidates were not considered further if they appeared in another sample from our rare disease cohort, since the patients were unrelated and expected to suffer from rare and distinct genetic disorders.

Phenotype-based prioritization of events classified as potential fusions
Putative fusion transcripts were evaluated with manual and automated approaches to ascertain potential relevance to each patient's phenotype. The manual review of fusion transcripts was carried out to identify links to patient phenotype based on case notes, medical records, Online Mendelian Inheritance in Man (OMIM) [39], Genecards [40] and relevant literature. We also applied an automated in-silico method called PCAN: Phenotype consensus analysis to support disease-gene association [41] to predict the relevance of fusion-forming genes to phenotypes. PCAN uses semantic similarity scoring to measure relationships between the phenotypic terms mutually associated with a patient and a gene. Scores are ranked by simultaneously measuring semantic similarity for all disease-associated genes in the ClinVar database [42] versus each patient's phenotype and producing a rank-score (rank/number of genes in Clinvar e.g. 0.01 indicates that a gene produces a score in the top 1% of all disease-linked ClinVar genes). PCAN also measures the phenotypic relevance of all genes sharing Reactome pathways [43] or STRING [44] protein-protein interaction networks with the fusion-forming genes, producing a p-value score and enabling indirect phenotypic-link discovery.

Confirmation of fusion candidates
A selection of fusions passing filtering and phenotypic prioritization steps were selected for PCR validation. Fusion transcripts were amplified from cDNA generated from patient RNA using the Invitrogen Super-Script II RT Kit (Cat. No. 18064022) with random hexamer primers. PCR was performed with primers detailed in S1 File using Bioline MiTaq Polymerase (Cat. No. BIO-25043). Reaction conditions included an annealing temperature of 55˚C for 30-34 cycles.
Droplet Droplet digital PCR (ddPCR) was also performed for all fusion sequences selected for validation. gBlock constructs (Integrated DNA Technologies) were synthesized as positive controls. ddPCR primers and gBlock sequences are described in S2 File. ddPCR reactions contained 11 μL of ddPCR EvaGreen Supermix (Bio-Rad), 2.2 μL of primer mix (100nM final concentration of each primer) and 8.8 μL of cDNA. Separate reactions were assembled for each fusion candidate using a corresponding primer set. The QX-100 Droplet Generator (Bio-Rad) generated droplets with 20 μL of sample mix and 70 μL of QX200 droplet generation oil Droplets were transferred to a semi-skirted plate and sealed at 180˚C for 4 sec. Thermocycling conditions were as follows: enzyme activation at 95˚C for 5 min, 40 cycles of denaturation at 95˚C for 30 sec, annealing and extension at 60˚C for 1 min, and signal stabilization at 4˚C for 5 min and 90˚C for 5 min. Plates were measured on a QX200 Droplet Reader (Bio-Rad).
Further validation work was performed for select fusion events. Agilent 44k and 180k array comparative genome hybridization (aCGH), fluorescence in-situ hybridization (FISH), multiplex-ligation probe analysis (MLPA) and Molecular Inversion Probe (MIP) Analysis were Putative fusion sequences were BLASTN aligned to the human genome and transcriptome to enable categorization. A) Candidates aligning to abundant hematological genes (Globins, Tcell receptors) were not considered further due to their overrepresentation in blood samples and observed overrepresentation in fusion analysis results. These might represent artifacts or transient biological events. B & C) Full length candidates producing unbroken alignments against the human transcriptome or genome were classified as likely known transcripts or genomic sequence respectively. D) When the candidate produced no alignment against the human genome or transcriptome or only a part alignment was possible, the candidate was classified as a likely artifact, potentially containing low quality or non-human sequence including adapters. E) When the candidate produced multiple alignments within the gene boundaries of a single gene but did not align completely to a known transcript, it was classified as a potential novel transcript of a known gene. This category has the also potential to capture aberrant single-gene events. F) When the candidate produced two hits to separate immunoglobulins the event was classed as potentially representing immune diversity. Alternatively these may be generated by alignment artifacts due to high homology between immunoglobulin genes. G) When two distinct alignments were produced against two different chromosomes, the candidate was defined as a potential interchromosomal fusion. Fused genes with known homology were flagged to enable additional checking for alignment artifacts. H) When the candidate aligned to two distinct genes or regions on a single chromosome, it was classified as a potential intrachromosomal fusion. Fused genes with known homology were flagged to enable additional checking for alignment artifacts. Intrachromosomal candidates occurring between neighboring genes were annotated as potential read-through events. These events could represent true fusions or aberrant transcriptional events but might also represent biologically normal events that occur due to co-transcription of neighboring genes that have yet to be re-classified as single genes. Interchromosomal and intrachromosomal candidates were annotated as homologous when the two hits occurred against known homologous genes based on the Duplicated Genes Database (http://dgd.genouest.org/). Such instances might represent artifacts due to misalignment between closely homologous genes or might equally represent true aberrant events, preferentially occurring due to homology at the genomic sequence level. performed as previously described by Oliver et al. [45]. Flow cytometry, long range PCR, Pacific Biosciences (PacBio) sequencing, targeted PCR and Sanger sequencing were performed as previously described by Cousin et al. [34].

Patient cohort
RNA-Seq was performed on 47 patients with an incomplete diagnosis following prior testing, including exome sequencing. The cohort consisted of 23 males and 24 females. Ages at initial referral ranged from 9 months to 68 years with a mean age of 18 years and median of 11 years. Clinical presentations varied widely and comprised a spectrum of neurological, immune, muscular, gastrointestinal, connective tissue and skeletal disorders (S1 Table).

Genes of prior interest
Of 47 cases, 19 had genes or variants of potential interest identified by exome sequencing and clinical review (S2 Table). Two patients had variants or genes considered to be of exceptionally high interest. Patient 6 carried a single pathogenic variant in ATM with strong links to phenotype, but a second variant was required to fully explain the phenotype based on an autosomal recessive mode of inheritance. In patient 37, a pathogenic variant was actively sought in EXT1 or EXT2. These genes of exceptionally high prior interest were determined to have expression levels suited to analysis in available tissue. Four further patients (Patients 21, 36, 42 and 44) carried variants with predicted pathogenicity and observed zygosity that was suspected to be fully explanatory of some element of their phenotype. It was theorized that fusion profiling for these patients might yield further phenotype-relevant events in other genes. The thirteen additional cases carried a selection of variants of unknown significance (VUS). Six of the thirteen patients carried a total of eight VUS in genes that displayed low expression (< 1 TPM in the GTEx [38] database) in whole blood, however, six of these showed correspondingly low expression in fibroblasts. Ultimately it was decided to proceed with sequencing of readily available blood samples for investigative purposes (S2 Table). The remaining 28 cases were unsolved and without candidates following exome sequencing, and were consequently included for exploratory analysis.

Fusion candidate classification workflow
The fusion candidate selection workflow with the median number of candidates per category is shown in Fig 2. This workflow was designed to remove suspected artifacts or recurrent fusions and to classify remaining candidates into biologically meaningful categories. The median number of unfiltered fusion candidates entering the workflow was 31,138 per patient. The minimal read-depth filter removed a median of 27,824 likely spurious events per patient. Removal of putative fusions previously observed in normal samples further reduced candidates by a median of 2,553 per patient. Remaining filtering and categorization steps reduced fusion candidates by a median value of 97, achieving a tractable median of 12 events per patient which were classified as potential fusions and subjected to manual review for links to phenotype. The number of candidates categorized per patient at each stage is detailed in S3 Table  while all candidates classified as potential fusions are included in S4 Table. A total of 16 fusion candidates in 13 patients (including 1 reciprocal event) passed phenotypic review, with potential links between genes and phenotype identified based on a combination of PCAN analysis and manual curation (Table 1). Extended descriptions and rationale for inclusion of fusion candidates passing manual review are provided in S5 Table.

Confirmation of fusion candidates
Eleven candidate fusions with strong phenotypic relevance to the patient were selected for confirmation using orthogonal methods. Table 2 describes each of the fusions as well as the rationale for their selection and the status of their experimental confirmation. Eight fusions were successfully confirmed, with 2 clinically classified as diagnostic of the patients' phenotype. Fusion confirmation images are included in S3 File. A selection of the confirmed fusion products are discussed in detail, as follows.

SAMD12-EXT1 fusion in a patient with multiple exostoses
Patient 37 is a male child who presented with a phenotype including pachygyria, epilepsy, developmental delay, short stature, failure to thrive, facial dysmorphisms, and multiple exostoses [45]. Trio-based clinical exome sequencing identified a maternally inherited, X-linked lossof-function variant in Doublecortin (DCX), which was classified as pathogenic and diagnostic of the patient's neurological phenotype. However, the cause of the patient's multiple exostoses remained unknown. Hereditary multiple exostoses is an autosomal dominant disorder, caused by pathogenic variants in EXT1 or EXT2 in 70-95% of cases, with EXT1 affected twice as frequently as EXT2 [46] [47]. Mosaic pathogenic events have been reported in numerous instances [48] [49]. No variant was identified in either gene despite extensive clinical testing, including array comparative genome hybridization (aCGH), metaphase karyotyping, multiplex ligation-dependent probe amplification (MLPA) and exome sequencing. RNA-Seq and subsequent fusion analysis discovered a candidate intrachromosomal fusion between SAMD12 and EXT1 (Fig 3A). The fusion was observed at the 3' boundary of SAMD12 exon 2 and the 5' boundary of EXT1 exon 2 forming a transcript predicted to be out-of-frame, leading to lossof-function. The fusion was supported by 17 sequence reads and was not identified in our normal control database. SAMD12 lies upstream of EXT1 on Chromosome 8 and both genes are oriented on the reverse chromosomal strand. Intuitively, the fusion transcript could be expected to result from a rare interstitial deletion of genomic sequence between the two genes, however, prior clinical testing did not report this. The clinical aCGH results were re-inspected for evidence of a deletion in this region and a 604 kb genomic region intervening the fused exons (chr8:118960168-119569348) showed evidence of mosaic loss of EXT1 exon 1 and SAMD12 exons 3-5, but did not meet clinical-reporting thresholds. The mosaic loss was subsequently confirmed by an increased density aCGH (       Reciprocal ATM-SLC35F2 fusion in a patient with severe combined immunodeficiency Patient 6 is a female infant diagnosed with T cell lymphopenia by newborn screening for severe combined immunodeficiency (SCID) [34]. SCID gene panel sequencing was uninformative and aCGH unrevealing. Subsequent trio-based exome sequencing discovered a paternally inherited frameshift INDEL in ATM, clinically classified as pathogenic. Pathogenic ATM variation causes Ataxia-telangiectasia in an autosomal recessive manner, and would account for the patient's phenotype if a second variant was in trans. Flow cytometry assay revealed impaired phosphorylation of ATM, supporting the presence of a second pathogenic variant [34]. RNA sequencing of patient fibroblasts revealed reciprocal ATM-SLC35F2 and SLC35F2-ATM fusion transcripts (Fig 3B). These fusions were supported by 14 and 43 reads respectively, and neither was identified in our normal control database. The ATM-SLC35F2 fusion consists of ATM exon 16 joined to SLC35F2 exon 8, while the SLC35F2-ATM fusion consists of SLC35F2 exon 7 joined to ATM exon 17. Both resulting fusions were predicted to be in-frame, with each gene fragment in its correct orientation, despite the two genes existing natively on opposing genomic strands on Chromosome 11q22.3. It was hypothesized that the reciprocal fusion transcripts were the result of a chromosomal inversion. To confirm the hypothesis, long range PCR of the putatively affected introns was conducted and sequenced using PacBio long-read technology (S6 Fig). This resulted in reads bridging the breakpoints, which were subsequently confirmed by targeted PCR (S3 File) and Sanger sequencing (S7 Fig). The event was shown to be inherited from the unaffected mother, equating to a compound heterozygous loss of ATM function in the patient. Thus the event was classified as diagnostic of the patient's phenotype in accordance with ACMG guidelines.

Fusion selection by default TopHat Fusion filtering
The default TopHat Fusion filters identified a total of 1003 candidates in our patient cohort (5-46 per patient). We classified these candidates using our categorization workflow (S6 Table). 52.3% of candidates involved blood-abundant genes while a further 19.7% involved immunoglobulin genes. The majority of candidates (994 of 1002) were removed due to their presence in our normal control database (S7 Table). All candidates detected by TopHat Fusion's default filters and classified by our workflow as potential fusions are described in S8  Table irrespective of normal tissue expression. Candidates occurring in normal tissue databases but categorized as potential fusions included known polymorphic events such as KAN-SL1-ARL17A/B [52] and TFG-GPR128 [53], detected in 14 and 3 patients respectively. Other events such as PFKFB3-RP11#563J2.2 (37 patients) and EIF4E3-FOXP1 (28 patients) appeared with high frequency and might represent previously unrecognized polymorphic fusion events or read-through transcription. The 9 remaining candidates not appearing in our normal database comprised 3 containing blood-abundant genes, 1 potential novel transcript and 5 events categorized as potential fusions (two representing the reciprocal ATM fusion). Thus 99.5% of the standard TopHat Fusion outputs were removed from further consideration by our classification workflow. Of the five mutually detected fusion candidates, the reciprocal ATM fusions were the only ones selected by our categorization and prioritization workflow ( Table 1). The remaining three fusion candidates were excluded from our manual analysis due to lack of for which a second hit was sought due to the autosomal recessive nature of ATM mutations. RNA-Seq revealed reciprocal fusions that were expected to retain their protein-coding potential but lead to aberrant ATM function based on the results of a novel flow cytometry assay. The fusions were experimentally validated by several orthogonal methods and shown to be maternally inherited, equating to compound heterozygous loss of ATM function which was classified as diagnostic of the patient phenotype. These reciprocal fusions were the only members of our validation panel that were detected by standard TopHat filters. https://doi.org/10.1371/journal.pone.0223337.g003 Fusion transcripts diagnose inherited disease phenotypic relevance. Within the group of 16 phenotypically prioritized fusion candidates output by our workflow, 8 of 11 attempted were successfully validated and only 2 were detected by the default TopHat Fusion filters.

Discussion
We have described the first systematic application of fusion transcript detection in an undiagnosed, rare inherited disease cohort. Our findings support the assertion that fusion transcription is a phenomenon whose pathogenic relevance extends beyond the traditionally recognized field of oncology, and furthermore, suggest that fusion analysis is an important component of comprehensive rare inherited disease testing. The two confirmed diagnostic fusions reported here involve genes that were previously suspected of clinical significance but for which a pathogenic event was still sought following clinical and research testing using several advanced methods. The fact that fusion analysis achieved diagnosis where multiple alternative methods failed underscores the diagnostic potential of fusion profiling in rare disease cases. We assert that fusion analysis should be considered integral to any RNA-Seq pipeline used for genetic diagnosis.
The discovery of SAMD12-EXT1 and reciprocal ATM-SLC35F2 fusions constitutes a 4.3% increase in diagnostic yield within our patient cohort. Notably, the diagnostic odyssey cases studied here represent a phenotypically diverse and challenging population, and it cannot be discounted that similar analyses might produce higher rates of diagnosis within distinct phenotypic groupings. The clinical significance of the 5 additionally validated fusions remains unknown despite experimental verification and the potential phenotypic relevance of their constituent genes. The EXT1 and ATM fusions are unique in that they affect genes with extensive prior evidence linking them incontrovertibly to each patient's phenotype. The events containing genes with lesser-evidenced links to patient phenotype are challenging to conclusively interpret and consequently these remain variants of uncertain significance. It is possible that periodic reassessment of such events will eventually identify a pathogenic role as knowledge in the field expands. Alternatively, functional validation studies remain an available but non-trivial option to clarify the role of such fusions.
We developed an inherited-disease-focused workflow to replace fusion-filtering strategies developed for alternative applications, and to lower the potential for erroneous removal of disease-relevant events while reducing an initially overwhelming number of fusion calls to a tractable quantity. Thus our workflow provides a call set that is amenable to manual analysis and interpretation in Mendelian disease studies. Furthermore all events detected by the default TopHat Fusion filters were detected by our workflow, but were biologically classified and largely deprioritized following sequence alignment and biological inference. Conversely, of the 16 fusion candidates prioritized by our workflow, 73% of those tested were experimentally validated and only one reciprocal fusion was detected by the standard TopHat Fusion filters.
Initial raw candidate identification remains wholly dependent on the underlying fusion calling algorithm, and suitable care in its selection is required. We selected TopHat Fusion based on its ability to provide output of unfiltered candidate fusions. While this approach proved effective in this study, an ensemble of multiple callers might enable the detection of additional fusion events and represents a natural extension of our approach that should be considered in future studies.
The rationale underlying our candidate categorization workflow is versatile and widely applicable. Its various components can be implemented wholly or piecemeal, as part of new or existing workflows utilizing a wide range of fusion calling algorithms. For example, we demonstrated the ability to remove events likely to have low phenotypic relevance from the outputs of standard fusion-caller filters as evidenced by our reduction of the default TopHat Fusion outputs from a median of 19 events to less than a single event per patient. Furthermore, we have demonstrated that comparison to normal tissue databases alone will markedly reduce the number of candidates of unlikely phenotypic relevance.
This study reveals that surrogate tissues, such as blood, are viable biospecimens for the profiling of fusion transcription in inherited disease studies. Inaccessibility of affected tissue is a recognized obstacle to RNA-Seq profiling because of tissue-specific gene expression and splicing patterns [6] [7], therefore the successful utilization of surrogate tissue sources for fusion detection is encouraging. Nonetheless, this approach poses challenges and constraints that should not be overlooked. While approximately 68% of OMIM genes are expressed in fibroblasts for example [7], the genes underlying muscle pathologies are underrepresented in both fibroblasts and blood [6]. Within our own cohort, several genes of potential interest were scarcely expressed in either blood or fibroblasts, and we cannot discount the possibility that our analyses may have failed to detect pathogenic events in these under-expressed genes. Thus, inaccessibility of affected tissue may limit the utility of RNA-based approaches and the viability of these methodologies may require assessment on a case-by-case basis.
Conversely, the direct profiling of disease-affected tissue may represent its own challenges. Our findings indicate that genes highly expressed in blood are a major source of transcriptional or artifactual noise, and whilst it is convenient to remove blood-abundant genes from an analysis unrelated to blood pathologies, it will be less viable to remove genes highly expressed in muscle if directly profiling the affected tissue for the underlying cause of a muscular phenotype. Illustratively, Patient 6 was the only case for which fibroblasts were utilized in our study, and a correspondingly large number of candidate fusion events were categorized, often involving highly expressed species such as collagens. It is likely that some customization of normal tissue databases and excluded gene lists will be required to enable adequate categorization of common tissue-specific normal events or artifacts. Ideally, large scale multi-tissue sequence analysis efforts like GTEx will multiply and broaden to increase the sampled population and include protocols like fusion transcript analysis, thus facilitating continued and expanded analyses like our own.
Automated PCAN analysis ranked our two diagnostic fusions highest (rank scores 0.001 & 0.002) and flagged 8 of our final 16 candidates in total, raising the possibility of a workflow without a requirement for manual candidate prioritization. Nonetheless, technical errors remain a reality, and PCR or other confirmation studies are necessary to confirm a candidate's presence. While our unsuccessfully validated fusions might represent an assortment of artifactual species, it is notable that all but one of them fused precisely at exon-exon boundaries, consistent with RNA splicing, and further they produced non-promiscuous alignments to the human genome and transcriptome. Furthermore, fusions between SPAST and SLC30A6 as reported in Patient 13 have been previously reported in disease [54]. Such observations raise some uncertainty about the artifactual origins of these candidates. Alternative possibilities include the presence of low copy-number events due to mosaicism, subclonality, tissue-specific gene expression, or other novel RNA rearrangements, and thus, validation efforts utilizing alternative tissue sources might represent a means of categorizing putative artifactual events with more certainty.
Since both diagnostic events identified in this study result from underlying genomic deletion or rearrangement, the question arises of whether whole-genome sequencing could detect them. Without further analysis, the possibility cannot be discounted. Whole-genome analysis nonetheless brings its own set of analytical and interpretive problems. DNA does not match the ability of RNA to measure transcriptional consequence [6], [5], [4] and has its own technical limitations that may cause failure to detect chromosomal DNA fusions [27]. We believe that whole genome analyses will indubitably play a major role in the increased diagnosis of rare disorders as it spreads in use and its complexities are further unraveled, but ultimately DNA and RNA-based methods will serve as supplementary and parallel methodologies.
We focus primarily on DNA-Seq and RNA-Seq because they represent the most mature modern 'omics' technologies and the two that are being most widely applied in the rare disease domain. However, alternative approaches including those that integrate proteomic-based technologies also have the potential to detect aberrant fusion events. Throughput is currently higher with RNA-based methods, enabling more rapid, extensive and cost-effective profiling. Furthermore, fusion transcripts may or may not produce a protein product depending on their constitution. For example, an out-of-frame fusion leading to loss-of-function of two genes would not be expected to produce a protein. Thus RNA-Seq offers advantages of detectability beyond that of protein based assays, however proteomic and other approaches including diverse multi-omic assays will likely reveal their own benefits in the future as they become more accessible and their use becomes more ubiquitous.
While this study has focused on the detection of aberrant fusion transcripts, further diagnoses may yet be possible by expanding testing to include profiling of ASE, aberrant expression levels and splicing [6][7][8]. Indeed, we have previously published case studies where such events were diagnostic of rare disease [55]. Furthermore, variations of the analytical approach described herein may yield further events of interest. For example, the event category housing potential novel transcripts from single genes might contain abnormal exon combinations arising from intragenic deletions and these have potential for disease relevance. Ultimately however, each of these analyses is methodologically distinct and forms its own set of technical challenges. Their systematic application to this and further patient cohorts should undoubtedly form the basis of future work.

Conclusions
We have reported the first successful systematic application of fusion transcript detection within a rare disease cohort. We have demonstrated an increased diagnostic rate and identified further novel candidates for phenotype causation. Fusion transcript analysis such as those described herein should be considered in any RNA-Seq analysis aimed at genetic diagnosis of undiagnosed rare inherited disease.
Supporting information S1