Molecular Diagnosis of Usher Syndrome: Application of Two Different Next Generation Sequencing-Based Procedures

Usher syndrome (USH) is a clinically and genetically heterogeneous disorder characterized by visual and hearing impairments. Clinically, it is subdivided into three subclasses with nine genes identified so far. In the present study, we investigated whether the currently available Next Generation Sequencing (NGS) technologies are already suitable for molecular diagnostics of USH. We analyzed a total of 12 patients, most of which were negative for previously described mutations in known USH genes upon primer extension-based microarray genotyping. We enriched the NGS template either by whole exome capture or by Long-PCR of the known USH genes. The main NGS sequencing platforms were used: SOLiD for whole exome sequencing, Illumina (Genome Analyzer II) and Roche 454 (GS FLX) for the Long-PCR sequencing. Long-PCR targeting was more efficient with up to 94% of USH gene regions displaying an overall coverage higher than 25×, whereas whole exome sequencing yielded a similar coverage for only 50% of those regions. Overall this integrated analysis led to the identification of 11 novel sequence variations in USH genes (2 homozygous and 9 heterozygous) out of 18 detected. However, at least two cases were not genetically solved. Our result highlights the current limitations in the diagnostic use of NGS for USH patients. The limit for whole exome sequencing is linked to the need of a strong coverage and to the correct interpretation of sequence variations with a non obvious, pathogenic role, whereas the targeted approach suffers from the high genetic heterogeneity of USH that may be also caused by the presence of additional causative genes yet to be identified.


Introduction
Usher syndrome (USH) is a group of recessively inherited disorders characterized by deafness and vision loss. Traditionally, USH is subdivided into three clinical subclasses. Visual impairment due to Retinitis Pigmentosa (RP) [1] is common to all three subtypes, which are distinguished based on the severity and progression of the hearing loss and by the presence or absence of vestibular symptoms. USH shows genetic heterogeneity: at least 11 distinct loci have been identified and 9 causative genes have been cloned. Although this classification of USH remains in clinical use, atypical clinical types have been described that defy this simple classification.
The USH2 type is less severe and is characterized by moderate to severe congenital deafness, with a high-frequency sloping configuration. Owing to the overlap between types I and II in clinical appearance and age of onset, visual symptoms are not considered reliable predictors of USH type in individual cases [2,3,21]. Three genetic loci have been reported so far in USH2 (USH2A, USH2C and USH2D) and the corresponding genes have been identified. Mutations in the USH2A gene, encoding usherin, underlie the most common form of USH2 accounting for up to 85-86% of cases [7,13,22,23]. Mutations in the USH2C and USH2D genes are much rarer [7,13,24,25]. The protein encoded by the GPR98 gene at the USH2C locus is a member of the serpentine G-protein coupled receptor superfamily [24]. Defects in the DFNB31 gene, a PDZ (post-synaptic density, disc-large, Zo-1 protein domains) domain-containing scaffold protein, are responsible for USH2D and nonsyndromic hearing loss (DFNB31) [25,26]. Finally, USH3 is characterized by variable onset of progressive hearing loss, variable onset of RP, and variable impairment of vestibular function and is caused by mutations in the USH3A (clarin-1) gene, located on 3q21-q25 [27,28]. USH3 is the less common form of Usher syndrome with a prevalence of 2-4% within all USH cases [6,7,13].
Identification of the USH causative mutations is important for early diagnosis, genetic counseling and prenatal diagnosis. Despite the fact that most Usher syndrome patients can be reliably grouped into one of the three main clinical classes, a comprehensive molecular diagnostics protocol for Usher syndrome has been hampered by genetic heterogeneity, the large number of exons to analyze and by the high costs associated with application of conventional techniques [29]. Current diagnostic strategies for Usher syndrome include: a) the use of genotyping microarrays based on the arrayed primer extension (APEX) method that can detect the presence of previously reported mutations and has proved to be an adaptable and affordable mutation screening tool [29,30,31]; and b) complete exon sequencing of all known Usher syndrome genes by Sanger sequencing, which was recently demonstrated to significantly improve diagnostic efficiency for this condition [10,13], but it is a demanding procedure in terms of both cost and time.
Newly developed molecular technologies are therefore needed to facilitate the discovery of underlying gene mutation early in life and to provide estimation of its prevalence in at risk pediatric populations thus laying a foundation for its incorporation as an adjunct to newborn hearing screening programs. During the past five years, new high-throughput DNA sequencing technologies collectively referred to as next generation sequencing (NGS) have emerged [32,33,34]. NGS is nowadays widely used in biomedical research [35,36,37,38,39], but its applications in molecular diagnostics are still limited [40,41,42,43]. NGS allows to efficiently sequence an entire genome, an exome, or specific genomic regions. The latter can be achieved by using one of several enrichment methods including long-range PCR (LR-PCR), fragment-capture using solid surface arrays and in-solution oligonucleotide capture [44,45,46,47]. In this pilot study, we report the results obtained by using both exome and targeted sequencing on human genomic DNA samples from Usher Syndrome patients.

Results
To evaluate whether the currently available NGS technologies are ready for USH diagnostics we used: a) Whole exome sequencing with analysis restricted to known USH genes and b) targeted resequencing of the known USH genes starting from long-range PCR products. A schematic overview of the strategy is shown in Figure 1.

Whole Exome sequencing
We tested nine different Usher patient samples, which previously turned out to be negative for known mutations in Usher genes as assessed by APEX microarray screening [31]. In the analysis, we focused only on the coding exons and the related splice junction sites of known Usher genes. Each obtained sequence was analyzed according to a specific workflow as for the different protocols and platforms used (see Information S1 and Figures S1 and S2). An average of 12 Gb of sequence was generated per affected individual ( Figure S3). A total of 43.3 million non-duplicated reads could be mapped to the genome. We found high correlation values (r s .0.9) among coverage depth positions of samples processed using the same enrichment method. We decided to consider the minimum and not the mean of obtained coverage depth to better evaluate the detection of variations in any coding position. The library and sequencing chemistry used influenced the overall results in terms of sequence obtained: on average 58 million reads were obtained for samples sequenced with SOLiD4 Paired-end protocol ( Figure 2, samples labeled with the ''USH'' prefix) while 30 million reads were obtained for SOLiD3 fragment sequencing ( Figure 2, samples labeled with the ''A'' prefix). The sequence coverage of the USH genes was comparable with all the Human RefSeq genes regarding the increase of minimum depth demand (Figure 2A-B). Within the Usher genomic regions we identified approximately 100 total sequence variants per sample with an average coverage of 206. In the process of sequence variant calling, a) we selected those variants that were located in translated regions; b) we selected those variants supported by reads on both DNA strands in human genome; c) we removed those variants deposited in the dbSNP database (version 135), 1000 human genome dataset, and in our in-house exome database that includes healthy individuals and patients with non-ocular conditions. A schematic workflow of our filtering step procedure and data analysis, similar to the scheme reported by Wei, X. et al [48] is depicted in Figure S4. This filtering process narrowed down the observed variations to 1-2 per samples ( Table 1 and Table S3), which were all confirmed by Sanger Sequencing. An overall number of 12 variations in 5 known USH genes was detected. The variation effect was predicted following a multi-step analysis available on the Usher Syndrome Missense Analysis (USMA) website (https://194.167. 35.160/cgi-bin/USMA/USMA.fcgi). Three variants were already reported in USHbase database as neutral (MYO7A: c.4697C.T), probably neutral (USH2A: c.14074G.A) and likely pathogenic (USH2A: c.3176C.T) while the remaining 9 were not described so far (Table 1). In silico analysis of the variations allowed us to infer the pathogenic role for 3 out of 9 variants, thanks to the availability of secondary and 3D structure analysis of the corresponding protein regions (Table S1).
Notably, only for three samples (A20, USH126, USH103) we were able to identify two putatively pathogenic sequence variations in the same gene while for two samples (A26, USH135) we were not able to identify any putative mutation.

Targeted Resequencing
As an alternative strategy for the molecular analysis of Usher genes, we tested the efficacy of targeted resequencing. We designed an in-house enrichment protocol based on 218 Long-PCR fragments (Table S2) covering all the known exons of the 9 Usher genes. We carried out this analysis on a set of three Usher patients different from those analyzed by exome sequencing. All the analyzed patients had been prescreened using the APEX-based genotyping chip and some putative mutations had already been identified ( Table 2). The obtained Long-PCR amplicons were equimolarly mixed and subsequently split in two parts to use the same starting material for the different library preparations according to the NGS platform (see Material and Methods). We obtained 5 Gb of sequences using the Illumina GAII and 45 Mb using the GS-FLX platform corresponding to 14.3 millions and 0.22 millions of non-duplicated mappable reads, respectively. Overall, as expected, this Long-PCR approach yielded a much higher coverage in Usher genes as compared to whole exome sequencing and about 94% of the Usher gene exons show a mean coverage higher than 206 ( Figure 3). Since the same long-range PCR library was analyzed using two different sequencing platforms, we could also compare the two procedures. Considering the entire Long-PCR product as target, we can clearly distinguish a better on-target efficiency of the Illumina GAII Paired-End Libraries (99% On-Target) compared to GS-FLX (90% On-Target Figure S5). This difference seems to be increased using the open source BWA mapping results while platform proprietary mapping software show a lower Off-Target for the same GS-FLX data (95% On-Target). Among Long-PCR products, we registered a similar coverage performance of GS-FLX up to 24-256, but above that value the curves show a more severe drop for GS-FLX compared to GAII ( Figure 3). We decided to further investigate the trend by checking for systematic biases in the coverage among our Long-PCR results. By investigating the coverage distribution of each single amplicon, we observed an uneven distribution of the coverage, which could not be ascribed to a random effect. Using the data from different samples we uncovered a positiondependent coverage correlation with a Spearman Rank Order Correlation coefficient of 0.82 r s ( Figure S6).
For sequence variant calling, we used the same criteria described for the analysis of exome-enriched samples. On average, we obtained a total of 129 sequence variants per sample that Figure 1. Workflow of the next generation sequencing strategies used. A) whole exome sequencing workflow. Samples have been prescreened using an Apex-based Usher genotyping microarray; library preparations prior to enrichment include fragment single reads or Paired-End preparation. Three different types of enrichment methods have been used; each enrichment probe sets overlap at different extent to the RefSeq coding regions of Usher genes (horizontal bars). Sequencing protocols include single 50 bp reads on the Solid3 System, single 50 bp read on Solid4 System, Paired-end reads 50 bp+35 bp on Solid4 System. B) Long-PCR sequencing workflow. Samples have been pre-screened using Usher Apex microarray, Long-PCR approach produced 218 PCR amplicons used as input for the for Fragment and Paired-End library preparation. Sequencing was performed using both GS-FLX and GAII Systems. doi:10.1371/journal.pone.0043799.g001 narrowed down to a few variations with our selection criteria. In addition to the expected mutations, i.e., those identified by microarray analysis, two unreported mutations were identified ( Table 2). Sample 1 presents an unreported heterozygous missense mutation in CDH23: c.1423G.C;p.V475M, classified as probably-damaging according to Poliphen2 prediction and located in a domain where other pathogenic variants have previously been reported [49]. In the case of Sample 3 we identified an unreported MYO7A variation (MYO7A: c.3827C.A;p.S1276*) that complements with the previously reported variation (MYO7A: c.77C.A; p.A26E) identified through APEX screening. Interestingly many conservation scores (Placental Mammal Basewise Conservation,-Vertebrate Basewise Conservation, Primate Basewise Conservation) drastically decrease exactly after the above nonsense mutation ( Figure S7).

Discussion
Molecular diagnosis in Usher syndrome is hindered by significant genetic heterogeneity, the large size of some of the Usher genes, and the high number of polymorphic variations in genes such as MYO7A and USH2A. Furthermore, a digenic inheritance has been proposed in some cases of Usher syndrome [10], although not yet universally accepted [13], which further complicates the elucidation of the molecular basis of genotypephenotype correlation. Therefore, the molecular analysis of known genes using Sanger sequencing is challenging and cannot be offered routinely [10,13]. The goal of our study was to test the promising NGS technologies that could be applied to the molecular diagnosis of Usher syndrome.
To test all the available options for diagnostic NGS, we investigated to what extent the current NGS technologies can be used in this process and we compared the efficacy of whole exome approaches vs. gene-specific approach based on Long-PCR enrichment. In addition, we obtained sequence data with all different NGS platforms. To our knowledge, this is the first attempt to use NGS protocols in the molecular diagnosis of Usher syndrome.
The first evidence we gathered is that when using whole exome sequencing the percentage of Usher genes represented ranged from a maximum of 98% to a minimum of 76% at low sequence coverage ( Figure 2B and Figure S2). However, those values rapidly decrease as the coverage demand increases. Without a substantial increase of the number of the sequencing reads and hence of the sequencing costs/sample, the whole exome approach can be insufficient for USH diagnosis. For example, setting a minimum coverage of 306, only 50% of the USH sequences are available for analysis. In addition, it is also evident that the enrichment protocol produces an uneven coverage with peaks and falls even in closely adjacent genomic regions depending on the enrichment protocol used (see Information S1 and Figures S8 and S9), as shown by high correlation values between samples processed using the same enrichment protocol. Even if newer versions of enrichment kits usually work better than previous ones, when looking at single genes it is possible to find some specific exceptions (like in the case of USH2A) in which the earlier version of the kit gives a better coverage curve ( Figure S3).
In contrast, the Long-PCR-based enrichment leads to an almost constant curve guaranteeing a higher minimum depth coverage. Our high correlation value, between samples processed for Long-PCR, seems to indicate a systematic non uniform distribution in our approach. This may be due to bias in the PCR products since no valid correlation with GC composition or length of the sequenced region can be addressed. Nevertheless, the comparison of the exon coverage between the Long-PCR and in-solution method shows that the Long-PCR approach guarantees always a higher number of exons sequenced for any selected coverage ( Figure 3). This result encourages the development of protocols to obtain DNA templates prepared by multiplex Long-PCR in order to decrease costs and workload/sample. We adopted a high stringency multi-steps filter criteria to select the pathogenic variants. This approach lowered the number of false positive variants, although it could raise the number of false negatives leading to an underestimation of the real number of variants. The high number of unreported variants (approximately 1 variant per sample) shows that, in the process of designing any strategy for USH molecular diagnosis, the high prevalence of novel mutations is a major issue, as already suggested in some recent publications [10,13]. Sequencing of all known USH exons and not only the screening of known mutations is required for proper molecular diagnosis and an accurate genetic counseling. In this light, we should pay attention to the lack of an adequate coverage due to the enrichment step, an issue that will be solved with the technical improvements of the capturing methods.
We observed that NGS of long-PCR products did improve the detection rate of mutations over APEX screening, but not over the more demanding Sanger methods [10]. This suggests that the genetic heterogeneity of USH is much higher and makes the NGS technologies cost-effective in terms of diagnostic power only in the cases due to mutations in the known genes. Although NGS can be envisaged to have multiple applications in clinical diagnostics, the technology is currently complex and requires attention to technical issues, including sequencing template enrichment and management of massive data. In particular, for lower complexity samples a point of diminishing returns is reached when the number of counts per sequence results in oversampling with no increase in data quality.
Out of the twelve USH patients analyzed, we could genetically solve five of them as we identified two presumably pathogenic variants in the same gene (A20, USH100, USH103, USH126, and Sample3). We identified a single heterozygous putative pathogenic variant in five other patients, while we could not recognize any pathogenic sequence variations in the remaining two USH patients (A26 and USH135, Table 1). Our results are overall in agreement with those on larger cohorts of USH patients by Sanger sequencing [10,13]. The higher percentage of genetically unresolved cases (2/12 corresponding to 16% of patients analyzed) with respect to previous studies can be explained by the fact that for the whole exome sequencing analysis we selected USH patients that were not found to harbor any known mutation in USH genes by APEX screening. We thus enriched our starting dataset for patients with higher probability to carry mutations in novel, yet to be identified USH genes.
In conclusion, the constant decrease in costs of NGS procedures will make them even more attractive in the near future. Even if nowadays whole genome approaches do not seem to fulfill the requirements of a complete molecular screening of Usher  Syndrome, they nevertheless remain a potential precious tool for the identification of new players in disease.

Patient selection
The patients examined in this study underwent a screening visit at the diagnostic centers of either Trieste or Naples that included a basic ophthalmic consultation, central visual acuity (CVA), Goldman visual field (GVF), fundus oculi, standard electroretinography (ERG), and a series of periodic control visits. The patients also provided a detailed medical history and underwent genetic counseling to identify hereditary patterns. Each patient underwent a complete ear, nose and throat (ENT) and audiovestibular examination, including otomicroscopic examination, audiometry, and an auditory brainstem response (ABR) for threshold assessment of patients less than 5 years of age. Blood samples were collected and genomic DNA was extracted from blood samples using standard techniques. All patients studied entered the diagnostic centers of Trieste or Naples and signed an appropriate consent form for genetic testing as well as forms related to privacy of data. Approval for the study was obtained by the Second University of Naples and by IRCCS-Burlo Garofolo of Trieste Ethics Committees.

Exome Enrichment
For Agilent exome enrichment 3 mg genomic DNA was used. We used ABI SOLiD optimized kits (Agilent, Santa Clara, CA, USA), following the manufacturer's instructions. The different enrichment kits are reported in Information S1. Briefly, for every 3 mg DNA, we diluted the genomic DNA and, using a Covaris station, sheered the genomic DNA to 150 base pair fragments. The purified obtained samples were end repaired, adaptor ligated and the obtained library amplified according to the SureSelect Target Enrichment protocol. The final step of hybrid capture selection provided an enriched library that has been quality assessed with Agilent 2100 Bioanalyzer. The enriched exome libraries were subsequently used for e-PCRs following manufacturer's instructions (Life Technologies, Carlsbad, CA, USA), based on a library concentration of 0.5 pM. DNA sequencing was performed with the use of the SOLiD system that involves ligationbased sequencing and a two-base encoding method in which four fluorescent dyes are used to tag various combinations of dinucleotide, reducing the risk of false positive determinations. Differences in the sequencing chemistry used are reported in Information S1.

Long-PCR Design
Primers were chosen using the web-based program Primer3 (PRIMER3, primer3_www.cgi v 0.2). Each oligo contained 26-32 nt with a predicted melting temperature higher than 59uC and lower than 66uC with a GC content lower than 40% where not possible (14 primers) we selected primers with higher GC contents. Primers were also checked by BLASTn against the NCBI data bank genome for specificity (http://www.ncbi.nlm.nih.gov/ BLAST). For long-PCR, 100 ng of genomic DNA was used for each PCR. We used Takara LA buffer or equivalent (pH .9 at 25uC) with 1.5 mM dNTP (PCR grade) and 3 mM MgCl2 (final concentrations) and a volume of 20 ml (mix 1). Reactions were carried out with a DNA Thermocycler System (MWG or MJ Research) with heated lid. To avoid sample evaporation, aqueous mixture in each tube was overlaid with 30-40 ml of mineral oil that was removed by chloroform extraction after the reaction.
After the first denaturation step at 98uC for 1 min, the thermal cycler was stopped at 85uC and 5 ml of diluted DNA polymerase (Takara LA Taq) was added. We diluted the polymerase 20-fold using 16 LA buffer. This step was followed by 30 cycles (98u for 10 s, 63uC for 1 min and 68uC for 6 to 12 min). After the PCR, DNA samples were precipitated with 2.5 vol of ethanol containing 2.5 M ammonium acetate at room temperature and no further purification was performed. Concentration of PCR product was measured using the NanoDrop spectrophotometer (NanoDrop, Wilmington, DE, USA) and verified by agarose gel electrophoresis.
The list of primers is available on Table S2. All the obtained Long-PCR amplicons were equimolarly mixed and subsequently split in two parts to use the same starting material for the different library preparation for Illumina GAII and Roche 454 GS FLX.

Long-PCR Sequencing
GS FLX DNA Libraries were made following the manufacturer's protocol. Briefly, 4 mg of the 0.5 mM pooled amplicon solution was nebulized at 310 kPa for 1 min and purified using the MiniElute PCR purification kit (Qiagen). Small DNA fragments were removed using an AMPure PCR purification system (Agencourt Bioscience, Beverly, MA, USA) and analyzed on an Agilent 2100 Bioanalyzer. The nebulized DNA was subsequently end-repaired and phosphorylated using T4 DNA polymerase and T4 polynucleotide kinase, and oligonucleotide adapters were ligated to the DNA fragments with DNA ligase. Adapter-modified fragments were diluted, annealed to capture beads, and clonally amplified by emulsion PCR. After emulsion PCR, beads with clonal amplicons were enriched and deposited on a quarter of Picotiter plate flow cell and sequenced on a Roche 454 GS FLX platform.
Illumina's Paired-end libraries were made following the manufacturer's protocol with reagents supplied in the Illumina DNA sample kit. Briefly, 20 ml of the 0.5-mM pooled amplicon solution was nebulized at 220 kPa for 6 min. The nebulized DNA was end-repaired using Klenow and T4 DNA polymerases, phosphorylated with T4 polynucleotide kinase, and adenylated using Klenow exo-DNA polymerase, and oligonucleotide adapters were added using DNA ligase. Ligated products were visualized in a 2% agarose TBE gel, and a 200-to 250-bp size range was excised and purified using a Qiagen gel extraction kit. The sizeselected, adapter-modified DNA fragments were amplified using adapter-specific primers 1.1 and 2.1 with Phusion DNA polymerase using the manufacturer's protocol. The library was purified prior to quantification using an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). A single-flow cell lane was sequenced on a Illumina Genome Analyzer II. Confirmation of variants by Sanger sequencing was obtained using standard protocol. Sequences were run on ABI 3100 DNA analyzer, and assembled using ABI Prism Seqscape 2.1.

Bioinformatic Sequence data analysis
All next-generation sequencing data were initially processed using the corresponding instrument software. Roche 454 data were initially processed using the GSMapper software package (Roche Inc.) supplied with the GS FLX instrument. High quality sequencing reads were aligned to the human genome reference sequence NCBI36/hg18. Variants with respect to NCBI36/hg18 reference sequence were identified with the Newbler software and the features obtained were remapped to GRCh37/hg19 using UCSC's liftOver tool (http://www.ncbi.nlm.nih.gov/genome/ tools/remap) to be directly comparable with the other results. Illumina Genome Analyzer II data were initially processed using the Illumina Sequence Control Software (SCS) with Real Time Analysis (RTA) and Genome Analyzer Pipeline software supplied with the instrument. Sequence were aligned to the human genome reference sequence GRCh37 and variants were identified with using Illumina's Consensus Assessment of Sequence and Variation (CASAVA) software. SOLiD data were initially processed using the ICS software to obtain primary sequence analysis consisting of image analysis and basecalling colorspace fasta sequence. Color space reads were mapped to the GRCh37 reference genome with the SOLiD bioscope software v1.3 (reference) which utilizes an iterative mapping approach. Differences in the parameter used for the different chemistry used are available in Information S1. Single nucleotide variants were subsequently called by the DiBayes algorithm using the conservative default call stringency. Small insertions and deletions were detected using the SOLiD Small InDel Tool. Called SNP variants and indels were combined and annotated using a custom analysis pipeline. To spare software biases all the cross-platforms, comparison has been performed re-analyzing all the starting data using open-source software. Briefly the raw sequences obtained from each platform have been aligned to the human genome reference sequence GRCh37 using BWA [50] and quality filter before variant calling using Samtools software [51]. In the process of sequence variant calling, we focused only on variants included in coding exons and in corresponding canonical splicing sites of known Usher genes, considering as relevant for splicing sites investigations only the exons boundaries plus 2 base pairs. We utilized for further filtering our MYSQL in-house exome database, which includes sequencing data from 27 healthy individuals and 73 patients with non-ocular conditions. All the SNVs present in the database with an allele frequency .1% were not regarded as putative pathogenic mutations.
The Poliphen2 [52] software programs have been used to predict the influence of any amino acid substitution on the protein structure and function. PhastCons, and GERP have been used evaluate the conservation score of each variation. The pathogenic effect of missense variants was predicted as previously described [53] following a multi-step analysis [54] and is available in https:// neuro-2.iurc.montp.inserm.fr/cgi-bin/USMA/USMA.fcgi. In addition, alignments of orthologs are accessible in USMA. Additional details are present as Information S1.