Integrative Data Mining Highlights Candidate Genes for Monogenic Myopathies

Inherited myopathies are a heterogeneous group of disabling disorders with still barely understood pathological mechanisms. Around 40% of afflicted patients remain without a molecular diagnosis after exclusion of known genes. The advent of high-throughput sequencing has opened avenues to the discovery of new implicated genes, but a working list of prioritized candidate genes is necessary to deal with the complexity of analyzing large-scale sequencing data. Here we used an integrative data mining strategy to analyze the genetic network linked to myopathies, derive specific signatures for inherited myopathy and related disorders, and identify and rank candidate genes for these groups. Training sets of genes were selected after literature review and used in Manteia, a public web-based data mining system, to extract disease group signatures in the form of enriched descriptor terms, which include functional annotation, human and mouse phenotypes, as well as biological pathways and protein interactions. These specific signatures were then used as an input to mine and rank candidate genes, followed by filtration against skeletal muscle expression and association with known diseases. Signatures and identified candidate genes highlight both potential common pathological mechanisms and allelic disease groups. Recent discoveries of gene associations to diseases, like B3GALNT2, GMPPB and B3GNT1 to congenital muscular dystrophies, were prioritized in the ranked lists, suggesting a posteriori validation of our approach and predictions. We show an example of how the ranked lists can be used to help analyze high-throughput sequencing data to identify candidate genes, and highlight the best candidate genes matching genomic regions linked to myopathies without known causative genes. This strategy can be automatized to generate fresh candidate gene lists, which help cope with database annotation updates as new knowledge is incorporated.


Background
A large number of disorders affecting skeletal muscle have a genetic basis, with multiple modes of inheritance. They are classified based on phenotype and histopathological features into several groups, which include muscular dystrophies, congenital myopathies and myotonic syndromes, among others (Table 1) [1]. Muscular dystrophies and congenital muscular dystrophies, for example, are characterized by dystrophic changes on muscle biopsy, as opposed to congenital myopathies, which have nondystrophic peculiar histopathologic findings [2][3][4][5]. Despite being rare, most inherited myopathies impose a heavy burden on the life of affected persons, and have a strong impact on the health care system. The identification of the causative gene and mutations is often a pre-requisite for genetic counseling and potentially prenatal diagnosis, improved disease care, and access to more specific therapies or inclusion in clinical trials. A lot of advances have been made in the last few decades on the molecular bases of inherited myopathies, which included the discovery of about 130 genes associated with different disorders [1]. Still, it is estimated that around 40% of patients afflicted with myopathies remain without a molecular diagnosis, supporting the implication of additional genes [6,7]. Further identification of these genes is the focus of a tremendous research effort at present, and will help understand pathological mechanisms and defining novel drug targets.
Next-generation sequencing (NGS) is a relatively new technology that enables massive parallel sequencing of a huge number of bases. It has revolutionized molecular diagnosis and genetic research, as it represents a cost-effective way of testing several genes at once in disorders with genetic heterogeneity, such as myopathies [8][9][10]. Moreover, exome sequencing (ES) or genome sequencing (GS) aid in the discovery of new genes associated with various diseases [11,12]. There has recently been a surge in publications that use NGS to discover new genes associated with diseases, including myopathies [13][14][15][16][17][18][19][20].
The biggest challenge of NGS is to cope with the complexity of analyzing the massive amount of variants generated by the approach. Indeed, comparing two unrelated individuals may lead to about 3 million matching variants in their genomes or about 20,000 in their exomes, but only one of these variants can cause a monogenic disease. The resolution of this issue demands good filtering pipelines to exclude common or meaningless variants, based on the biochemical function of genome location as studied through the ENCODE project [21], and on relationships between human variations and phenotype as in ClinVar and in locus specific databases [22,23]. In addition, ranking systems can help prioritize validation of the most promising variants. It makes sense to focus on genes presumably implicated in the disease process via functional, structural or phenotypical links with known genes. One of the approaches to collect and compare these data is via in silico analysis using a multitude of open-access knowledge information sources. This approach has been recently done successfully for some disorders but not yet for myopathies [24,25]. Lists of candidate genes thus generated can be ranked and used to prioritize variants resulting from NGS analysis.
Here, we propose ranked lists of candidate genes for individualized groups of inherited myopathies and related diseases that were obtained via data mining of online information databases.
These lists can be coupled to NGS analyses pipelines to help filter and prioritize variants aiming at the discovery of novel genes. We also put forward a number of genetic and functional insights taken from the generation of signatures for such disease groups to suggest common pathological pathways between them that can be subject of further scrutiny.

Classification of myopathy genes into 9 overlapping disease groups
The disease groups and associated known genes were based on a modified version of the Gene Table of Neuromuscular Disorders (GTNMD) [26]. We selected the following disease groups, which are primarily related to skeletal muscle pathology: Muscular Dystrophies, Congenital Muscular Dystrophies, Congenital Myopathies, Myotonic Syndromes, Ion Channel Muscle Diseases, Metabolic Myopathies, and Congenital Myasthenic Syndromes. To cope with an ill-defined classification of ''Other Myopathies'' in the GTNMD, we decided to cluster genes from this group into two new disease groups, Myofibrillar Myopathies and Vacuolar Myopathies. A literature search was performed to find recently published genes not yet listed in the Gene Table version that was used in our present study, which resulted in the addition of the following genes: VMA21 [16] and EPG5 [14] to the Vacuolar Myopathies group; TRAPPC11 [13] and TNPO3 [19,27] to the Muscular Dystrophies group; and STIM1 [20], CCDC78 [15] and KLHL40 [17] to the Congenital Myopathies group.
The disease groups have some degree of overlap due to phenotypic heterogeneity of certain genes. For example, SEPN1 is  -girdle  muscular dystrophies   ANO5, CAPN3, CAV3, DAG1, DES, DMD, DNAJB6, DPM3, DUX4, DYSF,  EMD, FHL1, FKRP, FKTN, LMNA, MYOT, PABPN1, PLEC, POMGNT1,  POMT1, POMT2, PTRF, SGCA, SGCB, SGCD, SGCG, SYNE1, SYNE2,  TCAP, TMEM43, TNPO3, TRAPPC11, TRIM32, TTN implicated in multi-minicore disease (a congenital myopathy), and in rigid-spine muscular dystrophy (a congenital muscular dystrophy); CAV3 both causes limb-girdle muscular dystrophy 1C (a muscular dystrophy), and rippling muscle disease (a myotonic syndrome). The largest overlap is found between Muscular Dystrophies and Congenital Muscular Dystrophies, with 8 genes out of 34 muscular dystrophy-associated genes also found among 20 congenital muscular dystrophy-associated genes. GTNMD's disease group "Distal Myopathies" was not included as a separate class in this work due to the lack of a gene uniquely associated with it -all genes were also found in other disease groups. Non-myopathy disease groups, such as ataxias, neuropathies, and motor neuron diseases, were also not included, as well as genes that, although listed in the GTNMD under included disease groups, do not lead to a skeletal muscle phenotype. This was the case of MYBPC3, implicated in cardiomyopathies, removed from the Congenital Myopathies group; PRKAG2, which causes a glycogen storage disease of the heart, not included in the Metabolic Myopathies group; and genes excluded from the Ion Channel Muscle Diseases group because they lead to various ataxia and cardiac arrhythmia syndromes, while not resulting in periodic muscle paralysis. The full list of genes and disease groups used in this work can be found in Table 1.

Data-mining from online databases to address complex biological questions
We used the data mining system Manteia [28], a public resource available online (manteia.igbmc.fr) that retrieves and combines data from freely available online data sources such as Ensembl, Reactome, OMIM, NCBI, Human Phenotype Ontology (HPO), Gene Ontology (GO), Mouse Genome Informatics, and InterPro. Manteia makes it possible to address complex biological questions by running several queries at the same time to mine and statistically analyze gene sets to highlight their annotation specificities compared to the rest of the genome. This study was conducted with Manteia version 2 with data downloaded in June 2013 from the different databases used in the system.
Using Manteia's orthology module, we analyzed human gene sets and their mouse orthologs to find an enrichment on statistically significant terms within several annotation categories, including Gene Ontology (GO), Human Phenotype Ontology (HPO), Mammalian Phenotype Ontology (MPO), pathways (Reactome), protein motifs (Interpro) and interacting complexes (Reactome). Gene length was not taken into account as there is no clear enrichment of large genes mutated in myopathies; while some large genes are indeed implicated (TTN, NEB), smaller genes were found to accumulate mutations along their sequence (e.g. ACTA1). Manteia calculates the enrichment of each term in the gene set compared to all genes in the genome, and sorts the terms according to individual statistical significances.
The list of specific terms for each data set can then be used to screen the genome looking for genes that have similar properties. This is achieved using a query builder, which outputs a list of candidate genes ranked according to their similarity with the data set annotation signature and the weight given for each term.
Extraction of specific signatures for each disease group based on known genes Statistical analysis for human genes in each disease group was individually performed for Gene Ontology (GO) terms, Human Phenotype Ontology (HPO) terms, pathways, complexes and protein motifs. Mouse orthologs were additionally used to get statistical breakdowns of Mammalian Phenotype Ontology (MPO) terms.
Signatures were represented by a weighted combination of GO terms, HPO terms, MPO terms, and what were collectively called "Interactions Annotation" (IA) terms -pathways and protein complexes descriptors from Reactome and descriptors of protein motifs from InterPro. For each disease group, terms were chosen from the various domains in order to obtain a signature of the disease group. We used the following criteria to select GO terms, HPO terms and MPO terms for each group: 1) significance pvalue less than 0.05 (corrected using the Benjamini-Hochberg false discovery rate (FDR) procedure); 2) occurrence in the disease group gene set greater than 1; 3) occurrence in the genome ,800; 4) GO level (or HPO level or MPO level) .2.
The FDR-BH correction of the p-value was chosen because it reduced the large size of the list of resulting terms while not being as stringent as the Bonferroni correction. Terms with only one occurrence in the gene set were deemed not representative of the set. Criteria 3 and 4 enrich for specificity and are closely associated owing to less specific terms (higher ontology level) being associated with a large number of genes; such general terms would not only be unproductive in compounding a signature for a disease group, but also could degrade the performance of a complex query.
For IA terms, the restriction on the occurrence of only one gene in the set was dropped with the aim of improving the scores of new genes related to protein function linked to single known myopathy gene. Indeed, a large proportion of significant terms have only one occurrence in any given disease group gene set. Finally, criteria 4 does not apply to Interpro and Reactome data, which are not structured in defined hierarchies as gene and phenotype ontologies.

Ranking formula based on weighted scores of signature terms
After experimenting with different signature definitions, we decided to define a signature as having an equal contribution of GO terms, phenotype terms (HPO and MPO terms) and IA terms (Figure 1), so as not to a priori give more importance to any term set. A signature with a stronger component of phenotype terms, for example, yields a list of purported candidates strongly biased to genes implicated in known diseases or for which mouse models have been extensively phenotyped. Likewise, if GO terms are the main component of the signature, genes with functional links are preferentially ranked. Finally, IA terms boost the interactome of the known genes to the top of the ranked lists.
Thus, in our approach, for each disease group, the weight of individual terms was calculated so that the added weights of all GO terms was the same as the total PO score (added HPO terms weights combined with added MPO terms weights) and as the added weights of all IA terms, which was arbitrarily set as 1000.
In the GO domain, we defined strata of term weights with percentile cutoffs so that terms with a higher significance would respond for a larger share of the total GO score. The top 20% of terms (p80) contributed to 40% of the total GO score, a middle tier comprising 40% of terms (p40-p80) contributed to an additional 40% of the total GO score, and the lower 40% of terms (p40) provided 20% of the total GO score.
A similar approach was used to calculate individual weights for HPO and MPO terms, with the exception that, as the total PO score reflects the combination of equal shares of HPO and MPO terms, the maximum score of either HPO terms or MPO terms was set at 500.
The weight of each IA term, on the other hand, was the same no matter its position on the corresponding list, and was simply calculated as 1000 divided by the number of significant IA terms for each disease group. This approach helped mine all genes that interacted with any single gene in the training set, provided pathways and interactions were statistically significant.
While the choice of percentile cutoffs that define the strata of weights was arbitrary, we observed that the modification of the cutoffs did not result in substantially different ranked candidate lists for each disease group, as long as the signature definition is the same. All terms for every domain in each disease group, with their corresponding calculated weights, can be found in Table S1.

Generation of ranked lists of candidate genes for each disease group
Manteia's query builder feature was used to filter genes in the human genome that matched the signature defined for each disease group. Queries combining terms that constitute the signature were run to obtain a list of genes ranked by a score represented by the sum of all matched term weights ( Figure 1). More specifically, a gene score results as the sum of a GO score (sum of weights of the disease group's signature GO terms that match the gene's GO terms), a PO score (sum of weights of matching HPO and MPO terms), and an IA score (sum of weights Figure 1. Integrated data mining workflow. A signature of a disease group, composed of weighted terms, is generated from statistical analyses of genes already implicated in diseases of the group. Terms come from the three main annotation groups, GO (Gene Ontology), PO (Phenotype Ontology, an aggregate of Human Phenotype Ontology and Mammalian Phenotype Ontology) and IA (Interactions Annotation), are mined using Manteia and receive weights proportional to the their enrichment in the set of genes implicated in the disease group, as compared to the set of all genes in the human genome. Weights are attributed to terms so that annotation groups contribute equally to the composition of the signature. The signature of the disease group is then used to mine the genome for additional genes. Every gene in the genome receives a score equal to the sum of weights of terms that describe the gene if they match terms that define the disease group signature, for a maximum possible score of 3000. Further filtering steps mark genes that have low relative skeletal muscle expression or are annotated with known diseases. doi:10.1371/journal.pone.0110888.g001 Data Mining of Candidate Genes for Myopathies PLOS ONE | www.plosone.org of matching IA terms). The maximum total score a gene can receive is thus 3000 (1000 for GO score+1000 for PO score+1000 for IA score). To deal with MPO terms applied to murine orthologs, the predicted best human orthologs were selected using Manteia's ORTHO function after the ranking process. The ranked lists for each of the 9 disease groups, including annotation detailed in the next subsection, can be found in Table S2.
Additional filtering of ranked candidate genes using expression data and association with human diseases The ranked lists for each disease group include the known genes of the group, which were used to create the signatures, genes known to be associated with myopathies but implicated in other disease groups, and genes that are not listed in any disease group and thus represent potentially good candidate genes for myopathies. Among those, additional filtering was performed using tissue expression databases. Data from Illumina Body Map E-MAT513, established from mRNA-Seq of 16 human tissues, was downloaded for every gene, and genes with no expression in skeletal muscle or with an expression in skeletal muscle that was less than a cutoff of 10% of the maximum expression found in any other tissue were excluded. The rationale behind this filtering is that if a gene is expressed in a tissue other than skeletal muscle at a much higher level, one expects such a gene to be implicated in disorders primarily involving that tissue. The 10% cutoff was empirically determined due to the fact that all genes already implicated in myopathies have skeletal muscle expression levels above this cutoff. To deal with missing expression data and eventual heterogeneity in Illumina Body Map's expression database, genes ruled out by the 10% threshold and candidate genes within the 100 first positions in the rankings for each disease group were double checked using expression data from the Genotype-Tissue Expression Project (GTEx) [29].
For the Congenital Myasthenic Syndromes disease group, which includes diseases primarily related to neuromuscular junction protein defects, but also some peripheral nerve terminal protein defects, we decided to disregard the muscle expression filtering due to the fact that the implicated genes AGRN, CHAT, and CHRNE do not have significant skeletal muscle expression (they are instead expressed in the nerve terminal).
The lists of candidate genes after skeletal muscle expression filtration was further annotated with Online Mendelian Inheritance in Men (OMIM) data on existing human phenotype in the form of well-characterized diseases or syndromes, in order to easily identify genes biased by phenotype, such as SMN1, which results in a phenotype very similar to many myopathies, characterized by flaccid proximal limb weakness, but which gives rise instead to a motor neuron disease; or biased by interactions, as occurs to a number of carbohydrate metabolism genes that share common pathways to metabolic myopathies but cause instead inborn errors of metabolism without a muscle phenotype.

Myopathy groups are clustered by gene ontology and protein function
To identify novel candidate genes for myopathies, we established an integrated data mining approach aiming first to extract specific signatures for disease groups encompassing previously implicated genes, and then to use these signatures to search for additional matching genes in the human genome. As detailed in Figure 1 and the methods section, this approach consists of a weighted ranking of three main sets of data: gene ontology, human and mouse phenotypes ontologies, and ''interactions annotation'' incorporating pathways and protein motifs and complexes.
To test this approach and better visualize signature composition analysis, we first analyzed a training set that consisted in all myopathy-associated genes using the data mining system Manteia [28]. Figure 2 shows graphs with relationships between all known genes of the nine chosen disease groups. In particular, the combination of GO, PO and IA terms aggregate most genes that are part of the same myopathy group for metabolic myopathies, the congenital myasthenic syndromes, and the glycosylation components of congenital muscular dystrophies ( Figure 2A). Of note, the gene GFPT1, which causes a congenital myasthenic syndrome with tubular aggregates, has mainly relationships with genes in the metabolic myopathy cluster, presumably because it codes for an enzyme in the metabolism of glycoproteins. Another large cluster in the graph encompasses the main genes implicated in muscular dystrophies and congenital or myofibrillar myopathies, without subdivision, suggesting a strong overlap in the function of the related genes and potentially in the pathogenesis. This approach can thus retrieve several phenotypic and pathologic clusters. However, applying only the human phenotype ontology analysis generates a single large, highly connected graph ( Figure 2B), even when the threshold for representing an edge in the graph -number of matching HPO term between two genes -is increased or decreased, or when HPO term hierarchy is taken into account. This means that genes implicated in myopathies share a common hierarchy of phenotype ontology terms, e.g. with most genes annotated with muscle weakness or abnormal muscle physiology related terms. While they do not help separate genes into disease groups, HPO terms are important to help emerge genes with phenotype annotation associated to skeletal muscle. GO terms and IA terms, on the other hand, are responsible for the final clustering. Different myopathy groups appear using only GO terms ( Figure 2C), while IA terms, even considering a lower threshold of 5 terms shared between genes, create smaller clusters of genes that interact closely by sharing the same pathways, interactions complexes or motifs, such as constituents of collagen VI, genes responsible for the assembly of nicotinic cholinergic receptors, or conglomerated proteins involved with the sarcomere ( Figure 2D). Only the combination of the different GO, PO and IA terms reaches the most precise clustering.

Characterization of disease groups via biological processes annotation
We next aimed to extract specific signatures for each disease group, classified based on the Gene Table of Neuromuscular Disorders [1]. Statistical analysis of known genes was conducted for each disease group.
GO terms include three types of ontologies: cellular components indicate the localization of gene products; molecular function refers to the normal roles of genes at the molecular level; and biological processes describe the higher-order roles of genes from a biological perspective. Four main general skeletal muscle-related biological processes were extrapolated from the hierarchy of GO terms: muscle contraction, calcium homeostasis, muscle development, and muscle intracellular organization ( Table 2). Analysis of the breakdown of biological process-related GO terms that make up the signatures of different disease groups reveals differences in the implicated skeletal muscle processes and hints on other important biological processes that do not primarily involve the skeletal muscle.
Vacuolar myopathies and myofibrillar myopathies did not receive in their signature GO terms associated with biological processes, because the training set genes for these groups were annotated with heterogeneous terms that did not attain statistical significance. Metabolic myopathies and congenital myasthenic syndromes inferred biological processes were, as expected, not primarily muscle related, but had mostly to do with glycogen metabolism and neuromuscular junction, respectively. Congenital muscular dystrophies, while having two GO terms associated with muscle development, were also primarily annotated with non muscle-specific biological processes, especially protein glycosyla- Figure 2. Graph representation of relationships of known genes. All known genes for the different disease groups were concurrently analyzed for matching terms in different ontologies. Nodes represent genes, and edges between two given nodes are depicted when the number of terms shared by the two connected genes is greater than a certain threshold. Edge width is proportional to the number of terms shared between two genes, and node size and color in a scale from green (lowest) to red (highest) is proportional to the number of associations of a gene in the graph. Closely related genes appear clustered together, and hubs in the graph appear centrally located. A: graph for combined terms from Gene Ontology (GO), Human Phenotype Ontology (HPO) and Interactions Annotation (IA), with a threshold of 30 matching terms. The cluster with a yellow background includes genes implicated in metabolic myopathies, the one with a red background groups congenital muscular dystrophy genes, and the cluster with a gray background represents genes associated with congenital myasthenic syndromes. B: graph for HPO terms with a threshold of 20 matching terms. C: graph for GO terms, with a threshold of 10 matching terms. Background colors correspond to clusters represented in A. D: IA terms with a threshold of 5 matching terms. The gray background highlights a cluster with gene that code subunits of cholinergic receptors, implicated in congenital myasthenic syndromes, the green one groups components of collagen VI, and the cluster with a blue background links elements of the contractile apparatus. doi:10.1371/journal.pone.0110888.g002 tion. Myotonic myopathies and ion channel muscle diseases take the larger contribution from calcium homeostasis-related terms. Muscular dystrophies mainly involve muscle intracellular organization terms, but also receive some contribution from muscle development and muscle contraction terms. Other important biological processes for muscular dystrophies are associated with heart muscle contraction and development. Finally, for congenital myopathies, muscle development and calcium homeostasis seem to be the most significant processes, but muscle contraction-related terms also play a role, as well as processes not specific for skeletal muscle, such as catabolism of nucleotides -these appear enriched due to the association of DNM2 to the catabolism of GTP, as well as MYH7 and TPM2 to the catabolism of ATP.
Training set genes appear at the top of the ranked lists of the disease groups We used the signature specific to each disease group to screen the whole set of human genes and identify candidate genes for myopathies. The breakdown of the gene score for these training set genes shows that a similar contribution of the different term domains can be consistently found throughout the various disease groups. GO score, PO score and IA score respond for approximately 30 to 45%, 40 to 55%, and 10 to 15% of the gene score, respectively. Table 3 shows the ranked lists of known genes for three disease groups (Congenital Myopathies, Muscular Dystrophies and Metabolic Myopathies), along with each gene score and breakdown of partial scores. Table S3 shows similar additional data for all disease groups.
As expected, genes already known to be mutated in the various disease groups, which were used as the training set to create the mining signatures, appear at the top of the ranked lists of data mining. Considering congenital myopathies, out of the 22 genes chosen as the training set, 19 genes appeared in the data mining, while genes CCDC78, KBTBD13, and KLHL40 did not have annotation in the databases used at the time of this work. Thirteen of these genes were ranked within the first 100 genes, a coverage of 13/19 (68.4%). The muscular dystrophy group had 31 out of 34 genes of its training set appearing in the data mining list, and of these 23 were found within the first 100 ranked genes (79.3%). In the metabolic myopathy disease group, all 19 genes of the training set were ranked, and 18/19 were found within the top 100 genes in the rank (94.7%). Outliers among the known genes are mostly poorly annotated genes, and genes with no score are actually not annotated at all (see discussion). Thus, the high ranking of most previously implicated genes supports the signature choice having adequately defined the disease group.

Proposed candidate genes after filtration
A number of candidate genes sharing disease group signatures with known myopathy genes are barely expressed in skeletal muscle or sometimes mutated in other diseases not affecting skeletal muscle. We thus added filtering steps based on tissue expression and known implication in diseases (see methods for details). Table 4 shows the top 8 ranked genes for each disease group after filtration on skeletal muscle expression and absence of link with diseases in OMIM (Online Mendelian Inheritance in Men, omim.org) database. Table S2 lists the full ranked list of genes for each disease group without filtration, but annotated with skeletal muscle expression and OMIM diseases, and can be linked to NGS filtering pipelines to help prioritization of novel gene discovery, as shown in the discussion. In the following paragraphs, we discuss a few genes found as candidates in some of the disease groups, to illustrate the connections between the integrated data mining results and evidence from the literature. Table 2. Composition of biological processes GO terms that make up the signature of each disease group.   Candidate genes for muscular dystrophies display strong links with muscle development, contraction and intracellular organization, expected subcomponents of skeletal muscle-related biological processes terms from the breakdown of GO terms. ITGB1 codes for a subunit of ubiquitous fibronectin receptors and has a number of suggested functions in different tissues. In skeletal muscle, it has been proposed as a possible target for myostatin in mice myoblast differentiation [30] and is also critical for the development of neuromuscular junctions [31]. TMOD1 encodes for tropomodulin, a protein that regulates tropomyosin and F-actin organization. Knockout mice present with age-dependent sarcomere misalignment and sarcoplasmic reticulum morphological defects [32]. MYL1 is involved with early differentiation of fast muscle cells [33] and TNNI1 codes for the slow-twitch skeletal muscle isoform of troponin I, which has yet to be associated with human diseases even though the fast-twitch isoform is responsible for a subtype of arthrogryposis and the cardiac isoform causes cardiomyopathy syndromes.
Candidate genes for congenital myopathies have a significant overlap with genes proposed for muscular dystrophies, for example for TMOD1 and TNNI1. Among those with a high rank in congenital myopathies are RYR3 and MYH1. RYR3 codes for a ryanodine calcium release channel with a low Ca 2+ sensitivity that has a physiologic role in the excitation-contraction coupling of neonatal skeletal muscles and is up regulated in steroid-associated muscle damage [34], while MYH1 is one of the adult skeletal muscle isoforms of myosin heavy chain that predominates in 2B myofibers. RYR3 high ranking is boosted by a strong contribution of calcium homeostasis terms, explaining why RYR3 received a similarly high score in the Myotonic Syndromes and Ion Channel Muscle Diseases groups, which also have a strong component of calcium homeostasis terms. Also in the group of Ion Channel Muscle Diseases, the gene CACNB1 encodes both the brain and skeletal muscle isoforms of the calcium channel beta subunit, and its loss in mouse is associated to a phenotype similar to that seen in mice with mutations in the known genes CACNA1S or RYR1 [35].
Within congenital muscular dystrophies, the gene B3GALNT2, ranked in the 97th place out of 4841 genes with annotation for this group's signature, was recently found to be associated with hypoglycosylation of alpha-dystroglycan and a congenital muscular dystrophy phenotype in humans [18]. Two other genes, GMPPB, ranked in 225th, and B3GNT1, ranked in 479th, were also implicated in a form of congenital muscular dystrophy with hypoglycosylation of alpha-dystroglycan and Walker-Warburg disease, respectively [36,37]. These genes had not been used in the training set of genes for congenital muscular dystrophies, and have since been annotated in OMIM, but their high placement in the ranking list validate the proposed data mining strategy and subsequent filtering steps.

Candidate genes within genomic regions linked to myopathies
A number of neuromuscular diseases have mapped loci awaiting gene identification [1]. Matching the genomic positions of the top 100 candidate genes of each disease group with such mapped loci reveals some interesting candidates ( Table 5).
The gene XIRP1, matching the locus for hyalin body myopathy and congenital muscular dystrophy with joint hyperlaxity, was originally studied in relation to murine cardiac morphogenesis and later shown to bind skeletal muscle actin in in vitro assays [38]. Its product, the Xin protein, is skeletal muscle-specific and has recently been put forward as a potentially useful biomarker of muscle damage, which can be used to monitor disease progression Table 3. Cont.   and treatment effects in myopathies [39]. OBSCN encodes obscurin, a giant sarcomeric signaling protein similar to titin, which has a suspected role in myofibrillogenesis. It is also involved in dystrophin localization and maintenance of sarcolemma integrity [40], and is proposed here as a candidate for congenital muscular dystrophy with merosin deficiency (MDC1B). An additional candidate gene mapped in the linked region is GALNT2, a glycosylating enzyme similar to B3GALNT2 recently found mutated in another form of congenital muscular dystrophy [18], and also involved in the O-glycosylation of peptides in the Golgi apparatus.
Although not directly analyzed in this work, malignant hyperthermia susceptibility regions encompass CACNG1 and CACNA2D1, which are associated with calcium homeostasis and calcium channels, are highly ranked for Ion Channel Muscle Diseases, and are thus interesting candidates. CACNA2D1, specifically, has been suggested at least as a modifier of hyperthermia susceptibility in association to other genes [41]. These genes have been excluded in a limited number of families not linked to RYR1 mutations [42], results which may be revisited with the advent of NGS data. Additionally, another candidate gene, CACNB1, has no associated human disease. However, CACNG1, CACNB1 and CACNA2D1 encode for subunits of the DHPR calcium channel, which is in direct contact and regulating RYR1 in skeletal muscle, and one mutation in the channel subunit CACNA1S of DHPR was linked to malignant hyperthermia [43].
The CALR and AP1M1 genes both map to 19q13, the locus associated to autosomal muscular dystrophy with rimmed vacuoles. In a recent work, the product of the CALR gene, calreticulin, has been shown to localize in cardiomyocyte mitochondria, and its content increases in mouse models with dilated cardiomyopathy [44]. Strikingly, calreticulin was found to be highly expressed in GNE myopathy, a distal myopathy associated with rimmed vacuoles [45]. Also in distal myopathies with rimmed vacuoles, though not necessarily GNE myopathy, adaptin related-proteins subunits, which are normally not marked in the immunohistochemistry of normal muscle, appear inside or on the rims of vacuoles. The AP1M1 gene codes for the mu subunit of adaptin related-proteins [46].

Discussion
In this study, an integrated data mining strategy was used to cluster and rank genes with known or potential importance for skeletal muscle, and to provide candidate genes for myopathies and some related diseases. Results from the clustering and ranking highlight pathological pathways specific for disease groups. The list of candidate genes was further filtered based on expression data and association with other diseases, and the ensuing identification of mutations in high-ranked genes for congenital muscular dystrophies (B3GALNT2, GMPPB and B3GNT1) illustrated the validity of this approach.

Gene clustering and ranking are dependent on database annotations
Proposed genes in the final ranked list have gene scores with a major contribution of GO and IA terms, and eventual contribution of MPO terms. Thus, they represent genes that have mostly functional links with known myopathy genes (IA terms and GO term ontologies for biological processes and molecular functions), but also some degree of product colocalization in the muscle cell, as expected from matching cellular component-related GO terms. When available, data on altered skeletal muscle function in mouse models also tend to contribute to higher scores for proposed candidate genes.
Database annotation can vary from one gene to another, as it is dependent on the history of research for each gene, including both the date when the gene was discovered and the amount of effort spent for its functional characterization. In addition, animal models are generally phenotyped with a specific organ system in mind.
To illustrate the effects of incomplete annotation, the genes TRAPPC11 and TNPO3, recently implicated in muscular dystrophies, were used as components of this disease group training set, but did not impact the results of the gene ranking due to their poor database annotation. Likewise, they were themselves not captured by the signature used for the Muscular Dystrophies disease group. TRAPPC11 does not appear in the ranking, as it was annotated with only two GO terms that are not significant for muscular dystrophies ("vesicle-mediated transport" and "Golgi apparatus"), it has no annotation for pathways or phenotypes, and its two protein motifs are unique in the genome. Annotation biases also account for higher placements of better-annotated genes that have some kind of overlap with myopathy genes. Such is the case for motor neuron disease-associated genes, which give rise to human and mouse phenotypes that present some degree of phenotypic overlap with myopathies and tend to share many HPO or MPO terms with myopathy phenotypes. In the ranked list of muscular dystrophies, high scores with a predominant contribution of HPO terms were given to the genes SMN1, SMN2, ALS2, IGHMBP2 and AR. These genes are linked to different types of motor neuron disease, which ultimately manifest with muscle weakness and atrophy. In silico approaches need thus periodic revisits to adjust candidate lists based on association of new genes that impact training sets and discovery of new pathways or interactions that change corresponding database annotation, such as the recently published interactome of skeletal muscle proteins centered on proteins that cause limb-girdle muscular dystrophies [47].

Possible insights into pathological mechanisms
The integrated data mining identified gene signatures revealing common function within specific myopathy groups or between groups, and highlighting pathological mechanisms.
Known and candidate genes for metabolic myopathies, congenital myasthenic syndromes, myotonic syndromes, and ion channel muscle diseases define distinctive functions for each disease group (Table 2). Highly ranked genes for congenital myasthenic syndromes are associated to function not primarily linked to skeletal muscle but point as expected to the neuromuscular junction. The cellular basis of myotonic myopathies and ion channel muscle diseases consists in the alteration of ion homeostasis. Additional genes contributing to glycogen metabolism were identified as good candidate genes for the metabolic myopathies.
Muscle development, muscle contraction and calcium homeostasis are key pathways linked to congenital myopathies; indeed this myopathy group presents generally at or before birth, and is characterized by histological hallmarks reflecting alteration and aggregation of proteins implicated in muscle contraction (nemaline myopathies) or due to primary defects in the excitation-contraction coupling (core myopathies and potentially the centronuclear myopathies).
Muscular dystrophies mainly involve muscle intracellular organization terms that reflect the structural importance of most proteins already reported mutated. However, other pathways may have been overlooked because the way the first genes were discovered; once the DMD gene was found, investigators started seeking mutations on genes from the dystrophin-glycoprotein complex. Of note, based on the terms breakdown and candidate genes identified, muscular dystrophies may have a larger contribution from the contractile apparatus than previously assumed, which would bring this disease group closer to congenital myopathies.

Allelic diseases
The integrated data mining reveals or confirms allelic diseases. Indeed, while proposed genes for metabolic myopathies or myasthenic syndromes are rather group-specific, a larger overlap occurred between congenital myopathies and muscular dystro-phies than what was expected from the analysis of overlap between these groups' training sets. While only 2 genes out of 34 muscular dystrophy training set genes also appeared among the 22 congenital myopathy training set genes (TTN and TRIM32), the first positions on the ranked lists of candidate genes after filtering for known diseases encompass a large overlap of genes: 5 out of the top 8 candidate genes for muscular dystrophies are within the top 8 for congenital myopathies (Table 4), and 33 out of the top 50 candidate muscular dystrophy genes are also within the top 50 candidate genes for congenital myopathies (Figure 3 and Table  S2). Overlaps are also substantial between both these lists and the one for myofibrillar myopathies, but in this case the overlap was expected as the training set for myofibrillar myopathies, despite being small (7 genes), included 2 genes also associated to muscular dystrophies and 1 gene associated to congenital myopathies. On the other hand, while the training sets of muscular dystrophies and congenital muscular dystrophies overlap with a significant share of 8 genes, only 3 genes within the top 50 candidate genes is the same for both groups. The reasons for these results lie in the signature of the disease groups: the breakdown of biological processes terms (depicted in Table 2), which represent the larger share of GO terms, is more comparable between congenital myopathies and muscular dystrophies, with similar contributions of terms involving muscle contraction and development, as opposed to the absence of resemblance between these disease groups and the congenital muscular dystrophies breakdown, which is enriched with mainly non-skeletal muscle-related terms, especially glycosylation. Taken together, gene clustering and candidate genes retrieval suggest that mutations in the same genes will eventually be linked to both muscular dystrophies and congenital myopathies. Table 4 presents the top 8 ranked genes after excluding genes with known disease annotation. Genes with known disease annotation (listed in Table S2) might still be good candidate genes for myopathies, considering that phenotypic variability is more a rule than an exception for known myopathy genes. This is the case for genes linked to both myopathy and cardiomyopathy. It is thus expected that known cardiomyopathy-associated genes may be found associated with a skeletal muscle phenotype. Such phenotypic variability may even transcend the realm of muscle alteration. DNM2, for example, is associated both with centronuclear myopathy, a congenital myopathy, and Charcot-Marie Tooth disease, a hereditary neuropathy [48,49]. LMNA, in addition to multiple myopathic phenotypes, also causes Charcot-Marie Tooth disease or progeria [50,51], and SYNE1 can cause one type of Emery-Dreifuss muscular dystrophy, a dilated cardiomyopathy syndrome, a form of autosomal recessive arthrogryposis, and autosomal recessive spinocerebellar ataxia  [52][53][54][55]. The origin of the variability may stem from varying impacts of mutations in different protein domains. For example, in DNM2, mutations giving rise to a centronuclear myopathy phenotype are enriched in the interface between the middle domain and the pleckstrin homology domain, while mutations implicated in Charcot-Marie-Tooth disease tend to cluster in other parts of the pleckstrin homology domain [56]. After we carried out this work, a form of vacuolar myopathy was associated to CLN3, implicated in neuronal ceroid lipofuscinosis, ranked 343rd for vacuolar myopathies [57]. In addition, LRP4, a gene associated with Cenani-Lenz syndactyly syndrome and sclerosteosis, and ranked high (54th) for congenital muscular dystrophies, was implicated in a patient with a CMS disease [58]. We therefore suggest that genes with disease annotation in the ranked lists should be considered with caution in analyzing NGS results, but not a priori excluded when using filtering pipelines.

Example of the usage of the ranked lists
We found the ranked lists to be helpful in our own analysis of exome data to prioritize the scrutiny of potential novel genes implicated in myopathies. The gene ranks in the Excel file sheets in Table S2 can be easily used as additional genomic annotation.
Consider this exome of a sporadic patient affected with nemaline myopathy, a congenital myopathy, from unaffected parents (Table S4). Out of an initial 86,333 variants called in the exome data, 250 remained after variants filtering to exclude purported sequencing errors and polymorphisms. The first variants we analyze closely are the ones found in genes with known implication in myopathies. The heterozygous variants in DMD and CACNA1S were subsequently found in the unaffected father, while the missense variant in ANO5, associated with autosomal recessive limb girdle dystrophy 2L, would require an association to a second mutation to cause disease. We can thus exclude the implication of these known genes.
We next focus on the candidate genes for congenital myopathies (Table 4 and Table S2). If a gene has more than 10% expression in skeletal muscle and is not associated to a disease, it receives a flag as a ''candidate''. Only 29 variants, a significant reduction from the original 250, survive this additional filtering and are shown in Table 6.
The gene with the highest rank was PXN, which codes for paxillin, a protein believed to have a function related to integrins and cytoskeletal localization in multiple tissues, skeletal muscle included [59]. The second best gene was MYBPH, which codes for myosin-binding protein H, the second most abundant protein of the family of myosin-associated proteins [60]. Except for its cloning, not much is known about its specific function. The third gene, PPARGC1A, has regulatory functions on glucose and fat oxidation in muscle cells and protects skeletal muscle fibers against atrophy in mouse models [61]. However, in all these genes, single mutations were found in a heterozygous state, thus a putative dominant negative effect or haploinsufficiency would be required for a pathogenicity call. The next gene in the list, SYNPO, produces synaptopodin, a protein whose name stems from its involvement in synapses involving dendritic spines, in addition to renal podocytes [62]. In spite of its name, skeletal muscle is actually the tissue where it is most strongly expressed. Furthermore, synaptopodin directly binds actin, one of the proteins known to be involved in nemaline myopathy. The missense variant found in this gene was homozygous, in a highly evolutionary conserved position, with prediction of pathogenicity in multiple tools, and was Sanger confirmed to be homozygous in the patient and heterozygous in her parents. We believe SYNPO is the best candidate gene for this family, based on a recessive scenario. While we cannot discard the other genes, the ranking of candidate genes based on our integrative data mining quickly highlights the best genes to proceed further in functional analysis.

Conclusions
The above integrated data mining approach was successfully used to retrieve both specific signatures for different myopathy groups and to uncover and rank interesting candidate genes for myopathies. Recent discoveries of gene implications that were correctly identified by the disease group's signature validated this approach. In silico approaches allow for systematic, but modifiable criteria to be used in generating ranked candidate lists and have the added benefit of automation, whereby such lists can be updated on the fly as new knowledge is incorporated in genomic databases.
Signatures and candidate genes highlighted both potential common pathological mechanisms and overlap between several disease groups. In addition, the ranked candidate gene lists are helpful to prioritize functional validation of filtered variants from overwhelming NGS data.