Evaluation of Candidate Genes from Orphan FEB and GEFS+ Loci by Analysis of Human Brain Gene Expression Atlases

Febrile seizures, or febrile convulsions (FEB), represent the most common form of childhood seizures and are believed to be influenced by variations in several susceptibility genes. Most of the associated loci, however, remain ‘orphan’, i.e. the susceptibility genes they contain still remain to be identified. Further orphan loci have been mapped for a related disorder, genetic (generalized) epilepsy with febrile seizures plus (GEFS+). We show that both spatially mapped and ‘traditional’ gene expression data from the human brain can be successfully employed to predict the most promising candidate genes for FEB and GEFS+, apply our prediction method to the remaining orphan loci and discuss the validity of the predictions. For several of the orphan FEB/GEFS+ loci we propose excellent, and not always obvious, candidates for mutation screening in order to aid in gaining a better understanding of the genetic origin of the susceptibility to seizures.


CNS-related Mendelian disorders
For the leave-one-out cross validation (LOOCV), we used the same information on human Mendelian disease phenotypes as in our previous study [6] on the Allen Mouse Brain Atlas [1], such that we could compare the results from the two studies. We obtained data from OMIM [2,3] on 17 June 2009, considering only the 749 phenotype entries of known molecular basis (OMIM symbol: #) containing the term 'central nervous system' in their Clinical Synopsis section. We downloaded the lists of known associated disease genes (mim2gene) from Entrez Gene [3] on 16 June 2009. Between 1 and 25 genes (on average 1.3 genes) were associated to each OMIM phenotype ID; only six phenotypes (<1%) had 10 or more associated genes.

Similarity of human disorders
To measure the pairwise similarity of OMIM phenotype entries, we processed the textual descriptions of all OMIM phenotype entries (not limited to CNSrelated disorders) using MimMiner, essentially as described by van Driel et al. [4].
MimMiner scores are normalized and range from 0 (unrelated) to 1 (highly related or identical). Since it was established that similar phenotypes can be identified with reasonable accuracy considering a minimum score of 0.4 [4], we used the same threshold for our work.
Using a notion of phenotype similarity allows to select reference genes also for phenotypes with so far unknown molecular basis or increase the number of reference genes for phenotypes with known molecular basis by taking reference genes known to be involved in similar phenotypes.

Leave-one-out
We performed large-scale LOOCVs for the spatially mapped expression data from the HBA and for the GEO dataset as previously described [6] and briefly summarized as follows: For each known gene-phenotype (g-p) pair, we constructed an artificial locus comprising g and the N closest genes on both sides of the chromosome (containing thus 2N +1 genes centered around g, or less for g close to a chromosome terminal). We chose four representative sizes for artificial loci (N =50, N =100, N =200, and N =400 with a maximum number of 101, 201, 401, and 801 positional candidates, respectively) and determined the lists of positional candidates (in terms of Entrez gene IDs) within these loci from the UCSC Genome Browser [5].
As candidate genes we considered those genes within the artificial locus for which expression data was available (simulating an 'orphan' locus obtained by linkage analysis or comparable techniques), applied the prioritization method as described in [6], and recorded the relative rank/position R rel g of the phenotypecausing gene g among the prioritized positional candidates from the artificial locus: where R g is the rank of g within the prioritized genes and |C p |≤2N +1 is the number of 'effective' candidates for which expression data was available. However, we limited the analysis to gene-phenotype pairs having corresponding artificial loci with |C p |≥50. We reasoned that a lower number of effective candidate genes that can be evaluated would introduce an undesired bias by automatically placing the true phenotype-causing gene in higher ranks. Also, we required g itself to have expression data (i.e. g∈C p ).
As reference genes for the LOOCVs we either took all genes known to be involved in the given OMIM phenotype (excluding g itself)-to simulate phenotypes with already partly known molecular basis-or all genes known to be involved in OMIM phenotypes similar to p (excluding p itself)-to simulate phenotypes of so far unknown molecular basis. For the simulation of phenotypes with known molecular basis at least two known disease genes were required (one taken as candidate and one as reference gene).