A Multi-Marker Genetic Association Test Based on the Rasch Model Applied to Alzheimer’s Disease

Results from Genome-Wide Association Studies (GWAS) have shown that the genetic basis of complex traits often include many genetic variants with small to moderate effects whose identification remains a challenging problem. In this context multi-marker analysis at the gene and pathway level can complement traditional point-wise approaches that treat the genetic markers individually. In this paper we propose a novel statistical approach for multi-marker analysis based on the Rasch model. The method summarizes the categorical genotypes of SNPs by a generalized logistic function into a genetic score that can be used for association analysis. Through different sets of simulations, the false-positive rate and power of the proposed approach are compared to a set of existing methods, and shows good performances. The application of the Rasch model on Alzheimer’s Disease (AD) ADNI GWAS dataset also allows a coherent interpretation of the results. Our analysis supports the idea that APOE is a major susceptibility gene for AD. In the top genes selected by proposed method, several could be functionally linked to AD. In particular, a pathway analysis of these genes also highlights the metabolism of cholesterol, that is known to play a key role in AD pathogenesis. Interestingly, many of these top genes can be integrated in a hypothetic signalling network.


Introduction
With the recent improvement of high-throughput genotyping technologies, the use of Genome-Wide Association Studies (GWAS) has become widespread in genetic research to identify significant associations between genetic markers such as Single Nucleotide Polymorphisms (SNPs) and complex phenotypes such as common diseases. GWAS generally yield results at the SNP-level, that are sets of SNPs associated with the disease. However, the vast majority of loci that have been identified for common diseases show modest effects and generally explain only a small part of the variance or heritability of the phenotype observed [1]. In a recent study of Body Mass Index (BMI), the markers associated explained only 0.84% of the variance, although it is considered that genetic factors should actually account for 40%-70% of the variance of BMI [2]. One explanation for the missing heritability is that the common analysis approach, assessing the effect of each SNP individually, is not well suited for the detection of small effects of multiple SNPs. Disease susceptibility is actually likely to depend on the cumulative effect of multiple variants in several genes interacting in functional pathways [3].
It is increasingly recognized that analyzing the combined association of multiple markers at the gene or pathway level may provide a complementary approach to the more common single SNP association approach, with several key benefits [4]. First it incorporates a priori biological knowledge in the analysis: as a matter of fact, in Genetics, the gene is often considered as the unit of interest since the analyses of the functional mechanisms of a disease are generally based on genes and their products such as RNA or proteins [5]. Determining the genes associated with the disease opens the door to a lot of additional research such as targeting genes of interests for candidate-gene studies or replicate association studies. Also, it allows the consideration of biological information, such as pathways or protein interactions, in the analysis of GWAS [6]. For instance, enrichment analysis such as performed by the method Gene Set Enrichment Analysis (GSEA) [7] aims to determine sets of genes involved in common biological processes or biological pathways. Such an analysis is possible through the use of functional information that is only available at the gene level. Second, as the number of genes or pathways is substantially smaller than the number of markers genotyped in GWAS, fewer hypotheses will be tested requiring less stringent multiple-testing correction [8]. Finally, by combining SNPs with modest associations, evidence of association at the gene or pathway level may emerge, even when the analysis of individual SNPs failed to identify any significant association.
In this context, the measure that summarizes the association between multiple SNPs and the trait of interest into a single statistic is a crucial step that raises several statistical issues. Among them, the number of SNPs considered and the impact of the possible Linkage Disequilibrium (LD) between them are often considered [4]. The most widely used approach is the minimum p-value of all the SNPs assigned to the set of SNPs, i.e. the p-value of the most significant SNP [9]. However it focuses on the most significant SNP only, rather than using the information provided by all the SNPs simultaneously which can be view as a limitation. In addition when applied directly, it has an inflated false-positive rate as it does not account for the two statistical issues described above [10]. In order to correct for both the number of SNPs and the LD, a phenotype permutation procedure can be used [11]. But permutations are time consuming, particularly if we want to reach a sufficient level of precision on p-values. Over the years, a number of alternatives have been proposed, such as the the Fisher's statistic to combine p-values of association over a set of SNPs [12].
Here we propose an adaptation of the Rasch model as a novel statistical approach to evaluate the combined effect of multiple genetic variants. Named after Georg Rasch, the Rasch model is a mathematical framework initially proposed to analyze rating scales and evaluates a latent variable not measurable directly from a set of categorical items (eg, disability, cognition or quality of life). The Rasch model is increasingly used in many areas of application such as Psychometry, Social Sciences, Education, and Clinical Trials [13], but has yet to be applied to Genetics. We believe that the application of the Rasch model to association studies offers a solution to the joint analysis of multiple genetic markers. Through different sets of simulations, the false-positive rate and power of the proposed approach is compared to a set of existing methods. By way of illustration, we also apply it to the Alzheimer ADNI GWAS data.

Introduction to the Rasch model
Some variables can be measured directly (eg, height and weight); other variables are measured indirectly by how they manifest (eg, disability, cognitive function, quality of life). Therefore, we need a method to transform the manifestations of these "latent" variables into numbers that can be taken as measurements [14]. Rating scales are a means to measure latent variables by a set of items, each of which has two or more ordered response categories that are assigned sequential integer scores.
For the analysis of rating scales, the Classical Test Theory is usually applied, whereby the item scores are summed to give a total score. However, this simple and natural approach has two main limitations [13]. First, scoring the items with sequential integers implies equal differences at the item level (differences between each response category are assumed to be equal) and at the summed score level (a change of one point implies an equal change across the range of the scale, no matter which item is concerned by this change). Consequently, such ordinal scores cannot provide us with a stable frame of reference in terms of the distance between individuals on the ability scale. Second, when applying the Classical Test Theory, the latent trait of interest is estimated by a summed score which is actually difficult to match to each single item in order to know what an individual can actually perform: individuals with the same summed score may not be able to achieve the same item task. To establish a reliable rating scale, the information of the relative difficulties of items which is actually lost in the summed score must be taken into account.
As a main alternative to overcome theses limitations, the Item Response Theory assumes that the probability of a specified score of a person on an item is a function of the person's ability and the item difficulty [15]: where X ni = x 2 {0, 1, . . ., m i } is an integer random variable for item i where m i is the maximum score, β n corresponds to the ability parameter of person n and τ ki corresponds to the difficulty to obtain the score k for the item i. When the person's ability is high and the item difficulty is low, the probability of having a high score for that item increases.
The Rasch model constitutes a particular case of the Item Response Theory and can be viewed as applying a transformation to the total scores [16]. The Rasch transformation preserves the order of the raw scores, but the distance between individuals can be assessed, and not only the rank ordering. Second, both the item difficulty and person ability are defined on the same scale; if a person's ability is known, we can predict how that person is likely to perform on an item. The Rasch model has several forms and extensions according to the data. The simplest form is the dichotomous Rasch model and corresponds to the situation where items have only two response categories (0 and 1). Specifically, the probability of a correct response is modeled as a logistic function of the difference between the person and item parameter: It assumes that when the person's ability equals the item difficulty, the probability of score 1 for item i is 0.5. The polytomous Rasch model is a generalization of the dichotomous Rasch model [17]. Here, we will precisely consider the Partial Credit model which allows different difficulty parameters for different items [14]: The Rasch model is based on four assumptions: 1) in the model there is only one latent variable of interest, which is the focus of the measurement and all items tap into this latent variable; 2) the total scores over an item or a person contains sufficient information for calculation of the parameters of the model; 3) for a person, the response to different items are independent; 4) the relationship between the probability of a given score to an item i and the latent trait is described by a logistic curve. Based on these assumptions, the item difficulty parameters (τ ki ) can be estimated by Conditional Maximum Likelihood; then the person's ability parameters (β n ) can be estimated by Maximum Likelihood.

Application of the Rasch model to multi-marker genetic association
The Rasch model is a measurement model that has potential application in any context where the objective is to measure a trait or ability through a process in which responses to items are scored with successive integers. When dealing with bi-allelic SNPs of possible alleles a and A, a set of SNPs can be considered as a set of items of possible categories 0 (= aa), 1 (= aA or Aa) or 2 (= AA) assuming an additive effect which is a reasonable hypothesis for complex traits, and analyzed with the polytomous Rasch model in order to summarize the information into one score. It corresponds to the person's ability parameter defined previously. In summary, our appraoch takes the genotypes of a set of SNPs as entry and apply the Rasch model to calculate one multi-marker Rasch genetic score per subject.
Once this score is estimated for each subject, its association to a trait of interest can be assessed within classical statistical inference models according to the trait of interest (linear for quantitative traits, logistic for binary traits) with the possibility to adjust with covariates such as population stratification or gender.

Implementation with R
Several softwares and R packages are available for Rasch model analysis such as ConQuest (https://shop.acer.edu.au/group/CON3), RUMM (www.rummlab.com.au), ltm (cran.r-project. org/package = ltm) and eRM (cran.r-project.org/package = eRm). Considering its flexibility and ease of integration to a pipeline of analysis, we choose to use the eRM R package.
The following short R script provides the functions used to obtain the multi-marker Rasch genetic score for each subject of a dataset of interest, where 'Geno' is a data matrix of genoptypes coded by 0, 1 and 2, with subjects in rows and markers in columns: If 'Trait' is a binary trait disease coded by 1 for cases and 0 for controls, the association of the multi-marker Rasch genetic score to the disease can then simply be assessed with a logistic model: > glm(Trait * score, family = "binomial") If 'Trait' is a quantitative trait, the association of the multi-marker Rasch genetic score to the disease can then simply be assessed with a linear model:

Simulations
The performances of our Rasch-based multi-marker genetic association test are first evaluated in term of false-positive rate and power based on simulations over three scenarios of dependence between SNPs and varying levels of association. For each scenario, we consider: a binary disease trait (500 cases and 500 controls) of prevalence K p = 0.05.
a set of 24 SNPs including 12 disease susceptibility loci (DSL) simulated with relative risks ranging from 1 (no association) to 2 (strong association).
This simulation framework detailed hereafter follows principles widely used previously [18][19][20][21][22]. Scenario 1: SNPs are independent. The simulation model for one SNP is based on the Wright's model [23] applied to a bi-allelic marker with alleles a and A having the frequencies p a and p A = 1 − p a . p 0 , p 1 and p 2 are the frequencies of genotypes aa, aA/Aa and AA defined by the Hardy-Weinberg proportions: where F is the consanguinity coefficient. This coefficient can indicate a deficit (F > 0) or conversely an excess (F < 0) of heterozygous. Here, we consider F = 0, so that the locus is under the Hardy-Weinberg equilibrium. We then want to compute the genotype frequencies of the SNP for cases and controls p Di and p Hi where i = 0, 1 or 2 using the disease prevalence K p , the penetrances f 0 , f 1 and f 2 of the genotypes and the mode of inheritance. The main modes of inheritance can be defined by considering the relative risks 2 ), and using f 0 = K p /(p 0 + RR 1 × p 1 + RR 2 × p 2 ), f 1 = RR 1 × f 0 , f 2 = RR 2 × f 0 and the Bayes formulas, we can easily derive the desired frequencies: The 24 SNPs are simulated independently according to this model, the 12 non-associated SNPs with a relative risk of 1 and the 12 DSLs with a relative risk ranging from 1 to 2.
Scenario 2: SNPs in moderate Linkage Disequilibrium. To account for SNPs in Linkage Disequilibrium (LD), our simulation model follows an approach based on the diplotype frequencies of real datasets. These frequencies are used as an empirical distribution of the range of possible diplotypes. First, 12 DSLs are simulated independently from the model described in Scenario 1. Then the remaining SNPs are completed based on a real dataset (here the chromosome 6 of the ADNI dataset described below) in order to generate one LD blocks of moderate magnitude (0.4-0.7) around each DSL. Simulating this way leads to genetic patterns similar to those found in real data and therefore allows us to finely control the level of LD between SNPs.
Scenario 3: SNPs in strong Linkage Disequilibrium. The simulation is the same as for Scenario 2 with the difference that we consider SNPs in strong LD (0.8-1).
Monte-Carlo estimation of false-positive rate and power. For each scenario and each level of DSL relative risk, we ran B = 1000 simulations in order to provide accurate Monte-Carlo estimates of false-positive rate and power. For each simulation we obtain a p-value of association of the set of SNPs simulated by applying our Rasch-based multi-marker association test. The false-positive rate is estimated by Pr H 0 (p-value α) and the power is estimated by Pr H 1 (p-value < = α), with α the significance level usually set to 5%. Consequently in our simulations, by placing ourselves under the null hypothesis H 0 of no association (RR 2 = 1), then under the alternative hypothesis H 1 of association (RR 2 > 1), we can respectively estimate both false-positive rate and power of our method by considering the same quantity: where ]() represents the number of p-values inferior or equal to α.
Comparison to existing methods. We compared the performances of our Rasch-based multi-marker association test to three existing methods: -minP [9] is the simplest and most naive method. It considers the most significant p-value of the set of SNPs considered as the p-value of the set. This method is obviously biased since it does not take the multiple-testing and the dependence of tests into account. It is used here as a negative control and also because it is nevertheless the most widely used approach in practice.
-GATES [24] is a multi-marker association test using an extended Simes procedure to apply on each SNP. The p-values computed by a standard linear trend test of association on each SNP are combined with the control of correlation structure: significant p-values in high LD count less than significant p-values of independent SNPs.
-Fisher [12] is the well-known Fisher's combination of p-values. For m SNPs, the multimarker test statistic is given by T ¼ À2 P m i¼1 lnðp i Þ which has a chi-square distribution with 2m degrees of freedom under the null hypothesis when the m tests are independent. An adjustment to dependent tests is also available and used here [25].
-SKAT [26] is SNP-set Kernel Association Test. It aggregates individual test score statistics of SNPs in a set and efficiently computes the set-level p-value. It performs multiple regression of a phenotype on all variants with Davies method while adjusting for covariants for counting account for population stratification and upweights rare variants.

Application to the Alzheimer ADNI GWAS data
Alzheimer's disease (AD) is the most common neurodegenerative disorder and affects more than 35 million people worldwide. It is characterized by brain atrophy reflecting neuronal and synaptic loss and the presence of amyloid plaques and neurofibrillary tangles, leading to a progressive deterioration of cognitive functions involving memory, reason, judgment and orientation [27]. AD pathogenic mechanisms are still unclear and the disease remains a condition without cure. According to age at onset, two main types of AD are differentiated: Early-Onset AD (EOAD, appears generally before the age of 65, less than 10% of the AD population and clear genetic determinants with mutations found in the APP, PSEN1 and PSEN2 genes) and Late-Onset AD (LOAD, more than 90% of the AD population, appears generally after the age of 65 and has a complex etiology based on genetic and environmental factors) [28]. In recent years, several Genome-Wide Association Studies (GWAS) were performed to detect genetic loci associated with LOAD [29][30][31]. These studies support the hypothesis that APOE is a major susceptibility gene for LOAD [32]. In addition to APOE, markers within several other genes gave replicated evidence of association with LOAD [33]. The identification of these genes improves our knowledge of AD. For instance, CR1 has been demonstrated to be able to produce an AD up-regulated protein [34]. Although these new loci have been found, some problems ramain unsolved. First, to date none of these loci has proven accurate or sensitive enough to serve as biomarker. Second, the replication of results is a tedious task in GWAS. To push the boundaries of current knowledge on AD, further studies about GWAS and statistical models are still necessary.
By way of illustration, we applied our Rasch-based multi-marker association test to the genes of the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu) [31]. The study population is made up of 359 cases and 226 controls, genotyped with an Illumina Human 610-Quad (= 620901 SNPs). A standard quality control process based on minor allele frequency, Hardy-Weinberg equilibrium, missingness and relatedness excluded 31 cases, 49 controls and 82071 SNPs [35]. The dataset was also reduced with a minimal loss of information by pruning with Plink (window size = 50 SNPs, shift = of 5 SNPs at each step and threshold correlation coefficient of 0.2) [36]. Missing genotypes were imputed with weighted k-Nearest-Neighbors method [37]. SNPs are considered attached to a gene if they are located within a distance of 20 kb around it. The curated dataset to analyze comprises 16514 genes. For each gene and each subject, a Rasch-based multi-marker genetic score is computed, and the association of this score to the disease is evaluated by a logistic regression model.
The top genes identified by the Rasch analysis were integrated into a hypothetic signalling network. Protein-protein interaction data and functional findings were extracted from QIA-GEN's Ingenuity Pathway Analysis (IPA, QIAGEN Redwood City, www.qiagen.com/ingenuity), manually analysed and supplemented by literature curation.

Simulations
False-positive rate and power for Rasch, minP, GATES, Fisher and SKAT across the three scenarios are given Fig 1. The first observation is that minP has a strongly inflated false-positive rate, far above the expected 5% level and decreasing with the level of LD (0.691 for Scenario 1, 0.285 for Scenario 2 and 0.145 for Scenario 3). This observation was actually expected knowing the drawbacks of the minP method, and validates our simulations. On the other hand, Rasch, GATES, Fisher and SKAT have a correct control of the false-positive rate to 5% (Fig 1a,1b,1c) and a power that increases toward 100% with an increasing level of association to the disease (Fig 1d,1e,1f). In term of false-positive rate, it is worthy to mention that on Scenario 3 (Fig 1c) Rasch is the closest to the 5% level (estimated to 0.043) whereas GATES and Fisher are more conservative (estimated to 0.039 and 0.034 respectively) and SKAT is more inflated. In term of power, Rasch has the best performances on independent SNPs followed by Fisher (Scenario 1, Fig 1d). Both methods have similar good performances when applied on SNPs with moderate and strong LD (Scenarios 2,3, Fig 1e,1f). The performance of GATES is better compared to SKAT on independent SNPs but is limited on LD block simulation. (Fig 1d,1e,1f).  Application to the Alzheimer ADNI GWAS data The association of 16514 genes to the Alzheimer's disease (AD) was analyzed with our Raschbased multi-marker association test. Standard QQ plots is given Fig 2 and the 20 top genes are detailed in Table 1. First, our analysis support the hypothesis that APOE on chromosome 19 is a major susceptibility gene for AD (p = 2.30e −8 ). It is well-known that its ε4 allele has been associated with an increased risk of developing Alzheimer's disease [38]. This result was expected and can be considered as a validation of our approach. Two other genes also markedly deviate from the QQ-line (Fig 2): ZNF398 (p = 9.71e −6 ) and AEN (p = 1.27e −5 ). AEN encodes an enhancing apoptosis nuclease, a process that takes part to the neuronal loss observed in AD. We unfortunately did not find any indication about the possible functional implication of ZNF398 in AD. As we noticed a slight deviation from the QQ-line at 10 −3 (Fig 2), we also investigated the other 17 top genes. Several of them could be functionally linked to AD: -PSMA5 is a proteasome subunit involved in the apoptosis process that takes part in the neuronal loss observed in AD [39]. PSMA5 was also found to interact directly with the AD associated PSEN1 gene [40].
-FXN encodes the frataxin mitochondrial protein which functions in regulating mitochondrial iron transport and respiration. Frataxin deficiency leads to mitochondrial dysfunction and oxidative damage that are at the origin of numerous neurodegenerative diseases like Friedreich ataxia, Parkinson and AD [41]. Interestingly, another top gene VKORC1L1 is also involved in regulation of oxidative stress and mediates vitamin K-dependent intracellular antioxidant function [42]. Remarkably, blood level of vitamin K in APOE4 carriers is lower than in persons with other APOE genotypes implying hypothetical link of vitamin K deficiency to pathogenesis of AD [43,44].
-Alzheimer's disease is sometimes named 'type 3 diabetes' due to twice more frequent occurrence in diabetic patients [45,46]. Two top genes from our list (COL5A3 and WDTC1) were identified as potent modulators of insulin signalling [47,48]. Noteworthy, vitamin K-dependent modification of osteocalcin was also shown to affect glucose homeostasis [49].
-NTM encodes a neural cell adhesion molecule that modulates neurite outgrowth and adhesion via a homophilic mechanism [50]. Some data indicates that NTM might directly bind to amyloid beta [51]. It has been associated to intelligence in a family-based association study [52] and lies at locus 11q25 which has been associated with AD [53] -SEMA7A belongs to the semaphorins family involved in neuronal processes. Semaphorins and their downstream signaling components regulate synaptic physiology and neuronal excitability in the mature hippocampus, and these proteins are also implicated in a number of developmental, psychiatric, and neurodegenerative disorders [54]. Remarkably, SEMA7A not only enhances axon growth via beta1-integrin, but equally processes immune-modulatory activity and regulates endothelial functions [55,56]. As well, another top gene (ADAMTS12) is also implicated in control of immune responses and angiogenesis, deregulated in course of Alzheimer's disease [57,58].
-Finally, LARP1 protein associates with the mTOR complex 1 (mTORC1) regulating global protein synthesis. Functional importance of mTOR signalling has been experimentally confirmed in Alzheimer's disease, and therapeutic targeting of this signalling module is considered as a promising strategy for developing neuro-protective treatments [59][60][61].
We also performed a formalized network analysis based on these top genes with the Ingenuity Pathway Analysis. The resulting network is given Fig 3 and seems to highlight the metabolism of cholesterol that plays a key role in AD pathogenesis [62][63][64]. Nine of the 20 top genes are connected in this network (in orange). Remarkably, most of them can be functionally linked to AD. For example, integrin ITGB1 mediates effect of SEMA7A on axon growth [65]. The integrins are modulated by CASR gene that forms a functional complex with metabotropic glutamate receptor GRM5 [66,67]. It was recently shown that GRM5 is a co-receptor for cytotoxic Aβ oligomers bound to prion PRNP protein [68].

Discussion
With the recent improvement of high-throughput genotyping technologies, the use of Genome-Wide Association Studies has become widespread in genetic research. However, the high dimension of the genetic data, the simultaneous testing of many markers and the necessity to account for the complex genetic structure of human populations are, among others, tricky issues that have raised doubts about the relevance of these studies' findings. The development of methods in Statistical Genetics is therefore very important to ensure that such studies are correctly conducted and to provide a proper interpretation of their findings, and this research has involved scientists from many disciplines. In this context, applying the Rasch model initially developed for psychometric data to the analysis of genetic data can be viewed as a new link between two areas of research that was not obvious before. Our novel statistical approach may be useful to complement at the gene or pathway level, the findings of significant associations made at the single SNP level.
Based on simulations, it showed in different situations good performances in terms of falsepositive rate and power compared to other popular methods (minP, GATES, Fisher and SKAT). We noticed that the benefits of Rasch in terms of power were more important when applied to independent SNPs which is coherent with one of the assumptions of the model that the response to different items are independent. As this loss of power is observed for all the methods when the level of dependence between the SNPs (Linkage Disequilibrium) increases, a Rasch model taking dependence into account could be of interest and further increase the power of the method.
The application of the Rasch model to the genes of the Alzheimer ADNI GWAS data allowed a coherent interpretation of the data. Our analysis supports that APOE is a major susceptibility gene for AD. In the other top genes, several of them (AEN, ADAMTS12, PSMA5, FXN, NTM, LARP1, WDTC1, SEMA7A, VKORC1L1, COL5A3) can be functionally linked to Alzheimer's disease. A pathway analysis of these genes also highlights the metabolism of cholesterol, that is known to play a key role in AD pathogenesis. All these elements can be integrated in a hypothetic signalling network based on known protein-protein, functional and phenomenological interactions (Fig 4). Interestingly, this network could be potentially targeted by acamprosate, a drug that was first approved in 1989 and since then has been widely used to  treat alcohol-dependence [69]. In combination with baclofen, acamprosate has recently been shown to be effective over a range of preclinical AD models [70], and has demonstrated promising results in phase 2a clinical trial for AD [71].
Through this study, we encountered three limitations for the application of the Rasch model. First, it works on complete data without missing values. However missing values are a common problem in most scientific research domains as they can arise from different sources such as mishandling of samples, low signal-to-noise ratio, measurement error, non-response or deleted aberrant value. Consequently the application of the Rasch model requires preliminary imputation of missing values. This imputation is a general and separate scientific topic that has been thoroughly discussed to date [72][73][74][75][76]. Second, in some particular cases the estimation of the Rasch model with the eRM R package does not converge and consequently does not provide any results. It happened for instance to 9 genes over the 16514 genes analyzed in the ADNI GWAS data and the reasons of that problem were not clear to us. Finally applying a Rasch model necessitates accessing individual level genetic data. But often, only summary statistics are available for published GWAS. This is a real limitation for most of the existing multimarker methods in order to correctly account for gene size and LD, although some authors have found a solution in using the genotype data from a reference panel such as the 1000 Genomes or the HapMap projects [77][78][79] which is not applicable here.
The application of the Rasch model also opens two opportunities that were not yet considered here. The analysis of multiple markers is not limited to the gene level, and the Raschbased multi-marker genetic association test could also be applied to the analysis whole pathways. In addition, this genetic score could also be used as a predictor of the disease for the supervised classification of cases versus controls. The Rasch model is also suitable to the inclusion of rare variants, as most rare variants analyses focus on gene level test by collapsing the effects of all rare SNPs in a gene into a single test of association [4]. These applications deserve further investigation.
From a broader point of view, given the urgent need to understand how the thousands of loci that have been identified in genome-wide association studies contribute to the genetic basis complex traits, the application of multi-marker methods at the gene or pathway level becomes an increasingly important approach for secondary analysis of GWAS data [80][81][82]. Main recognized benefits include the incorporation of biological knowledge, the reduction in multiple-testing and the consideration of SNPs with modest effects. But this type of analysis has also clear limitations [4]. For instance determining whether a particular SNP is part of, or regulates a gene is a thorny problem. In addition, by focusing on SNPs that can be assigned to genes, analyzing GWAS data at the gene level also misses many disease associated SNPs that cannot be linked to genes (such as SNPs in gene deserts for instance). In that case, the delimitation of genomic regions made of contiguous SNPs and associated as a whole, should also complement our understanding of the genetic of complex traits [20].