Evaluation of a Partial Genome Screening of Two Asthma Susceptibility Regions Using Bayesian Network Based Bayesian Multilevel Analysis of Relevance

Genetic studies indicate high number of potential factors related to asthma. Based on earlier linkage analyses we selected the 11q13 and 14q22 asthma susceptibility regions, for which we designed a partial genome screening study using 145 SNPs in 1201 individuals (436 asthmatic children and 765 controls). The results were evaluated with traditional frequentist methods and we applied a new statistical method, called Bayesian network based Bayesian multilevel analysis of relevance (BN-BMLA). This method uses Bayesian network representation to provide detailed characterization of the relevance of factors, such as joint significance, the type of dependency, and multi-target aspects. We estimated posteriors for these relations within the Bayesian statistical framework, in order to estimate the posteriors whether a variable is directly relevant or its association is only mediated. With frequentist methods one SNP (rs3751464 in the FRMD6 gene) provided evidence for an association with asthma (OR = 1.43(1.2–1.8); p = 3×10−4). The possible role of the FRMD6 gene in asthma was also confirmed in an animal model and human asthmatics. In the BN-BMLA analysis altogether 5 SNPs in 4 genes were found relevant in connection with asthma phenotype: PRPF19 on chromosome 11, and FRMD6, PTGER2 and PTGDR on chromosome 14. In a subsequent step a partial dataset containing rhinitis and further clinical parameters was used, which allowed the analysis of relevance of SNPs for asthma and multiple targets. These analyses suggested that SNPs in the AHNAK and MS4A2 genes were indirectly associated with asthma. This paper indicates that BN-BMLA explores the relevant factors more comprehensively than traditional statistical methods and extends the scope of strong relevance based methods to include partial relevance, global characterization of relevance and multi-target relevance.


Introduction
Asthma is a multifactorial disease influenced by wide range of genetic and environmental factors. Despite the numerous genetic studies carried out to gain insight into its complexity, our understanding of the nature and consequences of the genetic variations remains limited. Several potentially important genomic regions have already been identified but the causal alterations which could be responsible for the observed linkages are sometimes not discovered or more frequently the results of the association studies cannot be confirmed by other studies. One of the disputed genomic regions in asthma is the chromosome 11q13 where positive linkage to atopy was first reported by Cookson et al. [1]. Subsequent association tests carried out on this region revealed the importance of the genes MS4A2 (earlier FceRI-b or b chain of the high-affinity receptor for IgE), SCGB1A1 (earlier uteroglobin or Clara cell secretory protein (CC16)), glutathione Stransferase pi (GSTP1) and GPR44 (earlier CRTH2) in asthma and asthma related phenotypes [1,2,3,4].
Another frequently investigated asthma-related genomic region is the chromosome 14q22. Its attendance in the asthma development has been suggested by different whole-genome linkage studies [5]. Since then several genes have been shown to be involved in asthma-related processes in this region. These are genes for prostanoid transmembrane receptors, such as the prostaglandin-D2 receptor (PTGDR), the prostaglandin-E2 receptor (PTGER2) and the gene for galectin-3 (LGALS3) [6,7,8].
Despite the fact that associations of these regions and genes to asthma phenotypes were found earlier, none of them was confirmed subsequently by genome-wide association studies (GWAS) and meta-analyses [9,10,11].
This frequent phenomenon of genetic association studies has been explained by several theories, like the insufficient phenotypic descriptors (e.g. oversimplified case-control approach), insufficient observations (e.g. the negligence of rare genetic and epigenetic variants), ethnic differences between the study populations, confounding factors (population stratification and admixture), conventional design errors (small sample size, the multiple testing problem, small minor allele frequencies, curse of dimensionality, bias-variance dilemma, model complexity and significance, etc.), the inappropriate approach towards complex traits (i.e. disregarding the role of high-number of weak factors, gene-gene interactions, pathway-based interpretation), and the high redundancy of predictors (e.g. the discovery of non-causal, transitively associated descriptors) [12].
To overcome several of these limitations, probabilistic graphical models (PGMs) were proposed. Thanks to their ability to efficiently and accurately represent complex networks, PGMs represent powerful tools to dissect the genetic susceptibility of complex diseases. Bayesian networks are a popular class of PGMs, because they provide a clear, graphical semantics for representing a complete dependency-independency map of the domain. Therefore, the graphical representation presents a crucial advantage to allow the distinction between direct (causal) and indirect (due to LD) SNP-phenotype dependencies, thus ensuring a precise mapping of causal mutations. Additionally, BNs have the advantage of being able to efficiently deal with SNP-SNP interactions impacting the phenotype, a situation that is called epistasis.
Due to the high computational complexity and particularly because of the high sample (statistical) complexity, the learning of complete Bayesian network models is computationally prohibitive. To cope with complexity several 'local' approaches have emerged which limit their scope, and focus on the identification of strongly relevant variables, and possibly their interaction and causal structure. Thus, they omit a global and detailed characterization of relevance relations [13].
Other approaches apply a resampling scheme (i.e. bootstrap) or the Bayesian statistical framework to cope with the relatively small sample size, and to provide uncertainty and robustness measures for various model properties [14,15,16]. Although these methods have a much higher computational complexity than local causal identification approaches, their main advantage is that the modeling is not restricted, thus global characterization of the dependencies in the domain is possible. In fact, these approaches provide an overall characterization of the domain without the use of dedicated target variables [15].
Specifically within the Bayesian statistical framework the uncertainty of the validity of a discrete hypothesis given the observations is expressed by a probability, which is interpreted as an a posteriori belief in the hypothesis (e.g. above 0.5 it is more probable than not, see [17]). On the contrary, a p-value for a hypothesis in the frequentist approach is defined using the concept of repeatability and used indirectly through rejection to confirm a hypotheses.
Previously, we investigated the applicability of Bayesian networks using the Bayesian framework to learn the relevant variables with respect to a set of target variables. This could be seen as the Bayesian interpretation of the feature subset selection problem. We reported Monte Carlo methods to efficiently estimate posteriors over structural model properties, such as Markov blanket sets and Markov blanket graphs, which represent the strongly relevant variables and their interactions [18]. Based on these basic concepts we introduced a new statistical methodology, named Bayesian network based Bayesian multilevel analysis of relevance (BN-BMLA), which supports association analysis by estimating posteriors of strong relevance. In the context of genetic association studies, strongly relevant variables with respect to a phenotype represent the genetic and phenotypic factors that directly influence the phenotype (e.g. disease susceptibility). Strongly relevant variables statistically isolate the target variable from all other variables. However, the standard concept of pairwise association is not identical with the concept of strong relevance. First, if the dependency of a SNP to the phenotype is indirect due to LD with the causal SNP, then it is (transitively) associated but not strongly relevant to the phenotype. Second, if a SNP has no main effect on the phenotype, but has an epistatic effect along with an other factor, then this SNP is not associated (according to the prevailing terminology), but strongly relevant to the phenotype (i.e. it is in pure interaction with the phenotype). Therefore, strong relevance indicates either a strong, direct association or a pure interactionist relevance. See Figure S1 for further comparison of the concept of association and strong relevance.
Additionally, posteriors for strong relevance and partial strong relevance for subsets of variables can also be computed. The concept of partial strong relevance, that a set of k predictors are strongly relevant, is particularly useful, because it defines a hierarchical, embedded hypotheses space with varying complexity. For a given domain and sample size the posteriors over k-(strong)relevance can be used to determine the sufficiency of the data and these posteriors can also be used to investigate statistical interactions [19]. These model properties with varying complexities support the overview, the post hoc interpretation, and the offline meta-analysis of weakly significant results [20]. We tested the BN-BMLA method in a case-control setup using artificial datasets for identifying interactions and conditional relevance and it was proven to be superior over other multivariate methods using conditional models designed to detect associations between genotypic variables and the target variable [21].
In this paper we report results of the BN-BMLA statistical method on real-world genotype data from our partial genome screenings of the 11q13 and 14q22 regions in asthmatic patients and controls. We apply and demonstrate the unique ability of BN-BMLA to provide (1) a complete overview about partial multivariate relevance (i.e., posteriors of partial relevance of k predictors for varying k), (2) an overview about various types of dependencies between a predictor and a target based on global characterization (i.e., posteriors that a predictor is directly or transitively dependent, or is in pure interaction), and (3) a joint characterization of relevance for multiple target variables (i.e. to model the dependencies of the targets). We compare these results with results from traditional frequentist methods, and discuss the relevant candidate genes and their interactions from the aspect of association with asthma in our population. Among the newly identified genes FRMD6 showed the strongest association with asthma, and we confirmed the possible role of this gene in the disease in an animal model and in human asthmatics.

Subjects
The study population comprised 1201 unrelated individuals of Hungarian (Caucasian) population. Approximately, 5% of tested subjects were probably of Gypsy origin (estimation based on state population statistics).
Four hundred and thirty six asthmatic children, age 3-18 were recruited to the study. All the asthmatic children had specialist physician-diagnosed asthma with the following characteristics: (1) recurrent breathlessness and expiratory dyspnea requiring treatment; (2) physician diagnosed wheeze; (3) reversibility of the wheezing and dyspnea by bronchodilator treatment measured as forced expiratory volume 1 s (FEV 1 ) by a spirometer. All the asthmatics (or their parents) were instructed to record their symptoms accurately for 2 weeks, the treatment, and the peak expiratory flow (PEF) twice daily (in the evening and in the morning). PEF 100% was determined by calculation from the personal best value and the expected value according to the height of the patient.
If the patient is younger than 5 years old, the determination of lung function tests (PEF or FEV 1 ) is usually not possible. In that case the diagnosis and classification of the disease were made according to the frequency and severity of other symptoms.
The treatment of the patients remained unchanged before the blood was drawn. None of the asthmatic had experienced an exacerbation or a respiratory infection for at least 4 weeks as indicated by increased symptoms.
The control children were randomly selected from outpatients from the Orthopaedic Department in the Budai Children's Hospital and from the Urological Department of Heim Pal Hospital, Budapest. Children in the control group had mild musculoskeletal alterations (like pes planus or scoliosis), phimosis, or other small urogenital problems, showed no symptoms of asthma and required no medication. The adult controls were blood donors without asthma, and according to questionnaires they had not experienced asthma symptoms earlier. The control group comprised 765 subjects (mean age: 19612 years, 405 males/360 females).
In this paper we also demonstrate the ability of the BN-BMLA method to predict various types of dependencies from small amount of available data. For this purpose we applied the method to three embedded datasets: (1) the asthma status was known in all cases (1201 subjects, ''A'' dataset); (2) in all asthmatics (436) and in 664 controls, altogether in 1100 cases the status of rhinitis was also known (''RA'' dataset). Rhinitis was diagnosed in 278 asthmatics (64.0%), and in 233 controls (35.1%). Rhinitis was defined by troublesome sneezing or blocked or runny nose severely affecting the well being of the patient in periods without common cold or flu. Only those subjects were involved in this dataset whose rhinitis status was verified by specialists; (3) in 200 children (106 asthmatics and 94 controls) the status of rhinitis and the serum levels of total IgE and eosinophil were known (''CLI'' dataset, Table 1).
For the measurement of gene expression level in induced sputum, 31 adults between 19 to 61 years old were enrolled in the study, but 10 of them were excluded from the analysis because the qualities of the sputum samples were not appropriate. The study population comprised of 12 asthmatic patients (5 males and 7 females; mean age 36.3613.0 years) and 9 controls (4 males and 5 females; mean age 29.364.9 years). The 12 asthmatic patients had mild atopic stable asthma, no other lung diseases and no lower respiratory tract infection. Patients were required to have FEV1 of greater than 70% of predicted, baseline methacholine PC20 (the provocative concentration of methacholine causing a 20% fall in FEV1) of less than 16 mg/ml. The 9 healthy volunteers were recruited from the staff and students of the participating Hungarian universities. They gave no history of respiratory diseases, had a FEV1/FVC.80% and normal methacholine airway responsiveness (PC20.16 mg/ml).The included groups did not differ statistically regarding age, sex, smoking status and allergy.
The study was conducted according to the principles expressed in the Declaration of Helsinki, and approved by the Ethics Committee of the Hungarian Medical Research Council (ETT TUKEB; http://www.ett.hu/tukeb/tukeb.htm). Written informed consent was obtained from all patients or from the parents or guardians of the minors involved in the study.

Laboratory analysis
Genomic DNA was extracted from peripheral blood using DNA blood isolation midi kit (Qiagen, Valencia, CA).
Multiplex PCR and SNP genotyping for 144 SNPs were performed using 48-and 12plex genotyping assays on GenomeLab SNPstream genotyping platform (Beckman Coulter Inc. Fullerton, CA) using single-base primer extension technology. In addition, SNP rs545659 was genotyped using TaqMan SNP Genotyping Assay (Applied Biosystems, Warrington, UK) on an Applied Biosystems 7900 Real Time PCR System as per instructions of the manufacturer. Some samples were genotyped in duplicate in the same and in different plates. Only those SNPs where the genotyping call rate was .90% were included in the analysis. Some SNPs genotyped on GenomeLab SNPstream genotyping platform were also genotyped using TaqMan SNP Genotyping Assays. No difference between the results of the two methods was revealed.
Total serum IgE levels were determined by 3gAllergy blood tests in Immulite 2000 Immunoassay System (Siemens Healthcare Diagnostics; Deerfield, IL USA).
The eosinophil cell counts were measured by Coulter MAXM Analyser.
RNA was isolated with the Qiagen Mini RNeasy kit (Maryland, USA). RNA was transcribed to cDNA with the High Capacity cDNA Reverse Transcription Kit (Applied Biosystems, Foster City, CA). Real-time quantitative PCR was performed for the genes found to be relevant in asthma in our analyses (FRMD6, PTGDR, PTGER2, MS4A2, AHNAK, PRPF19, TXNDC16) and bactin using an ABI 7900HT Fast Real-Time PCR System (Applied Biosystems, Foster City, CA) and TaqMan Gene Expression Assays. Relative gene expression was determined by the comparative CT method (ddCT) with b-actin as the endogenous control.

Sputum induction and analysis
Sputum was collected from 20 asthmatic and 11 healthy volunteers. The participants inhaled 4.5% saline solution generated by a De Vilbiss Nebulizer (Ultra-NebTm 2000 model 200HI) for 5 minutes after pre-treatment with 400 mg of inhaled salbutamol. Induction was performed three times and the pulmonary function was measured each time after the sputum induction. All portions that macroscopically appeared free of salivary contamination were selected. Samples were diluted with phosphate buffered saline containing 0.1% dithiotreitol (Sigma, St Louis, MO, USA), portions were agitated with a vortex and placed on a bench rocker for 30 minutes. Samples were filtered through a 40 mm Falcon cell strainer, and centrifuged at 1500 rpm for 10 minutes. The cell pellet was resuspended in 1 ml PBS and viability (Trypan blue exclusion method) was determined using Burker chamber. After differential cell count, cells were stocked on lyses buffer at 280uC until use.

SNP selection
SNPs in chromosome region 11q12.2-q13.1 and 14q22.1-22.3 were selected using HapMap data analyzed with Haploview 4.1 (http://www.hapmap.org). Our gene coverage pipeline included tag SNPs for all of the detected haplo blocks with minor allele frequency .0.05 and selected at an r 2 of 0.8 for the CEU population. In addition to using LD criteria, we also added SNPs based on spacing across the region and their estimated functionality. In this manner we selected 145 SNPs in the given regions of Chr 11 and Chr 14 (68 and 77 SNPs, respectively) for genotyping. See Table S1 for detailed information on the examined SNPs.

Statistical methods
Frequentist methods. Allele frequencies were calculated by allele counting. Data were analyzed using MedCalc 5.0 and SPSS 11.5 programs. Hardy-Weinberg equilibrium was tested by using a x 2 goodness-of-fit test. x 2 test was used to test for differences in allele distribution between the groups. Logistic regression adjusted for age and sex was applied to assess the effect of the genetic background to dichotomous clinical characteristics. Confidence intervals were calculated at the 95 percent level. Estimated haplotype frequency was calculated by Haploview 4.1: http:// www.broad.mit.edu/mpg/haploview/. Haplotype-specific ORs were estimated using conditional logistic regression to model the log odds of disease as a function of the individuals' haplotype probabilities. Normalized gene expression levels were compared by t-test. Power analysis was carried out by genetic power calculator (http://pngu.mgh.harvard.edu/,purcell/gpc [22]. In order to facilitate a multivariate based frequentist analysis we applied multivariate logistic regression and multifactor dimensionality reduction (MDR). MDR is a nonparametric and genetic model-free data mining method for detecting nonlinear interactions among discrete genetic and environmental variables [23].
BN-BMLA method. A Bayesian Network (BN) is a directed acyclic graph (DAG) that represents a joint probability distribution of a set of random variables{X 1 , X 2 , …, X n }. These refer to the observed or measured factors (e.g. SNPs, clinical parameters) from a specific domain. A node of the graph represents a variable and an edge between two nodes represents a direct dependency between the variables. Learning a BN structure (i.e. the dependence relations of the variables) is finding a DAG that best describes the dataset. In most cases, where the amount of available data is modest relative to the number of variables there are likely to be many models that have non-negligible posteriors. However, there might be certain structural features, e.g. the presence of an edge, that we can extract reliably. A central structural feature is related to the concept of strong relevance of a single variable or a set of variables. For the definition of Markov blanket sets (MBS), see [24], for strong relevance, see [25], for Markov blanket membership (MBM), see [26] and for their relation, see [27].
With Bayesian learning we can estimate the strength with which the data indicates the presence of a certain feature by estimating its a posteriori probability (Eq. (1)): where G represents a BN structure, D is the dataset, and f(G) is 1 if the feature holds in G and 0 otherwise (for an overview, see [28]. For example, we refer to p(MBS = s|D) as the MBS posterior of s.
Bayesian learning. To calculate the first term of the summation in Eq.(1) we use Bayes rule, and we have that The term P(D|G) is the marginal likelihood of the data given structure G, and the term P(G) is the prior probability of a structure G. We used uniform prior over structures in our experiments.
Assuming that the dataset D is complete (i.e., there are no missing values), the variables are multinomial with a Dirichlet parameter prior for every possible instantiations of their parents, and the prior P(G) satisfies parameter independence, parameter modularity, and structure modularity then the marginal likelihood has an efficiently computable closed form [29].
Bayesian model averaging. As it is mentioned before, our goal was to compute the posterior probability of a feature (i.e., Eq.(1)). Because the number of BN structures is super-exponential in the number of random variables in the domain, exact summation of all possible structures G is computationally intractable (see [29,30]. We used Metropolis Coupled Markov Chain Monte Carlo (MC 3 ) [31] methods for the approximation of Eq.(1). We defined parallel Markov chains over the space of DAGs, whose stationary distribution were the posterior distribution P(G|D). We then generated samples by doing random walks in these chains, and used them to estimate Eq. (1) (for convergence and confidence diagnostics, see [32]. Each step in a Markov Chain corresponds to local transformations of the DAG, called operators [33,34]. Following Castello, we used three operators: (1) adding an edge to the DAG, if it does not violate the acyclic constraint, (2) reversing an edge, if it does not violate the acyclic constraint and (3) removing an edge from the DAG. The probability of the operator selection was uniform.
We ran the MCMC sampler with a burn-in period of 10 6 steps and then collected 5610 6 samples. We restricted the space of the possible structures limiting the number of parents per node to 8.
We computed a posteriori probabilities for structural features summarized in Table 2.
See Figure 1 for a graphical example. In a post-processing step, we introduced the concept of k-MBS as the k sized subset of the strongly relevant variables [19]. The a posteriori probability of the sub-relevance of a k-sized set s is where the terms are the exact MBS posterior of the set s and the MBS posterior for all its superset.
The concept of k-MBS was motivated by the observation that typically the most probable strongly relevant variable sets often share a significant common part. This partial multivariate approach provides an intermediate and scalable complexity for the analysis. Note that the cardinality of the k-MBS features is O(n k ) and the aggregation involves optionally 2 n MBS subsets, which is limited by the number of the DAG structures visited in the MCMC run, i.e. by the number of steps, which is typically 5610 6 , resulting in 10 4 different DAGs and MBS sets.
Posterior values are between 0 and 1, where 0 indicates no relevance, 1 indicates 100% relevance between a predictor and a target variable. In our study we consider a variable relevant when its posterior is greater than or equal 0.5.

Frequentist analysis
From the 145 genotyped SNPs 5 SNPs were monomorphic (MAF = 0), 5 deviated from Hardy-Weinberg equilibrium in controls (p,0.005) and 33 had poor genotyping results (poor genotype clusters or low call rates) and were not considered for further analyses leaving 102 SNPs (59 in 14q22.1-22.3 and 43 in 11q12.2-q13.1) for frequentist and BN-BMLA analyses. Table S2 shows the minor allele and genotype frequencies in asthmatic and control patients. Table S3 presents the statistical evaluation for association of SNPs with asthma at allele and genotype levels.
When allele frequencies were considered, one SNP (rs3751464) in the FRMD6 gene provided an evidence for an association with asthma (OR = 1.43 (1.18-1.75); p = 3610 24 ). Only this SNP could withstand the correction for multiple testing. Because 102 SNPs were considered, the Bonferroni corrected significance level was 5610 24 . This result is compatible with a power analysis which showed that the power of frequentist statistical tests are less than 0.2 for OR below 1.3 at sample size of 1200 (complete dataset).
When the genotype frequencies were considered, the CC genotype of the SNP rs17831682 provided strong evidence for an association (P = 3.9610 24 ), indicating a recessive model. This SNP is located in the 39 UTR of the PTGDR gene and considered as an exonic splicing enhancer (Genecards). Haplotypes were constructed from the investigated SNPs ( Figure S2). The permutation based

Bayesian network based Bayesian multilevel analysis of relevance
Sufficiency of the data. In the Bayesian context we first investigated the sufficiency of the sample size of all datasets (A: 1201, RA: 1100, CLI: 200) for performing a full-scale multivariate analysis to identify the set of relevant variables. This confirmed the necessity of a partial multivariate approach, because there was neither a dominant set nor a set of highly significant sets for the A/ RA and especially not for the CLI datasets ( Figure S3).
Strong univariate relevance. The most relevant SNPs and genes (i.e. with high posteriors for strong relevance with respect to asthma) according to the BN-BMLA are presented in Table 3. In the last column the p values and odds ratios calculated with logistic regression are also shown. Altogether 5 SNPs in 4 genes were found relevant in connection with asthma phenotype: PRPF19 on chromosome 11, and FRMD6, PTGER2 and PTGDR on chromosome 14. The SNP rs7928208 in the gene PRPF19 is also associated with early childhood asthma (in case of children under 6 years). Table 4 summarizes the posterior probability of strong relevance for the most relevant SNPs for asthma and for multiple targets, in case of RA and CLI data sets. By multiple targets we mean rhinitis+asthma (RA data set) and IgE level+eosinophil level+rhinitis+asthma (CLI data set) respectively. In case of the RA data set two SNPs in the AHNAK gene gave the strongest correlation with the phenotype asthma+rhinitis. Interestingly, a SNP in the most studied gene of the 11q13 region, the MS4A2, the gene for the high affinity IgE receptor b subunit showed an association when the CLI data was considered.
Partial multivariate relevance. The multivariate analysis of strong relevance (i.e. the posterior probabilities of Markov blanket sets with asthma as a target, for details see Methods) indicated a very flat posterior distribution. This means that there are several possible strongly relevant sets with low posteriors instead of a dominant set with a high posterior. On the other hand, the aggregation of the multivariate results into univariate conclusions, i.e. strong univariate relevance (described previously) indicated that model averaging can unhinge significant results. Furthermore, it is also possible to aggregate multivariate results into partial multivariate results (i.e. k-sized subsets, as detailed in the Methods section), This allows the investigation of SNP-SNP interactions, because a subset with a high posterior can indicate that these SNPs have a joint effect on the susceptibility of asthma (for a detailed investigation of interactions, see [19]. Results for partial relevance are systematically presented in Figure 2 showing the highest posteriors for various subset sizes (k). In this case the ''A'' dataset was evaluated using asthma as a target variable. The high posteriors for k = 1,2,3,4 indicate that the data is sufficient to infer that these variables are jointly strongly relevant, but above that level (k$5) as shown in Figure S4 the multivariate results are weakly significant.
We applied the MDR method on the ''A'' dataset for model sizes (k = 1,2,3,4) using an exhaustive evaluation. The results confirmed the strong significance of rs3751464 of FRMD6, because it was the most frequent part of the models. However, the difference in model scores within a certain model size was low. The range of scores for the 20 highest scoring models containing 1, 2, 3, and 4 variables were 0.548-0.521, 0.561-0.556, 0.592-0.586, and 0.639-0.630 respectively. Furthermore, the 20 highest scoring models of each size (1-4) contained 51 different SNPs altogether, which also confirms the low power of MDR in case of this data set. In contrast to the approach followed in BN-BMLA, model averaging was not possible because of the frequentist nature of the MDR score (compare with Eq.1.). Manual investigation of MDR results indicated that the following SNPs were often parts of the highly significant models: rs3751464 in FRMD6, rs17831675 and rs17831682 in PTGDR, rs708502 and rs708486 in PTGER2.
We also performed logistic regression analysis on the ''A'' data set with SPSS using PIN = 0.05 as the probability threshold for variable entry, and POUT = 0.1 as the threshold of removing a variable from the model. The forward variable selection method confirmed the strong significance of rs3751464 in FRMD6 Gene-gene interactions. The k-MBS concept from BN-BMLA represents the joint relevance of k variables. However, their effects can be linear or non-linear at some scale, which is indicated by the absence or presence of interaction terms in appropriate statistical models. Furthermore, (non-linear) interactions can be epistatic, i.e. a SNP has no main effect on the phenotype, but has an effect along with an other factor. As can be seen in the last column of Table 3, the distribution of the two SNPs in the PTGER2 gene does not differ between the asthmatic and control groups. It implies that these SNPs do not influence asthma risk alone, only in interaction with other variables, in this case with other SNPs. Table 5 presents the most probable gene-gene (SNP-SNP) joint relevances and interactions for asthma as a target variable. In this evaluation significant interactions between 2 and 3 SNPs were revealed. The table shows the p-values and the  corresponding odds ratios for the interaction terms in logistic regression calculated with SPSS using the enter method. The most relevant interactions include intrachromosomal (e.g. rs708502 in PTGER2 and rs3751464 in FRMD6) and interchromosomal interactions: rs7928208 in PRPF19 (chr. 11) with rs708502 in PTGER2 (chr. 14) and rs7928208 in PRPF19 (chr. 11) with rs3751464 in FRMD6 (chr. 14). Interestingly, the joint interaction of these three variables is also significant (exp(B) = 0.72, C.I. 95%: [0.58-0.89], p-value = 0.002) which is indicated by the logistic regression backward method (described in the previous section). According to these results, the most significant SNP in this study is the rs3751464 in the FRMD6 gene. It influences the asthma risk both alone and in interactions with other SNPs in PRPF19 and PTGER2 genes. In all the interactions with the minor rs3751464 TT genotype significantly increase the asthma risk, while interactions with the more frequent CC genotype decrease the asthma risk (data not shown). Detailed characterization of association relations. The association of a genetic variant to a phenotypic feature can have multiple types. First, because of the dependency of the genetic factors (due to linkage disequilibrium or evolutionary patterns), the markers typically found in genetic association studies are rarely directly associated to the phenotype. In these cases, the association is transitive, i.e. it is mediated by the causal SNP. For example, in case of PTGDR, the posterior of association of the SNPs (rs17831675, rs17831682, and rs803012) to asthma are larger than 0.9, but rs803012 can be excluded as strongly relevant, because its posterior of strong relevance is lower than 0.005, which indicates its non-causal, non-functional role. Second, in case of multiple targets, which originate potentially from a complex dependency model, the association can be mediated by phenotypic variables. For example in case of the asthma a possible dependency model is shown in Figure 3, in which a SNP affecting IgE might be both directly and transitively associated with asthma.
To evaluate the global characterization of association types shown in Table 2, we computed the a posteriori probability   whether a variable is directly relevant or its association is only mediated in cases where the rhinitis status was known (RA dataset; Table 6). Additionally, we computed whether a variable is in pure interaction or its association is pure confounded.
Here we present some examples how the results in Table 6 can be interpreted. The posterior that rs7928208 (PRPF19) is transitively associated to asthma is 0.822, and the posterior for a ''direct'' relation, which is not blocked by any other variable, is similarly high (0.718; RA dataset, asthma as a target variable). On the contrary, the posterior that, rs569108 in MS4A2 is transitively associated with asthma is 0.633, but the posterior for a ''direct'' relation is only 0.087. Another interesting example is rs11231128 in AHNAK. The posterior that it is transitively associated with asthma is 0.535, the posterior that it is strongly relevant is 0.736, and the posterior for a ''direct'' relation is only 0.029, The higher probability of strong relevance compared to the posterior for a transitive relation indicates a pure interaction (0.708), which suggests that this SNP is relevant only if the rhinitis status is known. Furthermore, when rhinitis status was excluded from the data set the strong relevance of rs11231128 for asthma vanished (data not shown). This shows that this SNP is strongly relevant through interaction with rhinitis.
Association to multiple targets. To decompose the relevance of genetic factors for various phenotypes we computed and compared the posteriors for the strongly relevant variables with respect to each target variable, namely IgE, and eosinophil levels, rhinitis and asthma (CLI dataset), which participate in a complex causal model with multiple paths. Posteriors of the decomposed relevance for multiple target variables are presented in Table 7.
We would like to demonstrate how the results in Table 7 can be interpreted through an example. Following the earlier discussion of rs569108 in MS4A2, the results on the CLI data indicate that this SNP is strongly relevant to some of the CLI targets with a posterior of 0.77. In case of rs708498 in PTGER2 the difference between the posteriors is more significant. The posteriors for strong relevance with respect to IgE, eosinophil level, rhinitis and asthma is 0.09, 0.04, 0.05,0.61, which clearly indicates that a relationship between rs708498 and asthma is more probable than a relationship with any other targets. Furthermore, the posterior probability that rs708498 is strongly relevant exclusively to IgE, eosinophil level, rhinitis or asthma are 0.03, 0.01,0.02, 0.50, which shows that it is more probable that this SNP is exclusively related to asthma than to any other phenotype. This is also supported by the posteriors that this SNP is strongly relevant to other targets, but not to IgE,eosinophil level, rhinitis or asthma: 0.58, 0.64, 0.62, and 0.07, with the lowest posterior for asthma in this respect. Finally, the posterior for rs708498 being a relevant SNP for multiple phenotypes (0.67) is relatively close to the posterior of strong relevance for asthma (0.61), which shows that relations to other targets are negligible. Figure 4 displays the posteriors for strong relevance detailed in Table 7 providing an overall view on strong relevance with respect to the targets in CLI data set on the level of genes. It indicates that PTGER2 (rs12587410, rs17197, rs1254600, rs708498) is related to asthma with a higher probability, whereas PTGDR (rs17831682, rs17831675, rs17125273) is slightly more relevant with respect to IgE levels, though the difference between posteriors for strong relevance of different targets is lower in the latter case.
The fact that the average posterior for strong relevance in case of the most relevant SNPs is moderate (in case of asthma as a target) or low (in case of IgE, eosinophil and rhintis) indicates that the CLI data set is at its limit in terms of data sufficiency. In fact, this resembles the flat posterior case mentioned previously. However, even in this case, the results provided by the BN-BMLA method at least allow a restricted analysis of the domain. Furthermore, in case of multiple targets, the joint strong relevance (multi-target relevance) approach provides a more robust posterior. Figure 4 also shows the relationship between the strong multitarget relevance and its approximation based on single target posteriors. The former accounts for the interdependencies between targets, while the approximation treats the targets as independent factors. Therefore, in domains that contain a complex dependency model, the multi-target approach is more viable, since the assumption of independency will not hold. Thus the approximation for a joint relevance based on independently treated posteriors of targets will yield inaccurate results. Table 6. The a posteriori probability that a SNP is directly relevant (D-Relevant), associated, strongly relevant (S-Relevant), transitive or in pure interaction with asthma using the RA data set.

Analysis of gene expression in mice and men
Earlier we have carried out measurements of gene expression levels by Agilent Whole Mouse Genome Oligo Microarray 4644 K chips in the lungs of mice with allergic airway inflammation and control mice (GSE11911 record number in GEO database) [35]. All of the genes found to be relevant in the present study were expressed in the lung of the mice. We compared the expression level of the genes in the lungs of mice with OVA-induced allergic airway inflammation and control mice. Altogether, 1134 transcripts showed .2.0-fold statistically significant differential expression [35], but none of the relevant genes in this study showed this level of difference. However, in all mice with OVA-induced experimental asthma the expression level of FRMD6 was consequently lower (in average with 1.52 fold). From Table 7. The posterior probability of strong relevance of predictors for each target and for a multi-target case based on the CLI data set.   the relevant genes in the present study no other genes showed such a consequent correlation. Next, we studied the changes in the expression levels of genes in the known pathway involving FRMD6. The FRMD6 is part of the conserved Hippo pathway playing a critical role in controlling organ size by regulating both cell proliferation and apoptosis [36] ( Figure 5). One of the best known target genes of this pathway is the antiapoptotic Birc5 [37,38] (also known as survivin) whose expression level showed 5.94 fold increase (corrected P = 0.001) in the lung of OVA induced mice.
We compared the gene expression levels of genes found to be relevant in asthma in the SNP analysis in sputum samples of 12 asthmatics and 9 controls using TaqMan Gene Expression Assays (FRMD6, PTGDR, PTGER2, MS4A2, AHNAK, PRPF19, TXNDC16). Sputum mRNA level of FRMD6 was significantly lower in the asthmatic patients compared to healthy controls with a fold change of 2.73 (p = 10 26 ). No other gene showed statistically significant difference in this comparison.

Discussion
In this paper we presented the results of a partial genome screening in asthma evaluated by several statistical methods. In 11q12.2-q13.1 and 14q22.1-22.3 genome regions, which were earlier identified as asthma susceptibility regions, we successfully genotyped 102 SNPs in 57 genes. Earlier, in different association studies several asthma genes were identified in these regions, but none of them were confirmed later by GWAS, although two of them (PTGDR and GSTP1) were verified in candidate gene association studies in several independent populations [9]. Additionally, in a study where the results of GWAS were combined with a candidate gene approach, polymorphisms in GSTP1 also showed an effect on asthma susceptibility [9]. In our present study, using the BN-BMLA method, several earlier results were confirmed. Associations were confirmed between SNPs in PTGDR, PTGER2, MS4A2 and asthma. Interestingly, however, the frequentist method could only identify the association of the PTGDR gene, and was unable to detect it in the case of the other two genes. The explanation for this phenomenon is, that according to our evaluation, SNPs in the PTGER2 influence asthma susceptibility in interactions or indirectly, and similarly, the association between a SNP in MS4A2 and asthma is transitive, which is hard to detect with traditional frequentist methods. This might be one explanation why the association of the polymorphisms in this gene, which is otherwise a very plausible gene in asthma and atopic diseases, could not be confirmed in the majority of the studies using traditional statistical methods [39,40,41].
MS4A2 gene (earlier known as FceRI-b), which codes for the high affinity IgE receptor b subunit has a central role in mast cell degranulation and IgE mediated allergy.
The rs569108 SNP, which corresponds to the E237G amino acid substitution, is predicted to introduce a hydrophobicity change within the C-terminus of the receptor. It is adjacent to the immunoreceptor tyrosine activation motif, and may affect the intracellular signaling capacity of the receptor. The MS4A2 was one of the first candidate genes in atopic diseases, and already in 1996 associations were found between E237G and significantly elevated skin test responses to different allergens and bronchial reactivity to methacholine in a UK population [42]. Since then, several studies in different populations have investigated the role of this polymorphism in asthma and atopy with very controversial results. In this study we could not confirm a direct association between E237G and asthma, but found a transitive association only when each target variable, namely IgE, and eosinophil levels, rhinitis and asthma (CLI dataset) were considered, which corresponded to a complex model with multiple paths.
It is well documented that, in asthmatics, prostaglandin D2 modulates the physiology of the airways by causing bronchoconstriction, vasodilation, and increase in capillary permeability and mucus production. Mice, lacking PTGDR fail to develop bronchial hyperresponsiveness upon ovalbumin challenge, suggesting that this receptor has an important role in the disease [43]. Polymorphisms in the gene have been reported to be associated with asthma in American, European and Japanese populations, but not in Chinese children, Latinos or Koreans [44]. In our study on Hungarian children, several SNPs showed different types of associations (direct or transitive). Interestingly, three SNPs (rs17831682, rs17831675, rs17125273) were slightly more relevant with respect to IgE levels, than asthma or other targets.
Prostaglandin E2 exerts anti-inflammatory and bronchoprotective mechanisms in asthma, it inhibits the chemotaxis of eosinophils toward eotaxin, prostaglandin D2 and C5a [45]. Polymorphisms in its receptor PTGER2 were mainly found to be associated with aspirin-intolerant asthma, but in some studies also with asthma in general [46]. In our study the distribution of the SNPs in the PTGER2 gene did not differ between the asthmatic and control groups. But, when SNP-SNP interactions were calculated, polymorphisms of the PTGER2 participated in each significant association. This implies that these SNPs do not influence asthma risk alone, but in interaction with other SNPs. When multiple targets were considered the SNPs of this gene showed relevance only to asthma and the relations to other targets (IgE and eosinophil levels or rhinitis) were negligible.
Besides confirming previous results, the present study also detected new asthma genes in these regions. The most remarkable result of the study is the role of FRMD6 in asthma. Association between a SNP in FRMD6 and asthma risk was identified with both the frequentist and BN-BMLA methods. A haplotype in this gene also influenced the disease susceptibility, and the rs3751464  [36,37,38]. According to this pathway lower level of FRMD6 might be associated with higher level of Birc5, as was found in the lung of the animal model of asthma. doi:10.1371/journal.pone.0033573.g005 showed an influential role in interactions with other SNPs in this respect. Additionally, the expression level of the FRMD6 was consistently lower in the lungs of mice with allergic airway inflammation and it was significantly lower in human asthmatics compared with controls. The exact function of this gene and its possible role in asthma are unknown, but some theories can be generated from earlier and our present data. FRMD6 (FERM domain containing 6, earlier also EX1, or Willin) is suspected to be an upstream component of the Hippo signaling pathway [36]. Several recent publications establish that the pathway is one major conserved mechanism governing cell contact inhibition and organ size control [36,37]. Clearly, even small changes in this pathway can alter the lung morphogenesis which can lead to modified response to environmental challenges and susceptibility to asthma. The role of this pathway in asthma is also supported by our findings that the expression of the antiapoptotic Birc5 (also known as survivin), one of the best known target genes of this pathway was drastically reduced in the lung of mice with airway inflammation. This finding supports the theory that one of the mechanisms which can play a role in asthma is the reduced apoptotic potential of the airway epithelium. Studies supporting this theory showed that after rhinovirus infections, the epithelial cells of the asthmatic lungs were unable to enter into apoptosis with the consequence that the replicating virus caused cytopathic cell death with extensive virus shedding [47,48]. It can be asked, however, why earlier GWAS were unable to detect FRMD6? The simplest explanation is that in these studies only one SNP in the FRMD6 gene was genotyped (rs7140150), which is not in LD with rs3751464. This supports the observation of Michel et al that GWAS coverage is insufficient for many asthma candidate genes [9]. In addition, these GWASs were carried out in other populations, which could significantly influence the results.
The rs3751464 SNP localizes in the 59 untranslated region of the FRMD6 gene on chromosome 14q22.1 at 52117892 base pair. With in silico methods, we were unable to find out the role of this SNP (or SNPs in LD with rs3751464) in the expression or function of FRMD6, or in the disease. In public databases no data were available whether this SNP or its haplotype influence the binding of any regulator elements.
Another interesting and novel finding of this study is the indirect but strong relevance of the SNPs in the AHNAK gene to asthma. The SNPs in this gene influenced asthma risk in interaction with rhinitis. AHNAK is a ubiquitous protein expressed in a variety of cell types. In epithelial cells AHNAK is distributed mainly on the cell membranes, suggesting its role in the formation of tight junction. Now, it is widely accepted that impaired formation of tight junction leads to reduced barrier function and increased susceptibility to asthma [49,50,51]. Notably, the rs11231128 is a missense SNP, causing a serine proline amino acid substitution at position 3724, which might influence the 3D structure of the protein. In addition, according to the Ingenuity database, AHNAK interacts with several factors playing important role in the disease, like TGFB1, EGFR, IL6 and STAT4 (Ingenuity Pathway Analysis (IPA) 9.0 Software (Ingenuity Systems, Redwood City, CA, USA; www.ingenuity.com)).
Similarly to the SNPs in AHNAK, a SNP in the TXNDC16 gene also influenced asthma risk in interaction with rhinitis. Until now, however, there has been no information about the function of this gene in any databases.
Strong and direct association was found between rs7928208 in the PRPF19 gene and asthma and this SNP was also associated with asthma development before 6 years of age. Although its possible role in asthma is unknown, the product of the gene, similarly of some earlier found asthma genes (e.g. RAD50) plays a role in DNA repair, thus theoretically its altered function might influence the resistance of the cells to environmental stress [52,53,54].
The above discussed results clearly show the several advantageous features of the BN-BMLA method over the traditional frequentist methods generally used in gene association studies. As can be seen from the results, the advantage is not only that the BN-BMLA can detect more relevant variables, but the Bayesian networks offer a rich language for the detailed representation of types of relevance, including causal, acausal, and multitarget aspects. Additionally, Bayesian statistics offers an automated and normative solution for the multiple hypothesis testing problem. The computational complexity is manageable using high-throughput and high-performance computing resources for medium sized problems with hundreds of variables. This extends the scope of local 'causal' discovery methods, and because of the direct interpretation of Bayesian posteriors, contrary to p-values from the frequentist approach, makes it an ideal candidate for creating probabilistic knowledge bases to support off-line meta-analysis and fusion of background knowledge.
We analyzed partial multivariate strong relevances, because the Bayesian statistical framework allows the calculations of posteriors over a wide range of hypotheses, such as strong relevance of variables, pairs of variables, triplets of variables, etc. This shows the advantage of the Bayesian framework, because it allows the selection of appropriate level of complexity of hypotheses, which is not possible in the traditional hypothesis testing approach.

Conclusion
In a partial genome screening of asthma we identified FRMD6 as a novel asthma gene. The possible role of this gene was also confirmed in an animal model and in human asthmatics. Beside FRMD6, using BN-BMLA method, we identified several additional genes (PTGDR, PTGER2, MS4A2, AHNAK, PRPF19, TXNDC16), which directly, or indirectly might play a role in the disease. In contrast to BN-BMLA, the traditional and novel frequentist based methods (x 2 test, multivariate logistic regression and multifactor dimensionality reduction) could consistently identify only the direct effect of FRMD6 on asthma risk, and in some models the possible effect of PTGDR, PRPF19 and PTGER2.
The BN-BMLA on one hand extends the scope of strong relevance based methods towards (1) partial multivariate relevance, (2) global characterization of pairwise relevances, and (3) multi-target relevances. On the other hand it can be seen as focusing the general, global feature learning techniques towards relevance analysis, i.e. from learning arbitrary dependency substructures to learning strongly relevant sets. Furthermore, this Bayesian global relevance analysis method provides posteriors, which are direct statements about hypotheses, thus it can also be used to construct probabilistic data analytic knowledge bases in genetic association studies to support complex quering, off-line meta-analysis, and fusion with background knowledge. Although, in artificial datasets we have previously demonstrated the superior ability of the BN-BMLA method to detect real factors and interactions in genetic association studies, the method must be applied to other real word datasets, and the results must be validated in alternative methods. If these studies confirm the usefulness of this new statistical method, it could be a good alternative to evaluate the results of gene or genome association studies.
The tool is accessible at http://webbmla.genagrid.eu. Figure S1 Comparison of standard concept of (pairwise) association and strong relevance. The concept of strong relevance and association (with respect to a target) has only one common element, direct relevance, i.e. non-mediated relationship between the target and a variable. Association also includes confounded and transitive relevance, where there is a mediator between the given variable and the target. In cause-effect relationship terms, the confounded case corresponds to a common cause; the transitive case corresponds to a cause-effect path. Strong relevance, on the other hand, includes interactionist relevance, i.e. a common effect type relationship. (TIF) Figure S2 Haplotype blocks of the investigated SNPs in the present study. SNPs are numbered sequentially and their relative location is indicated along the top. Markers 1-54 and markers 55-102 correspond to the studied SNPs on Chromosome 14 and 11, respectively. Triangles surrounding markers represent haplotype blocks.

Supporting Information
(TIF) Figure S3 The peakness of the posteriors of the most probable 100 MBS sets. The x axis denotes the rank of a Markov blanket set (MBS) of SNPs, the y axis denotes the joint probability of an MBS, e.g. an MBS with rank = 5 is the fifth most probable set. The ''RA'' and ''CLI'' prefixes denote the corresponding dataset, and the ''asthma'' and ''multitarget'' suffixes indicate the target variables. RA-multitarget: Rhinitis, Asthma; CLI-multitarget: Asthma, Rhinitis, Eosinophil and IgE level. Note, that the peakness of the posteriors decreases within the same dataset in the multitarget case; and between different data sets, the smaller sample size (CLI:200, RA:1100) results in weaker posteriors. In terms of data sufficiency, the flatness of the CLI MBS posterior curve (both in the single target and the multitarget case) indicates that the CLI data is not sufficient for a complete multivariate analysis. The RA dataset, on the other hand, is more appropriate having a relatively peaked posterior, although the maximum posterior is not particularly high.
(TIF) Figure S4 The most probable univariate (MBM), bivariate (2-MBS), trivariate (3-MBS), 4-MBS, 5-MBS subsets of variables. All posteriors of partial k-relevance are ordered, x axis denotes the rank (e.g.: the number 2 means the second highest posterior), and y axis denotes the posterior probabilities. As k increases (i.e. the set size of jointly considered SNPs) the maximum posteriors decrease, and the slope of the posteriors are much less ''peaked'', which means that the lower ranked posteriors are significantly higher than those with higher ranks. The univariate case can be considered as a relatively peaked curve, and as k increases, the curve ''flattens''. The curve of whole sets (MBS) is the ''flattest'', showing only a small difference between the lower ranked and the higher ranked posteriors. In this case, estimating the top 20 MBS is less informative than estimating the top 10 partial k-relevance of k = 2. This indicates the viability of the concept of partial k-relevance. (TIF)