Genes to Diseases (G2D) Computational Method to Identify Asthma Candidate Genes

Asthma is a complex trait for which different strategies have been used to identify its environmental and genetic predisposing factors. Here, we describe a novel methodological approach to select candidate genes for asthma genetic association studies. In this regard, the Genes to Diseases (G2D) computational tool has been used in combination with a genome-wide scan performed in a sub-sample of the Saguenay−Lac-St-Jean (SLSJ) asthmatic familial collection (n = 609) to identify candidate genes located in two suggestive loci shown to be linked with asthma (6q26) and atopy (10q26.3), and presenting differential parent-of-origin effects. This approach combined gene selection based on the G2D data mining analysis of the bibliographic and protein public databases, or according to the genes already known to be associated with the same or a similar phenotype. Ten genes (LPA, NOX3, SNX9, VIL2, VIP, ADAM8, DOCK1, FANK1, GPR123 and PTPRE) were selected for a subsequent association study performed in a large SLSJ sample (n = 1167) of individuals tested for asthma and atopy related phenotypes. Single nucleotide polymorphisms (n = 91) within the candidate genes were genotyped and analysed using a family-based association test. The results suggest a protective association to allergic asthma for PTPRE rs7081735 in the SLSJ sample (p = 0.000463; corrected p = 0.0478). This association has not been replicated in the Childhood Asthma Management Program (CAMP) cohort. Sequencing of the regions around rs7081735 revealed additional polymorphisms, but additional genotyping did not yield new associations. These results demonstrate that the G2D tool can be useful in the selection of candidate genes located in chromosomal regions linked to a complex trait.


Introduction
Asthma involves genetic and environmental factors in its development, chronicity and severity [1,2]. Although some of its underlying mechanisms have been elucidated in recent years, more work is needed to gain a clearer understanding of genetic determinants. The mapping of asthma has been one of the most important areas of human genetics in the last two decades. According to an overview by Blumenthal (2005), twelve complete and two incomplete genome scans for asthma have been published, identifying a total of twenty chromosomal linked regions to asthma [3]. Discrepancies often appeared between linkage studies [3][4][5], leading to the use of standard phenotype definitions and founder populations as a way to decrease phenotypic and genetic heterogeneity [6][7][8].
To date, common strategies have been employed to identify genes involved in asthma predisposition. Linkage studies followed by positional cloning identified six genes while association studies identified over a hundred genes, the majority of these having quite small effects on asthma susceptibility (see [9,10] for a review). Here, we describe a novel methodological approach that combines classical genetic approaches with a computational data-mining tool. In this regard, a genome-wide scan for asthma and atopy in families originating from the Saguenay-Lac-St-Jean (SLSJ) founder population (Northeastern Quebec, Canada) [11][12][13][14][15][16] has been combined with the Genes to Diseases (G2D) new computational tool in the prioritization of asthma candidate genes [17,18]. G2D performs the selection of the candidates in the chromosomal regions genetically linked to a disease by highlighting genes whose functions are related to the phenotype of the disease according to a data mining analysis of the bibliographic and protein public databases, or according to the genes already known to be associated with the same or a similar phenotype.
Single nucleotide polymorphisms (SNPs) within the candidate genes prioritized by the G2D tool have been subsequently used to conduct an association study with asthma, atopy and allergic asthma phenotypes in the SLSJ asthma familial collection. The chromosomal regions around associated SNPs have then been sequenced in order to find causal mutations, and a replication of the positive association findings has been assessed in the Childhood Asthma Management Program (CAMP) independent cohort. The goal was to apply the G2D tool in the search of genetic determinants associated with a complex trait, using asthma as a model. This approach led to the hypothesis-driven identification of ten genes that may not have been selected otherwise (LPA, NOX3, SNX9, VIL2, VIP, ADAM8, DOCK1, FANK1, GPR123 and PTPRE). Of these, one positive genetic association, resisting to corrections for multiple testing, has been found between the protein tyrosine phosphatase receptor type E gene (PTPRE) and allergic asthma in the SLSJ sample.

Subjects
Clinical evaluation and phenotyping criteria of the Saguenay-Lac-St-Jean (SLSJ) subjects have been described in recent reports [19][20][21] and summarized in Table 1. This familial sample is predominantly composed of probands that reported an onset age of asthma below 12 years old (81.6% of the probands). The mean age of onset for the probands is 7 years and the mean age of onset for the asthmatic family members is 22 years. The entire sample has been used for the association study while for the genome scan, the first 79 recruited families, that were available at the time when the genome scan was performed have been used (n = 609 individuals-see supplementary Table S1 for subjects characteristics and studied phenotypes). The Chicoutimi Hospital local ethics committee approved the study and all subjects provided informed consent.

Genome Scan
DNA was extracted for all SLSJ participants from whole blood by using the QIAGEN genomic purification procedure (QIAGEN Inc., Valencia, CA). Genotyping was completed on 367 autosomal and 21 X-chromosome microsatellite markers evenly spaced throughout the genome (average spacing of 9.2 cM). The marker set is a modification of the Cooperative Human Linkage Centre Screening Set (http://gai.nci.nih.gov/CHLC/, version 6.0), showing an average heterozygosity of 0.72 in our data set. Each primer was amplified separately and then pooled into panels of eight markers and products were interrogated using ABI 3700 sequencers (Applied Biosystems, Foster City, CA) with a size standard ladder. Duplicate of two CEPH control DNAs and one water were included in each genotyping plate. In our data set of 195939 autosomal genotypes, 91.4% of the alleles were called and the proportion of observed Mendelian error was 1.2%. Linkage of chromosomal regions with putative genetic risk factors for a given trait was assessed by evaluating the extent of excess sharing of alleles identical by descent in affected relatives within families. Test statistics are reported on the LOD scale. Briefly, a multipoint, one-parameter likelihood ratio test that is robust against incompleteness of marker data (when the descent of alleles in a pedigree is not fully known) was used [22,23]. We moreover evaluated the specific contribution of mothers and fathers to the test of linkage, to look for parent-of-origin effects. The above tests of linkage, linkage through mothers and linkage through fathers are described in details in [24].

Genes to Diseases (G2D)
The G2D tool has been applied in the two best genome scan susceptibility regions: 6q26 between markers D6S476 and D6S305, and 10q26.3 between markers D10S1223 and D10S1248. We used two of the approaches considered in G2D [18] to pre-select gene lists in these regions. The first approach uses a description of the phenotype to point to genes in a region. This method works with automatically derived relationships between the disease symptoms (as MeSH C terms) and gene features (as Gene Ontology or GO terms [25]) that are obtained from the literature and Entrez gene database (http://www.ncbi. nlm.nih.gov/). We call this procedure the ''PHENOTYPE'' method. The second approach consists on automatically finding genes in a region that are similar to other genes previously associated to asthma. To do this, G2D measures the semantic distance between the annotations of the ''known genes'' and the annotations of genes in the problem region assigned by homology searches. We call this procedure the ''KNOWN GENES'' method. For details about the algorithm see the G2D web site at http:// www.ogic.ca/projects/g2d_2/ and [17,18].
Association Study SNP selection. The HapMap database (http://www. hapmap.org/) has been used to identify SNPs assumed to be polymorphic in the SLSJ population. TagSNPs were then selected with the tagger program implemented in the Haploview software (version 3.32) [26] using an r 2 cutoff of 0.8 and a minor allele frequency (MAF) over 0.10 to cover each whole gene. SNPs were also prioritized on their localization (coding or untranslated regions-see supplementary Table S2). All SNPs are referred using their reference sequence number (rs#).
SNP genotyping. Eighty SNPs have been genotyped by the SequenomH matrix-assisted laser desorption/ionization time-offlight mass array spectrometer (Sequenom Inc., San Diego, CA)  (Table S2). Sequenom primers were designed using the Sequenom SNP Assay Design software version 3.0 for iPLEX reactions. A total of 74 assays were designed for a single multiplex reaction. The assay group file containing the PCR primers and the iPLEX extension probes can be supplied on request to the corresponding author. The protocol and reaction conditions are in accordance with the manufacturer [27]. The genotypes were viewed and analyzed using the MassARRAY Typer software version 3.4 (Sequenom Inc., San Diego, CA). The ten remaining SNPs (Table  S2) have been genotyped by the TaqManH SNP Genotyping Assays (Applied Biosystems, Foster City, CA) using the Rotor-Gene TM real-time PCR (Corbett Research Ltd, Sydney, Australia). Protocol and method were supplied by the manufacturer and PCR conditions were optimized to get a good cluster separation between different genotypes (see supplementary  Table S3 for PCR conditions). Genotypes were attributed by the Rotor Gene software using the scatter graph analysis option.
Statistical analysis. Family-based association testing has been performed with the FBAT software (version 1.7) using an empirical estimate of the variance [28][29][30] to correctly account for linkage. A Sidak correction for multiple testing has been applied on the p-values accounting for the effective number of independent phenotypes and SNPs, according to the definition of Li and Ji [31], as implemented in the SNPSpD program [32] (see the online supporting Text S1 file for supplementary details). Parent-specific transmission disequilibrium tests were performed using sib_tdt from the ASPEX package (http://aspex.sourceforge. net/). Mendelian errors have been assessed by FBAT and Hardy-Weinberg equilibrium has been assessed with Haploview software (version 3.32) [26].

PTPRE Sequencing
Forty unrelated SLSJ subjects (validated with the BALSAC database [33]) presenting full-fit allergic asthmatic criteria and that have contributed to the PTPRE association were selected. PTPRE sequence information was obtained from Ensembl database (http://www.ensembl.org/index.html, release 46). The sequencing was divided in six regions that spanned 2.7 kb, starting from the exon 2 to the exon 4. Oligonucleotides and PCR conditions are listed in Table S4. Amplification products were purified with multiscreen PCR plates (Millipore Corporation, Billerica, MA), sequenced with BigDye terminator v3.1 chemistry following instructions of the manufacturer and analyzed on a 3100 Genetic analyzer (Applied Biosystems, Foster City, CA). Sequence analysis was performed with Codoncode Aligner software (CodonCode Corporation, http://www.codoncode.com/). The identified SNPs presenting MAF over 0.05 have been genotyped in the SLSJ sample using a Sequenom panel and analyzed with FBAT, as described above. Newly described SNPs have been submitted to NCBI (http://www.ncbi.nlm.nih.gov/) SNP database.

Replication Study
The PTPRE rs7081735 association has been assessed in the Childhood Asthma Management Program (CAMP) study [34,35]. This analysis includes the 497 non-Hispanic white children and their parents for whom adequate DNA was available. The genotyping of the PTPRE rs7081735 was performed using a TaqManH SNP Genotyping Assays (Applied Biosystems, Foster City, CA) (see Table S3 for PCR conditions). Plates were scanned using the 7900HT Fast Real-Time PCR System (Applied Biosystems, Foster City, CA) and genotypes were assigned by the SDS 2.2 software using the scatter graph analysis option. The Institutional Review Board of the Brigham and Women's Hospital (BWH), as well as those of the other CAMP study centers, approved this study. Informed assent and consent were obtained from the study participants and their parents to collect DNA for genetic studies.

Genome Scan
The genome-wide linkage scan analysis revealed at least two regions showing suggestive evidence for linkage, as well as differential maternal and paternal contribution (Figure 1). The two regions are 6q26 for asthma (LOD = 1.54, p = 0.0038) and 10q26.3 for atopy (LOD = 2.82, p = 0.00016). In these two regions, affected sibs tend to share more alleles inherited from the mothers than from the fathers. The linkage tests through the mothers reach a LOD of 2.19 (p = 0.00074) in 6q26 for asthma and a LOD of 2.96 (p = 0.00011) in 10q26.3 for atopy. The respective LODs obtained through the fathers at the same loci are only 0.01 (p = 0.40) and 0.58 (p = 0.05). Moreover, even though the 10q region does not show a great strength of linkage with asthma (LOD = 0.57, p = 0.051), the asthmatic sibs tend to share these alleles when received from their mothers (LOD = 2.82, p = 0.00016). Each of the two chromosomal regions show LOD values (either for tests of linkage, or parent-specific LODs, or both) that are, in order of magnitude, consistent with what has been defined as «suggestive» for linkage by Lander and Kruglyak [36], which are expected to occur once per whole genome scan on average. Because of their great hypothesis generating potential, the 6q26 and 10q26 regions as well as asthma and atopy phenotypes were selected for the following G2D and association studies.

G2D
We applied the two algorithms PHENOTYPE and KNOWN GENES to the 6q26 and 10q26.3 regions. For the PHENOTYPE method, we used the OMIM [37] record 600807 as input, particularly the MeSH C terms from the MEDLINE references in OMIM entry 600807 which deals with susceptibility to asthma and asthma related traits. The most prevailing MeSH C terms are ''Asthma'' and ''Bronchial Hyperreactivity'', but the list also includes ''Hypersensitivity'', ''Respiratory Hypersensitivity'' and ''Eosinophilia''. We thus considered that the terms that refer to asthma were ''Asthma'' and ''Bronchial Hyperreactivity'' and that the terms that refer to atopy were ''Hypersensitivity'', ''Respiratory Hypersensitivity'' and ''Eosinophilia''. The highest scoring GO terms, associated to these MeSH C terms, describe a variety of molecular functions and processes that include leukotriene and interleukin signaling, glutathione metabolism, etc (see header of supplementary Table S5). We applied this method to the 6q26 region between D6S476 and D6S305, the two markers directly flanking the linkage peak seen in the region, which corresponds to the 10.48 MB band between positions 151,685,574 and 162,165,587 of chromosome 6. After discarding candidates that did not overlap with any known or hypothetical Entrez Gene sequence, 16 genes remained (see supplementary Table S5-A). A similar analysis was carried between D10S1223 and D10S1248, the two markers directly flanking the linkage peak in 10q26.3, between positions 129,150,822 and 130,982,363 in chromosome 10. In that case, 12 candidates were obtained (see supplementary  Table S5-B). For the KNOWN GENES method, we compiled a list of genes reported to be associated with asthma and atopy from the literature [38] and from the Genetic Association Database GAD [39] (supplementary Table S6). We then extracted the GO annotation of those genes in Entrez Gene [37]. We derived a scoring system for the candidates according to the minimal semantic distance between their GO annotation and the ones from the compiled known-gene list. For example, genes annotated with GO terms such as ''dipeptidyl-peptidase IV activity'' or ''chemokine receptor binding'' would score high as candidates. The method takes into account the hierarchical structure of GO as well as the specificity of GO terms. In that sense, similarity with more infrequent terms receive higher scores. For the complete list of GO terms see the header of supplementary Table S7. We applied the KNOWN GENES method to both regions 6q26 and 10q26.3, obtaining 15 and 10 genes, respectively, after filtering out those candidates that did not overlap with either known or hypothetical genes (see supplementary Table S7).
Use of a complementary method based on genomic sequence. We applied the G2D complementary Disease Gene Prediction (DGP) tool [40] to both genetically linked regions in order to predict the involvement of genes in inherited disease by their sequence features. This method analyses the probability of a gene to be associated with any disease phenotype. Genes with no associated phenotype and a probability greater than 0. Final list of candidate genes. Genes that received the higher scores in the pre-selected lists, and those that were pointed by the two different analyses were preferred, allowing the construction of a list of 17 candidates (displayed in the Table 2).  Table 2. Genes identified by G2D data mining analysis and those selected for the association study based on their number of appearance in G2D analyses and on their biological function. We then applied a candidate gene approach to select a final list of ten genes with the best biological potential related to asthma pathophysiology for genotyping (five in each chromosomal region): LPA, NOX3, SNX9, VIL2, VIP, ADAM8, DOCK1, FANK1, GPR123 and PTPRE (marked in bold in Table 2).

Association Study
A final panel of 91 SNPs (75 tagSNPs and 16 non-tagSNPs) was selected among the ten candidate genes (supplementary Table S2). Of these, four were non-polymorphic, six failed Sequenom genotyping assays, none presented deviation from Hardy-Weinberg equilibrium (p-values .0.001) and one presented more than two Mendelian errors (Table S2). For the remaining 80 SNPs, genotypes of individuals with Mendelian errors were considered as missing data in the FBAT analyses. Genotyping presented a mean success rate of 99.0% for the Sequenom assays and a mean success rate of 98.3% for the TaqMan assays. Accounting for the residual correlation between the tagSNPs, it is estimated that the 80 partially correlated SNPs correspond to an effective number of 59 independent ones [31]. As for the effective number of phenotypes, simulation indicates that the three studied phenotypes (asthma, atopy and allergic asthma) correspond to 1.75 effective independent ones (see supplementary Text S1 file). Accordingly, the estimated total effective number of independent tests is 103.25 (59 effective independent SNPs61.75 effective independent phenotypes). Thus, applying Sidak correction, the p-value threshold of significance is estimated to be 0.000483.
The FBAT single marker analyses were performed under an additive genetic model for each SNP and the three studied phenotypes. For the sake of brevity, only results showing a p-value under 0.05 before correction for multiple testing for one or more phenotypes are presented in Table 3. For the asthma phenotype, minor alleles LPA_rs12175867C, GPR123_rs11101913T, GPR123_rs11101932T and GPR123_rs12257731A were overtransmitted to the asthmatic probands, suggesting a susceptibility effect of these alleles for asthma (0.0085,p,0.047). Inversely, minor alleles DOCK1_rs1051039G, PTPRE_rs4369314A and PTPRE_ rs7081735G were undertransmitted to the asthmatic probands, suggesting a protective effect of these alleles for asthma (0.010,p,0.049).
For the atopy phenotype, only ADAM8_rs11101672G and PTPRE_rs7081735G minor alleles have been undertransmitted to the atopic probands, suggesting a protective effect (p = 0.039 and 0.037, respectively). Finally, for the allergic asthma phenotype, minor alleles LPA_rs12175867C, GPR123_ rs11101913T, GPR123_rs11101932T and GPR123_rs12257731A were overtransmitted to the allergic asthmatic probands, suggesting a susceptibility effect of these alleles for allergic asthma (0.035,p,0.041). Inversely, minor alleles ADAM8_rs11101672G, GPR123_rs11101916A, GPR123_rs761777G, PTPRE_rs11016002A, PTPRE_rs4002572C and PTPRE_rs7081735G were undertransmitted to the allergic asthmatic probands, suggesting a protective effect of these alleles for allergic asthma (0.000463,p,0.037). None of the SNPs reported above showed a significantly greater extent of transmission distortion from mothers than from fathers, thus not providing insights to the observed parental distortions seen in the linkage results (not shown).

PTPRE Sequencing
PTPRE rs7081735 shows the strongest association to allergic asthma (p = 0.000463) and is the only SNP shown to be significant after multiple testing correction (corrected p = 0.0478). We thus sequenced strategic genomic regions around the rs7081735, including coding regions in order to identify causal mutation. Figure 2 shows the PTPRE sequenced regions, the rs7081735 localization and the nine identified variants (diamonds), including four novel ones (c.86172A.G, c.140901G.A, c.140903G.A and c.141102C.T) ( Table 4). It is worth to note that any of these identified polymorphisms may affect either PTPRE isoforms. A family-based association analysis has been performed between the three studied phenotypes and variants presenting a MAF over 0.05 that were not included in the first genotyping panel, which was the case for four of the nine identified variants (rs7911506, c.86172A.G, rs7895103 and c.140901G.A). Only rs7895103 showed a modest association with asthma, at a level that does not provide additional insights (p = 0.013, compared to p = 0.000463 for rs7081735). Table 3. Significant Family-Based Association Test (FBAT) results between the ten G2D candidate genes studied SNPs and asthma, atopy and allergic asthma phenotypes under an additive genetic model.

Replication Study
Considering that the PTPRE rs7081735 association to allergic asthma is the strongest in the SLSJ sample, we evaluated it in an independent familial cohort, the Childhood Asthma Management Program (CAMP) [34]. The genotyping completion rate was 96% and no discordance was observed upon repeat genotyping of two random plates. The minor allele frequency was 0.33 and was in Hardy-Weinberg equilibrium (p = 0.90). Family-based association showed no significant association for asthma or atopy phenotypes (data not shown). However, to ensure that the selection of the CAMP study for the replication is appropriate and to demonstrate that the association found could result from a childhood subset of asthma instead of an adulthood one, we stratified the SLSJ association analyses considering only the probands that reported an asthma age of onset below 12 years old. Thus, the positive association found for PTPRE rs7081735 and allergic asthma (p = 0.000463, Table 3) remains significant when considering only those probands (Family number = 74; Z for the minor  allele = 23.166; p = 0.001546). Even with the loss of statistical power due to the stratified analysis, this comparison allowed to assume that the association found for PTPRE is probably more related to a childhood asthma, comforting the choice of the CAMP cohort as a replication study.

Discussion
This study proposes a novel approach for the selection of candidate genes for asthma association studies using the computational G2D tool to find genetic determinants for this disease. Based on a genome-wide scan performed in an asthmatic familial sample from the SLSJ founder population, we selected the two best-linked regions (6q26 and 10q26.3) and applied the G2D data mining approach [17,18] to identify ten candidate genes for an association in the SLSJ sample. Among these, five (LPA, ADAM8, DOCK1, GPR123 and PTPRE) presented modest associations with asthma, atopy or allergic asthma. After corrections for multiple testing, only the PTPRE rs7081735 association to allergic asthma remained significant. These findings demonstrate that the G2D tool can be useful in the selection of candidate genes for asthma genetic studies.
Because the PTPRE association to allergic asthma remained significant after correction for multiple testing, we sequenced strategic regions around the rs7081735, aiming to find the causal mutation. Sequencing allowed the identification of five known and four novel variants. Testing four SNPs with MAF .0.05 did not identify additional associated SNPs with asthma or atopy related phenotypes in the SLSJ sample. Thus, the rs7081735, located in the 59 untranslated region, could be the causal variant, or be in linkage disequilibrium with an unknown causal mutation.
PTPRE is a member of the protein tyrosine phosphatase (PTPs) family, which includes genes that are important regulators of signal transduction pathways involved in various cellular processes such as control of metabolic pathways, cellular adhesion, cell cycle progression and immune response [41,42]. PTPRE encodes two different isoforms, cytoplasmic and transmembrane [43], based on its different promoters [44]. PTPs receptors participate in transmembrane signaling and cellular adhesion processes, whereas intracellular PTPs take part in signal transduction within the cell [45]. For example in mice, PTPe-deficient macrophages present abnormalities in the regulation of the respiratory burst and the production of cytokines in response to bacterial lipopolysaccharide, suggesting a role of the PTPe isoform in inflammation as well as in host defense [46]. However, PTPRE has been shown to be highly expressed in peripheral human monocytes and granulocytes, and antigen-receptor stimulation induces the expression of PTPRE in activated lymphocytes [47]. These observations and our finding suggest a protective effect of PTPRE in allergic asthma, and lead us to hypothesize that this potential protective role could involve the leukocyte cellular processes in the limitation of lung inflammation following an allergen sensitization. Further work is needed to define the PTPRE possible role in asthma pathophysiology.
The PTPRE rs7081735 association has been evaluated in the independent CAMP cohort. Results showed no positive association for asthma and atopic phenotypes. Taking into account that the SLSJ and the CAMP studies are both family-based designed, well powered [48,49] and presenting a childhood onset asthma, the PTPRE association lack of replication in the CAMP study can be explained by other reasons: natural variability of asthma history [50,51], differences of proband mean age between samples (SLSJ mean age of 18, and CAMP mean age of 8, respectively), differences in the genetic background of the two populations [52] (SLSJ individuals descend predominantly from French European founders [11][12][13][14][15][16] whereas CAMP individuals come from a white North-American admixed population), or population specific gene-gene and gene-environment interactions [52]. Based on these observations, we conclude that the PTPRE rs7081735 association to allergic asthma is more penetrant in the SLSJ population, possibly resulting from an interaction with others genes and/or environmental factors that are more common in this founder population.
In summary, this study demonstrates that the G2D tool can be useful in the prioritization of candidate genes for a complex disease as it allowed us to find a novel asthma genetic association with the PTPRE gene in the SLSJ familial asthma sample. This association represents a potential protective factor for asthma pathogenesis, as it is more likely related to a childhood onset asthma. The present genetic study is an example of how the combination of different methodological approaches can be relevant to target asthma genetic determinants and to motivate further genetic and functional investigations.