A Genome-Wide Homozygosity Association Study Identifies Runs of Homozygosity Associated with Rheumatoid Arthritis in the Human Major Histocompatibility Complex

Rheumatoid arthritis (RA) is a chronic inflammatory disorder with a polygenic mode of inheritance. This study examined the hypothesis that runs of homozygosity (ROHs) play a recessive-acting role in the underlying RA genetic mechanism and identified RA-associated ROHs. Ours is the first genome-wide homozygosity association study for RA and characterized the ROH patterns associated with RA in the genomes of 2,000 RA patients and 3,000 normal controls of the Wellcome Trust Case Control Consortium. Genome scans consistently pinpointed two regions within the human major histocompatibility complex region containing RA-associated ROHs. The first region is from 32,451,664 bp to 32,846,093 bp (−log10(p)>22.6591). RA-susceptibility genes, such as HLA-DRB1, are contained in this region. The second region ranges from 32,933,485 bp to 33,585,118 bp (−log10(p)>8.3644) and contains other HLA-DPA1 and HLA-DPB1 genes. These two regions are physically close but are located in different blocks of linkage disequilibrium, and ∼40% of the RA patients' genomes carry these ROHs in the two regions. By analyzing homozygote intensities, an ROH that is anchored by the single nucleotide polymorphism rs2027852 and flanked by HLA-DRB6 and HLA-DRB1 was found associated with increased risk for RA. The presence of this risky ROH provides a 62% accuracy to predict RA disease status. An independent genomic dataset from 868 RA patients and 1,194 control subjects of the North American Rheumatoid Arthritis Consortium successfully validated the results obtained using the Wellcome Trust Case Control Consortium data. In conclusion, this genome-wide homozygosity association study provides an alternative to allelic association mapping for the identification of recessive variants responsible for RA. The identified RA-associated ROHs uncover recessive components and missing heritability associated with RA and other autoimmune diseases.

A run of homozygosity (ROH) denotes a contiguous set of homozygous genotypes in an intact genomic region. A practically used definition of ROH allows a rich homozygote region interrupted by a small number heterozygous genotypes arising from genotyping errors, missing genotypes, or mutations. An ROH that includes a sizable tract of homozygosity and deviates from a random distribution in the genome is denoted as ''homozygosity disequilibrium'' in this study. This type of ROH may result from various mechanisms including: 1) chromosomal aberrations, (e.g., uniparental disomy, hemizygous deletion, and/ or loss of heterozygosity [28,29,30,31,32]); 2) autozygosity as a consequence of inbreeding, consanguineous marriage, or a recent common ancestor [33,34,35,36,37]; and 3) natural selection, e.g., positive selection or selective sweep [38,39,40]. Homozygosity disequilibrium has frequently been observed in the general outbred population [34,41,42], but it is also not entirely benign as it increases the susceptibility to diseases such as neurodevelopment-related disorders [40,43] and other autoimmune diseases [44].
Homozygosity mapping aims to identify ROH(s) associated with disease states and was originally developed to map genes responsible for recessive diseases by using genetic marker data from inbred pedigrees [45,46,47,48,49]. Recent studies have also showed that homozygosity association mapping is a statistically powerful method when identifying susceptibility genes associated with complex diseases [40,43], cancers [50,51,52,53], and phenotypic traits [54,55,56]. Various statistical methods of homozygosity association mapping have been developed in order to analyze genotype data [35,53,57,58,59] or fluorescence intensity data [60,61,62,63] from SNP microarrays. To the best of our knowledge, however, studies have not been performed for genome-wide homozygosity association mapping for RA. Additionally, ROHs have not been used as genetic markers for the prediction of RA status. Instead of focusing on allelic association as have previous genome-wide association studies for RA [17,18,20], this study examined the hypothesis that ROHs act as recessiveacting determinants in the underlying genetic mechanisms of RA and identified RA-associated ROHs using genome-wide homozygosity association mapping.

Power calculations
Based on the simulation procedures described in Appendix S1, values for the powers of simulated genome-wide homozygosity association mappings were calculated using 2,000 patients and 3,000 controls in a simulation study of 1,000 replications ( Figure 1). We always used a genome-wide significance level of 2log 10 (p).8. First, we considered the scenario for which a disease-associated ROH consisted of L consecutive SNPs (L = 200). When 30%, 20%, and 10% of the RA patients carried this ROH (effect size, d, = 0.3, 0.2, 0.1), the power needed to detect the ROH was calculated as 1.000, 1.000, and 0.814, respectively, for a genome scan using a window size (W) of 100 SNPs (W = 100), or calculated as 1.000, 1.000, and 0.790, respectively, for W = 150, or as 1.000, 1.000, and 0.795, respectively, for W = 200. We also incorporated a heterozygous interference value (e), as a fraction that denoted incomplete homozygosity in the disease-associated ROH that may be caused by genotyping errors or unknown mutation mechanisms. The power required for no heterozygous interference was very similar to the power required for 10% heterozygous interference. When e = 0.2 and d = 0.3 or 0.2, the power was 1.000. However, the power was reduced to 0.141 for a genome scan with W = 100, reduced to 0.263 for W = 150, and to 0.463 for W = 200 (when e = 0.2 and d = 0.1). We also considered a diseaseassociated ROH for L = 150 or 100 and found the powers to be very similar to that found for L = 200.

Genome-wide homozygosity association scans
We conducted genome-wide homozygosity association scans with W = 100, 150, and 200 for the Wellcome Trust Case Control Consortium (WTCCC) SNP data (WTCCC_100, WTCCC_150, and WTCCC_200, respectively). Each genomic scan identified ROHs that satisfied the genome-wide significance criterion of 2log 10 (p).8 ( Figure 2). The identified regions and the respective maximum values of 2log 10 (p) within the identified regions are as follows. The WTCCC_100 scan identified three regions on chromosome 6p [2log 10  The first of these overlapping ROHs ranged from 32,451,664 bp to 32,846,093 bp and is located within the human major histocompatibility complex (MHC) region at 6p21.3, and the second ranged from 32,933,485 bp to 33,585,118 bp and overlaps the MHC region ( Figure 3). The two regions are located in different blocks of linkage disequilibrium (LD). The names of the genes within these two regions are shown in red in Figure 3. The first region contains 10 genes (from BTNL2 to HLA-DQB2), and the number of SNPs and the average intermarker distance are 125 and 3.1554 kb, respectively. The maximum 2log 10 (p) values for the scans are 37.5332 for WTCCC_100, 34.2091 for WTCCC_150, and 22.6591 for WTCCC_200. The second region contains 33 genes (from PSMB9 to ZBTB9), and the number of SNPs and the average intermarker distance are 134 and 4.8629 kb, respectively.
The proportion that RA patients carried a specific ROH (pROH) is higher than in normal controls in the two regions of homozygosity disequilibrium. For sliding windows anchored by SNPs within the first region, the maximum number of pROHs, as a fraction, for the patient data is 0.2206 for WTCCC_100, 0.2331 for WTCCC_150, and 0.2071 for WTCCC_200. These values are greater than those of the controls: 0.0996 for WTCCC_100, 0.1003 for WTCCC_150, and 0.1003 for WTCCC_200. In the second region, the maximum 2log 10 (p) values are 9.8852 for WTCCC_100, 9.0952 for WTCCC_150, and 8.3644 for WTCCC_200. The maximum number of pROHs is 0.1381 for WTCCC_100, 0.1331 for WTCCC_150, 0.1341 for WTCCC_200, and these figures are greater than the maximum number of pROHs for the normal control data (0.1003 for WTCCC_100, 0.1003 for WTCCC_150, and 0.1003 for WTCCC_200).

Copy number determination
We detected genomic deletions (copy number ,2) and amplifications (copy number .2) in the MHC regions of the 2,000 RA patients and 3,000 controls from the WTCCC study ( Figure 4). Regarding the genomic deletions, no region in the RA patients was found to have a significantly greater proportion (a proportion difference .2%) of deletions than regions of the controls. Conversely, one region from the controls, rs1431403 (33,155,009 bp) to rs7764491 (33,168,818 bp), had a greater proportion of deletions than the regions from RA patients. The , and WTCCC_200 scans are identified by the sky blue, light green, and orange horizontal bars, respectively. The outermost vertical bars denote the first SNP (gray tick) in the first window and the last SNP (gray tick) in the last window. Additionally, the first anchor SNP and the last anchor SNP for regions with 2log 10 (p).8 are marked using bold ticks. If two regions identified by the same genome scan overlap, the segment containing the overlapping regions is shown in dark blue, green, and orange for WTCCC_100, WTCCC_150, and WTCCC_200, respectively. The names of genes within the annotated regions are given above the bars. The names of genes in the regions identified by the three scans are shown in red, and the names of the genes identified by one or two scans are shown in blue. The location and width of the bars that prefix the gene names reflect the physical position and the size of the genes. LD structures are provided in the lower panels in which the higher intermarker LDs are in red. doi:10.1371/journal.pone.0034840.g003 average proportion difference is 0.0509. Regarding genomic amplifications, three regions from the RA patients had a greater proportion of amplifications (a proportion difference .2%) than those of the controls. The three regions are rs2516670 (30,542,978 bp) to rs9295931 (30,977,693 bp), rs9295961 (31,275,477 bp) to rs9295967 (31,291,999 bp), and rs2736177 (31,694,073) to rs2299851 (31,826,581 bp), and the average proportion differences for the RA patient data minus the control data are 0.0282, 0.0201, and 0.0214, respectively.

Discussion
Our study represents the first genome-wide homozygosity association scans for RA; we pinpointed important RA-associated ROHs in the MHC region and confirmed this region to be associated with RA [64,65]. For the two ROHs, the window with the best prediction accuracy 62% is anchored by the SNP rs2027852. We validated the results derived from the WTCCC data by using the independently acquired NARAC data ( Figure  S2). Homozygosity disequilibrium was consistently found in the MHC region, for which the respective maximum values of 2log 10 (p) for NARAC_100 (W = 100) and NARAC_200 (W = 200) are 2log 10 (p) = 7.6973 and 2log 10 (p) = 7.1334, respectively, which are highly significant values.
The SNP rs2027852 is flanked by HLA-DRB6 and HLA-DRB1. The HLA-DRB1-shared epitope is an important determinant of RA susceptibility [10]. Associations between HLA-DRB1 and RA susceptibility [10,11,12,13,66,67] and between HLA-DRB1 and the severity of RA [68,69] have been made. In addition to HLA-DRB1, a second relevant ROH includes HLA-DPA1 and HLA-DPB1. Previous studies produced inconclusive results concerning the relationship between RA and HLA-DPA1 and HLA-DPB1 [70,71,72]. Despite the evidence of statistical significance supported by this study, more functional studies are necessary to reconfirm the genetic associations with RA.
We found that the observed homozygosity disequilibrium in the MHC region is not explained by mechanisms associated with hemizygous deletion because our copy number analysis found only a very small proportion of the samples had acquired DNA deletions in the MHC region ( Figure 4). The RA-related ROHs probably were not generated from copy-neutral chromosomal aberrations, e.g., uniparental disomy and loss of heterozygosity, because such chromosomal abnormalities often result in severe inherited disorders and cancers, which the patients of the study did not have. Inbreeding, as the cause of the homozygosity disequilibrium, also seems unlikely as the patients were not an inbred population(s).
Selective sweep, a type of natural selection, seems to be a plausible mechanism for the appearance of homozygosity disequilibrium in general population [40]. Homozygosity disequilibrium in the MHC region, which has been shown to contain the important functional genes related to RA and other autoimmune diseases [64,65,73,74], results in a loss of genetic diversities and thereby influences quantitative and/or qualitative alternations of expression profiles. Some studies have found that autoimmunity susceptibility genes are positively selected in RA [75,76,77,78]. Selected alleles accumulate in the gene pool over time and consequently increase the probability of generating an ROH. Genomic regions with a small recombination fraction and a large LD tend to contain even more ROHs than do regions with large recombination fractions or a small LD; for example, the time necessary for a region to be affected by selective pressure is so short that a limited number of recombinations prevents a rapid decay of LD and thereby promotes the occurrence of ROHs [39]. For type-1 diabetes, a relevant study has also pointed out significant SNP identity and conserved extended haplotypes in the MHC region [44]. That and our study reinforce the idea that natural selection may be critical to maintaining functionally important genes [79] and susceptibility to complex diseases [80].
Our study attempted to tackle several difficulties associated with homozygosity association mapping, which is defined as the identification of ROHs associated with a given disease. However, the observed, extended homozygosity may contain a run of homozygotes, hemizygotes, or a combination of both, and the different types of runs may reflect different genetic mechanisms associated with a disease. For genotype-based homozygosity association mapping, it is difficult to distinguish the differences between true homozygosity (a homozygous run) and spurious homozygosity (a hemizygous run) [81,82]. Therefore, we employed genotype-based homozygosity association scans and intensity-based copy-number characterization to discriminate between copy-neutral homozygosity and deletion-induced hemizygosity for the RA-associated ROHs. Additionally, missing genotypes or heterozygous calls that arise from genotyping errors or recent mutations may interrupt a homozygous run (imperfect ROH). The genome-wide homozygosity association mapping used in this study overcame these obstacles by imputing missing genotypes and correcting for the modest heterozygous interference with the use of a local polynomial fit [53].
The required minimum power value and sample size for genome-wide homozygosity association mapping for complex diseases have not been explicitly determined in previous studies [81]. Our simulations provided an objective assessment of how the values for the power and the number of samples affect the results, and the results for the simulations suggest that we used sufficient sample numbers to attain reasonable statistical power to detect RA-associated ROHs in this study. In contrast to a single-SNP recessive model, the homozygosity association tests provided by LOHAS and ROH programs are multilocus analysis methods. The two multilocus methods make use of genetic information from extent of homozygosity, which is a function of LD, recombination fraction, and population history [40]. Recessive-acting disease alleles in an ROH predisposing to a disease are accumulated and made use to elevate the low power of a single-SNP analysis due to rare disease alleles at single SNPs.
Population substructure/admixture is an important confounding factor in genome-wide case-control association studies. Ignoring the difference of genetic substructure/admixture in case and control groups may lead to false-positive findings. We thus also performed genome-wide homozygosity association test with an adjustment for population substructure/admixture using principal components. We regressed the homozygosity intensity estimates from LOHAS software [53] on case/control disease status and the first 10 principal components from EIGENSTRAT software [83] to validate genetic association we identified in the MHC region. We found that genetic association between the identified ROHs in the MHC region and RA disease status remained very significant after taking population substructure/ admixture into account ( Figure S3). The maximum 2log 10 (p) values for the scans were 28.4155 for WTCCC_100, 23.1904 for WTCCC_150, and 14.6061 for WTCCC_200 in the first peak region and 8.6160 for WTCCC_100, 7.5250 for WTCCC_150, and 7.4240 for WTCCC_200 in the second peak region. The results explain that our findings in the MHC region are valid and robust to population substructure/admixture. RA-associated ROHs identified by LOHAS software was also evaluated by a second homozygosity association method. ROH program [40], which has been integrated into HelixTree software (HelixTree, Inc.), was run to examine homozygosity association in the MHC region. Several parameter combinations for defining an ROH were considered in the analysis using ROH program. At the Bonferroni significance level, two significant RA-associated ROHs identified by LOHAS software were validated by ROH program ( Figure S4).
In conclusion, our genome-wide homozygosity association study used high-density SNP array data to provide an alternative method to an allelic association study for mapping RAsusceptibility genes. Excess ROHs were found in the MHC regions of RA patients compared with those of controls, which uncovered a recessive component and missing heritability for RA and possibly other autoimmune diseases.

Study materials
We used SNP data from the WTCCC [18] that was obtained from 1,999 RA patients and 3,002 controls. Of the control samples, 1,502 were from the 1958 British Birth Cohort study and 1,500 were from the UK Blood Service. All samples were genotyped using the Affymetrix 500K SNP GeneChip system (Affymetrix Inc., Santa Clara, CA, USA). Genotypes were called using the genotype-calling algorithm, CHIAMO [18]. Samples from 868 RA patients and 1,194 normal controls participating in the NARAC [17] were used to independently validate the results of the WTCCC data. All samples were genotyped using the Infinium HumanHap550 SNP BeadChip system (Illumina Inc., San Diego, CA, USA). Genotypes were called with the genotyping module of BeadStudio. All samples passed a quality control examination. The SNP and gene annotation information including the physical positions and the associated genes were taken from the NCBI dbSNP Build 123.

Statistical methods
A genome-wide non-parametric association test was applied to map regions of homozygosity disequilibrium in the genomes of the RA patients. Given a target SNP (anchor) on a chromosome, a window containing the target SNP and W-1 nearest neighbor SNPs was constructed. Windows were slid along the chromosomes. For the genomes of each individual studied and for each window, a homozygote intensity (fraction) of SNPs was estimated by non-parametric local polynomial fitting [84] with a tricubic weight function. Dependent variable in the local regression is the homozygous/heterozygous states of SNPs and independent variable is physical position of the SNPs [53]. Then, in each window, the estimated homozygote intensities for each individual were compared with the median homozygote intensities for all patient and control samples to calculate the Kullback-Leibler distance [85]. The larger the distance was, the greater the fraction of homozygous SNPs. A Wilcoxon rank sum test [86] was applied to compare the Kullback-Leibler distances for the patient and control groups, and then to identify windows/regions of greater median homozygote intensity for the patient genomes. The aforementioned procedures were executed by using LOHAS software (http://www.stat.sinica.edu.tw/hsinchou/genetics/loh/ LOHAS.htm) [53]. Homozygote intensities in the regions of ROHs are used to predict RA status using statistical discriminant analysis [87] and a 10-fold cross-validation procedure. The average prediction accuracy of the fitted classifiers for the RA patients and the controls was calculated using the R package. Copy number analysis was performed using the Partek Genomics Suite (Partek, Inc.). Copy numbers were determined from the allele intensities with an adjustment for local GC content. Copy number alternations, including gene amplifications and deletions, were inferred by genomic segmentation for which the default parameters recommended by Partek were used. Figure S1 Distribution of the fraction of RA patients carrying ROHs in the two regions of homozygosity disequilibrium. There are 60 anchor SNPs in the two regions that satisfy 2log 10 (p).8. The first region (ROH1) contains 26 anchor SNPs and 5 genes, and the second region (ROH2) contains 34 anchor SNPs and 4 genes. A red point is plotted if a patient carried an ROH at an anchor SNP; otherwise the space is blank. The relative positions of 9 genes in these 2 regions are shown, and the 5 anchor SNPs used to tag rs9268831, rs2027852, rs9272723, rs3077, and rs9277542 are also marked. (PPT) Figure S2 Genome-wide homozygosity association scans for the NARAC and WTCCC data. The values of 2log 10 (p) at anchor SNPs for the two genome-wide homozygosity association scans, NARAC_100 (W = 100) and NARAC_200 (W = 200), are displayed. A genome-wide significance level of 2log 10 (p) = 8 is marked by the purple, horizontal line. The results for the WTCCC_100 and WTCCC_200 scans are provided for comparison. Peaks with 2log 10 (p) values above the significance line and signals that were consistently identified by the four scans were found in the MHC region on chromosome 6p21.3. (TIF) Figure S3 Homozygosity association scans with an adjustment for population substructure/admixture in the MHC region for the WTCCC data using principal components. The values of 2log 10 (p) at the anchor SNPs for the three homozygosity association scans, WTCCC_100, WTCCC_150, and WTCCC_200, are displayed. WTCCC_100, blue line, circles; WTCCC_150, green line, crosses; WTCCC_200, orange line, triangles. (TIFF) Figure S4 Homozygosity association scans in the MHC region for the WTCCC data using ROH program. Two parameters for defining an ROH are required in ROH program: the minimum run length (Rmin) and the minimum number of samples (Smin). ROHs are disregarded if the number of homozygous SNPs is less than Rmin. SNPs are removed if the number of samples for which that SNP is a member of an ROH is less than Smin (the details can refer to the user guide of ROH program in HelixTree software). This analysis considered Rmin = {50, 100, 150, 200} and Smin = {100, 150, 200, 250, 300}. Moreover, 10,000 permutations were performed to evaluate genetic association between affection status of RA and ROHs in the MHC region. In each subfigure, the horizontal axis denotes physical position (unit: Mb) on chromosome 6 and the vertical axis denotes p-value (2log 10 scale) from the homozygosity association test used in ROH program. A green solid line indicates a raw empirical p-value of homozygosity association tests from 10,000 permutations. Value of the raw empirical p-value is shown above the green line. Physical positions of starting and ending SNPs of an ROH are listed below the green line. A red dashed line indicates the Bonferroni significance level, i.e., 0.05/30 in this analysis. If no ROH was found under a certain parameter combination of Rmin and Smin, an empty subfigure is shown.

(TIFF)
Appendix S1 Simulation studies for evaluating power of the homozygosity association test used in this paper. (DOC)