Specificity of the STAT4 genetic association for severe disease manifestations of systemic lupus erythematosus.

Systemic lupus erythematosus (SLE) is a genetically complex disease with heterogeneous clinical manifestations. A polymorphism in the STAT4 gene has recently been established as a risk factor for SLE, but the relationship with specific SLE subphenotypes has not been studied. We studied 137 SNPs in the STAT4 region genotyped in 4 independent SLE case series (total n = 1398) and 2560 healthy controls, along with clinical data for the cases. Using conditional testing, we confirmed the most significant STAT4 haplotype for SLE risk. We then studied a SNP marking this haplotype for association with specific SLE subphenotypes, including autoantibody production, nephritis, arthritis, mucocutaneous manifestations, and age at diagnosis. To prevent possible type-I errors from population stratification, we reanalyzed the data using a subset of subjects determined to be most homogeneous based on principal components analysis of genome-wide data. We confirmed that four SNPs in very high LD (r(2) = 0.94 to 0.99) were most strongly associated with SLE, and there was no compelling evidence for additional SLE risk loci in the STAT4 region. SNP rs7574865 marking this haplotype had a minor allele frequency (MAF) = 31.1% in SLE cases compared with 22.5% in controls (OR = 1.56, p = 10(-16)). This SNP was more strongly associated with SLE characterized by double-stranded DNA autoantibodies (MAF = 35.1%, OR = 1.86, p<10(-19)), nephritis (MAF = 34.3%, OR = 1.80, p<10(-11)), and age at diagnosis<30 years (MAF = 33.8%, OR = 1.77, p<10(-13)). An association with severe nephritis was even more striking (MAF = 39.2%, OR = 2.35, p<10(-4) in the homogeneous subset of subjects). In contrast, STAT4 was less strongly associated with oral ulcers, a manifestation associated with milder disease. We conclude that this common polymorphism of STAT4 contributes to the phenotypic heterogeneity of SLE, predisposing specifically to more severe disease.


Introduction
Systemic lupus erythematosus (SLE) (OMIM 152700) is a disabling and chronic autoimmune disease with remarkable heterogeneity. The eleven classification criteria for SLE established by the American College of Rheumatology [1] -any four of which can confirm classification as SLE -include arthritis, renal disease, mucocutaneous manifestations, photosensitivity, neurological disorders, production of a variety of autoantibodies, and hematological disorders. Many of these characteristics are correlated, and may indicate different underlying disease mechanisms. SLE also has an established but complex genetic component [2]. Understanding the relationships between SLE risk genes and subtypes of the disease may help to elucidate disease mechanisms and pathways.
Recently, a polymorphism of the STAT4 gene on chromosome 2q has been strongly implicated in the risk for both SLE and rheumatoid arthritis [3]. We investigated whether variation in STAT4 contributes to the heterogeneity of SLE. Using 4 independent SLE case series, a large set of healthy controls, and two independent sets of genotypes for the STAT4 region on these subjects, we have found strong evidence that this is the case. In particular, we have found that the STAT4 susceptibility polymorphism is associated with more severe disease manifestations, including nephritis and early disease onset. It is also strongly associated with SLE characterized by double-stranded DNA autoantibody production.

Subjects
SLE cases were obtained from four sources. Patients from the University of California, San Francisco (UCSF) were participants in the UCSF Lupus Genetics Project and were recruited from UCSF Arthritis Clinics and private rheumatology practices in northern California, as well as by nationwide outreach [4]. SLE patients contributed by the Autoimmune Biomarkers Collaborative Network (ABCoN) [5] were recruited from the Hopkins Lupus cohort [6]. A third case series was part of the Multiple Autoimmune Disease Genetics Consortium (MADGC) collection [7]. Finally, a fourth set of cases recruited from the Pittsburgh Lupus Registry were obtained from the University of Pittsburgh [8]. Only subjects of self-described European descent were retained. Unrelated controls of European ancestry were from the New York Health Project (NYHP) [9] (http://www.amdec.org/ amdec_initiatives/nycp.html). The study populations are a superset of those recently used to establish a link between SLE and STAT4 [3], with the addition of the University of Pittsburgh cases and more than doubling the number of NYHP controls (see Table 1). The Institutional Review Boards of all investigative institutions approved these studies, and all cases and controls gave written informed consent.
Clinical data for the cases was obtained from medical records which were reviewed and tabulated at each institution. We chose to examine the ACR criteria [1] (http://www.rheumatology.org/ publications/classification/SLE/sle.asp) and age at diagnosis, categorizing the age at diagnosis for association analyses. The mean and median for age at diagnosis were 34 and 32, respectively; we chose a cutoff for early diagnosis of under 30 years of age versus greater than or equal to 30 years of age. We also chose to examine production of autoantibodies to doublestranded DNA (anti-dsDNA), as this is typically associated with severe disease and this was available in the clinical data from all sites. Finally, we used more detailed nephritis information available for the UCSF and ABCoN cohorts, namely a characterization of those patients with severe nephritis as defined by end-stage renal disease or histopathologic evidence of severe, progressive renal disease on renal biopsy.

Genotyping and SNP Selection
Genotype data were obtained from two parent studies (see Table 1). The four SLE case series and 1762 NYHP controls were genotyped using the Illumina HumanHap550 array as part of a genome-wide association study of SLE [10]. In addition, three of the case series (UCSF, ABCoN, and MADGC) and 1243 NYHP controls were genotyped for 67 SNPS in the STAT1/STAT4 region of chromosome 2q as part of a case-control study of STAT4 and two systemic autoimmune diseases, rheumatoid arthritis and SLE [3]. Selection and genotyping of these 67 fine-mapping SNPs was done by the National Institute for Arthritis and Musculoskeletal and Skin Diseases (NIAMS), using Sequenom MassARRAY Technology as previously described [3]. From the Illumina 550K panel, 91 contiguous SNPs from the STAT1/STAT4 region, extended with flanking regions 200kb on either side, were selected; of these, 45 were contained in the same region as the 67 SNPs, with 21 of those being identical. Coverage of these SNPs was analyzed using Tagger [11] in Haploview [12] with an r 2 threshold

Author Summary
Systemic lupus erythematosus is a chronic disabling autoimmune disease, most commonly striking women in their thirties or forties. It can cause a wide variety of clinical manifestations, including kidney disease, arthritis, and skin disorders. Prognosis varies greatly depending on these clinical features, with kidney disease and related characteristics leading to greater morbidity and mortality. It is also complex genetically; while lupus runs in families, genes increase one's risk for lupus but do not fully determine the outcome. It is thought that the interactions of multiple genes and/or interactions between genes and environmental factors may cause lupus, but the causes and disease pathways of this very heterogeneous disease are not well understood. By examining relationships between subtypes of lupus and specific genes, we hope to better understand how lupus is triggered and by what biological pathways it progresses. We show in this work that the STAT4 gene, very recently identified as a lupus risk gene, predisposes specifically to severe manifestations of lupus, including kidney disease. Duplicate genotyping enabled an analysis of SNP concordance between the two genotyping methods, and inclusion of genotypes that were called by either method. Conflicting genotype calls were dropped from analyses when using combined data.

Statistical Analysis
Subjects were first removed for whom there was evidence of duplication or relatedness in the Illumina 550K data, using IBS estimation in PLINK [13] (http://pngu.mgh.harvard.edu/ purcell/plink). For the choice of which sample to remove, preference was first based on availability of phenotype data, and then on overall genotyping call rate. SNPs were removed from analysis that had a minor allele frequency less than 5%, greater than 10% missing genotypes, or Hardy-Weinberg equilibrium p,0.001 in controls.
In order to choose loci to examine for phenotype analyses, we performed allelic and conditional tests. These analyses were performed separately for the 91 SNPs from the Illumina 550K panel and the 67 Sequenom SNPs, since they contained overlapping but different sets of SNPs (Table S1) and subjects (Table 1), and large amounts of missing data could bias haplotype estimation. We also analyzed the full set of 137 SNPs together using only the subset of subjects who were genotyped on both platforms. For each analysis, subjects were removed that had less than 90% genotyping. We first conducted allelic tests of cases and controls using PLINK [13] and selected those that had p,0.005; at this first screening stage we used a liberal p-value, considering that there are well over 10 independent haplotype blocks in the complete region. To eliminate redundant SNPs having effects only due to linkage disequilibrium, we then performed conditional analysis using WHAP [14] (http://pngu.mgh.harvard.edu/ purcell/whap).
SNP rs7574865 was chosen for phenotype analysis based on the allelic and conditional analyses (see Results). SNP rs7574865 was genotyped on both platforms with a very high rate of concordance (see Results), so genotypes from both platforms were combined for the phenotype analysis of rs7574865; the single subject for whom the calls conflicted was dropped. We first performed case-only analyses (e.g. presence of renal disorder versus no renal disorder) to establish which subphenotypes are associated with rs7574865 variation. We then performed case-control analyses (e.g. SLE with renal disorder versus controls) to examine the risk that is conferred by rs7574865 on subtypes of SLE characterized by each of those subphenotypes. In both sets of analyses, first bivariate odds ratios (ORs) and 95% confidence limits were determined for each subphenotype. To correct for variability among strata when combining data from different cohorts, we used Mantel-Haenszel tests and combined ORs. In order to investigate the possibility of associations with unknown but common underlying disease mechanisms, principal components analysis (PCA) was performed using all subphenotypes except severe nephritis (a subclass of nephritis, and available only for the UCSF and ABCoN case series). Values for the first two principal components (PCs) were evaluated as above as additional subphenotypes, categorized by positive or negative.
To address the concern that case-control studies may give spurious associations due to undetected population admixture or population substructure differences between cases and the controls, we utilized ancestry data for the Illumina 550K genotyped subjects. Ancestry was derived from ancestry-informative markers (AIMs) contained in the Illumina 550K panel. First a set of 235 AIMs was used to estimate percent European ancestry, using STRUCTURE [15]. For those subjects with .90% European ancestry, another set of 1409 EUROSTRUCTURE [16] AIMs was used to estimate percent Northern European versus Southern ancestry. Finally, a subset of 1253 subjects (751 cases and 502 controls) was identified that was homogeneous based on the first four PCs determined by PCA using the 550K panel and EIGENSTRAT [17] software. Minimum covariance determinant (MCD) estimators of PC location and scatter were calculated using R [18]; outliers were then determined using robust Mahalanobis distance. The procedure was applied in two steps, first using both cases and controls (significance level a = 0.05), and then using the case-only robust estimators of location and scatter (a = 0.10), which led to a more homogeneous case-control sample set. The l gc was decreased from 1.256 to 1.045 for the homogeneous set when assessed using the 550K panel (see Figure S1).
We analyzed the associations between $90% European versus ,90% European and $90% Northern European versus ,90% Northern European ancestry and rs7574865 in controls, using an allele-based exact test. We also reanalyzed all tests using the homogeneous subset of subjects. Finally, in multivariate analysis we adjusted for ancestry, sex, and disease duration. For this multivariate analysis, ancestry was a 3-category variable as follows: 1) ,90% European, 2) $90% European and $90% Northern European, and 3) $90% European and ,90% Northern European. We chose this coding due to the highly skewed distribution of continuous ancestry, and the collinearity between the European and Northern European variables.
Since we are examining associations for 13 phenotypes, the issue of multiple testing must be considered. However, since these are not independent phenotypes, a simple Bonferroni correction of a = 0.05/13 = 0.004 is clearly overly conservative, while an unadjusted a = 0.05 is clearly liberal. For this reason we have chosen to present unadjusted p-values so that these may be directly interpreted by the reader. Stata 9.2 (http://www.stata.com/) was used for correlations, odds ratios and p-values, Mantel-Haenszel tests and combined ORs, phenotype principal components analysis, and multivariate logistic regressions.

Subjects and Phenotypes
The numbers of independent cases and controls in each cohort and a summary of available genotype and phenotype data are listed in Table 1. For overlapping SNPs, including rs7574865, there were 1396 genotyped cases with phenotype data, and 2560 genotyped healthy controls. A summary of subphenotypes by cohort is presented in Table 2. There were significant differences among the cohorts for all phenotypes except neurologic disorder and age at diagnosis less than 30 years old.
Some of these phenotypes are highly correlated; in particular anti-dsDNA is a subcriterion for the ACR immunologic criterion, and is associated with renal disease. Pairwise correlation coefficients, for those pairs having r.0.1, are shown in Table 3. All p-values for these pairs were #0.0001. In principal components (PC) analysis of the phenotype data, the top 3 components of the first PC are anti-dsDNA, the immunologic criterion, and renal disease. The top 3 components of the second PC are malar rash, photosensitivity, and discoid rash. Variables corresponding to the first and second PCs were included in phenotype analyses (see Methods).

SNP Coverage, Concordance, and Merging
Of the 67 Sequenom SNPs (shown with the study genotypes in Figure 1A), 62 passed quality control filters, including MAF $5%. These 62 SNPs had 86% coverage of the common variation (MAF $5%) in the STAT1/STAT4 region. In the Illumina 550K panel (shown with the study genotypes in Figure 1B), 77 out of 91 SNPs passed quality control and had 83% coverage of the extended region obtained by adding flanking markers 200kb on either side of the Sequenom STAT1-STAT4 region. A complete list of SNPs on both platforms passing our quality control criteria, along with their MAF and percentage genotyped, is provided in Table S1.
We examined the concordance between calls for the 21 overlapping SNPs and 1458 subjects who were genotyped using both methods. Results for this are shown in Supplementary Table  S2. The average and minimum agreement were 99.90% and 99.65%, respectively. In particular for SNP rs7574865, the agreement was 99.93%. Given this high rate of concordance, we chose to merge genotype data for rs7574865 for the phenotype analyses (see below).  Allele Tests and Conditional Analysis Table 4 contains allelic p-values before and after conditioning on the most significant SNP, for those with initial allelic p-values of 0.005 or less. We did four separate conditional analyses: (A) subjects and SNPs genotyped on the Illumina 550K; (B) the genetically homogeneous subset of subjects (see Methods) typed on the Illumina 550K; (C) subjects and SNPs genotyped on the Sequenom platform; and (D) all SNPs for those subjects that were genotyped on both platforms. In the Illumina 550K panel, rs7574865 (circled in Figure 1B) was the most significant SNP in both the full set of subjects and the homogeneous subset (see Methods). In the Sequenom 67-SNP set and in the combined set, the 4 top SNPs were in high LD (D' = 0.97 to 0.99, r 2 = 0.94 to 0.99) and made up a 4-marker haplotype for which the components could not be independently analyzed (circled in Figure 1A). Of the estimated individual haplotypes of these 4 markers, over 99% were either CGTC or TTCG, so that any one SNP fully determined the other 3 in the vast majority of subjects. Table 4 test the significance of each SNP conditional on the values of the top SNP(s) which are given in bold. While there were some results of borderline significance, they were neither strong nor consistent across the different analyses. The only compelling evidence after conditioning was for the 4locus haplotype above. Since any of the 4 SNPs serves as a marker of this haplotype and rs7574865 is contained in both genotyping sets, we chose to carry out phenotype analysis using this marker for maximum power.

Ancestry Variability of rs7574865
We examined the minor allele frequencies for rs7574865 (Table  S3)    HCB, JPT, YRI, and CEU populations, respectively.) The minor allele frequencies were very similar, 22.1% and 22.6% respectively, for subjects of either ,90% or $90% Northern European ancestry; thus we did not observe a Northern-Southern European gradient for rs7574865. Analyses were repeated with the homogeneous subset of cases and controls (n = 1253) as described in Methods.

Phenotype Case-Only Analysis
We first examined each subphenotype for association with rs7574865 within the SLE cases. In unadjusted results (Table S4), only one phenotype showed borderline evidence for heterogeneity of association among the four SLE case series (p = 0.04 for immunologic disorder); thus we retained combined Mantel-Haenszel odds ratios and p-values. The most significant associations were with anti-dsDNA autoantibodies and the first principal component, OR = 1.44 (95% CI 1.23-1.70), p = 10 25 , and OR = 1.43 (95% CI 1.21-1.70), p = 3610 25 , respectively. Severe nephritis, available on a smaller subset of cases, had the highest OR = 1.50 (95% CI 1.11-2.01), p = 0.0075. There was also support for associations with immunologic criteria (OR = 1.24, p = 0.017), renal disorder (OR = 1.23, p = 0.024), age at diagnosis under 30 (OR = 1.22, p = 0.018), and an inverse association with oral ulcers (OR = 0.80, p = 0.0087). Table 5 contains case-only analyses, for phenotypes having unadjusted p,0.05, repeated first on the homogeneous subset of subjects (see Methods), and also using multivariate adjustment for ancestry, sex, and disease duration. There is consistency in odds ratios throughout and these analyses continue to support the aforementioned phenotypic associations with rs7574865. Some associations are even stronger in the homogeneous subset analysis,  Table 6 shows our primary results, the risk of SLE characterized by each subphenotype versus healthy controls. This illustrates a There is also strong evidence that the STAT4 risk allele is less frequent in SLE with oral ulcers, MAF = 28.8%, which is generally associated with milder disease.

Discussion
Genotype-phenotype associations between risk alleles and disease subtypes may give insight into disease etiology and mechanisms. Recent results show that rs7574865, a variant allele of STAT4, confers an increased risk for both SLE and rheumatoid arthritis (RA) [3,19], suggesting the involvement of common pathways of pathogenesis among these two autoimmune diseases. STAT4-deficiency is associated with accelerated renal disease and increased mortality [20] in a murine lupus model, but with protective effects for arthritis in knockout mice [21]. Since SLE is an extremely heterogeneous disease, with multiple correlated subphenotypes, we sought to investigate whether or not STAT4 appears to contribute to this phenotypic heterogeneity in human SLE. We believe that our data provide compelling evidence that STAT4 is associated with more severe SLE manifestations, particularly with nephritis and with the production of autoantibodies to double-stranded DNA. In contrast, other recentlydiscovered SLE risk polymorphisms do not appear to be strongly associated with severe disease manifestations [10].
There have been recent successes in the study of genotypephenotype associations in SLE and other autoimmune diseases. For example, PDCD1 has been shown to be associated with lupus nephritis and anti-phospholipid antibodies in ethnic subgroups [4], and PTPN22 is primarily associated with anti-cyclic citrullinated peptide (anti-CCP) [22] and rheumatoid factor (RF) [23] autoantibody positive RA. The STAT4 gene has been shown to be associated with both anti-CCP positive and negative RA [3]; it has not yet been investigated in the context of SLE subphenotypes.
Replication of genotype-phenotype associations can be challenging [24]; a strength of our study is the inclusion of four independent case series. Other strengths include the availability of two overlapping genotype sets in the STAT4 region for most of the subjects, including genome-wide data to facilitate ancestry analysis, and of course the availability of detailed phenotype data on all four of the case series.
A limitation of our study is that the subjects are of self-reported European ancestry and primarily female. It could be insightful to look at these associations in other populations, particularly since SLE has higher prevalence among African-Americans and other non-European populations [2]. The STAT4 gene has recently been shown to be associated with RA in a Korean population [19]; however significant associations with subphenotypes -namely age at onset, radiographic progression, and serologic status -were not found.
Another limitation is the inherent difficulty in obtaining accurate phenotype data. Differences between our 4 SLE cohorts may be true differences in patient characteristics, perhaps as a result of differences in selection, but could also be influenced by different methods of assessment and accuracy of individual records. However, although some of the phenotypes we examined are related to disease activity, and may fluctuate naturally or as a result of treatment, we classified SLE patients according to a history of these specific phenotypes. We are encouraged by the fact that our results were quite homogeneous across the different cohorts. Also, any misclassification would presumably be nondifferential with respect to genotypes, thus diluting our results rather than causing type I error.
Finally, it is important in genetic studies to protect against false associations due to undetected population substructure. Indeed there were some subjects in our cohort with sizeable non-European ancestry, in spite of being self-reported European, and those had a higher minor allele frequency for rs7574865. However, reanalysis of a more homogeneous subset of subjects of primarily northern European ancestry was very consistent with our overall results. There is even stronger evidence in this subset for relationships between the STAT4 rs7574865 SNP and nephritis subphenotypes, and for an inverse relationship with oral ulcers.
Since the subphenotypes having the strongest risk conferred by rs7574865 were highly correlated, we included clinical variables based on principal components (PC) analysis to investigate the possibility of common underlying effects. The first PC, associated with the severe manifestations of anti-dsDNA antibodies, nephritis and immunologic abnormalities, had similar associations as those of its components. Severe nephritis was consistently the most strongly associated subphenotype. The second PC, associated with the milder skin disease manifestations of malar rash, photosensitivity, and discoid rash, was not significantly associated with rs7574865 in any analysis.
In summary, our study has identified multiple correlated subphenotypes that are strongly associated with the STAT4 gene, including nephritis, autoantibodies to double-stranded DNA, and early age at diagnosis. The next challenge is identifying how these correlated features fit into causal pathways, and therefore to help elucidate the complex etiology of SLE.