Predicting the Risk of Rheumatoid Arthritis and Its Age of Onset through Modelling Genetic Risk Variants with Smoking

The improved characterisation of risk factors for rheumatoid arthritis (RA) suggests they could be combined to identify individuals at increased disease risks in whom preventive strategies may be evaluated. We aimed to develop an RA prediction model capable of generating clinically relevant predictive data and to determine if it better predicted younger onset RA (YORA). Our novel modelling approach combined odds ratios for 15 four-digit/10 two-digit HLA-DRB1 alleles, 31 single nucleotide polymorphisms (SNPs) and ever-smoking status in males to determine risk using computer simulation and confidence interval based risk categorisation. Only males were evaluated in our models incorporating smoking as ever-smoking is a significant risk factor for RA in men but not women. We developed multiple models to evaluate each risk factor's impact on prediction. Each model's ability to discriminate anti-citrullinated protein antibody (ACPA)-positive RA from controls was evaluated in two cohorts: Wellcome Trust Case Control Consortium (WTCCC: 1,516 cases; 1,647 controls); UK RA Genetics Group Consortium (UKRAGG: 2,623 cases; 1,500 controls). HLA and smoking provided strongest prediction with good discrimination evidenced by an HLA-smoking model area under the curve (AUC) value of 0.813 in both WTCCC and UKRAGG. SNPs provided minimal prediction (AUC 0.660 WTCCC/0.617 UKRAGG). Whilst high individual risks were identified, with some cases having estimated lifetime risks of 86%, only a minority overall had substantially increased odds for RA. High risks from the HLA model were associated with YORA (P<0.0001); ever-smoking associated with older onset disease. This latter finding suggests smoking's impact on RA risk manifests later in life. Our modelling demonstrates that combining risk factors provides clinically informative RA prediction; additionally HLA and smoking status can be used to predict the risk of younger and older onset RA, respectively.


Introduction
Rheumatoid arthritis (RA) is a common chronic inflammatory disorder. It results in substantial morbidity and disability alongside high medical and societal costs [1], [2]. There is therefore growing interest in preventing its development. Such prevention requires an ability to reliably predict who will develop RA. Advances in characterising genetic and environmental risk factors for RA together with developments in modelling methodology make predicting its development a realistic possibility.
RA is a clinical syndrome spanning multiple subsets [3]. The commonest subdivision is by the presence or absence of rheumatoid factor (RF)/anti-citrullinated protein antibodies (ACPA), termed seropositive and seronegative RA respectively.
Risk factor evaluation has mainly focussed on seropositive RA with nearly half its genetic architecture known. HLA-DRB1 alleles, in particular those encoding the shared epitope, dominate genetic risk accounting for approximately 36% of heritability [4]; 45 non-HLA variants explain approximately 15% of heritability [4].
Smoking is the main environmental risk factor [5]; it predisposes to seropositive RA and has a synergistic relationship with the shared epitope [6], [7]. Although single factors do not provide sufficient risk stratification, combining multiple factors within a prediction model may identify clinically relevant high-and lowrisk groups. The large risks conferred by HLA make such modelling an attractive prospect in RA despite limited success in other complex disorders [8][9][10].
RA develops over many years prior to clinical presentation [11]. Initially, individuals with genetic susceptibility variants are exposed to environmental risks; some may develop autoantibodies (RF/ACPA) [12]. A proportion will subsequently develop arthralgia, which may progress to an unclassified arthritis followed by a fully expressed RA phenotype. Pilot studies in unclassified arthritis indicate that secondary prevention may be possible with corticosteroids [13], [14], methotrexate [15] and biologics [16] attenuating the progression to RA. Although preventive treatments may be more effective before immune dysregulation and symptoms develop, primary prevention is not currently possible as no reliable method exists to identify asymptomatic high-risk individuals.
We report an alternative modelling approach to predicting RA. Our novel modelling method uses computer simulation to categorise risk profiles; our models also incorporate a larger number of HLA risk variants. The risk factors included in our modelling comprise 15 four-digit/10 two-digit HLA-DRB1 alleles, 31 SNPs and male ever-smoking status (as ever-smoking is a significant risk for RA in males only). We applied our models to two large cohorts of European ancestry: the Wellcome Trust Case Control Consortium (WTCCC) and the UK RA Genetics Group (UKRAGG) Consortium. Our primary aim was to determine if our approach would generate clinically relevant predictive values. Our secondary aim was to determine if our modelling better identified YORA. We demonstrate that clinically informative RA risk prediction is possible and that the risk of younger and older onset RA can be predicted using information on HLA and smoking status, respectively.

Ethics Statement
All participants in WTCCC and UKRAGG were recruited after providing informed consent. UKRAGG was approved by the North West Multi-Centre Research Ethics Committee (MREC 99/8/84). Authors gained written permission and approval from WTCCC to undertake this work in the publically available WTCCC1 collections.

Study Populations
The WTCCC dataset contains SNP data on 1,999 RA cases and 3,004 controls [28]. Controls were obtained from the 1958 British Birth Cohort and UK Blood Services. Genotyping was performed on the Affymetrix GeneChip 500k Mapping Array Set. Quality control (QC) procedures were undertaken excluding individuals with ,97% SNP call rates, high heterozygosity, non-European ancestry or relatedness, discordance between genotype and phenotype data and duplicate samples. In the post-QC dataset information was available on 490,031 SNP markers; the total genotyping rate was 1.00. Two-or four-digit resolution HLA-DRB1 tissue typing data were available on 1,837 cases and 1,647 controls.
The UKRAGG dataset contains SNP data on 5,024 RA cases and 4,281 controls from 6 UK centres [29]. Genotyping was performed using the Sequenom platform. Four hundred and four SNPs were genotyped over 8 staggered plexes; for each plex separate QC was undertaken excluding individuals and SNPs with ,90% data present. In the post-QC dataset total genotyping rates were 0.73 owing to systematic differences in samples run on each plex. Two-or four-digit resolution HLA-DRB1 tissue typing data were available on 3,420 cases and 1,500 controls.
Both datasets contained cases fulfilling the 1987 ACR classification criteria for RA [30]. HLA-DRB1 tissue typing was undertaken (at two-digit or four-digit resolution) at individual centres, using commercially available semiautomated polymerase chain reaction-sequence-specific oligonucleotide probe (PCR-SSOP) typing techniques (or research assays based on PCR-SSOP linear array technology) [29]. Two-digit typing includes the allele group (Field 1) only; four-digit typing includes both the allele group and the allele subtype encoding a specific HLA protein (Field 2) (http://hla.alleles.org/nomenclature/naming.html).

Author Summary
Rheumatoid arthritis (RA) is a common, incurable disease with major individual and health service costs. Preventing its development is therefore an important goal. Being able to predict who will develop RA would allow researchers to look at ways to prevent it. Many factors have been found that increase someone's risk of RA. These are divided into genetic and environmental (such as smoking) factors. The risk of RA associated with each factor has previously been reported. Here, we demonstrate a method that combines these risk factors in a process called ''prediction modelling'' to estimate someone's lifetime risk of RA. We show that firstly, our prediction models can identify people with very high-risks of RA and secondly, they can be used to identify people at risk of developing RA at a younger age. Although these findings are an important first step towards preventing RA, as only a minority of people tested had substantially increased disease risks our models could not be used to screen the general population. Instead they need testing in people already at risk of RA such as relatives of affected patients. In this context they could identify enough numbers of high-risk people to allow preventive methods to be evaluated.
We undertook prediction modelling in seropositive cases and controls with HLA-DRB1 tissue typing data available with or without additional SNP and smoking data (as most replicated risk loci are for seropositive RA and genetic risk is dominated by HLA) [4], [31]. The final cohorts comprised 1,516 cases and 1,647 controls from WTCCC and 2,623 cases and 1,500 controls from UKRAGG (Table 1).

Prediction Modelling Overview
Our modelling was performed within the R package, REGENT (Risk Estimation for Genetic and Environmental Traits), developed within our unit. This program incorporates published geneenvironment risk factor and disease statistics to categorise risk using a confidence interval (CI)-based approach within a simulated population. The methodology underlying REGENT has previously been described in detail [32], [33].
Genetic and environmental risk factors for input into REGENT are selected from the literature. Genetic risk factors require allelic ORs, allele frequencies, and sample sizes from relevant studies, in order to estimate precision. Environmental risk factors require ORs, standard errors and the proportion of the population exposed to the risk factor. Data on these risk factors are entered into REGENT as summary statistic input files, which are processed in two stages: the first develops the prediction model and the second runs the prediction model in real life data.
In the first stage REGENT simulates a population-distribution of disease risk. Risk profiles are simulated based on the frequency of each risk factor in the general population. Summary ORs for each risk profile are generated through combining the ORs for each genetic and environmental risk factor in a multiplicative model that assumes risk factor independence. CIs are generated using information on the variability of genetic risk factors (derived from the sample size of the risk variant discovery cohort) and environmental risk factors (standard error of the effect size). Each simulated risk profile's OR is initially calculated relative to a profile with no risk factors present; these are subsequently adjusted to ensure correct disease prevalence in the population, assigning a risk profile with a mean OR as having a baseline risk of 1.0. CIs are used to classify risk profiles into four risk categories (reduced, average, elevated and high-risk). Starting with the risk profile of baseline risk (OR = 1.0), any risk profile whose CI overlaps with this baseline CI is classified as being of average-risk (as this profile is not statistically different from baseline). Any risk profile whose CI resides fully below the baseline CI is classified as reduced-risk. Profiles with CIs above the baseline CI are classified as elevatedrisk. Furthermore, a high-risk group is determined by profiles whose CIs reside completely above the CI of the first risk profile classified as elevated-risk. An example of how this process is undertaken in a simplified model using 3 SNPs is provided in Figure S1.
In the second stage REGENT applies this simulated population profile to individual level data. Genotypes and environmental risk factor exposure data on each individual in the dataset of interest (WTCCC and UKRAGG) are entered into REGENT, which generates two measures of disease risk. Firstly, each individual's summary OR (95% CI) for RA is calculated (relative to the baseline individual with an OR of 1.0); as with the simulated population, risk factors are combined in a multiplicative model. This summary OR informs the individual of their risk of developing RA. Secondly, each individual is assigned a risk category for RA. This is undertaken through comparing the CI of each individual's summary OR to those of the simulated risk distribution in the same manner as described in stage 1. This risk category informs an individual whether they are at an increased or reduced risk of disease, relative to the average person in the general population.

Prediction Model Components Identified from Meta-Analyses
Genetic Risk Factors. We identified genetic susceptibility variants for potential inclusion in our prediction modelling from two large, recently published meta-analyses [34], [35]. We sought to include only susceptibility alleles attaining genome-wide significance (P GWAS ,5610 28 ); this ensured that the alleles modelled were replicated RA genetic risk factors. These comprised 15 four-digit and 10 two-digit HLA-DRB1 alleles and 35 non-HLA SNPs.
Environmental Risk Factor. We included the environmental risk factor smoking in our modelling. Other factors proposed to influence RA risk such as alcohol were not included: firstly the evidence underlying these is uncertain, with associations often present in case-control and not cohort studies [36] and secondly detailed data on non-smoking risk factors were not captured in WTCCC and UKRAGG. We used published ORs from the most recent meta-analysis evaluating smoking as an RA risk factor [5]. In this meta-analysis ever-smoking was a significant risk for seropositive RA in males only (OR 3.02; 95% CI 2.35-3.88) with a substantially smaller and non-significant (CIs contain 1) impact seen in females (OR 1.34; 95% CI 0.99-1.80). We therefore hypothesized that smoking would not improve prediction in women (confirmed in preliminary analyses; Table S1). As a result only males were evaluated in our modelling incorporating ever-smoking.
Although smoking interacts with the shared epitope we did not factor this into our modelling. This is because studies reporting summary ORs for this interaction [6], [29], [37], [38] have marked heterogeneity between them; therefore using meta-analysis techniques to obtain pooled ORs for shared epitope-smoking status combinations would be inaccurate and thus inappropriate. Examples of this heterogeneity include: (1) studies reporting risks stratified by different smoking levels, which would require an inverse variance fixed-effects model to obtain common ORs for all smokers within studies in addition to a random-effects model to estimate pooled ORs across studies; (2) two studies classifying the shared epitope at two-digit resolution, thus incorporating nonshared epitope alleles [6], [37]; (3) two studies not including all known shared epitope alleles [29], [38].

Prediction Model Component Availability in WTCCC and UKRAGG
Two-digit or four-digit HLA-DRB1 tissue typing data were available in all evaluated individuals. In WTCCC 1,342 seropositive cases, 966 ACPA-positive cases and 1,126 controls had fourdigit resolution data available on both alleles; 29 seropositive cases, 14 ACPA-positive cases and 159 controls had two-digit resolution data available on both alleles; 145 seropositive cases, 81 ACPApositive cases and 362 controls had mixed-digit resolution data (one HLA-DRB1 allele known at four-digit and the other at twodigit resolution) available. In UKRAGG 1,534 seropositive cases, 1,108 ACPA-positive cases and 735 controls had four-digit resolution data available on both alleles; 312 seropositive cases, 66 ACPA-positive cases and 205 controls had two-digit resolution data available on both alleles; 777 seropositive cases, 334 ACPApositive cases and 560 controls had mixed-digit resolution data available.
We excluded 4 SNPs attaining P GWAS in the meta-analysis for the following reasons: 1 (rs11676922) was in high linkage disequilibrium (r 2 .0.9; HapMap release 22 CEU population panel) [39] with another (rs10865035) -in this case the latter SNP was included due to a previous association with RA -and 3 SNPs/ proxy SNPs were unavailable (rs10488631, rs6859219 and rs934734 in UKRAGG; rs6822844, rs874040 and rs951005 in WTCCC). Eleven and two proxy SNPs were used in WTCCC and UKRAGG respectively (Table S2) [39].
Data on ever-smoking status were available in 287 male cases and 739 male controls in WTCCC and 529 male cases and 322 male controls in UKRAGG.

Final Prediction Models
To examine the contribution of each gene-environment component to prediction we constructed several models. These comprised a SNP model (with 31 SNPs), an HLA model (10 twodigit and 15 four-digit HLA-DRB1 alleles), an HLA-SNP model (combining HLA and SNP model components), an HLA-smoking model (combining HLA-DRB1 alleles with ever-smoking status) and an HLA-SNP-smoking model (combining HLA-DRB1 alleles, 28 SNPs and ever-smoking status). Only the 28 SNPs present in both WTCCC and UKRAGG were incorporated in the last model. The latter two models, which included smoking, were evaluated in males only.
The decision to combine two-digit and four-digit HLA-DRB1 alleles in the HLA model was undertaken to avoid removing the substantial number of individuals with mixed resolution typing. Preliminary analyses confirmed the validity of this approach with no significant differences seen in the discriminative abilities of HLA models incorporating (1) two-digit alleles only; (2) four-digit alleles only and (3) a mixed resolution of alleles (Table S3). Within our mixed resolution modelling the risks for each HLA allele were included only once per individual at the highest resolution at which they were known.
Only individuals with available data on relevant risk factors were included in models incorporating those risk factors. Therefore only males with available smoking data were included in the HLA-smoking and HLA-SNP-smoking models. Similarly only individuals with data available on the modelled SNPs could be included in the HLA-SNP and HLA-SNP-smoking models. Owing to missing data the number of individuals evaluated in each prediction model fell as more risk factors were included ( Figure 1).

Statistical Analyses
Evaluating Dataset Validity. To compare the representativeness of our datasets to published RA populations we summarised clinical features of cases and controls (Table 1) and calculated effect allele frequencies and allelic ORs (95% CIs) (Tables 2 and 3). For the HLA-DRB1 allele case-control association analysis ( Table 2) the two-digit resolution allele results included both individuals with two-digit resolution typing and collapsed four-digit resolution typing. This approach was undertaken due to the small number of individuals with two-digit typing data in WTCCC/UKRAGG. The meta-analysis from which we obtained our risk alleles had almost identical allele frequencies when comparing two-digit alleles and four-digit alleles collapsed to twodigit resolution [35]; comparing our datasets to the meta-analysis findings in this manner was therefore appropriate.
Comparing Model Classification Abilities. To evaluate the ability of each model to correctly classify disease status we constructed receiver operating characteristic (ROC) curves and measured the AUC; this is established methodology in determining genetic classification test efficacy [40], [41]. Higher AUCs indicate better classification. An AUC.0.5 signifies some discriminative ability; a perfect classifier has an AUC of 1. AUCs were calculated and compared using DeLong's method [42] performed within the R package, pROC [43].
Comparing Model Generated Risk Distributions. The risk distributions for cases and controls under each model were compared by plotting the logarithmic OR for seropositive RA for each individual ordered by risk.
Calculating Lifetime Risk of RA. Due to the low prevalence of RA [44], ORs approximate relative risks [45]. Therefore to calculate lifetime risks of seropositive RA we multiplied published lifetime risks by the summary OR for RA generated by our prediction models. As UK lifetime risks of RA are unknown we used estimates from a large US cohort study (2.4% for women; 1.1% for men) [46].
Evaluating YORA Prediction. The role of HLA, SNPs and ever-smoking status in determining age of RA onset was evaluated using individual-level OR outputs from the REGENT models in a Cox univariate analysis with gender, smoking status and smoking status-gender interaction used as covariates. Factors indicated as likely predictors of age of onset were then examined simultaneously in a multivariate analysis incorporating backward elimination of non-significant factors (P.0.05). We found no evidence of a ''gender-smoking interaction'' effect on the age of RA onset in either dataset (WTCCC P = 0.0823 and UKRAGG P = 0.8369; Table 4). This excluded a significant influence of gender on the relationship between smoking and the age at which RA developed. We therefore included both sexes when evaluating smoking's effect on the age of onset. Proportional hazards assumptions were verified using visual inspection of log-log plots [47]. To further demonstrate associations between significant factors and age of onset we constructed Kaplan-Meier estimates of the cumulative risk for cases, stratified by REGENT risk categorisation from the relevant models, alongside the presence/ absence of other risk factors. We used a Cox multivariate approach to establish which four-digit HLA-DRB1 alleles influenced age of onset (fitting all alleles simultaneously using stepwise selection, removing non-significant alleles from the final model). All time to event analyses were performed using SAS version 9.3 (SAS Institute, Cary, NC).

Separate Analyses for ACPA-Positive RA
We undertook modelling separately for seropositive (RF and/or ACPA present) RA and ACPA-positive RA since HLA-DRB1 allelic ORs were obtained from a meta-analysis evaluating ACPA-positive RA [35], and the shared epitope alleles, non-HLA SNPs and smoking predominantly associate with ACPApositive disease [4], [48][49][50]. We therefore hypothesised our modelling would perform better for ACPA-positive RA. As this was confirmed in the risk categorisation results we restricted further analyses (AUC and lifetime risk calculations, examining modelling associations with age of RA onset) to ACPA-positive RA.

Dataset Validity
Genetic Risk Factors. In both WTCCC and UKRAGG the effect allele frequencies and ORs for seropositive RA were generally similar to published data (Tables 2 and 3). Exceptions occurred at the four-digit HLA-DRB1 alleles *04:08 and *15:01 (absent from controls in WTCCC and UKRAGG respectively), at *01:01, *11:01, *11:04, *13:01 and *15:01 in WTCCC and *08:01 in UKRAGG (significantly lower allele frequencies in controls than expected). The absence of *04:08 in controls was probably a chance finding since it has a frequency of 0.005. The remaining discrepancies resulted from lower four-digit tissue typing rates for these alleles in controls, which were more often typed at two-digits, compared with cases. Although this could introduce bias, especially in the context of case-control association analyses, we do not consider it significantly affected our prediction modelling because these alleles were incorporated in our models at both twodigit and four-digit resolution (in most cases in the reference metaanalysis the two-digit alleles had similar allele frequencies and ORs compared with the four-digit alleles) and our risks were obtained from an external source [35].
SNP discrepancies occurred at rs3761847 in WTCCC and rs26232 and rs540386 in UKRAGG, which had ORs in the opposite direction to published results although the dataset and meta-analysis 95% CI's overlapped for two SNPs. Additionally the minor allele frequencies (MAFs) in controls were similar to those expected. These discrepancies probably represent normal variation as opposed to systematic genotyping differences.
Environmental Risk Factors. The ORs for seropositive RA in ever-smokers were 3.10 (95% CI 2. 22-4.37) in WTCCC and 4.32 (95% CI 3. 16-5.92) in UKRAGG for males and 1.02 (95% CI 0.84-1.25) in WTCCC and 1.96 (95% CI 1.61-2.40) in UKRAGG for females. The meta-analysis gender discrepancy surrounding the effect of ever-smoking on RA risk [5] was therefore mirrored in our datasets supporting the inclusion of only males in our smoking models.

Risk Prediction
Risk Categorisation. As hypothesized, our modelling more accurately categorised ACPA-positive RA as high-risk compared with seropositive RA (Figure 2 and Table S4). The HLA model provided most prediction in both datasets, classifying approximately one third of ACPA-positive RA as high-risk and two thirds of controls reduced-risk. Although the SNP model provided some prediction it classified most individuals as average-risk, reflecting the overlapping CIs generated by including many risk factors of a small effect size.
In WTCCC, the full genetic (HLA-SNP) model performed slightly better than HLA alone. Additional smoking data conferred subtle improvements in categorisation; this is particularly seen with the HLA-SNP-smoking model, which classified over half of ACPA-positive RA elevated/high-risk and 59% of controls reduced-risk.
In UKRAGG the addition of SNPs to HLA alleles increased the average-risk group size with no clear predictive benefits. The incorporation of smoking substantially improved prediction: the   ination with differences observed between HLA and HLAsmoking model AUCs (P = 0.0051) and HLA-SNP and HLA-SNP-smoking model AUCs (P = 0.0120).
An overview of the main findings for each of the 5 prediction models, alongside the differences between them is provided in Figure S2.
Risk Distributions. In both datasets the HLA model provided most risk prediction generating substantially higher and lower ORs for RA in cases and controls respectively compared with the SNP model ( Figure 4).
In WTCCC the addition of other risk factors to the HLA-DRB1 alleles resulted in further small incremental increases in ORs for RA in cases; a less pronounced reduction in risk was seen in controls.  In UKRAGG the addition of SNPs to HLA data provided no changes in case risk profiles, although a minority of controls had lower ORs. Additional smoking data resulted in significantly higher ORs for cases; only the HLA-SNP-smoking model clearly generated lower risk profiles for controls.
Lifetime Risk Prediction. Evaluating risks using genetics (HLA-SNP model) alone the highest risk WTCCC ACPA-positive case had an OR for seropositive RA of 79; as a male his lifetime risk was estimated at 86%. The highest risk control had an OR of 22; as a female her lifetime risk was estimated at 53%. Despite such high individual odds only a relative minority had relevant increased lifetime risks: using the same HLA-SNP model 49 (4.61%) ACPA-positive cases and 1 (0.07%) control had ORs for seropositive RA.20 (lifetime risks .48% if female and .22% if male) in WTCCC. In UKRAGG 9 (3.06%) ACPA-positive cases and 1 (0.17%) control had ORs.20.
The HLA-SNP-smoking model identified the greatest proportion of cases with substantially increased lifetime risks for RA. This model identified 18 (7.53%) and 3 (3.75%) ACPA-positive male cases to have ORs for seropositive RA.20 (lifetime risk .22%) in WTCCC and UKRAGG respectively; no controls had ORs.20.

Younger Onset RA Prediction
In WTCCC the HLA model summary OR score was the only significant predictor of age of RA onset ( Table 4). The hazard ratio (HR) was 1.034 (P,0.0001), which indicated that the hazard (the rate at which RA occurred) was greater in individuals with higher HLA derived ORs than those with lower ORs. Therefore a higher HLA model generated risk score associated with RA occurring at a faster rate and thus YORA. Conversely ever-smoking was associated with older onset RA: the HR of 0.902 indicated a smaller hazard (RA occurred at a slower rate) in ever-smokers compared with never-smokers, although this was not significant (P = 0.1301).
In UKRAGG the HLA model summary OR score, gender and smoking status were significant independent predictors of age of  onset. An increasing HLA summary OR score associated with YORA (P = 0.0003, HR 1.026); ever-smoking (P = 0.0041, HR 0.848) and male gender (P = 0.0465, HR 0.885) associated with older onset RA.
We considered that the non-significant relationship between smoking and age of onset in WTCCC reflected a limited sample size with our power to detect a 0.88 HR in the 962 WTCCC cases approximately 51% compared with 65% for the 1,361 UKRAGG cases. We therefore undertook a pooled analysis of both datasets (incorporating an additional ''study'' variable to account for dataset median age of onset differences). This confirmed that HLA derived risk scores significantly associated with YORA (P,0.0001, HR 1.030) and ever-smoking significantly associated with older onset RA (P = 0.0489, HR 0.889).
Kaplan-Meier curves of age of onset stratified by HLA model risk categorisation further demonstrate the association of HLA risk profiles with YORA ( Figure 5) with cases classified high-risk having significantly younger onset ages compared to those classified reduced-risk. In WTCCC the difference in the median time to RA (time point at which half the cases have developed RA) was 3 years between those classed high-and reduced-risk (Log-Rank = 11.43; P = 0.0007). In UKRAGG a stronger association was seen (Log-Rank = 27.33; P,0.0001) with a difference in median time to RA onset between risk groups of 6 years. Further stratification by ever-smoking status demonstrated a trend towards an older onset age in ever-smokers. In WTCCC the median time to onset difference between high-risk never-smokers and reducedrisk ever-smokers was 7 years (Log-Rank = 14.42; P = 0.0024); a larger disparity was seen in UKRAGG with a difference of 12 years observed (Log-Rank = 46.2505; P,0.0001).

Discussion
We have demonstrated that predicting RA development is possible with our prediction models able to identify individuals with clinically relevant increased risks for seropositive RA. Our modelling indicates that most prediction is provided by HLA-DRB1 alleles and, to a lesser extent, smoking in males; non-HLA susceptibility SNPs provide only minor predictive benefits. These findings are consistent with the estimations of heritability variance conferred by different genetic components. We have also shown it is possible to predict the age of RA onset, using information on HLA and smoking to identify those at risk of younger and older onset RA, respectively. Whilst our novel modelling approach, which uses computer simulation-based categorisation alongside a greater number of HLA alleles, significantly improves upon the discriminative abilities of existing models [26], [27] it remains unsuitable for population screening with only a minority at significantly increased lifetime risks for RA.
Our approach provides some potential advantages over existing RA prediction modelling [26], [27]. Firstly, by using a simulated population to generate risk profiles we do not require an entire population of real-life data to stratify risks. In contrast existing approaches categorise wGRS scores using their Gaussian distribution in control groups. Secondly, our CI-based approach considers the precision with which risk factor effect sizes are known when classifying risk; this prevents classifying people high-risk if their risk is imprecisely known. Thirdly, our models provide greater discrimination: the highest AUC for existing clinical-genetic models in discerning ACPA-positive RA from controls is 0.752; the highest AUC for our clinical-genetic model is 0.857.
SNPs provided only minor improvements in prediction, highlighting the limitations of genome-wide association study (GWAS) derived data in this field. Although GWAS-established SNPs have helped identify cellular pathways relevant to RA pathogenesis [51] their modest effect sizes limit their predictive utility. It has been proposed that the missing heritability of RA may reflect the involvement of rare variants of large effect sizes or structural variants [52]. Alternative genotyping technologies such as next-generation sequencing may identify these variants, although only loci with large effect sizes will substantially improve prediction modelling.
Although individuals with clinically relevant increased lifetime risks (such as 86%) for RA were identified there was, overall, only a minority of individuals at a significantly elevated risk: 7% of ACPA-positive individuals had lifetime risks of 22% or more when evaluated using all available risk factors. Therefore despite high AUCs our modelling is unsuitable for population level screening. However, if its use was targeted to groups with a priori increased risks, such as first degree relatives of RA probands [53][54][55], then a substantially greater proportion of very high-risk individuals might be identified.
Individuals classified high-risk by our HLA model were more likely to develop RA at a younger age. This finding -mainly attributable to the *04:01 allele -is supported by existing literature. Hellier et al reported a higher frequency of *04 RA associated alleles in YORA (present in 52% of 262 RA cases with onset age ,60) compared with elderly onset RA (present in 37% of 60 cases with onset age .60; P = 0.045) [18]. Similarly, Wu et al identified a significantly younger age of onset in Caucasian RA patients carrying shared epitope encoding *04 alleles (P = 0.0003) [19]. Other studies report positive correlations between YORA and shared epitope alleles [25], [56]. Our finding of ever-smoking associating with older onset RA is less established. It has only been examined in three relatively small studies, with contrasting outcomes: one study reported a significant relationship between smoking at disease onset and a younger onset age [57]; one reported a younger onset age in current vs. never-smokers (although ex-smokers had older onset RA in comparison to both these groups) [58]; the final study found no association [59]. Our findings -demonstrated in 2,323 individuals across two independent datasets -are biologically plausible. As risk genotypes are present from birth they can exert their effects on disease risk throughout an individual's lifetime; therefore possessing high-risk HLA-DRB1 alleles predisposes to RA at a younger age. In contrast the risk of RA increases as more cigarettes are smoked [60], [61] and smokers are exposed to more cigarettes as they age; therefore smokers are more likely to develop RA as they get older because they have been exposed to more cigarettes and thus smoking associates with older onset RA. This logic also explains why eversmoking associates with older onset RA in both men and women, with heavy smoking a risk factor for RA in both genders [5]. We were, however, unable to incorporate heavy smoking in our prediction modelling due to a paucity of data on smoking packyears in WTCCC/UKRAGG.
We incorporated many genetic risk factors in our modelling but included only one environmental risk factor, smoking. This reflects uncertainty regarding relevant environmental risks alongside limited environmental data within current genetic datasets. Although many environmental factors are linked to RA their associations are usually identified in case-control studies, which are subject to multiple biases, rather than cohort studies. Examples include alcohol consumption [36], parity [62], [63] and oral contraceptive pill use [64]. Better characterisation of environmental risks will enhance predictive modelling.
Our modelling has several limitations. Firstly, WTCCC participants were included in the meta-analyses that we obtained our genetic risk loci data from; however WTCCC comprised only a proportion of the meta-analyses datasets (20% of the HLA meta-analysis; 29% of the SNP meta-analysis) and our findings were independently replicated in UKRAGG. Secondly, missing data meant the number of individuals included in each model fell as more risk factors were included; this is particularly seen in models incorporating smoking. Thirdly, due to marked heterogeneity in published data on gene-gene/gene-environment interactions we assumed independence between these factors despite known interactions existing between the shared epitope alleles and PTPN22 and smoking [6], [7], [29], [37], [38].
Improving RA prediction requires better clarification of its genetic and environmental risk factors. Identifying risk factors with large effect sizes of known precision will most enhance prediction modelling. This could be facilitated through fine-mapping studies that better tag causal variants [65] alongside prospective cohort studies examining environmental risk factors in RA cases subdivided by ACPA status, with increasing evidence that risks differ between these serological subsets [36], [66]. It is, however, unlikely that identifying such risk factors will substantially increase the proportion of individuals with clinically relevant high disease risks. We therefore consider that prediction modelling requires evaluation in a priori higher risk groups. In this context it may identify sufficient numbers of very high-risk individuals, facilitating a better understanding of pre-RA immunopathology and enabling the assessment of primary prevention strategies.   Proxy SNPs used in modelling. a = proxy SNP obtained using 1,000 Genomes CEU population panel [39]; b = proxy SNP obtained using HapMap release 22 CEU population panel [39]; c = proxy SNP obtained using Ricopili (Broad Institute, Boston, USA) from the GWAS meta-analysis of RA risk (http://www.broadinstitute.org/mpg/ricopili/). (DOCX)