A genome-wide screen for variants influencing certolizumab pegol response in a moderate to severe rheumatoid arthritis population

Certolizumab pegol (CZP) is a PEGylated Fc-free tumor necrosis factor (TNF) inhibitor antibody approved for use in the treatment of rheumatoid arthritis (RA), Crohn’s disease, psoriatic arthritis, axial spondyloarthritis and psoriasis. In a clinical trial of patients with severe RA, CZP improved disease symptoms in approximately half of patients. However, variability in CZP efficacy remains a problem for clinicians, thus, the aim of this study was to identify genetic variants predictive of CZP response. We performed a genome-wide association study (GWAS) of 302 RA patients treated with CZP in the REALISTIC trial to identify common single nucleotide polymorphisms (SNPs) associated with treatment response. Whole-exome sequencing was also performed for 74 CZP extreme responders and non-responders within the same population, as well as 1546 population controls. No common SNPs or rare functional variants were significantly associated with CZP response, though a non-significant enrichment in the RA-implicated KCNK5 gene was observed. Two SNPs near spondin-1 and semaphorin-4G approached genome-wide significance. The results of the current study did not provide an unambiguous predictor of CZP response.


Introduction
Rheumatoid arthritis is a chronic, systemic autoimmune disease of unknown etiology, affecting between 0.3 and 1% of the global population. It affects the joints, connective tissues, muscle, tendons, and fibrous tissue leading to reduced quality of life, disability, and early mortality for sufferers [1]. To date, no GWAS has been reported for certolizumab pegol. We report the first genomewide study of a North American cohort of RA patients with moderate to severe RA, with the objective of discovering DNA variants associated with CZP response.

Patients
The study protocol was approved by the Institutional Review Boards at Duke and Columbia University and written informed consent for participation was obtained at study entry for all subjects. North American patients (�18 years) were selected from the REALISTIC phase IIIb study (NCT00717236). All participants were diagnosed with adult onset RA as defined by 1987 American College of Rheumatology (ACR) criteria for at least 3 months. In addition, they had either not tolerated or responded unsatisfactorily to at least one DMARD prior to study entry [30]. Patients were assessed using a 28 joint count and were defined as having active disease if they had at least five tender and at least four swollen joints, C-reactive protein (CRP) �10 mg/l and/or � 28mm/hour erythrocyte sedimentation rate (ESR) and had moderate to severe RA at study entry [30] (Table 1).

GWAS of common SNPs
Genotyping. A total of 2,372,361 SNPs in 413 patients were genotyped using the Illumina Omni2.5M array platform. SNP genotypes were called using the Illumina GenomeStudio Software package Version 2011.1 with genotyping module version 1.9.4 according to the manufacturer's instructions.
Quality control (QC). A series of QC checks were carried out to ensure sample integrity. Biological sex was assessed for concordance between the genetically-inferred and self-reported gender. Duplicate samples and cryptic relatedness were identified based on genetic data using PLINK [31]. Samples found to be related at the level of first cousins or closer (pairwise estimate of identity-by-descent exceeding 0.125) were considered excessively related and one sample from each pair was removed at random. Quantitative estimates of ancestry (principal components analysis (PCA) implemented in EIGENSTRAT software [32]) were performed using LD pruned SNPs to estimate genomic ancestry and compared with self-reported ethnicity.
Data quality was also evaluated at the marker level to remove low quality genotypes. Two subjects missing >10% of genotype calls and one subject with an anomalously low inbreeding coefficient (F statistic outside of 6 standard deviations from the mean) were excluded. A total of 8629 poorly performing markers missing >10% of genotype calls and 3144 heterozygous haploid genotypes were removed. In addition, rare variants with a minor allele frequency (MAF) of <1% were excluded. No SNPs were found to deviate from Hardy-Weinberg equilibrium when a Bonferonni-adjusted threshold of 2.91x10 -8 was applied.
Following QC, data were available for 1,685,541 SNPs in 360 individuals of primarily European ancestry. Of these, 302 individuals treated with CZP were included in the statistical analysis. Disease activity Score (DAS28 ESR) data were not available for all individuals (Table 2) and where data was missing for individual outcome measures, those patients were excluded from subsequent analysis. Only a minor difference in ACR20 response (45.7% vs 48.3%, Table 1) was observed between weeks 6 and 12, therefore only week 6 (ACR20_W6) was used for the analysis.
Imputation. After QC, non-ambiguous SNPs, including rare variants with MAF<1%, were used as input for imputation using haplotypes from the 1000 Genomes phase1 integrated reference panel. Prephasing of haplotypes was performed using SHAPEIT software [33]. Imputation of missing and un-genotyped SNPs was then performed on 5Mb segments from each autosome and chromosome X using IMPUTE2 [34] according to recommended best practices. High confidence SNPs with greater than 90% imputation confidence were then merged using GTOOL and converted back to PLINK format for association analysis.
Statistical analysis. Four primary outcomes were tested for association within the study: ACR20 and decrease in DAS28 ESR >1.2 at week 6 (both as dichotomous variables), and change in DAS28 ESR at week 6 and 12 (continuous variables). Testing for association between individual outcome measures and each SNP were carried out using logistic or linear regression, after correcting for gender and genetic ancestry using the top 3 principal component axes with significant Tracy-Widom statistics, explaining 2.78% of the observed variance. Analysis of candidate SNPs. Using a targeted approach harnessing known GWAS hits in autoimmune disease and genes involved in the TNF pathway, SNPs with higher prior probability of association with CZP response were tested for enrichment of association beyond that expected under the null hypothesis. Of 1618 SNPs previously reported to be associated with auto-immune diseases (SNPs with p< 5x10 -8 obtained by searching GWAS Central for the MeSH term "autoimmune disease"), 72 were specifically associated with RA. These SNPs have a higher probability of association with the CZP outcomes than SNPs selected randomly. A total of 1403 out of 1618 SNPs were present on the Omni2.5M chip or were available in the imputed dataset. Of the 72 RA-associated SNPs, 64 were available for testing. To examine the role of genetic variation in genes involved in TNF signaling, a further hypothesis driven approach was carried out using 13303 common SNPs located within 112 genes that were annotated as participating in TNF signaling (hsa04668) according to KEGG pathway ontologies [35]. Significance scores for the subset of 13303 TNF-signaling SNPs were compared against a uniform distribution of p-values that would be expected under the null hypothesis using quantile-quantile plots. A review of the literature was used to identify 94 SNPs previously reported to be associated with TNF inhibitor response. Of these, 86 SNPs were present in the Omni2.5M array or imputed dataset.

Whole-exome sequencing
Patient selection. Extreme non-responders (hereafter referred to as "non-responders") were defined as ACR20 failures at week 6 and 12, with maximum reduction in DAS 28 ESR at week 12 compared with baseline (BL) of 0.33 and maximum reduction in DAS 28 ESR at week 6 compared with BL of 0.6, to eliminate secondary loss of response. Extreme responders (hereafter referred to as "responders") were defined as individuals with a positive ACR70 score at 12 weeks. Using these criteria, a total of 81 non-responders and 40 responders were identified (summary shown in Table 3).
Sequencing and quality control. Whole-exome sequencing (WES) was performed using the Roche Nimblegen Human Exon v2 capture kit and KAPA library prep kit according to the manufacturer's instructions. Sequencing was performed using the 100bp paired-end read protocol on the Illumina HiSeq 2000 instrument. After quality filtering the raw sequence data using CASAVA (Illumina, Inc., San Diego, CA, USA), adapter sequences were trimmed and the sequencing reads were aligned to the reference human genome (NCBI37/hg19) using BWA software [36]. Duplicate reads were then removed from the dataset using Picard (Broad Institute, Boston, MA, USA). Variant calling was performed using GATK [37] with local realignment around insertion/deletion variants (indels) and base quality recalibration for singlenucleotide variants (SNVs) [38]. Similarly sequenced control samples available at the Duke Center for Human Genome Variation (CHGV) served as the comparison group for a subset of the exome sequencing analyses. Assesment of population structure. As the majority of subjects were self-described Caucasians, we sought to limit our analyses to that ancestry group in order to minimize the influence of population stratification between cases and controls. Preliminary ancestry predictions were perfomed during bioinfomatic processing using a panel of 12840 high coverage SNPs, based on principal components generated by EIGENSTRAT for cases, controls, and set of 4289 samples with pre-evaluated genetic ancestries. All cases and controls were initially assigned to one of six geographic ancestry groups (Caucasian, African, Hispanic, East Asian, South Asian or Middle Eastern) using a multinomial logistic regression model which provided a probability estimate of a sample belonging to each of the six groups. After restricting the cases and control samples to those predicted to be of Caucasian ancestry, all samples were further required to have EIGENSTRAT PC scores within 6 standard deviations of the mean on PCs 1 and 2. From an initial 121 CZP-treated patients and 2654 CHGV controls, ancestry pruning resulted in 74 CZP-treated patients (19 responders, 55 non-responders) and 1546 controls of European ancestry (Fig 1A and 1B). After removal of ancestry outliers, no major differences in population structure were observed between case-control groups ( Fig 1C). Therefore no additional adjustment for genetic ancestry was included in the WES association analysis. The first two principal components accounted for 0.3% of the observed variance. Study characteristics after ancestry pruning are shown in S1 Table. Variant quality filtering and association testing. Variants were filtered to include only those located in CCDS genes, with a GATK variant quality score in the 99.9 percentile and in the following functional categories: stop-gained, stop-lost, nonsynonymous coding, essential splice site, and coding region indels. Genotypes with �3 reads were considered missing for any individual. Variants were excluded if they were missing in �10% of cases or controls, or if they showed significant deviation from Hardy-Weinberg equilibrium in controls. In addition, 6044 variants previously identified as being artifacts based on comparison of CHGV controls Table 3. Characteristics of study sample for whole exome sequencing.

Patient demographics Responders Non-responders
Self-reported gender, % female 77.5 80. with publically available data from the Exome Variant Server (EVS; http://evs.gs.washington. edu/EVS/) were removed from the dataset. Individual variants were tested for association using Fisher's exact test. In addition, genebased collapsing tests were performed following the method of Li and Leal [39], with adjustment for differences in coverage at the exon level between cases and controls. Variants qualified for inclusion in these tests if they had a minor allele frequency (MAF) <1% in both the study sample (combined cases and controls) and EVS, and fell into the functional categories above.
There were three major comparisons in the study: responders vs non-responders, responders vs controls + non-responders, and non-responders vs controls + responders. The withincohort comparison of responders and non-responders benefits from the clarity of the two alternative phenotypes; however, since the patients in these groups were selected from the extremes of the outcome distribution, it is reasonable to also compare each extreme to a population control sample, as the misclassification rate in controls is unlikely to be high and the sample sizes available are much larger. Statistical significance was assessed using Bonferroni correction for the number of tested variants or genes (in the collapsing analysis).

Results
We have carried out a two-stage genetic analysis seeking to identify DNA variants associated with response to CZP in a North American cohort with moderate to severe RA.

Patient population
The patient population was derived from the REALISTIC phase IIIb study [30] and the subjects used for the GWAS are summarized in Table 1. 79.5% of participants in the study were women, with average age at entry of 56 years. Mean duration of disease was 9.1 years with 22% of the population having disease duration of <2 years. Median CRP level at entry was 9.0 mg/l and median ESR was 36.0 mm/h. Mean DAS 28 ESR at entry was 6.4. 60% of the population for which data were available was positive for anti-citrullinated peptide antibodies and 69% positive for rheumatoid factor. Previous exposure to anti-TNF agents had occurred in 47% of patients, while 72% were on methotrexate, 62% on steroids and 23% on statins at baseline. A

PLOS ONE
small number of patients were receiving other DMARDs on study entry (lefluonamide n = 21; azathioprine n = 1).

GWAS
After Bonferroni correction for multiple testing, using categorical or continuous outcome variables (ACR20 and DAS28), we were unable to demonstrate genome-wide significant associations between any genotyped SNP and therapeutic response to CZP for any of the outcome measures. For three of the measures, ACR20 and DAS28 ESR reduction of >1.2 at week 6 (dichotomous variables), and change in DAS28 ESR at week 6 (a continuous variable), there was no evidence of SNPs associated with outcome (S1 Fig). For the change in DAS28 ESR at week 12 outcome measure, two SNPs approaching genome-wide significance were observed (rs12287315 and rs35355083) (Fig 2). Neither of these SNPs were located within genes; rs12287315 (p = 5.78x10 -8 ; β = -1.08) is located 76 kb upstream of the spondin-1 (SPON1) gene and rs35355083 (p = 1.57x10 -7 ; β = -1.14) is located in an intergenic region 110 kb upstream of the semaphorin 4G (SEMA4G) gene. Imputation of ungenotyped SNPs in a 10Mb window around each lead SNP identified two additional SNPs in these regions with stronger associations with DAS28 ESR at week 12 (Fig 3). These included rs78675205 (p = 2.61x10 -8 ; β = -1.21) on chromosome 10, located downstream of the Paired Box 2 (PAX2) gene, and rs72873110 (p = 9.61x10 -9 ; β = -1.13) located on chromosome 11, approximately 3kb upstream of the SPON1 coding region. While the significance scores for these two variants exceeded the original threshold for statistical significance, they were not statistically significant when the association analysis was performed using all common SNPs from a genome-wide imputation and no additional associations were detected outside of these two regions (S2 Fig).

Candidate SNP and pathway analysis
We supplemented our genome-wide search for variants with a targeted analysis focused on SNPs with a higher prior probability of association, through previous documentation of association with auto-immune disease, as well as a sub-group of 64 SNPs associated specifically with RA. Similarly, we generated a candidate list of 13303 common variants observed in 112 genes reported to be involved in TNF biology [35]. No enrichment of association beyond that expected under the null hypothesis was observed using these approaches (Fig 4, S2-S4 Tables).
We further tested whether SNPs previously associated with response to TNF inhibitors showed evidence of association in the current study. As shown in Table 4, the previously reported associations with TNF inhibitor response were not observed in our cohort.

Exome sequencing analysis
In a complementary analysis, we screened for rare variants by WES of 19 extreme responders and 55 non-responders. Additionally, patients from these extremes of the response distribution were compared with a large sample of more than 1500 ethnically matched control samples sequenced in the same laboratory using identical data processing methods. S5 Table shows the sample size and the number of variants tested in each comparison.   For all comparisons, the single-variant tests did not reveal any variant showing significant association with responder or non-responder status (Fig 5). The p-value distributions were non-uniform and tended to be less significant than expected under the null, probably owing to both the small sample sizes and the relatively low allele frequency distribution of the variants tested. For gene-based collapsing tests, the results were similar (Fig 6), with no individual gene showing significant association with therapeutic response after correcting for multiple testing. The top ten SNPs and genes most significantly associated with CZP response are shown in Tables 5 and 6. Interestingly, among the top ten most highly associated results in the gene-based analysis was the potassium channel, subfamily K, member 5 (KCNK5) gene, which is known to be a critical factor in T cell activation [40]. In addition, KCNK5 expression has previously been reported to be strongly correlated with RA disease severity [41] and up-regulation of KCNK5 expression was found to be predictive of treatment failure in RA patients receiving the anti-IL-6 therapy tocilizumab [41], which inhibits RA inflammation through a different biological pathway to that targeted by anti-TNF therapies. In our collapsing analysis, non-synonymous variants in the KCNK5 gene were present in 16% (3/16) of CZP extreme responders, while no patients (0/55) from the CZP non-responder group were found to have any predicted functional variants in the KCNK5 gene.

Discussion
We have completed an exploratory pharmacogenetic analysis of a cohort of 302 RA patients of Western European descent treated with CZP. A two-stage analytical approach was employed, using high content beadchip genotyping analysis to examine the role of common human genetic variation, followed by a second stage consisting of WES of extreme responders and non-responders, to investigate the association of rare genetic variants with CZP treatment response.
In the first stage of the analysis examining common variation, we were unable to demonstrate robust associations with any of the clinical outcome measures. Two novel candidates (rs35355083 near SEMA4G and rs12287315 near SPON1), showed strong trends toward association with change in DAS28 ESR at week 12, but did not meet genome-wide statistical significance. These variants have not been associated with response to TNF inhibitors previously, however a low-frequency variant located in the FAR1-SPON1 intergenic region was previously reported to be associated with Etanercept response [42]. Imputation of ungenotyped SNPs near these suggestive associations revealed two additional variants with stronger associations

PLOS ONE
with DAS28 ESR at week 12. While the p-values for these associations exceeded the original threshold for statistical significance, they were not found to be significant when the association analysis was performed using all common variants from a genome-wide imputation. Semaphorins are a diverse group of proteins involved in regulation of cell movement and migration via interaction with cognate plexin or neuropilin receptors [43]. Recent studies have indicated that they play a role in many aspects of the immune system, including innate immunity and cell trafficking [44]. In the mouse, SEMA4G is required for cerebellar development [45], however, a clear role in auto-immune disease has yet to emerge. Spondin 1(SPON1) is an extracellular matrix protein involved in regulation of neuronal outgrowth and inhibition of angiogenesis. It has been shown to bind to the extracellular domain of amyloid precursor protein and impairs cleavage by the beta secretase BACE 1, reducing beta-amyloid production [46]. Its expression has been reported to be elevated in osteoarthritis lesions in both humans and rodents, and its activation of TGF-beta in situ may promote osteoarthritis pathogenesis [47].
The involvement of these genes in immune signaling, as well as the reported role of spondin 1 in cartilage homeostasis [47], suggests that these associations may be biologically relevant. However, their failure to show significant association with CZP response after correction for multiple testing should prompt caution with regard to interpretation of the findings. As with all such candidates, replication of these SNPs in additional cohorts and biological validation will be required.
Whilst the genotyping stage of the analysis was intended to screen for common variants associated with CZP response, for complex diseases such as RA, such approaches typically require large cohorts and/or large effect sizes. One particularly compelling reason to search for rare variants in the study of therapeutic response to drugs is that drug response may be correlated with the underlying genetic cause of disease, and it is expected that much of the risk for human disease will be found in the rare-variant spectrum [48]. To address the risk of overlooking low frequency variants that contribute to CZP response, we implemented a WES approach in the second stage of the study. This approach enables identification of rare variants with larger effect sizes, despite the small sample sizes available. In this case, sequencing focused on 74 patients from the REALISTIC trial with extreme responses to CZP treatment; including 55 non-responders and 19 super-responders. Neither the single-variant tests nor the gene-based collapsing method identified any gene or variant that was significantly associated with CZP response. A modest, but non-significant, enrichment of variation was identified in the KCNK5 gene in our collapsing analysis. The previous implication of this gene in RA disease progression, as well as the specific observation that increases in KCNK5 expression are correlated with the failure of antibody-based therapies targeting similar inflammatory pathways involved in RA disease pathology, may suggest that additional investigation of KCNK5 as a modifier of RA therapies is merited.
With the caveat that KCNK5 and the two SNPs in SEMA4G and SPON1 may be worthy of follow up, we have failed to demonstrate the presence of any variants with a clinically robust association with response to the anti-TNF agent CZP. While the lack of statistically significant associations could result from the limited size of our study, this finding is also consistent with many previous candidate gene and genome-wide efforts to discover genetic determinants of response to anti-TNF agents. Researchers have either been unable to discover or replicate initial findings in clinically relevant cohorts, or the statistical performance of the candidates has not been sufficiently compelling to justify follow up or clinical implementation. A post hoc analysis of the estimated statistical power for the GWAS component of our study indicated it had over 80% power to detect common variants with a MAF of 25% and a relative risk of 2.5, but was only sufficiently powered to detect more rare variants with a MAF of 5% when the relative risk was over 5 (S3A Fig). Similarly, the WES portion of our study was likely to be underpowered to detect enrichments of rare variants with moderate effect sizes (RR<6.0; S3B Fig).
RA is a heterogeneous disease with complex genetic and environmental etiology. Although we have focused our analysis on a largely homogenous Caucasian cohort of 302 patients, the study population comprised patients with differing clinical characteristics at entry. We have accommodated variability in disease severity and duration, and use of other medications in the analysis but were not able to adjust for the variability in biology underpinning disease heterogeneity and response/non-response. In addition, the development of antibodies against TNF inhibitors has recently been recognized a significant contributor to treatment failure [49,50]. Our study did not account for the induction of antibodies against CZP. It therefore remains a possibility that heritable biomarkers predictive of risk do exist for sub-populations within diverse RA populations such as this one, but that their penetrance is diluted to an extent whereby they are not detectable.
Further complexity is added by the routine use of composite outcome measures such as DAS28 and ACR20, the primary endpoints used in this clinical study. They are comprised of a mixture of objective physical measurements, laboratory biomarker data, and subjective patient assessments. Robust associations may be discernible with a more refined focus on a single objective endpoint, although it should be noted that this may be misleading in a mixed patient population.
The goal of personalizing RA therapy is to identify patients where the likelihood of a beneficial outcome is optimized and the risk of adverse events is reduced or eliminated. Many studies with conflicting and ambiguous outcomes have been published investigating the pharmacogenomics of etanercept, infliximab and adalimumab in RA populations, and to date there have been no predictive anti-TNF genomic biomarkers established for clinical use. This report, the first pharmacogenomic analysis of CZP response, is in accord with these earlier efforts, and has been unable to demonstrate compelling evidence for the existence of genomic biomarkers of value in targeting this anti-TNF to a responsive patient sub-population.
In diseases such as RA, where there are complex, polygenic contributions to pathology and response to therapeutic intervention, it remains a possibility that combinations of clinical, environmental, and molecular markers will hold more promise for managing therapy. Graphs of statistical power estimated across a range of minor allele frequencies and effect sizes for (A) the GWAS of common SNPs and ACR20 response, and (B) collapsing analysis of rare variants identified by WES for non-responders vs population controls and super-responders. For rare variant collapsing analysis, cMAF indicates the cumulative minor allele frequency summed across multiple rare variants. All power calculations were performed using CaTS. (TIF) S1 Table. Characteristics of exome sequencing sample after ancestry pruning (N = 74). (DOCX) S2