Subphenotypes of inflammatory bowel disease are characterized by specific serum protein profiles

Objective Genetic and immunological data indicate that inflammatory bowel disease (IBD) are characterized by specific inflammatory protein profiles. However, the serum proteome of IBD is still to be defined. We aimed to characterize the inflammatory serum protein profiles of Crohn’s disease (CD) and ulcerative colitis (UC), using the novel proximity extension assay. Methods A panel of 91 inflammatory proteins were quantified in a discovery cohort of CD (n = 54), UC patients (n = 54), and healthy controls (HCs; n = 54). We performed univariate analyses by t-test, with false discovery rate correction. A sparse partial least-squares (sPLS) approach was used to identify additional discriminative proteins. The results were validated in a replication cohort. Results By univariate analysis, 17 proteins were identified with significantly different abundances in CD and HCs, and 12 when comparing UC and HCs. Additionally, 64 and 45 discriminant candidate proteins, respectively, were identified with the multivariate approach. Correspondingly, significant cross-validation error rates of 0.12 and 0.19 were observed in the discovery cohort. Only FGF-19 was identified from univariate comparisons of CD and UC, but 37 additional discriminant candidates were identified using the multivariate approach. The observed cross-validation error rate for CD vs. UC remained significant when restricting the analyses to patients in clinical remission. Using univariate comparisons, 16 of 17 CD-associated proteins and 8 of 12 UC-associated proteins were validated in the replication cohort. The area under the curve for CD and UC was 0.96 and 0.92, respectively, when the sPLS model from the discovery cohort was applied to the replication cohort. Conclusions By using the novel PEA method and a panel of inflammatory proteins, we identified proteins with significantly different quantities in CD patients and UC patients compared to HCs. Our data highlight the potential of the serum IBD proteome as a source for identification of future diagnostic biomarkers.

Introduction Inflammatory bowel disease (IBD), comprising Crohn's disease (CD) and ulcerative colitis (UC), is a chronic inflammatory disease affecting the gastrointestinal tract. The inflammation arises at the intersection of genetic predisposition and factors related to the exposome. Current theories suggest that the disease is caused by an aberrant immune response to commensal gut microbiota in genetically predisposed individuals, which is precipitated by environmental factors. [1] However, the exact etiology of inflammatory bowel disease (IBD) remains unknown.
There has been a great progress within the field of genetics, genome wide association studies (GWAS) and subsequent meta-analyses have contributed significantly to our understanding of the genetic landscape of IBD. More than 200 genetic variants have been associated with the disease, but the causal gene has only been identified in a subset of theses variants. As a consequence, the impact of IBD on the proteome remains largely undescribed.
The majority of identified genetic variants represent pathways of innate and adaptive immunity. [2] The key role of the immune response is also supported by data generated in mice models, in vitro and in vivo studies. However, the characteristics of both the innate immune response and adaptive immune response differ between CD and UC. [3,4] Pronounced differences in inflammatory mediators such as cytokines and chemokines have also been shown between patients with CD and UC. [1] Recently, the categorization of IBD into CD and UC has been suggested to be overly simplistic. [5] Some genetic variants are disease-specific and the genetic risk scores, generated from all known risk alleles for IBD, seem to be associated with different subphenotypes of CD, such as ileal CD and colonic CD, as well as UC. [6] Thus, based on genetic variants and immunological data it can be hypothesized that IBD, and possibly also subphenotypes of the disease, might be characterized by specific inflammatory profiles within the serum proteome.
There is considerable variation within individuals in the concentration of different proteins in serum, which makes the use of traditional proteomic techniques such as 2-dimensional gel electrophoresis (2DE) and mass spectrometry, challenging. [7] In contrast, antibody-based techniques such as proximity extension assay (PEA) offer a combination of high sensitivity and a wide quantitative window of individual proteins, which avoids the problems associated with variation in protein concentration. [8] We aimed to characterize the inflammatory serum protein profiles of CD and UC using the novel PEA technique.

Population
Adult patients with CD and UC were consecutively recruited at the outpatient IBD clinic of Ö rebro University Hospital, Sweden. Similarly, consecutive healthy blood donors with no history of chronic gastrointestinal disease were recruited at Ö rebro University Hospital. All individuals were recruited between 2005 and 2012 and the cohort has been described previously in detail. [9] After obtaining an informed written consent, blood samples were collected and the serum was separated after centrifugation at 2,400 g for six minutes at room temperature. All serum samples were stored as aliquots at −80˚C. The diagnoses of CD and UC were based on clinical, endoscopic, radiological, and histological criteria. [10] Disease characteristics were classified according to the Montreal classification, [11] with the exception of disease activity, for which the physician's global assessment was used. [12,13] A random sample of CD (n = 54) were selected, as well as UC patients (n = 54) and healthy blood donors with no history of chronic gastrointestinal disease (HC; n = 54), both groups were matched according to sex and age (± 5 years). Altogether, these 162 individuals constituted the discovery cohort. For validation of results obtained from the discovery cohort, a replication cohort consisting of 30 CD patients, 30 UC patients, and 30 HCs was recruited in a similar manner with matching according to sex and age (± 5 years).
The study was approved by the Ö rebro University Hospital ethics committee.

Protein analysis
The commercially available panel, ProSeek Multiplex Inflammation I 96x96 (Olink Proteomics, Uppsala, Sweden) consists of 91 preselected proteins, all releated to inflammation (S1 Table). The concentrations of the proteins in the panel were assessed as previously described. [8] Briefly, a PEA was performed, where pairs of antibodies with oligonucleotides attached were incubated with the antigens. Oligonucleotides in close proximity produced a template for hybridization and extension. Pre-amplification was based on universal primers and PCR. Residual primers were digested before quantification with specific primers on a quantitative real-time PCR chip (Dynamic Array IFC; Fluidigm Biomark) on a Biomark HD Instrument. The analyses were performed at the Clinical Biomarkers Facility, Science for Life Laboratory, Uppsala. Normalized log 2 values corresponding to protein quantities were generated with the Olink Wizard for GenEx (Multid Analyses, Sweden).

Statistics
Continuous variables, representing clinical characteristics, are presented as median and interquartile range (IQR) and differences were tested with the Mann-Whitney U test. Corresponding categorical data are presented as frequencies, and they were compared using Pearson Chisquare test or Fisher's exact test when appropriate. Proteins with signals below the LOD in > 80% of the samples were excluded, if the remaining signals were evenly distributed between cases and controls. This was done in order to reduce the effect of biologically irrelevant differences or non-informative protein features. All samples in the discovery cohort were normalized using quantile normalization, and the concentration of proteins below the LOD was then reset to zero.
In order to identify possible outliers, and to evaluate consistency of the data in the discovery cohort, principal component analyses (PCAs) were performed and score plots were visually inspected.
Univariate analyses were performed by Welsh t-test. P-values were adjusted for multiple comparisons using a false discovery rate (FDR) approach with q-values reported. Proteins were regarded as being differentially regulated in the discovery cohort if they had a fold change of at least 1.2 and a q-value of < 0.05.
Each sample in the replication cohort was normalized independently against all samples of the discovery cohort, to simulate diagnostic conditions for newly measured samples. Proteins showing significant up-or down-regulation, based on univariate analysis of the discovery cohort, were considered to be validated if they had a similar (defined as the 95% confidence interval of fold change for each protein in the discovery cohort) or larger fold change of identical direction in the replication cohort.
Multivariate analysis comprised both principle component analysis (PCA) and sparse partial least-squares analysis (sPLS). [14] Being a supervised learning method, the sPLS model is optimized to separate groups. For the sPLS analysis, CD patients were stratified based on disease location. Due to the complexity of the model and the limited number of patients, disease location was divided into two categories: colonic disease (L2) and ileal/ileocolonic disease (L1/ L3). A second series of analyses was performed stratifying for clinical disease activity and excluding patients with active disease. Variable importance in the projection (VIP) was calculated for all variables and the analysis was optimized for both the number of variables and the number of components to use in the prediction model, with a rigorous double cross-validation design. [15,16] The selection of proteins was based on the optimized cut-off for the VIP scores. The prediction model was validated through leave-one-out (LOO) cross-validation and then tested on the replication cohort data set to determine its accuracy in class prediction of disease groups (UC, colonic CD, ileal/ileocolonic CD, CD in clinical remission and UC in clinical remission). [17] Significance of the observed LOO error rates was established by resampling analysis, i.e. randomly permuting the class labels and re-running the double cross-validation analyses, to be able to calculate permutation p-values for the observed LOO prediction hit rates for the original data.
Statistical analyses and data processing were performed in R version 3.

Clinical and demographic characteristics of the discovery cohort
Clinical characteristics of patients with CD and UC in the discovery cohort are given in Table 1. The median (IQR) disease duration at inclusion in CD and UC patients was 17.5 (8-28) and 13 (5-25) years, respectively.

Preparation of data set
Serum samples from 162 individuals in the discovery cohort were run in parallel on two Pro-Seek plates. Ninety-one target proteins were quantified successfully. However, IL-13, IL-33, IL-1 alpha, and TSLP were below LOD in > 80% of the individuals. These proteins were therefore excluded from further analyses, since the concentrations observed in the remaining samples were evenly distributed between CD patients, UC patients, and HCs.

Differentially regulated proteins in the discovery cohort identified by univariate analysis
Differentially regulated proteins between different groups of individuals, identified by univariate analysis, are shown in Table 2. Twenty-two proteins were identified by univariate analysis when CD and UC patients were compared to HCs. Seven of these 22 proteins differed in both CD and UC (Fig 1).

Differentially regulated proteins in the discovery cohort identified by multivariate analysis
The score plots for the first two components of the sPLS model showed a partial separation of CD, UC, and HC samples within the discovery cohort (Fig 2).
Proteins of importance for disease classification were then identified by discriminant sPLS analyses and cut-offs for optimal separation in each analysis were selected by the VIP score (S2 Table). All the proteins identified in univariate analysis as being significantly altered (when CD patients and UC patients were compared to HCs) were also included in the discriminant sPLS analyses. In total, 64 candidate proteins were identified as being of importance for the differentiation of CD patients from HCs. The corresponding figure for UC was 51. In addition to FGF19, discriminant sPLS analysis identified 38 additional candidate proteins when CD was compared to UC. Score plots for the first two components of the sPLS model also showed a partial differentiation of ileal/ileocolonic CD (L1/L3), colonic CD (L2), and UC samples in the discovery cohort (Fig 3). Finally, candidate proteins of importance for the differentiation of subphenotypes of CD and UC as well as for patients in clinical remission were identified by discriminant sPLS analyses (S2 Table).
Cross-validation error rates for discrimination of subgroups of individuals and corresponding resampling p-values, based on our LOO double cross-validation design, were calculated; all comparisons to HCs mounted significant p-values (Table 3). Validation of differentially regulated proteins in the replication cohort The data obtained in the discovery cohort were subsequently validated in the replication cohort, consisting of 30 CD patients, 30 UC patients, and 30 HCs. The PEA of the validation samples was separate from the analysis of the discovery samples and performed at a later occasion. There were no significant differences in clinical characteristics between the discovery cohort and the replication cohort (Table 1). In total, 16 of the 17 proteins that were associated with CD-based on univariate comparisons in the discovery cohort-could be validated in the replication cohort (Fig 4A and 4B). Correspondingly, eight of the twelve proteins that were apparently associated with UC could be confirmed in the replication cohort ( Fig 5). FGF-19 was also validated as being differentially regulated in CD relative to UC, and also in ileal/ileocolonic CD (L1/L3) relative to UC. Validation of the multivariate discrimination of subgroups in the replication cohort generated slightly higher error rates, for all models statistically significant models (Table 4). Receiver operator characteristics (ROC) curves for the different prediction models were computed ( Fig  6). An area under the curve (AUC) of 0.95 was observed for CD vs. HCs, 0.96 for UC vs. HCs and 0.65 for CD vs. CD.

Discussion
There has been great progress in the characterization of the genetic landscape of IBD in recent years. However, genetic variants alone do not appear to be sufficient to cause the disease, with the possible exception of some cases of very early onset IBD. Gene expression is also influenced by epigenetic mechanisms and the transcriptome undergoes additional modification before translation into proteins. Thus, the proteome of IBD patients may reflect both genetic and environmental factors important in the pathophysiology of IBD. However, the profound difference in concentrations of different proteins in serum has hampered the exploitation of the serum proteome, [7] and previous attempts to address the IBD proteome in serum. [18][19][20][21][22][23][24] This drawback has been overcome by some recently introduced methods. As a proof of concept, we applied the novel proximity extension assay (PEA) technique to identify serum protein profiles of IBD using a panel of 91 proteins.
Based on univariate comparisons of CD patients and HCs in a discovery cohort and on validation in a replication cohort, 16 serum proteins were identified which differed significantly between CD patients and HCs. In the same way, eight proteins were validated in patients with UC. One of these proteins, FGF-19, was also down-regulated in CD compared with UC.
In order to identify protein profiles that are specifically associated with disease group, we applied a supervised method, that is the sPLS model. Using this supervised model, we were able to partially differentiate IBD patients from HCs based on their protein profiles. Cross-validation revealed that CD patients could be accurately discriminated from HCs in 88% of cases; the corresponding figure for UC was 81%. This accuracy of discrimination was found to be significant based on resampling analysis, although the observed accuracy dropped slightly when the model was applied to the replication cohort. The observed accuracy was supported by the ROC analyses, where the AUC was 0.95-0.96 when comparing CD and UC patients with HCs. For the comparison of subphenotypes, such as UC vs. colonic CD, the prediction error rates were lower and not significant. However, identification of clinically relevant biosignatures would probably benefit from inclusion of additional variables, including clinical information, and should not rely on the inflammatory protein panel only.
Several of the differentially regulated proteins that were validated in patients with CD or UC are cytokines, including IL6, IL17A, IL18, IFN-γ, eotaxin-1, CXCL9, caspase-8, and CXCL11, which have already been associated with IBD. [1,18,20,[25][26][27] Similarly, an association between the neutrophil-derived protein S100-A12 and IBD has also been reported previously. [28,29] We were able to validate the association of S100-A12 for CD but not for UC. Since S100-A12 has been associated with active disease, [30] the non-significant change in   Specific serum protein profiles for IBD addition, our results also show reduction of TNFSF14 and TGF-alpha in both CD and UC. TNFSF14 has been suggested to be an important mediator of the pathogenesis in CD and in murine models neutralizing antibodies for TNFSF14 has reduced symptoms of induced colitis. [34] Conflicting results for TGF-alpha have been published before with increasing numbers of TGF-alpha containing cells in inflamed mucosa in UC, in contrast to the results of another study showing relatively lower protein signals in inflamed biopsies compared to non-inflamed biopsies in both CD and UC. [35,36] Intriguingly, the observed predictive power seemed to remain when stratifying for disease activity and restricting the analyses to patients in clinical remission. This observation reveals that the observed difference between IBD patients and HCs is not only seen in patients with a high systemic inflammatory burden. Hence the different phenotypes of IBD seemed to involve different inflammatory pathways that might be of interest to distinguish the different IBD entities. The univariate comparisons of CD and UC patients only identified FGF-19. FGF-19 is produced by enterocytes in the distal ileum on uptake of bile acids, [37] and a correlation with bile acid malabsorption in Crohn's disease has recently been shown. [38] The observed decrease in serum FGF-19 was most pronounced in patients with ileal involvement, probably reflecting that 77% of the patients with ileal/ileocolonic disease had undergone surgical resection. A poor cross-evaluation error rate (0.41) and a non-significant resampling error rate were observed when trying to differentiate between CD and UC using multivariate analyses. However, the error rate seemed to improve when we stratified for location of CD and restricted the analysis to CD patients with ileal involvement (L1 and L3). These results are in line with recent genetic data from the international IBD genetic consortium (IIBDGC), where CD patients with ileal disease were reported to be more distantly related genetically to patients with UC than patients with colonic CD. [6] The model separating CD patients from UC patients also seemed to improve when stratifying for disease activity and including patients in clinical remission. The sPLS method identified CCL28 as a potential interesting marker together with FGF-19, CSF-1, IL-18, and TGF-beta-1. CCL28 has been shown to exhibit a protective function against bacteria by direct antimicrobial effect and by recruiting IgA producing leukocytes. [39,40] To our knowledge, most studies of the serum proteome in IBD to date, have analyzed a sparse selection of proteins. In that context, this study advances the field by introducing a novel protein signature approach. The validation process and the use of a replication cohort was a major strength of our study. On the other hand, the results were limited by the pre-selection of candidate biomarkers, since we used a predefined commercially available panel of inflammatory proteins.
A further limitation of this study is the reliance on clinical assessments of disease activity which poorly predicts mucosal inflammation. [41] Objective assessments of inflammatory activity such as endoscopic evaluation, CRP or fecal calprotectin measurements would help to Specific serum protein profiles for IBD address this limitation. The inflammatory activity of the healthy controls was not assed by any objective measures but subjects with acute inflammation or ongoing treatment for any inflammatory disease are not eligible as blood donors. The mixed cohort represents the many stages of IBD and the data may therefore be influenced by previous and ongoing pharmacological treatments as well as surgery. Notably, few patients were on anti-TNF therapy since patients were recruited at the outpatient clinic, mostly before the wide-spread use of biologics, and not at the infusion unit.Thus, this cohort can be used for detection of differences between established IBD and healthy controls, but it is not ideal for the purpose of diagnostic biomarker identification. In summary, by using the novel PEA method and a panel of inflammatory proteins, we were able to identify proteins with significantly different quantities in CD patients and UC patients compared to HCs. Moreover, the protein profiles identified allowed us to partially differentiate between different subgroups of IBD patients, even when restricting the analyses to patients in clinical remission. Our work highlights the potential of the serum IBD proteome as a source for identification of future diagnostic biomarkers, but such efforts should be made in an inception cohort of treatment-naïve IBD patients, where patients with symptoms mimicking IBD are used for comparisons.
Supporting information S1