Plasma Protein Profiling Reveals Protein Clusters Related to BMI and Insulin Levels in Middle-Aged Overweight Subjects

Background Biomarkers that allow detection of the onset of disease are of high interest since early detection would allow intervening with lifestyle and nutritional changes before the disease is manifested and pharmacological therapy is required. Our study aimed to improve the phenotypic characterization of overweight but apparently healthy subjects and to identify new candidate profiles for early biomarkers of obesity-related diseases such as cardiovascular disease and type 2 diabetes. Methodology/Principal Findings In a population of 56 healthy, middle-aged overweight subjects Body Mass Index (BMI), fasting concentration of 124 plasma proteins and insulin were determined. The plasma proteins are implicated in chronic diseases, inflammation, endothelial function and metabolic signaling. Random Forest was applied to select proteins associated with BMI and plasma insulin. Subsequently, the selected proteins were analyzed by clustering methods to identify protein clusters associated with BMI and plasma insulin. Similar analyses were performed for a second population of 20 healthy, overweight older subjects to verify associations found in population I. In both populations similar clusters of proteins associated with BMI or insulin were identified. Leptin and a number of pro-inflammatory proteins, previously identified as possible biomarkers for obesity-related disease, e.g. Complement 3, C Reactive Protein, Serum Amyloid P, Vascular Endothelial Growth Factor clustered together and were positively associated with BMI and insulin. IL-3 and IL-13 clustered together with Apolipoprotein A1 and were inversely associated with BMI and might be potential new biomarkers. Conclusion/ Significance We identified clusters of plasma proteins associated with BMI and insulin in healthy populations. These clusters included previously reported biomarkers for obesity-related disease and potential new biomarkers such as IL-3 and IL-13. These plasma protein clusters could have potential applications for improved phenotypic characterization of volunteers in nutritional intervention studies or as biomarkers in the early detection of obesity-linked disease development and progression.


Introduction
Cardiovascular disease (CVD) and type 2 diabetes (T2DM) are common disorders affecting millions of people worldwide. Evidence is accumulating that chronic low-grade inflammation plays a role in the development of both diseases [1,2]. Increased plasma levels of several pro-inflammatory proteins and decreased levels of anti-inflammatory proteins have been observed in subjects with obesity and obesity-related diseases such as CVD and T2DM [3,4,5,6].
Certain pro-inflammatory plasma proteins are used as diagnostic biomarkers for disease state but specific plasma proteins may also be used as biomarkers for early state in the development of a disease. Such an improved pre-disease diagnostic would allow intervening with relatively mild strategies such as lifestyle interventions with specific dietary regimes and increased physical activity in contrast to pharmacological therapy required once the disease is manifested. Identification of biomarkers that allow detection of the onset of disease will help in prevention of the disease. Plasma proteins might be good candidates as they circulate throughout the whole body, thereby reflecting total body metabolic and inflammatory status. Moreover, blood can be easily obtained from human subjects and therefore plasma proteins can be easily measured for screening purposes.
So far, in most studies that investigated the use of plasma proteins as biomarkers only a few plasma proteins were measured. However, the etiology of diseases such as CVD and T2DM is complex and the measurement of multiple biomarkers will provide additional information about the individual phenotype and health status as compared with measurement of a single biomarker [7,8].
Recent technological advances such as multiplex immunoassays allow for the measurement of over hundred proteins at a time in one small plasma sample. Identification of biomarker profiles in such large protein datasets requires advanced statistical analyses. Random Forest (RF) has shown to be suitable for analysis of complex data sets as derived from proteomics analysis [9,10]. RF is a technique that can prioritize and select from a large number of variables a set of variables that is likely to be related to the outcome of interest. Furthermore, in the prioritization and selection process, it provides a way to take interactions between proteins into account [11,12]. The proteins that are selected by RF can subsequently be analyzed by clustering methods, offering the opportunity to identify clusters of proteins that are associated with different health outcomes.
Our study aimed to improve the phenotypic characterization of overweight but apparently healthy subjects and to identify new candidate profiles for early biomarkers of obesity-related diseases such as CVD and T2DM.

Subjects
Two populations were included in this study; population I was the primary study population of interest and population II was a smaller population used for verification of the results found in population I.
Population I consisted of 56 healthy men and women who participated in a controlled feeding trial [13]. Subjects included were aged 40-65 years with a BMI$25 kg/m 2 or a waist circumference $94 cm for men and $80 cm for women. Excluded were hypercholesterolemic subjects (fasting total cholesterol $8 mmol/L) and subjects with non-treated diabetes mellitus (according to WHO criteria) as measured by an oral glucose tolerance test during screening. Other exclusion criteria were the use of serum lipid or blood pressure lowering medication.
Population II consisted of 20 healthy, independently living elderly men and women. This population is a subgroup of the population participating in the study of Van de Rest et al. [14]. Subjects included were aged .65 years without depression, dementia or serious liver disease. This population was chosen because subjects were healthy and had average BMI and insulin values comparable to those from population I.
Both studies were approved by the Medical Ethics Committee of Wageningen University and all subjects gave written informed consent.

Plasma proteins
In population I, concentrations of 124 proteins, including insulin, were measured in fasting plasma by quantitative multiplex immunoassay based on Luminex xMAP technology (Rules Based Medicine Inc, Austin, Texas, USA) according to the procedure described by Domenici et al. [15][16]. For each multiplex, both calibrators and controls were included on each microtiter plate. 8point calibrators were run in the first and last column of each plate and 3-level controls were included in duplicate. Testing results were determined first for the high, medium and low controls for each multiplex to ensure proper assay performance. Unknown values for each of the analytes localized in a specific multiplex were determined using 4 and 5 parameter, weighted and non-weighted curve fitting algorithms included in the data analysis package. The plasma samples were run in duplicate and data reported as concentrations (average of two independent measures), together with data for the least detectable dose (LDD). Any value above the LDD will possess coefficients of variation (CV) less than 20%. Rules-Based Medicine's Multi-Analyte Profiles have been validated to Clinical Laboratory Standards Institute guidelines.
The set of proteins present on the assay consists of factors that are implicated in chronic diseases, inflammation, endothelial function and metabolism. In table S1 all proteins measured are listed.
Before analysis 16 out of 124 proteins (figure S1) were removed from the dataset as the concentrations of these proteins were below the detection limit in more than half of the samples. For the remaining 108 proteins, values below the detection limit were replaced by 0.1*Least Detectable Dose. As IL-6 was one of the removed proteins and IL-6 is an important factor for obesity and diabetes [6], this protein was separately measured by high-sensitive enzyme immunoassay (Human IL-6 Quantikine HS ELISA Kit, R&D Systems, Abingdon, United Kingdom). Apo lipoprotein B (ApoB) was not included in the Human Multi-Analyte Profiles, and ApoB levels were additionally measured on a Hitachi 912 autoanalyser (Roche, Lelystad, The Netherlands) using a commercially available kit (Roche cat. nr.1551779). In population II, the plasma concentrations of 107 out of the 124 proteins measured in population I, were determined using multiplex immunoassay (Rules Based Medicine Inc, Austin, Texas, USA) (table S1). In total, 110 proteins were included in the analysis for population I and 90 proteins were included in the analysis for population II (figure S1).

Statistical analyses
Univariate analyses. The association between individual protein concentrations, BMI and plasma insulin concentration was calculated by univariate analysis. The statistical package PASW (version 17.0; SPSS, Chicago, IL) was used for the univariate analysis. Since the distribution of several variables was slightly skewed in the population, Spearman correlation coefficients were calculated for the association between protein concentrations, BMI and insulin concentrations.
Random Forest. Random Forest (RF) was used to provide a ranking in the importance of proteins in their relationship with BMI as well as with plasma insulin concentrations, taking possible interactions between proteins into account. The R package randomForest (R-package randomForest, http://cran.r-project. org/), which is based on the original FORTRAN code from Breiman et al. [11] was used for the analysis (www.stat.berkeley. edu/breiman/RandomForests/).
In RF a group of tree-based models (the forest) is used to rank the proteins with an important contribution to BMI or insulin values [12,17]. Each tree starts with the total data set, which is split into smaller and more homogeneous groups to fit models for predicting the outcome from the measured proteins. Within the forest, different trees are obtained by bootstrap sampling and random subset selection.
Importance of proteins in association with the outcome of interest is defined by a measure referred to as the importance index, I m . For each protein, this I m is obtained by comparing the predictive performance of the forest for all proteins with the predictive performance of the forest in which the values of the protein are randomly permuted in the trees for the left-out observations. Larger differences in the predictive performance give a larger I m , indicating that the protein is more important. By permuting the values for one protein, not only the effect of this protein is taken into account, but also all possible interactions of this protein with other proteins. Interactions between proteins increase the I m for each of the proteins that are part of the interaction. Thus, in the ranking of proteins by their importance RF takes interactions between proteins into account.
To perform the RF analyses we used the scaled mean decrease in prediction accuracy. To obtain stable estimates of the I m and to capture as many important interactions as possible, the analyses were performed with a large number of trees (40,000). I m was used as measure to rank the proteins. We chose not to apply a FDR estimation of the Im scores, because FDR estimation of importance scores derived by tree-based approaches usually overestimates the real FDR it can lead to an unreliable selection of a subset of variables [18].
For RF analysis a threshold value of significance does not exist. In this study a threshold was set at an I m of 5 and only proteins with an I m .5 were considered for subsequent cluster analyses. We chose for this liberal threshold to avoid the possibility of leaving out proteins that might be of importance in relation to BMI and insulin.

Clustering of the proteins
The program MultiExperimentViewer, version 4.3 was used for hierarchical clustering and visualization of the data [19]. Hierarchical clustering organizes the data into a binary tree that groups similar elements together. Proteins, BMI and insulin were clustered based on their Spearman correlation coefficients to select groups of proteins with high correlation with BMI or insulin and with each other. Besides clustering of the proteins with BMI and proteins with insulin based on their correlation coefficients also individual protein profiles were clustered based on similarities in protein concentrations. To compare the individual protein concentrations, z-scores were calculated for each individual protein x{m s . To compare the association of the identified clusters of proteins and BMI to the association of single traditional biomarkers and BMI regression analysis was performed using the statistical package PASW (version 17.0; SPSS, Chicago, IL).

Pathway analysis
Ingenuity Pathways Analysis, version 8.7 (IngenuityH Systems, www.ingenuity.com) was used to identify connections between proteins and canonical pathways and diseases that were most significant to the data. Proteins were entered in Ingenuity based on their Swiss Prot ID and only connections, both direct and indirect, between proteins for humans and human primary cells were considered in the analysis.

Baseline characteristics
Characteristics of the two study populations used in the analysis are displayed in table 1. BMI, waist circumference, insulin levels and percentage smokers were comparable in the two populations and age was significantly higher (13.761.1) in population II.
Plasma protein concentrations for both populations are displayed in table S2.

Associations of proteins with BMI
In population I the RF I m of 20 proteins was above 5 and these proteins were considered to be associated with BMI (table 2). Using univariate analysis, 14 out of these 20 selected proteins  (table 2). For these 20 proteins and BMI mutual correlation coefficients were calculated. Based on these correlation coefficients a correlation matrix was constructed in which the proteins and BMI were clustered by similarity in their correlations ( figure 1A). Using this approach, three clusters of proteins associated with BMI could be identified; cluster 1 and 3 were positively associated with BMI while cluster 2 was negatively associated with BMI. Cluster 1 showed robust associations with BMI and contains proteins highly positively associated with BMI and with each other and included insulin, leptin, Complement 3 (C3), Interleukin 6 (IL-6), C Reactive Protein (CRP), Plasminogen Activator Inhibitor (PAI-1), Serum Amyloid P (SAP) and Vascular Endothelial Growth Factor (VEGF). Cluster 2 also showed robust associations with BMI and the included proteins were inversely associated with BMI and positively with each other ( figure 1A) and contained the proteins Apolipoprotein A1 (ApoA1), Cancer Antigen 19-9 (CA 19-9), Eotaxin, IL-3 and IL-13. The third cluster includes proteins that were positively associated with BMI and each other but most of these associations were less pronounced than in clusters 1 and 2.
In population II 22 proteins were associated with BMI, based on RF I m above 5, of which ten proteins were also associated with BMI in population I (table 2 and figure S1). A correlation matrix of these ten proteins and BMI was made for population II. Clustering of the proteins in population II was similar to the clustering in population I ( figure 1A and 1B). The protein clusters 1 and 2, which showed robust associations in population I, were also associated with BMI in population II. The weaker associations of protein cluster 3 could not be confirmed. Associations in population II were more robust compared to associations in population I.
Regression analysis to compare the association of the identified clusters of proteins and BMI to the association of single traditional biomarkers such as CRP and IL-6 showed that the explained variance was higher when all proteins of cluster 1 were included in the model compared to when only IL6 or CRP or a combination of both were included. For BMI, the proportion of variance explained was 16.3% for IL-6 alone, 19.4% for CRP alone, 22.0% for CRP and IL-6 combined, and 32.3% when all proteins from cluster 1 were included in the model. For insulin, cluster 1 explained 25.6% of the variance, compared to 8.3% by CRP alone, 11.8% by IL-6 alone, and 12.0% by CRP and IL-6 combined.
Out of the cluster analysis we selected the highly BMIassociated proteins from cluster 1 to plot individual plasma profiles (figure 2). Subjects with similar plasma protein concentrations were clustered and their BMI values were subsequently displayed. Figure 2 shows that, as expected, in general persons with higher BMI have higher concentrations of the selected proteins. However, a few persons with BMI values ,25 kg/m 2 had high plasma levels of these proteins and a few persons with BMI values .30 kg/m 2 had low plasma protein levels.

Association of proteins with insulin concentration
The association between protein profiles and fasting insulin concentration was investigated using the same approach as for BMI. In population I, RF analysis identified 20 proteins that were considered to be associated with insulin concentration (table 3). Using univariate analysis, ten of these proteins significantly correlated with insulin concentration. Hierarchical clustering of the 20 selected proteins and insulin based on correlation coefficients resulted in the formation of four separate protein clusters ( figure 3A). The proteins forming cluster 1, 2 and 3 were all positively associated with insulin concentration and the proteins in the fourth cluster were negatively associated with insulin.
In population II, 9 out of the 20 selected proteins in population I were associated with plasma insulin concentrations (table 3, figure 3B and figure S1). As for BMI associations with insulin in population II were more robust compared to associations with insulin in population I.

Pathway analysis
An overview of all clusters of proteins and interactions between the single proteins selected by RF is displayed in figure 4. This figure also shows the top 5 most significant pathways and diseases for the BMI-and insulin-associated proteins. Acute phase response signaling was the most significant pathway for BMI-associated proteins and was also significant for insulin-associated proteins.

Discussion
Elevated plasma levels of several pro-inflammatory proteins are related to the development of obesity-linked diseases, in particular to T2DM and CVD [4,20,21,22]. In the current study we observed associations of clusters of pro-and anti-inflammatory  proteins with BMI and insulin in a presumably healthy population. These BMI-and insulin-associated protein clusters may serve as biomarkers for a pre-disease state of people at risk to develop CVD and T2DM. The protein clusters could possibly be used to improve individual disease risk prediction and help in the design of personalized strategies to prevent disease as early as possible.
The protein cluster showing the most robust positive association with BMI contained a number of pro-inflammatory proteins of which several are involved in the acute phase response, such as CRP, IL-6, C3 and SAP. These and other proteins (i.e. ASP, MPO, PAI-1 and VEGF) included in this cluster and in cluster 1 that was associated with insulin were previously found to be increased in subjects with insulin resistance, CVD, or both [3,21,23,24,25,26,27,28,29,30,31,32]. From prospective studies evidence has accumulated that increased levels of C3 and CRP can predict T2DM and coronary events and could be candidate biomarkers for a pathological state preceding the ultimate disease [3,8,26,27]. We hypothesize that the proteins clustering together with CRP and C3 could be similar type of biomarkers. Moreover, Macrophage Colony-Stimulating Factor (MCSF) which was highly positively associated with BMI was discovered to be a prognostic marker of cardiovascular events in patients with chronic coronary artery disease [33]. We speculate that MCSF may also be an early biomarker for cardiac disease in healthy subjects.
The positive correlations of BMI with leptin and other adipose-tissue derived proteins, as seen in our study, supports the view of adipose tissue as an important source of immune-related proteins [34]. However, besides adipose tissue-derived proteins also proteins produced by the liver, endothelial cells, immune cells and lipoproteins were associated with BMI and insulin in our healthy subjects. This indicates that low-grade inflammation in the early disease state is not only an adipose tissue-specific effect but that also organ cells increase their secretion of inflammatory proteins with increasing BMI or plasma insulin levels.
A small cluster of proteins (ApoA1, CA19-9, Eotaxin, IL-3 and IL-13) was negatively associated with BMI. Plasma levels of tumor marker CA19-9 and ApoA1, the major protein component of plasma HDL, were reported to be lower in obesity [35,36]. Less is known about the negative association of BMI with Th2 cytokines IL-3 and IL-13 and Eotaxin, which are correlated with each other and closely clustered [37,38,39]. Concentrations of IL-3 and IL-13 were on average very low in the populations so we should be careful with the interpretation of this data. Nevertheless we observed associations of IL-3 and IL-13 with BMI in both study populations, therefore these proteins might be considered as potential new early biomarkers for obesity-related diseases.
Associations between strongly associated protein clusters 1 and 2 with BMI and insulin in the primary study population were confirmed in the second population. Weaker associations were not all confirmed which could be attributable to the lower number of subjects in population II and the fact that less proteins were measured reducing the chance of finding protein interactions with RF. However, the strong associations found in population I were even more pronounced in population II. The latter population was significantly older but concentrations of individual plasma proinflammatory proteins were not higher. Maybe an elevated BMI or insulin concentration in elderly subjects is directly linked to an increase in pro-inflammatory protein secretion while this is not always the case in younger subjects who may display a more flexible metabolic phenotype in handling changes in BMI or insulin concentration.
Levels of individual proteins were not extremely elevated in our study subjects, but based on protein profiles subjects may be differentially classified with lower or higher risk to develop a more pathological phenotype. Despite the fact that the risk profile was BMI-related in the whole study population, there were a few subjects with low BMI also having this risk profile and a few subjects with high BMI not having the profile. Besides BMI, waistto-height ratio can be used as a determinant for obesity-related health risk. When the same analyses were performed for the association of proteins with waist-to-height ratio similar results as for BMI were found. Our findings support the recent ideas of using a 'multimarker' approach, i.e. measuring multiple plasma proteins in instead of single biomarkers or only BMI to increase the prognostic value for individual disease risk [8,40,41].
Still, the measurement of multiple proteins makes data analysis complex and calls for more advanced methods of data analysis. Using RF we were able to identify associations of proteins with BMI or insulin that were not observed when associations of each protein were analyzed separately, but combined with other proteins may be of relevance. For RF we chose a liberal threshold for selection of proteins associated with BMI or insulin, which increases the risk of selecting proteins that are not related to the outcome of interest. However, using this threshold, we observed a high overlap between the results from RF and univariate analysis. Furthermore, we applied this analysis to two independent populations to replicate and verify the associations of protein clusters with BMI and insulin. In our view, this approach has increased the reliability of our results.
The results of our study are based on a single measurement of plasma proteins that provides a ''snapshot'' of the actual health status. Whether it can be used in predicting long-term health and inflammatory status requires more long-term studies. Furthermore, our primary study population consisted of a relatively low number of subjects. However, we have observed associations of proteins with BMI or insulin that are physiologically relevant and were found in epidemiological and clinical studies [3,5,6,8,21,23,24,31,32]. Moreover, we were able to confirm the most robust associations in a second, even smaller population consisting of subjects from a different age group.
With our study it is not possible to draw conclusions about the causal relationship between elevated concentrations of the selected proteins and occurrence of disease. Further prospective studies with clinical end points are needed to determine whether the protein clusters found in this study could be used as reliable biomarkers for early identification of persons at risk for T2DM and CVD.
Our study aimed at identifying new leads for clusters of early biomarkers of disease by using plasma protein profiling in healthy subjects. In healthy subjects we identified clusters of proteins associated with BMI and insulin that included previously identified biomarkers for obesity-related disease risk and potential new biomarkers for which an association with disease is not wellestablished. We showed that plasma protein profiling allows a more subtle phenotypic characterization and differentiation of people with otherwise similar phenotypical features such as BMI or insulin levels. This could be of great value for dietary and pharmacological intervention studies where subgroups of volunteers with matching phenotypes could be included in order to improve the power of such interventions. Improved individual risk assessment and classification of subjects may ultimately lead to a more tailored and adequate intervention either by pharmacology or changes in lifestyle.

Supporting Information
Table S1 Proteins included in the analysis for both populations. '+' indicates that the protein is measured and detected in more than half of the samples and included in the analysis. '-' indicates that the protein is not measured or not detected in more than half of the samples and therefore not included in the analysis. Found at: doi:10.1371/journal.pone.0014422.s001 (0.14 MB DOC)