Significant common environmental effects on leukocyte subpopulations

Major efforts are invested in the analysis of phenotypic variation in a population of individuals. While many of these studies focus on the genetic basis of phenotypic variation via measurements of DNA polymorphic sites, the environmental effects are still elusive. Here we propose a methodology, called CCCE ('Cell Composition Common Environment'), to identify environmental effects on the composition of immune cell functionalities. Specifically, CCCE is focused on the common experiences that are shared between siblings (the ‘common environment’), designed to correct for cell subpopulation heterogeneity, and is based on a multicolor flow cytometry analysis across a large cohort of human monozygotic and dizygotic twins. We demonstrate that the CCCE methodology can provide insights on the relations between common environmental effects and the heterogenic functions of several immune cell types, such as NK cells effector functions and coagulation-related capabilities of monocytes. The software described in this article is available at http://csgi.tau.ac.il/CCCE.


Introduction
Trait variation is a result of both genetic and life history differences between individuals. The life experience of individuals, referred to as the 'environmental effect', has two main components: first, the common experiences shared between siblings that grow up in the same household (e.g., similar maternal care), referred to as the common environmental effect, and second, the specific experiences, which are unique to each individual, including biological and technical noise [1].
Common environmental effects, such as the same intrauterine experience and maternal care, diet and exposure to air pollution in shared indoor housing, are of particular interest in analyzing phenotypic variation, as they often convey important influences on early development [1]. In addition, the common environment also dictates the exposure to pathogens and immunization during early childhood, thus bearing possible implications on immune homeostasis and a variety of immune-related diseases [2]. Understanding the possible effects of the common environment on specific phenotypes is of fundamental interest and could facilitate the development of strategies for preventing and reducing environmental risks. PLOS  Immune homeostasis is an attractive system for the analysis of the environmental effects on global regulation of normal physiological conditions. Many aspects of immune cell physiology, such as proliferation and differentiation, can be revealed based on the quantity of different cell types. Recent immunological studies of twins examine the contribution of genetic and environmental effects to immune traits variation, and suggest that a substantial proportion of the variation in homeostatic cell quantities in the blood is due to non-genetic factors [3][4][5][6][7]. Moreover, the variation attributed to the common environment appears in a larger number of cell subpopulations than previously assumed [8]. Importantly, any protein presented on the cell surface likely performs part of a certain functionality of the cell, either the main cell functionality, or alternatively, a secondary or a subsidiary functionality. We therefore refer to any such protein in the context of a specific cell type as a subfunction. For instance, monocytes presenting the cell surface marker Thrombomodulin (THBD) carry the anticoagulation subfunction, in addition to their main role in innate and adaptive immunity [9]. Notably, recent work has demonstrated genetic effects acting on immune subfunctions [10]. Yet, despite the importance of such variation, we currently lack an assessment of the role of the environment in determining the composition of immune subfunctions during homeostasis.
Advances in flow cytometry analysis have opened the way to comprehensive multiparametric analysis of immune cell subfunctions, facilitating quantification and isolation of the various cell types [11]. In addition, multicolor flow cytometry allows monitoring the expression of a panel of markers on the surface of cells as a rich annotation of various subfunctions in each individual single cell [12]. A typical multicolor flow cytometry, performed across a large cohort of individuals, can be therefore used to assess the magnitude of common environmental effects on immune cell subfunctions.
Here, we developed a computational approach, referred to as CCCE ('Cell Composition Common Environment'), which aims to identify common environmental effects on immune subfunctions. Whereas the standard approach investigates the inter-individual variation in the composition of immune cell subpopulations [8], CCCE identifies environmental effects on the composition of immune cell subfunctions while accounting for the heterogeneity in cell subpopulations. Notably, CCCE only relies on multicolor flow cytometry measurements across a large cohort of human twins, without requiring any prior knowledge or measurements of the underlying environmental factors.
Using CCCE, we demonstrate the existence of a common environmental effect on the distribution of subfunctions in four cell types (monocytes, effector NK cells, CD4 + CD8 + αβ T cells and CD8 + T cells), highlighting core cell surface markers that likely relate to these common environmental effects. Our CCCE analysis further suggests that the environmental effects on the composition of cell subpopulations may fail to describe environmental effects on the composition of cell subfunctions. Taken together, our study highlights the importance of probing the effect of the common environment on immune subfunctions, even when knowledge about the underlying environmental factors is lacking.

Input data
The input dataset used by the CCCE algorithm is a collection of multicolor flow cytometry cell quantities, measured in a given cell type across a large cohort of monozygotic and dizygotic twins. The data is collected for each individual separately in two steps: first, one particular cell type is isolated using flow cytometry; and secondly, the isolated cells are analyzed using multicolor flow cytometry that simultaneously measures the level of several cell surface proteins in each single cell. We hereby refer to the set of cell surface markers analyzed simultaneously as a panel. Altogether, each dataset refers to a single cell type and a single panel of proteins whose cell surface level is measured by multicolor flow cytometry across a cohort of twins.
Typically, a flow cytometry 'gating criterion' is a discretization of the cells into two groups based on the level of their cell surface proteins, either above (+) or below (-) a certain cutoff. In accordance, a cell subpopulation is defined as a subset of cells (of the same cell type) carrying the same discretized levels across the entire pre-defined panel of proteins (Fig 1A, left). In particular, a panel of k proteins defines 2 k cell subpopulations, where each such subpopulation is a group of cells carrying the same vector of discrete protein assignments x element of {+, -} k . We next define the cell subpopulation frequency (in short, CSF) as the ratio between the number of cells of a given cell subpopulation and the total number of cells of the same cell type. Such a measure is calculated independently for each cell subpopulation and for each individual. More globally, we further define the CSF trait as the CSF vector across all individuals (Fig 1A, right).
Overall, the CCCE input dataset is a collection of 2 k different CSF traits measured using a certain k-proteins panel (and its accompanying discretization cutoffs) in a specific cell type. Each trait encompasses individuals of both dyzygotic and monozygotic twin pairs ( Fig 1A).
We note that each particular cell surface protein contributes to specific cellular subfunctions. In accordance, the term cell subfunction reflects the existence of one particular protein on the cell surface of a given cell type, regardless of the combination with any other cell surface protein (Fig 1A, left). Throughout this study we therefore distinguish between two interrelated terms: whereas a 'cell subpopulation' refers to a group of cells carrying the same combination of protein markers, a 'cell subfunction' refers to the functionality of a single protein, which may appear in many different cell subpopulations.

Overview of CCCE
The CCCE input is a single dataset consisting of a collection of CSF traits for a single cell type (that is, a single flow cytometry panel) across the individuals participating in the study (all monozygotic and dizygotic twins). Each of the traits is accompanied by its corresponding signature of cell subfunctions (Fig 1A). Given these inputs, the algorithm aims to identify common environmental effects on specific cell subfunctions. Our rationale is that calculations of common environmental effects on the frequencies of cell subfunctions may lead to false positive predictions due to confoundings related to imbalance in cell subpopulation frequencies.
For example, assume a highly prevalent cell subpopulation A that carries a cell surface marker x. Further assume that marker x resides on the cell surface of several rare subpopulations in the same tissue. We consider a scenario in which the common environmental effect acts only on the frequency of subpopulation A and has no effect on any other subpopulation. Due to the high prevalence of type-A cells in the data, it may be erroneously determined that the common environmental effect acts on the presence of marker x (subfunction x) rather than on the cell subpopulation A. To discriminate between these possibilities, CCCE evaluates the relations between the common environment and cell subfunctions while eliminating potential biases due to subpopulation-specific evidence.
In particular, CCCE first utilizes standard methods to calculate the common environmental effect for each cell subpopulation ( Fig 1B). Next, CCCE aims to assess the ability of the various cell subfunctions to predict the common environmental effect, using a regularized regression framework and assuming an unbiased evidence from the different cell subpopulations ( Fig  1C). Using permutations, CCCE determines the statistical significance of the relation between the immune subfunctions and the common environmental effects (a P-value score, Fig 1D). It may also be useful to extract the dominant subfunction that contributes to these relationships ( Fig 1E). The CCCE algorithm proceeds as follows:

Preprocessing step: Calculating the common environmental effect
Many computational methodologies aim to assess the statistical significance of the impact of a given (common or non-common) environmental factor on the phenotypic outcome [13]. These methods provide important insights regarding the effect of environmental factors, but require a-priori information about the putative environmental factor. However, the environmental factor is not always known a-priori. Even when the environmental factor is known, its monitoring still requires intensive labor and time (e.g., [14]). In addition, many environmental factors, such as the actual exposure to pathogens during childhood, are impractical to monitor.
To address these difficulties, an alternative strategy would be to avoid monitoring the environmental factors. In particular, the contribution of the common and non-common environmental effects on the total phenotypic variation is assessed solely based on the phenotypic diversity, using the Falconer's formula [1]. In brief, the Falconer's formula utilizes the difference between monozygotic and dizygotic twins to decompose the total trait variation (V p ) into three types of variation: genetic heritability (h 2 ), common environment variation (c 2 ) and noncommon environment variation (e 2 ), through the equation: V p = h 2 + c 2 + e 2 (it is possible to add additional effect types such as gender). The calculation is based on the phenotypic differences between monozygotic twins, who share their genetic makeup and are exposed to the same common environment, versus dizygotic twins, who are also exposed to the same common environment but share only about 50% of their genetic makeup. Specifically, h 2 = 2(r mz − r dz ), c 2 = r mz − h 2 = 2r dz − r mz , and e 2 = V p − h 2 − c 2 , where r mz is the trait's correlation between the monozygotic twins, and r dz is the trait's correlation between the dizygotic twins. The Falconer formula thus allows evaluation of the common environment effect solely based on phenotypic variation in dizygotic and monozygotic twins, without requiring direct environmental measurements.
CCCE assumes a single common environmental effect acting on each of the cell subpopulations. The common environmental effects were calculated using the Falconer formula as described in Roederer et al. [8] (Fig 1B). Briefly, the calculation involves two steps: first, evaluating the age effect, estimated based on a linear least-squares fit of the CSF trait value by age, and then adjusting the trait for the confounding effect of individual age. Second, for each nonage-related CSF trait, the Falconer's formula calculates the genetic and environmental effects. We hereby denote by c 2 the common, non-age-related, environmental effect (c 2 values were downloaded from [8]).
Step 1: Identify common environment effects on cell functions CCCE aims to calculate the extent to which the cell subfunctions are related to the heterogeneity of the common environmental, non-age-related effects (c 2 ) across cell subpopulations. Naively, this can be done by calculating the c 2 values for single-marker subpopulations (as if the panel consists of only one marker). However, such an approach is prone to false positive results due to the internal composition of the subpopulations. To tackle this, CCCE assumes that if a certain subfunction is affected by the common environment, a similar effect would be observed in all subpopulations carrying this subfunction. To avoid multiple testing of each individual subfunction, the CCCE model encompasses all subpopulations of a given isolated cell type into a single multiple linear regression model. In this regression model, the dependent variable is the common environmental effect (c 2 ) and the k predictors are the collection of cell subfunctions (+1 if the corresponding protein is present, otherwise -1) across all 2 k subpopulations (CSF traits) of the same panel (Fig 1C). By solving the regression jointly for all cell subpopulations, the model provides optimized coefficients that are shared among all cell subpopulations. The CCCE model is therefore substantially different from the Falconer formula that is solved independently for each cell subpopulation.
To avoid overfitting, we used regularized multiple regression framework (here, elastic net [15]). In order to obtain the best elastic-net parameters, we performed K-fold cross validation and chose the elastic net parameters that minimize the model error. The K-fold cross validation is a procedure in which the dataset is partitioned into K smaller subsets. For each such subset a model is trained using the other K-1 subsets and the remaining subset is used to calculate the model error. The output 'prediction error' of the model is the averaged error across the K trained models. This procedure is repeated for different elastic net parameters, and the set of parameters that minimizes the prediction error is chosen. Here we set K to be the number of cell subpopulations divided by 5, in order to ensure a large number of folds while maintaining at least five cell subpopulations in each fold.

Step 2: Assigning statistical significance
To assign statistical significance to this measure, we make the most conservative permutation by breaking potential ties between the dependent and predictor variables while maintaining the internal structure within each of these components. In particular, CCCE created 1000 shuffled datasets, each of which consists of permuted values of the dependent variable c 2 . For each shuffled dataset, CCCE calculates the best set of elastic net parameters (and the corresponding prediction error). Finally, CCCE compares the 1000 shuffling-based prediction errors to the real prediction error (Fig 1D). This process allowed us to give a significance value for a model of a given cell type (a permutation-based P-value score).

The leading subfunction
Multiple subfunctions (markers) were included in a cell type model, yet not all of them were related to or controlled by the common environment. The top-scoring (highest absolute correlation coefficient value) subfunctions are therefore useful for interpretation of the results, being the subfunction with the highest contribution to the model (Fig 1E). In particular, a positive (negative) coefficient links a surface marker presence (absence) with common environmental effect. We refer to the top-scoring subfunctions as 'leading subfunctions'.
In this study our main biological interest was to identify common environmental effects related to the presence of cell surface markers (positive-coefficients). However, the methodology is general and can be applied not only on common environmental effects but also on additional effect types (e.g., heritability or age). Furthermore, the analysis of leading subfunctions is general and may refer to both positive and negative coefficients.

Results
We applied the CCCE method on a collection of recently published datasets ( [8]; downloaded from http://www.tinyurl.com/twinsFACSdata). These datasets were measured across 500 human female twins (156 monozygotic and 344 dyzygotic twins) and encompass 32 different cell types with 3 to 8 proteins in a flow panel for each cell type. As previously proposed [8], we included in our analysis only high-quality CSF traits exhibiting high replicate correlation (>0.3) and mean trait value ranging between 1% to 99%. As a result, we applied our analysis on 16 datasets with at least 18 of high-quality CSF traits (S1 Table). Fig 2A lists the identified datasets with significant CCCE P-values (FDR < 0.1): effector NK cells, monocytes, CD4 + CD8 + αβ T cells [panels A and B] and total CD8 + T cells. For each of these datasets, Fig 2B demonstrates the CCCE calculation of statistical significance. As expected, CSF traits in significant datasets do include, but are not limited to, high common environmental effects (Fig 2C).
Characterization of the leading cell surface proteins provide insights into the cell subfunctions that are likely affected by the common environment. For instance, the leading protein of the monocytes dataset is Thrombomodulin (THBD), an important thrombin cofactor involved in reduction of blood coagulation and regulation of inflammation [16,17]. Although monocytes primarily participate in the innate immune response, they have an additional subfunction in thrombosis-related processes [9,18]. Specifically, monocytes contribute both as procoagulants (for example in the presence of LPS or other infective stimulation) and as inhibitors of coagulation via their cell surface Thrombomodulin [9,18]. Taken together, our results indicate that the balance of the coagulation process and its coupling with inflammation through monocytes are affected by common environmental factors (Fig 3A).
The analysis further suggested an effect of the common environment on subfunctions of effector NK cells. The leading protein of NK cells was CD2, which plays important roles in NK cell activation by cell-cell adhesion to the target cell, leading to elimination of infected cells [19] (Fig 3B). An example for such cell-cell adhesion is the interaction between CD2 residing on the NK cell and CD58 and CD48 on the target cell [20]. This interaction also has an effect on the activation and regulation of NK cytotoxic activity [21]. In addition, CD2 expression is essential to the formation of nanotube between the NK cell and the target cell [21], thereby establishing a physical contact between the cells that enables a selective intercellular trafficking of vesicles, further influencing the NK cell toxicity [22,23].
The leading T cells proteins also have known important subfunctions. In particular, in both CD8 + T cells (activated T c cells) and CD4 + CD8 + αβ T cells [Panel B] (resting or naïve T cells), CD45RA (PTPRC) was found to be the leading protein. The subfunction of PTPRC, a tyrosine phosphatase, is regulation of T-and B-cell antigen receptor signaling, JAK kinases and cytokine receptor signaling [24]. We refer to this result as a positive control, as it is expected that stimulations from the environment influence the balance between naïve and memory T cells. The analysis further suggested CD39 (ENTPD1), a nucleotide metabolizing surface enzyme that is likely involved in T cell activation, as the leading protein of CD4 + CD8 + αβ T cells [Panel A] [25].

Discussion
The environmental factors influencing the immune system are diverse and might include diet [26], physical activity [27], smoking habits [28] and air pollution [29]. Notably, many of the factors are not yet known. Our goal was to explore common environmental factors and identify significant effects on the composition of cell subfunctions. Specifically, we aimed to assign statistical significance to common environmental effects without prior knowledge about the actual environmental factors. We thus developed the CCCE methodology that is based on a dataset of multicolor flow cytometry measurements across a population (Fig 1). When applied on a collection of datasets, CCCE identified several cell types whose distribution of subpopulations is likely affected by the common environment, including effector NK cells, monocytes, CD4 + CD8 + αβ T cells and total CD8 + T cells (Fig 2). The functionality of some of the associated leading proteins in the specific cell type context, such as THBD in monocytes and CD2 in NK cells, has been previously described (e.g., THBD in monocytes and CD2 in NK cells, Fig  3A and 3B).
Our results open the opportunity to build a model of the relations between common environmental effects and various diseases. To construct such a model, we focused on the cell types identified by CCCE as targets of the common environment, and highlighted the leading  (Fig 1A, circles), each linked to the leading subfunction on which the environmental effect is exerted (hexagon). Based on the literature, each subfunction protein is further connected to certain diseases (rectangles). Whereas plots A and B provide zoom-in on the subnetwork of monocytes and effector NK cells, respectively, plot C shows the entire connectivity network. protein of each cell type. We further added diseases that are known to be related to the cell surface protein (Fig 3C). For instance, THBD was found to be involved in bleeding disorder [30], THBD mutations were found associated with thrombosis [31], and its level of expression was associated with cancer [32][33][34], atherosclerosis [35] and immune-related diseases [36,37]; the expression of CD2 was associated with psoriasis [37] and T-cell acute lymphoblastic leukemia [38]; changes in the regulation of PTPRC were also associated with Colitis, Crohn's disease [39], diabetes [40] and many other disease phenotypes; and finally, ENTPD1 down-regulation was found to be associated with uterine serous papillary cancer [41], and different ENTPD1 mutations were found to be associated with Crohn's disease [42], hereditary spastic paraplegia [43] and hereditary mental retardation [44].
The integrated model suggests cell subfunctions that might mediate the effect of the common environment on physiological disease traits. Notably, the effect of the environment on the same disease can be mediated through several distinct subfunction proteins (e.g., THBD and CD2 mediate psoriasis; Fig 3C). This suggests combinatorial regulation of the environment on phenotypic diversity through cellular functionality, and further emphasizes the importance of comprehensive studies to uncover an increasingly detailed map of disease-environment relationships.
Supporting information S1 Table. Input datasets used in this study. The table shows the cell type name of each dataset (column 1), its panel proteins (column 2), the number of high-quality CSF traits that were included in the analysis (column 3), the FDR based on the CCCE P-values (column 4), and the leading subfunction (a protein symbol; column 5). Only datasets with more than 17 high-quality CSF traits were included. (XLS)