Graphical Modeling of Gene Expression in Monocytes Suggests Molecular Mechanisms Explaining Increased Atherosclerosis in Smokers

Smoking is a risk factor for atherosclerosis with reported widespread effects on gene expression in circulating blood cells. We hypothesized that a molecular signature mediating the relation between smoking and atherosclerosis may be found in the transcriptome of circulating monocytes. Genome-wide expression profiles and counts of atherosclerotic plaques in carotid arteries were collected in 248 smokers and 688 non-smokers from the general population. Patterns of co-expressed genes were identified by Independent Component Analysis (ICA) and network structure of the pattern-specific gene modules was inferred by the PC-algorithm. A likelihood-based causality test was implemented to select patterns that fit models containing a path “smoking→gene expression→plaques”. Robustness of the causal inference was assessed by bootstrapping. At a FDR ≤0.10, 3,368 genes were associated to smoking or plaques, of which 93% were associated to smoking only. SASH1 showed the strongest association to smoking and PPARG the strongest association to plaques. Twenty-nine gene patterns were identified by ICA. Modules containing SASH1 and PPARG did not show evidence for the “smoking→gene expression→plaques” causality model. Conversely, three modules had good support for causal effects and exhibited a network topology consistent with gene expression mediating the relation between smoking and plaques. The network with the strongest support for causal effects was connected to plaques through SLC39A8, a gene with known association to HDL-cholesterol and cellular uptake of cadmium from tobacco, while smoking was directly connected to GAS6, a gene reported to have anti-inflammatory effects in atherosclerosis and to be up-regulated in the placenta of women smoking during pregnancy. Our analysis of the transcriptome of monocytes recovered genes relevant for association to smoking and atherosclerosis, and connected genes that before, were only studied in separate contexts. Inspection of correlation structure revealed candidates that would be missed by expression-phenotype association analysis alone.


Introduction
Smoking is a major risk factor for atherosclerosis and its complications, particularly coronary artery disease (CAD) and peripheral arterial disease [1][2][3]. Pathophysiological mechanisms by which smoking promotes atherogenesis are relatively well known, in particular through alterations of lipid metabolism [4,5] and endothelial function [6]. However, the molecular mechanisms by which smoking exerts its adverse effects at the cellular level are less documented. The advent of transcriptomic studies allowing investigation of all the genes expressed in a given type of cell has opened a new window for exploring in a global way the biological mechanisms underlying pathophysiological conditions. Using such transcriptomic approach, widespread perturbation of gene expression by smoking has been recently shown in whole blood [7], circulating lymphocytes [8], and monocytes [9] of humans.
Increasing evidence supports the hypothesis that oxidative stress and activation of the immune system provide a pathophysiological link between cigarette smoking and CAD [10,11]. Monocytes are key cells of the immune system involved in the inflammatory response to external agents. We hypothesized that the effect of smoking on atherosclerosis might be reflected by perturbation of gene expression in circulating monocytes and that it might be possible to identify gene networks causally involved in the relationship linking smoking to atherosclerosis.
Questions about causal effects in observational studies can be addressed by statistical methods that can translate statements about correlations and conditional independencies into structural equations [12] or Bayesian Networks [13]. Implementation of both techniques, however, can be difficult when the number of variables is large and inference of the ''true'' network that generated the data may not be feasible, even with large sample sizes. In addition, current implementations of Bayesian Network inference restrict to systems of Gaussian only, binomial (or multinomial) only or hybrids where binomials can only precede but not be caused by Gaussian variables [14]. A third class of methods based on information theory has been developed for the problem of identification of large networks of direct gene interactions, which does not rely on the correct specification of distribution functions, but where such interactions do not have a causal interpretation (e.g. ARACNE [15], SA-CLR [16], and Parmigene [17], among others). For the problem under study here, an approach that is not limited in the types of variables that can be modeled, i.e. binary, continuous, and counts, or in how they may be associated, that allows inferences about causal relations, and that can deal with large number of variables was needed.
The objective of the present study was to identify groups of genes that may help explain the causal effect of smoking on extent of atherosclerosis. For this purpose, genomewide gene expression in monocytes was modeled as a molecular phenotype potentially linking smoking to carotid atherosclerosis. Data were obtained from the Gutenberg Health Study (GHS), a community-based project primarily aimed at improving cardiovascular risk prediction [9]. We devised a stepwise approach to: 1. Identify patterns of expression associated to smoking and/or atherosclerosis using Independent Component Analysis (ICA); 2. Select expression patterns showing relatively high support for a causal role in the mediation between smoking and atherosclerosis using graphical modeling and Bayesian Network (BN) inference; 3. For patterns compatible with a potential causal role, infer the network skeleton connecting smoking, genes and atherosclerosis.
Our approach identified three gene networks that were compatible with a causal effect of gene expression mediating the relation between smoking and atherosclerosis where a few genes are candidates for mediating the perturbing effect of smoking in these networks. By performing causal inference on independent patterns of expression instead of the expression of a single gene, we not only dramatically reduced the space of models to be tested, but also applied BN inference in a space that is less prone to the effects of hidden variables. We restricted to detecting classes of best fitting models rather than a single causal model and present arguments for why some causal models may be favored in this study. We also provide cautionary statements to avoid misinterpretation of the reported causal models.

Smoking and extent of atherosclerosis in the GHS cohort
Association between smoking habits and atherosclerosis was investigated in a cohort of subjects of both sexes aged 35 to 74 years who participated in the GHS [9]. Study participants were classified into current ($1 cigarette/day) smokers (n = 248) and nonsmokers (n = 688). Occasional smokers (n = 42) and ex-smokers (n = 547) were excluded from the study (Methods). Characteristics of the study population are given in Table 1.
Atherosclerosis extent was defined as the total number of atherosclerotic plaques measured in the two carotid arteries by ultrasound echography (Methods). Carotid intima-media thickness and carotid plaque are well-recognized markers of subclinical atherosclerosis [18] which are influenced by smoking [19]. The number of plaques observed per person ranged from 0 to 11, with a skewed distribution and an average of 0.72 plaques (variance 1.95). The prevalence of atherosclerosis, defined as the presence of at least one plaque, was 31.2% in this middle-aged population. The prevalence and the extent of atherosclerosis were higher in men than in women and in smokers than in non-smokers (Table 2).
In the following, the phenotype considered was the number of atherosclerotic plaques, referred to as ''plaques''. Because the distribution showed overdispersion, a negative binomial distribution was used for modeling plaques as a function of covariables (see Text S1). The major determinants of plaques were age, sex and smoking, which all together explained 30% of the variability of plaques. The effect of smoking on plaques was stronger in men than in women although the significance of the interaction test was borderline (p = 0.027). Additional cardiovascular factors tested for association resulted in a modest increase of the explained variance of plaques (from 30% to 34%) (Table S1).

Gene expression in monocytes is associated to smoking and plaques
The analysis workflow of expression data is outlined in Figure 1. Expression of 18,364 genes was detected in total RNA from circulating monocytes by 23,214 probes in the Illumina Human HT-12 BeadChip (Methods). Association of probe expression level with smoking or plaques (log-transformed) was assessed by linear model adjusted for age and sex, as well as for the 6 first singular value decomposition (SVD) components of the expression matrix taken as surrogate variables for technical sources of variability (Methods and Text S1).
In a first step, we identified genes whose expression was associated to either smoking or plaques by univariate analysis. At FDR #0.1, we found 3,774 probes (3,062 distinct genes) associated to smoking. The list of smoking-associated genes was enriched in three ''biological processes'' in the Gene Ontology (GO) database: platelet activation, interferon-gamma-mediated signaling pathway, and Toll signaling pathway (Table 3). Association between plaques and gene expression was much less prevalent than with smoking, with only 258 probes (236 distinct genes) associated to plaques at FDR #0.1. No GO terms were significantly enriched for genes associated to plaques. Table 4 shows the 10 genes with strongest association to smoking or plaques, respectively. The whole list of associated genes to either phenotype is given in Table S2. There were 2 members common to both top 10 gene lists, SASH1 and PTGDS. In addition to SASH1 and PTGDS, 4 genes of the top 10 list for smoking were ranked among the top 100 genes for plaques: FUCA1, LOC157627, MMP25 and PID1 (Table S2). Smoking was associated to a much larger variability of gene expression (from 35% to 15% for the top 10 smoking-related genes) than plaques (from 5% to 2% for the top 10 plaques-related genes). The correlation of r 2 D values for smoking and plaques across all genes was 0.4, indicating a strong association between smoking and plaques effects on gene expression as a result of their confounding effects.

Comparison of smoking-associated gene expressions in monocytes and lymphocytes
We investigated the robustness of the association between gene expression and smoking by comparing the list of smokingassociated genes in monocytes to a list of 323 genes that have been found associated to smoking in lymphocytes [8]. The two studies used different microarray platforms and the 13,707 genes common to both studies were taken as the reference set. Of the 323 genes associated to smoking in lymphocytes, 268 were detected in monocytes of GHS of which 151 were associated to smoking (56.4%). This represented a 2.5-fold enrichment versus the reference (p = 4.6610 234 ). Using a more stringent FDR threshold of 0.05 rather than 0.10 as in [8] did not significantly affect the results (2,477 unique genes associated to smoking in GHS; 2.6-fold enrichment, p = 2.4610 234 ). Results from both studies did not only overlap in the list of genes associated to smoking but also in the magnitude and the direction of the effects (Pearson correlation coefficient of 0.72 between the r 2 D estimated in GHS and the corresponding correlation values reported in [8]) ( Figure S1). Association between plaques and smoking conditional on a single gene expression In a second step, we investigated whether single gene expressions might mediate the effects of smoking on plaques. For this purpose, we modeled plaques as a function of smoking and each single gene expression. In this one-dimensional scan, no gene could entirely explain the association between plaques and smoking by conditioning on its expression. SASH1 was the gene that, once accounted for, contributed to the largest reduction in the covariation between plaques and smoking. PTGDS and PPARG were the second and third genes contributing the most to this reduction (Table S3). These results suggested that, not unexpectedly, the underlying link between smoking and plaques might involve more complex networks, including several genes and/or hidden variables. Next, we devised an approach to explore a more comprehensive set of models on multiple genes and other variables at a time.
Clusters of genes associated to plaques or smoking Unsupervised gene clustering was used to reduce the dimensionality of the data before testing causal models involving multiple genes. Prior to this, we reduced the set of gene expressions by considering only genes that were significantly associated to smoking or to plaques when tested either separately or jointly (see Methods). This constituted a set of 3,960 probes. The probe with the highest variance in intensity across samples was chosen for each gene, leaving a set of 3,368 distinct gene profiles associated to plaques or smoking.
Independent component analysis (ICA) was used to identify patterns of co-expression in this set of 3,368 genes. ICA is an efficient algorithm that factorizes a matrix of multivariate data into a mixing matrix A of patterns for independent ''hidden'' components causing correlation among variables and an S matrix of signatures, which are vectors of coefficients associating variables (genes) to components (see Methods and [20] for details and [21] for a recent application to gene expression data). A pattern is a linear combination of gene expressions whose level varies among individuals. A signature is a vector of the contributions of a component to each gene expression that can be characterized by the genes that are most affected by that component (see module  Association of probe level was tested separately with smoking and ln(plaques+1) by linear model adjusted for age, sex and the 6 first SVD components. The beta regression coefficient is shown. Probes were ranked according to decreasing r 2 D for smoking (plaques, respectively). In case of several probes by gene (e.g. SASH1 and PPARG), the probe with the highest r 2 D is shown. doi:10.1371/journal.pone.0050888.t004 below). In the following, the terms pattern, signature or module are used for referring to a component according to the context where it applies (i.e. individuals or genes).
The number of components to extract was determined by a permutation test, which indicated that 59 components could be detected in this dataset (see Methods). Components that did not meet pre-specified quality control criteria were discarded, leaving 29 components for analysis (see Methods and Text S1 for details).
Each component was associated to a specific module of genes characterizing its signature. A module was defined as the subset of genes that were selected on either tail of the signature distribution by controlling local FDR #0.001, as done previously [21]. This resulted in 29 modules of 9 to 125 genes (Table S4). Two modules were enriched in GO pathways: module 18 (interferon-gammamediated signaling pathway) and module 39 (antigen processing and presentation via MHC class II) ( Table S5).

Association of ICA expression patterns with smoking and plaques
As mentioned above, the patterns obtained by ICA factorization are linear combinations of gene expressions whose level can be interpreted as reflecting the ''degree of activation'' of subsets of coexpressed genes among individuals. Association of expression patterns with smoking or plaques was investigated in a similar fashion to that employed for individual gene expressions, except that we used a Bonferroni-corrected significance threshold (0.05/ 29 = 0.0017) instead of a FDR. At this significance threshold, 14 of the 29 patterns were associated to smoking and 7 to plaques, 5 being common to both ( Table 5).
Worthy of note, ICA was able to recover a strong signature of smoking effects on gene expression in monocytes (pattern 43), as reflected by the high proportion of variance of that pattern (56%) explained by smoking ( Table 5). The module associated to pattern 43 comprised 34 genes listed in Table S4. At the core of module 43 was SASH1, which had the highest coefficient for signature 43 in the S matrix and therefore was the most correlated to the overall pattern of expression.
Significant overlap with genes associated to smoking in lymphocytes [8] was tested for each ICA module in the same manner as for the full set of genes associated to smoking (above). Six of the 29 modules were significantly enriched for genes associated to smoking in lymphocytes (Table S6). The most enriched module was module 43, for which 12 of the 34 genes were also observed in lymphocytes (OR = 27.35; p = 9.1610 213 ).
As in the single-gene case, no single pattern was able to entirely account for the association between smoking and plaques. The top-ranking pattern by amount of covariation explained between P and S was pattern 43 (Table 5). When this pattern was introduced in the model relating plaques to smoking, the proportion of variability of plaque counts explained by smoking (r 2 D ) decreased from 8.8% to 2.3%.

Selection of expression patterns with potential causal role in the relationship between smoking and plaques
Though no single pattern could entirely explain the relationship between smoking and plaques, we sought to determine whether some ICA patterns may show evidence for a causal effect partially mediating the link of smoking to plaques. For this analysis, we used graphical modeling (Methods). For each expression pattern, the best-supported causal model involving smoking (S), plaques (P) and gene expression pattern (G) was selected by a likelihood-based model selection approach. All equivalence classes of graphical models among these three variables were enumerated ( Figure 2).
Two classes, f and k, were of primary interest because both comprise models with a path SRGRP, which is the causal relation of interest. Under model class f, all the association between smoking and plaques can be explained by the effect of one single pattern. In model k, the pattern does not explain all the covariation between smoking and plaques, but it is associated to both (Text S1). Maximum likelihood was used to identify the equivalence class that was best supported by the data for each pattern. The process was repeated for 1000 bootstraps of the data to account for uncertainty in model selection.
The results of causal model inferences are summarized in Figure 3 and Table S4 (spreadsheet ''Causality''). The probabilities of the different models across the 1000 bootstraps for each pattern are shown in bottom half of plot in Figure 3. The probability of selecting a model from a causal class was defined as the sum of probabilities for the model classes f and k (top half of plot in Figure 3). According to this criterion, 4 patterns (21, 29, 31 and 51) had a relatively high support for causality (probability $0.6). Worthy of note, pattern 43, the one the most influenced by smoking, was associated with the model G-S-P with a probability of 0.7, which is incompatible with SRGRP causality.

Inference of the gene network underlying ICA expression patterns
To further characterize the ICA patterns showing some support for causality, we inferred the topology of the networks underlying patterns. For each pattern, the network was constructed from the subset of genes composing the module specific to that pattern. We applied the PC algorithm 1 to discover the skeleton of conditional independencies (algorithm 1 in [22]; see Methods). The network was represented as an undirected graph. To decrease the possibility of hidden variables for the network, we considered in these analyses all cardiovascular risk factors that were associated to each pattern by stepwise regression (Table S7). This is important to avoid spurious edges resulting from untested confounding variables. To assess uncertainty in the inference, the process was repeated 1000 times by bootstrapping individuals and the proportion of data samples that recovered an edge was recorded. We considered only edges with a recovery probability $0.6 (see Methods). Graphic representations for all networks are found in Text S2.

Networks selected by causality test
For the 4 modules with suggestive support for causal effects (21, 29, 31 and 51), we estimated the minimum path(s) between smoking and plaques, starting from each gene directly connected to smoking. Pattern 31 did not reveal any path because no gene was connected to plaques in more than 60% of bootstraps.
For pattern 21 (Figure 4), the different paths connecting smoking to plaques involved four genes directly connected to smoking (MAP3K6, GAS6, HTRA1 and DSC2) and only one gene directly connected to plaques (SLC39A8) ( Table 6). Therefore, the SLC39A8 gene funneled all information paths between smoking and plaques in this network. SLC39A8 expression level was positively correlated to plaques and smoking was positively correlated to all the genes in the cluster, suggesting that an upregulation of genes of the cluster was associated to increased atherosclerosis extent. Additionally, SLC39A8 was negatively correlated to HDL-cholesterol levels (,1 probability), which is consistent with the protective role of HDL-cholesterol in atherosclerosis.
In pattern 29, there were 7 genes directly connected to smoking (CXCL16, DHRS9, FAM20C, FPR3, PDE4B, PTGFRN, and TBC1D8) and only one gene connected to plaques (RASD1) ( Table 6). RASD1 was positively correlated to plaques, although this association was recovered with probability of only 0.64. This was a large network with 98 genes (Table S4). However, smoking and plaques were separated by a relatively low number of genes where the 7 paths had between 3 to 5 connecting genes ( Table 6).
Pattern 51 was the smallest (15 nodes) and among the most interconnected networks, with an average number of connections per node of 4.3. Among the different paths connecting smoking to plaques, 2 genes were directly connected to smoking (CYP1B1 and HOXA10) and only one gene was directly, negatively, connected to plaques (TMEM136) ( Table 6).

PPARG network
PPARG was the gene showing the strongest association to plaques and the third in the reduction of the r 2 D of smoking to plaques. In addition, this gene is known to be involved in atherosclerosis [23]. This prompted us to examine in more details the network(s) comprising PPARG. Actually, PPARG was only present in module 42, which included 71 genes (Table S4). Pattern 42 was not significantly associated to smoking or plaques (Table 5). However, causality testing gave inconclusive results, with two model classes, h (SRPrG) and k (S-G-P-S), alternatively selected by bootstrapping with probabilities 0.5 and 0.31, respectively (Table S4). In model h which had the best support, smoking and pattern expression were associated only when conditioned on plaques, meaning that both have causal effects on plaques but independently from one another. Conversely, the topology of the network suggested that the shortest path from smoking to plaques was through PPARG only (Text S2). This path, which was recovered in 96% of bootstraps, supported a causal effect of PPARG. This discrepancy between PPARG and pattern 42 might be explained by the fact that the contribution of pattern 42 to PPARG was weak (ranking 67 out of 71 genes of the module) and therefore, the behavior of PPARG did not exactly match that of the entire pattern. n: number of genes in the pattern-specific module, b: sign of the regression coefficient of G on S and G on P, respectively. All models included age, sex and the first 6 SVD components. Only the 29 patterns that passed quality control are shown. P-values #0.05/29 = 0.0017 are in bold. Genes are ranked by increasing r 2 D associated to S in the model P,S|G, which corresponds to decreasing reduction of the amount of covariation between smoking and plaques explained by pattern expression. doi:10.1371/journal.pone.0050888.t005

Discussion
We present results from a genome-wide survey of gene expression in monocytes that revealed widespread effects from smoking, with .3000 genes either over-or under-expressed in smokers. Because no information was collected about number of cigarettes smoked per day, we could not test dose-dependence of the effects on gene expression. Due to the non-stringent FDR adopted, the list was rather large because our primary objective was not to miss any gene of potential interest. The list of smokingassociated genes showed significant overlap with those observed in lymphocytes from a large cohort of Mexican Americans [8] indicating high robustness of smoking effects across different circulating cell types and genetic background. By contrast, the number of gene expressions associated to atherosclerosis was much lower (236 genes), which is probably explained by the fact that atherosclerosis is a complex and distal phenotype with multiple genetic and non genetic determinants.
As expected, a one-dimensional scan across the transcriptome revealed that no single gene could explain the association between smoking and plaques, leading us to search for networks of genes that might be more relevant. Using ICA, we identified 29 patterns  Smoking, Gene Expression and Atherosclerosis PLOS ONE | www.plosone.org of co-expressed genes, 14 of which were strongly associated to smoking. Two patterns were enriched in functional GO categories (interferon-gamma-mediated signaling pathway and MHC class II antigen processing) but none of these two patterns was related to atherosclerosis after conditioning on smoking. Worthy of note, one of the patterns (pattern 43 driven by SASH1) could be interpreted as a robust signature of the impact of smoking on the transcriptome of circulating blood cells, as demonstrated by the substantial overlap of smoking-associated genes between monocytes and lymphocytes [8]. Actually, all the genes directly connected to smoking in network 43 (i.e. SASH1, MMP25, P2RY6, FUCA1, PID1, DTNA, GFRA2, CLEC10A and PTGDS) had been previously identified as the strongest correlates of smoking in GHS [9]. Surprisingly, none of these genes was found differentially expressed in a recent experimental study performed in a human monocytic cell line (THP-1 cells) exposed to cigarette smoke extract [24]. This discrepancy suggests that in vivo chronic exposure to cigarette smoke may have a different impact from in vitro acute exposure, in particular because of the important role played in vivo by the lung, kidney and liver in metabolizing xenobiotics.
We then tested causality by graphical modeling. Actually, two model classes, S-G-P and S-G-P-S contained the causal path of interest SRGRP. Although both of these graphs describe multiple Bayesian networks, a priori information can be used to favor only a few of them. Indeed, genes expressed in monocytes are unlikely to affect smoking behavior, eliminating SrG edges. Supporting this assumption, only three genes across all networks have been reported by GWAS to be associated to smoking behavior and none of them were in modules selected for causal effects (Table S8). On the other hand, by excluding ex-smokers, we reduced the possibility that smoking behavior might be modified by the presence of atherosclerosis, making SrP edges less likely (see Text S1 for details). Therefore, the networks most likely underlying the two graphical models of interest were SRGRP for the two-edge case and SRGRPrS or SRGrPrS for the three-edge case.
We identified four patterns of expression (21, 29, 31, and 51) that were compatible with an effect of expression partially mediating the relation between smoking and atherosclerosis. Pattern 31 did not have a gene network topology consistent with any gene mediating smoking to plaques effects. Pattern 29 was associated to a large module (94 genes) with some support for causal effects (0.59 probability). All the paths in this network were connected to plaques through a single gene, RASD1. This gene, whose exact function is unknown, encodes a Ras-related protein stimulated by dexamethasone, a drug with anti-inflammatory and immunosuppressive actions. Because of the relatively low probability of causality of pattern 29 and the modest recovery of the RASD1-plaques edge (0.64 frequency), caution is needed in the interpretation of this result.
Pattern 51 had a relatively high support for causality (0.85 probability). In this network, smoking was directly connected to 2 genes, CYP1B1 and HOXA10, which both have relevance to atherosclerosis and smoking. CYP1B1 encodes a member of the cytochrome P450 protein superfamily that localize to the endoplasmic reticulum and metabolize procarcinogens including polycyclic aromatic hydrocarbons in tobacco-smoke [25]. CYP1B1 expression showed the strongest difference in placenta from smoking and nonsmoking women [26] and was increased in THP-  Table 6. Shortest gene paths connecting smoking to plaques. 1 cells after in vitro exposure to cigarette smoke [24]. In endothelial cells, CYP1B1 expression is regulated by shear stress and is associated to anti-atherogenic effects [27] and decreased oxidative stress [28]. HOXA10 encodes a transcription factor associated to cell proliferation in the monocytes cell lineage [29] and repression of PHOX genes involved in oxidative stress [30]. It is expressed in the endothelium in a location-dependent manner, with lower expression in atheroprone than in atheroresistant arteries [31,32]. HOXA10 expression in endometrium has been shown to be directly affected by cigarette-smoke extract both in humans and mice [33]. Therefore, there is support in the literature linking at least two genes of the network to smoking and plaques. However, TMEM136, which was the only gene of the network directly connected to plaques, was expressed at low levels in monocytes, only exceeding detection threshold in 5% of male non-smokers and a few individuals outside this group. TMEM136 encodes a transmembrane protein of unknown function. Therefore, it is possible that association between plaques and this gene is instead reflecting association to a different unobserved network member. Pattern 21 which had the highest support for causality (0.87 probability) also appeared to have the highest relevance in the context of smoking-induced atherosclerosis. In this network, connection to plaques was mediated by SLC39A8 (aka ZIP8), a transmembrane zinc transporter. Genetic variants in SLC39A8 have previously been associated to several cardiovascular risk factors such as HDL-cholesterol [34,35], blood pressure [36,37], obesity [38], and activation of plasminogen [39]. In addition to being connected to plaques, SLC39A8 was directly connected to HDL-cholesterol in the network (Figure 4), a result consistent with the association found with SLC39A8 genetic variants [34,35] and supporting a causal role of SLC39A8 in atherosclerosis. However, here we provide evidence for an effect on plaques not completely mediated by HDL-cholesterol, since a direct plaques-SLC39A8 edge was recovered with 0.69 probability. SLC39A8 is known to have a cytotoxic role by intracellular transport of cadmium, a toxic heavy metal and carcinogen that is abundant in cigarette smoke [40,41]. In lung epithelia, SLC39A8 expression is increased by TNFa, a pro-inflammatory cytokine that is abundant in smoker's lung [42]. In addition, the cadmium-mediated toxicity induced by cigarette smoke has been shown to be enhanced through NF-kBmediated activation of SLC39A8 expression [42].

Pattern Paths
The shortest path from smoking to plaques in network 21 was smoking-GAS6-MUC1-SLC39A8-plaques. MUC1 (mucin 1) encodes a membrane protein involved in cell adhesion and signal transduction, not previously associated to smoking effects or atherosclerosis. By contrast, GAS6 (growth arrest-specific gene 6) has a strong relevance to atherosclerosis. It belongs to a family of vitamin K-dependent coagulation proteins and has a pleiotropic role in atherosclerosis, with pro-and anti-atherogenic effects [43]. In human atherosclerotic plaques, which are the focus of the present study, GAS6 has been shown to be expressed mainly by vascular smooth muscle cells and to have an anti-inflammatory action by stimulating the anti-inflammatory cytokine TGFß and inhibiting expression of TNFa [44]. Therefore, TNFa may be a molecular signal underlying the correlation between SLC39A8 and GAS6, but this hypothesis needs to be confirmed. GAS6 was also among the genes reported to be up-regulated in the placenta of women smoking during pregnancy [26]. To the best of our knowledge, GAS6 and SLC39A8 have not been connected before in the context of atherosclerosis and their functional link needs to be experimentally confirmed.
Because ICA was performed on a subset of genes pre-selected by their association with smoking or atherosclerosis, we cannot exclude the possibility that we missed some genes that could be important nodes in the networks subsequently identified. However, not doing this pre-selection would have led to a larger number of ICA components (91 components were actually detected in our previous ICA application in the complete GHS expression dataset [21]), most of them being irrelevant for the problem under study. Also, we cannot exclude the possibility of spurious edges in networks resulting from untested confounding variables.
Networks discovered in this study may represent diverse mechanisms of gene-by-gene co-expression. For instance, expression of genes may be co-regulated by a signaling pathway in a single cell type or they may represent coordinated variation in the proportion of cell population subclasses of monocytes. Both possibilities are equally interesting since causal effects may be mediated by either mechanism. For instance, HDL levels modulate monocytes proliferation and activation, changing the composition of the myeloid cell lineage, which is thought to explain in part its anti-inflammatory and athero-protective effects [45,46]. Further studies are needed to determine what specific molecular mechanisms underlie the correlation patterns reported here. Additionally, replication of candidate gene networks in independent datasets should be performed before laborious functional studies are undertaken.
A word of caution is in place about inferring causal associations from observational data. Direct associations in causal graphs are appropriate only in the context of the variables that are included in the system, i.e. the variables that were measured. Causal graphs are considered complete only in the sense of common causes between variables but they do not include all causes of variables [47]. Therefore, a direct causal association may represent the net effect of a large number of direct causal associations among variables that were not measured and that mediate the effects among the variables that were observed [48]. A causal graph may change if variables are removed or new ones are included. Furthermore, the real causal associations that can be inferred from graphical models are not in the edges but in the lack of edges between variables, that is, two variables are not causally associated given that we account for the other variables in the system. Instead the presence of an arrow only indicates the possibility of a causal association, which has to be determined from data [49]. The method used here can only identify networks, which in the context of the variables measured, present a correlation structure that is not incompatible with the causal effects of interest SRGRP. An effort was made to include all other variables that may be relevant in the system, but of course there is no guarantee that all relevant variables were included. In addition, although the method does not always allow identifying the best model, providing the best fitting class of models is an honest and useful summary of the information encoded in these data.
In conclusion, we have used a graphical modeling approach to investigate the potential role of gene expression in monocytes in mediating smoking effects on atherosclerosis. The analytic approach implemented here allowed discriminating among competing causal models affecting multiple genes and revealed gene networks that included multiple members with known causal roles in atherosclerosis or mediation of smoke-tobacco effects. To the best of our knowledge, this is the first application of causal inference on gene modules rather than individual genes. Our results put together previously unconnected genes that led to the formulation of new hypotheses about potential molecular mechanisms linking smoking effects to atherosclerosis phenotypes. Therefore, inspection of the correlation structure of risk factors, gene expression and atherosclerosis, revealed candidate genes that would have been missed by looking at strength of gene-phenotype associations alone.

Materials and Methods
More details are provided in Text S1.

Subjects
Study participants of both sexes aged 35-74 yr, were successively enrolled into the GHS, a community-based cohort study in the Rhein-Main region in western mid-Germany. Participants were of European origin. More details about the GHS study are provided in [9]. There were 1,536 individuals with microarray expression data that passed quality control tests. Nine individuals with missing number of plaques were removed. Smoking status was recorded by interview at recruitment. Current smokers (individuals smoking $1 cigarette per day since at least 6 months before recruitment, n = 248) and nonsmokers (individuals who had never been smoking over a period of at least 6 months, n = 688) were used for analyses. Occasional smokers (n = 42), ex-smokers (n = 547) and individuals with missing information on smoking status (n = 2) were excluded.

Ethics statement
All subjects gave written informed consent. Ethical approval was given by the local ethics committee and by the local and federal data safety commissioners (Ethik-Kommission der Landesä rztekammer Rheinland-Pfalz 22/03/2007 Number 837.020.07 (5555)).

Evaluation of the number of atherosclerotic plaques
Intima-media thickness (IMT) in the common carotid arteries was assessed with an ie33 ultrasound system (Philips, NL) using an 11 to 3 MHz linear array transducer. Measurements were performed by an experienced technologist and evaluated in the QLab software (Philips, NL). The presence of an atherosclerotic plaque was determined by an increment of 1.5 mm or more in IMT when compared to a region without plaque 1 cm before the carotid bulb, averaging from left and right carotids. Plaques were counted in the common, external, and internal carotid arteries on both left and right sides [9]. The phenotype considered was the count of plaques summed over both carotid arteries.

RNA samples processing
Total RNA from circulating monocytes was extracted as previously described [9]. Briefly, monocytes were isolated from blood samples with the RosetteSep Monocyte Enrichment Cocktail (StemCell Technologies, Vancouver, Canada), cells were resuspended in Trizol (Invitrogen, Karlsruhe, Germany), RNA was extracted with the RNeasy Mini kit (Qiagen, Hilden, Germany) and controlled for quality in an Agilent Bioanalyzer 2100 (Agilent Technologies, Boeblingen, Germany).

Microarray data processing
RNA samples were hybridized to the Illumina HT-12 BeadChip v3 (Illumina, San Diego CA) containing 48,803 50-mer DNA probes. Probe-mapping to the genome was obtained from ReMOAT annotations [50] in the illuminaHumanv3.db package from Bioconductor v2.8 (http://www.bioconductor.org). There were 13,445 probes with ''Bad'' scores for genome alignment, which were discarded. Probes were annotated with RefSeq and EntrezGene ids using the org.Hs.eg.db package from Bioconductor. Of the remaining 35,358 probes, 28,515 were annotated to Ensembl transcripts and 28,137 to EntrezGene ids. For the purpose of counting number of genes throughout, unannotated (but of good quality) probes were considered as targeting distinct genes. Note that probe annotations from Illumina were not used to discard probes, which is different from our previous analyses on this dataset [9,21]. However, 98% of probes that were removed in our previous analyses were also removed by at least one of the filtering steps applied here.
Bead-level data were processed with BeadStudio (Illumina, San Diego CA) to perform quality control and summarization of intensity values at probe level. Data were further processed by quantile normalization. An arcsinh transformation was applied to stabilize the variance [51]. A transcript was considered detected when the normalized intensity reported by its targeting probe was significantly above that of negative control probes on the same array (detection p-value,0.04). Probes with ''detected'' calls in $5% of samples within any smoking/sex group were considered for analysis, representing 23,214 probes. All analyses were performed with R v2.13 [52].

Correction for technical sources of variability in gene expression
Major components of variance in the gene expression dataset were identified by singular value decomposition (SVD) using the La.svd function in R. The six largest components were not associated to any individual characteristic, and therefore were thought to reflect systematic effects from sample-processing protocols. The first 6 SVD components were then used as surrogate variables of technical effects and were adjusted for in expression analyses [53]. In addition, potential sample contamination with B and T cells, and megakaryocytes was corrected as previously described [21] (Text S1).

Modeling the number of plaques
In analyses modeling the number of plaques as dependent variable, we used a negative binomial (NB-2) distribution [54]. Other distribution functions were considered but not used (see Text S1 for details). The negative binomial model was fitted by the glm.nb function with a log link using the R MASS package. A pseudo-R 2 coefficient was defined as R Criterion BIC~{2(log L(yDĥ h)zk log(n), where L(yDĥ h) is the maximum likelihood of the data given parametersĥ h, k is the number of parameters inĥ h and n is the number of observations. When comparing different negative binomial models, the dispersion parameter a was held constant and fixed at the estimated value from the model with age, sex and smoking effects (a = 1.34). Individuals with missing values for any variables being considered were removed before model fitting.

Single gene expression association analysis
Analysis of expression data was performed at the probe level. Association of a single probe expression with plaques counts and smoking was performed by linear regression analysis where the dependent variable was the probe expression level and the independent variables were plaques and smoking, tested either individually or simultaneously. In these models, the variable considered for plaques was ln(plaques+1). All models were systematically adjusted for age, sex and the 6 first SVD components. Linear models were tested by the glm function in R. An FDR #0.1 was used for selecting the probes associated to plaques or to smoking [57]. See Text S1 for details.
The list of 3,368 distinct gene expressions associated to smoking was compared to the list of 323 gene expressions previously reported to be associated to smoking in lymphocytes [8]. This study used the Illumina Human WG-6 v1 microarray platform, which is an earlier version of the HT-12 used in GHS. Updated gene annotations were obtained from Bioconductor. There were 19,614 unique genes in WG-6 of which 17,243 were also queried in HT-12. Of these, 13,707 were detected in monocytes of GHS, which were used as the reference set. Gene set enrichment analysis was performed by the Fisher's exact test implemented in the fisher.test function in R.

GO enrichment analysis
Gene set enrichment analysis of gene ontology (GO) terms was performed with the topGO (v 2.4.0) package in R. GO data were obtained from the GO.db Bioconductor metadata package v 2.5.0 (March 2011). Enrichment was tested by a Fisher's exact test using the set of 18,364 unique known detected genes as a reference. In order to account for the redundancy and hierarchy of GO terms, the weight01 algorithm was used, which is a hybrid between the elim and weight algorithms described in [58]. Control for multiple testing was done by a Bonferroni correction on the number of GO terms represented in the reference set.

Covariation between plaques and smoking explained by single genes
To test whether a single gene expression could explain part of the covariation between plaques and smoking, we modeled plaques as a function of smoking and gene expression using the glm.nb function. The dispersion parameter h was held constant across genes to a value of 0.75, which was estimated with the glm.nb function on the reduced model without gene effect. The strength of the association of smoking before and after including gene expression was tested by the r 2 D coefficient as described above. The difference between these values was regarded as covariation explained by a single gene.

Clustering gene expressions by Independent Component Analysis (ICA)
The fastICA algorithm was used to factorize the matrix of 3,368 gene expressions associated to smoking or plaques [20]. A single probe per gene was selected, that was the probe showing the largest variance across samples. This approach avoids the bias that would be introduced by, for instance using the probe with the strongest association to smoking or plaques and favors probes with more information content. Using the mean expression across probes for a gene was not considered because of the added noise that that may result from errors in probe annotation and because the largely uneven number of probes per gene would dramatically affect the distribution of technical error across genes. All unannotated probes were kept.
Normalized data from each probe were centered and standardized. SVD was initially performed to determine the a priori number of components for the ICA algorithm as explained in [21]. This number was found to be 59.
The fastICA algorithm identifies major variance components by iteratively estimating the ''mixing'' matrix A that satisfies the equation X = SA, where X is an m6n data matrix, S is a m6p matrix of signatures across genes, A is p6n matrix of patterns across samples, n is the number of samples (n = 936), m is the number of genes (m = 3,368), and p is the number of ICA components set a priori (p = 59). The iterative algorithm minimizes dependency among signatures (columns of S), while maximizing non-Gaussianity, i.e. negentropy, of signature distributions [20]. The fastICA function in the R-package of same name was used [59]. The algorithm was run multiple times to avoid trapping in a local maximum. The results were processed as previously described in [21] to remove components that were not consistently detected across random start points. Briefly, fastICA was repeated 500 times and the best run (with the maximal negentropy) was selected. The stability of components over the 500 runs was calculated. Components that did not meet quality control criteria were discarded (see Text S1 and Tables S9, S10, S11).
For each component, a signature-specific module was defined as the subset of genes on either tail of the signature distribution selected by controlling local FDR #0.001 [21]. The association between ICA expression patterns and smoking or plaques was tested in the same way as single gene expression, except that adjustment was made on all risk factors associated to pattern by stepwise regression analysis and not only on age and sex.

Investigation of causality models
All triplets including smoking (S), plaques (P), and a gene expression pattern (or a single gene expression) (G) were constructed. Potential causal relationships among triplet members were represented as Bayesian Networks, i.e. graphical models, where variables are vertices and causal associations are indicated by directed edges, i.e. arrows, between nodes. The probability function used depended on the distribution of each variable.
Smoking was modeled by a binomial, plaques by a negative binomial, and gene expression by a Gaussian distribution function. The inference of network structure and parameters was performed as follows. Each node in the graph was fitted a linear model with its parents as independent variables. Given a graph M with variables X = {X 1 , …, X N } represented as nodes, a BIC score [60] for M was computed as BIC(M) = 22 log(L(X|M))+k log(n), where n is the sample size, N is the number of nodes, k is the number of parameters of the model, and L(X|M) is the maximum likelihood of the data given graph M, which by the directed Markov property of the graph, is given by L(X DM)~P N i~1 Pr(X i DPa M (X i )). The term Pr(X i DPa M (X i )) denotes the conditional probability of observation X i given its set of parents Pa M (X i ) in graph M. The graph with the lowest BIC is selected as the most likely model, which produces identical values for equivalent networks [61]. Here, however, because different density functions were used for different variables, equivalent networks may results in slightly different scores. Therefore, we computed BIC scores for all models in a class, which was selected if any network in the class had the minimum score. The probability of selecting a model class, estimated by 1000 bootstraps of the data was used as a measure of confidence in the Bayesian network inference [62].

Inference of network skeleton
The skeleton of the network connecting genes, smoking, risk factors and plaques was learned by conditional associations using the PC algorithm 1 implemented in the pcalg R package v 1.1-4 [22]. Briefly, the algorithm starts with a matrix of marginally associated variables (nodes connected by edges) and for each pair of connected, i.e. adjacent, nodes it successively tests whether the pair becomes independent after conditioning on any group of adjacent nodes. If for any set of conditioning nodes the pair is independent, the edge is removed. The Pearson correlation with significance level of 0.05 was used to test independence. The process was repeated for 1000 data bootstraps and the proportion of samples where the edge is recovered was recorded. Shortest paths between smoking and plaques were inferred with the Dijkstra's algorithm on a reduced graph after removing nodes corresponding to covariates. Whenever more than one shortest path existed, the algorithm chooses one according to a greedy search. The Dijkstra's algorithm implemented in the sp.between function of the RBGL package for R was used [63].
Note that although in its full version the PC algorithm (Algorithm 2 in [22]) can asymptotically infer causal associations in a network by directing some of the edges of the skeleton, when applied to samples of finite size the algorithm may lead to inconsistencies that make such inference impossible [64]. This was the case with the present data. Therefore, we decided to limit to inferring the skeleton of significant partial correlations among variables using a first part of the PC algorithm.