A Model-Based Joint Identification of Differentially Expressed Genes and Phenotype-Associated Genes

Over the last decade, many analytical methods and tools have been developed for microarray data. The detection of differentially expressed genes (DEGs) among different treatment groups is often a primary purpose of microarray data analysis. In addition, association studies investigating the relationship between genes and a phenotype of interest such as survival time are also popular in microarray data analysis. Phenotype association analysis provides a list of phenotype-associated genes (PAGs). However, it is sometimes necessary to identify genes that are both DEGs and PAGs. We consider the joint identification of DEGs and PAGs in microarray data analyses. The first approach we used was a naïve approach that detects DEGs and PAGs separately and then identifies the genes in an intersection of the list of PAGs and DEGs. The second approach we considered was a hierarchical approach that detects DEGs first and then chooses PAGs from among the DEGs or vice versa. In this study, we propose a new model-based approach for the joint identification of DEGs and PAGs. Unlike the previous two-step approaches, the proposed method identifies genes simultaneously that are DEGs and PAGs. This method uses standard regression models but adopts different null hypothesis from ordinary regression models, which allows us to perform joint identification in one-step. The proposed model-based methods were evaluated using experimental data and simulation studies. The proposed methods were used to analyze a microarray experiment in which the main interest lies in detecting genes that are both DEGs and PAGs, where DEGs are identified between two diet groups and PAGs are associated with four phenotypes reflecting the expression of leptin, adiponectin, insulin-like growth factor 1, and insulin. Model-based approaches provided a larger number of genes, which are both DEGs and PAGs, than other methods. Simulation studies showed that they have more power than other methods. Through analysis of data from experimental microarrays and simulation studies, the proposed model-based approach was shown to provide a more powerful result than the naïve approach and the hierarchical approach. Since our approach is model-based, it is very flexible and can easily handle different types of covariates.

provide a more powerful result than the naïve approach and the hierarchical approach. Since our approach is model-based, it is very flexible and can easily handle different types of covariates.

Background
The development of new technologies has greatly affected the biological research field. Specifically, the advent of microarray technology provides a crucial turning point in biological research [1,2,3,4]. Microarray technology has commonly been used for simultaneously identifying the gene expression patterns in cells for thousands of genes. In addition, the sensitivity and specificity of microarray technology continues to improve, and microarrays are becoming a more economical research tool [5]. An important emerging medical application for microarray technology is clinical decision support for diagnosis of a disease as well as the prediction of clinical outcomes in response to a treatment [6].
Recently, the improvements in microarray technology have been guiding the development of various platforms. Many studies have tried to integrate several platforms; for example, the MicroArray Quality Control (MAQC) project provided gene expression levels that were measured from seven different platforms. The MAQC study provided a resource representing an important first step toward establishing a framework for the use of microarrays in clinical and regulatory settings [7]. In addition, microarray technology has been successfully commercialized, and as a result, a substantial amount of microarray data has been generated. Several studies have performed an integration analysis of microarray data. Meta-analysis is powerful for unifying the results of various gene expression studies (for example, breast cancer [8]). Statistical models such as analysis of variance are effective in integration analysis for identifying genes that have different gene expression profiles in the presence of many controlling variables [9].
In general, the primary goal of microarray data analysis is to identify differentially expressed genes (DEGs). Microarray technology allows us to obtain data on the expression of target genes more easily than other technologies. DEGs have become more easily detected by microarray technology than ever before. When applied to experimental data, the causal genes related to diseases can be obtained by discovering DEGs. Over the last decade, numerous statistical methods have been proposed such as t-tests, significance analysis of microarray (SAM) [10], regression modeling, mixed modeling [11], and local pooled error (LPE) tests [12].
Of these approaches, the t-test is the most popular statistical test for comparing the means between two groups. The t-test is a parametric method that requires a normality assumption. However, microarray data rarely satisfy the normal distribution assumption. Therefore, a permutation test that does not require such assumptions is preferably used to detect DEGs [13,14]. The SAM [10] uses a t-type of statistics using a fudge factor to stabilize the variance, and controls for the false discovery rate (FDR) [15]. The SAM is also a non-parametric analysis that does not require normality distributional assumption.
The application of microarray technology has also led to diverse studies that go beyond identifying DEGs such as a study examining the relationship between phenotype and expression data. Various phenotypes have been used in microarray experiments; for example, the survival time was utilized as a phenotype for analyzing the recurrence of cancer in clinical studies [16,17]. Several genes associated with the survival time were identified. Microsatellite instability (MSI) was utilized as a phenotype in a microarray study of colorectal cancer. Since the CpG island methylator phenotype (CIMP) was associated with MSI and BRAF mutations in colorectal cancer [18], MSI has played an important role in colorectal cancer studies. In addition, the tumor subtype can also be an important phenotype. For example, estrogen receptor (ER), progesterone receptor (PR), and HER2 jointly define the subtypes of breast cancer. The triple-negative phenotype (ER-negative, PR-negative, and HER2-negative) is most commonly used [19].
Phenotype-associated genes (PAGs) are the genes that are associated with a phenotype of interest. The PAGs can be identified by regression analyses such as linear regression analysis for continuous phenotypes and the Cox regression model for survival time phenotypes [20]. When the phenotype is a binary variable representing two groups, the identification of PAGs becomes equivalent to identification of DEGs.
In this article, we focus on the joint identification of DEGs and PAGs in microarray data analyses. Our study was motivated by the need for an analysis of a microarray experiment consisting of high fat diet (HFD) and normal diet (ND) groups. Ten mice were assigned to each group for the microarray experiment. In addition, four phenotypes reflecting the expression levels of leptin, adiponectin, insulin-like growth factor 1 (IGF-1) and insulin were measured in blood samples. Leptin is an adipocyte-secreted hormone with a key role in energy homeostasis [21]. IGF-1 is similar in molecular structure to insulin and is an important hormone for childhood growth. Adiponectin controls glucose levels as well as fatty acid breakdown, and Insulin is one of the most important hormones in the metabolic system of mammals. The microarray experiment focused on gene expression changes associated with dietary fat control, and the determination of influential genes associated with obesity-related phenotypes. Thus, we need to identify DEGs for HFD and ND groups that are also PAGs for the four obesity-related phenotypes.
Although many approaches have been proposed for the separate identification of DEGs and PAGs, only a few approaches are available for the joint identification of DEGs and PAGs. The first approach we used for the joint identification of DEGs and PAGs was a naïve approach that detects DEGs and PAGs separately and then identifies the intersecting genes from the lists of PAGs and DEGs. The second approach is a hierarchical approach [22] that detects DEGs first and then chooses PAGs among DEGs or vice versa. Both approaches are two-stage analyses that require separate testing of DEGs and PAGs, making it difficult to control false positive errors.
We propose a new model-based approach for the simultaneous identification of DEGs and PAGs. Our model-based approach uses a linear regression model. We have use the linear regression model since it is easy to use, flexible in dealing with individual covariates, and easy extendibility (i.e. extension to permutation test can be done without using normality assumption). Our method is a one-stage analysis that takes less computing time, makes it easier to control false positive errors, and has greater power than naïve or hierarchical approaches. Through analysis of data from a microarray experiment carried out in mice and from simulation studies, we compare our model-based approach with naïve and hierarchical approaches.

Ethics statement
All animal experimental procedures were reviewed and approved by the Institutional Animal Care and Use Committee of Sookmyung Women's University (SMU-IACUC-2011-0401-005).

Data
Microarray data consisted of data obtained from HFD and ND groups of mice to determine influential genes associated with obesity. Four-week-old male C57BL/6J mice were purchased from SLC Japan (Hamamatsu, Tokyo, Japan). Mice were housed in plastic cages (three to four mice per cage) under a constant temperature (23 ± 2°C) and humidity (50 ± 10%) with a 12-h light/dark cycle. Animals were allowed to acclimatize to the laboratory environment for 1 week before the experiment onset. The composition of experimental diet was based AIN-93G. The fat sources of normal diet (ND, 15% of fat calories) and high-fat diet (HFD, 45% of fat calories) were based on corn oil and lard. The reference we have used for such fat percentage definition can be seen in "A high-fat diet impairs neurogenesis: Involvement of lipid peroxidation and brain-derived neurotrophic factor" [23]. Fresh diet was provided every 2~3 days and mice had free access to water and food throughout all experiments. Animals were maintained for 8 weeks and were killed by CO 2 inhalation at 13 week of age. At necropsy, blood and tissue samples were collected; serum samples were prepared by centrifugation of whole blood samples at 650 × g for 20 min and stored at -80°C until analysis; colon tissues were rapidly removed, immediately frozen in liquid nitrogen, and stored at -80°C until microarray analysis.
Illumina MouseRef-8 v1.1 Expression BeadChip was used in our microarray experiment. We observed changes in the gene expression pattern due to HFD-induced obesity. We assigned 10 mice to each ND group and HFD group. Then, three mice from the ND group and six mice from the HFD group were selected via QC for the microarray experiment, and each sample had 45281 probes.
Four phenotypes associated with regulating metabolism were extracted using levels of expression in the blood sample including leptin, adiponectin, insulin-like growth factor 1 (IGF-1), and insulin. Serum insulin concentration was measured with an ELISA kit (Linco Research, St Louis, MO, USA) according to the manufacturer's instruction. Serum concentrations of IGF-1, leptin (R&D Minneapolis, MN, USA) and adiponectin (Biovendor, Brno, Czech Republic) were also measured with an ELISA kit, according to the manufacturer's instructions. IGF-1 is similar in molecular structure to insulin and is an important hormone for childhood growth. Adiponectin controls glucose levels as well as fatty acid breakdown, and insulin is one of the most important hormones in the metabolic system of mammals. The expression values are log-transformed. After log-transformation, QQ plots and goodness of fit tests for normal distribution did not provide evidence that the data do not follow normal distribution. We pro-

Detection of DEGs
First, we detected DEGs by using a two-sample t-test. Secondly, we used significance analysis of microarray (SAM) [10] for identifying DEGs. SAM uses the t-statistics modified by adding a fudge factor (s 0 ) to common statistics as one of the penalty methods. The variable s i is the estimated standard error from gene i, and s 0 is calculated as a percentile based on α. Then, the following test statistic is used: In addition, the SAM method uses a permutation algorithm to control for the false discovery rate (FDR) [15]. Therefore, we can control FDR more easily with this test than for the other tests such as the t-test.

PAGs detection
Linear regression analysis is utilized to determine PAGs. There are two treatment groups in our microarray data: ND and HFD. Group information is denoted by Group. Expression i indicates the expression value for each gene. As mentioned earlier, the phenotypes of interest consist of leptin, adiponectin, IGF-1, and insulin expression. Linear regression analysis is performed for each phenotype. Two linear regression models are applied to identify the linear relationship between genes and phenotypes.
where i (= 1,2,. . .,p) represents the gene. Group information is denoted by Group. Expression i indicates the expression value for each gene. The first model M1 is to identify the effect of expression on the phenotype, whereas the second model M2 is an extension of M1 with an additional Group covariate. The significance of the linear relationship between gene and phenotype may be affected by the group effect because some genes may not have marginal effects on the phenotype but may have conditional effects given the group information. M1 is used for detecting the marginal effect, while M2 is used for detecting conditional effects. PAGs may depend on the group effect. For example, the v1rh4 gene is a non-PAG by model M1. However, it is identified as a PAG by model M2 (Fig 1). Model M2 is a more appropriate model than M1, when a group effect exists. However, model M1 provides PAGs that do not depend on the group effect, suggesting both M1 and M2 need to be fitted. Therefore, we use models M1 and M2 simultaneously to identify PAGs.
In model M1, the expression effect β 1 is of the main interest. In model M2, β 1 is still of the main interest even though the group effect β 2 is added to explain the high fat diet effect between the ND group and the HFD group. The PAGs can be identified by testing the following hypotheses: significances of these effects can be tested by calculating F-statistic for each gene.

Joint identification analysis
Naïve approach. The naïve approach detects DEGs and PAGs separately, and then identifies the intersection of PAGs and DEGs. It consists of the following two steps: 1. Identifying DEGs set and PAGs set separately for total data.

Determining intersection gene sets of DEGs and PAGs.
Hierarchical approach. The hierarchical approach [21] detects DEGs first and then chooses PAGs among DEGs, or vice versa. The hierarchical approach consists of two steps: identifying DEGs and then detecting PAGs among the DEGs, or alternatively, identifying PAGs and then detecting DEGs among the PAGs.  Both naïve and hierarchical approaches are two-stage analyses that require separate testing of DEGs and PAGs, and as a result, it was not straightforward to control the false positive errors because two tests are not independent. Thus, we propose a new model-based approach for joint identification of DEGs and PAGs. Our model-based approach uses a linear regression model, as explained in the following section.
Model-based approach. We propose a new test based on the likelihood ratio to determine DEGs and PAGs simultaneously. Consider the following model (6) M3 : For finding the DEGs and PAGs, consider the following hypotheses.
Under the null hypothesis, the likelihood can be maximized as follows.
Thus, the likelihood ratio test statistic Λ is derived as follows: Where b g 1 0 is MLE of γ 1 when γ 2 = 0, b g 2 0 is MLE of γ 2 when γ 1 = 0 and ( b g 1 ; b g 2 ) are MLE in the whole parameter space Un the null hypothesis, if γ 1 or γ 2 is large, the LRT statistic asymptotically follows χ 2 (1) distribution. Since this likelihood ratio test statistic is smaller than each likelihood ratio test statistic for testing H 01 : γ 1 = 0 or H 02 : γ 2 = 0, the p-value of LRT statistics is larger than that of both likelihood ratio test statistics for testing each beta. In this reason, even we test DEG and PAG at the same time with Chi-square test under 1 degree of freedom, type I error will be conserved.
However, we cannot say that our test is most powerful among the tests. Thus, we performed simulation for power comparison between our proposed method and naïve approach.
Simulation study. For a more systematic comparison, we performed an extensive simulation study with following settings. First, we generated phenotype data from the normal distribution with the same mean and variance as the actual phenotype data. Since adiponectin most closely followed the normal distribution according to the normality test [24], the mean 25.06 and the variance 274.896 of adiponectin were used to simulate the phenotype data from the normal distribution. 50 samples consisting of 25 cases and 25 controls were generated. Finally, expression levels were generated from the Eq (6), where ε follows a normal distribution with mean 0 and variance 78.3.

Results DEGs
Two-sample t-tests, and the SAM method, were utilized to identify DEGs. The results of t-tests were not significant after controlling for multiple testing at 5% level using Bonferroni correction or FDR. The SAM method provided significant results at the 0% median FDR = 0: Tfrc, Sprr1a, Cyp4f16, and 9030605I04Rik. Although SAM provided only a few genes that are all up regulated genes, the SAM result is expected to be very reliable because it did not contain any false positives (FDR = 0) [24]. In Table 1, we displayed top 5 genes by using the nominal p-values from the Student's t-test without any corrections for multiple comparisons. These lists of genes from two sample t-tests are different from those of SAM. Fig 2 shows a pairwise plot that illustrates the correlation coefficients among four phenotypes. All of them are highly positively correlated. Specifically, leptin and insulin have a particularly high correlation coefficient (0.836). Both leptin and insulin are known to be associated with body composition [25], BMI, and type 2 diabetes [26,27].

PAGs
Models M1 and M2 were employed to detect PAGs for these phenotypes. Fig 3 shows the Venn diagram for the number of PAGs identified by M1 and M2 at the 5% significance level. Depending on the phenotypes, the numbers of overlapping and non-overlapping PAGs differ greatly. Thus, we could assume that in many genes, conditional distributions of gene expressions given groups are different from marginal distribution.

Joint identification analysis
We applied three joint identification methods to the microarray data from the mice. We first applied the naïve approach by comparing the list of DEGs and the list of PAGs. We then applied the hierarchical approach by detecting DEGs first and then PAGs. Finally, we applied the model based approach using the likelihood ratio test from model M3. For the purpose of a fair comparison, we fixed significance and FDR at the same level. Table 2 summarizes the numbers of significant genes that are both DEGs and PAGs. A larger number of significant genes were identified by the model based approach than the other methods. The top gene list for four phenotypes at a significance level of 5%, the p-value and q-value for each effect are summarized in Table 2. Since naïve approach and hierarchical approach provided the same results, the numbers of significant genes in Table 2 are the same for both approaches.   Fig (A) shows detected genes that are significant with leptin. Fig (B) shows detected genes that are significant with IGF-1. Fig (C) shows detected genes which are significant with ADIPO. Fig (D) shows detected genes that are significant with insulin. In all of these four figures, Both M1 and M2 reveal a large number of different significant PAGs. Thus, we could assume that conditional distributions of expression levels, given group, are different from marginal distribution in many genes. doi:10.1371/journal.pone.0149086.g003

Simulation study
To check the type I error and power, we set the values of γ 1 as follows: 0, 0.08, 0.12, 0.16, 0.2, 0.24, 0.28, and let γ 2 vary from 0 to 8 by 2. These parameter values represent the effect sizes derived from the real data. Since our type I error needs to be evaluated not at a point but two axes generated by γ 1 and γ 2 , we checked both (γ 1 = 0, γ 2 6 ¼0) and (γ 1 6 ¼0, γ 2 = 0) cases. Table 3 shows how well each method preserves the type I error under the various simulation settings. When both γ 1 and γ 2 equal 0, the type I error is smaller than 0.05 for all methods. However, as either γ 1 or γ 2 increases, the type I error becomes closer to 0.05. Therefore, all methods tend to preserve the type I errors. Figs 5 and 6 show the powers of each methods for varying simulation settings. In this paper, power means the probability that the statistical hypothesis test is rejected when the null

Discussion
As the experimental designs using microarrays becomes more complex, a more complicated analysis method needs to be developed. In early microarray studies, either DEGs or PAGs need to be identified. However, recent microarray designs require a more complicated method to detect the genes that are simultaneously DEGs and PAGs. Our proposed model provided some interesting lists of genes (Table 4). Gene expression profiling figures are included in S1 File (Figs C-E). We also provided gene expressions for lists of genes and phenotypes for real data analysis in S2 and S3 Files. The four selected phenotypes  are known to be associated with regulating metabolism. Therefore, many of the significant genes tend to be associated with this regulatory function. For example, The Grb10 gene is known to encode a growth factor receptor-binding protein that interacts with insulin receptors and insulin-like growth-factor receptors [28]. FABP9 (Fatty Acid Binding Protein 9, Testis) is a Protein Coding gene which is associated with sporadic breast cancer. GO annotations related  to this gene include transporter activity and lipid binding [29]. SGCE (Sarcoglycan, Epsilon) is a Protein Coding gene which is associated with myoclonic, dystonia-11, or dystonia. It is known that SGCE gene works together with SGCA [30]. SGCA is known that is associated with type 2d [31]. Although SGCA cannot reach q-value below 0.05, that of p-value for model-based approach is 0.0011. Therefore, these genes would have significant relationships for finding differences of effects between high-fat diet and low-fat diet.
LGALS4 (Lectin, Galactoside-Binding, Soluble, 4) is a Protein Coding gene. Diseases associated with LGALS4 include colon adenocarcinoma and measles. Among its related pathways are Cell adhesion_Cell-matrix glycoconjugates. GO annotations related to this gene include carbohydrate binding [32]. These candidate genes demonstrate that the proposed method successfully identified the genes known to be associated with regulating metabolism. In addition, we have performed gene ontology analysis using DAVID functional annotation tool, using the genes only detected by model-based method with q-value below 0.05 as input. We found five GO terms contained in the CC_FAT terms that reached p-values below 0.05 or around 0.05 [33,34]. These results are summarized in Table 5. On the contrary, other methods could not find any significant molecular signatures. This may infer that genes undetected by the naïve and the hierarchical approaches could play important roles; in other words, important candidate genes may be overlooked if only the traditional approaches are used.
In this paper, we propose a statistical model for simultaneously detecting DEGs and PAGs. The proposed model is more efficient than other naïve methods for the joint identification of DEGs and PAGs. Using microarray data from an experiment and using simulation studies, the proposed model was compared to the other tested methods and was shown to have greater power. In other words, the proposed model identified more significant genes than other approaches under the same conditions ( Table 4). The proposed approach was flexible and easy to extend. Since our model is a linear regression model, it can be extended for the cases where there are multiple factors. For example, our model can be applied to the analysis of a variety of covariates at the same time.
Supporting Information S1 File. For checking the normality of log-transformed gene expressions, we performed Shapiro-Wilks test for each gene and drew a histogram of the p-values. P-values for most genes are larger than 0.05 which means log-transformed gene expressions follow normal distributions (Fig A). QQ-plots for some genes which are significant in model-based approach ( Fig  B). Gene expression profiles for significant genes with leptin detected by model-based approach. Blue dots are ND samples and Red dots are HFD samples. X-axis shows log-normalized gene expression for each gene and Y-axis shows expression of leptin (Fig C). Gene expression profiles for significant genes with adiponectin detected by model-based approach. Blue dots show ND samples and Red dots show HFD samples. X-axis shows log-normalized gene expression for each gene and Y-axis shows expression of adiponectin (Fig D). Gene expression profiles for significant genes with insulin detected by model-based approach. Blue dots are ND samples and Red dots are HFD samples. X-axis shows log-normalized gene expression for each gene and Y-axis shows expression of insulin (Fig E).
(PPTX) S2 File. Expression data for significant genes listed on Table 4.