Investigation of Variation in Gene Expression Profiling of Human Blood by Extended Principle Component Analysis

Background Human peripheral blood is a promising material for biomedical research. However, various kinds of biological and technological factors result in a large degree of variation in blood gene expression profiles. Methodology/Principal Findings Human peripheral blood samples were drawn from healthy volunteers and analysed using the Human Genome U133Plus2 Microarray. We applied a novel approach using the Principle Component Analysis and Eigen-R 2 methods to dissect the overall variation of blood gene expression profiles with respect to the interested biological and technological factors. The results indicated that the predominating sources of the variation could be traced to the individual heterogeneity of the relative proportions of different blood cell types (leukocyte subsets and erythrocytes). The physiological factors like age, gender and BMI were demonstrated to be associated with 5.3% to 9.2% of the total variation in the blood gene expression profiles. We investigated the gene expression profiles of samples from the same donors but with different levels of RNA quality. Although the proportion of variation associated to the RNA Integrity Number was mild (2.1%), the significant impact of RNA quality on the expression of individual genes was observed. Conclusions By characterizing the major sources of variation in blood gene expression profiles, such variability can be minimized by modifications to study designs. Increasing sample size, balancing confounding factors between study groups, using rigorous selection criteria for sample quality, and well controlled experimental processes will significantly improve the accuracy and reproducibility of blood transcriptome study.


Introduction
Peripheral blood is a very promising material for biomedical research due to its critical role in immune responses and metabolism. The ease and minimal invasiveness with which it can be collected have also made peripheral blood attractive for clinical use. Over the last decade, advances in microarray have offered the opportunity to study the expression of thousands of genes simultaneously in a biological system. The microarray-based transcriptome analyse of peripheral blood may provide new insights into the variations in global gene expression specifically associated with physiological and pathological events. Numerous studies have addressed the use of gene expression profiling of peripheral blood from patients with malignancies, infectious diseases, autoimmunity and cardiovascular diseases [1,2].
Before the blood transcriptomic biomarkers can be used for clinical purposes, it is essential to understand the underlying factors that contribute to the sources of variability in blood gene expression profiles. Previous studies have investigated the blood gene expression profiles of normal individuals [3]. Those studies used either whole blood or peripheral blood mononuclear cells. Gene expression profiles were found varying greatly among the different blood cell types [4][5][6]. The results demonstrated the significant contributions of genetic and physiological factors, as well as varying proporations of different cellsubsets, in determining the overall gene expression profiles of human blood [7][8][9]. Yet other research also provided evidences that gene expression profiles of normal individuals can be remarkably stable over time [10,11]. Several studies reported an excess of hemoglobin mRNA present in total RNA extracted from whole blood resulting in high noise and reduced sensitivity in transcriptome analysis [12,13]. The usefulness of available methods to minimize excess hemoglobin mRNA was then evaluated [14]. Furthermore, technological factors such as sample collection, transportation and storage conditions, as well as RNA isolation and amplification techniques, in addition to biological factors, can have a significant impact on the blood gene expression profiles [15][16][17][18][19][20].
In the study described here, we used the PAXgene TM Blood RNA System and GeneChipH U133Plus2 Microarray to analyze gene expression profiles in peripheral blood from healthy Chinese volunteers. For each donor, the physiological variables and blood cell counts were measured. The blood gene expression profiles were investigated for possible interference of age, gender, body mass index (BMI), sample RNA quality, as well as the presence of varying proportions of different blood cell types.

Human Blood Sample Collection
For gene expression profiling, 2.5 ml of peripheral blood were drawn from each volunteer by PAXgene TM Blood RNA tubes (PreAnalytix, Hilden, Germany). Another 2 ml of blood were collected for Complete Blood Count analysis. The test was performed by standard procedures at the Fudan University Shanghai Cancer Center Clinical Laboratory. The measures included white blood cell counts (leukocyte counts, relative counts for neutrophils, lymphocytes and monocytes), red blood cell counts (erythrocyte counts, hemoglobin amount, and relative reticulocyte count), as well as platelet counts. This study was carried out at the Fudan University Shanghai Cancer Center and was approved by the Ethical Committee of Fudan University Shanghai Cancer Center for Clinical Research. The written informed consents were obtained from all participants.

RNA Isolation and Preparation
Once blood samples were drawn into the PAXgene TM Blood RNA tubes by standard phlebotomy procedure, the samples were inverted ten times, maintained at room temperature for 2 hours, frozen at 220uC overnight and then moved to 280uC for storage until further use. Frozen blood samples were thawed at room temperature for 3 hours, followed by total RNA extraction with the PAXgene TM Blood RNA kit according to the manufacturer's instruction. The intact total RNA of each participant was separated into three tubes ( Figure 1). While the 1 st tube was kept intact, the 2 nd and 3 rd tubes were heated at 70uC for 10 and 20 minutes, respectively. The quantity of the total RNA was measured by a spectrophotometer at an optical density (OD) of 260 nm. Total RNA purity was assessed by the A260/A280 ratio. RNA Integrity Number (RIN) was determined using RNA 6000 Nano Chips and the Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, U.S.A.).

Microarray Experiments
We reversely transcribed 50 ng of total RNA and linearly amplified single-stranded cDNA using Ribo-SPIA TM technology with the WT-Ovation TM RNA Amplification System (NuGEN Technologies Inc., San Carlos, CA, USA) according to the manufacturer's standard protocol. The reaction products were purified with a QIAquick PCR purification kit (QIAGEN GmbH, Hilden, Germany). We subsequently fragmented 2 mg of amplified and purified cDNA with RQ1 RNase-Free DNase (Promega Corp., Fitchburg, WI, USA) and labelled the fragments with biotinylated deoxynucleoside triphosphates using terminal transferase (Roche Diagnostics Corp., Indianapolis, IN, USA) and GeneChipH DNA Labeling Reagent (Affymetrix Inc., Santa Clara, CA, USA). The labelled cDNA was then hybridized onto a GeneChipH U133Plus2 Array (Affymetrix) in a Hybridization Oven 640 (Agilent Technologies) at 60 rotations per minute and 50uC for 18 hours. After hybridization, the arrays were washed and stained according to Affymetrix protocol EukGE-WS2v4 using a GeneChipH Fluidics Station 450 (Affymetrix). The arrays were scanned with a GeneChipH Scanner 3000 (Affymetrix). All reactions and array hybridizations were carried out by the same technician to minimize the technical variation.

Statistical Analysis
Statistical analysis was performed using R software and packages from the Bioconductor microarray analysis environment [21,22], adapted to our needs. Gene expression profiles were quantified using the Robust Multi-array Average (RMA) method [23][24][25] implemented in the ''Simpleaffy'' package [26]. The GeneChipH U133plus2.0 Array contains 54,000 probesets for the interrogation of 38,500 human genes. Very often, multiple probesets are targeting to the unique gene. It is also found that several probesets were poorly annotated without any target gene. To reduce the noise and redundancy, we applied a bioinformatics-based filter using the information of Entrez Gene Database. For multiple probesets mapping to the same Entrez Gene ID, only the probeset showing the largest Inter Quantile Range (IQR) were retained and the others were removed. The probesets without Entrez Gene ID annotation were also removed. After bioinformatics-based filtering, the expression profiles of 9859 genes in 24 samples were retained for the downstream analysis.
Let Y be a m|nmatrix, where the rows of Y are the genes (m = 9859) and the columns of Y are the samples (n = 24). Although with the bioinformatics filtering, the difficulty remained in the fact that the number of genes was still much larger than the number of samples. Thus, we performed Principle Component Analysis (PCA) to reduce the dimensionality of gene expression data. In PCA, singular value decomposition was applied to the mean centered data matrix and decomposed Y into the following: where the matrices U and V were column orthogonal, so that U T U~V T V~I and D was a diagonal matrix. The columns of V were the right eigenvectors and also called principle components (PCs). We were particularly interested in the PCs because these represented the aggregated trends in the gene expression profiles. Specifically, the first PC was the linear combination of the gene expression profiles that explained the most variation in the data. The second PC was the linear combination of the gene expression profiles that explained the most variation in the data once the first PC had been removed, and so on. The proportion of total variation captured by the i-th PC was given by: where i = 1,2,..,n; d i was the eigenvalue of the i-th eigenvector, which was obtained from the i-th diagonal entry of D. Afterwards, a PCA extended method, called ''Eigen-R 2 '' [27], was used to precisely determine the proportion of variation explained by the predefined biological and technological variables X based on the PCA transformed data. Let p i be the i-th PC and letp p i be the fitted values when modelling p i in terms of the predefined variable X . For each of PCs, the proportion of variation in p i that is explained by X was calculated by: where p p p p i was the mean ofp p i and p p i was the mean of p i . Since p i of the total variation in the data is explained by p i , R 2 p p i should be weighted by p i . Additionally, given each pair of PCs is uncorrelated, the variation explained by X in p i is orthogonal to the variation explained by X in p j where j=i. Therefore, the proportion of total variation explained by X was estimated by: Figure 1. The study design for evaluating the impact of RNA quality on the blood gene expression profiles. From each healthy volunteer, 2.5 ml of peripheral blood was collected using a PAXgene TM Blood RNA tube. The extracted total RNA from each individual was aliquoted into three tubes. The 1 st tube was kept intact, while the 2 nd and 3 rd tubes were heated at 70uC for 10 and 20 minutes, respectively. According to the RNAs processing, microarray experiments were then performed in the sequential series: Series_1, Series_2 and Series_3. The RIN value was measured and used as the technical variable for evaluating the impact of RNA quality on the blood gene expression profiles. doi:10.1371/journal.pone.0026905.g001 Given the small cohort of present study, we further adjusted the R 2 calculation to accommodate small sample sizes by: where d 0 and d were the degrees of freedom spent in fitting the null model and the target model, respectively. By default, the null model was a model with only intercept (d 0 = 1). The adjusted estimator had been shown to be a unbiased estimator of the true value [28]. The gene-wise linear analysis was performed using the lmFit() function from the ''Limma'' package [29], adjusted for multiple testing with False Discovery Rate (FDR) [30]. For each gene, a linear model was fitted over the series of arrays with the log2transformed gene expression intensity as the dependent variable and the biological and technological factors as the independent variables. The functional annotation enrichment analysis for gene ontology and pathways were performed by the DAVID Bioinformatics Resource (http://david.abcc.ncifcrf.gov) with default parameters [31].
The microarray dataset discussed in this study were MIAME compliant [32]. The raw data had been deposited in ArrayExpress Database [33] and were accessible through the accession number: E-MEXP-2917.

Using the Independent Dataset for in silico Validations
Given the small sample size in the present study, we further used an independent, previously published blood gene expression dataset to verify our findings [10]. The Karlovich et al. study employed the same methodologies for the sample collection (PAXgene TM Blood RNA System) and gene expression profiling (HG-U133Plus2 Microarray), hence we considered the dataset an appropriate material to perform the in silico validation. From the Karlovich et al. dataset, we selected a total of 100 arrays with blood samples drawn from 20 (11 female and 9 male) healthy volunteers at five time points (Starting day, Day 14, Day28, Day90 and Day 180). All volunteers were Caucasians and lived in eastern France. The volunteers ranged in age from 23 to 64 yr (mean 6 SD = 46615). The age distribution was well balanced between men and women. The raw data were downloaded from NCBI's Gene Expression Omnibus [34] with the accession number: GSE16028.

Gene Expression Analysis by Real-time PCR
For each sample, 0.2 mg of total RNA was reverse-transcribed into cDNA using the Prime Script TM reverse transcriptase (TaKaRa, Dalian, China). Real-time PCR analysis was performed by the LightCyclerH 480 system (Roche Diagnostics, Mannheim, Germany) in 96-well plates using the SYBR Premix Ex Taq TM (TaKaRa, Dalian, China) according to the manufacturer's instructions. Primer designs were provided in the Table S1. ACTB (b-actin) was used as an internal control. The relative quantification of mRNA expression was calculated as a ratio of target gene to ACTB. The correlations of real-time PCR data with microarray data and predefined variables were assessed by the Spearman's Rank Correlation Test.

Sample Characteristics and Variable Definition
Peripheral blood samples were taken from 8 (4 female and 4 male) apparently healthy volunteers. All volunteers were Chinese, not on medication and non-fasted. The volunteers ranged in age from 22 to 35 yr (mean 6 SD = 27.164.1). The age distribution was balanced between men and women. The volunteers' height and weight were measured and converted into BMI. The BMI ranged from 17.6 to 29.4 (mean 6 SD = 21.363.7). For each volunteer, the blood cell counts including leukocyte count, lymphocyte%, monocyte%, neutrophil%, erythrocyte count, hemoglobin amount, reticulocyte% and platelet count were measured.
We performed Spearman's Rank Correlation Test to explore the correlationship between age, gender, BMI and blood cell counts ( Figure 2). Given that lymphocytes and neutrophils together making up 80-95% of total leukocytes, not surprisingly, the lymphocyte% and neutrophil% was inversely correlated with each other (r = 20.95, P-value,0.001). The erythrocyte count, hemoglobin amount, monocyte% as well as BMI were observed significantly correlated with the gender factor (P-value,0.001). The reticulocyte% and platelet count were also found to be correlated with each other (r = 0.69, P-value,0.001). The age variable was not apparently correlated with any other variable. Furthermore, the RIN value was measured to assess the RNA integrity on a scale from 0 (low integrity) to 10 (high integrity) [35]. Afterward, it was used to investigate the effect of RNA quality on the blood gene expression profiles. All the sample characteristics on the basis of demography, blood cell counts and RNA quality were given in the Table 1.

Investigation of Variation in Blood Gene Expression Profiles
Our blood gene expression dataset contained 9859 genes and 24 samples. The goal is to dissect the variation of thousands of gene expression profiles and characterise the underlying factors that contributed to the data variability. For this purpose, sophisticated data analysis methods were required. PCA is a widely used unsupervised linear technique for dimensionality reduction. The central idea behind PCA is to transform the original dataset consisting of a larger number of interrelated variables to a new set of uncorrelated principal components (PCs), while retaining as much as possible the variation present in the original data set [36]. Therefore, we used PCA to decompose the overall data variation into a set of PCs (Table S2). It was noteworthy that the Top-10 PCs explained 85% of the total variation, suggesting that the gene expression profiles might be affected by only few but significant factors. To explore and visualise the major sources of variation, samples were displayed in a 3-dimensional space consisting of the Top-3 PCs which explained 28.2%, 17.0% and 10.2% of the total variation, respectively. In the Figure 3, the spatial distance between the samples actually reflected their approximate degree of transcriptional similarity. Interestingly, the samples from each individual across experimental series were in close proximity to one another on the plane of PC1 and PC2, while male and female groups were separated in the direction of PC3. All of PCs transformed from original gene expression data were retained to avoid any information loss. Furthermore, a PCA extended method, called ''Eigen-R 2 '', was used to precisely determine the proportion of variation related to the predefined biological and technological factors based on the PCA transformed data ( Table 2).  It is widely believed that the highest quality RNA should be used for gene expression analyses. However, in some cases, such as human autopsy samples or paraffin embedded tissues, high quality RNA samples may not be available. Previous studies had investigated how the RNA quality might affect the gene expression profiles in the tissue samples [37,38], remarkably little is know for the situation in blood. In this work, with a specific study design, we intended to explore the possible effect of sample quality on the blood gene expression profiles (Figure 1). The extracted total RNA from each individual was aliquoted into three tubes. The 1 st tube was kept intact, while the 2 nd and 3 rd tubes were heated at 70uC for 10 and 20 minutes, respectively. According to the RNAs preparation, microarray experiments were then performed in the sequential series: Series_1, Series_2 and Series_3. The average RIN value for Series_1, Series_2 and Series_3 were 8.9, 7.7 and 6.5, respectively. The distributions of RIN value between series were significantly different (P-value,0.001).
The RIN variable was shown to be associated with 2.1% of the total variation in blood gene expression data. The effect of RNA quality on the expression of individual genes was estimated through the gene-wise linear analysis at FDR ,0.01. We identified 28 genes which the expression profiles were significantly associated with the RIN variable. The list of genes was submitted to the DAVID Bioinformatics Resource for functional enrichment analysis. The overview of variable-associated-gene list and enriched functional annotation terms were provided in the Table  S3. The genes involved in ''mRNA metabolic process'' (ZFP36L2, PTBP2, SFRS2IP, HNRNPC and RBM25; P-value = 3.2E-3), ''endoplasmic reticulum'' (CTSZ, CNIH4, SLMAP, ATF6B, CNPY3 and ERGIC1; P-value = 1.4E-2), as well as ''chromatin organization'' (EPC1, HUWE1, MLL3 and MPHOSPH8; P-value = 2.5E-2) were significantly enriched. As an example, the expression profiles of ERGIC1 (Endoplasmic Reticulum-Golgi Intermediate Compartment 1) were found highly variable between the series of  Table 2. Eigen-R 2 analysis for dissecting data variation,variable-associated gene identification and functional annotation enrichment. arrays (Figure 4). Given the same cohort, the average expression intensity of ERGIC1 dropped from 615 to 394 between Series_1 and Series_3, indicating the notable influence of sample RNA quality on the individual gene expression profiles. Recently, the technology advances such as RNA 6000 Nano Chips and Agilent 2100 Bioanalyzer had made it possible to access the samples' RNA quality beforehand. In case the quality of RNA was not satisfied, one would simply discard it. However, our results showed that this may not solve all the problems. Given the blood samples drawn from the same donors and high quality RNAs in Series_1 and Series_2 with the RIN values between 7.3 and 9.2, the expression intensity of ERGIC1 were significantly different (P-value = 0.002). Hence, it could be also important to consider the distribution of RIN variable and make it equally distributed between the comparison groups.

Variation Associated with Age
The aging effect might contribute to 8.3% of the total variation in blood gene expression profiles. A list of 196 genes was identified to be significantly associated with the age variable. Interestingly, several biological processes including ''blood coagulation'' (CD40LG, CD59, TFPI, SERPING1 and PF4; P-value = 0.02), ''blood vessel development'' (CEACAM1, EPAS1, IL8 and TCF7L2; P-value = 0.45), ''stem cell maintenance'' (RIF1, TCL1A and TCF7L2; P-value = 0.03) and ''inflammatory response'' (C3AR1, TNFSF4, IL8, CD40LG, HRH4, RIPK2, SERPING1, IDO1 and CXCL10; P-value = 0.02) were significantly enriched. Given the fact that the blood vessel walls become thicker and tougher during the aging process, it was not unexpected that the angiogenesisrelated genes were appeared in the list. On the other hand, the blood itself also changes with age in different aspects. The amount of bone marrow decreases with age, causing a decline in the formation of new blood cells. Therefore, recovery from bleeding episodes will be slowed. Age-related decline also occurs in white blood cells. Most of the white blood cells stay at the same levels, but certain white blood cells important to immunity (e.g. lymphocytes) decrease in their number and ability to resist inflammatory and infection.
We performed PCA and Eigen-R 2 analysis using the independent Karlovich et al. dataset. The age variable was found to be associated with 2.3% of the total variation. This value was in accordant with what had been reported by Karlovich et al., however, somewhat lower compared to our result. We noticed some differences between two studies that might contribute to the discordance. The age range was smaller in our study (22-35 yrs) compared to the Karlovich et al. study (23-64 yrs). Actually, 14 out of 20 volunteers from Karlovich et al. dataset were above the 35 yrs old. Hence, it was not unexpected that our result could not be fully retrieved from those samples. Furthermore, Karlovich et al. explained that the broad age range in their study might have prevented the detection of the aging effect in blood [10]. Although similar observations had been reported previously [7,11], the aging effect in blood transcriptome needs to be further specified with more samples and a broad range of ages.

Variation Associated with Gender
The gender effect was found to be associated with 9.2% of the total variation in blood gene expression data. A total of 105 gender-associated genes were identified at FDR ,0.01. It was intriguing that the Top-10 gender-associated genes were located on either the X or Y chromosomes (XIST, RPS4Y1, EIF1AY, KDM5D, DDX3Y, CYorf15A, CYorf15B, USP9Y, UTY and PRKY). The ectopic expression patterns of gene XIST (X Inactive Specific Transcript) was observed showing an average intensity of 3,000 in females and almost no expression in males. On the contrary, the Y chromosome linked gene RPS4Y1 (Ribosomal Protein S4, Y-linked 1) was highly expressed in males with an average intensity of 1,800 but nearly no expression in females ( Figure S1). In the Karlovich et al. dataset, gender effect was associated with 9.7% of the total variation, which was very similar to what was found in our dataset. Furthermore, the specific patterns of XIST and RPS4Y1 were also retrieved. The ratio of XIST to RPS4Y1 showed significantly higher in females compared to males (P-value = 1.6E-7), which might represent a useful gender-associated biomarker for the blood gene expression analyses in the future.

Variation Associated with BMI
Given the distinct distribution of BMI between men and women, we included the gender variable to the Eigen-R 2 linear regression model for the estimation of variation. The BMI variable was found to be associated with 5.3% of the total variation. A total of 122 BMI-associated genes were identified at FDR ,0.01. Of these genes, a cluster of 9 genes known to be related to ''natural killer cell mediated cytotoxicity'' were enriched (TNFRSF10A, PRF1, KIR2DL5A, KIR2DS1, GZMB, KIR2DL1, KIR2DL3, SH2D1B and KIR3DL1; P-value = 8.2E-6). Other genes involving in the ''hematopoietic cell lineage'' (CD38, CD8B, IL7 and CD5; Pvalue = 0.03) and ''phosphatase activity'' (ALPL, PTPRN2, INPP4B, PSPH and NT5E; P-value = 0.05) were also enriched in the list.

Variation Associated with White Blood Cells
It was noteworthy that the white blood cell measures (''Leukocytes'', ''Lymphocyte %'', ''Monocyte %'' and ''Neutrophil %'') were associated with 7.8% to 16.4% of the total variation. Previously, white blood cells had been described as the most transcriptionally active of all cell types in blood and might present the most sensitive gene expression profiles in response to biological and technological factors [7,11]. Taking the advantages of PCA and Eigen-R 2 method, our result demonstrated that the heterogeneity of white blood cell constituents indeed contributed to the most significant portion of the overall data variation. Via the genewise linear analysis at FDR,0.01, we identified 50, 557, 1146 and 1560 genes associated with ''Monocyte %'', ''Leukocytes'', ''Lymphocyte %'' and ''Neutrophil %'', respectively. The white blood cell counts had been shown highly correlated, hence we combined the four gene sets together and resulted in a list of 1893 unique genes. As expected, numerous specific (e.g. ''leukocyte activation'', P-value = 2.7E-7; ''mitochondrial membrane organization'', P-value = 2.7E-5; ''homeostasis of number of cells'', Pvalue = 8.1E-5; and ''T cell receptor signalling pathway'', Pvalue = 1.1E-2) as well as more general biological processes (e.g. ''immune response'', P-value = 4.2E-10; ''cell activation'', Pvalue = 2.8E-8; ''hemopoiesis'', P-value = 2.2E-5; and ''regulation of apoptosis'', P-value = 1.8E-4) were significantly enriched among the list of leukocytes-associated genes. Hierarchical clustering of the leukocytes-associated genes resulted in perfect clustering of individual samples, demonstrating the predominating effect of individual heterogeneities of leukocyte subsets on the blood gene expression profiles ( Figure 5).

Variation Associated with Red Blood Cells
Given the different physiological conditions between men and women, the red blood cell counts were correlated with gender. Hence, we included the gender variable into the Eigen-R 2 linear regression model for the estimation of variation. The red blood measures (''reticulocyte%'', ''erythrocyte'' and ''hemoglobin'') were found to be associated with 3.6% to 8.6% of the total variation. A total of 7, 60 and 299 genes were identified to be associated with ''reticulocyte%'', ''erythrocyte'' and ''hemoglobin'', respectively. The three lists were combined and resulted in a set of 306 unique genes. Interestingly, multiple relevant biological processes, such as: ''negative regulation of apoptosis'' (HTATIP2, GNRH1, SOCS3, CLU, SNCA, PF4, PIM3, BCL2L1, STRADB, CSDA, MIF, PROK2,

Variation Associated with Platelets
Human blood platelets play critical roles in normal hemostatic processes and pathologic conditions such as inflammation, thrombosis, vascular remodelling, and wound repair. Although platelets are anucleate and lack nuclear DNA, they retain megakaryocyte-derived mRNAs [39]. Platelets contain rough endoplasmic reticulum and polyribosomes, and thereby they maintain the ability for protein biosynthesis from cytoplasmic mRNA [40]. Quiescent platelets generally display minimal translational activity, but newly formed platelets synthesize various a-granule and membrane glycoproteins [41]. In our dataset, the platelets variable was associated with 4.3% of the total variation. The gene-wise linear analysis identified 24 platelet-associated genes. Several genes involved in ''immune response'' (DDX58, OAS3, RSAD2, IFI44L, OAS1, TREML1 and IFI6; P-value = 3.6E-5) as well as ''nucleotide binding'' (OAS1, OAS3, DDX6, DDX58, NUAK1, CMPK2 and RNF213; P-value = 5.6E-3) were significantly enriched. Moreover, it was intriguing that PEAR1 (platelet endothelial aggregation receptor 1) and GP3A (platelet glycoprotein III) were also found significantly associated, but with slightly higher FDRs at 0.02 and 0.03, respectively.

Validation of the Microarray Data by Real-time PCR
Real-time PCR is generally considered the "gold-standard" assay for measuring gene expression and is often used to confirm findings from microarray data. A total of five genes (C3AR1, XIST, LCK, OAS1 and IFIT1), which associated with age, gender and blood cell counts in microarray data, were selected for real-time PCR validation. The gene expression profiles determined by microarray and real-time PCR were found to be highly correlated (Table 3). Meanwhile, the real-time PCR based gene expression profiles were also significantly associated with the age, gender, lymphocyte%, reticulocyte% and platelet variables.

Discussion
In the study described here, we intended to characterize the major sources of variation in the gene expression of human blood.
We used the PCA and Eigen-R 2 method to dissect the overall variability of gene expression data and associate the major sources of variation with the predefined biological and technological factors. The results indicated that the predominating sources of the variation could be traced to the individual heterogeneity of the relative proportions of various blood cell types (leukocyte subsets and erythrocytes). The physiological factors (age, gender and BMI) were found to be associated with a significant proportion of variation in blood gene expression profiles. We further used a large independent dataset to perform the in silico validation. By applying the same statistical analyse, the similar gender effect as well as lessened aging effect was verified in the Karlovich et al. dataset. Through the specific design, we investigated the blood gene expression profiles from the same donors but with different levels of RNA quality. Although the proportion of variation associated to the RIN variable was mild, the significant impact of RNA quality on the expression of individual genes was observed.
Remarkably, the individual heterogeneities of blood cell constituents represent the most significant portion of the overall variation in blood gene expression. As a major defense and transport system, the blood interacts with virtually every organ and tissue in the human body, and thereby the gene expression responses of circulating white blood cells can potentially provide early warning of any abnormalities they discover. From this point of view, our results lend support to the clinical use of blood transcriptomic biomarkers for exposure, disease progression, diagnosis or prognosis. Given the significant effect of the relative proportions of different blood cell types, it is highly recommended to perform the CBC tests simultaneously and integrated the blood cell counts into the blood transcriptome study. For the donors with the blood cell counts falling outside normal ranges, they should be excluded. In some cases the size effects of biological or environmental factors are mite, the differences of gene expression profiles between comparison groups can be whelmed by the unconcerned individual heterogeneity and resulting in no differentially expressed genes was found. Then, it is possible to deconvolute blood gene expression profiles ''in silico'' if the measures of blood cell counts are available. This type of analysis can help to deduce cellular composition or cell-specific levels of gene expression using statistical methodologies [42][43][44]. Ultimately, increasing the sample size of comparison groups whenever possible is always helpful to neutralize the unconcerned individual heterogeneity. In the study design, physiological factors (age, gender and BMI) need to be well controlled and have them equally distributed between the comparison groups. The RNA quality of blood samples should be accessed before gene expression profiling. In case the quality of RNA not fit the predefined criteria, samples should be removed. It is also important to pay an attention to the global distribution of sample quality and make it well balanced between the comparison groups. Finally, our study extends the limited information base currently available for the baseline variation of blood gene expression profiling in Chinese population. The clinical and transcriptomic data described in the study have been made freely available and should represent a useful resource for the design of future studies. Figure S1 The ectopic expression patterns of XIST and RPS4Y1 in men and women. In x-axis, the samples were arranged in accordance with the array Series 1-3 from the left to the right. Black and red dots represent blood samples collected from men and women, respectively. The y-axis indicates the gene expression signal intensity. (TIF)