Collective Dynamics of Specific Gene Ensembles Crucial for Neutrophil Differentiation: The Existence of Genome Vehicles Revealed

Cell fate decision remarkably generates specific cell differentiation path among the multiple possibilities that can arise through the complex interplay of high-dimensional genome activities. The coordinated action of thousands of genes to switch cell fate decision has indicated the existence of stable attractors guiding the process. However, origins of the intracellular mechanisms that create “cellular attractor” still remain unknown. Here, we examined the collective behavior of genome-wide expressions for neutrophil differentiation through two different stimuli, dimethyl sulfoxide (DMSO) and all-trans-retinoic acid (atRA). To overcome the difficulties of dealing with single gene expression noises, we grouped genes into ensembles and analyzed their expression dynamics in correlation space defined by Pearson correlation and mutual information. The standard deviation of correlation distributions of gene ensembles reduces when the ensemble size is increased following the inverse square root law, for both ensembles chosen randomly from whole genome and ranked according to expression variances across time. Choosing the ensemble size of 200 genes, we show the two probability distributions of correlations of randomly selected genes for atRA and DMSO responses overlapped after 48 hours, defining the neutrophil attractor. Next, tracking the ranked ensembles' trajectories, we noticed that only certain, not all, fall into the attractor in a fractal-like manner. The removal of these genome elements from the whole genomes, for both atRA and DMSO responses, destroys the attractor providing evidence for the existence of specific genome elements (named “genome vehicle”) responsible for the neutrophil attractor. Notably, within the genome vehicles, genes with low or moderate expression changes, which are often considered noisy and insignificant, are essential components for the creation of the neutrophil attractor. Further investigations along with our findings might provide a comprehensive mechanistic view of cell fate decision.


Introduction
Cell fate decision involves reprogramming of precursor cells into the differentiated state. It is intriguing to grasp how a specific path is chosen by a cell, among the several possibilities that can arise, through the complex multi-molecular interactions during differentiation. The understanding of such deterministic process, where the macroscopic stable cell fate transition requires the coordinated regulation of thousands of genes forming networks, could uncover mechanisms that control cell differentiation, as well as reveal better strategy to suppress disease progression, e.g., cancer proliferations.
The study of large-scale network dynamics has been investigated in a variety of fields including mathematics, physics, information sciences, ecology and biology ever since the onset of the nineties [1][2]. A large number of studies have already shown that the emergence of collective behavior, such as synchronization of processes, can arise due to the non-linear regulations of complex network systems with environmental perturbations. For example in biology, the secretion and detection of autoinducer molecules between bacteria enable a population of them to collectively regulate gene expression and, therefore, produce coordinated group behavior such as the formation of biofilm by Pseudomonas aeruginosa [3][4]. However, it remains unclear how the complex and dynamically evolving molecular networks found in biological systems can give rise to a globally coherent orchestrated response.
High-throughput omics (transcriptomics, proteomics & metabolomics) analyses have indicated that the molecular interactions within a living cell typically form a single, largely interconnected network [5][6][7][8]. It is, thus, necessary to have an integrated network view to understand cellular processes such as cell fate transitions or differentiations in which cells receive a broad range of biological signals or perturbations which influence gene expressions across the entire genome to produce reliable and robust outcome.
To demonstrate the genome-wide integrated response for cell fate decision, Huang et al. investigated the differentiation of human pro-myelocytic leukaemia HL-60 cells into neutrophil by the action of two different reagents, DMSO and atRA [9]. Based on the 2773 highest expressed genes (based on two-fold expression changes), Huang et al. showed the convergence of cell fate despite different initial transcriptome dynamics arising from the different stimuli, thus suggesting the presence of stable multidimensional attractor states in biology [10][11][12][13]. Although this result is the first step towards understanding the existence of cell fate attractors, many other fundamental questions remain to be investigated. For example, what are the intracellular origins and mechanisms that instill genome-wide response? What form cellular attractors? How these emerge through the complex molecular networks? If the entire genome is linked through networks, is attractor state achieved by self-regulation [12]?
The majority of large-scale gene expression studies have focused on genes with high expression changes or variations to decipher key regulatory processes, since low-level expression changes of genes have been considered as noisy due to the issue of poor signal-to-noise ratio in microarray experiments. This is due to the difficulty in the estimation of unspecific binding abundance between probe and target in signal intensity [14][15][16], and especially for the low level expression changes, the effect of background noises, compared with specific binding activity, is likely larger than that for highly variable genes. However, in our recent study, we demonstrated that the splitting of whole genome into different ensembles to analyze their temporal expression changes from the initial time resulted in the reduction of their fluctuations as the ensemble size is increased. This resulted in collective genome-wide expression behaviors which exhibited local and global effects of lipopolysaccharide (LPS) stimulated macrophages; local being the well-known pro-inflammatory response of a small number of highly expressed genes, while global being the novel collective activation of diverse processes comprising the rest of the lowly expressed genes [17][18].
In this paper, we investigated the entire microarray data of HL-60 cells for atRA and DMSO stimuli including lowly variable signals over time [9]; DMSO is known to activate key transcription factors such as NF-kB [19], whereas atRA penetrates the nucleus and directly remodels chromatin structure [20]. To uncover the orchestrated gene expressions guiding cell fate decision, we used Pearson (linear) correlation and mutual information (nonlinear correlation) metrics to investigate the collective dynamics of gene expressions for each stimulus. To overcome the difficulties of dealing with single gene expression noises in microarray data, we formed grouping of genes (chosen randomly from the whole genome and ranked according to group expression changes across time) which showed the reduction of correlation noises as ensemble size is increased. From this, in contrast to a previous finding which suggested the whole genome's role in differentiation, we demonstrate that only selective portions of fractal-like gene ensembles are responsible for the neutrophil attractor. Notably, the removal of these specific gene ensembles from the whole genome, for both atRA and DMSO stimuli, destroys the attractor. Thus, for the first time, we reveal the existence of 'genome vehicle' and show that genes with low or moderate expression changes, contained within genome vehicles, are crucial for the neutrophil attractor.

Results and Discussion
Reduction of correlation noises when grouping genes Previously, we have shown that the collective proinflammatory response of whole genome can be captured by random gene sampling of ensemble size above 80 [17][18]. Thus, to investigate the collective behavior of HL-60 cell differentiation, we randomly grouped genes from whole genome into different ensemble sizes (n = 10, 50, 100, 200, 500, 1000) and evaluated their expression dynamics in the correlation space (see Methods, ''Correlation analysis of gene expressions''). Both the temporal (modified) Pearson correlation of gene variation, r v , and the corresponding temporal mutual information, I, distributions of the gene ensembles transited from scattered and incoherent ones to clear bell-shaped ones for n t $100, where n t & ffiffiffiffi ffi N p (N = 12625, Figure 1). These result show that standard deviations of r v and I distributions at each time point are reduced according to the a= ffiffi ffi n p law with increasing n, where a is the fitting coefficient. Thus, the ensemble size of n t = 200, with good resolution, was chosen to evaluate the probability distributions of r v and I for each time point of the gene expression data, {V(t 0 ),..,V(t M )}, where V(t i ) is the whole genome expression deviation vector at t i (i = 0,1,..,12) (see Methods).
Localization and overlapping of probability distributions of correlations for atRA and DMSO responses indicate neutrophil attractor Utilizing the noise reduction by grouping genes, we plotted the probability distributions of r v and I versus time, and observed that as the ensemble size is increased, the distributions localized to specific points (r v , I), especially after 48h (Figure 2A-B). These localizations may be derived by the presence of neutrophil attractor. To test whether the localization of probability distributions indicate attractor state, we analyzed the superposition of r v and I distributions after 48h for both atRA and DMSO and found they possess distinct peaks for both the atRA and DMSO responses ( Figure 2C). Moreover, the superposition of the probability distributions (SPD) of r v and I of atRA and DMSO responses overlap indicating the presence of cell fate attractor, as it corresponds to the fact the two stimuli elicit the same biological end-point, the generation of a mature neutrophil cell.
To define the attractor region, we adopted the concept of critical (inflection) points as used in phase transitions in thermodynamic systems to determine the boundary of the neutrophil attractor. Note that due to the limited temporal data points, we are unable to determine the attractor basin for neutrophil differentiation as defined in continuous dynamics. Thus, we evaluated the gradients of the SPD for r v and I to determine the inflection points for atRA and DMSO responses and the resultant plots reveal distinctive crater-like feature with the rings indicating inflection points ( Figure 2D) and the common overlapping area of the inflection points of the SPDs, i.e., the SPD boundaries for the atRA and DMSO responses was defined as the neutrophil attractor ( Figure 2E).
As a further test, the attractor boundary also encompasses the convergence of the atRA-and DMSO-trajectories ( Figure 2E, right panel). To check the statistical significance of the localized SPD of r v and I within the attractor, we verified that its standard deviation of both r v and I distributions for each stimulus also scales with a= ffiffi ffi n p as n is increased ( Figure 2F). Note that for the other localized SPD of r v and I before 48h, it coincided with the whole genome trajectory loops indicating intermediary cell differentiation states [21] ( Figure 2E, left panel).

Emergence of asymptotic whole genome collective behaviors
To investigate the whole genome collective behavior, we grouped genes according to their variance across time. The whole genome deviation vector, V(t 0 ), was sorted from the highest to the lowest standard deviation s (see Methods, ''Ranking gene ensembles''). This sorting order at t 0 was retained for all other time points (t = 1,..,12). Next, we split the ranked whole genome into p groups, where p is the integer values of N/n for n = 10, 50, 100, 200, 500, 1000. Similarly for the random selection of genes, we checked whether expression noises can be reduced for the ranked groups as the group size is increased. We plotted the set of mean values of gene deviations for p groups, .,12) (see Methods, ''Ranking gene ensembles'' and Figure 3). The plots show the group's mean values transited from scatter to the emergent asymptotic curves, f i G k (t 0 ) À Á , as n is increased for all t i (i = 1,..,12) (transition at n t % ffiffiffiffi ffi N p , Figure 3A-B). Note that f i G k (t 0 ) À Á (k = 1,2,…,p) is the gene deviation value on the asymptotic curve, and f i is the function of the asymptotic curve for the i th time point determined by the nonlinear least squares fitting with cubic polynomial for the set of points (G k (t 0 ), G k (t i )). This is due to the fact that as n is increased, , decreases following the a= ffiffi ffi n p law ( Figure 3A-B, center panels). These asymptotic curves suggest that the genome-wide averaging behavior of collective expression dynamics exists. Once again n t = 200 genes produced acceptably good resolution. Thus, n t is the basic size of the genome element, G k (t 0 ), and for the whole genome, we obtained 63 genome elements, totaling 12600 genes. The remaining 25 genes with very low s were discarded. Note: we used s instead of coefficient of variation (CV~s=m) for ranking genome elements as we are dealing with trajectories of ensemble of genes, rather than normalized form as often used in conventional approaches. Nevertheless, we compared CV versus s and found linear relationships between them (Figure S1), ruling out any possible trivial scale effect as explanation of our results. Transition from scattered to smooth bell shaped distributions of r v and I when grouping genes. Distributions of (A) r v and (B) I for ensembles of n randomly chosen genes from whole genomes (n = 10, 50, 100, 200, 500, 1000), estimated by Gaussian kernel with 100 repeats at a representative t = 48h (similar profiles are obtained for all time points), left panels for atRA and middle panels for DMSO response. Standard deviation of r v and I distributions (right panels of (A) and (B)) at t = 48h decreases as n is increased, following a a= ffiffi ffi n p law, a>1 for r v and a>0.3 for I. Note that this transition also occurred for all time points (data not shown). doi:10.1371/journal.pone.0012116.g001 Specific genome elements fall into the attractor in a fractal-like manner We investigated trajectories of the 63 ranked genome elements ( Figure 4A, left panel) by creating a sequence of binary numbers where 1 and 0 indicate genome elements falling into and not falling into the attractor, respectively, against their standard deviation, s ( Figure 4B, upper panels). The result showed that the genome elements falling into the attractor are non-continuous in s. To understand the discontinuity, we checked the sensitivity of genome elements falling into the attractor, i.e., changing from 0 to 1 or viceversa for single-gene shift ( Figure 4A, right panel). We found that even a single replacement of the highest s gene from a genome element with the highest s gene of the next lower ranked genome element results in its destiny change of falling or not falling into the attractor (see e.g., G 11 (t 0 ) for DMSO, Figure 4B, lower panels). This expansion of a genome element shows fractal-like binary distributions and the sensitivity of single-gene shift within a genome element demonstrates the non-linear nature of gene expression dynamics. Note that the expansion processes are limited by the lack of continuous data to show true fractal characteristics [22][23].
Next, we evaluated trajectories of the 63 genome elements and compared each with the whole genome trajectory, in terms of the Euclidean distance, d, of r v and I. We observed that 12 genome elements for atRA and 20 for DMSO, fall into the attractor, and among them more than 50% (with 0.24,s,0.40 for atRA and 0.20,s,0.59 for DMSO) are close to the whole genome trajectory with minimum distance (for d,0.11) ( Figure 4C). This indicates that the genome elements falling into attractor scale with the whole genome trajectory. Overall, these results suggest that whole genome responses possess fractal-like nature for neutrophil differentiation.
To exhaustively search for more possible genome elements that can enter the attractor, we performed an iterative procedure where we removed the initial elements into attractor and shifted the remaining genomes by 50 genes from the highest s values to create new genome elements ( Figure S2). Through this, we obtained additional 9 elements for atRA and 8 elements for DMSO.

The loss of the attractor when specific genome elements are removed
We evaluated the SPDs of r v and I of all the genome elements into the attractor and found the SPD boundaries of atRA and DMSO responses overlapped to maintained the attractor ( Figure 5A), while those for the rest of genome elements did not ( Figure 5B). Moreover, for the genome elements falling into attractor, the corresponding trajectories of both atRA and DMSO converged, but not for the trajectories of the rest of genome elements ( Figure 5A-B). These results indicate that the rest of genomes for both atRA and DMSO stimuli failed to demonstrate the convergence and to form the neutrophil attractor.
Previously, Huang et al. indicated the convergence of atRA and DMSO trajectories in the space spanned by the first two principal components based on 2773 high expression genes (2-fold change in expression values from t 0 ) [9]. Notably, our analysis shows that for the high expression genes, neither their correlation trajectory converged nor their SPD boundaries overlapped ( Figure 5C). Furthermore, SPD boundaries of the specific genome elements    without these highly expressed genes maintained a common neutrophil attractor, albeit with less area ( Figure 5D). Thus, the lowly and moderately variable genes within genome vehicles play an important role in the formation of the neutrophil attractor ( Figure 5E). These results demonstrate that the collective dynamics of specific gene elements for both atRA and DMSO responses are responsible for the cell fate decision and we call these genome elements that effectively drive cells toward the attractor for each stimulus as ''genome vehicle''.
In summary, we show the existence of genome vehicles is responsible for neutrophil differentiation. Despite initial differences of the transcriptional program induced by atRA and DMSO stimulations, the self-regulation of the genome vehicles leads to the formation of a common neutrophil attractor. In addition, we demonstrate that the collective motion of lowly and moderately variable genes within the genome vehicle, which are often considered as noisy and insignificant, play an important role in the formation of the neutrophil attractor, perhaps indicating the non-instructive signaling of genes related to small-amplitude DNA motions [24][25]. Since the dynamics of gene expression is connected with the dynamics of chromatin structural changes, finding the underlying mechanisms, such as the collective dynamics of small-amplitude DNA fluctuations within chromatin structure, for the motion of the genome vehicle might decipher fluctuations in chromatin dynamics that determines cell fate decision. It will be interesting to know how the concerted motion of the genome vehicle, together with well-known master instructive genes, such as Yamanaka factors [26], drives the differentiation of pluripotent stem cells as well as other biological processes that could acquire a completely different perspective under the proposed model.

Correlation analysis of gene expressions
Microarray technologies monitoring large-scale gene expressions simultaneously have revealed mutual and highly correlated behaviors [18], [27][28]. This is conceivable due to the fact that gene expressions, i.e., net mRNA concentrations, are tightly controlled by the transcriptional system (consisting of transcription factors, RNA degradation, DNA physiochemical modifications, etc.) which regulates multiple sets of genes rather than a single gene. Hence, the use of correlation metrics has been widely adopted in microarray studies.
The majority of studies have used Pearson correlation, r(X,Y), when analyzing two N-dimensional expression vectors, X and Y, e.g., comparing the response of genomes between two time points for a given biological stimulation; where X i~xi {x and Y i~yi {y, x i and y i are gene expression of the i th gene of expression vectors, X and Y, respectively, x and y are mean values, and h is the angle between the two N-dimensional expression vectors from the center of mass. Thus, this form of analysis of gene expressions reveals linear relationship, e.g., h = 0 for perfect positive linear relationship, h = p for perfect negative (anti-correlated) linear relationship and h = p/2 for linearly independent relationship. However, if the relationship between the two vectors is nonlinear, then Pearson correlation analysis is insufficient. In such cases, the use of mutual information has been instrumental in biology [29][30]. In this paper, we adopted both a modified version of Pearson correlation (see below) and mutual information, to investigate the whole genome collective dynamics in the process of neutrophil differentiation to two distinct stimuli (atRA and DMSO), revealing the existence of neutrophil attractor as well as the whole genome expression dynamics toward the attractor.

Modified Pearson correlation r v for measuring expression variation
Each stimulus's dynamic genome expression activity (partial and whole) is defined by N-dimensional gene deviationfrom-average vectors at time , where x j t i ð Þ is expression value of the j th gene at t i , x j is its average expression over M+1 discrete time points, and v j (t i )~x j t i ð Þ{x j is its deviation from the average gene expression (called gene deviation). In our study, N = 12625 genes/ORFs and M = 12, where t = 0, 2, 4, 8, 12, 18, 24, 48, 72, 96, 120, 144, 168h. We modified the typical Figure S3 for r v vs. r) by subtracting their average expression value, x j , from each expression value at all time points, instead of subtracting the mean of whole genome expression, x. This index thus measures the temporal correlation of genome-wide expression deviations from their average values so as to allow discriminating gene expressions with different amplification but similar temporal profiles. For simplicity we included ORFs as genes, and for the microarray data, we applied RMA normalization which is known to produce robust reproducible results for all range of expression units [31].

Mutual information I
Nonlinear dependency between vectors V(t i ) and V(t 0 ) is checked by mutual information [32][33] ) p(x,y) ln (p(x,y)){e, where the joint probability distribution function p(x,y), and marginal probability distribution functions, p i (x) at t i and p 0 (y) at t 0 are estimated by means of an histogram-based approach by discretizing the gene expression into K = 10 bins [32]. Note: due to the discretization, mutual information I incurs a systematic error e [32]. Since randomly ordered data should destroy correlations, we expect I to be close to zero, therefore, we calculated the minimum I for 100 random permutations of gene deviation vectors {V(t i )}. However, we found a positive value for minimum I instead of zero, and so subtracted this minimum positive constant value from the final I. For comparing I of atRA and DMSO response, we used the  s si ,s sj for i.j and s sj is the standard deviation of the j th gene deviation. Next, we split the whole genome into p groups (each having n genes) at t 0 , so that the whole genome at t 0 is represented by G t 0 ð Þ~S p k~1 G k ( s s k ; t 0 ),whereG k ( s s k ; t 0 )~S n j~1 s (k{1)nzj (s s (k{1)nzj ; t 0 ), s s k~1 n P n j~1 s s (k{1)nzj , s (k{1)nzj (s s (k{1)nzj ; t 0 ) is the j th gene deviation in the k th group, and s s(k{1)nzj is its standard deviation. Note that we choose p to be an integer value of N/n for n = 10, 50, 100, 200, 500, 1000, and the residual genes were not evaluated. From here onwards, we simplified all notations without s symbols, e.g., G k ( s s k ; t 0 )~G k (t 0 ). The set of p groups' average gene deviation from the whole genome at t i is represented by G t i ð Þ~S p k~1 G k t i ð Þ, where G k t i ð Þ~1 n P n j~1 s (k{1)nzj (t i ) and i = 0,1,..,12. Figure S1 Comparing CV with s. Standard deviation, s versus CV for genome elements of n t = 200 genes sorted by s.