Systems Mapping for Hematopoietic Progenitor Cell Heterogeneity

Cells with the same genotype growing under the same conditions can show different phenotypes, which is known as “population heterogeneity”. The heterogeneity of hematopoietic progenitor cells has an effect on their differentiation potential and lineage choices. However, the genetic mechanisms governing population heterogeneity remain unclear. Here, we present a statistical model for mapping the quantitative trait locus (QTL) that affects hematopoietic cell heterogeneity. This strategy, termed systems mapping, integrates a system of differential equations into the framework for systems mapping, allowing hypotheses regarding the interplay between genetic actions and cell heterogeneity to be tested. A simulation approach based on cell heterogeneity dynamics has been designed to test the statistical properties of the model. This model not only considers the traditional QTLs, but also indicates the methylated QTLs that can illustrate non-genetic individual differences. It has significant implications for probing the molecular, genetic and epigenetic mechanisms of hematopoietic progenitor cell heterogeneity.


Introduction
Cell fate decision is an important question during developmental processes, such as embryogenesis, neurogenesis, and hematopoiesis. During the hematopoiesis process, hematopoietic stem cells (HSCs) proliferate to self-renew or differentiate to progenitor cells, which generate mature blood cells. These progenitors, including common lymphoid progenitors (CLPs) and common myeloid progenitors (CMPs), can differentiate into more committed progenitors that give rise to blood cells. These progenitors can be used for bone marrow transplantation to treat diseases such as leukemia, sickle cell anemia, and thalassaemia [1][2][3][4].
Hematopoietic multipotential progenitors have two major differentiation choices: erythroid and myeloid lineages, which are regulated by the key transcription factors, Gata1 and PU.1. These two transcription factors positively regulate lineage-specific genes and repress each other [5]. In addition to transcriptional networks, genetic and epigenetic mechanisms are critical in determining cell fate.
Genetically identical hematopoietic progenitor cells growing under the same conditions can show differences in phenotypic characteristics, which is known as "population heterogeneity", and has attracted interest for many years. However, it remains unclear whether this non-genetic characteristic affects cell fate determination. A previous report published in Nature by Chang et al. [6] showed that gene expression of noise controls the lineage choices of hematopoietic progenitor cells. However, the genetic mechanisms that control this process have not been explored in that paper. Cell fate conversions are dynamic with changes in chromatin structure regulated by DNA and histone modifications, including DNA methylation at symmetrical CG dinucleotides (CpG) and histone methylation and acetylation. Epigenetic regulation has been studied in hematopoietic lineage specification based on coordinated changes in gene expression, chromatin state, and DNA methylation [5,7,8].
Genetic mapping can provide a view of network and gene actions, as well as interactions with quantitative trait loci (QTLs), which can demonstrate the effects of genetic variation. Functional mapping developed by Wu et al. differs from the traditional mapping strategies, and is a very useful method to analyze dynamic data, as well as mapping QTLs related to development processes including cell apoptosis, cancer stem cell proliferation [9][10][11][12], et al. Clonal population heterogeneity, which is known as "non-genetic cell individuality", cannot be analyzed using traditional QTLs. Several studies have examined the link between DNA methylation and gene expression, as well as mapping the methylated QTLs (meQTLs) to interpret the mechanisms underlying genetic variants [13,14]. meQTLs may increase our understanding of population heterogeneity and lineage choice problems, which cannot be demonstrated by alterations in the DNA sequences.
The aim of this article was to explore the genetic mechanisms regulating cell population heterogeneity and hematopoietic progenitor cell lineage choices. Besides the traditional QTL analysis, we mapped the effects of genetic variation on DNA methylation, focusing on mapping meQTLs that determine population heterogeneity and lineage choices.

Mathematical modeling of the evolution of two hematopoietic progenitor cell subpopulations
Hematopoietic progenitor cells show heterogeneity in one clonal population. The expression level of stem-cell-surface marker Sca-1 showed an approximately 1000-fold range within one newly derived clonal cell population based on flow cytometric analysis. Cells with the highest, middle and lowest~15% Sca-1 expression level were isolated from one clonal population as separate subpopulations using fluorescence-activated cell sorting (FACS). Within hours, all three subpopulations showed narrow Sca-1 histograms; however, the three fractions regenerated Sca-1 histograms similar to that of the parental (unsorted) population after 21-day culture. A two-Gaussian model that best fitted the observed histogram evolution and restoration of the parental distribution was predominantly driven by state transitions between the subpopulations. Linear and nonlinear ordinary differential equations (ODEs) are used to describe the transition of the two subpopulations, respectively.
Based on Fig 1, Chang et al. [6] proposed a linear model of equations for the size x i of subpopulation i: where x 1 and x 2 represent the sizes of subpopulations 1 and 2, r is the growth rate of both subpopulations, k 1 is the transition rate from x 1 to x 2 and vice versa for k 2 .
To increase our understanding of the asymptotic behavior and explain the sigmoidal increase of x 1 and x 2 , a nonlinear dynamic model of two interacting and growing progenitor cell populations was proposed based on Fig 2. The ordinary differential equations governing the growth of the two subpopulations will then be: where x 1 and x 2 represent the sizes of subpopulations 1 and 2, r is the growth rate of both subpopulations, k 1 is the transition rate from x 1 to x 2 and vice versa for k 2 , and k 3 and k 4 are parameters determining the feedback rate.

Statistical modeling of systems mapping
Systems mapping is a statistical model that views a complex phenotype as a dynamic system, dissects it into its underlying components, coordinates different components in terms of biological laws through mathematical equations, and maps specific genes that mediate each component and its connection with other components [15]. As a bottom-top model, a systems approach can identify specific QTLs that govern the developmental interactions between various components, giving rise to the function and behavior of the system. By estimating and testing mathematical parameters that specify the system, systems mapping allows us to predict the critical genetic mechanisms governing alterations in the physiological status of a phenotype.

Sampling strategy
Suppose there is a natural population drawn at random. From a genetic perspective, we assume the original population follows Hardy-Weinberg equilibrium. The n sampled subjects are genotyped for a set of molecular markers, such as single nucleotide polymorphisms (SNPs). Assume that there exists a particular gene that controls the dynamics of cell interactions and growth. This gene has three alleles; namely, Q, q, and q + , whose frequencies are q 1 , q 2, and 1-q 1q 2 in the population, and q + represents the methylated state of q. Let μ j denote the genotypic value of size for x 1 and x 2 with j = 0 to 5 for QQ, Qq, Qq + , qq, qq + , and q + q + , respectively. The actual gene for cell dynamics, called the quantitative trait locus (QTL), cannot be observed directly, but it can be inferred from associated markers. Consider a SNP with alleles M vs. m, with respective frequencies p 1 vs. 1-p 1 in the population. The SNP and QTL form six haplotypes; namely, MQ, Mq, Mq + , mQ, mq and mq + , with respective frequencies P MQ = P M P Q + D 1 ,P Mq = P M P q +D 2, P Mq þ ¼ P M P q À D 1 À D 2, P mQ = P m P Q -D 1, P mq = P m P q -D 2, and P mq þ ¼ P m P q þ þ D 1 þ D 2 in the population, where D 1 and D 2 are a linkage disequilibrium between the marker and QTL.
In the parental progeny of the population, the haplotypes unite randomly to generate 21 diplotypes and 18 genotypes, whose frequencies can be expressed in terms of haplotype frequencies. We express the genotype frequencies in matrix notation, with rows representing marker genotypes and columns representing QTL genotypes (Table 1). Thus, the conditional probabilities of QTL genotypes, conditional upon marker genotypes, can be derived according to Bayes' theorem.
Likelihood. Systems mapping embeds a system of differential equations into a genetic mapping setting constructed using a segregating population. Genetic mapping uses a mixture model-based likelihood to estimate genotype-specific parameters by assuming j QTL genotypes. For a progenitor cell population i, we obtain the size, x 1 and x 2 , for the two following cell subpopulations at a finite set of times, 1, . . ., T. A linear model for describing the phenotypic values of population i controlled by a putative gene is expressed as: where ξ i is an indicator variable defined as 1 if this population belongs to a specific genotype and 0 otherwise; time t, m x 1 j ðtÞ, and m x 2 j ðtÞ are the genotypic values of type k for the cell size at time t, respectively; and e x 1 i and e x 2 i are the residual errors of subject i for cell x 1 and x 2 at time t, respectively. For the mapping population of n members with marker information (M) and phenotypic data (Y i ) for the two subpopulations, we formulate the likelihood as: whereY i = (x 1i(1) ,. . .,x 1i(T), x 2i(1) ,. . .,X 2i(T) ) represents phenotypic data for the subpopulation weights, π j|i is the mixture proportion representing the conditional probability of QTL genotype k given the marker genotype of progenitor cell population i; and f j (Y i ; Θ j , C) is a multivariate normal distribution with an expected mean vector, m j ¼ ðm x 1jð1Þ ; . . .; m x 1jðTÞ ; m x 2jð1Þ ; . . .; m x 2jðTÞ Þ, for population i that belongs to QTL genotype j, and covariance matrix with S 1 and S 2 being the time-dependent covariance matrix for x 1 and x 2 , respectively, and S 1x2 = S 2×1 T being the time-dependent covariance matrix between these two variables. Modeling mean vectors. The biological merit of systems mapping is to model genotypespecific differences in the time-dependent mean vector using a biologically meaningful mathematical equation [9]. When modeling the dynamic behavior of progenitor cell transition, Chang et al. [6] developed a set of ordinary differential equations (ODEs) as shown in (1) and (2). The overall behavior of progenitor cell heterogeneity can be described and quantified by ODE solution's parameters (r, k 1 , k 2 ) for the linear model and parameters (r, k 1 , k 2 , k 3 , k 4 ) for the nonlinear model. Differences in any one or more of these parameters will influence cell transition dynamics.
For a particular QTL genotype j, the dynamic behavior of ODE can be specified by a set of parameters Θ uj = (r j ,k 1j ,k 2j ) (for the linear differential equation) or Θ uj = (r j ,k 1j ,k 2j, K 3j, k 4j ) (for nonlinear differential equations). In other words, by changing any one or more parameters in Θ uj , the trajectory of cell transition dynamics may change. Thus, by incorporating ODE into functional mapping, a statistical framework originally derived to estimate QTL genotype-specific ODE parameters, Θ uj , can be used to assess the genetic effects of QTLs on progenitor cell transition by comparing the genotypic differences of the parameters.
Modeling covariance structure. The statistical power of systems mapping is partly the result of structured modeling of the covariance matrix. There are many approaches available to model the covariance structure, including autoregressive [9], antedependent [16], autoregressive moving average [17], nonparametric [18], and semiparametric [19]. These approaches have their own advantages and disadvantages in terms of efficiency, flexibility, and parsimony. Of these approaches, the antedependent model shows a good balance between these properties, and Zhao et al. [20] extended it to model the covariance structure of multiple variables. In this article, we use a bivariate antedependence approach for modeling the structure of the covariance matrix (5).
The residual errors of the linear model (3) at time t depend on the previous residuals with the degree of dependence decaying with time lag. If the current residuals only depend on their first preceding ones, the model is called a first-order antedependence model [SAD (1)]. The SAD(1) model specifies time-dependent residual errors for population i as: where ϕ 1 (or φ 2 ) and φ 1 (or ϕ 2 ) are the unrestricted antedependence parameters induced by variable 1 (or 2) itself and by the other variable 2 (or 1), respectively; ε 1 i ðtÞ and ε 2 i ðtÞ are the ''innovation" errors assumed to be bivariate-normally distributed with mean zero and variance matrix: For simplicity, we assume that S ε (t) is time-independent; i.e., ν 1 , ν 2 , and ρ are not dependent on time. Based on the above assumptions, S ε becomes a block matrix with the diagonal repeating the following unit T times, All parameters modeling the covariance matrix S ε are arrayed in vector C e = (ϕ 1 ,φ 2 ,φ 1 ,ϕ 2 , υ 1 ,υ 2 ,ρ). With model (6), closed forms for the determinant and inverse of the structured matrix, which enhances computing efficiency, were derived [20].
Computational algorithms. Three algorithms were integrated to estimate marker-QTL haplotype frequencies using the EM algorithm; the genotype-specific ODE parameters for progenitor cell transition dynamics by the fourth-order Runge-Kutta algorithm [21,22] and the SAD (1) parameters for the covariance structure using simplex algorithm. We derived a closedform solution for the EM algorithm to estimate haplotype frequencies.
In the E step, we calculated the posterior probability with which population i carries QTL mating type genotype j m using: In the M step, the calculated posterior probability was used to estimate marker-QTL haplotype frequencies by: In each iteration of the EM step, we estimated ODE parameters, Y u j ¼ ðr j ; k 1j ; k 2j Þ(for the linear differential equation) or Y u j ¼ ðr j ; k 1j ; k 2j ; k 3j ; k 4j Þ (for nonlinear differential equations) using the Runge-Kutta algorithm, and covariance-structuring parameters, and C e = (ϕ 1 ,φ 2 ,φ 1 , ϕ 2 ,υ 1 ,υ 2 ,ρ), using the simplex algorithm. The Runge-Kutta and simplex algorithms are integrated within the EM framework to generate each iterative procedure, which is repeated until the parameters converge to stable values.

Hypothesis tests
Whether there are specific QTLs for progenitor cell transition dynamics can be tested by formulating the following hypotheses: H 1 : At least one of the equalities above does not hold; where the H 0 corresponds to the reduced model, in which only one single curve of transition dynamics exists; and H 1 corresponds to the full model, in which there exists six such different curves to fit the data. We calculate this hypothesis to reduce the full model using the log-likelihood ratio (LR). The critical threshold is determined empirically from permutation tests [23]. Two variables, x 1 and x 2 , may or may not be controlled by the same gene. The pleiotropic control of the gene over these two variables can be tested by formulating the null hypotheses: H 0 :r jm r and k ijm k i j = 5,4,3,2, 1, 0; m = 2, 1, 0; i = 1,2 for the linear model and 1,2,3,4 for the nonlinear model H 1 : At least one of the equalities above does not hold ð12Þ If the null hypothesis is rejected, this indicates that the gene detected exerts a pleiotropic effect on progenitor cell heterogeneity.

Results
We performed simulation experiments to examine the statistical properties of the model built for genetic mapping of the two subpopulations of hematopoietic progenitor cell transitions. Fu et al. provided much more detailed information regarding the test and validation of systems mapping [24]. The transition process was described by linear and nonlinear ODEs, and the simulation assumed different heritabilities (0.05, 0.1 and 0.2) under different human population sizes (200 and 400) at Hardy-Weinberg equilibrium, considering a SNP marker that is associated with a putative QTL(with six genotypes QQ, Qq, Qq + , qq, qq + , and q + q + ) through linkage disequilibrium (LD). The transition parameters Y u j = (r j , k 1j , k 2j ) and Y u j = (r j , k 1j , k 2j , k 3j , k 4j ) for the six genotypes were determined in the ranges of empirical estimates of these parameters. Note that for computational simplicity, r, k 1 , k 2 , k 3 and k 4 are provided as shown in Tables 2, 3, 4 and 5. Based on allele frequencies of the marker, QTL, and their LD, joint marker and phenotypic data were simulated. The genetic parameters (P, Q 1 , Q 2 , D 1 , and D 2 ) of the QTL can be estimated with high precision using the EM algorithm. The genotype-specific mean vectors were modeled based on ODE's solution (3) where the covariance matrix was structured by the first-order antedependence model [SAD(1)] with correlation and variance parameters, C e = (ϕ 1 ,φ 2 ,φ 1 ,ϕ 2 ,υ 1 ,υ 2 ,ρ), for w 1 and w 2 .
The simulation results of cell transition parameters, genetic parameters, and covariancestructural parameters with the linear and nonlinear model are shown in Tables 2 and 3 and Tables 4 and 5, respectively. The cell transition parameters can be estimated with high precision     Table 4.   in both linear and nonlinear models. The precision of estimation of marker allele frequency is not affected by differences in heritability, but estimates of transition parameters, QTL allele frequency, and marker-QTL linkage disequilibrium are more precise for a higher than a lower heritability and for a bigger than a smaller population. The covariance-structural parameters can also be estimated, partly because of their simple structure for covariance modeling. The six QTL genotypes QQ, Qq, Qq + , qq, qq + , and q + q + are each hypothesized to have different response curves for different human populations. Figs 3 and 4 illustrate different forms of the cell population transition for six QTL genotypes, QQ, Qq, Qq + , qq, qq + , and q + q + , under different heritabilities (0.05, 0.1 and 0.2) with size 400 populations, with the transitional values given in Table 3 for the linear model and Table 5 for the nonlinear model. Pronounced differences among the genotypes suggest that the QTL may affect cell transitions, resulting in different cellular phenotypes. Meanwhile, we considered the methylated QTL status (Qq + , qq + , and q + q + ), which could illustrate phenotypic differences among cells with the same DNA sequences. The cell transition values can be estimated from the model. The model displays great power in detecting a QTL responsible for cell transitions using the associated marker. The trajectories of additive and dominant effects of subpopulations x 1 and x 2 are shown as Figs 5 and 6 for the linear model and nonlinear model, respectively. We could observe the genetic architecture of the transition dynamics of the two subpopulation cells.

Discussion
As more investigators use stem/progenitor cells to study the mechanisms of pluripotency and differentiation, they pay greater attention to cell heterogeneity and variability, which affects the use of pluripotent cells in regenerative medicine, disease modeling, and studying development processes. New methodologies, including single-cell and single-molecule analysis, as well as mathematical and computational modeling have been developed to explore pluripotent cell population heterogeneity. Emerging evidence has found differences in gene expression profiles at the mRNA level and functional differences in the differentiation within a heterogeneous cell population. However, the genetic status such as genetic backgrounds and epigenetic profiles including methylation of CpG islands and histone modifications remain unclear. Previous reports by Chang et al. [6] showed that transcriptome noise leads to clonal heterogeneity and controls lineage choices, but the genetic mechanisms of the heterogeneity and differentiation potential were not described.
In this article, we developed a statistical model that combined the mathematical concepts regarding cellular mechanisms and a general framework for mapping dynamic traits [15]. Both linear and nonlinear ODEs use specific parameters to describe how cells transit from one state to another during cell culture and test the magnitude and patterns of genetic effects in the process. The estimates of response curves are more precise for the linear model (Fig 3). However, the nonlinear model includes more parameters and information regarding complex cell proliferation and transition processes, so it may be more close to the true conditions. This model first considers methylated DNA as an allele in this "heterogeneity" based question, which indicates that different methylation states affect the phenotypes of the cells with the same genotype. Therefore, the model provides a tool to generate biologically meaningful hypotheses for understanding the genetic control of population cell heterogeneity.
The model proposed in this article considered only six genotypes; namely, QQ, Qq, Qq + , qq, qq + , and q + q + , but biologically there should be more methylated states of the genotypes such as Q + Q + , Q + q, and Q + q + . The allele frequencies of the marker, QTL and their LD, joint marker and phenotypic data, and the statistical model will be more complex in this situation. In addition, further cellular and molecular experiments are required to confirm the exact DNA Estimated and true curves of systems mapping for the linear progenitor cell transition dynamics model. The curves show a putative QTL having six genotypes, QQ, Qq, Qq + , qq, qq + , and q + q + , as indicated by colors in a natural population of 400 assuming heritability H 2 = 0.05 (a), H 2 = 0.1 (b), and H 2 = 0.2 (c), respectively. The broad consistency between the estimated (solid) and true curves (broken) suggests that the model provides a good estimate of the dynamic system.  Estimated and true curves of systems mapping for the nonlinear progenitor cell transition dynamics model. The curves show a putative QTL having six genotypes, QQ, Qq, Qq + , qq, qq + , and q + q + , as indicated by colors in a natural population of 400 assuming heritability H 2 = 0.05 (a), H 2 = 0.1 (b), and H 2 = 0.2 (c), respectively. The broad consistency between the estimated (solid) and true curves (broken) suggests that the model provides a good estimate of the dynamic system. modification states of these cells, and a model incorporating multiple QTL and their interactive networks should be derived. However, the model will be useful in elucidating the genetic architecture, especially epigenetic mechanisms of cell heterogeneity, and could increase our understanding of the driving forces behind cell heterogeneity and transitions.