An Observation-Driven Agent-Based Modeling and Analysis Framework for C. elegans Embryogenesis

With cutting-edge live microscopy and image analysis, biologists can now systematically track individual cells in complex tissues and quantify cellular behavior over extended time windows. Computational approaches that utilize the systematic and quantitative data are needed to understand how cells interact in vivo to give rise to the different cell types and 3D morphology of tissues. An agent-based, minimum descriptive modeling and analysis framework is presented in this paper to study C. elegans embryogenesis. The framework is designed to incorporate the large amounts of experimental observations on cellular behavior and reserve data structures/interfaces that allow regulatory mechanisms to be added as more insights are gained. Observed cellular behaviors are organized into lineage identity, timing and direction of cell division, and path of cell movement. The framework also includes global parameters such as the eggshell and a clock. Division and movement behaviors are driven by statistical models of the observations. Data structures/interfaces are reserved for gene list, cell-cell interaction, cell fate and landscape, and other global parameters until the descriptive model is replaced by a regulatory mechanism. This approach provides a framework to handle the ongoing experiments of single-cell analysis of complex tissues where mechanistic insights lag data collection and need to be validated on complex observations.


Introduction
Recent breakthroughs in light microscopy have opened new doors to study complex tissues in vivo at single-cell resolution. With genetically encoded fluorophores, 3D time-lapse imaging has provided unprecedented temporal and spatial resolution to observe cellular dynamics in diverse organisms. The recordings often contain hundreds to thousands of cells over

Methodology
In recent years, many platforms for agent-based modeling have emerged to achieve specific goals, such as NetLogo [20], FLAME [21], MASON [22], Repast [23], and others. NetLogo is used in this study because NetLogo is straightforward in implementing the agent-based modeling concept for cell development. With its own syntax and grammar, as well as a friendly graphical interface, NetLogo is very easy to use, meaning that we can setup, run, and observe models without writing codes with complicated programming languages. A number of builtin models related to biology can greatly help our current model. Moreover, NetLogo supports 3D modeling, which is appropriate for simulating morphogenesis of tissues. Also, the built-in visualization provides useful tools for model verification. NetLogo's main problem is its lack of speed [24], but this is not an issue at current scope of the project. Therefore, NetLogo is an appropriate choice of platform. The ODD (overview, design concepts, and details) protocol [25] is used to describe our model below.

Overview
The following sub-sections present a brief description of the framework, including the purpose of this simulation, key variables used for constructing the model, and the overall framework. Details are then provided in the "Detail" sub-section.
Purpose. The long-term goal of this study is to provide a general platform to incorporate experimental data at the molecular and cellular levels to examine emergent properties of C. elegans embryogenesis. The purpose of the current study is to provide a basic implementation that can be extended to incorporate complex experimental data. In this regard, the eggshell, body axes and the lineage (mother-daughter relationship) are implemented to provide the foundation to describe embryogenesis. The implementation of cellular behaviors is limited to cell division and cell migration, and each behavior is controlled by a statistical model (mean and standard deviation) instead of through gene networks. The goal is to provide timely validation of the overall design and implementation. Different tools are also implemented to plot and visualize the results of simulation and examine the key temporal and spatial features of the simulated embryo.
State variables and scales. In order to simulate cell behaviors during development, two kinds of parameters are defined in advance: (1) attributes of the agent (cells in the framework); and (2) environmental parameters, such as spatial and temporal scales. The variables used in each category and a brief description of each variable are provided in Tables 1 and 2, respectively.
Process overview and scheduling. Our ABM framework for C. elegans embryogenesis contains agents corresponding to the exact number of cells from the 4-cell to 350-cell stage, as well as the mother-daughter relationships between cells known as the cell lineage. The current implementation includes three behaviors for each cell: cell-cycle length, the direction in which the cell divides, and its migration paths. Each of these behaviors is modeled based on statistical measurements in the wild-type embryo [7]. The model also includes an ellipsoidal eggshell and three body axes of the embryo.
On a higher level, our model can be divided into two processes (Fig 1): (1) the "setup" process and (2) the "execution" process (including a "movement" activity and a "division" activity). The setup process establishes the environment in which cells live and initializes a cell's state such as nuclear size and coordinates. This will be further described in the "Initialization" section.
The "execution" phase is further divided into two biological stages, namely the interphase stages and mitosis for each cell cycle [26]. During the interphase, a cell moves in the ellipsoidal eggshell. The movement is controlled by several conditions of the environment and attributes of the cell itself. During mitosis, a cell divides. Two new cells are generated and initialized, replacing the old parent cell. The timing and direction of the division are controlled by attributes of the cell (see Table 1). As time goes by, the model continuously checks whether a cell is in the ellipsoidal eggshell, whether it is at the movement stage or enters the division stage, and implements the right actions. Detailed descriptions of each activity are provided in the "Submodels" sub-section.

Design concepts
A number of general concepts in ABM apply to this framework for C. elegans embryogenesis.
The fundamental goal of this agent-based C. elegans modeling framework includes understanding how cells become the appropriate cell type and find their appropriate positions through cell-cell interactions. The objectives of a cell include becoming the appropriate type and moving to the appropriate position based on the type and position of its neighbors. In the current implementation, a cell's type is modeled as its lineal identity. In wild-type embryogenesis, a cell's type equates with its lineal identity. At each time point, a cell adapts the direction of its movement based on its current position relative to the destination, moves to the next location, and continuously checks the division timing with appropriate stochasticity built into it. The model is designed to regularly output the agent's status in standard data format. Such information is further used to validate the model by comparing it with observational data.

Details
Initialization. The overall initiation of the simulation is achieved through a "setup" button in NetLogo. It triggers the import of various inputs that are derived from observational Hold the X, Y and Z coordinates of the next place that the cell will move to.
cellTargetX, cellTargetY, cellTargetZ Long term target of cells. Hold the X, Y and Z coordinates of the destination that the cell will move to.
data (see the "Input and output" section) and sets the timer. Most specifically, four initial cells are hatched by the following procedures: 1. Set the name, size and original location (the X, Y, Z coordinates) for each cell.
2. Arrange a new color for every current cell in order to distinguish their later generations from other initial parents and their offspring.
3. Import divTime and the standard deviation.
6. If there exist other cell(s) to be hatched, go to step 1.

Input and output.
To achieve the goals of the cell development simulation, three groups of data are introduced at the current stage. The "timeTable" holds an index of all cells by their names, time points and durations of division, along with their standard deviations, respectively. A "dirTable" is introduced for keeping names of cells and their two daughters. Also, there are three other parameters recording the movement directions of the newly hatched cells. Finally, a "positionTable" is always necessary for maintaining every cell's destination target.
In Table 3, a brief description and example of each parameter are given.

Fig 1. High level representation of framework.
Two processes in the the framework, namely "setup" and "execution". The "execution" process is further divided into a "movement" and a "division" activity. For the output, to establish a standard form for not only the current simulation, but further research as well, the data is organized as a format including 20 fields. In such format, motherdaughter relationship is coded in the first four columns: Column 1 represents an arbitrary index for a cell, specific to the time point. Column 2 is the index of the corresponding cell at the previous time point, either the cell itself or its mother, and −1 means NULL. Columns 3 and 4 are introduced as indices of the corresponding cell(s) at the next time point, either the cell itself or its two daughters. When the cell is not dividing, column four is set to −1. Columns 5 to 10 provide the x, y, z coordinates, diameter and name of a cell, respectively. Columns 11 to 20 are unused for current simulation, but will be used for further research. They are set to 0 except Column 15 is left blank.
An example of output data format is given in Table 4 for intuitive understanding. The parameter "outTimeResolution" (see Table 2) functions as controlling the output interval of the standard format data. In our model, the default value is 60 seconds.
Sub-models. As previously mentioned in the section "Process overview and scheduling" section, behaviors of the agents are divided into two activities: movement and division. Each is described in detail below.
In the "movement" process, a cell moves to its target destination, where the next division takes place. To achieve this goal, two kinds of calculation of the cell's location are implemented. (1) When a cell is just divided into two single ones, the primary thing is to split the two cells completely, without any overlap. In our model, we set the variable divCycleTime to 90 seconds for this process. (2) After that, the cell's next location is calculated based on the differences between current time and cell's division time (divTime), current location and its   (1) and (2) are introduced to achieve such calculations Finally, we check whether each cell is still in the eggshell. The whole process of "movement" activity is shown in Fig 2. cellMove ¼ ðcor þ ððdiv Â cellSize=2Þ=ðdivCycleTimeÞÞÞ; where cellMove = [cellMoveX, cellMoveY, cellMoveZ], cor = [xcor, ycor, zcor] represents the current coordinates of a cell, and div = [divX, divY, divZ].
where cellTarget = [cellTargetX, cellTargetY, cellTargetY] and ticks represents the current time. When it is time for division, information on the cell to be divided is temporarily saved for further potential usage. After that, two new daughter cells are generated, replacing the old parent one. In the current simulation, new cells are named by finding out their parents' names in dirTable, and then obtaining the corresponding daughters' names (see Table 3). Also, division directions are extracted from dirTable and transmitted into the cells' attributes dirX, dirY and dirZ, which are used for the direction setting. DivTick, representing the current status of a cell, should be reset and goes into the next cycle. This operation helps to determine whether a cell is dividing, that is, separating into two halves, or has divided, and then to choose a corresponding method (see Eqs (1) and (2)) to calculate its next location in the "movement" activity. When this is done, information on division time point and duration from the timeTable is extracted, and a new divTime for cells are recalculated by adding random disturbances based on their standard deviations. Finally, destination target coordinates for the cell are obtained by indexing its name in positionTable. The whole process is shown in Fig 3.

Model validation strategies
The long-term objectives and potential research implications of this project aim to allow biologists to generate testable hypotheses about cell development. The short term objectives are complete verification, to confirm that the framework is properly built.
We use a MATLAB application to verify the division cycle movement, by comparing the standard format output simulation data with the observational data. The MATLAB application used is named Dev-scape, which is an open source software that was first presented in [7]. The software package provides a comprehensive analysis of a cell's differentiation, proliferation and morphogenesis at single-cell resolution. Input for Dev-scape is exactly the standard format output file of NetLogo. All the text or graphic information for validation, such as cell cycle, division time, and others, will be outputted for further analysis. Besides Dev-scape, we also use Java for the visualization of cell lineage tree, and POV-Ray [27] to analysis the cell migration process. Inputs for the above two validations are also obtained from Dev-scape.
Specifically, two main points of verification are introduced for the current stage of simulation, division cycle movement, and cell migration.

Computational experiments
Quantifying division timing is relatively simple, since every sixty seconds, the program outputs standard format data, from which the number of cells at current time point can be counted. A curve that presents the number of cells over time is created as we continuously track this value. Theoretically, the number of cells is rigorously limited by the division time and its standard deviation in the framework. As a result of this assertion, simulation results outputted from simulation should not show a large difference as compared with observational data.
Number of cells in embryo overtime is properly modeled. From Fig 4 we can see that the two curves, representing the number of cells in the embryo from simulation and observation, respectively, largely overlap with each other, except that a few intervals have some slight deviation. This makes sense since for better recurring realistic situations, we add a random difference to the division time of each cell based on their statistical standard deviations.
Systematic and quantitative analysis of the result is implemented by utilizing root-meansquare error (RMSE) to measure the deviation between simulation and observational data (Eq (3)).
where n is the number of measurements, and d i is the deviation between simulation and observational data. The number of cells in an embryo over time is sampled every 5 minutes from both simulation result and observational data. If data is missing at the minute multiple of 5, the nearest minute with data for both cases is chosen as the sampling point. Because of the frequent divisions of cells during some time intervals (especially 100 to 140 min), and the limited output resolution (60 s in this case), when the above two sampling rules cannot be met, we select the point with minimum time difference between simulation result and observation data instead of an approximate estimation (see Table 5).
From the result we find that the deviations increase when the number of cells increases rapidly, especially the interval between 100 and 140 minutes. For the other time intervals, the deviations are relatively small. With Eq (3), we calculate that the RMSE σ for this case is 2.491. The immediate follow-up efforts are to (1) optimize the "movement" and "division" function to reduce the RMSE σ, and (2) substitute statistical models with regulatory mechanisms to approach the RMSE we obtain here.
Simulation generated cell lineage tree and cell cycle length of founder cells during C. elegans early embryogenesis show consistencies with those from wild-type embryo. A cell lineage tree drawn from simulation data is also provided (see Fig 5). The shape of this lineage tree intuitively describes every cell's division timing, as well as different paces of cell divisions in their sublineages. We compare the simulation result with the cell lineage tree of wild-type embryos [28], and the large extent of similarity validates that division timing of each cell in simulation conforms to the situation of wild-type embryos. Also, consistency in the shape of both trees,where two sister cells show different paces of cell divisions in their sublineages, validates the correctness of body axes and cell naming.  Twelve founder cells in early embryogenesis are selected for the analysis of cell cycle length. We collect the born and divided times for each founder cell during simulation and compare the result with observational data (see Table 6). Cell cycle standard deviation is calculated based on born time and divided time standard deviations with Eq (4).
where σ c , σ b and σ d are the cell cycle, born time and divided time starnard deviations, respectively. We find that most of the deviations of cell cycle length (11 of 12) between simulation result and observational data are less than 60% of the corresponding standard deviations. Moreover, the RMSE σ for cell cycle length is 52.877.
Cell migration patterns and distribution of positions. Fig 6 demonstrates the cell migration of simulation and observational data, respectively, in which different types of cells are color coded. A 3D view of cell migration dynamics is provided during AB64 (when AB divides to 64 cells) stage. We cannot expect the two figures are completely the same, since the observational result is an average movement from large amounts of data, and on the other hand, random deviations (see Table 3 and the "Sub-models" section) are also added into the simulation model. The conclusion that trends of cell movement in simulation largely follows the situation in observation illustrates that the experimental result of cell displacement patterns looks good.
In Fig 7, we also visualize distributions of cell positions at AB32 and AB64 stage, respectively. Ellipsoids are drawn relying on a large amounts of data, centered at the average position of cell positions at the certain stage with radii of one standard deviation on each direction. The single point in each figure represents one certain result of cell position from simulation and observation. By comparing the corresponding figures in each stage for a number times of experiments, we observe that cells always locate near the spots where they are most likely to appear (in the ellipsoid), which is reasonable, even though there are slight differences between corresponding cells. Moreover, we examine whether the cumulative effect of positional deviations will have an influence on the results, and it turns out that errors produced at earlier stages are controlled in a proper range as they accumulate towards later stages. Finally, we compare the positions of twelve founder cells at the time when they divide in simulation and observational cases (see Table 7). Deviations in each direction are produced owning the random errors added to the division times of all the cells. Compared with the standard deviations of observational data on each direction (data not shown here), the deviations between simulation result and observational data are extremely small. The RMSE on each direction σ x , σ y and σ z is 0.0059, 0.0067, and 0.0033, respectively.

Conclusion and future work
In this paper, an agent-based C. elegans model has been developed to incorporate observational data on cellular behavior, such as cell lineage, fate, division time and direction, and movement path. Observation-driven data structures and flowcharts have been implemented in the framework to establish a statistic model, and regulatory mechanisms are allowed to be incorporated when more insights are discovered. System-level results have been generated, analyzed and validated in a simple and clean ways to gain potential possibilities to guide further C. elegans embryogenesis research. The software system is available through a public code repository at https://bitbucket.org/abm_utk/.
Our model considers fundamental aspects of C.elegans embryogenesis a whole process, rather than in separate modules [29]. This approach/design provides a framework to deal with experiments of complex tissues at a single-cell resolution under the circumstance that mechanistic rules lag data collection and remain to be substantiated on complex observations, which is of general interest.
Although we have built a basic descriptive agent-based model, and acquired a number of meaningful simulation results, a comprehensive framework including other main aspects during C. elegans embryogenesis is needed for further research and experimentation. Therefore, we design and reserve the following necessary interfaces, which can be considered as our immediate follow-up efforts.

Gene list
Each agent should hold a list that contains genes related to the current model, with their genetic states and active states. Genetic information, as the basis of all agent (cell in our case) behaviors, is indispensable in agent-based C. elegans embryogenesis framework. Such kinds of structures built in the agents is also a prerequisite for further investigation of cell movement, division, polarity, interaction, fate, and landscape.

Polarity and cell interaction
Polarity and cell interaction are the two most important ways to change genetic and active states of genes in cells. Such processes play significant roles during cell differentiation and Agent-Based Modeling for C. elegans Embryogenesis control cell fate differentiation, and further impact other downstream cell activities, such as movement and division. Therefore, we maintain interfaces for these two processes, in which cells continuously seek the changes produced in themselves, from other cells, or the surrounding environment. When regulatory mechanisms that can lead to genetic change are targeted, states of the corresponding genes in certain target cells will change. The domino effect of the agent-based changes of gene states in huge network will provide us with much more interesting results than ever before.

Cell fate and developmental landscape
As mentioned earlier, cell fate and developmental landscape [2] play important roles for research on developmental problems. The classic definition of cell fate is implemented by obtaining gene expression information in each cell. However, the limitation of such an approach is that it is quite difficult to acquire such a large amounts of data. Therefore, we reserve a hybrid definition of cell fate, in which the gene expression information in the cells themselves is not only utilized, but in their descendants is gathered as well. This allows us to maximize the potential to find more meaningful results by utilizing known knowledge obtained from regulatory mechanisms and experimentations.

Global parameters
Factors that are not related to agent (cell) can be considered global parameters. It is obvious that such kind of factors will have a influence on biological processes. Interfaces reserved for this purpose will help us investigate the changes resulting from non agent-related factors: whether changes of temperature and humidity will result in different cell behaviors or fate changes during embryogenesis.