Data analysis and modeling pipelines for controlled networked social science experiments

There is large interest in networked social science experiments for understanding human behavior at-scale. Significant effort is required to perform data analytics on experimental outputs and for computational modeling of custom experiments. Moreover, experiments and modeling are often performed in a cycle, enabling iterative experimental refinement and data modeling to uncover interesting insights and to generate/refute hypotheses about social behaviors. The current practice for social analysts is to develop tailor-made computer programs and analytical scripts for experiments and modeling. This often leads to inefficiencies and duplication of effort. In this work, we propose a pipeline framework to take a significant step towards overcoming these challenges. Our contribution is to describe the design and implementation of a software system to automate many of the steps involved in analyzing social science experimental data, building models to capture the behavior of human subjects, and providing data to test hypotheses. The proposed pipeline framework consists of formal models, formal algorithms, and theoretical models as the basis for the design and implementation. We propose a formal data model, such that if an experiment can be described in terms of this model, then our pipeline software can be used to analyze data efficiently. The merits of the proposed pipeline framework is elaborated by several case studies of networked social science experiments.


Background and motivation
Online controlled networked temporal social science experiments (henceforth referred to as NESS experiments or experimental loop) are widely used to study social behaviors [1][2][3][4][5][6] and group phenomena such as collective identity [6,7], coordination [8], and diffusion and contagion [3,6,9]. There are several distinguishing features of NESS experiments. First, experiments and analyses are performed in a loop. Second, experiment subjects or participants interact through prescribed communication channels, where the players and interactions can be represented as nodes and edges, respectively, of networks. Third, experiments are carried out until a specified condition is met or for a particular amount of time (as opposed to one shot games). (Sometimes the term game is used in this work as a substitute for experiment because some experiments can be viewed as games, in the sense that human subjects are working to achieve some goal. However, we are not addressing gaming in this work.) Besides carrying out NESS experiments, data analytics on experimental data and computational modeling of experiments are also very important. Analytics are required to interpret experimental results and modeling is useful in reasoning about and extending results from experiments [10,11]. Combining experiments with modeling, in a repeated, iterative process, enables each to inform and guide the other [12][13][14]. This approach has been undertaken in several studies without automation [15][16][17] or purely conceptually [18]. Reference [18] takes a combined experiment/modeling approach by defining a framework for conceptual modeling for simulation-based serious gaming. Often, there is emphasis on one or the other (experiments or modeling) with no experiment-and-modeling iterations. That is, experiments are emphasized and there are no iterations [9], or modeling is emphasized and there are no iterations [19][20][21].
The simple idea of iterative experiments and modeling can be operationalized in various ways, including deductive and abductive analyses. In deduction, models are first developed, and predictions from them are then compared to subsequently-generated experimental data, in order to validate the models. In abductive looping, experiments are performed first, patterns are searched for in the experimental data, and this information is used to construct and modify models. Detailed abductive looping examples for the study of collective identity in the social sciences are provided in [7,22]. Fig 1 provides one representation of the steps in abductive looping. Experiments are conducted; raw data are transformed into a common format (e.g., cleaned) for processing. Then experimental data are analyzed in different ways to understand player actions, identify patterns, and evaluate hypotheses. Models are developed based on these data, and model properties are inferred from the data. Models are executed and validated, and modeling results are compared against experimental data. Predictions may be made to explore counterfactuals. These latter results and the existing experimental data are used to determine conditions for the next experiments, if any, and the loop may repeat. See [7,23,24] for further discussion of abduction. We note that the steps in deduction are essentially the same, but the sequencing of experiments and modeling is reversed.
In this work, our focus is automating many steps in the NESS experiments. Automating these steps can lead not only to improved productivity, but also to improved scalability and reproducibility. (This has been the case in our research group.) It is seen that NESS experiments require several classes of operations: (1) experimental design, (2) experiment execution and data collection, (3) data fusion and integration, (4) experimental data analysis, (5) modeling, design, construction, and verification, (6) model parameters inference, (7) exercising models (e.g., simulations for agent-based modeling approaches), (8) comparisons of experimental data against model output, (9) model executions beyond the ranges of experimental data (e.g., to explore counterfactuals), and (10) iteration on these steps.
However, current practice often entails producing custom programs and analytical scripts that pertain to the experiments and modeling. Our lab has found that this often leads to inefficiencies and duplication of effort. We propose a pipeline framework that automates many of the steps involved in analyzing social science experimental data, building models to capture the behavior of human subjects, and providing data to test hypotheses. The proposed pipeline framework is based on formal models, formal algorithms, and theoretical models. We also provide a data model such that if an experiment can be formally described in terms of this data model, then data from the experiment can be analyzed with our system. While there are software systems that address some of these operations [25,26], they do not take the semantics of social experiments into account and largely focus on providing a generic data schema. It is important to note that our software system, presented in this work, is agnostic to deductive or abductive methodologies because our pipelines (described below) are composable. This composability also enables abduction using an experiment-only approach by removing the modeling activities in Fig 1.

Technical challenges of building software systems to analyze social science experiments
To realize an automated and extensible software system for NESS experiments, there are two major groups of technical challenges: those pertaining to pipelines in general, and those about social sciences. Addressing the first group, abstractions that capture data analytics and computation are important [27]. High-level abstractions render a system more understandable and reusable [28]. General challenges include identifying appropriate levels of abstractions for tasks, pipelines, and systems. The problems of abstraction are important for automation, traceability, reproducibility, interoperability, composability, extensibility, and scalability [29]. Formal models help solve these abstraction problems [30].
In the case of the NESS system, there are three unique challenges to address. The first is specific to the features of NESS experiments. NESS experiments are often multi-phased, multisubject, and multi-action, and hence are sophisticated. Each subject can take repeated actions from a set of action types, at any time and in any order. Interactions among subjects change the environment of a subject because they share resources. This is a far more complicated setup compared to many types of social science experiments such as one-shot games, experiments with a single type of action, and individualized experiments. Such experiments require more sophisticated software. Second, a greater range in modeling functionality is required, even for one class of problems. This is because a "model" in social sciences is often a qualitative textual description that is open to different interpretations due to lack of detail and due to uncertainty (e.g., in human behavior). Consequently, multiple interpretations of a textual description can result in different algorithmic models to build and evaluate. Third, experiments in the social sciences can vary widely, depending on the phenomena being studied [31]. Hence, data analytics for these varying experiments, including data exploration, requires custom analyses. These custom analyses can be addressed at the task level (i.e., new individual tasks within a pipeline), or at the pipeline level (i.e., the addition of new pipelines).

Solution approach and roadmap of work
To better present our work, Fig 2 provides a roadmap of this manuscript and the relationships among sections. Section 2 provides an overview of our solution approach, and specific contributions of the work. The data model (Section 3) is a formal specification of the features of experiments whose data can be analyzed with our system. If an experiment can be represented by this data model, then the experimental data can be analyzed with our pipelines. Graph dynamical systems (GDS) (Section 4) is a theoretical framework that we use for generating models of human behavior from experimental data. Both the data model and GDS are integral to the pipeline system software design and implementation (Section 7): the data model identifies the features of experiments and data that must be analyzed in the system, and GDS provides a formalism for model building. The pipeline system conceptual overview (Section 5) identifies the different components of the pipeline system. From this, the mathematical model for the pipeline system (framework and h-functions) in Section 6 is provided. This theoretical representation of the system is then used to specify the design of the system. That is, we have three theoretical models (in Sections 3, 4, and 6) that are the basis for software system design. This design, and implementation, of the pipelines are the subjects of Section 7. The implementation, along with the data model, are used in the case studies of Section 8.
A pipeline is a composition of tasks, where each task takes a set of inputs and produces a set of outputs. Our use of pipeline is motivated by the Pipes and Filters architecture pattern [32,33]. A pipeline combines tasks in analyst-specified ways. We distinguish our work from workflows because, while there is much overlap between the capabilities of workflows and pipelines, here we do not address provenance of digital objects. Although the analysis loop in terms of experiments and modeling are presented in Fig 3, these analyses and abductive and deductive looping can be executed within a study that exclusively uses experiments (i.e., no modeling). The importance of experiments, even with modeling, is observed in Fig 3 because experimental data plays a major role in pipelines 1, 2, 3, and 5. Experiments are critical, for example, in establishing causality, by comparing results from control experiments with those using treatments.
Our experimental data analysis and modeling software pipelines are complementary to current efforts to build configurable software platforms to perform social science experiments. See [35][36][37][38]. Usually, these systems only focus on the design and running of online lab experiments. Just as these experiment platforms provide the infrastructure for users to instantiate a particular experiment in software, we provide a pipeline framework that can be used to build pipelines for performing various types of analyses on the experimental data.
The focus of this paper is on formal theoretical models, and the architecture, design, implementation, and use of the pipelines that instantiate these models in software. The goal of the software system is to automate many of the steps in analyzing social science experimental data, and building and exercising models. We presume that in the great majority of cases, no one person is going to identify a social science problem or question; specify experiment requirements and design; build experimental platforms and execute experiments; specify analyses; build software to analyze experiments and perform data analyses; specify, design, build, and validate models of experiments; and evaluate hypotheses. Rather, we view these social science researches as "team science," and as such, this system is not focused on all members of such a team. So while all team members can have a general appreciation of the need for and value of such a system, the paper is focused on the team members who design and build software to automate many analysis steps.
The terms experiment to mean human subjects interacting in a controlled setting with their actions recorded. Modeling refers to building mathematical representations of experiments. Simulation is execution of software implementations of models, e.g., ABMs. We avoid ambiguous terms such as computational experiment. This paper is a full treatment of, and a significant extension of, a preliminary version (a conference paper) that appears as [34].

Novelty of work
There are three novel aspects of our proposed pipeline framework. First, we devise an abstract data model that is a representation of experiments and simulation models. One can rigorously determine whether experimental data and model outputs can be analyzed with our pipelines. Furthermore, we incorporate a second model called graph dynamical systems (GDS) [39]. GDS and the abstract data model provide foundations to ensure proper mappings, from Experimental data are transformed, by the EDTP, into a data common specification that conforms to our data model (see Section 3).
(2) DAP Data Analytics Pipeline The DAP analyzes data and generates and prepares data for property inference.
(3) PIP Property Inference Pipeline The PIP determines properties for probabilistic agent-based modeling (ABM) and simulation (ABMS).
(4) MASP Modeling and Simulation Pipeline Simulations are performed in the MASP.

(5) MEAPP Model Evaluation and Prediction Pipeline
The MEAPP generates comparisons between experimental data and model predictions using statistical and logical testing. This is part of model validation. We can then specify test conditions for next experiments (experiment specification).
experimental conditions to computational model structure, and from model structure to experiments. See Fig 4, where we have an experimental platform and a modeling and simulation (MAS) platform, and we need these two to interoperate through our data and GDS models. It shows specific, illustrative types of data sources and modeling approaches. Second, our pipeline framework is based on formal theoretical models; the three models that inform the pipelines are denoted by the dashed arrows in Fig 2. These models are crucial in providing a principled approach to software design and implementation. This is also useful for reasoning about abstractions. Third, our pipelines use a microservices conceptual approach [40][41][42] wherein the components (i.e., tasks) of a pipeline-which we call functions, h-functions, or tasks-have well-defined minimal scopes. (Functions are described below, but basically represent the software codes that provide the functionality that pipelines orchestrate.) This way, reuse is fostered because new functions can be added surgically for experiments, analyses, and models without introducing redundant capabilities. The pipeline framework can accommodate the insertion of new h-functions at arbitrary points in the pipeline.
In comparing our software system with others in the social science realm, we note that according to [28]: "the current focus of many social science systems is social network analysis." See other works in Section 9. As illustrated in Figs 1 and 3 our work goes far beyond structural analyses of static networks: our work centers on experiments of human behavior, where interactions among players are specified as edges in a network whose nodes are the players. Our system is used for quantifying the behavior of humans in experiments: (i) analyzing experimental data, (ii) developing models and their properties for the behavior of human subjects in these experiments, and (iii) conducting agent-based simulations to model these experiments, and conditions beyond those tested. Furthermore, the system is applicable to a wide range of experiments, as long as they conform to the data model in Section 3. To our best knowledge, there are no other pipeline software systems for these types of studies.

Contributions
We itemize our contributions below.
1. Development of formal models, formal algorithms, and software implementations for each of a data model and a pipeline model. For each of data and pipeline representations (down left-hand column) of Table 2, we provide formal models, formal algorithms, and implementations. This approach demonstrates the power of modeling (including theory) to inform software system implementations. (Elements of Table 2 in blue and bold are our contributions; elements taken from other works are normal type-faced.) Thus, taking the data, GDS, and pipeline systems each in turn, this contribution is specifically that we provide a consistent (and unified) view of, and approach to, pipeline systems building for social experiments and for modeling them. Specific contributions within this context follow.
2. Formal data model specification for NESS experiments and modeling. We develop a formal abstract data model for NESS experiments. The primary use of our data model is this: any experiment that can be formally described in terms of this data model can be analyzed within our pipeline system. The model provides a single specification for both experiments and modeling, thus ensuring a correspondence between experiments and the modeling and simulation (MAS) tasks that represent the experiments. The abstract data model provides an abstraction level per Section 1.2. Characteristics of our data model are: (i) an experiment may contain one or more phases (i.e., sub-experiments); (ii) the finite duration of each phase may be different; (iii) the interaction structures among players (represented as networks) may be different for different phases; (iv) the set of player actions and the set of multi-player interactions may be different for different phases; and (v) players may repeat these actions and interactions any number of times, in any order, within a phase (i.e., temporal freedom of actions and interactions). A significant class of experiments is represented by these five characteristics. Illustrative works whose experiments are in this class are [1-6, 8, 9]. The data model, with our dynamical systems computational model (Section 4), provide a formal specification for experiments and models. The data common specification in Fig 3 is based on the data model.
3. Formal pipeline framework. We provide a conceptual view of pipelines used to construct a formal theoretical model of our pipeline framework. The pipeline framework is the infrastructure that executes common operations that are invariant across pipelines that have different functionality. (It is the same among all five pipelines that we introduce in this paper to study social science experiments and to model them.) These common fundamental  5. From the model, we present an algorithm that covers these operations, and then design and construct a pipeline framework to execute these operations for any pipeline. The framework is extensible to additional pipelines: we have demonstrated in our work that it is extensible because our particular pipelines have been constructed over time using the same framework.

Pipeline h-functions (also called functions and tasks).
We use a microservices conceptual approach [40][41][42] for our pipelines, wherein the tasks or components in a pipelinewhich we call functions or h-functions-have minimalist scopes. The h-functions are software components that give a pipeline its application domain functionality. For example, one h-function will perform a particular data analytics operation, such as compute time histories, or compute a particular property for a particular model from data. We provide 29 implemented hfunctions within the five pipelines (see Appendix D). All h-functions are serial codes written in C++, Python, and R. New functions can be introduced for new experiments, analyses, and models in a targeted fashion (as we have done), fostering reuse without redundancy. Note that a pipeline is comprised of the pipeline framework and a sequence of h-functions (Fig 5). We put these parts together to form particular pipelines in the next contribution.
5. Five extensible pipelines for modeling and simulation, and analysis, of controlled networked experiments. We design and construct pipelines for (1) transforming experimental data, (2) analysis of data, (3) inferring model properties, (4) MAS, and (5) comparing model results with experiments results, and predicting results in the absence of data (i.e., counterfactuals). Each pipeline consists of an extensible collection of functions that can be composed to accomplish particular objectives. Moreover, there are several ways to order these pipelines ( Fig  3 is one way), and some pipelines may be omitted or implemented as multiple instances. An example is the use of experiments only for devising and testing hypotheses (i.e., studying a phenomenon with experiments, without modeling). Across multiple iterations of Fig 3, the experiment may change, necessitating different Data Analytics Pipelines for different experiments. Execution of pipelines and tasks are robust because of syntactic data validation of inputs and outputs at the task (function) level. These pipelines execute operations (3) through (10) in Section 1.1 (note: we do not automate the process of generating software verification cases, and model design is a human task). The Fig 3 caption explains why we emphasize controlled experiments; however, this is not a requirement for the pipelines (e.g., they can be used with social media or other types of observational data). The automated steps in Fig 3 are executed with a human-in-the-loop to inspect results. The pipelines also help ensure extensibility, scalability, and other "ilities" of Section 1.2.
6. Case studies. Use of the NESS system is demonstrated with three case studies. Case study 1 combines experiments and modeling. Case study 2 addresses experiments only. Case study 3 focuses on modeling only. In case study 1, we describe social experiments to generate collective identity (CI) within a collection of individuals [7]. Collective identity (CI) is an individual's cognitive, moral, and emotional connection with a broader community, category, practice, or institution [43]. Experiments and all five pipelines in Fig 3 are used. Two additional case studies use published works from other teams, appearing as [3,44]. The point of these case studies is to demonstrate that our pipelines are useful for other types of experiments, and can be used in other settings.
Empirical context for our pipelines. The works of [7,22,45] demonstrate the usefulness of our pipeline system, where collective identity was studied via online experiments and modeling of them. That is, these provide empirical context where our software tools are important. Analogous works that also provide context are experiments in [1][2][3][4][5][6]. Returning to [7,22,45], these works demonstrated that CI could be formed among players in a group anagram game, where multiple players interact with their assigned neighbors to form words from collections of letters. Devised and implemented in the software, games were played online, through players' web browsers. Game data were analyzed to understand game dynamics, to develop a model of player behaviors in the game, and to compute properties for the model. The work [45] produced additional models for the individual actions of players (word formation, letter requests of game neighbors, and replies of letters to neighbors' letter requests) in the anagram game. Although all three of the works [7,22,45] used the software pipelines of this work, there is no mention nor description of the software pipelines in them. The purpose of our work is to describe the software pipeline system for general NESS experiments. That is, our pipeline software system is far more general than its use in those works. Nonetheless, those works demonstrate the value of our pipeline system.

Significant work beyond the conference paper
A preliminary 12-page version of this paper was published as [34]. Significant extensions of that work, presented herein, are summarized as follows. (1) In Section 3, we demonstrate how our abstract data model can be transformed into data models used in software development, such as an entity-relationship diagram in unified modeling language (UML) format. This enables reasoning about and representing the data model as a software artifact. (2) In Section 4, the graph dynamical systems (GDS) framework is presented in more detail and an example is given that uses the model. This makes more precise the GDS framework and its correspondence with the data model. (3) In Section 6, we provide a formal mathematical model of the pipeline system; we provide an algorithm of its functionality; and we describe how the model maps onto software. This is important because the formal model is the basis for the architecture and design of the pipeline system. (4) In Section 7 and Appendices A through D, we provide a greatly expanded description of the software design and implementation. This also demonstrates how the model of Section 6 is used to design and implement the software pipelines.

Abstract data model for NESS experiments and for modeling and simulation
We present a formal abstract data model. The utility of this model is to determine whether an experiment can be analyzed with our pipeline system. If an experiment can be represented by the characteristics of our data model, then data from the experiment can be analyzed with our pipelines. We provide a short example of its use, and then we demonstrate how the data model can be transformed into an entity-relationship diagram that is a more typical representation for reasoning about software, for implementation purposes.

Formal data model
A general adaptive abstract data model is presented. This data model for networked social experiments follows the five characteristics of Section 2.3, Contribution 2. The purpose of the data model, provided in Table 3, together with the computational model of Section 4 and the pipeline model in section 6, is to provide formal representations for experiments and MAS, and their iterative interactions, per Fig 4. We focus only on the data model, and for compactness, we describe the data model in terms of experiments, but the description is equally valid for modeling. Given a description of an experiment or model, one can determine whether our system of five pipelines can be applied. Also, given a phenomenon to study, the data model can be used to formulate experiments and models for simulating experiments. The data model produces the "data common specification" in Fig 3 (blue). We note that even for different types of experiments that do not conform to our data model, a pipeline system of collections Note that E 0 may be empty.
8 Node attributes for a phase.
. G j ðtÞ ¼ ðg j1 ðtÞ; g j2 ðtÞ; . . . ; g j;Z v ðtÞÞ is the sequence of η v attributes for v j 2 V 0 in the phase i n p at time t. Γ is a triple nested sequence in attributes, player ID, and time. 9 Edge attributes for a phase.
. C j ðtÞ ¼ ðc j1 ðtÞ; c j2 ðtÞ; . . . ; c j;Z e ðtÞÞ is the sequence of η e attributes for e j 2 E 0 in the phase i n p at time t. C is a triple nested sequence in attributes, edge ID, and time.
10 Initial conditions for nodes 11 Initial conditions for edges . . . ; b j;m e Þ is the sequence of μ e initial conditions for the phase, for e j 2 E 0 ; μ e � 0.
12 Action set A A ¼ fa 1 ; a 2 ; :::; a n a g. Set of n a actions that each player can execute, over time, any number of times, during a phase, where n a � 0.
is the schema for an action tuple. σ i is a string that is a unique identifier for an action sequence. Action a j 2 A is initiated by node v k 2 V 0 , and v ℓ is the target node of the action, with edge e = {v k , v ℓ } 2 E 0 . t o 2 R is the time of the action (0 � t o � t p ); py q is the payload represented as a JSON schema.
The experiment schema describes experiment parameters. The phase schema structure describes parameter types for an experimental phase; an experiment can have any number n p of phases. Particular instance variables within the phase schema structure can vary across phases. We use experiment throughout in the table and text for ease of exposition, but the data model is also used for (simulation) models.
https://doi.org/10.1371/journal.pone.0242453.t003 of operations can still be built, but would have different h-functions than those we have constructed. We now describe the two major sections of Table 3. Experiment schema. Per Table 3, an experiment has the following parameters: a unique ID exp_id, the number n p of experiment phases, the number n of players (i.e., human subjects) over all phases of the experiment, a t_begin timestamp for the start of the experiment, and a t_end timestamp for the end of the experiment. Each player has a (universally) unique ID v i for identification. A set V of players in an experiment is defined by V = {v 1 , . . ., v n }. An experiment has n sa attributes defined for each player. Player attributes O are invariant across phases (e.g., age, gender, education level, and income that might be obtained through a questionaire).
Phase schema. An experiment is composed of one or more phases. All phases have a common schema, per Table 3, but particular phases may have different variable values for parameters in the schema.
Each phase schema has the following parameters: a unique ID ph_sch_id, the number i n p ð1 � i n p � n p Þ of the phase in the sequence of phases, a t_ph_begin timestamp for the start time of the phase, number of time increments in the phase t p , and the unit u p of time of one time increment. The interaction channels of pairwise interactions among players is defined by a network G(V 0 , E 0 ), with meanings of edges Λ, for each phase. Edge attributes C and node attributes Γ over all edges and nodes capture time-varying attribute changes for phase i n p . Players (i.e., nodes) and edges may have initial conditions B v and B e , respectively, whose elements may be the same as Γ and C. The permissible player actions during a phase is denoted as the set A. An action tuple T i , which captures pair-wise interactions between players, may be intimately tied to the attribute sequences Γ and C of a phase because action tuples, for example, may cause or be caused by changes in node and edge attributes. In essence, Γ and C can be viewed as sequences of node and edge states. Items 8 through 11 and 13 of the phase schema in Table 3 follow the same basic pattern, to capture features by node or edge, and by time.
There is a sequence of values for a particular node v j or edge e j (e.g., Γ j , C j , B v j , B e j , and T j ). Each entry in a sequence can be a scalar, array, set, map, or other structure. Then, these entries are sequenced over time through the union of entries over time, from time 0 through t p , as shown in rows 8, 9, and 13 of Table 3. The exceptions are the initial conditions B v j and B e j (rows 10 and 11), because by definition, they are specified only at time 0.

Illustrative instances of data model parameters
We provide a few illustrative examples of data model elements. A 3-phase game is described in Section 8, Case Study 1. Phase 2 is a group anagram (word construction) game. In phase 2, a network G(V 0 , E 0 ) is imposed on the players, where the meaning λ of a edge is a communication channel to request letters and reply to requests. A node initial condition b j1 for a game is the number of alphabet letters a player receives at the beginning of the phase to use in forming words, and b j2 is the set of letters. Each player can execute any action from the action set A, such as request a letter from a neighbor.
We now provide an example of an action tuple of an action sequence. If player v i requests letter "z" (a request is action a ℓ 2 A) from player v j at time t o , which initiates a sequence of actions (because there may be a subsequent letter reply from v j ) then the action tuple is Here, σ i = v i + "−" + counter (e.g., a string) is a concatenation of the initiator's (v i 's) ID with a player-specific counter to form a unique ID for the sequence of actions that is initiated with the letter request. If v j responds with "z," then this (second) action tuple will use the same σ i as the first element of the tuple, consistent with T i . This is how action tuples are defined and identified in data processing, in forming action sequences T for a phase.

From abstract data model to software specification
Ours is an abstract mathematical data model. There are several reasons for our choice of model representation. First, a mathematical representation is more abstract (which means, among other things, more versatile and flexible) in its use. Second, it corresponds much more closely to the information required for pipeline capabilities, and enables compact representations of simulation models. Third, it is naturally amenable to translation into other data model representations that are more common in software. We elaborate on each of these.
1. Abstract representation. An element of a sequence can abstractly represent any type of data, including scalars, vectors, sets, tensors, and complicated data structures (that may be implemented via a JSON schema). For example, consider γ j2 of Γ j of Γ in Table 3, which is an attribute for node or player v j 2 V 0 . This variable might represent a 2-D matrix or a set. Furthermore, if the representation needs to be changed, it is much easier to do so with an abstract representation.
2. Compactness. Consider a capability for a simulation model, as part of a pipeline: multiplying two matrices, M 1 and M 2 . A mathematical representation is simply M 1 � M 2 or M 1 M 2 . A pseudo code representation for this functionality would require some five lines of code including three FOR loops. Clearly, M 1 � M 2 is far more compact.  Table 3 contain data structures. Instances of our abstract data model (generated from the execution of an experiment) can be represented as entity-relationship diagrams, which are conceptual or logical data models. Examples are relational models [46], object-oriented models like Object Definition Language (ODL) [47] or Unified Modeling Language (UML) [48], or data structure diagrams [49], among others. A UML representation of an entity-relationship diagram for our abstract data model is presented in Fig 7. UML is the   Table 3 translated into a entity-relationship diagram in unified modeling language (UML) form. This illustrates that the abstract data model can be translated to customary forms of data models (e.g., UML) that are more amenable for software development.
https://doi.org/10.1371/journal.pone.0242453.g007 industry-standard language for specifying, visualizing, constructing, and documenting the artifacts of software systems [48]. All of the structures from the abstract data model of Table 3 are translated into a entity-relationship diagram in unified modeling language (UML) form, demonstrating that the abstract data model can be translated into standard forms of data models more amenable for software development.
Data common specification. Every JSON input file in the pipelines needs a corresponding JSON schema for the verification of formats. For our Data Common Specification there are five classes of input every experiment needs to define. The formal data model in Section 3.1 specifies that an experiment can have any number n p of phases and a different set of players with an action set for each phase. Table 8 in Appendix A shows a description of the elements of the Data Common Specification. Figures in Appendix A define through JSON schemas the formats and compositions of the elements of the Data Common Specification. These are implementation aspects of our pipelines. These are also the types of files we use in the case studies in Section 8.

Graph dynamical system model
In this section, we present a formal framework for NESS experiments and Agent-Based Models. We use a computational model known as a discrete graph dynamical system (GDS) [39], to specify, build, and execute experiments and simulators of experiments (and of other conditions). GDS is also correspondent with the data model of Section 3 and is a general model of computation [50,51], and hence can ensure that experiments and models are synchronized, per Fig 4. A number of other formal models could have been used; we find GDS to be a natural model for specifying NESS experiments. Table 4 shows a description of all the symbols used in our equations. Table 4. Symbols used to describe our computational model known as a discrete Graph Dynamical System (GDS).

GDS S
A synchronous Graph Dynamical System (GDS).

Node set
is an undirected graph with n = |V|, and represents the underlying graph of the GDS, with node set V and edge set E.
11 Sequence of vertex states 12 System state or configuration

GDS formal model
A synchronous Graph Dynamical System (GDS) [52] S is specified as S = (G, W, F, U), where we define each in the following. (a) G � G(V, E) is an undirected graph with n = |V|, and represents the underlying graph of the GDS, with node set V and edge set E. Nodes represent agents in a system or test subjects in our experiments, and edges denote pair-wise interactions between agents. (b) W is the state space, which is the union of the state space W v for nodes and the state space W e for edges; i.e., W = W v [ W e . These are the states that nodes and edges can take during the dynamics.
. ., f n ) is a collection of functions in the system. Function f i represents the local function associated with node v i , 1 � i � n, that describes how v i updates its state. (d) U is the method which describes how the local functions are ordered at each discrete time. Here, we use the synchronous update scheme where all f i execute in parallel. Each node v i 2 V of G has a state value from W v at each time t. Each edge e ij 2 E of G has a state value from W e at each t. Each function f i specifies the local interaction between node v i and its neighbors in G. The inputs to function f i are the state of v i , the states of the neighbors of v i , and the states of the edges outgoing from v i in G. Function f i maps each combination of inputs to s 0 i 2 W v for v i , and to s 0 ij 2 W e for each directed edge e ij . s 0 i is the next state of node v i , and s 0 ij is the next state of edge e ij . These functions are executed in parallel at each time step t. We provide details of the dynamics of a GDS, based on the overview above. We assume here that only nodes have states; there are no edge states. Let G(V, E) be a graph with node set V and edge set E, and where n = |V|. Each node v i has a state s i Let N(v i ) be the sequence of ver- We call s(v i ) the restricted state of v i . The system state or configuration C of a GDS is the vector of length n, C = (s 1 , s 2 , . . ., s n ).
A local function f i : ðW v Þ dðv i Þþ1 ! W v quantifies the dynamics of node v i by computing v i 's next state s 0 i using the states of nodes in its closed 1-neighborhood as Updating the entire set of nodes in G at some time t is accomplished with the GDS mapping For the synchronous update scheme, where all f i , i 2 {1, 2, . . ., n}, execute in parallel, the GDS mapping is defined by In a simulation, we compute successive system states using this last equation, as C(t + 1) = F(C(t)), where C(t) is the system state or configuration at time t, and C(t + 1) is the next system state.
To make this explicit, we now cast the preceding formalism into a pseudo-algorithm in computing the dynamics of a GDS. Let us assume for simplicity that only nodes possess state, and edges do not. At any time t, the configuration C(t) of a GDS is CðtÞ In a synchronous GDS, all nodes compute and update their next state synchronously, i.e., in parallel. A GDS transition from one configuration C(t) to a next configuration C(t + 1) in parallel at each time t can be expressed as follows, Compute the value of f i (Eq (3)) using states in C(t) and assign it to Note that if the f i are stochastic, C(t + 1) may not be unique. The extension to the update of edge states s 0 ij is natural. Associations between the data model and GDS. The data model in Section 3 is consistent with a GDS. The graph G(V 0 , E 0 ), per phase, in Table 3 corresponds to the graph G(V, E) of the GDS. Node W v and edge W e state spaces in the model represent subsets of the node (Γ) and edge attributes (C) in the data model, respectively. Attributes may have additional parameters that are not part of the node or edge state, such as gender and age. Action tuples may be part of the state. The sequencing of action tuples is related to the update scheme U, e.g., whether each node takes turns performing some action in series, or whether players can act simultaneously.

Example GDS and resulting dynamics: Threshold systems
We provide an example of a GDS and the dynamics that it generates. We use a threshold contagion system, motivated by the work [3,13,53] in the social sciences. Also, we use this model in the second case study of Section 8. A progressive threshold system works as follows. The network G(V, E) is provided at the left in Fig 8. The valid state set W for a node is W = W v = {0, 1}, where state 0 means that a node does not possess a contagion and state 1 means that a node possesses the contagion and will assist in transmitting it. The threshold local function works as follows. Each node v i is assigned a threshold 0 � θ i � d i + 1, where d i is the degree of v i in G. If the state s i of node v i at time t is 1 (i.e., s t i ¼ 1), then the output of f i is 1 (that is, a node in state 1 at t remains in state 1 at (t + 1)). If s t where s t (v i ) is the sequence of states in the closed neighborhood of v i at time t, and n 1 is the number of nodes in state 1 in s t (v i ). This is a deterministic GDS. The dynamics evolve as follows; see Fig 8. We specify as initial conditions that v 1 has the contagion at t = 0, i.e., s 0 So, the threshold for v 2 is just met by v 1 . For the same reason, s 1 5 and all other neighbors of v 5 are in state 0). No other node will change state at t = 1 and therefore C(1) has three nodes in state 1 at t = 1. At t = 2, v 4 will change state, even though its threshold is large (θ 4 = 3) because three of v 4 's neighbors (v 1 , v 2 , and v 5 ) are now in state 1. This is the only node that changes state at t = 2 and so C(2) is as shown in Fig 8. The same reasoning applies to the transitions of other node states. Note that v 3 will never transition because its threshold (2) is greater than the number of its neighbors (1). Also note that the system reaches a fixed point at t = 3 because no further state changes are possible.

Conceptual view of pipelines
The purpose of this section is to provide a high-level overview of the pipeline system. Pipeline composition, the pipeline framework, particular pipelines, and operations (h-functions) within pipelines are covered. This is useful for setting up formal theoretical model of Section 6 and the software implementation in Section 7.

Pipeline system
Pipeline compositions. Our system of five pipelines is shown in Fig 3. We separate the experimental platform from the pipelines so that the system can be used with different experimental software platforms, as long as an experiment conforms to the Data Common Specification, which is the data model of Section 3. An iteration of the loop may use any number of the five pipelines, and any number of functions within them, for flexible composability, consistent with data dependencies [54].
Pipeline framework.    box around the pipeline in this figure, which represents the pipeline framework, i.e., the invariant part of pipelines that is used across all pipeline instances. The operations executed by the pipeline framework are listed in Section 2.3. It is the h-functions that tailor a pipeline for a given domain-based purpose.
Pipelines. The five pipelines of Fig 3 now described. (1) The Experimental Data Transformation Pipeline cleans the experimental data and transforms them into a data common specification. (2) The Data Analytics Pipeline analyzes temporal interactions among players to identify patterns in the data in order to understand human behavior and to assist in model development. Computational models are developed offline, as this is human reasoning-based effort. Thereafter, direct and derived data are used as input to the (3) the Property Inference Pipeline. This pipeline generates property values for parameters of simulation models, often by combining data from multiple experiments. Simulation models (e.g., ABMs) are built off-line and software implementations of these models are part of (4) the Modeling and Simulation Pipeline. This pipeline invokes the code to run simulations, using the generated property values, as well as network descriptions, initial conditions, and other inputs. Simulations may model completed or future experiments, or other scenarios beyond the scope of experiments. (5) The Model Evaluation and Prediction Pipeline compares multiple sets of data. As one case, experimental data and model predictions may be compared. As another case, results from two models may be compared. One objective may be to predict beyond experiment data (counterfactuals) and propose further investigations suggested by analysis findings.
Each pipeline is currently a sequential composition of functions. This composition is specified by an analyst through a job definition. Similarly, compositions of the pipelines of Fig 3 are specified by an analyst. The pipeline process takes care of file dependencies between functions. Also, it validates the input and output data of functions, described below. The structure of a pipeline is shown in Fig 9, where function h 1 takes two inputs and generates three outputs (two are inputs to function h 2 and one is an input to function h 3 ); function h 2 generates two outputs, one of which is an input to h 3 . Note that the pipelines control execution of functionality. Execution control consists of a pipeline invoking functions sequentially, as illustrated in

Functions within pipelines
Functions are designed as microservices-modular software with limited scope-within pipelines. They provide a range of capabilities from straight-forward plotting routines to data cleaning and organizing, storing and accessing data sets, inferring properties, and running simulations. Users may add other functions and continue community-based development. This concept is illustrated in Fig 9. Currently, inputs and outputs are files, but may include other digital objects, such as database table entries. Input data ðq i Þ (e.g., in the form of an ASCII data file that may be raw data or output from a preceding function) may need to be transformed into formats required by h. This transformation is performed by transformation code τ j , which generates the input ðkÞ in the required format. These input objectsk 1 andk 2 conform to JSON specifications to ensure compliance for inputs to h. The outputs of h are' 1 ,' 2 , and' 3 .
Microservices. Our functions map directly to microservices. Appendix E addresses characteristics, benefits, and comparisons of microservices. We provide details of microservices because they are the fundamental execution units within our pipelines.

Formal pipeline framework model
With the conceptual view in Section 5, we now provide a formal mathematical model for the pipeline framework, the invariant part of a pipeline, and h-functions, which are particular operations to perform on data (e.g., from experiments). First, we provide the theoretical model. Then, we provide an algorithm for its execution, which moves the system closer to the software and facilitates system design. In Section 7, we combine the pipeline framework with h-functions to produce particular pipelines; the emphasis there is on software design and implementation.

Pipeline framework model
Let P be a collection of pipelines, with pipeline P 2 P represented as PðQ;Q; S ID ; S; T ID ; T; HÞ. Here, Q is a set of datatypes q 2 Q;Q is a set of all data instancesq 2Q; S ID is a set of mappings s ID 2 S ID from datatypes to schema evaluators; S is a set of schema evaluators s 2 S; T ID is a set of mappings τ ID 2 T ID from h-functions and datatypes to transformations; T is a set of data transformations τ 2 T; and H is a sequence of h-functions h 2 H. We detail each of these in turn.
First we address the types of data that are inputs and outputs to h-functions. Let q 2 Q be a datatype of the set Q of all datatypes. Let k 2 K be an input datatype of the set K of all input datatypes. Let ℓ 2 L be an output datatype of the set L of all output datatypes. Datatypes can be primitive datatypes found in most programming languages (e.g, integer, float, real, char), and data structure types (e.g., records) that are combinations of primitive types and data structures such as maps and arrays. An element q 2 Q may be either or both an input data element k and an output data element ℓ; we have Q = K [ L. Moreover, the intersection of K and L will almost always be non-empty, i.e., K \ L 6 ¼ ;, because in a pipeline, an output element of an hfunction may be an input to a subsequent h-function. We use k to denote an input datatype; we use ℓ to denote an output datatype; and we use q to denote an input datatype, an output datatype, or both.
We have the instance analogs of the datatypes above. That is, instances have numerical values and character (strings) assigned for each datatype. Data instancesq 2Q, input data instancesk 2K , and output data instances' 2L, must conform to the datatypes of Q, K, and L, respectively. Note that there will be an implicit relationship between an instanceq and a datatype q because these are based on the semantics of a problem. In general, the relationship between one q andq is 1-to-many: there are many possible instances for a single datatype. Each data instance has as a parameter the datatype to which it must conform.
We now address data schema and data format verification. Let S ID be the set of schema ID mappings s ID 2 S ID , where s ID :Q ! S is defined by a mapping from each datatype q to a unique schema evaluator s 2 S. That is, If we have a universal schema identifier, then |S ID | = 1, i.e., a single s ID is used across all q 2 Q.
To verify that an instanceq of a datatype q has a valid format, we use a schema evaluator s:Q ! f0; 1g. A schema evaluator takes as input a data instanceq and outputs a 1 whenq conforms to the datatype q (i.e.,q is successfully verified against q using s), and outputs a 0 otherwise. That is, sðqÞ returns a 0 or 1. The next phase of the model addresses data transformations. Let T ID be the set of transformation ID mappings τ ID 2 T ID . A transformation ID mapping τ ID : H × K ! T is a mapping from a target h-function h 2 H and target input datatype k 2 K for the h-function, to a transformation function τ. That is, Hence, there is one transformation function τ for each input datatype k and instancek, respectively) to an h. Without loss of generality, we have can have a universal transformation ID mapping τ ID across the entire set of tuples H × K, so that |T ID | = 1. The role of a data transformation function is to operate on inputs and outputs from one or more h-functions (defined below) and produce a new data instance that is in the required format for input to another h-function. A set T of data transformation functions τ 2 T transforms data instancesq 2Q into data instancesk 2K , of types q 2 Q and k 2 K, respectively, that are suitable for input into an h. Formally, a data transformation function t:Q n t !K is defined ask wherek 2K andq j 2Q, 1 � j � n τ . Here, n τ is the number of input arguments to τ. An h-function (or function) h 2 H represents a microservice that performs some unit of work in a pipeline. An h-function takes as input a sequence of n i input data instances and computes a sequence of n o output data instances. Each input data elementk j 2K , 1 � j � n i , has been verified through an s 2 S, identified from s ID 2 S ID , so that the inputs to h are valid (i.e., so that the appropriate s 2 S outputs a 1 for each instancek j ). Also, each of these input data instances may have been generated by transforming data into the required format, using one data transformation function τ 2 T. Each h outputs a sequence of instances of' j 2L, (1 � j � n 0 ) which are also verified through s ID 2 S ID and elements s 2 S, so that the sequence of outputs from h are valid (i.e., so that the appropriate s 2 S outputs a 1 for each instance of' j ). Thus, we have the following. An h-function is h:K n i !L n o is defined by wherek j 2K , 1 � j � n i , and' j 2L, 1 � j � n o .
It is useful to define the composition of all h-functions within a pipeline, because this composition identifies the order in which h-functions execute. It naturally identifies the (input) data files that must exist before the pipeline starts and which output files are generated. Some input files for some h-functions are not specified initially because they are generated by other [preceding] h-functions. As the preceding model description indicates, one data transformation function may need to be executed on each input before each h-function is invoked, to put each input data instance k into the required format for h. If there are n i inputs to h, then the number of data transformation functions is n i (one or more transformation functions may be the identity function). Hence, executing one h-function h j can be thought of as a composition of functions ðt � j ; h j Þ ¼ ðh j � t � j Þ, where t � j represents the n i j transformation functions that are required to put all inputs for h j into the proper formats to execute h j . A composition of n f hfunctions H:K n p;i !L n p;o is defined by where ð' 1 ;' 2 ; . . . ;' n p;o Þ ¼ Hðk 1 ;k 2 ; . . . ;k n p;i Þ.
We defineK � ¼K n p;i andL � ¼L n p;o as short-hand. Thus, the n p,i input files that must exist before the pipeline is invoked are represented byK � . The n p,o pipeline outputs are represented byL � . It is often convenient to represent H as the (ordered) sequence where the ordering gives the order of execution.

Algorithm of the execution of the pipeline framework
With the formalism of Section 6.1, the execution of the pipeline framework is now presented. Algorithm 1 contains the algorithm. The algorithm steps through each h i 2 H and for each inputk i of h i , determines whether it needs to be created by transforming one or more data instances. If so, the inputsq 0 i to the transformation function τ-for computingk i -are obtained. They are verified using schema verification functions s. The transformation function is executed and the output data instancek i is verified. At this point the required input data for h i exist, and h i is invoked and the output files are generated. These outputs are stored. Note that at various points, data file formats are verified by using schema verification functions. The output filesL � are returned.
Algorithm 1 Steps of the Algorithm PIPELINE EXECUTION. A. Get the datatype k i from instancek i . B. Identify the transformation function τ using τ = τ ID (h, k i ). C. LetQ 0 ¼ fq 0 1 ;q 0 2 ; :::;q 0 n t g be the set of n τ existing input instances to the transformation function τ, obtained from the definition of τ, such thatk i ¼ tðq 0 1 ;q 0 2 ; :::;q 0 n t Þ.
D. for eachq 0 2. Obtain the schema s 2 S as s ¼ s ID ðq 0 j Þ.
E. Use the data transformation function τ to compute the input k i for h i , in the proper format, according tô ii. Obtain the schema  (d) Verify the format of each output' j ð1 � j � n 0 Þ by obtaining the corresponding datatype ℓ j and schema s = s ID (ℓ j ), and invoking sð' j Þ.
If sð' j Þ ¼ 1, then the output file format is verified. Else' j is not verified, which is an error, and the pipeline gracefully terminates.

Return L � .
The description thus far in this section is focused on a single pipeline. However, the model is equally valid across pipelines. In fact, grouping sets of h-functions into multiple pipelines, as we do herein, is largely a matter of practicality, and aids in software system organization and in reasoning about such systems. However, from Section 6.1 and this Section 6.2, it should be clear that all data transformation functions and h-functions could be put into a single τ � large pipeline.

Mapping of model onto the software system
One reason for the particular development in Section 6.1 above is that it parses the model into components that are the responsibility of the pipeline framework, software that users put into a pipeline, and user-supplied information regarding data. For example, input datatypes K and instanceK for a pipeline or a collection of pipelines must be supplied by an analysts, or come from some previous analysis.
The schema ID mapping and schema themselves are provided by the analyst to ensure that input and computed results conform to specified formats and contain the proper types of information. The execution of schema to verify data representation instances is the responsibility of the pipeline (not the functions). Data transformation functions and h-functions are executable software, and may be stand-alone executables that constitute processes. They are provided by an analyst or software developer. It is the pipeline's responsibility to invoke the correct functions and in the correct order, and to access the proper input files and to store the resulting output files, all of which are specified in a human-generated pipeline configuration file (addressed below). Functions are responsible for generating correct outputs.

Pipeline design and implementation
With the conceptual view of pipelines in Section 5 and the mathematical model and algorithm in Section 6, we now present the pipeline design and implementation. We address several topics in this section and in the referenced appendices. These include the composability of pipelines, pipeline configuration files, descriptions of the five pipelines, h-functions and their configuration files, examples of pipeline configuration files, detailed representations of two of the pipelines, and a compilation of all implemented h-functions.

Pipelines
Two pipelines are depicted with black boxes in Fig 11. The major elements of a pipeline are the configuration file, data files and schema, pipeline framework, h-functions, and transformation functions. Table 5 provides additional overview of several of these elements.
All pipelines in the system have been developed on this project and for the work described herein. We have added pipelines and functions over the course of a year, demonstrating the extensibility of the system, without modifying the pipeline framework code discussed in Section 7.1.2.

Pipeline configuration file.
To run a pipeline (called a job), a configuration input file specifies functions and their order of execution. Table 6 overviews the entire pipeline configuration file with a definition for each parameter. JSON schema files exist for each component in the data common specification from Section 3.3. The functions component defines the available h-functions to run in the pipeline and the input files for each function. Appendix B contains a detailed example of a configuration file.  To run a pipeline (called a job), a pipeline-specific configuration input file is verified and is read by the pipeline framework. The file specifies h-functions and their order of execution, as well as required input files to the pipeline. Here we show how function h 1 is executed in a pipeline 1 and how h 4 is executed in pipeline 2. The pipeline framework invokes the corresponding functions. If specified in the configuration file, the pipeline framework invokes a transformation function that transforms the contents of one or more files into an input file of correct format for the h-function. There may be one transformation function for each direct input to an h-function. At appropriate points in a pipeline, data files are verified against their corresponding JSON schema (input file verification). The h-function is executed and output files are generated (these digital object outputs may be, e.g., plot files, ASCII data files, and binary data files). There may be additional h-functions within pipeline 1, indicated by the ellipsis below pipeline 1 function h 1 execution. In this example, outputs from the generic pipeline 1 are inputs for the generic pipeline 2. Function h 4 in pipeline 2 is executed in a similar fashion to function h 1 in pipeline 1. See the text for descriptions of these various components. Note: the pipeline framework (in brown) is the same code for all pipelines. See Table 5 for implementation details of the elements in this figure. https://doi.org/10.1371/journal.pone.0242453.g011 Table 5. Sections and files from the execution of a generic Pipeline.

# Input File Name File Type Description
Pipeline i: In this section the input files are specified for execution.
1 Configuration input file JSON Specifies h-functions to execute within pipeline i, and their order of execution.
2 Input files JSON Input files to a pipeline, i.e., files required to execute h-functions in the pipeline (possibly outputs from upstream pipelines).

Pipeline framework:
In this section the functions are invoked, specifying the order in which they are executed.

Functions within pipelines.
Each pipeline has a list of available functions. The functions can be written in any programming language. Currently we have h-functions written in C++, Python, and R. A function may use as input any combination of outputs from preceding functions in the same pipeline, functions in preceding pipelines, files from previous iterations, and data from experiments.
Currently there are 29 functions across five pipelines. A summary of the h-functions in each of the five pipelines is provided in Table 7. Listings and details of all functions implemented per pipeline are provided in Appendix D (one table for each pipeline).

Case studies
The purpose of the three case studies is to demonstrate the utility (i.e., usefulness) of the pipeline system. The first case study (Study 1) uses all five pipelines. This study took two years to complete, in building software, running experiments, varying treatments, analyzing data, building multiple models, validating and exercising models, and hypothesis testing. We iterated over these operations, as suggested in Fig 3. The pipelines of this manuscript were used for all of the work in this case study. We consider this to be a very large case study. The purpose of case studies 2 and 3 are different. Our goal here is to demonstrate the versatility and wide applicability of the pipeline system. For each of these cases studies, we take experiments or computations from other researchers' works in the literature, and demonstrate through our data model that our system can analyze the data and computations of those works. In case study 3, we could also include their model in our pipelines. Other works in the literature [1, 3-6, 9, 55] can also be analyzed with our pipelines.

Study 1: Entire system execution for collective identity experiments
Collective identity (CI), as defined by [43], is an individual's cognitive, moral, and emotional connection with an enclosing broader group such as a team or a community. CI is important in many applications and contexts, making it worthy of study. For example, CI is important in the formation and maintenance of teams, and team behavior [56,57]. It is also important in the formation and enforcement of norms [56,57].
Here, we use a complete cooperatively game to produce CI among team members that are playing. We want to measure the amount of CI created between team players in an experiment. The experiment includes 3 phases. In phase-1, the DIFI index [58] measures (for a baseline) the individual levels of CI. In phase-2, CI is created between team members using a collaborative anagram game; In phase-3, using the same index as in phase-1, the individual levels of CI in players are measured.
Here, we use the Dynamic Identity Fusion Index (DIFI) score [58] as a proxy for CI. The DIFI score is measured individually as part of our online experiments in the following way. A small (movable) circle represents an individual player and a second (stationary) larger circle represents the team. A player moves the small circle along a horizontal axis, where the distance between circle centroids represents that player's sense of identity with the team; it is their DIFI score. The range in DIFI distance value is, −100 � DIFI � 125; DIFI = 0 corresponds to the two circles just touching, DIFI < 0 means that the two circles are disjoint (an individual has no positive affinity for the team), and DIFI > 0 means that the two circles overlap (an individual identifies with the team).
As a priming activity to foster CI among team members, in phase-2, they play a collaborative word construction (anagram) game motivated by [6]. This Phase 2 is the focus of our case study.

Web-based experiment software platform, game play and data collection.
We built a web application to conduct experiments. The primary components of our platform are the oTree framework [59], Django Channels and the online web interface. Each phase of the experiment has software, designed and developed, that interfaces with oTree. Interactions among players is supported by Django Channels technology; individual participants and the server communicate by websocket. Fig 12 shows the web interface for each player of the anagram game. The experiment interface enlists players from Amazon Mechanical Turk (MTurk) and registers actions from all the players in all phases. The clicks and their event times represent the actions for defined HTML objects like letters, and submit buttons.
In phase-2, at the beginning of a game, players receive three letters, and communication channels to d number of other players; through these channels players can help each other to form new words by sharing letters. Based on the recruited number n of players, the experimental platform creates a graph with a pre-defined regular degree d on the n players. Players of the game can perform the following actions, request letters from neighbors, reply to letter requests from neighbors, and form words; these actions are explained in detail in the caption of Fig 12. The objective of the game is to form as many words as possible as a team. The total number of words formed by the team defines the earnings in a game. Earnings are divided uniformly between players. For a player to form a valid word, the word has to be unrepeated in the player's list of formed word; however more than one player can duplicate a word. Each player possess an infinite stock of each of the three initial letters received. This means a player can use these initial letters more than once to form words, and also openly share them with neighbors. These features are planned to promote cooperation. 8.1.2 Data analysis, modeling and simulations, and modeling evaluations using the pipelines. Some data model features from Table 3 are provided in Fig 13. For the DIFI measures (phases 1 and 3), the action set A, with its one element (submit DIFI score), is shown, and the action sequence T is the action tuple of submitting DIFI score for each agent. For phase 2, the word construction game, the edge set E for the four players is provided, as is the action set A, containing four elements. The action "thinking" is a no-op in the model. Initial letter assignments to players, which are part of B v j for each node (player) v j , are shown. So, too, is an illustrative sequence of action tuples. For example, T 3 states that v i requests the letter "G" from v 3 .
Several ABMs were built to model the phase 2 group anagram game. The ABM described here is build on a transition probability matrix where the transition probability from one action a(t) = a i at time t to the next action a(t + 1) = a j for each agent v, i, j 2 [1..4] and a(t) 2 A, is given by We use i and j to represent the actions a i and a j 2 A. Agent v executes a stochastic process driven by transition probability matrix P = (π ij ) m×m , where m � |A| (here, = 4). A multinomial logistic regression model is used for π ij . Details are in [7]. During the 5-minute game, the ABM predicts action tuples T i for players v i participating.
In this study, the system of Fig 3 is executed over many loops; some times completely and other times portions of it. In this case study we examine only the anagram game. We perform one iteration of three experiments, with n = 6 for the number of players and d = 5 for the number of neighbors. Figs 14-16 display results for the Data Analytics Pipeline (DAP) . Fig 17   Fig 12. The anagram game screen, phase-2, for one player. This player has own letters "R," "O," and "L" and has requested an "E" and "A" from neighbors. The "E" is green, so this player's request has been fulfilled and so "E" can be used in forming words; but the request for "A" is still outstanding so cannot be used in words. Below these letters, it shows that Player 2 has requested "O" and "L" from this player. This player can reply to these requests, if she so chooses. Below that is a box where the player types and submits new words. The following paragraph discusses special details of these results. Fig 14 presents a plot, generated by h 3 , of the time series of words formed for each player of one game. When a new word is formed a step in a curve indicates the time. "Form word" is a 4 2 A in Fig 13. h 3 can

Pipelines for social science experiments
The β coefficients in Fig 17 are parameters in the multinomial logistic regression model alluded to above. In the π ij terms above, each transition is from action i to j. For example, the β coefficients at the bottom are for the transition from forming word (a 4 in Fig 13) to the next actions being a 2 through a 4 ; the probability that the next action is a 1 (thinking) is 1 minus the sum of other three transition probabilities.
In Fig 18, the Modeling and Simulation Pipeline is employed to create all three plots (the first two for simulating experiments, the third for predictions beyond the experiments). The Model Evaluation and Prediction Pipeline is employed in the first two plots to compare experiments and model predictions.
Appendix F describes two more case studies. Study 2 in Appendix F.1 shows the data model for online experiment in [3]. Study 3 in Appendix F.2 shows the data model for a simulation study in [44].

Related work
We address several different topics below.

Online social science experiments
In order to understand human behavior, there has been significant interest in using online systems to carry out social science experiments. These experiments analyze a variety of phenomena, like collective identity [17,60,61], and cooperation and contagion [62], to name a few. The methodological and practical challenges of online interactive experimentation, and the value of an online labor market has been discussed in different studies [63,64]. The benefits of online experiments, compared to in-person experiments, include reduced costs, an agile logistic process, and the collection of detailed data. Research teams use different options to design and deploy their online experiments. While some teams, create web-based programs especially designed for their research [17,61,62], others use web-based experimental platforms that provide this service [60,63]. In [60] the online platform Volunteer Science [35] was used to implement a web-based public goods experiment, and to recruit participants around the world. In [63], a repeated public goods experiment was implemented in the free web-based platform for interactive online experiments, LIONESS [36], and participants were recruited via Amazon Mechanical Turk (MTurk). In [37] a modular virtual lab named Empirica offers a development platform for virtual lab experiments, and they claim that is even accessible to novice programmers. There are tools that focus in Adaptive Experimentation, like Facebook Ax [38], an accessible, general-purpose platform for understanding, managing, deploying, and automating adaptive experiments. Usually these platforms only focus on the design and running of online lab experiments, but they don't offer a complete automated solution for experiments, analysis, modeling and simulation, and evaluation.  (Networked) Experiments in the social sciences. Experiments with interacting participants can be represented as networks, where edges represent interaction channels. There are several online and in-person experiments with individuals [60,61,[65][66][67][68][69] and groups [1,[3][4][5][6]9]. Some include modeling of the experiment [9]. Also, none of these works appears to do iterative evaluations involving modeling and experiments. There is no platform, that we know of, that allows the iterative process of data analysis, design of data-driven model to simulate experiments, model validation and verification in order to predict behavior. In this work our focus is to formalize a general methodology, through a generic data pipeline, for online controlled experiments of human subjects aim to explain diverse phenomena.
Simulation frameworks. There are many frameworks for developing simulations. In [19] four design patterns systematize and simplify the modeling and the implementation of multilevel agent-based simulations. In [20] a framework for developing agent-based simulators as mobile apps and online tools is presented. They present a case study in the field of health and welfare. In [21] a methodology for an artificial neural network based metamodeling of simulation models is presented. The model is for the case when online decision making routines are invoked repetitively by the simulation model throughout the simulation run. We believe, none of these frameworks provides composable and extensible pipelines for studying networked social science phenomena, in order to address social sciences for modeling/experiments.

Workflow systems
There are many workflow systems. Here, we cite several popular workflow systems and then describe how they relate to social sciences and pipelines for computation. Examples include Taverna [70] for bioinformatics, chemistry, and astronomy; Pegasus [71] and CyberShake, built on Pegasus [72], for large-scale workflows in astronomy, seismology, and physics; Kepler [73,74] for ecology and environmental workflows. Other workflow engines include Toil [75], and Rabix [76] developed for computational biology.
We believe, none of these systems addresses social sciences for modeling/experiments as we do here. As an illustration, suicide data is analyzed with Taverna in [77] and Galaxy is used for genomic research [78]; neither has a component for modeling.
In the social sciences most workflows are for social network analyses [28]; we seek to go well beyond that. Also in [79], a taxonomy of features is defined from the way scientists make use of existing workflow systems; this provide end users with a mechanism by which they can assess the suitability of workflow to make an informed choice about which workflow system would be a good choice for a particular application. The importance of interoperability between these systems is detailed in [80] and identifies three dimensions; execution environment, model of computation (MoC), and language. MoCs provide the semantic foundation, but a data model is a prerequisite. [27,28,79,81] are among the works that overview several workflow systems. An overview and discussion of future directions is provided in [82]. Challenges and future directions for life science workflows are provided in [83]. Ontologies for workflow objects are discussed in [84].
Workflow languages are usually represented in a textual manner, or through graphical interfaces. A textual representation is often employed for storing the workflows in files, even when a graphical representation is employed. For full interoperability, it is important to have the capacity to translate between workflow languages [80]. Wings [85] uses rich semantic representations to describe compactly complex scientific applications in a data-independent manner. Swift [86] and Swift/T [87,88] are workflow languages built for executing parallel programs within workflows. NextFlow [89] is a domain specific language for computational workflow management systems. Workflow languages include Common Workflow Language (CWL) [76,90] and Workflow Description Language (WDL) [91]. Script of Scripts [92] is a workflow system with an emphasis on support for different scripting languages.

Microservices
Our pipelines take a microservices conceptual approach. First defined in 2012, Microservices [93] is an architectural style, addressing how to build, manage, and evolve architectures out of small, self-contained units [40][41][42]94]. The h-functions of our pipelines have a narrow scope; this way, for new experiments and models new functions can be included in a specific way, promoting reuse by not presenting repeated capacities.
Microservices Architecture (MSA) and Service-Oriented Architecture (SOA) both rely on services as the main component. But they vary greatly in terms of service characteristics. SOA divides applications into sets of business applications offering services through different protocols. This aims to solve the problem of complexity. SOA applications are costly and complex and are designed to support high workloads, and a large number of users. In [93] is stated that microservices keep services independent so that a service can be individually replaced without impacting an entire application.
In 2012 [95] defined microservices as a way to more swiftly build software by dividing and conquering, using Conway's Law to structure teams. Issues, advantages and disadvantages of microservices are identified in [96]. For example an issue identified is the system decomposition. Advantages include the increase in scalability and the clear boundaries. Disadvantages include the difficulty to learn. The microservice architectural style is largely used by several companies such as Amazon [97], Netflix [98], and many others.

Data models
In [99], a data model is presented for supporting the modeling, execution and management of emergency plans before and during a disaster. In [100], aspects of a business data model are described. In [101], a data model is presented for capturing workflow audit trail data relevant to process performance evaluation. In [102], models for social networks that have mainly been published within the physics-oriented complex networks literature, are reviewed, classified and compared.
In [103], an object-relational graph data model is proposed for modeling a social network. It aims to illustrate the power of this generic model to represent the common structural and node-based properties of different social network applications. A multi-paradigm architecture is proposed to efficiently manage the system. In [104], a semantic model that can naturally represent various academic social networks is presented; it describes various complex semantic relationships among social actors.
Formal models of pipelines. The possibility of incorporating formal analytics into workflow design is investigated in [100]. It provides a model that includes data dependencies. The workflow design analytics they propose helps construct a workflow model based on information about the relevant activities and the associated data. Also, it helps determine whether the given information is sufficient for generating a workflow model and ensures the avoidance of certain workflow anomalies. A detailed treatment of data dependencies is found in [54].
In [105], to improve data curation process efficiency for biological and chemical oceanography data studies, pipelines are defined using a declarative language. The pipelines are serialized into formal provenance data structures using the Provenance Ontology (PROV-O) data model (defined in the paper).

"-Ilities;" reproducibility; interoperability; composability; extensibility; scalability; reusability; and traceability
Foreseeable and unforeseeable changes occur in a system, ilities are attributes that characterize a system's ability to respond to both. Ilities describe what a system should be, providing an enduring architecture that is potent and durable, yet flexible to evolve with the insertion of new systems.
The use of ilities for systems engineering of subsystems and components is investigated in [106]. They show how some ilities are passed and used as a non-functional property of electrical and structural subsystems in aircraft. They demonstrate that a useful practice for systems engineers, to ensure that customer needs are actually met by the system under design or service, is to flow ilities down to the subsystem level. The system ilities are passed down and translated from non-functional to functional requirements by subject matter experts.
Pipelines and workflows provide reproducibility [84], interoperability [107], reusability [84]. The microservices conceptual approach of our pipelines satisfy the reproducibility, interoperability and reusability properties. We show the pipeline composability feature, also it properties for extensibility, scalability, and traceability.

Conclusion, future work, and limitations
Online social science experiments are used to understand behavior at-scale. Considerable work is required to perform data analytics for custom experiments. Furthermore, modeling is often used to generalize experimental results, enabling a greater range of conditions to be studied than through experiments alone. In order to transition from experiments to modeling, model properties must also be inferred. Consequently, our work presents a software pipeline system for evaluating social phenomena that are generated through controlled experiments. Our work scope in this manuscript ranges from formal models through software design and implementation. Our models include a formal experimental data model (and data common specification), a network-based discrete dynamical systems model (graph dynamical system, GDS), and a formal model for pipeline composition. These models aid in reasoning-in a principled way-about the architecture, design, and implementation of five software pipelines, which currently contain 29 functions. The pipelines are composable and extensible, and they can be operationalized for different methodologies (e.g. deductive and abductive analyses). We provide three case studies, on collective identity, complex contagion, and explore-exploit behavior, respectively, to illustrate the successful use of the system. We are adding these pipelines to a larger job management system and are developing new h-functions for developing new models. Contact Vanessa Cedeno (vcedeno@vt.edu) or Chris Kuhlman (cjk8gx@virginia. edu) for the system code. A repository with a user manual is available at https://github.com/ vcedeno/PLOS_ONE_Pipelines_Supporting_Information.
There are limitations to this work. There is a host of other types of experiments that might demand different types of data analytics, and there is a variety of modeling approaches, e.g., structural equation, statistical, differential equation models, that can be added to a pipeline system. Another limitation, and an opportunity for future work, is to provide a data specification for both experiments and analyses. Specifically, Section 2.1 identified experimental platforms that are customizable [35][36][37][38] in ways that are analogous to our approach for customizable software analysis pipelines. A single specification language for experiments and analyses could be used to coordinate experiments and analyses. Also, it may be possible to use artificial intelligence techniques to provide insight into external validation based on an experiment specification.

A Appendix: Data common specification
This appendix provides a concrete view into the system. The definition of a data common specification in Fig 3 provides the bridge between the abstract data model and the implementation of the pipelines; see Fig 6. Table 8 shows a description of the elements of the Data Common Specification. JSON schemas provide a detailed specific view of the implementation aspect of our pipelines. Because we go into detail, this is an exemplar for other types of problems. These are the types of files we use in the case studies in Section 8.

C Appendix: Examples of the software system
This Appendix shows examples of input files for the Experimental Data Transformation Pipeline (Fig 30), and the Data Analytics Pipeline (Fig 31). Here we show how a function is executed in a generic pipeline. Input files are validated against their corresponding JSON schema. If necessary, file contents are transformed (possibly outputs from upstream functions) to obtain the direct input for a function in the correct format. After verification of formats by the corresponding JSON schemas, the function is executed and output files are generated (these digital object outputs may be, e.g., plot files, ASCII data files, and binary data files).

D Appendix: Pipeline functions
In this Appendix, we describe the characteristics of the the atomic element of a pipeline: the function. If a new component is added to the pipeline, it is introduced by a new function. We provide a listing of types of functions as microservices within each of the five pipelines. We show five tables, one for each pipeline, with a list of available functions. Table 9 shows one function for the (1) Experimental Data Transformation Pipeline (EDTP). Table 10 shows fourteen functions for the (2) Data Analytics Pipeline (DAP) Table 11 shows four functions for the (3) Property Inference Pipeline (PIP). Table 12 shows five functions for the (4) Modeling and Simulation Pipeline (MASP). Table 13 shows five functions for the (5) Model Evaluation and Prediction pipeline (MEAPP).
The functions provide a range of capabilities from simple plotting routines to cleaning and organizing, storing and accessing data sets, and inferring properties and running simulations. Users may add other functions and continue community-based development, as these functions are not exhaustive. Each function completes one well-defined task. Many of these functions can be used in multiple contexts; functions use the pipeline as a universal interface. For example, the action progression function h 3 of the Data Analytics Pipeline generates a plot of the number of actions a i per player in time 8a i 2 A. Also, often a function represents a category of operation; e.g., there are six different agent-based models (ABMs) under h 1 of the Modeling and Simulation Pipeline. Currently, functions are written in the following Programming Languages (PLs) C++, Python, and R.     This is an example of the (1) experimental data transformation pipeline execution to transform raw experimental data into the data common specification. Here we show how function h 1 is executed. Here we show an input CSV file as an example for the "Completed Session Summary" input file. If necessary, file contents are transformed to obtain the direct input for a function in the correct format. Here we show how the "Completed Session Summary" CSV input file is transformed into a "Completed Session Summary" json file that becomes the input for the function. After verification of formats by the corresponding JSON schemas, the function is executed and output files are generated. Here we show the output json file for the "Experiment" data common specification. 9. Weary of Sharing Capability Between Services: the more multiple microservices share, the more services become coupled to internal representations and decreases autonomy.  Generate a file with distance between two actions. The distance has to be provided by the analyst (e.g, for the action of forming a word, the Levenshtein distance between two words formed).
Compare action characteristics in an experiment.
Data files h 13 Rank of actions. Generate a file with rank of an action. The rank has to be provided by the analyst (e.g, for the action of requesting a letter, the letter rank comes from a specified list).
Compare action characteristics in an experiment.
Data files h 14 Score of actions. Generate a file with a score of an action. The method to calculate the score has to be provided by the analyst (e.g, for the action of forming a word, the scrabble score for a word formed).
Compare action characteristics in an experiment.

Data files
Many functions may be considered as collections of functions because they can handle multiple types of data through the data model. Use of the sequences of discrete actions to generate the probability of transition from an action a i to an action a j as measured in the experiment data.
Generates the properties for a Markovian transition matrix. 10. APIs (Application Programming Interfaces): specify/select/prefer technology-agnostic APIs so that the services are not constrained by technology. Achieve decoupling: the success of the "Change/upgrade" feature is an evaluation of decoupling success. Decoupling also requires good models.
2. Technology changeout. R-squared is a statistical measure of how close the data are to the fitted regression line.

Data files h 5 Cross-Validation
The original experiment sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k-1 subsamples are used as training data. The crossvalidation process is then repeated k times, with each of the k subsamples used exactly once as the validation data.
Demonstrate that the model is a reasonable representation of the actual system. All observations are used for both training and validation, and each observation is used for validation exactly once.

Data files
Many functions may be considered as collections of functions because they can handle multiple types of data through the data model.

E.3 Microservices as a type of service oriented architecture
Pipelines are intimately tied to microservices. While microservices may be used individually, typically, the small scope and limited features (or one feature) per service implies that they must be composed to accomplish many tasks. This composition can be accomplished with pipelines. This is not necessarily true with larger, more monolithic service oriented architectures (SOAs): these may provide broader-scope services within one module.
Microservices are one type of service oriented architecture (SOA). One example of the difference between the two is that microservices generally tend to avoid shared libraries that are used across microservices. This is because use of shared libraries means increased coupling of services. Based on the authors' experiences, this difference between microservices and SOAs in general is analogous to the difference between shared memory multi-process systems versus distributed systems, as described next.
By multi-process shared memory systems, we mean a software system that is composed of multiple processes that run asynchronously and use shared memory to exchange information (e.g., no message passing). In this environment, the processes are tightly coupled because if one process requires changes in shared storage structures, these will affect all other processes that use those storage structures. That is, the software for these other processes needs to be changed, too, leading to increased maintenance. Hence, there are a lot of interdependencies. However, in an asynchronous distributed system, each process has its own storage structures and memory, so that changes in storage structures for one process has no effect on other processes. While it is the case that additional infrastructure is required for distributed systems (e.g., for message passing), this additional requirement is offset by the autonomy realized for each process. The analogy here is that a multi-process shared memory system is a classic SOA, while microservices are the distributed system. F.1.1 Overview. In [3], the effects of network structure on complex contagion diffusion are studied by the spread of health behavior through networked online communities. We represent this experiment with the data model from Section 3. Each experiment, exp_id, consists of two independent phases (n p = 2), one with G(V 0 , E 0 ) being a clustered-lattice network and another H(V@, E@) being a random network. V = V 0 [ V@ is the set of all players with player v i 2 V, and 1 � i � n. There are n/2 players in each of the two networks, assigned randomly. Γ i contains variables for v i 's profile (i.e., avatar, username, health interests), ratings of the forum content, and the state of v i in time, i.e., whether v i has joined the forum. The meaning of an edge is λ = communication channel between pairs of subjects. B v i contains initial conditions for the game, including values for the elements of Γ i . The set of actions is A = {a 1 , a 2 , a 3 }, where a 1 is "send a message" to encourage a neighbor to adopt a health related behavior; a 2 is "join forum" which notifies a participant every time a neighbor adopts the behavior; and a 3 is "input rating content" in the forum. In T 1 , v 1 sends a message to v 2 , then in T 2 , v 1 sends a message to v 3 . All these are signals from v 1 to encourage health buddies to join the forum. In T 3 , v 2 decides to join because of v 1 's message. This is why the unique identifier σ i for the action sequence is the same as in T 1 . After this, the news is propagated to v 2 's health buddy v 3 in T 4 . v 2 sends a message to v 4 in T 5 . In T 6 , v 1 's inputs rating content to the forum. This data model instance, coupled with a GDS formulation (not shown), means that the experimental data can be analyzed (and modeled) with the pipeline system. F.1.2 Formal data model. Table 14 details the online social network experiment in [3], defined with our data model. We define one experiment with two independent phases, one with a clustered-lattice network and another with a random network. Each has a population size n = 98 and number of health buddies per person d = 6. Fig 33 shows the model of Table 14 translated into a entity-relationship diagram in unified modeling language (UML) form. This data model instance, that represents an experiment instance, means that the experimental data can be analyzed (and modeled) with the pipeline system. We can perform similar mappings for other social experiments [1,9,61].  (Table 3), for the online social network experiment in [3].

Experiment Schema
1 exp_id = 1 Experiment id for an experiment. . . . ; g j;Z v ðtÞÞ is the sequence of η v attributes for v j 2 V 0 . η v = # of initial ratings in the forum to provide content for the early adopters.
10 B v B v j ¼ ðavatar j1 ; username j2 ; health interest j3 ; . . .Þ. 12 A A = {a 1 , a 2 , a 3 } where a 1 is send message, a 2 is join forum, and a 3 is input rating content.
One experiment has two independent phases, one with a clustered-lattice network and another with a random network; each with population size n = 98 and number of health buddies per person d = 6.

F.2 Study 3: Data model for a simulation study in [44]
F.2.1 Overview. In this case study, we evaluate research that is purely simulation-based. We cast their problem in terms of our data model. With this mapping, we then can reason that if we performed experiments according to this data model, we would have a correspondence between those experiments and the simulation system. Hence, in a sense, this case study demonstrates a process of going from modeling to experiments. Another note is that even with simulation models and no experiments, we can still use our pipeline system. The model in [44] investigates how the structure of communication networks among actors can affect system-level performance. This is an agent-based computer simulation model of explore-exploit tradeoffs, with information sharing. [44] produces an arbitrarily large number of statistically identical "problem" for the simulated agents to solve (explore). Also, the less successful emulate the more successful (exploit). They state that solutions involve the conjunction of multiple activities, in which the impact of one dimension on performance is contingent on the value of other dimensions. For example, activities A, B, and C each actually hurt performance unless all are performed simultaneously, in which case performance improves dramatically. These are defined as synergies, and the presence of such synergies produces local optima.
F.2.2 Formal data model. Table 15 details the model in [44], defined with our data model. We define one experiment with one phase, with a population of 100, 20 human activities, and 5 synergies (i.e., activities that performed simultaneously improves dramatically the activity performance). Here we also provide an example of an action sequence. In T 1 , v 1 posts a solution, then in T 2 , v 2 posts a solution. All these are signals from v 1 to encourage health buddies to join the forum. In T 3 , v 3 evaluates v 1 solution. In T 4 v 3 copies solution from v 1 . The payload will have the information of how accurate agents copy the solution from other, (i.e.) if it was "mimic" or "adapt".  Table 15 translated into a entity-relationship diagram in unified modeling language (UML) form.
This data model instance, that represents a modeling instance, means that the computational modeling results can be analyzed with the pipeline system.   Here we define this model with our data model.