Data analysis and modeling pipelines for controlled networked social science experiments | PLOS One

Advertisement

Browse Subject Areas

?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1 — Fig 1.

A representation of the steps in iterative abductive analysis.
The process begins with conducting experiments and flows clockwise through reasoning about data and what experiments to perform next, whereupon the process repeats. Deductive analyses include these steps, but modeling occurs before experiments, so that the steps are rearranged. Parts of many of these steps (e.g., computing model properties) can be automated, and this automation is the focus of this paper. Other steps are not automated, such as the process of developing a model, because this requires a significant element of human reasoning. Thus, our software system requires human-in-the-loop execution. The process can be used in a purely experimental approach (i.e., no modeling). See the text for a description of this graphic.

More »

Fig 2 — Fig 2.

Roadmap of, and relationships among, sections in this manuscript.
Arrows indicate dependencies among sections, and dashed arrows identify the theoretical models that impact the design and implement of the software pipeline system. The Introduction, Related Work, and Conclusions are not shown. See text for details.

More »

Fig 3 — Fig 3.

Five software pipelines (in gray) for NESS experiments.
The five pipelines are itemized and described in Table 1. In this human-in-the-analysis loop, experiments (upper left in figure) are performed. Any experiment whose data can be cast in terms of the data model specification can be analyzed with this system. These pipelines are the focus of this work. The pipeline composition shown here, for abductive looping, is one of several possibilities. See Table 1 for descriptions of the pipelines in this figure. The first, second, and fifth pipelines can be used with a purely experimental approach (omitting modeling). An earlier version of the pipeline system is provided in [34], Fig 1.

More »

Table 1 — Table 1.

Description of the five pipelines for NESS experiments.

More »

Fig 4 — Fig 4.

The three types of models described in this work: (Abstract) data model, graph dynamical system model, and pipeline model.
The data model enables rigorous reasoning about both (i) experiments and experimental data specifications (requirements) and (ii) modeling and simulation (MAS) specifications. It, along with the graph dynamical system (GDS) model, help to ensure consistency and correspondence between experiments and MAS. We use GDS to model the dynamics of particular applications systems. Specific data sources and modeling approaches are shown. These are used within our pipeline model. Figure adapted from [34].

More »

Table 2 — Table 2.

This work involves three major topics (left column of table): Data representation, modeling representation, and software pipelines.

More »

Fig 5 — Fig 5.

An application-specific pipeline is composed of an invariant framework that performs general operations (see text) and application-specific h-functions.

More »

Table 3 — Table 3.

Definition of our abstract data model.

More »

Fig 6 — Fig 6.

Sequence of data models for reasoning about experiments and modeling and simulation.
We advocate for pre-pending the abstract data model to the front end of the model process, as shown here. Table 3 shows our abstract data model and Fig 7 shows this data model translated into a entity-relationship diagram in unified modeling language (UML) form. The table and figures in A (which support Section 7) show the Data Common Specification for our software design.

More »

Fig 7 — Fig 7.

Data model of Table 3 translated into a entity-relationship diagram in unified modeling language (UML) form.
This illustrates that the abstract data model can be translated to customary forms of data models (e.g., UML) that are more amenable for software development.

More »

Table 4 — Table 4.

Symbols used to describe our computational model known as a discrete Graph Dynamical System (GDS).

More »

Fig 8.

Network G(V, E) for a GDS example, with V = {v₁, v₂, v₃, v₄, v₅, v₆}.
Thresholds θ_i are provided for nodes v_i, in blue, by the respective nodes. The local functions f_i are threshold functions for v_i ∈ V, 1 ≤ i ≤ 6; see text for details. The discrete system dynamics are given by the configurations at successive times from 0 to 4, at the right in the figure. Each configuration is given by . The system reaches a fixed point at time t = 3, as evidenced by no change in the configuration in going from t = 3 to t = 4.

More »

Fig 8 — Fig 8.

Network G(V, E) for a GDS example, with V = {v₁, v₂, v₃, v₄, v₅, v₆}.
Thresholds θ_i are provided for nodes v_i, in blue, by the respective nodes. The local functions f_i are threshold functions for v_i ∈ V, 1 ≤ i ≤ 6; see text for details. The discrete system dynamics are given by the configurations at successive times from 0 to 4, at the right in the figure. Each configuration is given by . The system reaches a fixed point at time t = 3, as evidenced by no change in the configuration in going from t = 3 to t = 4.

More »

Fig 9 — Fig 9.

Conceptual view of a pipeline that is composed of the pipeline framework (represented by the bounding box) and the h-functions that provide the application-based functionality of a particular pipeline.
Functions, or h-function, h_i, 1 ≤ i ≤ 3 are implemented as software within a pipeline. The pipeline framework (red box) controls the execution order of functions and the inputs and outputs for each function, through a pipeline job specification. Circles in the figure denote input and output digital objects, such as ASCII files or database tables. This figure is a more detailed representation of Fig 4. Adapted from [34] Fig 3.

More »

Fig 10.

One arbitrary software h-function within a pipeline.
Data instances , , and are transformed by transformation code τ₁ to conform to required input for h. Similary, and are used by τ₂ to produce input . Outputs from the h-function are , , and . Inputs and outputs are subjected to verification through comparisons with specified schema (not shown here). The pipeline framework is represented by the red box that controls execution of the h-functions and transformation codes. This is a more detailed representation of Figs 5 and 9.

More »

Fig 10.

One arbitrary software h-function within a pipeline.
Data instances , , and are transformed by transformation code τ₁ to conform to required input for h. Similary, and are used by τ₂ to produce input . Outputs from the h-function are , , and . Inputs and outputs are subjected to verification through comparisons with specified schema (not shown here). The pipeline framework is represented by the red box that controls execution of the h-functions and transformation codes. This is a more detailed representation of Figs 5 and 9.

More »

Fig 11 — Fig 11.

Two pipelines are shown to illustrate similarities and differences between them.
To run a pipeline (called a job), a pipeline-specific configuration input file is verified and is read by the pipeline framework. The file specifies h-functions and their order of execution, as well as required input files to the pipeline. Here we show how function h₁ is executed in a pipeline 1 and how h₄ is executed in pipeline 2. The pipeline framework invokes the corresponding functions. If specified in the configuration file, the pipeline framework invokes a transformation function that transforms the contents of one or more files into an input file of correct format for the h-function. There may be one transformation function for each direct input to an h-function. At appropriate points in a pipeline, data files are verified against their corresponding JSON schema (input file verification). The h-function is executed and output files are generated (these digital object outputs may be, e.g., plot files, ASCII data files, and binary data files). There may be additional h-functions within pipeline 1, indicated by the ellipsis below pipeline 1 function h₁ execution. In this example, outputs from the generic pipeline 1 are inputs for the generic pipeline 2. Function h₄ in pipeline 2 is executed in a similar fashion to function h₁ in pipeline 1. See the text for descriptions of these various components. Note: the pipeline framework (in brown) is the same code for all pipelines. See Table 5 for implementation details of the elements in this figure.

More »

Table 5 — Table 5.

Sections and files from the execution of a generic Pipeline.

More »

Table 6 — Table 6.

Configuration input file description.

More »

Table 7 — Table 7.

Summary table of h-functions.

More »

Fig 12 — Fig 12.

The anagram game screen, phase-2, for one player.
This player has own letters “R,” “O,” and “L” and has requested an “E” and “A” from neighbors. The “E” is green, so this player’s request has been fulfilled and so “E” can be used in forming words; but the request for “A” is still outstanding so cannot be used in words. Below these letters, it shows that Player 2 has requested “O” and “L” from this player. This player can reply to these requests, if she so chooses. Below that is a box where the player types and submits new words.

More »

Fig 13 — Fig 13.

Case study 1.
Partial representation of the data model for the online experiment composed of 3 phases with a set of V players (n = |V|). The phase 1 DIFI measure, a proxy for CI, uses a null (i.e., empty) network on n players; i.e., there are no edges in the graph because players play individually. In phase 2, a team-based CI-priming game, edges E are communication channels. Initial conditions B^v include letter assignments to players. The individual DIFI measure is repeated in phase 3. The action set A and illustrative action tuples T_i are given for each phase.

More »

Fig 14 — Fig 14.

The Data Analytics Pipeline (DAP) was executed to analyze phase 2 of three experiments with n = 6 and d = 5.
The time series of number of words formed by player for experiment #2 is generated by function h₃.

More »

Fig 15 — Fig 15.

The Data Analytics Pipeline (DAP) was executed to analyze phase 2 of three experiments with n = 6 and d = 5.
The histogram for the number of actions “letter request” for three experiments is generated by function h₅. The x-axis is time in the group anagram game, binned in 30 seconds intervals.

More »

Fig 16 — Fig 16.

The Data Analytics Pipeline (DAP) was executed to analyze phase 2 of three experiments with n = 6 and d = 5.
The discrete time actions for all three experiments is generates by function h₇. This latter output will inform the Property Inference pipeline for computing parameters for simulation models. Time (in seconds) is shown in the first row as 1, 2, 3, …, and counts of the z vector components, per player and per experiment are given.

More »

Fig 17 — Fig 17.

The Property Inference pipeline receives the input from h₇ of the Data Analysis Pipeline (DAP).
The parameters in this figure were generated to inform an ABM model for the Modeling and Simulation Pipeline (MASP). The transitions in the figure are from from i to j, where a_i ∈ A is the action at time t and a_j ∈ A is the action at (t + 1). Rows not shown mean there are no such transitions in the data.

More »

Fig 18 — Fig 18.

The Modeling And Simulation Pipeline (MASP) and Model Evaluation And Prediction Pipeline (MEAPP) were run to obtain simulation results and model predictions, and to compare experimental data to model predictions.
All three plots contain model predictions and use results from h₁ of the MASP. Function h₁ of MEAPP plots corresponding experimental and model output data (top plot) and compares experiment and model output using KL-divergence (center plot) for six parameters. Function h₂ of MEAPP uses h₃ of the Data Analysis pipeline (DAP) to plot model predictions from h₁ of the MASP (bottom plot) where now n = 15 (in experiments, n = 6).

More »

Table 8 — Table 8.

Data common specification.

More »

Fig 19 — Fig 19.

JSON schema for the “Experiment” of the data common specification.

More »

Fig 20 — Fig 20.

JSON schema for the “Phase” of the data common specification.

More »

Fig 21 — Fig 21.

JSON schema for the “Phase Description” of the data common specification.

More »

Fig 22 — Fig 22.

JSON schema for the “Player” of the data common specification.

More »

Fig 23 — Fig 23.

JSON schema for the “Action” of the data common specification.

More »

Fig 24 — Fig 24.

To run a pipeline (called a job), a configuration input file specifies functions and their order of execution.
This figure shows a portion of the schema for a configuration file that specifies the experiment JSON schema file location.

More »

Fig 25 — Fig 25.

To run a pipeline (called a job), a configuration input file specifies functions and their order of execution.
This Figure shows a portion of the schema for a configuration file that specifies the phase description JSON schema file location.

More »

Fig 26 — Fig 26.

To run a pipeline (called a job), a configuration input file specifies functions and their order of execution.
This Figure shows a portion of the schema for a configuration file that specifies the phase JSON schema file location.

More »

Fig 27 — Fig 27.

To run a pipeline (called a job), a configuration input file specifies functions and their order of execution.
This Figure shows a portion of the schema for a configuration file that specifies the action description JSON schema file location.

More »

Fig 28 — Fig 28.

To run a pipeline (called a job), a configuration input file specifies functions and their order of execution.
This Figure shows a portion of the schema for a configuration file that specifies the player description JSON schema file location.

More »

Fig 29 — Fig 29.

To run a pipeline (called a job), a configuration input file specifies functions and their order of execution.
In this configuration file there are five possible functions that can be executed in any order. This Figure shows a portion of the schema for a configuration file that specifies how to compose and execute one or more functions of a simple pipeline. For example, here it defines that a parameter called “actionId” is only necessary for functions h₂ through h₅.

More »

Fig 30 — Fig 30.

This is an example of the (1) experimental data transformation pipeline execution to transform raw experimental data into the data common specification.
Here we show how function h₁ is executed. Here we show an input CSV file as an example for the “Completed Session Summary” input file. If necessary, file contents are transformed to obtain the direct input for a function in the correct format. Here we show how the “Completed Session Summary” CSV input file is transformed into a “Completed Session Summary” json file that becomes the input for the function. After verification of formats by the corresponding JSON schemas, the function is executed and output files are generated. Here we show the output json file for the “Experiment” data common specification.

More »

Fig 31 — Fig 31.

This is an example of the (2) data analytics pipeline execution to analyze files of data in the common specification.
Here we show how function h₇ is executed. Input files are validated against their corresponding JSON schema. Here we show an example of a json schema file for the “Experiment” description input file. Fig 19 contains the whole file. After verification of formats by the corresponding JSON schemas, if necessary, file contents are transformed to obtain the direct input for a function in the correct format. After verification of formats by the corresponding JSON schemas, function h₇ is executed and output files are generated. In this example the output file is an input for the (3) Property Inference pipeline.

More »

Table 9 — Table 9.

Listing of types of functions as microservices for the (1) Experimental Data Transformation Pipeline (EDTP).

More »

Table 10 — Table 10.

Listing of types of functions as microservices for the (2) Data Analytics Pipeline (DAP).

More »

Table 11 — Table 11.

Listing of types of functions as microservices for the (3) Property Inference Pipeline (PIP).

More »

Table 12 — Table 12.

Listing of types of functions as microservices for the (4) Modeling and Simulation Pipeline (MASP).

More »

Table 13 — Table 13.

Listing of types of functions as microservices for the (5) Model Evaluation and Prediction pipeline (MEAPP).

More »

Fig 32 — Fig 32.

Elements of the data model (Table 3), for the online social network experiment in [3].

More »

Table 14 — Table 14.

Online social network experiment in [3], defined with our data model.

More »

Fig 33 — Fig 33.

Data model of Table 14 translated into a entity-relationship diagram in Unified Modeling Language (UML) form.

More »

Table 15 — Table 15.

How the structure of communication networks among actors can affect system-level performance is studied in [44].

More »

Fig 34 — Fig 34.

Data model of Table 15 translated into a entity-relationship diagram in Unified Modeling Language (UML) form.

More »