Skip to main content
Advertisement

< Back to Article

Fig 1.

Workflow for aggregating and transforming laboratory-derived experimental results for data science use.

Prior to formal analyses, diverse experimental data must be aggregated and organized into tidy data files (see Fig 2 for examples of different data types that may be sourced for this purpose). Next, additional data transformation steps will likely take place, concurrent with initial exploratory analyses to identify and refine key parameters (see Fig 2 for examples of different considerations that can modulate data transformation outcomes). Once these steps (outlined in green) are completed, hypothesis-driven research questions and model development can take place (in isolation or in tandem), such as generation of simple statistical models (outlined in blue) and ML models (outlined in purple). Establishing ML models may necessitate additional experimental considerations to be optimized prior to training/testing/refining of models. Best practices involve validating models with new, independent data, and the use of cross-validation methods to ensure accurate predictive outcomes not influenced by data noise.

More »

Fig 1 Expand

Fig 2.

Considerations when aggregating and tidying in vivo and molecular/laboratory data.

In vivo-generated data can encapsulate a wide range of serially collected and/or discrete (stand-alone) specimens and observations, and experimental outcomes. Results from in vivo experimentation are frequently contextualized with a diversity of pathogen sequence-based information and laboratory-based assays. Examples of data types within these groupings are shown on the left-hand side of this figure. Depending on the data type, there are a range of options available for distilling complex laboratory-based readouts into discrete values which are necessary for many data science applications; these decisions can meaningfully impact the conclusions drawn from the work. Examples of how complex data can be tidied for this purpose for each data type are shown on the right-hand side of this figure. AUC, area under the curve; RBS, predicted receptor binding preference; PA, predicted polmerase activity. Data types and analysis considerations are representative only and do not encapsulate all potential parameters employed in data science applications employing in vivo data. Image generated entirely by CDC illustrators by hand.

More »

Fig 2 Expand