OpenStats: A robust and scalable software package for reproducible analysis of high-throughput phenotypic data

doi:10.1371/journal.pone.0242933

Fig 1.

The increase in the total number of the IMPC mouse lines/data points along with the IMPC major data releases from the first release in 2012 to the current release in 2020.

There is always a chance of more than one release or some minor releases per year. Here the y-axis shows the mouse lines/data points for the specific major release or the average from the minor releases. (Left) The total number of IMPC phenotyped lines corresponding to the IMPC data releases. (Right) The overall increasing trend in the data points divided by the type of the data, non-time series (red), the time series (green), categorical (black) and total (blue) corresponding to the IMPC data releases. These plots show that on average the total number of data points and phenotyped mouse lines increase by a factor of 20% between IMPC major data releases.

More »

Expand

Fig 2.

The schematic illustration of the OpenStats workflow.

The OpenStats software is designed with a four-layer structure namely Input data and model specification, dataset processing and preparation, statistical analysis, and reporting/exporting the results.

More »

Expand

Fig 3.

Schematic view of the IMPC statistical pipeline.

The measurement of several parameters per specimen are collected from 13 centres all over the world, inspected for possible QC issues, carefully filtered to form individual working datasets, pre-optimised for being processed by the cluster computing platform and ultimately passed to the statistical analysis engine either PhenStat or OpenStats for the statistical analysis. The analysis engine is in charge of applying a proper statistical method to each working dataset and stores the analysis results in a format that enhances the downstream processes. All outputs from the statistical engine are inspected for the failures, errors and must pass a random QC check prior to being released to the downstream processes.

More »

Expand

Table 1.

The comparison between OpenStats and PhenStat for analysing the IMPC continuous and categorical data.

More »

Expand

Fig 4.

The comparison of the IMPC statistical pipeline analysed by OpenStats and PhenStat with respect to the time efficiency.

(Top row) The left and right charts show the top (average) saving time in minutes by using OpenStats versus PhenStat over the IMPC procedures and parameters. The bottom row shows the top best (average) loses in minutes where PhenStat performs faster than OpenStats. These plots show that OpenStats improves the efficiency of the IMPC statistical pipeline.

More »

Expand