^{1}

^{2}

^{1}

^{3}

^{4}

^{1}

^{1}

The authors have declared that no competing interests exist.

Left-censored missing values commonly exist in targeted metabolomics datasets and can be considered as missing not at random (MNAR). Improper data processing procedures for missing values will cause adverse impacts on subsequent statistical analyses. However, few imputation methods have been developed and applied to the situation of MNAR in the field of metabolomics. Thus, a practical left-censored missing value imputation method is urgently needed. We developed an iterative Gibbs sampler based left-censored missing value imputation approach (GSimp). We compared GSimp with other three imputation methods on two real-world targeted metabolomics datasets and one simulation dataset using our imputation evaluation pipeline. The results show that GSimp outperforms other imputation methods in terms of imputation accuracy, observation distribution, univariate and multivariate analyses, and statistical sensitivity. Additionally, a parallel version of GSimp was developed for dealing with large scale metabolomics datasets. The R code for GSimp, evaluation pipeline, tutorial, real-world and simulated targeted metabolomics datasets are available at:

Missing values caused by the limit of detection/quantification (LOD/LOQ) were widely observed in mass spectrometry (MS)-based targeted metabolomics studies and could be recognized as missing not at random (MNAR). MNAR leads to biased parameter estimations and jeopardizes following statistical analyses in different aspects, such as distorting sample distribution, impairing statistical power, etc. Although a wide range of missing value imputation methods was developed for–omics studies, a limited number of methods was designed appropriately for the situation of MNAR currently. To alleviate problems caused by MNAR and to facilitate targeted metabolomics studies, we developed a Gibbs sampler based missing value imputation approach, called GSimp, which is public-accessible on GitHub. And we compared our method with existing approaches using an imputation evaluation pipeline on both of the real-world and simulated metabolomics datasets to demonstrate the superiority of our method from different perspectives.

Missing values are commonly existed in mass spectrometry (MS) based metabolomics datasets. Many statistical methods require a complete dataset, which makes missing data an inevitable problem for subsequent data analysis. Generally speaking, missing at random (MAR), missing completely at random (MCAR), and missing not at random (MNAR) are three commonly accepted missing types [

The processing of missing values has been developed and studied in MS data, which is an indispensable step in the metabolomics data processing pipeline [

To reduce adverse effects caused by missing values in following metabolomics data analyses, we developed a left-censored missing value imputation framework, GSimp, where a prediction model was embedded in an iterative Gibbs sampler. Next, we compared GSimp with HM, QRILC, and kNN-TN on two real-world metabolomics datasets and one simulation dataset to demonstrate the advantages of GSimp regarding imputation accuracy, observation distribution, univariate and multivariate analysis [

A variable containing missing elements from free fatty acids (FFA) dataset was randomly selected to track the sequence of corresponding parameters and estimates across the first 500 iterations out of a total of 2000 (100 × 20) iterations using GSimp. From

The first 500 iterations out of a total of 2000 (100×20) iterations using GSimp where

We evaluated four different MNAR imputation/substitution methods on FFA, bile acids (BA) targeted metabolomics and simulation datasets. First, we measured the imputation performances using label-free approaches. Sum of ranks (SOR) was used to measure the imputation accuracy regarding the imputed values of each missing variable. From the upper panel of

SOR on FFA dataset (upper left) and BA dataset (upper right) along with different numbers of missing variables based on four imputation methods: HM (red circle), QRILC (green triangle), GSimp (blue square), and kNN-TN (purple cross). PCA-Procrustes sum of squared errors on FFA dataset (lower left) and BA dataset (lower right) along with different numbers of missing variables based on four imputation methods: HM (red circle), QRILC (green triangle), GSimp (blue square), and kNN-TN (purple cross).

Then, we measured the imputation performances with clinical group information provided. We compared the results of univariate and multivariate analyses for imputed and original datasets. Since this is a case-control study, student’s

Pearson's correlation between log-transformed p-values of student’s t-tests on FFA dataset (upper left) and BA dataset (upper right) along with different numbers of missing variables based on four imputation methods: HM (red circle), QRILC (green triangle), GSimp (blue square), and kNN-TN (purple cross). PLS-Procrustes sum of squared errors on FFA dataset (lower left) and BA dataset (lower right) along with different numbers of missing variables based on four imputation methods: HM (red circle), QRILC (green triangle), GSimp (blue square), and kNN-TN (purple cross).

On the simulation dataset, we compared QRILC, kNN-TN, and GSimp using same approaches. Consistent results were recognized (

The purpose of this study is to develop a left-censored missing value imputation approach for targeted metabolomics data analysis. We evaluated GSimp with other three imputation methods (kNN-TN, QRILC, and HM) and suggested that GSimp was superior to others using different evaluation methods. To illustrate the performance of GSimp, we randomly selected one variable containing missing values from FFA dataset (

Scatter plots of imputed values (X-axis) and original values (Y-axis) on one example missing variable while non-missing elements represented as blue dots and missing elements as red dots based on four imputation methods: HM (upper left), QRILC (upper right), kNN-TN (lower left), and GSimp (lower right). Rug plots show the distributions of imputed values and original values.

In this study, we comprehensively evaluated our algorithm on targeted metabolomics datasets for the MNAR situation. We additionally tested a non-targeted GC/MS profiling metabolomics dataset and found that most of missing values are manually retrievable due to the miss-identification of peaks. These retrievable missing elements were randomly distributed across the dataset and irrelevant to their true abundances (

GSimp is more than that, other truncation values could also be applied in real-world analyses, such as known LOQ/LOD of metabolites or quantile of observed values (e.g., 10%) can be set as upper truncation points for different conditions. Additionally, when signal intensity of certain compound is larger than the upper limit of quantification range or saturation during instrument analysis, an informative lower truncation point could be correspondingly applied for the right-censored missing value. What’s more, when non-informative bounds for both upper and lower limits (e.g., +∞, -∞) were applied, GSimp could be extended to the situation of MCAR/MAR. With the flexible usage of upper and lower limits, our approach may provide a versatile and powerful imputation technique for different missing types. For other–omics datasets with missing values (especially MNAR) (e.g., single cell RNA-sequencing data), we could also apply this method with few modifications of default settings. Thus, it is worthy to evaluate our approach, GSimp, in other complex scenarios in the future.

Since GSimp employed an iterative Gibbs sampler method, a large number of iterations (

In conclusion, we developed a new imputation approach GSimp that outperformed traditional determined value substitution method (HM) and other approaches (QRILC, and kNN-TN) for MNAR situations. GSimp utilized predictive information of variables and held a truncated normal distribution for each missing element simultaneously via embedding a prediction model into the Gibbs sampler framework. With proper modifications on the parameter settings (e.g., truncation points, pre-processing, etc.) GSimp may be applicable to handle different types of missing values and in different -omics studies, thus deserved to be further explored in the future.

We employed datasets from a study of comparing serum metabolites between obese subjects with diabetes mellitus (N = 70) and healthy controls (N = 130) where N represents the number of observations. Dataset 1: a total of 42 free fatty acids (FFAs) were identified and quantified in those participants in order to evaluate their FFA profiles [

For the simulation dataset, we first calculated the covariance matrix ^{2}) and

For two real-world targeted metabolomics datasets, we generated a series of MNAR datasets by using the missing proportion (number of missing variables/number of total variables) from 0.1 to 0.6 in a step of 0.05 with quantile cut-off for each missing variable drawn from a uniform distribution

A prediction model was employed for the prediction of missing values by setting a targeted missing variable as outcome and other variables as predictors. Different prediction models (e.g., linear regression, elastic net [

The L2 penalty _{1} controls the number of predictors by assigning zero coefficients to the "unnecessary" predictors. From a Bayesian point of view, the regularization is a mixture of Gaussian and Laplacian prior distributions of coefficients which can pull the full model of maximum likelihood estimates

Gibbs sampler is a MCMC technique that sequentially updates parameters while others are fixed. It can be used to generate posterior samples. For each missing variable in the dataset, we applied a Gibbs sampler to impute the missing values by sampling from a truncated normal distribution with prediction model fitted value as mean and root mean square deviation (RMSD) of missing part as standard deviation while truncated by specified cut-points. Assuming we have a _{1}, _{2}, _{3}, …, _{p}) with only one variable _{j} containing left-censored missing values. We denote _{j} as _{m} with length _{f} with length _{-j} as _{f} or a given LOQ. The truncation bounds ensure imputation results are constrained within [

Step-1 (initialization): we initialize missing values (QRILC in our case), and get

Step-2 (prediction): we then build a prediction model (elastic net in our case):

Step-3 (estimation): based on the prediction model, we get the predicted value

Step-4 (sampling): we draw sample

We iteratively repeat step-2 to step-4 and update _{j}.

A whole data matrix _{1}, _{2}, _{3}, …, _{p}) contains a number of

1. ^{imp} ← initialize the missing values for

2.

3.

4.

5.

6.

7. _{j} and _{j};

8.

9. Gibbs sampler step 2 to 4;

10.

11. Update

12.

13.

14.^{imp}

Other three left-censored missing imputation/substitution methods were conducted in our study for performance comparison:

kNN-TN (Truncation

QRILC (Quantile Regression Imputation of Left-Censored data) [

HM (Half of the Minimum): This method replaces missing elements with half of the minimum of non-missing elements in the corresponding variable.

Normalized Root Mean Squared Error (NRMSE) [_{i}(_{th} missing variable.

Procrustes analysis, a statistical shape analysis, could be used to evaluate the similarity of two ordinations by calculating the sum of squared errors [

Labeled measurements include correlation analysis for log-transformed

Furthermore, we evaluated the impacts of different imputation methods on the statistical sensitivity of detecting biological variances. On the simulation dataset, we calculated

(PDF)

SOR (upper left), PCA-Procrustes sum of squared errors (upper right), Pearson's correlation between log-transformed

(TIF)

SOR on simulation dataset along with different numbers of missing variables based on four different numbers of iterations:

(TIF)

# missing variables: number of missing variables; iters_each: number of iterations for imputing each missing variable; iters_all: number of iterations for imputing the whole matrix; n_cores: number of cores.

(XLSX)

R. Wei and J. Wang would like to thank their parents (B. Wei, X. He, K. Wang, and Q. Peng) for their endless love and support. They are also grateful to Mr. Link who is always curious about the unexplored land.