GSimp: A Gibbs sampler based left-censored missing value imputation approach for metabolomics studies

Left-censored missing values commonly exist in targeted metabolomics datasets and can be considered as missing not at random (MNAR). Improper data processing procedures for missing values will cause adverse impacts on subsequent statistical analyses. However, few imputation methods have been developed and applied to the situation of MNAR in the field of metabolomics. Thus, a practical left-censored missing value imputation method is urgently needed. We developed an iterative Gibbs sampler based left-censored missing value imputation approach (GSimp). We compared GSimp with other three imputation methods on two real-world targeted metabolomics datasets and one simulation dataset using our imputation evaluation pipeline. The results show that GSimp outperforms other imputation methods in terms of imputation accuracy, observation distribution, univariate and multivariate analyses, and statistical sensitivity. Additionally, a parallel version of GSimp was developed for dealing with large scale metabolomics datasets. The R code for GSimp, evaluation pipeline, tutorial, real-world and simulated targeted metabolomics datasets are available at: https://github.com/WandeRum/GSimp.

Missing values caused by the limit of detection/quantification (LOD/LOQ) were widely observed in mass spectrometry (MS)-based targeted metabolomics studies and could be recognized as missing not at random (MNAR). MNAR leads to biased parameter estimations and jeopardizes following statistical analyses in different aspects, such as distorting sample distribution, impairing statistical power, etc. Although a wide range of missing value imputation methods was developed for-omics studies, a limited number of methods was designed appropriately for the situation of MNAR currently. To alleviate problems caused by MNAR and to facilitate targeted metabolomics studies, we developed a Gibbs sampler based missing value imputation approach, called GSimp, which is public-

Introduction
Missing values are commonly existed in mass spectrometry (MS) based metabolomics datasets. Many statistical methods require a complete dataset, which makes missing data an inevitable problem for subsequent data analysis. Generally speaking, missing at random (MAR), missing completely at random (MCAR), and missing not at random (MNAR) are three commonly accepted missing types [1,2]. When the probability of a missing value is depended on other observed variables but not the value itself, it is regarded as MAR [1,2] (e.g., false peak matching, deconvolution of co-eluting compounds). MCAR is from completely unexpected missingness without any relationships with other variables (e.g., stochastic fluctuations, random errors from incomplete derivatization or ionization). Targeted metabolomics studies have been widely used for the accurate quantification of specific groups of metabolites. Due to the limit of compound quantifications (LOQ), missing values are usually caused by signal intensities lower than LOQ, also known as left-censored missing, which can be assigned to MNAR. The processing of missing values has been developed and studied in MS data, which is an indispensable step in the metabolomics data processing pipeline [3]. One simple solution is the substitution of missing by determined values, such as zero, half of the minimum value (HM) or LOQ/c where c denotes a positive integer. Determined value substitutions, although commonly applied for dealing with missing values in metabolomics studies [4][5][6], can significantly affect the subsequent statistical analyses in different ways (e.g., underestimate variances of the variable, decrease statistical power, fabricate pseudo-clusters among observations, etc.) [1]. Advanced statistical imputation methods have been developed for high-dimensionalomics studies (e.g., k-nearest neighbors (kNN) [7], singular value decomposition (SVD) [8,9], random forest (RF) [10]) that are available to users on several metabolomics data analysis software [11][12][13][14][15]. MetaboAnalyst [15][16][17] is a popular metabolomics data processing web-tool providing kNN, Probabilistic PCA (PPCA), Bayesian PCA (BPCA), SVD, or substitution by determined values (HM, mean, median, minimum). However, most advanced statistical imputation methods are mainly aiming at imputing MCAR/MAR and not suitable for the situation of MNAR. So far, a limited number of approaches dealing with left-censored missing values were applied by researchers [18,19]. Quantile regression approach for left-censored missing (QRILC) imputes missing data using random draws from a truncated distribution with parameters estimated using quantile regression [18]. Although this imputation keeps the overall distribution of missing parts compared to determined value substitutions, it may produce stochastic imputed values since no extra information is used for the prediction of missing parts. Another imputation method recently developed for MNAR is k-nearest neighbor truncation (kNN-TN) [19]. This approach applies Maximum Likelihood Estimators (MLE) for the means and standard deviations of missing variables based on truncated normal distribution. Then a Pearson correlation based kNN imputation method was implemented on standardized data. Although the author stated that kNN-TN could impute both MNAR and MAR, the imputed values were entirely dependent on the nearest neighbors while no constraint was placed upon the imputation. Thus, this approach might cause an overestimation of MNAR missing values.
To reduce adverse effects caused by missing values in following metabolomics data analyses, we developed a left-censored missing value imputation framework, GSimp, where a prediction model was embedded in an iterative Gibbs sampler. Next, we compared GSimp with HM, QRILC, and kNN-TN on two real-world metabolomics datasets and one simulation dataset to demonstrate the advantages of GSimp regarding imputation accuracy, observation distribution, univariate and multivariate analysis [20], and sensitivity. Our findings indicate that GSimp is a robust method in handling left-censored missing values in targeted metabolomics studies.

Gibbs sampler in GSimp
A variable containing missing elements from free fatty acids (FFA) dataset was randomly selected to track the sequence of corresponding parameters and estimates across the first 500 iterations out of a total of 2000 (100 × 20) iterations using GSimp. From Fig 1, we can observe that both fitted value ŷ and sample value ỹ reach to the convergence after several iterations and the standard deviation estimate σ drop to a steady state of small values. In addition, an upper constraint for the distribution of ỹ indicated that it was drawn from a truncated normal distribution.

Imputation comparisons
We evaluated four different MNAR imputation/substitution methods on FFA, bile acids (BA) targeted metabolomics and simulation datasets. First, we measured the imputation performances using label-free approaches. Sum of ranks (SOR) was used to measure the imputation accuracy regarding the imputed values of each missing variable. From the upper panel of Fig 2, we can observe that GSimp has the best performance with the lowest SOR across all varying numbers of missing variables in both FFA and BA datasets. To measure the extent of imputation induced distortion on observation distributions, the PCA-Procrustes analysis was conducted between the original data and imputed data. The lower panel of Fig 2 shows that GSimp has the lowest Procrustes sum of squared errors compared to other methods, which means GSimp kept the overall observation distribution of original dataset with the least distortions.
Then, we measured the imputation performances with clinical group information provided. We compared the results of univariate and multivariate analyses for imputed and original datasets. Since this is a case-control study, student's t-tests were applied for univariate analyses. Then we compared the results by calculating Pearson's correlation between log-transformed p-values calculated from imputed and original data for missing variables. Again, GSimp performs best with the highest correlations among four methods (upper panel of Fig 3) along with different numbers of missing variables, and it implies GSimp keeps the most original biological On the simulation dataset, we compared QRILC, kNN-TN, and GSimp using same approaches. Consistent results were recognized (S1 Fig), and GSimp presents the best performances on the simulation dataset with the lowest SOR and PCA/PLS-Procrustes sum of squared errors and the highest correlation of univariate analysis results. Moreover, to examine the influences of statistical power using different imputation methods, we calculated the true positive rate (TPR) as the capacities to detect differential variables on different imputation datasets. Again, with both p-cutoff of 0.05 and 0.01, GSimp shows the overall highest TPR over different missing numbers (Fig 4). This implies that GSimp impairs the sensitivity to the least extent among three methods, which is reasonable since GSimp also keeps the highest correlation of p-values in previous comparisons.

Discussion
The purpose of this study is to develop a left-censored missing value imputation approach for targeted metabolomics data analysis. We evaluated GSimp with other three imputation methods (kNN-TN, QRILC, and HM) and suggested that GSimp was superior to others using different evaluation methods. To illustrate the performance of GSimp, we randomly selected one variable containing missing values from FFA dataset (Fig 5) to compare the imputed values and original values. Although determined value substitution (e.g., HM) were widely used by researchers in the field of metabolomics, our results indicated that HM could severely distort the data distribution (upper left panel of Fig 5), thus impairing subsequent analyses. In comparison, QRILC kept the overall data distribution and variances (upper right panel of Fig  5). However, stochastic values would be generated by this approach since QRILC imputes each missing variable independently without utilizing the predictive information from other variables. Statistical learning based method, kNN-TN, applied a correlation based kNN algorithm with parameters of missing variables estimated with truncated normal distributions. This method utilized the information of highly correlated variables of targeted missing variable, thus kept a linear trend between original values and imputed values. However, since no constraint was applied for the imputation, a right shift of missing part occurs, causing imputed values to exceed the truncation point (lower left panel of Fig 5). In contrast, GSimp utilized the predictive information of other variables by employing a prediction model and held a truncated normal distribution for each missing element simultaneously, which ensured a favorable linear trend between imputed and original values as well as a reasonable bound for the imputed values (lower right panel of Fig 5).
In this study, we comprehensively evaluated our algorithm on targeted metabolomics datasets for the MNAR situation. We additionally tested a non-targeted GC/MS profiling metabolomics dataset and found that most of missing values are manually retrievable due to the missidentification of peaks. These retrievable missing elements were randomly distributed across the dataset and irrelevant to their true abundances (S1 Appendix). Based on this, we assumed the majority of missing values are MCAR/MAR situation for non-targeted GC/MS data before manually missing retrieval. For the rest un-retrievable missing elements, we found much lower signal to noise ratios which could be assigned to the situation of left-censored MNAR. Thus, for non-targeted profiling datasets, missing retrieval from raw spectral data will be most recommended. Since we applied the minimum observed value of missing variable as an informative upper truncation point and -1 as a non-informative lower truncation point for leftcensored missing, GSimp with this default settings might be applicable for the imputation of post-missing retrieval non-targeted data.
GSimp is more than that, other truncation values could also be applied in real-world analyses, such as known LOQ/LOD of metabolites or quantile of observed values (e.g., 10%) can be set as upper truncation points for different conditions. Additionally, when signal intensity of certain compound is larger than the upper limit of quantification range or saturation during instrument analysis, an informative lower truncation point could be correspondingly applied for the right-censored missing value. What's more, when non-informative bounds for both upper and lower limits (e.g., +1, -1) were applied, GSimp could be extended to the situation of MCAR/MAR. With the flexible usage of upper and lower limits, our approach may provide a versatile and powerful imputation technique for different missing types. For other-omics datasets with missing values (especially MNAR) (e.g., single cell RNA-sequencing data), we could also apply this method with few modifications of default settings. Thus, it is worthy to evaluate our approach, GSimp, in other complex scenarios in the future.
Since GSimp employed an iterative Gibbs sampler method, a large number of iterations (iters_all = 20, iters_each = 100) are preferable for the convergence of parameters in Markov chain Monte Carlo (MCMC) method. However, as we tested on the simulation dataset with different number of iterations, a smaller number of iterations (iters_all = 10, iters_each = 50) won't severely affect the imputation accuracy (S2 Fig). Among iterations for the whole data matrix, we applied a sequential imputation procedure for missing variables from the least number of missing values to the most. To improve the computational efficiency of GSimp on large scale datasets, we additionally implemented a parallel version which can run Gibbs sampler on multiple missing variables simultaneously, then update all imputed values of missing elements. Increasing the number of cores will significantly decrease the running time of GSimp as we tested on a random generated dataset (S1 Table).
In conclusion, we developed a new imputation approach GSimp that outperformed traditional determined value substitution method (HM) and other approaches (QRILC, and kNN-TN) for MNAR situations. GSimp utilized predictive information of variables and held a truncated normal distribution for each missing element simultaneously via embedding a prediction model into the Gibbs sampler framework. With proper modifications on the parameter settings (e.g., truncation points, pre-processing, etc.) GSimp may be applicable to handle different types of missing values and in different -omics studies, thus deserved to be further explored in the future.

Diabetes datasets
We employed datasets from a study of comparing serum metabolites between obese subjects with diabetes mellitus (N = 70) and healthy controls (N = 130) where N represents the number of observations. Dataset 1: a total of 42 free fatty acids (FFAs) were identified and quantified in those participants in order to evaluate their FFA profiles [21]. Dataset 2: a total of 34 bile acids (BAs) were identified and quantified in a similar way using different analytical protocol [22].

Simulation dataset
For the simulation dataset, we first calculated the covariance matrix Cov based on the whole diabetes dataset (P = 76) where P represents the number of variables. Then we generated two separated data matrices with the same number of 80 observations from multivariate normal distributions, representing two different biological groups. For each data matrix, the sample mean of each variable was drawn from a normal distribution N(0, 0.5 2 ) and Cov was kept using SVD. Then, two data matrices were horizontally (column-wise) stacked together as a complete data matrix (N×P = 160×76) so that group differences were simulated and covariance was kept.

MNAR generation
For two real-world targeted metabolomics datasets, we generated a series of MNAR datasets by using the missing proportion (number of missing variables/number of total variables) from 0.1 to 0.6 in a step of 0.05 with quantile cut-off for each missing variable drawn from a uniform distribution U(0.1, 0.5). The elements lower than the corresponding cut-off were removed and replaced with NA. For the simulation dataset, we generated a series of MNAR datasets by using the missing proportion from 0.1 to 0.8 step by 0.1 with MNAR cut-off drawn from U (0.3, 0.6) for a more rigorous testing.

Prediction model
A prediction model was employed for the prediction of missing values by setting a targeted missing variable as outcome and other variables as predictors. Different prediction models (e.g., linear regression, elastic net [23], regression trees [24] and random forest [25], etc.) could be embedded in our imputation framework. Elastic net was applied in our approach as an ideal prediction model considering its stability, accuracy, and efficiency. This model is a regularized regression with the combination of L1 and L2 penalties of the LASSO [26] and ridge [27] methods. The estimates of regression coefficients in elastic net are defined aŝ The L2 penalty ð1 À aÞ=2kbk 2 2 improves the model's robustness by controlling the multicollinearities among variables which are widely existed in high-dimensional-omics data. And the L1 penalty αkβk 1 controls the number of predictors by assigning zero coefficients to the "unnecessary" predictors. From a Bayesian point of view, the regularization is a mixture of Gaussian and Laplacian prior distributions of coefficients which can pull the full model of maximum likelihood estimates argmin b ky À Xbk 2 towards the null model of prior coefficients distribution, thus controls the risk of overfitting and increase the model robustness. R package glmnet was used for the elastic net. We set hyperparameters λ as 0.01 (default setting for highdimensional data) and α as 0.5 (an equally mixture of LASSO and ridge penalties) [28].

Gibbs sampler
Gibbs sampler is a MCMC technique that sequentially updates parameters while others are fixed. It can be used to generate posterior samples. For each missing variable in the dataset, we applied a Gibbs sampler to impute the missing values by sampling from a truncated normal distribution with prediction model fitted value as mean and root mean square deviation (RMSD) of missing part as standard deviation while truncated by specified cut-points. Assuming we have a n × p data matrix X = (X 1 , X 2 , X 3 , . . ., X p ) with only one variable X j containing left-censored missing values. We denote X j as y and the missing part as y m with length m and non-missing part as y f with length f, and the rest of matrix X -j as X'. We can then set the lower truncation point lo as -1 (centralized data) or 0 (original data) and upper hi as the minimum/ quantile value of y f or a given LOQ. The truncation bounds ensure imputation results are constrained within [lo, hi]. Then, the Gibbs sampler approach can be described as following steps: Step-1 (initialization): we initialize missing values (QRILC in our case), and get y'; Step-2 (prediction): we then build a prediction model (elastic net in our case): y'~X'; Step Step-4 (sampling): we draw sampleỹ m i from a truncated normal distribution Nðŷ m i ; s 2 j½lo; hiÞ for ith missing element and update y'.
We iteratively repeat step-2 to step-4 and update X j .

GSimp framework
A whole data matrix X = (X 1 , X 2 , X 3 , . . ., X p ) contains a number of k (k p) left-censored missing variables. We present our imputation framework as following algorithm. Algorithm: Gibbs sampler based left-censored missing value imputation approach Require: X an n × p data matrix, iters_all the number of iterations for imputing the whole matrix X, iters_each the number of iterations for imputing each missing variable, a vector of upper limits U (+1 for non-missing variables) with length p and a vector of lower limits L (-1 for non-missing variables) with length p. y 0 X imp j , y 0 can be divided into two parts: y 0 m is a vector of the imputed part (original missing part) with length m and y 0 f is a vector of the non-missing part with length f while n = m + f;

6.
X 0 X imp À j , represents the matrix X with jth column removed; 7.
lo L j and hi U j ; 8.
end for 11. Update X imp j ; 12.
end for 13. end for 14.return X imp

Other imputation approaches
Other three left-censored missing imputation/substitution methods were conducted in our study for performance comparison: • kNN-TN (Truncation k-nearest neighbors imputation) [19]: this method applied a Newton-Raphson (NR) optimization to estimate the truncated mean and standard deviation. Then, Pearson correlation was calculated based on standardized data followed by correlation-based kNN imputation.
• QRILC (Quantile Regression Imputation of Left-Censored data) [18,29]: this method imputes missing elements randomly drawing from a truncated distribution estimated by a quantile regression. R package imputeLCMD was applied for this imputation approach.
• HM (Half of the Minimum): This method replaces missing elements with half of the minimum of non-missing elements in the corresponding variable.

Assessments of performance
Normalized Root Mean Squared Error (NRMSE) [30] has been commonly used to evaluated the differences between true values and imputed values. Considering the skewed distribution of missing values in MNAR, NRMSE based sum of ranks (SOR) was derived, a robust nonparametric measurement, to compare different imputation methods. The formula is as follows [31]: where Rank i (NRMSE) represent the NRMSE ranks of different imputation methods in i th missing variable. Procrustes analysis, a statistical shape analysis, could be used to evaluate the similarity of two ordinations by calculating the sum of squared errors [32]. We applied principal component analysis (PCA) as the unsupervised (un-labeled) ordination measurement and Procrustes analysis to measure the alteration of the original sample distribution and the imputed sample distribution in the space of top PCs. R package vegan was applied for Procrustes analysis [33].
Labeled measurements include correlation analysis for log-transformed p-values between true data and imputed data from Student's t-test, partial least square (PLS)-Procrustes analysis that measures the differences between original and imputed sample distributions on top PCs using supervised PLS for the dimensional reduction. R package ropls was applied for PLS analysis [34]. These measurements were done using our imputation evaluation pipeline from our previous study [31], which is also accessible through: https://github.com/WandeRum/MVIevaluation.
Furthermore, we evaluated the impacts of different imputation methods on the statistical sensitivity of detecting biological variances. On the simulation dataset, we calculated p-values from student's t-tests between two groups from original and imputed datasets. We marked a set S as real differential variables at a significant level of p-cutoff (e.g., 0.05) from original simulation data, and a set S' as detected differential variables at the same significant level from imputed simulation data. Then we calculated the true positive rate TPR ¼ #of ðS\S 0 Þ #of S to evaluate the effects of different imputation methods in terms of detecting differential variables.