Accounting for multiple imputation-induced variability for differential analysis in mass spectrometry-based label-free quantitative proteomics

Imputing missing values is common practice in label-free quantitative proteomics. Imputation aims at replacing a missing value with a user-defined one. However, the imputation itself may not be optimally considered downstream of the imputation process, as imputed datasets are often considered as if they had always been complete. Hence, the uncertainty due to the imputation is not adequately taken into account. We provide a rigorous multiple imputation strategy, leading to a less biased estimation of the parameters’ variability thanks to Rubin’s rules. The imputation-based peptide’s intensities’ variance estimator is then moderated using Bayesian hierarchical models. This estimator is finally included in moderated t-test statistics to provide differential analyses results. This workflow can be used both at peptide and protein-level in quantification datasets. Indeed, an aggregation step is included for protein-level results based on peptide-level quantification data. Our methodology, named mi4p, was compared to the state-of-the-art limma workflow implemented in the DAPAR R package, both on simulated and real datasets. We observed a trade-off between sensitivity and specificity, while the overall performance of mi4p outperforms DAPAR in terms of F-Score.


Introduction
Dealing with incomplete data is one of the main challenges as far as statistical analysis is concerned. Different strategies can be used to tackle this issue. The simplest way consists of deleting from the dataset the observations for which there are too many missing values, leading to a complete-case dataset. However, it causes information loss, might create bias and could ultimately result in poorly informative datasets.
Another way to cope with missing data is to use methods that account for the missing information. For the last decades, researchers advocated the use of a single technique called imputation. Imputing missing values consists of replacing a missing value with a value derived using a user-defined formula (such as the mean, the median or a value provided by an expert, thus considering the user's knowledge). Hence it makes it possible to perform the analysis as if the data were complete. More particularly, the vector of parameters of interest can be then estimated. (1) Initial dataset with missing values. It is supposed to be made of N observations that are split into I groups.
(3) The vector of parameters of interest is estimated based on the single imputed dataset.
Single imputation means completing the dataset once and considering the imputed dataset as if it was never incomplete, see Figure 1. However, single imputation has the major disadvantage of discarding the variability from the missing data and the imputation process. It may also lead to a biased estimator of the vector of parameters of interest.
Multiple imputation described by Little and Rubin (2019) closes this loophole by generating several imputed datasets. These datasets are then used to build a combined estimator of the vector of parameters of interest, by usually using the mean of the estimates among all the imputed datasets, see Figure 2. This combined estimator is known as the first Rubin's rule. The second Rubin's rule states a formula to estimate the variance of the combined estimator, decomposing it as the sum of the intra-imputation variance component and the between-imputation component. The rule of thumb suggested by White et al.
(2011) takes the number of imputed datasets as the percentage of missing values in the original dataset. Recent work focused on better estimating the Fraction of Missing Information (Pan, Q. et al.) or improving that rule (von Hippel, P. T.). Note that Rubin's rules cannot be used in order to get a combined imputed dataset but instead provide an estimator of the vector of parameters of interest and an estimator of its covariance matrix both based on multiple imputation, see Figure 2. (1) Initial dataset with missing values. It is supposed to have N observations that are split into I groups.
(2) Multiple imputation provides D estimators for the vector of parameters of interest. (3a) The D estimators are combined using the first Rubin's rule to get the combined estimator. (3a) The estimator of the variance-covariance matrix of the combined estimator is provided by the second Rubin's rule.
Dealing with missing values is also one of the main struggles in label-free quantitative proteomics. Intensities of thousands of peptides are obtained by liquid chromatography-tandem mass spectrometry, using extracted ion chromatograms. Several reasons may cause missing peptides' intensities. Either the considered peptide is missing in the given biological sample, and the intensity is then missing not at random (MNAR), or it was missed during the acquisition process and, the intensity is then missing at random (MAR).
In state-of-the-art software for statistical analysis in label-free quantitative proteomics, single imputation is the most commonly used method to deal with missing values. In the MSstats R package (available on Bioconductor), Choi et al. (2014) distinguish missing completely at random values and missing values due to low intensities. The user can then choose to impute the censored value using a threshold value or an Accelerated Failure Time model. The Perseus software by Tyanova et al. (2016) offers three methods for single imputation: either imputing by "NaN", impute by a user-defined constant or impute according to a Gaussian distribution in order to simulate intensities, which are lower than the limit of detection. Recently, Goeminne et al. (2020) implemented a single imputation method based on a hurdle model in their MSqRob R package (Goeminne et al., 2018). As far as machine learning is concerned, Song and Yu (2021) suggested a method for imputing missing values in label-free mass spectrometry-based proteomics datasets, called XGboost.
The ProStaR software based on the DAPAR R package and developed by Wiec-zorek et al. (2017) splits missing values into two categories, whether they are Missing in an Entire Condition (MEC) or Partially Observed Values (POV) (Lazar et al., 2016). The software allows single imputation, using either a small quantile from the distribution of the considered biological sample, the k-Nearest Neighbours (kNN) algorithm or the Structured Least Squares Adaptative algorithm or by choosing a fixed value. The PANDA-view software developed by Chang et al. (2018) also enables the use of the kNN algorithm or a fixed value. Moreover, both software programs give the possibility to impute the dataset several times before combining the imputed datasets in order to get a final dataset without any missing values. PANDA-view relies on the mice R package by Van Buuren and Groothuis-Oudshoorn, (2011), whereas ProStaR accounts for the nature of missing values and imputes them with the imp4p R package implemented by Giai . However, both software programs consider the final dataset as if it had always been complete. The uncertainty due to multiple imputation is not properly taken into account downstream of the imputation step.
In the following, we will conduct the multiple imputation process to its end, as described by Little and Rubin (2019) and use the imputed datasets to provide a combined estimator of the vector of parameters of interest as well as a combined estimator of its variance-covariance matrix estimator. We will then project this matrix to get a unidimensional variance estimator before moderating it using the empirical Bayes procedure defined in the seminal paper of Smyth et al. (2004) and later developed by Phipson et al. (2016). It is well known that such a moderating step highly improves the following statistical analyses such as significance testing of confidence interval estimation, both at the peptide level (Suomi et al., 2015;Goeminne et al., 2015) or the protein level (Goeminne et al., 2015(Goeminne et al., , 2016).

Simulated datasets
We evaluated our methodology on three types of simulated datasets. First, we considered an experimental design where the distributions of the two groups to be compared scarcely overlap. This design led to a fixed effect one-way analysis of variance model (ANOVA), which can be written as: with µ = 100, δ ij = 100 if 1 ≤ i ≤ 10 and j = 2 and δ ij = 0 otherwise and ijk ∼ N (0, 1). Here, y ij represents the log-transformed abundance of peptide i in the j-th sample. Thus, we generated 100 datasets by considering 200 individuals and 10 variables, divided into 2 groups of 5 variables, using the following steps: 1. For the first 10 rows of the data frame, set as differentially expressed, draw the first 5 observations (first group) from a Gaussian distribution with a mean of 100 and a standard deviation of 1. Then draw the remaining 5 observations (second group) from a Gaussian distribution with a mean of 200 and a standard deviation of 1.
2. For the remaining 190 rows, set as non-differentially expressed, draw the first 5 observations as well as the last 5 observations from a Gaussian distribution with a mean of 100 and a standard deviation of 1.
Secondly, we considered an experimental design, where the distributions of the two groups to be compared might highly overlap. Hence, we based it on the random hierarchical ANOVA model by Lazar et al. (2016), derived from Karpievitch et al. (2012). The simulation design follows the following model: where y ij is the log-transformed abundance of peptide i in the j-th sample, P i is the mean value of peptide i, G ik is the mean differences between the condition groups, and ij is the random error terms, which stands for the peptide-wise variance. We generated 100 datasets by considering 1000 individuals and 20 variables, divided into 2 groups of 10 variables, using the following steps: 1. Generate the peptide-wise effect P i by drawing 1000 observations from a Gaussian distribution with a mean of 1.5 and a standard deviation of 0.5.
2. Generate the group effect G ik by drawing 200 observations (for the 200 individuals set as differentially expressed) from a Gaussian distribution with a mean of 1.5 and a standard deviation of 0.5 and 800 observations fixed to 0.
3. Build the first group dataset by replicating 10 times the sum of P i and the random error term, drawn from a Gaussian distribution of mean 0 and standard deviation 0.5.
4. Build the second group dataset by replicating 10 times the sum of P i , G ik and the random error term drawn from a Gaussian distribution of mean 0 and standard deviation 0.5.
5. Bind both datasets to get the complete dataset.
Finally, we considered an experimental design similar to the second one, but with random effects P i and G ik . The 100 datasets were generated as follows.
1. For the first group, replicate 10 times (for the 10 variables in this group) a draw from a mixture of 2 Gaussian distributions. The first one has the following parameters: a mean of 1.5 and a standard deviation of 0.5 (corresponds to P i ). The second one has the following parameters: a mean of 0 and a standard deviation of 0.5 (corresponds to ij ).
2. For the second group replicate 10 times (for the 10 variables in this group) a draw from a mixture of the following 3 distributions.
(a) The first one is a Gaussian distribution with the following parameters: a mean of 1.5 and a standard deviation of 0.5 (corresponds to P i ).
(b) The second one is the mixture of a Gaussian distribution with a mean of 1.5 and a standard deviation of 0.5 for the 200 first rows (set as differentially expressed) and a zero vector for the remaining 800 rows (set as not differentially expressed). This mixture illustrates the G ik term in the previous model.
(c) The third distribution has the following parameters: a mean of 0 and a standard deviation of 0.5 (corresponds to ij ).

Real datasets
We challenged our methodology on several real datasets coming from two different experiments described hereafter.
We consider a first real dataset from Muller et al. (2016). The experiment involved six peptide mixtures, composed of a constant yeast (Saccharomyces cerevisiae) background, into which increasing amounts of UPS1 standard proteins mixtures (Sigma) were spiked at 0.5, 1, 2.5, 5, 10 and 25 fmol, respectively. In a second well-calibrated dataset, yeast was replaced by a more complex total lysate of Arabidopsis thaliana in which UPS1 was spiked in 7 different amounts, namely 0.05, 0.25, 0.5, 1.25, 2.5, 5 and 10 fmol. For each mixture, technical triplicates were constituted. The Saccharomyces cerevisiae dataset was acquired on a nanoLC-MS/MS coupling composed of nanoAcquity UPLC device (Waters) coupled to a Q-Exactive Plus mass spectrometer (Thermo Scientific, Bremen, Germany) as extensively described in Muller et al. (2016). The Arabidopsis thaliana dataset was acquired on a nanoLC-MS/MS coupling composed of nanoAcquity UPLC device (Waters) coupled to a Q-Exactive HF-X mass spectrometer (Thermo Scientific, Bremen, Germany) as described in Supplementary data, Section S7.3.
For the Saccharomyces cerevisiae and Arabidopsis thaliana datasets, Maxquant software was used to identify peptides and derive extracted ion chromatograms. Peaks were assigned with the Andromeda search engine with full trypsin specificity. The database used for the searches was concatenated in-house with the Saccharomyces cerevisiae entries extracted from the UniProtKB-SwissProt database (16 April 2015, 7806 entries) or the Arabidopsis thaliana entries (09 April 2019, 15 818 entries) and those of the UPS1 proteins (48 entries). The maximum false discovery rate was 1% at peptide and protein levels using a target-decoy strategy. For the Arabidopsis thaliana + UPS1 experiment, data were extracted both with and without Match Between Runs and 2 pre-filtering criteria were applied before statistical analysis: only peptides with, on the one hand, at least 1 out of 3 quantified values in each condition and, on the other hand at least 2 out of 3, were kept. Thus, 4 datasets derived from the Arabidopsis thaliana + UPS1 were considered. The same filtering criteria were applied for the Saccharomyces cerevisiae + UPS1 experiment, but only on data extracted with Match Between Runs, leading to 2 datasets being considered.

Normalization
Normalising peptides' or proteins' intensities aims at reducing batch effects, sample-level variations and therefore better comparing intensities across studied biological samples . In this work, quantile normalisation (as described by Bolstad et al. (2003)) was performed using the normalize.quantiles function from the preprocessCore R package (Bolstad et al., 2021).

Multiple imputation methods
Several methods for imputing missing values in mass spectrometry-based proteomics datasets were developed in the last decade. However, the recent benchmarks of imputation algorithms do not reach a consensus (as shown in Supplementary data, Table S1.1). This is mainly due to the complex nature of the underlying missing values mechanism. In this work, we chose to focus on some of the most commonly used methods 1.

Method
Implementation References The k-Nearest Neighbours method imputes missing values by averaging the k-nearest observations of the given missing value in terms of Euclidean distance. This method was described by Hastie et al. (1999) and Troyanskaya et al (2001) and implemented in Hastie et al. (2021). The Maximum Likelihood Estimation method imputed missing values using the EM algorithm proposed by Schafer (1997) and implemented by Giai . The Bayesian linear regression method imputes missing values using the normal model and following the method described by Rubin (1987) and implemented by Van Buuren and Groothuis-Oudshoorn, (2011). The Principal Component Analysis imputes missing values using the algorithm proposed by Josse and Husson (2013) and implemented by Giai . The Random Forests method imputes missing values using the algorithm proposed by Stekhoven and Bühlmann (2012) and implemented by Giai .

Estimation
The objective of multiple imputation is to estimate from D drawn datasets the vector of parameters of interest and its variance-covariance matrix. Notably, accounting for multiple-imputation-based variability is possible thanks to Rubin's rules, which provide an accurate estimation of these parameters.
Let us consider a D-time imputed dataset with P individuals (corresponding to P peptides or proteins) and N observations (corresponding to N biological samples), divided between I groups (corresponding to I conditions to be compared). Let β P be the vector of parameters of interest, such as : The first Rubin's rule provides the combined estimator of β P : whereβ p,d is the estimator of β P in the d-imputed dataset. The second Rubin's rule gives the combined estimator of the variance-covariance matrix for each estimated vector of parameters of interest for peptide p through the D imputed datasets such as: where W d denotes the variance-covariance matrix ofβ p,d , i.e. the variability of the vector of parameters of interest as estimated in the d-th imputed dataset.

Projection
State-of-the-art tests, including Student's t-test, Welch's t-test and moderated t-test, rely on the variance estimation. Here, the variability induced by multiple imputation is described by a variance-covariance matrix. Therefore, a projection step is required to get a unidimensional variance parameter. In our work, we chose to perform projection using the following formula : whereΣ p,(k,k) is the k-th diagonal element of the matrixΣ p and X is the design matrix.

Testing
In our work, we focus our methodology on the moderated t-test introduced by Smyth et al. (2004). This testing technique relies on the empirical Bayes procedure, commonly used in microarray data analysis, and to a more recent extent for differential analysis in quantitative proteomics Wieczorek et al. (2017). The moderated t-test procedure relies on the following Bayesian hierarchical model: where σ 2 p is the peptide-wise variance, d 0 and s 0 are hyperparameters to be estimated (Phipson et al., 2016). From there, a so-called moderated variance estimatorŝ 2 p[mod] of the variance σ 2 p is derived: This estimator is then computed in the test statistic associated to the null hypothesis H 0 : β pi = 0 (see Equation 9). Therefore, the results of this testing procedure account both for the specific structure of the data and the uncertainty caused by the multiple imputation step.
with (X T Ω p X) −1 k,k the k-th diagonal element in the matrix (X T Ω p X) −1 . Under the H 0 hypothesis, T pi[mod] follows a Student distribution with d p + d 0 degrees of freedom.
As many tests as the number of peptides considered are performed. Hence, the proportion of falsely rejected hypotheses has to be controlled. Here, the False Discovery Rate control procedure from Benjamini and Hochberg (1995) was performed using the cp4p R package by Giai Gianetto et al. (2016).

Aggregation
The methodology implemented in the mi4p R package can be applied to peptidelevel quantification data as well as protein-level quantification data. However, we were interested in evaluating our method on a peptide-level dataset and inferring results at a protein level, as it is common practice in proteomics. Therefore, for intensity aggregation, we chose to sum all unique peptides' intensities for each protein. The detailed pipeline for intensity aggregation is described in Supplementary data in Section S2.

Measures of performance
We compared our methodology to the limma testing pipeline implemented in the state-of-the-art ProStaR software, through the DAPAR R package. To assess the performances of both methods, we used the following measures: sensitivity (also known as true positive rate or recall), specificity (also known as true negative rate), precision (also known as positive predictive value), F -score and Matthews correlation coefficient. In our work, we define a true positive (respectively negative) as a peptide/protein that is correctly considered as (not) differentially expressed by the testing procedure. Similarly, we define a false positive (respectively negative) as a peptide/protein that is falsely considered as (not) differentially expressed by the testing procedure. The expressions of the previously mentioned performance indicators are given in Supplementary data in Section S3.

Results and Discussion
We highlight here results obtained using the maximum likelihood estimation imputation method. Results from other imputation methods on simulated data can be found in Supplementary data (Tables S4.3 to S4.6, S5.8 to S5.11 and S6.13 to S6.16). For each experiment, simulated or real, the performances of each method are based on adjusted p-values, with a 5% significance level and using a 1% Benjamini-Hochberg False Discovery Rate. Figure 3 describes the evolution of the distribution of differences in sensitivity and specificity between mi4p and DAPAR depending on the proportion of missing values in the first set of simulations. For a small proportion of missing values (1%), where the imputation process induces little variability, performances in terms of sensitivity, specificity and F -Score are equivalent between both methods. No improvement nor deterioration was observed for sensitivity, as it remains at 100% regardless of the missing value proportion. Specificity and F -Score are improved with the mi4p workflow above 5% missing values. The same observations can be drawn for precision and Matthews coefficient correlation (see Figure S4.1 in Supplementary data). Detailed results can be found in Table S4.2. Figure 4 describes the evolution of the distribution of differences in sensitivity and specificity between mi4p and DAPAR depending on the proportion of  missing values in the second set of simulations. For all proportions of missing values, we observe a trade-off between sensitivity and specificity. A slight loss in specificity (remaining above 99%) provides a greater gain in terms of sensitivity. The mean F -score across the 100 datasets is also increased with the mi4p workflow than with the DAPAR one. The Matthews correlation coefficient highlights the gain in performances (as illustrated in Supplementary data, Figure S5.2).

Simulated datasets
Extending the simulation model from fixed effects to random effects using the last set of simulations provides similar results, as shown in Supplementary data ( Figure S6.3 and Tables S6.12 to S6.16).

Real datasets
The trade-off suggested by the simulation study is confirmed by the results obtained on the real datasets. In the Saccharomyces cerevisiae + UPS1 experiment, a decrease of 70% in the number of false positives is observed, improving the specificity and precision (Table S8.23 in Supplementary data). However, this costs in the number of true positives (see Table 2), decreasing of sensitivity. The same trend is observed in the Arabidopsis thaliana + UPS1 experiment; the number of false positives is decreased by 50% (see Table 3 and Table S8.17), thus improving specificity and precision at the cost of sensitivity. The loss in sensitivity is bigger in the highest points of the range in both experiments. The structure of the calibrated datasets used here can explain these observations. Indeed, the quantitative dataset considered takes into account all samples from all conditions, while the testing procedure focuses on one-vs-one comparisons. Two issues can be raised: • The data preprocessing step can lead to more data filtering than necessary. For instance, we chose to use the filtering criterion such that rows with at least one quantified value in each condition were kept. The more conditions are considered, the more stringent the rule is, possibly leading to a poorer dataset (with fewer observations) for the conditions of interest.
• The imputation process is done on the whole dataset, as well as the estimation step. Then, while projecting the variance-covariance matrix, the estimated variance (later used in the test statistic) is the same for all comparisons. Thus, if one is interested in comparing conditions with fewer missing values, the variance estimator will be penalised by the presence of conditions with more missing values in the initial dataset.
This phenomenon is illustrated in Table S8.18, where solely the two highest points of the range have been compared, only using the quantitative data from those two conditions. More peptides have been taken into account for the statistical analysis. This strategy leads to better scores for precision, F -score and Matthews correlation coefficient compared to the previous framework.
As far as data extracted without the Match Between Runs algorithm are concerned, the results were equivalent in both methods considered in the Arabidopsis thaliana + UPS1 experiment (as illustrated in Tables S8.20 and S8.21).

Condition vs. 25fmol
True positives False positives Sensitivity Specificity F-Score Table 2: Performance of the mi4p methodology expressed in percentage with respect to DAPAR workflow, on Saccharomyces cerevisiae + UPS1 experiment, with Match Between Runs and at least 1 out of 3 quantified values in each condition. Missing values (6%) were imputed using the maximum likelihood estimation method.

Condition vs. 10fmol
True positives

False positives
Sensitivity Specificity F-Score Table 3: Performance of the mi4p methodology expressed in percentage with respect to DAPAR workflow, on Arabidopsis thaliana + UPS1 experiment, with at least 1 out of 3 quantified values in each condition. Missing values (6%) were imputed using the maximum likelihood estimation method.
Furthermore, the same observations can be drawn from datasets filtered with the criterion of a minimum of 2 out of 3 observed values in each group for the Arabidopsis thaliana + UPS1 experiment (Tables S8.19 and S8.21) as well as for the Saccharomyces cerevisiae + UPS1 experiment (Table S8.24). These observations translate a loss of global information in the dataset, as filtering criteria lead to fewer peptides considered with fewer missing values per peptide. The mi4p methodology also provides better results at the protein-level (after aggregation) in terms of specificity, precision, F -score and Matthews correlation coefficient, with a minor loss in sensitivity (Table S8.25). In particular, a decrease of 63.2% to 80% in the number of false positives is observed with a lower loss on the number of true positives and on sensitivity (up to 2.6%) for the Saccharomyces cerevisiae + UPS1 experiment, as illustrated in Table 5. As far as the Arabidopsis thaliana + UPS1 experiment is concerned, the same trend is observed (Table S8.22). Indeed, the number of false positives is decreased by 31% to 66.8%, with a maximum loss in the number of true positives of 9.8%, as illustrated in Table 4.

Condition vs. 10fmol
True positives False positives Sensitivity Specificity F-Score

Conclusion
In this work, we presented as a key step of a workflow a rigorous multiple imputation method by estimating both the parameters of interest and their variability. We considered this variability downstream of the statistical analysis by including it in the moderated t-test statistic. The methodology was implemented in the R statistical language through a package called mi4p. Its performance was compared on both simulated and real datasets to the state-of-the-art methodologies, such as the package DAPAR, using confusion matrix-based indicators. The results showed a trade-off between those indicators. In real datasets, the methodology reduces the number of false positives in exchange for a minor reduction of the number of true positives. The results are similar among all imputation methods considered, especially when the proportion of missing values is small. Our methodology with an additional aggregation step provides better results with a minor loss in sensitivity and can be of interest for proteomicists who will benefit from results at the protein level while using peptide-level quantification data. Supplemental Information for Accounting for multiple imputation-induced variability for differential analysis in mass spectrometry-based label-free quantitative proteomics S1 State of the art on imputation in quantitative proteomics Table S1.1 gives an overview of the recent literature on imputation methods in quantitative proteomics. Imputation methods are abbreviated as follows.
• BPCA: Bayesian principal component analysis   Table S1.1: State of the art on imputation methods used in quantitative proteomics and type of data used.

S2 Aggregation of peptides' intensities
The methodology implemented in the mi4p R package can be applied to peptidelevel quantification data as well as protein-level quantification data. However, we were interested in evaluating our method on a peptide-level dataset and inferring results at a protein level, as it is common practice in proteomics. Therefore, for intensity aggregation, we chose to sum all unique peptides' intensities for each protein. We then adjusted our pipeline as follows: 1. Out-filtration of non-unique peptides from the peptide-level quantification dataset.

Aggregation by summing all peptides intensities (non-log2-transformed)
from a given protein in each imputed dataset.
7. Projection of the estimated variance-covariance matrix.

S3 Indicators of performance
Let T P , T N , F P and F N respectively denote the numbers of true positives, true negatives, false positives, and false negatives.
S4 Results on the first set of simulations

S4.1 Simulation design
We considered an experimental design where the distributions of the two groups to be compared scarcely overlap. This design led to a fixed effect one-way ANOVA model, which can be written as: with µ = 100, δ ij = 100 if 1 ≤ i ≤ 10 and j = 2 and δ ij = 0 otherwise and ijk ∼ N (0, 1). Here, y ij represents the log-transformed abundance of peptide i in the j-th sample. Thus, we generated 100 datasets by considering 200 individuals and 10 variables, divided into 2 groups of 5 variables, using the following steps: 1. For the first 10 rows of the data frame, set as differentially expressed, draw the first 5 observations (first group) from a Gaussian distribution with a mean of 100 and a standard deviation of 1. Then draw the remaining 5 observations (second group) from a Gaussian distribution with a mean of 200 and a standard deviation of 1.
2. For the remaining 190 rows, set as non-differentially expressed, draw the first 5 observations as well as the last 5 observations from a Gaussian distribution with a mean of 100 and a standard deviation of 1.

S4.2 Performance evaluation
This subsection provides the evaluation of the mi4p workflow compared to the DAPAR workflow on the first set of simulations. The performance is described using the indicators detailed in Section S3. Figure S4.1: Distribution of the difference of performance between mi4p and DAPAR workflows on the first set of simulations imputed using maximum likelihood estimation.
S5 Results on the second set of simulations S5.1 Simulation design Secondly, we considered an experimental design, where the distributions of the two groups to be compared might highly overlap. Hence, we based it on the random hierarchical ANOVA model by Lazar et al. [2016], derived from Karpievitch et al. [2012]. The simulation design follows the following model: where y ij is the log-transformed abundance of peptide i in the j-th sample, P i is the mean value of peptide i, G ik is the mean difference between the condition groups, and ij is the random error term, which stands for the peptide-wise variance. We generated 100 datasets by considering 1000 individuals and 20 variables, divided into 2 groups of 10 variables, using the following steps: 1. Generate the peptide-wise effect P i by drawing 1000 observations from a Gaussian distribution with a mean of 1.5 and a standard deviation of 0.5.
2. Generate the group effect G ik by drawing 200 observations (for the 200 individuals set as differentially expressed) from a Gaussian distribution with a mean of 1.5 and a standard deviation of 0.5 and 800 observations fixed to 0.
3. Build the first group dataset by replicating 10 times the sum of P i and the random error term, drawn from a Gaussian distribution of mean 0 and standard deviation 0.5.
4. Build the second group dataset by replicating 10 times the sum of P i , G ik and the random error term drawn from a Gaussian distribution of mean 0 and standard deviation 0.5.
5. Bind both datasets to get the complete dataset.

S5.2 Performance evaluation
This subsection provides the evaluation of the mi4p workflow compared to the DAPAR workflow on the second set of simulations. The performance is described using the indicators detailed in Section S3. Figure S5.2: Distribution of the difference of performance between mi4p and DAPAR workflows on the second set of simulations imputed using maximum likelihood estimation.

%MV Method
True positives Finally, we considered an experimental design similar to the second one, but with random effects P i and G ik . The 100 datasets were generated as follows.
1. For the first group, replicate 10 times (for the 10 variables in this group) a draw from a mixture of 2 Gaussian distributions. The first one has the following parameters: a mean of 1.5 and a standard deviation of 0.5 (corresponds to P i ). The second one has the following parameters: a mean of 0 and a standard deviation of 0.5 (corresponds to ij ).
2. For the second group replicate 10 times (for the 10 variables in this group) a draw from a mixture of the following 3 distributions.
(a) The first one is a Gaussian distribution with the following parameters: a mean of 1.5 and a standard deviation of 0.5 (corresponds to P i ).
(b) The second one is the mixture of a Gaussian distribution with a mean of 1.5 and a standard deviation of 0.5 for the 200 first rows (set as differentially expressed) and a zero vector for the remaining 800 rows (set as not differentially expressed). This mixture illustrates the G ik term in the previous model.
(c) The third distribution has the following parameters: a mean of 0 and a standard deviation of 0.5 (corresponds to ij ).

S6.2 Performance evaluation
This subsection provides the evaluation of the mi4p workflow compared to the DAPAR workflow on the first set of simulations. The performance is described using the indicators detailed in Section S3. Figure S6.3: Distribution of the difference of performance between mi4p and DAPAR workflows on the third set of simulations imputed using maximum likelihood estimation.
The following tables provide results expressed as the mean of the given indicator over the 100 simulated datasets ± the mean of the standard deviations of the given indicator over the 100 simulated datasets. Results are based on adjusted p-values using the Benjamini-Hochberg procedure [Benjamini and Hochberg, 1995] and a false discovery rate of 1%. We consider a first real dataset from Muller et al. [2016]. The experiment involved six peptide mixtures, composed of a constant yeast (Saccharomyces cerevisiae) background, into which increasing amounts of UPS1 standard proteins mixtures (Sigma) were spiked at 0.5, 1, 2.5, 5, 10 and 25 fmol, respectively. In a second well-calibrated dataset, yeast was replaced by a more complex total lysate of Arabidopsis thaliana in which UPS1 was spiked in 7 different amounts, namely 0.05, 0.25, 0.5, 1.25, 2.5, 5 and 10 fmol. For each mixture, technical triplicates were constituted. The Saccharomyces cerevisiae dataset was acquired on a nanoLC-MS/MS coupling composed of nanoAcquity UPLC device (Waters) coupled to a Q-Exactive Plus mass spectrometer (Thermo Scientific, Bremen, Germany) as extensively described in Muller et al. [2016]. The Arabidopsis thaliana dataset was acquired on a nanoLC-MS/MS coupling composed of nanoAcquity UPLC device (Waters) coupled to a Q-Exactive HF-X mass spectrometer (Thermo Scientific, Bremen, Germany) as described hereafter.

S7.2 Data preprocessing
For the Saccharomyces cerevisiae and Arabidopsis thaliana datasets, Maxquant software was used to identify peptides and derive extracted ion chromatograms. Peaks were assigned with the Andromeda search engine with full trypsin specificity. The database used for the searches was concatenated in house with the Saccharomyces cerevisiae entries extracted from the UniProtKB-SwissProt database (16 April 2015, 7806 entries) or the Arabidopsis thaliana entries (09 April 2019, 15 818 entries) and those of the UPS1 proteins (48 entries). The minimum peptide length required was seven amino acids and a maximum of one missed cleavage was allowed. Default mass tolerances parameters were used. The maximum false discovery rate was 1% at peptide and protein levels with the use of a decoy strategy. For the Arabidopsis thaliana + UPS1 experiment, data were extracted both with and without Match Between Runs and 2 prefiltering criteria were applied prior to statistical analysis: only peptides with at least 1 out of 3 quantified values in each condition on one hand and 2 out of 3 on the other hand were kept. Thus, 4 datasets derived from the Arabidopsis thaliana + UPS1 were considered. For the Saccharomyces cerevisiae + UPS1 experiment, the same filtering criteria were applied, but only on data extracted with Match Between Runs, leading to 2 datasets considered.

S7.3 Supplemental methods for Arabidopsis thaliana dataset
Peptide separation was performed on an ACQUITY UPLC BEH130 C18 column (250 mm × 75 µm with 1.7 µm diameter particles) and a Symmetry C18 precolumn (20 mm ×180 µm with 5 µm diameter particles; Waters). The solvent system consisted of 0.1% FA in water (solvent A) and 0.1% FA in ACN (solvent B). The samples were loaded into the enrichment column over 3 min at 5 µL/min with 99% of solvent A and 1% of solvent B. The peptides were eluted at 400 nL/min with the following gradient of solvent B: from 3 to 20% over 63 min, 20 to 40% over 19 min, and 40 to 90% over 1 min. The MS capillary voltage was set to 2 kV at 250°C. The system was operated in a data-dependent acquisition mode with automatic switching between MS (mass range 375-1500 m/z with R = 120 000, automatic gain control fixed at 3 × 106 ions, and a maximum injection time set at 60 ms) and MS/MS (mass range 200-2000 m/z with R = 15 000, automatic gain control fixed at 1× 105, and the maximal injection time set to 60 ms) modes. The twenty most abundant peptides were selected on each MS spectrum for further isolation and higher energy collision dissociation fragmentation, excluding unassigned and monocharged ions. The dynamic exclusion time was set to 40s.

S8 Results on real datasets
This section provides the evaluation of the mi4p workflow compared to the DAPAR workflow on the real datasets considered. The performance is described using the indicators detailed in Section S3. Results are based on adjusted p-values using the Benjamini-Hochberg procedure [Benjamini and Hochberg, 1995 Table S8.25: Performance evaluation on the Saccharomyces cerevisiae + UPS1 dataset, at the protein-level and filtered with at least 1 quantified values in each condition. 39