Signal Partitioning Algorithm for Highly Efficient Gaussian Mixture Modeling in Mass Spectrometry

Mixture - modeling of mass spectra is an approach with many potential applications including peak detection and quantification, smoothing, de-noising, feature extraction and spectral signal compression. However, existing algorithms do not allow for automated analyses of whole spectra. Therefore, despite highlighting potential advantages of mixture modeling of mass spectra of peptide/protein mixtures and some preliminary results presented in several papers, the mixture modeling approach was so far not developed to the stage enabling systematic comparisons with existing software packages for proteomic mass spectra analyses. In this paper we present an efficient algorithm for Gaussian mixture modeling of proteomic mass spectra of different types (e.g., MALDI-ToF profiling, MALDI-IMS). The main idea is automated partitioning of protein mass spectral signal into fragments. The obtained fragments are separately decomposed into Gaussian mixture models. The parameters of the mixture models of fragments are then aggregated to form the mixture model of the whole spectrum. We compare the elaborated algorithm to existing algorithms for peak detection and we demonstrate improvements of peak detection efficiency obtained by using Gaussian mixture modeling. We also show applications of the elaborated algorithm to real proteomic datasets of low and high resolution.


Introduction
Current computational methodology for processing signals from spectra registered by mass spectrometry (MS) in mixtures of proteins and/or peptides usually involves sequences of signal processing operations organized in a manner leading to the detection and quantification of spectral peaks. When proteomic mass profiles are analyzed and interpreted, spectral peaks are used as features of MS spectra. It is assumed that each spectral peak corresponds to a certain peptide/protein species, and the composition of the mass spectrum carries direct information on composition of the analyzed samples. Currently, there are already more than a dozen sense of performing sufficiently large computational experiments and involving e.g., comparisons of sensitivities/specificities of peak detection, accuracy of estimation of m/z etc.
In this paper we present a new algorithm for the Gaussian mixture modeling of protein mass spectra based on partitioning the MS signal into smaller fragments. The fragmented spectra are separately decomposed into mixture models. The obtained parameters of components for all fragments are then aggregated and used as the mixture model of the whole spectrum. The main idea of partitioning the MS signals into fragments by using "splitters", as well as other ideas of the elaborated algorithm, are described in detail in the "Methods" section of this paper. Partitioning the MS signal into fragments allows for overcoming obstacles encountered in the previous studies. Both initializing and executing EM iterations is much easier for the smaller fragments of the MS signal than for the whole spectrum. Another advantage of partitioning the MS signals is the possibility of parallelizing the computations. Partition of the spectral signals into fragments is augmented by the use of an existing algorithm for peak detection for proteomic MS spectra.
We verify efficiency of the developed algorithm. In the first step of verification of our methodology we use our algorithm as a tool for improving peak detection in simulated mass spectra. We present comparisons of our algorithm of peak detection to the two peak detection algorithms of high efficiency published in the literature, MassSpecWavelet (based on continuous wavelet transform, CWT, approach) [3] and Cromwell (based on spectra differentiation) [2]. Comparisons are based on a large number of artificially generated datasets. We demonstrate the improvements achieved by using Gaussian mixture modeling. In the second step of verification of the methodology we show Gaussian mixture decompositions of real proteomic datasets [25,26,27]. For the Aurum dataset [27] we demonstrate the improvement of the accuracy of estimation of positions of peaks by using GMM. For datasets from [25,26] we highlight abilities of GMM modeling method to detect hidden spectral components to represent skewed shapes of spectral signals encountered in real data.

Methods
In this section we describe our algorithm for automated, whole spectrum scale Gaussian mixture modeling (GMM) of proteomic mass spectra and for MS peak detection based on Gaussian mixture representation. We first introduce the notations for such spectra and their Gaussian mixture models. Afterwards, we first present the main idea of the algorithm.
A typical proteomic mass spectrum contains data on mass-to-charge (m/z) values of the registered ions, denoted by x n versus their abundances i.e., the numbers of counts from the ion detector, denoted by y n , n = 1,2,. . .,N. The number of data points in the spectrum is denoted by N. In real experiments the analyzed datasets most often consist of more than one spectrum, multiple counts by y mn , m = 1,2,. . .,M correspond to each point x n along the m/z axis, where m denotes the index of the spectrum and M is the number of the spectra.
As the model for proteomic mass spectra, we use the univariate Gaussian mixture probability density function of the form where K is the number of Gaussian components, α k , k = 1,2,. . .K are component weights (mixing proportions), which sum up to 1, X K and f k (X n , μ k, σ k )denotes the probability density function of the Gaussian distribution.

Scaling
The mixture model (1) must be appropriately scaled. Due to finite sensitivity of the ion detector, numbers of counts in the average spectrum, y n , correspond to ranges (intervals) in the m/z axis, (x n − Δ n / 2, x n + Δ n / 2), where Δ n is the width of the interval centered at x n . In other words the data are binned and the numbers of counts y n are modeled by the multinomial probability distribution with probabilities given by areas of bins [28]. For real proteomic MS data bin widths Δ n are changing with n; they are narrower for low x n and wider for high x n . One can assume that bins are dense, i.e., their areas are well approximated by products of bin widths and values of the probability density functions at bin centers, which corresponds to the following model The parameter η in (4) is called the total ion current (TIC) [29]. From (4) the value of the total ion current η is We call model (4) the globally scaled model of the MS signal y n (due to its function of changing the scaling from probability densities to ion counts).
We are, however, more interested in using locally scaled models for the MS signal y n . When we analyze only a small fragment of the spectral signal n min n n max , we can assume constant bin widths Δ n = Δ. This allows for writing the locally scaled model in the following form Where w k = sα k , (s = ηΔ),. The scale factor s can be computed "locally".
s ¼ X n max n¼n min y n X n max n¼n min Main idea of the algorithm Fitting the scaled mixture model (4) or (6) to spectral data x n , y n , is done by using expectation maximization (EM) algorithm [28,30]. A special version of the EM algorithm for binned data [28] is needed, described in detail in S1 File. Fitting a mixture model to MS data by EM iterations at the whole spectrum scale is, however, impractical (impossible) due to reasons described in detail at the beginning of the "Discussion and Conclusion" section. Therefore we have developed a method to decompose the MS signal into smaller fragments. Our method uses the property of the MS signal that after removing baseline (which is a wide component of the spectral signal) the remaining components are relatively narrow.
The main idea of the algorithm is partitioning the mass spectrum. The partitioning algorithm uses the following concepts: "splitters", "clear peaks", "splitter segments" and "segments". These concepts are defined below and further explained and illustrated (Figs 1 and 2) in the remaining part of this subsection. Figs 1 and 2 are plotted for one artificially generated MS datasets available in S2 File.

Definitions
A clear peak is one of the peaks of the mass spectrum, chosen on the basis of its quality index and its position in the spectrum. We also use another term, splitting peak-synonymously to the term clear peak.
A splitter is a fragment of the mixture model of a protein mass spectrum, which contains a clear peak. A splitter-segment is a suitably chosen fragment of the mass spectrum, which contains a clear peak (around a clear peak).
A segment is a fragment of the mass spectrum obtained by subtracting two neighboring splitter signals from the mass spectrum signal.

Explanations
An example of a clear peak and the related splitter in the protein mass spectrum is shown in  In order to compute (estimate) the splitter signal we "cut out" a fragment of the MS signal around the clear peak ( Fig 1A). This fragment of the MS signal is a splitter-segment. We perform Gaussian mixture decomposition of the splitter-segment ( Fig 1B). Cutting (truncating) the MS signal leads to possible errors in modeling. However, on the basis of our assumption of narrow components we expect that errors occur only close to boundaries of the splitter-segment and that they do not affect the model of the splitter (in the middle). Intuitively, in the vicinity of the clear peak the MS signal can be modeled by either only one or a small number of Gaussian components. These components are reliable parts of the decomposition of the MS signal into mixture of Gaussians.
For a given MS signal we need a set of splitters. Therefore, in the phase 1 of our algorithm, for a given MS signal, we search for a set of splitters by applying a heuristic procedure, which uses a peak detection algorithm as its first step. In principle any peak detection algorithm can be applied. In our implementation we used "mspeaks" function from Matlab Bioinformatics toolbox [31] (with the default options). The heuristic procedure for searching for clear peaks (splitting peaks) is designed such that it returns a set of clear peaks, which are neither too close nor too far from each other and each of them is of sufficient quality (measured by the ratio of the peak height and heights of the neighboring lowest points of the MS signal). Then, by using EM iterations we compute decompositions of splitter-segments (fragments of MS signals around each of the splitting peaks), as shown in Fig 1, and we obtain models of all splitters signals.
Since models of splitters signals are reliable parts of Gaussian mixture decomposition of the MS signal, in the phase 2 of our algorithm we subtract splitters signals from the MS signal, which leads to splitting (partitioning) the whole spectrum into separate fragments-segments. Then, segments are decomposed into Gaussian mixtures, again by using EM iterations. The idea of the phase 2 of our algorithm is illustrated in Fig 2. In Fig 2A we present a fragment of the MS spectrum with two (neighboring) splitters. In Fig 2B we show the MS signal of the segment obtained by subtracting splitters models signals from the MS signal. In Fig 2C and 2D we show Gaussian mixture decomposition of the segment signal from the middle plot (C-GMM and D-GMM components).
Finally, we aggregate all the computed GMM components into one set, which is a wholespectrum mixture model of the MS signal. A more detailed description of the steps of our algorithm is given in S1 File. A Matlab implementation of our algorithm and exemplary data are available as S2 File.

Results
In this section we present some evaluations of the performance of our algorithm and comparisons to methods of analyses of MS signals based on picking spectral peaks. The presented results concern both simulated (low resolution) datasets and real proteomic datasets of low and high resolution.
Computations for all datasets were performed with the use of the computational server with two hi-end Intel Xeon X5680 processors (3.4 GHz in normal work, 3.6 GHz in turbo mode) and 32 GB DDR3 1333 Mhz RAM memory. Average processing times were, for one low resolution spectrum (approximately 10000 m/z points and 100-500 components)-1.5 minutes and for one high resolution spectrum (approximately 100000 m/z points and 1000-15000 components)-25 minutes.

Simulated data
First we applied our algorithm as a tool for peak detection for simulated proteomic MS spectra. We compare our algorithm to two existing procedures for protein MS peak detection published in [2] and [3]. Our choice of the reference algorithms is based on the comparative studies [32][33][34] of algorithms for peak detection for the MS. The algorithm and associated computer program (R environment) published by Du and coauthors [3] was rated high in all comparisons studies as showing high sensitivity for peak detection with quite low false discovery rate. It is based on computing continuous wavelet transform (CWT) of the spectral signal, with the "Mexican Hat" mother wavelet function, and relating spectral peaks to the "ridge" lines in the parameter space. The algorithm developed by Coombes and coauthors in [2] (with publicly available implementation in the Matlab environment) was rated lower in comparisons [32][33][34]. When using this algorithm it is quite difficult to compromise between sensitivity of peak detection and false discovery rate. However, its advantage is that it uses natural ideas for peak detection, smoothing (with the use of wavelet functions) and differentiation of smoothed spectral signal. For the three compared algorithms we use the following abbreviations: MS-GMM-for our algorithm, CWT (continuous wavelet transform)-for the algorithm from [3] and CROM (Cromwell)-for the algorithm from [2].
Similarly to other studies devoted to comparisons of peak detection algorithms, [32][33][34], we use mass spectra, obtained with the use of the virtual mass spectrometer (VMS), [35], where the true positions of peaks in spectrum are known. We additionally change the structure of the simulated data by assuming different numbers of true peaks in the spectra, and we study their influence on the detection power of different algorithms.

Spectral peaks
Three algorithms MS-GMM, CWT and CROM are compared with respect to their efficiency in detection of spectral peaks. Spectral peaks are features of mass spectra given by lists of m/z values. Peak detection algorithms are designed such that estimated spectral peaks should be as close as possible to the true m/z values corresponding to peptides (proteins) in the analyzed samples. In the case of CWT and CROM algorithms, spectral peaks m/z values are estimated by maxima detection procedures accompanied by noise rejection and smoothing. In the case of MS-GMM spectral peaks are given by m/z values corresponding to expectations of Gaussian components in the GMM. These Gaussian components are obtained in the post-processing step of the algorithm described in detail in S1 File.

Virtual mass spectrometer datasets
Synthetic spectral datasets are obtained with the use of the VMS algorithm/tool [35] based on the physical principles underlying mass spectrometry instruments. This tool enables the generation of realistic virtual spectra with known underlying protein (peptide) compositions, and has already been widely used by many authors, [32,34]. VMS signal contains the same parts as those (hypothetically) encountered in actual spectral signals, namely the true spectral signal consisting of a sum of overlapping Gaussian components (each corresponding to a protein or peptide species) multiplied by a random multiplicative factor adjusting for random amounts of proteins/peptides ionized and desorbed from each slide, a baseline signal and a zero mean Gaussian error with the variance given by a smooth function of m/z. For a given protein/peptide ion (i.e. spectral component) we summarize its distribution across samples by three quantities: its prevalence defined by the proportion of samples in the population containing the component, the mean and the standard deviation of corresponding peak intensity across samples that contain the component.
By using the VMS algorithm we have generated five datasets with different true numbers of protein (peptide) species, 100, 150, 200, 250 and 300. The detailed description of the scenario for simulating artificial (synthetic) datasets is presented in S1 File, and the generated spectra are provided in S3 File.

Comparisons of performances of algorithms
We compute several performance indexes, useful to characterize/compare results obtained by different algorithms. The specificity index (defined by false discovery rate) is abbreviated by FDR. FDR is the number of peaks among those detected by the procedure which do not correspond to the true peaks, divided by the number of all peaks detected by the procedure. The sensitivity index is abbreviated by S. S is the number of true peaks detected by the procedure divided by the number of all true peaks in the sample. We also aggregate the performance measures FDR and S into one index, F1 (defined as the harmonic mean of 1-FDR and S) Obviously, higher values of F1 index imply better performance and lower values-poorer performance of the evaluated method. Finally, we also report the number of peaks detected by a peak detection algorithm.
Results of comparisons are presented in Fig 3 where we show plots of indexes F1, FDR, S and the numbers of (hypothetical) peaks detected by the algorithms versus numbers of true peaks in the simulated spectra.
Performance indexes for all algorithms, MS-GMM, CWT and CROM show similar patterns of change. As reported in published literature, CROM achieves the lowest values of the F1 index for the whole range of true numbers of peaks in virtual spectra. Differences between algorithms are also seen when comparing numbers of the detected (hypothetical) peaks, shown in the Fig 3D. All algorithms underestimate the number of peaks.
Our algorithm MS-GMM exhibits the best performance in terms of values of the F1 index, for the whole range of values of true numbers of peaks in the virtual spectra. The possibility of tuning the parameter of this algorithm to achieve best compromise between sensitivity and FDR follows from highest values of sensitivities of our algorithm compared to other algorithms ( Fig 3B). Our algorithm is also closest to the truth when estimation the number of peaks in the spectral signal is considered.
In Table A in S1 File, optimal values of parameters used in algorithms MS-GMM, CWT and CROM, when computing performance indexes. We also show, in Fig C in S1 File, ROC curves (FDR versus sensitivity) obtained by applying the algorithms with different values of their adjustable parameters. These curves demonstrate the increase of the sensitivity, at the same values of FDR, obtained by our MS-GMM algorithm, when compared to CWT and CROM. Improvement of performance (increased sensitivity) of peak detection achieved by using our Gaussian mixture model is obtained thanks to detection of "hidden" peaks in the spectral signals. This is additionally illustrated in Fig 4 below, where we have reproduced a small fragment (2900-3300 Da) of one spectral signal (with 200 true peaks). The plot includes positions of peaks detected by using CWT algorithm (blue asterisks) and components (peaks) detected by using our algorithm MS-GMM (red Gaussian curves). Along the m/z line (x-axis) we have marked with circles all true peaks in the spectrum (within the analyzed range) and we have additionally colored the circles depending on the detection status using the following code: detected only by MS-GMM method-red, detected only by CWT method-blue, detected by both MS-GMM and CWT-black and not detected by any of algorithms-empty circle. One can see several examples of hidden peaks, which have been detected thanks to the use of the Gaussian mixture model.

Real proteomic datasets
We show some results of application of our GMM decomposition approach and comparisons to peak picking algorithms for real proteomic datasets.

Aurum dataset
The first real dataset, which we have analyzed in this paper is the Aurum Dataset [27]. The Aurum dataset contains high resolution MS and MS/MS spectra of 246 human proteins of known amino-acid sequences expressed in Escherichia coli, individually purified and trypsindigested. MS and MSMS spectra were registered with the use of the ABI 4700 MALDI TOF/ TOF mass spectrometer.
We have analyzed MS data in the Aurum dataset. The set of the MS data in the Aurum dataset includes six series of experiments (batches) of registering MS, each containing 192 spectra, which in total gives 1152 spectra. Each of the spectra in the Aurum dataset is accompanied by a list of ground truth, m/z values corresponding to peptide species present in the registered samples. We have computed GMM decompositions and peak detection by CWT algorithm for all 1152 spectra and we have performed a comparison between these two approaches.
For the Aurum spectra we highlight improvement of accuracy of estimation of positions of peaks achieved by application of MS-GMM. This improvement is related to skewness of shapes of components of spectra (spectral peaks) corresponding to the peptide species and concerns accuracy of estimation of positions of peaks. For the skewed components of MS the position of the maximum of the MS signal and the true m/z of the peptide species may not coincide [19]. According to the methodology outlined by Kempka and coauthors [19] (see also references therein), skewed spectral peaks can be modeled by a mixture of two Gaussian components and the position of the higher-narrower component is a better estimate of the m/z value of the peptide species than the maximum point of the peak.
We have performed computations analogous to those described in [19] for all 1152 Aurum MS spectra. We have tuned both MS-GMM and CWT algorithms for high sensitivity (in order to detect all ground truth peaks) and we have compared relative absolute errors of estimates of the m/z values between GMM and CWT algorithms. Absolute relative error is defined as When we have modeled the spectrum by the GMM method, the estimate of m/z was the position of the higher-narrower Gaussian component closest to the true m/z, and for the CWT method the estimate of the m/z value was the position of the detected peak closest to the true m/z.
Results of comparison of relative absolute errors between the two algorithms are shown in Fig 5. Histograms of absolute relative errors corresponding to MS-GMM (red) and CWT (blue) are presented. One can see that MS-GMM algorithm outperforms the CWT algorithm in the aspect of the value of the absolute relative error (9). Values of RE obtained with the use of MS-GMM are on the average closer to zero than values obtained by using the CWT algorithm.
Distributions of absolute relative errors shown in Fig 5 are right-skewed. Therefore it is statistically sound to report medians and interquartile ranges for comparisons between these distributions. In Table 1 we present median values and interquartile ranges (IQR) of distributions of the absolute relative errors, obtained by using either MS-GMM or CWT algorithms, for all six series of experiments (batches) in the Aurum dataset.
In Fig 6 we present a short fragment of MS T10761_Well A24_18836 from the Aurum dataset including one ground truth peak m/z = 1690.766 Da from the spectrum. We present the MS fragment, its GMM decomposition and GMM components. We also mark, by vertical lines ground truth position of the peak (1690.766 Da) and m/z positions corresponding to its estimates by CWT and MS-GMM algorithms. One can see that the shape of the peak is right-skewed and that the MS-GMM estimate of peak's m/z is closer to the true value of m/z than the CWT estimate.

Other proteomic datasets
In the majority of cases of analyses of real datasets, true compositions of samples are not known, which makes impossible comparisons of different algorithms based on differences between detected and true positions of peaks. For such real proteomic spectral datasets comparisons of GMM modeling to methods based on spectral features defined by peaks can still be done on the basis of indirect methods, e.g., on the basis of comparing efficiencies of spectral classifiers using different definitions of spectral features. We are, however, deferring such analyses to separate studies. Here, instead, we provide some technical comments on results of analyses of two real proteomic datasets (low resolution dataset and high resolution dataset) concerning abilities of GMM modeling method to detect certain spectral components and concerning shapes of spectral signals encountered in real data. Importance of these properties of GMM modeling was highlighted in the presented earlier analyses of the simulated datasets and the Aurum dataset.

Low resolution dataset
The low resolution dataset comes from published clinical study aimed at detection of colorectal cancer using serum peptidome profiling by MALDI-ToF mass spectrometry [25]. The dataset  included 116 MALDI-ToF spectra of the low-molecular-weight fraction of serum proteome of cancer patients and healthy volunteers, each covering the m/z range 960-11,169 Da. Spectra were registered by Ultraflex MALDI-ToF spectrometer (Bruker Daltonics) working in the linear mode. The raw spectra contained approximately 45000 points along m/z axis. We used operations of averaging and binning described in the original paper, which resulted in "low resolution" spectra including approximately 10000 data points along m/z axis (~1 point per Da). Using our algorithm with the default settings resulted in computing the GMM model with 472 components. In Fig 7 we illustrate results of this computational experiment for the case where no operations of post-processing of the GMM model were applied. In the upper plot (A) whole average MS is shown. It is seen that due to large size of MS it is not possible to recognize any details of the structure of MS. Therefore in (B) we show a short fragment of MS from (A). Along with the fragment of MS in (B) we also show, with blue asterisks, positions of peaks detected by using the CWT algorithm. In (C) we present its GMM and in (D) GMM components. From Gaussian components, not detected by peak detection algorithm (hidden peaks). In the righthand part of the spectrum model one can see a low, wide component, which seems to result from some residue of baseline. It is seen that due to the fact that no post-processing was applied, there is some excess of Gaussian components used to modeling.
Application of the post-processing algorithm for MS-GMM (with parameters optimized for the case of 300 true peaks in the spectrum, see Table A in S1 File) reduced the number of GMM components from 472 to 391. The total number of peaks detected by using the CWT algorithm was 258. Parameters of the CWT algorithm were also optimized for the case of 300 true peaks in the spectrum-see Table A in S1 File. Application of the detection threshold 0.3% leads to the estimate of overlap between MS-GMM and CWT estimates of peaks positions including 141 elements. Differences between two sets of peaks follow from presence of hidden components, such as those seen in (Fig 7B and 7D).

High resolution dataset
The high resolution MS dataset was generated in our team during characterization of head and neck cancer tissue proteome [26]. In the study a post-operative tissue sample was analyzed using MALDI Imaging Mass Spectrometry (MALDI-IMS). Tissue section processed with trypsin digestion was imaged with 50-100 μm raster using UltrafleXtreme MALDI-ToF spectrometer (Bruker Daltonics) working in the reflectron mode. Spectra were registered in the 800-4,000 Da range, which resulted in 20000 spectral signals, each containing 100000 data points along the m/z axis (i.e.,~30 points per Da, which could be considered as "high resolution" spectra). We have computed a mean spectrum (over 20000 signals) and we have decomposed it according to the GMM model, using our algorithm with the default settings, which resulted in obtaining in total 6216 components. In Fig 8 we illustrate results of computations. In the upper plot (A) we plot whole mean MS. Again due to large size of MS one cannot recognize details of the structure of MS. Therefore in (B) we zoom out a short fragment (1019-1030 Da) of mean spectrum. In (C) the GMM is shown and in (D) we presented GMM components. One can see a characteristic high resolution MS signal isotopic pattern with neighboring peaks occurring in the distance 1 Da. One can also observe that isotopic parts of the spectral signal are right skewed. Application of our algorithm results in modeling each of them by two Gaussian components, analogously to the case of Aurum spectra.

Discussion and Conclusion
The standard approach to modeling signals by mixtures of Gaussian component functions is by using expectation-maximization (EM) algorithm-a recursive procedure for maximization of the log likelihood function [28,30]. However, there are serious obstacles on the way to developing an EM recursive algorithm for fitting a mixture model to the multi-component proteomic mass spectra. The first obstacle stems from the difficulty in setting initial conditions for EM iterations. Fitting a mixture model with a large number of components to data is difficult due to problems with setting appropriate initial conditions for the EM algorithm. The problem of setting the initial values of mixture parameters for the EM algorithm has been researched numerous times in the literature [36][37][38][39]. However, the published approaches are practical only for mixtures with a relatively low number of components. When the number of components increases, the precision of estimation of mixture parameters obtained with the use of the mentioned methods of initialization rapidly decreases. This makes the published methods inapplicable for mixtures with hundreds or even thousands of components encountered in spectra registered for complex proteomic mixtures (like serum or cancer tissue). An approach useful for setting initial components for the EM iterations dedicated to such spectra was proposed in [18]. This approach applies an algorithm for detecting MS peaks as a first step and then sets initial mean values of components equal to detected locations of peaks. While the idea of using available information on locations of peaks of the spectrum is certainly reasonable and useful, the proposed approach still suffers from serious drawbacks: (i) EM iterations started with mean values of components positioned at MS peaks can still converge to undesired solutions due to imprecisions of initial values of component weights and standard deviations, (ii) the method is blind to hidden components, which are not identified (detected) by MS peaks, (iii) the method may require launching EM iterations at the full spectrum scale, which can be difficult (impossible) for large datasets. The second obstacle is the size of the proteomic MS data. For very large datasets, with numbers of points along the m/z axis of a spectrum of the order of tens or even hundreds of thousands, executing (iterating) EM algorithm can be difficult due to large sizes of the necessary data structures and problems with the slow convergence.
We have managed to overcome the previously encountered difficulties and to develop an efficient method for the automated whole spectrum decomposition of MS signals into Gaussian mixtures. The idea of the algorithm is based on partition of the spectral signal into separate fragments. The partition is obtained by defining "splitters" (fragments of the GMM model, which contain "clear peaks", as shown in Figs 1 and 2). The possibility of the partition by splitters follows from properties of MS signals (with removed baselines). In the baseline corrected MS signals components are relatively narrow, which excludes long ranging overlaps. Separate segments are decomposed into GMM model by using EM iterations initialized with the use of the high efficiency algorithm. Despite the multi-step design of our algorithm its ideas are simple. Partitioning of spectra allows for separate analyses leading to mixture models of sufficient precision. Aggregating results of decompositions of segments leads to Gaussian mixture model of the whole spectrum.
Separate decompositions of MS segments allow for estimation of whole spectrum Gaussian mixture models of MS signals of arbitrarily large sizes (proven by automated analyses of high resolution spectra with numbers of m/z values of orders of hundreds of thousands). Separation also enables easy parallelizing of computations, which can be used to elaborate high efficiency computational environments based on multi-processor hardware systems. Efficiency of the idea of partitioning can be exemplified by comparing partitioned and un-partitioned versions of GMM decomposition algorithms Partitioned version of the GMM decomposition algorithm allows for obtaining results in a shorter time and leads to mixture model of a better quality, when compared to un-partitioned version of the algorithm. Examples of such comparisons are shown in S1 File.
When using the obtained mixture models for peak detection we have encountered a problem of selecting peaks from the set of mixture components. This problem arises because some of the mixture components obtained in the iterative EM algorithm may not correspond to spectral peaks. We have proposed a solution to this problem by a post-processing algorithm described in S1 File, including a threshold value for component weights and a method for merging Gaussian components with a tunable parameter MZ-thr. We have compared the peak detection method based on our GMM decomposition algorithm to two literature algorithms for peak detection (CWT, CROM), on the basis of artificially generated MS signals, and we have demonstrated its supremacy (Fig 3 and Fig C in S1 File).
For the publicly available Aurum dataset [27] we have highlighted the supremacy of the GMM approach over the pick picking algorithms when comparing relative accuracies of estimating positions of MS peaks between MS-GMM and CWT algorithms. Consistent to findings in [19], application of MS-GMM leads to estimates of m/z values of peaks of lower relative error than estimates obtained by using CWT algorithm.
Apart from improvements of the efficiency of peak detection/estimation demonstrated in this paper, there are also other areas of possible applications for an algorithm for the automated, whole spectrum scale GMM decomposition of MS signal. Gaussian mixture modeling of MS signals can be potentially used as a tool for smoothing and de-noising spectral signals, for modeling and/or removing baselines in the spectra, for MS signals peak quantification, for MS signal compression and for spectral deisotoping algorithms. Other applications can involve using mixture models for defining spectral features to be further used in construction of protein spectral classifiers.