Identifying Multiple Potential Metabolic Cycles in Time-Series from Biolog Experiments

Biolog Phenotype Microarray (PM) is a technology allowing simultaneous screening of the metabolic behaviour of bacteria under a large number of different conditions. Bacteria may often undergo several cycles of metabolic activity during a Biolog experiment. We introduce a novel algorithm to identify these metabolic cycles in PM experimental data, thus increasing the potential of PM technology in microbiology. Our method is based on a statistical decomposition of the time-series measurements into a set of growth models. We show that the method is robust to measurement noise and captures accurately the biologically relevant signals from the data. Our implementation is made freely available as a part of an R package for PM data analysis and can be found at www.helsinki.fi/bsg/software/Biolog_Decomposition.


Introduction
Biolog Phenotype Microarray (PM), not to be confused with RNA expression microarrays, is a commercially available 96-well format test system capable of multiple parallel testing of the bacterial growth responses to different nutrients and/or supplements [1]. The standard Biolog PM plates contain a variety of different substrates, such as carbon and nitrogen sources, heavy metals, antibiotics, etc. The substrates are pre-dispensed and dried, requiring only inoculation with bacteria and a buffer containing a dye (usually tetrazolium violet). Bacterial metabolism during growth leads to the irreversible reduction of the dye in the well with production of a purple colour which can be read as the change in absorbance over time [2]. The level of colouration is generally determined by a scanner at 15 minute intervals during the experiments, which are usually carried out over 48-72 hours depending on the studied bacterial species. The measurements of the colouration are recorded in arbitrary units.
The levels of colouration measured in a single well during one experiment are here referred to as signal. Signals are normally consistent between experimental replicates ( Fig 1A) and depend on bacterial adaptation to a substrate and experimental conditions. Preferred or utilisable substrates support active metabolism which is reflected by a rapid signal growth (Fig 1, substrates B03, A03). In contrast, toxic substrates inhibit bacterial growth or lead to cell death in which case only a small amount of colour is produced (Fig 1, substrate H04).
Bacteria may often undergo several cycles of metabolic activity, seen as differences in the rates of colour accumulation during growth (Fig 1, substrate A03). Multiple cycles may represent different metabolic pathways sequentially used by bacteria as they switch between Fig 1. Kinetics of accumulation of the colour production may represent metabolic cycles in bacteria. Panel A-colour production of three replicates of the bacterial E. coli strain IMT17887 during growth on plate PM1 in the substrates H03 (Tyramine), B03 (Glycerol) and A03 (N-Acetyl-D Glucosamine). Panel B-lagged difference L of the colouration of these signals. Panels C-growth rate S (smoothed lagged difference of the colouration) of these signals. The smoothing coefficient is set to b = 0.5. nutrients, undergo depletion of substrates, or after excretion of end-products followed by reutilisation. They may also represent different subpopulations in the growing cultures.
A number of methods have been used for analysing the Biolog metabolic signals and comparing the metabolic activity triggered by different substrates. The simplest approaches describe metabolic signals with a single summary statistic, e.g. maximum intensity reached or the area under the curve [3,4]. Some methods split signals into growth or no-growth curves by using an arbitrary cut-off or comparison to a reference signal [5,6]. However, describing a time-series with only a single summary statistics leads to a loss of information, and may introduce bias in the results [7]. If more than two samples are provided, the differences in summary statistics can be tested, e.g. by using t-test or ANOVA.
In addition to simple summary statistics, model-based methods are widely applied to Biolog data [7][8][9][10]. They are able to utilize more information by fitting growth models such as logistic, Gompertz and Richard, to the metabolic profiles. An R package opm is a widely used tool for reading in, processing and visualizing Biolog data [9]. It fits Gompertz's and Richard's models by using the grofit R package, and enables the comparison of the curves based on the 95% confidence intervals of the model parameter estimates. The most recent software by Gerstgrasser et al. [7] fits several models at once, and chooses the most suitable one utilizing Bayesian inference in parameter estimation and model selection. The signals are then compared against each other by defining maximum colour change, steepest slope and length of lag phase based on the fitted models. However, none of these growth models, or other methods [5,6,[11][12][13][14] are able to capture more than one potential metabolic cycle at a time.
To address the problem, we propose an algorithm for identifying multiple potential metabolic cycles of bacteria by decomposing the PM well signal into multiple growth models. In addition, we propose a method for comparing signals with each other using summary statistics gained from the growth models. We show that the method is robust to measurement noise and captures accurately the biologically relevant information from the data, thus increasing the potential of PM technology in microbiology.
To illustrate the proposed algorithm, we use Biolog metabolic signals from three E. coli strains: IMT17887, PCV17887 and T17887. Three biological replicates per strain were tested on plate PM1. Strain number IMT (Institut für Mikrobiologie und Tierseuchen) 17887 was isolated from a horse with wound infection. It is an extended-spectrum beta-lactamase (ESBL)producing E. coli of sequence type (ST) ST648. ESBL-plasmid extraction using a heat technique resulted in the ESBL-plasmid-"cured" variant PCV17887 [15]. Transformant T17887 contains the ESBL-plasmid, which was transferred into PCV17887 via electroporation. The data is provided in S1 File.

Signal decomposition
The proposed algorithm is applied separately to each of the 96 wells on the PM plates. Here, the time-series R = (R 1 , R 2 . . . R T ) contain the raw signal, i.e. the sequence of T integers between 0 and 400 representing the measured intensity of the colouration in one particular well at several time points. In theory, as the production of the purple colour is irreducible, R would be an increasing sequence. In practice, R is subject to low-frequency observational noise and can show a decreasing pattern due to measurement errors.
As the metabolic activity is represented not by the absolute level of colouration but rather by its change due to growth of the cultures, we are not interested in the values in R as such, but the increments in the process. However, the lagged difference of the signal L = (R 1 , is typically noisy (Fig 1B), which calls for a statistical approach to analyse the curves. To filter the noise the lagged difference is smoothed with a Gaussian kernel with a predefined smoothing coefficient b (Fig 1C). We define the target signal S as: To identify metabolic cycles, the target signal S is approximated with the sum of n components S % P n i¼1 C ðiÞ , where each component C ðiÞ ¼ ðC ðiÞ 1 ; C ðiÞ 2 . . . C ðiÞ T Þ represents one period of colour accumulation (e.g. potential metabolic cycle). We focus on three basic types of components based on the following growth models (Fig 2): a Gaussian sequence: a brick sequence: and a slope sequence: Here A, μ, t 0 , t 1 and v are the component parameters. Each component is defined by exactly three parameters. To compensate for the smoothing, all sequences C are also processed by the Gaussian kernel with the same smoothing coefficient b. Gaussian and brick sequences represent a logistic and a linear growth of a colouration respectively, while slope sequences represent a dynamics with initial growth slowing down with time (Fig 2). The various types of components are not intended to represent strictly different biological processes, but are used to increase the capability of matching the data patterns well.
Our decomposition algorithm consists of the following three steps: pre-processing converts the raw signal into the target signal; initial decomposition specifies the number of components required for optimal decomposition; calibration minimizes the distance between the components and the target signal and separates the components. To retain biological interpretability and prevent overfitting we set the constraint ∑C t ! δ (where δ is a predefined threshold). Pre-processing. During the pre-processing step the target signal S is obtained by smoothing the lagged difference of the raw signal (see Eq 1). The smoothing is required to remove highfrequency observation noise.
Initial decomposition. During the initial decomposition step a crude decomposition is proposed using a greedy algorithm. The first component C (1) is fitted to the whole signal S, the second component is fitted to the residuals S − C (1) , the third-to the residuals S − C (1) − C (2) and so on.
The fitting is done by optimizing ith component's type and parameters to minimize squared error between the target signal and the proposed components: Optimization could in principle be done with any optimization algorithm capable of finding a global minimum in the restricted space of the parameters. In our implementation we use a grid method combined with the built-in R function optimize [16]. To choose the component type, we fit three different types separately and choose the one with a minimal squared error.
The iterations continue while fitted components satisfy the condition P C ðiÞ t ! d. When the first small component with P C ðiÞ t < d is encountered, the initial decomposition step is stopped and the i − 1 components are set as the initial decomposition. If the first proposed component is small P C ð1Þ t < d we conclude that no periods of colour accumulation have occurred (for example, no growth or the initial bacterial inoculum contained no live bacteria), the decomposition is not continued and a stale process is reported as a result.
Calibration Initial decomposition may be imprecise, as first components are obtained ignoring the later ones. If two or more components are found during the initial decomposition (n ! 1), the components are calibrated to achieve concordance between them. Components are calibrated sequentially: C (1) ,C (2) ,. . .C (n) ,C (1) ,C (2) . . . until a pre-determined condition is reached. It could be based on the number of iterations or a change in the distance between the components and the target signal. When a component C (i) is calibrated, its type and parameters are updated by minimizing the function: The first part of the function is the squared error between the proposed components and the target signal. The second part penalizes the correlation among the calibrated component and the rest of components. The latter is scaled with the correlation weight γ. The same optimization method as in the initial decomposition step is used. During the pre-processing raw signal is converted to the target signal. During the initial decomposition three components (putative cycles of metabolic activity) are revealed: two Gaussian sequences and a slope sequence. During the calibration these components are refined: the first component changes its type to a brick sequence, the second and the third are slightly adjusted. Calibrating may change the type of a component. If a component ceases to satisfy the constraint C ðiÞ t ! d after the calibration, it is removed and the rest of the components are recalibrated.

Using decomposition to compare signals
The summary statistics of the identified components could be used to measure a similarity between two signals and thus among replicates or different strains. We suggest three summary statistics: maxðCÞ reflects the peak growth speed caused by the component; Using these summary statistics we can define a similarity measure between two components A and B: Here δ (max) , δ (size) and δ (center) are coefficients scaling the importance of the differences in summary statistics. sim(A, B) varies between 0 and 1 and does not depend on components' types.
Finally we propose the following similarity measure for two decompositions A and B consisting of m and n components, respectively: A ¼ ðA ð1Þ ; A ð2Þ . . . This similarity metric varies between 0 and 1 and depends on summary statistics of the components. Since components in decompositions A and B may be in different order, and since decompositions may have incorrectly identified small false components, it is important to check all possible pairings between the decomposition and choose the best one. If decompositions have no components to compare, the metric is either set to 0 (one signal is active and one is non-active) or 1 (both are non-active).
Similarity between two decompositions can be used to cluster signals. Fig 5 shows the example of decomposition for signals of three E. coli strains (IMT17887, PCV17887 and T17887). The decompositions are consistent between experimental replications and vary between strains. Fig 6 shows the similarity measures computed for the same data based on the Euclidean distance (Panel A) and similarity between decompositions (Panel B). In the second case, the difference between the three E. coli strains is more pronounced. We recommend comparing the decompositions obtained with the same parameter values.

Performance analysis
Sensitivity test. We tested the algorithm's ability to identify the correct number and type of components and related summary statistics using a synthetic data set. First, we generated one component or a pair of negatively correlated components C Ã(i) using randomly sampled parameters. The raw signal was constructed as a cumulative sum of components: Non-normal noise was added to the raw signal: The non-normal noise was chosen to reflect the apparent non-normality of the Biolog observational noise. We then estimated the component (or components) C (i) from S. We repeated the simulations 1000 times and measured a probability of correctly guessing the number of components n. If n was correct, we measured a probability identifying the component type and the mean absolute difference in the summary statistics max(C), size(C) and center(C). The results are shown in Table 1. The code used for testing is available at www.helsinki.fi/bsg/software/ Biolog_Decomposition. We used T = 50 hours, blur strength b = 1, threshold δ = 20, correlation weight γ = 2 and the observation noise λ = 10. Single components were almost always (in 97-99% cases) identified correctly. In 20% of the cases brick and slope components were misidentified because smoothing during the pre-processing step blurs the distinctions between the types. Narrow components are especially susceptible for this. The errors in the summary statistics were insignificant.
Identifying two components correctly was more challenging: the number of components was correctly identified in 56-74% of the cases. If the components were located close to each other, decomposition algorithm often mistook them as one or separated them in an incorrect position. The type of the components was correctly estimated in 53-79% of the cases. The errors in the summary statistics were larger in this setting as well.
Robustness to parameter choice. To assess the robustness of the decomposition algorithm to measurement errors and parameter choices with simulations, we applied the same protocol

Discussion
The proposed algorithm decomposes Biolog Phenotype Microarray data into potentially biologically meaningful components, i.e. components that could be interpreted directly as bacterial metabolic cycles and/or population changes. Identification of these components could be useful for further investigations, such as identifying sub-populations within bacterial cultures. Different signals (metabolic cycles) may arise after the initial death or growth stasis of a subpopulation of bacteria followed by growth of a second sub-population. Also, metabolites generated during growth on the initial substrate might result in a second decomposition signal in a later phase of the experiment. Among the most promising future applications would be a direct link to concurrent RNA-sequencing data to detect different metabolic pathways.
The decomposition of growth kinetics and comparison of similarity among replicates and different strains is a meaningful tool for analysing the growth of different bacteria in a manner of high resolution, in contrast to methods only analysing the respiration kinetics data as endpoint assays.
Performance analysis revealed that the presented method has a lowered sensitivity if there are several correlated components. Due to the observational errors, it is only possible to identify evident metabolic cycles. While the probability of correctly inferring the component type was low, the summary statistics were estimated accurately. Therefore any further analysis should rely on the summary statistics rather than on the component types.
The algorithm requires several pre-defined options to determine the sensitivity and level of smoothing. A user-specified tuning may be required to obtain an optimal fit for a particular data set. For some parameters, such as the component size threshold δ prior knowledge may also be used. We have investigated different modifications of the basic algorithm. We tested to use an L1 norm instead of the L2 norm (Euclidean distance) to identify the components and using a sliding mean smoothing instead of a Gaussian kernel. Our analysis suggested (data not shown) that the presented version of the method provides the best sensitivity, specificity and robustness among the considered alternatives. We also considered including a fourth component type: a right half of a Gaussian bell (C t = 0 for t < μ, C t ¼ A e À ðmÀ tÞ 2 v for t ! μ). However, this half-Gaussian sequence was almost never observed in a sample data sets, as similar patterns are better described with a slope sequence. The time-series generated by the Biolog PMs are inevitably subject to measurement noise. In addition to the signal-level noise (which is handled by the smoothing), there are plate-level biases, such that two plates with the same substrates in the same conditions may produce different amount of colouration. To handle the plate-level noise all time-series representative of the same array may be analysed concurrently and normalized a priori.
The presented algorithm does not require extensive computational resources. The runtime depends on the number of components identified and takes typically about 20 minutes to complete for a single plate in a standard single CPU desktop computing environment. The code is written in R and can be downloaded at www.helsinki.fi/bsg/software/Biolog_Decomposition. It is a part of a pipeline for analysing Biolog PM data [8] (www.helsinki.fi/bsg/software/R-Biolog) built upon the opm package [9].
Supporting Information S1 File. Sample data. Biolog metabolic signals from E. coli IMT17887, PCV17887 and T17887 tested on plate PM1. Three biological replicates per strain. (ZIP)