Splitting Gaussian Process Regression for Streaming Data

Gaussian processes offer a flexible kernel method for regression. While Gaussian processes have many useful theoretical properties and have proven practically useful, they suffer from poor scaling in the number of observations. In particular, the cubic time complexity of updating standard Gaussian process models make them generally unsuitable for application to streaming data. We propose an algorithm for sequentially partitioning the input space and fitting a localized Gaussian process to each disjoint region. The algorithm is shown to have superior time and space complexity to existing methods, and its sequential nature permits application to streaming data. The algorithm constructs a model for which the time complexity of updating is tightly bounded above by a pre-specified parameter. To the best of our knowledge, the model is the first local Gaussian process regression model to achieve linear memory complexity. Theoretical continuity properties of the model are proven. We demonstrate the efficacy of the resulting model on multi-dimensional regression tasks for streaming data.


Introduction
Gaussian process (GP) regression is a flexible kernel method for approximating smooth functions from data. Assuming there is a latent function which describes the relationship between predictors and a response, from a Bayesian perspective a GP defines a prior over latent functions. When conditioned on the observed data, the GP then gives a posterior distribution of the latent function. In practice, GP models are fit to training data by optimizing the marginal log likelihood with respect to hyperparameters of the kernel function, k.
A notable shortcoming of GP regression is its poor scaling both in memory and computational complexity. In what we will hereafter refer to as full GP regression, the posterior mean and variance can be directly computed at any finite number of points using linear algebra [24]. However, a computational bottleneck is caused by the necessary matrix inversion of the kernel matrix K, which is well-known to have a time complexity of O(n 3 ), where n is the number of observations. The O(n 2 ) memory complexity of storing K may also pose issues for massive data sets.
GP regression may also struggle to well-approximate latent functions which are non-stationary [7,9]. Non-stationary latent function's mean or covariance may vary over its domain. Non-stationarity may be induced in more subtle ways as well, such as heteroscedastic additive noise in the observed response.
Local GP regression is a class of models which address both of these problems, to varying extents. The commonality of these models is the assignment of observations to one of many local GPs during training, and the aggregation of the local models' individual predictions. As a result of observations being assigned to a single local model, effectively only a block-diagonal approximation of the full kernel matrix K is maintained [20,3], easing the time and memory complexity. The price of this flexibility and computational advantage is a potential decrease in predictive ability, relative to full GP regression, on tasks for which a full GP is well-suited.
A successful method for assigning observations to local GPs is partitioning the input space, the domain of the latent function, and creating a local GP model for each cell of the partition. Existing local GP models may encounter various difficulties during the partitioning process, as detailed in Section 2, and/or neglect a setting where only a sequential data source is available and partitioning must be performed sequentially.
This sequential setting is important since it encompasses tasks with changing dynamics. For example, in applications such as process quality monitoring [29,13] and motion tracking [11], a sequential approach allows for the model to adapt to regimes which are yet-unobserved.
The primary contribution of this paper is an algorithm which recursively partitions the input space to construct local GPs using sequential observations. The resulting model is dubbed the splitting GP. The algorithm is shown to have superior asymptotic time and memory complexity relative to other state-of-the-art local GP methods which can be used in this problem setting. By design of the algorithm, we also ensure an exact upper bound on the time complexity of updating the splitting GP model. We also prove theoretical properties of the model related to continuity of the predictions and empirically demonstrate the efficacy of the model. Additionally, a software implementation of the algorithm is provided, which leverages the computational advantages of the GPyTorch library [6], to facilitate the use of the algorithm by others.

Related work
It appears that the first exploration of local GP regression is due to Rasmussen and Ghahramani [23], as a special case of their mixture of GP experts model, where prediction at a point is performed by one designated "expert" GP. Snelson and Ghahramani [26] further developed this idea, but only as a supplementary method to sparse GP regression.
An adjacent line of research on treed GP models was begun by Gramacy [7]. Treed GP models perform domain decomposition in the same manner as classification and regression trees [2], and fit distinct GP models to each resulting partition. Predictions are then formed using k-nearest neighbors. Gramacy [7] found that local GP regression methods, such as treed GP models, were well-suited to non-stationary regression tasks since each leaf GP model in the tree could specialize to local phenomena in the data. Since the inception of this method, further advances have been made by Gramacy and collaborators [9,8], particularly with respect to large scale computer experiments and surrogate modeling. It is acknowlegded by Gramacy [7] and Gramacy and Lee [9] that predictions by the treed GP model are not continuous in the input space. Additionally, the construction of the tree is performed probabilistically, and does not make use of the intrinsic structure of the data for domain decomposition.
Since the original proposal of local GP regression, several methods have been proposed which are adapted to the sequential setting. Shen et al. [25] reduced the prediction time and kernel matrix storage for isotropic kernels by building a kd-tree from the training data set. Nguyen-Tuong et al. [18] proposed a method of local GP regression for online learning, which assigns incoming data to local models by similarity in the feature space and forms mean predictions by weighting local predictions. This local GP model was among the first to consider a fully sequential setting. This model has two notable drawbacks: it suffers from discontinuities in its predictive mean, and depends on a hyperparameter which is difficult to tune in practice and strongly affects prediction performance. Another local GP method which may be used in the sequential setting is the robust Bayesian committee machine (rBCM) by Deisenroth and Ng [4], which can be seen as a product of GP experts [10]. The rBCM emphasizes rapid, distributed computation over model flexibility. This is demonstrated by a) its assumption that the latent function is well-modeled by a single GP and b) consequent random assignment of observations to GP experts, rather than a partitioning-based approach. This modeling approach does not address potential non-stationarity of the latent function.
More recently, Park and collaborators have done significant work in this area, applying mesh generating procedures from finite element analysis [21,22] and the recursive Principal Direction Divisive Partitioning algorithm [20,1] to partition the input space for fitting local models. However, in these papers it is assumed that a substantial number of observations are available during the initial model construction to perform the partitioning procedure, particularly when using mesh generation methods. These models also suffer from discontinuities at the boundaries of partition cells, an issue which the authors have creatively addressed by adding constraints to the hyperparameter optimization which force equality at finitely many boundary points, or by adding "pseudo-inputs" at the boundaries to induce continuity.

Recursive splitting of local Gaussian processes
As previous works have shown, there is a trade-off between the predictive ability of the aggregate model, the number of local models, and computation speed. Given this fact, we aim to construct a model which maintains strong predictive capability while keeping its computational demands below a pre-specified upper bound. We show that this can be accomplished in a straightforward manner by recursively splitting GP models once they surpass a presupposed threshold in the number of observations. Splitting the model amounts to performing a clustering subroutine which divides the observations associated with the model into two subsets, and then fitting a new local model to each subset.
We consider streaming data that requires the model to permit sequential updating and prediction. A natural quantity of interest is the time τ required for a single update; since τ is a function of the size, m × m, of the kernel matrix K i of a local model indexed by i, m is then an interpretable parameter describing the period of splitting. The parameter m may be interpreted as the splitting limit, the maximum number of observations which may be assigned to a local model before it is split. For the remainder of the paper, we will describe the splitting GP model, which is characterized almost entirely by this intuitive parameterization. Additionally, the full specification of the splitting GP algorithm may be found as pseudo-code in the appendix.

Notation
In preparation for the proceeding material, we define some notation. We assume some familiarity of readers with the theory of GPs and kernel methods [24].
The input data matrix and the response vector associated with the i th local model are denoted by is a column vector and n i is the number of observations associated with the i th local model. We call X the input space.
When creating the i th local GP, its center is defined to be the centroid of X i as follows: Centers are critical to the assignment of observations to local GPs, as well as prediction, in the splitting GP model. Each time a new observation is assigned to a local model, its center is recomputed.
The kernel function k: X ×X → R is a positive-definite, symmetric function. A vector θ parameterizes the kernel, and is called the hyperparameter of the GP model. For simplicity, we omit θ from our notation. We use the term feature space to refer to the reproducing kernel Hilbert space in which k implicitly computes an inner product.
We write f ∼ GP(µ, k) to say that the function f is distributed as a GP with mean µ and kernel function k, and implicitly assume that the domain of f is X . The notation f |X i , Y i denotes that f is conditioned on the data X i , Y i . We use x * to denote a test input at which a prediction is to be made and f * |x * to be a GP conditioned on the test input, i.e. the posterior distribution.

The splitting procedure
To address modeling a potentially non-stationary latent function, each local GP model is split in a manner which centers the resulting child GPs on regions of different means. This is done by performing principal component analysis (PCA) on the training inputs associated with the parent model. The first principal component gives the direction of most variance of the training inputs, which implies that the corresponding orthogonal hyperplane is the minimizer of within-cluster variance among all linear bisections, leading to Principal Direction Divisive Partitioning (PDDP) [1].
Two new GPs, which we call child GPs, are then created, each of which is assigned the data of the parent model from one side of the hyperplane, as in Fig. 1(a). This heuristic is based on the idea that, by fitting separate GPs to the subsets of training inputs which are maximally different, the model may best adapt to non-stationary behavior of the latent function. We utilize PDDP since the splitting procedure is computed efficiently in closed form, i.e. without the use of a convergent algorithm such as k-means. It also allows for different choices in how the principal direction is computed. For example, in a setting where observations are fully sequential, the principal direction can be efficiently approximated using Oja's rule [19] to avoid a singular value decomposition of the data matrix. (a) Splitting a 2-d input data set using PDDP into two subsets for two child GPs. Note the principal direction (orange vector), orthogonal hyperplane (green line), and the centroid (×) of each subset colored in blue or red.  Critically, the prior of each child GP is taken to be the posterior of the parent model when conditioned on the m training observations, reflecting the Bayesian belief that the latent function's local covariance structure is not too dissimilar from its covariance structure in the larger subspace prior to splitting.
Formally, in the function space inference view of GP regression, where f child is the prior of the child and f parent |X parent , Y parent is the posterior of the parent given the training data X parent , Y parent . This assumption of local similarity is implicit in the use of many smooth kernel functions, particularly the radial basis function family, which are infinitely differentiable.

Aggregating predictions of local models
A new data point (x, y) is assigned to a child GP whose center is most similar to the predictor x in the feature space, as determined by the kernel function. That is, for C child GPs, the data point is assigned to the child GP indexed by Similarly, the posterior mean at the test input x * is computed by weighting the prediction of each child GP by the relative similarity of its center to the test input in the feature space. This idea is based upon an interpretation of the predictive mean as weighting observations by their similarity to the point of prediction in the feature space, according to Shen et al. [25]. The same idea was also used by Nguyen-Tuong et al. [18].
Unlike these previous works, we take a weighted average of all predictions of child GPs, rather than only a subset, as follows: where . This may be interpreted as weighting each child GP's mean prediction by the similarity of its local region to the test input x * . In the next section, we will show that using predictions from all child GPs has important consequences on the theoretical properties of the resulting model. Proposition 1. Let f * parent |X parent , Y parent ∼ GP(µ, k) be the full GP prior to splitting and let x * be a test input. The child GPs f i |X i , Y i ∼ GP(µ i , k), i = 1, 2 from the first split have the property that, prior to being updated with any new observations, f * |x * = f * parent |X parent , Y parent , x * . That is, the predictive distribution is preserved by the splitting procedure.
This result is somewhat intuitive, since no additional evidence has been obtained which might alter our Bayesian beliefs about the latent function; we have simply altered the model structure. In agreement with this idea, the equality of the parent and childrens' predictive distribution no longer holds after any new data is assigned to one of the child GPs. Note that, in general, this relationship does not hold for any split after the first.
An important property of the splitting GP model is the preservation of continuity properties of the child GPs. We provide a result which shows the necessary and sufficient condition for continuity of the splitting GP model's predictions.
Proposition 2. Suppose k is a kernel function and f i |X i , Y i ∼ GP(µ i , k), i = 1, ..., C. Then the random field given by is mean square continuous in the input space if and only if the kernel function, k, is continuous. Under the same condition, the mean prediction E[ f * |x * ] is also continuous in the input space.
While it may seem obvious that the mean prediction is continuous in the input space, it is reassuring to know that the random field created by the splitting model has the same sufficient condition for mean square continuity as the underlying child GPs. Intuitively, we are computing a continuously weighted average of smooth random fields, so the resulting random field should also be smooth.
Note that it is not necessary to aggregate the predictions of each local model. Instead, one may elicit predictions only from the C 0 < C local models most similar to the point of prediction, as measured by the kernel. This variation of the prediction method, as used in [18], may result in faster computation of predictions. However, we caution against this practice without careful consideration. This prediction method will result in a loss of continuity of the mean prediction E[ f * |x * ], and consequently the mean square continuity of the random field f * |x * , since the mean prediction is computed using a maximum function, which is not continuous. The discontinuity will manifest as sharp jumps in the mean predictions, as illustrated in Fig. 1(b).

Complexity analysis of the algorithm
The algorithm for the splitting GP model has improved complexity in both memory and training/prediction time compared to the full GP regression. Additionally, it maintains some benefit in asymptotic complexity relative to other local GP methods.
The complexity of updating the splitting GP model with a single datum is bounded above by O(m 3 ), corresponding to a matrix inversion of one child GP. The contribution of the PCA-based splitting procedure to the time complexity is negligible. Assuming the PCA is performed via naive singular value decomposition, the procedure would have O(m 2 ) complexity amortized over m sequential observations, which yields a linear additive term m < m 3 .
While each child GP has at most m observations, the average case will clearly be lower. It should be noted that we explicitly chose to not utilize a rank-one update of the Cholesky decomposition of the kernel matrix during the update procedure, a method which is used in other local GP methods such as in [18]. We made this choice to permit updating the model with a batch of observations, in addition to fully sequential updating, which would require an update of rank greater than one. We also observed empirically that repeatedly updating a single child GP using rank-one updates would cause numerical issues if the kernel length-scale parameter is small.
Since the time complexity of the algorithm is characterized primarily by the parameter m, the largest number of observations which may be associated with a single child model, it is straightforward to select an appropriate parameter value for applications requiring rapid sequential updating. After empirically determining the necessary wall-clock time needed for an update, the parameter may be adjusted appropriately.
The splitting algorithm also imposes a lower memory complexity in terms of the number of observations, n. It is well known that local GP methods effectively store only a block-diagonal approximation of the full covariance matrix [20,3]. In particular, when the splitting algorithm has n observations, n/m child GPs have been created, each of which will store a kernel matrix with up to m 2 entries. The resulting memory complexity of the algorithm is then O(mn) < O(n 2 ), where O(n 2 ) is the memory complexity of a full GP regression. In contrast, the rBCM by [4] has an asymptotic memory complexity of O(n 2 /E), where E is the (constant) number of experts specified as a parameter. Nguyen-Tuong et al. [18] did not present the memory complexity of their local GP model but, under the mild assumption that the input space is bounded, it can be shown that the asymptotic memory complexity is O(n 2 ). Notably, the splitting GP model is, to the best of our knowledge, the only local GP model to achieve a linear memory complexity.

Experiments
The efficacy of the splitting GP model was experimentally evaluated on one synthetic and two realworld data sets. For each experiment, all models used the radial basis function kernel, with automatic relevance determination [15] enabled. We compared the performance of the splitting GP model to the local GP regression of Nguyen-Tuong et al. [18], the robust Bayesian committee machine (rBCM) of Deisenroth and Ng [4], and full GP regression as a baseline comparison. Each of these models was chosen since they may be updated using sequential data, and make no use of a "complete" training data set to inform the model. This is in contrast to, for example, the patchwork kriging method of Park and Apley [20], which utilizes information from the entirety of data for domain decomposition.

Synthetic data with a non-stationary latent function
The synthetic data set (see Fig. 2) was constructed to have a non-stationary latent function f of a two-dimensional predictor (x 1 , x 2 ) ∈ [−1, 1] 2 . The response function is defined as for each replicate of the experiment. This sampling over the grid ensured that no two observations are too close to one another, which may cause numerical issues during training.  For this experiment, we compared three metrics of the models' performance: prediction error (in mean squared error, or MSE), memory usage (in kilobytes, or kB), and training time (in seconds). Each metric was estimated using a 5-fold cross validation procedure. We utilized common random numbers [28] as a variance reduction technique. Each experiment was replicated 10 times to reduce variability of results. In Fig. 3, the shaded region shows the 95% pointwise confidence region for the mean metric. The experiment was performed with different numbers of observations, ranging from 100 to 2500 in increments of 100, to demonstrate the relative data requirements for each model to converge. Each model was updated sequentially with single observations to simulate a streaming data setting.
For each alternative model compared, the experiment was replicated for a range of parameter values and the most favorable, in terms of MSE, results for each model are reported. For the splitting GP model, we used a splitting limit of m = 500. For the rBCM, E = 10 local experts were used. We found the parameter w gen of the local GP model difficult to tune and eventually used w gen = 10 −3 based on an extensive grid search (detailed in the appendix).
In Fig. 3, one can see that the splitting model and full GP required relatively few observations to achieve strong predictive power. The rBCM was much slower to converge, and exhibited high variability in its MSE across replicates of the experiment. We attribute the greater variability in MSE to the rBCM's random assignment of data to GP experts. On the other hand, the splitting GP model exhibited remarkably low variability in MSE between replicates. It is worth noting that the splitting model's MSE slowly increased as it splits at intervals of 500 observations (see Fig. 4).
The memory usage of the splitting GP model proved to be significantly lower than the full GP model, and marginally higher than that of the rBCM. However, for~2500 observations, it can be seen that the memory usage of the rBCM model began to surpass that of the splitting GP model. This is to be expected, since the asymptotic memory complexity of the rBCM is quadratic, as opposed to the linear complexity of the splitting GP model.
The training time of the rBCM and local GP models were found to be comparable, and the splitting GP model took slightly longer. The full GP took significantly longer to train. The change in regime of the training time of the full GP at~1100 observations is due to specialized numerical methods in the GPyTorch library [6].

Real-world data in robotic control application
The first real-world data set, kin40k, consists of 40,000 records describing the location of a robotic arm as a function of an 8-dimensional control input [17]. This data set was chosen since it exemplifies a task where the fast sequential updating and low memory profile offered by the splitting GP model are   desirable. Furthermore, kin40k is a popular benchmark for other GP regression methods [4,16,14], and thus facilitates direct comparison. We used the same training/test split of 10,000/30,000 points as in the previous work [4,16,14]. Observations were added to the models in batches, with each batch having a number of observations equal to the number of observations per local model. The results can be seen in Fig. 5(a). The splitting model's root-mean-square error (RMSE) is comparable with that of the rBCM, which is designed under a stronger assumption (that the latent function is well-modeled by a single GP), which is satisfied in this stationary regression task. In contrast, the local GP model struggled to achieve the same RMSE in this experiment (see the appendix for more detail on the difficulty tuning the parameter w gen ).

Real-world data in power plant control application
The second real-world data set is powergen, which consists of four predictors describing control inputs of a combined cycle power plant, and a response of the net electrical energy production. This data set is due to Kaya et al. [12] and Tüfekci [27] and is publicly available [5]. We pre-processed the data to remove duplicate observations, leaving a total of 7622 observations.
In the powergen experiment, we compared both the predictive error (in RMSE) and the combined training and prediction time (in seconds) of the splitting GP model and rBCM. We compared only these two models, since they proved to be the most competitive local GP methods in earlier experiments. The data set was randomly divided into a 80%/20% train/test split, and each model evaluated for several different parameter values. Observations were added to the models in batches, with each batch having a number of observations equal to the number of observations per local model. The experiment was replicated 10 times, and common random numbers were used for splitting the data set and training the models for sharper comparisons between the models. The results can be seen in Fig. 5(b).
The splitting GP model's RMSE decreased with the number of observations per local model, achieving a comparable performance with the rBCM. In terms of runtime, the splitting GP model was significantly faster, while having little variability between replicates. While the rBCM's runtime decreased with the number of observations per expert, its average runtime was 2.5-6 times longer than that of the splitting GP model, depending on the choice of parameters.

Conclusion
In this paper, we have developed an algorithm for constructing splitting GP regression models for potentially non-stationary latent functions using streaming data. We have shown that splitting GP models attain comparable predictive performance, while addressing critical shortcomings of other local GP models, such as discontinuity of predictions, lack of flexibility for modeling non-stationary latent functions, and opaque parameters which may be challenging to tune. Furthermore, splitting GP models are shown to enjoy linear memory complexity which, to the best of our knowledge, is the best among existing local GP methods, which typically have quadratic memory complexity. An implementation of the splitting GP model is available in the supplementary material to this article.