The gradient clusteron: A model neuron that learns to solve classification tasks via dendritic nonlinearities, structural plasticity, and gradient descent

Synaptic clustering on neuronal dendrites has been hypothesized to play an important role in implementing pattern recognition. Neighboring synapses on a dendritic branch can interact in a synergistic, cooperative manner via nonlinear voltage-dependent mechanisms, such as NMDA receptors. Inspired by the NMDA receptor, the single-branch clusteron learning algorithm takes advantage of location-dependent multiplicative nonlinearities to solve classification tasks by randomly shuffling the locations of “under-performing” synapses on a model dendrite during learning (“structural plasticity”), eventually resulting in synapses with correlated activity being placed next to each other on the dendrite. We propose an alternative model, the gradient clusteron, or G-clusteron, which uses an analytically-derived gradient descent rule where synapses are "attracted to" or "repelled from" each other in an input- and location-dependent manner. We demonstrate the classification ability of this algorithm by testing it on the MNIST handwritten digit dataset and show that, when using a softmax activation function, the accuracy of the G-clusteron on the all-versus-all MNIST task (~85%) approaches that of logistic regression (~93%). In addition to the location update rule, we also derive a learning rule for the synaptic weights of the G-clusteron (“functional plasticity”) and show that a G-clusteron that utilizes the weight update rule can achieve ~89% accuracy on the MNIST task. We also show that a G-clusteron with both the weight and location update rules can learn to solve the XOR problem from arbitrary initial conditions.


Introduction
In the discipline of machine learning, artificial neural networks (ANNs) have gained great popularity due to their impressive success in solving a wide variety of computational tasks [1]. ANNs also hold a special appeal due to their similarity to the networks of neurons in biological brains, lending hope to the possibility that ANNs might serve as a general-purpose framework for replicating human-like cognition. ANNs have also been used by computational neuroscientists to model activity dynamics, computation, and learning processes in the brain [2][3][4][5][6].
Despite their utility for both machine learning and neuroscience, ANNs operate at a level of abstraction that disregards many of the details of biological neural networks. In particular, the artificial neurons in ANNs integrate their inputs linearly; that is to say, each input to a neuron in an ANN is given an independent weight and the neuron applies an activation function to the linear weighted sum of its inputs.
In contrast, synapses in biological neurons display an assortment of nonlinear interactions due to the passive biophysical properties of the cell as well as the active properties of voltagegated ion channels. One nonlinear channel of particular interest is linked to the N-methyl-Daspartate (NMDA) receptor. The NMDA channel is both ligand-gated and voltage-gated [7][8][9], allowing neighboring synapses on a dendrite to interact in a cooperative, supralinear manner. When two nearby NMDA synapses are activated simultaneously, the voltage-dependence of the NMDA receptor creates a voltage perturbation (depolarization) that is greater than the linear sum of the two inputs had they been activated independently [10].
The NMDA receptor has been shown to play a crucial role in structural plasticity, or the growth and elimination of synapses between neurons [11][12][13][14]. Structural plasticity plays a crucial role in development and learning [15][16][17][18][19]. Structural plasticity increases the likelihood that temporally correlated synapses are placed next to each other [12,14,[20][21][22], which may take advantage of the NMDA supralinearity. While some structural plasticity may be due to new presynaptic-postsynaptic pairs being formed, structural plasticity can allow neurons with extant synaptic connections between them to find the most functionally effective dendritic location to make a synapse. The commonality of multiple synaptic contacts between a single presynaptic-postsynaptic pair of cells, as shown by electron microscopy [23][24][25][26], as well as evidence that the number of multiple synapse boutons increases in an enriched environment [27], lends support to the idea that connected neurons may be sampling dendritic locations so as to optimize synaptic placement.
The synergistic coactivation of neighboring inputs on the dendrite via nonlinear NMDA receptors and the ability of a neuron to relocalize its synapses in response to correlated activity have been theorized to have a computational function. An early framework proposed that, by placing synapses with correlated inputs next to each other, a neuron could learn to solve classification tasks. To demonstrate this capability, Bartlett Mel [28] created a simplified model of a neuron with a single dendrite called the clusteron. The clusteron's dendrite consisted of a sequence of discrete locations, from 1 to N, where N is the number of features in the input (in an image, N would be the number of pixels). Each input to the neuron would impinge upon a specific dendritic location. The "activation" of each synapse was defined as the direct input to that synapse multiplied by the sum of the inputs to nearby synapses within a fixed radius on the dendrite.
In order to train the clusteron to recognize a class of patterns (e.g. to identify pictures containing the digit 2), the clusteron is presented with patterns from the class to be recognized (called the positive class) and stores the average activation of each synapse per pattern. At every epoch of learning, the clusteron randomly swaps the locations of "poorly performing" synapses (i.e. synapses with a low activation relative to the other synapses) with each other on the dendrite, eventually resulting in a configuration wherein correlated inputs become spatially adjacent to each other. The adjacent correlated synapses interact nonlinearly with each other, leading to a higher overall activation for the positive class of patterns.
The clusteron model has spurred a wave of research into synaptic clustering, the idea that synapses may be selectively bunched together on a dendrite (structural clustering), and that synaptic clusters may receive correlated inputs which can influence a neuron's computation (functional clustering) [29]. Recent experimental work has provided evidence for both structural and functional clustering [14,19,[30][31][32][33] with implications for visual orientation selectivity [32] and multimodal sensory adaptation [19]. For a review of both experimental and modeling work pertaining to synaptic clustering, see [29].
In this work, we propose an alternative model for learning via dendritic nonlinearities and structural plasticity on the single dendrite, which we call the gradient clusteron, or G-clusteron. Unlike the original clusteron model, the G-clusteron uses a dendrite with continuous-valued locations (as opposed to the discrete locations in the clusteron) and a bell-shaped distancedependent function to model the location-dependence of interactions between synapses. These modifications allow us to analytically derive a gradient descent-based learning rule for the synaptic locations of the inputs to the G-clusteron. We show that the G-clusteron's location update rule can learn to solve the classic MNIST handwritten digit multiclass classification task with accuracy comparable to logistic regression (i.e. a linear classifier of the form y ¼ sigmoidðw � xÞ, whereŷ represents the predicted probability that the input pattern x belongs to the positive class and w is a learned weight vector). Moreover, we derive an additional plasticity rule for the synaptic weights of the G-clusteron and show that when the synaptic location update rule and the synaptic weight update rule are used simultaneously, the Gclusteron can learn to solve the exclusive or (XOR) binary classification task, a feat which cannot be accomplished by a linear classifier [34].

Results
The G-clusteron is a model neuron with a single one-dimensional "dendrite" containing synapses at various dendritic locations (Fig 1A). The activation of a synapse is defined as the product of its weighted input with a distance-weighted sum of the weighted input of every synapse on the dendrite (including itself).
Formally, for a given real-valued input pattern vector x = [x 1 ,x 2 . . .x N ], the real-valued activation, a i , of the synapse with index i is: where l i is the real-valued location of synapse i on the dendrite, w i is the real-valued weight of synapse i, and f(l i , l j ) is a bell curve-shaped distance-dependent function which determines how much each synapse affects each other synapse, defined as: where r is a positive-valued "radius" hyperparameter that determines the width of the curve.
(From a biophysical standpoint, Eq 2 is analogous to the distance-dependence of the voltage response of a passive cable to an instantaneous current impulse [35]. In this framework, r can be thought of as the square of the cable length constant λ, where l 2 ¼ r m r a , i.e. the ratio of the membrane resistance r m (in O�cm) to the axial resistance r a (in O/cm).) Note that f(l i , l j ) = 1 when l i = l j (i.e. when synapses i and j occupy the same location) and that f(l i , l j ) approaches 0 as synapses i and j move further away from each other (Fig 1B). Also note that f(l i , l j ) = f(l j , l i ). For convenience of notation and computation, we define a matrix F Each input x i is associated with a synapse at location l i and weight w i . The activation of each synapse is affected by each other synapse according to a distance-dependent factor (dashed curves). A threshold function is applied to the sum of the activations. (B) Schematic of distance-dependent interactions [Eqs 2 and 3]. The distance-dependent factor between synapse 1 and synapses 2, 3, and itself are shown as blue points along the distance-dependent function (dashed curve). Synapse 1 interacts maximally with itself, moderately with synapse 2, and slightly with synapse 3. Even though the inputs and weights for all synapses are identical, the activations (shaded rectangles, darker shades indicate larger activation) differ due to the relative locations of the synapses (e.g. synapse 1 is centrally located so it is affected by synapses 2 and 3 more than synapses 2 and 3 affect each other). (C) Schematic of the signdependence of the location update rule (Eq 7). Colors of the synapses indicate the sign of the weighted inputs (wx). For positive patterns (left column), two synapses attract each other if their weighted inputs have the same sign (left column, top and center) and repel each other if the weighted inputs have opposite signs (left column, bottom). In negative patterns this is reversed (right column). (D) Effect of input and weight magnitude on the magnitude of the location update rule. The magnitude of the attraction and repulsion (arrow length) is proportional to |w i x i � w j x j |. (E) Top: Magnitude of the location update rule as a function of the distance between two synapses. Bottom: Three examples for the location-dependence of the update rule (aligned to the curve above). Arrow length denotes magnitude of update. This magnitude is multiplied by the weighted input factor of the learning rule from D as well as the magnitude of the error |ŷ À y| to obtain the final magnitude. https://doi.org/10.1371/journal.pcbi.1009015.g001

PLOS COMPUTATIONAL BIOLOGY
The gradient clusteron such that F is thus a symmetric matrix with ones on its diagonal. For a given pattern of presynaptic inputs x, the G-clusteron sums its activations together with a bias term b to produce: Because we will be using the G-clusteron as a binary classifier, we apply a sigmoidal nonlinearity g to h(x) and produce an outputŷ ranging between 0 and 1: As in logistic regression,ŷ can be interpreted as a probability estimate for the binary label for the input pattern x, with values closer to 0 representing a prediction for the label 0 (called the negative class) and values closer to 1 representing a prediction for the label 1 (called the positive class). The G-clusteron thus differs from the original clusteron [28] in two important ways: 1) each synapse has a real-valued location on the dendrite instead of an integer-indexed location and 2) each synapse's activation function depends on the inputs of its neighbors as a gradually decreasing distance-dependent function as opposed to a hard cutoff at a fixed distance.
Defining the output of the G-clusteron in this fashion allows us to derive a gradient descent plasticity rule for the synaptic locations (assuming a logit or cross-entropy error loss function, see Methods). For each input pattern presented to the G-clusteron, we update each synaptic location according to the rule: where Δl i is the update to the location of synapse i defined as: where y20,1 is the true label (negative or positive class) for pattern x, and η L is the positive-valued learning rate for the synaptic locations. This gradient rule for each synapse can be understood as summing over "forces" that depend on pairwise interactions between that synapse and each of the other synapses on the dendrite. The interaction between two synapses depends both on the weighted inputs of the synapses (w i x i and w j x j ) and the distance between the synapses in the following manner (see S1 Text for further explanation): 1. For positive-class training patterns (y = 1), synapses with same-sign weighted inputs (i.e. sign(w i x i ) = sign(w j x j )) exhibit "attraction" while synapses with opposite-sign weighted inputs (i.e. sign(w i x i ) = −sign(w j x j )) exhibit "repulsion" (Fig 1C, left).
3. The magnitude of attraction and repulsion between two synapses is proportional to the product of the weighted inputs of the two synapses ( Fig 1D). 4. The magnitude of attraction and repulsion between two synapses is distance-dependent. The attractive/repulsive force is small at very small distances, becomes larger at intermediate distances, and shrinks again at large distances ( Fig 1E).
The magnitude of the update for each synapse is also scaled by the magnitude of the error term, |ŷ À y|.
Eq 7 can be interpreted as a vector field along the dendrite, like force fields in physical systems of particles. If we consider a "unit synapse" such that wx = 1 at an arbitrary location l on the dendrite, the magnitude and direction of the "force vector" created by the plasticity rule at that location for a given input pattern x is given by: This interpretation of the location plasticity rule can provide some intuition as to how such an algorithm might be implemented biologically. We can imagine extracellular chemical signals being released at regions of the dendrite where there is a concentration of synaptic activity. Excitatory and inhibitory inputs would have different chemical signals associated with them. The diffusion of these chemicals in the extracellular space around the dendrite would create a location-dependent field of chemical gradients. These chemical gradients could induce presynaptic axons to form or eliminate synapses on the dendritic regions that were recently active, with differential effects for excitatory and inhibitory synapses.

Toy examples
To illustrate how the location update learning rule works, we consider several toy problems. For these problems, instead of training the G-clusteron to discriminate between two classes, we will show what happens when the G-clusteron is given a dataset where all examples are from the positive class (in other words, the G-clusteron is tasked with maximizing its output on all the input patterns, see Methods for an explanation of how the error signalŷ À y is implemented in these tasks) and when all examples are from the negative class (so the G-clusteron must minimize its output for all input patterns). For all the examples in this section, all synaptic weights are fixed to have the value 1 and don't change over the course of learning; our results thus strictly depend on the synaptic locations and input values. For the first toy problem, we create a single synaptic input vector x where the values of x are randomly distributed between -1 and 1. This input vector is repeatedly presented to the Gclusteron with a positive label (y = 1), such that the G-clusteron will attempt to maximize its overall activation. Because the interaction between synapses in Eq 1 is multiplicative, a sensible strategy would be to minimize the distance between synapses with same-sign inputs and maximize the distance between synapses with opposite-sign inputs. This would result in two clusters on the dendrite: one with positive-valued synaptic inputs and one with negative-valued synaptic inputs. The location update rule, by causing attraction between same-sign synapses and repulsion between opposite-sign synapses, does exactly this (Fig 2A and S1 Movie).
We now show how the radius hyperparameter r (i.e. the width of the distance-dependent curve) affects the learning rule for input patterns from the positive class. In the first example, r was sufficiently large relative to the initial dispersion of synapses such that all the synapses could "feel" each other, thus enabling all synapses with the same sign to eventually aggregate into a single cluster. However, if r is reduced, synapses that start further apart from each other do not exert a strong pull on each other, so instead the learning rule operates in a more local fashion, creating several same-sign clusters (Fig 2B and S2 Movie).
We next train the G-clusteron on the same input pattern, but this time with a negative label (y = 0), such that the G-clusteron will minimize its overall activation. Here we pre-initialize the G-clusteron in a high-activation state where synapses start in separated homogenous-signed clusters. Inversely to the first strategy, we would like opposite-sign inputs to become intermixed. Because the plasticity rule flips attraction and repulsion for input patterns from the negative class, we achieve the expected result ( Fig 2C and S3 Movie).
Having demonstrated that the G-clusteron learns by aggregating and dispersing same-sign and opposite-sign synaptic inputs for individual input patterns, we now illustrate that this mechanism serves to learn the correlation structure of a dataset comprised of multiple input patterns. To this end, we create a 20-dimensional multivariate Gaussian distribution where each dimension of the Gaussian had a mean of 0 and a covariance matrix S such that: In other words, dimensions x 1 . . .x 5 were all positively correlated with each other, as were dimensions x 6 . . .x 10 , x 11 . . .x 15 , and x 16 . . .x 20 , but otherwise, dimensions were slightly negatively correlated ( Fig 3A). As such, each vector sampled from this distribution would exhibit similarity between the first five dimensions, second five dimensions, and so forth ( Fig 3B).
As the mean of each dimension is 0, a linear neuron (without a bias) would be unable to increase its average output on this dataset, as there is no independent information in each synapse that could be given a linear weight. However, the G-clusteron can take advantage of the correlational structure of the data. Because the groups of positively correlated inputs will tend to all be either all-positive or all-negative in any given input pattern, the learning rule will gradually move the positively correlated synapses together, creating dendritic clusters of synapses with correlated inputs. As these clusters form, the output of the G-clusteron increases ( Fig 3C  and S4 Movie). Inversely, the G-clusteron can learn to decrease its output on this dataset by clustering negatively correlated inputs ( Fig 3D and S5 Movie). We verified that the G-clusteron appropriately increases or decreases its average activation for this task by running 100 trials of the algorithm and checking the average activation before and after learning (See S2 Text). The average activation on the positive class task before learning was 0.86, and after learning was 2.95 (average signed d' = 1.095). For the negative class task, the average activation before learning was 2.03, and after learning was 0.44 (average signed d' = −1.155).

Learning MNIST
The ability of the G-clusteron to learn the correlational structure of a dataset enables it to perform supervised classification by aggregating correlated inputs from the patterns where y = 1 and disaggregating correlated inputs from the patterns where y = 0 ( Fig 4A). We trained the G-clusteron on the MNIST dataset of images of handwritten digits [36] and compared its performance to that of the original clusteron from [28] as well as to logistic regression. Before attempting the all-versus-all classification task, we test the G-clusteron separately on each digit in a one-versus-all classification paradigm (example shown in Fig 4B). For each digit from 0-9, a G-clusteron was trained on a dataset where half of the images were of that digit (positive class, label y = 1) and half of the images contained other digits (negative class, label y = 0). The G-clusteron was then tested on a holdout test set (See Methods). This procedure was repeated for the original clusteron (only positive-class training examples are used for the original clusteron, see Methods) as well as logistic regression. The learning and testing process was repeated 10 times for each classifier to ensure performance stability. The results are shown in Table 1.  Schematic of Gclusteron performing the one-versus-all classification task, where it must produce a larger activation for images containing the positive class 2 (digit with red border) than any other number (digit with blue border). Colored circles in the digit images indicate exemplar pixel locations mapped to exemplar synapses on the dendrite with the same color, generating two dendritic clusters (green-blue and orange-yellow, respectively). The synapses within these clusters will be maximally activated if both inputs to the cluster are the same sign (i.e. both black or both white), while opposite sign inputs will result in lower activation. (B) Exemplar histogram of sum of synaptic activations from the G-clusteron tasked with classifying the digit 2 (label y = 1, red) versus all other digits (label y = 0, blue) before learning (left) and after learning (right). Over the course of learning, the G-clusteron increases its activation on images containing a 2 relative to images displaying other digits, enabling binary classification. Blue vertical line is the value of the bias term. (C) Synaptic activations for positive (top) and negative (bottom) patterns For all digits, all three classifiers achieved a classification accuracy far above chance level of 50%. Depending on the particular digit, the clusteron achieved between 74.4% and 92.8% accuracy, the G-clusteron achieved between 80.7% and 96.8% accuracy, and logistic regression achieved between 91.2% and 98.3% accuracy. We emphasize that both the clusteron and the G-clusteron had all their weights fixed to 1 during the entire learning procedure and were only able to update their synaptic locations. Our results on the one-versus-all MNIST task thus demonstrate the remarkable efficacy of structural plasticity-based learning and suggest that the G-clusteron algorithm may be superior to the original clusteron algorithm for solving certain tasks.
The distribution of synapses and synaptic activations before and after learning MNIST (Fig 4C-4E) can be instructive in understanding how the G-clusteron operates. In many instances, over the course of learning, the synapses on the dendrite moved from an approximately uniform synaptic density to a more clustered structure, with some regions of the dendrite having a higher synaptic density than others ( Fig 4E). These higher density areas occasionally had larger activations for patterns from the positive class than for patterns from the negative class (Fig 4D), suggesting that the G-clusteron may be building structural clusters to take advantage of correlated inputs in the positive class relative to the negative class. However, not all high-density clusters showed high activation, and some high-activation regions were low density. Thus, while the G-clusteron learning algorithm does produce structural clusters as a consequence of learning, there is not a guaranteed correspondence between the structural clusters and activation level in a complex task like image classification.
Having shown that the clusteron and G-clusteron exhibit satisfactory performance on the one-versus-all MNIST task, we now turn to the harder problem of all-versus-all classification. Here, we wish to train a single-layer network of 10 classifiers on the MNIST dataset (one for each digit) and have the network classify each digit correctly. We consider two standard ways to train a single-layer network on a multiclass classification problem. In the one-versus-rest (OVR) method, 10 units are independently trained on a one-versus-all paradigm, as before, and the classifier which produces the largest output (ŷ) for a given input pattern is declared the "winner", and the input pattern is assigned to the positive class for that classifier (Fig 5A). In the softmax (SM) method, the units are trained simultaneously on each example and the raw outputs (h(x)) of all units are passed through a softmax function (see Methods), which normalizes the output of each unit by the sum of the outputs of all the units in the layer (Fig 5B). This normalized output is then used in the error termŷ À y when calculating the update rule. Because the softmax method allows the classifiers to communicate with each other via the output normalization, it can often lead to superior results for multiclass classification [37].
Importantly, the original single-dendrite clusteron does not utilize an error signal in its learning rule, so the original clusteron can only be trained on the multiclass task with the OVR before (left) and after (right) learning. Note the increase in synaptic activations for positive patterns during learning. (D) Sum of the activations within small bins of dendritic length averaged over positive (red) and negative (blue) datasets before and after learning (aligned with C). Note that there are small regions (functional clusters) where the activations for positive patterns is much larger than for negative patterns. (E) Synaptic density per bin (same for positive and negative input patterns, aligned with C-D). Note that there are several high-density regions (structural clusters) that sometimes correspond to the functional clusters.

PLOS COMPUTATIONAL BIOLOGY
The gradient clusteron paradigm ( Fig 5C). Both logistic regression and the G-clusteron, however, do use an error term, making it straightforward to perform multiclass classification with both the OVR and SM methods.
To test the performance of the G-clusteron under both multiclass classification paradigms, we trained the original clusteron, the G-clusteron, and logistic regression using the OVR method, and the G-clusteron and logistic regression (but not the original clusteron) using the SM method. As with the previous task, learning for all classifiers was repeated 10 times to ensure stability.
When using the OVR paradigm, the original clusteron achieved an average accuracy of 70.1%, the G-clusteron achieves 74.3% accuracy, and logistic regression achieves 92.2% accuracy. While both the original clusteron and the G-clusteron achieve performance far better than chance level of 10%, neither of them do nearly as well as logistic regression. However,

PLOS COMPUTATIONAL BIOLOGY
The gradient clusteron when we train the G-clusteron using the SM method, it achieves an impressive accuracy of 85.3%, closing in on logistic regression's 92.6% accuracy in the SM scheme (Fig 5D). Although not superior to logistic regression, 85.3% accuracy on an all-versus-all MNIST classification task is notable for an algorithm that can only modify synaptic locations and not synaptic weights.

Synaptic weight update rule for the G-clusteron
In addition to the location update rule, we also derive a gradient descent rule for the weights of the G-clusteron of the form: Where Δw i is the update to the weight of synapse i defined as (assuming a logit or crossentropy error loss function, see Methods): with η W as the positive-valued learning rate for the weights. This rule can either be used on its own or in conjunction with the location update rule (Eq 6) by simultaneously updating the weights and locations using their respective rules during each epoch. To see if the weight update rule can improve the accuracy of the G-clusteron on the all-versus-all MNIST task, we trained a single-layer G-clusteron network that used only the weight update rule as well as a network that used both the location update rule and the weight update rule simultaneously, using both the OVR and SM protocols. When only the weight update rule was used, the G-clusteron achieved an accuracy of 77.9% for OVR and 89.3% for SM, and when both the weight and location update rules were used simultaneously, the G-clusteron achieved an improved accuracy of 81.2% for OVR and a similar accuracy of 89.1% for SM. Thus, being able to manipulate synaptic weights may allow the G-clusteron to achieve slightly superior accuracy relative to a G-clusteron that can only update its synaptic locations (Fig 5D).
Although the G-clusteron with both the weight and location update rules did not perform as well as logistic regression on the MNIST task, this does not indicate that the theoretical classification capacity of the G-clusteron is inferior to that of a linear classifier. In fact, we prove (see S3 Text) that if a G-clusteron with an arbitrary localization of synapses is equipped with a single additional "bias synapse" x 0 (x 0 = 1 for all input patterns, this is distinct from the 0 thorder bias term b), it can approximate a linear classifier to an arbitrary degree of precision by appropriately assigning the weights. The discrepancy in classification accuracy in this task between logistic regression and the G-clusteron with a weight update rule is thus likely due to the G-clusteron's failure to converge to a global minimum of the loss function rather than an inability to implement a linear separation boundary. Future optimizations to the learning algorithm might therefore allow the G-clusteron to surpass the accuracy of logistic regression on MNIST. In any event, we will now show that the G-clusteron is indeed superior to a linear classifier with respect to one particularly famous task: the XOR problem.

XOR problem
To motivate the use of the weight update rule in combination with the location update rule for the G-clusteron, we consider the XOR function. The XOR function receives two bits of binary input (x 1 and x 2 ) and returns a 1 if the input bits are different or a 0 if the input bits are the same. The XOR function famously cannot be implemented by linear classifiers like the perceptron [34]. It is thus valuable to demonstrate that the G-clusteron can implement this function. We first wish to show that there are values for the parameters (i.e. the weights-w 1 and w 2 -and the locations-l 1 and l 2 -of the two synapses) of the G-clusteron that will result in a correct implementation of the XOR function. We then show that a G-clusteron must be able to update both its synaptic locations and its synaptic weights if it is to learn to solve the XOR problem from every arbitrary initialization of its parameter values.
To implement the XOR function, the parameters of the G-clusteron must satisfy both of the following inequalities (See Methods): Note that although we originally have 4 parameters (w 1 , w 2 , l 1 , and l 2 ), the solution space satisfying these inequalities (Fig 6A and 6B) can be expressed in terms of only three parameters: w 1 , w 2 , and F 12 . When expressed in this manner, there are several things to observe about these inequalities. First, note that for F 12 �0.5, there is no way to assign the weights such that Eqs 12 and 13 are satisfied (Fig 6B, left). This means that if the two synapses are initialized sufficiently far away from each other such that their distance-dependent factor is less than 0.5, a G-clusteron with only a weight update rule (Eq 10) would have no way to solve the XOR problem.
Moreover, the range of weights that are valid to solve the XOR problem increases as F 12 increases from 0.5 to 1, where the solution space for the weights is maximal (Fig 6B, center and  right). In other words, as the synapses move closer together, there is a larger set of weights that would satisfy Eqs 12 and 13. However, even when the synapses occupy the exact same location (i.e. F 12 = 1), there is still a large range of invalid weights. For example, if both weights are initialized with the same sign, a G-clusteron with only a location update rule (Eq 6) could not implement the XOR function (Fig 6B and 6C). Therefore, for the G-clusteron to solve XOR from arbitrary initial parameter values, it must be able to update both its weights and its locations.
To demonstrate this numerically, we created 3 G-clusterons: one with only a weight update rule, one with only a location update rule, and one with both update rules. For each G-clusteron, we ran 1000 trials with randomized initializations of the inputs and labels for the XOR problem. Each trial ran for 10,000 epochs or until convergence (see Methods).
For a G-clusteron with only the weight update rule, a trial was designated as "possible to converge" if some weight assignment could produce the correct XOR output given the initialization of the two synaptic locations, which can only occur if the initial F 12 was greater than 0.5. For a G-clusteron with only the location update rule, a trial was designated as "possible to converge" if some location assignment could produce the correct XOR output given the initialization of its synaptic weights. Because the solution space for XOR grows as the synapses move closer to each other, this entails that the initial weights would have to satisfy Eqs 12 and 13 given that F 12 = 1 (in other words, the weights must fall in the overlap shaded region of Fig 6B,  right panel). For the G-clusteron with both the weight and the location update rules, all trials could potentially converge (Fig 6D).
The G-clusteron with only the weight update rule converged for 475 trials out of 485 trials analytically possible to converge (Fig 6D and 6E), unexpectedly failing on trials where F 12 was initialized slightly above the boundary condition of 0.5, because the solution space for the weights when F 12 is fixed near 0.5 is small (Fig 6B middle), and it can therefore take the learning algorithm an exorbitantly long time to direct the weights into this exactingly narrow range of values.
The G-clusteron with only the location update rule converged for 247 trials out of 251 trials analytically possible to converge (Fig 6D and 6F), unexpectedly failing on trials where F 12 was initialized very close to 0, as before, and also on trials where the weights were initialized very close to the boundary of the solution space ( Fig 6B, right panel). In these cases, it was evidently difficult for the algorithm to find the bias value with sufficient precision to correctly perform the classification, even though the weights and locations were analytically valid.
Only the G-clusteron with both the weight and the location update rules was able to converge on almost all of the trials (947 out of 1000 trials), unexpectedly failing only on trials where F 12 was initialized very close to 0 (Fig 6D and 6G). These unexpected failures occurred because the magnitude of the location update rule for a given epoch scales with the value of F 12 and the weights at that epoch (Eq 7). Thus, if F 12 is initialized very close to 0, the synaptic locations may only move a miniscule amount each time the gradient rule is applied, depending on the values of the other parameters.

Discussion
We have shown that the G-clusteron is a robust single-neuron learning algorithm that can solve real-world classification tasks, such as MNIST handwritten digit classification. The Gclusteron's ability to solve this task merely by moving synapses on its dendrite without using synaptic weights demonstrates the computational potential of structural plasticity that makes use of distance-dependent nonlinearities.
While the original single-dendrite clusteron [28] also exhibits impressive performance when tasked with one-versus-all classification on a single digit, the lack of a supervised gradient descent-based plasticity rule makes it difficult to scale this algorithm to a multiclass classification problem, as the clusteron learning rule does not have an error signal that can be communicated between neurons within a layer via a softmax activation function.
The gradient descent plasticity rule of the G-clusteron enables effective multiclass classification and allows the use of a variety of techniques developed for classic ANNs. For example, we used the ADAM momentum-based adaptive learning method [38] to dynamically optimize our learning rate. Bringing dendritic cluster-based plasticity algorithms within the fold of gradient descent learning opens up many possibilities for the extension of the G-clusteron algorithm using the extant mathematical frameworks and robust literature for learning with gradient descent in ANNs [39]. The G-clusteron requires the weight and location update rules to solve the XOR problem from arbitrary initial conditions. (A) Analytical solution to XOR problem. Green area represents the parameter space in terms of the two weights, w 1 and w 2 , and the distance-dependent factor, F 12 , which would produce the correct outputs for the XOR problem. (B) Representation in weight space of the analytical solution to XOR, shown for selected values of F 12 (i.e. slices in the F 12 axis of A). The blue region satisfies the inequality from Eq 12, and the yellow region satisfies the inequality from Eq 13. The G-clusteron can solve the XOR problem when both inequalities are satisfied (overlap region, same as green region in A). Note that the solution space for the weights is null when F 12 <0.5 and increases as F 12 approaches 1. However, there are many weight assignments that never satisfy both inequalities. (C) Examples of correct and incorrect parameter assignments for the XOR problem. Some weight assignments never produce a correct solution for XOR regardless of the distance between the synapses (top). Moreover, if the synapses are too far away from each other, there are no weights that satisfy the XOR relation (middle). Thus, solving XOR from arbitrary initial conditions requires that the synapses be close together and the weights appropriately set (bottom). (D) Number of trials that were analytically possible to converge versus trials that successfully converged on the XOR problem for the G-clusteron trained with only the weight update rule, only the location update rule, or both rules. (E) Top: Convergence (marker shape and color, see legend) as a function of the initial parameters of 1000 randomly initialized G-clusterons with only the weight update rule. Note that only trials in which the initial value for F 12 was greater than 0.5 converged. Bottom: Convergence as a function of the final parameters of the G-clusterons after learning. Note that converged trials fall within the region specified in A. (F) Same as E for G-clusteron using only the location update rule. Note that only trials in which the initial values for the weights were within the overlap region of B (right panel) were able to converge. (G) Same as E for G-clusteron with both the weight and location update rules. Almost all trials converge to the correct solution. https://doi.org/10.1371/journal.pcbi.1009015.g006 We have also shown that incorporating a plasticity rule for the weights in a G-clusteron alongside the dendritic location plasticity rule enables the G-clusteron to solve the classic linearly inseparable XOR problem [34]. This reinforces the possibility that neuron models with dendritic nonlinearities may be more computationally powerful than the linear nodes used in ANNs [40][41][42][43].

Relationship to previous work
The original clusteron model [28] has been developed into the "two-layer model" [40,41] in which synapses are linearly summed within dendritic branches, and each branch computes a sigmoidal nonlinearity of its input before sending its output to the soma where a final nonlinearity is applied. Unlike the original clusteron, this two-layer model does employ a supervised learning strategy that applies gradient descent to modify synaptic stability, which in turn determines how synapses are mapped to individual branches [40], however it does not directly learn exact synaptic locations via gradient descent as we do here. The two-layer model has been shown to achieve above 96% accuracy on a binary MNIST classification task (although not directly comparable to the tasks in our study, this result was superior to a linear classifier) [44]. A two-layer neuron with two branches can also implement the XOR function by, for example, having one branch implement OR, the other branch implement NAND, and the soma implement AND, each of which is a linear operation. We emphasize that the G-clusteron, as a single-branch model, is not in competition with the two-layer model in terms of classification performance. Rather, the G-clusteron is a model of very local, distance-dependent nonlinearities that may occur within a single dendritic branch. The sigmoidal output of the G-clusteron can be interpreted as a branch-level nonlinearity (as opposed to a somatic nonlinearity) and the G-clusteron can thus be incorporated as the branch-level unit in a twolayer model. This would provide an interesting avenue of future research.
In addition to the two-layer model [40,41], there have been a variety of other attempts to model learning with dendrites and dendritic nonlinearities [45][46][47]. These models also tend to treat the dendritic branch (or sometimes an entire dendritic tree) as discrete loci of nonlinearity. Conceivably, both discrete, branch-or tree-dependent nonlinearities and fluid, locationdependent nonlinearities could exist simultaneously, further expanding the computational capabilities of real neurons.
Although our work here treats dendrites with NMDA receptors as nonlinear integration units, recent work [48] has shown that a detailed biophysical model of a cortical layer 5 pyramidal cell with NMDA synapses and other active mechanisms can implement the perceptron learning algorithm, implying that a neuron can behave as a linear classifier. However, this does not preclude the possibility of nonlinear integration and plasticity rules; rather it should be taken as an indication a neuron can elect to apply a simple plasticity rule that ignores the synergistic interactions between synapses and still manage to solve classification tasks as a perceptron would. Moreover, Mel [49] has shown that a clusteron-like algorithm can also be implemented in a detailed biophysical model of a cortical pyramidal cell with NMDA synapses.
The original clusteron [28] fits within the framework of subsampled quadratic classifiers, as the clusteron sums a subset of its N 2 mixed terms x i x j [42]. The G-clusteron can be thought of as a "constrained" quadratic classifier in the sense that the equation for the G-clusteron contains all N 2 mixed terms, but the coefficients of these terms-the entries of the F matrix-are "tied together" in the sense that F must be produced from a matrix representing distances between points in one dimension, and not all matrices are valid distance matrices. The G-clusteron therefore gives us N 2 mixed terms for the price of N location parameters, but at the cost of not being able to manipulate each of the N 2 coefficients independently.
From a computational standpoint, we note that the covariance perceptron [50] also takes a correlation-based approach to solving learning tasks.

Biological plausibility
The G-clusteron takes several liberties with respect to the biological phenomena from which it is inspired. While a multiplicative activation function may be a tolerable first-order approximation of the synergetic cooperation between NMDA synapses, a sigmoidal function provides a closer fit [41]. Even to the extent that a multiplicative model does portray the synergistic relationship between excitatory NMDA synapses, such a relationship does not exist between excitatory synapses and inhibitory synapses or between inhibitory synapses and other inhibitory synapses. The multiplicative interaction with negative inputs, however, is essential to the learning protocol of the G-clusteron, so this aspect of our algorithm should be viewed as "biologically inspired" rather than an accurate depiction of what occurs in real biological cells. We do note, however, that clustering dynamics have been observed to occur between dendritic spines and inhibitory synapses [51].
Although in real biological neurons there is voltage attenuation from the dendrite to the soma [52,53], we have chosen not to include this feature in the G-clusteron in order to emphasize the effect of local interactions, to maintain similarity to the original clusteron model [28], and to keep the model relatively simple. This choice can be justified biologically either on the basis of "synaptic democracy" [54,55], where synaptic weights are adjusted to compensate for attenuation, or by noting that in some cases the high input resistance of distal dendritic regions may help balance out attenuation to the soma in active dendrites [56].
While some have posited that the dendritic branch may be the fundamental computational unit within a neuron [57,58], there is reason to believe that interesting clustering dynamics occur within branches on a more local scale. Although there is no consensus definition for synaptic clustering, several studies have shown that cluster-related dynamics occur on a scale from 0-15 μm. Synapses are more likely to exhibit correlated activity if they are within 10 μm of each other [14]. Relatedly, the intercontact distance (ICD) between synapses decreases from 10 μm to~5 μm in owl inferior colliculus following an adaptation protocol [19]. Moreover, the relevant range for BDNF-dependent clustering seems to be~15 μm [11], and the spread of Ras, which allows "crosstalk" between synapses in a long-term potentiation (LTP) protocol, is limited to~10 μm [59,60]. Heterosynaptic plasticity-structural long-term depression (LTD) occurring near spines that underwent structural LTP-also occurs on a scale of~10 μm [21].
The mechanism of learning synaptic locations in the G-clusteron, namely the attraction and repulsion of synapses based on the activity of their presynaptic inputs, also warrants discussion from a biological standpoint. As we have described, the computation of the "forces" exerted by each synapse results in a vector field along the dendrite with regions of attraction and repulsion. Such a vector field could be implemented by the release of attractive or repulsive chemical factors in proportion to the local dendritic activation which diffuse in the extracellular space, creating a distance-dependent gradient that effectively sums the "pull" of the synapses that were active along the dendrite. The attractive factors could stimulate the growth of presynaptic axonal boutons or postsynaptic filopodia at specific regions of the dendrite, while the repulsive factors could eliminate existing synapses.
Brain-derived neurotrophic factor (BDNF) and its precursor, proBDNF, are strong candidates for the signaling mechanism underlying the sort of structural plasticity we suggest here. BDNF has been shown to be responsible for structural plasticity in development by stabilizing correlated synapses during development, while proBDNF weakens synapses that exhibit uncorrelated activity [11]. Another possible signaling agent is estradiol, which is also involved in the formation of new spines [61] and interacts with BDNF in spine regulation pathways [62]. Astrocytes and microglia, which play an important role in spine elimination [63] may also be implicated in our model as a mechanism for intelligently rearranging synapses on the dendrite in response to local activity.
We note that the NMDA receptor is not only a mechanism for voltage nonlinearities; the receptor also plays a crucial role in the plasticity process itself by allowing Ca 2+ ions to enter into the postsynaptic cell, initializing a cascade that can result in LTP or LTD, depending on the stimulation protocol used [64,65]. Interestingly, the NMDA receptor is involved in structural plasticity as well, as the release of proBDNF and BDNF is NMDA-dependent [11,12]. Moreover, there is evidence that both LTP and LTD protocols can cause structural plasticity changes, with LTP encouraging the growth of new spines and LTD resulting in spine retraction [66]. This lends support to the idea that structural and functional plasticity may be happening simultaneously, both mediated by the NMDA receptor.
Several testable experimental predictions emerge from the G-clusteron's location update rule. If presynaptic axons are indeed being coaxed into moving their synapses along chemical gradients, we might expect to see axonal boutons being destroyed and new boutons from the same axon forming a short distance away. We may also expect to see different clustering patterns on different regions of the dendrite due to different lengths constants, as in Fig 2A and  2B. Because the length constant λ increases with dendritic diameter [35,53], we might expect that thinner regions of dendrite would have a larger number of small clusters than wider regions, which may tend to group synapses into a smaller number of large clusters. However, these effects would likely depend on the exact nature of the task that the neuron performs.
The distance-dependence of the weight update rule (Eq 11) of the G-clusteron can also be understood in a biological framework. Experimental evidence has shown that when LTP is induced in one synaptic spine, the threshold for LTP induction in nearby spines within~10μm is reduced [67], a process mediated by the intracellular diffusion of Ras [59]. The G-clusteron's weight update rule qualitatively expresses a similar phenomenon-if a synapse has a strong input such that its weight increases a large amount, nearby synapses will require less input to achieve a large weight update.
Both the weight and location update rules in our model require an error signal (ŷ À y). A prominent candidate for this error signal is the neuromodulator dopamine, which has been shown to correspond to reward prediction error [68]. Dopamine has also been demonstrated to be necessary for LTP in hippocampal CA1 [69] and is involved in learning and plasticity processes in cortex [70], adding support to the idea that dopamine-mediated error signals play an important role in supervised learning.

Future directions
The G-clusteron can be extended in a variety of directions. Because the G-clusteron is inspired by biological dendrites, we use one-dimensional synaptic locations, but our model and algorithm can be modified such that synapses interact with each other as a function of distance in a space with arbitrarily high dimensions, instead of merely moving along a line. Presynaptic axons ostensibly do not just need to localize their boutons within a branch; they may also want to decide between branches of the same neuron or among different neurons. As such, a 2D or 3D G-clusteron, where synapses are localized within the plane of a neuron's branching structure or within a 3D volume of brain tissue, also has some biological motivation. The 2D or 3D case may be more relevant during early development, when the brain is being wired up for the first time, whereas the one-dimensional case presented here may be more relevant to later stages of development or adulthood, when the brain is densely packed and neuronal processes are likely more restricted in their motion, possibly limited to resampling locations along a dendritic branch.
Our model can also be made more biologically plausible in a number of ways, such as by using a sigmoidal nonlinearity instead of a multiplicative nonlinearity [40,41], incorporating recent work regarding distance-dependent interactions between nonlinear synapses [71][72][73], or incorporating attenuation from the dendrite to the soma [52,53]. One recent study [74] used an attenuation-based model to show how neurons could learn probability distributions via creating multiple synaptic connections at different dendritic locations. It would be valuable to see how incorporating these features into the G-clusteron model would affect both the weight and location gradient update rules.
As with any single-neuron model, the G-clusteron should also be explored in the context of a multi-layer network. The G-clusteron's gradient descent-based update rules lend themselves to the possibility of a backpropagation algorithm for a deep network of G-clusteron neurons. This can create exciting directions for a version of deep learning that incorporates synaptic weight updates together with dendritic nonlinearities and structural plasticity.

Derivation of location update rule for the G-clusteron
We wish to derive a learning rule that updates the synaptic locations on each iteration of the algorithm of the form: Where Δl i is proportional to the gradient of the error with respect to the locations. For an arbitrary loss function J and nonlinearity g where we have J(g(h(l i ))) we have: By the chain rule we have: The first two factors of the gradient, dJ dg and dg dh , are specific to the cost function and non-linearity chosen. In our case, we use the logit cross-entropy error JðyÞ ¼ À ylogðŷÞ À ð1 À yÞlogð1 ÀŷÞ ð14Þ and sigmoidal nonlinearity We therefore have (see [75] for elaboration): For the final term (see S4 Text for full derivation of dh dl i ): We thus have: We can subsume the constant factor 4 r into the learning rate η L , which gives: Which gives our location update rule (Eq 7). This same expression holds if we use a softmax nonlinearity:ŷ and a cross-entropy cost function: as we do for the multiclass MNIST classification task (See [75]). (Here i and j refer to the class labels in a multiclass classification task).

Derivation of bias update rule
We also require a rule for the update of the bias term b. As above: The first two terms are as above, the final term is: Where η B is the positive-valued bias learning rate. (Note the difference in sign relative to the location and weight update rules.)

Derivation of weight update rule
In addition to the location update rule, we also derive a gradient descent rule for the weights of the G-clusteron of the form: Where: As with the location update rule, we have: Taking the derivative dh dw i we obtain (see S5 Text for full derivation of dh dw i ): We can subsume the constant factor 2 into the learning rate. The weight update rule for the Gclusteron is thus:

Efficient computation and batch learning
For computational efficiency on the MNIST task, the output of the G-clusteron and the learning rules can be implemented in a vectorized fashion. We assume here that we will be working with a dataset comprised of multiple input patterns, with indices p21,2. . .P. We will denote input j in pattern p as x ðpÞ j . To efficiently calculate each synaptic activation over an entire dataset, we define a matrix S such that We also define a signed distance matrix D where: The distance-dependent factor matrix F can be expressed in terms of D: The activation for synapse j in pattern p, a ðpÞ j , is the element A pj of matrix A, which can be computed as: Where˚denotes the Hadamard product, i.e. elementwise multiplication.
For the location update rule, we utilize a batch protocol where updates are performed after observing P patterns (P here is the size of the batch, not the dataset). Here, for each pattern p in the batch we define a matrix Q (p) such that The derivative of the loss function on the entire batch with respect to location l i is given by: Using our matrix notation, we have However, this is computationally intensive as it requires an elementwise multiplication for every input pattern in the batch (F˚D can be precomputed for the entire batch, but F˚D˚Q (p) is different for every input pattern). We therefore rearrange and obtain: Which averages over all input patterns in the batch first, requiring only two elementwise multiplications regardless of batch size. For the weight update rule we have:

Toy examples
To train the G-clusteron to continually increase or decrease its activation on a particular input pattern or dataset, we treated the activation function as a threshold function which always returned the wrong answer such that it would perform the maximal update at each epoch. For patterns with label y = 0, we fixedŷ ¼ 1 such thatŷ À y ¼ 1, for patterns with label y = 1, we fixedŷ ¼ 0 such thatŷ À y ¼ À 1.

Learning MNIST
For the MNIST learning tasks, we used the Tensorflow [76] MNIST dataset [36], which is composed of 60,000 training examples and 10,000 test examples. The examples are split roughly evenly between the 10 digits. As a preprocessing step, we centered the data such that each image had a mean pixel value of 0.
The logistic regression implementation used as our baseline for comparison on MNIST was the default implementation in the Scikit-learn library [77], which uses the 'LBFGS' solver [78] and an L2 regularization penalty. Using the stochastic average gradient ('SAG') [79] solver and no regularization penalty achieves similar results (see repository linked in Code and Data Availability Statement). The original clusteron algorithm was implemented in Python by the authors using the description from [28].
For the one-versus-all experiments, the original clusteron was trained only on positive training examples,~6,000 examples per digit. The G-clusteron and logistic regression classifiers were trained on a balanced dataset with~6000 images of the digit from the positive class and~6000 total images of other digits. The test set was evenly split between positive and negative examples. Hyperparameters and runtimes for the one-versus-all task can be found in Table 2.
For the all-versus-all experiments, each of the 10 clusteron units were trained on all images from their positive class. Each G-clusteron and logistic regression classifier was trained with the entire MNIST dataset. Hyperparameters and runtimes for all-versus-all tasks are shown in Table 3.
For both one-versus-all and all-versus-all (SM and OVR), the hyperparameter r of the Gclusteron was always set to 0.23. For the original clusteron, synapses interacted if they were within 10 synapses of each other. Hyperparameters were hand-tuned to maximize accuracy in all cases.

Logistic regression
Default logistic regression from the Scikit-learn library [77]

PLOS COMPUTATIONAL BIOLOGY
The gradient clusteron

XOR problem
The XOR function is defined as (Table 4): For the G-clusteron to solve the XOR problem, we require: maxðhð0; 0Þ; hð1; 1ÞÞ < minðhð0; 1Þ; hð1; 0ÞÞ ð27Þ where h(x) is the output of the G-clusteron on the input vector x (In our case, the vector [x 1 , x 2 ]) before applying the sigmoidal nonlinearity, as defined in Eq 4.
In the case of two inputs, h(x) can be written as: We therefore have (Table 5): From Eqs 27 and 28 and Table 5, we require: and which can be simplified to Eqs 12 and 13 in Results. (Note that there are two other inequalities that follow from Eqs 27 and 28, namely that w 2 1 > 0 and w 2 2 > 0, which require that neither w 1 nor w 2 are equal to 0, however Eqs 29 and 30 already guarantee this.) To test the different G-clusteron learning rules on the XOR dataset, we created a G-clusteron for each of the three learning paradigms: weight update rule only, location update rule only, and both weight and location update rules together. Each G-clusteron was run on 1000 trials with different parameter initializations for 10,000 epochs per trial.
For each trial, the weight values were chosen randomly from a uniform distribution within [-1,1], and the initial locations were chosen such that F 12 was uniformly distributed within [0,1]. The G-clusterons were trained with a stochastic gradient descent protocol such that each epoch, one out of the four input vectors for the XOR function (Table 4) were presented to the G-clusteron, which would update its parameters according to the relevant update rule(s). For Table 5. G-clusteron outputs for two binary inputs.
Input Output computational efficiency, we stopped the learning protocol on a given trial when the algorithm converged. Convergence was defined as achieving perfect accuracy on a test set of all 4 input vectors of the XOR problem for 10 consecutive epochs. Hyperparameters and the net runtime for all 10,000 trials are shown in Table 6.
For the XOR problem, the r hyperparameter was set to 1. Momentum was not used for the XOR problem. Hyperparameters were hand-tuned to maximize the number of converged trials. Runtimes differed due to different convergence probabilities for the different rules.
All scripts were written in Python and run on an Acer Aspire 5 Notebook laptop with an i7-10510U quad-core processor (1.80 GHz) and 16GB RAM, running on a Windows 10 operating system.