Neuromorphic computing for content-based image retrieval

Neuromorphic computing mimics the neural activity of the brain through emulating spiking neural networks. In numerous machine learning tasks, neuromorphic chips are expected to provide superior solutions in terms of cost and power efficiency. Here, we explore the application of Loihi, a neuromorphic computing chip developed by Intel, for the computer vision task of image retrieval. We evaluated the functionalities and the performance metrics that are critical in content-based visual search and recommender systems using deep-learning embeddings. Our results show that the neuromorphic solution is about 2.5 times more energy-efficient compared with an ARM Cortex-A72 CPU and 12.5 times more energy-efficient compared with NVIDIA T4 GPU for inference by a lightweight convolutional neural network when batch size is 1 while maintaining the same level of matching accuracy. The study validates the potential of neuromorphic computing in low-power image retrieval, as a complementary paradigm to the existing von Neumann architectures.


Introduction
Neuromorphic computing is a non-von Neumann computer architecture, aiming to obtain ultra-high-efficiency machines for a diverse set of information processing tasks by mimicking the temporal neural activity of the brain [1][2][3]. In neuromorphic computing, numerous spiking signals carry information among computing units i.e. artificial neurons, synchronously or asynchronously [4], forming a mesh-like, nonlinear dynamical system [5]. The information can be encoded in the temporal characteristics of the signals, for example firing rates [6].
In this work, we implement and analyze a low-power computer vision model for visual search engines and recommender systems that evaluate the visual similarity between a query image and a database of product images. In conventional machine learning pipelines, this is often performed by transfer learning using a deep convolutional neural network (CNN) [7] pre-trained on a large-scale dataset e.g., ImageNet [8,9] and fine-tuned on a domain-specific image dataset e.g., DeepFashion2 for apparel [10]. The embeddings of the images are calculated by inferring the activation values of the last few layers of the neural network as visual features [11][12][13][14][15][16][17]. The distances between embeddings of the query image and the database images are used to find the nearest neighbors for the query image in the embeddings space, identifying the most similar items visually [18]. Here, we evaluate the same visual search and recommendation technique using embeddings generated by the neuromorphic neural networks. We train spiking convolutional neural networks on a clothing-specific image classification dataset, Fashion-MNIST [19]. The trained spiking neural networks are then used for extraction of features for the product images and the query images. The embeddings will be based on the patterns of the temporal spikes, and similar to the conventional convolutional neural networks, they are used for finding nearest visual neighbors of the query image among product images. Our results show considerable power efficiency in finding the most visually similar products using neuromorphic chips and particularly Loihi [20].

Methods
To explore applications of neuromorphic computing in image retrieval, we built and deployed a spiking neural network (SNN) on Intel's Loihi neuromorphic chip. Our image search pipeline is shown in Fig 1. Firstly, we convert a trained artificial neural network (ANN) into a spiking neural network (SNN) and deploy it on Loihi chip. We then feed training and test images into the SNN and probe the neurons of the layer before the output layer to get image embeddings. Finally, nearest neighbor search is employed on CPU cores to find the best matches in the training dataset for each test image.
In the first step, we train different ANNs by minimizing the cross-entropy loss function for the classification of the Fashion-MNIST dataset via backpropagation. Then, we convert the ANNs into SNNs, compare the classification test accuracies of SNNs, and select the most accurate SNN model. Suggested by Hunsberger and Eliasmith [21] and Sengupta et al. [22], we reduce the feature map size using average pooling rather than max pooling, and employ dropout to regularize [23].
Note that there are two constraints on the neural network architectures that can be deployed on Loihi chips. One constraint is that the synaptic memory, which stores neuron weights per neuromorphic core is 128 KB. This indicates that the number of parameters associated with neurons in a core is limited. The other constraint is the maximum fan-in of 4,096 per neuromorphic core, which means the input size of the neurons cannot exceed 4,096 [24]. These two constraints result in neural networks deployed on Loihi chip to have relatively slim layers rather than wide layers.
Given an ANN, the conversion is done through building a SNN which has the same architecture as the ANN, but changing the neuron type to Leaky, Integrate and Fire (LIF) neuron with soft-reset, which is a variant of Residual Membrane Potential (RMP) neuron proposed by Han et al. [25]. Then, floating-point ANN parameters are scaled to integers and transplanted to the SNN as Loihi chip executes operations with integer numbers. The spiking threshold of each LIF neuron is determined at the same time as the parameter scaling, using a method provided by Loihi NxSDK [26]. The method of parameter scaling and threshold calculation is shown in Algorithm 1 (For more details, see mapping spiking neural networks onto a manycore neuromorphic architecture by Lin et al. [26] Similar to the spike-norm algorithm proposed by Sengupta [22], a set of images are fed into the network and the threshold at each layer is set to the maximum activation at that layer. However, Loihi chip uses a rate-based simulation of SNN instead of doing the actual SNN forward-pass to calculate the spiking thresholds.
In Algorithm 1, there are two important variables. One is named param_scale, which gives the factor we use to scale the ANN parameters to integers to get the SNN parameters. The other one is named threshold, which is the spiking threshold that decides the LIF neuron spiking activity.
Algorithm 1 requires a batch of input images to tune the spiking threshold, and they are represented as an N × H × W × C matrix with floating-point elements ranging between zero and one; N for the number of images and H, W, C for the image's height, width, and channel. Line 1 set the W MAX , b MAX , and line 2 set the slope, param_percentile, activation_percentile variables. If we use 9 bits to represent SNN weights on Loihi chip, then the maximum weight W MAX is 2 9−1 − 1 = 255. We set the maximum bias b MAX in the same way. The slope variable shows the ratio between the SNN neuron output and the ANN neuron output at the current layer and is initialized to one.
In line 3, we get snn_layer and its corresponding ann_layer. From line 4 to 6, if snn_layer is the input layer which encodes input images into spike time series, we set param_scale to W MAX and multiply input by param_scale to get the dvdt, which is the neuron membrane potential increment rate. Note that dvdt here still has the shape of N × H × W × C as the input layer only multiplies the input by a scalar.
From line 7 to 16, if the snn_layer is not the input layer, we have to scale the ANN parameters and set the SNN parameters. In line 8, we get the ANN weight and bias from ann_layer. Then in line 9, we multiply bias by slope to update bias with the scaling of the previous layer. In line 10, we set weight_norm as one single value by getting a percentile value of abs(weight) and do likewise to set bias_norm in line 11. Then in line 12, we set weight_ratio as the ratio between W MAX and weight_norm to find out how many times we can scale up weight without exceeding W MAX , and we do the same thing to calculate bias_ratio. In line 13, we compare weight_ratio and bias_ratio to set the param_scale to the smaller value. In line 14 and 15, we use param_scale to scale the ANN weight and bias, quantizing them to integers, and set them as the parameters of snn_layer. In line 16, we calculate dvdt by simulating the ANN neuron activation and the shape of dvdt becomes N × FH × FW × FC, where FH, FW, and FC stand for the feature map's height, width, and channel.
In line 18 and 19, we set the threshold of neurons at snn_layer to the quantized percentile value of dvdt so there is one single threshold value for this layer. Then, in line 20, we calculate the spikerate, an estimation of the spiking probability of neurons, as the output of snn_layer, which has the same shape as dvdt. In line 21, we update slope by multiplying it with the ratio of param_scale and threshold. Now having a SNN at hand, we start feeding images into the network. For each image, we probe the neurons of the layer before the output layer at the last execution time step to get the neuron membrane potentials. The membrane potential vector is then the embedding of the input image.
Our SNN takes images in the training and test sets as inputs and generates their embeddings. We see the training image embeddings as a corpus of image features. For each test image, we apply nearest neighbor search using cosine similarity to find images in the corpus that are the closest to the test image in the embedding space.

Results
We implemented and tested 3-layer, 4-layer, and 5-layer SNNs for classification of Fashion-MNIST dataset. We selected Fashion-MNIST as our evaluation dataset because it is suitable for benchmarking small-footprint computer vision models. Note that we use this dataset without data augmentation in our experiments. The architectures analyzed are shown in Table 1.
In the architecture column, number of convolutional kernels (number of output channels) in each layer are concatenated by hyphens. Note that the last architecture in Table 1 was not deployable on the Loihi chip because the maximum fan-in was exceeded. The fourth architecture in Table 1 scores the best classification test accuracy when converted to a SNN; this architecture is shown in Fig 2. It consists of three layers, including two convolutional layers and one dense layer. We use this SNN architecture to conduct the rest of the experiments. The image   (Fig 2) is relatively compact, so the number of cores occupied is small compared with the number of cores available on a Loihi chip.
SNN has an intrinsic execution time parameter, called number of time steps, which is used to define how many discrete time slots are given to the network to process information during inference. It is intuitive that the more time steps we give our SNN to process the information, the higher performance we get, but the runtime is also larger. This tradeoff between performance and number of time steps is shown in Fig 4. We can see that performance metrics skyrocket between 4 time steps and 16 time steps and then plateau, showing that using 16 time steps is enough to achieve certain degree of performance. The error bars indicate the negligible variations among five independently trained networks, displaying reproducibility of our results.
The relation between the runtime and the number of time steps is shown in Fig 5. As we gradually increase the number of time steps, the runtime scales up almost linearly. However, the runtime is independent of the number of time steps for small numbers, e.g., 4 or 8 time steps because the overhead takes up the majority of the runtime.
The performance comparison between the selected SNN and its ANN counterpart is shown in Table 2. Note that the number in the parentheses next to the model type is the number of time steps used per example during SNN inference. The ANN and SNN have the same network architecture but different neuron types and parameters. We can see that the SNN using 128 time steps have accuracies very close to the ANN, indicating that the SNN is capable of achieving comparable performance with its ANN counterpart. Using fewer time steps, e.g., 16 time steps, our SNN suffers a classification accuracy degradation, but the gap is smaller than 5%. However, the top-1 and top-3 accuracies of the SNN with 16 time steps is still very close to the ANN. This means that the SNN with 16 time steps per inference generates reasonable embeddings, suitable for the image retrieval task.
Several examples of the SNN image retrieval are shown in Fig 6. The first column shows query images, each from a class in the dataset. The next three columns present three randomly-selected images from the corpus with the same class label as the query images. The next three columns demonstrate the top-three images selected by image search from the corpus using the ANN-generated image embeddings. The last three columns show the top-three images selected from the corpus using the SNN-generated image embeddings. It is obvious that image retrieval results, either using ANN or SNN, are visually closer to the query images compared with the randomly-selected images from the corpus. Again, our SNN implemented on the Loihi chip demonstrates comparable performance with the ANN.
The neural network inference latency (forward-pass runtime per example) comparison between the selected SNN and its ANN counterpart is shown in Table 3. Note that the Loihi could not support batch sizes larger than one at the time of the experiments. We can see that when the batch size equals one, the SNN on Loihi using 16 time steps has approximately 13.8x/11.3x longer runtime than the ANNs on Xeon/i7 CPUs, 3.8x longer than the ANN on ARM CPU, and 2.3x/2.5x longer than the ANNs on V100/T4 GPUs. The difference is even more dramatic if we use larger batch sizes for inference on the CPUs or GPUs. It is obvious that the SNNs on Loihi chip do not have an advantage in terms of the inference latencies. Several time steps that take an SNN to converge to its results leads to long execution times. Reducing the runtime is a direction where we look forward the neuromorphic hardware to improve upon.
The comparison of the average power consumptions between the SNNs and the ANNs is shown in Table 4. With the batch size set to one, the SNN with 16 time steps uses 217.0x/24.0x less power than the ANNs on Xeon/i7 CPUs, 9.3x less than the ANN on ARM CPU, and 40.8x/31.3x less than the ANNs on V100/T4 GPU. This is where neuromorphic hardware starts to shine as it consumes way less power than the conventional hardware. Utilizing the temporal sparsity of SNN appropriately, we believe the neuromorphic hardware can further reduce its power consumption. Another thing that we can observe from Table 4 is that the static (idle) power dominates the power consumption of the Loihi chip.
We measured the total energy used per inference (forward pass) reported in Table 5. These results can also be estimated by combining the results of Tables 3 and 4. As summarized in Table 5, with the batch size set to one, the energy consumption of SNN with 16 time steps is 15.6x/3.2x less than the ANNs on Xeon/i7 CPUs, 2.5x less than the ANN on ARM CPU, and 17.5x/12.5x less than the ANNs on V100/T4 GPUs per inference. This proves the benefits of the neuromorphic hardware in the low energy-budget applications of machine learning, particularly lightweight image search engines and visual recommender systems. It is apparent that when large batch sizes are used, CPUs and GPUs consume less energy per example. However, there are many use cases where inference is executed in small batches, and they are the targets for neuromorphic hardware in the current stage.  Another observation is that the energy consumption for a small number of time steps does not scale linearly. For example, the energy consumption per inference for 128 time steps is only 4.0 times larger than 16 time steps (Table 5). This is due to the constant portion of the energy needed for running each inference, which does not change by the number of time steps.
We use energy probes provided by Loihi NxSDK to perform the power and energy measurements of the Loihi chips. For the CPUs, we use Intelligent Platform Management Interface (IPMI) and the system profiler information to measure the power consumption and then, we integrate the power readings over time to get the energy consumption. For the GPUs, we use

Discussion
Our results confirm the energy efficiency of the Loihi neuromorphic chip. However, we noticed that the inference latency becomes impractically large when a network of Loihi chips are used. We ponder this is due to the interchip communication latencies. Nowadays in many applications, deep neural networks models with millions of parameters and billions of intermediate activations are used. Neuromorphic chips need to scale up, possibly by increasing the number of neuromorphic cores and on-chip memory, to support these applications in future. The energy efficiency obtained by the Loihi chip in our experiments is owing to two factors. First, the model parameters are stored in the local memory of the neuromorphic cores, minimizing the energy cost of the data transfer to the shared memory. Second, the neuromorphic cores are optimized for specialized functionalities; this efficiency is very similar to that of other specialized accelerators e.g., graphical processing units (GPUs). The typical ANN-to-SNN conversion methods, including Algorithm 1 used here, do not capitalize on temporal sparsity, possible on the neuromorphic processors, as in the brain. So, designing better training and conversion algorithms to employ temporally sparse signals for neuromorphic machine learning is a promising future direction. Finally, it is worthwhile to emphasize that to implement the complete image retrieval pipeline, we performed the nearest neighbor search on the host CPU cores. It is possible to carry out an approximate k-nearest neighbors algorithm on the neuromorphic chips [28], but we believe that the CPU cores are largely needed for some stages of a machine learning pipeline. Thus, the role of neuromorphic computing is to improve the performance of special tasks and supplement the general-purpose processors.

Conclusion
We studied the application of the Loihi chip, a neuromorphic computing hardware developed by Intel, in image retrieval. Our results show that the generation of the deep learning embeddings by spiking neural networks for lightweight convolutional neural networks is about 2.5 times more energy-efficient compared with a CPU and 12.5 times more energy-efficient compared with a GPU. We confirm the long-term potential of neuromorphic computing in machine learning, not as a replacement for the predominant von Neumann architecture, but as accelerated coprocessors.