Efficient Computation of k-Nearest Neighbour Graphs for Large High-Dimensional Data Sets on GPU Clusters

This paper presents an implementation of the brute-force exact k-Nearest Neighbor Graph (k-NNG) construction for ultra-large high-dimensional data cloud. The proposed method uses Graphics Processing Units (GPUs) and is scalable with multi-levels of parallelism (between nodes of a cluster, between different GPUs on a single node, and within a GPU). The method is applicable to homogeneous computing clusters with a varying number of nodes and GPUs per node. We achieve a 6-fold speedup in data processing as compared with an optimized method running on a cluster of CPUs and bring a hitherto impossible -NNG generation for a dataset of twenty million images with 15 k dimensionality into the realm of practical possibility.


Introduction
K-Nearest neighbor graphs have a variety of applications in bioinformatics [1,2], data mining [3], machine learning [4,5], manifold learning [6], clustering analysis [7], and pattern recognition [8]. The main reason behind the popularity of neighborhood graphs lies in their ability to extract underlying information about structure and governing dimensions of data clouds. The k-NNG problem is similar to the k-NN problem and a k-NNG can be built by repeatedly applying the k-NN query for every object in the input data once a convenient search indexing data structure has been built. Such search data structures include kd-trees [9], BBD-trees [10], random-projection trees (rp-trees) [11], and hashing based on locally sensitive hash [12]. These method focus on optimizing the k-NN search, i.e., finding k-NNs for a set of query points w.r.t. a set of points with which the search data structure is built, ignoring the fact that every query point is also a data point. These methods are generally less efficient compared with one that focuses on k-NNG construction directly.
The problem of constructing an exact k-NNGs has been investigated extensively to avoid the O n 2 À Á complexity of the brute force method. An O(n log d{1 n) was presented in [13]. [14] presented a O(c d log n) algorithm and [15] presented a worst case O((c 0 d) d n log n). In [16] two practical algorithms for k-NNGs based on recursive partitioning and pivoting have empirically shown time complexities 0:685e 0:23d n 1:48 and 0:858e 0:11d n 2:15 , respectively. All these algorithms have a time complexity that is exponential in the dimenision d of the data. The same exponential dependence on dimension of time complexity is seen in methods based on space filling curves such as the Hilbert's curve [17] and Morton's curve [18]. There are several approximate methods that can handle moderately high dimensional data, typically with a trade-off between speed and accuracy. One set of techniques typically is based on a hybrid of spatial subdivision up to a threshold granularity and small scale brute force evaluation or heuristics for refinement [19][20][21]. These methods rapidly lose accuracy or speed or both when the dimension of the data exceeds 10 3 .
The curse of dimensionality leads to a belief supported by many researchers that the most efficient method for finding k-NNGs for high-dimensional data clouds is in fact the brute force method [22]. The brute force algorithm breaks into two parts: distance calculation and comparison. In the distance calculation part, all distances between all points for graph construction are computed. This results in a M|N distance matrix, where M is the number of query points and N the number of data-base points. Next, each row of the matrix is sorted to get the nearest k neighbors to each of the query points. Recently, there have been several methods that accelerate brute force k-NN and k-NNGs on graphics processing units. They primarily differ in the manner of selecting k smallest elements in every row in the distance matrix. In [23,24] each row of the distance matrix is processed by one thread. Each thread uses an modified insertion sort algorithm to select the k nearest elements. The number of steps that each thread takes to process a given stage is not pre-determined. This can lead to branch divergence and loss of efficiency. Also, since the insertion sort data structure is stored in global memory, this can cause a significant loss in performance due to un-coalesced memory writes. In [25], every row of the distance matrix is processed by a thread block. A heap (one per row) maintains the nearest k smallest distances. Each thread in the thread block strides through the row reading and storing the element in a buffer if it is smaller than the largest element in the heap. When the buffer fills up, all threads synchronize and then push their elements on to the heap in a serialized manner. The last stage of this algorithm can be extremely expensive especially for large k. In [26], a thread block is used to process a single row. Each thread in the thread block strides through the array storing the k smallest elements in a local heap maintained in global memory. Next, all the thread heaps are merged into 32 heaps in shared memory threads in a single thread warp. Finally a single thread merges the 32 heaps in shared memory to find the k nearest neighbors. Storing heaps in global memory has the same disadvantages of thread divergence and uncoalesced memory write as in [25]. Finally, for kw192 it may not even be possible to execute the method in share memory even on the latest GPUs. All these methods dramatically lose performance for large k. In [27] a radix sort based approach is used to select the k nearest neigbhbors. They claim that for large data sets, especially for a large number of queries, the selection process dominates. A simple complexity analysis suggests that this is quite impossible (O(dmn) for distance calculation vs O(mn log n) for sorting where d&120 is the data dimension, m&2000 is the number of query points and n&30,000 is the number of data points). A closer examination shows that they process each row of the distance matrix in a separate sort. For n that fit into GPU memory, this process underuses GPU resources.
Our target application is Manifold Embedding to recover structure and conformations from a large data set of images with high noise [28]. The underlying assumption behind manifold embedding states that a cloud of high dimensional data makes a low dimensional hyper-surface in high dimensional space. The generated hyper surface, called a manifold, contains the information about the individual objects, 3D structure of images, and the system that generated the data. The main computational part of manifold embedding is the construction of a k-NNG for the image data set that can be normalized to a diffusion map in subsequent stages of algorithm. There are upwards of 10 7 images with each image having 15 k pixels. We require kw200 for accurate results. The nature of the data, coupled with the accuracy requirements, makes the brute-force algorithm the only viable alternative. In this paper we describe our parallelized brute-force k-NNG algorithm on a cluster of graphics processing units. We describe three levels of parallel distribution of tasks and data partitioning: between nodes, between multiple GPUs within a node, and finally within the GPU. We have also developed a novel algorithm based on sorting for the comparison and selection step of the brute force k-NNG routine. Our benchmarks show that our routine outperforms the best GPU-based comparison and selection methods. Overall, we achieve a 6| gain in performance over a distributed solution on CPU clusters using the fastest libraries that are available. This bring a hitherto impossible k-NNG generation for a dataset of twenty million images with 15k dimensionality into the realm of practical possibility.

Methods
Given a set of vectors V~fv 1 ,v 2 ::: Proximity is defined using a metric d. In this work, we are primarily concerned with the Euclidian distance metric given by: The square of the distance metric can be written as: ,v i2 ::::::::::::::::::v iD g T Defining matrices A N|D , B N|N be given by :: :: :: Defining the matrix of squared distances S as ::: :: :: :: We can now write the following equation Finding the set of k nearest neighbors for vector v i involves sorting the i th row of S and picking the column indices corresponding to the k smallest distances. In our application of interest, N~10 6 to 10 7 . Therefore, computing and storing S will require 1.8-18 terabytes of memory. Therefore, our approach is to compute k nearest neighbors in parts. As illustrated in Figure 1., the computation of the squared distance matrix S is split into P|P partitions. Consequently, each portion S (I,J) is computed as: where A I ,B I I~1,2:::P are partitions of A,B respectively. Other metrics such as Cosine and Pearson distances can also be incorporated in the k-NNG algorithm. The Cosine distance is given by: The Pearson distance is given by: where m is the average vector of the dataset. The Pearson distance is essentially a centered Cosine distance and therefore an additional pre-processing step for centering is necessary. Clearly, these distance metrics can be formulated in terms of matrix multiplication (v T i v j ) and operations using vector norms. Therefore, the same decomposition and task partitioning applicable in Euclidian case can be used.

Distribution of Data and Tasks between Computing Nodes
Computing clusters typically have several nodes that are connected by high-speed interconnects. One of the nodes is designated as the head node. The head node typically co-ordinates the tasks between different worker nodes. Each node has its own hard disk. In addition, there is a large shared disk accessible through parallel (I/O) by all nodes that typically hold input data and results. The worker nodes, with smaller local disk space, copy input data as and when required from the shared disk. Table 1 illustrates the steps involved in computing the k-NNG. For load balancing, we distribute computing of the partitions of S in a block cyclic manner. Since S is symmetric, we only need to compute the upper triangular portion, i.e., all sub-matrices S (I,J) j J §I. Figure 2 illustrates the task partitioning between  different nodes. For example, in block row I, node q will process all sub-matrices S (I,J) JjJ%Q~q,J §I. In the beginning, the input data matrix A is split into P parts. Each node q then reads portion A J where q~J%Q into its local drive. The calculation of the distance sub-matrix S (I,J) and the calculation of associated k-NNs is handled by node q, where q~J%Q. The computation proceeds in a block-row by block-row manner (lines 5-20 in Table  1). Any sub-matrix S (I,J) requires inputs A I ,A J ,B I ,B J . Of these A I ,B I are used by all nodes that are processing the I th block row. Each node q also requires B J ,A J jJ : J%Q~q. For each block row I, the partition A I is read in parts in parallel by all nodes from the shared drive. All nodes then share the parts that they each read through an 'all_gather' operation [29] to build a local copy of A I in memory. Each node q then reads A J jJ : J%Q~q from its local drive. Since the local disk read operation is slower, we use a memory buffer and overlap computation and disk read operation to completely hide latency. We do not actually build the matrix B. Instead, the vectorB B~fEv 1 E 2 ,Ev 2 E 2 ,:::::::::::Ev N E 2 g T is computed in advance and stored in the RAM of each node. Even for N~10 7 the size ofB B (,10 MB) is quite small compared to the RAM in each node (48 GB). Each node q computes the vector norms for all vectors in the partitions A J jJ%Q~q resident on its disk space Figure 3. Processing global k-NNs. In this figure node q~4 is responsible for calculating the global k-NNs of all vectors that are in A 4 . This is done by computing and merging local row k-NNs of S (4,1) ,S (4,2) ,:::S (4,p) ,::::S (4,P) . Note that the local row k-NNs w.r.t. S (4,J) J~1,2,3 have already been calculated when node 4 calculated local block-column k-NNs w.r.t. S I,4 I~1,2,3. The merged results are stored in a heap. The k-NNs w.r.t S (4,J) j Jw4 are cooperatively computed by all nodes. For example, node q~3 successively computes and merges k-NNs w.r.t. S (4,Qz3) ,S (4,2Qz3) ,:::,S (4,P{Qz3) . It then transmits the results to node q~4, which receives results from other nodes and does a global merge. doi:10.1371/journal.pone.0074113.g003 locally. Using the 'all_gather' operation, the nodes can then share data among themselves and build a complete local copy ofB B.
Once the partition S (I,J) is computed, the local block k-NNs with respect to both the rows (k R -NNs) and columns (k C -NNs) are computed (lines 9-14 in Table 1). Since the matrix S is symmetric, the local block k C -NNs w.r.t. partition S (I,J) are identical to the local block k R -NNs w.r.t. partition S (J,I) . Therefore, each node q maintains one heap per block column JjJ%Q~q of S that it processes. Each of these heaps contains the merged local k C -NNs w.r.t. partitions S (I,J) jI~1,2,::J, J%Q~q. For example, as shown in Figure 3, node 4 maintains one heap for each of the columns 4,Qz4,:::(P{Qz4). The heap for column 4 will contain the merged block k C -NNs for S (1,4) ,S (2,4) ,S (3,4) ,S (4,4) . The heap for column Qz4 will contain the merged local k C -NNs for partitions S (1,Qz4) ,S (2,Qz4) ,:::::,S (Qz4,Qz4) .
The node q is used to compute the global k-NNs for all vectors in partition A I jI%Q~q. The global k-NNs for all the vectors in A I are generated by merging the local block k R -NNs w.r.t. partitions S (I,J) J~1,2,::::,P. However, the merged results of the local block k R -NNs w.r.t all partitions S (I,J) jJ~1,2,:::,I are already available in node q from the local block k C -NNs computed when processing rows 1,2:::I. The block k R -NNs w.r.t. all partitions S (I,J) J~Iz1,Iz2,:::::,P are cooperatively computed by different nodes. Each node maintains a heap to merge the results of finding the local k R -NNs of the partitions that it processes (line 15 in Table 1). At the end, the merged results are communicated to the node processing the global k-NNs for the block row I for merging at the global level (line 18 in Table 1). For example, as illustrated in Figure 3, for I~4, node 3 will compute block k R -NNs w.r.t. all partitions S (4,Qz3) ,S (4,2Qz3) ,::S (4,Q{Pz3) . The results will be merged and stored in a heap. At the end of the computation, the results in the heap will be communicated to node 4. Node 4 will have computed all k R -NNs w.r.t. S (4,1) ,S (4,2) ,S (4,3) and merged them while processing block rows I~1,2,3. These results and the results communicated to it by all other nodes processing parts of block row I~4 will be merged by node 4 to compute global k-NNs for all vectors in A 4 .
Data partitioning parameters. Our tests reveal that compute time dominates communication time. For a variety of  where, P is the smallest multiple of Q that satisfies the above relation, R in the size of the partition of A J within the node and is dependent on GPU RAM size. Here the first term in the right hand side is the size of A I , the second term is the size of a buffered portion of A T J (one buffer portion is for reading from the disk while the second is for feeding the GPUs. These buffers are swapped at the end of computation.), the third term is the size of the buffer to store Q row k-NNs (this data is used to compute the global k-NNs. The data structure is a tuple consisting of a float/double to represent the distance and a int for index.), the fourth term accounts for the block column k-NN storage for all columns being processed by a node, the fifth term represents the memory required for partial column k-NN generated by the M GPUs in each node, and the sixth term represents the memory requirements for the operating system. R is given by the smallest integer value such that the following relationship holds:  the first term represents the storage for a portion of A I , the second term represents the storage for a portion of A J and the third term represents the storage for partial R row k-NNs which will be merged by the GPU.

Distribution of Tasks and Data Within Nodes
We assume that each node has M GPUs. In our current setup, M~2. As mentioned previously, each node is responsible for computing a partition S (I,J) of the squared distance matrix. A I s In this test our input data has the dimension d~4096 and the number of input objects/ vectors n~16384. Figure 7(a) shows the performance vs. [23]. Figure 7(b) shows the performance vs. [24]. doi:10.1371/journal.pone.0074113.g008 are read from the head node through parallel I/O. A J s are read from the local disk. Within the node, A I is divided into M equal partitions. A J is divided into R partitions.
Task distribution within each node is described in Figure 4. Each GPU m is responsible for computing the sub-matrices S I,J (m,r) and the associated row and column k-NNs. This process requires portions of the input vector data matrices A I (m) and A J (r). Once again the computation proceeds in a block row manner. Therefore the portion A J (r) is read by all GPUs while as the portion A I (m) is read by the GPU m. When the computation of S (I,J) (1,r),S (I,J) (2,r), ::::,S (I,J) (m,r) is being done by the M GPUs, the node simultaneously reads the file partition A J (rz1) into the RAM. The node also has the norm vectorB B in its RAM. B B is partitioned in two ways: one has M partitions with each of these partitions going to the M different GPUs and the other has R partitions, with each partition being read sequentially by all GPUs. When the computation of S (I,J) (m,r) is complete, the column k-NNs as well as the row k-NNs are computed using sorting. The column k-NNs are written back to CPU memory while the row k-NNs are kept on global memory to be merged.  Figure 7(a) shows the performance vs. [23]. Figure 7(b) shows the performance vs. [24]. doi:10.1371/journal.pone.0074113.g009 Note that S (I,J) (m,r) r~1,2::,R are being processed by the same GPU m, therefore, it makes sense to merge row k-NNs in the GPU memory without write back to CPU RAM. However, S (I,J) (m,r) m~1,2,::M are being processed by different GPUs, therefore, the column k-NNs are generated by different GPUs. The accumulated column k-NNs are then merged by one GPU per column. The final result of this operation is the local row and column k-NNs for the partition S (I,J) .
We use OpenMP multi-threading [30]. Each node runs Mz1 CPU threads. M threads control the M GPUs while one thread is in charge of I/O from the local disk as well as the shared disk.

Distribution of Tasks and Data within GPUs
The tasks that are accomplished within each GPU include the following:   Figure 7(a) shows the performance vs. [23]. Figure 7(b) shows the performance vs. [24]. As n increases, the total performance gain asymptotically approaches that of matrix multiplication because for large n, this computation dominates. doi:10.1371/journal.pone.0074113.g010 Each of these tasks is coded as kernels. In our clusters we have Tesla C2050 compute GPUs from NVIDIA. We use the CUDA programming environment [31] to code our kernels.
Finding vector norms. Finding vectorB B of input data norms is done once in the beginning at the same time that the input files A J are communicated to each node. For example, if node q will receive all files A J jq~J%Q. When the file is received, it is partitioned into M partitions, one partition per GPU. Although there is a library function to calculate vector norms in CUBLAS [32], using it will be inefficient since the vectors are relatively large in number (N = 10 6 to 10 7 ) with much smaller dimension D&15000. Finding the norm of vectors one at a time will not fully use GPU resources. We have written our own kernel that overcomes underuse of GPU resources by computing multiple vector norms in one kernel invocation.This process is illustrated in Figure 5. Every vector in partition A J (m) is processed by one thread block. All threads in the thread block cooperatively compute the vector norm. Each thread strides through the vector components adding the square of the entries with a stride length equal to the number of threads in the thread block. Finally, all threads in the thread block write to a single global memory location using atomic-add.
Dense matrix multiplication. For dense matrix multiplication A I (m)A T J (m), we use the optimized library function from CUBLAS [?]. The result of the dense matrix multiplicatioñ S S (I,J) (m,r) is stored in global memory. For summation, we have written a special kernel to take advantage of the particular structure to minimize memory transactions. Figure 6 illustrates the summation kernel. As mentioned previously, we do not build the matrices B I (m) and B T J (r) but simply use the portions of the vector of input data normsB B I (m) andB B J (r). We process S (I,J) (m,r) one row at a time. Every thread reads one element ofB B J (r) into its register. Next, the thread block reads a section ofB B I (m) into shared memory. Finally, all threads simultaneously update the entire row of S (I,J) (m,r). Each thread reads the corresponding element of row m ofS S (I,J) (m,r), adds to it the corresponding element ofB B J (r) that is in its register, and an elementB B I (m) that is in shared memory. In reading an elementB B I (m) by all threads in the thread block, we use the broadcast mechanism in GPUs.
Batch index sort for k-NN. We could obviously sort each column and each row separately. However, this is not efficient because the resources on the GPU are not fully used. Also, for sorting according to columns, we will need to execute an expensive operation to rearrange the data in the column major format, Instead, we use a process we called Batch Index Sorting. Note that in GPU RAM, S (I,J) (m,r) is laid out as a linear array in a rowmajor format. Each element of S (I,J) (m,r) is also then associated with its row index and column index. We use radix sort with the elements of S (I,J) (m,r) as key. In the next two steps, we execute an order preserving sort of the result of the previous step with the column index as the key. Separately, we also execute an order preserving sort with the result of the first sort with the row index as the key. These two sorting operations generate the nearest neighbors for each column and row. We then execute a separate kernel to extract the k-NNs for each row and column w.r.t. S (I,J) (m,r). Figure 7 illustrates and example with 3 data vectors and 3 query vectors.

Results
We tested our implementation against that of [24] and [23]. All implementations were compiled using C++ using appropriate compiler optimization flags. The implementations were run on a NVIDIA Tesla C2050. We used synthetic data sets. We benchmarked distance computation, k-NN selection, and total time. Figure 8 shows the comparison when we varied k. In this test we set the data dimension to d~4096 and the number of objects n~16384. Figure 8(a) shows the speedup versus [23]. The distance computation is almost the same because both works formulate distance computation as a matrix-matrix multiplication and use the optimized CUBLAS library. Our version of k-NN selection breaks even with the work of [23] at about k~128 and outperforms it by 15| for k~1024. Overall our implementation has a performance advantage of 7:87|. Figure 8(b) shows the speedup versus [24]. For this test we re-formulated the Pearson distance computation to enable the use of optimized matrix-matrix multiplication. Consequently, our distance implementation has a roughly 9| performance advantage. The k-NN implementation is a per thread linear insertion sort with each thread handling one row of the distance matrix. Our implementation breaks even at k~128 and ends up with a 42| performance advantage when k~1024. Overall, our implementation has a 24| performance gain at k~1024.
In the second sets of tests we kept n~16384 and k~512 and varied d. Figure 9(a) shows the comparison with [23]. Both the speedup with respect to distance and with respect to k-NN selection remains constant at &1| and &6|, respectively. However, the proportion of time for distance computation increases linearly with d. Therefore, we see that the performance gain for total time decreases from 5:7| to 3:5|. Figure 9(b) Figure 11. Performance benchmarks for multi-GPU execution. In this test we used 2 GPUs. For the implementation in [24] the 2 GPUs (Tesla 2050) were mounted on a single desktop machine. For our implementation, we use 2 nodes in our GPU cluster and opted to use only one GPU per node. The input data had the dimension d~16384, and the number of closest neighbors k~512. doi:10.1371/journal.pone.0074113.g011 Figure 12. Performance scaling with number of GPUs. In this test we used a data set with n~1966080,d~2048,k~512. We achieve a performance speedup that is slightly below linear in the number of GPUs. doi:10.1371/journal.pone.0074113.g012 shows the speedup with respect to [24]. Once again, the speed up with respect to distance and with respect to k-NN selection remains constant at &9:5| and &10|, respectively. Once again as d increases, the proportion of time for distance computation increases linearly with d. For large d, the graph shows stabilization of the overall speedup at 10:25|.
In a third sets of tests we kept d~1024 and k~512 and varied n. Figure 10(a) shows the comparison with [23]. For small n, the speedup with respect to selection is &200|. As n increases, the performance gains taper off to &12|. Overall speedup starts off at &100| and falls to &11|. Figure 10(b) shows the comparison with [24]. Once again for small n, the speedup with respect to selection is &37|. As n increases, the speedup tapers off to &5:6|. Overall speedup starts off at &20| and tapers off to &4:7|. Finally, for the tests with varying d and n, while our implementation was able to handle a model size up to n~32,767, the implementations by [23] could only handle a model size up to n~16,384.
We also tested our implementation vs. [24] for multi-GPU configuration. While the implementation in [24] requires all GPUs be on a single computer (connected through PCI Express bus with OpenMP multi-threading), our implementation is designed for execution on GPU clusters, i.e., the scalability is much larger. In the tests, we ran the implementation in [24] on two GPUs on a single desktop and our implementation was run on two nodes of a cluster, with each node containing a single GPU. We used a combination of MPI and OpenMP for multi-GPU execution. Figure 11 shows the result. Here d~16384 and k~512. We achieve up to 156overall speedup. However, for data with a small dimension (dv500) and a small k (kv64), the implementation in [24] can be faster. In fact, for n~1507328, d~294 and k~20, our implementation is roughly 2:2| slower. This is mainly because our k-NN algorithm is not as efficient for small ks. Our new k-NN algorithm to be published in a subsequent paper is much faster. Figure 12 illustrates the scalability of the method w.r.t. the number of GPUs in the cluster. For this test we used a data set with n~1966080, d~2048 and k~512. As can be seen from the graph, we achieve a performance gain that is lightly below linear in the number of GPUs. Since each node contains 2 GPUs, we went from using 2 nodes to 6 nodes in the cluster.
To demonstrate the full capabilities of our implementations, we executed k-NNG construction for two datasets. The first data set contains two million images of simulated diffraction patterns of a randomly oriented Adenylate kinase (ADK) molecule. Each image has 126|126~15876 pixels. The second dataset consists of twenty million images of simulated diffraction patterns of melting ADK in ten different molecular conformations. For more information about the structure of data sets, please refer to [28]. Unfortunately, the second data set on our CPU cluster would take approximately 3 months to compute, and therefore, we do not have actual timings.
We evaluated our method with a previous implementation of a neighborhood graph construction using MATLAB technical computing language. The MATLAB implementation took 56 hours on an exclusive CPU cluster with 32 nodes for two million images of diffraction patterns with the use of highly a optimized ATLAS-BLAS library [33] for multi-threaded Matrix-Matrix Multiplication in double precision. The cluster had one Xeon E5420 quad-core CPU per node with 16 kB of L1 cache, 6144 kB or L2 cache and 40 GFLOPS of double precision computing power. Comparison and selection was accomplished using the quick sort algorithm. Since the parallel MATLAB implementation did not take advantage of the symmetry of the distance matrix, one can assume that such an implementation would roughly take about 28 hours. Our GPU cluster had 16 nodes with each node equipped with two NVIDIA Tesla C2050 GPUs. Each Tesla C2050 GPU has a RAM of 3 GB with 506 GFLOPS of double precision computing power. There are 14 multi-processors sharing 720 kB of L2 cache and each multi-processor having 48 kB of user-configurable L1 cache/Share memory. In addition, each of the GPU nodes had two quad-core Xeon E5620. Note that in our GPU cluster, CPUs are used mostly for managing the GPUs and movement of data between nodes and not for computation. Our GPU cluster implementation took 4.23 hours, giving a roughly 6:6| gain in performance.
To investigate the efficiency of our implementation we also benchmarked the most expensive part of the computation, i.e., matrix matrix multiplication, we tested both double and single precision matrix-matrix multiplication on a single GPU (Tesla C2050) vs. a multi-core optimized (with SSE instructions) on a Xeon processor. We achieved a roughly 7:7| speed-up using GPUs just for matrix multiplication alone. As shown earlier, our complete implementation is slightly worse at a 6:6| gain in performance. This slight degradation may be attributed to communication overhead while executing on the GPU cluster.

Discussion
We have successfully implemented a parallel brute-force k-NNG algorithm on a cluster of graphics processing units. We have used multiple levels of parallelism, task distribution, and data partitioning to achieve a roughly 6:6| performance gain over an implementation using MATLAB on a comparable CPU cluster. There are several shortcomings in our implementation, as evidenced from comparing the raw float point processing power of the processors. The most expensive part of the brute force k-NNG is matrix multiplication. With the best tuned GPU libraries, we see that there is only 50% utilization of GPU resources as opposed to 90% use by finely tuned CPU libraries. A better GPU matrix multiplication library would go a long way toward addressing this deficiency. Another possible route to reducing the computational load is to recursively partition the input data set into sub-divisions with overlaps. This pre-processing step will reduce the computations (distance computation and selection) for each input vector based on set membership. While our current method from comparison and selection is faster than the state-ofthe-art, it is clear that by grouping all rows in the distance matrix together and sorting, we are doing a significant amount of redundant work. For example, in sorting distances, there is no need to compare entries in a given row with entries in all other rows to get the smallest k values in the given row. We are currently working on an elegant solution based on quick-sort that will have a complexity of O(n). With these improvements we will expect to gain a 30{40% increase in overall performance.

Author Contributions
Conceived and designed the experiments: IK AD RMD. Performed the experiments: AD. Analyzed the data: RMD AD. Contributed reagents/ materials/analysis tools: AD IK. Wrote the paper: AD RMD.