Novel Online Dimensionality Reduction Method with Improved Topology Representing and Radial Basis Function Networks

This paper presents improvements to the conventional Topology Representing Network to build more appropriate topology relationships. Based on this improved Topology Representing Network, we propose a novel method for online dimensionality reduction that integrates the improved Topology Representing Network and Radial Basis Function Network. This method can find meaningful low-dimensional feature structures embedded in high-dimensional original data space, process nonlinear embedded manifolds, and map the new data online. Furthermore, this method can deal with large datasets for the benefit of improved Topology Representing Network. Experiments illustrate the effectiveness of the proposed method.


Introduction
Techniques for dimensionality reduction have attracted much attention in many fields such as machine learning and data mining [ [10]. Dimensionality reduction methods are used for mapping high-dimensional observations into desired low-dimensional space while preserving the features hidden in the original space. Over the past decades, a number of dimensionality reduction methods have been proposed. Principal Component Analysis (PCA) [11] [12] [13] [14] [15] [16] [17] [18] and Multidimensional Scaling (MDS) [19] [20] [21] have been the two most popular methods because of their relative simplicity and effectiveness. However, PCA is designed to operate when the manifold is embedded linearly or almost linearly in the subspace, and it cannot project previously "unseen" patterns. Classical MDS finds a low-dimensional embedding of patterns with distances in the target space that reflects dissimilarities in the original sample. Both PCA and MDS cannot disclose nonlinearly embedded manifolds because they operate on Euclidean distances. To overcome this limitation, many nonlinear methods have been proposed. Locally Linear Embedding (LLE) [22] maps high-dimensional original data feature space into a single global coordinate system of low dimensionality. Laplacian Eigenmap [23] uses spectral techniques to perform dimensionality reduction. ISOMAP [24] [25] employs classical MDS for geodesic distances in the original data feature space. L-ISOMAP [26] increases ISOMAP's efficiency. It approximates a large global computation in ISOMAP by a much smaller set of calculations.
Because geodesic distances are especially suitable for computing distances among data points embedded in nonlinear manifolds, many methods to build graphs on the data have been proposed. The Topology Representing Network (TRN) [27] [28] [29] [30] is representative because of its effectiveness and simplicity. TRN, which combines the neural gas (NG) vector quantization method with the competitive Hebbian learning rule is used to quantize embedded manifolds and learn the topological relations of the input space without the necessity of prespecifying a topological graph. There are some dimensionality reduction methods based on TRN. Online data visualization using the neural gas network (OVI-NG) [31] is a distance preserving mapping of the codebook vectors (vector quantization) obtained by the NG algorithm. The codebook positions (codebook vectors' projection in low-dimensional space) are adjusted in a continuous output space using an adaptation rule that minimizes a cost function that favors local distance preservation. OVI-NG is not able to disclose nonlinear embedded manifolds because of its use of Euclidean distances. The Geodesic Nonlinear Projection Neural Gas (GNLP-NG) algorithm [32] is an extension of OVI-NG that uses geodesic distances instead of Euclidean distances so that GNLP-NG performs well in the projection of nonlinear embedded manifolds. GNLP-NG and OVI-NG are not able to project new data. The method RBF-NDR [33], which includes the NG algorithm and RBFN, can process data online. Nonetheless, RBF-NDR sometimes has poor mapping quality and sometimes performs well due to minimizing STRESS [33] at each iteration without clear targets.
In this paper, we propose a new method for online and nonlinear dimensionality reduction called ITRN-RBF. We improve the conventional TRN so that it builds a more appropriate topology relationship. That is, the method we call the Improved TRN (ITRN) is more specifically suited to calculating geodesic distances. Furthermore, large amounts of data can be processed by ITRN's vector quantization. We chose the MDS method as the mapping method. In contrast to classical MDS operating on Euclidean distances, our method operates on the geodesic distances of the topology graph reconstructed by ITRN. The mapping between the original high-dimensional space and low-dimensional feature structures embedded is then learned by supervised RBFN, whose target values are generated by the mapping methods. In particular, we give two implementations of RBFN. One is trained by the Widrow-Hoff learning algorithm. The other is an exact RBFN designed by precise mathematical calculation. Finally, the RBFN is used to reduce the dimensions of the original high-dimensional data. ITRN-RBF can process nonlinearly embedded manifolds, preserve the global structure of these manifolds, and project new data online.

Methods
ITRN-RBF comprises two procedures: capturing the topology of the given dataset using ITRN and learning the mapping using RBF. The first procedure learns the topology of the input data embedded in the high-dimensional original data feature space and generates a graph using ITRN. ITRN connects the subgraphs together to ensure the connectivity of the resulting graph. The method for connecting the subgraphs is discussed in the section below. Using the output (codebook vectors with similarity relationships) from the first procedure, the second procedure calculates the pairwise graph distances as geodesic distances and constructs the mapping between the high-dimensional original space and low-dimensional target space. It then uses RBFN to learn this mapping. In particular, there are variety of ways to implement RBFN. We give two different implementations, which are described below. Finally, RBFN is just the dimensionality reduction tool, which has the desired capabilities of processing nonlinearly embedded manifolds and projecting new data online. In the following, ITRN-RBF is introduced and discussed in detail.

ITRN
TRN is one of the vector quantization algorithms that are based on neural network models, which are capable of adaptively quantizing a given set of input data. Given a set of data X = {x 1 , x 2 , . . ., x N }, x j 2 R D , TRN employs a finite set V = {v 1 , v 2 , . . ., v n }, v i 2 R D called codebook vectors (or reference vectors, neural units) to encode X. TRN learns the topological relation of X by distributing nodes among the data and connecting them using the competitive Hebbian rule. The purpose of TRN's learning is to reconstruct a topology graph G = (V, C) for X, where C represents the adjacent matrix of V, whose values are constrained to 0 (unconnected) or 1 (connected). The conventional TRN algorithm operates as follows. and set all connection edges.
2. Randomly select input pattern x from X.

Update all nodes
5. Connect the two nodes closest to the randomly selected input pattern x. Set c i 0 i1 = 1 and set this connection's age to zero (t i 0 i1 = 0). 6. Increase the age of all connections of v i 0 by setting t i 0 j = t i 0 j + 1 for all nodes v j that are connected to node v i 0 (c i 0 j = 1).
7. Remove the connections of node v i 0 that have exceeded their lifetime by setting c i 0 j = 0 for all j with c i 0 j = 1 and t i 0 j > T.
8. Increase the iteration step t = t + 1. If the maximum number of iterations has not yet been reached (t < t max ), continue with step 2.
There are many parameters in this algorithm. The codebook vectors' number n and maximum number of iterations t max are both set by the user. The parameter λ, step size and lifetime T depend on the number of iterations. The time dependent parameters are set according to the form Here, g i is the initial value of the variable, g f is the final value, t denotes the iteration step and t max represents the maximum number of iterations. Suggestions as to how to tune these parameters have been proposed by Martinetz and Schulten [27].
In fact, to obtain a denser graph that is better for calculating geodesic distances, we implement some improvements. For the randomly selected input patterns at each iteration, the method ITRN creates a connection between the 1 st and (k + 1) th nearest nodes (1 < = k < = kn, typically kn 2 {2, 3, 4}) instead of only connecting the first and second closest codebook vectors. In addition, we also connect the subgraphs to avoid the existence of infeasible nodes. Specific details about ITRN are presented in the statements below. Steps 1-5 are the same as steps 1-5 in the conventional TRN, hence we only list the steps that follow.
6. If the following condition is satisfied for k = 1, 2, . . ., kn, then create a connection between nodes v i s and v i k by setting c i s ik = 1 and t i s ik = 0.
7. Increase the age of all connections of v l (l = i 0 , i 1 , . . ., i kn−1 ) by setting t lj = t lj + 1 for all nodes v j that are connected to node v i 0 (c i 0 j = 1).
8. Remove the connections of node v l (l = i 0 , i 1 , . . ., i kn−1 ) that have exceeded their lifetime by setting c lj = 0 for all j for which c lj = 1 and t lj > T.
9. Increase the iteration step:t = t + 1. If the maximum number of iterations has not yet been reached (t < t max ), continue with step 2.
10. If the resulting graph G = (V, C) is unconnected, it is necessary to connect the subgraphs.
where e ij is the shortest edge obtained by connecting the closest nodes in G i and G j . Finally, choose a suitable e ij to add to C and obtain the connected graph G E = (V, C E ).
Compared with conventional TRN, we note that: • ITRN modifies the TRN strategy to establish the connections in steps 6-8 (see Fig 1) and connect subgraphs in step 10 (see Fig 2).
• Conventional TRN causes deviation because it ignores some topological relations of the codebook vectors. However, ITRN connects multiple points so that more topological relations can be established. In addition, a relation caused by miscalculation will be removed when its lifetime exceeds the limit. An experiment shows the different construction, as shown in Fig 3. • The distance ratio defined as follows: can be used to quantitatively evaluate the connection quality. Where GD ij denotes the geodesic distance and ED ij denotes the Euclidean distance between codebook vectors v i and v j . The bar chart shown in Fig Fig 5a). The ITRN's bar chart has a larger gradient and much more restricted ratio range, both of which are desirable.

RBFN
In this section, we propose two methods to train or design an RBFN. The first approach, called the training RBFN (TRBF), is a D-h-d network that includes an input layer with D units (equal to the codebook vectors' dimensionality), hidden layers with h units (set by users), and an output layer with d units (equal to the dimensionality of the output space). The second approach, named exact RBFN (ERBF), is a D-n-d network with the same parameters as the training RBFN. Especially the number of hidden layer units n is equal to the number of codebook vectors. All of them have the same codebook vector input s obtained by ITRN and the same training targets given by MDS. What is more important, MDS is based on geodesic distances that are calculated from the graph G E = (V, C E ) and the training targets defined as  [39]. TRBF. In terms of TRBF, we chose a Gaussian function as the activation function, defined as follows: The hidden layer output is defined as In addition, the loss function is given by The TRBF network provides four types of adjustable parameters: center c li , widths σ li , weights w ik and bias b k . Based on the Widrow-Hoff learning algorithm, the calculation equations of each parameter are given by: where η c , η σ , η w , and η b which are individual step sizes for c li , σ li , w ik , and b k , respectively, can be defined by users. ERBF. ERBF's weight W and output layer bias B are obtained by mathematical calculation, so the RBFN can ensure zero error, in theory. The linear equations are given as follows: The input layer bias b in is set as ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Àlog0:5 2 p =spread so there is only one parameter that needs to be set by users. How to set the spread is described in the results section.

ITRN-RBF method
The detailed algorithm process is as follows: 1. Construct graph G E = (V, C E ) using ITRN. In reality, the graph is connected.
2. Calculate the geodesic distances on G E .
3. Construct the mapping between the high-dimensional original space and low-dimensional target space by using MDS operating on the geodesic distances of the topology graph. For every v j , we get output t j as an expectation.
4. Train or design an RBFN with explicit inputs V and outputs T. In this step, any appropriate RBFNs such as ERBF or TRBF could be applied.

5.
Use the RBFN to map the dataset.

Results
In this section, ITRN-RBF is used for visualization and feature extraction, and is also compared with others including methods based on TRN and classical dimensionality reduction methods such as ISOMAP, L-ISOMAP and PCA. We also present the computational complexity analysis of the method and a table with running times. There are many parameters for experimental data. The common parameters of TRN, OVI-NG and GNLP-NG were set as follows: t max = 20n, i = 0.1, f = 0.05, λ i = 0.05n, λ f = 0.01, T i = 0.05n, and T f = n. The auxiliary parameters of the OVI-NG and GNLP-NG were set as α i = 0.3, α f = 0.001, σ i = 0.7n, and σ f = 5. The extra parameter for ITRN kn was set to two (for the Swiss roll) or three (for the artificial faces, handwritten digit "2" and UMist faces datasets). The parameters of RBFN in the Swiss roll experiment were set as follows: η c = 0.03, η σ = 0.03, η w = 0.2. For the image processing experiments, they were changed to η c = 0.002, η σ = 0.002, and η w = 0.05. The ERBF's parameter spread can be obtained as follows: where d ij denotes the Euclidean distances between the codebook vectors. The number of neighbors used in the compuations for ISOMAP and L-ISOMAP is set to 12. The number of landmarks used in L-ISOMAP is set to 0.1n.

Comparison with the methods based on TRN
We chose two standard metrics for mapping quality. They are widely used for analysing dimensionality reduction methods based on TRN.
• Distance preservation: This value evaluates the distance difference between nodes in input space and nodes in output space. We chose the classical MDS [19] [20] and Sammon stress functions [40] to quantify this value. Their expressions are as follows: where d ij is the distance between nodes in the original space and d ij is the distance between nodes in output space. Moreover, when the mapping method uses geodesic distances, the expression is calculated using geodesic distances. Otherwise, the method uses Euclidean distances.
• Neighborhood preservation: This value evaluates the degree to which adjacent patterns in input space are close in output space. The measures of trustworthiness M 1 (k) and continuity M 2 (k) [41] [42] are suitable. Their expressions are given below: where U k (v i ) is the set of nodes that are in the k-size neighborhood of the codebook vector i in the output space but not in the original space. In contrast, V k (v i ) denotes the set of nodes that belong to the k-size neighborhood of codebook vector i in the original space rather than in output space. Rank r ij refers to rank in the original space, but r ij denotes the order in output space. In fact, trustworthiness and continuity are functions of the number of neighbors k.
Three methods, OVI-NG, GNLP-NG, and RBF-NDR, were selected for comparison. In particular, OVI-NG and GNLP-NG can only map the codebook vectors. Hence, to keep the comparison fair, we used the RBFN obtained by RBF-NDR and ITRN-RBF to map the codebook for comparison. All methods' line charts with respect to trustworthiness and continuity are given after each experiment, except for OVI-NG, because the method cannot process nonlinear embedded manifolds. (We only show the results separately in the Swiss roll experiment.) Table 1 presents the stress functions for the different methods.
Swiss roll. The Swiss roll (S2 Dataset) corresponds to a two-dimensional pattern distributed uniformly on a plane and embedded nonlinearly in 3D (Fig 5a). We used ITRN to learn this manifold and ensure the connectivity of the resulting graph. The graph given in Fig 5b shows the reconstructed manifold embedded in the high-dimensional original data feature space by ITRN. We then trained an RBFN to reduce the dimensionality. The projection estimated by the ERBF module is given in Fig 5c and 5d. Fig 5c shows the mapping of the training pattern (2000 nodes), and Fig 5d shows the mapping of the new dataset (5000 nodes) that was taken from the Swiss roll by random sampling. We observe that TRN-RBF is able to recover the intrinsic two-dimensionality of the Swiss roll and process a new dataset.
The different mappings of the Swiss roll's codebook vectors are presented in Fig 6. All methods disclose the embedded manifolds of the Swiss roll except OVI-NG. The neighborhood preservation achieved by OVI-NG is presented in Fig 7. This method shows such a poor performance, only ITRN-RBF, RBF-NDR, and GNLP-NG are discussed in the following. Moreover, for RBF-NDR and GNLP-NG, the purpose of iterative adjustment is to minimize the stress function, hence they have similar mapping structures.
Analyzing each of the measures shown in Fig 8 and Table 1, it is clear that ITRN-ERBF retains two distinct advantages with respect to distance and neighborhood preservation. Closest to ITRN-ERBF in performance is RBF-NDR. Methods GNLP-NG and ITRN-TRBF perform almost as well.
Artificial and real-world images. The artificial images (S3 Dataset) are from the domain of visual perception. The dataset contains 698 artificially generated images of faces (image size: 64 × 64, 688 images for training and 10 for testing, referred to as AFs) under different poses and different illumination conditions.
The real-world images (S4 Dataset) come from the Mixed National Institute of Standards and Technology (MNIST) database. We chose the handwritten digit "2" (image size: 28 × 28, 1000 images for training and 10 for testing, referred to as "2") for this experiment because of its varied forms.
In particular, for the different datasets, there are two treatments: AF are preprocessed by PCA. The principal components that contribute less than 0.1% to the explained variance are discarded, hence dimensionality reduction methods are used for mapping the primed dataset. However, for "2," we chose the original dataset as the training patterns.
ITRN-ERBF and other methods were used for the task of visual perception. The resulting two-dimensional projection of training patterns obtained by ITRN-ERBF is given in Figs 9 and 10. A comparison of the mapping quality is presented in Figs 11 and 12 as well as Table 1. Blue plusses represent the two-dimensional projections of training patterns and red circles represent testing patterns' position. For easy inspection, only part of the training patterns' corresponding images were plotted. The major articulation features of the AF, left-right (x-axis) and up-bottom (y-axis), are captured from the input space. For the "2" dataset, the bottom loop (x-axis) and lean (y-axis) are captured from input space.
In terms of mapping quality, ITRN-ERBF has a high adaptability and performance. In contrast, ITRN-TRBF, GNLP-NG, and RBF-NDR perform less well. In very rare cases, GNLP-NG shows the best distance preservation feature because the goal of GNLP-NG is to minimize the stress function.

Novel Dimensionality Reduction Method
Comparison with RBF-NDR. Most dimensionality reduction methods can process new datasets because of RBFN. However, an imprecise RBFN could lead to imprecise projections. Hence, ITRN-ERBF, ITRN-TRBF, and RBF-NDR were selected to determine whether they were able to generate definitive results. All of them use RBFN to project the dataset.
All methods ran 20 times on a uniform Swiss roll dataset. At each iteration, the manifold learning procedure was executed afresh and the RBFN was also designed or trained again. The results are shown in Fig 13. Here, the x-axis denotes the iterations and the y-axis represents the value of E MDS or E SM . We observe that ITRN-ERBF has the smoothest line, indicating that ITR-N-ERBF has the most definitive results. In contrast, ITRN-TRBF and RBF-NDR have obvious fluctuations because of their trained RBFN, which could not minimize the stress function or loss function.

Comparison against the classical methods
In this section, ITRN-RBF was compared with classical dimensionality reduction methods including ISOMAP, L-ISOMAP, PCA. Three quality metrics [43], namely, stress function, the  Novel Dimensionality Reduction Method correlation coefficient, smooth neighborhood preservation were used for analysis. We detail the three quality metrics in the following.
• Stress function. You can refer to Eq 16.
• Correlation coefficient. This value measures how distances in the original space are correlated to those in the visual space. The expression are as follows: where D and Dare the upper triangular distance metrics before and after projection, is the element-by-element product, <> is the average operator and σ is the standard deviation of the vector's elements. The smaller the value of E CC , the better the performance of the visualization is.
• Smooth neighborhood preservation. This is also a neighborhood preservation metric, but it's based on distance instead of rank order compared with trustworthiness and continuity. The Novel Dimensionality Reduction Method local misplacing metrics are defined as follows: : : where N T (v i ) is the set of nodes in the k-nearest neighborhood (we set k = 12 for this analysis) of an node i that are not mapped among the k-nearest neighbors of i in the output space and N FN (v i ) is the set of nodes that are not among the k-nearest neighbors of i but are mapped among the k-nearest neighbors of i in the output space, jWj is the number of elements in the  set and w(r, t) is given below: wðr; tÞ ¼ : Smooth Neighborhood preservation can be obtained by simply computing: where S is the set of nodes under analysis. The smaller the value of E NP means the better neighborhood preservation.
We add a dataset, three people's face images (S5 Dataset) in UMist Faces database (575 total images, 112 × 92 size, manually cropped by Daniel Graham [44]), for showing feature extraction (Fig 14). Table 2 presents the quality metrics' value for the different methods. We observe that PCA has poor performance because of nonlinear datasets. ISOMAP is better than L-ISO-MAP because L-ISOMAP approximates a large global computation. ITRN-ERBF is better than ITRN-TRBF because ITRN-TRBF is trained and it has less center nodes in network. ITRN-RBF, ISOMAP and L-ISOMAP have similar results. In some cases, ITRN-RBF performs better than ISOMAP and L-ISOMAP. That illustrates the effectiveness of ITRN-RBF.

Computational complexity analysis
Assume that input space's nodes number is N, codebook vectors number is n, TRN's epochs is k 1 and TRBF's epochs is k 2 . The most time consuming part of TRN corresponds to sorting the distances for rank r i which goes with O(Nlog 2 N). Our improvement of TRN increases time cost because of building connecting graph. The extra time cost is O(n 2 ). However, in most applications, this time cost can be neglectable because of the small value of n. The MDS has complexity O(n 3 ). The TRBF is O(k 2 n) and the ERBF is O(n). So ITRN-RBF runs in O(k 1 Nlog 2 N + n 3 + k 2 n) (based on TRBF) or O(k 1 Nlog 2 N + n 3 ) (based on ERBF).
We list the running times in Table 3. Specially, training RBF and mapping dataset are separated, so the extent to which RBFN maps the dataset fast are quite remarkable. We note that: • In most applications, n < < N, so MDS and training RBFN run faster.
• If we get RBFN, mapping the dataset only costs O(N).
• ITRN-TRBF is slower than ITRN-ERBF because trained RBFN has iterative procedure. However, if we get RBFN, the mapping based TRBF is always faster than ERBF's, because ERBF has larger number of center nodes in network.

Discussion
The classical dimensionality reduction methods, such as PCA and MDS cannot disclose nonlinear embedded manifolds. ISOMAP and L-ISOMAP uses geodesic distantce to improve MDS, providing good performance. ITRN-RBF offers performance near that, but has a faster mapping speed and an ability to deal with new data. For the dimensionality reduction methods based on TRN, OVI-NG can also not process nonlinear dataset because it uses Euclidean distances in the observation space. GNLP-NG makes improvements that are similar to ISOMAP's. Both of OVI-NG and GNLP-NG cannot project new data online. Novel Dimensionality Reduction Method ITRN-RBF and RBF-NDR overcome these problems. They can project nonlinear data for using geodesic distances and can map new data because of RBFN. In this paper, we proposed two methods to obtain the RBFN. Each has distinct advantages and disadvantages. ERBF has only one parameter, its spread. Larger spread will generate more robust networks, but too large a spread will cause mathematical calculation problems. ERBF only calculates once without accumulating error, hence it is fast and exact. However, a large number of training patterns will result in a large-scale network. ITRN uses the vector quantization technique to decrease the number of training patterns, hence ERBF is the recommended approach to obtain an RBFN. The other method, TRBF, obtains a training RBFN, which requires a large number of adjustable parameters and calculation time. Compared with RBF-NDR, ITRN-RBF has definitive results and high mapping quality. ITRN-RBF has good scalability with reasonable hardware costs. That is, if more effective methods for getting RBFN are adopted, better performance is obtained.
To sum up, the proposed ITRN-RBF that uses ITRN, which is suitable for geodesic distances because it builds a more appropriate topology relationship, does well with nonlinearly embedded manifolds, large amounts of data, and the online projection of new data. This method can be applied to a wide range of applications including visualization, feature extraction, and other applications.
Supporting Information S1 Dataset. Randomly generated nodes dataset.