A two-route CNN model for bank account classification with heterogeneous data

Classifying bank accounts by using transaction data is encouraging in cracking down on illegal financial activities. However, few research simultaneously use heterogenous features, which are embedded in the time series data. In this paper, a two route convolution neural network TRHD-CNN model, fed with two types of heterogeneous feature matrices, is proposed for classifying the bank accounts. TRHD-CNN adopts divide and conquer strategy to extract characteristics from two types of data source independently. The strategy is proved able in mining complementary classification characteristics. We firstly transfer the original log data into a directed and dynamic transaction network. On the basis of that, two feature generation methods are devised for extracting information from local topological structure and time series transaction respectively. A DirectedWalk method is developed in this paper to learning the network vector of vertices used for embedding the neighbor relationship of bank account. The extensive experimental results, conducted on a real bank transaction dataset that contains illegal pyramid selling accounts, show the significant advantage of TRHD-CNN over the existing methods. TRHD-CNN can provide recall scores up to 5.15% higher than competing methods. In addition, the two-route architecture of TRHD-CNN is easy to extend to multi-route scenarios and other fields.


Introduction
Almost any system that involves money and services can breed fraudulent activities, e.g. bank fraud, insurance fraud, telecommunication fraud, and e-commerce fraud. With economic development, the financial fraud activities seriously threaten the property security of customers. Identifying the anomaly bank accounts helps for providing clues to police in combating financial crimes. Therefore, the study of bank account classification has caused great interest among researchers in recent years [1]. In this paper, we call the research, classifying bank accounts, as CBA for short.
Due to its practical importance, classification algorithm has attracted extensive attention in many financial fraud scenarios. The existing classification methods can be categorized into three types: statistical based methods, time series based methods, and network based methods. Statistical based methods [2] utilize profile characteristics of the customers, made up by PLOS  information. The experimental results show that the TRHD-CNN model significantly outperforms the existing methods in terms of accuracy, recall and F1-score.
The main contribution of this paper can be listed as follows. (1) We build a weighted directed transaction network, which embeds the transaction relationship of bank accounts. (2) A network vector generation method DirectedWalk is proposed to extract the local topological relationship of bank accounts. (3) We build a two-route CNN model TRHD-CNN to classify bank accounts by using two kinds of heterogeneous data. TRHD-CNN can be easily expanded to other fields owing to its divide and conquer structure. (4) The experimental results show that TRHD-CNN outperforms the existing methods significantly.
The rest of this paper is organized as follows. Related work is reviewed in Sect. 1. In Sect. 2, we propose the generation methods of local typological feature matrix and time series feature matrix respectively. TRHD-CNN model is created in Sect. 3. The performance evaluation of TRHD-CNN is analyzed in Sect. 4. Sect. 5 concludes the paper, and finally Sect. 6 describes our future work.

Related work
The classification method is one of the most widely used methods in detecting financial fraud [9]. In this filed, traditional methods divide the classification task into two separate parts, i.e. feature mining and classifying. This section reviews the existing classification methods corresponding to financial fraud scenarios, and the multi-input CNN models. Trading behavior in money laundering organization. As shown in (A), the money transaction paths of MLM members are arranged in a tree-structured network. The transaction topological structure of members, especially members in the same layer, are similar to each other. A leaf node pays membership fee to its parent node and earns rebates from a upper member (not depicted). In addition, how many money members belonging to different levels earns decrease from top bottom. As depicted in (B), the money transaction paths are arranged in spindle-shaped structure. A large fund is firstly injected into the left node, and is split into a number of small funds. These small amounts are transferred in parallel and flow into the right node finally. https://doi.org/10.1371/journal.pone.0220631.g001

Traditional classification method in financial fraud detection field
In recent years, different types of financial frauds and different fraud detection techniques are studied widely. The classification methods, according to the data they focus on, can be categories into three groups: statistical based methods, time series based methods, and network based methods.
The statistical based methods exploit the statistical features from the historical data for profiling an entity. In order to detect the credit card fraud, [10] devises a number of decision tree and SVM models, using the trading characteristics which are extracted from the continual monitoring data of an account's trading records. The records contain geographical locations, transaction date, and Merchant Category Codes, etc. Any new incoming data is quantified and compared with the existing profiles to determine whether the difference is larger than a suspicion score. [11] mines frequent patterns from legal and fraud transaction data separately, and proposes a matching algorithm to find which pattern the incoming data is closer to. The Apriori algorithm [12], a classical algorithm in data mining, is utilized to return a set of frequent itemsets. Similarly, the technique of fuzzy association rule mining is exploited by [13] to extract knowledge useful for detecting credit card fraud accounts. Maes et al. [14] performs experiments, by using Bayesian belief networks and artificial neural networks, on a new transaction data to identify whether it is fraudulent or not. It is demonstrated that the Bayesian network has a better fraud detection capability.
The time series based methods use time series characteristics of transaction data to depict the trading pattern of an entity. [3] conceives a developed time series decomposition method EMD (Empirical Mode Decomposition) to extract the fluctuation features. The complex financial time series data is decomposed into some local detail parts and one global tendency part. [3] introduces the concept of peer group comparison, which is implemented by using a novel linear segment approximation method. [15] proposes a sequence matching method to detect the suspicious activities of an account according to its own transaction trend and temporal pattern. To create the temporal sequence of an account, [15] takes two kinds of information into consider, i.e. the historical transaction information and information from the account's peer group. [16] considers the transaction data of a credit card as a data stream, and classifies the fraudulent users by using an expanded Very Fast Decision Tree (VFDT) method.
The network based methods, on the basis of a created transaction network, analyze the suspicious activities involves a cooperative group. Redmond et al [17] employ network analysis method in detecting the time-related suspicious groups in P2P (peer-to-peer) lending system. [17] creates a directed loan network with parallel edges, in which each edge corresponds to a timestamp. In the temporal network, the characteristics of variation in local structure are used to detect the suspicious active subgraphs, which occur during a short time interval. The experimental results demonstrate the existence of three dense structures, i.e. cliques [18], trusses [19] and FOFI. [20] describes a classification system that is capable of analyzing group behaviors by using network analysis method and supervised learning method together. A transaction network is modelled by [20], in which vertices are parties, edges mean relationships, and the communities are defined as near-k-top neighbors. Furthermore, an ego-centric approach is taken to build the communities in a bottom-up processing. Finally, the author creates a SVM and a random forest classifiers by using four kinds of community-related features, i.e. demographic, network, transaction and dynamic features. In order to identify illegal pyramid selling accounts, [21] builds a telecommunications network and creates the ego networks for different kinds of users. It proves that the ego networks of normal and service users are far different from that of the MLM members. The visualization of the ego network of a MLM member is shown as a tree-like structure. Based on the characteristics of MLM transactions, six network attributes are quantified and being used to detect the MLM members.
Summary Most of the statistical features are extracted on the basis of analyzing the knowledge of the specific suspicious cases, [10] [13]. Table 1 summarizes the researches on methods and techniques in financial fraud detection field. The significant drawback is their weak scalable. Complemented with time series characteristics, the users' trading pattern is expressed more sufficient in [15]. The challenge of time series based methods is how to design a sequence matching method, which is low in complexity but high in accuracy. The network analysis techniques are employed in detecting the suspicious accounts, owning abnormal collective behaviors [19] [20]. However, the employed network features are too coarse-grained to achieve a satisfying classification results. In order to capture the intrinsic patterns, [7] proposes a CNNbased detection framework to learn latent trading patterns from labeled data. In [7], a novel feature trading theory is devised and the one-dimensional transaction data is transformed into a feature matrix.

Multi-input CNN models
In order to extract more discriminative classification features from more perspectives, many CNN models with multi-input structures are proposed by researchers. [22] presents a handwritten digit recognition method based on cascaded heterogenous CNNs. Each CNN structure is built to recognize a proportion of input samples, and the following CNN is fed with the rejected samples. The reliability and complementation of heterogeneous CNNs are investigated in [22]. Simonyan et al. [23] extend deep Convolutional Networks (ConvNets), a stateof-the-art still image representation, to a two-stream ConvNet architecture for recognizing actions in video data. The two-stream ConvNet structure incorporates two separate ConvNetstyle networks, training the spatial stream and the temporal stream together, to a multi-task learning framework. The complementray of the two recognition streams are demonstrated by the experiments results. [24] devises a multi-column deep neural network (MCNN) for crowd counting from an individual image by using three columns of CNNs. Three filters of different sizes are used in the three columns to adapt to the large variation in people size of an image.
Summary An additional CNN structure is helpful for extracting complementary features for classifying samples. However, in [22] [23] [24], the types of convolution filters of multiple CNNs are the same. In this paper, two kinds of heterogenous data, i.e. network relationship data and time series data, are processed into two feature matrices and being used to describe the trading patterns of an accounts. Due to the fact that the adjacent elements' relationships in those two matrices are quite different, each CNN inputs requires a distinct convolution filter. In this paper, we propose a TRHD-CNN model that contains two different CNN substructures to solve the aforesaid problem.

Preliminaries
In this section, the problem definitions are given first. Then, we devise a DirectedWalk algorithm for learning network vector of the accounts. Subsequently, the local topological feature and time series feature of accounts are formed into two matrices respectively.

Problem definition
This part gives the formal definitions of bank account classification.
Definition 1 (CBA problem). Given the bank account set A = {a 1 , a 2 , . . ., a n } and its category label set C = {c 1 , c 2 , . . ., c m }, where a i denotes the ith bank account and c j means label of the jth category. The CBA problem is to find the Cartesian product set Seen the accounts as vertices, the transaction relationships between accounts as directed edges and the transaction information as edge weight, the information of the transaction data can be created into a directed and dynamic network. We define the network as bellow.

Generating the local topological feature matrix
Economic crime investigators believe that it is helpful to consider the vertex itself and its neighbors comprehensively in determining whether a vertex is fraudulent. In other words, for any v in G, which category it belongs to is closely related to its local topological structure G v . We propose a DirectedWalk algorithm, an improvement version of DeepWalk [8] in directed and dynamic network, to learn the network vectors of accounts. Based on this, a Credit card fraud accounts [10], [11], [13], [14] Time series features EMD and peer group comparing, sequence matching, VFDT Suspicious financial transaction, suspicious activites and accounts, credit card fraud accounts [3], [15], [16] Network features Network analysis method, network analysis method and SVM+random forest, ego-network P2P members, suspicious active subgraphs, MLM members [17], [20], [21] https://doi.org/10.1371/journal.pone.0220631.t001 method to generate an account's local topological feature matrix is presented by constructing its transactional relationships.

DeepWalk.
To capture the network topology information, [8] proposes a DeepWalk approach, which learns features that describe the graph structure. The optimization  techniques, originally designed for language modeling, are used for learning social representations of a graph's vertices. DeepWalk takes an graph as input and produces its latent structure representation as an output. The learned structural features are used in many applications such as network classification and anomaly detection with outperform results.
DeepWalk algorithm, borrowing the concept in language modeling, considers a set of short truncated random walks as its own corpus, and the graph vertices as its own vocabulary. There are two main components, a random walk generator and an update procedure. Given graph G as input, the random walk generator samples uniformly a random vertex v as the root of the random walk W v . For each vertex v, being the start of γ random walks, |W v | vertices are selected in order randomly. Aiming at updating the vector representation of v, Skip-Gram [25] is utilzed to maximize the probability of its neighbors in the walk W v . Inspired by DeepWalk, we propose a DirectedWalk algorithm to learn network vectors of accounts in the transaction network.

Learning the network vector.
We generate a corpus and a vocabulary from the directed and dynamic network, which is the only required input for learning the network vectors. Give G = (V, E, W, T), the vertex set V is considered as its own vocabulary, and the directed sequential transaction paths are seen as its own corpus. It is well known that, in Deep-Walk, the relationship strength of two vertices is determined by the frequency they occur in an adjacent position in random walks. Here, as defined in Definition 2, the strength of the relationship between any two vertices v i and v j is determined by weight w ij of e ij and w ji of e ji . Moreover, in G, the order of vertices that passed by a transaction path are not random but time-related. Therefore, we propose a DirectedWalk algorithm for learning the network vector of the transaction vertices. Each vertex can be seen as the start vertex of a directed walk with a maximum length of l. For the last visited vertex v, the walk will pass over all of its directed neighbors, only if v has trading with them within a time interval τ. The walk grows up iteratively until not satisfies the constraints.

Require:
network G = (V, E, W, T) maximum walk length l, timestamp interval τ window size w, embedding size d Ensure: matrix of vertex rpresentations Φ 2 R |V|×d 1: the directed corpus NC is initialized to be empty; 2: for v i 2 V do 3: if v j has no directed neighbor vertex or len(p) = l then 12: p Line 2-18 in Algorithm 1 shows the core of our algorithm. The outer loop specifies the paths that start with each vertex and generates a time series ordering of the directed neighbor vertices. The state factor of p i is initialized to True and will be reset to False, in condition that its last vertex has no directed neighbor vertex or the length attains l. On condition that all of the paths reach their False states, the generation procedure of P i starting at v i is completed. As depicted in line 8-16, for the last passed vertex v, the timestamp information is used to determine whether a vertex will be appended to v. In a summary, fixing a start vertex, the longer the weight vectors of the passed directed edges are the more walks will be produced by our Direc-tedWalk approach. Therefore, the frequent traders are more likely to exist in walks and appear in a window with high probability.
It is shown in line 21, the Skip-Gram model [25] is employed to maximize the co-occurrence probability F among the vertices in the time ordered walks. Skip-Gram, using the independence assumption, iterates over all possible collocations in directed corpus NC within the window w.
We finally obtain the optimal network vector representations of the vertices by using the same learning process of DeepWalk. DirectedWalk maps the directed and dynamic transaction network into a d-dimensional vector space. The vertices that contain similar local directed topological structures are mapped into adjacent vectors.

Constructing the local topological
Having calculated the Euclidean distance between v t and v t i 2 N ðhÞ v , we store the top k nearest v t i in ascending order. This yields the local topological feature matrix T v as follows, where, v t i (1 � i � k) is the network vector of v i 2 N ðhÞ v , while v t 0 denotes the network vector of v itself.

Creating the time series feature matrix
For each account, the time series records are consist of incoming and outgoing direction transactions, and each record is composed of three kinds of information, i.e. two accounts, a timestamp and transaction amount. In a real life scenario, the latent information in its transaction sequence is helpful for determining which category an account belongs to.
According to the Definition 2, given u, v 2 V, the weight of e uv and e vu are denoted as w uv and w vu repectively. For instance, w uv contains the transaction amounts from u to v in chronological order. Therefore, all the incoming transactions of v are represented as set S −v = (w 1v , w 2v , . . ., w nv ) (n ¼ S i2N ðinÞ v j w iv j), and its outgoing transactions are denoted as set Integrating the above two sets and sorting its elements by their occurrence timestamps, the time series vector of v, in length of N = n + m, is expressed as: In this paper, the time series feature of vertex v is created by its top k nearest neighbor vertices, which are obtained from Sect. 2.2.2. The feature matrix W v is described as follows, where, d w ¼ max i2½0;jVj� ðjv s i jÞÞ denotes the maximum length of v s i , which belongs to V 2 G. And then, we pad the empty positions in matrix W v with 0.
Thus far, the local topological and time series information of bank accounts are embedded into feature matrices respectively. These two features are two types of heterogenous data in the same matrix format, which can be used as the input of a CNN model.

The CNN classification model
In our earlier work, a bank account classification model MHD-CNN has been presented, which splices the two matrices up and down into a concatenated matrix, and then classifies the new matrix by using a CNN structure. In order to achieve better classification performance in lower computational complexity, a novel two-route CNN classification model TRHD-CNN is devised in this paper. This section begins with a review of the framework of MHD-CNN, summarizing the defects in previous study. Then, the TRHD-CNN model is described in detail.

The MHD-CNN model
For 8v 2 V, the concatenated matrix H v is described as in formula 4, where, d h = max{d, d w } is the maximum value of column numbers of T v and W v . The empty positions of matrix H v is padded with 0. Obviously, H v expresses the characteristics of vertex v with two kinds of heterogeneous data comprehensively. Taking H v as input, the framework of MHD-CNN is depicted in Fig 5. Considering that each row of T v in matrix H v , defined in formula 1, means a network vector of a certain neighbor vertex, one row of H v should be taken as the basic unit when constructing the convolution kernels. That means the convolution kernel of MHD-CNN should cover at least a row of H v . Similar to the n-gram technique in NLP, our convolution kernels are designed in n-vertex format, e.g., one-vertex, two-vertex and three-vertex. In this way, the local shallow features relating to a different number of vertices are extracted into high level features. As depicted in Fig 5, the used parameters of the convolutional layers are set as follows: • The number of types of convolution kernels: fc 1 and fc 2 are set as 3.
• The stride length for each convolution kernel is set as 1.
• The number of each type of convolution kernel: fk 1 and fk 2 are is set as 30.
• The size of each convolution kernel: r 1 and r 2 are the same and range from {1, 2, 3}.
Although heterogeneous data are used to increase the discriminative classification features, MHD-CNN also has some problems, which are listed as bellow.
• MHD-CNN exploits same types of convolution kernels on two kinds of heterogeneous data, and it cannot fully mining their unique characteristics.
• The integrated input matrix doubled the size of model parameters, resulting in high computational complexity.
Therefore, we propose an improved version of MHD-CNN, which possesses a parallel convolution structure to extract two kinds of characteristics independently.

The TRHD-CNN model
The novel two-route CNN classification model TRHD-CNN employs different convolutional and pooling mechanisms on two kinds of data. Its architecture is depicted in Fig 6. 3.2.1 The first route in the TRHD-CNN architecture. The first route (i.e., Route 1 in Fig  6) extracts the local topological structure features of an involving account from its input matrix T v . As described in Sect. 2.2.3, the row vector of T v , is the minimum unit in describing its local topological structure. Thus, the convolution structure of this route is similar to that of the MHD-CNN model, with the following process steps.
The first route of the TRHD-CNN model fed with T v employs the same architecture as the MHD-CNN model, except from containing the softmax output layer (seen in Fig 5). Then, vector X re 1 , part of the output of the convolutional component of TRHD-CNN, will be used in the subsequent classification procedure. The structure of Process 1 is expalined as following: 1. Input layer. For 8v 2 V, the input data is the matrix T v .
2. Convolutional layer. The input matrix T v is used as the feature map in the first layer. The convolution kernels are same with those used in MHD-CNN. For the lth convolutional layer, the processing function on each feature map is shown in formula 5, where, f(•) represents an activation function, the operator " � " expresses the convolution operation, X l−1 denotes an input feature map, K l i ; ð1 � i � 3Þ denotes the ith convolution kernel as mentioned in Sect. 3.1, and b l i , means the bias vector. Therefore, the kind of output feature maps is ∑ i K i times that of the input.
3. Pooling layer. The processing, on the feature map of the lth pooling layer, is shown in formula 6, where, operator "�" represents a pooling function, which is used for downsizing the ith input matrix X lÀ 1 i . The notations P l i and b l i denote the weight matrix and bias vector respectively. The number of feature maps is not changed by the pooling procedures.
4. Connection layer. In this layer, the input feature maps, each in size of 1 × 1 and is obtained from the last pooling layer, are concatenated into a one-dimensional vector X cc 1 . 5. Fully connected layer. This layer compressed the vector X cc 1 into a shorter vector X re in length of K 1 by using a fully connected structure.

The second route in the TRHD-CNN architecture.
The second route (i.e., Route 2 in Fig 6) is primarily used to extract discriminative classification features from the time series feature matrix of a vertex v, denoting as W v . The difference between W v and T v is that the minimum feature representation unit is a row vector for matrix T v , while for matrix W v each data element does. Therefore, a new kind of convolution structure, different from the first route, is needed for extracting more useful information. The second route of TRHD-CNN is designed as following, shown in Fig 7. 1. Input layer. For any vertex v, the input data is its time series feature matrix W v . The length of the row vector of W v is set as the longest transaction sequence in network G. W v is a sparse matrix, for the reason that the trading frequencies of accouts are quit different. Features in T v are mapped into vector X re 1 by Process 1 and those in W v are mapped into X re 2 through Process 2 similarly. Being concatenated into X mer , the features from X re 1 and X re 2 are compressed into a shorter vector X fu by a fully connected layer. The output layer can be seen as a classifier that maps the features of X fu into two category labels.
https://doi.org/10.1371/journal.pone.0220631.g006 2. Convolutional layer and pooling layer. As shown in Fig 7, the structure, two convolutional layers with each followed by a pooling layer, is built referring to the following details. Moreover, the trading behaviors of different accounts are independent, which are hidden in their time series transaction amounts. Therefore, the convolution kernels of the first convolutional layer, c 1 , is created in form of 1 × n 1 , n 1 2 N 1 . Taking the matrix W v as input, this first layer generates |N 1 | × k 1 feature maps by using k 1 convolution kernels on each input feature map. In order to extract the trading characteristics of a single vertex, the pooling window is created in 1 × p 1 format. That means a pooling window does not slide over adjacent vertices. Similar to the first one, the second convolutional layer is also built in the multi-kernel structure. c 2 kinds of convolution kernels are utilized, in which each in 1 × n 2 , n 2 2 N 2 format. Aim to describe the local region features of a feature map, we employ pooling windows in shape of m × n on the second pooling layer. Here, m and n equal the row and column values of the corresponding input feature map respectively. The processing of the convolutional layer and the pooling layer correspond to formula 5 and formula 6 in Sec. 3.1, respectively. 3. Fully connected layer. There are two fully connected layers in this route. The first one connects the |N 1 | × k 1 × |N 2 | × k 2 feature maps, generated by Sec: Conv. + Pool., into a onedimensional vector X cc 2 . Subsequently, another fully connected layer is constructed to compress vector X cc 2 and balance the effects of each route. X cc 2 is reduced into X re 2 of length K 2 , and then is used as a part of the input for the succeeding classification process.

Concatenating the two routes of TRHD-CNN.
This part describes the classification component of TRHD-CNN, including three portions: concatenation portion, compression portion and classification portion.
1. The concatenation portion. The vectors X re 1 and X re 2 , come from the first route and the second route, are concatenated into vecto X mer ¼ ðX re 1 ; X re 2 Þ. 2. The compression portion. A fully connected layer is used to compress X mer into X re f .
3. The classification portion. The portion acts as the output layer of TRHD-CNN, outputs the category of a vertices v with X re f .

Environmental settings
A deep learning server with GPU NVIDA LESLA P100 and 128G memory is adopted to improve the speed of training the CNN models. TensorFlow is selected to train the MHD-CNN and TRHD-CNN models.

Data set
In recent years, we have been exploring computational models to classify bank accounts in combating illegal pyramid selling. The department of economic investigation provides us with plenty of transaction data of real bank accounts. An instance of transaction records is shown in Fig 2. An account contains a lot of transaction records, each of which includes bilateral transaction accounts, timestamp, amount of money and transaction direction, etc. We sample out the transaction records belonging to 10145 bank accounts to form out dataset for training our model. There are 9270 normal accounts and 875 accounts involving a MLM organization respectively. As shown in Table 2, the number of transaction records generated by the normal accounts run up to 6732730 and the fraud records created by MLM members amount to 275804 rows. These MLM members are manually annotated as "illegal" by economic investigators. Before training the models, we filtered out some noisy data, i.e. deleting the duplicate records, incomplete records and the records whose transaction amounts no more than 50. Therefore, 1371914 records is filtered out from the set of normal accounts' transaction records and 91341 records created by illegal accounts are deleted. In general, more than 5 million transaction records are used after denoising. In our experiment, all the accounts are mixed and five-fold cross validation strategy is adopted to evaluate the classification effective of our TRHD-CNN model. That means the dataset is divided into 5 portions and each time the model is trained with four portions as training set and the remaining portion as testing set. In other words, the proportion of training accounts and testing accounts is set as 4:1 in the training process.
To obtain the optimal model, the back-propagation method [26] is used to update the hyper parameters. Aim to speed up the convergence rate of TRHD-CNN, we adopt the Minibatch Gradient Descent (MBGD) method [27] in the iteration process. That means, the training process will stop under the condition that the objective function is stable at its minimal change. Note that, the hyperparameter of two routes are updated by joint training.

Experimental settings
For our TRHD-CNN model, the optimal parameters are selected from numerous experimental settings. Selection range of these parameters are determined by default values and certain characteristics of the real dataset.

Common parameters of MHD-CNN and TRHD-CNN.
Experimental results show that the two CNN models converge under 100 epochs, therefore we set the training termination condition to a fixed epoch value of 100. In our experiments, during each iteration, the learning rate is dynamically adjusted to accelerate the convergence, i.e. the learning rate is selected from the set {0.01, 0.05, 0.09, 0.13, 0.17, 0.21, 0.25} in descending order according to the training errors. In these two models, regularization term L 2 and dropout rate are introduced to avoid over-fitting. These two parameters are assigned referring to the default values used in the relevant study [28], i.e. the correlation coefficient of L 2 and dropout rate are set to 10e-4 and 0.5, respectively.

Parameters of MHD-CNN.
The maximum length of the directed edge weight vectors (i.e. d h ) is derived from our transaction network based on real dataset. The maximum length value equals to 1359. To achieve optimal classification performance, the MHD-CNN model adopts the ReLU activation function in convolutional layers and max pooling function in pooling layers.

Parameters of TRHD-CNN. In this part, we list the parameters of each component of TRHD-CNN in turn.
A. Parameters of the convolutional component: In the first-route of the TRHD-CNN model, the values of all parameters and the selected activation functions are consistent with those of the MHD-CNN model. In the second-route, the used parameter settings are listed in Table 3. In its first convolutional layer, the number of convolution kernels c 1 and the number k 1 of each kind of kernel are set as 4 and 100, respectively. And then the shapes of c 1 convolution kernels are set as {1 × n 1 , n 1 = 2, 3, 4, 5}. The width of pooling window of the first pooling layer p 1 is set as 3. Similarly, in the second convolutional layer, c 2 and k 2 are set to 3 and 50, respectively, and these c 2 convolution kernels shaped like {1 × n 2 , n 2 = 2, 3, 4}. The following pooling layer adopts a pooling window in shape of m × n, where m and n are the number of rows and columns of the corresponding input feature map, respectively.
B. Parameters of the classification component: The experimental results show that the concatenated vector, when K 1 and K 2 are set as 100 and 50 respectively, achieves the best results in classifying bank accounts.

Analysis of the classification performance
In this section, we analyze the classification performance of the proposed CNN models from two aspects, evaluating the impacts of different parameters, and comparing classification performance with other models.

Evaluating different architecture parameters of TRHD-CNN.
We test the influence of the value of parameters K 1 and K 2 , and the number of convolutional layers on the classification effects of TRHD-CNN, respectively.
(1) Tuning the values of K 1 and K 2 . In this section, we tune the parameters K 1 and K 2 to receive the best classification results for TRHD-CNN. The details are shown in Fig 8. In the legend, the first-route means the parameter K 1 , and the second-route corresponds to K 2 . The value of AUC means the area under the curve. As well known, the greater the AUC value is, the better the classification performance of the model will be. As shown in Fig 8, TRHD-CNN model achieves the best results with K 1 = 100 and K 2 = 50. Therefore, we set K 1 to 100 and K 2 to 50 in TRHD-CNN. The result indicates that the impact of the typological network data is greater than that of the time series data on classification effects, which further illustrates the advantage of our improved two-route framework.
(2) Adding extra convolutional layers. It is found that the TRHD-CNN model receives the best results when the first-route contains one convolutional layer followed by a pooling layer, meanwhile the second-route contains two convolutional layers and each layer is connected to a pooling layer. The reason why no significant improvement in classification performance is made when increases the convolutional layers, may be that the features is already well described by TRHD-CNN model by shallow network structure.

Comparing the classification performances of different models.
In this section, three groups of experiments are carried out on the three models, the LOF method [29], the MHD-CNN model and the TRHD-CNN model, for testing their performance in classifying bank accounts.
Note that we test several traditional abnormal account detection methods based on statistical features in our prior research [29]. [29] focuses on detecting bank accounts by using traditional classifiers with novel devised three kinds of behavioral features. In [29], the classification problem on an imbalance dataset is seen as an abnormal detection problem. The utilized classification features, being extracted from their financial time series transaction records, are grouped into three categories: transaction statistical features, network behavioral features and periodic behavioral features. A comparative analysis of several classical classifiers, one-class SVM, isolation forest (IForest), local outlier factor (LOF) and robust covariance (RC), is made in [29]. Given a predefined outlier fraction, the experiment results show that IForest achieves the best F1-score, i.e. 58%. It is obviously that the traditional abnormal account detection methods are uncompetitive with the following ones. Therefore, in the following paper, the performance of IForest method is regarded as a benchmark, and the other two models are tested with different parameter values as mentioned in Sect. 4.3. In particular, for IForest method, three values of a necessary threshold parameter, i.e. 0.01, 0.08, 0.1, meaning the outlier factor are selected in our experiments. The optimum outlier factor, which achieves the best result on the training set is selected for IForest. The classification results of the three models on the dataset are shown in Table 4, where the optimal results of MHD-CNN and TRHD-CNN are exhibited.
As shown in Table 4, both CNN models achieve better results than the traditional statistically methods (i.e. IForest), proving the superiority of deep learning methods in mining classification features. The reason may be the statistical features are manual extracted, which are static during the classification process. However, the classification procedure of CNN model contains a feedback process, which modifies the adopted classification features dynamically. Moreover, the MBGD algorithm, a back-propagation method used for update the parameters, ensures the CNN models converges to the minimum loss state. Therefore, CNN always chooses the most effective features according to specific classification tasks, and learning the  optimal classification parameters, automatically. The improvement of CNN is not intuitively obvious, referring to the generally high accuracies of the three methods. It should be noted that the recall is a more important evaluating indicator in a fraud identification scenario. Moreover, the high accuracy is caused by the imbalance between the number of illegal samples and normal samples. Table 5 gives details of the methods' classification results on test set, where the indicators true positive, false negative, false positive, true negative are respectively denoted as TP, FP, FN, TN. Each time, one fifth dataset is randomly selected as test set, containing 175 MLM accounts and 1854 normal accounts. As presented in Table 5, 144 MLM accounts are identified by TRHD-CNN, in contrast with the amount 100 IForest recognized. And yet, TRHD-CNN makes significant improvement over MHD-CNN in TP.
On the whole, the TRHD-CNN is higher than the MHD-CNN in precision, recall and F1-score, indicating different types of convolution kernels have advantage of extracting more useful features.
The results in Table 4 show that TRHD-CNN better than the other methods on accuracy and recall obviously. To further investigate the superiority of TRHD-CNN, we perform the Paired Samples T Test (SPSS) for statistical significance. Four times five-fold validation was performed on the data, and the accuracy results on the testing set are used for SPSS. Here, IForest and MHD-CNN are compared with TRHD-CNN respectively. The results are shown in Table 6. According the definition of Probability Value (p-value), the smaller the p-value, the more significant the difference is. Moreover, we can draw a conclusion that the difference is statistically highly significant when p-value is smaller than 0.01. Based on the p-values in Table 6, we reject the null hypothesis, indicating a definite improvement provided by TRHD-CNN.
In the process of testing the proposed CNN models, we adopt the method of dynamically selecting learning steps to accelerate the convergence of the learning model. Fig 9 shows the changes of training and testing errors with increasing iteration times for CNN models.
The x-axis and the y-axis represent the number of iteration times and the cost values of the objective function, respectively. As shown in Fig 9, along with the iteration number increases, all curves diminish sharply at the beginning, then decline in decreased speeds, and finally converge to stable values. Obviously, the training errors and testing errors of the TRHD-CNN model are lower than those of the MHD-CNN model in general, and the reason is that the TRHD-CNN model is helpful in extracting more useful classification features. In addition, using the strategy of dynamic selection of step size, the fluctuations of the cost curves almost disappear after a certain number of iterations (beginning at the 80th iteration). Hence, reducing the step size after a specific iteration number is helpful for the objective function to converge to an optimal value correctly and rapidly.
In summary, the experiment results indicate that the existing methods cannot take advantage of the information, embedded in the time series transaction data, efficiently. While, the TRHD-CNN model, which employs a two-route convolution structure for extracting the specific classification features, gains a significant advantage over other algorithms. Its superiority in the recall indicator proves the complementary of local topological structure features and time series transaction features in describing the category of an account. And yet, to say that TRHD-CNN is time consuming in learning a new coming account's classification features.

Conclusion
This paper considers the CBA problem by using features come from two kinds of heterogenous data separately. It is found that the two data are complementary in describing the classification category of an account. This paper develops a two-route TRHD-CNN model, which adopts a devide and conquer strategy to train the two kinds of input matrices simultaneously, and employs a two-route convolution structure for extracting the specific classification features. During its training, TRHD-CNN utilizes local topological features to describe the relationship information and time series transaction features to extract the spending information. This paper develops a DirectedWalk algorithm to learn the network vector of vertices in a directed and dynamic network. The experimental results on a real dataset show that TRHD-CNN has a significant advantage over other algorithms.

Future work
In our research, we have found that sub-sequence of the time series transaction records is of great significance for identifying the category of an account. Therefore, in our future research, we intend to use phase space theory to extract the sub-sequence features to improve the  classification effects. In addition, the Long Short-term Memory (LSTM) model has been proved very suitable for classifying data with their time series characteristics. Thus, how to combine LSTM to process time series transaction data to improve the TRHD-CNN model is another problem we will solve in the next step. Currently, the real bank account dataset is huge in volume but rare in labelled samples, therefore, we aim to adopt semi-supervised strategy to improve the TRHD-CNN model.
In summary, the research on account classification using transaction data is still in its infancy. It's our continuous goal to explore high efficiency classification models with the transaction data.