Development of stock correlation networks using mutual information and financial big data

Stock correlation networks use stock price data to explore the relationship between different stocks listed in the stock market. Currently this relationship is dominantly measured by the Pearson correlation coefficient. However, financial data suggest that nonlinear relationships may exist in the stock prices of different shares. To address this issue, this work uses mutual information to characterize the nonlinear relationship between stocks. Using 280 stocks traded at the Shanghai Stocks Exchange in China during the period of 2014-2016, we first compare the effectiveness of the correlation coefficient and mutual information for measuring stock relationships. Based on these two measures, we then develop two stock networks using the Minimum Spanning Tree method and study the topological properties of these networks, including degree, path length and the power-law distribution. The relationship network based on mutual information has a better distribution of the degree and larger value of the power-law distribution than those using the correlation coefficient. Numerical results show that mutual information is a more effective approach than the correlation coefficient to measure the stock relationship in a stock market that may undergo large fluctuations of stock prices.


Introduction
Complex network analysis in recent years has become a powerful tool to investigate challenging problems in a wide range of research areas. A complex network is defined as a system with a large number of nodes and relationships between these nodes [1]. A variety of methods have been applied to study complex networks in biology, social sciences, finance and engineering. Among them, the stock network is an important financial system [2]. Each node in a stock network stands for a stock, and the edge connecting a pair of stocks represents the correlation between the prices of these two stocks. The stock networks have been used to observe and analyze the dynamics of the stock market as well as make predictions of future prices [3].
To build stock networks, the commonly used algorithms include the Minimum Spanning Tree (MST) [4], the Planar Maximally Filtered Graph (PMFG) [5,6], and the Correlation independent component analysis [28] and the analysis for both small and high-dimensional data sets [29][30][31]. Mutual information comes from Shannon's entropy theory, and it is unique in its close ties to Shannon's entropy. However, it is also true that the estimation of mutual information is not always easy. Thus the estimation of mutual information is an important work in information theory [32,33]. Pluim et al. gave an algorithm to compute mutual information for high-dimensional variables and applied it to medical image registering [34]. It has been shown that the network based on mutual information could replace the network using the correlation coefficient [35]. Although mutual information has been used to develop genetic regulatory networks recently [29,35], the stock network based on mutual information is still at the early developmental stage. Only the partial mutual information and mutual information rate have been used to compare with the correlation coefficient for developing stock networks [36,37].
To address the issue of the nonlinear correlation, this work proposes a novel framework to develop stock networks by using mutual information. The stock price data from the Shanghai Stocks Exchange (SSE) are used to demonstrate the effectiveness of this new approach. The remaining part of this paper is organized as follows. Section 2 discusses the computation of mutual information and MST for developing stock networks. In Section 3, we develop two stock networks using mutual information and the Pearson correlation coefficient, respectively, and finally study the topological properties of these networks.

Mutual information
Mutual information from entropy theory is a generalized correlation measurement. According to Shannon's entropy theory [32], the entropy of a discrete random variable X is defined by where p(x i ) is the probability distribution of X. Entropy is used to measure the uncertainty of a random variable, which is equivalent to the quantity of information it owns. For two-dimensional random variables (X, Y), the joint entropy is given by where p(x i , y j ) is the joint probability distribution of (X,Y). The mutual information of X and Y is then defined by IðX; YÞ ¼ HðXÞ þ HðYÞ À HðX; YÞ; ð3Þ which can be interpreted as the information that X and Y share. In addition, mutual information can be defined as where H(X|Y) is the conditional entropy of X under the condition Y, which is defined as where p(x i |y j ) is the conditional probability. In this definition, mutual information is regarded as the uncertainty of random variable X removed under the condition Y. Mutual information I (X, Y) = 0 holds if and only if X and Y are independent. We can normalize mutual information into the interval [0, 1] by using NMIðX; YÞ ¼ 2IðX; YÞ HðXÞ þ HðYÞ : ð6Þ From the above definitions, we need the probability distributions to exactly compute mutual information. Since it is difficult to obtain such distributions for complex problems, we use a numerical method to compute the mutual information of stock returns [29]. Considering a network of n stocks with prices in d trading days, denote P i,t and R i,t as the closing price and log-return of stock i at day t, respectively, given by The entropy of stock i then is approximated by To compute the joint entropy of stocks i and j, we uniformly divide the square of log-return [minR i,t , maxR i,t ] × [minR j,t , maxR j,t ] into k × k bins. Denote f i;j;q;r d as the frequency of joint logreturns falling into the bin (q, r), which can substitute the joint probability distribution with p i;j;q;r % f i;j;q;r d ; ði; j ¼ 1; :::; n; q; r ¼ 1; :::; kÞ: ð10Þ The joint entropy of stock i and j can be approximately computed by and the mutual information of stock i and j is estimated by When computing the normalized mutual information by using (11 and 12), we can choose a different number of bins. To test the influence of bin number on the value of mutual information, we calculate the value using 10×10, 15×15, 20×20 bins. For the same stock pair, we find that the largest difference of the values between 10×10 and 15×15 bins, and that between 10×10 and 20×20 bins are 0.0073 and 0.0107, respectively. This result shows that once the bin number is adequately large, any further increase of the bin number has not much influence on the accuracy of mutual information. Thus, we use 10×10 bins in this study.
On the other hand, the correlation coefficient of stocks i and j is computed by where R i is the average log-return of stock i over d trading days. In a network, the distance between nodes must be given by a metric. In the network based on the Pearson's correlation coefficient, a usual metric is We can verify that it satisfies the non-negative, symmetric and triangle inequality properties. In addition, this metric has a normalized version Similarly, the distance of stocks i and j in the stock network is

Minimum Spanning Tree
We will use the MST method to build the stock network. Here, a graph is denoted by . . e m } is the set of edges and the edge (v i , v j ) connects nodes v i and v j . If the edge (v i , v j ) is undirected, the graph is called undirected graph. A path is a graph which has finite distinct nodes and each edge connects two adjacent nodes. If the nodes belonging to a path are different, the path is called a simple path. If two endpoints are equal, the path is called a loop. When each edge has a weight, the graph is called a weighted graph. For an un-weighted graph, the length of a path is the number of edges. For a weighted graph, the length of a path is the sum of weights. In an undirected graph, if there is a path linking endpoints v i and v j , these endpoints are called connective. If any two nodes are connective, the graph is connective. A tree is a connected acyclic graph. A MST is a spanning graph with a minimal sum of weights. For the stock network, we use the distance between two stocks as the weight of an edge. There are two popular algorithms for constructing an MST. Among them, the Kruskal algorithm ranks the weights of edges in an ascending order and adds the next edge with the smallest weight if this addition does not create a cycle. The complexity of the Kruskal algorithm is O(mlnm) where m is the number of edges. On the other hand, the Prim algorithm grows the spanning tree from a given node, and iteratively adds the shortest edge from a node in the network to the node that has not been reached yet, until all the nodes are reached. The complexity of the Prim algorithm is O(n 2 ) where n is the number of nodes. Generally, the Kruskal algorithm is suitable for sparse networks, while the Prim algorithm is better for dense networks.
In this work, we use the Prim algorithm to construct stock networks. Suppose that G(V, E) is a weighted undirected connective graph with n nodes. The MST, denoted as T(TV, TE), is constructed by: If the network is not cyclic, add v into TV and add (u, v) to TE. Otherwise, reject this edge and then consider the next shortest edge.

Chinese stock market
There are more than 2000 companies traded at the Shanghai Stock Exchange (SSE). In this work, we consider a subsystem that is related to the real estate industry. Currently the real estate industry is a very important part of the market economy in China. A number of stocks in the financial, banking and chemical sectors have much influence on the stock market. We choose stocks from companies related to the real estate, chemical industry, automobile, banking, building materials, cement, non-banking financial, as well as iron and steel sectors. We remove the stocks that have poor business performance and face the risk of delisting. The Chinese stock market is a growing market. Each year a number of companies are added to the market. Thus we cannot use the market data over a long time period. Otherwise, a proportion of stocks will have to be excluded from our study because of the incompleteness of data. We

Comparison of robustness of two measures
We first test the robustness property of the correlation coefficient and mutual information for measuring stock prices with large variations. For each measure, we first calculate five values based on the stock prices in the four time periods as well as the prices in the whole time period. Then we calculate the standard derivation (STD) of these five values. To remove the influence of the mean, we further calculate the Fano factor, given by where μ and σ 2 are the mean and variance of the five values for each stock pair. Fig 1D shows that the range of the Fano factor values for mutual information is much smaller than that of the correlation coefficient in Fig 1C. These results suggest that mutual information is a more robust measure than the correlation coefficient for the stock price data with large variations.  In the second type, the stock pairs have large values of mutual information but small values of the correlation coefficient. These stocks can be further divided into two major groups. In the first group, stock prices change with large volatilities. For example, Southwest Securities (600369) and Industrial Securities (601377) in Fig 3C have the similar fluctuation trends, but are not linearly dependent. The stock price of Southwest Securities has nearly vertically declined from the highest price of 25 Chinese Yuan. Its price trend is consistent with the Shanghai Composite Index. This highly nonlinear correlation measured by mutual information cannot be expressed well by the correlation coefficient. In the second group, companies had rationed their shares before the large price movement. One example is the Shanghai Construction Group (600170) and Shanghai Tunnel Engineering Company (600820) in Fig 3D. In  For these stock pairs, normally one of them has large volatility in price, but the other is relatively stable. Anxin Trust in Fig 3E paid stock dividend on 23/09/2015, and its stock price fluctuated violently before this date. However, the stock price of Kibing Group (601636) was always stable. In the second example, the price of Fujian Cement (600802) was stable at a low level due to its industrial development; however, the price of Guihang Automotive Components (600523) has experienced relatively large fluctuations in Fig 3F. Finally, the fourth type includes stock pairs whose mutual information and the correlation coefficient all have small values. This type of stock pairs is not discussed in this work.

Comparison of top stock pairs
During the developmental process of the Chinese stock market, especially in the studied period of 04/01/2014-30/12/2016, the Chinese stock market underwent violent fluctuations from time to time. Thus it is inappropriate to consider the third type of relationship discussed above, though these stock pairs have large values of the correlation coefficient. However, the second type of relationship is important for the nonlinear correlation between stock pairs. Therefore, in this work we propose to use mutual information to measure the relationship between stocks. For comparison study, we also develop corresponding networks using the correlation coefficient.

Hierarchical networks
Based on the values of mutual information and the correlation coefficient for each stock pair, we next use the MST method to build the undirected weighted network. We label each stock using its corresponding stock code and distinguish stocks in different sectors by using different colors, namely chemical (red), building materials (yellow), ornament (green), automobile (blue), household electrical appliance (white), real estate (black), banking (purple), non-banking financial (gray), and iron and steel (brown). For the network based on mutual information, Fig 4 shows that stocks in the same sector possess certain internal connection properties. Stocks more likely connect stocks within the same sector. Indeed, companies in the same sector provide similar products and service activities, and thus the reaction of their shares to the external influence is also similar to each other. Fig 4 shows different densities of interconnections between different sectors. According to the interconnection density, the nine sectors in Fig 4 can be classified into three major groups. The first group includes the non-banking financial sector, banking sector, ornament sector, and real estate sector that form the largest group. Sixteen non-banking financial stocks form a sub-group and connect to the network through the Industrial Bank (601166). Banking stocks connect to the network through the Poly Real Estate(600048). Stocks in the financial sector, such as banks and insurance companies, usually offer high dividend yields and their stock prices are low. After the stock market crash in 2015, to maintain the stability of the stock market, banking stocks are usually the primary investment option. This particularity leads to the strong clustering of the stocks in the first group.
The second group includes companies in the automobile, chemical, and household electrical appliance sectors. The major activities of these companies cover a wide range of business activities, and often there are overlaps between the business activities of different sectors. In addition, the correlation between stocks inside each sector is higher than that between different sectors. Stocks in these sectors form small sub-groups inside each sector and are connected to the sub-group in other sectors. For example, the automobile industry has been developing business in new energy and intelligence industry, and the development of the real estate sector also accelerates the growth of the automobile industry. Thus, stocks in the automobile sector connect to stocks in the chemical and real estate sectors. In addition, companies in the chemical industry have a wide range of business activities. There are a number of stocks in this sector forming a few small sub-groups connected to other sectors. Thus, the companies in these three sectors are closely related to each other.
The third group includes companies in the iron and steel as well as building material sectors. The iron and steel stocks connect to the network through stocks in the ornament sector and the center of this sub-group is Shanghai Iron and Steel (600022). Due to the excess of production capacity, the price of iron and steel continues to decline. Companies in this sector have to merge or reorganize in recent years. Companies in the building material sector have low internal relevance without clustering, mainly affected by companies in the real estate sector. Most of the stocks in the third group are on the boundary of the network, namely as the leaf nodes.
As mentioned earlier, the Chinese stock market has experienced different developmental stages during the last three years. To find the influence of different stages on the network structure, we develop two networks for simplicity using the data in 02/01

Network topological properties
We next investigate the topological properties of the developed networks, including degree, path-length and the power-law distribution. The degree of a node is the number of edges connecting it. A node with a larger degree plays a more important role in the network. According to the distribution of degree in Table 1, we analyze three types of stocks with different degrees.
The first type includes important nodes that have degrees of more than 6. When financial news affects the stock market, these stocks react first and the fluctuations of their stock prices influence the stocks near them. All these stocks represent the major companies in their sectors. The second type of stocks have degrees between 2 and 6. These stocks deliver market information along the branches. The third type is the boundary stock with degree 1. The majority of the nodes are boundary nodes in these MST networks. Although the difference between the distributions of these two networks in Figs 4 and 5 is not large, the variance of degrees for the network using mutual information is smaller than that using the correlation coefficient. Finally, connections between nodes in these two networks are highly non-uniform. The   (Fig 4), To further study the influence of degree, we consider the probability distribution P(k) of degree k. Fig 6 gives the scatter diagrams of calculated frequency. It suggests that, for stock networks in Figs 4 and 5, the probability P(k) follows the power-law distribution P(k) / k −γ , where γ is the power exponent. In addition, the accumulative influence follows the power-law distribution with γ − 1. Based on mutual information, Table 1 shows that the power exponents of the networks based on the whole dataset, stage one dataset and stage two dataset are 2.09, 1.82, and 2.17, respectively. However, when the correlation coefficient is used, the power-law exponents of the networks based on the whole dataset, stage one dataset and stage two dataset are 1.98, 1.98, and 1.93, respectively. From the degree distribution and power-law exponent, the network based on mutual information is more effective to represent the stock system than the correlation coefficient according to these three datasets.
The length of path for a stock pair is the number of intermediate stocks through which these two stocks are connected. The average length of a network can reflect its network size. The average length of the network on the basis of mutual information is 9.1419, which suggests that one stock for affecting another one on average needs to pass through about 10 stocks. The longest path length is 23, which connects the Sailun Group (601015) and Pacific Securities (601058). On the other hand, the average path length of the network using the correlation coefficient is 8.0096.