Figures
Abstract
Applying a network analysis to stock return correlations, we study the dynamical properties of the network and how they correlate with the market return, finding meaningful variables that partially capture the complex dynamical processes of stock interactions and the market structure. We then use the individual properties of stocks within the network along with the global ones, to find correlations with the future returns of individual S&P 500 stocks. Applying these properties as input variables for forecasting, we find a 21 improvement on the R2score in the prediction of stock returns on long time scales (per year), and 3
on short time scales (2 days), relative to baseline models without network variables. These findings highlight the potential of integrating network-based variables into stock return prediction models, which could enhance forecasting accuracy and provide a deeper understanding of market dynamics. This approach could be valuable for both investors and researchers seeking to model and predict stock behaviour in complex financial networks.
Citation: Achitouv I (2025) Dynamical analysis of financial stocks network: Improving forecasting using network properties. PLoS One 20(5): e0319985. https://doi.org/10.1371/journal.pone.0319985
Editor: Alejandro Raúl Hernández-Montoya, Universidad Veracruzana, MEXICO
Received: October 3, 2024; Accepted: February 11, 2025; Published: May 9, 2025
Copyright: © 2025 Ixandra Achitouv. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data for this study are publicly available from the figshare repository (https://doi.org/10.6084/m9.figshare.28238414.v1).
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Financial market dynamics exhibit properties of complex systems, such as non-linear behaviour and inter-dependencies of stock prices [1–3], emergent phenomena such as trends and bubbles from interactions of market participants [4–6], adaptive behaviour [7, 8] and feedback mechanisms [9] (e.g. rising prices attract more investors, further driving up prices). Collective behaviours emerging in financial markets are known as market modes, referring to synchronized movements of groups of stocks. They have been studied in stock return correlations using different tools including principal component analysis (e.g. [10]), complex network analysis (e.g. [1, 11–13]) and random matrix theory (e.g. [14–16]). Indeed, in a complex system, the behaviour of individual components, or agents, is coupled to the collective dynamics of the system. When applied to financial markets, one could ask how these couplings can be extracted and used to improve forecasting of the global stock market as well as in individual stock returns.
Many studies have proposed to forecast individual stock returns using soft computing methods [17, 18, 18] (e.g. machine learning regression algorithms) using mostly individual features of a stock (e.g. previous closing value, difference between high and low value of the stock each day, mean volume exchanged, etc..). Others have included macroeconomic variables to forecast stock returns (e.g. [19]), or sentiment analysis using natural language processing techniques (e.g. [20–22]). Some studies have analyzed stock market networks’ structural properties to detect influencers in stock market “communities” (e.g. [23, 24]) but little research has been conducted to apply the properties of a network built on stock returns correlations to forecast the future movements of the global stock market [25, 26], or of individual stocks [27]. Unlike previous works that often focuses on static network structures or market-wide influence detection, this study applies dynamic network analysis to understand how stock return interdependencies evolve over time, offering insights into the collective behaviours of the market. Previous methodologies, such as [26], focused on applying network properties to predict market movements but did not investigate how these properties change over time and can influence individual stock forecasts.
In this article, we use complex network analysis to study the dynamical inter-dependencies of stock returns and their collective behaviours, and test how the properties of the network correlate with the market stock return that we define as the unweighted average return across all assets. Indeed, networks underpin complex systems and influence their behaviour [28]. Therefore, by analysing the dynamical evolution of the stock returns network one could hope to infer the overall dynamics of the stock returns. We further consider the properties of individual stocks within the network and test how these properties correlate with the future returns of individual stocks. Finally, we consider how our results change depending on the time scale employed to built the network, to test if we have a scale invariance on the correlations observed between the log return and the network properties. Indeed, scale-invariance is often considered a property of complex systems [29], including in financial market [29, 30]. The fractal market hypothesis, discussed in [31–34] suggests that market prices exhibit fractal properties and that patterns observed in short-term price movements can resemble those in long-term trends. This contrasts with the efficient market hypothesis, which assumes that market prices follow a random walk and are inherently unpredictable. Thus it becomes interesting to test the added value of the network input variable on short and long time scale forecasting and study if the key predictors are the same. Thus, this work extends previous studies in financial network analysis by explicitly considering the scale-invariance of network correlations over different time scales and testing how network features serve as reliable predictors for future stock returns.
This article is organized as follows: first, we describe the key network properties we consider and how we constructed our financial network. Second, we measure the dynamic evolution of these properties and test their correlation with the overall market return. Third, we examine the individual properties of the stocks within the network and test if they have meaningful correlations with their future returns. We then use these properties to forecast individual stock returns. Finally, we present our conclusion.
Network properties and construction of the financial stock return network
Network properties
In order to study the properties of the network we consider centrality measures:
(1) The degree ki of a node i in a network represents the number of connections it has to other nodes [28]. In an undirected network, the degree ki of a node i is given by the sum of the adjacency matrix elements corresponding to that node.
where aij is an element of the adjacency matrix A, which is 1 if there is an edge between nodes i and j, and 0 otherwise.
(2) Closeness centrality CC(u) of a node u measures the average length of the shortest path from a node to all other nodes in the network, representing how quickly information spreads from a given node to others [35]. It identifies nodes that can quickly interact with all other nodes, and is defined as:
where is the shortest path distance between nodes u and v, and V is the set of all nodes in the network.
(3) Betweenness centrality of a node v is a measure of the extent to which a node lies on the shortest paths between other nodes, indicating its role as a bridge or connector within the network. Nodes with high betweenness centrality can control the flow of information or resources between different parts of the network [36].
where is the total number of shortest paths from node s to node t and
is the number of those paths that pass through v.
(4) Eigenvector centrality CE(i) of a node i is a measure of the influence of a node in a network based on the idea that connections to high-scoring nodes contribute more to the score of the node in question [37]. A high eigenvector score means that a node is connected to many nodes who themselves have high scores. It is computed using the principal eigenvector of the adjacency matrix of the network:
where is a constant (the eigenvalue),
is the set of neighbors of i, and A is the adjacency matrix of the network. Applied to a financial market, high eigenvector values can be problematic for systemic risks as contagion risk is linked to these influential stocks [38]. Indeed, a highly interconnected financial network can act as an amplification mechanism by creating channels for a shock to spread, leading to losses that are much larger than the initial changes [39].
In addition, we consider the weighted clustering coefficient CC(i) of a node i, it measures the tendency of nodes to cluster together while taking into account the strength of the connections (weights) between nodes, providing a more nuanced view of network cohesiveness [40]:
where si is the sum of the weights of the edges connected to node i, ki is the degree of node i, wij is the weight of the edge between nodes i and j, and aij is the adjacency matrix element that is 1 if nodes i and j are connected and 0 otherwise. Applied to a financial market, networks with high clustering coefficients are often more robust and resilient to node failures, as the redundancy in connections within clusters can help maintain the overall connectivity [41]. This could be used as a network systemic risk measure similarly to the eigenvector centrality, when building on optimal portfolio. In fact it was found (e.g. [42]) that one can improve on traditional portfolio selection by maximizing the usual Sharpe ratio where the standard deviation of the portfolio is replaced by a weighted average of node centrality measures (e.g. eigenvector centrality, betweennness centrality, closeness centrality).
We also measure properties that characterize the network structure globally, such as:
(1) Community stability measure. In a network, this refers to the persistence of community structures over time or under perturbations, indicating how robust the communities are to changes in the network. It can be quantitatively assessed using various metrics, one of which involves evaluating the modularity Q of the network [43]. The modularity of a network is defined as:
where Aij is the adjacency matrix, ki and kj are the degrees of nodes i and j respectively, m is the total number of edges, and is a function that is 1 if nodes i and j are in the same community and 0 otherwise.
(2) The largest component L, defined as the largest subset of nodes such that there is a path connecting any two nodes within this subset. It represents the most extensive cluster of interconnected nodes in the network, providing insights into the network’s structure and functionality [28]. Let G=(V,E) be a graph where V is the set of vertices and E is the set of edges. If Ci represents the i-th connected component of G, then the largest component L is given by:
where is the size (number of nodes) of component Ci.
(3) The resilience of a network, R, measures its ability to maintain its structural integrity and functionality in the face of failures or attacks on its nodes or edges. It characterizes the robustness and vulnerability of networks [28, 41]. Resilience can be quantified by assessing the size of the largest connected component after a certain fraction of nodes or edges have been removed.
where is the size of the largest connected component after removing
nodes, and
is the total number of nodes in the original network.
Construction of the network
The S&P 500 index was chosen for this study as it represents a diverse cross-section of the largest U.S. companies across various sectors, making it a widely recognized benchmark for the overall stock market and an ideal candidate for studying network properties and their correlation with market dynamics. The closing stock prices can be downloaded from Yahoo finance. The periods chosen for this analysis were selected to balance the availability of comprehensive data with the need to capture meaningful long-term and short-term market dynamics. For the long-term period, we selected daily stock closing prices spanning from 1993-01-01 to 2024-01-01 to ensure coverage of multiple economic cycles, including major market events such as the dot-com bubble, the 2008 financial crisis, and the COVID-19 pandemic, which provide a robust basis for analyzing the dynamics of stock return networks over extended time scales. For the short-term period, the data from 2022-08-31 to 2024-07-31 captures recent market activity, offering insights into network properties under current market conditions while focusing on higher-frequency (hourly) dynamics. Both datasets used for this analysis can be accessed on https://github.com/IxandraAchitouv/Dyn_analysis_FSN_forecasting. The acronyms of the stocks from 1993 to 2024 can be accessed in the same link under the name historicalsp500components.csv.
For the long term period,
we clear out stocks that were not present in the entire time range, ending up with 267 stocks and 7805 days of closing values for each stock. We then split our data into 30 yearly samples covering Nobs = 30 years of business days.
For the short term period,
we also clear out stocks that were not present in the entire time range, ending up with 488 stocks and 3346 hours of recorded price for each stock. We then split our data in time intervals of 14 hours (corresponding to two business days) and we end up with Nobs = 238 measures (one measure every two business days).
The structure of the financial market can be mapped as a network where nodes represent the stocks and the edges connecting them represent the correlations between their returns after applying some filtering, e.g. [1, 44–46].
To compute the correlation of their returns we take the closing price Pi(t) log-return:
where and
is the standard deviation of the stock computed over the year we consider.
In this correlation matrix all stocks are connected to one another (the degree distribution is therefore a constant equal to the number of nodes), which does not provide any useful information. One approach to convert this correlation matrix to an adjacency matrix Ai,j, is the threshold method originally introduced in [44] and used in many studies e.g. [23, 24, 47]. It is defined as if
, otherwise Ai,j = 0. One can easily understand what it does qualitatively: it filters out ’spurious’ correlations and keeps an edge between stocks when the correlation of their return is above a certain limit. Thus remains the question of which value one should take for the threshold to optimally filter the correlation matrix of the returns, differentiating between noise and signal. Note that one could start by extracting the market modes from the correlation matrix as in [13] before applying the threshold, however this requires the correlation matrix to be full rank which may not be the case for short time scales. In previous studies, the value for the threshold is usually an heuristic choice as discussed in [48]. For instance, in [49],
, where
is the standard deviation of the Ci,j distribution and n an integer. Recently, a criterion based on network properties was introduce in [50]. In our work we also propose to privilege a value based on the network. Many complex systems naturally evolve towards a scale-free structure due to the dynamic processes governing their formation. The scale-free property means that the degree distribution follows a power law: a few nodes are connected with many (high degree) while most are not connected (null degree). The scale-free network emerges because the probability of a node gaining new connections is proportional to its existing connectivity [51]. In fact, many natural networks grow over time by the addition of new nodes. New nodes are more likely to attach to existing nodes that already have a high degree of connectivity, leading to a few highly connected hubs. This process, known as preferential attachment, is a fundamental mechanism in the formation of scale-free networks. Scale-free networks may also provide an evolutionary advantage by enabling complex systems to adapt and evolve more efficiently. Highly connected nodes (hubs) can serve as control points that help the network respond to changes and perturbations, thus enhancing the system’s overall adaptability and resilience [52]. Hence we propose to define the threshold value as the minimum value where the degree distribution of the nodes follows a power law.
In practice, we implement a loop starting with an initial threshold value of . For each iteration, we compute the adjacency matrix and the degree distribution of the nodes. If the degree distribution is not convex, we increment the threshold by 0.1 and repeat the process. This iterative procedure continues until the degree distribution becomes convex, at which point we stop, resulting in threshold values of
. This threshold can be seen as a hyperparameter for the construction of the adjacency matrix [23]. While exploring alternative criteria for threshold selection goes beyond the scope of this analysis, we note that as long as the threshold value is sufficiently large to filter meaningful correlations, the results should remain robust.
Once we have built the adjacency matrix, we use the NetworkX library [53] to visualize the network and compute most of the network statistics we previously introduced.
In Fig 1 we display the resulting networks for 4 different years (top panels), where nodes are colored by the sector of the stocks. The spatial visualisation of the network is computed using ForceAtlas2 [54] which maximizes/minimizes the distance of nodes that have low/high weighted edges respectively. The size of the nodes is proportional to their degree. Interestingly we can observe some clustering where stocks of the same sector (shown by colors) are closer. We run a Louvain community finder [55] which optimizes locally the difference between the number of edges between nodes in a community and the expected number of such edges in a random graph with the same degree sequence. As a result, we identify different numbers of clusters depending on the year (Ncluster). For visibility we display the label of the nodes if the node is in the top 3 of the eigenvector centrality distribution, when its eigenvector centrality is the maximum within the Ncluster, or when it has the largest eigenvector centrality within its S&P5500 sector.
In the lower panels of Fig 1 we display the degree distribution of the nodes as well as the histograms of eigenvector centrality and local clustering. The red vertical line corresponds to the mean. In the appendix we also show 4 selected random networks built using the short time period.
Global evolution of the network and the market stock returns
To characterize the global dynamical evolution of the network we consider the following measures: the average degree of the top nodes, the mean closeness centrality (the mean corresponds to the average over all nodes), the mean betweeness centrality, the mean eigenvector centrality, the mean clustering, the largest component, the resilience of the network, and the community stability. We also measure the maximum eigenvalues of stock returns computed from their correlation matrix, as this can used as a precursor to financial crises by indicating market turbulence and systemic risk [14]. The global dynamical properties of the graphs for the long and the short term periods can be accessed at https://github.com/IxandraAchitouv/Dyn_analysis_FSN_forecasting.
Long term period
In Fig 2 we display these measurements (black curves) along with the global stock returns evolution (blue curves) rescaled by some constant factor to fit on the same y-axes.
The number of observed time scale is Nobs = 30. The vertical dashed lines correspond to (from left to right): Asian Financial Crisis, 1997 - Dotcom Crash, 2000 - Subprime Crisis, 2008 - Federal Reserve’s QE3 Announcement, 2012 - COVID-19 pandemic, 2020.
The vertical dashed lines correspond to (from left to right): The Asian Financial Crisis (1997), which primarily affected East Asian markets but also had a global impact. The U.S. market experienced a decline in 1998 due to global instability. - The Dotcom Bubble Burst (2000–2002) led to a sharp decline in the S&P 500 as technology stocks crashed. - The Subprime Crisis (2008), triggered by the subprime mortgage meltdown and global banking failures, caused a significant plunge in the S&P 500. - The Federal Reserve announced its third round of quantitative easing (QE3) in September 2012, committing to purchase 40 billion USD in mortgage-backed securities per month to boost economic growth and reduce unemployment. The S&P 500 responded positively, gaining momentum in the latter half of 2012. - The COVID-19 pandemic caused severe market disruptions, leading to a sharp and rapid decline in the S&P 500 in March 2020. However, the market rebounded quickly after the initial sell-off due to government stimulus measures and rapid monetary interventions by central banks. Despite the sharp drop, the S&P 500 recovered and reached new highs in the subsequent months.
Interestingly it seems that qualitatively the mean clustering and the largest eigenvalue of the stock returns increase over the last 30 years which would indicate that the market becomes more connected (largest eigenvalue is the market mode).
In Fig 3 we compute the correlation matrix of the global network properties and the market log-return using the 30 time steps (years). We observe that some variables are anti-correlated to the log return with an absolute value greater than : the mean clustering and the max eigenvalue of the stock returns. This suggests that when stocks are highly connected to one another (mean clustering) or when they follow the market trend (largest eigenvalue), the global log return decreases. On the contrary, the log return is positively correlated with a coefficient greater than
with the community stability variable.
Given the limited number of observations (Nobs = 30) it is challenging to discuss the correlations between network evolution and stock returns. Nonetheless we perform a Granger causality test [56] using a maximum lag of 5 years as an indicator to test if the time series of the network properties can use to predict the stock returns. We only quote the result for the most restrictive statistical test which corresponds to the Sum of Squared Residuals (SSR) F-test for computing the p-value.(Other statistical tests for the SSR are -test and Likelihood ratio. While we obtain p-values less than 0.05 in some cases, we choose to disregard them). We obtain a p-value less than 0.05 for the measurement given in Table 1, suggesting that these variable could be used as predictors for forecasting the global stock returns, especially the community stability at lag 1.
The short term period
In Fig 4 we display the global dynamical evolution of the network properties (black curves) and the rescaled global stock returns smoothed with a rolling mean average window W = 23. The vertical dashed lines mark key market events (from left to right) [57–60]: Market Volatility Amid Inflation Concerns (Oct 2022) led to uncertainty and sharp fluctuations, SVB Bank Collapse (March 2023) triggered financial sector panic, US Debt Ceiling Agreement (June 2023) provided relief with avoided default, Tech Sector Rally (July 2023) boosted indexes driven by AI growth, Federal Reserve Interest Rate Hike (Sept 2023) slowed markets with tightened monetary policy, and Market Correction Due to Rising Bond Yields (Oct 2024) reflected investor shifts to fixed-income assets.
The short time is two business days smoothed over a rolling window W = 23. The number of observed time scale is Nobs = 238. The vertical dashed lines correspond to (from left to right): Market Volatility Amid Inflation Concerns, Oct 2022 - SVB Bank Collapse, March 2023 - US Debt Ceiling Agreement, June 2023 - Tech Sector Rally, July 2023 - Federal Reserve Interest Rate Hike, Sept 2023 - Market Correction Due to Rising Bond Yields, Oct 2024.
The matrix correlation is displayed in Fig 5, made with 238 observations (every two days). In the short time scale, the top 5 most correlated variables (in absolute value) are:
- The max eigenvalues of the stock returns, with a coefficient of −0.55, also the highest correlation on long term period.
- The mean closeness centrality, with a coefficient of 0.49
- The mean clustering, with a coefficient of −0.48, which was the second highest coefficient for the long term period.
- The largest component, with a coefficient of 0.47
- The resilience, with a coefficient of −0.35.
This is quite interesting, pointing out that the dynamical processes between the stocks on very different time scales can be partially captured by the maximum eigenvalues of the stock returns and the mean clustering of the network. The former is not surprising as it captures the market mode [14] and it was previously shown that high eigenvalues are an indicator of financial crisis [61]. The latter was also studied in [44, 62, 63] as a metric that can reflect the underlying stability of the market. Here we find that they are consistent over very different time scales when studying the global stock returns.
The Granger causality test [56] result for the best lag of our network variables is reported in Table 2, suggesting again that these variables could be used as predictors for forecasting the global stock returns. Again, it is interesting to compare these variables with the ones we find on the long-term period. For the long term period we only find the mean closeness centrality and the community stability as meaningful properties to predict the stock returns. In the short time period, we find that all 9 variables (with different lag) are correlated with the future stock return. The best p-value being for the Resilience at lag 3 and the 90th percentile degree at lag 2. Here the number of observation is an order of magnitude larger which makes this test more robust.
Evolution of the individual properties of stocks within the network: Application to forecasting stock returns
Previously we have studied how the market log return correlates in time with respect to global properties of the network. We find some interesting correlations with these properties suggesting some predictability and a deeper understanding of how stocks interact with each other, leading to a coherent collective behaviour. Now we might wonder how these interactions impact individual stocks: by considering the position and properties of stock i within the network, along with the global properties of the network, can we predict its log return more accurately than by using standard variables?
To address this question, we perform a forecast analysis using the following individual variables for each stock: 1- its degree centrality, 2-its closeness centrality, 3-its betweenness centrality, 4-its eigenvector centrality and 5-its clustering coefficient. These variables characterize the stock within the network. In addition we take the global network variables that we previously consider. They contain information about the overall market structure.
The standard variable that we consider for the baseline scenario is the stock’s log return at the previous time step (lag).
Methodology
Preparing the data.
We start by randomly selecting 85 of the stocks for training and
for testing. For each stock we smooth all variable f(t) (including the log return) using a rolling mean average of window W, defined as:
where is the rolling mean of variable f at time t and f(t–i) is the value of variable f at time t–i. Note that one could adapt the average window to each variable by performing a grid search on the best forecasting results for instance. In what follows we adopt a fixed criterion for the value W, taking 10
of the number of the observed time.
We further compute all variables at different lag with a maximum given by Nlag. We then remove the time scale t<Nlag. Finally we concatenate all training stocks into one training dataset, implicitly assuming that the time series are approximately stationary.
The regression models.
We consider multiple regression models for this analysis. For the predictions that do not use the network features but only the log-return at previous lags we use:
1- a linear regression model using solely the lag 1 of the log return as input variable (LRbase)
2- a Random Forest Regressor [64] using as input variables the most correlated lags of the log return (RFRbase)
3- an XGBoost [65] model with only the lags of the log-return (XGBbase)
4- a LightGBM [66] model with only the lags of the log-return (LGBMbase)
5- a CatBoost [67] model with only the lags of the log-return (CatBbase)
6- a mean weighted average of model 1–5 (wAbase)
For the predictions that use the network features and the log-return at previous lags we use:
1- a Random Forest Regressor using a fraction of all input variables (the most correlated ones with the lag-return defined as a Pearson correlation coefficient in the top 30th percentile of the correlation distribution) (RFR)
2- a Gradient Boosting Regressor model [68] using the same selected variables (GBR)
3- an XGBoost model on all input variables (XGB)
4- a LightGBM model on all input variables (LGBM)
5- a CatBoost model on all input variables (CatB)
6- a mean weighted average of model 1–5 (wA)
Hyperparameters and pipelines.
For the RFRbase, RFR and GBR models we use a naive sklearn model without trying to find the optimal hyperparameters. For the XGBbase, LGBMbase, CatBbase, XGB, LGBM and CatB we perform a random sampling of the hyperparameters, use cross-validation to evaluate the performance of each hyperparameter combination and select the best model for each. The entire pipeline to reproduce our results along with the data can be accessed on https://github.com/IxandraAchitouv/Dyn_analysis_FSN_forecasting.
Long time scale forecasting
We start with the selected SP500 stocks from 1993-01-01 to 2024-01-01 (267 stocks) where we have Nobs = 30 (one measurement for each business year) with Nlag = 5 such that the training sample Ntrain =
= 5673. We use a window
to compute the rolling mean.
We perform a Granger causality test on the network variables and report the best lag p-value for the SSR F-test in Table 3.
The first 5 variables are specific to each stocks while the others are the global variables of the network and the max eigenvalue of the stock return correlation matrix. The result is interesting as it confirms that a stock’s network properties do correlate with its future value at lag 1 but also the global properties of the network at higher lags with lower p-values.
The most correlated variables with the log-return (top 30th percentile of the Pearson coefficient distribution), also used for the RFR and GBR models, are reported in the appendix S1 Table. The first two most correlated variables are the log return at lag 1 and 2 followed by the 90th percentile degree at lag2. The most correlated network variable that is not a global network variable is the closeness centrality at lag3. In fact over the selected network variables, 6 of them are characteristic to individual stocks and 17 describe the global network properties.
In Fig 6 we show the differences between the wA prediction and the log return (green curves), the LRbase and the log return (orange curves) and the mean of the log return with itself (grey curves) for 5 randomly selected stocks from the testing set. The horizontal lines correspond to the mean of the differences. Qualitatively, the differences between LRbase and wA are not significant but we observe that the two models do better than a simple mean average of the log return.
Differences between the wA prediction and the log return (green curves), the LRbase and the log return (orange curves) and the mean of the log return with itself (grey curves). The horizontal lines correspond to the mean of the differences.
For each stock of the testing set we measure the R2 score and the MAE to compute their distribution for our different models. The performance summary table is given in Table 4 and we display in Fig 7 the distribution for some of these models. The question we want to address is whether adding the network input variables improves the overall prediction of the individual stock returns. In Table 4 we find that the relative improvement defined as model/modelbase−1 is systematically positive, adding the network features improve all scores by an average of for the R2 score in average (excluding the RFR and the wA comparisons) and
for the MAE. Finally we observe in Fig 7 the wide skewed distribution of the R2 score and MAE point out the high variability of these simple predictive models.
R2 distribution (top panel) and MAE distribution (lower panel) computed using the testing set of stocks for the long time scale forecasting. Vertical lines correspond to the median values. The added value of the network parameters can be observed by comparing the base models (without network features) with the others.
Short time scale forecasting
Similarly to the long-term forecast we use the rolling mean average window of size and for the number of lag we take Nlag = 12 after checking that including higher lag does not change the accuracy of the prediction.
Following the same methodology as described in [method], we report in Table 5 the Granger causality test results on the best Lag.
In the this case, the first 4 variables are associated with the individual properties of the stock within the network while the others are global properties of the network and the max eigenvalue of the stock return correlation matrix. Interestingly, the global variables have a lower p-value than the individual ones, suggesting that the market structure has more weight in the forecast of the individual stocks compared to the interactions of the stock within the network.
The most correlated variables with the log-return (top 30th percentile of the Pearson coefficient distribution), also used for the RFR and GBR models, are reported in the appendix S2 Table. In this case the most correlated variables are the log return itself at lag=[1,2,..12] with a correlation coefficient ranging from [0.95,0.9,...,0.44] respectively. None of the variables selected for the training are individual network variable expect for the Closeness Centrality at lag 2 and 3 that appears at rank 48 and 59 with a correlation coefficients equal to 0.09 and 0.08 suggesting again that the global network properties have more impact on the individual stock return than the stock properties within the network. Interestingly, the closeness centrality was also the most correlated individual variable in the long time period.
In the appendix we also show the differences of the predicted log return for the wA and LRbase for 5 randomly selected stocks from the testing set.
The performance summary table is given in Table 6 and we display in Fig 8 the distribution for some of these models. For these predictions the distribution of the R2 scores and the MAE are much better than for the long time scale. The training sample size and the higher rolling window are probably the main reasons why. Note that we also test the forecasting using different rolling average window of size W = 10, for which the median R2score leads to a improvement, for W = 30 we get a
improvement and for W = 5 a 21
improvement.
R2 distribution (top panel) and MAE distribution (lower panel) computed using the testing set of stocks for the long time scale forecasting. Vertical lines correspond to the median values.The added value of the network parameters can be observed by comparing the base models (without network features) with the others.
Nonetheless we find again that adding the network input variables improves the overall prediction of the individual stock returns. In Table 6 we find that the relative improvement defined as model/modelbase−1 is positive, adding the network features improve all scores by for the R2 score and
for the MAE, in average.
Discussion
We find some interesting results in both the long and the short time periods. First, as expected the log return itself at previous lag is most important variable to predict the future value of a stock. Second, the global network variable are more correlated to the future return of a stock than the stock characteristics within the network. This suggests that the overall market structure plays a more important role than the stock’s interactions within the network. The closeness centrality measure of a stock is the most important individual feature to predict the future return of a stock in our models. The resilience of the network and the 90th Percentile Degree are also strong global features that correlate with the future return of a stock. Overall the network variables improve the forecasting of individual stock returns both on short and long time period.
Summary and conclusions
In this article we study the dynamical evolution of a network made from stock returns correlations and how its properties correlate with the log return itself. We consider global properties of the network as well as individual properties of the nodes. This allows us to study the correlation between 1- the market return evolution with the global properties of the network, and 2- the evolution of individual stock returns with the characteristics of the stock within the network. Both at the collective level and on the individual level we find meaningful indicators for the future evolution of the stocks (Granger causality test). We also study the dynamical evolution on two different time scales: long period and short period. We find that some indicators are scale invariant (the largest eigenvalue of the correlation matrix and the average clustering of the network) for the collective evolution. For the individual stocks, the degree of the node, the closeness centrality, the eigenvector centrality and the clustering of the node are meaningful indicators on both long- and short-term periods, although they were not selected to forecast the future value of our testing stocks for the short time period except for the closeness centrality measure. These findings are relevant for risk management as they suggest that depending on the trading strategy, such as daily or monthly trading, some network variables remain consistent across time scales. Hence, these variables could be used as indicators for measuring risk, helping investors and analysts assess market conditions and adjust strategies accordingly.
Finally, we use the network variables to forecast the evolution of stock returns and find an improvement over baseline scenarios that do not include network features (21 improvement for the long period and
for the short period). This indicates that network variables capture some of the interactions of the complex financial system, thus providing an added value towards predicting the dynamical evolution of collective and individual stock returns. While this study provides valuable insights, the key limitation of these finding are the reduced number of observations which could benefit from a larger dataset; however, such data were not available for this analysis. Additionally, the analysis relies on specific data sources (S&P 500), which, while appropriate for the scope of this work, could be further tested to different market stocks. Nevertheless, the results presented here are robust within the defined framework. Exploring alternative models or incorporating additional variables, such as macroeconomic indicators or extended datasets, would go beyond the scope of this article but could enhance future research in this area.
One could also add additional variables in the training dataset when predicting individual stock returns, such as the variance of some of the network variables. It would also be interesting to test some more sophisticated forecasting models.
To conclude on the policy implications of this study: the largest eigenvalue of the correlation matrix and the average clustering, could inform the development of early warning systems for market instability, offering regulators and policymakers a tool to monitor systemic risks. Similarly, the identification of meaningful indicators for individual stocks, such as closeness centrality, suggests potential applications for portfolio optimization and risk management strategies. While these implications were outside the immediate scope of this article, they highlight the potential for translating network-based analyses into practical tools for both policymakers and market participants.
The exploration of alternative machine learning models could enhance predictive accuracy and applicability. For instance, leveraging advanced methods such as graph neural networks (GNNs) or temporal convolutional networks (TCNs) could better capture the complex, dynamic interactions within financial networks. These models, combined with network variables, may offer more robust tools for predicting stock returns and systemic risks, paving the way for their integration into decision-making processes by regulators and institutional investors. While such developments are beyond the scope of this article, they represent a promising avenue for future research and policy design.
Materials and methods
The data used for this project were downloaded from Yahoo Finance’s API and can be accessed at https://github.com/IxandraAchitouv/Dyn_analysis_FSN_forecasting. The dynamical evolution of the network variables measured in this article is also publicly available at the same link, along with the ML pipeline for forecasting. The data collection and analysis for this project complied with Yahoo Finance’s Terms of Service. The dataset was used solely for non-commercial academic research purposes and was processed in accordance with the platform’s terms to ensure proper attribution and compliance.
Supporting information
S1 Fig.
Network of the stocks for different years and their properties for the short time scale. In this figure, we display the network built on the short time scale for 4 randomly selected scale [46,106,151,211] (top panels), along with the degree distribution (left bottom panels), eigenvector centrality histogram (middle bottom panels) and clustering histogram of the stocks (right bottom panels). Red vertical lines correspond to the mean. Compared to Fig 1 the number of stocks are higher. We observe some clustering per sector (shown by the colors). In average the number of clusters found with the Louvain is smaller than in the long time period.
https://doi.org/10.1371/journal.pone.0319985.s001
(TIF)
S2 Fig. Comparison of the predictions with the log return for 5 randomly selected stocks from the testing set on the short time scale.
Differences between the wA prediction and the log return (green curves), the LRbase and the log return (orange curves) and the mean of the log return with itself (grey curves). The horizontal lines correspond to the mean of the differences. In this fugure, we show the difference of the predicted log return for the wA, the LRbase and the mean of the log return for 5 selected stock in the testing set. There is no qualitative difference between the LRbase prediction and the wA one. Both outperform a simple mean average of the log return.
https://doi.org/10.1371/journal.pone.0319985.s002
(TIF)
S1 Table. Selected variables for the training of the long time period.
https://doi.org/10.1371/journal.pone.0319985.s003
(PDF)
S2 Table. 42 Selected variables over 63 for the training of the short time period: we don’t show the ones with a correlation coefficient less than 0.10.
https://doi.org/10.1371/journal.pone.0319985.s004
(PDF)
Acknowledgments
I.A. would like to thank the Institut des Systèmes Complexes de Paris Ïle-de-France for supporting the article processing charge of this article.
References
- 1. Mantegna RN. Hierarchical structure in financial markets. Eur Phys J B. 1999;11(1):193–7.
- 2.
Bouchaud J, Potters M. Theory of financial risk and derivative pricing: from statistical physics to risk management. Cambridge University Press; 2003.
- 3. Cont R. Empirical properties of asset returns: stylized facts and statistical issues. Quantitative Finance. 2001;1(2):223–36.
- 4. Farmer JD, Lo AW. Frontiers of finance: evolution and efficient markets. Proc Natl Acad Sci U S A. 1999;96(18):9991–2. pmid:10468547
- 5. Lux T, Marchesi M. Scaling and criticality in a stochastic multi-agent model of a financial market. Nature. 1999;397(6719):498–500.
- 6.
Sornette D. Why stock markets crash: critical events in complex financial systems. Princeton University Press; 2003.
- 7.
Arthur W, Durlauf S, Lane D. The economy as an evolving complex system II. Addison-Wesley; 1997.
- 8. Lo AW. The Adaptive Markets Hypothesis. JPM. 2004;30(5):15–29.
- 9. LeBaron B. Stochastic volatility as a simple generator of apparent financial power laws and long memory. Quant Finance. 2001;1(6):621–31.
- 10.
Jolliffe I. Principal component analysis. New York: Springer; 2002.
- 11. Boccaletti S, Latora V, Moreno Y, Chavez M, Hwang D. Complex networks: structure and dynamics. Phys Rep. 2006;424:175–308.
- 12. Peron T, Rodrigues F. Collective behavior in financial market. arXiv. 2011. https://arxiv.org/abs/1109.1167
- 13. Achitouv I. Inferring financial stock returns correlation from complex network analysis. arXiv. 2024. https://arxiv.org/abs/2407.20380
- 14. Laloux L, Cizeau P, Bouchaud J, Potters M. Noise dressing of financial correlation matrices. Phys Rev Lett. 1999;83(7):1467–70.
- 15. Plerou V, Gopikrishnan P, Rosenow B, Nunes Amaral LA, Stanley HE. Universal and Nonuniversal Properties of Cross Correlations in Financial Time Series. Phys Rev Lett. 1999;83(7):1471–4.
- 16. Bouchaud J, Potters M. Financial applications of random matrix theory: a short review. arXiv preprint. 2009.
- 17. Yang J, Plassmann F, Pogue TF, Shen Y. Stock market forecasting by soft computing techniques. Int Rev Financ Anal. 2002;11(3):363–80.
- 18. Kim K. Financial time series forecasting using support vector machines. Neurocomputing. 2003;55(1–2):307–19.
- 19. Li X, Xie H, Wang R, Cai Y, Cao J, Wang F. Forecasting stock market returns using macroeconomic variables: an analysis of the Russian stock market. Res Int Bus Finance. 2014;32:100–13.
- 20. Bollen J, Mao H, Zeng X. Twitter mood predicts the stock market. J Comput Sci. 2011;2(1):1–8.
- 21. Schumaker RP, Chen H. Textual analysis of stock market prediction using breaking financial news: the AZFinText system. ACM Trans Inf Syst (TOIS). 2009;27(2):12.
- 22. Plakandaras V, Gupta R, Katrakilidis C, Wohar M. Market sentiment and exchange rate directional forecasting. J Int Money Finance. 2015;53:58–76.
- 23. Namaki A, Shirazi AH, Raei R, Jafari GR. Network analysis of a financial market based on genuine correlation and threshold method. Phys A: Stat Mech Appl. 2011;390(21–22):3835–41.
- 24. Huang W-Q, Zhuang X-T, Yao S. A network analysis of the Chinese stock market. Phys A: Stat Mech Appl. 2009;388(14):2956–64.
- 25. Kim M, Sayama H. Predicting stock market movements using network science: an information theoretic approach. Appl Network Sci. 2016;1(1):14.
- 26. Seong N, Nam K. Forecasting price movements of global financial indexes using complex quantitative financial networks. Knowl-Based Syst. 2022;235:107608.
- 27. Baitinger E. Forecasting asset returns with network‐based metrics: A statistical and economic analysis. Journal of Forecasting. 2021;40(7):1342–64.
- 28.
Newman MEJ. Networks: an introduction. USA: Oxford University Press. 2010.
- 29. Stanley H, Amaral L, Gopikrishnan P, Plerou V, Gabaix X. Anomalous fluctuations in the dynamics of complex systems: from DNA and physiology to econophysics. Phys A: Stat Mech Appl. 1996;224(1–2):302–21.
- 30. Calvet L, Fisher A. Multifractality in asset returns: theory and evidence. Rev Econ Stat. 2002;84(3):381–406.
- 31.
Mandelbrot B. The fractal geometry of nature. W H Freeman and Company; 1982.
- 32.
Peters EE. Fractal market analysis: applying chaos theory to investment and economics. John Wiley & Sons; 1994.
- 33.
Mandelbrot BB. Fractals and scaling in finance: discontinuity, concentration, risk. Springer; 1997.
- 34.
Mandelbrot BB, Hudson R. The (mis)behavior of markets: a fractal view of risk, ruin, and reward. Basic Books; 2004.
- 35. Freeman LC. Centrality in social networks conceptual clarification. Soc Netw. 1978;1(3):215–39.
- 36. Freeman LC. A Set of Measures of Centrality Based on Betweenness. Sociometry. 1977;40(1):35–41.
- 37. Bonacich P. Power and centrality: a family of measures. Am J Sociol. 1987;92(5):1170–82.
- 38. Zhu T, Wang G, Liu Y. Measuring contagion risk in the international stock markets: a network approach. Phys A: Stat Mech Appl. 2016;45341–52.
- 39.
Jackson M, Pernoud A. Systemic risk in financial networks: a survey. 2020.
- 40. Barrat A, Barthélemy M, Pastor-Satorras R, Vespignani A. The architecture of complex weighted networks. Proc Natl Acad Sci U S A. 2004;101(11):3747–52. pmid:15007165
- 41. Albert R, Jeong H, Barabasi A. Error and attack tolerance of complex networks. Nature. 2000;406(6794):378–82. pmid:10935628
- 42. Heshmati Z. Portfolio selection by optimizing risk and return based. Int J Ind Eng Manag Sci. 2021;XX(YY):ZZ-ZZ.
- 43. Newman MEJ. Modularity and community structure in networks. Proc Natl Acad Sci U S A. 2006;103(23):8577–82. pmid:16723398
- 44. Boginski V, Butenko S, Pardalos PM. Statistical analysis of financial networks. Computational Statistics & Data Analysis. 2005;48(2):431–43.
- 45. Onnela J-P, Kaski K, Kertész J. Clustering and information in correlation based financial networks. Eur Phys J B - Condens Matter. 2004;38(2):353–62.
- 46. Kim H-J, Lee Y, Kahng B, Kim I. Weighted Scale-Free Network in Financial Correlations. J Phys Soc Jpn. 2002;71(9):2133–6.
- 47. Xu X-J, Wang K, Zhu L, Zhang L-J. Efficient construction of threshold networks of stock markets. Phys A: Stat Mech Appl. 2018;509:1080–6.
- 48. Park J, Cho C, Lee J. A perspective on complex networks in the stock market. J Econ Interact Coord. 2020;15(2):203–23.
- 49. Nobi A, Maeng SE, Ha GG, Lee JW. Effects of global financial crisis on network structure in a local stock market. Phys A: Stat Mech Appl. 2014;407:135–43.
- 50. Xu X-J, Wang K, Zhu L, Zhang L-J. Efficient construction of threshold networks of stock markets. Phys A: Stat Mech Appl. 2018;509:1080–6.
- 51. Barabási A-L, Bonabeau E. Scale-free networks. Sci Am. 2003;288(5):60–9. pmid:12701331
- 52. Barabási A-L, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nat Rev Genet. 2004;5(2):101–13. pmid:14735121
- 53.
Hagberg A, Schult D, Swart P. Exploring network structure, dynamics, and function using NetworkX. In: Proceedings of the 7th Python in Science Conference (SciPy2008). 2008. 11–5.
- 54. Jacomy M, Venturini T, Heymann S, Bastian M. ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software. PLoS One. 2014;9(6):e98679. pmid:24914678
- 55. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech. 2008;2008(10):P10008.
- 56. Granger CWJ. Investigating Causal Relations by Econometric Models and Cross-spectral Methods. Econometrica. 1969;37(3):424–38.
- 57.
CNBC. Inflation sparks market volatility in October 2022. CNBC; 2022.
- 58.
Forbes. AI-driven tech sector rally boosts market in July 2023. Forbes; 2023.
- 59.
. Federal reserve interest rate hike announcement, September 2023. 2023.
- 60.
Nasdaq. Market Correction Due to Bond Yields, October 2024. Nasdaq; 2024.
- 61. Plerou V, Gopikrishnan P, Rosenow B, Amaral L, Stanley H. A random matrix theory approach to financial cross-correlations. Phys Rev E. 2000;62(3):3023.
- 62. Marti G, Nielsen F, Biémont E, Donnat P. A review of two decades of correlations, hierarchies, networks and clustering in financial markets. Progr Artif Intell. 2017;:33–43.
- 63. Onnela J-P, Chakraborti A, Kaski K, Kertész J, Kanto A. Dynamics of market correlations: taxonomy and portfolio analysis. Phys Rev E Stat Nonlin Soft Matter Phys. 2003;68(5 Pt 2):056110. pmid:14682849
- 64. Breiman L. Random forests. Breiman L. Mach Learn. 2001;45(1):5–32.
- 65.
Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. 785–94.
- 66.
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W. Lightgbm: a highly efficient gradient boosting decision tree. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017. 3149–57.
- 67. Dorogush AV, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features support. arXiv. 2018.
- 68. Friedman J. Greedy function approximation: A gradient boosting machine. Ann Stat. 2001;29(5):1189–232.
- 69. Powers D, M W. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. J Mach Learn Technol. 2011;2(1):37–63.
- 70. Willmott C, Matsuura K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim Res. 2005;30:3079–82.