## Figures

## Abstract

A key metric to determine the performance of a stock in a market is its *return* over different investment horizons (*τ*). Several works have observed heavy-tailed behavior in the distributions of returns in different markets, which are observable indicators of underlying complex dynamics. Such prior works study return distributions that are marginalized across the individual stocks in the market, and do not track statistics about the joint distributions of returns conditioned on different stocks, which would be useful for optimizing inter-stock asset allocation strategies. As a step towards this goal, we study emergent phenomena in the distributions of returns as captured by their pairwise correlations. In particular, we consider the pairwise (between stocks *i*, *j*) partial correlations of returns with respect to the market mode, *c*_{i,j}(*τ*), (thus, correcting for the baseline return behavior of the market), over different time horizons (*τ*), and discover two novel emergent phenomena: (i) the standardized distributions of the *c*_{i,j}(*τ*)’s are observed to be invariant of *τ* ranging from from 1000min (2.5 days) to 30000min (2.5 months); (ii) the scaling of the standard deviation of *c*_{i,j}(*τ*)’s with *τ* admits good fits to simple model classes such as a power-law *τ*^{−λ} or stretched exponential function (λ, *β* > 0). Moreover, the parameters governing these fits provide a summary view of market health: for instance, in years marked by unprecedented financial crises—for example 2008 and 2020—values of λ (scaling exponent) are substantially lower. Finally, we demonstrate that the observed emergent behavior cannot be adequately supported by existing generative frameworks such as single- and multi-factor models. We introduce a promising agent-based Vicsek model that closes this gap.

**Citation: **Miyahara H, Qian H, Holur PS, Roychowdhury V (2024) Emergent invariance and scaling properties in the collective return dynamics of a stock market. PLoS ONE 19(2):
e0298789.
https://doi.org/10.1371/journal.pone.0298789

**Editor: **Juan E. Trinidad Segovia,
University of Almeria: Universidad de Almeria, SPAIN

**Received: **July 27, 2023; **Accepted: **January 30, 2024; **Published: ** February 23, 2024

**Copyright: ** © 2024 Miyahara et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **All data are available from Wharton Research Data Services (WRDS) database (https://wrds-www.wharton.upenn.edu/) and are widely used for research in this field. Code is provided in a cloud repository (https://github.com/pholur/stock-market) for easy reproduction of the results.

**Funding: **The author(s) received no specific funding for this work.

**Competing interests: ** The authors have declared that no competing interests exist.

## Background

Stock prices demonstrate considerable volatility, a result of several confounding factors such as traders’ collaborative and competitive decision-making to buy, hold or sell, differing appetites for risk, and various time horizons for expected returns on investment [1–7]. A considerable body of literature has focused on identifying patterns in price fluctuations and on developing succinct dynamical models that display similar characteristics as a real market. Financial experts have, for instance, frequently observed seasonality patterns in individual stock prices and the fractal nature of price fluctuations [8, 9]. In contrast, macroscopic patterns that concern the ensemble of stocks would emerge because of correlated dynamics in investment decisions, and would reflect inter-stock asset allocation strategies used by investors. The abundant literature in swarming and “econophysics” [4, 10] provide a framework for both numerical analysis of such joint price data and for formulating theoretical generative models.

Conventionally, investors and economists compute the *return* [11]: the return over an investment horizon of *τ* is defined as the equivalent compounded interest rate if one bought a stock at time *t* and sold it at time *t* + *τ*, and estimated as *r*(*t*;*τ*) ≔ ln(*p*(*t* + *τ*)) − ln(*p*(*t*)), where *p*(*t*) and *p*(*t* + *τ*) are the stock prices at time *t* and *t* + *τ*, respectively. Return sequences are a stationary measure (a parameter constant over the interval *τ*) of price change characterized by several statistical properties observed in empirical evidence across markets (often referred to as *stylized facts* [12]).

Prior research has explored the statistical properties of these returns as a means to characterize the dynamics of the market. Some [11, 13] have identified linear relationships in the log-log scale on the distributions of the returns. Plerou et al. [14] discover power law fits [15, 16] on (i) the cumulative distribution function (CDF) of the return distribution and (ii) the standard deviation of the return distribution as a function of market capitalization. Similarly, Müller et al. [17] demonstrated evidence of a scaling regime governing the mean of returns with respect to the return horizon (for *τ* ≤ 20 seconds). Return distributions and their properties for longer horizon *τ* beyond high frequency trading time scales have not been studied. *Moreover, such prior works only study return distributions that are marginalized across the individual stocks in the market*. For example, in [11, 13], they fix a *τ* and then compute *r*_{i}(*τ*) for all stocks *i* in a market and then estimate the distribution of this set of *r*(*τ*)’s; thus marginalizing over all the stocks, and the Complementary CDFs (CCDFs) for our dataset is presented in Fig 1 of the S1 File and for large enough *τ* the tails can be fit with that of power-law distributions. Similarly, in [14] they compute the standard deviation, *σ*_{i}(*τ*), of returns for a fixed *τ* and for all stocks with market capitalization *S*_{i} and then they show that log(*σ*_{i}(*τ*) scales linearly with log(*S*_{i}); again, the returns are marginalized over all the stocks with the same capitalization. *None of these works track statistics about the joint distributions of returns* conditioned on different stocks, which would be useful for optimizing inter-stock asset allocation strategies.

To address such issues, other methods have attempted to model the inter-stock return correlations in order to compare stocks’ relative performances over time [4]. These correlations are particularly useful to construct *graphical models* of the market, in this case, a fully connected network, where the stocks are nodes and the pair-wise correlations correspond to the inter-stock edge weights. Structures are distilled from within the network representation by adopting various graph theory algorithms [18–21]. For example, one can compute the Minimum Spanning Tree (MST), which, for certain return horizons, exhibit a local aggregation of communities of stocks (nodes), such that each community is shared by stocks belonging to market sector [11, 22]. Recent studies have attempted to refine this method of identifying clusters by calculating the normalized *partial* correlations in relation to the market mode [23]. With *fixed* *τ* > *τ*_{0}—where the MST structures are obtained—these partial correlation scores have been observed to contain enough inter-stock information to facilitate agglomerative clustering of the Korean Stock Exchange (KOSPI) that aligned with GICS sectors, while network modeling of the partial correlations computed using daily returns have helped uncover specific stocks that are influential in driving the return patterns in a subset of high-capitalization NYSE stocks [24].

## Our contributions

Such MST-based studies (see Fig 8) provide only a limited visual representation of the underlying return correlation distribution and its dependence on *τ*. We extend such MST-based analysis of the correlation statistics to the study of its density function, governing the return correlations. In contrast to existing work discussed earlier, we are interested in: (a) the distributions of *market-mode adjusted* partial correlations computed at both short (1 minute) and *long* investment horizons (50000 minutes); and (b) the distribution properties as a function of the investment horizon *τ*. *As our first main contribution*, we find that for a significant range of *τ* (varying from approximately 2.5 days to around 2.5 months), the standardized distributions—scaled by the standard deviation *σ*(*τ*) and zero-shifted by mean *μ*(*τ*)– of the market-mode adjusted partial return correlations are invariant of *τ*.

The above distribution invariance results suggest that both the standard deviation *σ*(*τ*) and the mean *μ*(*τ*) of the pairwise partial correlations would be a function of return horizon *τ* in the regime where such invariance is observed. We find that *μ*(*τ*) has no significant scaling behavior with respect to *τ* (see Fig 2 in the S1 File). *As our second main contribution*, we find that the standard deviations of the partial return correlations do indeed scale as a function of *τ* in the distribution invariance regime, and demonstrate convincing fits via either a power law or a stretched exponential function. The critical model parameters—the scaling exponent in the case of the power law (λ), and the stretching parameter (*β*) in the stretched exponential function—appear to be rich indicators of macroeconomic volatility patterns. The distribution invariance as well as the scaling of *σ*(*τ*) are observed to hold for 1000min ≤ *τ* ≤ 30000min. Evidence spans 17 years of real S&P500 stock price data, sampled every minute. Data can be accessed for research purpose at the Wharton Research Data Services (WRDS) (https://wrds-www.wharton.upenn.edu/) and the code repository (https://github.com/pholur/stock-market) is linked.

Finally, we explore if these numerically observed emergent behavior properties can be replicated by an agent-based generative model. We first reexamine the single- and multi-factor generative models, popular generative frameworks used to model consensus behavior in financial markets [25]. These models for the most part fail to replicate the above-mentioned emergent trends—the invariant standardized histograms and the power-law/stretched-exponential fits with respect to *τ*: (a) The single-factor model fails to reproduce the vine MST structure; (b) Multi-factor models, while generating the vine, fail to produce both the distribution-invariance and the scaling phenomena. *As our third main contribution,* we introduce an alternate framework, a modified Vicsek model—commonly used to describe the dynamics of active matter—that proves to be a much more promising candidate for reproducing the empirical evidence. In these approaches, the stock market would be modeled as a closed environment, where individual stocks behave as agents in a vector space that influence each other. At each time-step, the position of an agent corresponds to a particular stock’s instantaneous market behavior. Agents that exhibit correlated behavior over multiple time steps cluster together as swarms.

## Materials and methods

### Correlations of returns and partial correlation with respect to market mode

Suppose a market has *N* stocks; in the S&P500, *N* ≈ 500. Let us denote, by *p*_{i}(*t*), the price of stock *i* at time *t* for *i* = 1, 2, …, *N* and *t*_{ini} ≤ *t* ≤ *t*_{fin}. Typical macroeconomic market analyses such as Year-over-Year (YoY) gain, annualized returns and GDP growth, cap *T*_{int} ≔ *t*_{fin} − *t*_{ini} to 1 year from January 1st to December 31 to avoid seasonality patterns and resulting artifacts in the correlations. We similarly consider each calendar year separately and the evidence of scaling is thus presented individually for each of the 17 years (2004−2020).

We sample each of the stock prices at a granularity of 1 − minute. Let the price sequence of a stock *i* be *p*_{i}. For *T*_{int} = 1year, there are ∼98000 values per sequence. We compute the effective return or interest rate *r*_{i}(*t*;*τ*) of the stock *i* at time *t* over a time horizon of *τ* (*τ* ≤ *T*_{int}), a preferred *first-order* metric for investing than the absolute price. An investment in the *i*^{th} stock at time *t* (say, a sum of *mp*_{i}(*t*) by purchasing *m* units) when *compounded continuously* at the given rate would yield the same amount as that which would be obtained by selling the stock at time (*t* + *τ*) (i.e., *mp*_{i}(*t* + *τ*)). Quantitatively, . Thus, we get:
(1)
for *t*_{ini} ≤ *t* ≤ *t*_{fin} − *τ*. Therefore, an investment horizon of *τ* yields a return sequence of length, *T*_{int} − *τ* + 1. Note that for the longest considered *τ* = 30000*min*, the return sequence for each stock still contains a significant number of return values (∼68000). After computing the return sequences of all stocks, we can find the market-mode return sequence as:
(2)

We denote the time average of *r*_{i}(⋅;*τ*) and *r*_{i}(⋅;*τ*)*r*_{j}(⋅;*τ*) for *i*, *j* = 0, 1, 2, …, *N* by
(3) (4)

Now we are ready to define the *conventional correlation*. Between any pair of stocks *i*, *j* = 1, 2, …, *N*:
(5)
where
(6) (7)

Note: In an optimized market, the cross-correlation between *r*_{i}(*t*) and *r*_{j}(*t* + Δ) for non-zero Δ can be written by replacing the right-hand side of Eq (4) by:
(8)

However, these correlation should equal 0; otherwise investors would use one return series to predict another stock’s return (recall Stylized Fact I [12]).

Next we introduce the concept of partial correlation between stocks *i* and *j* with respect to the market mode [26]: let be the residuals while predicting *r*_{i}(*t*;*τ*) with respect to *r*_{0}(*t*;*τ*) using a linear fit. Then the correlation between these residuals associated to stocks *i* and *j* is the partial correlation and is given by,
(9)
where *ρ*_{i,j}(*τ*) is the (conventional) correlation between stocks *i* and *j*, and *ρ*_{i,0}, *ρ*_{j,0} are correlations of returns of stocks *i* and *j* with respect to the market return.

### Distributions and invariance of partial correlations

Let *p*_{τ}(*x*) be the probability density function of *c*_{i,j}(*τ*) empirically estimated as:
(10)
where *δ*_{D}(⋅) is the Dirac delta function. We observe that as to be expected, the functional form of *p*_{τ}(⋅) is *τ*-dependent (see Fig 1). However, **the standardized distributions**—the distribution when *c*_{i,j}(*τ*) are mean-shifted and scaled by standard deviation,
(11) **are invariant over a significant regime of τ**; i.e.

*τ*

_{max}≥

*τ*≥

*τ*

_{min}. This indicates that the scaling factor—in this case, the standard deviation—scales with

*τ*.

### Scaling phenomena during distribution invariance

Let *σ*(*τ*) be the standard deviation and *m*(*τ*) the mean. In this work, we use the *inverse* of the standard deviation—rather than the standard deviation—as an interpretable measure of *precision*, .

We will demonstrate that during the regime where the *standardized* distributions are invariant (*τ* > *τ*_{0}), the dependence between *τ* and the precision *b*(*τ*) is well-explained by simple models with few and interpretable parameters such as the power law,
(12)
and the stretched exponential function:
(13)

Aside from convincing model fits, the critical model parameters—the scaling exponent λ (in the case of the power law) and the stretching parameter *β* (in the case of the stretched exponential)—once trained, emerge as candidates for macroeconomic indicators of market volatility. Indeed, we suspect that *any other simple model class that can convincingly fit and validate the (near-linear in log-log scale) dependence of b(τ) with respect to τ should similarly express the market characteristics within its model parameters.*

## Empirical results: Emergent phenomena in real-world data

Recall the dataset descriptors: *T*_{int} = 1 year; the sampling time interval of stock prices is 1 minute; there are ∼251 business days in a year when the stock market is open from 9: 30 to 16: 00 ET; for every day the market is open, each stock has ∼390 prices. For *T*_{int} = 1 year, each stock has a ∼98000-length price series; the price series is arranged such that the closing price at 16: 00 ET on the current market day immediately precedes the opening price at 9: 30 ET on the following market day. While volatility in after-hours trading may result in drastic price fluctuations at particular indices in the series, an increasing *τ* has a smoothing effect on these spikes, and we believe that the return correlation PDFs are not significantly affected by these gaps. Results presented below are replicated for a shorter *T*_{int} = 3 months (see S1 File).

### Functional form of the standardized partial correlation PDF is invariant during a finite *τ* regime

We provide qualitative and quantitative evidence of the invariance of the functional form of the standardized distribution for a finite regime 30000min > *τ* > 1000min. First, in Fig 2, we plot the standardized histograms for 6 years (remaining years can be verified using the attached codebase), by superimposing the functions across different *τ*. Quantitative evidence is provided next:

As *τ* exceeds 1000 min, the shape of takes a more stable form. A similar analysis with *T*_{int} = 3 months is presented in the Fig 5 in S1 File.

**Pairwise KL divergence between standardized partial correlation PDFs**: For each year from 2004 to 2020, we compute the KL divergence (KLD) between and , the standardized partial correlation PDFs computed with*τ*_{1}and*τ*_{2}respectively. We would like to show that inside the regime where functional invariance was visually observed (1000min ≤*τ*≤ 30000min), for any pair {*τ*_{1},*τ*_{2}} is small compared to the KLD computed between a pair of standardized PDFs for which*τ*is outside the scaling region. The pairwise KL divergence between the standardized partial correlation PDFs across 6 evaluated years are presented in Fig 3. The dark square block in the bottom right of each heatmap implies that the KL divergence between any pair of standardized distributions sampled from the region of*τ*specifying the functional invariance is low. In order to compute the KL divergence in a consistent and comparable fashion, each standardized PDF is re-sampled (*N*= 10000) using Gaussian smoothing, .**Probing the onset of the function invariance using Gaussian Mixture Models and Kurtosis**: We fit a Gaussian Mixture Model (GMM) (2-mode) on and probe the weights of the two components across*τ*. We expect to see a transition as the function invariance sets in. As shown in Fig 4, initially, one mode is dominant, and as*τ*> 1000, the weights of the two modes become comparable. Such transitions are observed with 3, 4, 5-mode fits as well.**Kurtosis of the density function with respect to**: In Fig 5, we demonstrate the same transition (from*τ**τ*< 1000min to*τ*> 1000min) by plotting the kurtosis of with respect to*τ*. When the invariance property takes effect, the kurtosis values suggest a corresponding transition from a leptokurtic (>3) to platykurtic (<3) regime.

Mode 1 corresponds to the mode with the lower standard deviation. As *τ* is increases, the second mode starts contributing significantly to the fit signaling the onset (shaded cyan).

As *τ* increases, the correlation distribution becomes more flat resulting in lower kurtosis (platykurtic regime).

### The scaling behavior and its emergent properties

Motivated by the observed function invariance in the standardized distributions of the partial return correlations, in Fig 6 we plot *b*(*τ*)—the precision (defined in Materials and Methods)—as a function of *τ* for each year, 2004 to 2020, to find evidence of a scaling phenomenon. Within the *τ* range where the invariance is identified (*τ* = 1000 min to 30000 min)—the regime highlighted by light cyan, we observe a near-linear relationship between ln*τ* and ln*b* suggesting a Stretched Exponential or Power law fit. Note that for very small values of 1min ≤ *τ* ≤ 200min, the estimated pairwise correlations aren’t reliable due to Epps effect [27]. We now evaluate these fits using Model Architecture Search (MAS).

From *τ* = 1000 min to 30000 min (the regime highlighted by light cyan), observe the near-linear relationship between ln*τ* and ln*b*. A similar visualization with *T*_{int} = 3 months is presented in the Fig 6 in S1 File.

### Model architecture search

We consider 6 candidate regression models to fit the {*τ*, *b*(*τ*)} data samples from *τ* = 1000min to 30000min: Linear, Polynomial (degree = 2,5), Exponential, Stretched Exponential, and Power Law. A 4 − fold cross-validation setup is used: For every year between 2004 and 2020, we fit each candidate model on 75% of the samples and report the training and validation Mean Squared Error (MSE) on the remaining 25%. Error bars indicate the standard deviation of the MSE across the 4 folds. In the case of log-transformed target variables, the MSE is computed in the original scale to ensure fair comparison. Fig 7 indicates that the Power Law and Stretched Exponential models have the best fits among the candidates. When *T*_{int} = 1year, the Stretched Exponential fit is slightly better. When *T*_{int} = 3months, the Power Law fit is marginally better (see Fig 7 in S1 File).

Observe that the Power Law and the Stretched Exponential fits consistently reports lower validation MSE. Error bars are computed across 4 folds of cross-validation. Polynomial models demonstrate clear signs of overfitting while the exponential model (*β* = 1) is only slightly worse than the best fits. A similar analysis with *T*_{int} = 3 months is presented in the Fig 7 in S1 File.

## Generative models

We have observed so far that real S&P500 data demonstrates a functional invariance in the standardized distributions of the partial return correlations and an associated linear dependence of the precision with respect to *τ*. Economists attempt to construct generative models to explain these results in order to better characterize the consensus-forming taking place in the stock market. A starting point—as noted in the Introduction—is the correlation graph of inter-connected stocks, which reveals emergent stock communities for a return horizon *τ* > *τ*_{0} corresponding to industry sectors. These are shown in Fig 8.

Each vertex is colored depending on its GICS sector. Edge weights are computed as: . Observe that for larger values of *τ*, stocks belonging to the same sector cluster together.

Many generative models—such as the single- and multi-factor models—have been proposed to explain these interactions by quantifying the pair-wise inter-stock interactions. We show that these do not replicate the observed functional invariance and/or scaling behavior and propose a suitable replacement—a modified Vicsek model—that is more promising.

### Factor models

#### Single factor model.

The conventional single-factor model [28] uses only the fluctuations of the market mode and individual stock prices to model the correlations of return, i.e.,
(14)
where *r*_{0}(*t*) represents the market mode describing the overall fluctuation of the financial market. In Eq (14), *ξ*_{i}(*t*) is the part not included in the market mode. In the one-factor model, *ξ*_{i}(*t*) is a zero mean Gaussian distributed time series with and is independent of each other and *r*_{0}(*t*). In fact, we can derive the values of the correlation coefficients in the one-factor model: *ρ*_{ij} = *ρ*_{i0}*ρ*_{j0}, and the residuals in Eq (9). The standardized distribution in Eq (11) reduces to the delta function (*μ* = 0, *b*(*τ*)→∞) violating the structure of the empirically observed correlation distribution.

#### Multi-Factor model.

The MSTs of the pairwise stock correlations clearly show clustering of stocks belonging to the same sector, and one can formulate a multi-factor model [29–31] wherein we supply additional parameters that correspond to the individual sectors. Since the computational models we are considering directly output returns (not the prices), one needs to introduce an additional parameter to simulate the effect of time scale *τ*: by varying this parameter, one can control whether the market mode dominates –drowning out the effect of the sectors (as observed for small *τ* in real data)– or is suppressed, allowing the sector correlations to emerge (as observed for large *τ*).

Consider *K* sectors in a market. The multi-factor model takes the form:
(15)
where,
(16) (17)
and, for *i* = 1, 2, …, *N*,
(18) (19)

Here, is the Gaussian perturbation function and is non-zero when stock *i* belongs to sector *k* and 0 otherwise. Additionally, the variance of market and sector returns are set to 0.05 × Δ*t*, and 0.10 × Δ*t* respectively. Note that increasing corresponds to larger perturbations of in successive time steps. Thus it plays the role of 1/*τ*: a large implies market-mode dominance and small , sector-mode dominance (see Figs 9 and 10).

We varied : (left) , (center) , and (right) . We set the number of steps 10000 and . Observe for small , the sectors are separated into distinct groups in the MST—similar to when *τ* is large. As increases, the groups lose identity and merge; i.e the sector information is devalued: a similar effect to when *τ* is small. Precision in the edge weights is set to 1.

We set the number of steps 10000 and .

We performed the numerical simulation of the multi-factor model with *K* = 2 sectors. We set *N* = 500 and the number of stocks in each sector as 250. We swept in Eq (19) across multiple values. The other parameters are set as follows: , , , and if stock *i* belongs to sector *k* and otherwise zero. In Fig 9, we show the MSTs of the multi-factor models for and . Observe that for small , the stocks per sector belong in separate communities in the MST. As increases, the communities collapse.

In Fig 10, we plot the PDF *p*_{τ}(⋅) in Eq. (Eq (10)) of the multi-factor model and standardized PDF in Eq (11). The following dynamics are observed: (a) For small —may correspond to large *τ*—two peaks originate from two sector modes and one peak originates from the market mode; (b) for moderate , the market mode dissipates and the two sector modes dominate the distribution; and (c) for large , the return correlation distribution becomes random due to large perturbations. The multi-factor model explicitly uses sector affiliation as a parameter resulting in multi-modal correlation distributions. This multi-modal structure of results in the precision of return correlations *b*(*τ*) not scaling with *τ*.

### Modified vicsek model

The Vicsek model is a generative model that can display some of the salient group characteristics of swarming behavior, as observed in the motion patterns of flocks of birds and swarms of fish. Compared to the multi-factor model where group assignments are provided in advance, such assignments emerge naturally in the Viscek model: each particle in the swarm is influenced by other particles that are within a neighborhood. Based on such local-only interactions, long distance order emerges and groups of particles cluster together in their dynamical behavior, akin to sectors emerging in stock markets.

Our model uses the standard setup [32] with the following modifications: (a) Consistent with the factor model setup, the predicted variable is the return *r*_{i}(*t*); (b) Particles (individual stocks) move in (rather than in ); a stock’s offset from 0 is the return value. The proximity of one particle *i* to another *j* at time *t* is the absolute value of the difference of the returns |*r*_{i}(*t*) − *r*_{j}(*t*)| rather than the typical cosine distance metric used in ; (c) Time steps are discretized rather than continuous. The update step is:
(20)
where
(21)
and *N*_{i,δ} is the number of elements *j* that satisfy |*r*_{i}(*t*) − *r*_{j}(*t*)| < *δ*. An extended derivation of the Vicsek update step is presented in the S1 File.

**Evaluating the Vicsek model under different parameter settings of** *δ*, *η*: We next discuss how parameters *δ* (radius of influence), and *η* (standard deviation of noise) -individually and collaboratively—can play roles analogous to the return horizon parameter *τ* in the empirical stock price data. In particular, we analyze the dependence of the distributions of correlation of returns defined in Eq (20) on *δ* and *η*.

*Role of δ*: The parameter*δ*plays a crucial role in determining the extent to which particles in the model are influenced by their neighbors. We anticipate that very large values of*δ*lead to substantial inter-particle influence, producing highly correlated return sequences, while very small*δ*values result in independent particle behavior with little correlation. Thus, we would expect small values of*δ*to lead to very small return correlations—akin to short return horizons*τ*—and as*δ*is increased we expect pockets of correlated returns, just as sectors emerge in the empirical data with increasing*τ*. Indeed, as shown in Fig 13, for intermediate values of*δ*the precision follows a near-linear dependence in the log-log scale with respect to*δ*—a scaling phenomenon.*Role of η*: Injected noise adds randomness to the trajectories of particles and together with the radius of influence determined by*δ*, the noise level*η*facilitates the formation of distinct pockets. In the absence of this noise and with a sufficiently wide radius of influence, particles tend to merge into a unified group, exhibiting strong correlations with each other. Visual evidence illustrating this effect can be seen in the MST structure in Fig 11(a). As illustrated in Fig 11(b) and 11(c), increasing the noise factor results in the formation of communities. Of course if*η*is increased further, the vine structure will disintegrate.

The number of steps was set to 10000, and *δ* was fixed at 0.10. For *η* = 100.0, the vine structure is apparent and we used these vines to define the analogs of sectors in stocks. In particular, we performed a community finding on MST [33] corresponding to the vine structure, and identified 11 communities corresponding to the number of GICS sectors. These communities indeed constitute individual vines, as shown by colored nodes in the right-most figure. Next we tracked the associated stocks as *η* decreased based on the fixed *δ* condition. Notably, as *η* decreases, the sectors collapse due to the fixed neighborhood of *δ*, which encourages more particles (stocks) to interact with one another. This increased interaction arises as the particles experience less perturbation from *Ξ*, leading to homogeneous behavior and radial MSTs.

We set *δ* = 1.0 and the number of steps 10000. (b) Standardized PDF in Eq (11) of the modified Vicsek model. Note that the subscript is changed from *τ* to *δ*. We set *η* = 10.0 and the number of steps 10000.

*Indeed, δ and η behave as duals of one another while influencing the distribution of the return correlations:* For example, increasing the noise in particle trajectory (*η*) has a similar effect to decreasing each particle’s radius of influence (*δ*). Given these constraints, we look to discover a scaling effect with respect to *δ*, *η* and functional invariance of for *intermediate* *δ*, *η* values. For simulations, we set *N* = 500, *α*_{i} = *γ*_{i} = 1.0 and *β*_{i} = 0.05 for *i* = 1, 2, …, *N*, and Δ*t* = 1.0.

*Functional form of the correlation PDFs:*In Fig 12, we present the standardized correlation PDF*p*_{(⋅)}(⋅) for various*η*values (on the left) and*δ*values (on the right) (compare with the empirical result in Fig 2). Notably, within a finite range of*η*and*δ*, we observe that the functional form shows invariance properties similar to those observed in the empirical data.*Scaling behavior with respect to η and δ:*In Fig 13, we plot the relationship between the precision and each of the parameters*η*(left) and*δ*(right) keeping the other fixed (please refer to Fig 6 for a comparison). We observe the scaling phenomenon for intermediate values of*η*and*δ*. While at the extremes, particle trajectories are either completely uncorrelated (high*η*, low*δ*) or globally correlated (low*η*, high*δ*), the range in between facilitates particles to be locally correlated (akin to sectors—see Fig 11 (right)).

The highlighted region (in which) denotes the near-linear fit between *b*(⋅) and *η*, *δ*. For different parameter settings, the near-linear fit in the log-log plot is visualized in cyan.

## Concluding remarks

In this paper, we first observe that the standardized distributions of the partial correlation of returns reaches an invariance for a finite range of *τ*. Second, within this *τ* regime, we demonstrate a scaling phenomenon governing the precision of the raw distributions, *b*(*τ*), with respect to *τ*—the investment horizon. We additionally review existing stochastic and generative factor models to show that they fail to model these observed emergent phenomena and propose a modified Vicsek-inspired framework that is a more promising candidate. The scaling behavior was demonstrated yearly from 2004 to 2020 on real stock price data sampled every minute of trading hours.

The compelling presence of such a scaling phenomenon warrants investigating the role of the model parameters that are crucial to the fit that explains the dependence of *b*(*τ*) on *τ*. Specifically, in the case of a Power Law fit, λ appears as a macro-economic indicator of market health. A similar analysis on the Stretched Exponential fit—also a good fit of the {*τ*, *b*(*τ*)} data in Fig 7—shows a similar effect with respect to the *β* parameter (see S1 File).

Fig 14 plots the scaling exponent across the 17 years of evaluation. The figure shows that λ’s exhibit inter-annual variations, portraying a distinct linear decline from past to present, characterized by intriguing anomalies (highlighted in cyan). We seek to make sense of this trend and interpret its significance.

The definition of λ is given in Eq (12). Error bars are computed using 4-fold cross-validation while estimating the linear fit. In the blue highlighted regions, anomalies are observed where λ deviates from the linear fit.

### Setup

Recall that the standard deviation *σ* of the return correlations is proportional to *τ*^{λ}. To quantify the change in standard deviation as we transition from a short-term (*τ*_{I}) to a long-term (*τ*_{F}) investment horizon, we introduce a novel metric defined as follows:
where *σ*(*τ*_{F}) and *σ*(*τ*_{I}) represent the standard deviations corresponding to the long-term (*τ*_{F}) and short-term (*τ*_{I}) investment horizons, respectively. This measure captures the fractional increase in the standard deviation from the short-term to the long-term. A large *R*_{σ} indicates that the standard deviation of the return correlations in the short-term are much smaller than in the long-term. A small *R*_{σ} suggests that the short- and long-term investment horizons look statistically similar.

Using the scaling law, we get: where . Referring back to Fig 14, we observe empirically that λ ∈ (0, 1).

Therefore, *R*_{λ} is an increasing function in λ. Given that we have noticed a consistent *decrease* in the value of λ over the years, our focus now shifts to understanding the implications of a corresponding declining trend in *R*_{λ} across years:

*Market Maturity:*We first consider the y-intercepts depicted in Fig 6. Specifically, the values of*σ*(*τ*_{I}) (which is 1/*b*(*τ*_{I})) demonstrate a consistent and gradual increase from past to present, while*σ*(*τ*_{F}) remains relatively constant across the same period. Consequently, the decreasing trend in*R*_{λ}suggests that*σ*(*τ*_{I}) gets closer to*σ*(*τ*_{F}) every successive year. In more general terms,*the communities of stocks observed over longer investment horizon in earlier years appear in shorter time horizons in later ones, a sign that investors are becoming increasingly efficient and adept at identifying stock return patterns*—a sign of market maturity.*Global Financial Crises:*We now consider the cyan-colored windows in Fig 14 corresponding to two recent global crises—subprime mortgage crisis in 2008 and the COVID-19 pandemic in 2020. In these cases, λ dips significantly below the linear fit. As markets stabilized post the 2008 crisis, the λ values rebound to the linear trend. Since our data stops at 2020, it remains to be seen whether a similar rebound will take effect.

In summary, the discovery of such scaling phenomena and its associated summary statistics in the partial correlations of stock price returns adds to a growing body of work in macro-economic modeling. By extending the qualitative observations of the variations in MST structure to the correlations at large in a quantifiable manner, we demonstrate one robust path to probe market health based on collective dynamics.

## Supporting information

### S1 File. Additional experiments and proofs.

Reporting the Complementary Cumulative Distribution Functions (CCDF) of returns, scaling and function invariance for a 3-month investment horizon, an interpretation of the Stretched Exponential model, and a derivation of the Vicsek model update rule.

https://doi.org/10.1371/journal.pone.0298789.s001

(PDF)

## References

- 1. Newman MEJ. Resource Letter CS–1: Complex Systems. American Journal of Physics. 2011;79(8):800–810.
- 2. Ladyman J, Lambert J, Wiesner K. What is a complex system? European Journal for Philosophy of Science. 2013;3(1):33–67.
- 3.
Ladyman J, Wiesner K. What is a complex system? Yale University Press; 2020.
- 4.
Mantegna RN, Stanley HE. Introduction to econophysics: correlations and complexity in finance. Cambridge university press; 1999.
- 5. Bonanno G, Lillo F, Mantegna RN. Levels of complexity in financial markets. Physica A: Statistical Mechanics and its Applications. 2001;299(1-2):16–27.
- 6. Bouchaud JP. The subtle nature of financial random walks. Chaos: An Interdisciplinary Journal of Nonlinear Science. 2005;15(2):026104. pmid:16035906
- 7. Stanley H, Gabaix X, Gopikrishnan P, Plerou V. Economic fluctuations and statistical physics: Quantifying extremely rare and less rare events in finance. Physica A: Statistical Mechanics and its Applications. 2007;382(1):286–301.
- 8. Rozeff MS, Kinney WR. Capital market seasonality: The case of stock returns. Journal of Financial Economics. 1976;3(4):379–402.
- 9.
Mandelbrot BB, Cootner PH, Gomory RE, Fama EF, Morris WS, Taylor HM. Fractals and Scaling in Finance: Discontinuity, Concentration, Risk. Selecta Volume E. SpringerLink: Bücher. Springer New York; 2013. Available from: https://books.google.com/books?id=H6jqBwAAQBAJ.
- 10.
Luna F, Perrone A. Agent-based methods in economics and finance: simulations in Swarm. vol. 17. Springer Science & Business Media; 2002.
- 11. Kwapień J, Drożdż S. Physical approach to complex systems. Physics Reports. 2012;515(3-4):115–226.
- 12. Cont R. Empirical properties of asset returns: stylized facts and statistical issues. Quantitative finance. 2001;1(2):223.
- 13. Drożdż S, Forczek M, Kwapień J, Oświe¸cimka P, Rak R. Stock market return distributions: From past to present. Physica A: Statistical Mechanics and its Applications. 2007;383(1):59–64.
- 14. Plerou V, Gopikrishnan P, Nunes Amaral LA, Meyer M, Stanley HE. Scaling of the distribution of price fluctuations of individual companies. Phys Rev E. 1999;60:6519–6529. pmid:11970569
- 15. Simkin MV, Roychowdhury VP. Re-inventing Willis. Physics Reports. 2011;502(1):1–35. https://doi.org/10.1016/j.physrep.2010.12.004
- 16. Kong JS, Sarshar N, Roychowdhury VP. Experience versus talent shapes the structure of the Web. Proceedings of the National Academy of Sciences. 2008;105(37):13724–13729. pmid:18779560
- 17. Müller UA, Dacorogna MM, Olsen RB, Pictet OV, Schwarz M, Morgenegg C. Statistical study of foreign exchange rates, empirical evidence of a price change scaling law, and intraday analysis. Journal of Banking & Finance. 1990;14(6):1189–1208.
- 18. Tumminello M, Aste T, Di Matteo T, Mantegna RN. A tool for filtering information in complex systems. Proceedings of the National Academy of Sciences. 2005;102(30):10421–10426. pmid:16027373
- 19. Aste T, Di Matteo T, Hyde S. Complex networks on hyperbolic surfaces. Physica A: Statistical Mechanics and its Applications. 2005;346(1-2):20–26.
- 20. Tumminello M, Di Matteo T, Aste T, Mantegna RN. Correlation based networks of equity returns sampled at different time horizons. The European Physical Journal B. 2007;55(2):209–217.
- 21. Onnela JP, Kaski K, Kertész J. Clustering and information in correlation based financial networks. The European Physical Journal B. 2004;38(2):353–362.
- 22. Kwapień J, Oswiecimka P, Forczek M, Drozdz S. Minimum spanning tree filtering of correlations for varying time scales and size of fluctuations. Physical Review E. 2016;95. https://doi.org/10.1103/PhysRevE.95.052313
- 23. Jung SS, Chang W. Clustering stocks using partial correlation coefficients. Physica A: Statistical Mechanics and its Applications. 2016;462:410–420.
- 24. Kenett DY, Tumminello M, Madi A, Gur-Gershgoren G, Mantegna RN, Ben-Jacob E. Dominating Clasp of the Financial Sector Revealed by Partial Correlation Analysis of the Stock Market. PLOS ONE. 2010;5(12):1–14. pmid:21188140
- 25.
Bai J. In: Factor Models. London: Palgrave Macmillan UK; 2016. p. 1–7. Available from: https://doi.org/10.1057/978-1-349-95121-5_2298-1.
- 26. Baba K, Shibata R, Sibuya M. PARTIAL CORRELATION AND CONDITIONAL CORRELATION AS MEASURES OF CONDITIONAL INDEPENDENCE. Australian & New Zealand Journal of Statistics. 2004;46(4):657–664.
- 27. Epps TW. Comovements in Stock Prices in the Very Short Run. Journal of the American Statistical Association. 1979;74(366):291–298.
- 28.
Luenberger DG, et al. Investment science. OUP Catalogue. 1997;.
- 29.
Melas D. Best practices in factor research and factor models. MSCI Research Insight. 2018;.
- 30. Levin A. Stock selection via nonlinear multi-factor models. Advances in Neural Information Processing Systems. 1995;8.
- 31. Fama EF, French KR. Comparing Cross-Section and Time-Series Factor Models. The Review of Financial Studies. 2019;33(5):1891–1926.
- 32. Vicsek T, Czirók A, Ben-Jacob E, Cohen I, Shochet O. Novel Type of Phase Transition in a System of Self-Driven Particles. Phys Rev Lett. 1995;75:1226–1229. pmid:10060237
- 33. Girvan M, Newman MEJ. Community structure in social and biological networks. Proceedings of the National Academy of Sciences. 2002;99(12):7821–7826. pmid:12060727