Covariance matrix filtering with bootstrapped hierarchies

Cleaning covariance matrices is a highly non-trivial problem, yet of central importance in the statistical inference of dependence between objects. We propose here a probabilistic hierarchical clustering method, named Bootstrapped Average Hierarchical Clustering (BAHC), that is particularly effective in the high-dimensional case, i.e., when there are more objects than features. When applied to DNA microarray, our method yields distinct hierarchical structures that cannot be accounted for by usual hierarchical clustering. We then use global minimum-variance risk management to test our method and find that BAHC leads to significantly smaller realized risk compared to state-of-the-art linear and nonlinear filtering methods in the high-dimensional case. Spectral decomposition shows that BAHC better captures the persistence of the dependence structure between asset price returns in the calibration and the test periods.

Covariance matrix inference is a cornerstone of the dependence inference between objects.This kind of matrix suffers however from the curse of dimensionality, as they become very noisy when the number of objects is similar to the number of features.Even worse, unfiltered covariance matrices are pathological in the high dimensional case, i.e., when the number of features exceeds the number of objects.This case is frequent e.g. in biological data and in multivariate dynamical systems such as financial markets in which only the most recent history is likely to be relevant.
Given its importance, covariance matrix filtering has a long history.A popular approach is to obtain filtered covariance matrices from the corresponding correlation matrices.Two types of approaches stand out: i) spectral methods, e.g.Random Matrix Theory, Rotationally Invariant Estimators [1], and Shrinkage [2,3]; ii) ansatz for the correlation matrix, e.g.block-diagonal [4] or hierarchical [5].
The usual setting is to have n objects and t features and to compute the correlation matrix between these n objects.Recent results on Rotationally Invariant Estimators [6] propose non-linear shrinkage methods able to correct the eigenvalue spectrum of covariance matrices optimally: the inversion of the QuEST function [7], the Cross-Validated (CV) eigenvalue shrinkage [8] and the IW-regularization [1], the latter being valid only in the low dimensional regime q = n/t < 1, i.e., when there are more features than objects.Eigenvector filtering is more complex.However, ansätze for the shape of the true correlation matrix impose constraints on the structure of the eigenvectors and of the eigenvalues.Such ansatz should be simple enough to clean noise but flexible enough to account for fine relevant details.The popular hierarchical clustering ansatz (HC thereafter) is indeed simple: it assumes that correlations are nested [5], which is equivalent to assume that dependencies are described by a dendrogram (a tree).In practice, it is hard to find statisticallyvalidated hierarchical structures [9] when the fitted hierarchical structure is highly sensitive to small variations of data.
An obvious problem of HC occurs when the structure is more complex than a tree: for example the non-diagonal blocks in Figs 1 and 2 are ignored by a hierarchical ansatz: one needs more than a single hierarchical structure to describe these empirical dependence structures.As a consequence, a non-negligible part of the dependence structure is left out, and in a dynamic context, the stability of a single hierarchical structure is likely to be poor.
Here, we introduce a more flexible hierarchical ansatz able to capture more of the structure of the eigenvectors.The rationale is to compute filtered hierarchical structures of many bootstrapped copies of the initial data, which yields probabilistic hierarchical structures.Such procedure describes the structure of correlation and covariance matrices better while keeping the robustness of hierarchical clustering.We illustrate the power of our method with data from two relevant fields.First, in bioinformatics, DNA micro-array gene expression dependence in tissues is frequently characterized by correlation matrices.Hierarchical clustering and its variants are commonly used [10,11], which helps simplify the covariance matrix by linkage averaging [12] (see Fig. 1).When there are several strong candidates of hierarchical structure, this approach selects a single one, which neglects possibly crucial information held by alternative structures.Comparing unfiltered correlation matrices with the filtering yielded by hierarchical clustering and average linkage (HCAL) [5] (Fig. 1) makes it clear first that (i) hierarchical clustering does capture some of the structure and (ii) a substantial part of the structure is lost (see the bottom plot).This is because hierarchical clustering imposes too strict a structure, which erases out an uncontrolled amount of information.
Another domain in which covariance matrix filtering plays a central role is risk management.Broadly speaking, the problem amounts to minimize future uncertainty by determining the fraction of resources to allocate to every possible choice.Risk in this particular context is due to fluctuations of the future value of the choices.The usual procedure consists in minimizing a suitable risk measure in the calibration window and hoping that the future, realized, risk will bear some relationship with the calibrated risk.The simplest approach consists in defining risk as the variance of the weighted sum of choices' values and to minimise it.This is known as globaly minimum-variance portfolios, a subfield of quadratic portfolio optimization which has a wide range of applications: investment into technologies [13], energy sources mix for countries [14,15], wind farm locations [16], and capital allocation in finance [17].We shall focus on financial risk because data are abundant, which makes it possible to compare the out-of-sample performance of filtering methods.In addition, the high-dimensional regime is particularly relevant in finance: there are many assets to choose from and the speed with which the dependence structure between asset price returns may change asks for an as short as possible calibration period [18].
In an inference or descriptive context such as DNA microarray data analysis, filtering correlation matrices is meant to bring estimated covariance matrices closer to the ground truth.In a dynamical context, especially for non-stationary systems such as financial markets, what matters is the part of the ground truth that most likely persists after the calibration period, i.e., when one uses the allocation weights computed from the filtered covariance matrix.Thus, ideally, the filtered covariance matrix should contain as much of the persistent structure as possible.The nature of the most likely persistent structure is of course unknown from the calibration window only.Figure 2 shows that there are indeed strongly persistent dependence structures of asset price returns between two non-overlapping periods.Similarly to correlation matrices of DNA microarray data, while a pure HC does capture a sizeable part of the useful structure, the non-diagonal correlation patterns blocks e.g., around (x, y) = (140, 600) indicate that HC itself is not sufficient.
Here, we propose a method that improves on hierarchical clustering.We exploit the fact that the less adequate a hierarchical ansatz, the more fragile it is with respect to small data perturbations.At a global level, the idea is thus to take bootstraps of the data and to average the resulting hierarchical structures.More precisely, we apply HCAL to bootstraps of the original data and then average all HCAL-filtered matrices to obtain a new kind of filtered matrix.We call our method BAHC, which stands for Bootstrapped Average Hierarchical Clustering, and define it for covariance and correlation matrices.BAHC rests on multiple hierarchical structures weighted by their frequency.A single hierarchical structure will only emerge if all the bootstrap realizations lead to the same dendrogram.Thus, this method is particularly adapted to data that is well-described by a hierarchical structure in a first approximation [19] but avoids selecting a single fragile structure.

Microarray DNA
We first apply the BAHC method to DNA microarray data [20] where the objects are n = 327 tissues of patients affected by pediatric acute lymphoblastic leukemia and features are the expression intensities of t = 271 genes (q 1.21).Classifying leukemia subtypes based on their gene expression profile is crucial to correct prognosis and risk assessment.However, the simplistic classification obtained from a single tree could lose relevant information coming from the complex interactions among the elements analyzed.
To show the new insights brought by BAHC compared to a simple hierarchical clustering, we kept the dendrograms of all the bootstraps and produced a bidimensional t-SNE projection [21] of their cophenetic correlation coefficients.Two main clusters appear, which essentially differ by the topmost branches, as shown by the tanglegram (right plot of Fig. 3).This means that two parts of the dendrogram which appear to be far away in a dendrogram may be much closer in another one.We applied FIG. 3. Bidimensional t-SNE projection of the cophenetic distance between the dendrograms of 1000 bootstraps of DNA microarray data [20].Two main clusters emerge, with further subclusters, corresponding to distinct potential hierarchies of dependence that are compatible with data.The red crosses indicate the centroids of the two largest clusters whose structure differences appear in the tanglegram of right plot.
spectral clustering [22] to determine sub-clusters of each main cluster.Typically, sub-clusters within either of the main clusters differ at lower levels of branching.In summary, sub-groups of cancers that lie on far branches of the sample dendrograms could be miss-classified as uncorrelated despite being possibly much closer in the dendrograms of many bootstraps.

Risk minimization
Given the n × (t + 1) matrix of values of choice i at time k, p i,k , and the value returns r i,k = p i,k /p i,k−1 − 1, one must determine the fraction of investment given to each choice i, the i−th component of vector w.The risk is measured by the standard deviation of the portfolio return, denoted by v P , whit v 2 P = w T Σw, where Σ is the n × n covariance matrix of the matrix of returns R. If the weights can be negative, the optimal weights w = with the condition i w i = 1 in order to avoid the trivial solution w = 0.This situation is called longshort portfolio in the following.In some situations, e.g., when choosing one's portfolio of energies or products, only positive weights are allowed, in which case one has to solve a quadratic programming problem; we refer to this situation as long-only portfolio.
The realized (out-of-sample) risk is the relevant performance measure.Using the out exponent, the realized risk is where w are computed from the in-sample covariance matrix, filtered or not, and X † is the transpose of matrix X.
All the results reported below use the simulation setup described in the Methods section: in short, we perform 10,000 simulations of n = 100 random assets in random periods.We compare the out-of-sample risk computed from BAHC and several other well-known methods: the classic Ledoit and Wolf linear shrinkage method (LW henceforth) [2] and the more recent nonlinear shrinkage approach based on the inversion of the QuEST function (QuEST) [7].We also include the Cross-Validated eigenvalue shrinkage (CV) [8] and HCAL [5], denoted by <.
Figure 4 shows that BAHC outperforms all the alternative methods for t in 300, i.e., for q = n/t 1  3 , which includes all of the high-dimensional regime q > 1.In particular, for the long-only portfolios, the BAHC method reaches the absolute minimum out-of-sample risk over all t in and all methods for t in 200, i.e., q 1/2.The right-hand-side plots of Fig. 4 report the probability that BAHC outperforms each alternative method when q > 1/2, which confirms that BAHC is better than all the other methods not only with respect to the average realized risk, but also in probability in this region.
Finally, we vary the length of the test window, t out .We report the probability that the BAHC method outperforms all its competitors as a function of both t in and t out in Fig. 5. Our approach achieves lower realized riskwith in more than half the simulations than any other method tested here as soon as t in < 226 (q > 1/2.26) for every t out in the considered range.Remarkably, as t out increases, the calibration length below which BAHC has better than 50% chances to outperform all its competitors only weakly increases.We interpret this result by the fact that our method is able to extract the right kind of persistent structure in that particular data, which is confirmed below by spectral analysis.We found similar results for the Hong Kong equity market (see S.I.).

Spectral Properties
In order to understand why and when our method has a better performance than the other methods based  on spectral clustering, it is instructive to compare the in-and out-of-sample persistence of the eigenvalues and eigenvectors produced by all the filtering methods considered here.The spectral decomposition of correlation matrix C is denoted by C = U † ΛU , where U is a n × n matrix formed by the eigenvectors of C and Λ is the diagonal matrix obtained from the corresponding eigenvalues.

Eigenvectors stability
A simple way to characterise eigenvectors stability is to compare the empirical out-of-sample correlation matrix C out with the Oracle correlation estimator defined as is the Oracle eigenvector estimator, the idea being that Ξ in C = C out if in-and out-of-sample eigenvectors coincide (see S.I.).The Oracle estimator for the covariance matrix, denoted by Ξ in Σ , is defined in a similar way. Figure 6   tion of t in for n = 100 assets.Note that CV, LW and QuEST methods all use the in-sample eigenvectors and thus do not need separate computations.Generally, our method yields more stable correlation and covariance matrices for t in < 300 (q > 1/3), i.e., already in the lowdimensional case.The difference is due to the fact that the eigenvectors obtained by our method are more stable than the vanilla in-sample eigenvectors, which mechanically improves the Oracle estimator.
Figure 6 also shows that the probability that the eigenvectors of BAHC-filtered correlation matrices are more stable than those provided by the alternative filtering methods grows as t in becomes smaller.The same applies to the comparison between BAHC -filtered and empirical covariance matrices, while HCAL, denoted by <, has better performance in about a 25% of samples almost independently of t in .In short, as soon as q > 1/3 in this dataset, the BAHC method likely yields more persistent eigenvectors than all the other filtering methods considered here.

Eigenvalues stability
Since both the covariance Σ and precision Σ −1 matrices are relevant to minimum-variance optimization, we measure two types of residues that focus on large and small eigenvalues, defined as where λ i = (Λ) ii is the i-th eigenvalue of the in-sample estimator and z i = (Z in ) ii comes from the Oracle estimator computed with the respective filtered eigenvector matrix and i is the respective rank of these eigenvalues.The residue measure hi mainly accounts for the discrepancy between the largest eigenvalues and the residue measure low attributes more weight to the discrepancy between the smallest eigenvalues.Figure 7 plots the residues of the correlation and covariance matrices respectively as a function of t in .We compare our approach with the sample estimator, HCALfiltered matrix, and the Cross-Validated (CV) eigenvalue distribution.While CV method outperforms all the other methods when t in 1000 (q > 0.01), the eigenvalues produced by our method are still much closer to the Oracle than those of the raw sample estimator when t in 500.

Filtered correlation and covariance matrices
The ultimate test is of course to compare filtered insample matrices with out-of-sample matrices.Figure 8 reports the Frobenius distance between the filtered insample and out-of-sample correlation and covariance matrices for all the tested methods.Expectedly, BAHC outperforms all the other ones for t in 300. Figure 8 plots the fraction of times the Frobenius norm of our method is lower than the other methods, which shows that the BAHC method outperforms HCAL filtering for every t in .

I. DISCUSSION
Filtering covariance and correlation matrices requires to take care of O(n 2 ) coefficients.Focusing on O(n) variables, for example by tweaking the eigenvalues or using a single hierarchical ansatz, works to some extend.Making further progresses requires to filter more variables, if possible while keeping an O(n) ansatz.This is what the BAHC method that we introduce achieves: by using m bootstraps and applying an O(n) structure, BAHC allows some additional flexibility, while keeping the overall structure simple.
Our method both filters out estimation noise and improves the stability of the eigenvectors in a dynamical context.Indeed, the spectral decomposition of BAHCfiltered correlation matrices is close to the optimal CV method with respect to the eigenvalue distribution.Furthermore, in the dynamical context investigated here, the eigenvectors produced by our method have a higher overlap with the out-of-sample ones than the unfiltered insample eigenvectors for reasonably small q = t/n.This is why our method leads to better minimum-variance portfolios than all the competing filtering methods when the calibration window is small.In particular, if no short selling is allowed, our approach produces, on average, the lowest-risk portfolio.
Future work is needed to characterize the average dependence structure produced by BAHC better, from both theoretical and empirical points of view.In addition, BAHC may still be too strict in some cases and thus leave out valuable information, hence, further refinements of the ansatz will need to be investigated.

Datasets description
We consider the daily close-to-close returns of US equities, adjusted for dividends, splits, and other corporate events.More precisely, the dataset consists of large-capitalization stocks, from 1992-02-03 to 2018-06-29.The number of stocks with data varies over time: it ranges from 399 in 1992-02-06 to 723 in 2018-06-29 and is roughly constant from 2008 onwards.The list of tickers is reported in S.I.
DNA microarray data [20] can be downloaded from [23].It consists of gene expression intensity of 327 tissues of patients affected by pediatric acute lymphoblastic leukemia and a subset of 271 genes.

Numerical simulations with financial data
All the simulations are carried out in the same way: each point of each plot is an average over 10, 000 simulations, each of which includes an in-sample window of length t in and an out-of-sample window of length t out = 42 days (about two trading months) unless otherwise specified; it starts from a random day uniformly chosen in the available dataset.To have meaningful inand out-of-sample windows given the maximum t in considered, the first day of the out-of-sample must be after 01-01-2000; each simulation selects n = 100 assets at random among the assets with no missing value in both inand out-of-sample windows.

BAHC algorithm
Given matrix R ∈ R n×t , our method prescribes to create a set of m bootstrap (feature-wise) copies of R, denoted by {R (1) , R (2)  The Pearson correlation matrix of each bootstrapped data matrix R (b) is then computed and denoted by C (b) ; in turn the latter is filtered with the hierarchical clustering average linkage (HCAL) proposed in [9], which yields C (b)< .In short, the HCAL uses two ingredients: the distance D = 1 − C to agglomerate cluster in a hierar-chical way, and the averaging of the correlation between clusters (see S.I. and [9] for more details).
Finally, the filtered correlation matrix C BAHC is the average of the HCAL-filtered matrices C (b)< To build a BAHC-filtered covariance matrice, we estimate the variance of r and finally obtain the BAHC-filtered covariance matrix

Frobenius norms
We use rescaled Frobenius norms to account for the fact that the number of assets in our dataset depends on time: In addition, because CV, LW and QuEST methods do not guarantee the identity on the diagonal of filtered correlation matrices; therefore, contrarily to BAHC, we do not include the diagonal elements in the metric and thus define We found that the performance of CV, LW, QuESTbased correlation estimators is slightly improved by replacing c ij with cij √ cii cjj , which also ensures that the diagonal elements equal one, and thus have used this modification in our analysis.

Source code
We have written a BAHC package for both R and Python, available from CRAN and PyPI, respectively.

FIG. 1 .FIG. 2 .
FIG.1.Correlation matrix from tissue-gene micro-array data of patients affected by lung cancer.The upper left plot is the sample correlation matrix, the upper right plot is the result of hierarchical and average-linkage averaging (HCAL).The bottom plot is the difference between the two: it still has evident structure unaccounted for by HCAL.

FIG. 5 .
FIG.5.Fraction of time BAHC yields a smaller realized risk than all the alternative methods.Left plot: portfolios with positive and negative weights; right plot: portfolios with only positive weights.The dotted line corresponds to q = t/n = 1, and the level curve to a 50% probability.10, 000 independent simulations per point; t out = 42 days, n = 100 assets, US equities.

F
reports the Frobenius distances (see the Methods section) C out − Ξ in C C F and Σ out − Ξ in Σ Σ as a func-

FIG. 6 .
FIG.6.Frobenius distance between the out-of-sample matrices and the Oracle estimators obtained with the in-sample eigenvectors (in), the in-sample BAHC-filtered eigenvectors (BAHC) and the in-sample HCAL-filtered eigenvectors (<).Upper panels refer to correlation matrices C, lower panels to covariance matrices Σ.The left panels are the Frobenius norm of the difference between the estimator and the out-of-sample realization; the right panels are the fraction of time BAHC outperforms the alternative estimators.10, 000 independent simulations per point; t out = 42 days, n = 100 assets, US equities.

FIG. 7 .
FIG.7.Average residue hi and low over 10, 000 simulations with random calibration windows and a random selection of n = 100 assets.The upper panel refers to the correlation matrix, the lower panel refers to the covariance matrix.10, 000 independent simulations per point; t out = 42 days, n = 100 assets, US equities.
, • • • , R (m) }.A single bootstrap copy of the data matrix R (b) ∈ R n×t has elements r (b) ij = r is (b) j , where s (b)is a vector of dimension t obtained by random sampling with replacement of the elements of vector {1, 2, • • • , t}.The vectors s (b) , b = 1, • • • , m are independently sampled.