Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Assessing portfolio diversification via two-sample graph kernel inference. A case study on the influence of ESG screening

  • Ragnar L. Gudmundarson ,

    Contributed equally to this work with: Ragnar L. Gudmundarson, Gareth W. Peters

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    rlg2000@hw.ac.uk

    Affiliations Department of Actuarial Mathematics and Statistics, Heriot-Watt University, Edinburgh, United Kingdom, Centre for Networks & Enterprise, Edinburgh Business School, Edinburgh, United Kingdom

  • Gareth W. Peters

    Contributed equally to this work with: Ragnar L. Gudmundarson, Gareth W. Peters

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Supervision, Validation, Writing – original draft, Writing – review & editing

    Affiliation Department of Statistics & Applied Probability, University of California, Santa Barbara, Santa Barbara, California, United States of America

Abstract

In this work we seek to enhance the frameworks practitioners in asset management and wealth management may adopt to asses how different screening rules may influence the diversification benefits of portfolios. The problem arises naturally in the area of Environmental, Social, and Governance (ESG) based investing practices as practitioners often need to select subsets of the total available assets based on some ESG screening rule. Once a screening rule is identified, one constructs a dynamic portfolio which is usually compared with another dynamic portfolio to check if it satisfies or outperforms the risk and return profile set by the company. Our study proposes a novel method that tackles the problem of comparing diversification benefits of portfolios constructed under different screening rules. Each screening rule produces a sequence of graphs, where the nodes are assets and edges are partial correlations. To compare the diversification benefits of screening rules, we propose to compare the obtained graph sequences. The method proposed is based on a machine learning hypothesis testing framework called the kernel two-sample test whose objective is to determine whether the graphs come from the same distribution. If they come from the same distribution, then the risk and return profiles should be the same. The fact that the sample data points are graphs means that one needs to use graph testing frameworks. The problem is natural for kernel two-sample testing as one can use so-called graph kernels to work with samples of graphs. The null hypothesis of the two-sample graph kernel test is that the graph sequences were generated from the same distribution, while the alternative is that the distributions are different. A failure to reject the null hypothesis would indicate that ESG screening does not affect diversification while rejection would indicate that ESG screening does have an effect. The article describes the graph kernel two-sample testing framework, and further provides a brief overview of different graph kernels. We then demonstrate the power of the graph two-sample testing framework under different realistic scenarios. Finally, the proposed methodology is applied to data within the SnP500 to demonstrate the workflow one can use in asset management to test for structural differences in diversification of portfolios under different ESG screening rules.

1 Introduction

The ability to investigate the time-varying nature of portfolio diversification is fundamental to portfolio managers. In this work we seek to enhance the frameworks practitioners in asset management and wealth management may adopt when assessing how their investment decision-making may influence the portfolio diversification.

We consider a setting where an asset manager can select a total of n assets. However, due to considerations such as regulation, fund risk appetite, investor mandates on investment practice and scope of investment objects, etc., the fund manager has to select a subset of the n assets which satisfies these restrictions. The procedure we develop in this work, allows one to statistically test for differences in diversification and risk and return profiles of the obtained portfolio with respect to other possible protfolios. That is, the procedure allows for a robust testing and comparsion of various screening criteria or optimal investment strategies. This is achieved at the level of the inter-asset statistical return relationships rather than directly at the portfolio level, thereby also providing valuable interpretable insights into the resulting portfolio structures.

The problem of comparing screening rules and investment strategies arises in many investment contexts and we will illustrate this specifically in the timely and topical area of Environmental, Social, and Governance (ESG) based investing practices. For instance, take the iShares ETF investment platform, or the Vanguard platforms and look at the investment goals suggested by the robo-advisory. Each has some version that allows investors to select something akin to “go sustainable”, where in the case of the iShares platform, this selection directs the investor to a collection of iShares ETFs that have portfolios of screened assets that align with particular ESG objectives that include four broad categories: “Screened” which involves selections of ETFs that are designed to eliminate exposure to certain controversial business activities that pose risks or do not align with stated preferences.; “Broad ESG” which involves a selection of ETFs that invest in securities based on overall environmental, social, and/or governance (ESG) performance.; “Thematic ESG” involves a collection of ETFs that pursue specific environmental, social, governance, or Sustainable Development Goals (SDGs) issues.; and “Impact” which involves a collection of ETFs that intend to contribute to measurable positive environmental, social or SDG outcomes while pursuing financial returns.

The statistical frameworks developed in this manuscript will allow for the statistical testing of changes to diversification structures between portfolios constructed from different screening criteria, which enhances the growing literature on questions of ESG portfolio risk and return performance.

1.1 Motivation for studying portfolio diversification based on ESG screening

ESG investing is a broad field with many different investment approaches addressing various investment objectives across an array of equity and fixed-income markets. However, it is important to acknowledge that the field is still very much in its infancy and naturally this comes with many challenges that are yet to be resolved such as the basic question: Does ESG scores drive returns and value in the long run? The construction of ESG scores is in itself a challenging problem and consequently, there is an emerging literature highlighting challenges and issues that arise when developing coherent ESG scoring methods, see [15]. Nevertheless, sustainable investing is increasingly becoming central to capital allocation in many markets, as such the ESG metrics and scores have become of critical importance. Hence, in this work, we do not seek to enter the debate regarding the efficacy of ESG scores, as important as it is, as we feel that this area of investment practice is here to stay. We take this view in light of the significant infrastructure being built by Bloomberg, Moodys, Fitch, MSCI, Sustainalytics, etc. regarding measuring and monitoring ESG scores for all listed company stocks, warrants, depository receipts, ETFs, mutual funds, and the continued effort to start to add such ratings also in fixed income bond issuance’s. Therefore, whilst the form of the scoring methodology will undoubtedly evolve, the consideration of ESG scores and screening of investment assets by ESG scores will increasingly be relevant in investment decision making.

In this work, we seek to understand how to develop a statistically rigorous framework that allows one to test the effect ESG screening rules have on the risk profiles of portfolios, for example, does the diversification effects change? Such a framework, we believe, will be useful for practitioners who seek to form ESG factor investing-based models, ESG ETF funds, impact funds, and various other screened asset-traded instruments. Just as it is important to understand the role of counter-cyclical assets, growth assets, and defensive assets when forming an investment portfolio in the equity space, the new dimension of screening of assets by ESG scores may also influence the performance of the portfolio. This is a novel approach, as the focus of ESG studies has been on associations and possible causation between ESG scores and portfolio performance, see examples in [68].

In [7] they studied the link between ESG information and the valuation and performance of companies. This was achieved by examining a variety of transmission channels within a standard discounted cash flow model. Namely, they considered a cash-flow channel, an idiosyncratic risk channel, and a valuation channel. They argue based on the work of [9] that companies that possess a strong ESG profile are more competitive than their peers. In particular, their competitive advantage can be identified as arising from factors such as greater efficiency in the use of resources, enhanced human capital infrastructure, and better innovation management. Furthermore, they argue that companies that have better ESG ratings are typically better at developing long-term business plans and long-term incentive plans for senior management. They argue that not only do such companies often produce cashflows as dividends but that they are also considered growth stocks as they have a greater capacity to utilize their competitive advantage to generate abnormal returns, which ultimately leads to higher profitability. Consequently, higher profitability results in higher dividends. Evidence was provided for the validity of such economic channels in practice using the constituents of the MSCI ACWI Index, where the MSCI is a global equity index designed to represent the performance of the full opportunity set of large- and mid-cap stocks across 23 developed and 24 emerging markets. It is a good cross-section and highly representative as it covers more than 2,933 constituents across 11 sectors and approximately 85% of the free float-adjusted market capitalization in each market. The index is built using MSCI’s Global Investable Market Index (GIMI) methodology, which is designed to take into account variations reflecting conditions across regions, market cap sizes, sectors, style segments, and combinations.

Furthermore, [1012] have studied the risk profiles of companies that have good ESG scores. For instance, they have shown that companies with strong ESG characteristics typically have above-average risk control and compliance standards. Consequently, they tend to suffer less frequently from severe incidents such as fraud, embezzlement, corruption, or litigation cases [9, 13]. As a result, they argue that with a reduction of such incidents there will be fewer stock-specific downside price pressures on the share price of companies. One could also argue that with less of such events, the operational risk capital that they may wish to set aside to mitigate such losses is reduced, freeing up such capital for more productive use in expanding the business. In [9] they again study such an economic channel in the MSCI ACWI constitute assets and show the relationship between each individual ESG characteristic of companies and how they are linked to stock price returns tail risks. They compare the residual volatility of companies across ESG quintiles, that is, the volatility that is not explained by the common factors in the MSCI Barra Global Equity Model. They also compared the kurtosis of stock returns across ESG quintiles; kurtosis is a commonly used measure for tail risks. They concluded that each of these stock-specific risk measures shows a lower idiosyncratic risk for high ESG-rated companies, in particular with respect to tail risks. Furthermore, such downside risks have been further studied in the pension fund context by [14].

When it comes to the question of individual company valuation and ESG ratings, [13, 15] demonstrated that the transmission channel from lower systematic risk to higher valuations can also be explained through the relative size of the investor base. They subsequently argued that companies with poor ESG ratings have a relatively small investor base since increasingly risk-averse investors are socially conscious and so will tend to avoid exposure to poorly ranked ESG rated companies, especially when they are aided by robo-advisory tools that provide them with guidance on such responsible investing screening of assets.

Whilst the highlighted studies above have elucidated the economic channels that link ESG scores to enhanced risk-return performance. It should be noted that there does exist a large literature on such studies and that the aforementioned works are just a small selection of the many works that have arisen from both in academia and the asset management industry. Challenges remain in the analysis undertaken in this domain, which has been exacerbated by two common causes: The different underlying ESG data used. There is still not a standardized and agreed-upon framework and methodology for how best to quantify ESG scores for companies and assets, see discussions in [16].; Secondly, many empirical studies analyzing the link between ESG and financial performance do not strictly differentiate between correlation and causality. With the exception of the aforementioned papers above, numerous works have estimated a correlation between ESG and financial variables and implicitly interpreted this to mean that ESG is the cause and financial value the effect, although the transmission easily could also be reversed. For instance, one can argue that companies with high ESG scores are better at managing their risks, leading to higher valuations. Alternatively, companies with higher valuations might be in better financial shape and therefore able to invest more in measures that improve their ESG profile; such investments might lead to higher ESG scores. Discussions on such challenges in this literature have been undertaken on a large scale through several meta-studies that have summarized the results of over 1,000 research reports and found that the correlation between ESG characteristics and financial performance was inconclusive. The existing literature found positive, negative, and nonexistent correlations between ESG and financial performance, although the majority of researchers found a positive correlation, see further discussions in [17, 18].

Finally, it has been argued that responsible ESG practices might mitigate the tail risk of a company meaning that they can reduce the left-tail risk of companies and, therefore, reduce ex-ante expectations of a left-tail event [19, 20]. Furthermore, good ESG scores are said to reduce the probability of an adverse event occurring and essentially reduce expected litigation costs, reputation losses, and environmental hazards [21, 22]. The impact of carbon risk on stock pricing has been investigated by [23, 24] where a brown-minus-green (BMG) risk factor is developed. [25] predict the accuracy of main financial indicators using ESG indicators. [20] used a R-vine copula to model the (tail-)dependencies of assets using ESG information. It is apparent that ESG factors have made the investment process noticeable more complex and while most empirical studies have been performed to analyze the interplay of ESG factors and financial performance, little has been done to test structural changes in the financial market due to ESG factors [26] across a large collection of screened assets say from the S&P500 universe of 500+ assets which are all ESG rated.

From the actuarial perspective, there is also activity taking place in the profession to explore the role of ESG based investing practice and the role of environmental practice in asset management, see for instance the Institute and Faculty of Actuaries (IFoA) who have initiated a working party on sustainability research, including: Net zero and the implications for investment portfolios, climate change disclosures, managing sustainability in the absence of metrics and measurements, and asset management. This demonstrates the fact that the actuarial community puts a great emphasis on developing adequate methods to measure and address the difficult subject of social responsibility.

1.2 Contributions

As outlined above current studies have largely focused on marginal relationships between a company’s risk-return profile and its ESG ratings. In this work, we focus on a different aspect of the problem. We study the structural relationships between assets and portfolio diversification structures that arise when various ESG based screening rules are applied to select assets for investment consideration in a portfolio strategy. This allows us to consider addressing formally and rigorously in a statistical framework questions such as: Does ESG screening of assets influence the diversification structure of a market or a portfolio, ETF, or other such targeted investment strategies? Furthermore, which components are affected most by ESG screening criteria, eg. sectors, industries, defensive stocks, counter-cyclical stocks?

In order to study such an aspect rigorously, we develop a statistical testing framework utilizing a kernel graph two-sample testing framework. This research is related to risk analysis within the ESG verse. We introduce a method that complements risk modeling at the portfolio level, by allowing for robust testing of structural changes in dependency structures in assets within a portfolio under different screening rules, our screening rules are based on ESG scores. This is achieved by using a graph testing framework that will be described in detail below. We further note that graph testing can also be used used to test structural changes in dependencies of financial markets after an intervention event should one also seek a causal analysis study.

Specifically, this paper makes the following contributions: An extensive statistical analysis is undertaken to list the expressiveness of various graph kernels for various graph structures using synthetic case studies. We consider cases, where traditional graph hypothesis settings fail such as where the nodes and/or edges are attributed, are labeled, and where the graph topology can be binary or weighted. For some kernels, we furthermore discuss why and in which setting they will give good or bad performances. We present a real-world application in risk management by showing how graph kernel two-sample testing can be used to compare financial graphs (portfolios). The method provides a statistical test to quantify whether two portfolios are the same or not, and can furthermore, be easily extended to change point detection analysis.

1.3 Notation

We use the following notation: Small boldface letters will denote a vector of observations and capital boldface letters will denote matrices. Capital letters (non-bold) can denote sets or random variables and should be clear from the context. G denotes a graph for consistency to other papers, if G denotes a random graph then we will underline the fact using where is a probability distribution. E is a set of edges and V is the set of vertices. We also write E(G) and V(G) to denote the edge and node set of graph G. A is the adjacency matrix of a graph and bold upper case letters will denote matrices. Ω will denote a space containing graphs. is a Hilbert space, and is the unit ball in a Hilbert space. (⋅, ⋅) is a kernel function where ⋅ represents a input. will denote a norm with respect to the metric (Hilbert) space and is a dot product defined on the space . 1: M is the set {1, …, M} where M is an integer. If S is a set then |S| denotes it cardinality. will denote a σ-algebra, is a probability distribution. ⊗ is the Kronecker product.

1.4 Software for reproducible results

In undertaking this study, the graph kernels are calculated using the GraKel package [27] with the exception of the WWL and RW kernel choices. The WWL kernel was calculated using the original code from [28]. The RW code was written from scratch using the ideas from [29]. The code to calculate the robust MMD estimator was taken from [30] but adjusted to allow for different sample sizes. The Huge package was used for graph construction [31]. The code repository for this paper can be found https://github.com/ragnarlevi/MMD_Graph_Diversification.

2 Methodology

Whilst hypothesis testing and inference procedures are well established in standard Euclidean domains, there is a pressing need to develop families of inference procedures that will facilitate the development of testing frameworks when working with the topologies of graphs. This is particularly relevant as there is a growing literature that is developing graph and network-based statistical models for financial risk analysis and insurance modeling, see [32, 33].

There are well-established techniques being developed for transforming complex data structures into graph data embeddings, examples include obtaining samples of graphs independently, as observed sub-graphs/communities in a larger graph over time or as sub-graphs in a larger graph, see [34, 35]. Alternatively, one can use graph construction methods such as graph lasso methods to construct graph valued data, see [36]. Furthermore, substantial work has been performed on graph estimation from correlation matrices, especially in the financial research area. The main methods used are graphical lasso types of estimation [3740], Laplacian estimation [41] or threshold rules [42, 43]. In this paper, we use the graphical lasso to estimate the precision matrix/graph of assets as it is a widely used covariance estimation tool that allows both positive and negative correlations.

However, having obtained such graph and network-structured data, there are far fewer studies on how to perform inference on such data structures. There has been some work on graph hypothesis testing, for example, [44] consider two-sample testing of large Erdös-Rényi graphs and proof asymptotic results both for single graph and samples of graphs. [45] present a framework based on matrix representation of networks and considers test statistics based on the matrix pairwise distances. [46] consider a two-sample hypothesis testing problem of undirected graphs where one has access to only one observation from each model. [47] consider two-sample testing of weighted graphs. [48] use a test statistic based on a similarity graph constructed on the pooled observations from the two samples. In this paper, we consider a kernelized version of graph sample testing. The main advantage of the kernelized graph testing framework is that it generalizes various graph testing frameworks by allowing a wide range of additional features such as node labels or attributes and edge labels or weights along with allowing different graph structures such as undirected, directed, bipartite, and weighted.

The work created in this manuscript takes a novel perspective on two sample testing for graph-valued data. A detailed development of kernel-based families of hypothesis testing frameworks for graph-valued data will be established. It will be demonstrated that such testing frameworks provide very flexible testing frameworks that can accommodate many types of graph testing structures of relevance to applications in financial risk and insurance modeling. The focus of the applications in this paper will be on Environmental, Social, and Governance (ESG) scoring methods and equity returns. This a topic of importance to all investment management communities, wealth management practitioners, pension funds, and endowments.

In formulating the concept of hypothesis testing on graph-based data. One can define many forms of inference questions and hypotheses to test based on specific structures of the graph data. The core focus of this work will be to explore two-sample hypothesis testing, effectively generalizing the classical notion of two-sample testing for distributions in Euclidean space to two-sample testing for distributions on graph structures via kernel two-sample testing. Such applications have relevance in addressing a multitude of inferential hypotheses.

In particular, the framework proposed in this manuscript will utilize the recently developed kernel two-sample testing framework introduced by [49] with an extension to graph-valued data samples through the development of various graph kernels. In the two-sample kernel testing methodology, one constructs two-sample testing by using the so-called maximum mean discrepancy (MMD) statistic. This has been shown to provide various forms of test statistics which have trade-offs between bias and computational evaluation efficiency, something that could be significant when testing collections of large graphs. Subsequent developments were made on kernel two-sample testing, where a statistic to account for autocorrelation was developed in [50] and robust methods were extended in [30]. Furthermore, there has been progress in forming estimates of the distribution under the null such as permutation, parametric, and eigenvalue methods see [49, 51].

The application of kernel MMD two-sample testing has to date focused on problems such as evaluating the performance of models [52, 53] and two-sample tests for nonstationary random processes [54]. [52] introduced a method to select kernels that maximize the power of a test. Additionally to the kernel MMD two-sample testing there exist kernel methods for measuring independence and conditional independence [55, 56]. It is not yet the case that such kernel testing methods have been extended to graph-valued data sets, this is therefore one of the novel contributions of the work undertaken in this manuscript.

To achieve the development of kernel two sample testing for graph data, we will explore various families of graph kernels. Graph kernels seek to quantify a notion of closeness or similarity between pairs of graphs, depending on the type of graph kernel used, one can define various notions of similarity between graphs at a macro scale and a micro vertex neighborhood scale. Graph kernels have mainly been used in graph classification using support vector machines and therefore they are benchmarked accordingly [57, 58]. This paper looks at a different approach and evaluates the performance or the expressiveness of kernels on various different graph structures. To do so we carefully perform various experiments and identify the best kernels to be used for a given graph structure. The graph structures tested are binomial graphs, scale-free graphs, stochastic block graphs, weighted graphs, node-labeled graphs, node-attributed graphs, and signed graphs. All these various types of graphs represent different types of graph-valued data sets that arise in applications in practice.

The majority of graph kernels are based on a convolution kernel [59] and there are multiple different graph kernels that exist. The Weisfeiler-Lehman kernel [60, 61] is one of the most used graph kernels and furthermore acts as a building block for other kernels. It utilizes the Weisfeiler-Lehman (WL) algorithm and can be used on node-labeled graphs. Random walk kernels are popular kernels [6264] that can be used on undirected, directed, labeled, and edge-labeled graphs. Their computation efficiency is generally low, making the use of such graph kernels slow when large collections of graph-valued data are being studied or when the graph-valued datum involves very large numbers of vertices. However, recently [29] introduced a fast computation for the case of the geometric graph random walk kernel. The shortest path kernel [65] is a kernel that measures and compares the shortest path of graphs. It can be used for node-labeled graphs and attributed graphs, the computation is however slow. The Wasserstein Weisfeiler-Lehman graph kernel is based on the WL algorithm and can be used on node-labeled graphs [28]. It measures the distance between Weisfeiler–Lehman-inspired embeddings using the Wasserstein distance. Propagation kernels are based on monitoring how information spreads through a set of given graphs [66]. It can be used on directed, node-labeled, and attributed graphs. The pyramid match kernel can be used on graph samples with or without node labels [67, 68]. The WL optimal assignment kernel can be used on node-labeled graphs [69], where it has been proven that an optimal assignment kernel can be derived from the WL algorithm. Finally, we have the baseline kernels, the vertex histogram, and the edge histogram, which are simply counting how often a node label or an edge label occurs, respectively. It should be noted that many more graph kernels exist, for example, see [57, 58].

3 Graphs hypothesis testing

In this section, we will begin by defining the concept of graph-valued data and its various forms and characteristics that make it distinct from classical ways to encode data for financial data analysis and insurance and risk modeling applications. This will also serve to introduce some core notations and concepts used in later methodological developments for two sample graph testing frameworks.

3.1 Graphs valued data descriptors

In order to proceed with developing the concept of graph kernels and graph-valued data two-sample testing, we must first begin by defining a graph data point (i.e. a sample datum that is a graph) along with some of the characteristics of such a datum that will be useful to define the graph kernels that will measure similarity between data points (i.e. pairs of graphs).

Definition 3.1 (Graph). A graph is a pair G = (V, E) where V is a set of n vertices or nodes V = {v1, …, vn} and E is a set of edges EV × V.

The size of the graph is characterized by the number of nodes within the graph, denoted throughout by N = |V|, and the structure of the graph is characterized by the density which relates to the number of edges, denoted by M = |E|. Furthermore, a graph G is termed undirected if (vi, vj) ∈ E ⇔ (vj, vi)∈E. We sometimes write V(G) and E(G) to denote the node and edge set of a graph G respectively. One way to characterize a graph is via its adjacency matrix A, see Definition 3.2.

Definition 3.2 (Adjacency Matrix). The adjacency matrix A of a unweighted graph is:

The adjacency matrix A of a weighted graph is: where .

The local structure of a given sample graph datum provides a lot of information about the bigger structure of the graph. The adjacent nodes or the neighborhood of the i-th node (vi) provides a lot of information about the node itself and this fact is used in node classification and graph kernel construction. Formally, we can define the notion of a node or vertex neighborhood.

Definition 3.3 (Neighborhood of a Node). The neighborhood of a node vi is denoted as and is the set of all vertices adjacent to vi:

One of the most important node statistics often used to summarise graph-valued sample data, is the degree of a graph G, which described the significance of each node in the graph in terms of its connectedness to other vertices or nodes.

Definition 3.4 (Node Degree). The degree of an undirected graph G is:

The in-degree of a directed graph G is:

The out-degree of a directed graph G is:

Furthermore, some graph data will have an additional structure such as labels associated with the vertices or nodes as well as labels potentially also associated with its edge set. These labels can give substantial information related to encoding the semantics of complex objects.

Definition 3.5 (Labeled Graph). A labeled graph G is endowed with a function l: VE ↦ Σ that assigns labels to the vertices and/or edges of the graph from a discrete set, Σ, of labels, called the alphabet. If the labeling function only labels nodes then the graphs are called node-labeled graphs and if the labeling function only labels edges then the graphs are called edge-labeled graphs.

Note that there can be two labeling functions, one for the edges and one for the nodes with two distinct alphabets. Similar to labels, one may often encounter graph data sets in which each graph sample has nodes or edges that have an associated real-valued vector or matrix, usually called the node attributes.

Definition 3.6 (Attributed Graph). An attributed graph G is endowed with a function that assigns real-valued attributes to the vertices and/or edges.

Other considerations one must think about when developing a graph kernel is whether it will focus on local vertex neighborhood structures or on global structures encompassing the entire graph when developing measures of similarity. In this vane, for the microlocal structure setting one can focus on the neighborhood of a collection of nodes or the community they live in rather than the whole graph itself. These communities are called sub-graphs and are often used when developing measures of similarity in graph kernel constructions or community analysis.

Definition 3.7 (Sub-Graphs). Let S be a set of vertices SV. Then, G[S] = (S, E[S]) is the subgraph induced by S where E[S] is the set of edges that have both end-points in S:

Besides the adjacency matrix A it is oftentimes convenient to work with the Laplacian matrix L. The Laplacian is of great importance in financial applications, where it is often assumed to encode the precision matrix associated with financial networks and is therefore used in the objective function and constraints in network estimation problems, see [41].

Definition 3.8. Let A be the adjacency matrix of an undirected graph G and let D be a diagonal matrix with the degree of each node on the diagonal, Dii = ∑j Aij. Then the Laplacian L is:

Having presented some core components required to develop the graph two sample testing methodology presented in this paper, we can now proceed to introduce the framework.

3.2 Inference procedures for graph valued data: Two sample testing

We now aim to explore the notion of a distribution taking support on graphs. This is important to consider as when we discuss two sample testing for graph-valued data, we are effectively exploring either equivalence between two families of population distribution or their properties such as moments, cumulants, etc. As such it will be meaningful to recall the definition of a random graph in the context of two sample testing.

In the two-sample testing of graph-valued data we will explore, we assume we are given two sets of samples/observations that comprise collections of graph-valued data {G1, …, Gn} and where . The graphs in the two samples are all generated independently from two probability spaces and , and the goal is to infer whether . A visualization of the problem is given in Fig 1.

thumbnail
Fig 1. Graph two sample testing.

The graph two sample testing scenario. Here we have observed 4 graphs from and 3 graphs from . The sample space of and is the same ( possible edges).

https://doi.org/10.1371/journal.pone.0301804.g001

It is worth pausing for a moment to inspect the probability spaces more closely. In the simplest case, the sample space Ω contains all possible edges that can occur in a graph G, that is Ω = {(v1, v2), …, (v1, v|V|), (v2, v1), …, (v|V|, v|V|−1)} (We are assuming that a node can not be connected to itself). As the sample space is discrete we can define the σ-algebra as the power set of Ω, namely, . The probability function then defines the probability of obtaining a certain graph in the sample set of graph-valued data. As an example we can define for instance a population distribution to be uniform where is the total number of possible edges and G(|V|, |E|) is a graph with |V| vertices and |E| number of edges.

Now, returning to the concept of two sample testing for graph valued data. The goal is to infer whether the two samples of graphs are generated according to the same distribution. This involves developing a statistical test to determine from the population samples whether there is sufficient evidence to reject a null that both population distributions generating the two samples of graphs are equivalent, where is a function that distinguishes between the null hypothesis and the alternative hypothesis: (3.1)

As in classical inference procedures, one can directly employ the ideas from standard inference to graph-valued testing settings. The test (3.1) is the main theme of this study. A type I error is made when is rejected when H0 is true and a type II error is made when is failed to be rejected when H1 is true. The level α of a test is the upper bound of a Type I error and is a parameter decided beforehand. The power of a test is the probability that the test correctly rejects H0 when H1 is true. A consistent test achieves α and a type II error of zero when the samples approach infinity.

When constructing graph sample tests one is required to define some kind of features that summarize the graphs within the sample. This will be used to make an inference or seek empirical evidence to reject a null statement that the data-generating population distributions for each graph-valued sample set are equivalent.

In this context, one can choose to map from graph-valued data to summary statistics of each graph data point and then seek statistical evidence that such features are sufficiently different so as to reject a null that the two graph samples were from the same population. In this regard, one could resort to summary statistics such as the average of the average degree of the graphs or the average of the shortest paths. This, however, might be too simple a measure for graphs as they can have rich features which are not captured by these single summary statistics. The challenge with this approach is that the summary statistics may not characterise sufficiently the features of the graph distribution, such basic summaries will in general not act as sufficient statistics, the consequence is then to have a less powerful test for the decision to be made as to reject the null or not given two samples of graphs.

Instead, to preserve the power of testing such a hypothesis we advocate in this paper for the use of graph kernels, which utilize the entirety of the graph-valued data samples, rather than just crude summary statistics when forming the test statistic. This however comes with some level of complexity as one must now define a mechanism to measure the similarity between pairs of graphs from each population sample. This is where kernel methods come into play [70, 71]. In this paper, we are interested in computing the similarity between two graphs G and G′ by computing their similarity in a reproducing kernel Hilbert space (RKHS) .

3.3 Graph valued data embedding to kernel RKHS space

The main idea of the kernel method is that if two functions are close in the RKHS then f(G) and g(G′) are close for all G and G′ that are close, such that G, G′ ∈ Ω where Ω is as space of graphs. This fact is only true if the space is a so-called reproducing kernel Hilbert space (RKHS) space.

We begin by introducing a special function in , namely a function that assigns to each its value at G ∈ Ω which plays an important role in the theory of RKHS:

Definition 3.9 (RKHS). Let be a Hilbert space of functions . Consider the linear functional over the space of functions in that evaluates each function at a point G,

If for all G, δG is continuous at any , or equivalently, δG is bounded. Then the Hilbert space of functions is called a reproducing kernel Hilbert space (RKHS).

Since the evaluation functional δG is linear and bounded/continuous we have by the Riesz representation theorem [72] that there exists an element/function such that . kG is called the reproducing element.

Definition 3.10 (Reproducing Kernel). Let be a Hilbert space of functions and for any two graphs G and Gsuch that G, G′ ∈ Ω. The kernel function measuring similarity between G and Gthrough k(G, G′) defined by the function is called the reproducing kernel of if it satisfies:

  • G ∈ Ω, ,
  • G ∈ Ω, , (reproducing property)

When working with kernels the usual thing that is done is to use a feature mapping kG = ϕ(G) (see definition 3.11). In particular it can be seen that as we have . In our case, we are mapping each G to a function that is a linear combination of kernels within . The kernel function is defined as:

Definition 3.11 (Kernel Function). A kernel is a positive definite function k such that for all graphs G, G′ ∈ Ω where is called the feature mapping

When working with data we need to calculate the kernel function between all data points and store them in a matrix called the kernel matrix K.

Definition 3.12 (Kernel Matrix). The kernel matrix is defined as: where is the data.

In the context of two sample graph testing, we have two data sources, sample 1 of graphs assumed drawn from and sample 2 of graphs assumed drawn from . It can be good to order the kernel matrix such that K has a block structured as follows: (3.2) where is the Kernel function evaluated at data points within the sample coming from the unknown distribution , is the Kernel function evaluated at data points within sample coming from the unknown distribution , and is the Kernel function evaluated at data points between the two samples. Note = .

Given this brief introduction to kernels, we can now introduce the kernel mean embedding. Assume that the sample was generated from a probability distribution , meaning that each Gi is a realization of the random graph . The notion of feature maps for probability distributions is extended to the so-called embedding of a probability distribution [73] and is defined as follows:

Definition 3.13 (The Mean Embedding). The mean embedding of the random variable K(G, ⋅), where G is a random graph with the law , is defined as: for all .

The following lemma gives conditions such that the mean embedding exists given that the mean embedding exists see [49]:

Lemma 3.1. If k(⋅, ⋅) is measurable and then exists.

If exists then we have, by the reproducing property and the definition of , . If is known then it is sometimes possible to find the mean embedding . As an example consider the kernel: here the random variable is the graph G so the adjacency matrix A is random a, a′, b, and b are vectors (not random). The kernel is called a 1-step random walk kernel and will be covered more in-depth in a latter section. If we use the binomial graph measure where the probability of edges p is fixed: then the mean embedding will be: where is a matrix where each entry is p except at the diagonal where it is 0.

Remark. Note that in this simple scenario there is a one-to-one correspondence between a graph and its adjacency matrix.

3.4 Graph two sample test statistic: Maximum mean discrepancy

The maximum mean discrepancy (MMD) is a general class of operator that measures equivalence between population distributions when they are embedded into an RKHS space, which is then equivalent to measuring the supremum between the mean of functions in the RKHS with respect to each population distribution. Its properties also allow for an empirical formulation to be used where one replaces population distributions with empirical measure estimators obtained from samples drawn from each population distribution. This can then be used as the basis of a hypothesis-testing framework as will be outlined below.

In order to explain this in detail we first present the population distribution definition of the MMD statistic. Consider some function class (which will be the unit ball in a RKHS space in our case) then the MMD is given in Definition 3.14.

Definition 3.14 (MMD). Let be a class of functions . The maximum mean discrepancy (MMD) is defined as: (3.3)

If we let the function class be the unit ball in a reproducing kernel Hilbert space characterized by reproducing kernel k(⋅, ⋅) (in our case between graph valued data pairs of points), then it is possible to derive an attractive test statistic based on kernel evaluations of elements in the two samples, given that the mean embedding exists [49]: (3.4) where and are random graphs. For the MMD to be a metric, some assumptions need to hold, see Proposition 3.2.

Proposition 3.2. Let be a ball in a universal RKHS , defined on the compact metric space Ω, with associated continuous kernel k. Then if and only if .

This verifies that one has a reference value of zero for the population-based metric under a null statement that both population graph measures are equivalent. Hence, if one can then find a means to estimate this MMD metic using samples from the population distributions, this could then form the basis of a test statistic for two sample testing. Fortunately, the next result demonstrates that such a sample-based estimator can be considered when estimating Eq (3.4), which is provably unbiased and given for two sets of n and n′ samples of graphs by: (3.5) The computational complexity of evaluating this statistic on two population samples of size n and n′ respectively is O((n + n′)2). This would be expensive if one had a large collection of graph-valued data samples and even worse if each sample was comprised of graphs with very large vertex set cardinality. However, there exists a linear time statistic that will be denoted by which can be computed in O(n) time. Assume that n = n′ and define n2 = ⌊n/2⌋, where ⌊⋅⌋ is the floor function, then the linear estimate is computed as: while has higher variance than , it is computationally much more appealing.

3.5 Defining extreme outlier samples for graph valued data and robust MMD test statistics

Before we detail how to develop a robust version of the MMD statistic estimator. It is worth putting some consideration towards what exactly is an outlier or extreme sample when considering graph-valued sample data. In other words, what is meant by an outlier in a graph-valued data setting? In essence, there can be more than one type of outlier that can be observed when it comes to structured data as can be seen in Figs 2 and 3. In Fig 2 we can see that the graphs have an average degree of around 2, however, graph no. 3 is a complete graph and is clearly different from the rest of the graphs in the samples. This example is a realistic scenario in the context of the financial markets. Namely, in times of crisis, most financial assets show strong correlations while in a normal state the correlations are usually weaker and the resulting graphs have fewer connections. In Fig 3, we show another type of outlier, namely in the node attributes. Here the outlier is again graph 3 as the node attributes are significantly higher, even though the graph topology in terms of edge relations to vertices is common between all samples. This is the usual notation of an outlier in numerical data. Other types of outliers may happen as well such as the node label distribution, edge weights, or the number of triangles. If an outlier is present in the sample it can severely affect estimates that are not robust to outliers like the unbiased MMD estimate. Therefore, a robust version that replaces the expectation with the median of averages taken over non-overlapping blocks of the data has been developed [30].

thumbnail
Fig 2. Outlier graph.

A graph sample with an outlier graph (red) with respect to degree.

https://doi.org/10.1371/journal.pone.0301804.g002

thumbnail
Fig 3. Outlier graph.

A graph sample with an outlier graph (red) with respect to attributes.

https://doi.org/10.1371/journal.pone.0301804.g003

In order to robustify the MMD statistic estimator, we may proceed as follows with the classical robust estimator one would adopt in Euclidean sample space settings, modified for the MMD setting for graph-valued data as follows. Consider the partition {Sq}q ∈ 1 : Q of |Sq| = N/Q partitions. The empirical measure of sample Sq is, , giving rise to the expectation where . The Median Of meaN (MON) is defined as: Implying that we are taking the median of the mean embeddings of each partition. The MON-based MMD estimator associated to kernel k (MONK) is then defined as:

Replacing the population expectation with the empirical expectation we have:

By the representer theorem [74] we can express the optimal f as: where are some constants. We next observe that by denoting , . We can rewrite the MON-based MMD estimator as: where is an indicator vector of block q. Note that we have cTK c ≤ 1 as we are searching for functions within the unit ball of . This matrix objective function is then solved to find the best MON-based estimate. The key takeaway is that the number of corrupted samples, Nc can almost be half of the number of blocks, Q. That is, there exists δ ∈ (0, 1/2] such that NcQ(1/2 − δ).

3.6 Hypothesis decision making

Given our MMD statistic on probability distributions, we would like to infer whether two sets of samples come from the same underlying distribution by determining whether the observed empirical MMD is within a reasonable decision region. Under the null , converges asymptotically to a distribution that depends on the unknown distribution [49]. This unfortunately means that we can not evaluate a closed-form decision threshold cα and reject H0 if where α is the level of significance of this test that represents the probability of a false rejection i.e. a Type II error corresponding to a false negative. Instead, we estimate a data-dependent threshold by using a permutation procedure that replaces the unknown population distribution with a sampled empirical equivalent. The exact algorithm can be found in the accompanying technical S1 Appendix.

To understand the permutation test it may be beneficial to visualize the kernel matrix, such a visualization can be done as the elements of the kernel matrix Kij = k(Gi, Gj) measures the similarity between Gi and Gj according to the RKHS of k. We calculate the kernel matrix for two cases, one when the null hypothesis is true, and one when the alternative is true. The kernel matrix is then structured as demonstrated in Eq (3.2). The two cases are demonstrated in Fig 4 which displays the heatmaps when when the null hypothesis is true (lef) and when the alternative hypothesis is true (right). When H0 is true we can see that the kernel matrix appears homogeneous giving us a reason not to reject the null hypothesis. Practically, if we perform a permutation test which is essentially the same as shuffling the rows and columns of kernel matrix and recalculating the MMD, we would see values that are very close to the original sample estimate. Conversely, when H1 is true then the kernel matrix is heterogeneous and a clear block structure appears within the kernel matrix. This gives us a reason to reject the null hypothesis. Now the permutation of the kernel matrix and recalculation of the MMD would give lower estimates and therefore a low p-value. Furthermore, note that in both cases we can see that the diagonal blocks have higher numerical values than the off-diagonal blocks. The reason being is that objects are more similar to themselves than other objects. This fact can lead to a problem called the diagonal dominance problem if the kernel used to assess the similarity between collections of graphs is too specific in how it assesses similarity in structure. This also motivates why it is important to study and consider the testing framework using various types of graph kernels. We emphasize that although a visualization can be a good aid, it does not replace the p-value and should solely be used for empirical diagnostic purposes.

thumbnail
Fig 4. Kernel matrix visualization using heatmaps.

The left figure shows a heatmap of the kernel matrix when the null hypothesis is true and the right figure shows a heatmap when the alternative hypothesis is true.

https://doi.org/10.1371/journal.pone.0301804.g004

3.7 Kernel closure properties

As we will later see, most graph kernels can not take negative weights explicitly into account. However, to incorporate this possible information we will assume that the following closure property of kernels holds for graph kernels [71].

  • ,
  • .

Therefore to allow negative weights in the graph edges we can split each graph G into G(−) and G(+) where G(−) is the subgraph containing all nodes but only negative edges and G(+) is defined analogously. Then we perform the MMD test on the tensor-product kernel k(G, G′) = k((G(+), G(−)), (G(+), G(−)) = k(G(+), G(+))k(G(−), G(−)). We could also perform two separate MMD tests: One for two samples containing only the positive graphs and the other test on two samples containing only the negative graphs and then report the Bonferroni adjusted p-value when the joint decision on both sets of graphs is combined into a decision rule. However, the Bonferroni adjustment is conservative and a set of synthetic experiments performed in this manner showed that the tenor-product method results in a higher power for the resulting test.

4 Graph kernels

In order to undertake the graph testing framework, one must determine the space upon which the kernel similarity will be measured and what family of kernels will be utilized. There exist many families of similarity measures and kernel mappings between graphs that are based on different transformations of the underlying graph structures. In this context, when dealing with graph kernels we usually have to define the feature vector explicitly. This implication means that computing graph kernels is often time-consuming. For example, at first glance, it might be a good idea to define a kernel between graphs that counts the number of all subgraphs. However, computing the all subgraph kernel is NP-hard as it is essentially the same decision problem as finding hamiltonian paths, [64, 65].

There is an additional problem when dealing with graph kernels. Namely, finding an injective graph kernel that separates different graphs completely. This problem is also known to be NP-hard, see [64]. To elaborate, we need the notion of complete graphs kernels and isomorphic graphs.

Definition 4.1 (Isomorphic Graphs). Two graphs G and Gare isomorphic if there is a bijection ψ : V(G) ↦ V(G′) such that ∀(u, v) ∈ E(G) ⇔ (ψ(u), ψ(v)) ∈ E(G′).

We denote that G and G′ are isomorphic by GG′.

Definition 4.2 (Complete Graph Kernel). Let Ω be a set that consists of graphs and let be a mapping. Furthermore, let be such that . If ϕ is injective then k is called a complete graph kernel.

Computing any complete graph kernel is at least as hard as deciding whether two graphs are isomorphic. This can be seen as follows. If ϕ is injective then if and only if GG′, [64]. Therefore, we have a trade-off between complexity and expressiveness and one concludes from this that it is intractable to compute complete graph kernels when performing graph testing as undertaken in this work. However, as will be illustrated in the remainder of this section, there do exist tractable graph kernels which have affordable complexity and are expressive. Here we will introduce the graph kernels used in this study. For a survey on graph kernels see [57, 58].

Many graph kernels assume node-labeled graphs, but the graphs may not be labelled. To deal with this problem, graphs can be labeled by labeling the nodes according to the node degrees. If the graph is attributed one can also try to bin the attributes to create labels. Graph kernels that assume node-labeled graphs can also be used on edge-labeled graphs. This can be done by labeling each node by concatenating the edge labels from its edges in alphabetical order.

In the remainder of this section we will introduce the following families of graph kernels: Random walk kernels on graphs (with various sub-families as special cases); shortest path kernels on graphs; Weisfeiler-Lehman graph kernels; optimal assignment graph kernels; pyramid matching graph kernels; propagation graph kernels and Wasserstein Weisfeiler-Lehman graph kernels. Each family of kernels will be defined formally and described with regard to the interpretation of what the resulting kernel proximity measures are seeking to evaluate when comparing graph samples from two population samples, as required in this graph testing context.

4.1 Convolution kernel

The convolution kernel forms a basis for most graph kernels and is thus worth mentioning. We can think of structured data (for example graphs) as objects that are composed of subobjects or substructures. For example, a string is composed of smaller strings. Using this property the idea is to compute the product of subkernels and sum over the set of allowed decompositions. Formally, the convolution kernel is defined the following way, [57, 59]:

Definition 4.3 (Convolution Kernel). Let denote the space of components such that a graph G ∈ Ω decomposes into elements of . Furthermore, let denote the mapping from components to objects, such that if and only if the components make up the graph G ∈ Ω, and let . Then, the R-convolution kernel is: where ki is a kernel on for i ∈ {1, …, d}.

In other words, R defines a relation on the set .

4.2 Random walk kernel on graphs

One way to reduce the complexity of graph kernels is to consider walks instead of paths. The idea is similar but the main difference is that walks can visit nodes multiple times while paths can only visit nodes exactly once. Random walk kernel originate from walk kernels. Let be the set of walks with n vertices and be the set of all walks. The walk kernel may then be defined as in Definition 4.4.

Definition 4.4 (Walk Kernel on Graphs). Let denote the set of all possible label sequences of walks of length n, and . For any graph G let a weight λG(w) be associated with each walk . Let the feature vector be defined by:

Then the walk kernel is defined by:

One can then make a particular selection for the weight function in the kernel to produce what is known as the random walk kernel on a graph. The random walk kernel is obtained with , where is the probability of observing random walk w, (usually a Markov random walk). In that case we have [62]:

Remark. In order to define such a kernel on walks, define the normalized adjacency matrix as where A is the adjacency matrix and D is a diagonal matrix containing the degrees. Note that its rows sum up to 1 and can therefore be thought of as a transition matrix of a Markov chain. Using this construction, consider a Markov process that generates a sequence of vertices according to . We can write as the probability of transition from vj to vi in t steps. We note however that a valid kernel is still obtained even though no normalization of the adjacency matrix is done, one simply has to make sure that the sum converges, that is, it should be square summable.

In [64] it was shown that performing a simultaneous random walk on G and G′ is equivalent to performing a random walk on the direct product graph G× which is defined as follows. Given two graphs G(V, E) and G′(V′, E′), their direct product G× is a graph with vertex set and edge set .

In [63] they propose a generalized framework for random walk graph kernels. Let p and p′ denote the initial probability distributions over the vertices of G and G′, then the corresponding initial probability distribution on the direct product graph is p× = pp′. Similarly, if q and q′ are stopping probabilities then the stopping probability on the direct product graph is q× = qq′. The starting and stopping probabilities allow us to put prior information into the kernel design. If G and G′ are edge-labeled graphs then we can define a weight matrix with G× such that W× = Φ(X) ⊗ Φ(X′) where and is a set of labels, including a label for no label. The product is defined as:

As a consequence of this definition, the entries of W× are non-zero only if the corresponding edge exists in the direct product graph G×. We can simply take which gives the weight matrix . The kernel is finally defined as outlined in Definition 4.5.

Definition 4.5 (Random Walk Kernel on Graphs). Let G = (V, E) and G′ = (V′, E′) be two graphs. The random walk kernel is defined as: (4.1) where μ(s) is a weight function to ensure convergence of the sum and p×, q×, and W× are defined as above.

Remark. A valid kernel is still obtained even if p× and q× are not probabilities in Eq (4.1). One only has to make sure that the sum converges for all graphs.

The RW kernel is indeed positive semi-definite as it is possible to write each term in an inner product form, as shown in [63]: Furthermore, as the p.s.d kernels are closed under convex combinations we have that the random walk kernel is a valid p.s.d. kernel.

By taking μ(s) = λi, where where λ× is the largest eigenvalue of W× we get the so-called geometric random walk kernel . Another choice is μ(s) = λs/s! which gives the exponential random walk kernel . It is also possible to consider a finite sum in Definition 4.5. Taking only s = 0 would only take the prior information of p× and q× into account, s = 1 would perform one transition update, etc. Using a too low s does in some sense not take enough moments into account as we will see later in this subsection. The naive way of calculating the geometric random walk kernel involves inverting a |V|2 × |V|2 matrix which has a time complexity of O(|V|6). Luckily, using a low-rank approximation of W and the Sherman-Woodbury-Morrison lemma [75] it is possible to calculate the kernel in O((|E| + |V|)r + r2) time where r is the number of eigenvalues used in the approximation [29]. The fast random walk kernel is given in Definition 4.6.

Definition 4.6 (r-Approximate Random Walk Kernel, ARKU_plus). Let G = (V, E) and G′ = (V′, E′) be two graphs. The r-approximate random walk kernel is defined as: (4.2) where is a low rank approximation of W× and c is constant to ensure invertibility.

Remark. There exist well-known methods to make reasonable low-rank approximations of such matrices such as the Nystrom method, see [76].

Remark. In this study, we will mainly use the r-Approximate Random Walk kernel and it will be the kernel we are referring to when we mention the RW kernel.

Note that the speed-up scheme can allow for directed, node-labeled, and edge-labeled graphs. It all depends on how one defines the matrix W×. For example in the case of edge-labels one can define where if the edge (vi, vj) is labelled as l and L is the total number of labels [63]. The speed-up scheme for this case can be found in the S1 Appendix. For node labels one can use W× = S(AA′) where S is a diagonal matrix whose (i, i) entry is 0 if the i-th row of (AA′) has label inconsistency, 1 otherwise [29].

Additionally, it is possible to allow node attributes using the random walk kernel [77]. Node attributes encode valuable information such as observations of graph signals. Node signals are, for example, encountered in portfolio design problems where stock returns are connected through a latent graph structure [33] or for insurance pricing of products with geographic risk [78]. These signals can additionally depend on some external features: Let denote a matrix containing the vertex attributes of a graph Gi and let be a (trainable) vector s.t. yi = Zi βi. Let where ni, nj are the number of nodes in graph Gi and Gj. We can then write the resulting kernel as follows [77]: (4.3)

It is possible to perform a kernelized version, let Ky be the kernel matrix containing < ϕ(yi), ϕ(yj) >. The graph kernel matrix then contains elements given by: (4.4) It is possible to interpret the RW in the case of distribution embeddings, namely, the random walk kernel is connected to the average degree. In fact by taking a special case of the weight vector such that μ(s) = 1 if s = 1 and μ(s) = 0 otherwise, with Φ(X) = A and q = 1 and p = 1/|V|, then one may write the mean embedding as follows: where K is the degree denoted as a random variable. If we take 2 steps and define μ(0) = 0, μ(1) = 1 and μ(1) = 1 and assume that G is an undirected graph, then it is possible to show that: (4.5)

Unfortunately, we do not get the third moment if we take 3 steps. This can be seen if we inspect k3 of a 4 node graph, , expanding the cubic form has the term Ai2Ai2Ai2 3 times but it only appears 2 times in the matrix multiplication. However, the third step starts counting triangles among other walks, which means that more steps start comparing more complex structures and essentially allows for comparing higher-order moments. The density of an undirected graph is defined as ρ = 2|E|/(|V|(|V| − 1)) = ∑i di/(|V|(|V| − 1)). Thus, taking q = 1/(n1) gives us a mean embedding of the density of the graph as the features of similarity between the graphs being compared.

In summary, we have demonstrated how to construct kernels for measuring similarity between graphs and demonstrated their feasible computational complexity and their explicit interpretations when measuring similarity structures between two graph samples. The focus on the random walk kernel was due to its versatility. It can be used to perform inference on weighted directed or undirected graphs possibly with node labels or edge labels. The Naive time complexity of calculating K(G, G′) is O(|V|6). Luckily, there exist speeding up methods [29] which schemes based on the geometric random walk kernel with time complexity O((|E| + |V|)r + r2) for undirected graphs and O(|V|2r4 + |E|r + r6) for directed graphs, r is the number of eigenvalues used in the SVD decomposition of the adjacency matrices. A fast approximation of edge-labeled graphs can be found in the S1 Appendix. In summary, the random walk kernel is a R-convolution kernel that can be used on weighted, directed, undirected, and bipartite graphs with node labels, node attributes, edge labels, and edge attributes. We do not compute the explicit feature mapping although it can sometimes be found, for example, according to definition 4.4.

4.3 Shortest-Path kernels on graphs

As previously observed, computing explicitly path kernels is an NP-hard problem. However, finding the shortest path can be solved in polynomial time, for example, Dijkstra, [79] or Floyd-Warshall, [80]. [65] suggested a kernel based on the shortest paths in a graph as it can be solved in polynomial time. The first step is to transform the original graphs into shortest-path graphs. The resulting graph has an edge between all nodes and the edge is labeled by the shortest distance between the graphs as can be seen in Fig 5. The transformation is called Floyd transformation.

thumbnail
Fig 5. Floyd-transformation of two graphs.

The transformation puts an edge between all nodes and labels it by the shortest distance.

https://doi.org/10.1371/journal.pone.0301804.g005

The kernel is then defined in the following manner described in Definition 4.7.

Definition 4.7 (Shortest-Path Graph Kernel). Let G1 and G2 be two graphs that are Floyd-transformed into S1 and S2. The shortest path kernel on S1 = (V1, E1) and S1 = (V1, E1) is defined as (4.6) where is a kernel on edge walks of length 1.

Let ei = (vi, ui) and ej = (vj, uj) then is usually defined as (4.7) where kv is a kernel comparing vertex labels, and ke is a kernel comparing shortest path lengths. Vertex labels are usually compared via a Dirac kernel, while shortest path lengths may also be compared via a Dirac kernel or another kernel like the Brownian bridge. The time complexity of the shortest path kernel is O(|V|4) [65]. The SP kernel is a R-convolution kernel that can be used on weighted undirected and directed graphs with possible node and/or edge labels. A time complexity of O(|V|4) is very hindering but in the case when the base kernel is a Dirac delta kernel then a significant speed-up scheme can be done. Therefore, in the case of weighted and/or node-attributed graphs, one has to rely on binning strategies in order to compute the kernel efficiently.

4.4 The Weisfeiler-Lehman framework

In this section, we will discuss the popular Weisfeiler-Lehman framework for graph kernel construction. The Weisfeiler-Lehman framework (WL) is firstly a label enrichment strategy inspired by the Weisfeiler-Lehman test of graph isomorphism, [61]. The Weisfeiler-Lehman iteration algorithm takes an input graph G = G0 and labelling function l0 = l and returns a sequence {G0, …Gh} = {(V, E, l0), …(V, E, lh)}. Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence. Let k0 be any kernel for graphs that we will call the base kernel. The general Weisfeiler-Lehman kernel with h iterations with the base kernel k0 is defined as (4.8) The most commonly used WL kernel is the Weisfeiler-Lehman subtree kernel [60, 61] provided in Definition 4.8.

Definition 4.8 (Weisfeiler-Lehman Subtree Kernel). Let Gand Gbe two graphs. The [Weisfeiler-Lehman Subtree Kernel is defined as the vector inner product where σij is letter number j in the alphabet Σi after i number of WL-iterations. That is, we obtain a new alphabet after each WL iteration. ci(G, σij) is the count of occurrences of the letter σij in the graph G.

By looking at the definition we can see that the graph features considered by the WL subtree kernel essentially count different rooted subtrees in the graph. For the Weisfeiler-Lehman subtree, we compute the feature mapping explicitly and it has a runtime complexity of O(h|E|) [60]. The WL framework kernel can be used on undirected and directed node-labeled graphs.

4.5 Optimal assignment kernels

Optimal assignment kernels act to assign parts of one object to the parts of the other object, for example, matching of vertices of two graphs, such that the similarity is maximized. However, the optimal assignment procedure does not always give rise to a positive semi-definite kernel [81]. [69] consider a particular class of base kernels that give rise to positive semi-definite kernels. Let be a set of possible components that can be extracted from the graphs in Ω, Πn be the set of all possible permutations of [1, …, n] and and be a set of components from the graphs G and G′, for example, the nodes. Then the optimal assignment kernel is defined as: where k0 is called the base kernel. If the objects have different cardinality then we may fill the smaller set by new objects z with . [69] show that the WL algorithm defines a strong hierarchy on the set of all vertices and thus the resulting optimal assignment called the Weisfeiler-Lehman Optimal Assignment (WLOA) kernel will be a positive semi-definite kernel as detailed in Definition 4.9.

Definition 4.9 (Weisfeiler-Lehman Optimal Assignment Kernel). Let G = (V, E) and G′ = (V′, E′) be two graphs. The Weisfeiler-Lehman optimal assignment kernel is defined as where k0 is the following base kernel: where li(v) is the label of node v at the end of the i-th iteration of the Weisfeiler-Lehman relabeling procedure.

The WLOA kernel can be used on node-labeled graphs and has a computational cost of O(h|E|).

4.5.1 Pyramid match graph kernel.

The pyramid match graph kernel [68], based on the pyramid match kernel of [67], first embeds the vertices of each graph into a low-dimensional vector space using the eigenvectors of the largest in magnitude eigenvalues of the adjacency matrix of the graph. The procedure then partitions the feature space into regions of increasingly larger size and a weighted sum of the matches that occur at each level is taken. Two points are matched if they fall into the same region. Given a sequence of levels from 0 to L, then at level l, the d-dimensional unit hypercube has 2l cells along each dimension and D = 2ld cells in total. Let and denote the histograms of the graphs G and G′ at level l and , , the number of vertices of G, G′ that lie in the ith cell. The pyramid match kernel is then given in Definition 4.10.

Definition 4.10 (Pyramid Match Graph Kernel). Let G and Gbe two graphs. The pyramid match graph kernel is defined as: (4.9) where

The kernel can be performed on labeled graphs by matching only vertices that share the same labels. The emerging kernel for labeled graphs corresponds to the sum of the separate kernels: (4.10) where c is the number of distinct labels and ki(G1, G2) is the pyramid match kernel between the sets of vertices of the two graphs which are assigned the label i. The pyramid match kernel can be performed on weighted undirected graphs with or without node labels. The time complexity of the pyramid kernel is O(d|V|L).

4.5.2 Propagation kernel.

Propagation kernels are based on monitoring how information spreads through a set of given graphs, see [66]. We start by considering (partially) labeled graphs without attributes, where V = VLVU is the union of labeled and unlabeled nodes. We monitor the distribution of labels encountered by random walks. Let be the prior label distribution of all nodes in V where i-th row corresponds to the prior label distribution of node vi. Note there are k number of labels. If node viVL is observed then . If the nodes are attributed then the “prior” is set as where xi is the attributed vector of node i. The label diffusion process is where T = A D−1 is the transition matrix and row i of Pt is the distribution of labels at iteration t for node vi. If we consider an absorbing node set SV, then the absorbing random walk is defined as.

The label propagation process is

If S = VL then we have the label propagation algorithm for node classification in [82].

Note that other propagation may be defined. The kernel proceeds by comparing the label/attribute propagation matrix Pt for graph G to the label/attribute propagation matrix of for graph G′ and sums each comparison for t = 1, …, T where T is the maximum number of propagation steps. In order to make the kernel more efficient the propagation kernel uses a hash function that maps the rows of the propagation matrices label distribution or the propagated attributes to integer-valued bins. [66] suggest early stopping rather than the steady-state as the intermediate distributions obtained by the diffusion during the convergence process provide useful insights about the graph structure.

Definition 4.11 (Propagation Kernel). Let G and Gbe two node-labeled or node-attributed graphs. Define bt as the number of integer bins (or the number of labels) occupied by the nodes of G and G after applying the hashing function to the node attributes at the t-th iteration of the algorithm. Let also ct(G, i) be the number of nodes of G placed into bin i at the t-th iteration of the algorithm. Then, the propagation kernel with tmax iterations between Graphs G and Gis defined as: where

The kernel is very general as it can be used to construct kernels for many graph types, including node-labeled, partially labeled, unlabeled, directed, and node-attributed graphs. Computing the kernel has a time complexity of O(tmax|V|). The total running time of one pair of graphs is O((tmax − 1)|E| + tmax|V|).

4.5.3 Wasserstein Weisfeiler-Lehman graph kernels.

The Wasserstein Weisfeiler-Lehman graph kernel (WWL) was developed in [28] and can be applied to attributed, labeled, and weighted graphs. The idea is to calculate a Weisfeiler–Lehman-inspired embedding scheme that works for both categorically labeled and continuously attributed graphs which are then coupled with a graph Wasserstein distance.

Given two sets of matrices and , the Wasserstein distance between them is defined as: here M is the distance matrix between each elements of X and X′, is a transport/joint probability matrix and 〈⋅, ⋅〉 is the Frobenius dot product. The transport matrix contains the fractions that indicate how to transport the values from X to X′ in the most efficient way.

Definition 4.12 (Wasserstein Weisfeiler-Lehman Graph Kernel). Consider two labeled graphs G = (V, E) and G′ = (V′, E′). The Wasserstein Weisfeiler-Lehman Graph Kernel is defined as: where λ > 0 and DW is the graph Wasserstein distance defined as: where is an embedding scheme defined as where where li(v) is the label of node v after the i-th iteration of the WL relabeling procedure and h is the total number of WL iterations.

The ground distance matrix M is the Hamming distance for node-labeled graphs. This graph kernel can be used on undirected graphs with node labels and can be used on node-attributed graphs with the right embedding scheme as well. The node feature vector is not computed explicitly. The Wasserstein distance is the computational bottleneck with a complexity of O(|V|3 log |V|) naively but there exist speedup tricks [28].

4.6 Enhancing graph kernels

We should briefly mention that new kernels can be constructed by manipulating and combining kernels as explained in [71]. These include: (4.11) where k1 and k2 are some graph kernels, and k3 is kernel over where d is the dimension of the feature vector ϕ(G). These properties can then be used to show that the following kernel manipulations are further possible: (4.12) where p is a polynomial, || ⋅ || is some norm and σ2 > 0 is some constant. We saw in the previous subsections that many graph kernels explicitly compute feature vectors, meaning that they essentially transform the graph data to vector data. The final step is often to apply a linear kernel to the constructed vector data [69]. However, it is well-known that better results are often obtained when a non-linear kernel such as the polynomial or RBF kernels are used. Therefore, it may be beneficial to apply one of the kernel construction methods in Eq (4.12) to enhance the predictive performance, and indeed, this was suggested in [69]. As noted the third equation in (4.12) shows that any norm can be used. Over the past years, a few developments of p-adic number theory have been made that enhance the understanding of complex networks, see for instance [83]. The p-adic numbers give an extension of the ordinary arithmetic of rational numbers. The p-adic numbers, where p is any prime number, come from an alternate way of defining the distance between two rational numbers. The p-adic field is a complete metric space like the real number field, but unlike the reals the p-adics are an ultrametric space, leading to a number of fascinating but often counterintuitive results. A p-adic number is a series of the form: (4.13) with xk ≠ 0, the xj coefficients are the p-adic digits, i.e. numbers in the set {0, 1, …, p − 1}. The natural norm of p-adic numbers is defined as ||x||p = pk, and the ultrametric property refers to the fact that ||xy||p ≤ max{ ||xz||p, ||zy||p}. The use of p-adic numbers has recently been suggested in the machine learning literature. For example in [84, 85], who develop so-called p-adic cellular neural networks where the weights of a neural network are p-adic numbers, and [86] who investigate the correspondence between neural networks and p-adic statistical field theories. In the case of kernel learning one may construct a p-adic kernel by combining the p-adic norm and the kernel algebra above as follows: (4.14) where , [ϕ(G)]i is the i-th element of ϕ(G), and are some constants for each i.

5 Synthetic experiments exploring the influence of the graph kernel on the power of two sample graph testing

We performed multiple synthetic studies regarding graph two-sample hypothesis testing where we assumed various different graph distributions. We considered graph distributions that mimic real-world graphs such as binomial graphs, scale-free graphs, and stochastic block matrices with and without node attributes, edge labels, node labels, and outlier graphs [87]. We used the Area Under the Curve (AUC) metric to assert the performance of the kernels. Here we will report the main findings but we refer to the S1 Appendix for the complete study.

5.1 Positively and negatively weighted graphs with attributes

We include the experiment when the graphs can have both negative and positive edge weights and node attributes. This experiment is designed to mimic comparison between portfolios, where the nodes are average returns and the graph itself is the precision matrix (minus the diagonal).

The samples were generated in the following manner: First, generate a sample with binomial graphs with degree ki. Next, for each edge present, give an edge weight according to an exponential distribution with scale parameter λi. The sign of each edge weight is flipped with probability pi (note the graph is symmetric so we flip the sign of both edges eij and eji). Finally, for each node generate a normal random variable Xi with mean mi and standard deviation si as its attribute.

In the experiment we will construct 20 binomial graphs with 20 nodes in each sample and the attributes are set to the following values: m1 = 0.00038, s1 = 0.01, k1 = 4, λ1 = 3000, p1 = 0.35. In the experiment, we changed one parameter at a time for sample 2 and the results can be seen in Fig 6. The MMDu estimate of the MMD was used. We both tried using a tensor product kernel as explained in section 3.7 and a kernel assuming absolute values for the edge weights for WL, WWL, WLOA, and propagation kernels to incorporate negative weights. We found that the results only differ when the only difference between the graphs was the edge weight sign distributions, therefore we only report the result from the tensor product kernels.

thumbnail
Fig 6. ROC curves.

The power of various kernels in the graph two-sample testing problem.

https://doi.org/10.1371/journal.pone.0301804.g006

Fig 6 illustrate the results of the experiments performed. In each row, we can see the performance in different scenarios for a fixed basket of kernels. The plots demonstrate the Receiver-Operator (ROC) curves that show Type I error versus the Power of the test. We show this for a variety of experimental settings and kernel choices.

  • The WL, WLOA, and WWL kernels utilize the Weisfeiler Lehman iteration scheme on node labels. In our experiment the nodes were labeled according to their degrees. It can be seen that the kernels give good performances when the average degree between the two samples is different and when the probabilities of a negative edge are different. The first case is in concordance with other experiments (the label distributions are different). The latter can be explained as follows: Take the pair and , where i comes from sample 1 and j comes from sample 2. If the two samples have the same probability of negative signs and the same average degree then we would not expect the test to be rejected, however, if the two samples have different sign probabilities but the same average degree, then depending on the magnitude of pi (the probability of a positive edge), will be more sparse or dense than and the WL iterations will be different, even though the label distribution is the same. Not surprisingly we see that the test has 0 AUC when only the attributes are different. Overall we do not see AUC differences when a different number of WL iterations are used nor from the WWL discount parameter.
  • The number of walks tmax in the propagation kernel was held constant in this experiment at 6. The propagation kernel gives non-zero AUC in all cases. We can see that the lower the bin width the higher the power, which makes sense due to the relatively low value of the attributes ±2 * 0.01. However, the lower the bin width the more diagonally dominant the kernel matrix becomes. It might be surprising that the AUC is not zero when p1 = 0.35 and p2 = 0.3 but all other attributes remain the same, as the propagation kernel simply normalizes the adjacency matrix and essentially calculates the average node attribute of the neighborhood of each node repeatedly. However, as the probability of a negative sign is different between the two samples we have the fact that will be more sparse or dense than . This means that if the bin width of the propagation is smaller than the standard deviation of the kernel then the test will detect differences between the two samples.
  • The depth L of the pyramid match kernel was held constant in this experiment at L = 6. The pyramid kernel has a non-zero AUC when the sign count is different between the two samples. This is not surprising as the sparseness is different between and if j and i belong to different samples, the same is true for the positive graphs. Interestingly we see that including labels does give better performances even though the average degree between the two samples is the same. We can also see that including more eigenvectors is preferable. Not surprisingly we have a zero AUC when only the attributes are different between the two samples as the pyramid match does not take attributes into account. Lastly, we see that when the two samples are even more different the AUC increases.
  • We tested multiple versions of the RW kernel and for the SP kernel, we used a discretization strategy where the attributes were divided by the maximum attribute and then rounded to the first non-zero digit and the weights followed a similar discretization strategy but were rounded to the second non-zero digit. The different versions of the RW and SP all give good AUC except for the case when only the sign probability was different. Interestingly, when attributes are ignored the RW kernel will give good performances when the sign probabilities between the two samples are different. All types of random walk kernels seem to give very good performances while the SP kernel is similar to the propagation kernel. We remark that the reason that the binary RW kernel is doing a good job is that the attributes are different. If only the edge weights are different then the binary RW kernel will give a zero AUC performance.
  • We also tried a different labeling scheme for the WL-type kernels. The labels were set by binning the attributes. The attributes were divided by the largest attribute observed in each graph and then the new attribute was rounded using 1 or 2 digits. The test had positive AUC and rounding of 1 or 2 does not matter in this case. However, using a round of 3 gave 0 AUC.

We also tested the MONK estimator of the MMD in the presence of outliers both when the topological structure was observed as an outlier and when the node attributes contained outliers as in Fig 3. We found empirically that when the two samples are generated from the same distribution but one sample contained 5% outliers the monk estimator had significantly lower type I error. We also performed MMD tests on other graph typologies which gave similar results, although we only tested it for binary weights. The main difference is that the shortest path kernel did not manage to discriminate two populations of scale-free graphs with different scaling parameters. The reason is that the shortest path for scale-free graphs is independent of the scaling parameter [88].

The second main synthetic study we undertook which is of direct relevance to understanding when using these methods for the ESG portfolio analysis undertaken in the real data case studies in this manuscript is shown below and pertains to the regularisation of the graph sample estimation.

5.2 Effects of sparse graph sample construction via lasso regularization and its influence on graph testing

Another experiment vital to the application undertaken in this manuscript for use of this two-sample graph testing inference in portfolio diversification comparisons is the effect of regularization in the graph construction/estimation in this two-sample testing context. Consider two samples and such that xi ∼ N(0, Σx) and yi ∼ N(0, Σy) with unknown. It is known that random fluctuations in a sample can lead to spurious correlations in the empirical covariance matrix. The graphical lasso, see section 6.2 has been proposed for filtering out spurious correlations, either by regularizing the covariance [39] matrix or the precision matrix [89]. Let Θ = Σ−1 be the precision matrix. In this example we generate 100 samples from and 100 samples from and estimate Θx and Θy using graphical lasso. This was performed 50 times, giving two samples of graphs and . We calculate the p-value of the MMD test statistic. The experiment is performed 1,000 times for different regularization parameters both when Θx = Θy and ΘxΘy. Two cases are considered: 1) We ignore the weights and signs and only consider binary graphs: Aij = 1 if |Σij| > 0, 0 otherwise. For this case we use the SP kernel 2) We consider the precision matrix as-is and use the r-approximate RW kernel. From Fig 7 we can see a very interesting result. Namely in the case when H1 is true, then the power of the test is low with no regularization while it increases when ρ is near its optimal value. This clearly shows the significance of using sparsity-inducing estimation methods like the glasso. We also see when H0 is true that the proportion rejected stays within the type I error α. These observations are both true for the binary graph experiment and the weighted graph experiment but note that the power starts to fall down before the error reaches the minimum value for the weighted graph experiment.

thumbnail
Fig 7. The effect of regularization in graph learning for the kere two-sample problem.

https://doi.org/10.1371/journal.pone.0301804.g007

6 Real data experimental design: Assessing diversification in ESG screened equity portfolios

In this case study, the focus is on the exploration of the screening of assets for inclusion/exclusion in a wealth management portfolio such as those found in the emerging lucrative green finance investing universe of ETFs, Mutual Funds, and 401k superannuation managed accounts constructed by investment managers and self-managed super investors. This market is increasingly making up the bulk of equity investments and as such, it is important to understand better the role that ESG ratings may have in effecting portfolio diversification if screening by ESG ratings is practiced. We demonstrate the methodology on assets within the S&P500 index from the years 2016–2022. The historical prices and ESG scores of each asset were obtained using Yahoo’s publicly available APIs, which is intended for research and educational purposes. The data was used for modelling and demonstration purposes only, and the data collection and analysis methods complied with the terms and conditions of the Yahoo developer API terms of use (The Yahoo terms can be found at Yahoo terms). ESG scores were obtained from ESG scores where the ticker has to be changed for each asset. The historical prices were obtained from prices where the ticker has to be set for each asset.

There are four stages of the experimental design and case study undertaken. Namely, smoothing of the ESG scores, graph estimation, portfolio construction, and portfolio valuation. We start by describing the steps taken to perform the study.

Stage 1: Smoothly Stochastically Interpolating Equity ESG Scores

The first step is to smooth the ESG scores using a Gaussian Process which is a vital part of the study to give a more accurate screening procedure. Since ESG scores for each of the companies in listed stock exchanges are infrequently produced (scored), the screening procedure adopted will benefit from an interpolation of the time series of ESG scores per company. Furthermore, since ESG scores are typically noisy and may contain outliers, it is beneficial to first smoothly interpolate the scores so that any portfolio screening rules will not be affected adversely by rapid changes in the inclusion/exclusion of assets on a monthly screening and rebalance schedule of a portfolio, simply due to noisy scoring methods.

Hence, in stage 1 of the process, the framework proposed for this case study was a procedure that interpolates and essentially generates more observations which in turn gives more rejection decision data points from the two-sample tests. This in turn allows for a better comparison between the classical portfolio performance metrics and the inter-asset differences measured by the MMD.

Stage 2: Dynamic ESG Screening of Assets into Various ESG Ranked Asset Universes for Portfolio Inclusion

The obtained smoothed ESG scores were then used as screening criteria to construct two dynamic portfolios. Let P1,t and P2,t be two portfolios containing assets iI at rebalancing step t, where I is the set of assets considered for inclusion in the portfolio managers investment thesis/theme. In our study, we considered multiple investment sets, both assets belonging to one particular sector within the SnP500 index and all assets in the SnP500 index. An important note is that the asset set was constructed such that |I| mod 3 = 0. This was obtained by randomly removing assets from the original asset set during the entire experiment such that the remaining set was divisible by 3. This was done so that the number of assets in each portfolio was equal. Denote ESGi,t as the ESG scored observed for asset i at time t.

At each rebalancing time, t the assets were ordered according to their current ESG score and split into three groups. The first group contained assets with the best ESG scores (lowest), the second group contained assets with medium ESG scores and the last group contained assets with poor ESG scores. The portfolios were then updated such that P1,t contained non-zero weightings of positions only for the best ESG assets from group 1 and P2,t contained assets with the poorest ESG ratings from group 3 assets (the medium ESG assets were not used at the particular time t). We can define the sets more rigorously as follows: where I is the asset set and Qt,j% is the j-th quantile of the smoothed ESG scores at time t. To emphasize, by construction we have:

Stage 3: Construction of Two Sets of Graph Samples via Screened Assets by ESG Rankings from Stage 2 Portfolios

With the screened asset sets, the historical log-returns were extracted for the screened assets in portfolio 1 and portfolio 2. Where n is the number of days considered and p = |I|/3 is the number of assets in each group. A graph representation of inter-asset relations was then estimated for each asset set using the graphical lasso [89].

The graphical lasso estimates the precision matrices Θ(j,t) j ∈ {1, 2} using the historical log returns as input (or their covariance). Finally, the estimated precision matrices were used to estimate the adjacency matrices of the two portfolio graphs G1,t and G2,t at time t. We assumed that vertices cannot be connected to themselves so the diagonal of were ignored/set to zero. The edge weights were defined to be the elements of -. This is justified by the fact that the negative precision encodes the partial covariance. Additionally to the graph estimation, we used 3 portfolio construction techniques for the assets in P1,t and P2,t and analyzed their performances using various portfolio performance metrics. The graph estimation step was performed in a rolling window fashion using every other time point, i.e.e the rebalancing was performed on the set t ∈ {t2i : i ∈ {w, w + 1, …, ⌊n/2⌋}} where w < n is the graph estimation window size.

Furthermore, for each of these screened portfolios, the optimal portfolio weights for P1,t and P2,t were determined according to the mean-variance criterion to produce various optimal portfolio strategies: Global Minimum Variance and Maximum Sharpe Ratio.

Stage 4: Two-Sample Graph Testing Inference to Assess Influence of ESG Screening on Portfolio Diversification

The final step involved using the MMD testing procedure in a rolling window fashion on the two graph samples where M is the total number of graphs in each sample. That is at each time step t a kernel two-sample test was performed using the two graph sub-samples where s is the size of the graph testing rolling window (not the same as graph estimation rolling window w). In our experiment, we used s = 20. The overall graph estimation and graph testing procedure are written in a pseudo-code fashion in the S1 Appendix.

6.1 Smoothly stochastically interpolating equity ESG scores

The ESG data from 2016–2022 was obtained from yahoo finance and shows monthly ESG scores of companies within the SnP500 index. Not all companies within the SnP500 have ESG scores and some only have very few observations, around 10–20. Companies with 20 or fewer observations are discarded giving 429 companies to analyze. Fig 8 shows the ESG time series of 3 companies. First, we can see that the ESG scores are piecewise constant functions, Secondly, ESG scores are usually observed monthly but during COVID and after the observation get less frequent. Finally, there is a level shift around 2020. The shift does not occur in the same month for all series as we can see in the figure, BA and GILD shift before ADI.

thumbnail
Fig 8. ESG scores for 3 tickers within the SnP500 asset universe.

https://doi.org/10.1371/journal.pone.0301804.g008

We remove the shift by taking the mean before and after the shift and subtract the difference from the original series for company i before the shift k: where T is the number of observations. We are interested in studying the effect of ESG scores on portfolios in a rolling window manner. However, due to the characteristics of the ESG series, it will pose some difficulties as we only have a maximum of 75 observations for an individual series compared to 2681 observations for financial log-returns. To generate daily estimates we fit a Gaussian process (GP) smoother to each ESG index [90]. Formally, GPs are defined as:

Definition 6.1 (Gaussian Process). Denote a stochastic process parametrised with state-space , where . The random function f(x) is a Gaussian process if all its finite-dimensional distributions are Gaussian, where for any , the random vector [f(x1), f(x2), …, f(xn)] is jointly normally distributed.

We can therefore interpret a GP as equivalently characterized by the following class of random functions: with and such that:

This means that, as Gaussian processes are simply a Gaussian distribution, we can completely specify them using the mean function μ and a covariance function k. Assume that we observe noisy observations y (the ESG score): where . Let the subscript * denote unobserved values then we can write the joint distribution of y and f* as: where k(X, X) is a matrix obtained by applying k pairwise to each data point in X and μ(X) is the mean vector obtained by applying the mean function row-wise to X. Deriving the conditional distribution of f* we arrive at the key predictive equations for Gaussian process regression

There are usually three sets of parameters that need to be estimated when fitting a Gaussian Process θμ, the hyperparameters of the mean function, θk, the parameters of the covariance function and σ2 the noise level of the observations. In our case, we assume that the mean function is zero leaving only θk and σ2 to be estimated. In this manuscript, we are more interested in smoothing rather than forecasting so we use a stratified cross-validation to choose the best σ2 while θk is estimated by maximizing the log-likelihood for a given value of σ2. The metric used in each fold is defined as: where are the predictions within the test set of the GP and is the mean of the sample. We used the Sklearn [91] library to fit the gaussian processes. The overall smoothing algorithm is explained in the S1 Appendix. We use a Matern kernel with ν = 3/2 to smooth the ESG scores (using the time index as a feature). It gave the maximum log-likelihood compared to other kernels tested such as other Matérn kernels, the RBF kernel, and the rational quadratic kernel. Fig 9 shows the GP smoothed ESG scores of an asset.

6.2 Construction of two sets of graph samples via screened assets by ESG rankings from stage 2 portfolios

In this section, we explain how one can take the multivariate time series of returns for the screened assets for Portfolio 1 and Portfolio 2 and construct a time series of regularised graph samples that will form the input to the graph testing framework. Furthermore, since the graphs should capture aspects of the structure between the time series of asset returns in each ESG score-screened sub-population, it is important to base the graph estimation on a measure of portfolio diversification. The most natural of these is the correlation structure between the assets that will be included in each portfolio.

Such samples of graphs constructed from financial returns should contain the most important connections within the ensemble of assets being investigated. These connections are, however, by no means obvious, and unraveling them all is a very complicated task. Here we will apply the most widely used graph estimation method to estimate the latent financial sample graphs which are then to be used in the portfolio optimization procedure and graph testing framework.

The graphical lasso [39, 89] has been proposed for filtering out spurious correlations. It can either be performed as a regularization on the covariance matrix or the precision matrix but the objective function is non-convex for the covariance matrix making the precision matrix problem a bit easier to work with. The estimated graph will be sparse because of the imposed l1 regularization.

At time t we estimate two graphs/precision matrices, one that encodes financial covariances between assets with good ESG scores (those screened to be included in portfolio 1) and one that encodes financial covariances between assets with poor ESG scores (those screened to be included in portfolio 2). Further, assume that we use the past n observations to do so and let Θ(1,t) = Σ−1,(1,t), Θ(2,t) = Σ−1,(2,t) be the precision matrix at time t for the good ESG portfolio and the poor ESG, respectively. The graphical lasso is: where d is the number of features, μ is the mean of the log-returns, and ri are realisations of the log-returns. The first 3 terms are the normal log-likelihood function and the last term is the lasso penalty term. For simplicity we often use the same penalty for each edge, that is, ρ = ρij. The objective is concave and can thus be solved by using a convex optimization procedure such as a coordinate-wise descent algorithm [89, 92, 93]. Fig 10 shows an example of a graph obtained from estimated precision matrix.

thumbnail
Fig 10. Example of a learnt graph.

Example of a top ESG Portfolio Graph for the Industrial sector at a time t. Each vertex represents an asset, and each vertex has furthermore the mean return over the graph construction period as an attribute. The mean return has been multiplied by 1000.

https://doi.org/10.1371/journal.pone.0301804.g010

Remark. The edge weights and signs are ignored for the WWL, WL, and WLOA kernels. The SP and Propagation kernel use absolute values of edge weights as those kernels can not use negative weights. The reason why we choose the absolute value instead of a tensor product is that very few negative edge weights were observed after the graph estimation was performed. The SP kernel further bins the node attributes and edge weights to speed up the kernel calculation.

We tested two graph estimation packages: The Huge package [31] in R and Sklearn package in python [91]. The two packages differ slightly as the Huge package penalizes the diagonal of Θ whilst the Sklearn package does not. We chose to use the Huge package as it was easier to use for our purposes.

The (extended) BIC [94] approach is to maximize the posterior probability of a model Mi given the data y:

We use Laplace’s method to approximate the integral, this only works if p(y|θi)p(θi) has a global maximum and decays rapidly to zero away from the maximum. We expand log (p(y|θi)p(θi)) about the mode obtaining: where is the Hessian. Let , since Q obtains the maximum at we have: where m is the number of parameters or number of edges in our case. For a large number of samples n we have: where F is the Fisher information. This asymptotic relation and taking the log further gives us the following equation:

If we assume an uninformative flat prior then and furthermore drop the Fisher matrix and the m log 2π term then the above formula reduces to the BIC, where we have multiplied the above with 2:

Now if we are in a high-dimensional setting, such as graph/covariance estimation where m >> n we have the BIC often prefers large models and will then possibly choose spurious covariates, to see this, say that there are 1000 covariates. The set of one covariate has a cardinality 1000 while the set of two covariate models has a cardinality of 1000*999/2, meaning that the two covariate model is 999/2 times more likely to be selected. This means that we should not set an uninformative flat prior in a high-dimensional setting. A prior that has been suggested is [94] with 0 < β < 1, where d is the dimension of each ri, resulting in the so-called extended BIC or EBIC:

We can see that the higher the β, the sparser the optimal model is according to EBIC.

6.3 Monotone transformation

The graphical lasso relies heavily on the assumption of normality. Unfortunately, this assumption is usually discredited even by the log transformation of the returns. We consider a transformation suggested by [95] called the nonparanormal to transform the log-returns via smooth functions, meaning that we transform r by using f(r) = (f1(r1), …, fd(rd)) where f(r) is multivariate Gaussian. The random vector r is said to be nonparanormal if there exists functions such that f(r) ∼ N(μ, Σ) and we write:

It can be seen that the NPN is simply a Gaussian copula when fi are monotone and differentiable and that the joint probability density of r is given by:

The density is not identifiable so it is assumed that fj preserve means and variances:

This all implies that where Fj is the marginal distribution of rj and Φ is the cumulative distribution function of the standard normal. We use the Winsorized empirical distribution, function to estimate Fj as suggested by [95]: where is the empirical distribution and .

Remark. In addition to the nonparanormal transform, we also tried using standard scaling. Scaling is important if the features are of different scales which is generally not the case for log returns. We compared the results when the log returns were given as-is, scaled, and transformed nonparanormally. There were some discrepancies between the estimated graphs, as expected and in the end, we chose the nonparanormal transform.

6.4 Portfolio construction and metrics

To maintain comparability between the graphs and the portfolios we used the estimated graph/precision matrix, as explained in subsection 6.2 to get an estimate of the covariance matrix by setting:

This is justified as the graphical lasso can be seen as a robust estimate of Θ−1,(j,t). To assess portfolio performance under ESG screening we consider three portfolio construction methodologies. The Passive Equal-Weighted Portfolio(PEW) is a buy-and-hold portfolio that serves a long-term investment strategy with minimal interaction with the market. The weights of all assets are set to be equal wi = 1/d for asset i with d being the number of assets. PEW prohibits short positions. Global Minimum Variance Portfolio (GMV) portfolio seeks to find the portfolio with the lowest volatility [96, 97]. The optimal weights are allocated by an optimization problem subject to minimizing the variance of the return of the portfolio. The GMV’s optimization is given by:

The analytic solution is where 1 is a vector of ones. The Maximum Sharpe Ratio Portfolio (MS) is a portfolio that serves as a benchmark to achieve the highest return per unit risk. The weights are allocated by the optimization problem subject to maximizing the Sharpe ratio of the portfolio:

The analytic solution is .

Note that the optimization objectives allow for long and short positions for the GMV and the MS portfolios. This can lead to extreme asset allocations that result in an excessive concentration of positions in a few assets. The Federal Reserve usually limits the short-selling positions within a portfolio to be 50% of the portfolio weight, i.e., 150–50 fund investment strategy. In practice, the short-selling ratio ranges from 120–20 to 150–50, with 130–30 being the most common [98]. Consequently, we limit the short position to no more than 30% of all positions. We approximate the box constraints by renormalizing the asset weights rather than adding them to the optimization constraints in the following manner: where S ⊆ {1, …, d} are the indices of the assets being shorted. Similarly, for the long positions where L ⊆ {1, …, d} are the indices of the assets being longed.

We can visualize the geometry of the different portfolios available for the top ESG universe and the low ESG universe by looking at the efficient frontier at different times t. The efficient frontier can be found by minimizing the following object for different portfolio returns , (6.1) which is similar to the MS objective. The efficient frontier can be found by setting some predefined values for and plotting the outcome using the following equation for the standard deviation: (6.2) where a = 1T Θ 1, b = 1 Θ μ and c = μT Θ μ.

We consider multiple portfolio metrics to assess the performance of various ESG-screened portfolios. The diversification ratio measure is defined as: where σ2 = diag(Σ) is variance of the individual assets. We also look at the Value-at-risk (VaR) diversification measure: where q is the quantile, Rp and Ri are random variables for the return of the portfolio and asset i, respectively. We estimate VaR by using the empirical quantile. We additionally consider return metrics such as the Sharpe, Treynor, and Sortino values and the omega function evaluated at zero which is a measure that allows for higher moment information in the returns distribution. We exclude the exact formulas here but defer them to the technical S1 Appendix.

Remark. Each of these metrics along with the average log returns were calculated over the same period as was used to estimate the graphical lasso. That is at time t we use the historical data with index i ∈ [tn, t].

7 ESG ratings and diversification benefits

Typically portfolio assessment is concentrated on the net portfolio risk and return profiles, we take a novel approach to study and quantize differences in inter-asset relations and graph structures which in turn affect diversification differences between two portfolios selected according to a dynamic screening criterion ranking assets for inclusion/exclusion according to their ESG ranking as outlined in the previous section, so-called good and poor ESG portfolios. The main objective is to analyze the rejection rates which occur when one tests differences in inter-asset relations in two portfolios. We further want to test if the rejection rates of graph kernels are translated in differences in diversification and risk-adjusted returns profiles and whether the rejections can be explained by differences in the ESG profiles.

To answer these questions we will look at the sensitivity and conservatism of different graph kernels. The kernel should not be too conservative but at the same time not too sensitive meaning that the kernel should not fail to reject all of the time but also not reject all of the time. After identifying the right kernels we then count how often the good ESG portfolio has better diversification and risk-adjusted return metrics while also reporting the rejection decision. A visual example is then demonstrated to showcase which differences the kernel is quantifying. We further analyze whether a MMD test rejection is directly translated to portfolio performances. This would mean that inter-asset relations affect portfolio risk and return profiles. To further assess the relationship between portfolio metrics and MMD rejections rates we will perform PCA and classification analysis to try and identify which portfolio metrics mostly affect a rejection.

Remark. Portfolio testing does not have to be a cross-sectional study of two portfolios but also a temporal study where the inter-asset relationships change can be tested before and after some events, such as interest rate hikes, ESG scoring changes, or legalization changes.

Different combinations of assets in the SnP 500 were used to construct 2 portfolios in a rolling-type fashion for different scenarios, as explained in section 6. The 2 dynamic portfolios will represent good ESG companies and low ESG companies respectively in multiple scenarios: When portfolios are only allowed to invest in single sectors and when the portfolios are allowed to select all assets in the SnP500 index. Table 1 lists how many assets belonged to each portfolio, i.e. |I|/3, for different asset worlds (sectors and global). It can be seen that some portfolios only contained a few assets (Basic Materials, Energy, and Communication) while others had more (Global, Consumer cyclical, Industrials, Healthcare and Financial).

thumbnail
Table 1. Number of assets in each portfolio, the good and poor ESG, for different asset worlds (sectors and global).

https://doi.org/10.1371/journal.pone.0301804.t001

We add a few more comments on the portfolio structure. Fig 11 counts the number of sectors contained in the global good ESG portfolio and poor ESG portfolio. The top ESG portfolio mainly consists of assets that belong to a sector with better overall ESG scores such as the Healthcare, Real Estate, Technology, and Financial sectors while the poor ESG portfolio has assets that belong to poor ESG sectors such as the Materials, Utilities, Energy, and Industrials. We can also see in Fig 12 that the membership of assets is fairly consistent, this feature was also observed for portfolios in other sectors. That is the assets in portfolios P1,t and P2,t are mostly stable with only a few assets jumping between portfolios at each rebalancing step.

thumbnail
Fig 11. Asset ESG group membership.

The figure shows how many assets belong to each sector at different time points for the two global portfolios (top and bottom).

https://doi.org/10.1371/journal.pone.0301804.g011

thumbnail
Fig 12. Changes in portfolio membership for the global assets.

Each column is an asset and the row is time. Each cell indicates which portfolio the asset (column) belonged to at each rebalance date (row). Time flows from the top to the bottom. Color intensity 0 indicates that the asset belonged to the top ESG portfolio, 1 the medium ESG portfolio (not used in this study), and 2 the low ESG portfolio.

https://doi.org/10.1371/journal.pone.0301804.g012

Table 2 gives an overview of the relation between the E score and CO2 emission. It can be seen that the Utilities, Energy, Materials, and Industrials sectors have by far the most CO2 emissions. We keep this table in mind and study if there are similarities in inter-asset relations of those sectors.

thumbnail
Table 2. Average environmental scores and amount of carbon emission of some companies in the S&P 500 separated by Global Industry Classification Standard (GICS) sector ranked by direct CO2 emissions, where is the number of the assets available for calculating the number of carbon emissions, and is the number of whole assets in the sector.

The table was taken from [98] with permission.

https://doi.org/10.1371/journal.pone.0301804.t002

In Table 3 we report the rejection rate, the number of rejections divided by the total number of tests performed, of different graph kernels using α = 0.01. It can be seen that the pyramid kernel and the SP kernel with binned attributes are too sensitive as they reject almost all of the time. The rejection rates do depend on the asset universe, for example, the RW kernel rejects more often for the Materials, Energy, Consumer Cyclical, and Real Estate sectors. The results for each asset universe are mostly heterogeneous across different kernels. Furthermore, it can be seen that the MONK estimator is usually more conservative than the MMDu and MMDl estimators, as expected.

thumbnail
Table 3. The number of rejections for different graph kernels assuming a cut-off of α = 0.01.

The RW kernel discount was set to c ≈ 10−9 (varied a little between tests). The propagation used w = 0.0001 and tmax = 6. The number of WL iterations was set to h = 2. The WWL used a discount of λ = 0.1. The pyramid kernel used L = 10 and d = 6. Note that other hyperparameters were tested as well, giving very similar results. The SP kernel used a discretization of the continuous edge weights and node attributed, using rounding of 3 digits and 1 digit respectively. The MONK estimator used Q = 5 partitions.

https://doi.org/10.1371/journal.pone.0301804.t003

We can visualize the reasons for rejection by looking at two graphs, one good ESG and one poor ESG, constructed from the Utilities sector as the portfolio graphs only had 9 nodes as shown in Fig 13. The main difference, in this particular case, is that the good ESG portfolio graph was more dense and had bigger weights. Also, the good ESG portfolio only had 2 negative edges while the poor ESG portfolio has none. The average log return is higher for the bad ESG portfolio, 0.00068 vs 0.00082.

thumbnail
Fig 13. Comparison of two portfolio graphs.

One graph comes from the good ESG sample and the other from the poor ESG sample, on 2019–3-15 when the test was rejected. The average log returns have been multiplied by 1000.

https://doi.org/10.1371/journal.pone.0301804.g013

Furthermore, Fig 14 shows the average weighted average degree for the negative weights and the positive weights and the density for the graphs of the Utilities sector.

thumbnail
Fig 14. The average weighted degree of the graphs within the two portfolios.

A separate degree is calculated for the negative and positive weights.

https://doi.org/10.1371/journal.pone.0301804.g014

From the figure, we can see that there are very few negative edges and that the weighted average degree defined as: where and wij is the weight of the edge (vi, vj) are very different between the two portfolios. Interestingly, the density seen in Fig 15 between the graphs in the two samples is much more similar in value and the two two time series often cross. This is reflected in the fact that the RW kernel has more rejections as it corresponds to testing for weighted edges compared to the WL kernel which corresponds to considering only binary edges.

thumbnail
Fig 15. The density of the graph sample ignoring the signs and assuming a binary structure.

The density is shown for the top ESG portfolio and the low ESG portfolio.

https://doi.org/10.1371/journal.pone.0301804.g015

A natural question is whether the rejection rates are purely based on inter-asset relations or if they are related to portfolio ESG scores or portfolio performance metrics. We now mostly focus on the RW kernel as it gives good power while it can also explicitly deal with real weighted graphs (positive and negative).

Remark. The negative weights in Fig 14 are very small in absolute value when compared to the positive weights, we also note that the number of negative edges is few. This is true for all sectors. Although negative weights are observed in sector portfolios we ignore the weights as the ratio of negative weights is very low and often 0 for different time points t. This fact can lead to problems as the negative G(−) graphs will be empty, leading to a kernel evaluation of zero for some cases which will ruin the inference procedure if too many elements of K are zero. Therefore, we decided to ignore weights in kernels that do not deal explicitly with negative weights (all but the RW kernels) by taking the absolute value. We should also note that the attributes are of order 10−4 while the edge weights are of order 103. For the RW kernel with attributes, we multiplied the attributes by 1000 to ensure the positive semi-definiteness of the kernel. The SP kernels use a binning strategy with the identity kernel as a base kernel to speed up the kernel calculations. Each edge weight has been divided by the maximum absolute value of the edge weights in the sample and each node attribute has been divided by the maximum absolute value of the node attributes in the sample. We did not find much difference in binning the attributes for the WL-type kernel vs using the degree as a label, but the binning strategy was usually more sensitive and rejected the test more often.

To analyze if ESG scores affect inter-asset relations we define the following feature: where is the average over the time points observed in [ts, …, ts+L], i is the asset universe (sectors or global), is the average of ESG score of the assets that make up the good portfolio and is the standard deviation. and are defined similarly. zi,t can be seen as the difference between the 75% quantile of the ESG score of the good ESG portfolio and the 25% quantile of the poor ESG portfolio. From Fig 16 we can observe that for some sectors the difference between the quantiles of the good and poor portfolios is negative, meaning that the ESG scores of the assets within the good and poor portfolios are similar, this is, for example, true for the Financial, Communication Services, Technology and Materials sectors. For other asset universes, the ESG profiles show differences such as the Global, Industrials, and Energy sectors. From Fig 17 we can see that there is no evidence between rejections rates and ESG differences.

thumbnail
Fig 16. Difference of ESG scores for each sector.

The figure shows zt,i, the difference between the 75% quantile of the ESG score of the good ESG portfolio and the 25% quantile of the poor ESG portfolio.

https://doi.org/10.1371/journal.pone.0301804.g016

thumbnail
Fig 17. Rejection rates as a function of , the average quantile difference of the good and poor ESG portfolios.

The rejection rates are the mean observed rejection in the interval[ts, …, ts+L].

https://doi.org/10.1371/journal.pone.0301804.g017

Table 4 gives the occurrence ratio when the good ESG portfolio gave better performances for each portfolio type for each test decision. The exact formula is: where R is the set containing the time when the test was rejected using the MONK estimator and M = {DRatio, DVar, S, MDD, Ω, ST, TR} is the set of portfolio performance metrics and is the observed metric value at time t for the good ESG portfolio while is the observed value at time t for the poor ESG portfolio. Pnot rejected is defined analogously but considers the set which contains the time when the test was not rejected using the MONK estimator. Prejected,m considers a specific portfolio metric. We present the results for the RW kernel as the values are consistent over different kernels. The full table can be seen in the S1 Appendix.

thumbnail
Table 4. The occurrence ratio when the good ESG portfolio gave better performances for each portfolio type for each test decision.

The results are using the MONK estimator from the RW kernel.

https://doi.org/10.1371/journal.pone.0301804.t004

From Table 4 we first see that there is not an apparent difference between the ratios for the rejected and non-rejected cases. For the Global portfolio, we can see that the good ESG companies are preferred for index investing, namely the PEW portfolio, but not preferred for MS optimized portfolio while the good ESG GMV portfolio is slightly preferred over the poor ESG GMV portfolio. Then once we look at the sectors we can see that the good ESG portfolios, independent of the construction method, are most often preferred for the Materials, Industrials, Energy, Technology, and Financial sectors while the poor ESG portfolios are more often preferred for the Consumer Cyclical, Real Estate, and Healthcare. For other sectors, we usually have that the good ESG portfolio is slightly preferred using the aggregated proportion metric Prejected. Looking at each portfolio construction: 1) For index investing (PEW) the Consumer Cyclical, Real Estate sector and Healthcare sector show a preference for the poor ESG portfolio, 2) The poor ESG portfolios are preferred for Consumer Cyclical, Real Estate using the GMV portfolio construction and 3) Looking at the MS construction type it can be seen that the poor ESG portfolios consisting of global, Consumer Cyclical, Real Estate, and Healthcare assets outperformed the good ESG portfolio most of the time. One reason why the poor ESG Real Estate portfolio outperforms the good ESG part is that the ESG scores within the Real Estate sector are very similar, although, this is not the case for the Consumer Cyclical sector. The ESG score of the two portfolios over time can be found in the S1 Appendix. We can break down the analysis further by looking at Prejected,m. In general, the risk-adjusted returns, Sharpe, Treynor and Sortino, and omega usually agree on the ordering. Interestingly, if DRatio is better for the good ESG portfolio then DVaR is not necessarily better as well. This is, for example, the case for the global portfolio comparison. A table for Prejected,m can be found in the S1 Appendix.

We have identified that there are structural differences between the two portfolios (the ratios in Table 4 are not all 0.5), so the natural question is whether these differences are manifested in portfolio-level performance. We start by looking at PCA biplots and then move on to logistic regression to try to identify which metrics show relations with the inter-asset connection. We construct the data as follows: where t is a timepoint where a rejection decision is made, is the portfolio metric value for one particular portfolio construction (PEW, GMV, or MS) for the good ESG portfolio. is defined analogously but for the poor ESG portfolio. We then scale each feature and create where and are the estimated mean and standard deviation of xi respectively. The label yt is the rejection decision according to the MONK estimator at time t, 0 if no rejection and 1 otherwise.

Fig 18 shows a PCA biplot using the RW kernel for a few sectors. A figure for all sectors can be found in the S1 Appendix. The non-rejections tend to occur in the neighborhood of other rejections, although the two classes are not linearly separable. We now perform a logistic regression whose objective is to classify a MONK rejection when the input data are portfolio metrics. We used a weighted logistic regression to adjust for the class imbalance defined as: where w0 and w1 are the class weights for non-rejection and rejection labels, and using x as features and β is the vector of regression coefficients.yk is the rejection decision in test nr. k. n is the number of tests performed and p is the number of features in the regression (number of portfolio constructions times the number of portfolio metrics). The regularization parameter λ is found by using a 3-fold CV. The data matrix was shuffled randomly before the estimation procedure began to evenly spread the non-rejections over each fold. The weights were found by using where ni is the number of observations in class i. Table 5 lists the area under the curve (AUC) for each fold and its corresponding estimated 95% confidence interval (CI). The column all metrics means that x contains all metrics for each portfolio optimization method 3*7 = 21 in total, while the columns PEW, GMV, and MS metrics mean that the logistic regression input x contained only metrics from the specific portfolio optimization mentioned in the column header. The table demonstrates that it is possible to relate, in all cases, the outcome of the test related to structural changes between portfolios with high vs. low ESG scores and financial performance. The table shows that there is a relationship between performance and ESG scores independent of the strategy of type. The key finding is that whatever strategy is used, there will be a difference by considering ESG. Because portfolio performance implies a difference in inter-asset relations measured between high and low ESG portfolios. Furthermore, the fact that the AUC is consistent over sectors, portfolio construction methods, and kernel types, we deduce that there are no special cases where ESG does not matter when one looks at inter-asset relations. We further confirmed these results by performing a support vector machine (SVM) classification to add non-linear relations. The AUC increases for all cases, quite significantly for some, indicating that there are also further non-linear relations. The table for the SVM can be found in the S1 Appendix.

thumbnail
Fig 18. PCA biplots for a few asset universes.

The orange dots indicate time points when a rejection was made while the orange dots indicate no rejection.

https://doi.org/10.1371/journal.pone.0301804.g018

thumbnail
Table 5. AUC of the logistic regression trying to classify a MONK non-rejection when the input data were portfolio metrics.

The AUC and its 95% confidence interval were estimated using a 3-fold CV. The column all metrics means that x contains all metrics for each portfolio optimization while the columns PEW, GMV, and MS metrics mean that the logistic regression input x contained only metrics from the specific portfolio optimization mentioned in the column header.

https://doi.org/10.1371/journal.pone.0301804.t005

These findings were analyzed further by looking at which features are affected by the difference in inter-asset correlations of high and low ESG portfolios. Table 6 shows which metrics remained in the lasso logistic regression using the 1 standard deviation rule [99]. First, we can see that there are some features that are almost always selected. These are the MDD, DRatio, and Treynor from independent from the construction, although the PEW features are most often observed.

thumbnail
Table 6. The table shows which metrics remained in the lasso logistic regression model using all metrics available.

TR = Treynor, S = Sharpe, ST = Sortino, MDD = maximum drawdown, the subscript indicates which portfolio construction the metric came from.

https://doi.org/10.1371/journal.pone.0301804.t006

Fig 19 shows the normalized hamming distance between the feature sets for each sector on the RW kernel labels. We want to see if similar sectors are clustered together. It can be seen that we can cluster some sectors together, such as the Energy and Communication Services and the Real Estate and Basic Materials, although the reason for their similarity is not obvious.

thumbnail
Fig 19. The normalized hamming distance between the feature sets for each sector on the RW kernel labels.

https://doi.org/10.1371/journal.pone.0301804.g019

8 Conclusion

In this manuscript, we took a novel approach to study and quantize differences in inter-asset relations and graph structures between portfolios with different ESG profiles. We used graph kernels to quantify the differences and identified differences in the analysis introduced by different graph kernels both in a simulation study and a real portfolio optimization scenario. We further introduced a robust estimator for such tests which are of great importance for the financial literature. We showed empirically that investing in a good ESG Investment-fund (PEW portfolio) does outperform its poor ESG counterpart for a global asset universe and for most sectors. However, once a max Sharpe optimization is allowed, then a poor ESG portfolio can outperform the good ESG portfolio counterpart. This is for example the case for the global asset universe. We note, however, that this is the result of an aggregated study over a 5-year history. Finally, we showed that given the kernel MMD rejections it is possible to use the difference between portfolio performance metrics to guess the rejection decision at a given time with high accuracy. Although this relationship is not obvious a lasso regression suggested that the MDD, DRatio, and Treynor metrics are the most informative. A non-linear classification was also performed giving an even higher AUC.

Supporting information

S1 Appendix. File containing pseudo code of algorithms, additional information on the labelled random walk kernel, and synthetic experiments.

https://doi.org/10.1371/journal.pone.0301804.s001

(PDF)

Acknowledgments

We acknowledge the support from Professor Dimitris Christopoulos and Dr. George Tzougas. This work was supported by Edinburgh Business School, Heriot-Watt.

References

  1. 1. Clément A, Robinot É, Trespeuch L. Improving ESG Scores with Sustainability Concepts. Sustainability. 2022;14(20).
  2. 2. Kotsantonis S, Serafeim G. Four Things No One Will Tell You About ESG Data. Journal of Applied Corporate Finance. 2019;31(2):50–58.
  3. 3. Abhayawansa S, Adams CA, Neesham C. Accountability and governance in pursuit of Sustainable Development Goals: conceptualising how governments create value. Accounting, Auditing amp; Accountability Journal. 2021;34(4):923–945.
  4. 4. Duque-Grisales E, Aguilera-Caracuel J. Environmental, Social and Governance (ESG) Scores and Financial Performance of Multilatinas: Moderating Effects of Geographic International Diversification and Financial Slack. Journal of Business Ethics. 2021;168:315–334.
  5. 5. Pagano MS, Sinclair G, Yang T. In: Chapter 18: Understanding ESG ratings and ESG indexes. Cheltenham, UK: Edward Elgar Publishing; 2018. Available from: https://www.elgaronline.com/view/edcoll/9781786432629/9781786432629.00027.xml.
  6. 6. Cornell B. ESG preferences, risk and return. European Financial Management. 2021;27(1):12–19.
  7. 7. Giese G, Lee LE, Melas D, Nagy Z, Nishikawa L. Foundations of ESG Investing: How ESG Affects Equity Valuation, Risk, and Performance. The Journal of Portfolio Management. 2019;45(5):69–83.
  8. 8. Lioui A, Tarelli A. Chasing the ESG factor. Journal of Banking Finance. 2022;139:106498.
  9. 9. Gregory A, Tharyan R, Whittaker J. Corporate Social Responsibility and Firm Value: Disaggregating the Effects on Cash Flow, Risk and Growth. Journal of Business Ethics. 2014;124:633–657.
  10. 10. Godfrey PC, Merrill CB, Hansen JM. The relationship between corporate social responsibility and shareholder value: an empirical test of the risk management hypothesis. Strategic Management Journal. 2009;30(4):425–445.
  11. 11. Jo H, Na H. Does CSR Reduce Firm Risk? Evidence from Controversial Industry Sectors. Journal of Business Ethics. 2012;110:441–456.
  12. 12. Oikonomou I, Brooks C, Pavelin S. The Impact of Corporate Social Performance on Financial Risk and Utility: A Longitudinal Analysis. Financial Management. 2012;41(2):483–515.
  13. 13. Hong H, Kacperczyk M. The price of sin: The effects of social norms on markets. Journal of financial economics. 2009;93(1):15–36.
  14. 14. Sautner Z, Starks LT. ESG and Downside Risks: Implications for Pension Funds. Wharton Pension Research Council Working Paper. 2021;(2021-10).
  15. 15. El Ghoul S, Guedhami O, Kwok CC, Mishra DR. Does corporate social responsibility affect the cost of capital? Journal of banking & finance. 2011;35(9):2388–2406.
  16. 16. Dimson E, Marsh P, Staunton M. Divergent ESG Ratings. The Journal of Portfolio Management. 2020;47(1):75–87.
  17. 17. Harvey CR, Liu Y, Zhu H. … and the cross-section of expected returns. The Review of Financial Studies. 2016;29(1):5–68.
  18. 18. Gibson Brandon R, Krueger P, Schmidt PS. ESG rating disagreement and stock returns. Financial Analysts Journal. 2021;77(4):104–127.
  19. 19. Shafer M, Szado E. Environmental, social, and governance practices and perceived tail risk. Accounting and Finance. 2019;60(4):4195–4224.
  20. 20. Bax K, Sahin Ö, Czado C, Paterlini S. ESG, Risk, and (tail) dependence. arXiv preprint arXiv:210507248. 2021;.
  21. 21. Djoutsa Wamba L, Sahut JM, Braune E, Teulon F. Does the optimization of a company’s environmental performance reduce its systematic risk? New evidence from European listed companies. Corporate Social Responsibility and Environmental Management. 2020;27(4):1677–1694.
  22. 22. Chan PT, Walter T. Investment performance of “environmentally-friendly” firms and their initial public offers and seasoned equity offers. Journal of Banking & Finance. 2014;44:177–188.
  23. 23. Görgen M, Jacob A, Nerlinger M. Get Green or Die Trying? Carbon Risk Integration into Portfolio Management. The Journal of Portfolio Management. 2021;47(3):77–93.
  24. 24. Roncalli T, Guenedal TL, Lepetit F, Roncalli T, Sekine T. The Market Measure of Carbon Risk and its Impact on the Minimum Variance Portfolio. The Journal of Portfolio Management. 2021;47(9):54–68.
  25. 25. Lucia CD, Pazienza P, Bartlett M. Does Good ESG Lead to Better Financial Performances by Firms? Machine Learning and Logistic Regression Models of Public Enterprises in Europe. Sustainability. 2020;12(13):5317.
  26. 26. Friede G. Why don’t we see more action? A metasynthesis of the investor impediments to integrate environmental, social, and governance factors. Business Strategy and the Environment. 2019;28(6):1260–1282.
  27. 27. Siglidis G, Nikolentzos G, Limnios S, Giatsidis C, Skianis K, Vazirgiannis M. GraKeL: A Graph Kernel Library in Python. Journal of Machine Learning Research. 2020;21(54):1–5.
  28. 28. Togninalli M, Ghisu E, Llinares-López F, Rieck B, Borgwardt K. In: Wasserstein Weisfeiler-Lehman Graph Kernels. NIPS’19. Red Hook, NY, USA: Curran Associates Inc.; 2019. Available from: https://proceedings.neurips.cc/paper/2019/file/73fed7fd472e502d8908794430511f4d-Paper.pdf.
  29. 29. Kang U, Tong H, Sun J. Fast Random Walk Graph Kernel. In: Proceedings of the 2012 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics; 2012. Available from: https://doi.org/10.1137/1.9781611972825.71.
  30. 30. Lerasle M, Szabo Z, Mathieu T, Lecue G. MONK Outlier-Robust Mean Embedding Estimation by Median-of-Means. In: Chaudhuri K, Salakhutdinov R, editors. Proceedings of the 36th International Conference on Machine Learning. vol. 97. PMLR; 2019. p. 3782–3793. Available from: https://proceedings.mlr.press/v97/lerasle19a.html.
  31. 31. Zhao T, Liu H, Roeder K, Lafferty J, Wasserman L. The huge Package for High-dimensional Undirected Graph Estimation in R. Journal of Machine Learning Research. 2012;13(37):1059–1062. pmid:26834510
  32. 32. Ferreira E, Orbe S, Ascorbebeitia J, Álvarez Pereira B, Estrada E. Loss of structural balance in stock markets. Scientific Reports. 2021;11:12230. pmid:34108544
  33. 33. de Miranda Cardoso JV, Palomar DP. Learning Undirected Graphs in Financial Markets. In: 2020 54th Asilomar Conference on Signals, Systems, and Computers. IEEE; 2020.Available from: https://doi.org/10.1109/ieeeconf51394.2020.9443573.
  34. 34. Gouvêa AM, Vega-Oliveros DA, Cotacallapa M, Ferreira LN, Macau EE, Quiles MG. Dynamic community detection into analyzing of wildfires events. In: International Conference on Computational Science and Its Applications. Springer; 2020. p. 1032–1047. Available from: https://link.springer.com/chapter/10.1007/978-3-030-58799-4_74.
  35. 35. Dhiman A, Jain SK. Optimizing Frequent Subgraph Mining for Single Large Graph. Procedia Computer Science. 2016;89:378–385.
  36. 36. Kumar S, Ying J, de M Cardoso JV, Palomar DP. A Unified Framework for Structured Graph Learning via Spectral Constraints. Journal of Machine Learning Research. 2020;21(22):1–60.
  37. 37. Hallac D, Park Y, Boyd S, Leskovec J. Network Inference via the Time-Varying Graphical Lasso. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD’17. ACM; 2017. p. 205–213. Available from: https://doi.org/10.1145/3097983.3098037.
  38. 38. Finegold MA, Drton M. Robust Graphical Modeling with T-Distributions. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. UAI’09. AUAI Press; 2009. p. 169–176. Available from: http://www.auai.org/uai2009/papers/UAI2009_0120_91e7a49300db94dabcef290d622ebdb2.pdf.
  39. 39. Bien J, Tibshirani RJ. Sparse estimation of a covariance matrix. Biometrika. 2011;98(4):807–820. pmid:23049130
  40. 40. Kojaku S, Masuda N. Constructing networks by filtering correlation matrices: a null model approach. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2019;475(2231):20190578. pmid:31824228
  41. 41. de Miranda Cardoso JV, Ying J, Palomar D. Graphical Models in Heavy-Tailed Markets. In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Vaughan JW, editors. Advances in Neural Information Processing Systems. vol. 34. Curran Associates, Inc.; 2021. p. 19989–20001. Available from: https://proceedings.neurips.cc/paper/2021/file/a64a034c3cb8eac64eb46ea474902797-Paper.pdf.
  42. 42. Kenett DY, Tumminello M, Madi A, Gur-Gershgoren G, Mantegna RN, Ben-Jacob E. Dominating Clasp of the Financial Sector Revealed by Partial Correlation Analysis of the Stock Market. PLoS ONE. 2010;5(12):e15032. pmid:21188140
  43. 43. Fallani FDV, Latora V, Chavez M. A Topological Criterion for Filtering Information in Complex Brain Networks. PLOS Computational Biology. 2017;13(1):e1005305.
  44. 44. Ghoshdastidar D, von Luxburg U. Practical Methods for Graph Two-Sample Testing. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, editors. Advances in Neural Information Processing Systems. vol. 31. Curran Associates, Inc.; 2018.Available from: https://proceedings.neurips.cc/paper/2018/file/dfa92d8f817e5b08fcaafb50d03763cf-Paper.pdf.
  45. 45. Lovato I, Pini A, Stamm A, Vantini S. Model-Free Two-Sample Test for Network-Valued Data. Computational Statistics Data Analysis. 2020;144(C).
  46. 46. Ghoshdastidar D, Gutzeit M, Carpentier A, von Luxburg U. Two-sample tests for large random graphs using network statistics. In: Kale S, Shamir O, editors. Conference on Learning Theory. vol. 64. PMLR; 2017. p. 954–977. Available from: https://www.semanticscholar.org/paper/Two-Sample-Tests-for-Large-Random-Graphs-Using-Ghoshdastidar-Gutzeit/8a154c1188a73accd2944ee6da73d98c99d2929c.
  47. 47. Yuan M, Wen Q. A practical two-sample test for weighted random graphs. Journal of Applied Statistics. 2021; p. 1–17. pmid:36819081
  48. 48. Chen H, Friedman JH. A New Graph-Based Two-Sample Test for Multivariate and Object Data. Journal of the American Statistical Association. 2017;112(517):397–409.
  49. 49. Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A. A graph kernel two-Sample Test. Journal of Machine Learning Research. 2012;13(25):723–773.
  50. 50. Chwialkowski K, Sejdinovic D, Gretton A. A Wild Bootstrap for Degenerate Kernel Tests. In: Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2. NIPS’14. Cambridge, MA, USA: MIT Press; 2014. p. 3608–3616. Available from: https://dl.acm.org/doi/abs/10.5555/2969033.2969229.
  51. 51. Gretton A, Fukumizu K, Harchaoui Z, Sriperumbudur BK. A Fast, Consistent graph kernel two-Sample Test. In: Bengio Y, Schuurmans D, Lafferty J, Williams C, Culotta A, editors. Advances in Neural Information Processing Systems. vol. 22. Curran Associates, Inc.; 2009.Available from: https://proceedings.neurips.cc/paper/2009/file/9246444d94f081e3549803b928260f56-Paper.pdf.
  52. 52. Sutherland DJ, Tung H, Strathmann H, De S, Ramdas A, Smola AJ, et al. Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net; 2017.Available from: https://openreview.net/forum?id=HJWHIKqgl.
  53. 53. Lloyd JR, Ghahramani Z. Statistical Model Criticism using Kernel Two Sample Tests. In: Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R, editors. Advances in Neural Information Processing Systems. vol. 28. Curran Associates, Inc.; 2015.Available from: https://proceedings.neurips.cc/paper/2015/file/0fcbc61acd0479dc77e3cccc0f5ffca7-Paper.pdf.
  54. 54. Laumann F, von Kügelgen J, Barahona M. Kernel Two-Sample and Independence Tests for Nonstationary Random Processes. In: The 7th International conference on Time Series and Forecasting. MDPI; 2021.Available from: https://doi.org/10.3390/engproc2021005031.
  55. 55. Gretton A, Herbrich R, Smola A, Bousquet O, Schölkopf B. Kernel Methods for Measuring Independence. Journal of Machine Learning Research. 2005;6(70):2075–2129.
  56. 56. Fukumizu K, Gretton A, Sun X, Schölkopf B. Kernel Measures of Conditional Dependence. In: Platt J, Koller D, Singer Y, Roweis S, editors. Advances in Neural Information Processing Systems. vol. 20. Curran Associates, Inc.; 2007.Available from: https://proceedings.neurips.cc/paper/2007/file/3a0772443a0739141292a5429b952fe6-Paper.pdf.
  57. 57. Kriege NM, Johansson FD, Morris C. A survey on graph kernels. vol. 5; 2020.Available from: https://doi.org/10.1007/s41109-019-0195-3.
  58. 58. Nikolentzos G, Siglidis G, Vazirgiannis M. Graph Kernels: A Survey. J Artif Int Res. 2022;72:943&1027.
  59. 59. Haussler D. Convolution kernels on discrete structures. Computer Science Dept., UC Santa Cruz.; 1999. UCSC-CRL-99-10.
  60. 60. Shervashidze N, Borgwardt K. Fast subtree kernels on graphs. In: Bengio Y, Schuurmans D, Lafferty J, Williams C, Culotta A, editors. Advances in Neural Information Processing Systems. vol. 22. Curran Associates, Inc.; 2009.Available from: https://proceedings.neurips.cc/paper/2009/file/0a49e3c3a03ebde64f85c0bacd8a08e2-Paper.pdf.
  61. 61. Shervashidze N, Schweitzer P, van Leeuwen EJ, Mehlhorn K, Borgwardt KM. Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research. 2011;12(77):2539–2561.
  62. 62. Kashima H, Tsuda K, Inokuchi A. Marginalized Kernels between Labeled Graphs. In: Proceedings of the Twentieth International Conference on International Conference on Machine Learning. ICML’03. AAAI Press; 2003. p. 321–328. Available from: https://www.aaai.org/Papers/ICML/2003/ICML03-044.pdf.
  63. 63. Vishwanathan SVN, Schraudolph NN, Kondor R, Borgwardt KM. Graph Kernels. Journal of Machine Learning Research. 2010;11(40):1201–1242.
  64. 64. Gärtner T, Flach P, Wrobel S. On Graph Kernels: Hardness Results and Efficient Alternatives. In: Schölkopf B, Warmuth MK, editors. Learning Theory and Kernel Machines. Berlin, Heidelberg: Springer Berlin Heidelberg; 2003. p. 129–143. Available from: https://doi.org/10.1007/978-3-540-45167-9_11.
  65. 65. Borgwardt KM, Kriegel H. Shortest-Path Kernels on Graphs. In: Fifth IEEE International Conference on Data Mining. ICDM’05. IEEE; 2005. p. 74–81. Available from: https://doi.org/10.1109/icdm.2005.132.
  66. 66. Neumann M, Garnett R, Bauckhage C, Kersting K. Propagation kernels: efficient graph kernels from propagated information. Machine Learning. 2016;102(2):209–245.
  67. 67. Grauman K, Darrell T. The Pyramid Match Kernel: Efficient Learning with Sets of Features. Journal of Machine Learning Research. 2007;8(26):725–760.
  68. 68. Nikolentzos G, Meladianos P, Vazirgiannis M. Matching Node Embeddings for Graph Similarity. Proceedings of the AAAI Conference on Artificial Intelligence. 2017;31(1).
  69. 69. Kriege NM, Giscard PL, Wilson RC. On Valid Optimal Assignment Kernels and Applications to Graph Classification. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS’16. Red Hook, NY, USA: Curran Associates Inc.; 2016. p. 1623–1631. Available from: https://proceedings.neurips.cc/paper/2016/file/0efe32849d230d7f53049ddc4a4b0c60-Paper.pdf.
  70. 70. Hofmann T, Schölkopf B, Smola AJ. Kernel methods in machine learning. The Annals of Statistics. 2008;36(3):1171–1220.
  71. 71. Shawe-Taylor J, Cristianini N. Kernel Methods for Pattern Analysis. Cambridge: Cambridge University Press; 2004. Available from: https://doi.org/10.1017/CBO9780511809682.
  72. 72. Kreyszig E. Introductory functional analysis with applications. vol. 17. John Wiley & Sons; 1991.
  73. 73. Smola A, Gretton A, Song L, Schölkopf B. A Hilbert Space Embedding for Distributions. In: Hutter M, Servedio RA, Takimoto E, editors. Algorithmic Learning Theory. Berlin, Heidelberg: Springer Berlin Heidelberg; 2007. p. 13–31. Available from: https://doi.org/10.1007/978-3-540-75225-7_5.
  74. 74. Schölkopf B, Herbrich R, Smola AJ. A Generalized Representer Theorem. In: Helmbold D, Williamson B, editors. Computational Learning Theory. Berlin, Heidelberg: Springer Berlin Heidelberg; 2001. p. 416–426. Available from: https://doi.org/10.1007/3-540-44581-1_27.
  75. 75. Piegorsch WW, Casella G. Erratum: Inverting a Sum of Matrices. SIAM Review. 1990;32(3):470–470.
  76. 76. Drineas P, Mahoney MW. On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning. Journal of Machine Learning Research. 2005;6(72):2153–2175.
  77. 77. Nikolentzos G, Vazirgiannis M. Random Walk Graph Neural Networks. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, editors. Advances in Neural Information Processing Systems. vol. 33. Curran Associates, Inc.; 2020. p. 16211–16222. Available from: https://proceedings.neurips.cc/paper/2020/file/ba95d78a7c942571185308775a97a3a0-Paper.pdf.
  78. 78. Tufvesson O, Lindström J, Lindström E. Spatial statistical modelling of insurance risk: a spatial epidemiological approach to car insurance. Scandinavian Actuarial Journal. 2019;2019(6):508–522.
  79. 79. Dijkstra EW. A note on two problems in connexion with graphs. Numerische Mathematik. 1959;1(1):269–271.
  80. 80. Floyd RW. Algorithm 97: Shortest path. Communications of the ACM. 1962;5(6):345.
  81. 81. Vert JP. The optimal assignment kernel is not positive definite; 2008. Available from: https://hal.archives-ouvertes.fr/hal-00218278.
  82. 82. Xiaojin Z, Zoubin G. Learning from labeled and unlabeled data with label propagation. Carnegie Mellon University; 2002. CMU-CALD-02–107. Available from: http://mlg.eng.cam.ac.uk/zoubin/papers/CMU-CALD-02-107.pdf.
  83. 83. Hua H, Hovestadt L. p-adic numbers encode complex networks. Scientific Reports. 2021;11(1). pmid:33420128
  84. 84. Zambrano-Luna BA, Zúñiga-Galindo WA. p-adic Cellular Neural Networks. Journal of Nonlinear Mathematical Physics. 2022;30(1):34–70.
  85. 85. Khrennikov AY, Nilson M. In: P-Adic Neural Networks. Springer Netherlands; 2004. p. 123–153. Available from: http://dx.doi.org/10.1007/978-1-4020-2660-7_8.
  86. 86. Zúñiga-Galindo WA, He C, Zambrano-Luna BA. p-Adic statistical field theory and convolutional deep Boltzmann machines. Progress of Theoretical and Experimental Physics. 2023;2023(6).
  87. 87. Newman M. Networks. Oxford University Press; 2018. Available from: https://doi.org/10.1093/oso/9780198805090.001.0001.
  88. 88. Cohen R, Havlin S. Scale-Free Networks Are Ultrasmall. Phys Rev Lett. 2003;90:058701. pmid:12633404
  89. 89. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2007;9(3):432–441. pmid:18079126
  90. 90. Rasmussen CE, Williams CKI. Gaussian Processes for Machine Learning. The MIT Press; 2005. Available from: https://doi.org/10.7551/mitpress/3206.001.0001.
  91. 91. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830.
  92. 92. Friedman J, Hastie T, Höfling H, Tibshirani R. Pathwise coordinate optimization. The Annals of Applied Statistics. 2007;1(2):302–332.
  93. 93. Banerjee O, El Ghaoui L, d’Aspremont A. Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. J Mach Learn Res. 2008;9:485–516.
  94. 94. Orzechowski P, Moore JH. EBIC: A Scalable Biclustering Method for Large Scale Data Analysis. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion. GECCO’19. New York, NY, USA: Association for Computing Machinery; 2019. p. 31–32. Available from: https://doi.org/10.1145/3319619.3326762.
  95. 95. Liu H, Lafferty J, Wasserman L. The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs. Journal of Machine Learning Research. 2009;10(80):2295–2328.
  96. 96. Markowitz HM. Portfolio Selection: Efficient Diversification of Investments. Yale University Press; 1959. Available from: http://www.jstor.org/stable/j.ctt1bh4c8h.
  97. 97. Markowitz HM. Portfolio Selection. The Journal of Finance. 1952;7(1):77–91.
  98. 98. Marupanthorn P, Sklibosios Nikitopoulos C, Ofosu-Hene E, Peters G, Richards KA. Mechanisms to Incentivise Fossil Fuel Divestment and Implications on Portfolio Risk and Returns. 2022;.
  99. 99. Murphy KP. Machine learning: a probabilistic perspective. Cambridge, Mass. [u.a.]: MIT Press; 2013.