Predicting corporate credit risk: Network contagion via trade credit

Trade credit is a payment extension granted by a selling firm to its customer. Companies typically respond to late payments from their customers by delaying payments to suppliers, thus generating a ripple through the transaction network. Therefore, trade credit is as a potential vehicle of propagation of losses in case of default events. The goal of this work is to leverage information on the trade credit among connected firms to predict imminent defaults of firms. We use a unique dataset of client firms of a major Italian bank to investigate firm bankruptcy between October 2016 to March 2018. We develop a model to capture network spillover effects originating from the supply chain on the probability of default of each firm via a sequential approach: the output of a first model component on single firm features is used in a subsequent model which captures network spillovers. While the first component is the standard econometrics way to predict such dynamics, the network module represents an innovative way to look into the effect of trade credit on default probability. This module looks at the transaction network of the firm, as inferred from the payments transiting via the bank, in order to identify the trade partners of the firm. By using several features extracted from the network of transactions, this model is able to predict a large fraction of the defaults, thus showing the value hidden in the network information. Finally, we merge firm and network features with a machine learning model to create a ‘hybrid’ model, which improves the recall for the task by almost 20 percentage points over the baseline.


Jaccard similarity
As previously stated, we generate the networks at time t by aggregating information of the transactions between two firms for the previous t * = t − 12 months. However, during the exploration phase of the dataset we tested different aggregation windows, i.e., t r = 3, 6, 12 months, and we generated different transactions networks according to them. Our findings reveal that the difference by using different temporal aggregation is negligible in both the structure of the networks and the dynamics of the label propagation. In order to understand the similarity between the aggregated networks, we report the Jaccard similarity index between between the networks ( Fig S1). The Jaccard index is a statistic used for comparing the similarity of two sets, and is defined as the size of the intersection divided by the size of the union of the sample sets: We report the results of the distribution of the average value of the Jaccard index for the edges of networks aggregated every 3-6-12 months. In the first scenario (Fig S1 left) we performed a 1-month sliding window aggregation. For instance we aggregated the transactions for the months of Jul-Aug-Sep and compared with the network of aggregated transactions of the months of Aug-Sep-Oct, and so on. While in the second scenario (Fig S1 right) we use a tumbling window: in this case we compare the months of Jul-Aug-Sep with the months of Oct-Nov-Dec. The results suggest that different temporal aggregations do not significantly affect the similarity of the networks.

Spatial autocorrelation
We also analyzed the spatial autocorrelation of some variables using different temporal aggregation. Spatial autocorrelation pertains to the non-random pattern of attribute values over a set of spatial units (the network in our case). This pattern can take two general forms: positive autocorrelation, which reflects value similarity in space, and  negative autocorrelation, or value dissimilarity in space. In either case the autocorrelation arises when the observed spatial pattern is different from what would be expected under a random process operating in space. Spatial autocorrelation can be analyzed from two different perspectives. Global autocorrelation analysis involves the study of the entire map pattern and generally asks the question as to whether the pattern displays clustering. Local autocorrelation, on the other hand, shifts the focus to explore within the global pattern and identify clusters or so called 'hot spots' that may be either driving the overall clustering pattern, or that reflect heterogeneities that depart from global patterns.
In what follows, we highlight the global spatial autocorrelation by considering Moran's and Geary's measures, implemented in the PySAL library. 1 Moran's I Moran's I measures the global spatial autocorrelation in an attribute y measured over n spatial units and is given as: where w i,j is a spatial weight, z i = y i −ȳ, and S 0 = i j w i,j .
Geary's C Geary's C is a similar measure defined as follows: We consider aggregated networks by different time windows (3, 6, 12) and different metrics to analyze the spatial autocorrelation: Deviance Residuals (Rd) and Pearson Residuals (Rp). In Fig S2 we show an example of the results of the p-value of the Moran's I for the case of the Deviance and Pearson residuals for networks with weights corresponding to fraction of revenues between the adjacent nodes. The most relevant result is that the temporal aggregation does not have a relevant effect on the spatial correlation.

Cascade analysis
For this analysis, we draw inspiration from the seminal work on information cascades by Kempe et al. [1] to model the default contagion process in our inter-firm network. In particular, we assume the contagion follows an independent cascade process on the transaction network. When a node u becomes active (i.e., experiences a default), it has a single chance of activating each inactive neighbor v, with some probability p uv . Given a set of activations, the inference problem requires reconstructing the most likely diffusion path for the cascade. However, here we are interested in analyzing all the possible diffusion paths, which represent an upper bound on the information content of the network. For example, assume a simple network composed of three nodes u, v, w as the one in Fig S3. Both u and v activate at time t 1 < t 2 , while w activates at time t 2 . Node w has incoming edges from both u and v. According to the independent cascade model, both u and v may be responsible for the activation of w. Thus, we define a potential cascade (henceforth simply referred to as a cascade) as the union all the possible diffusion paths in the graph. Each edge in the cascade is a possible diffusion vector for the default contagion process.  Note that each cascade is a directed acyclic graph (DAG), as the arrow of time allows us to break possible cycles. We also assume that if two firms activate at the same timestamp, then they do not influence each other. That is, the diffusion can only happen from an activation strictly in the past. Given that the temporal dynamic of the process is not know in advance, we do not impose any limit on the time delay between two activations (i.e., once a node is active, it is contagious for the rest of the timeline). We find 595 distinct (disconnected) cascades by this process, for a total of 4340 nodes and 4499 edges. Figures S4 and S5 show the distribution of the number of nodes and edges for the set of cascades inferred from the dataset. The distributions are extremely skewed, with the largest cascade dwarfing all the others, therefore we also show the same graphs with the largest cascade excluded. Most of the cascades are rather small, with the bulk consisting of only two nodes and a single edge.
We define the depth of a cascade as the maximum shortest-path distance between the root node of the DAG and the rest of the nodes. The depth of a cascades gives us an indication of how far (i.e, how many hops in the graph) the diffusion process can reach. The distribution of the depth of the inferred cascades is shown in Fig S6a and summarized in Fig S6b. Apart from the largest cascade, which has depth 9, all the other cascades have a maximum depth of 3, with most having a depth of 1. Therefore, the diffusion process as modeled seems to be fairly localized. We now look at the time difference between two activations in the cascades (δ t ) to get an idea of the possible temporal dynamics of the diffusion process. Fig S7 shows the distribution of the difference between the activation times for each pair of nodes connected by an edge in the cascades. We can see a mostly linear decay for the ≈2 years observation window: it would seem most contagions are more likely to happen in a short delta of time. However, upon closer inspection, the linear decay in probability is actually compatible with a window effect. To observe an edge in the cascade both activations should fall within the observation window. The larger the time delta between these two observations, the lower the likelihood that both fall within the window (p ∝ W −δt W , where W is the size of the observation window). Thus, we do not see any decaying effect of time on the contagion probability in the dataset under study. The horizon of the temporal dynamic of the process is possibly much larger than 2 years. As we show in the next section, experiments with a prediction feature based on this analysis confirm the absence of linear temporal decay in the contagion process.
Finally, we wish to understand what are the upper bounds of an application of a cascade-like model in a prediction setting. To do so, we build a simple prediction oracle. The oracle works as follows: for each firm in the dataset, if at timestamp t the firm has target variable Y = 1, if the firm is present in a cascade and has in-degree larger than zero, then the oracle predicts output Y = 1, otherwise Y = 0. This condition captures the fact that the network-based oracle can foresee with certainty the contagions before they happen, but cannot foresee the first activation. Fig S8a shows the performance of the oracle compared with a simple random guess, which just sorts the firms randomly. The average recall at K (R@K, K = 5%) for the random baseline is around 4% while the oracle gets an average of 14%. This marked improvement represents an upper bound on how much we can expect from a purely contagion-based model. Fig S8b shows the results for a similar setting, where the baseline is a logistic regression (LR) model trained only on features of the firm, and the oracle enhances the predictions for the firms present in a cascade with in-degree larger than zero. The average R@K for the baseline LR model is 53%, while the oracle averages 59%. This result confirms that the network information has the potential to improve the performance for the default prediction task.

Financial statement items
Tables S1 and S2 report a list of the financial statement items available in the dataset during the development of our models.