Mutual information: Measuring nonlinear dependence in longitudinal epidemiological data

Given a large clinical database of longitudinal patient information including many covariates, it is computationally prohibitive to consider all types of interdependence between patient variables of interest. This challenge motivates the use of mutual information (MI), a statistical summary of data interdependence with appealing properties that make it a suitable alternative or addition to correlation for identifying relationships in data. MI: (i) captures all types of dependence, both linear and nonlinear, (ii) is zero only when random variables are independent, (iii) serves as a measure of relationship strength (similar to but more general than R2), and (iv) is interpreted the same way for numerical and categorical data. Unfortunately, MI typically receives little to no attention in introductory statistics courses and is more difficult than correlation to estimate from data. In this article, we motivate the use of MI in the analyses of epidemiologic data, while providing a general introduction to estimation and interpretation. We illustrate its utility through a retrospective study relating intraoperative heart rate (HR) and mean arterial pressure (MAP). We: (i) show postoperative mortality is associated with decreased MI between HR and MAP and (ii) improve existing postoperative mortality risk assessment by including MI and additional hemodynamic statistics.

. Probability table for mutual Information example with discrete distributions X = −1 X = 0 X = 1 P (Y = k) Y = 0 0 1/3 0 1/3 Y = 1 1/3 0 1/3 2/3 P (X = j) 1/3 1/3 1/3 1 of nats). The change of base formula for logs enables a straightforward rescaling of information estimated in nats to its corresponding value in bits. We define the entropy of X to be the mean information content p j log 2 (p j ).
Similar definitions hold for H(Y ) and H(X, Y ). The mutual information of X and Y is One can show I(X, Y ) ≥ 0 and that I(X, Y ) = 0 if and only if p jk = p j p k for all j, k, which is to say when X and Y are independent. The equivalence of independence and zero mutual information is a key property illustrated in the example below.
The average amount of shared information cannot exceed the average amount of information contained in observations of X or Y so that I(X, Y ) must be less than both H(X) and H(Y ). This natural restriction allows one to define the unitless uncertainty coefficients where C XY (C Y X ) indicates the proportion of information of Y (X) which is shared with X (Y ) so that C XY (C Y X ) is zero if and only if X and Y are independent and one if and only if Y (X) is a deterministic function of X (Y ).

Example
Let X be a uniformly distributed random variable with P (X = j) = 1/3 for j = −1, 0, 1 and let Y = X 2 . The joint and marginal distributions of X and Y are shown Table 1.
It is straightforward to show that E[XY ] = E[X 3 ] = 0 and Corr(X, Y ) = 0. Conversely, the joint and marginal entropies of X and Y are: so that I(X, Y ) = log 2 (3)-2/3 ≈ 0.92 bits. Note that H(X) and H(X, Y ) are equal.
Since Y is completely determined by X, an observation of (X, Y ) provides no new information beyond the observation of X alone which is reflected in the uncertainty coefficients: The value C XY = 1 indicates the deterministic dependence of Y on X.

Continuous Distributions
When switching from discrete to continuous distributions, it is natural to replace pmfs with pdfs and sums with integrals. Thus, if X and Y are continuous with joint and marginal probability densities, f (x, y), f x (x), and f y (y), the mutual information is where 0 log 2 0 ≡ 0 if f (x, y) = 0. Unfortunately, the differential entropies, h(X), h(Y ), and h(X, Y ), do not carry the same interpretation as their discrete counterparts. An odd but unavoidable challenge with a continuous random variable, X, is that a realization of a X will be an irrational number with probability one. As such, one would require an infinite amount of information to record it. A more detailed discussion of differential entropy can be found in [2]. Despite this limitation, the interpretation of MI remains the same for continuous data as it is a measure of shared information. Importantly, the nonnegativity of MI and its equality to zero if and only if X and Y are independent are both preserved.
Without meaningful definitions of h(X) and h(Y ) it is not possible to construct uncertainty coefficients akin to those in the discrete case. This motivates a natural question of how to interpret the units of MI and when it is large. There are no cut-offs for weak, moderate, or strong relationships. Rather, we propose comparing the MI for a pair of covariates relative to the MIs obtained for all other pairs of variables collected in the same study population. This approach is illustrated in the intraoperative HR/MAP application considered in the manuscript. However, we again reiterate that the interpretation of mutual information as shared information between variables is unchanged. In fact, this interpretation extends to mixed data as well.

Example
Let X have a standard Gaussian distribution and given X, suppose that Y is uniformly distributed on the interval (−|X| − 1, |X| − 1). Similar to the discrete example, one can show that E[XY ] = E[X] = 0 and Corr(X, Y ) = 0. Equation 3 cannot be evaluated directly in this case, but a numerical approximation indicates that I(X, Y ) = 2.83 bits.

Mutual Information: Estimation
In principle, one could combine estimates of the joint and marginal entropies to estimate the mutual information. Unfortunately, the pmfs and pdfs involved in these calculations are generally unknown and must be estimated from data. In the case of discrete data, it is natural to replace the pmfs in Equation (1) with observed frequencies. This approach is known to give biased results which have motivated improved estimators [5]. We omit those details and instead focus on the continuous case which is less straightforward but of greater interest to our case study of intraoperative hemodynamic data.
There are a number of density estimation techniques for independent, continuous data such as binning and kernel density estimators. Instead we have elected to focus on a method adapted from a k-nearest neighbor entropy estimation algorithm, which has been developed specifically for mutual information [1]. Important properties of k-nearest neighbor methods have been shown to hold for time series and longitudinal data, which is of central interest in the study of intraoperative hemodynamic data (Young and Dunson 2019, arXiv:1904.05850).
Consider a collection of N regularly sampled HR and M AP measurements, (HR i , M AP i ), for i = 1, . . . , N and fix a positive integer k which is less than N . Let r ( i, k) be the absolute value of the difference between HR i and its kth closest HR measurement from the set of all other HR measurements To estimate the HR density at HR i , we make the following approximations based on Riemann sums Here ψ(k) is the digamma function, which is incorporated to reduce bias for small values of k. Importantly, the digamma function assumes use of the natural log resulting in MI and differential entropy estimates in nats, which we assume to be the case hereafter. By the change of base formula, one may simply divide an MI estimate in nats by ln (2) to obtain the estimate in bits.
For the differential entropy, one can obtain the empirical estimate and a similar estimate for H(M AP ) [3,4]. The estimate for the joint entropy is We obtain this estimate by using the maximum absolute difference to determine the ordering of the nearest neighbors of (HR i , M AP i ). The additional factor of two in the 4N and 2/N terms appear since (X, Y ) is two-dimensional. Combining these estimates together gives the following MI estimator

Sampling Error Comparison
Herein, we provide a study of the convergence behavior of three statistics measuring dependence: the Maximal Information Coefficient (MIC) estimated using the minerva package [7], distance correlation (dCor) estimated using the energy package [6], and MI estimated via nearest neighbors, hereafter referred to as the KSG method. All experiments were conducted in R version 4.2.1. The mean squared error for each method is shown for each sampling distribution. Across all joint distributions of (X, Y ) and samples sizes, the KSG estimator of MI (black) has a MSE which is approximately two orders of magnitude smaller than that of MIC (red) or dCor (blue).

Independent Data
Closed form expressions for MI, MIC, and dCor for a given joint distribution are, in general, intractable with a few exceptions, such as a bivariate Gaussian distribution in the case of MI and dCor. Thus, empirical studies on the mean squared error of statistical estimators of these quantities are somewhat limited. Thus, we will begin with the case of independent variables wherein each measure is zero. We consider three cases for iid random variables X and Y : N (0, 1), U nif [0, 1], and Cauchy. In each case, 400 independent trials of N iid draws of (X, Y ) are generated. For each trial, MIC, dCor, and MI are estimated from the N samples of (X, Y ). The mean squared error (MSE) from zero of the MI, MIC, and dCor methods are then estimated from these results. The process is repeated for N = 100, 200, . . . , 2000 samples. Figure 1, provides graphs of the MSE as a function of sample size, N .
In each case, the KSG estimator has MSE approximately two orders of magnitude smaller than that of MIC or dCor across all samples size considered. Notably, KSG outperforms either method over the range of HR and MAP samples considered in the main article, thereby supporting the choice of MI as a measure of statistical dependence.

Dependent Data
It is natural to wonder if the results of the preceding example hold in the case of dependent data. Herein, we consider the case where random variables X, Y follow a  bivariate Gaussian distribution with mean µ = (0, 0) and covariance matrix Σ = 1 ρ ρ 1 .
In this case, MIC does not have a known value in this setting. We consider three cases: ρ = 0.2, 0.5, 0.8. For each value of ρ, we generate N samples of (X, Y ) and estimate MI, MIC, and dCor. Using 400 independent realizations of this process, empirical estimates for the mean and variance of each dependence measure are calculated. This process is repeated for samples sizes N = 100, 200, . . . , 500. The results are shown in Figure 2.
All methods exhibit comparable variances across each distribution and sample size, One exception is KSG for ρ = 0.2 which has a much lower variance for small sample sizes. However, the results pertaining to the mean (and thus sample bias) are striking. dCor and KSG have comparable bias across all sample sizes and correlation. Notably, the bias is neglible when compared with the true value for N ≥ 300. However, the sample mean MIC is clearly decreasing for all values of N suggesting that the finite sample bias of MIC decays much more slowly than KSG or dCor. Based on the preceding results, one may expect that KSG will have lower mean squared error (MSE) when compared to MIC and dCor in this setting as well. However, lower MSE may be an artifact of smaller overall values of an estimand. For comparison, we also provide plots for MSE and relative error, as measured by the ratio of the bias and root mean squared error (RMSE) for each method and correlation. These summaries are shown in Fig. 3. When estimating the bias, true values for KSG and dCor were available. However, MIC has no known ground truth and was instead estimated using the mean results for the highest sample sizes. Interestingly, in all cases the relative error was an increasing function of sample size, which is a result of the bias decreasing at a slower rate than the RMSE. Importantly, KSG has the lowest relative error in all correlations and sample sizes considered.
Collectively, these results indicate that MI, as estimated using KSG, outperforms the other methods consider across a range of test scenarios when one is most concerned about robust estimation under moderate sample sizes. As such, the choice of MI is a suitable measure of dependence in the primary study of HR and MAP.

Effect of k on MI Estimation
Optimal tuning of the number of nearest neighbors for the KSG estimator is unknown. Below, we show the MI estimates of the gapminder data using k = 10 to demonstrate   the impact of this choice. For reference, the k = 20 case considered in the paper suggests SBP and cholesterol have the strongest association, followed by BMI and cholesterol, and SBP with alcohol consumption and BMI. For k = 10, the four strongest pairwise associations are the same, but cholesterol and BMI is the weakest in this group.