A novel piecewise-linear method for detecting associations between variables

Panru Wang; Junying Zhang

doi:10.1371/journal.pone.0290280

Abstract

Detecting the association between two variables is necessary and meaningful in the era of big data. There are many measures to detect the association between them, some detect linear association, e.g., simple and fast Pearson correlation coefficient, and others detect nonlinear association, e.g., computationally expensive and imprecise maximal information coefficient (MIC). In our study, we proposed a novel maximal association coefficient (MAC) based on the idea that any nonlinear association can be considered to be composed of some piecewise-linear ones, which detects linear or nonlinear association between two variables through Pearson coefficient. We conduct experiments on some simulation data, with the results show that the MAC has both generality and equitability. In addition, we also apply MAC method to two real datasets, the major-league baseball dataset from Baseball Prospectus and dataset of credit card clients’ default, to detect the association strength of pairs of variables in these two datasets respectively. The experimental results show that the MAC can be used to detect the association between two variables, and it is computationally inexpensive and precise than MIC, which may be potentially important for follow-up data analysis and the conclusion of data analysis in the future.

Citation: Wang P, Zhang J (2023) A novel piecewise-linear method for detecting associations between variables. PLoS ONE 18(8): e0290280. https://doi.org/10.1371/journal.pone.0290280

Editor: Lu Peng, Wuhan University of Technology, CHINA

Received: November 28, 2022; Accepted: August 3, 2023; Published: August 24, 2023

Copyright: © 2023 Wang, Zhang. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The first real data underlying the results presented in the study are available from https://www.seanlahman.com/baseball-archive/statistics/. The second real data underlying the results presented in the study are available from https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients.

Funding: This work was supported by the Natural Science Basic Research Program of Shaanxi Province, P.R.China (Program No. 2021SF-184). The funder Junying Zhang, the correspondent of this work, had role in study design, supervision, writing - review & editing, and decision to publish.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

There are various linear or nonlinear associations [1–7] between two variables in the big data era. Detecting the association strength between them is necessary and meaningful for future data analysis [8–11]. Linear association between two variables can be detected through existing methods, however, nonlinear association cannot be detected well by using these existing methods. How to accurately detect the association between two variables is an urgent problem to be solved.

The key indicators used to detect the association between two variables are Pearson coefficient, Spearman coefficient, Kendall coefficient, mutual information and distance correlation coefficient. They can detect the association strength between them, but there are also limitations. Galton [12] first proposed the concept of regression and applied the letter “r” to express the degree of correlation, however, he did not realize the concept of negative correlation. Subsequently, Pearson [13] proposed Pearson linear coefficient which is the quotient of covariance and standard deviation between two variables. The Pearson coefficient can be used for detecting the association between two variables, where the association is only statistically linear related. Therefore, Spearman [14] proposed Spearman coefficient on the basic of Pearson coefficient, which can detect linear or nonlinear associations between two variables, but these associations are monotonous. As time went by, more and more methods have been proposed to detect the association between two variables. Kendall raised Kendall coefficient [15] also called Harmony coefficient, but the data must be sorted out by the method of rating. Whereafter, Shannon [16] proposed mutual information [17, 18], which is difficult to calculate because it involves probability density. In 2007, Székely [19] proposed a new statistical correlation method, distance correlation coefficient, which made improvement in the Pearson coefficient’s shortcoming. If there is a nonlinear association between two variables, even if the value of Pearson coefficient is 0, we can’t arbitrarily think that there is no association between them; but if the value of distance correlation coefficient is 0, we can directly think there is no association between them without further analysis. Broadly speaking, these indicators, Pearson coefficient, Spearman coefficient, Kendall coefficient, mutual information, and distance correlation, all can be used to detect the association between two variables. However, these measures have some shortcomings: Pearson coefficient only detects linear association, Spearman coefficient is low precision, Kendall coefficient requires ordered variables, mutual information is difficult to calculate, the distance correlation coefficient is not necessarily 0 when variables are independent.

There are various associations between two variables, which may be some complex nonlinear associations, and may not even be expressed by mathematical functions. In modern times, many measures have been proposed to detect the association between them. Wang et al in 2011 proposed a new measure, R correlation coefficient, to detect linear or simple nonlinear relationship between two variables [20]. The R correlation coefficient is based on the mathematical statistics, and only one simple example is used to prove this measure, which is lack of experimental proof. Meanwhile, Reshef [21] et al proposed a widely used measure, maximal information coefficient (MIC), which can detect extensive correlation relationships such as linear, exponential, periodic, even all functional relationships (a superposition of functions—are not well modeled by a function), but it has high computational complexity. Next, Wijayatunga in 2016 proposed a generalized Pearson coefficient [22] and argued that it can detect any nonlinear dependence if a suitable distance metric was used and all possible maximal dependences were considered, however, this method mainly focuses on discrete variables. Soon afterwards, a new measure [23], G-squared, was proposed in 2017 based on a piecewise-linear regression method, which detects whether two univariate random variables are independent and their association strength. Nevertheless, it is hard to estimate the G-squared and needs to satisfy some regularity conditions. Moreover, when the value of G-squared is zero, it does not mean that these two variables are independent. Although the above measures can be used to detect the association strength between two variables, they have following limitations respectively: R coefficient detects simple nonlinear association and lacks more experimental proof, MIC has high computational complexity, generalized Pearson coefficient mainly focus on discrete variables, and the value of G-Squared is 0 which does not mean that variables are independent.

In this paper, we proposed a new measure, maximal association coefficient (abbreviated as MAC), to detect the association strength between two variables. The MAC is to utilize the piecewise-linear idea to detect the association strength between two variables by Pearson coefficient, including linear or nonlinear associations. The remainder of the paper is organized as follows. In section 2, the detailed description of MAC method is shown. In section 3, the generality and equitability of MAC are verified through two simulation experiments. In section 4, the MAC method is applied to two real datasets to further illustrate that it can detect the association strength between two variables, and the results prove that its performance is comparably or better than MIC. In the end, discussions and conclusions are shown in sections 5 and 6 respectively.

2 Proposed MAC measure

Since the association between two variables can be nonlinear, thus dividing the nonlinear association into some piecewise-linear ones is a way for using linear correlation coefficient to detect nonlinear association between them. In this way, the problem of detecting nonlinear association between two variables is transformed into the problem of detecting multiple simple linear associations. The piecewise-linear can be achieved by partitioning method. The maximal association coefficient is proposed based on partition and linear correlation coefficient, which is summarized in Fig 1. Now, we face two problems. 1) How to divide variables? 2) How to measure linear association?

Download:

Fig 1. The overview of the proposed MAC measure.

https://doi.org/10.1371/journal.pone.0290280.g001

Any nonlinear association between two variables can be considered to be composed of some piecewise-linear ones. However, no one knows where the breakpoint for connecting two piecewise-linear ones is, which is why random partition is suggested. The following problem is that there are an infinite number of partitions, which makes it impossible to detect the association strength between two variables. Clustering techniques becomes one of the options used to achieve the partition. Any clustering technique can be used to divide variables, among which the simple and effective K-means clustering is used here. The K-means method is utilized to divide each variable space into different bins, where K value determines the number of bins of each variable, and then the grid partition between two variables can be obtained. All data can be divided into different grids, and a schematic diagram is shown in Fig 2. We divide the data into different grids, the association coefficients of data in some grids are likely to be larger than that of the whole data. Nevertheless, we employ the combination of the association coefficients from all grids instead of one grid to reveal the association of the whole data. If the association coefficients of only minority grids are large, it is unlikely to cause a higher MAC than the real association coefficient.

Download:

Fig 2. A schematic diagram of grid partition between two variables.

(a) This figure is a two-dimensional scatter plot between two variables. (b) There are some different grid partitions between them.

https://doi.org/10.1371/journal.pone.0290280.g002

About how to detect linear association strength of the data in each grid obtained by dividing two variables, any measures can be used. Among them, Pearson coefficient was proposed early and easy to compute, and until now, it is still the most widely used correlation coefficient index. Thus, Pearson coefficient, is used to detect the association strength of the data in each grid after obtaining the grid partition between two variables. The Pearson coefficient is between [–1,1], which can detect whether the association between two variables is positive correlation, uncorrelation or negative correlation. After obtaining the Pearson coefficient of data in each grid, the weighted sum obtained by directly using the Pearson coefficient offsets by the positive and negative values, which causes the result of weighted sum cannot reflect the association strength between two variables. We aim to detect whether the association between variables is correlated and the degree of correlation. Therefore, we take the absolute value of the Pearson coefficient to reveal the association, and then if the coefficient is 0 that indicates uncorrelation, 1 indicates perfect correlation, and 0 to 1 indicates different degrees of correlation. As a result, the weighted sum obtained by using the absolute value of the Pearson coefficient can well reflect the association between two variables.

For the grid partition between two variables, it is necessary to set a maximum number of grids (MG) to avoid the infinite grids caused by the method of partition. The maximum number of grids (MG) is MG = max{4, n^α}, where n is data size and α is a hyper-parameter and α belongs to [0, 1]. For variable x and variable y, they are divided into s and t bins, respectively, where t is equal to MG/s. The data is divided into any s-by-t grid, where s belongs to [2, MG/2], and thus many different forms of grid partition between two variables can be obtained. Under each grid partition, an association coefficient (AC) between variables x and y is calculated based on Eq 1.

(1)

In Eq 1, i is the i-th grid containing data, w_i is the weight of the i-th grid, and |p_i| is the absolute value of Pearson coefficient of data in the i-th grid. The w_i can be obtained according to Eq 2.

(2)

In Eq 2, s_i is the area of the i-th grid containing data, and ∑_js_j is the sum of areas of all these grids. Moreover, in Eq 2, the weight w_i is the normalization of the area, so that the sum of weights is 1, i.e., the ∑_iw_i is 1.

The association coefficient under each grid partition can be obtained by Eq 1, in which the maximum value of these association coefficients is the maximal association coefficient (MAC), as shown in Eq 3.

(3)

The description of MAC algorithm is described in the following. The inputs of MAC algorithm are a dataset D with data size n, and a hyper-parameter α in which the hyper-parameter α determines the maximum number of grids (MG). The output of MAC algorithm is maximal association coefficient, i.e., MAC, between two variables.

Download:

https://doi.org/10.1371/journal.pone.0290280.t001

The calculation steps of MAC are shown as follows:

Step 1. There are two variables labeled as x and y, respectively. Then, K-means method is used for dividing x-values into s bins and y-values into t bins, in this way, any s-by-t grid partition can be obtained.

Step 2. The area of each grid containing data, and the Pearson coefficient of data in this grid is calculated, respectively. The ratio of the area of each grid containing data to the sum of the areas of all these grids is used as the weight of the corresponding grid.

Step 3. The weighted sum of the absolute value of the Pearson coefficient of the data in each grid and the weight of this grid is called association coefficient between two variables under this partition.

Step 4. Repeat step 1 to step 3, many association coefficients can be obtained under the constraint of the maximum number of grids (MG). The maximum value of these association coefficients is the maximal association coefficient between two variables.

3 Simulation experiment and result analysis

In section 2, the MAC method is proposed and described in detail for detecting the association strength between two variables. If a measure can detect the association strength under extensive relationships with enough sample size, this measure is of generality; If a measure can give similar scores to different relationship types with the same noise, this measure is of equitability. We assume that MAC should be of generality and equitability. Thus, we verify and analyze whether the MAC is generality and equitability through a series of simulation experiments, in this section 3.

In this section, the hardware used is Intel (R) Core (TM) i7-10510U CPU @ 1.80GHz 2.30GHz, 8.00GB RAM and the software is Anaconda Python 3.7.

3.1 Data type

There are 10 different relationship types used in this section are shown in Table 1, where the corresponding relational expressions are also shown in the table. The data x_i (in which i is 1, 2, …, n) in the variable X is generated in the domain [0, 1], and the data y_i (in which i is 1, 2, …, n) in the variable Y is obtained according to the relational formula given in Table 1. These relationship types are used to verify the generality and equitability of the MAC.

Download:

Table 1. Multiple different relationship types between two variables.

https://doi.org/10.1371/journal.pone.0290280.t002

3.2 Generality

Under each relationship type in Table 1, datasets with data sizes n of 10, 20, 40, 80, 100, 200, 400, 800, 1000, 2000, 4000, 8000, 10000 are generated, totaling 13 datasets. Under each dataset, the value of the variable X is randomly generated in the domain [0, 1], and then the corresponding the value of the variable Y is generated according to the corresponding relational expression in Table 1. The first nine types in Table 1 are non-random relationship types, and the tenth type is random relationship types. The data (x_i, y_i) with different data scales are generated under each relationship type, where i = 1, 2,…, n, and the MAC between two variables in the data is calculated. In this section, we apply MAC and MIC methods to each dataset under each relationship type, where the hyper-parameter α is empirically set to 0.75 for the first nine relationship types and that is empirically set to 0.45 for the random relationship type. We repeat the MAC and MIC computation for 50 times to obtain 50 MACs and MICs, and then take the average of 50 MACs and MICs as the MAC and MIC of each dataset to reduce the influence of randomness on the experiments. As the data scale increases, the MAC and MIC between two variables under each relationship type are shown in Fig 3A and 3B. The MACs and MICs of different relationship types are represented by these different colored polylines is shown in the legend in Fig 3.

Download:

Fig 3. The MACs and MICs between two variables with different data scales under each relationship type.

(a) The MAC between variables. (b) The MIC between variables.

https://doi.org/10.1371/journal.pone.0290280.g003

As can be seen from Fig 3A and 3B, for the first four simple relationship types, even if the data scale is small, their association can be accurately detected, that is, the MAC and MIC tends or equals to 1.0. For the latter five complex relationship types, their association strength can be accurately detected in a big data scale, and their MAC and MIC approaches to 1.0. It is noted that there is still a strong association for these five relationship types even in a small data scale, so we try to change the value of hyper-parameter α to observe whether the MAC and MIC methods can detect the association between them. We set the hyper-parameter α in the MAC and MIC methods to 1.0 to detect their association under a small data scale with data size of 10, 20, 40, 100, 200, and their MACs and MICs are shown in an insert figure in Fig 3A and 3B. We can know that MAC method can completely detect their association even if there are only 10 data points, but MIC method requires at least 40 data points except for the parabolic relationship type. Our MAC method is more robust than MIC method for a small data scale. For relationship types with different complexity, the α can also be set to different value when detecting their association strength. Additionally, when the data scale is large (or small), the value of α can be small (or large).

When the relationship type between two variables is random, the MAC and MIC generally decrease with the increase of data scale until they tend to zero. Especially, there is an interesting phenomenon for the random relationship type, that is, the MAC is large when the data scale is small. We assume that there is an unknown relationship between variables when the data scale is small. We explain this phenomenon through an experiment, as shown in Fig 4, where n (n = 10, 20, 40) is data size. Mathematically, it can be proven that any function can be expressed in polynomial form. We employ polynomial function to fit the randomly generated data points. When only 10 data points are randomly generated, we can employ a polynomial function to completely fit them, as shown in Fig 4A and 4D. The MACs between variables in Fig 4A and 4D are all 1.0, but the MIC is 1.0 in Fig 4A and not 1.0 in Fig 4D. This may be because our MAC method is more robust than MIC method for a small data scale, which has been mentioned in the above. As the data scale increases, the randomness between variables also increases and the MAC and MIC gradually decreases, as shown in Fig 4B, 4C, 4E and 4F. In summary, the MAC and MIC may be large when data scale is small even in the random relationship type.

Download:

Fig 4. The MACs and MICs between two variables under the random relationship type.

The text at the top of each figure reveals the data scale and the MAC and MIC between variables.

https://doi.org/10.1371/journal.pone.0290280.g004

Additionally, it can be shown from Fig 3A that the MAC of deterministic relationship type increases as the data scale increases, and the MAC of random relationship type decreases. When the actual data size up to 10000, the MAC between two variables infinitely approaches to 1.0 in addition to the random relationship type. The MAC can be used to detect the association strength of multiple complex relationship types. Therefore, the MAC has good generality.

3.3 Equitability

The datasets generated by the first nine relationship types as shown in Table 1 are used to verify the equitability of the MAC. The data of the variable X is uniformly generated in the domain [0,1], and then the data of the variable Y is obtained according to the relational expression under each relationship type in Table 1, in which the data size n is set to 4000. In this way, 9 noise-free datasets are obtained. Uniform vertical noise means that the noise is added to the data of variable Y. R² is the squared Pearson coefficient between the perturbed y-values and the true y-values. In other words, 1- R² is the noise added to the true y-values. After that, the noise levels of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100% were added to the noise-free data set for generating some noise datasets. Considering the noise-free data set, a total of 11 datasets with different noise levels under each relationship type are obtained. The hyper-parameter α determines the maximum number of grids, that is, MG described in section 2. We employ MAC and MIC methods to detect the association between variables. For MIC method, the hyper-parameter α is set to 0.5 for all relationship types. It can be seen from section 3.2 that for different relationship type, the hyper-parameter α in MAC method can be different when detecting their association strength. The simpler the relationship, the smaller the value of α, and the more complex the relationship, the larger the value of α. Thus, for linear, exponential, and triangle composite function, these relationship types are simple and the hyper-parameter α is set to 0.2. For parabolic and sine, these relationship types are slightly more complex and the hyper-parameter α is set to 0.3. For Periodic plus Linear, Sinusoidal (Fourier Frequency), Sinusoidal (non-Fourier Frequency) and Sinusoidal (Varying Frequency), these relationship types are complex and the hyper-parameter α is set to 0.5.

The MAC and MIC methods are applied to each dataset with different noise levels under these nine relationships types, and the MAC and MIC between two variables can be obtained. The MAC and MIC calculation are repeated 50 times, and the average of the 50 MACs and MICs are taken as the MAC and MIC between two variables, as shown in Fig 5. The purpose of taking the average is to reduce the influence of randomness on the experiments. The MACs and MICs of different relationship types are represented by these different colored polylines is shown in the legend in Fig 5. As shown in Fig 5, the MAC and MIC between two variables gradually decreases under the same relationship type as the noise level increases. When the added noise level is the same, the MACs and MICs between two variables under different relationship types are similar. Therefore, the MAC has good equitability. Because there is no baseline association coefficient for data with noise, it is hard to estimate which method is accurate, MAC or MIC methods.

Download:

Fig 5. The MACs and MICs between two variables with different noise levels under each relationship type.

(a) The MAC between variables. (b) The MIC between variables.

https://doi.org/10.1371/journal.pone.0290280.g005

In section 3.2, by exploring the MAC between variables with different data scales under various relationship types, it is found that the MAC between two deterministic variables under the same relationship type gradually increases to 1 or infinitely tends to 1 as the data scale increases, indicating that MAC has good generality. In section 3.3, by exploring the MAC between variables with different noise levels under various relationship types, it is found that the MAC between two variables under the same relationship type decreases as noise level increases, and the MAC between two variables under the different relationship types with the same noise level are basically consistent, indicating that MAC has good equitability. Therefore, the MAC has good generality and equitability.

4 Experiments and result analysis on real data

There are various real datasets [24–29], such as performance statistics dataset for the 2008 Major League Baseball season (MLB2008 dataset) [24, 25]; dataset of credit card clients’ default [26]; and so on. In this section, the MAC method is used for two real datasets, MLB2008 dataset and dataset of credit card clients’ default, to verify that MAC method can be used to accurately detect the association strength between two variables. For baseball statistic glossary see: http://www.baseballprospectus.com/glossary/.

4.1 Major-league baseball dataset

In this section, the MAC method is used to calculate the maximal association coefficient (MAC) between player’s salary and any one of 50 variables in the MLB2008 dataset which contains 337 instances, to reveal the association strengths between them. The hyper-parameter α in the MAC method is empirically set to 0.45, and a detailed introduction to hyper-parameter α is given in section 2. For proving the ability of the MAC to detect the association strength, MIC and Pearson coefficient were used to detect the association of these 50 pairs of variables in the other two comparative experiments. The MAC, MIC, and Pearson coefficient between player’s salary and any one of 50 variables in the MLB2008 dataset are shown in Fig 6, where the MAC is the average of the results obtained running the MAC algorithm for 50 times and MIC is from the previous work [21]. As shown in Fig 6, the MACs of 50 pairs of variables are represented in MAC column. Similarly, the MICs are shown in MIC column and the Pearson coefficients are shown in the Pearson column.

Download:

Fig 6. The MAC and MIC and Pearson coefficient of 50 pairs of variables in the MLB2008 dataset.

https://doi.org/10.1371/journal.pone.0290280.g006

The evaluation criteria of regression analysis are used to evaluate the difference between MAC and MIC, as shown in Table 2. These evaluation criteria are Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), where the larger the value, the greater the error. It can be seen from Fig 6 and Table 2 that the difference between MAC and MIC is small, therefore, the MAC can be used to detect the association strength between two variables instead of computationally complicated MIC.

Download:

Table 2. Differences between the MACs and MICs of 50 pairs of variables in the MLB2008 dataset.

https://doi.org/10.1371/journal.pone.0290280.t003

4.2 Dataset of credit card clients’ default

To prove MAC method can also be used for large-scale dataset, in this section, this method is applied to a dataset of credit card clients’ default which contains 30000 instances to detect the association between variables. In addition, we also apply MIC method to this dataset for comparing the performance of these two algorithms. We apply these two methods to all pairs of variables from this dataset, where six key associations are chosen and shown in Fig 7. As shown in Fig 7, the green dots represent the data points and the red line represents the possible association between variables. The text at the top of each figure shows the MAC and MIC between variables and these two algorithms’ running time, which are the average of the results obtained after MAC and MIC algorithms are run 50 times. There are linear associations in Fig 7A and 7B, the MAC and MIC between variables are similar but the running time of MAC is lower than MIC. In addition, it can be seen from Fig 7A and 7B that more noise is added to the linear relationship in Fig 7A than in Fig 7B, therefore, the MAC and MIC between variables in Fig 7A are lower than Fig 7B according to the equitability of these two measures. There are similar associations in Fig 7C and 7D, we can see that, on the one hand, the MAC and MIC between variables and their running time are close; on the other hand, the MAC is more precise than MIC and the running time of MAC algorithm is lower than MIC algorithm. Similarly, these phenomena also appear in Fig 7E and 7F. The experimental results show that MAC method is both computationally inexpensive and precise in detecting the association strength compared with MIC method. It proves that MAC method is suitable for big data. It is potentially important to accurately detect the association between variables, such as using these associations for feature selection, etc., which may obtain better conclusion of data analysis.

Download:

Fig 7. Application of MAC to pairs of variables from the dataset of credit card clients’ default.

There are scatter plots of 6 pairs of variables, and the possible associations between variables are also shown in these scatter plots. The text at the top of each figure reveals the corresponding coefficients obtained through MAC and MIC algorithms and the running time of these two algorithms.

https://doi.org/10.1371/journal.pone.0290280.g007

5 Discussions

Both MAC and MIC methods can be used to detect the association strength between two variables, but there are also differences between them. The MAC method is based on piecewise-linear idea, which employs Pearson coefficient to detect the association strength of data in each piecewise-linear for further revealing the association strength between two variables. In addition, K-means in the MAC method is used to divide data into different grid partitions to achieve the piecewise-linear idea. And yet, the MIC method is used to detect the association strength between two variables based on the mutual information, in which mutual information is calculated approximately by the way of grid partition. The MAC is more precise than the MIC, the reason may be that the MAC is obtained by once approximate calculation, that is, the maximal number of grids is determined manually, while the MIC is obtained by twice approximate calculations, that is, the mutual information is calculated approximately and the maximal number of grids is determined manually. The MAC method is more robust than the MIC method, which is verified in section 3.2. Additionally, there is no ground truth of association coefficient for data with noise, which makes it difficult to estimate which of the MAC and MIC methods is accurate. It’s an unavoidable problem.

The computational complexity of the MAC and MIC methods is described as follows. In the MAC method, the K-means is used to divide each variable, and the computational complexity of K-means is O(nkt), where n denotes the number of samples, k denotes the number of clusters and t denotes the number of iterations, k and t are generally considered to be constants. The computational complexity of partition step is O(n). To obtain the MAC between variables, this calculation needs to call MG times, where MG = n^α. Therefore, the computational complexity of the MAC method is O(n)*MG = O(n^1+α). When the default value of α is set to 0.6, its computational complexity is O(n^1.6). In the MIC method [21], the computational complexity of the sub-procedure OptimizeXAixs (D, Q, x) is O(k²xy), where k = cx, xy < B = n^α, that is, the O(k²xy) = O(x²B). The range of x is from 2 to B/2, and the complexity of the whole MIC method is O(x²B) *O(x) = O(B⁴) = O(n^4α). When the default value of α is 0.6, its computational complexity is O(n^2.4).

6 Conclusions

In this paper, a maximal association coefficient (MAC) is proposed to detect the association strength between two variables. The idea of MAC method is to detect nonlinear association by dividing it into multiple piecewise-linear ones, which can employ computationally inexpensive Pearson coefficient to detect this nonlinear association. Since no one knows how many linear associations exist in a nonlinear association, and where are the breakpoints between two consecutive linear ones, thus partitioning method is adopted to divide the data into many different forms of grid partition. Under these grid partitions, many association coefficients can be obtained, the maximum value of them is maximal association coefficient.

The generality and equitability of the MAC have been verified by two simulation experiments, which indicates that it is reasonable to use the MAC method to detect the association strength between two variables. The research results on real data show that the MAC method can be used to detect the association strength between two variables. The information between variables mined by the MAC method may also help to complete the downstream task better in the future. The MAC between two variables can also be extended to the MAC between univariate and multivariate to explore the association between multiple variables.

References

1. Liu ZM, Rios C, Zhang NY, Yang L, Chen W, He B. Linear and nonlinear relationships between visual stimuli, EEG and BOLD fMRI signals. NeuroImage. 2010 Jan 15; 50(3): 1054–1066. pmid:20079854
- View Article
- PubMed/NCBI
- Google Scholar
2. Coselli JS. Composition of the surgical team in aortic arch surgery-a risk factor analysis. Eur J Cardiothorac Surg. 2022 Apr 19; 62(3): ezac243. pmid:35437586
- View Article
- PubMed/NCBI
- Google Scholar
3. Morrow RL, Mintzes B, Souverein PC, Bruin MD, Roughead EE, Lexchin J, et al. Influence of drug safety advisories on drug utilisation: an international interrupted time series and meta-analysis. BMJ Qual Saf. 2022 Jan 20; 31(3): 179–190. pmid:35058332
- View Article
- PubMed/NCBI
- Google Scholar
4. Baldanzi G, Hammar U, Fall T, Lindberg E, Lind L, Elmsthl S, et al. Evening chronotype is associated with elevated biomarkers of cardiometabolic risk in the EpiHealth cohort: a cross-sectional study. Sleep. 2022 Feb 14; 45(2): zsab226. pmid:34480568
- View Article
- PubMed/NCBI
- Google Scholar
5. Bonnell LN, Troy AR, Littenberg B. Nonlinear relationship between nonresidential destinations and body mass index across a wide range of development. Prev Med. 2021 Aug 24; 153(106775). pmid:34437875
- View Article
- PubMed/NCBI
- Google Scholar
6. Sun GY, Khaskheli A, Raza SA, Shah N. Analyzing the association between the foreign direct investment and carbon emissions in MENA countries: a pathway to sustainable development. Environ Dev Sustain. 2022 Jul 3; 24(3): 4226–4243.
- View Article
- Google Scholar
7. Schober P, Boer C, Schwarte LA. Correlation coefficients: appropriate use and interpretation. Anesth Analg. 2018 Jan 11; 126(5): 1763–1768. pmid:29481436
- View Article
- PubMed/NCBI
- Google Scholar
8. Abe H, Tsumoto S. Analyzing behavior of objective rule evaluation indices based on pearson product-moment correlation coefficient. ISMIS 2008: Foundations of Intelligent Systems. 2008. pp. 84–89.
- View Article
- Google Scholar
9. Atanu B. Distance correlation coefficient: an application with bayesian approach in clinical data analysis. J Mod Appl Stat Meth. 2014 May 1; 13(1): 354–366.
- View Article
- Google Scholar
10. Strickert M, Schleif FM, Villmann T, Seiffert U. Unleashing pearson correlation for faithful analysis of biomedical data. Berlin: Springer, 2009.
11. Yan M, Liu J, Lei Z, Meng Y, Liu J, Lei Z. Classification of unknown mobile web traffic based on correlation coefficient measurement. 2014 International Symposium on Wireless Personal Multimedia Communications (WPMC). 2014. pp. 6–11.
- View Article
- Google Scholar
12. Galton F. Regression towards mediocrity in hereditary stature. J Anthropol Inst G B Irel. 1886; 15: 246–263.
- View Article
- Google Scholar
13. Pearson K. Contributions to the mathematical theory of evolution. J R Stat Soc. 1893 Dec; 56(4): 675–679.
- View Article
- Google Scholar
14. Spearman C. The proof and measurement of association between two things. Am J Psychol. 1987; 100(3/4): 441–471.
- View Article
- Google Scholar
15. Kendall MG. A new measure of rank correlation. Biometrika. 1938 Jun 1; 30(1/2): 81–93.
- View Article
- Google Scholar
16. Shannon CE. A mathematical theory of communication. Bell Labs Tech J. 1948 Jul; 27(3): 379–423.
- View Article
- Google Scholar
17. Cover TM, Thomas JA. Elements of information theory. 2nd ed. New York: Wiley-Interscience; 2006.
18. Kraskov A, Stgbauer H, Grassberger P. Estimating Mutual Information. Phys Rev E. 2004 Jun 23; 69(6): 066138. pmid:15244698
- View Article
- PubMed/NCBI
- Google Scholar
19. Székely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation of distances. Ann Stat. 2007 Dec; 35(6): 2769–2794.
- View Article
- Google Scholar
20. Wang T, Zhang SQ. Study on linear correlation coefficient and nonlinear correlation coefficient in mathematical statistics. Stud Math Sci. 2011; 3(1): 58–63.
- View Article
- Google Scholar
21. Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, et al. Detecting novel associations in large data sets. Science. 2011 Dec 16; 334(6062): 1518–1524, Dec. 2011. pmid:22174245
- View Article
- PubMed/NCBI
- Google Scholar
22. Wijayatunga P. A geometric view on Pearson’s correlation coefficient and a generalization of it to non-linear dependencies. Rat Math. 2016; 30(1): 3–21.
- View Article
- Google Scholar
23. Wang X, Jiang B, Liu JS. Generalized R-squared for detecting dependence. Biometrika. 2017 Feb 22; 104(1): 129–139. pmid:29430028
- View Article
- PubMed/NCBI
- Google Scholar
24. Baseball Prospectus Statistics Reports (2009) [Internet]. Available from: www.baseballprospectus.com/sortable/.
25. Lahman S., The Baseball Archive, The Baseball Archive (2009) [Internet]. Available from: baseball1.com/statistics/.
- View Article
- Google Scholar
26. Yeh IC, Lien CH. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst Appl. 2009 Mar; 36(2-Part 1): 2473–2480.
- View Article
- Google Scholar
27. World Health Organization Statistical Information Systems, Database: World Health Organization Statistical Information Systems (WHOSIS) [Internet]. Available from: https://www.who.int/data/.
28. Rosling H. Database: Gapminder [Internet]. Available from: https://www.gapminder.org/.
- View Article
- Google Scholar
29. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998 Dec; 9(12): 3273–3297. pmid:9843569
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Liu ZM, Rios C, Zhang NY, Yang L, Chen W, He B. Linear and nonlinear relationships between visual stimuli, EEG and BOLD fMRI signals. NeuroImage. 2010 Jan 15; 50(3): 1054–1066. pmid:20079854
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Coselli JS. Composition of the surgical team in aortic arch surgery-a risk factor analysis. Eur J Cardiothorac Surg. 2022 Apr 19; 62(3): ezac243. pmid:35437586
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Morrow RL, Mintzes B, Souverein PC, Bruin MD, Roughead EE, Lexchin J, et al. Influence of drug safety advisories on drug utilisation: an international interrupted time series and meta-analysis. BMJ Qual Saf. 2022 Jan 20; 31(3): 179–190. pmid:35058332
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Baldanzi G, Hammar U, Fall T, Lindberg E, Lind L, Elmsthl S, et al. Evening chronotype is associated with elevated biomarkers of cardiometabolic risk in the EpiHealth cohort: a cross-sectional study. Sleep. 2022 Feb 14; 45(2): zsab226. pmid:34480568
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Bonnell LN, Troy AR, Littenberg B. Nonlinear relationship between nonresidential destinations and body mass index across a wide range of development. Prev Med. 2021 Aug 24; 153(106775). pmid:34437875
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Sun GY, Khaskheli A, Raza SA, Shah N. Analyzing the association between the foreign direct investment and carbon emissions in MENA countries: a pathway to sustainable development. Environ Dev Sustain. 2022 Jul 3; 24(3): 4226–4243.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref7] 7. Schober P, Boer C, Schwarte LA. Correlation coefficients: appropriate use and interpretation. Anesth Analg. 2018 Jan 11; 126(5): 1763–1768. pmid:29481436
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref8] 8. Abe H, Tsumoto S. Analyzing behavior of objective rule evaluation indices based on pearson product-moment correlation coefficient. ISMIS 2008: Foundations of Intelligent Systems. 2008. pp. 84–89.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref9] 9. Atanu B. Distance correlation coefficient: an application with bayesian approach in clinical data analysis. J Mod Appl Stat Meth. 2014 May 1; 13(1): 354–366.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref10] 10. Strickert M, Schleif FM, Villmann T, Seiffert U. Unleashing pearson correlation for faithful analysis of biomedical data. Berlin: Springer, 2009.

[ref11] 11. Yan M, Liu J, Lei Z, Meng Y, Liu J, Lei Z. Classification of unknown mobile web traffic based on correlation coefficient measurement. 2014 International Symposium on Wireless Personal Multimedia Communications (WPMC). 2014. pp. 6–11.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref12] 12. Galton F. Regression towards mediocrity in hereditary stature. J Anthropol Inst G B Irel. 1886; 15: 246–263.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref13] 13. Pearson K. Contributions to the mathematical theory of evolution. J R Stat Soc. 1893 Dec; 56(4): 675–679.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref14] 14. Spearman C. The proof and measurement of association between two things. Am J Psychol. 1987; 100(3/4): 441–471.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref15] 15. Kendall MG. A new measure of rank correlation. Biometrika. 1938 Jun 1; 30(1/2): 81–93.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref16] 16. Shannon CE. A mathematical theory of communication. Bell Labs Tech J. 1948 Jul; 27(3): 379–423.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref17] 17. Cover TM, Thomas JA. Elements of information theory. 2nd ed. New York: Wiley-Interscience; 2006.

[ref18] 18. Kraskov A, Stgbauer H, Grassberger P. Estimating Mutual Information. Phys Rev E. 2004 Jun 23; 69(6): 066138. pmid:15244698
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref19] 19. Székely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation of distances. Ann Stat. 2007 Dec; 35(6): 2769–2794.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref20] 20. Wang T, Zhang SQ. Study on linear correlation coefficient and nonlinear correlation coefficient in mathematical statistics. Stud Math Sci. 2011; 3(1): 58–63.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref21] 21. Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, et al. Detecting novel associations in large data sets. Science. 2011 Dec 16; 334(6062): 1518–1524, Dec. 2011. pmid:22174245
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref22] 22. Wijayatunga P. A geometric view on Pearson’s correlation coefficient and a generalization of it to non-linear dependencies. Rat Math. 2016; 30(1): 3–21.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref23] 23. Wang X, Jiang B, Liu JS. Generalized R-squared for detecting dependence. Biometrika. 2017 Feb 22; 104(1): 129–139. pmid:29430028
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref24] 24. Baseball Prospectus Statistics Reports (2009) [Internet]. Available from: www.baseballprospectus.com/sortable/.

[ref25] 25. Lahman S., The Baseball Archive, The Baseball Archive (2009) [Internet]. Available from: baseball1.com/statistics/.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref26] 26. Yeh IC, Lien CH. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst Appl. 2009 Mar; 36(2-Part 1): 2473–2480.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref27] 27. World Health Organization Statistical Information Systems, Database: World Health Organization Statistical Information Systems (WHOSIS) [Internet]. Available from: https://www.who.int/data/.

[ref28] 28. Rosling H. Database: Gapminder [Internet]. Available from: https://www.gapminder.org/.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref29] 29. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998 Dec; 9(12): 3273–3297. pmid:9843569
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

Figures

Abstract

1 Introduction

2 Proposed MAC measure

3 Simulation experiment and result analysis

3.1 Data type

3.2 Generality

3.3 Equitability

4 Experiments and result analysis on real data

4.1 Major-league baseball dataset

4.2 Dataset of credit card clients’ default

5 Discussions

6 Conclusions

References