Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Asymptotic Properties of Pearson's Rank-Variate Correlation Coefficient under Contaminated Gaussian Model

  • Rubao Ma,

    Affiliation Department of Automatic Control, School of Automation, Guangdong University of Technology, Guangzhou, Guangdong, China

  • Weichao Xu ,

    wcxu@gdut.edu.cn

    Affiliation Department of Automatic Control, School of Automation, Guangdong University of Technology, Guangzhou, Guangdong, China

  • Yun Zhang,

    Affiliation Department of Automatic Control, School of Automation, Guangdong University of Technology, Guangzhou, Guangdong, China

  • Zhongfu Ye

    Affiliation Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, Anhui, China

Asymptotic Properties of Pearson's Rank-Variate Correlation Coefficient under Contaminated Gaussian Model

  • Rubao Ma, 
  • Weichao Xu, 
  • Yun Zhang, 
  • Zhongfu Ye
PLOS
x

Abstract

This paper investigates the robustness properties of Pearson's rank-variate correlation coefficient (PRVCC) in scenarios where one channel is corrupted by impulsive noise and the other is impulsive noise-free. As shown in our previous work, these scenarios that frequently encountered in radar and/or sonar, can be well emulated by a particular bivariate contaminated Gaussian model (CGM). Under this CGM, we establish the asymptotic closed forms of the expectation and variance of PRVCC by means of the well known Delta method. To gain a deeper understanding, we also compare PRVCC with two other classical correlation coefficients, i.e., Spearman's rho (SR) and Kendall's tau (KT), in terms of the root mean squared error (RMSE). Monte Carlo simulations not only verify our theoretical findings, but also reveal the advantage of PRVCC by an example of estimating the time delay in the particular impulsive noise environment.

Introduction

Correlation coefficients are indices that depict the strength of statistical relationship between two random variables obeying a joint probability distribution [1]. In general, correlation coefficients should be large and positive if there is a high probability that large (small) values of one variable occur in conjunction with large (small) values of another; and it should be large and negative if the direction reverses [2]. Due to their theoretical and algorithmic advantages, correlation coefficients have been widely used in many sub-areas of signal processing [3][18]. Among many methods of correlation analysis in practice, Pearson's product moment correlation coefficient (PPMCC), Kendall's tau (KT) and Spearman's rho (SR) are perhaps the most prevalent ones [19].

There are many advantages and disadvantages to these three classical coefficients. PPMCC is optimal under the bivariate normal model (BNM), and is appropriate mainly for characterizing linear correlations. However, it will output misleading results if nonlinearity is involved in the data. On the other hand, the two rank-based coefficients, SR and KT, are not as powerful as PPMCC when the data follows bivariate normal distributions. Nevertheless, they are invariant under increasing monotone transformations, which makes them more suitable for many nonlinear cases in practice. Moreover, theoretical and empirical results indicate that SR and KT surpass PPMCC when data are corrupted by impulsive noise [12], [20]. Besides these three classical coefficients, some methods such as Pearson's rank-variate correlation coefficient (PRVCC) [21], Gini correlation (GC) [22], and order statistics correlation coefficient (OSCC) [23][25] among others [26][29], have also been proposed in the literature.

It was known long before that PPMCC is an optimal estimator of the population correlation coefficient, in the sense of unbiasedness and approaching the Cramer-Rao lower bound for large samples under bivariate normal models [30]. Despite its desired properties just mentioned, PPMCC might not be applicable under the circumstances where the data are corrupted by impulsive noise, that is, the distribution of data deviates from the BNM. Consider the following scenario that frequently occurs in radar, sonar or communication. We have a prescribed signal, whether deterministic or stochastic, whose statistical property is priorly known. Our purpose is to estimate the correlation between this “clean” signal and the associated distorted version from the receiver that might be corrupted by a tiny fraction of impulsive noise (outliers with very large variance [31][33]). To deal with such case, one might adopt the conventional strategy, that is, ranking the cardinal variable(s) and resorting afterwards to SR or KT [22], which are robust against both nonlinearity and impulsive noise [20]. However, using only ranks of the two variables, we unavoidably lose useful information embedded in the variates of the “clean” variable. A better strategy would be to rely on coefficients, such as PRVCC [21], that accommodate both ordinal and cardinal information contained in the samples. Our purpose in this work is thus to investigate the properties of the historical PRVCC by both theoretical and empirical means.

The contribution in this work is twofold. Firstly, we establish the asymptotic closed forms of the expectation and variance of PRVCC under a contaminated Gaussian model that emulates a frequently encountered scenario in practice. Secondly, we demonstrate the superiority of PRVCC over PPMCC, SR and KT, in terms of the root mean squared error, by an example of estimating the time delay in the particular impulsive noise environment. These theoretical and empirical findings might be helpful to rejuvenate the historical PRVCC, which has long been forgotten in the literature due to insufficient understanding on its theoretical properties.

For convenience of later discussion, we employ symbols , , and to denote the mean, variance, covariance and correlation of (between) random variables, respectively. Univariate and bivariate normal distributions are denoted by and , respectively. The sign reads “is approximately equal to”, whereas the sign stands for “is defined as”. The notation denotes that as [34]. The symbol stands for the product . Other notations will be defined where it first enters the text.

Methods

This section presents the definitions of PRVCC as well as a particular CGM model simulating the impulsive noise environment mentioned in the previous section. Moreover, some auxiliary results are also established for further theoretical analysis.

1 Definitions of PRVCC

Let and be two random variables following a continuous bivariate distribution. Denote by and the marginal distributions of and , respectively. Then, according to a historical paper of Pearson [21], one of the population versions of PRVCC can be defined, in modern notation, by(1)Exchanging the roles of and in (1) yields the other versionLet be independent and identically distributed (i.i.d.) data pairs drawn from a continuous bivariate population. After rearranging in ascending order, we get a new sequence , which is termed the order statistics of [35][37]. Suppose that is at the th position in the sorted sequence. The integer is named the rank of and is denoted by . Let represent the arithmetic mean of data points . Then, based on (1), the sample version with respect to can be constructed as(2)where is the empirical cdf of . Substituting the relationship into (2) along with some simplifications leads to(3)Exchanging the role of and in (3) gives another sample version with respect to . Note that in general . The choice between and depends on different roles played by and in the scenario mentioned in the previous section. To avoid redundancy, we will focus on the properties of which is abbreviated as in the sequel unless ambiguity occurs.

2 Contaminated Gaussian Model

To simulate the specific circumstance remarked in Section Introduction, throughout we utilize the following CGM representing the joint probability density function (pdf) of two random variables and [20](4)where , , , and . Under this specific CGM, it is obvious that the marginal distribution of is , whereas the marginal distribution of is . In other words, under Model (4), stands for a “clean” normal variable while stands for a “dirty” variable corrupted by a tiny fraction of Gaussian component with vary large variance (might tending to infinity). In this model, the parameter , which is considered of interest, is what we aim at estimating as accurate as possible, while and are interferences we seek to suppress. For the reason why Model (4) can happen in practice, please see Appendix A in our previous work [20].

3 Auxiliary Results

To establish our major results of Theorem 1 in the next subsection, some auxiliary results summarized in Lemma 1 below are mandatory.

Lemma 1.

Assume that the random vector follows a quadrivariate normal distribution with , , and for . Write for and for . Then(5)

(6)

(7)and(8)Proof. Write for . Then(9)

(10)

(11)

(12)The results of (5) and (7) follow readily by substituting the results in [39] into the right sides of (9) and (11), respectively. Next we show that (6) and (8) also hold true. Let . Then, according to [39], it follows that(13)where(14)(15)and(16)It is seen that , and are all subscripts of the -terms in (14). Since, by (16), only is non-null, then (14) and hence (13) are non-null only when the following conditions are satisfied, i.e.,(17)It is easy to verify that there are only four solutions to (17), as(18)Substituting (18) into (13) and using (15) and (16) thereafter produce(19)which along with (10) leads to (6). By a similar argument, we have(20)which together with (12) yields (8). This completes the proof of the lemma.□

4 Asymptotic Mean and Variance of PRVCC Under CGM (4)

By applying the delta method [38] with Lemma 1, we are ready to establish the closed forms of the mean and variance of PRVCC for samples generated by CGM (4).

Theorem 1.

Let be i.i.d. data pairs drawn from a bivariate normal population and be i.i.d. data pairs drawn from another bivariate normal population . Assume that and are mutually independent. Write and . Denote by the union of and . Then, as large, small, , and , the expectation and variance of PRVCC defined in (3) are(21)and(22)Proof. From (3), it is easy to verify that is shift invariant. Therefore, we lose no generality by assuming that hereafter. For convenience, write(23)

(24)Then, by the well known delta method [38], it follows that(25)and(26)Obviously, we only need to evaluate , , , and in order to work out (25) and (26). From the theorem assumption, both and follow the same normal distribution , then obeys a distribution, which means that(27)

(28)Using Lemma 1 with some tedious algebra, we can obtain , and .

We first derive . From the definition (23) and the relationships [40],(29)Expanding and recalling that , it follows that(30)Since, by definition, is a mixture of and , (30) can be expanded as(31)which becomes (32) after some straightforward algebra along with the assistance of (8) in Lemma 1.(32)

Next we evaluate , which can be written as(33)where(34)

(35)

(36)and(37)The expression of is easily obtained by substituting (31) into (34).

For convenience, denote by and the two triple summations of in (36). Then it follows that is decomposable into eight sub-triple summations which can be further partitioned into disjoint and exhaustive subsets that listed in Table 1. An application of (7) to Table 1 leads directly to

Similarly we also haveThusThe quadruple summation in (37) can be decomposed as(38)Expanding (38) according to different suffixes of and , we obtain sub-quadruple summations which can be further partitioned into disjoint and exhaustive subsets. In other words, is a summation of integrals of the form , i.e., the -terms, weighted by corresponding subset cardinality, i.e., the -terms. By substituting into (5) the corresponding parameters tabulated in Table 2 as well as exploiting the symmetry of (38), we obtain the expression of . Substituting the expressions of , , and into (33) and tidying up lead to the expression of in (39).(39)

Finally we deal with . Write(40)and(41)Then . Now(42)Since we have in (32) and in (27), the second term in (42) can be easily obtained, as(43)whereas the first term, , can be written as(44)where(45)It is evident that , , and can be regarded as -terms in Lemma 1, i.e.,As mentioned above, is a union of and , and is a union of and , the triple summation in (45) can be split into eight terms as(46)Using (6) in Lemma 1 along with the corresponding parameters tabulated in Table 3, we can work out each sub-triple summation in (46). A series of straightforward algebra leads readily to (47).(47)

Substituting (27),(28),(32),(39) and (47) into (25) and (26), respectively, letting and , and omitting terms thereafter, we finally arrive at (21) and (22), respectively. The theorem thus follows.□

Remark 1.

Letting in (21), simplifies to a neater form of(48)which manifests how and affect the accuracy of the estimate of by PRVCC. Moreover, as , (21) and (22) reduce respectively to(49)and(50)which are the contamination-free versions of PRVCC.

Results and Discussion

In this section we verify the correctness of Theorem 1 by Monte Carlo simulations. To gain a further insight about PRVCC, we also compare it with the two classical correlation coefficients,i.e., SR and KT, in terms of RMSE. At last, we will provide an examples of time-delay estimation under CGM (4). Since the theoretical results in Theorem 1 only hold true for large sample size and small , in this section we set the sample size and . All samples are generated by suitable functions in the Matlab environment. For the sake of accuracy, the number of Monte Carlo trials is set to be unless otherwise stated.

1 Verification of Theorem 1

Figure 1. verifies the correctness of the mean of PRVCC under CGM (4) for large samples and small . Specifically, in Figure 1. we plot the simulation results (circles) and the theoretical results of (21) (solid lines), and the contamination-free version (49) (dashed lines) under different combinations of and . Good agreements are observed between the simulation results and the theoretical counterparts. It can also be observed that the larger the contamination fraction and difference between and , the bigger the bias between and the ideal dashed curve corresponding to .

thumbnail
Figure 1. The numerical verification of (21), the expectation of in Theorem 1.

The number of samples is chose as . In the vertically up direction, is decreasing following respectively; whereas corresponds to a increasing trend in the horizontally right direction, following respectively. It shows a good agreement between the simulation result (circles) and the theoretical computation (solid lines) in each subplot. As a reference, the contamination-free version (49) is also posted together (see dashed curves).

https://doi.org/10.1371/journal.pone.0112215.g001

Figure 2. verifies the correctness of the variance of PRVCC, by plotting the simulation results (circles) and the theoretical results of (22) (solid lines) concerning in the same scenarios as in Figure 1. For the purpose of comparison, the contamination-free version (50) (dashed lines) is also included in each subplot to highlight the effects of and . Note that we have multiplied by for a better visual effect. This figure shows good agreements between the simulation results and the corresponding theoretical ones. Moreover, it is seen that when , the curves are symmetric and the magnitude of increase with , especially for large. On the other hand, when , the curves are no longer asymmetric. Specifically, for large, increases if and have opposite signs; and it decreases if and have the same signs. When is fixed, is the reversal of .

thumbnail
Figure 2. The numerical verification of (22), the variance of in Theorem 1.

The number of samples is chose as . In the vertically up direction, is decreasing following respectively; whereas corresponds to a increasing trend in the horizontally right direction, following respectively. It shows a good agreement between the simulation result (circles) and the theoretical computation (solid lines) in each subplot. As a reference, the contamination-free version (50) is also posted together (see dashed curves).

https://doi.org/10.1371/journal.pone.0112215.g002

From these two figures, it follows that, although derived based on the assumptions and , our theoretical results established in Theorem 1 are sufficiently accurate for as small as and as large as . This means that, PRVCC is applicable to many situations in practice, such as radar and biomedical engineering, where the sample size is much larger than and the fraction of impulsive interference is much lower than .

2 RMSE Comparison of PRVCC with SR amd KT

To deepen the understanding of PRVCC, in this subsection we compare in terms of RMSE the performance of PRVCC with SR and KT, which are also robust under the CGM (4) as shown in our previous work [20].

For fairness of comparison, some calibrations are necessary. From (21), it follows(51)by which we can define an asymptotic unbiased estimator of the population correlation , asThe other two asymptotic unbiased estimators based on KT and SR are defined as [20]

Given definitions of the above, we can then compare their performance in terms of the popular RMSE defined byThe CGM based on (4) is set to bewhere , and increases from to by a step of . The RMSEs are listed in Table 4, where the minima with respect to , and are highlighted in bold font in a rowwise manner within each of the eight blocks. It appears that 1) all RMSEs are quite small, meaning that these three estimators perform similarly well under CGM (4); 2) outperforms and in the cases that takes values from to medium magnitudes; 3) outperforms for around ; 4) plays an intermediate role between and . Note that the RMSE values for are not shown in Table 4 due to symmetry.

3 Example of Time-Delay Estimation

As remarked in Section Introduction, it is often encountered in radar, sonar or communication that we need to estimate the correlation between a prescribed “clean” signal with a distorted version corrupted by impulsive noise. Now we provide an example of time-delay estimation which is similar to this situation. In this example, the prescribed clean signal is a segment of sinusoidal wavewhereas the corrupted signal is with being a white contaminated Gaussian noise following the distribution of(52)where and . The time-delay is set to be ms. Our purpose is to estimate as accurate as possible under various signal to noise ratio . As illustrated in Figure 3, the procedure of estimating includes two steps. The first one is to construct a correlation function that corresponds to by each of , and with respect to and . The second one is to locate time-shift corresponding the maximum of the correlation function. The value of is considered to be an estimate of and restored for further analysis. Note that the number of Monte Carlo trials in this study is set to be .

thumbnail
Figure 3. Schematic illustration of estimating the time-delay in Model (4).

The time-shift with respect to the maximum of the correlation function in the bottom panel is considered as an estimate of the true time-delay .

https://doi.org/10.1371/journal.pone.0112215.g003

Table 5 summarizes the estimates of from , and . It is observed that all three methods produce acceptable estimates of , for an SNR even as low as dB. However, PRVCC is slightly better than the other two, in the sense of giving smaller biases and standard deviations in most cases.

thumbnail
Table 5. Performance comparison of , and for being a segment of sin wave.

https://doi.org/10.1371/journal.pone.0112215.t005

Conclusions

This paper systematically investigates the statistical properties of the historical PRVCC under a particular contaminated Gaussian model. As shown in our previous work [20], this model simulates reasonably some frequently encountered scenarios where one variable is clean and the other corrupted by a tiny fraction of impulsive noise with very large variance. Under this model, we establish the asymptotic closed forms of the expectation and variance of PRVCC by means of the well known Delta method. To gain a further insight on PRVCC, we also compare it with two other classical correlation coefficients, i.e., SR and KT, in terms of the popular RMSE. Monte Carlo simulations not only verify our theoretical findings, but also reveal the strength and weakness of PRVCC in various occasions. The theoretical and empirical findings in this work are believed to add new knowledge to the area of correlation analysis that prevails in many branches of science and engineering.

Acknowledgments

The authors would like to thank the Academic Editor Prof. Haipeng Peng and the three anonymous reviewers for their insightful and constructive suggestions.

Author Contributions

Conceived and designed the experiments: RM WX. Performed the experiments: RM WX. Analyzed the data: RM WX. Contributed reagents/materials/analysis tools: RM WX YZ ZY. Wrote the paper: RM WX.

References

  1. 1. Kendall M, Gibbons JD (1990) Rank Correlation Methods. New York: Oxford University Press, 5th edition.
  2. 2. Gibbons JD, Chakraborti S (1992) Nonparametric Statistical Inference. New York: M. Dekker, 3rd edition.
  3. 3. Jacovitti G, Cusani R (1992) Performance of normalized correlation estimators for complex processes. IEEE Transactions on Signal Processing 40: 114–128.
  4. 4. Delmas J, Abeida H (2009) Asymptotic distribution of circularity coefficients estimate of complex random variables. Signal Processing 89: 2670–2675.
  5. 5. Chorti A, Hristopulos DT (2008) Nonparametric identification of anisotropic (elliptic) correlations in spatially distributed data sets. IEEE Transactions on Signal Processing 56: 4738–4751.
  6. 6. Chaux C, Duval L, Benazza-Benyahia A, Pesquet JC (2008) A nonlinear stein-based estimator for multichannel image denoising. IEEE Transactions on Signal Processing 56: 3855–3870.
  7. 7. Tao R, Zhang F, Wang Y (2008) Fractional power spectrum. IEEE Transactions on Signal Processing 56: 4199–4206.
  8. 8. Xu W, Zhao C, Ding Z (2009) Limited feedback multiuser scheduling of spatially correlated broadcast channels. Vehicular Technology, IEEE Transactions on 58: 4406–4418.
  9. 9. Girault J, Kouamé D, Ouahabi A (2010) Analytical formulation of the fractal dimension of filtered stochastic signals. Signal Processing 90: 2690–2697.
  10. 10. Lian J, Garner G, Muessig D, Lang V (2010) A simple method to quantify the morphological similarity between signals. Signal Processing 90: 684–688.
  11. 11. Li J, Chen X, He Z (2013) Adaptive stochastic resonance method for impact signal detection based on sliding window. Mechanical Systems and Signal Processing 36: 240–255.
  12. 12. Xu W, Hou Y, Hung Y, Zou Y (2013) A comparative analysis of spearman's rho and kendall's tau in normal and contaminated normal models. Signal Processing 93: 261–276.
  13. 13. Axehill D, Gunnarsson F, Hansson A (2008) A low-complexity high-performance preprocessing algorithm for multiuser detection using gold sequences. IEEE Transactions on Signal Processing 56: 4377–4385.
  14. 14. Beko M, Xavier J, Barroso V (2008) Further results on the capacity and error probability analysis of noncoherent mimo systems in the low snr regime. IEEE Transactions on Signal Processing 56: 2915–2930.
  15. 15. Greco M, Gini F, Farina A (2008) Radar detection and classification of jamming signals belonging to a cone class. IEEE Transactions on Signal Processing 56: 1984–1993.
  16. 16. Liu L, Amin M (2008) Performance analysis of gps receivers in non-gaussian noise incorporating precorrelation filter and sampling rate. IEEE Transactions on Signal Processing 56: 990–1004.
  17. 17. Gini F (1998) A radar application of a modified Cramer-Rao bound: parameter estimation in non-Gaussian clutter. IEEE Transactions on Signal Processing 46: 1945–1953.
  18. 18. Gini F, Michels J (1999) Performance analysis of two covariance matrix estimators in compound-Gaussian clutter. IEE Proceedings Part-F 146: 133–140.
  19. 19. Mari DD, Kotz S (2001) Correlation and Dependence. London: Imperial College Press.
  20. 20. Ma R, Xu W, Wang Q, Chen W (2014) Robustness analysis of three classical correlation coefficients under contaminated gaussian model. Signal Processing 104: 51–58.
  21. 21. Pearson K (1914) On an extension of the method of correlation by grades or ranks. Biometrika 10: 416–418.
  22. 22. Schechtman E, Yitzhaki S (1987) A measure of association base on gini's mean difference. Commun Statist-Theor Meth 16: 207–231.
  23. 23. Xu W, Chang C, Hung YS, Kwan SK, Fung PCW (2006) Order statistic correlation coefficient and its application to association measurement of biosignals. ICASSP 2006 Proceedings 2: II–1068-1071.
  24. 24. Xu W, Chang C, Hung YS, Kwan SK, Fung PCW (2007) Order statistics correlation coefficient as a novel association measurement with applications to biosignal analysis. IEEE Transactions on Signal Processing 55: 5552–5563.
  25. 25. Xu W, Chang C, Hung Y, Fung P (2008) Asymptotic properties of order statistics correlation coefficient in the normal cases. IEEE Transactions on Signal Processing 56: 2239–2248.
  26. 26. Gao ZK, Zhang XW, Jin ND, Donner R, Marwan N, et al. (Sept. 2013) Recurrence networks from multivariate signals for uncovering dynamic transitions of horizontal oil-water stratified flows. Europhysics Letters 103: 50004 (6 pp.)..
  27. 27. Gao ZK, Zhang XW, Jin ND, Marwan N, Kurths J (2013) Multivariate recurrence network analysis for characterizing horizontal oil-water two-phase flow. Phys Rev E 88: 032910.
  28. 28. Su Z, Li L, Peng H, Kurths J, Xiao J, et al. (2014) Robustness of interrelated traffic networks to cascading failures. Scientific Reports 4: 5413.
  29. 29. Li L, Peng H, Kurths J, Yang Y, Schellnhuber HJ (2014) Chaos–order transition in foraging behavior of ants. Proceedings of the National Academy of Sciences 111: 8392–8397.
  30. 30. Fisher RA (1921) On the ‘probable error’ of a coefficient of correlation deduced from a small sample. Metron 1: 3–32.
  31. 31. Stein D (1995) Detection of random signals in gaussian mixture noise. IEEE Trans Inf Theory 41: 1788–1801.
  32. 32. Reznic Z, Zamir R, Feder M (2002) Joint source-channel coding of a gaussian mixture source over the gaussian broadcast channel. IEEE Trans Inf Theory 48: 776–781.
  33. 33. Chen R, Wang X, Liu J (2000) Adaptive joint detection and decoding in flat-fading channels via mixture Kalman filtering. IEEE Trans Inf Theory 46: 2079–2094.
  34. 34. Serfling RJ (2002) Approximation Theorems of Mathematical Statistics. Wiley series in probability and mathematical statistics. New York: Wiley.
  35. 35. David H, Nagaraja H (2003) Order Statistics. Hoboken: Wiley-Interscience, 3rd edition.
  36. 36. Balakrishnan N, Rao CR (1998) Order Statistics: Applications. Handbook of statistics; v. 17. New York: Elsevier.
  37. 37. Balakrishnan N, Rao CR (1998) Order Statistics: Theory & Methods. Handbook of statistics; v. 16. New York: Elsevier.
  38. 38. Stuart A, Ord JK (1994) Kendall's Advanced Theory of Statistics: Volume 1 Distribution Theory. London: Edward Arnold, 6th edition.
  39. 39. Xu W, Hung YS, Niranjan M, Shen M (2010) Asymptotic mean and variance of Gini correlation for bivariate normal samples. IEEE Trans Signal Process 58: 522–534.
  40. 40. Moran PAP (1948) Rank correlation and product-moment correlation. Biometrika 35: 203–206.