Figures
Abstract
This paper investigates the robustness properties of Pearson's rank-variate correlation coefficient (PRVCC) in scenarios where one channel is corrupted by impulsive noise and the other is impulsive noise-free. As shown in our previous work, these scenarios that frequently encountered in radar and/or sonar, can be well emulated by a particular bivariate contaminated Gaussian model (CGM). Under this CGM, we establish the asymptotic closed forms of the expectation and variance of PRVCC by means of the well known Delta method. To gain a deeper understanding, we also compare PRVCC with two other classical correlation coefficients, i.e., Spearman's rho (SR) and Kendall's tau (KT), in terms of the root mean squared error (RMSE). Monte Carlo simulations not only verify our theoretical findings, but also reveal the advantage of PRVCC by an example of estimating the time delay in the particular impulsive noise environment.
Citation: Ma R, Xu W, Zhang Y, Ye Z (2014) Asymptotic Properties of Pearson's Rank-Variate Correlation Coefficient under Contaminated Gaussian Model. PLoS ONE 9(11): e112215. https://doi.org/10.1371/journal.pone.0112215
Editor: Haipeng Peng, Beijing University, China
Received: June 9, 2014; Accepted: October 9, 2014; Published: November 13, 2014
Copyright: © 2014 Ma et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. All relevant data are within the paper and its Supporting Information files.
Funding: This work was supported in part by National Natural Science Foundation of China (Project 61271380), in part by Guangdong Natural Science Foundation (Project S2012010009870), in part by 100-Talents Scheme Funding from Guangdong University of Technology (Grant 112418006), in part by the Talent Introduction Special Funds from Guangdong Province (Grant 2050205), in part by a team project from Guangdong University of Technology (Grant GDUT2011-07), and Project Program of Key Laboratory of Guangdong Higher Education Institutes of China (Grant 2013CXZDA015). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Correlation coefficients are indices that depict the strength of statistical relationship between two random variables obeying a joint probability distribution [1]. In general, correlation coefficients should be large and positive if there is a high probability that large (small) values of one variable occur in conjunction with large (small) values of another; and it should be large and negative if the direction reverses [2]. Due to their theoretical and algorithmic advantages, correlation coefficients have been widely used in many sub-areas of signal processing [3]–[18]. Among many methods of correlation analysis in practice, Pearson's product moment correlation coefficient (PPMCC), Kendall's tau (KT) and Spearman's rho (SR) are perhaps the most prevalent ones [19].
There are many advantages and disadvantages to these three classical coefficients. PPMCC is optimal under the bivariate normal model (BNM), and is appropriate mainly for characterizing linear correlations. However, it will output misleading results if nonlinearity is involved in the data. On the other hand, the two rank-based coefficients, SR and KT, are not as powerful as PPMCC when the data follows bivariate normal distributions. Nevertheless, they are invariant under increasing monotone transformations, which makes them more suitable for many nonlinear cases in practice. Moreover, theoretical and empirical results indicate that SR and KT surpass PPMCC when data are corrupted by impulsive noise [12], [20]. Besides these three classical coefficients, some methods such as Pearson's rank-variate correlation coefficient (PRVCC) [21], Gini correlation (GC) [22], and order statistics correlation coefficient (OSCC) [23]–[25] among others [26]–[29], have also been proposed in the literature.
It was known long before that PPMCC is an optimal estimator of the population correlation coefficient, in the sense of unbiasedness and approaching the Cramer-Rao lower bound for large samples under bivariate normal models [30]. Despite its desired properties just mentioned, PPMCC might not be applicable under the circumstances where the data are corrupted by impulsive noise, that is, the distribution of data deviates from the BNM. Consider the following scenario that frequently occurs in radar, sonar or communication. We have a prescribed signal, whether deterministic or stochastic, whose statistical property is priorly known. Our purpose is to estimate the correlation between this “clean” signal and the associated distorted version from the receiver that might be corrupted by a tiny fraction of impulsive noise (outliers with very large variance [31]–[33]). To deal with such case, one might adopt the conventional strategy, that is, ranking the cardinal variable(s) and resorting afterwards to SR or KT [22], which are robust against both nonlinearity and impulsive noise [20]. However, using only ranks of the two variables, we unavoidably lose useful information embedded in the variates of the “clean” variable. A better strategy would be to rely on coefficients, such as PRVCC [21], that accommodate both ordinal and cardinal information contained in the samples. Our purpose in this work is thus to investigate the properties of the historical PRVCC by both theoretical and empirical means.
The contribution in this work is twofold. Firstly, we establish the asymptotic closed forms of the expectation and variance of PRVCC under a contaminated Gaussian model that emulates a frequently encountered scenario in practice. Secondly, we demonstrate the superiority of PRVCC over PPMCC, SR and KT, in terms of the root mean squared error, by an example of estimating the time delay in the particular impulsive noise environment. These theoretical and empirical findings might be helpful to rejuvenate the historical PRVCC, which has long been forgotten in the literature due to insufficient understanding on its theoretical properties.
For convenience of later discussion, we employ symbols ,
,
and
to denote the mean, variance, covariance and correlation of (between) random variables, respectively. Univariate and bivariate normal distributions are denoted by
and
, respectively. The sign
reads “is approximately equal to”, whereas the sign
stands for “is defined as”. The notation
denotes that
as
[34]. The symbol
stands for the product
. Other notations will be defined where it first enters the text.
Methods
This section presents the definitions of PRVCC as well as a particular CGM model simulating the impulsive noise environment mentioned in the previous section. Moreover, some auxiliary results are also established for further theoretical analysis.
1 Definitions of PRVCC
Let and
be two random variables following a continuous bivariate distribution. Denote by
and
the marginal distributions of
and
, respectively. Then, according to a historical paper of Pearson [21], one of the population versions of PRVCC can be defined, in modern notation, by
(1)Exchanging the roles of
and
in (1) yields the other version
Let
be
independent and identically distributed (i.i.d.) data pairs drawn from a continuous bivariate population. After rearranging
in ascending order, we get a new sequence
, which is termed the order statistics of
[35]–[37]. Suppose that
is at the
th position in the sorted sequence. The integer
is named the rank of
and is denoted by
. Let
represent the arithmetic mean of
data points
. Then, based on (1), the sample version with respect to
can be constructed as
(2)where
is the empirical cdf of
. Substituting the relationship
into (2) along with some simplifications leads to
(3)Exchanging the role of
and
in (3) gives another sample version
with respect to
. Note that in general
. The choice between
and
depends on different roles played by
and
in the scenario mentioned in the previous section. To avoid redundancy, we will focus on the properties of
which is abbreviated as
in the sequel unless ambiguity occurs.
2 Contaminated Gaussian Model
To simulate the specific circumstance remarked in Section Introduction, throughout we utilize the following CGM representing the joint probability density function (pdf) of two random variables and
[20]
(4)where
,
,
,
and
. Under this specific CGM, it is obvious that the marginal distribution of
is
, whereas the marginal distribution of
is
. In other words, under Model (4),
stands for a “clean” normal variable while
stands for a “dirty” variable corrupted by a tiny fraction of Gaussian component with vary large variance (might tending to infinity). In this model, the parameter
, which is considered of interest, is what we aim at estimating as accurate as possible, while
and
are interferences we seek to suppress. For the reason why Model (4) can happen in practice, please see Appendix A in our previous work [20].
3 Auxiliary Results
To establish our major results of Theorem 1 in the next subsection, some auxiliary results summarized in Lemma 1 below are mandatory.
Lemma 1.
Assume that the random vector follows a quadrivariate normal distribution with
,
, and
for
. Write
for
and
for
. Then
(5)
(7)and
(8)Proof. Write
for
. Then
(9)
(12)The results of (5) and (7) follow readily by substituting the results in [39] into the right sides of (9) and (11), respectively. Next we show that (6) and (8) also hold true. Let
. Then, according to [39], it follows that
(13)where
(14)
(15)and
(16)It is seen that
,
and
are all subscripts of the
-terms in (14). Since, by (16), only
is non-null, then (14) and hence (13) are non-null only when the following conditions are satisfied, i.e.,
(17)It is easy to verify that there are only four solutions to (17), as
(18)Substituting (18) into (13) and using (15) and (16) thereafter produce
(19)which along with (10) leads to (6). By a similar argument, we have
(20)which together with (12) yields (8). This completes the proof of the lemma.□
4 Asymptotic Mean and Variance of PRVCC Under CGM (4)
By applying the delta method [38] with Lemma 1, we are ready to establish the closed forms of the mean and variance of PRVCC for samples generated by CGM (4).
Theorem 1.
Let be
i.i.d. data pairs drawn from a bivariate normal population
and
be
i.i.d. data pairs drawn from another bivariate normal population
. Assume that
and
are mutually independent. Write
and
. Denote by
the union of
and
. Then, as
large,
small,
,
and
, the expectation and variance of PRVCC defined in (3) are
(21)and
(22)Proof. From (3), it is easy to verify that
is shift invariant. Therefore, we lose no generality by assuming that
hereafter. For convenience, write
(23)
(24)Then, by the well known delta method [38], it follows that
(25)and
(26)Obviously, we only need to evaluate
,
,
,
and
in order to work out (25) and (26). From the theorem assumption, both
and
follow the same normal distribution
, then
obeys a
distribution, which means that
(27)
(28)Using Lemma 1 with some tedious algebra, we can obtain
,
and
.
We first derive . From the definition (23) and the relationships
[40],
(29)Expanding and recalling that
, it follows that
(30)Since, by definition,
is a mixture of
and
, (30) can be expanded as
(31)which becomes (32) after some straightforward algebra along with the assistance of (8) in Lemma 1.
(32)
Next we evaluate , which can be written as
(33)where
(34)
(36)and
(37)The expression of
is easily obtained by substituting (31) into (34).
For convenience, denote by and
the two triple summations of
in (36). Then it follows that
is decomposable into eight sub-triple summations which can be further partitioned into
disjoint and exhaustive subsets that listed in Table 1. An application of (7) to Table 1 leads directly to
Similarly we also haveThus
The quadruple summation in (37) can be decomposed as
(38)Expanding (38) according to different suffixes of
and
, we obtain
sub-quadruple summations which can be further partitioned into
disjoint and exhaustive subsets. In other words,
is a summation of
integrals of the form
, i.e., the
-terms, weighted by corresponding subset cardinality, i.e., the
-terms. By substituting into (5) the corresponding parameters tabulated in Table 2 as well as exploiting the symmetry of (38), we obtain the expression of
. Substituting the expressions of
,
,
and
into (33) and tidying up lead to the expression of
in (39).
(39)
Finally we deal with . Write
(40)and
(41)Then
. Now
(42)Since we have
in (32) and
in (27), the second term in (42) can be easily obtained, as
(43)whereas the first term,
, can be written as
(44)where
(45)It is evident that
,
,
and
can be regarded as
-terms in Lemma 1, i.e.,
As mentioned above,
is a union of
and
, and
is a union of
and
, the triple summation in (45) can be split into eight terms as
(46)Using (6) in Lemma 1 along with the corresponding parameters tabulated in Table 3, we can work out each sub-triple summation in (46). A series of straightforward algebra leads readily to (47).
(47)
Substituting (27),(28),(32),(39) and (47) into (25) and (26), respectively, letting and
, and omitting
terms thereafter, we finally arrive at (21) and (22), respectively. The theorem thus follows.□
Results and Discussion
In this section we verify the correctness of Theorem 1 by Monte Carlo simulations. To gain a further insight about PRVCC, we also compare it with the two classical correlation coefficients,i.e., SR and KT, in terms of RMSE. At last, we will provide an examples of time-delay estimation under CGM (4). Since the theoretical results in Theorem 1 only hold true for large sample size and small
, in this section we set the sample size
and
. All samples are generated by suitable functions in the Matlab environment. For the sake of accuracy, the number of Monte Carlo trials is set to be
unless otherwise stated.
1 Verification of Theorem 1
Figure 1. verifies the correctness of the mean of PRVCC under CGM (4) for large samples and small . Specifically, in Figure 1. we plot the simulation results (circles) and the theoretical results of (21) (solid lines), and the contamination-free version (49) (dashed lines) under different combinations of
and
. Good agreements are observed between the simulation results and the theoretical counterparts. It can also be observed that the larger the contamination fraction
and difference between
and
, the bigger the bias between
and the ideal dashed curve corresponding to
.
The number of samples is chose as . In the vertically up direction,
is decreasing following
respectively; whereas
corresponds to a increasing trend in the horizontally right direction, following
respectively. It shows a good agreement between the simulation result (circles) and the theoretical computation (solid lines) in each subplot. As a reference, the contamination-free version (49) is also posted together (see dashed curves).
Figure 2. verifies the correctness of the variance of PRVCC, by plotting the simulation results (circles) and the theoretical results of (22) (solid lines) concerning in the same scenarios as in Figure 1. For the purpose of comparison, the contamination-free version (50) (dashed lines) is also included in each subplot to highlight the effects of
and
. Note that we have multiplied
by
for a better visual effect. This figure shows good agreements between the simulation results and the corresponding theoretical ones. Moreover, it is seen that when
, the curves are symmetric and the magnitude of
increase with
, especially for
large. On the other hand, when
, the curves are no longer asymmetric. Specifically, for
large,
increases if
and
have opposite signs; and it decreases if
and
have the same signs. When
is fixed,
is the reversal of
.
The number of samples is chose as . In the vertically up direction,
is decreasing following
respectively; whereas
corresponds to a increasing trend in the horizontally right direction, following
respectively. It shows a good agreement between the simulation result (circles) and the theoretical computation (solid lines) in each subplot. As a reference, the contamination-free version (50) is also posted together (see dashed curves).
From these two figures, it follows that, although derived based on the assumptions and
, our theoretical results established in Theorem 1 are sufficiently accurate for
as small as
and
as large as
. This means that, PRVCC is applicable to many situations in practice, such as radar and biomedical engineering, where the sample size
is much larger than
and the fraction of impulsive interference
is much lower than
.
2 RMSE Comparison of PRVCC with SR amd KT
To deepen the understanding of PRVCC, in this subsection we compare in terms of RMSE the performance of PRVCC with SR and KT, which are also robust under the CGM (4) as shown in our previous work [20].
For fairness of comparison, some calibrations are necessary. From (21), it follows(51)by which we can define an asymptotic unbiased estimator of the population correlation
, as
The other two asymptotic unbiased estimators based on KT and SR are defined as [20]
Given definitions of the
above, we can then compare their performance in terms of the popular RMSE defined by
The CGM based on (4) is set to be
where
,
and
increases from
to
by a step of
. The RMSEs are listed in Table 4, where the minima with respect to
,
and
are highlighted in bold font in a rowwise manner within each of the eight blocks. It appears that 1) all RMSEs are quite small, meaning that these three estimators perform similarly well under CGM (4); 2)
outperforms
and
in the cases that
takes values from
to medium magnitudes; 3)
outperforms
for
around
; 4)
plays an intermediate role between
and
. Note that the RMSE values for
are not shown in Table 4 due to symmetry.
3 Example of Time-Delay Estimation
As remarked in Section Introduction, it is often encountered in radar, sonar or communication that we need to estimate the correlation between a prescribed “clean” signal with a distorted version corrupted by impulsive noise. Now we provide an example of time-delay estimation which is similar to this situation. In this example, the prescribed clean signal is a segment of sinusoidal wavewhereas the corrupted signal is
with
being a white contaminated Gaussian noise following the distribution of
(52)where
and
. The time-delay
is set to be
ms. Our purpose is to estimate
as accurate as possible under various signal to noise ratio
. As illustrated in Figure 3, the procedure of estimating
includes two steps. The first one is to construct a correlation function that corresponds to
by each of
,
and
with respect to
and
. The second one is to locate time-shift
corresponding the maximum of the correlation function. The value of
is considered to be an estimate of
and restored for further analysis. Note that the number of Monte Carlo trials in this study is set to be
.
The time-shift with respect to the maximum of the correlation function in the bottom panel is considered as an estimate of the true time-delay
.
Table 5 summarizes the estimates of from
,
and
. It is observed that all three methods produce acceptable estimates of
, for an SNR even as low as
dB. However, PRVCC is slightly better than the other two, in the sense of giving smaller biases and standard deviations in most cases.
Conclusions
This paper systematically investigates the statistical properties of the historical PRVCC under a particular contaminated Gaussian model. As shown in our previous work [20], this model simulates reasonably some frequently encountered scenarios where one variable is clean and the other corrupted by a tiny fraction of impulsive noise with very large variance. Under this model, we establish the asymptotic closed forms of the expectation and variance of PRVCC by means of the well known Delta method. To gain a further insight on PRVCC, we also compare it with two other classical correlation coefficients, i.e., SR and KT, in terms of the popular RMSE. Monte Carlo simulations not only verify our theoretical findings, but also reveal the strength and weakness of PRVCC in various occasions. The theoretical and empirical findings in this work are believed to add new knowledge to the area of correlation analysis that prevails in many branches of science and engineering.
Acknowledgments
The authors would like to thank the Academic Editor Prof. Haipeng Peng and the three anonymous reviewers for their insightful and constructive suggestions.
Author Contributions
Conceived and designed the experiments: RM WX. Performed the experiments: RM WX. Analyzed the data: RM WX. Contributed reagents/materials/analysis tools: RM WX YZ ZY. Wrote the paper: RM WX.
References
- 1.
Kendall M, Gibbons JD (1990) Rank Correlation Methods. New York: Oxford University Press, 5th edition.
- 2.
Gibbons JD, Chakraborti S (1992) Nonparametric Statistical Inference. New York: M. Dekker, 3rd edition.
- 3. Jacovitti G, Cusani R (1992) Performance of normalized correlation estimators for complex processes. IEEE Transactions on Signal Processing 40: 114–128.
- 4. Delmas J, Abeida H (2009) Asymptotic distribution of circularity coefficients estimate of complex random variables. Signal Processing 89: 2670–2675.
- 5. Chorti A, Hristopulos DT (2008) Nonparametric identification of anisotropic (elliptic) correlations in spatially distributed data sets. IEEE Transactions on Signal Processing 56: 4738–4751.
- 6. Chaux C, Duval L, Benazza-Benyahia A, Pesquet JC (2008) A nonlinear stein-based estimator for multichannel image denoising. IEEE Transactions on Signal Processing 56: 3855–3870.
- 7. Tao R, Zhang F, Wang Y (2008) Fractional power spectrum. IEEE Transactions on Signal Processing 56: 4199–4206.
- 8. Xu W, Zhao C, Ding Z (2009) Limited feedback multiuser scheduling of spatially correlated broadcast channels. Vehicular Technology, IEEE Transactions on 58: 4406–4418.
- 9. Girault J, Kouamé D, Ouahabi A (2010) Analytical formulation of the fractal dimension of filtered stochastic signals. Signal Processing 90: 2690–2697.
- 10. Lian J, Garner G, Muessig D, Lang V (2010) A simple method to quantify the morphological similarity between signals. Signal Processing 90: 684–688.
- 11. Li J, Chen X, He Z (2013) Adaptive stochastic resonance method for impact signal detection based on sliding window. Mechanical Systems and Signal Processing 36: 240–255.
- 12. Xu W, Hou Y, Hung Y, Zou Y (2013) A comparative analysis of spearman's rho and kendall's tau in normal and contaminated normal models. Signal Processing 93: 261–276.
- 13. Axehill D, Gunnarsson F, Hansson A (2008) A low-complexity high-performance preprocessing algorithm for multiuser detection using gold sequences. IEEE Transactions on Signal Processing 56: 4377–4385.
- 14. Beko M, Xavier J, Barroso V (2008) Further results on the capacity and error probability analysis of noncoherent mimo systems in the low snr regime. IEEE Transactions on Signal Processing 56: 2915–2930.
- 15. Greco M, Gini F, Farina A (2008) Radar detection and classification of jamming signals belonging to a cone class. IEEE Transactions on Signal Processing 56: 1984–1993.
- 16. Liu L, Amin M (2008) Performance analysis of gps receivers in non-gaussian noise incorporating precorrelation filter and sampling rate. IEEE Transactions on Signal Processing 56: 990–1004.
- 17. Gini F (1998) A radar application of a modified Cramer-Rao bound: parameter estimation in non-Gaussian clutter. IEEE Transactions on Signal Processing 46: 1945–1953.
- 18. Gini F, Michels J (1999) Performance analysis of two covariance matrix estimators in compound-Gaussian clutter. IEE Proceedings Part-F 146: 133–140.
- 19.
Mari DD, Kotz S (2001) Correlation and Dependence. London: Imperial College Press.
- 20. Ma R, Xu W, Wang Q, Chen W (2014) Robustness analysis of three classical correlation coefficients under contaminated gaussian model. Signal Processing 104: 51–58.
- 21. Pearson K (1914) On an extension of the method of correlation by grades or ranks. Biometrika 10: 416–418.
- 22. Schechtman E, Yitzhaki S (1987) A measure of association base on gini's mean difference. Commun Statist-Theor Meth 16: 207–231.
- 23. Xu W, Chang C, Hung YS, Kwan SK, Fung PCW (2006) Order statistic correlation coefficient and its application to association measurement of biosignals. ICASSP 2006 Proceedings 2: II–1068-1071.
- 24. Xu W, Chang C, Hung YS, Kwan SK, Fung PCW (2007) Order statistics correlation coefficient as a novel association measurement with applications to biosignal analysis. IEEE Transactions on Signal Processing 55: 5552–5563.
- 25. Xu W, Chang C, Hung Y, Fung P (2008) Asymptotic properties of order statistics correlation coefficient in the normal cases. IEEE Transactions on Signal Processing 56: 2239–2248.
- 26. Gao ZK, Zhang XW, Jin ND, Donner R, Marwan N, et al. (Sept. 2013) Recurrence networks from multivariate signals for uncovering dynamic transitions of horizontal oil-water stratified flows. Europhysics Letters 103: 50004 (6 pp.)..
- 27. Gao ZK, Zhang XW, Jin ND, Marwan N, Kurths J (2013) Multivariate recurrence network analysis for characterizing horizontal oil-water two-phase flow. Phys Rev E 88: 032910.
- 28. Su Z, Li L, Peng H, Kurths J, Xiao J, et al. (2014) Robustness of interrelated traffic networks to cascading failures. Scientific Reports 4: 5413.
- 29. Li L, Peng H, Kurths J, Yang Y, Schellnhuber HJ (2014) Chaos–order transition in foraging behavior of ants. Proceedings of the National Academy of Sciences 111: 8392–8397.
- 30. Fisher RA (1921) On the ‘probable error’ of a coefficient of correlation deduced from a small sample. Metron 1: 3–32.
- 31. Stein D (1995) Detection of random signals in gaussian mixture noise. IEEE Trans Inf Theory 41: 1788–1801.
- 32. Reznic Z, Zamir R, Feder M (2002) Joint source-channel coding of a gaussian mixture source over the gaussian broadcast channel. IEEE Trans Inf Theory 48: 776–781.
- 33. Chen R, Wang X, Liu J (2000) Adaptive joint detection and decoding in flat-fading channels via mixture Kalman filtering. IEEE Trans Inf Theory 46: 2079–2094.
- 34.
Serfling RJ (2002) Approximation Theorems of Mathematical Statistics. Wiley series in probability and mathematical statistics. New York: Wiley.
- 35.
David H, Nagaraja H (2003) Order Statistics. Hoboken: Wiley-Interscience, 3rd edition.
- 36.
Balakrishnan N, Rao CR (1998) Order Statistics: Applications. Handbook of statistics; v. 17. New York: Elsevier.
- 37.
Balakrishnan N, Rao CR (1998) Order Statistics: Theory & Methods. Handbook of statistics; v. 16. New York: Elsevier.
- 38.
Stuart A, Ord JK (1994) Kendall's Advanced Theory of Statistics: Volume 1 Distribution Theory. London: Edward Arnold, 6th edition.
- 39. Xu W, Hung YS, Niranjan M, Shen M (2010) Asymptotic mean and variance of Gini correlation for bivariate normal samples. IEEE Trans Signal Process 58: 522–534.
- 40. Moran PAP (1948) Rank correlation and product-moment correlation. Biometrika 35: 203–206.