Asymptotic Properties of Spearman’s Rank Correlation for Variables with Finite Support

The asymptotic variance and distribution of Spearman’s rank correlation have previously been known only under independence. For variables with finite support, the population version of Spearman’s rank correlation has been derived. Using this result, we show convergence to a normal distribution irrespectively of dependence, and derive the asymptotic variance. A small simulation study indicates that the asymptotic properties are of practical importance.


Introduction
A common question when looking at new data is "Does Y tend to increase when X increases?" When X and Y are ordinal, the nonparametric Spearman's sample rank correlation,r s , is frequently used to measure the association.
Spearman originally thought of the situation where a small group of individuals are rated on two separate tasks [1]. His question was whether there existed an association between an individual's two ratings. As ρ s is defined as the sample correlation of the ranks of two variables this question translates to whetherr s is significantly different from zero. In cases when there are no ties,r s follows a normal distribution under independence [2]. In practice,r s is often used not for ratings, but for Likert type survey variables that take only a few values. When both variables are discrete with only a few categories, bias from not taking ties into account can become considerable with increasing sample size. In addition, the question of interest often concerns not only whether there exists an association but the size of that association. For example, the association between smoking and lung function has been heavily researched during the last half century. Both smoking and lung function are typically measured in categories, and the question of interest has over time shifted from whether smoking decreases lung function to the extent of the impact. In such cases, when ties cannot be disregarded or the research question is not posed against independence, an asymptotic distribution is lacking ( [3], p. 7904).
The focus of this paper is on the properties ofr s when used as a measure of association between variables with finite support. [4] has constructed a population version of Spearman's rho for discrete variables, ρ s . In this article, we apply Nešlehová's results to the sample version of Spearman's rank correlation, deriving its asymptotic properties and showing the importance of Nešlehová's work to statistics.
In the next Section we introduce ρ s andr s for discrete variables with finite support. In Section three we derive the asymptotic properties ofr s . Section four presents simulation results and some empirical examples. A conclusion ends the paper.

Definitions
We are interested in the case when X and Y are discrete random variables with probability mass functions p i = P(X = i) and q j = P(Y = j) with finite support i 2 {1, . . ., I}, and j 2 {1, . . ., J}, I, J 2 [2, 1). Spearman's sample rank correlation is typically seen in the following form where n denotes the sample size and R i = rank X i , S i = rank Y i , and S. Previous to Nešlehová's work, Spearman's sample correlation did not have a population version. In this Section we present Neslehova's population version of Spearman's rank correlation for variables that take a finite number of values [4]. In such cases, the relation between X and Y can be represented in a contingency table, and ρ s can be written as a function of the cell probabilities. We denote the joint probability mass function h ij = P(X = i\Y = j). Then, p i ¼ P J j¼1 h ij , and q j ¼ P I i¼1 h ij . The cumulative marginal distribution functions are then F i ¼ P i k¼1 p k and G j ¼ P j k¼1 q k respectively. ( [5], p. 94-95) defines Spearman rank correlation r s : ρ s is defined for cases with at least some variation in both X and Y, so that P J j¼1 q 3 j < 1 and P I i¼1 p 3 i < 1. We denote the empirical marginal distribution functions byF andĜ, the estimated cell proportion in cell i, j byĥ ij and letp i ¼ It turns out that the sample version of ρ s equals the standard Spearman's sample correlation. We thus have a second available expression ofr s ([4], p. 564) Asymptotic properties ofr s In this section we use the definitions presented above and apply the delta theorem to derive consistency, asymptotic unbiasedness, and asymptotic normality ofr s between variables with finite support.
As P I i¼1 P J j¼1 h ij ¼ 1 there are only IJ − 1 unique probabilities and we can write Theorem If X and Y are discrete random variables with finite support, ρ s is as defined in Eq (2), the gradient of ρ s with respect to h is denoted by _ r s , and the covariance matrix of h is denoted by S, then Proof. As shown by ( [6], p. 419) ffiffiffiffi N p ðĥ IJ À h IJ Þ converges in distribution to a singular multivariate normal distribution with mean zero, covariance matrix diagðh IJ Þ À h IJ h T IJ and rank IJ − 1. It follows thatĥ converges in probability to h. This implies that ffiffiffiffi N p ðĥ À hÞ converges in distribution to a nondegenerate multivariate normal distribution with mean zero, and covariance matrix S = diag(h) − h h T . As all terms in Eq (2) are functions of h, ρ s can be consistently estimated from the cell proportions.
Next, we show thatr s is continuous with continuous first partial derivatives. Denote the separate terms of ρ s as follows: Then Since P J j¼1 q 3 j < 1 and P I i¼1 p 3 i < 1 we have that 0 < B k < 1, 8k. A and B are simple functions of h, involving no division. Therefore,r s is smooth with respect toĥ, implying that application of the delta theorem tor s is straightforward. We thus conclude thatr s converges to the distribution given in Eq (4).
For construction of the asymptotic covariance matrix, _ r s is given below.
where _ A ¼ @A @h T , _ B ¼ @B @h T , and for all (r, s) 6 ¼ (I, J), @B @h rs ¼ 3B 2  Table 1 the results from the Monte Carlo simulation are shown. In addition we run the simulation generating data from a bivariate normal distribution with correlation 0.95. The results from this simulation are consistent with those presented. In column one and two bias and mean square error ofr s are presented. From a practical perspective the bias is very close to zero. As the bias is close to zero the MSE is basically the variance, and as could be expected the MSE is halved when the sample size is doubled. One way to analyze the normality of a statistic is to make a simple z-test at e.g. the 5% level. If the normality assumption is true then we would expect the rejection rate to be 5%. A 95-% confidence interval for a proportion of 0.05 is 0.047-0.053 for 20000 replicates. This means that observed proportions outside this interval would indicate that normality is not the case. In this part of the simulation we compare the asymptotic estimator with two other estimation strategies: the large sample approximation suggested by [7], available through e.g. MATLAB:s function corr and the empirical bootstrap. As the corr function does not give the variance but the p-value, the variance is solved from the formula of the z-statistic. The comparison with MATLAB:s built in function is chosen because it is easily available and therefore commonly used. However, this approximation disregards ties and is valid only under independence. We also analyze other approximations from the literature. They all rely on both the independence assumption as well as the assumption of continuous distribution, and they perform similarly to each other. Therefore, only the results from MATLAB:s built in function are shown. The bootstrap comparison is chosen because it tends to perform well and, although somewhat more complicated as well as computationally demanding to use, is typically a good choice in situations when a closed form for the variance is lacking.
From row three in Table 1 we see that the asymptotic variance is within the interval for sample sizes larger than 400 with good margin, indicating that normality, while an asymptotic property, is a good approximation forr s from moderate sample sizes. The variance estimators used for comparison relate to the identical point estimate. From row four we see that violating the assumptions of independent and continuous observations has a severe impact on the results: MATLAB:s built in function performs poorly and does not improve with increasing sample size. The results from the bootstrap estimator (row five) are within the desired range by sample size 100, indicating that for small sample sizes, the bootstrap seems to be the best choice of variance estimator. A kernel density estimate of the small sample distribution for the sample size 50 is shown in Fig 1. A standard normal distribution is also shown as reference. The asymptotic variance seems to be fairly well approximated by the normal distribution although the empirical distribution has a slight negative skew. This deviation from normality is much lower for n = 100 and larger samples are very well approximated by the normal distribution. Due to space limitations, only n = 50 is displayed.
In the next step of the simulation study, we compare the power of the estimators. Variables are generated with the same characteristics as previously, but the correlation of the underlying continuous variables is now set to 0.55 and 0.65, yielding population rank correlations ρ s of 0.4695 and 0.5608 respectively. The results are shown in Table 2. When the true rank correlation is 0.4695, no estimator exceeds a power of 0.36 even with a sample size of 800. When the true rank correlation is 0.5608, a larger difference from the null, the asymptotic estimator has a power of about 0.5 with a sample size of 100 and 0.95 with a sample size of 400. The asymptotic estimator consistently outperforms the bootstrap, but the difference is small and at least partly due to the bootstrap estimator's somewhat lower rejection rates. Turning to MATLAB:s built  in function, the results from Table 2 underscores those from Table 1 in showing that this type of estimator should not be used for other purposes than testing against ρ s = 0.
We illustrate the performance of the three different types of estimators with empirical examples taken from [8] The results are shown in Table 3. The purpose is to give examples of the practical implications of the above derived asymptotic variance (V A ), the bootstrap (V B ), and MATLAB:s built in approximation (V M ). I and J represent the number of values that X and Y can take respectively, and n gives the sample size. The sizes of the contingency tables and sample sizes are what is commonly encountered in empirical applications and the examples are from various fields: 2.4) income and job satisfaction, 2.11) inheritage of political views, 3.2) primary and secondary pneumonia infection in calves, 8.10) smoking and lung function. The most striking result is that the asymptotic variance and the bootstrap estimates perform similarly, while V M differs considerably. Returning to the correlation between smoking and decreased lung function (8.10), in the chosen example we have a point estimate of 0.24. Using our derived variance, we are 95 percent confident to say that this translates to a value in the interval (0.18; 0.30). The bootstrap estimate would similarly return a confidence interval of (0.18; 0.30), while the approximation assuming independence and no ties returns the wider interval (0.16; 0.32). One could think of a policy ascribing regulations to substances depending their established correlation with lung disease. For this, a hypothesis test with null hypothesis corresponding to the relevant threshold would be needed. In this case the use of a biased variance estimator would lead to an overestimation of uncertainty with a delay in health regulation as a potential consequence.

Conclusion
Using Nešlehová's population version of Spearman's rho we have been able to show that Spearman's sample correlation has desirable asymptotic properties when applied to discrete variables. In particular, we have shown thatr s is consistent and asymptotically normal, and derived the asymptotic variance. Simulation results on both rejection rates and power indicate that the asymptotic variance performs as well as bootstrap for sample sizes from 400, allowing for easy construction of confidence intervals when Spearman's correlation is used. For moderate to large sample sizes, the derived asymptotic variance combines the easy use of a closed form statistic with a performance on pair with the bootstrap. In addition, the existence of an asymptotic variance in closed form, suitable for practical applications, means that the potential uses of Spearman's rank correlation in the construction of other estimators has increased.

Acknowledgments
We would like to thank the referees for valuable comments.