A Novel Test for Independence Derived from an Exact Distribution of ith Nearest Neighbours

Dependence measures and tests for independence have recently attracted a lot of attention, because they are the cornerstone of algorithms for network inference in probabilistic graphical models. Pearson's product moment correlation coefficient is still by far the most widely used statistic yet it is largely constrained to detecting linear relationships. In this work we provide an exact formula for the th nearest neighbor distance distribution of rank-transformed data. Based on that, we propose two novel tests for independence. An implementation of these tests, together with a general benchmark framework for independence testing, are freely available as a CRAN software package (http://cran.r-project.org/web/packages/knnIndep). In this paper we have benchmarked Pearson's correlation, Hoeffding's , dcor, Kraskov's estimator for mutual information, maximal information criterion and our two tests. We conclude that no particular method is generally superior to all other methods. However, dcor and Hoeffding's are the most powerful tests for many different types of dependence.


Introduction
Dependence measures and tests for independence have recently attracted a lot of attention, because they are the cornerstone of algorithms for network inference in probabilistic graphical models.Pearson's product moment correlation coefficient is still by far the most widely used statistic in areas such as economy, biology and the social sciences.Yet Pearson's correlation is largely constrained to detecting linear relationships.Spearman [1] and Kendall [2] extended Pearson's work to monotonic dependencies.In 1948, Hoeffding [3] proposed a non parametric test for independence that is suited for many different functional relationships.Sze ´kely et al. [4] introduced the distance correlation (dcor) as a generalization of Pearson's correlation.
Other approaches build on mutual information (MI).MI characterizes independence in the sense that the MI of a joint distribution of two variables is zero if and only if these variables are independent.However, MI is difficult to estimate from finite samples.Kraskov et al. [5] proposed an accurate MI estimator derived from nearest neighbor distances.Reshef et al. [6] presented the maximal information coefficient (MIC), a measure of dependence for two-variable relationships which was heavily advertised [7] but lacks any statistical motivation.
dcor and Kraskov's estimator use the pair-wise distances of the points in a sample as a sufficient statistic.In this work we provide an exact formula for the ith nearest neighbor distance distribution of rank-transformed data (i~1,2,:::).Based on that, we propose two novel tests for independence.An implementation of these tests, together with a general benchmark framework for independence testing, are freely available as a CRAN software package (http://cran.r-project.org/web/packages/knnIndep).In this paper we have benchmarked Pearson's correlation, Hoeffding's D, dcor, Kraskov's estimator for MI, MIC and our two tests.We conclude that no particular method is generally superior to all other methods.However, dcor and Hoeffding's D are the most powerful tests for many different types of dependence.Circular dependencies are best recognized by our tests.This type of dependence is fairly common, e.g., if two dependent periodic processes are monitored.An example from biology is the expression of a transcription factor and one of its target genes during the cell cycle [8].

Exact distribution of the th nearest neighbour distances
Consider a set of N §4 points that are distributed 'randomly' on a surface.In what follows, we derive the distribution (conditional distribution) of the (iz1)th nearest neighbor of a point (given the distance to its previous neighbors).We assume the points drawn from the following model: Let X ~(x j ) j~1,:::,N and Y ~(y j ) j~1,:::,N be permutations of the numbers 0,:::,N{1 that are drawn uniformly from the set of all permutations of f0,:::,N{1g.The i i points z j ~(x j ,y j ), j~1,:::,N, lie on a torus of size N which is endowed with the maximum distance as a metric.I.e., the distance between two points is given by dist(z 1 ,z 2 )~max ( min (jx 1 {x 2 j,N{jx 1 {x 2 j), min (jy 1 {y 2 j,N{jy 1 {y 2 j)) Fix a reference point z 1 .Let d i , i~1,:::,N{1 denote the distance of the i-th nearest neighbor of z 1 to z 1 and D i the random variable associated with it.Since this distance measure is translation invariant, let without loss z 1 ~(x 1 ,y 1 )~(0,0).Importantly, all points z j have pairwise different x j and y j .A point at distance d~dist(z,(0,0)) to the origin must have at least one of its coordinates equal to d or N{d.This implies that there are at most 4 points exactly at distance d to the origin.Our target is the calculation of the joint probability of observing the whole sequence of nearest neighbor distances P(D 0 , D 1 , . . ., D N{1 ), of the conditional probability P(D iz1 D D i , . . ., D 0 ) and the marginal P(D i ).The main work will be the calculation of the probability P(D iz1 §c, D i ~a, . . ., D i{kz1 ~a, D i{k va) for given values k, a and c.Once this is done, P(D 0 , D 1 , . . ., D N{1 ), P(D i ) and P(D iz1 D D i , . . ., D 0 ) can be derived by elementary calculations (section S1 in Methods S1).
First we determine P(D iz1 §c, D i ƒa) by counting the number of admissible point configurations and dividing through (N{1)!, the number of all possible point configurations with z 1 ~(0,0) fixed.When counting configurations, we repeatedly exploit the fact that each horizontal and each vertical grid line contains exactly one point from the sample.In case of cwa, we split the torus into 3 regions (Figure 1).Region I is a square of side length 2az1.It contains z 1 and i additional points at arbitrary positions.The number of possibilities to draw an i-tuple from 2a positions (recall that one position is already taken by z 1 ) without replacement is (2a)!(2a{i)! .Thus, there are (2a)!(2a{i)! 2 i-tuples describing an admissible configuration in region I.However, each configuration is counted i! times, since the order of the points does not matter.Hence, the number of unique configurations in region I For the second region we have N{2cz1 possible y-coordinates and 2c{1{(iz1)~2c{i{2 columns to be filled with sample points (note that the columns {c and c belong to region III and that iz1 columns are already taken by points in region I).This yields (N{2cz1)!(N{4c{iz3)!unique configurations for region II.There are N{2cz1 points remaining which can be placed freely in the remaining N{2c columns/rows, yielding (N{2cz1)!possibilities.Together we obtain: In the case of c~a there is one more complication, because we have a region R of points exactly at distance c, containing at least the i-th and (iz1)-th neighbor of z 1 , where the region I overlaps with regions IIa and IIb (Figure 1).Let r[f2,3,4g be the number of points in region R and i 0 the number of points strictly inside the square of distance c.We derive a general formula for all admissible configurations in the case of c~a, P(D i 0 zrz1 wc, D i 0 zr Á Á Á ~Di 0 z1 ~c, D i 0 vc).Denote by k(r,i 0 ,c) the number of admissible point configurations in region R (see section S2 in Methods S1 for a derivation of k(r,i 0 ,c)).Table 1 lists all possible admissible combinations of points in region R. Counting the admissible configurations strictly inside regions I, IIa, IIb and III is similar to the above cases (Equation 1).This leads to the following general formula for all admissible configurations: The sum over all possible tuple (r,i 0 ) in Table 1 gives the probability P(d iz1 ~c, d i ~c) in the general case: The above calculations only hold if region R is a genuine square, for large values of c R degenerates to a pair of lines (one horizontal and one vertical line).These cases are covered in the extended formula Analogously we can count the number of possible configurations where D iz1 wc, some k points D i , . . ., D i{kz1 ~a and all other points D i{k va and deduce the following probability: Since the above formulas involve tedious calculations, we validated the formulas for N~7 and N~8 by counting the occurrence of each possible configuration among all N! configurations.Additionally, we checked the validity of our formula for larger N (N~20) by taking 10 6 random configurations and comparing the empirical frequency h(d i ) with P(d i ) (section S3 in Methods S1).
Figure 2 shows the distribution of P(d iz1 Dd i ) and P(d i ).The conditional distribution is shown for i~50.The marginal distribution is highly peaked with a low variance that decreases with increasing i (and reaches 0 for i~N).
The formulas have been implemented in the statistical language R [9] with emphasis on a numerically stable implementation as we deal with small numbers.The implementation is vectorized for speed.Still there is a computational penalty through the many factorials and logarithms that have to be calculated.For a sample of size 320, calculating all P(d iz1 Dd i ) takes 4.1 seconds on a single workstation (single thread, Intel Core i5-2500 CPU @ 3.30GHz).Runtime for larger samples is shown in Figure S1 in File S1 and Table 1.Counts for points lying exactly on the border region R.
For each possible number of points r~2,3,4 on the border region R and each possible number of points i 0 strictly inside of region I, we give the the number of admissible combinations of points in region R.The derivations of the number of admissible combinations is shown in the section S2 in Methods S1. doi:10.1371/journal.pone.0107955.t001 indicates a practical limit on the sample size of Nv3000 (which takes up to 3 minutes) and a complexity of O(N 2 ).
For practical reasons, we assumed that the points lie on a torus (distances on the torus are translation-invariant and therefore our formulas for P(d iz1 Dd i ) and P(d i ) hold for all points in the sample).This will bias results when applied to points on a plane, as points on the border will have different nearest neighbors when projected on the torus.The bias is less pronounced for close neighbors (i small), thus we limit our statistics to i max ~N=2.We do not expect to lose statistical power, since the information content of P(d i ) for large i approaches zero (see Figure 2).

Tests based on the th nearest neighbour distribution
It has been shown that the distance of the ith nearest neighbour of some point z can be used to estimate the local (log) density at z [5].Our idea is to use the full sequence of nearest neighbour distances for assessing local density.For a sample point z, let ) the sequence of neighbour distances.If z lies in a dense region, we expect this sequence to increase slower than in a region with lower density.

Distributional tests
The sequence of nearest neighbor distances of a point z, ) is a 4th order Markov chain, i.e., That way, taking z as the center point, the distances d z iz1 , given the four previously observed distances (d z i , d z i{1 , d z i{2 , d z i{3 ), are pairwise independent for all i.On the other hand this is not true for the distances d z1 iz1 and d z2 iz1 (not even if we condition the four previously observed distances).This follows from the triangle inequality in metric spaces, dist(z 1 ,x)ƒdist(z 2 ,x)zdist(z 1 ,z 2 ), which implies that d z1 iz1 ƒd z2 iz1 zdist(z 1 ,z 2 ).Let the random variable C i be defined by the process of drawing a point Z uniformly from 1,:::,N and then drawing C i according to the distribution P(D i DD i{1 ~dz i{1 , . . ., D 0 ~dz 0 ).Let f i denote the probability function of C i , it is given by We consider the observed values d z i , z~1,:::, N, as (not necessarily independent) realizations of D i .Their empirical frequency e i is where I½: denotes the indicator function with values in f0,1g.
Pearson's x 2 test [10] can be used to test for the fit of f i to e i : X i is a x 2 -distributed test statistic with w i {1 degrees of freedom where w i is the number distances c with f i (c) strictly positive.Our final test statistic is: Alternatively the empirical and theoretical cumulative distributions corresponding to e i and f i can be compared by an Anderson-Darling [11] or a Crame ´r-von Mises test, which proved inferior to Pearson's x 2 test (section S4 in Methods S1).

Test for location
We have the idea to compare the distribution of the ith neighbour distances observed in a sample with a suitable null distribution by means of their location.The most robust measures of location are mean or median, however in our studies of samples taken from joint distributions with low mutual information, we realized that many points do not show exceptional nearest neighbour distances.The difference to a sample drawn from independent X and Y distributions was made up by few points that had extreme nearest neighbour distances.This lead us to use extreme values as a test for location.The pvalue of a two-sided test based on P(D z i Dd z i{1 , . . . ) is p z i ~2 min (u z i , 1{u z i ), with u z i ~P(D z i ƒd z i Dd z i{1 , . . .).We summarize, for all ith neighbours, the 2-sided pvalues by their minimum Our test statistic V is obtained by aggregating the V i values:

Construction of a benchmark set
Benchmarking was done on distributions (X ,Y ) given by X *U½0,1, and Y *f (X )zN (0,s 2 ).Here, U½0,1 denotes a uniform distribution on the interval ½0,1, and N (0,s 2 ) denotes a Gaussian distribution with mean 0 and variance s 2 .The function f was chosen as one of the following: linear, quadratic, cubic, sine with period 0.5, circular, f (x)~x 1=4 and a step function (see Figure S2 in File S1).This choice was inspired by a comment by Simon & Tibshirani (http://statweb.stanford.edu/tibs/reshef/script.R, [12]) to the publication of the method MIC by Reshef et al. [6].The noise parameter s 2 determines the degree of dependence between X and Y , i.e., the mutual information i MI(X ,Y ; f ,s 2 ).The latter was estimated using an approximation q XY (X ,Y ) to the density p(X ,Y ) for which the mutual information can easily be calculated.We make q XY a piecewise-constant density on a sufficiently fine quadratic grid f(Ex,Ey , )Dx,y[Zg with q XY (x,y)~p(Et x E z0:5s,Et y E z0:5s).In our case, E~0:01 yielded sufficient precision.It is elementary to calculate the mutual information of q by MI~E 2 : X x,y[Z q XY (Ex,Ey) : log q XY (Ex,Ey) q X (Ex)q Y (Ey) Here, q X and q Y denote the marginal densities with respect to x and y.
To make the results comparable for different f , we fixed an MI value M and chose s 2 f ,M such that MI(X ,Y ; f ,s 2 f ,M )~M.This was done for 20 MI values, M ranging from 0.01 to 0.5.The noise levels s 2 f ,M are listed in section S5 in Methods S1.Samples from all dependencies f with M~0:5 is shown in Figure S2 in File S1.
So far performance evaluation of measure of dependence was only done on functional dependencies.Here we introduce ''patchwork copulas'' as a new non-functional dependence of x and y.Fix a grid size B, say B~10.Our density q will be a piecewise constant function defined on a rectangular 2D grid on the unit square (with uneven grid line spacing) such that its marginal distributions are uniform (i.e., we will define a copula).The parameters of our distribution are the values p ij , i,j~1,:::,B, with P B i,j~1 p ij ~1.Let p iÃ ~PB j~1 p ij and p Ãj ~PB i~1 p ij .Let (I,J) be a random variable which selects the grid rectangle (i,j) with probability p ij , i.e., P((I,J)~(i,j))~p ij , i,j~1,:::,B.Our distribution (X ,Y ) is then defined by X * P I{1 i~1 p iÃ zU I , U I *U½0,p I , and Y * P J{1 j~1 p Ãj zV J , V J *U½0,p J .The density in the grid rectangle (i,j) can be computed as q ij ~pij piÃpÃj .It is elementary to verify that the marginals of q are uniform and that the mutual To generate samples with a desired MI value, we choose suitable values for a and b.We draw i.i.d.samples p ij * Beta (a,b), i,j~1,:::,B, and then rescale the p ij by dividing them by their sum.This process is repeated with different a, b until MI(X ,Y ; (p ij )) is close enough to the desired MI value.The resulting dependence resembles a patchwork quilt of dense and spread out point clouds (Figure S3 in File S1).
Typically the points are considered embedded in Euclidean spaces [5], however the distance function can easily be adapted to model the geometry of a torus.We benchmarked some methods on both geometries (Euclidean plane and torus) and found that all methods were sensitive to changes of geometry.
We made the benchmark framework publicly available under a GPL3.0+license.It is implemented in R [9] and contains code for generating the dependence structures as well as plotting the results.An example is given in section S6 in Methods S1.

Comparison of methods
We compared both our tests (based on x 2 and extreme paths) to Pearson's product moment correlation coefficient, distance correlation (dcor, [4]), Hoeffding's D [3], Kraskov's estimator for mutual information [5] and MIC [6].For each type of dependence and each given value of MI, we generated a test set of 500 samples each consisting of 320 points from the respective dependence type.Test statistics were calculated for each sample.Additionally we generated a reference set of 500 samples with x and y values drawn independently which is used to calculate the cutoff value corresponding to a significance level of 5%.The power of each With increasing i, the distribution becomes narrower and the entropy tends towards 0, as the number of possible distances to the ith nearest neighbor decrease.The non-monotonic behavior of the entropy for large values of i is due to downstream constraints imposed by the maximal distance N 2 .For testing independence, we advise using all P(D iz1 DD i ) until the value of i where the entropy starts increasing again (i~9 in this example).doi:10.1371/journal.pone.0107955.g002method was estimated as the fraction of samples that were called significant according to the cutoff.Results are shown in Figure 3. Additionally we generated receiver operating curves (ROC) for each type of dependence and MI value (Figures S4-S9 in File S1).
The method of Hoeffding and dcor perform well throughout all types of dependence considered except for the circular dependence.Our methods have a performance that places them after dcor and Hoeffding's method and before MIC.In the case of the circular dependence, our methods perform best, achieving maximum power at mutual information of 0.03.We suspect that is due to the fact that a circle geometrically resembles two crossing lines when projected onto a torus (Figure S11 in File S1).To test this hypothesis we projected all types of dependence onto the torus and reran the whole benchmark (Figure S10 in File S1).We observe that the cubic, sine and step functions are not detected by any method, even at the same MI.The scaling of the plots in Figure 3 to the MI of the underlying joint distribution, enables the direct visual comparison of different dependence types.On the one hand this reveals that some types of dependence seem to be more difficult to detect for all methods (step function, sine curve and the ''patchwork copula").On the other hand each method performs best on different types of dependence.
We compared method of Hoeffding, dcor and our test based on extreme paths on a dataset from the World Health Organization and partner organizations.This dataset is available at http:// www.exploredata.net/ftp/WHO.csv.We ran the methods on all pairwise comparisons that have a squared Pearson's product moment correlation coefficient lower than 0.001 to exclude any linear relationships.As most method cannot handle missing values, we further restricted the comparisons to have at least 81 pairwise complete observations.This leads to 2971 pairwise comparison shown in Figure 4.All test statistics are uncorrelated for the pairs in which no linear dependency was detected leading again to the conclusion that no method is uniformly more powerful.

Discussion
We have derived an exact formula for the distribution of the distances of the i th nearest neighbour of a given point.This distribution assumes rank transformed bivariate data from two independent variables.While this result is of independent interest, we used it to construct two non-parametric tests of independence for bivariate data.Similar to Kraskov's estimator, our test statistic is purely based on nearest neighbour distances.In contrast to Kraskov's estimator which requires an arbitrarily fixed i, we simultaneously take into account the whole sequence of ith nearest neighbours (i~1,2,:::).This improves on Kraskov's estimator, if used as a score for independence testing.Our tests use rank transformed data, because this is a prerequisite for applying the exact nearest neighbour distributions derived in this paper.The rank transformation is often used as a primary step to estimating mutual information, therefore we consider it an uncritical step in our procedure.Our tests perform almost as well as the best competitors dcor and Hoeffding's D and they perform better than the recently proposed MIC statistic.We believe that the power of our method could be further improved in the Euclidean plane if our ith neighbour statistic would be adapted to account for boundary effects in the Euclidean plane.Although our methods try to account for the dependence of the variables D z i , z~1,:::,N, we necessarily lose power because their exact dependence structure is not known.Alternatively we propose to take all distances d z i for a point z and apply a sequential testing approach for calling points that are located in dense regions.The number of these points could serve as a test statistic.The rationale is that under the null hypothesis of independence there should be fewer points z considered significant in the sequential test than for dependent samples.
Next we reviewed competing methods and presented a benchmark framework for performance testing on different types of dependence structures and topologies (Euclidean and toroidal).The benchmark framework and our novel tests for independence are publicly available as an R [9] package on CRAN (http://cran.r-project.org/web/packages/knnIndep).By scaling each type of dependence to a common set of mutual information values we allow comparison between all dependence types.Remarkably, when benchmarked on patchwork copulas, all methods fail.This is particularly intriguing for MIC as by design it should detect the grid structure of the data.In the case of the circular dependence, our methods perform best, while the method of Hoeffding and dcor perform well throughout all types of dependence considered.This in turn shows, that all tests we investigated are biased towards the detection of certain types of dependence structures.

Figure 1 .
Figure 1.Diagrams explaining Equations 1 and 2 for N~7, a~1 and c~2 (panel A) and a~c~2 (panel B) with the reference point z 1 at coordinates (0,0).A: We define 3 regions I, II and III (black, red and blue points respectively).Region I has the least number of constraints and the number of admissible configurations is the number of possibilities to draw i points from 2a positions without replacement nor ordering: 2a i 2

Figure 2 .
Figure 2. A: Conditional distribution p c ~P(D iz1 ~diz1 DD i ~di ) for i~2, N~21 (top) and the entropy { P t N 2 s diz1~1 p c log p c (bottom).The probability p c of observing large (d iz1 ,d i ) is zero for distances larger than (6,6) when i~2.The lower triangle is empty because d iz1 §d i and the entropy is constantly decreasing for increasing values of d i because the possible (d iz1 ,d i ) decrease towards (6,6).B: Marginal distribution P(d i ) for N~21 (top) and entropy { P t N 2 s di ~1 P(d i ) log P(d i ) (bottom).With increasing i, the distribution becomes narrower and the entropy tends towards 0, as the number of possible distances to the ith nearest neighbor decrease.The non-monotonic behavior of the entropy for large values of i is due to downstream constraints imposed by the maximal distance N 2 .For testing independence, we advise using all P(D iz1 DD i ) until the value of i where the entropy starts increasing again (i~9 in this example).doi:10.1371/journal.pone.0107955.g002