Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A new method for computing the projection median, its influence curve and techniques for the production of projected quantile plots

  • Fan Chen ,

    Contributed equally to this work with: Fan Chen, Guy Nason

    Roles Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation School of Mathematics, University of Bristol, Fry Building, Woodland Road, Bristol, England, United Kingdom

  • Guy Nason

    Contributed equally to this work with: Fan Chen, Guy Nason

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    g.nason@imperial.ac.uk

    Affiliation Dept. Mathematics, Imperial College, London, England, United Kingdom

Abstract

This article introduces a new formulation of, and method of computation for, the projection median. Additionally, we explore its behaviour on a specific bivariate set up, providing the first theoretical result on form of the influence curve for the projection median, accompanied by numerical simulations. Via new simulations we comprehensively compare our performance with an established method for computing the projection median, as well as other existing multivariate medians. We focus on answering questions about accuracy and computational speed, whilst taking into account the underlying dimensionality. Such considerations are vitally important in situations where the data set is large, or where the operations have to be repeated many times and some well-known techniques are extremely computationally expensive. We briefly describe our associated R package that includes our new methods and novel functionality to produce animated multidimensional projection quantile plots, and also exhibit its use on some high-dimensional data examples.

1 Introduction: Overview of multivariate medians

The median is an estimator of location that is robust, i.e. not heavily influenced by outlying values, which are, loosely speaking, points that are far from the main body of the data. Let x = (x1, …, xk)T be a mutually independent and identically distributed (i.i.d.) sample of length from a univariate distribution with distribution function F. The univariate population median functional M(F) is (1)

There are several equivalent definitions of the univariate median that all yield same unique value of true median μ for a distribution F with a bounded and continuous density f(μ) at μ.

For multivariate data there is no natural ordering of the data to enable the choice of the middle observation in the same way as for one-dimensional data. However, several different multivariate median concepts have been developed that retain some characteristics of the univariate median. For example, an early extension of the multivariate median was suggested by Hayford [1], which is simply the component-wise median, also known as the vector of marginal medians. The spatial median, also known as the L1 median [2, 3], and Tukey’s median [4] are two other popular variants. Oja’s median [5] provides an alternative to the spatial median, but it is known to be more computationally expensive than other choices. These, and others, are reviewed in [68]. We briefly review some of them here next, not least as we use them later in our simulation study.

1.1 Component-wise median

Let X = (x1, …, xk)T be an n-dimensional i.i.d. sample with distribution function . We assume that the n marginal distributions have bounded densities f1(μ1), …, fn(μn) at the uniquely defined marginal medians μ = (μ1, …, μn). The component-wise median, also known as the marginal sample median, minimises (2) the sum of component-wise distances over , where m = (m1, …, mn). The corresponding population functional, MC(F), for the vector of population medians minimises (3)

1.2 Spatial median

The spatial median MS(X), also known as the L1 median, minimises (4) over , where is the (squared) Euclidean norm. The corresponding functional spatial median, MS(F), minimises (5)

1.3 Oja’s median

Let X = (x1, …, xk)T be an i.i.d. sample in with distribution function . The volume of the n-variate simplex determined by the n + 1 vertices (m1, …, mn+1) is (6)

The Oja median, MO(X), minimises (7) over . The corresponding functional MO(F) minimises (8)

1.4 Tukey’s median

Let X = (x1, …, xk)T be an i.i.d. sample of size k in with distribution function . Let be the class of all closed half spaces in . For each , define the empirical distribution (9) where is the usual indicator function. Then, define the depth, D(μ), of a point within the dataset, to be the infinum of , that is taken over all closed half spaces H for which μH. Tukey’s median is defined as the set of points μ of maximal depth.

2 The projection median

This section introduces our new method for computing the projection median, yamm. We prove that yamm is equivalent to the projection median, as defined by Durocher and Kirkpatrick [9] in and then generalised to higher dimensions by Basu et al. [10]. We also explore, theoretically and numerically, the statistical behaviour of yamm using a mixture of two bivariate normal distributions.

2.1 Review of the projection median

2.1.1 Projection median in .

Let X be a multiset of points in and θ ∈ [0, 2π) be an angle. Let Xθ denote the multiset defined by the projection of X onto the unit vector uθ = (cos θ, sin θ), so (10) where 〈⋅〉 denotes the usual inner product.

The projection median of a non-empty finite set X with points in is (11) where is the median of the projection of X onto the line through the origin, parallel to uθ.

2.1.2 Generalisation of the projection median.

Given a fixed positive integer, n ≥ 2, and a finite set of points X in , the n-dimensional projection median of X is (12) where is the unit n-dimensional hypersphere, med(Xa) is the median of the projection of X onto the line through the origin parallel to a, and f is the normalised uniform measure over Xn−1. Hence, for a point x = (x1, x2, …, xn)∈Xn−1, the n-dimensional spherical coordinates are given by (13) where each angle θ1, θ2, …, θn−2 has a range of π and θn−1 has range of 2π. Also, the normalised uniform measure f over Xn−1 is given by (14) where is the volume element of the (n − 1)-sphere.

Basu et al. [10] proved that the projection median has a breakdown point of 1/2 for all n ≥ 2.

2.2 Yet another multivariate median (Yamm)

Let be a random sample of size , . Let a be a n × 1 projection vector of unit length, 1k be the k × 1 vector of ones and μ a shift vector of length n. Let y be the projection of X onto a after X has been shifted by μ: (15) where . The univariate median m of the projected points y is (16)

Now define the integral (17)

The yamm estimator of location for X is (18)

Eqs (17) and (18) illustrate the rationale behind yamm. Intuitively, if the shift vector μ is far away from the true ‘middle’ of the dataset, then the magnitude of mX(μ, a), as well as the integral MX, m(μ), will be large. By contrast, a smaller mX(μ, a) can be obtained when the μ is moving closer to the true ‘middle’ of the data set.

Instead of computing the squared value of mX(μ, a) for the integral, we also considered the absolute value as an alternative. However, this leads to similar numerical results.

Example. We now generate two polar plots of the absolute value of mX(μ, a), when μ is both close to, and far away, from the true median, respectively. A random two-dimensional dataset with k = 100 points was generated, whose Tukey’s median computed as (2.78, 8.16). Here, the Tukey median is to be interpreted as a ‘sensible’ middle of the data set. The shift vector μ is set to be (2.2, 8) and (2, 7.5) respectively, and for each plot, two thousand random projections were used to calculate the univariate median mX(μ, a), using methods to be explained in Section 2.4. Fig 1 shows that when μ is near the Tukey’s median, the magnitude of each mX(μ, a) is less than 0.65, while a larger value, ranging from 0 to 1.2, is shown in the figure when μ is far away from the median. Overall, when integrated the quantity involving the μ is closer to the Tukey median it gives a smaller result.

thumbnail
Fig 1. Polar plot (in radians) of the magnitude of mX(μ, a).

Grey line: μ = (2.2, 8) and Blue line: μ = (2, 7.5).

https://doi.org/10.1371/journal.pone.0229845.g001

The projection median and yamm definitions seem similar, as both project the multiset onto the line passing through the origin, and then take the median. However, the projection median integrates med(Xa) directly over the unit hypersphere in , whereas yamm minimises the objective function over the shift vector μ. Despite these differences, the following theorem shows that the projection median and yamm are identical.

Theorem. For any finite multiset with n ≥ 2, yamm is equivalent to the projection median.

For the proof of the theorem, see S1 Appendix.

2.3 Yamm behaviour on a bivariate normal mixture

To gain insight about the theoretical behaviour of yamm we study the case of yamm applied to a mixture of two bivariate normals, where one is thought of as the bulk and the other as the outlier of the distribution. Such a setup enables us to evaluate the robustness of yamm. We numerically and theoretically assess the influence curve when moving the outlier far from the bulk.

2.3.1 Bivariate mixture setup.

Let and be independent bivariate normal random variables, where X1 = (X11, X12)T, X2 = (X21, X22)T with mean vector ν1 = (ν11, ν12)T and ν2 = (ν21, ν22)T. Let R(θ) be a rotation matrix with angle θ given by (19)

We are interested in the first row of this matrix, which describes the projection onto direction θ. Let Yi = (Yi1, Yi2)T = R(Xiμ) for i = 1, 2 respectively, where μ = (μ1, μ2)T is a shift vector mentioned in (15). Basic multivariate theory shows that (20)

Denote Yi = (Yi1, Yi2)T, Yi1 is the first entry of Yi for i = 1, 2. Then, it is immediate that , where (21) (22)

The mixture distribution that we study is (23) where is the density of Xi, and ϵ ∈ [0, 1], is typically small. Here, is considered to be the bulk of the distribution and the outlier.

2.3.2 Projected distribution.

Based on the bivariate setup above, the projected distribution is (24) where are as above and ϕ is the standard normal density.

The distribution function of the projected Y(θ) is (25) where Φ is the standard normal distribution function. We require the median of the projected distribution, i.e. find (26)

Finding an analytic exact solution for ym is difficult. Hence, we will simplify the problem and assume that Σ1 = Σ2 = I2, the identity matrix. Since R(θ) is an orthogonal matrix, this means that and Eq (25) becomes (27)

For small ϵ, we know that the median should be close to the median of the bulk, so the median of FY should be close to s1, the median of the first component of the mixture in Eq (27).

2.3.3 Theoretical approximation of yamm on the mixture.

We derive a theoretically based approximation to the empirical influence function. We proceed by using a Taylor series expansion of FY(y) around s1, the quantity we know is close to our median: (28) where When y is close to s1, Eq (28) is approximately equal to 1/2 when ϵ is small, which is the behaviour we expect.

To find an approximation to the median we solve FY{ym(θ)} = 1/2. Ignoring remainders, subtracting 1/2 off both sides of Eq (28) gives (29) and then (30)

Now using (31) and , we can write (32)

For small ϵ the denominator is close to 1. From Eqs (21) and (22), we can write: (33) where δ1 = ν21ν11 and δ2 = ν22ν12. Thus (34)

According to Eq (17), our job is to find the optimal , which minimises (35)

The integrand involves the standard normal distribution function, which is tricky to handle analytically. Hence, we use the approximation, ϕ(z)≈(1 + cos z)/2π, for −π < z < π, for the standard normal density [11], which enables the following proposition.

Proposition. Let X1 = (X11, X12)T and X2 = (X21, X22)T. Suppose that and independently, where ν1 = (ν11, ν12)T and ν2 = (ν21, ν22)T, respectively. Let the mixture, W, of X1 and X2 be where ϵ ∈ [0, 1] is considered small.

An approximation of the yamm estimator, , is (36) where , δ1 = ν21ν11, δ2 = ν22ν12 and α = arctan(δ2/δ1). The approximation we use is valid whenever , where θ is the projection direction when computing yamm. This inequality is true for all θ whenever .

Intuitively, the approximation in the Proposition works whenever the two cluster means are close enough together, i.e. when .

In particular, when ν11 = ν21 or ν12 = ν22 (i.e. when one of the δi = 0, i = 1, 2), we can form a more accurate approximation. This is because the approximation for the standard normal distribution function, ϕ(z)≈(1 + cos z)/2π, is no longer required to find the optimal minimising Eq (35). Without loss of generality, let ν1 = (ν11, ν12)T = (0, 0)T and ν2 = (ν21, ν22)T = (0, d)T, we obtain the yamm estimator as follows (37) where BesselI[n, z] is the modified Bessel function of the first kind, sometimes denoted In(z). For the proof of the proposition, see S2 Appendix.

2.3.4 The yamm influence curve on the mixture.

This section numerically computes and plots yamm for the case where ϵ = 0.05, and , with ν1 = (0, 0)T and ν2 = (0, d)T for . We explore how yamm varies as d increases from 0 to 10 in steps of 0.2. If yamm is robust, then it should increase with d, but plateau beyond a certain point.

For each value d we estimate yamm as the mean over five hundred bivariate mixture realizations, with two thousand projections involved for each yamm computation, using methods described below in Section 2.4. The numerically computed crosses in Fig 2 show that, for this setup, yamm plateaus somewhere between d = 2 and d = 4.

thumbnail
Fig 2. Yamm computed on simulated setup, increasing the distance between two bivariate normals.

Crosses: numerically computed values; Solid blue line: approximation computed for general ν1 and ν2; Solid red line: approximation computed when ν1 = (0, 0)T and ν2 = (0, d)T.

https://doi.org/10.1371/journal.pone.0229845.g002

The solid red line in Fig 2 shows our theoretical approximation of the yamm influence curve with the more specific setup, where μ* follows Eq (37). Under this approximation, the influence curve closely follows the numerically computed crosses. On the other hand, the solid blue line is the approximation of the yamm under the more general setting of Eq (36), which exhibits poor approximation after d > 4.5, although it performs reasonably well when the inter-cluster mean distance 0 < d < 4.5, and does not plateau.

This is because, in the setup, δ1 = d, δ2 = 0, and d > 4.5 implies . However, the specific setup approximation of yamm obviously does not work for arbitrary values of ν1 and ν2, whereas the general approximation gives a good theoretical idea of the yamm influence curve when the two means of the clusters are close enough together.

2.4 Projection median and yamm computation

2.4.1 Projection median computation.

A simple Monte Carlo integration [12] can be used to compute an approximation of the projection median by (38) where J represents the number of projections used, and is a set of random, independently-drawn, unit length n-vectors over Xn−1.

Calculating approximation of Eq (38) is relatively straightforward, but a large value of J is required to ensure accuracy. Another approach computes the projection median directly from the definition in Eq (12), using the spherical coordinates illustrated in Eq (13), where the integral can be obtained by the trapezoidal rule. For example, in the two-dimensional case, we apply the trapezoidal rule once on Eq (11). In the three-dimensional case, we have to apply the trapezoidal rule twice for the double integral, and so on. This direct approach is easy to implement when our dataset has a low dimension, but excessive work is required in not that many higher dimensions, even with, e.g. n = 10.

2.4.2 Computing yamm.

To compute an approximation to yamm, we can also use Monte Carlo integration together with an optimiser. Let be the number of projections, be a set of independent random unit length n-vectors, an estimator for MX, m(μ) is given by (39)

We then numerically minimise over μ to obtain our estimated location measure, using the BFGS optimization method [1316]. BFGS is a quasi-Newton algorithm searching for a stationary point of a function via local quadratic approximation. Parallel versions such as optimParallel exist as easy to use packages in R.

With reasonable starting values, such as the mean or other multivariate medians, yamm typically provides accurate results with a considerably smaller number of projections than used by the Monte Carlo projection median method mentioned above.

In conclusion, projection median computation via the trapezoidal rule is fast and accurate in low dimensions, but increasingly onerous in higher dimensions, as progressively more multidimensional integration is required. For higher dimensions, we prefer the Monte Carlo method and prefer yamm over the projection median as it does not require such a large number of projections, particularly if the optimiser is given a good starting solution.

Overall, approximating the projection median by the trapezoidal rule is a good choice in and , and either of the other two methods can be used in higher dimensions.

3 Empirical performance for different medians

This section reviews the theoretical computational complexity for a variety of medians and computes some running times for real implementations of several medians computed in R. We then present some results for accuracy of estimation for these medians.

3.1 Computational complexity and empirical speed

For a dataset in with k observations, the computational complexity for the Spatial median is O(nk) [17], which is the same for the exact computation of the component-wise median. The projection median can be obtained in O(k4/3log1+ϵ k) time in [9], and O(k5/2+ϵ) time in [10]. In , with n > 3, Basu et al. showed that O[kn{1−δn/(n+1)}+ϵ] time is required to compute the projection median, where δn = (4n − 3)n and ϵ is a fixed small constant. Several algorithms for other multivariate medians have been developed or the bivariate case. The current best algorithms for Oja’s and Liu’s medians require O(k log3 k) and O(k4) time, respectively [18], whereas that for the fastest bivariate Tukey median is O(k log3 k) [19]. The calculation of these three multivariate medians in higher dimensions is more complicated and approximate computation is often preferred/required.

To provide empirical assessment of the real computation speed, we apply several R software medians to simulated data. There are several R functions using different algorithms to compute one median. For example, spatial.median from the library ICSNP estimates the median with the algorithm developed by Vardi and Zhang [20], while Gmedian developed by Cardot et al. [21] is faster but, perhaps, less accurate. In addition, l1median [22] from library pcaPP and med from depth also provide opportunities to compute the spatial median. Hence, after some experiments, we choose the best function (evaluated in terms of speed and accuracy) for each multivariate median in and shown in Table 1. Much of the software for multivariate medians in R only works in low numbers of dimensions.

thumbnail
Table 1. R functions used for analysing different multivariate medians.

https://doi.org/10.1371/journal.pone.0229845.t001

The med function can only calculate the bivariate Liu’s median, which is considerably more challenging in higher dimensions. The calculation of Tukey’s median is exact in one and two dimensions, and approximate in higher dimensions. We use the approximate Tukey’s median computation in the med function, due to numerical errors that sometimes surface when using the exact algorithm. For Oja’s median, the approximate method (evolutionary algorithm) is used instead of the exact one, as it is faster and can deal with high dimensions.

Table 2 displays mean computation times and their standard deviations across 1000 simulated datasets from the two-dimensional Laplace distribution with different numbers of observations (k) for each set. The results are produced by running R on a single core of an Intel i7-8750h processor with 2.20 GHz base clock using 16Gb RAM. For small k, Liu’s median is fastest, but its speed is not as fast as others for higher k. In this experiment, Oja’s median is the slowest for small k values, but its speed does not appear to be particularly sensitive to k. Hence, its speed is faster than Tukey’s median when k = 200. The projection median is one of the quickest when k is below 100, while for large k values, the component-wise median and the Spatial median are faster.

thumbnail
Table 2. Mean and standard deviation (s.d.) of the operation time (×10−5) in seconds for data in .

https://doi.org/10.1371/journal.pone.0229845.t002

The results in Table 2 are produced by only one possible R function for one median. However, other functions can be used. For example, the med function from the depth package can also be used to calculate the spatial median and provides accurate answers. It is extremely fast for small k and lower dimensions, but it becomes slower than l1median for larger k. Hence, we use l1median to compute the spatial median, whose performance for small k is also good.

3.2 Mean squared error for some medians

We assess the accuracy of some of the medians via empirical mean squared error. If is an estimator in with respect to the unknown parameter , then the mean squared error is (40) where represents the squared Euclidean distance between and μ, normalized by the vector length. Smaller values are better.

Table 3 shows MSE results based on the same simulations as used for Table 2. Not surprisingly, for this long-tailed data, all medians perform better than the sample mean. The spatial median and the projection median have smaller mean squared error, the latter performing better for small k values. On the other hand, Liu’s median always produces a very high mean squared error.

Conclusion. Based on these simulations, for the R functions listed in Table 1, the spatial and projection medians always have the lowest mean squared error, but also fast running speeds. Although Liu’s median has the shortest computation time, for small k, it is the most inaccurate, and its computation time becomes long for large datasets. Similarly, the component-wise median is fast, even when k increases, but it has a large mean squared error. Hence, the spatial and projection medians are good choices when computing two-dimensional robust measures of location in this case, and the latter is preferred for small datasets. The computational results for high-dimensional simulations (n = 3, 5, 10) can be found in S1 Table.

3.3 2D projection median computation functions

The R package DurocherProjectionMedian can be downloaded from Github at https://github.com/12ramsake/DurocherProjectionMedian.

The DurocherProjectionMedian package provides functions to compute the projection median via the Monte Carlo integration method using projectionMedianMC) [27] and an exact method for two dimensions proposed by Ramsay [28] using projectionMedian2D. Tables 4 and 5 show the performance of the different functions computing the two-dimensional projection median of 1000 simulated datasets from the Laplace distribution with different k.

thumbnail
Table 4. Mean and standard deviation (s.d.) of the operation time (×10−5) in seconds for different R functions to produce the projection median.

https://doi.org/10.1371/journal.pone.0229845.t004

thumbnail
Table 5. Mean squared error (×10−3) for 1000 sets of data in generated from Laplace distribution.

https://doi.org/10.1371/journal.pone.0229845.t005

For the Monte Carlo Integration method, when k is small (e.g. under 150 in ), the computation time of projectionMedianMC is longer than our PmedMCInt under the same number of projections in both and high dimensions, whereas both implementations have almost the same MSE.

Although the projectionMedian2D provides a slightly smaller MSE, its running time is slow. Our PmedTrapz is faster and its MSE performance is comparable to projectionMedian2D, and, hence, the former might be recommended as the best choice for .

4 The yamm R package

Our Yamm R package provides users with functions to compute the projection median according to the different methods mentioned in section 2.4. PmedMCInt computes the projection median using the Monte Carlo approximation; PmedTrapz uses the trapezoidal rule and currently, it is only valid in two and three dimensions; yamm computes the projection median using the Monte Carlo approximation to find the shift vector μ minimising our objective function yamm.obj. The package also includes functions Plot2dMedian and Plot2dMedian to plot different multivariate medians for data in both and . Most functions in our package are implemented internally using C code. This section provides some brief illustrations of the use of Yamm.

4.1 Yamm projection medians

The function PmedMCInt computes the projection median for any multivariate data, x, by invoking

PmedMCInt(x, nprojs = 20000)

Since this function uses Monte Carlo integration, we need to choose the number of projections J, which has a default value of 20000. Typically, a large J is required to obtain a stable answer, which means the result will not change much if recomputed under the same conditions. This function returns the projection median estimate vector.

The function PmedTrapz computes the projection median in and and is invoked by

PmedTrapz(x, no.subinterval)

PmedTrapz applies the trapezoidal rule once in and twice in on each entry of the vector , mentioned in section 2.1.2, and returns a vector of the projection median estimate.

The argument no.subinterval determines the number of subintervals for the trapezoidal rule. For the bivariate case the no.subinterval argument is a single number that controls the number of subdivisions for the one-dimensional integration; for the trivariate case the argument is a vector of length two that controls the number of subdivisions for the two integrals. In general, it is better to use at least 36 subintervals, which typically produces accurate results without excessive running time.

More subintervals may be appropriate for more complex datasets. For some unusual data sets it would be ideal to have a high resolution of the interval of integration in one particular region, and a relatively low resolution elsewhere, but this is beyond the scope of the current research. A small number of partitions, e.g. below 15, is not recommended for reasons of accuracy.

The yamm function is valid for data of any dimension. It uses an optimiser to provide another method to compute the projection median. The arguments are

yamm(x, nprojs = 2000, reltol = 1e-06,

xstart = l1median(x), opt.method = “BFGS”,

doabs = 0, full.results = FALSE).

The yamm function is a wrapper to minimise the the objective function yamm.obj, which uses the Monte Carlo method to approximate the squared or absolute value of the univariate median of the projection of the shifted data matrix. The nprojs argument controls the number of projections in the Monte Carlo approximation and doabs is an indicator, where 1 uses the absolute value of the univariate median and 0 forces the use of the squared value. The arguments reltol, xstart, opt.method are supplied directly to the R optimisation function optim: reltol is the tolerance for the optimiser, with default value of 10−6. Usually, we set a larger value (e.g. 10−3) to this argument, which will reduce the running time, whilst maintaining accuracy. The argument opt.method controls the selection of optimisation methods, which can be chosen from any of the four options, “BFGS”, “Nelder-Mead” [29], “CG” [30], “L-BFGS-B” [31], and “SANN” [32]. The default choice “BFGS” is relatively fast and stable in our case. See the help page of the function optim in R for further details about the different optimisation methods. The xstart argument provides the initial value for the parameters to optimise over, which plays an important role in the function yamm. A good starting point will reduce the running time and provide a more accurate result, so we use the spatial median as the default value. Other multivariate medians could be used, but they need to be fast. If full.results = TRUE, the output of this function involves a list with components obtained from the optim function, otherwise, it returns a vector containing the multivariate median estimate.

4.2 Some real examples

We now exhibit results for the projection medians applied to some real datasets. Our plots show different multivariate medians and the sample mean value for two simulated datasets in and , respectively, allowing the methods to be compared.

4.2.1 Beetle data.

The famous beetle data [33] takes six measurements on 74 flea-beetles, with each belonging to one of three different species. We apply yamm and obtain the following output:

yamm(beetle, nprojs = 1000, reltol = 1e-3, doabs = 0,

full.results = TRUE)

[1] 180.19194 123.73920 49.97819 135.87913 13.62603 95.49062

$value

[1] 5.585139

$counts

function gradient

90 4

$convergence

[1] 0

$message

NULL

The yamm results show that the optimiser executed 90 calls to the objective function yamm.obj and constructed 4 gradients. The par component contains the estimate of the yamm for the beetle data. These results are not that different from the output generated by PmedMCInt, which is

PmedMCInt(beetle, nprojs = 100000)

[1] 179.54428 124.72128 50.56934 137.47363 13.23372 94.80188

For the beetle data, we chose the number of projections in yamm to be 1000, while many more projections were required (e.g. 100000) in PmedMCInt to obtain a similar and consistent result; although yamm requires optimisation. Fewer projections for the function PmedMCInt may lead inaccurate results for some components of the multivariate median. PmedTrapz is not valid in this six-dimensional case, but we will show that it has a similar output when computing projection median in two- and three-dimensions.

4.2.2 Simulated Data in with three clusters.

We now use the function Plot2dMedian in the package Yamm to generate and display different multivariate medians for the simulated data set clusters2d. This set contains three clusters, which are generated randomly from different independent normal distributions, and two outliers.

Here, we display the three different estimates of the projection median. When computing other multivariate medians, we use functions from R packages listed in section 3.1. The actual data points is plotted with grey dots. The first plot in Fig 3 is producing excluding the two outliers, whilst the second one includes them. The projection medians produced with different estimators are very close to each other, and not far from the other median estimators also. Fig 3 also shows that the multivariate medians are not particularly affect by the outliers, whilst the mean value is.

thumbnail
Fig 3. Bivariate medians and mean for three cluster two-dimensional set.

Top: without outliers; Bottom: with outliers (out of plot area).

https://doi.org/10.1371/journal.pone.0229845.g003

4.2.3 Simulated data in with four clusters.

The function Plot3dMedian in Yamm plots the three-dimensional medians. The dataset clusters3d has four clusters, each generated from different independent normal distributions, as well as five outliers. Fig 4 is produced with the dataset clusters3d, whose outliers have been removed. It shows that apart from the Oja’s median, the other medians are located close to each other. Again, the three approximations of the projection median almost coincide in every component.

thumbnail
Fig 4. Trivariate medians & mean for four cluster three-dimensional set.

https://doi.org/10.1371/journal.pone.0229845.g004

4.3 The muqie plot and some examples

As well as obtaining a robust location measure, we can use projections to provide information on the spread and configuration of the data. Obtaining true multivariate quantiles can be computationally challenging, and what we produce are not true multivariate quantiles, but they do enable us to gain useful understanding about multivariate data. The muqie (MUltivariate QuantIlE) plots are constructed as follows.

First choose a unit-length direction vector, u. Then project our yamm-centred multivariate data onto u to obtain a univariate set. The muqie point, Q(α, u), is merely the vector u rescaled to have length equal to the α-quantile of the univariate set. A muqie plot is the collection of all muqie points, Q(α, u) over all unit-length direction vectors u. In practice, we construct our plot by choosing a number of directions and joining the points. The basic concept, and plots, are not new, Section 2 of Fraiman and Pateiro-Lopez [34] introduces the concept based on mean-centred data and is related to ideas in [35]. Our main addition to this body of work is to (i) centre using yamm, or other robust median and (ii) presenting the muqie plots as dynamic videos of increasing α.

Fig 5 shows two muqie plots for α = 0.4 and α = 0.8. The latter indicates the three cluster nature. Surprisingly, this also shows up clearly in the α = 0.4 plot with the 0.4 quantile for, e.g. the bottom-left cluster appearing in a “north-easterly” direction and coloured red in our plot. The movie Animation shows an animated plot, which includes both the plots in Fig 5 and many of the others for increasing values of α.

thumbnail
Fig 5. Muqie plot for the three cluster two-dimensional data set without outliers.

The figures are produced for different values of pseudo-quantile α. The centre point (in blue) in each plot is the yamm median. Left: α = 0.4, Right: α = 0.8.

https://doi.org/10.1371/journal.pone.0229845.g005

These plots were produced by the muqie() function in the Yamm package. For the animated plot, the package includes the makeplot() function, which calls muqie() for multiple values of α. Then we use the CRAN package animation to produce an animated GIF using

saveGIF(makeplot(clusters2d[,-c(102,103)], nprojs = 4000),

diff.col = 3, interval = 0.1, width = 500, height = 500).

The movie beetle shows a three-dimensional Muqie plot using three variables from the beetle data. The R commands used were:

saveGIF(makeplot3D(beetle, dm = c(1,3,6)), diff.col = 3,

interval = 0.2, width = 500, height = 500)

5 Conclusions and discussions

We have introduced a new method, yamm, to compute the projection median, for data in with n ≥ 2. We have proved the theoretical equivalence of yamm and the projection median.Through theoretical and numerical investigations we demonstrate the robustness of yamm on a simple, but illuminating, bivariate setup.

Then, we illustrated three computation methods for the projection median, which can be best deployed in different situations. Approximating the projection median by the Monte Carlo method is valid in any dimensions but requires a large number of projections to ensure accuracy, while using the trapezoidal rule is computationally fast and accurate in two and three dimensions, but requires more integration on the projection vector in the higher dimensions, which becomes rapidly more complex. The yamm approximation can also compute the median in any dimensions. Its computational speed is not as quick as the other two, under the same conditions (e.g. the number of projections). However, thanks to the optimiser, a small number of the projections can be chosen to obtain an accurate median with a reasonable starting point (e.g. other multivartiate medians or mean value), which can be a distinct advantage.

Our research also documents the simulated empirical performance for different medians in terms of the computation time and the mean squared error. Using different R functions to calculate different multivariate medians, we find that the spatial median and the projection median are always accurate with relatively fast speed using the existing R functions. The performance of other multivariate medians either exhibits slow speed or large mean squared error.

Finally, we introduce our R package, Yamm, that contains our three methods to compute the projection median. We show that our methods coincide with each other in and , and all multivariate medians are not affected by the outliers in the dataset, but the location of the mean value varies a lot. Currently, the function PmedTrapz in the R package is only valid in and , further investment can be conducted on extending this function to higher dimensions.

The Yamm package also introduces our Muqie plots, which are capable of producing animated plots of two- and three-dimensional sets’ projected quantiles. The animated ‘growth’ of these “quantile” plots give a vivid picture of the extent, spread and configuration of data in the sets.

The Yamm package is available on the CRAN archive.

References

  1. 1. Hayford J. W. (1902). What is the center of an area or the center of a population. Journal of the American Statistical Association, 8(58):47–58.
  2. 2. Weber A. (1909). Über den Standort der Industrien. Mohr.
  3. 3. Weber A. (1929). Theory of the Location of Industries. The University of Chicago Press.
  4. 4. Tukey J. W. (1975). Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, 2:523–531.
  5. 5. Oja H. (1983). Descriptive statistics for multivariate distributions. Statistics & Probability Letters, 1:327–332.
  6. 6. Small C. G. (1990). A survey of multidimensional medians. International Statistical Review, 58(3):263–277.
  7. 7. Chaudhuri P. and Sengupta D. (1993). Sign tests in multidimension: Inference based on the geometry of data cloud. Journal of the American Statistical Association, 88:1363–1370.
  8. 8. Oja H. (2013). Multivariate median. In Becker C., Fried R., and Kuhnt S., editors, Robustness and Complex Data Structures, chapter 1, pages 3–16. Springer, Berlin.
  9. 9. Durocher S. and Kirkpatrick D. G. (2005). The projection median of a set of points in . Journal of Computational Geometry, 42:364–375.
  10. 10. Basu R., Bhattacharya B. B., and Talukdar T. (2012). The projection median of a set of points in . Discrete and Computational Geometry, 47(2):329–346.
  11. 11. Johnson N. L., Kotz S., and Balakrishnan N. (1995). Continuous univariate distributions. Number v.2 in Wiley series in probability and mathematical statistics: Applied probability and statistics. Wiley & Sons.
  12. 12. Robert C. P. and Casella G. (2005). Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-Verlag, Berlin, Heidelberg.
  13. 13. Broyden C. G. (1970). The convergence of a class of double-rank minimization algorithms. The Institute of Mathematics and Its Applications, 6:76–90.
  14. 14. Fletcher R. (1970). A new approach to variable metric algorithms. Computer Journal, 13(3):317–322.
  15. 15. Goldfarb D. (1970). A family of variable metric updates derived by variational means. Journal of the Mathematics of Computation, 24(109):123–126.
  16. 16. Shanno D. F. (1970). Conditioning of quasi-newton methods for function minimization. Journal of the Mathematics of Computation, 24(111):647–656.
  17. 17. Bose P., Maheshwari A., and Morin P. (2003). Fast approximations for sums of distances, clustering and the Fermat-Weber problem. Computational Geometry: Theory and Applications, 24(3):135–146.
  18. 18. Aloupis G., Langerman S., Soss M., and Toussaint G. T. (2003). Algorithms for bivariate medians and a Fermat-Torricelli problem for lines. Computational Geometry, 26:69–79.
  19. 19. Langerman S. and Steiger W. (2003). Optimization in Arrangements. Springer Berlin Heidelberg, Berlin, Heidelberg.
  20. 20. Vardi Y. and Zhang C. (2000). The multivariate l1-median and associated data depth. Proceedings of the National Academy of Sciences, 97(4):1423–1426.
  21. 21. Cardot H., Cénac P., and Zitt P. A. (2013). Efficient and fast estimation of the geometric median in Hilbert spaces with an averaged stochastic gradient algorithm. Bernoulli Society for Mathematical Statistics and Probability, 19(1):18–43.
  22. 22. Croux C., Filzmoser P., and Oliveira M. R. (2006). Algorithms for projection-pursuit robust principal component analysis. KU Leuven Working Paper No. KBI 0624, 19(1):18–43.
  23. 23. Rousseeuw P. J. and Ruts I. (1996). Algorithm AS 307: Bivariate location depth. Journal of the Royal Statistical Society. Series C (Applied Statistics), 45(4):516–526.
  24. 24. Rousseeuw P. J., Ruts I., and Tukey J. W. (1999). The bagplot: A bivariate boxplot. The American Statistician, 53(4):382–387.
  25. 25. Struyf A. and Rousseeuw P. J. (2000). High-dimensional computation of the deepest location. Computational Statistics & Data Analysis, 34(4):415–426.
  26. 26. Fischer D., Mosler K., Möttönen J., Nordhausen K., Pokotylo O., and Vogel D. (2016). Computing the Oja median in R: The package OjaNP. ArXiv, pages 1–36.
  27. 27. Durocher S., Leblanc A., and Skala M. (2017). The projection median as a weighted average. Journal of Computational Geometry, 8:78–104.
  28. 28. Ramsay K. (2017). Computable, robust multivariate location using integrated univariate ranks.
  29. 29. Nelder J. A. and Mead R. (1965). A simplex algorithm for function minimization. Computer Journal, 7:308–313.
  30. 30. Fletcher R. and Reeves C. M. (1964). Function minimization by conjugate gradients. Computer Journal, 7:148–154.
  31. 31. Nocedal J. and Wright S. J. (1999). Numerical Optimization. Springer, first edition.
  32. 32. Byrd R. H., Lu P., Nocedal J., and Zhu C. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16:1190–1208.
  33. 33. Lubischew A. A. (1962). On the use of discriminant functions in taxonomy. Biometrics, 18:455–477.
  34. 34. Fraiman R. and Pateiro-Lopez B. (2012). Quantiles for finite and infinite dimensional data. Journal of Multivariate Analysis, 108:1–14.
  35. 35. Kong L. and Mizera I. (2012). Quantile tomography: using quantiles with multivariate data. Statistica Sinica, 22:1589–1610.