A new method for computing the projection median, its influence curve and techniques for the production of projected quantile plots

Fan Chen; Guy Nason

doi:10.1371/journal.pone.0229845

Abstract

This article introduces a new formulation of, and method of computation for, the projection median. Additionally, we explore its behaviour on a specific bivariate set up, providing the first theoretical result on form of the influence curve for the projection median, accompanied by numerical simulations. Via new simulations we comprehensively compare our performance with an established method for computing the projection median, as well as other existing multivariate medians. We focus on answering questions about accuracy and computational speed, whilst taking into account the underlying dimensionality. Such considerations are vitally important in situations where the data set is large, or where the operations have to be repeated many times and some well-known techniques are extremely computationally expensive. We briefly describe our associated R package that includes our new methods and novel functionality to produce animated multidimensional projection quantile plots, and also exhibit its use on some high-dimensional data examples.

Citation: Chen F, Nason G (2020) A new method for computing the projection median, its influence curve and techniques for the production of projected quantile plots. PLoS ONE 15(5): e0229845. https://doi.org/10.1371/journal.pone.0229845

Editor: Chengming Huang, Huazhong University of Science and Technology, CHINA

Received: August 6, 2019; Accepted: February 15, 2020; Published: May 7, 2020

Copyright: © 2020 Chen, Nason. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant beetle data is already published and available in the article "Lubischew, A. A. (1962). On the use of discriminant functions in taxonomy. Biometrics, 18:455–477." All other data is available in an R package via CRAN at https://cran.r-project.org/package=Yamm and also available as a Supporting Information file.

Funding: GN:supported by UK Engineering and Physical Sciences Research Council EP/K020951/1 http://www.epsrc.ac.uk FC: supported by a University of Bristol/China Scholarship Council batch award. There is no specific grant number https://www.chinesescholarshipcouncil.com http://www.bristol.ac.uk The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction: Overview of multivariate medians

The median is an estimator of location that is robust, i.e. not heavily influenced by outlying values, which are, loosely speaking, points that are far from the main body of the data. Let x = (x₁, …, x_k)^T be a mutually independent and identically distributed (i.i.d.) sample of length from a univariate distribution with distribution function F. The univariate population median functional M(F) is (1)

There are several equivalent definitions of the univariate median that all yield same unique value of true median μ for a distribution F with a bounded and continuous density f(μ) at μ.

For multivariate data there is no natural ordering of the data to enable the choice of the middle observation in the same way as for one-dimensional data. However, several different multivariate median concepts have been developed that retain some characteristics of the univariate median. For example, an early extension of the multivariate median was suggested by Hayford [1], which is simply the component-wise median, also known as the vector of marginal medians. The spatial median, also known as the L₁ median [2, 3], and Tukey’s median [4] are two other popular variants. Oja’s median [5] provides an alternative to the spatial median, but it is known to be more computationally expensive than other choices. These, and others, are reviewed in [6–8]. We briefly review some of them here next, not least as we use them later in our simulation study.

1.1 Component-wise median

Let X = (x₁, …, x_k)^T be an n-dimensional i.i.d. sample with distribution function . We assume that the n marginal distributions have bounded densities f₁(μ₁), …, f_n(μ_n) at the uniquely defined marginal medians μ = (μ₁, …, μ_n). The component-wise median, also known as the marginal sample median, minimises (2) the sum of component-wise distances over , where m = (m₁, …, m_n). The corresponding population functional, M_C(F), for the vector of population medians minimises (3)

1.2 Spatial median

The spatial median M_S(X), also known as the L₁ median, minimises (4) over , where is the (squared) Euclidean norm. The corresponding functional spatial median, M_S(F), minimises (5)

1.3 Oja’s median

Let X = (x₁, …, x_k)^T be an i.i.d. sample in with distribution function . The volume of the n-variate simplex determined by the n + 1 vertices (m₁, …, m_n+1) is (6)

The Oja median, M_O(X), minimises (7) over . The corresponding functional M_O(F) minimises (8)

1.4 Tukey’s median

Let X = (x₁, …, x_k)^T be an i.i.d. sample of size k in with distribution function . Let be the class of all closed half spaces in . For each , define the empirical distribution (9) where is the usual indicator function. Then, define the depth, D(μ), of a point within the dataset, to be the infinum of , that is taken over all closed half spaces H for which μ ∈ H. Tukey’s median is defined as the set of points μ of maximal depth.

2 The projection median

This section introduces our new method for computing the projection median, yamm. We prove that yamm is equivalent to the projection median, as defined by Durocher and Kirkpatrick [9] in and then generalised to higher dimensions by Basu et al. [10]. We also explore, theoretically and numerically, the statistical behaviour of yamm using a mixture of two bivariate normal distributions.

2.1 Review of the projection median

2.1.1 Projection median in .

Let X be a multiset of points in and θ ∈ [0, 2π) be an angle. Let X_θ denote the multiset defined by the projection of X onto the unit vector u_θ = (cos θ, sin θ), so (10) where 〈⋅〉 denotes the usual inner product.

The projection median of a non-empty finite set X with points in is (11) where is the median of the projection of X onto the line through the origin, parallel to u_θ.

2.1.2 Generalisation of the projection median.

Given a fixed positive integer, n ≥ 2, and a finite set of points X in , the n-dimensional projection median of X is (12) where is the unit n-dimensional hypersphere, med(X_a) is the median of the projection of X onto the line through the origin parallel to a, and f is the normalised uniform measure over Xⁿ⁻¹. Hence, for a point x = (x₁, x₂, …, x_n)∈Xⁿ⁻¹, the n-dimensional spherical coordinates are given by (13) where each angle θ₁, θ₂, …, θ_n−2 has a range of π and θ_n−1 has range of 2π. Also, the normalised uniform measure f over Xⁿ⁻¹ is given by (14) where is the volume element of the (n − 1)-sphere.

Basu et al. [10] proved that the projection median has a breakdown point of 1/2 for all n ≥ 2.

2.2 Yet another multivariate median (Yamm)

Let be a random sample of size , . Let a be a n × 1 projection vector of unit length, 1_k be the k × 1 vector of ones and μ a shift vector of length n. Let y be the projection of X onto a after X has been shifted by μ: (15) where . The univariate median m of the projected points y is (16)

Now define the integral (17)

The yamm estimator of location for X is (18)

Eqs (17) and (18) illustrate the rationale behind yamm. Intuitively, if the shift vector μ is far away from the true ‘middle’ of the dataset, then the magnitude of m_X(μ, a), as well as the integral M_{X, m}(μ), will be large. By contrast, a smaller m_X(μ, a) can be obtained when the μ is moving closer to the true ‘middle’ of the data set.

Instead of computing the squared value of m_X(μ, a) for the integral, we also considered the absolute value as an alternative. However, this leads to similar numerical results.

Example. We now generate two polar plots of the absolute value of m_X(μ, a), when μ is both close to, and far away, from the true median, respectively. A random two-dimensional dataset with k = 100 points was generated, whose Tukey’s median computed as (2.78, 8.16). Here, the Tukey median is to be interpreted as a ‘sensible’ middle of the data set. The shift vector μ is set to be (2.2, 8) and (2, 7.5) respectively, and for each plot, two thousand random projections were used to calculate the univariate median m_X(μ, a), using methods to be explained in Section 2.4. Fig 1 shows that when μ is near the Tukey’s median, the magnitude of each m_X(μ, a) is less than 0.65, while a larger value, ranging from 0 to 1.2, is shown in the figure when μ is far away from the median. Overall, when integrated the quantity involving the μ is closer to the Tukey median it gives a smaller result.

Download:

Fig 1. Polar plot (in radians) of the magnitude of m_X(μ, a).

Grey line: μ = (2.2, 8) and Blue line: μ = (2, 7.5).

https://doi.org/10.1371/journal.pone.0229845.g001

The projection median and yamm definitions seem similar, as both project the multiset onto the line passing through the origin, and then take the median. However, the projection median integrates med(X_a) directly over the unit hypersphere in , whereas yamm minimises the objective function over the shift vector μ. Despite these differences, the following theorem shows that the projection median and yamm are identical.

Theorem. For any finite multiset with n ≥ 2, yamm is equivalent to the projection median.

For the proof of the theorem, see S1 Appendix.

2.3 Yamm behaviour on a bivariate normal mixture

To gain insight about the theoretical behaviour of yamm we study the case of yamm applied to a mixture of two bivariate normals, where one is thought of as the bulk and the other as the outlier of the distribution. Such a setup enables us to evaluate the robustness of yamm. We numerically and theoretically assess the influence curve when moving the outlier far from the bulk.

2.3.1 Bivariate mixture setup.

Let and be independent bivariate normal random variables, where X₁ = (X₁₁, X₁₂)^T, X₂ = (X₂₁, X₂₂)^T with mean vector ν₁ = (ν₁₁, ν₁₂)^T and ν₂ = (ν₂₁, ν₂₂)^T. Let R(θ) be a rotation matrix with angle θ given by (19)

We are interested in the first row of this matrix, which describes the projection onto direction θ. Let Y_i = (Y_i1, Y_i2)^T = R(X_i − μ) for i = 1, 2 respectively, where μ = (μ₁, μ₂)^T is a shift vector mentioned in (15). Basic multivariate theory shows that (20)

Denote Y_i = (Y_i1, Y_i2)^T, Y_i1 is the first entry of Y_i for i = 1, 2. Then, it is immediate that , where (21) (22)

The mixture distribution that we study is (23) where is the density of X_i, and ϵ ∈ [0, 1], is typically small. Here, is considered to be the bulk of the distribution and the outlier.

2.3.2 Projected distribution.

Based on the bivariate setup above, the projected distribution is (24) where are as above and ϕ is the standard normal density.

The distribution function of the projected Y(θ) is (25) where Φ is the standard normal distribution function. We require the median of the projected distribution, i.e. find (26)

Finding an analytic exact solution for y_m is difficult. Hence, we will simplify the problem and assume that Σ₁ = Σ₂ = I₂, the identity matrix. Since R(θ) is an orthogonal matrix, this means that and Eq (25) becomes (27)

For small ϵ, we know that the median should be close to the median of the bulk, so the median of F_Y should be close to s₁, the median of the first component of the mixture in Eq (27).

2.3.3 Theoretical approximation of yamm on the mixture.

We derive a theoretically based approximation to the empirical influence function. We proceed by using a Taylor series expansion of F_Y(y) around s₁, the quantity we know is close to our median: (28) where When y is close to s₁, Eq (28) is approximately equal to 1/2 when ϵ is small, which is the behaviour we expect.

To find an approximation to the median we solve F_Y{y_m(θ)} = 1/2. Ignoring remainders, subtracting 1/2 off both sides of Eq (28) gives (29) and then (30)

Now using (31) and , we can write (32)

For small ϵ the denominator is close to 1. From Eqs (21) and (22), we can write: (33) where δ₁ = ν₂₁ − ν₁₁ and δ₂ = ν₂₂ − ν₁₂. Thus (34)

According to Eq (17), our job is to find the optimal , which minimises (35)

The integrand involves the standard normal distribution function, which is tricky to handle analytically. Hence, we use the approximation, ϕ(z)≈(1 + cos z)/2π, for −π < z < π, for the standard normal density [11], which enables the following proposition.

Proposition. Let X₁ = (X₁₁, X₁₂)^T and X₂ = (X₂₁, X₂₂)^T. Suppose that and independently, where ν₁ = (ν₁₁, ν₁₂)^T and ν₂ = (ν₂₁, ν₂₂)^T, respectively. Let the mixture, W, of X₁ and X₂ be where ϵ ∈ [0, 1] is considered small.

An approximation of the yamm estimator, , is (36) where , δ₁ = ν₂₁ − ν₁₁, δ₂ = ν₂₂ − ν₁₂ and α = arctan(δ₂/δ₁). The approximation we use is valid whenever , where θ is the projection direction when computing yamm. This inequality is true for all θ whenever .

Intuitively, the approximation in the Proposition works whenever the two cluster means are close enough together, i.e. when .

In particular, when ν₁₁ = ν₂₁ or ν₁₂ = ν₂₂ (i.e. when one of the δ_i = 0, i = 1, 2), we can form a more accurate approximation. This is because the approximation for the standard normal distribution function, ϕ(z)≈(1 + cos z)/2π, is no longer required to find the optimal minimising Eq (35). Without loss of generality, let ν₁ = (ν₁₁, ν₁₂)^T = (0, 0)^T and ν₂ = (ν₂₁, ν₂₂)^T = (0, d)^T, we obtain the yamm estimator as follows (37) where BesselI[n, z] is the modified Bessel function of the first kind, sometimes denoted I_n(z). For the proof of the proposition, see S2 Appendix.

2.3.4 The yamm influence curve on the mixture.

This section numerically computes and plots yamm for the case where ϵ = 0.05, and , with ν₁ = (0, 0)^T and ν₂ = (0, d)^T for . We explore how yamm varies as d increases from 0 to 10 in steps of 0.2. If yamm is robust, then it should increase with d, but plateau beyond a certain point.

For each value d we estimate yamm as the mean over five hundred bivariate mixture realizations, with two thousand projections involved for each yamm computation, using methods described below in Section 2.4. The numerically computed crosses in Fig 2 show that, for this setup, yamm plateaus somewhere between d = 2 and d = 4.

Download:

Fig 2. Yamm computed on simulated setup, increasing the distance between two bivariate normals.

Crosses: numerically computed values; Solid blue line: approximation computed for general ν₁ and ν₂; Solid red line: approximation computed when ν₁ = (0, 0)^T and ν₂ = (0, d)^T.

https://doi.org/10.1371/journal.pone.0229845.g002

The solid red line in Fig 2 shows our theoretical approximation of the yamm influence curve with the more specific setup, where μ* follows Eq (37). Under this approximation, the influence curve closely follows the numerically computed crosses. On the other hand, the solid blue line is the approximation of the yamm under the more general setting of Eq (36), which exhibits poor approximation after d > 4.5, although it performs reasonably well when the inter-cluster mean distance 0 < d < 4.5, and does not plateau.

This is because, in the setup, δ₁ = d, δ₂ = 0, and d > 4.5 implies . However, the specific setup approximation of yamm obviously does not work for arbitrary values of ν₁ and ν₂, whereas the general approximation gives a good theoretical idea of the yamm influence curve when the two means of the clusters are close enough together.

2.4 Projection median and yamm computation

2.4.1 Projection median computation.

A simple Monte Carlo integration [12] can be used to compute an approximation of the projection median by (38) where J represents the number of projections used, and is a set of random, independently-drawn, unit length n-vectors over Xⁿ⁻¹.

Calculating approximation of Eq (38) is relatively straightforward, but a large value of J is required to ensure accuracy. Another approach computes the projection median directly from the definition in Eq (12), using the spherical coordinates illustrated in Eq (13), where the integral can be obtained by the trapezoidal rule. For example, in the two-dimensional case, we apply the trapezoidal rule once on Eq (11). In the three-dimensional case, we have to apply the trapezoidal rule twice for the double integral, and so on. This direct approach is easy to implement when our dataset has a low dimension, but excessive work is required in not that many higher dimensions, even with, e.g. n = 10.

2.4.2 Computing yamm.

To compute an approximation to yamm, we can also use Monte Carlo integration together with an optimiser. Let be the number of projections, be a set of independent random unit length n-vectors, an estimator for M_{X, m}(μ) is given by (39)

We then numerically minimise over μ to obtain our estimated location measure, using the BFGS optimization method [13–16]. BFGS is a quasi-Newton algorithm searching for a stationary point of a function via local quadratic approximation. Parallel versions such as optimParallel exist as easy to use packages in R.

With reasonable starting values, such as the mean or other multivariate medians, yamm typically provides accurate results with a considerably smaller number of projections than used by the Monte Carlo projection median method mentioned above.

In conclusion, projection median computation via the trapezoidal rule is fast and accurate in low dimensions, but increasingly onerous in higher dimensions, as progressively more multidimensional integration is required. For higher dimensions, we prefer the Monte Carlo method and prefer yamm over the projection median as it does not require such a large number of projections, particularly if the optimiser is given a good starting solution.

Overall, approximating the projection median by the trapezoidal rule is a good choice in and , and either of the other two methods can be used in higher dimensions.

3 Empirical performance for different medians

This section reviews the theoretical computational complexity for a variety of medians and computes some running times for real implementations of several medians computed in R. We then present some results for accuracy of estimation for these medians.

3.1 Computational complexity and empirical speed

For a dataset in with k observations, the computational complexity for the Spatial median is O(nk) [17], which is the same for the exact computation of the component-wise median. The projection median can be obtained in O(k^4/3log^1+ϵ k) time in [9], and O(k^5/2+ϵ) time in [10]. In , with n > 3, Basu et al. showed that O[k^{n{1−δ_n/(n+1)}+ϵ}] time is required to compute the projection median, where δ_n = (4n − 3)⁻ⁿ and ϵ is a fixed small constant. Several algorithms for other multivariate medians have been developed or the bivariate case. The current best algorithms for Oja’s and Liu’s medians require O(k log³ k) and O(k⁴) time, respectively [18], whereas that for the fastest bivariate Tukey median is O(k log³ k) [19]. The calculation of these three multivariate medians in higher dimensions is more complicated and approximate computation is often preferred/required.

To provide empirical assessment of the real computation speed, we apply several R software medians to simulated data. There are several R functions using different algorithms to compute one median. For example, spatial.median from the library ICSNP estimates the median with the algorithm developed by Vardi and Zhang [20], while Gmedian developed by Cardot et al. [21] is faster but, perhaps, less accurate. In addition, l1median [22] from library pcaPP and med from depth also provide opportunities to compute the spatial median. Hence, after some experiments, we choose the best function (evaluated in terms of speed and accuracy) for each multivariate median in and shown in Table 1. Much of the software for multivariate medians in R only works in low numbers of dimensions.

Download:

Table 1. R functions used for analysing different multivariate medians.

https://doi.org/10.1371/journal.pone.0229845.t001

The med function can only calculate the bivariate Liu’s median, which is considerably more challenging in higher dimensions. The calculation of Tukey’s median is exact in one and two dimensions, and approximate in higher dimensions. We use the approximate Tukey’s median computation in the med function, due to numerical errors that sometimes surface when using the exact algorithm. For Oja’s median, the approximate method (evolutionary algorithm) is used instead of the exact one, as it is faster and can deal with high dimensions.

Table 2 displays mean computation times and their standard deviations across 1000 simulated datasets from the two-dimensional Laplace distribution with different numbers of observations (k) for each set. The results are produced by running R on a single core of an Intel i7-8750h processor with 2.20 GHz base clock using 16Gb RAM. For small k, Liu’s median is fastest, but its speed is not as fast as others for higher k. In this experiment, Oja’s median is the slowest for small k values, but its speed does not appear to be particularly sensitive to k. Hence, its speed is faster than Tukey’s median when k = 200. The projection median is one of the quickest when k is below 100, while for large k values, the component-wise median and the Spatial median are faster.

Download:

Table 2. Mean and standard deviation (s.d.) of the operation time (×10⁻⁵) in seconds for data in

.

https://doi.org/10.1371/journal.pone.0229845.t002

The results in Table 2 are produced by only one possible R function for one median. However, other functions can be used. For example, the med function from the depth package can also be used to calculate the spatial median and provides accurate answers. It is extremely fast for small k and lower dimensions, but it becomes slower than l1median for larger k. Hence, we use l1median to compute the spatial median, whose performance for small k is also good.

3.2 Mean squared error for some medians

We assess the accuracy of some of the medians via empirical mean squared error. If is an estimator in with respect to the unknown parameter , then the mean squared error is (40) where represents the squared Euclidean distance between and μ, normalized by the vector length. Smaller values are better.

Table 3 shows MSE results based on the same simulations as used for Table 2. Not surprisingly, for this long-tailed data, all medians perform better than the sample mean. The spatial median and the projection median have smaller mean squared error, the latter performing better for small k values. On the other hand, Liu’s median always produces a very high mean squared error.

Download:

Table 3. Mean squared error (×10⁻²) for data as in Table 2.

https://doi.org/10.1371/journal.pone.0229845.t003

Conclusion. Based on these simulations, for the R functions listed in Table 1, the spatial and projection medians always have the lowest mean squared error, but also fast running speeds. Although Liu’s median has the shortest computation time, for small k, it is the most inaccurate, and its computation time becomes long for large datasets. Similarly, the component-wise median is fast, even when k increases, but it has a large mean squared error. Hence, the spatial and projection medians are good choices when computing two-dimensional robust measures of location in this case, and the latter is preferred for small datasets. The computational results for high-dimensional simulations (n = 3, 5, 10) can be found in S1 Table.

3.3 2D projection median computation functions

The R package DurocherProjectionMedian can be downloaded from Github at https://github.com/12ramsake/DurocherProjectionMedian.

The DurocherProjectionMedian package provides functions to compute the projection median via the Monte Carlo integration method using projectionMedianMC) [27] and an exact method for two dimensions proposed by Ramsay [28] using projectionMedian2D. Tables 4 and 5 show the performance of the different functions computing the two-dimensional projection median of 1000 simulated datasets from the Laplace distribution with different k.

Download:

Table 4. Mean and standard deviation (s.d.) of the operation time (×10⁻⁵) in seconds for different R functions to produce the projection median.

https://doi.org/10.1371/journal.pone.0229845.t004

Download:

Table 5. Mean squared error (×10⁻³) for 1000 sets of data in

generated from Laplace distribution.

https://doi.org/10.1371/journal.pone.0229845.t005

For the Monte Carlo Integration method, when k is small (e.g. under 150 in ), the computation time of projectionMedianMC is longer than our PmedMCInt under the same number of projections in both and high dimensions, whereas both implementations have almost the same MSE.

Although the projectionMedian2D provides a slightly smaller MSE, its running time is slow. Our PmedTrapz is faster and its MSE performance is comparable to projectionMedian2D, and, hence, the former might be recommended as the best choice for .

4 The yamm R package

Our Yamm R package provides users with functions to compute the projection median according to the different methods mentioned in section 2.4. PmedMCInt computes the projection median using the Monte Carlo approximation; PmedTrapz uses the trapezoidal rule and currently, it is only valid in two and three dimensions; yamm computes the projection median using the Monte Carlo approximation to find the shift vector μ minimising our objective function yamm.obj. The package also includes functions Plot2dMedian and Plot2dMedian to plot different multivariate medians for data in both and . Most functions in our package are implemented internally using C code. This section provides some brief illustrations of the use of Yamm.

4.1 Yamm projection medians

The function PmedMCInt computes the projection median for any multivariate data, x, by invoking

PmedMCInt(x, nprojs = 20000)

Since this function uses Monte Carlo integration, we need to choose the number of projections J, which has a default value of 20000. Typically, a large J is required to obtain a stable answer, which means the result will not change much if recomputed under the same conditions. This function returns the projection median estimate vector.

The function PmedTrapz computes the projection median in and and is invoked by

PmedTrapz(x, no.subinterval)

PmedTrapz applies the trapezoidal rule once in and twice in on each entry of the vector , mentioned in section 2.1.2, and returns a vector of the projection median estimate.

The argument no.subinterval determines the number of subintervals for the trapezoidal rule. For the bivariate case the no.subinterval argument is a single number that controls the number of subdivisions for the one-dimensional integration; for the trivariate case the argument is a vector of length two that controls the number of subdivisions for the two integrals. In general, it is better to use at least 36 subintervals, which typically produces accurate results without excessive running time.

More subintervals may be appropriate for more complex datasets. For some unusual data sets it would be ideal to have a high resolution of the interval of integration in one particular region, and a relatively low resolution elsewhere, but this is beyond the scope of the current research. A small number of partitions, e.g. below 15, is not recommended for reasons of accuracy.

The yamm function is valid for data of any dimension. It uses an optimiser to provide another method to compute the projection median. The arguments are

yamm(x, nprojs = 2000, reltol = 1e-06,

xstart = l1median(x), opt.method = “BFGS”,

doabs = 0, full.results = FALSE).

The yamm function is a wrapper to minimise the the objective function yamm.obj, which uses the Monte Carlo method to approximate the squared or absolute value of the univariate median of the projection of the shifted data matrix. The nprojs argument controls the number of projections in the Monte Carlo approximation and doabs is an indicator, where 1 uses the absolute value of the univariate median and 0 forces the use of the squared value. The arguments reltol, xstart, opt.method are supplied directly to the R optimisation function optim: reltol is the tolerance for the optimiser, with default value of 10⁻⁶. Usually, we set a larger value (e.g. 10⁻³) to this argument, which will reduce the running time, whilst maintaining accuracy. The argument opt.method controls the selection of optimisation methods, which can be chosen from any of the four options, “BFGS”, “Nelder-Mead” [29], “CG” [30], “L-BFGS-B” [31], and “SANN” [32]. The default choice “BFGS” is relatively fast and stable in our case. See the help page of the function optim in R for further details about the different optimisation methods. The xstart argument provides the initial value for the parameters to optimise over, which plays an important role in the function yamm. A good starting point will reduce the running time and provide a more accurate result, so we use the spatial median as the default value. Other multivariate medians could be used, but they need to be fast. If full.results = TRUE, the output of this function involves a list with components obtained from the optim function, otherwise, it returns a vector containing the multivariate median estimate.

4.2 Some real examples

We now exhibit results for the projection medians applied to some real datasets. Our plots show different multivariate medians and the sample mean value for two simulated datasets in and , respectively, allowing the methods to be compared.

4.2.1 Beetle data.

The famous beetle data [33] takes six measurements on 74 flea-beetles, with each belonging to one of three different species. We apply yamm and obtain the following output:

yamm(beetle, nprojs = 1000, reltol = 1e-3, doabs = 0,

full.results = TRUE)

[1] 180.19194 123.73920 49.97819 135.87913 13.62603 95.49062

$value

[1] 5.585139

$counts

function gradient

90 4

$convergence

[1] 0

$message

NULL

The yamm results show that the optimiser executed 90 calls to the objective function yamm.obj and constructed 4 gradients. The par component contains the estimate of the yamm for the beetle data. These results are not that different from the output generated by PmedMCInt, which is

PmedMCInt(beetle, nprojs = 100000)

[1] 179.54428 124.72128 50.56934 137.47363 13.23372 94.80188

For the beetle data, we chose the number of projections in yamm to be 1000, while many more projections were required (e.g. 100000) in PmedMCInt to obtain a similar and consistent result; although yamm requires optimisation. Fewer projections for the function PmedMCInt may lead inaccurate results for some components of the multivariate median. PmedTrapz is not valid in this six-dimensional case, but we will show that it has a similar output when computing projection median in two- and three-dimensions.

4.2.2 Simulated Data in with three clusters.

We now use the function Plot2dMedian in the package Yamm to generate and display different multivariate medians for the simulated data set clusters2d. This set contains three clusters, which are generated randomly from different independent normal distributions, and two outliers.

Here, we display the three different estimates of the projection median. When computing other multivariate medians, we use functions from R packages listed in section 3.1. The actual data points is plotted with grey dots. The first plot in Fig 3 is producing excluding the two outliers, whilst the second one includes them. The projection medians produced with different estimators are very close to each other, and not far from the other median estimators also. Fig 3 also shows that the multivariate medians are not particularly affect by the outliers, whilst the mean value is.

Download:

Fig 3. Bivariate medians and mean for three cluster two-dimensional set.

Top: without outliers; Bottom: with outliers (out of plot area).

https://doi.org/10.1371/journal.pone.0229845.g003

4.2.3 Simulated data in with four clusters.

The function Plot3dMedian in Yamm plots the three-dimensional medians. The dataset clusters3d has four clusters, each generated from different independent normal distributions, as well as five outliers. Fig 4 is produced with the dataset clusters3d, whose outliers have been removed. It shows that apart from the Oja’s median, the other medians are located close to each other. Again, the three approximations of the projection median almost coincide in every component.

Download:

Fig 4. Trivariate medians & mean for four cluster three-dimensional set.

https://doi.org/10.1371/journal.pone.0229845.g004

4.3 The muqie plot and some examples

As well as obtaining a robust location measure, we can use projections to provide information on the spread and configuration of the data. Obtaining true multivariate quantiles can be computationally challenging, and what we produce are not true multivariate quantiles, but they do enable us to gain useful understanding about multivariate data. The muqie (MUltivariate QuantIlE) plots are constructed as follows.

First choose a unit-length direction vector, u. Then project our yamm-centred multivariate data onto u to obtain a univariate set. The muqie point, Q(α, u), is merely the vector u rescaled to have length equal to the α-quantile of the univariate set. A muqie plot is the collection of all muqie points, Q(α, u) over all unit-length direction vectors u. In practice, we construct our plot by choosing a number of directions and joining the points. The basic concept, and plots, are not new, Section 2 of Fraiman and Pateiro-Lopez [34] introduces the concept based on mean-centred data and is related to ideas in [35]. Our main addition to this body of work is to (i) centre using yamm, or other robust median and (ii) presenting the muqie plots as dynamic videos of increasing α.

Fig 5 shows two muqie plots for α = 0.4 and α = 0.8. The latter indicates the three cluster nature. Surprisingly, this also shows up clearly in the α = 0.4 plot with the 0.4 quantile for, e.g. the bottom-left cluster appearing in a “north-easterly” direction and coloured red in our plot. The movie Animation shows an animated plot, which includes both the plots in Fig 5 and many of the others for increasing values of α.

Download:

Fig 5. Muqie plot for the three cluster two-dimensional data set without outliers.

The figures are produced for different values of pseudo-quantile α. The centre point (in blue) in each plot is the yamm median. Left: α = 0.4, Right: α = 0.8.

https://doi.org/10.1371/journal.pone.0229845.g005

These plots were produced by the muqie() function in the Yamm package. For the animated plot, the package includes the makeplot() function, which calls muqie() for multiple values of α. Then we use the CRAN package animation to produce an animated GIF using

saveGIF(makeplot(clusters2d[,-c(102,103)], nprojs = 4000),

diff.col = 3, interval = 0.1, width = 500, height = 500).

The movie beetle shows a three-dimensional Muqie plot using three variables from the beetle data. The R commands used were:

saveGIF(makeplot3D(beetle, dm = c(1,3,6)), diff.col = 3,

interval = 0.2, width = 500, height = 500)

5 Conclusions and discussions

We have introduced a new method, yamm, to compute the projection median, for data in with n ≥ 2. We have proved the theoretical equivalence of yamm and the projection median.Through theoretical and numerical investigations we demonstrate the robustness of yamm on a simple, but illuminating, bivariate setup.

Then, we illustrated three computation methods for the projection median, which can be best deployed in different situations. Approximating the projection median by the Monte Carlo method is valid in any dimensions but requires a large number of projections to ensure accuracy, while using the trapezoidal rule is computationally fast and accurate in two and three dimensions, but requires more integration on the projection vector in the higher dimensions, which becomes rapidly more complex. The yamm approximation can also compute the median in any dimensions. Its computational speed is not as quick as the other two, under the same conditions (e.g. the number of projections). However, thanks to the optimiser, a small number of the projections can be chosen to obtain an accurate median with a reasonable starting point (e.g. other multivartiate medians or mean value), which can be a distinct advantage.

Our research also documents the simulated empirical performance for different medians in terms of the computation time and the mean squared error. Using different R functions to calculate different multivariate medians, we find that the spatial median and the projection median are always accurate with relatively fast speed using the existing R functions. The performance of other multivariate medians either exhibits slow speed or large mean squared error.

Finally, we introduce our R package, Yamm, that contains our three methods to compute the projection median. We show that our methods coincide with each other in and , and all multivariate medians are not affected by the outliers in the dataset, but the location of the mean value varies a lot. Currently, the function PmedTrapz in the R package is only valid in and , further investment can be conducted on extending this function to higher dimensions.

The Yamm package also introduces our Muqie plots, which are capable of producing animated plots of two- and three-dimensional sets’ projected quantiles. The animated ‘growth’ of these “quantile” plots give a vivid picture of the extent, spread and configuration of data in the sets.

The Yamm package is available on the CRAN archive.

Supporting information

S1 Appendix.

https://doi.org/10.1371/journal.pone.0229845.s001

(PDF)

S2 Appendix.

https://doi.org/10.1371/journal.pone.0229845.s002

(PDF)

S1 Table. Simulation performance for high-dimensional medians.

https://doi.org/10.1371/journal.pone.0229845.s003

(PDF)

References

1. Hayford J. W. (1902). What is the center of an area or the center of a population. Journal of the American Statistical Association, 8(58):47–58.
- View Article
- Google Scholar
2. Weber A. (1909). Über den Standort der Industrien. Mohr.
3. Weber A. (1929). Theory of the Location of Industries. The University of Chicago Press.
4. Tukey J. W. (1975). Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, 2:523–531.
- View Article
- Google Scholar
5. Oja H. (1983). Descriptive statistics for multivariate distributions. Statistics & Probability Letters, 1:327–332.
- View Article
- Google Scholar
6. Small C. G. (1990). A survey of multidimensional medians. International Statistical Review, 58(3):263–277.
- View Article
- Google Scholar
7. Chaudhuri P. and Sengupta D. (1993). Sign tests in multidimension: Inference based on the geometry of data cloud. Journal of the American Statistical Association, 88:1363–1370.
- View Article
- Google Scholar
8. Oja H. (2013). Multivariate median. In Becker C., Fried R., and Kuhnt S., editors, Robustness and Complex Data Structures, chapter 1, pages 3–16. Springer, Berlin.
9. Durocher S. and Kirkpatrick D. G. (2005). The projection median of a set of points in . Journal of Computational Geometry, 42:364–375.
- View Article
- Google Scholar
10. Basu R., Bhattacharya B. B., and Talukdar T. (2012). The projection median of a set of points in . Discrete and Computational Geometry, 47(2):329–346.
- View Article
- Google Scholar
11. Johnson N. L., Kotz S., and Balakrishnan N. (1995). Continuous univariate distributions. Number v.2 in Wiley series in probability and mathematical statistics: Applied probability and statistics. Wiley & Sons.
12. Robert C. P. and Casella G. (2005). Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-Verlag, Berlin, Heidelberg.
13. Broyden C. G. (1970). The convergence of a class of double-rank minimization algorithms. The Institute of Mathematics and Its Applications, 6:76–90.
- View Article
- Google Scholar
14. Fletcher R. (1970). A new approach to variable metric algorithms. Computer Journal, 13(3):317–322.
- View Article
- Google Scholar
15. Goldfarb D. (1970). A family of variable metric updates derived by variational means. Journal of the Mathematics of Computation, 24(109):123–126.
- View Article
- Google Scholar
16. Shanno D. F. (1970). Conditioning of quasi-newton methods for function minimization. Journal of the Mathematics of Computation, 24(111):647–656.
- View Article
- Google Scholar
17. Bose P., Maheshwari A., and Morin P. (2003). Fast approximations for sums of distances, clustering and the Fermat-Weber problem. Computational Geometry: Theory and Applications, 24(3):135–146.
- View Article
- Google Scholar
18. Aloupis G., Langerman S., Soss M., and Toussaint G. T. (2003). Algorithms for bivariate medians and a Fermat-Torricelli problem for lines. Computational Geometry, 26:69–79.
- View Article
- Google Scholar
19. Langerman S. and Steiger W. (2003). Optimization in Arrangements. Springer Berlin Heidelberg, Berlin, Heidelberg.
20. Vardi Y. and Zhang C. (2000). The multivariate l1-median and associated data depth. Proceedings of the National Academy of Sciences, 97(4):1423–1426.
- View Article
- Google Scholar
21. Cardot H., Cénac P., and Zitt P. A. (2013). Efficient and fast estimation of the geometric median in Hilbert spaces with an averaged stochastic gradient algorithm. Bernoulli Society for Mathematical Statistics and Probability, 19(1):18–43.
- View Article
- Google Scholar
22. Croux C., Filzmoser P., and Oliveira M. R. (2006). Algorithms for projection-pursuit robust principal component analysis. KU Leuven Working Paper No. KBI 0624, 19(1):18–43.
- View Article
- Google Scholar
23. Rousseeuw P. J. and Ruts I. (1996). Algorithm AS 307: Bivariate location depth. Journal of the Royal Statistical Society. Series C (Applied Statistics), 45(4):516–526.
- View Article
- Google Scholar
24. Rousseeuw P. J., Ruts I., and Tukey J. W. (1999). The bagplot: A bivariate boxplot. The American Statistician, 53(4):382–387.
- View Article
- Google Scholar
25. Struyf A. and Rousseeuw P. J. (2000). High-dimensional computation of the deepest location. Computational Statistics & Data Analysis, 34(4):415–426.
- View Article
- Google Scholar
26. Fischer D., Mosler K., Möttönen J., Nordhausen K., Pokotylo O., and Vogel D. (2016). Computing the Oja median in R: The package OjaNP. ArXiv, pages 1–36.
- View Article
- Google Scholar
27. Durocher S., Leblanc A., and Skala M. (2017). The projection median as a weighted average. Journal of Computational Geometry, 8:78–104.
- View Article
- Google Scholar
28. Ramsay K. (2017). Computable, robust multivariate location using integrated univariate ranks.
29. Nelder J. A. and Mead R. (1965). A simplex algorithm for function minimization. Computer Journal, 7:308–313.
- View Article
- Google Scholar
30. Fletcher R. and Reeves C. M. (1964). Function minimization by conjugate gradients. Computer Journal, 7:148–154.
- View Article
- Google Scholar
31. Nocedal J. and Wright S. J. (1999). Numerical Optimization. Springer, first edition.
32. Byrd R. H., Lu P., Nocedal J., and Zhu C. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16:1190–1208.
- View Article
- Google Scholar
33. Lubischew A. A. (1962). On the use of discriminant functions in taxonomy. Biometrics, 18:455–477.
- View Article
- Google Scholar
34. Fraiman R. and Pateiro-Lopez B. (2012). Quantiles for finite and infinite dimensional data. Journal of Multivariate Analysis, 108:1–14.
- View Article
- Google Scholar
35. Kong L. and Mizera I. (2012). Quantile tomography: using quantiles with multivariate data. Statistica Sinica, 22:1589–1610.
- View Article
- Google Scholar

[ref1] 1. Hayford J. W. (1902). What is the center of an area or the center of a population. Journal of the American Statistical Association, 8(58):47–58.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Weber A. (1909). Über den Standort der Industrien. Mohr.

[ref3] 3. Weber A. (1929). Theory of the Location of Industries. The University of Chicago Press.

[ref4] 4. Tukey J. W. (1975). Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, 2:523–531.
View Article
Google Scholar

[7] View Article

[8] Google Scholar

[ref5] 5. Oja H. (1983). Descriptive statistics for multivariate distributions. Statistics & Probability Letters, 1:327–332.
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref6] 6. Small C. G. (1990). A survey of multidimensional medians. International Statistical Review, 58(3):263–277.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref7] 7. Chaudhuri P. and Sengupta D. (1993). Sign tests in multidimension: Inference based on the geometry of data cloud. Journal of the American Statistical Association, 88:1363–1370.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref8] 8. Oja H. (2013). Multivariate median. In Becker C., Fried R., and Kuhnt S., editors, Robustness and Complex Data Structures, chapter 1, pages 3–16. Springer, Berlin.

[ref9] 9. Durocher S. and Kirkpatrick D. G. (2005). The projection median of a set of points in . Journal of Computational Geometry, 42:364–375.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref10] 10. Basu R., Bhattacharya B. B., and Talukdar T. (2012). The projection median of a set of points in . Discrete and Computational Geometry, 47(2):329–346.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref11] 11. Johnson N. L., Kotz S., and Balakrishnan N. (1995). Continuous univariate distributions. Number v.2 in Wiley series in probability and mathematical statistics: Applied probability and statistics. Wiley & Sons.

[ref12] 12. Robert C. P. and Casella G. (2005). Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-Verlag, Berlin, Heidelberg.

[ref13] 13. Broyden C. G. (1970). The convergence of a class of double-rank minimization algorithms. The Institute of Mathematics and Its Applications, 6:76–90.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref14] 14. Fletcher R. (1970). A new approach to variable metric algorithms. Computer Journal, 13(3):317–322.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref15] 15. Goldfarb D. (1970). A family of variable metric updates derived by variational means. Journal of the Mathematics of Computation, 24(109):123–126.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref16] 16. Shanno D. F. (1970). Conditioning of quasi-newton methods for function minimization. Journal of the Mathematics of Computation, 24(111):647–656.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref17] 17. Bose P., Maheshwari A., and Morin P. (2003). Fast approximations for sums of distances, clustering and the Fermat-Weber problem. Computational Geometry: Theory and Applications, 24(3):135–146.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref18] 18. Aloupis G., Langerman S., Soss M., and Toussaint G. T. (2003). Algorithms for bivariate medians and a Fermat-Torricelli problem for lines. Computational Geometry, 26:69–79.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref19] 19. Langerman S. and Steiger W. (2003). Optimization in Arrangements. Springer Berlin Heidelberg, Berlin, Heidelberg.

[ref20] 20. Vardi Y. and Zhang C. (2000). The multivariate l1-median and associated data depth. Proceedings of the National Academy of Sciences, 97(4):1423–1426.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref21] 21. Cardot H., Cénac P., and Zitt P. A. (2013). Efficient and fast estimation of the geometric median in Hilbert spaces with an averaged stochastic gradient algorithm. Bernoulli Society for Mathematical Statistics and Probability, 19(1):18–43.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref22] 22. Croux C., Filzmoser P., and Oliveira M. R. (2006). Algorithms for projection-pursuit robust principal component analysis. KU Leuven Working Paper No. KBI 0624, 19(1):18–43.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref23] 23. Rousseeuw P. J. and Ruts I. (1996). Algorithm AS 307: Bivariate location depth. Journal of the Royal Statistical Society. Series C (Applied Statistics), 45(4):516–526.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref24] 24. Rousseeuw P. J., Ruts I., and Tukey J. W. (1999). The bagplot: A bivariate boxplot. The American Statistician, 53(4):382–387.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref25] 25. Struyf A. and Rousseeuw P. J. (2000). High-dimensional computation of the deepest location. Computational Statistics & Data Analysis, 34(4):415–426.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref26] 26. Fischer D., Mosler K., Möttönen J., Nordhausen K., Pokotylo O., and Vogel D. (2016). Computing the Oja median in R: The package OjaNP. ArXiv, pages 1–36.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref27] 27. Durocher S., Leblanc A., and Skala M. (2017). The projection median as a weighted average. Journal of Computational Geometry, 8:78–104.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref28] 28. Ramsay K. (2017). Computable, robust multivariate location using integrated univariate ranks.

[ref29] 29. Nelder J. A. and Mead R. (1965). A simplex algorithm for function minimization. Computer Journal, 7:308–313.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref30] 30. Fletcher R. and Reeves C. M. (1964). Function minimization by conjugate gradients. Computer Journal, 7:148–154.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref31] 31. Nocedal J. and Wright S. J. (1999). Numerical Optimization. Springer, first edition.

[ref32] 32. Byrd R. H., Lu P., Nocedal J., and Zhu C. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16:1190–1208.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref33] 33. Lubischew A. A. (1962). On the use of discriminant functions in taxonomy. Biometrics, 18:455–477.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref34] 34. Fraiman R. and Pateiro-Lopez B. (2012). Quantiles for finite and infinite dimensional data. Journal of Multivariate Analysis, 108:1–14.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref35] 35. Kong L. and Mizera I. (2012). Quantile tomography: using quantiles with multivariate data. Statistica Sinica, 22:1589–1610.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

Figures

Abstract

1 Introduction: Overview of multivariate medians

1.1 Component-wise median

1.2 Spatial median

1.3 Oja’s median

1.4 Tukey’s median

2 The projection median

2.1 Review of the projection median

2.1.1 Projection median in .

2.1.2 Generalisation of the projection median.

2.2 Yet another multivariate median (Yamm)

2.3 Yamm behaviour on a bivariate normal mixture

2.3.1 Bivariate mixture setup.

2.3.2 Projected distribution.

2.3.3 Theoretical approximation of yamm on the mixture.

2.3.4 The yamm influence curve on the mixture.

2.4 Projection median and yamm computation

2.4.1 Projection median computation.

2.4.2 Computing yamm.

3 Empirical performance for different medians

3.1 Computational complexity and empirical speed

3.2 Mean squared error for some medians

3.3 2D projection median computation functions

4 The yamm R package

4.1 Yamm projection medians

4.2 Some real examples

4.2.1 Beetle data.

4.2.2 Simulated Data in with three clusters.

4.2.3 Simulated data in with four clusters.

4.3 The muqie plot and some examples

5 Conclusions and discussions

Supporting information

S1 Appendix.

S2 Appendix.

S1 Table. Simulation performance for high-dimensional medians.

S1 File.

S1 Video.

S2 Video.

References