A new outlier detection method for spherical data

Adzhar Rambli; Ibrahim Bin Mohamed; Abdul Ghapor Hussin

doi:10.1371/journal.pone.0273144

Abstract

In this study, we propose a new method to detect outlying observations in spherical data. The method is based on the k-nearest neighbours distance theory. The proposed method is a good alternative to the existing tests of discordancy for detecting outliers in spherical data. In addition, the new method can be generalized to identify a patch of outliers in the data. We obtain the cut-off points and investigate the performance of the test statistic via simulation. The proposed test performs well in detecting a single and a patch of outliers in spherical data. As an illustration, we apply the method on an eye data set.

Citation: Rambli A, Mohamed IB, Hussin AG (2022) A new outlier detection method for spherical data. PLoS ONE 17(8): e0273144. https://doi.org/10.1371/journal.pone.0273144

Editor: Lei Shi, Yunnan University of Finance and Economics, CHINA

Received: January 1, 2021; Accepted: August 4, 2022; Published: August 24, 2022

Copyright: © 2022 Rambli et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data has been provided in the manuscript.

Funding: We would like to state that Universiti Teknologi MARA Research Grant (600-IRMI/FRGS 5/3 (353/2019)) is funding this research by paying the publication fee and UM IIRG Research Grant (IIRG002A-19FNW) is funding this research by paying for proofreading the manuscript. Both funds had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Spherical data are concerned with directions in three dimensions. They may arise in many areas of scientific experimentation such as biological, geological and environmental sciences. For example, the wind direction measured by two different equipments (see [1]) or the altitudes of the moon and the sun observed at the beginning of the lunar month (see [2]) form spherical data. The analysis of spherical data generally concentrates on the directional vector of the auditory object and, in most cases, ignores the distance effects. Under this assumption, the representation of the data reduces to a more tractable two-dimensional spherical display of the data namely latitude θ and longitude φ. While normal distribution is common for linear data, the von Mises-Fisher distribution is regularly considered for spherical data. The distribution is also known as Fisher distribution and assumes the data to be rotationally symmetric [3, 4].

Outliers are observations that are different in some way from the rest. For example, the wind direction on one particular day which is in the opposite direction to that observed on other days in the same monsoon season is a candidate to be an outlier. The existence of outliers in circular data has been shown to affect parameter estimation and weaken the accuracy of forecast (see for example [5, 6] and warrants proper treatment in the early stage of data analysis. At present, several discordancy tests are developed to detect outlier in 2-dimensional directional data including [7–10]. Fewer similar studies are conducted for spherical data [11]. Used probability plot as part of a preliminary examination on a given spherical data set to detect outlier. On the other hand [4], proposed formal tests of discordancy by extending the idea used in [5] for circular data. In this paper, we propose a new outlier detection method for spherical data using the k-nearest neighbours distance on a unit sphere. The distance between two points on the surface of a sphere is measured using the law of cosine. The proposed method can detect not only single and multiple outliers but also a patch of outliers.

This paper is organized as follows: Section 2 reviews two existing tests of discordancy in the Fisher distribution. Section 3 shows the distance between two-unit vectors. Section 4 reviews the definition of k-nearest neighbours distance. Section 5 presents a new test of discordancy for a patch of outliers. Through simulations, we obtain the percentage points of the test statistic and study its performance in Section 6. For illustration, an application of the methods on a real data set is presented in Section 7.

Tests of discordancy in the fisher distribution

Fisher distribution is a common unimodal distribution considered for spherical data. The probability density functions of a Fisher distribution for a given random vector (Θ, Φ) is given by (1) where 0≤θ, α<π; 0≤φ, β<2π; κ>0, (α, β) is the mean direction, and κ is a measure of the concentration about the mean direction.

Let (θ₁, φ₁),…,(θ_n, φ_n) be a random sample from a Fisher distribution with mean direction (α, β). Let be the sample mean direction, R be the sample resultant length given by where , , x_i = sin θ_i cos φ_i, y_i = sin θ_i sin φ_i, z_i = cos θ_i and be the mean resultant length. Note that (x_i, y_i, z_i) is in a direction of cosine. Further, and denote the values of resultant length and mean resultant length, respectively, with the observation (θ_i, φ_i) omitted from the data set [4]. Recommended two test statistics, the C^k and E^k statistics. The analogue of Collett’s C statistic is defined as (2)

While the analogue of Collett’s M statistic is (3)

[4] noted that the C^k statistic is a good statistic when κ is known or a good estimate if it is available. In addition, the E^k statistic is developed by considering intuitive and formal likelihood-ratio, whose distribution is available in compact form, and is independent of the value of κ. The E^k statistic is based on a generalized likelihood-ratio test against the alternative hypothesis that one observation is drawn from a Fisher distribution with different mean direction but the same concentration parameter. In addition, both test statistics can detect a single outlier and several outliers (see [4]).

The distance on a sphere

For any 3-dimensional data set, we can find a distance of any given point on a sphere by calculating the distance between two vectors. The distance between two-unit vectors x₁ and x₂ (where both have a length of unit radius) can be calculated by using the law of cosine. Let θ₁₂ be the angle between unit vectors x₁ = (x₁, y₁, z₁) and x₂ = (x₂, y₂, z₂). We can obtain the distance between the two points on a sphere by (4)

Therefore Eq (4) can be simplified to where 0≤θ₁₂≤π. It is known that x₁•x₂ = ‖x₁‖‖x₂‖ cos θ₁₂. Then

Given that x₁ and x₂ are two-unit vectors, it must be that . In general, the spherical distance between two-unit vectors is given by

For simplicity, we may remove the constant giving (5)

The K-nearest neighbours distance

If k = 1, we consider the distance of first nearest neighbours for a given point, say x_i. First, we denote d_1i(x_i, x_j), for j = 1,2,…,n, i≠j as the distance of first nearest neighbours between the i-th observation and the rest of observations while d_(1i)(x_i, x_j) the corresponding ordered distances. The first-nearest distance for the i-th observation is then given by (6)

Note that gives a sequence of distances between successive observations on the p-dimensional surface. The statistic (6) can be generalized to detect a patch of outliers in spherical data by calculating the k-nearest distance for the i-th observation. For that, we define as the k-nearest neighbours distance for the ith ordered observation, k = 1,2,3,… and i = 1,2,…,n such that (7)

We will use the statistic (7) in the development of a new method for detecting a single, multiple as well as a patch of outliers in the following section.

A new method of outlier detection for spherical data

In this section, we use the k-nearest neighbours distance as a basic idea to be used in the development of a new method to detect possible outliers in spherical data, denoted by Q^k. Suppose x₁, x₂,…,x_n are (i.i.d) spherical observations from a Fisher distribution of sample size n. The sample vector of a spherical sample is given by x_i = (x_i, y_i, z_i). Thus, the procedure to obtain the outlier detection method using the Q^k statistic is described as follows:

Step 1 Start with k = 1. Calculate , i = 1,2,…,n as given by Eq (7).
Step 2 If the value of exceeds a pre-determined cut-off point, say C_Q, then the i-th observation corresponding to is identified as an outlier and the process is stopped. Otherwise, proceed to the next step.
Step 3 Increase k by one, that is, k = 2. Calculate , i = 1,2,…,n.
Step 4 If the value of exceeds a pre-determined cut-off point, say C_Q, then the observations corresponding to are identified as a patch of two outliers and the process is stopped. Otherwise, the process continues by increasing the value of k by one at a time in the subsequence steps.

First, we need to obtain the cut-off points C_Q for the Q^k statistic. We design a simulation study using the R software to find the percentage points under the null hypothesis of no outliers in the circular data set. Note that parameters α and β are spherical location parameters while κ is a concentration parameter. We found that the distances between observations generated from a Fisher distribution depend on n and κ but not on α and β (the detail is not given here). For each combination of n and κ, we generate a sample from Fisher distribution with both location parameters fixed (α = 0, β = 0) and calculate the Q^k statistic. Then, we repeat the process 3000 times and estimate the percentage points of the Q^k statistic at 10%, 5% and 1% upper percentiles when no outlier is present in the sample. Selected cut-off points C_Q for the Q^k statistic are tabulated in Tables 1–3 for k = 1, 2 and 3 respectively.

Download:

Table 1. Cut-off points, C_Q for Q¹ statistic.

https://doi.org/10.1371/journal.pone.0273144.t001

Download:

Table 2. Cut-off points, C_Q for Q² statistic.

https://doi.org/10.1371/journal.pone.0273144.t002

Download:

Table 3. Cut-off points, C_Q for Q³ statistic.

https://doi.org/10.1371/journal.pone.0273144.t003

For most combinations of the concentration parameter κ and percentile level, the cut-off point decreases as the sample size increases. It can also be seen that, for small sample sizes, the cut-off points are a decreasing function of κ. For larger sample sizes, the cut-off points have a peak value at around κ = 30. The results indicate the proposed statistic depends on n and κ of the underlying assumed model. As one might expect, it is also noted that the cut-off point also increases as the value of the k-nearest distance increases due to larger distances on the sphere between the points of interest.

The performance of the Q^k statistic

Let P5 be the probability that the contaminant point is an outlying point and is identified as discordance [12, p.185] and [13, p.64-68]. Stated that a good test is expected to have a high P5 [4]. Investigated the performance of several methods to detect a single outlier in spherical distribution. Therefore, we compare the performance of the Q^k statistic with the existing methods of E^k and C^k statistics to detect a single outlier and a patch of outliers for various values of sample size and concentration parameters.

To study the performance of the Q^k, E^k and C^k statistics to detect a single outlier, we first generate samples for two cases, a) n = 10, κ = 3 and b) n = 30, κ = 50. The samples are generated in such a way that n−1 of the observations come from Fisher distribution with α = 0, β = 0 while one observation (outlier) from Fisher distribution with α = λπ, β = 0, κ = 30, 0≤λ≤1. If the value of , and are greater than the corresponding cut-off point and the i^th observation is located at the outlying value, then we have correctly detected an outlier. We repeat the simulation 3000 times and obtain the value of P5 or known as probability of correct detection of an outlier which has been introduced into the samples. Note that, the cut-off points for the E^k and C^k statistics are obtained from Monte Carlo simulation according to the procedure in obtaining the cut-off points for Q^k statistic.

Fig 1 plots the performance of the Q^k, E^k and C^k statistics to detect a single outlier for small sample size and small concentration parameter value. Generally, for small sample size, the Q^k statistic performs better than the E^k statistic only. However, the performance is almost identical when larger sample size and larger concentration parameter values are considered as shown in Fig 2.

Download:

Fig 1. The performance of the Q^k, E^k and C^k statistics for n = 10 and κ = 3 for a single outlier.

https://doi.org/10.1371/journal.pone.0273144.g001

Download:

Fig 2. The performance of the Q^k, E^k and C^k statistics for n = 30 and κ = 50 for a single outlier.

https://doi.org/10.1371/journal.pone.0273144.g002

We also investigate the performance to detect a patch of outliers for the three statistics. For small sample size and small concentration parameter value, the performance of the Q^k statistic is comparable to the E^k and C^k statistics as shown in Fig 3. A much closer result is observed for larger sample size and larger concentration values as shown in Fig 4. The trend is observed for other combinations of sample size and concentration parameter values. This suggests that the Q^k statistic can be a good alternative outlier detection method for spherical data.

Download:

Fig 3.

The performance of the Q^k, E^k and C^k statistics for n = 10 and κ = 3, for a patch of three outliers.

https://doi.org/10.1371/journal.pone.0273144.g003

Download:

Fig 4.

The performance of the Q^k, E^k and C^k statistics for n = 30 and κ = 50, for a patch of three outliers.

https://doi.org/10.1371/journal.pone.0273144.g004

Practical example

For illustration, we now apply the proposed and existing spherical discordancy tests into a set of eye data. We consider the eye data consisting of 23 patients (unit in radians) recorded using optical coherence tomography (OCT) at the University Malaya Medical Centre (UMMC). OCT technology originally is used in ophthalmology to image the posterior segment and has also been used to image anterior segment structures such as the cornea. The angle imaging of the anterior segment OCT in UMMC patients’ eyes were obtained with Anterior Segment OCT (AS-OCT). The measurements selected are the angle of the posterior corneal curvature, φ, and the angle of the eye (between posterior corneal curvature to iris), θ. As such, we are keen to identify possible outliers in this data set as given in Table 4.

Download:

Table 4. The bivariate eye data.

https://doi.org/10.1371/journal.pone.0273144.t004

The summary statistics for the given spherical data set are calculated; the sample mean direction is given in a longitude and latitude expression, with the concentration parameter . The spherical plot of the data is given in Fig 5. The samples are located around a north pole. This indicates that both variables, namely the posterior and angle of the eye, recorded these 23 observations to be in the same direction. However, there is one observation lying further away from the rest.

Download:

Fig 5. Spherical plot of eye data.

https://doi.org/10.1371/journal.pone.0273144.g005

It is known that Q-Q plot and probability plotting are commonly used to investigate the goodness of fit of linear, circular and spherical data samples (see for example [14–16]). It is used to visualize the goodness of fit and to identify the presence of outlier(s) at earlier stage [11]. Provided procedures of plotting an ordered value for spherical data which is assumed to follow a Fisher distribution. They proposed three types of procedures, namely, colatitudes, longitude and two-variable plotting procedure for a Fisher model. The procedures considered three-ordered-value plots. Two of them examine the marginal distributions of the two variables and one of them is to find the association between these two variables. The details of the procedures can be obtained in [11]. Note that, the quantile of the unit exponential distribution is denoted by e, for the uniform model is denoted by u and the quantile of the N(0,1) distribution is denoted by q.

The colatitude plot of the eye data as shown in Fig 6(A) indicates that the data follow Fisher distribution as the plot gives almost a straight line through the origin. This is further supported by the longitude plot as shown in Fig 6(B) which gives an approximately straight linear plot of slope close to 45° passing through the origin. From Fig 6(C), we can clearly see one observation that lies far from the rest, indicating the existence of one possible outlier. Therefore, we apply the proposed discordancy test on the data. Upon applying the maximum likelihood estimation method, we obtain the estimate parameters of the Fisher distribution. The values of the parameters are and .

Download:

Fig 6.

Plots of eye data, (a) Colatitude plot, (b) Longitude plot, (c) Two-variable plot.

https://doi.org/10.1371/journal.pone.0273144.g006

Based on the estimated parameters, we obtain the critical values of three test statistics using the R statistical software. The values are shown in Table 5. We apply the discordancy tests including our proposed test statistic and obtain their test statistic values. The values of the test statistics which correspond to observation number 17 are , , , and . Table 5 shows the cut-off points for the three methods at 10% significance levels. Based on Table 5, only Q² and Q³ statistics can detect observation 17 as an outlier at 10% upper level. This observation corresponds to a patient with small values of angle of the posterior corneal curvature compared to other patients and thus may warrant further investigation.

Download:

Table 5. The (10% upper level) critical values of discordancy tests for n = 23 and κ = 17.9100.

https://doi.org/10.1371/journal.pone.0273144.t005

Next, we are keen to demonstrate the application of the tests to detect a patch of outliers. Observation 10 is chosen and located closely to observation 17 so that a patch of two outliers exist in the data. The new coordinate for observation 10 is θ = 0.9599, φ = 0.6109.

From Fig 7, it can be seen clearly that both observations (observations 10 and 17) are located far from the rest. Upon applying descriptive statistics, the value of the sample mean direction is given in a longitude and latitude expression, and the concentration parameter . The values of the test statistics and the cut-off points for the three methods at 10% significance levels are given in Table 6. As a result, the Q² and Q³ statistics successfully detected observations 10 and 17 as a patch of two outliers at 10% upper level while the other test statistics failed.

Download:

Fig 7. Spherical plot of eye data (a patch of two outliers).

https://doi.org/10.1371/journal.pone.0273144.g007

Download:

Table 6. The test statistics values and the (10% upper level) critical values of discordancy tests for n = 23 and κ = 16.5789.

https://doi.org/10.1371/journal.pone.0273144.t006

Conclusion

In this paper, we proposed a new discordancy test for detecting outliers in spherical data based on the k-nearest neighbours distance. We further demonstrated the applicability of the proposed Q^k statistic on the eye data set by successfully identifying a single outlier and a patch of outliers in the data. A novel aspect of this method is in its ability to detect a patch of outliers which can be enhanced for cluster analysis in spherical data. The proposed procedure should work for other spherical distributions.

References

1. Hussin A. G. (1997). Pseudo-replication in functional relationships with environmental applications. Ph.D. thesis, University of Sheffield.
- View Article
- Google Scholar
2. Ahmad N., Nawawi M.S.A.M., Zainuddin M.Z., Nasir Z.M., Yunus R.M. and Mohamed I. (2020). A new crescent moon visibility criteria using circular regression model: A case study of Teluk Kemang, Malaysia, Sains Malaysiana, 49(4):859–870.
- View Article
- Google Scholar
3. Mardia K. V. (1972). Statistics of directional data. Academic Press, London.
4. Fisher N. I., Lewis T. and Willcox M. E. (1981). Tests of discordancy for samples from Fisher’s distribution on the sphere. Journal of Applied Statistics 30, 230–237.
- View Article
- Google Scholar
5. Collett D. (1980). Outliers in circular data. Applied Statistics 29(1):50–57.
- View Article
- Google Scholar
6. Abuzaid A. H., Mohamed I. B. and Hussin A. G. (2009). A new test of discordancy in circular data. Communication in Statistics-Simulation and Computation 38(4): 682–691.
- View Article
- Google Scholar
7. Rambli A., Ibrahim S., Abdullah M. I., Hussin A. G. and Mohamed I. (2012). On discordance test for the wrapped normal data. Sains Malaysiana, 41 (6): 769–778.
- View Article
- Google Scholar
8. Rambli A., Abuzaid A. H., Mohamed I. B. and Hussin A. G. (2016). Procedure for detecting outliers in a circular regression model, PLOS ONE, 11 (4): e0153074. pmid:27064566
9. Mohamed I.B., Rambli A., Khaliddin N. and Hussin A.G. (2016). A new discordancy test in circular data using spacings theory, Communication in Statistics—Simulation and Computation, 45 (5), 2904–2916.
- View Article
- Google Scholar
10. Mahmood E. A., Midi H., Rana S., and Hussin A. G. (2017). Detecting of outliers in univariate circular data using robust circular distance, Journal of Modern Applied Statistical Methods. 16(2), 418–438.
- View Article
- Google Scholar
11. Lewis T. and Fisher N. I. (1982). Graphical methods for investigating the fit of a Fisher distribution to spherical data. Geophysics Journal Royal Astronomical Society 69, 1–13.
- View Article
- Google Scholar
12. David H. A. (1970). Order statistics. New York and London, Wiley.
13. Barnett V. and Lewis T. (1984). Outliers in statistical data. New York: John Wiley & Sons.
14. Best D. J. and Fisher N. I. (1986). Goodness-of-fit and discordancy tests for samples from the Watson distribution on the sphere. Australian Journal of Statistics, 28, 13–31.
- View Article
- Google Scholar
15. Collet D. and Lewis T. (1981). Discriminating between the von Mises and wrapped normal distributions. Australian Journal of Statistics, 23 (1): 73–79.
- View Article
- Google Scholar
16. Fisher N. I., Toby Lewis and Embleton B. J. J. (1987). Statistical analysis of spherical data, New York: Cambridge University Press.

[ref1] 1. Hussin A. G. (1997). Pseudo-replication in functional relationships with environmental applications. Ph.D. thesis, University of Sheffield.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Ahmad N., Nawawi M.S.A.M., Zainuddin M.Z., Nasir Z.M., Yunus R.M. and Mohamed I. (2020). A new crescent moon visibility criteria using circular regression model: A case study of Teluk Kemang, Malaysia, Sains Malaysiana, 49(4):859–870.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Mardia K. V. (1972). Statistics of directional data. Academic Press, London.

[ref4] 4. Fisher N. I., Lewis T. and Willcox M. E. (1981). Tests of discordancy for samples from Fisher’s distribution on the sphere. Journal of Applied Statistics 30, 230–237.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref5] 5. Collett D. (1980). Outliers in circular data. Applied Statistics 29(1):50–57.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref6] 6. Abuzaid A. H., Mohamed I. B. and Hussin A. G. (2009). A new test of discordancy in circular data. Communication in Statistics-Simulation and Computation 38(4): 682–691.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref7] 7. Rambli A., Ibrahim S., Abdullah M. I., Hussin A. G. and Mohamed I. (2012). On discordance test for the wrapped normal data. Sains Malaysiana, 41 (6): 769–778.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref8] 8. Rambli A., Abuzaid A. H., Mohamed I. B. and Hussin A. G. (2016). Procedure for detecting outliers in a circular regression model, PLOS ONE, 11 (4): e0153074. pmid:27064566
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref9] 9. Mohamed I.B., Rambli A., Khaliddin N. and Hussin A.G. (2016). A new discordancy test in circular data using spacings theory, Communication in Statistics—Simulation and Computation, 45 (5), 2904–2916.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref10] 10. Mahmood E. A., Midi H., Rana S., and Hussin A. G. (2017). Detecting of outliers in univariate circular data using robust circular distance, Journal of Modern Applied Statistical Methods. 16(2), 418–438.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref11] 11. Lewis T. and Fisher N. I. (1982). Graphical methods for investigating the fit of a Fisher distribution to spherical data. Geophysics Journal Royal Astronomical Society 69, 1–13.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref12] 12. David H. A. (1970). Order statistics. New York and London, Wiley.

[ref13] 13. Barnett V. and Lewis T. (1984). Outliers in statistical data. New York: John Wiley & Sons.

[ref14] 14. Best D. J. and Fisher N. I. (1986). Goodness-of-fit and discordancy tests for samples from the Watson distribution on the sphere. Australian Journal of Statistics, 28, 13–31.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref15] 15. Collet D. and Lewis T. (1981). Discriminating between the von Mises and wrapped normal distributions. Australian Journal of Statistics, 23 (1): 73–79.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref16] 16. Fisher N. I., Toby Lewis and Embleton B. J. J. (1987). Statistical analysis of spherical data, New York: Cambridge University Press.

Figures

Abstract

Introduction

Tests of discordancy in the fisher distribution

The distance on a sphere

The K-nearest neighbours distance

A new method of outlier detection for spherical data

The performance of the Qk statistic

Practical example

Conclusion

References

The performance of the Q^k statistic