Figures
Abstract
In this study, we propose a new method to detect outlying observations in spherical data. The method is based on the k-nearest neighbours distance theory. The proposed method is a good alternative to the existing tests of discordancy for detecting outliers in spherical data. In addition, the new method can be generalized to identify a patch of outliers in the data. We obtain the cut-off points and investigate the performance of the test statistic via simulation. The proposed test performs well in detecting a single and a patch of outliers in spherical data. As an illustration, we apply the method on an eye data set.
Citation: Rambli A, Mohamed IB, Hussin AG (2022) A new outlier detection method for spherical data. PLoS ONE 17(8): e0273144. https://doi.org/10.1371/journal.pone.0273144
Editor: Lei Shi, Yunnan University of Finance and Economics, CHINA
Received: January 1, 2021; Accepted: August 4, 2022; Published: August 24, 2022
Copyright: © 2022 Rambli et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data has been provided in the manuscript.
Funding: We would like to state that Universiti Teknologi MARA Research Grant (600-IRMI/FRGS 5/3 (353/2019)) is funding this research by paying the publication fee and UM IIRG Research Grant (IIRG002A-19FNW) is funding this research by paying for proofreading the manuscript. Both funds had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Spherical data are concerned with directions in three dimensions. They may arise in many areas of scientific experimentation such as biological, geological and environmental sciences. For example, the wind direction measured by two different equipments (see [1]) or the altitudes of the moon and the sun observed at the beginning of the lunar month (see [2]) form spherical data. The analysis of spherical data generally concentrates on the directional vector of the auditory object and, in most cases, ignores the distance effects. Under this assumption, the representation of the data reduces to a more tractable two-dimensional spherical display of the data namely latitude θ and longitude φ. While normal distribution is common for linear data, the von Mises-Fisher distribution is regularly considered for spherical data. The distribution is also known as Fisher distribution and assumes the data to be rotationally symmetric [3, 4].
Outliers are observations that are different in some way from the rest. For example, the wind direction on one particular day which is in the opposite direction to that observed on other days in the same monsoon season is a candidate to be an outlier. The existence of outliers in circular data has been shown to affect parameter estimation and weaken the accuracy of forecast (see for example [5, 6] and warrants proper treatment in the early stage of data analysis. At present, several discordancy tests are developed to detect outlier in 2-dimensional directional data including [7–10]. Fewer similar studies are conducted for spherical data [11]. Used probability plot as part of a preliminary examination on a given spherical data set to detect outlier. On the other hand [4], proposed formal tests of discordancy by extending the idea used in [5] for circular data. In this paper, we propose a new outlier detection method for spherical data using the k-nearest neighbours distance on a unit sphere. The distance between two points on the surface of a sphere is measured using the law of cosine. The proposed method can detect not only single and multiple outliers but also a patch of outliers.
This paper is organized as follows: Section 2 reviews two existing tests of discordancy in the Fisher distribution. Section 3 shows the distance between two-unit vectors. Section 4 reviews the definition of k-nearest neighbours distance. Section 5 presents a new test of discordancy for a patch of outliers. Through simulations, we obtain the percentage points of the test statistic and study its performance in Section 6. For illustration, an application of the methods on a real data set is presented in Section 7.
Tests of discordancy in the fisher distribution
Fisher distribution is a common unimodal distribution considered for spherical data. The probability density functions of a Fisher distribution for a given random vector (Θ, Φ) is given by
(1)
where 0≤θ, α<π; 0≤φ, β<2π; κ>0, (α, β) is the mean direction, and κ is a measure of the concentration about the mean direction.
Let (θ1, φ1),…,(θn, φn) be a random sample from a Fisher distribution with mean direction (α, β). Let be the sample mean direction, R be the sample resultant length given by
where
,
, xi = sin θi cos φi, yi = sin θi sin φi, zi = cos θi and
be the mean resultant length. Note that (xi, yi, zi) is in a direction of cosine. Further,
and
denote the values of resultant length and mean resultant length, respectively, with the observation (θi, φi) omitted from the data set [4]. Recommended two test statistics, the Ck and Ek statistics. The analogue of Collett’s C statistic is defined as
(2)
While the analogue of Collett’s M statistic is
(3)
[4] noted that the Ck statistic is a good statistic when κ is known or a good estimate if it is available. In addition, the Ek statistic is developed by considering intuitive and formal likelihood-ratio, whose distribution is available in compact form, and is independent of the value of κ. The Ek statistic is based on a generalized likelihood-ratio test against the alternative hypothesis that one observation is drawn from a Fisher distribution with different mean direction but the same concentration parameter. In addition, both test statistics can detect a single outlier and several outliers (see [4]).
The distance on a sphere
For any 3-dimensional data set, we can find a distance of any given point on a sphere by calculating the distance between two vectors. The distance between two-unit vectors x1 and x2 (where both have a length of unit radius) can be calculated by using the law of cosine. Let θ12 be the angle between unit vectors x1 = (x1, y1, z1) and x2 = (x2, y2, z2). We can obtain the distance between the two points on a sphere by
(4)
Therefore Eq (4) can be simplified to
where 0≤θ12≤π. It is known that x1•x2 = ‖x1‖‖x2‖ cos θ12. Then
Given that x1 and x2 are two-unit vectors, it must be that . In general, the spherical distance between two-unit vectors is given by
The K-nearest neighbours distance
If k = 1, we consider the distance of first nearest neighbours for a given point, say xi. First, we denote d1i(xi, xj), for j = 1,2,…,n, i≠j as the distance of first nearest neighbours between the i-th observation and the rest of observations while d(1i)(xi, xj) the corresponding ordered distances. The first-nearest distance for the i-th observation is then given by
(6)
Note that gives a sequence of distances between successive observations on the p-dimensional surface. The statistic (6) can be generalized to detect a patch of outliers in spherical data by calculating the k-nearest distance for the i-th observation. For that, we define
as the k-nearest neighbours distance for the ith ordered observation, k = 1,2,3,… and i = 1,2,…,n such that
(7)
We will use the statistic (7) in the development of a new method for detecting a single, multiple as well as a patch of outliers in the following section.
A new method of outlier detection for spherical data
In this section, we use the k-nearest neighbours distance as a basic idea to be used in the development of a new method to detect possible outliers in spherical data, denoted by Qk. Suppose x1, x2,…,xn are (i.i.d) spherical observations from a Fisher distribution of sample size n. The sample vector of a spherical sample is given by xi = (xi, yi, zi). Thus, the procedure to obtain the outlier detection method using the Qk statistic is described as follows:
- Step 1 Start with k = 1. Calculate
, i = 1,2,…,n as given by Eq (7).
- Step 2 If the value of
exceeds a pre-determined cut-off point, say CQ, then the i-th observation corresponding to
is identified as an outlier and the process is stopped. Otherwise, proceed to the next step.
- Step 3 Increase k by one, that is, k = 2. Calculate
, i = 1,2,…,n.
- Step 4 If the value of
exceeds a pre-determined cut-off point, say CQ, then the observations corresponding to
are identified as a patch of two outliers and the process is stopped. Otherwise, the process continues by increasing the value of k by one at a time in the subsequence steps.
First, we need to obtain the cut-off points CQ for the Qk statistic. We design a simulation study using the R software to find the percentage points under the null hypothesis of no outliers in the circular data set. Note that parameters α and β are spherical location parameters while κ is a concentration parameter. We found that the distances between observations generated from a Fisher distribution depend on n and κ but not on α and β (the detail is not given here). For each combination of n and κ, we generate a sample from Fisher distribution with both location parameters fixed (α = 0, β = 0) and calculate the Qk statistic. Then, we repeat the process 3000 times and estimate the percentage points of the Qk statistic at 10%, 5% and 1% upper percentiles when no outlier is present in the sample. Selected cut-off points CQ for the Qk statistic are tabulated in Tables 1–3 for k = 1, 2 and 3 respectively.
For most combinations of the concentration parameter κ and percentile level, the cut-off point decreases as the sample size increases. It can also be seen that, for small sample sizes, the cut-off points are a decreasing function of κ. For larger sample sizes, the cut-off points have a peak value at around κ = 30. The results indicate the proposed statistic depends on n and κ of the underlying assumed model. As one might expect, it is also noted that the cut-off point also increases as the value of the k-nearest distance increases due to larger distances on the sphere between the points of interest.
The performance of the Qk statistic
Let P5 be the probability that the contaminant point is an outlying point and is identified as discordance [12, p.185] and [13, p.64-68]. Stated that a good test is expected to have a high P5 [4]. Investigated the performance of several methods to detect a single outlier in spherical distribution. Therefore, we compare the performance of the Qk statistic with the existing methods of Ek and Ck statistics to detect a single outlier and a patch of outliers for various values of sample size and concentration parameters.
To study the performance of the Qk, Ek and Ck statistics to detect a single outlier, we first generate samples for two cases, a) n = 10, κ = 3 and b) n = 30, κ = 50. The samples are generated in such a way that n−1 of the observations come from Fisher distribution with α = 0, β = 0 while one observation (outlier) from Fisher distribution with α = λπ, β = 0, κ = 30, 0≤λ≤1. If the value of ,
and
are greater than the corresponding cut-off point and the ith observation is located at the outlying value, then we have correctly detected an outlier. We repeat the simulation 3000 times and obtain the value of P5 or known as probability of correct detection of an outlier which has been introduced into the samples. Note that, the cut-off points for the Ek and Ck statistics are obtained from Monte Carlo simulation according to the procedure in obtaining the cut-off points for Qk statistic.
Fig 1 plots the performance of the Qk, Ek and Ck statistics to detect a single outlier for small sample size and small concentration parameter value. Generally, for small sample size, the Qk statistic performs better than the Ek statistic only. However, the performance is almost identical when larger sample size and larger concentration parameter values are considered as shown in Fig 2.
We also investigate the performance to detect a patch of outliers for the three statistics. For small sample size and small concentration parameter value, the performance of the Qk statistic is comparable to the Ek and Ck statistics as shown in Fig 3. A much closer result is observed for larger sample size and larger concentration values as shown in Fig 4. The trend is observed for other combinations of sample size and concentration parameter values. This suggests that the Qk statistic can be a good alternative outlier detection method for spherical data.
The performance of the Qk, Ek and Ck statistics for n = 10 and κ = 3, for a patch of three outliers.
The performance of the Qk, Ek and Ck statistics for n = 30 and κ = 50, for a patch of three outliers.
Practical example
For illustration, we now apply the proposed and existing spherical discordancy tests into a set of eye data. We consider the eye data consisting of 23 patients (unit in radians) recorded using optical coherence tomography (OCT) at the University Malaya Medical Centre (UMMC). OCT technology originally is used in ophthalmology to image the posterior segment and has also been used to image anterior segment structures such as the cornea. The angle imaging of the anterior segment OCT in UMMC patients’ eyes were obtained with Anterior Segment OCT (AS-OCT). The measurements selected are the angle of the posterior corneal curvature, φ, and the angle of the eye (between posterior corneal curvature to iris), θ. As such, we are keen to identify possible outliers in this data set as given in Table 4.
The summary statistics for the given spherical data set are calculated; the sample mean direction is given in a longitude and latitude expression, with the concentration parameter
. The spherical plot of the data is given in Fig 5. The samples are located around a north pole. This indicates that both variables, namely the posterior and angle of the eye, recorded these 23 observations to be in the same direction. However, there is one observation lying further away from the rest.
It is known that Q-Q plot and probability plotting are commonly used to investigate the goodness of fit of linear, circular and spherical data samples (see for example [14–16]). It is used to visualize the goodness of fit and to identify the presence of outlier(s) at earlier stage [11]. Provided procedures of plotting an ordered value for spherical data which is assumed to follow a Fisher distribution. They proposed three types of procedures, namely, colatitudes, longitude and two-variable plotting procedure for a Fisher model. The procedures considered three-ordered-value plots. Two of them examine the marginal distributions of the two variables and one of them is to find the association between these two variables. The details of the procedures can be obtained in [11]. Note that, the quantile of the unit exponential distribution is denoted by e, for the uniform model is denoted by u and the quantile of the N(0,1) distribution is denoted by q.
The colatitude plot of the eye data as shown in Fig 6(A) indicates that the data follow Fisher distribution as the plot gives almost a straight line through the origin. This is further supported by the longitude plot as shown in Fig 6(B) which gives an approximately straight linear plot of slope close to 45° passing through the origin. From Fig 6(C), we can clearly see one observation that lies far from the rest, indicating the existence of one possible outlier. Therefore, we apply the proposed discordancy test on the data. Upon applying the maximum likelihood estimation method, we obtain the estimate parameters of the Fisher distribution. The values of the parameters are and
.
Plots of eye data, (a) Colatitude plot, (b) Longitude plot, (c) Two-variable plot.
Based on the estimated parameters, we obtain the critical values of three test statistics using the R statistical software. The values are shown in Table 5. We apply the discordancy tests including our proposed test statistic and obtain their test statistic values. The values of the test statistics which correspond to observation number 17 are ,
,
,
and
. Table 5 shows the cut-off points for the three methods at 10% significance levels. Based on Table 5, only Q2 and Q3 statistics can detect observation 17 as an outlier at 10% upper level. This observation corresponds to a patient with small values of angle of the posterior corneal curvature compared to other patients and thus may warrant further investigation.
Next, we are keen to demonstrate the application of the tests to detect a patch of outliers. Observation 10 is chosen and located closely to observation 17 so that a patch of two outliers exist in the data. The new coordinate for observation 10 is θ = 0.9599, φ = 0.6109.
From Fig 7, it can be seen clearly that both observations (observations 10 and 17) are located far from the rest. Upon applying descriptive statistics, the value of the sample mean direction is given in a longitude and latitude expression, and the concentration parameter
. The values of the test statistics and the cut-off points for the three methods at 10% significance levels are given in Table 6. As a result, the Q2 and Q3 statistics successfully detected observations 10 and 17 as a patch of two outliers at 10% upper level while the other test statistics failed.
Conclusion
In this paper, we proposed a new discordancy test for detecting outliers in spherical data based on the k-nearest neighbours distance. We further demonstrated the applicability of the proposed Qk statistic on the eye data set by successfully identifying a single outlier and a patch of outliers in the data. A novel aspect of this method is in its ability to detect a patch of outliers which can be enhanced for cluster analysis in spherical data. The proposed procedure should work for other spherical distributions.
References
- 1. Hussin A. G. (1997). Pseudo-replication in functional relationships with environmental applications. Ph.D. thesis, University of Sheffield.
- 2. Ahmad N., Nawawi M.S.A.M., Zainuddin M.Z., Nasir Z.M., Yunus R.M. and Mohamed I. (2020). A new crescent moon visibility criteria using circular regression model: A case study of Teluk Kemang, Malaysia, Sains Malaysiana, 49(4):859–870.
- 3.
Mardia K. V. (1972). Statistics of directional data. Academic Press, London.
- 4. Fisher N. I., Lewis T. and Willcox M. E. (1981). Tests of discordancy for samples from Fisher’s distribution on the sphere. Journal of Applied Statistics 30, 230–237.
- 5. Collett D. (1980). Outliers in circular data. Applied Statistics 29(1):50–57.
- 6. Abuzaid A. H., Mohamed I. B. and Hussin A. G. (2009). A new test of discordancy in circular data. Communication in Statistics-Simulation and Computation 38(4): 682–691.
- 7. Rambli A., Ibrahim S., Abdullah M. I., Hussin A. G. and Mohamed I. (2012). On discordance test for the wrapped normal data. Sains Malaysiana, 41 (6): 769–778.
- 8. Rambli A., Abuzaid A. H., Mohamed I. B. and Hussin A. G. (2016). Procedure for detecting outliers in a circular regression model, PLOS ONE, 11 (4): e0153074. pmid:27064566
- 9. Mohamed I.B., Rambli A., Khaliddin N. and Hussin A.G. (2016). A new discordancy test in circular data using spacings theory, Communication in Statistics—Simulation and Computation, 45 (5), 2904–2916.
- 10. Mahmood E. A., Midi H., Rana S., and Hussin A. G. (2017). Detecting of outliers in univariate circular data using robust circular distance, Journal of Modern Applied Statistical Methods. 16(2), 418–438.
- 11. Lewis T. and Fisher N. I. (1982). Graphical methods for investigating the fit of a Fisher distribution to spherical data. Geophysics Journal Royal Astronomical Society 69, 1–13.
- 12.
David H. A. (1970). Order statistics. New York and London, Wiley.
- 13.
Barnett V. and Lewis T. (1984). Outliers in statistical data. New York: John Wiley & Sons.
- 14. Best D. J. and Fisher N. I. (1986). Goodness-of-fit and discordancy tests for samples from the Watson distribution on the sphere. Australian Journal of Statistics, 28, 13–31.
- 15. Collet D. and Lewis T. (1981). Discriminating between the von Mises and wrapped normal distributions. Australian Journal of Statistics, 23 (1): 73–79.
- 16.
Fisher N. I., Toby Lewis and Embleton B. J. J. (1987). Statistical analysis of spherical data, New York: Cambridge University Press.