Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A method for detecting outliers in linear-circular non-parametric regression

Abstract

This study proposes a robust outlier detection method based on the circular median for non-parametric linear-circular regression in case the response variable includes outlier(s) and the residuals are Wrapped-Cauchy distributed. Nadaraya-Watson and local linear regression methods were employed to obtain non-parametric regression fits. The proposed method’s performance was investigated by using a real dataset and a comprehensive simulation study with different sample sizes, contamination, and heterogeneity degrees. The method performs quite well in medium and higher contamination degrees, and its performance increases as the sample size and the homogeneity of data increase. In addition, when the response variable of linear-circular regression contains outliers, the Local Linear Estimation method fits the data set better than the Nadaraya Watson method.

1. Introduction

Circular (directional) data are measured on directions, angles, and rotations; and are primarily used in engineering, meteorology, ocean science, geography, geology, medicine, and neuroscience ([13]). Such data consist of angles measured based on a specific reference point and assumed to be on the unit circle. Due to their geometric structure, conventional statistical methods cannot be applied to circular data. Thus, the need to analyze this type of data has arisen ([1]).

Non-parametric circular regression is a popular research topic. The first implementation for non-parametric circular regression was made by Di Marzio et al. [4]. They defined circular kernel functions and extended the least squares method for the local polynomial regression model. Nadaraya-Watson (NW) and local linear (LL) non-parametric regression models and kernel weight functions were defined by Di Marzio et al. [5] when the response variable was circular. In their study, these methods were examined separately for the models, including linear and circular explanatory variables, respectively. The local trigonometric and Nadaraya-Watson estimators were compared by Oliveira et al. [6] when the explanatory and response variables were circular and linear, respectively. They also obtained the optimal bandwidth value through leave-one-out cross-validation (CV). Besides, Oliveira et al. [7] developed an R package named NPCirc for non-parametric density estimation and regression analysis and extended it in [8]. Xu [9] developed a non-parametric smoothing method to estimate the periodic functions of both circular density estimation and linear-circular non-parametric regression. Sikaroudi and Park [10] presented a mixture of linear-linear regression models as an alternative for parametric and non-parametric linear-circular regression. Alonso-Pena et al. [11] developed different non-parametric tests to examine the equality and parallelism of the non-parametric regression curves across various groups, including linear-circular non-parametric regression cases. Meilán-Vila et al. [12] introduced local linear estimators for non-parametric multiple regression when the response variable was circular. Meilán-Vila et al. [13] proposed circular trend surface estimators considering a spatial linear-circular non-parametric regression model. Recently, Di Marzio et al. [14] has addressed the problem of estimating the Kernel regression function in the presence of measurement errors when the predictor and/or response variable is circular.

The concept of outliers in statistics is used for observations with significant distances from other observations. Outliers are among the most important problems encountered in modelling and forecasting because of undesirable effects on estimation. Since circular data differs from linear data in its geometric structure, detecting outliers requires special investigation.

The studies on detecting outliers for circular regression are generally based on simple circular regression. Abuzaid and Hussin [15] developed numerical and graphical methods using circular residuals for detecting a single outlier in circular regression. Abuzaid et al. [16] proposed a Mean Circular Error (MCE) statistic based on row deletion from the data to detect outliers. Rana et al. [17] developed an outlier detection method for the case in which both explanatory and response variables contain outliers. Mahmood et al. [18] suggested a robust approach (RCDU) based on circular median and generated cut-off points for different parameters of Von Mises (VM) distribution for univariate circular data. In another study, Mahmood et al. [19] proposed a robust method for detecting outliers in the simple circular regression model (RCDy) when both explanatory and response variables were circular. Alkasadi et al. [20] developed an outlier detection procedure for multiple circular regression models. They showed that the proposed statistic uses the DFFITc statistic and performs well in detecting outliers in multiple circular regression models.

All the abovementioned methods have been proposed for the cases when the residuals come from a well-defined VM distribution. In addition, the Wrapped-Cauchy (WC) error was assumed by Kato et al. [21], and its impact on the performance of simple circular regression was discussed by Abuzaid and Allahham [22].

The current study proposes a new method to detect outliers with circular residuals coming from the WC distribution in linear-circular non-parametric regression. Accordingly, this paper is organized as follows. Section 2 introduces the concept of a circular outlier, circular distance, and the proposed method. A comprehensive simulation study in Section 3 investigates the performance of the proposed method. In Section 4, a real data example is presented, and the implementation of NW and LL estimators are compared in the presence of outlier(s). Finally, the results are interpreted in Section 5.

2. Materials and method

2.1 Circular outlier and circular distance

The distance and outlier concepts for circular data differ from linear data due to their geometric structure. In circular (angular) data, a circular observation that is far from the main mass of the data (e.g., mean direction) can be referred to as an outlier ([23]).

Let θi and θj, i, j = 1, 2, …, n be random circular observations taken over from the n-dimensional unit circle. Then the circular distance between θi and θj angular observations is described in Eq (1), which demonstrates the maximum distance between two circular (angular) observations. Note that the distance cannot be greater than π ([2]).

(1)

The mean direction is used as a measure of location for circular data and is estimated using Eq (2) (2) where and ([1]).

The mean direction does not exhibit robust behaviour if the arithmetic mean is used to calculate the mean direction. Otenio and Anderson-Cook [24] stated that the circular median displayed more robust behaviour than the mean direction. He and Simpson [25] suggested using the circular median instead of the circular mean, especially when the dataset does not follow VM distribution. The circular median is the angle θ that minimizes Eq (3) ([1]).

(3)

2.2 Method

This study proposes an outlier detection method based on the distances of linear circular regression residuals from the median value for the WC distributed data. Circular distributions such as VM or WC are symmetric; therefore, circular mean accurately represents the centre of the data. However, as He and Simpson [25] stated, the circular median is more robust than the circular mean when the data distribution is not symmetric. In the case of outliers included in a WC distributed data set, the distances from the circular mean may not work well for outlier detection since the distribution will deviate from symmetry to a certain degree due to outliers. Therefore, we have based our method on median distances.

Our method follows a two-step procedure. In the first stage, the cut-off points are calculated for the combinations of sample size and concentration parameters to determine whether the data is an outlier or not; in the second stage, the observations exceeding the corresponding cut-off value are defined as outliers. The procedure can be summarised in steps as follows.

  1. Step 1. Calculate the absolute value of circular residuals from the fitted regression model.
  1. Step 2. Compute the absolute value of the circular residuals’ (ei) distances from their circular median.
  1. Step 3. Calculate the 90%, 95%, and 99% quantiles for the distances disti.
  2. Step 4. Repeat Step 3 2000 times and set the mean quantiles as cut-off values.
  3. Step 5. Attribute the observations disti > cutoff as the outlier.

Cut-off points were produced for the NW and LL methods and are given in Tables 16.

thumbnail
Table 1. Cut-off points for NW under different concentration parameters and sample sizes, q = 0.90.

https://doi.org/10.1371/journal.pone.0286448.t001

thumbnail
Table 2. Cut-off points for LL under different concentration parameters and sample sizes, q = 0.90.

https://doi.org/10.1371/journal.pone.0286448.t002

thumbnail
Table 3. Cut-off points for NW under different concentration parameters and sample sizes, q = 0.95.

https://doi.org/10.1371/journal.pone.0286448.t003

thumbnail
Table 4. Cut-off points for LL under different concentration parameters and sample sizes, q = 0.95.

https://doi.org/10.1371/journal.pone.0286448.t004

thumbnail
Table 5. Cut-off points for NW under different concentration parameters and sample sizes, q = 0.99.

https://doi.org/10.1371/journal.pone.0286448.t005

thumbnail
Table 6. Cut-off points for LL under different concentration parameters and sample sizes, q = 0.99.

https://doi.org/10.1371/journal.pone.0286448.t006

The performance of the proposed method is determined by three different measures as given in below ([18]).

  1. Masking (M): Rate of detected outliers as inliers.
    (4)
  2. Swamping (S): Rate of inlier observations detected as outliers.
    (5)
  3. True Detection Rate (TDR): Rate of true detected outliers,
    where n denotes the sample size.

3. Simulation study

A simulation study was performed to evaluate the outlier detection performance of the proposed method, with five factors: Sample size, contamination degree, concentration values, regression estimation procedure, and the number of outliers. Simulations were conducted via a crossover design with the factor levels:

Note that since some percentage values of n = 20, 40 and 50 are less than 1, the outlier number was rounded to 1 in these cases.

The data for the explanatory variable X of the linear-circular regression model come from X ~ N (3, 0.25). Then the circular response yi is generated through Eq (6).

(6)

NW and LL methods with Gaussian kernel are used to obtain non-parametric regression fits as defined in Di Marzio et al. [5]. In addition, the leave-one-out Cross Validation (CV) is used to obtain bandwidths to estimate regressions and is given in Eq (7).

(7)

The optimal bandwidth is the value minimizing Eq (7). Here, denotes the estimated value with the exclusion of the pair of observations (Xi, Θi) from the whole data set. The procedure is iterated by excluding only one pair of observations from the data set. The contaminated yk is generated as suggested by Abuzaid et al. [16] and Mahmood et al. [19]: (8) where γ refers to the contamination degree.

All the outputs of the performed simulation through the designed experiment were obtained, yet only some were included within the text for space and simplicity. Since the results of performance indicators for all sample sizes are consistent, only the results for the concentration parameters ρ = 0.70, ρ = 0.90 and q = 0.95 when n = 40,100 and 200 are included inside the paper and given in Figs 114.

thumbnail
Fig 1. TDR, M, and S of NW for different contamination degrees with percentage of contamination 1%, ρ = 0.70, q = 0.95.

https://doi.org/10.1371/journal.pone.0286448.g001

thumbnail
Fig 2. TDR, M, and S of LL for different contamination degrees with percentage of contamination 1%, ρ = 0.70, q = 0.95.

https://doi.org/10.1371/journal.pone.0286448.g002

thumbnail
Fig 3. TDR, M, and S of NW for different concentration parameters with percentage of contamination 1%, γ = 0.85, q = 0.95.

https://doi.org/10.1371/journal.pone.0286448.g003

thumbnail
Fig 4. TDR, M, and S of LL for different concentration parameters with percentage of contamination 1%, γ = 0.85, q = 0.95.

https://doi.org/10.1371/journal.pone.0286448.g004

thumbnail
Fig 5. TDR, M, S and MCE values of NW and LL for different contamination degrees with percentage of contamination 1%, ρ = 0.70, q = 0.95, n = 40.

https://doi.org/10.1371/journal.pone.0286448.g005

thumbnail
Fig 6. TDR, M, S and MCE values of NW and LL for different contamination degrees with percentage of contamination 1%, ρ = 0.70, q = 0.95, n = 200.

https://doi.org/10.1371/journal.pone.0286448.g006

thumbnail
Fig 7. TDR, M, S and MCE values of NW and LL, for different concentration parameters with percentage of contamination 1%, γ = 0.85, q = 0.95, n = 40.

https://doi.org/10.1371/journal.pone.0286448.g007

thumbnail
Fig 8. TDR, M, S and MCE values of NW and LL for different concentration parameters with percentage of contamination 1%, γ = 0.85, q = 0.95, n = 200.

https://doi.org/10.1371/journal.pone.0286448.g008

thumbnail
Fig 9. TDR, M, S and MCE values of NW and LL for different percentages of contamination with ρ = 0.70, γ = 0.85, q = 0.95, n = 50.

https://doi.org/10.1371/journal.pone.0286448.g009

thumbnail
Fig 10. TDR, M, S and MCE values of NW and LL for different percentages of contamination with ρ = 0.70, γ = 0.85, q = 0.95, n = 100.

https://doi.org/10.1371/journal.pone.0286448.g010

thumbnail
Fig 11. TDR, M, S and MCE values of NW and LL for different percentages of contamination with ρ = 0.70, γ = 0.85, q = 0.95, n = 200.

https://doi.org/10.1371/journal.pone.0286448.g011

thumbnail
Fig 12. TDR, M, S and MCE values of NW and LL for different percentages of contamination with ρ = 0.90, γ = 0.85, q = 0.95, n = 50.

https://doi.org/10.1371/journal.pone.0286448.g012

thumbnail
Fig 13. TDR, M, S and MCE values of NW and LL for different percentages of contamination with ρ = 0.90, γ = 0.85, q = 0.95, n = 100.

https://doi.org/10.1371/journal.pone.0286448.g013

thumbnail
Fig 14. TDR, M, S and MCE values of NW and LL for different percentages of contamination with ρ = 0.90, γ = 0.85, q = 0.95, n = 200.

https://doi.org/10.1371/journal.pone.0286448.g014

The outputs of the simulation showed that all the performance criteria, TDR, masking, and swamping rates, are sensitive for specific ranges at all levels of simulation design factors. On the other hand, the increasing value of the concentration parameter has a positive effect on TDR. In most cases, just as the value of the concentration parameter increases, so does TDR, and this trend can be observed more regularly at greater contamination degrees. Furthermore, the TDR improves much faster when the concentration parameter is greater than 0.70. It would be wrong to assume a significant increase occurs in TDR as the contamination degree becomes larger; however, it could be stated that the TDR is higher at large contamination degrees than at small contamination degrees. The opposite interpretation can be made for masking.

TDR performance varies depending on the concentration parameter at different sample sizes. For the 0.20–0.80 range of the concentration parameter, the TDR performs better as the sample size gets smaller, while outside this range, it is not affected by the sample size. At 0.85 and higher values of the concentration parameter, the TDR approaches 1 in all sample sizes.

The swamping rate is affected mainly by sample size since the swamping rate decreases as the sample size increases. While the percentage of contamination is effective at the smaller values of the concentration parameter on performance criteria, both percentage of contamination and sample size lose their effect as the concentration parameter becomes larger. NW and LL show performances close to one another for almost all performance evaluation criteria. However, the increase in percentage of contamination causes the LL method to yield slightly better performance values than NW.

Some of the simulation outputs are presented here as the rest has similar characteristics and are given in the S1S5 Files.

4. Real data example

The proposed method was implemented in the 2018 GEFC Wind Turbine Scada Dataset, which includes wind speed (m/s) and wind direction (°) measurements taken from the Scada system of a wind turbine operating and generating electricity in Turkey ([26]). To model the relationship between wind direction and wind speed, the 10-minute measurements of the related data set between 05.01.2018, 20:50–06.01.2018, 05:50 was considered, and since “wind speed” (explanatory variable) is linear, and the “wind direction” (response variable) is circular, a linear-circular kernel regression is fitted with both NW and LL methods (Fig 15). The circular plots and rose diagrams of the estimated circular residuals for both methods are given in Fig 16.

thumbnail
Fig 15. NW and LL fits for the Wind Turbine Scada dataset.

https://doi.org/10.1371/journal.pone.0286448.g015

thumbnail
Fig 16.

The circular plot and the rose diagram of the estimated circular residuals when NW is applied to the Wind Turbine Scada dataset (left), the circular plot and the rose diagram of the estimated circular residuals when NW is applied to the Wind Turbine Scada dataset (right).

https://doi.org/10.1371/journal.pone.0286448.g016

Before fitting regression models, the distribution of the data was investigated. The Watson U2 test was employed to test the null hypothesis that the distribution is WC. Because the U2 test does not exist in any software for WC distribution, the authors produced asymptotic critical value with the bootstrap method following Sun [27]. The results confirmed that the WC distribution appeared to be a good fit for the Wind Turbine Scada dataset with the mean direction 1.074 and concentration parameter 0.1831.

The circular distances between the absolute residuals of NW and LL and their circular medians were calculated and compared against the cut-off values 1.1989, 1.7149, and 2.2787 of NW and 1.1914, 1.7161, and 2.2930 of LL for the quantiles 0.90, 0.95 and 0.99, respectively. The investigation of outliers resulted in the 27th, 35th, 36th, 37th and 55th observations at q = 0.95, and 35th, 36th, and 55th observations at q = 0.99. The cut-off points and the distances are illustrated in Fig 17, where the straight indicate the cut-off points.

thumbnail
Fig 17.

The distances and the cut-off points for the Wind Turbine Scada dataset, using NW fits (left) and using LL fits (right).

https://doi.org/10.1371/journal.pone.0286448.g017

5. Conclusion

The present study deals with the problem of detecting outliers in non-parametric linear-circular regression. An outlier detection method based on linear-circular regression residuals distances from the circular median value has been proposed for the WC distributed errors. The corresponding cut-off points were identified via simulations. In addition, a comprehensive simulation study was carried out to evaluate the performance of the proposed procedure in terms of true detection, masking, and swamping rates. The results showed that the proposed method performs well for medium and higher contamination degrees. It was also observed that the method’s performance increases as the sample size and homogeneity of data increase. The findings were illustrated and supported through a real data set example. NW and LL methods with Gaussian kernel were used to obtain non-parametric regression fits. The results indicate that when the response variable of linear-circular regression contains outliers, the Local Linear Estimation method is preferable to the Nadaraya-Watson method.

It should be noted that although the proposed method is quite satisfactory for outlier detection in a linear-circular non-parametric regression model, the method and, therefore, the generated cut-off values are model specific. Thus, the use of calculated cut-off values is limited only to the linear-circular non-parametric regression and estimation methods used in this study. Further studies are planned to address these issues and develop outlier detection methods for circular-linear and circular-circular non-parametric regression models.

Supporting information

References

  1. 1. Fisher NI. Statistical analysis of circular data. cambridge university press; 1995 Oct 12.
  2. 2. Jammalamadaka SR, SenGupta A. Topics in circular statistics. world scientific; 2001.
  3. 3. Mardia KV, Jupp PE, Mardia KV. Directional statistics. Chichester: Wiley; 2000 Jan.
  4. 4. Di Marzio M, Panzera A, Taylor CC. Local polynomial regression for circular predictors. Statistics & Probability Letters. 2009 Oct 1;79(19):2066–75.
  5. 5. Di Marzio M, Panzera A, Taylor CC. Non‐parametric regression for circular responses. Scandinavian Journal of Statistics. 2013 Jun;40(2):238–55.
  6. 6. Oliveira M, Crujeiras RM, Rodríguez-Casal A. Nonparametric circular methods for exploring environmental data. Environmental and ecological statistics. 2013 Mar;20:1–7.
  7. 7. Oliveira M, Crujeiras RM, Rodríguez-Casal A. NPCirc: An R package for nonparametric circular methods. Journal of Statistical Software. 2014 Nov 13;61:1–26.
  8. 8. Alonso-Pena M, Oliveira M, Ameijeiras-Alonso J, Crujeiras RM, Gijbels I, Rodriguez-Casal A, et al. Package ‘NPCirc’.
  9. 9. Xu Z. An alternative circular smoothing method to nonparametric estimation of periodic functions. Journal of Applied Statistics. 2016 Jul 3;43(9):1649–72.
  10. 10. Sikaroudi AE, Park C. A mixture of linear-linear regression models for a linear-circular regression. Statistical Modelling. 2021 Jun;21(3):220–43.
  11. 11. Alonso-Pena M, Ameijeiras-Alonso J, Crujeiras RM. Nonparametric tests for circular regression. Journal of Statistical Computation and Simulation. 2021 Feb 11;91(3):477–500.
  12. 12. Meilán-Vila A, Francisco-Fernández M, Crujeiras RM, Panzera A. Nonparametric multiple regression estimation for circular response. TEST. 2021 Sep;30(3):650–72.
  13. 13. Meilán-Vila A, Crujeiras RM, Francisco-Fernández M. Nonparametric estimation of circular trend surfaces with application to wave directions. Stochastic Environmental Research and Risk Assessment. 2021 Apr;35(4):923–39.
  14. 14. Di Marzio M, Fensore S, Taylor CC. Kernel regression for errors-in-variables problems in the circular domain. Statistical Methods & Applications. 2023 Mar 30:1–21.
  15. 15. Abuzaid A, Hussin AG. Identifying single outlier in linear circular regression model based on circular distance. Journal of Applied Probability and Statistics. 2009 3(1):107–117.
  16. 16. Abuzaid AH, Hussin AG, Mohamed IB. Detection of outliers in simple circular regression models using the mean circular error statistic. Journal of Statistical Computation and Simulation. 2013 Feb 1;83(2):269–77.
  17. 17. Rana S, Mahmood EA, Midi H, Hussin AG. Robust detection of outliers in both response and explanatory variables of the simple circular regression model. Malaysian Journal of Mathematical Sciences. 2016 Sep 30;10(3):399–414.
  18. 18. Mahmood EA, Rana S, Midi H, Hussin AG. Detection of outliers in univariate circular data using robust circular distance. Journal of Modern Applied Statistical Methods. 2017;16(2):22.
  19. 19. Mahmood EA, Midi H, Rana S, Hussin AG. Robust Circular Distance and its Application in the Identification of outliers in the Simple Circular Regression Model. Asian Journal of Applied Sciences. 2017; 10:126–133.
  20. 20. Alkasadi N, Ibrahim S, Abuzaid A, Yusoff MI. Outlier detection in multiple circular regression model using DFFITc statistic. Sains Malaysiana. 2019; 47(7): 399–414.
  21. 21. Kato S, Shimizu K, Shieh GS. A circular–circular regression model. Statistica Sinica. 2008 Apr 1:633–45.
  22. 22. Abuzaid AH, Allahham NR. Pak. J. Statist. 2015 Vol. 31 (4), 385–398 Simple Circular Regression Model Assuming Wrapped Cauchy Error. Pak. J. Statist. 2015;31(4):385–98.
  23. 23. Collett D. Outliers in circular data. Journal of the Royal Statistical Society: Series C (Applied Statistics). 1980 Mar;29(1):50–7.
  24. 24. Otieno BS, Anderson-Cook CM. Measures of preferred direction for environmental and ecological circular data. Environmental and Ecological Statistics. 2006 Sep;13:311–24.
  25. 25. He X, Simpson DG. Robust direction estimation. The Annals of Statistics. 1992 Mar;20(1):351–69.
  26. 26. 2018 GEFC Wind Turbine Scada Dataset [dataset]. Available from: https://www.kaggle.com/datasets/berkerisen/wind-turbine-scada-dataset
  27. 27. Sun Z. Comparing measures of fit for circular distributions (Doctoral dissertation).