Procedure for Detecting Outliers in a Circular Regression Model

A number of circular regression models have been proposed in the literature. In recent years, there is a strong interest shown on the subject of outlier detection in circular regression. An outlier detection procedure can be developed by defining a new statistic in terms of the circular residuals. In this paper, we propose a new measure which transforms the circular residuals into linear measures using a trigonometric function. We then employ the row deletion approach to identify observations that affect the measure the most, a candidate of outlier. The corresponding cut-off points and the performance of the detection procedure when applied on Down and Mardia’s model are studied via simulations. For illustration, we apply the procedure on circadian data.


Introduction
The occurrence of outliers in a data set has been widely discussed in the literature. Their occurrence may be due to error, or part of the phenomena under study. Either way, it is important to identify outliers so that further investigation can be conducted. In linear regression, extensive study on the problem of outliers and leverage points can be found in the literature (e.g. [1,2,3]). Many statistical software packages provide different tools to identify outliers in linear regression models. However, such studies are rarely found for circular regression models where the dependent and independent variables are of circular form.
Circular variables are commonly found in many scientific fields such as meteorology. The variable takes the values in the range [0,2π) radian. The existence of outliers in circular data may affect the estimation of the parameters and weaken the accuracy of forecasting. Thus, it is of interest to develop suitable methods of identifying outliers in circular problem. We focus on developing such method for circular regression model.
The regression of a circular dependent variable on a set of linear variables was first discussed by Gould [4]. The model follows closely the linear regression form and an iterative method was used to estimate the parameters by maximizing the likelihood function, with further improvement made by Fisher and Lee [5] and Johnson and Wehrly [6]. On the other hand, the first attempt to fit a circular regression models of two circular variables u and v was made by Laycock [7] using the complex linear regression, where the model can be expressed as a conventional linear model with complex entries. Rivest [8] proposed another regression model with specific application in predicting the direction of earthquake displacement. On the other hand, Jammalamadaka and Sarma [9] expressed a circular-circular model in terms of Fourier series expansions while Hussin et. al [10] assumed the two circular variables are related in linear form. In this paper, we consider the circular regression model proposed by Downs and Mardia [11], and would refer the model as "DM circular regression model" for the rest of the paper.
Although the first discussion of circular regression goes back to Gould [4], there are few known published work found on the identification of outliers in circular regression. Abuzaid et al. [12] and Ibrahim et al. [13] explored the problem on two types of circular regression models by observing the effect of removing one observation on the covariance matrix. Further, Abuzaid et al. [14] proposed a residual measure using a cosine function to detect outliers in a linear circular regression model, where the relationship between the dependent and independent variables is strictly linear (see [10]). In this paper, we propose a new summary measure for the purpose of detecting outliers in terms of a simple measure of circular distance in DM circular regression model. Due to the compact close range of circular variables, it is expected that the effect of masking problem is minimal.
With that view in mind, this paper is organized as follows: Firstly, we review the theory of DM circular regression models. Secondly, the proposed statistic to be used in identifying influential observations in DM circular regression models is presented. Thirdly, we conduct simulation studies to investigate the sampling behavior of the statistic and the performance of the procedure of detecting influential observation. Finally, we then apply the procedure on the circadian data as given in Down and Mardia [11].

DM Circular Regression Model
Assume that (u,v) are a pair of independent and dependent random angles with angular location parameters α and β respectively, and ω is a slope parameter in the closed interval [-1, 1]. Down and Mardia [11] proposed the DM circular regression model given by The model ensures a one-to-one relationship between u and v, ω 6 ¼ 0. The relationship can be described by a continuous closed curve winding around a toroidal surface. The model has a unique solution given by Suppose that v in Eq (2) is replaced by μ, the mean direction for v givenu. The resulting DM circular regression model is given by which has a unique solution As can be seen, the model has three functionally independent parametersα, β and ω. It can be shown that the log-likelihood function for a random sample of n pairs (u j , v j ), j = 1, 2,. . .,n, is where κ is the concentration parameter, I 0 ðkÞ ¼ X 1 j¼0 ððk=2Þ j =j !Þ 2 is the modified Bessel function of the first kind order zero and vðu j À a; oÞ ¼ 2tan À1 o tan 1 2 ðu j À aÞ n o . We may define explicitly the maximum likelihood estimatorr of the precision parameter ρ bŷ Hence, the log-likelihood functions of Eq (5) and maximum likelihood estimatorr of Eq (6) are changed accordingly.
We employ an iterative method of obtaining the estimates of (α, β, ω), say ðâ;b;ôÞ, which maximize Eq (5). This can be done by using the MS function available in S-Plus software. The function requires the determination of initial values α 0 , β 0 and ω 0 . These initial values can be taken to values which give maximum precision parameterr in Eq (6) for all possible pairs (α, β, ω) in pre-specified sets. In our case, the following sets of parameter values are considered; α = [−π, π], β = [−π, π] and ω = [−1, 1]. Then using those initial values, we obtain the estimates iteratively for the three parameters of the model.

Definition of a New Statistic
Upon fitting the bivariate circular variables (u j , v j ), j = 1, 2,. . .,n, we obtain the fitted values of v j , sayv j . It is then useful to utilize the fitted values in evaluating the goodness-of-fit of the DM circular regression model in terms of circular errors. One useful measure is the circular distance between two circular observations, say ϕ and θ, as given by Jammalamadaka and SenGupta [15]. It is defined by as d (ϕ, θ) = π−|π−|ϕ−θ||, d 2[0, π]. Down and Mardia [11] in Section 2.3 had shown that the angular error where in our case, the difference between v j andv j is then given by d j ¼ p À jp À jv j Àv j jj which can also be treated as a circular error of the model follow a von Mises distribution denoted as VM with mean direction μ = 0 and concentration parameterκ. In measuring the overall goodness-of-fit of the model, we may define a summary measure of errors called mean circular error (MCEs) as where n is the sample size and MCEs2[0, 1]. We intend to use a row deletion method to investigate the effect of removing an observation from the data set on the values of MCEs. The effect can be measured by looking at the maximum absolute difference between the value of the statistics for full and reduced data sets, denoted by DMCEs, such that where MCEs and MCEs (-j) are the values of Eq (7) for the full data set and when the jth observation is removed from the data, respectively. Any observation will be identified as an outlier if the corresponding value of DMCEs exceeds a pre-specified cut-off point.

Sampling Behavior of the DMCEs Statistic
We perform a simulation study to investigate the sampling behavior of the DMCEs statistic. A set of circular random errors of sizes n = 10, 20, 30, 50, 70, 100 and 150 are generated from a VM with mean direction μ = 0 and various values of concentration parameter κ = 5, 10, and 20. We also generate the values of the independent circular random u from VM(π/2, 3) of size n.  Table 1.
In general, for these particular choices of parameter values, the value of cut-off point decreases as the concentration parameter κ increases for all n and percentile levels. Similarly, as the sample size increases, the cut-off points decrease for all percentile levels and concentration parameters. The cut-off points may differ for different combinations of parameter values and are available upon request from the authors. Alternatively, the relevant program to obtain the cut-off points can be found at http://cran.r-project.org/.

Power of Performance of the DMCEs Statistic
It is of interest to investigate the performance of the DMCEs statistic via simulation study. A similar scheme used in Section 4 is employed here. We introduce an outlier in the simulated where v Ã d is the contaminated observation at position d and λ is the degree of contamination, 0 λ 1.
When λ = 0, there is no contamination at position d, whereas when λ = 1, the observation v Ã d is located at the anti mode of its initial location. The generated data are fitted using Eq (2) and consequently we obtain the fitted valuesv. Then, we calculate the value of DMCEs for each simulated data set. The statistic has good power of performances if the fraction of correctly detecting outlier at position d is close to 1. Fig 1 shows the performance of DMCEs for n = 70 and various values of κ. When larger values are used, the performance is almost similar, but clearly better than that for small κ. On the other hand, Fig 2 gives the plot of the power of performance of the DMCEs statistic for κ = 10 and various sample sizes. We observe that the power of performance is an increasing function of sample size n. The DMCEs statistic performs better for larger sample size. Similar results are observed for the other cases.

Background
Here we consider a real data set to show the estimation of the DM circular regression model using MLE method and the application of the DMCEs statistic using circadian data provided by Downs and Mardia [11]. The data are obtained from 10 medical students in Austria. The students are measured several times daily for a period of several weeks. The study period was split into two prime time periods as part of the study, and the peak time for systolic blood pressure (in degree) was estimated separately for each student for each period, giving values S1 and S2. The two blood pressure peak times should be equivalent, if circumstances are the same for each of the two periods.   data are given in Table 2. The summary statistics of the S1 and S2 are almost similar including the concentration parameter with the value more than two.
In addition, Fig 5 shows the spoke plot of the data. By taking the horizontal axis in the right direction as 0°, the inner ring places the observations of S1 while the outer ring for S2. The lines connecting points on outer and inner rings correspond to the observed values of S1 and S2 respectively for the same time point. It can be observed that one line corresponding to student number 8 on the left hand side of the plot lies a distance away from the others.

Parameter Estimation
Using the circadian data set, we calculate the precision parameters in the pre-specified sets as described in Section 2. The resulting plot of ρ versus index is given in Fig 6. The initial values of each parameter correspond to the highest point observed in the plot giving α o = 18°, β o = 9°a nd ω o = 0.70. Thus, using these initial values, the final parameter estimates are obtained by maximizing the log likelihood function given by Eq (5)

Outlier detection
We now apply the outlier detection procedure based on the DMCEs statistic on the data. The student number 8 is flagged as a candidate of outlier. By employing the DMCEs statistic which uses the row deletion approach, such outlier is also known as an influential observation. The data is of size n = 10 with the maximum likelihood estimate of concentration parameter k ¼ 17:64 giving the cut-off point to be used is 0.07. Upon calculating the DMCEs for the data, we have DMCEs = 0.09 which is greater than the cut-off point and conclude that student number 8 is an influential observation. Further, we investigate the effect of this observation on the parameter estimates. After removing student number 8 from the data set, we notice that the values ofâ andb increase by a large value in degree whileô also changes from 0.669 to 0.820 as shown in Table 3. Further investigation should then be carried out as the identification of this outlier might lead to useful understanding of the data.

Conclusions
In this paper, we consider the problem of detecting outliers in the Down and Mardia's circular regression model based on the DMCEs statistic. The sampling behaviour and the performance of the procedure are investigated via simulation. We illustrate the use of the new procedure using the circadian data set. In the future, it is our interest to introduce a more robust approach in identifying outliers by extending methods used in the linear case to circular.