A simple approximation algorithm for the diameter of a set of points in an Euclidean plane

Approximation algorithms with linear complexities are required in the treatments of big data, however, present algorithms cannot output the diameter of a set of points with arbitrary accuracy and near-linear complexity. By introducing the partition technique, we introduce a very simple approximation algorithm with arbitrary accuracy ε and a complexity of O(N + ε−1 log ε−1) for the cases that all points are located in an Euclidean plane. The error bounds are proved strictly, and are verified by numerical tests. This complexity is better than existing algorithms, and the present algorithm is also very simple to be implemented in applications.


Introduction
Given a finite set of points T in a 2D Euclidean plane R 2 , its diameter, denoted by d T , is defined as the maximum distance between two points of T. Computing the diameter of a point set is a fundamental problem in computer science. It has been proved that in an Euclidean plane, finding the accurate diameter of a set of N points can be reduced to formulating the convex hull of them, with a lower bound of complexity O(N log N) [1][2][3][4].
In the science of big data, this classical problem encounters new challenges. For big data, the number of points N can be huge, and one usually expects linear or sub-linear algorithms to replace the O(N log N) complexity. Clearly, as O(N log N) is the lower bound, they will certainly be approximate algorithms. In the present paper we only consider the algorithms without pre-processing. For these cases, no sub-linear algorithm can guarantee the accuracy of the approximate diameter, as listing the points will require a minimum O(N) complexity. Therefore, if we want to obtain an estimable approximate diameter, a linear complexity should be the lower bound.
As an introduction, we here show an easiest approximate algorithm. Given an arbitrary point p i , this algorithm simply reports its maximum distance to other points, i.e, max i;j2⟦1;N⟧;j6 ¼i T i T j with T i T j the distance between points T i ,T j 2 T, as the approximate diameter d a . It is simply to show that 1 � d T /d a � 2, implying a very low accuracy of the approximation. There exist two references for improving this approximation to higher accuracy. Egecioglu and Kalantari designed an algorithm that in m iterations the reported approximate diameter d m satisfies that 1 � d T =d m � ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 5 À 2 ffi ffi ffi 3 p p � 1:24 [5]. Recently, Alipour et al. improved this algorithm to allow fewer iterations, however, the accuracy of the approximation did not change [6].
Another type of approximation problems allows introducing an arbitrary positive number 0 < ε < 1 aims at outputting an approximate diameter d o in linear time, such that Note that this is equivalent to the description that which is formally consistent to literature. These problems are usually named as (1 + ε)-approximations. In two dimensions, lots of approximation algorithms with various near-linear complexities have been developed. Table 1 gives a comparion of different (1 + ε)-approximation algorithms to compute the diameter of T, in chronological order. However, we remark that most of these algorithms are difficult to be implemented in practices since they require complicated calculations and designs in computational geometry. In this paper, we will introduce a very simple algorithm to approach a near-linear complexity of O(N + ε −1 log ε −1 ) which is much simpler to be implemented in applications.

Approximation algorithm for the diameter in an Euclidean plane
For the finite set of points T in R 2 , we choose a point O 2 T arbitrarily as the origin, and then divide the plane into 6n same regions with n 2 Z � þ . In each region S i (i = 1, . . ., 6n), we can find a farthest point from the origin, and let r i denote the distance between the farthest point of S i and the origin. By using the origin O as the the center of a circle and r i as the radius, we can obtain 6n sector regions, as Fig 1. We remark that the number of the regions, 6n, will allow in the following parts to provide analytical error estimation. Let p i (i = 1, . . ., 6n) be the midpoint of the arc of each sector region, and compute the largest distance d p of these 6n midpoints p i s. Then we propose the following main theorem of this paper which shows the relationship between the diameter d T of the point set T and the largest distance d p . Here we note that the virtual points p i s can be different with the real points in T. 1 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi In the following we will give the proof of Theorem 1 in two parts: to prove the upper and lower bounds of d T d p respectively.

Lower bound of d T d p
Without loss of generality, suppose that an endpoint of the line segment d p is in the region S, and then we denote the opposite angle region by S 0 and denote the other regions clockwise by S 1 , . . ., S 6n−1 . Note that in this way the region S is exactly the region S 3n . Let the line passing through the origin O and the midpoint of the arc of the region S be the x-axis, then we can set up the Cartesian coordinate system in the plane, as Fig 2. The coordinate of the midpoint p i of the arc in each region S i is À r i cos ip 3n ; r i sin ip 3n À � , where i = 1, . . ., 6n − 1, and r i is the radius of the sector region S i . Before giving the proof on the lower bound of d T d p , we bring out the following lemmas. Lemma 1. If an endpoint of the line segment d p is in the region S (i.e., S 3n ) as we supposed above, then the other endpoint of d p cannot be obtained in the regions S 2n+1 , . . ., S 3n−1 .
Proof. Denote R = max(r 0 , . . ., r 6n−1 ), then the relationship R � d p is obvious. For point , and the point p 3n (−r, 0) in the region S, we can compute the distance between these two points: Obviously, p i p 3n When we can obtain the maximum of g(r i ): g(r i ) max = g(0) or g(R), and g(0) = g(R) = R 2 . Thus g(r i ) � R 2 . Therefore, p i p 3n 2 < r 2 þ r 2 i À rr i � R 2 , and the equality arrives when (r = R, r i = 0) or (r = R, r i = R) or (r i = R, r = 0). In this way, we can prove that p i p 3n 2 < R 2 � d 2 p . That means the distance p i p 3n ði ¼ 2n þ 1; . . . 3n À 1Þ, is always less than d p .
According to the symmetry of the regions, the following lemma can be easily obtained. Lemma 2. If an endpoint of the line segment d p is in the region S (i.e., S 3n ), then the other endpoint of d p cannot be obtained in the regions S 3n+1 , . . ., S 4n−1 .
Moreover, the cases for an endpoint of d p in the regions S 0 , S 6n−1 , . . ., S 4n are equivalent to those in the regions S 0 , S 1 , . . ., S 2n . Therefore, if we suppose that an endpoint of the line segment d p is in the region S, then from Lemma 1 and 2, we only need to consider the 2n + 1 cases where the other endpoint of d p is in the regions S 0 , . . ., S 2n . In what follows we will consider two cases in order to compute the lower bound of d T d p . Case I: i 2 ⟦0, 2n − 1⟧ As we supposed above, an endpoint of d p is in the region S, then there certainly exists a point q 1 on the arc of the region S. The coordinate of q 1 is (r cos θ, r sin θ), where y 2 À p 6n ; p 6n � � and r is the radius of the sector region S. If the other endpoint of d p is in the region S i (i = 0, 1, . . ., 2n − 1), then there certainly exists a point q 2 on the arc of the region S i and the coordi- and r i is the radius of the sector region S i .
The distance between the two points q 1 and q 2 can be computed as where From the definition of d T , we know that ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Since d p ¼ p i p 3n ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi q , the following relationship can be obtained.

d T d p �
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi From ( Since the denominator of f 0 (x) is always greater than 0, we only consider the sign of its numerator. Let g(x) be the numerator of f 0 (x) divided by π/(3n).
All this leads up to the following inequality: Case II: i = 2n In this case d p ¼ p 2n p 3n ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi r 2 þ r 2 i À r i r p , and from the proof of Lemma 1 we know that ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi r 2 þ r 2 i À r i r

Upper bound of d T d p
In this subsection, we will prove the upper bound of d T d p . Similar to the approach for the proof of the lower bound, supposing that an endpoint of the line segment d T is in the region S, and then we only need to consider the cases that the other endpoint of d T is in the region S i for i 2 ⟦0, 2n⟧.
Case I: i 2 ⟦1, 2n⟧ As we supposed above, an endpoint of d T is in the region S and the other endpoint is in the region S i , which denoted by m 1 and m 2 respectively. The coordinates of m 1 and m 2 are (Àr cos y;r sin y) and (Àr i cos y i ;r i sin y i ) respectively, wherer 2 ½0; r�,r i 2 ½0; r i �, y 2 À Let hðrÞ ¼r 2 þr 2 i þ 2arr i , where a ¼ cos ðiÀ 1Þp 3n 2 À 1 2 ; 1 À � .
1. In the case a ¼ cos ðiÀ 1Þp 3n 2 ½0; 1�, hðrÞ ¼r 2 þr 2 i þ 2arr i � r 2 þ r 2 i þ 2arr i . And since d p � p i p 3n ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi r 2 þ r 2 i þ 2rr i cos ðiÀ 1Þp 3n q ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 2. In the case a ¼ cos ðiÀ 1Þp 3n 2 À 1 2 ; 0 À � , we can compute that h 0 ðrÞ ¼ 2ðr þ ar i Þ.
a. Ifr i � 2r, then h 0 ðrÞ � 0 whenr 2 ½0; r�. And we can get the maximum of hðrÞ: Thus d T ¼ m 1 m 2 � R, and as we mentioned above, m 1 m 2 � R, those lead that In this moment, ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Therefore we have ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi r 2 þ r 2 i þ 2rr i cos ðiÀ 1Þp 3n q ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi b. If 0 �r i < 2r, then h 0 ðrÞ � 0 whenr 2 ½0; À ar i � and h 0 ðrÞ � 0 whenr 2 ½À ar i ; r�. In this case, hðrÞ max ¼ hð0Þ or h(r).
If hðrÞ max ¼ hð0Þ, same as the case (a), In this way, we have proved the Theorem 1, and the relationship in this theorem can be also written as 1 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi This theorem therefore provides a fast approximation for the diameter of the point set T.

Remarks
We remark that the complexities of visiting all N points in set T, calculating their polar coordinates and renewing the values of r i are all linearly O(N). The complexity of calculating the diameter of 6n virtual points is O(n log n) as introduced in section 1. Indeed, even if we compare all pairs among these points via brute force, the complexity will be at most O(n 2 ), which will be negligible by comparing to O(N) if N is huge. Therefore, we conclude that our approximation algorithm has a complexity of O(N + n log n), which is deterministically linear complexity when N � n.
In addition, recalling the problem descriptions (1) and (2), we can also formulate the outputted diameter and calculate the necessary number of regions n for an arbitrary accuracy 0 < ε < 1. It is easy to verify that when n > 2p 2p À 3 cos À 1 ð À 1þ4εþ2ε 2 2ðεÀ 1Þ 2 ÞÞ Eqs (1) and (2) are both satisfied. For small ε Taylor expansion shows that which indicates that and the complexity writes We therefore remark that all values of d p , d o and d � o can be used as approximate diameters, depending on the accuracy interval one requires.

Numerical tests
As illustrated in the previous section, expressions (3), (1) and (2) describe the error bounds of d p , d o and d � o respectively. However, in practice these upper and lower bounds correspond to the worst cases, while for most situations the error will be even smaller. In this section we show by three different point sets this error distribution, respectively. In the first point set T (1) , the Cartesian coordinate of each point is (x, y), where x and y are independent random variables uniformly distributed in [0, 1), leading to a diameter close to the diagonal; in the second point set T (2) , the polar coordinate of each point is (r, θ), where θ is a random variable homogeneously distributed in [0, π), and r is a random variable with Gaussian distribution N(0, 1) [11]; the third point set T (3) is chosen from a real database on the positions of fluid particles [12]. Both T (1) and T (2) have 1 × 10 6 discrete points, while T (3) have 4 × 10 5 discrete points. We simply use O(n 2 ) brute force method to calculate the diameter of the 6n virtual points. Indeed, this does not yield any inconvenience in the calculation, while the computational time of the case n = 100 is only 1.01 times of that of the case n = 1. Therefore we can conclude that the calculations are of near-linear complexity.
Without loss of generality, here we use our algorithm to output the values of d p , and compare them with the theoretical error ranges (3), as shown in Fig 3. We randomly select 100 different origin points for each value of n respectively. Clearly, for all cases, most calculated diameters d p are quite close to the real value d T , which are even quite better than the theoretically worst bounds (shown as dash-dotted lines in Fig 3). These results then show the effectiveness of the present algorithm.
We also present the CPU time in Fig 4. Points are generated similarly to the T (1) case, i.e., point coordinates are independent random variables uniformly distributed in [0, 1). The partition parameter n is fixed as 2, 12, 100 and 300 respectively. Calculations are performed via single thread at Intel Core i5-6200U CPU 2.30GHz, interpreted by Python 2.5.1 in the IDLE Approximation for the diameter of a set of points software. Fig 4 shows that the CPU time is linear to the value of N, illustrating that the present algorithm is of nearly linear complexity. In addition, although no optimization is implemented to accelerate the calculations, the real performance is acceptable since calculating the approximate diameter of 2 × 10 6 points with n = 100 (corresponding to relative error ε = 9 × 10 −7 ) only costs about 3 seconds. These evidences suggest the implementation of the present algorithm in real applications.

Conclusion
As a fundamental problem of big data, linear approximation algorithms for the diameter of a set of points will be potentially useful. By introducing the partition technique, we introduce an approximation algorithm with arbitrary accuracy and deterministically linear complexity. The implementation of this algorithm is very simple and does not require any complicated data structure. Note that the lower bound of the proposed algorithm is O(N + n log n) with n of the order of ε −1 , while a brute force visiting algorithm for virtual points will increase this to O(N + n 2 ). In practice n will be much smaller than N, therefore O(n 2 ) will be negligible by comparing to O(N). In addition, increasing the number of partition n does not increase any multiple coefficient to O(N), which indicates the robustness of the near-linear complexity of our algorithm. Comparing to existing approximation algorithms, the present algorithm shows a lowest complexity O(N + ε −1 log ε −1 ). Also, another advantage of the present algorithm is that it is very simple to be implemented, which does not require any complicated data structure or geometry calculation. The present contribution is a preliminary attempt in 2D plane. For higher dimensional cases, this method might also be extended, but a division of hyper-sphere [13][14][15] will be required. In those situations, other partition schemes will be more efficient. For example, one may use high-dimensional Cartesian coordinates instead of the division of hyper-sphere. The related accuracy will also be more complicated, and is expected to be investigated in our future work.