Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A simple approximation algorithm for the diameter of a set of points in an Euclidean plane

  • Jieying Hong,

    Roles Conceptualization, Investigation, Methodology, Project administration

    Affiliation ESSEC Asia-Pacific, ESSEC Business School, Singapore, Singapore

  • Zhipeng Wang,

    Roles Formal analysis, Validation

    Affiliation École Centrale Pékin, Beihang University, Beijing, China

  • Wei Niu

    Roles Conceptualization, Data curation, Formal analysis, Writing – original draft

    wei.niu@buaa.edu.cn

    Affiliations École Centrale Pékin, Beihang University, Beijing, China, Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing, China

Abstract

Approximation algorithms with linear complexities are required in the treatments of big data, however, present algorithms cannot output the diameter of a set of points with arbitrary accuracy and near-linear complexity. By introducing the partition technique, we introduce a very simple approximation algorithm with arbitrary accuracy ε and a complexity of O(N + ε−1 log ε−1) for the cases that all points are located in an Euclidean plane. The error bounds are proved strictly, and are verified by numerical tests. This complexity is better than existing algorithms, and the present algorithm is also very simple to be implemented in applications.

Introduction

Given a finite set of points T in a 2D Euclidean plane , its diameter, denoted by dT, is defined as the maximum distance between two points of T. Computing the diameter of a point set is a fundamental problem in computer science. It has been proved that in an Euclidean plane, finding the accurate diameter of a set of N points can be reduced to formulating the convex hull of them, with a lower bound of complexity O(N log N) [14].

In the science of big data, this classical problem encounters new challenges. For big data, the number of points N can be huge, and one usually expects linear or sub-linear algorithms to replace the O(N log N) complexity. Clearly, as O(N log N) is the lower bound, they will certainly be approximate algorithms. In the present paper we only consider the algorithms without pre-processing. For these cases, no sub-linear algorithm can guarantee the accuracy of the approximate diameter, as listing the points will require a minimum O(N) complexity. Therefore, if we want to obtain an estimable approximate diameter, a linear complexity should be the lower bound.

As an introduction, we here show an easiest approximate algorithm. Given an arbitrary point pi, this algorithm simply reports its maximum distance to other points, i.e, with the distance between points Ti,TjT, as the approximate diameter da. It is simply to show that 1 ≤ dT/da ≤ 2, implying a very low accuracy of the approximation.

There exist two references for improving this approximation to higher accuracy. Egecioglu and Kalantari designed an algorithm that in m iterations the reported approximate diameter dm satisfies that [5]. Recently, Alipour et al. improved this algorithm to allow fewer iterations, however, the accuracy of the approximation did not change [6].

Another type of approximation problems allows introducing an arbitrary positive number 0 < ε < 1 aims at outputting an approximate diameter do in linear time, such that (1) Note that this is equivalent to the description that , indicating that if Eq (1) is satisfied, we can output instead of to satisfy that (2) which is formally consistent to literature. These problems are usually named as (1 + ε)-approximations. In two dimensions, lots of approximation algorithms with various near-linear complexities have been developed. Table 1 gives a comparion of different (1 + ε)-approximation algorithms to compute the diameter of T, in chronological order. However, we remark that most of these algorithms are difficult to be implemented in practices since they require complicated calculations and designs in computational geometry. In this paper, we will introduce a very simple algorithm to approach a near-linear complexity of O(N + ε−1 log ε−1) which is much simpler to be implemented in applications.

thumbnail
Table 1. Comparasion of different (1 + ε) − approximation algorithms to compute the diameter of T.

https://doi.org/10.1371/journal.pone.0211201.t001

Approximation algorithm for the diameter in an Euclidean plane

For the finite set of points T in , we choose a point OT arbitrarily as the origin, and then divide the plane into 6n same regions with . In each region Si (i = 1, …, 6n), we can find a farthest point from the origin, and let ri denote the distance between the farthest point of Si and the origin. By using the origin O as the the center of a circle and ri as the radius, we can obtain 6n sector regions, as Fig 1. We remark that the number of the regions, 6n, will allow in the following parts to provide analytical error estimation. Let pi (i = 1, …, 6n) be the midpoint of the arc of each sector region, and compute the largest distance dp of these 6n midpoints pis. Then we propose the following main theorem of this paper which shows the relationship between the diameter dT of the point set T and the largest distance dp. Here we note that the virtual points pis can be different with the real points in T.

thumbnail
Fig 1. Diagram of the partition for the point set.

Solid points are real points in the set T, while empty points are virtual points in the set of pis.

https://doi.org/10.1371/journal.pone.0211201.g001

Theorem 1. dT is the diameter of a finite set of points T in , and dp is the largest distance of the 6n virtual midpoints pis as defined above, then the following statement holds: (3)

In the following we will give the proof of Theorem 1 in two parts: to prove the upper and lower bounds of respectively.

Lower bound of

Without loss of generality, suppose that an endpoint of the line segment dp is in the region S, and then we denote the opposite angle region by S0 and denote the other regions clockwise by S1, …, S6n−1. Note that in this way the region S is exactly the region S3n. Let the line passing through the origin O and the midpoint of the arc of the region S be the x-axis, then we can set up the Cartesian coordinate system in the plane, as Fig 2. The coordinate of the midpoint pi of the arc in each region Si is , where i = 1, …, 6n − 1, and ri is the radius of the sector region Si. Before giving the proof on the lower bound of , we bring out the following lemmas.

thumbnail
Fig 2. Diagram of the 6n regions for the point set and the Cartesian coordinate system.

https://doi.org/10.1371/journal.pone.0211201.g002

Lemma 1. If an endpoint of the line segment dp is in the region S (i.e., S3n) as we supposed above, then the other endpoint of dp cannot be obtained in the regions S2n+1, …, S3n−1.

Proof. Denote R = max(r0, …, r6n−1), then the relationship Rdp is obvious. For point in the region S2n+1, …, S3n−1 (i.e., i ∈ ⟦2n + 1, 3n − 1⟧), and the point p3n(−r, 0) in the region S, we can compute the distance between these two points: (4) Obviously, .

Let , then f′(r) = 2rri can be easily obtained. Since r, ri ∈ [0, R], then we have that f′(r) < 0 when r ∈ [0, ri/2), and f′(r) > 0 when r ∈ (ri, R]. Thus f(r)max = f(0) or f(R).

When r = 0, ; and when . Let , since ri ∈ [0, R], we can obtain the maximum of g(ri): g(ri)max = g(0) or g(R), and g(0) = g(R) = R2. Thus g(ri) ≤ R2.

Therefore, , and the equality arrives when (r = R, ri = 0) or (r = R, ri = R) or (ri = R, r = 0). In this way, we can prove that . That means the distance , is always less than dp.

According to the symmetry of the regions, the following lemma can be easily obtained.

Lemma 2. If an endpoint of the line segment dp is in the region S (i.e., S3n), then the other endpoint of dp cannot be obtained in the regions S3n+1, …, S4n−1.

Moreover, the cases for an endpoint of dp in the regions S0, S6n−1, …, S4n are equivalent to those in the regions S0, S1, …, S2n. Therefore, if we suppose that an endpoint of the line segment dp is in the region S, then from Lemma 1 and 2, we only need to consider the 2n + 1 cases where the other endpoint of dp is in the regions S0, …, S2n. In what follows we will consider two cases in order to compute the lower bound of .

Case I: i ∈ ⟦0, 2n − 1⟧

As we supposed above, an endpoint of dp is in the region S, then there certainly exists a point q1 on the arc of the region S. The coordinate of q1 is (r cos θ, r sin θ), where and r is the radius of the sector region S. If the other endpoint of dp is in the region Si (i = 0, 1, …, 2n − 1), then there certainly exists a point q2 on the arc of the region Si and the coordinate of q2 is (−ri cos θi, ri sin θi), where and ri is the radius of the sector region Si.

The distance between the two points q1 and q2 can be computed as (5) where , for i = 0, 1, …, 2n − 1. Since cos x is monotone decreasing in , we have . Thus (6) From the definition of dT, we know that (7)

Since , the following relationship can be obtained. (8)

From (8), we have (9) When x ∈ [0, 2n − 1], the function is monotone decreasing, thus . In addition, , Then we can obtain that (10)

Let and x ∈ [0, 2n − 1]. We consider the monotonicity of f(x) and compute its derivative for this purpose. (11) Since the denominator of f′(x) is always greater than 0, we only consider the sign of its numerator. Let g(x) be the numerator of f′(x) divided by π/(3n). (12)

The derivative of g(x) is . In the case x ∈ [0, 2n − 1], it is obviously that g′(x) > 0, and we can get that g(0) < 0 and g(2n − 1) < 0 by computation. Thus for any x ∈ [0, 2n − 1], g(x) < 0 and also f′(x) < 0. Therefore we know that f(x) is monotone decreasing in the interval [0, 2n − 1].

All this leads up to the following inequality: (13)

Case II: i = 2n

In this case , and from the proof of Lemma 1 we know that . Moreover, since dTR, we have (14)

Concluding the two cases, we can obtain the lower bound of : (15)

Upper bound of

In this subsection, we will prove the upper bound of . Similar to the approach for the proof of the lower bound, supposing that an endpoint of the line segment dT is in the region S, and then we only need to consider the cases that the other endpoint of dT is in the region Si for i ∈ ⟦0, 2n⟧.

Case I: i ∈ ⟦1, 2n

As we supposed above, an endpoint of dT is in the region S and the other endpoint is in the region Si, which denoted by m1 and m2 respectively. The coordinates of m1 and m2 are () and () respectively, where , , and . The distance between the two points , is exactly dT and thus .

Furthermore, (16) where , for i = 1, …, 2n. Since cos x is monotone decreasing in [0, π], we have . Thus (17)

Let , where .

  1. In the case , . And since , we know that (18)
  2. In the case , we can compute that .
    1. If , then when . And we can get the maximum of : . Thus , and as we mentioned above, , those lead that (19) In this moment, (20) Therefore we have (21)
    2. If , then when and when . In this case, or h(r).
      If , same as the case (a), (22) If , then let , where . The derivative of is . If , then , and . In this case, , and then similar to the case (a), (23) If , then , and . Here we have , and thus (24)

Since , we can summarize all the cases in Case I and get (25) Then by using the similar approach for computing the lower bound of , where i ∈ ⟦1, 2n⟧, we can deduce the following inequality (26)

Case II: i = 0

In this case, . Moreover, (27)

Therefore, (28)

From Case I and II, the supper bound of can be concluded: (29)

In this way, we have proved the Theorem 1, and the relationship in this theorem can be also written as (30) This theorem therefore provides a fast approximation for the diameter of the point set T.

Remarks

We remark that the complexities of visiting all N points in set T, calculating their polar coordinates and renewing the values of ri are all linearly O(N). The complexity of calculating the diameter of 6n virtual points is O(n log n) as introduced in section 1. Indeed, even if we compare all pairs among these points via brute force, the complexity will be at most O(n2), which will be negligible by comparing to O(N) if N is huge. Therefore, we conclude that our approximation algorithm has a complexity of O(N + n log n), which is deterministically linear complexity when Nn.

In addition, recalling the problem descriptions (1) and (2), we can also formulate the outputted diameter and calculate the necessary number of regions n for an arbitrary accuracy 0 < ε < 1. It is easy to verify that when (31) Eqs (1) and (2) are both satisfied. For small ε Taylor expansion shows that (32) which indicates that (33) and the complexity writes (34)

We therefore remark that all values of dp, do and can be used as approximate diameters, depending on the accuracy interval one requires.

Numerical tests

As illustrated in the previous section, expressions (3), (1) and (2) describe the error bounds of dp, do and respectively. However, in practice these upper and lower bounds correspond to the worst cases, while for most situations the error will be even smaller. In this section we show by three different point sets this error distribution, respectively. In the first point set T(1), the Cartesian coordinate of each point is (x, y), where x and y are independent random variables uniformly distributed in [0, 1), leading to a diameter close to the diagonal; in the second point set T(2), the polar coordinate of each point is (r, θ), where θ is a random variable homogeneously distributed in [0, π), and r is a random variable with Gaussian distribution N(0, 1) [11]; the third point set T(3) is chosen from a real database on the positions of fluid particles [12]. Both T(1) and T(2) have 1 × 106 discrete points, while T(3) have 4 × 105 discrete points. We simply use O(n2) brute force method to calculate the diameter of the 6n virtual points. Indeed, this does not yield any inconvenience in the calculation, while the computational time of the case n = 100 is only 1.01 times of that of the case n = 1. Therefore we can conclude that the calculations are of near-linear complexity.

Without loss of generality, here we use our algorithm to output the values of dp, and compare them with the theoretical error ranges (3), as shown in Fig 3. We randomly select 100 different origin points for each value of n respectively. Clearly, for all cases, most calculated diameters dp are quite close to the real value dT, which are even quite better than the theoretically worst bounds (shown as dash-dotted lines in Fig 3). These results then show the effectiveness of the present algorithm.

thumbnail
Fig 3. Numerical results of dT/dp.

(a) T(1) case; (b) T(2) case; (c) T(3) case. The theoretical error bounds are shown as dash-dotted lines.

https://doi.org/10.1371/journal.pone.0211201.g003

We also present the CPU time in Fig 4. Points are generated similarly to the T(1) case, i.e., point coordinates are independent random variables uniformly distributed in [0, 1). The partition parameter n is fixed as 2, 12, 100 and 300 respectively. Calculations are performed via single thread at Intel Core i5-6200U CPU 2.30GHz, interpreted by Python 2.5.1 in the IDLE software. Fig 4 shows that the CPU time is linear to the value of N, illustrating that the present algorithm is of nearly linear complexity. In addition, although no optimization is implemented to accelerate the calculations, the real performance is acceptable since calculating the approximate diameter of 2 × 106 points with n = 100 (corresponding to relative error ε = 9 × 10−7) only costs about 3 seconds. These evidences suggest the implementation of the present algorithm in real applications.

thumbnail
Fig 4. CPU time with different N values.

Points are generated similarly to the T(1) case. n = 2, 12, 100 and 300 respectively.

https://doi.org/10.1371/journal.pone.0211201.g004

Conclusion

As a fundamental problem of big data, linear approximation algorithms for the diameter of a set of points will be potentially useful. By introducing the partition technique, we introduce an approximation algorithm with arbitrary accuracy and deterministically linear complexity. The implementation of this algorithm is very simple and does not require any complicated data structure. Note that the lower bound of the proposed algorithm is O(N + n log n) with n of the order of ε−1, while a brute force visiting algorithm for virtual points will increase this to O(N + n2). In practice n will be much smaller than N, therefore O(n2) will be negligible by comparing to O(N). In addition, increasing the number of partition n does not increase any multiple coefficient to O(N), which indicates the robustness of the near-linear complexity of our algorithm. Comparing to existing approximation algorithms, the present algorithm shows a lowest complexity O(N + ε−1 log ε−1). Also, another advantage of the present algorithm is that it is very simple to be implemented, which does not require any complicated data structure or geometry calculation.

The present contribution is a preliminary attempt in 2D plane. For higher dimensional cases, this method might also be extended, but a division of hyper-sphere [1315] will be required. In those situations, other partition schemes will be more efficient. For example, one may use high-dimensional Cartesian coordinates instead of the division of hyper-sphere. The related accuracy will also be more complicated, and is expected to be investigated in our future work.

Acknowledgments

This work has been supported by the NSFC project 11601023.

References

  1. 1. Yao ACC. On constructing minimum spanning trees in k-dimensional spaces and related problems. SIAM J Comput. 1982;11:721–736.
  2. 2. Preparatat FP, Shamos MI. Computational geometry: an introduction; 1985.
  3. 3. Bentley JL, Preparata FP, Faust GM. Approximation algorithms for convex hulls. Comm ACM. 1982;25:64–68.
  4. 4. Malandain G, Boissonnat JD. Computing the Diameter of a Point Set. Int J Comput Geom Appl. 2002;12(06):489–509.
  5. 5. Eg̃eciog̃lu O, Kalantari B. Approximating the diameter of a set of points in the Euclidean space. Inform Process Lett. 1989;32(4):205–211.
  6. 6. Alipour S, Kalantari B, Homapour H. Fast approximation and randomized algorithms for diameter. arXiv preprint arXiv:14102195. 2014.
  7. 7. Agarwal PK, Matoušek J, Suri S. Farthest neighbors, maximum spanning trees and related problems in higher dimensions. Comput Geom. 1992;1(4):189–201.
  8. 8. Barequet G, Har-peled S. Efficiently approximating the minimum-volume bounding box of a point set in three dimensions. J Algorithms. 1999;38:82–91.
  9. 9. Chan TM. Approximating the diameter, width, smallest enclosing cylinder, and minimum-width annulus. In: Proceedings of the sixteenth annual symposium on Computational geometry. ACM; 2000. p. 300–309.
  10. 10. Arya S, Chan TM. Better ε-Dependencies for Offline Approximate Nearest Neighbor Search, Euclidean Minimum Spanning Trees, and ε-Kernels. In: Proceedings of the thirtieth annual symposium on Computational geometry. ACM; 2014. p. 416.
  11. 11. Knuth D. The Art of Computer Programming 2: Seminumerical Algorithms. MA: Addison-Wesley; 1968.
  12. 12. Fang L, Bos WJT, Jin GD. Short-time evolution of Lagrangian velocity gradient correlations in isotropic turbulence. Phys Fluids. 2015;27:125102.
  13. 13. Cooper PW. The hypersphere in pattern recognition. Inf Control. 1962;5(4):324–346.
  14. 14. Katsuki S, Frangopol DM. Hyperspace Division Method for Structural Reliability. J Eng Mech. 1994;120(11):2405–2427.
  15. 15. Sato K, Yamaji A. Uniform distribution of points on a hypersphere for improving the resolution of stress tensor inversion. J Struct Geol. 2006;28(6):972–979.