Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Evaluation of the Gini Coefficient in Spatial Scan Statistics for Detecting Irregularly Shaped Clusters

  • Jiyu Kim,

    Affiliation Department of Biostatistics and Medical Informatics, Yonsei University College of Medicine, Seoul, Korea

  • Inkyung Jung

    ijung@yuhs.ac

    Affiliation Department of Biostatistics and Medical Informatics, Yonsei University College of Medicine, Seoul, Korea

Evaluation of the Gini Coefficient in Spatial Scan Statistics for Detecting Irregularly Shaped Clusters

  • Jiyu Kim, 
  • Inkyung Jung
PLOS
x

Abstract

Spatial scan statistics with circular or elliptic scanning windows are commonly used for cluster detection in various applications, such as the identification of geographical disease clusters from epidemiological data. It has been pointed out that the method may have difficulty in correctly identifying non-compact, arbitrarily shaped clusters. In this paper, we evaluated the Gini coefficient for detecting irregularly shaped clusters through a simulation study. The Gini coefficient, the use of which in spatial scan statistics was recently proposed, is a criterion measure for optimizing the maximum reported cluster size. Our simulation study results showed that using the Gini coefficient works better than the original spatial scan statistic for identifying irregularly shaped clusters, by reporting an optimized and refined collection of clusters rather than a single larger cluster. We have provided a real data example that seems to support the simulation results. We think that using the Gini coefficient in spatial scan statistics can be helpful for the detection of irregularly shaped clusters.

Introduction

Among various statistical methods for spatial cluster detection, the spatial scan statistics [1] have been extensively used in numerous applications including not only geographical disease surveillance but also architecture [2], forestry [3,4], astronomy [5], and criminology [6,7]. The method, based on a likelihood ratio test statistic, evaluates a large number of different and overlapping scanning windows. The test statistic is formulated based on a probability model depending on the data type, such as the Poisson model for count data [1] and the ordinal model for ordered categorical data [8]. Scanning windows are constructed with variable sizes at each location on a study area, up to a certain maximum limit. For each scanning window, a likelihood ratio test statistic for comparing inside versus outside the window is calculated and the scanning window with the maximum likelihood ratio is defined as the most likely cluster. The procedure of finding significant spatial clusters using the spatial scan statistics can be performed with the freely available software SaTScan™ [9].

An important issue regarding spatial scan statistics is the scanning window shape. The first proposed spatial scan statistic used circular-shaped scanning windows. The circular spatial scan statistic works well for compact clusters, but it may have difficulty correctly identifying non-circular clusters. Tango [10] and Tango and Takahashi [11] have mentioned that the original spatial scan statistic using circular windows tends to detect a larger cluster than the true cluster by swallowing neighboring areas with non-elevated risk. This phenomenon may occur more easily when the true cluster is non-circular. Other shapes of scanning windows also have been proposed such as elliptic [12] and irregular [11,1317] shapes. Several studies [1519] have shown that the methods using irregularly shaped scanning windows have a better power for detecting irregularly shaped clusters, as expected.

To apply the spatial scan statistics, one should determine the maximum scanning window size (MSWS) in advance. The MSWS is usually chosen in terms of the percentage of the total population for the study area, and an MSWS value of 50% of the total population is commonly used as the default setting for SaTScan™. However, users may choose an arbitrary MSWS and the results can be affected by the chosen MSWS. Ribeiro and Costa [20] examined the effect of different values of MSWS via a simulation study and found that the performance of spatial scan statistics can be sensitive to the choice of MSWS. Their findings do not imply that one may run the analysis multiple times with different values of MSWS to optimize the cluster detection results, as discussed by Han et al. [21]. In that case, the results will suffer from the multiple testing problem. Han et al. [21] proposed a method using a Gini coefficient to optimize the maximum reported cluster size (MRCS). It is statistically valid to rerun the analysis to report clusters of a certain maximum size while keeping the MSWS fixed at a larger value. Han et al. [21] mentioned that setting the MRCS at 50% often results in unnecessarily large and less informative clusters, and the authors concluded that the Gini coefficient can identify a more refined collection of non-overlapping clusters. This method has been implemented in SaTScan™ version 9.3.

In this paper, we have evaluated the use of the Gini coefficient in the spatial scan statistics for detecting irregularly shaped clusters. From our experience, we also found that using the Gini coefficient in SaTScan™ tends to result in the identification of multiple smaller clusters rather than a single larger cluster. The smaller clusters were often connected and located contiguously, in which case we may consider the clusters as a single cluster in a possibly irregular shape. We think that using the Gini coefficient improves the detection of irregularly shaped clusters. We do not expect that the use of the Gini coefficient outperforms other cluster detection methods specifically using irregularly shaped windows for the detection of irregular clusters. The Gini coefficient was developed as an optimizing criterion for MRCS, and if it has an ability to better detect irregularly shaped clusters than the original method, it certainly offers an advantage.

In the next section, we briefly review the spatial scan statistic for count data and the Gini coefficient in the Poisson-based scan statistic. Through a simulation study, we evaluate the performance of the Gini coefficient for detecting irregularly shaped clusters, compared with the original circular and elliptic scan statistics. Methods that were developed specifically for detecting irregular clusters, such as the flexible spatial scan statistic [11], the circular spatial scan statistic with a restricted likelihood ratio [22], and the flexible spatial scan statistic with a restricted likelihood ratio [23], are also included in the simulation study for comparison. We illustrate the different methods using a real data set of liver cancer mortality for males in Seoul and Gyeonggi province in Korea. Finally, we discuss our findings with some concluding remarks.

Methods

Spatial scan statistic for count data

When we want to detect a cluster of cases compared against the underlying population at risk, for example, using disease mortality data, we can use the Poisson-based spatial scan statistic. Given a collection of scanning windows Z, the spatial scan statistic for count data is defined as the maximum of the likelihood ratio test statistics over Z for the following hypotheses. where p and q are the event rates inside and outside the scanning window z, respectively. The null hypothesis indicates no clustering and the alternative can be specified to search for clusters with high (or low) rates. The Poisson-based spatial scan statistic λ is expressed as where cz and nz denote the observed number of cases and the population within z, respectively, and C and N are the total number of observed cases and the total population over the whole area, respectively. I() is the indicator function to indicate a high or low rate. Because the denominator on the above formula does not depend on z, the term (C/N)c often drops from the test statistic.

The most likely cluster is defined as the scanning window associated with the value of λ. The statistical significance of the most likely cluster is often assessed using Monte Carlo hypothesis testing, by generating random data sets under the null hypothesis and comparing the test statistic from the original data set with those from the randomly generated data sets. One may use Gumbel-based p-values by approximating the distribution of the test statistic to an extreme value distribution [24,25]. The two methods are available on SaTScan™.

Besides the most likely cluster, it can be informative to report secondary clusters with high likelihood ratios. The statistical significance of the secondary clusters is evaluated in the same way for the most likely cluster. As thoroughly explained in the paper by Han et al. [21], an earlier version of SaTScan™ reported secondary clusters without overlapping with more significant clusters as a default option, which could result in a large most likely cluster hiding several smaller distinct clusters. They proposed to apply the Gini coefficient as an intuitive and systematic way to determine the best collection of clusters to report by optimizing the MRCS. Here we briefly describe the method. The Gini coefficient for a set of clusters is calculated as two times the area between the reference line of y = x and the Lorenz curve. The Lorenz curve for a set of clusters is constructed using the cumulative percentages of observed cases and expected cases on the x- and y-axes, respectively. When there is a single significant cluster, as the number of observed cases in the cluster gets higher, which means more cases are concentrated, the Lorenz curve gets further away from the reference line and the Gini coefficient value gets higher. When comparing several competing collections of non-overlapping clusters, the one with the highest Gini coefficient value should be chosen as the cluster collection to report [21]. Through a simulation study, Han et al. [21] showed that the method identified the correct clusters and performed well. For more detailed information on the use of the Gini coefficient in the spatial scan statistic, refer to the paper by Han et al. [21]. The method has been implemented in SaTScan™ and is available for the Poisson and Bernoulli models only.

Although Han et al. [21] conducted a simulation study and showed a good performance of the Gini coefficient, they only considered compact clusters. Here we want to evaluate the Gini coefficient for detecting irregularly shaped clusters. As previously mentioned, if multiple small clusters of circular or elliptic shapes are found and they are located contiguously, they can be regarded as a single and possibly irregularly shaped cluster. We presumed that using the Gini coefficient can more precisely identify irregular clusters by reporting several smaller clusters connected to one another.

Simulation study

We conducted an extensive simulation study to evaluate the performance of the Gini coefficient in the Poisson-based spatial scan statistic for detecting irregularly shaped clusters. We created 7 different cluster models with different shapes and at different locations on a real geographical map of Seoul and Gyeonggi province in South Korea. The area consists of 69 districts with mixed urban and rural regions. Seoul is the capital city of South Korea with a highly dense population and Gyeonggi province is composed of districts in relatively larger sizes with small populations. Table 1 and Fig 1 show the locations and information of the 7 simulated cluster models. We tried to create various types of cluster models in irregular shapes and in different locations and sizes. We also included a cluster model of a compact shape.

thumbnail
Table 1. Number of clusters and districts in the clusters of simulated cluster models A–G.

https://doi.org/10.1371/journal.pone.0170736.t001

We generated 1,000 random data sets for each cluster model with relative risks (RRs) of 1.3, 1.5, and 2 for the clusters of high rates. For the population for the study area, we used a half of the real population for each district of Seoul and Gyeonggi province in 2010 provided by Statistics Korea. For each randomly generated data set and each cluster model, we conducted a spatial cluster detection analysis using 7 different methods: the circular and elliptic spatial scan statistics with and without the Gini coefficient (denoted by CS, ES, GCS, and GES), the original flexible spatial scan statistic (OF), the circular spatial scan statistic with a restricted likelihood ratio (RC), and the flexible spatial scan statistic with a restricted likelihood ratio (RF). Analyses using CS, ES, GCS, and GES were conducted using SaTScan™ version 9.3 [9] and those using OF, RC, and RF were conducted using FleXScan version 3.1 [26].

We identified all significant clusters by each of the 7 methods for each simulation and calculated 3 performance measures, namely the usual power, sensitivity, and positive predictive power (PPV). The usual power indicates the power to reject the null hypothesis of no clustering (in any way) and was estimated by the number of rejections out of 1,000 replicate simulations. Tango and Takahashi [11] used the expression of the usual power, while they proposed a bivariate power to better reflect the accuracy of detecting true clusters. Sensitivity is defined as the number of districts correctly detected divided by the number of districts in the true cluster, and PPV is defined as the number of districts correctly detected divided by the number of detected districts. Sensitivity and PPV were estimated as the averages of sensitivity and PPV for data sets rejected at the 0.05 significance level.

We also estimated the bivariate power distribution proposed by Tango and Takahashi [11]. While the usual power, sensitivity, and PPV are useful for showing the performance as averaged measures, the bivariate power can reveal more detailed information on the accuracy of identifying the true cluster. The bivariate power distribution P(l,s) is defined with 2 parameters of length l, the number of regions of the detected cluster, and s, the number of regions identified correctly in the true cluster. The usual power can be obtained by summing up the bivariate power over all possible values of l and s. The bivariate power can indicate the probabilities of exact detection, under-detection, and over-detection.

Korean male liver cancer mortality data

We analyzed Korean male liver cancer mortality data for 2010–2013 obtained from Statistics Korea. We used the aggregated mortality data at the “Si-Gun-Gu” (district) level and searched for clusters with high mortality rates in Seoul and Gyeonggi province using the 7 different methods. For the population, we used the 2010 Population and Housing Census data from Statistics Korea. The population and mortality data were grouped into 5-year age intervals and the analyses were adjusted for the age group.

Results

Simulation results

Tables 24 show the estimated usual power, sensitivity, and PPV for each method under cluster models A–G with the RRs of 1.3, 1.5, and 2, respectively. In most cases, the usual power was estimated as 1 or very close to 1. We included the usual power when it is not exactly equal to 1. Although none of the methods performed best across all scenarios, the RF method showed the highest values of sensitivity and PPV in many cases. The OF method performed relatively well overall. On the other hand, the CS and ES methods had poor performance in all scenarios except for scenario G of a compact cluster. ES performed very well and even better than the other methods for scenario G. The RC method showed very low values of sensitivity, especially when the relative risk was 1.3. The GES method performed reasonably well in general and better than CS, ES, and GCS did. The GCS method performed better than CS and ES overall, but ES seemed to perform better than GCS under some scenarios with RR = 1.3. The performance of GES was very comparable to those of RF and OF with even higher values of sensitivity or PPV in some scenarios. For a compact cluster, both the sensitivity and PPV of GES were higher than those of RF when the RR = 1.3 although PPV was somewhat lower when the RR = 1.5 or 2.

The estimated bivariate power distribution for cluster model A with the RR of 1.5 for each method is shown in Table 5. Cluster model A represents a single irregularly shaped cluster composed of 11 districts. The usual power for each method was 1000/1000 as indicated in Table 3. The estimated probability of exact detection P(11,11) was highest for RF and RC. GES also had a relatively high value of exact detection probability compared to CS, ES, GCS, and OF. We observed that CS and GCS tended to over-detect (as represented by the large numbers presented in the rows greater than length l = 11), which led to a relatively low PPV (Table 3). RC seemed not to over-detect, but rather seemed to under-detect (as represented by the large numbers presented in the rows less than length l = 11), which led to a relatively low sensitivity. ES in this scenario also tended to under-detect. For OF and RF, larger numbers were distributed around the point of exact detection P(11,11). The bivariate power distribution for GES was comparable to that for OF or RF. Because the results of the bivariate power distribution for each method under all scenarios would take up too much space, here we only present 1 case as an example. Sensitivity and PPV provide enough information to compare the overall performance of the 7 methods. The results for the bivariate power distribution under all other scenarios can be found in S1 File.

thumbnail
Table 5. Estimated bivariate power distributions P(l,s) × 1,000 of the 7 methods for cluster model A (RR = 1.5).

https://doi.org/10.1371/journal.pone.0170736.t005

Analysis results for Korean male liver cancer mortality data

Fig 2 shows the detected clusters with high rates of Korean male liver cancer mortality in Seoul and Gyeonggi province for 2010–2013, using the 7 different methods. Table 6 includes information on the RR, p-value, and number of districts of the detected clusters. Overall, the results were similar for each method. On closer examination, however, the clusters detected by GES were almost identical to those detected by RF. The optimal MRCS for GES was found as small as 3% and still, ES (using 50% for MRCS) detected exactly the same clusters as GES. We think that this was due to the detected clusters having very high RRs and low populations. GES, and ES did not include only 2 small districts among the regions in the clusters detected by RF, while CS, GCS, OF, and RC identified more districts as significant clusters than RF. Although we do not know the true clusters in the real data, we assumed that the clusters detected by RF would be close to the true ones because the simulation studies in this paper and the paper by Tango [23] showed that RF has a very good performance for accurately identifying clusters.

thumbnail
Fig 2. Spatial clusters with high mortality rates of male liver cancer in Seoul and Gyeonggi province in Korea for 2010–2013, detected by the 7 methods.

https://doi.org/10.1371/journal.pone.0170736.g002

thumbnail
Table 6. Most likely and secondary clusters of high rates of male liver cancer mortality in Seoul and Gyeonggi province in Korea for 2010–2013, detected by the 7 methods.

https://doi.org/10.1371/journal.pone.0170736.t006

Discussion

In this paper, we evaluated the use of the Gini coefficient in the Poisson-based spatial scan statistic for detecting irregularly shaped clusters. The simulation study showed that using the Gini coefficient in the elliptic spatial scan statistic had a reasonably good performance compared to the other methods for detecting irregular clusters. We think that the analysis results for Korean male liver cancer mortality data also support that the elliptic spatial scan statistic using the Gini coefficient might work well for detecting irregularly shaped clusters. Despite their popular usage in various applications, it has been pointed out that the spatial scan statistics with circular and elliptic shaped scanning windows may have difficulty in correctly identifying non-compact, arbitrarily shaped spatial clusters [1319,22,23,27]. However, based on our simulation study, using the Gini coefficient in the elliptic spatial scan statistic can resolve the issue to a certain extent. We do not insist that the Gini coefficient can work better than other spatial cluster detection methods specifically using irregularly shaped windows for detecting arbitrarily shaped clusters. By reporting an optimized and refined collection of clusters, using the Gini coefficient can better identify irregularly shaped clusters than the original spatial scan statistic without using it. Also, its performance can be almost as good as the flexible spatial scan statistic with a restricted likelihood ratio. A major advantage of using the Gini coefficient over the flexible spatial scan statistic is efficiency in computation time. We found that running FleXScan with the RF method took two to three times longer than running SaTScan for our simulation study. Also, it could be a very tedious job to create a matrix definition file representing adjacency for each location, which is additionally required for FleXScan, for a data set having a very large number of locations.

The Gini coefficient has been already implemented in SaTScan™ for the Poisson and Bernoulli models. Spatial scan statistics are available for other probability models such as ordinal [8,28], multinomial [29], normal [30], and exponential [31]. It will be very useful to develop the Gini coefficient or another criterion for optimizing the MRCS for such models as well. While it is expected that such criterion measures may work well for detecting irregular clusters, a careful evaluation will be needed.

Supporting Information

S1 File. Results for bivariate power distributions.

https://doi.org/10.1371/journal.pone.0170736.s001

(PDF)

Author Contributions

  1. Conceptualization: IJ.
  2. Data curation: JK.
  3. Formal analysis: JK.
  4. Funding acquisition: IJ.
  5. Investigation: JK.
  6. Methodology: IJ.
  7. Project administration: IJ.
  8. Resources: IJ.
  9. Software: JK.
  10. Supervision: IJ.
  11. Visualization: JK.
  12. Writing – original draft: JK IJ.
  13. Writing – review & editing: IJ.

References

  1. 1. Kulldorff M. A spatial scan statistic. Communications in Statistics—Theory and Methods. 1997; 26(6):1481–96.
  2. 2. Kaza N, Lester TW, Rodriguez DA. The spatio-temporal clustering of green buildings in the United States. Urban Studies. 2013; 50:3262–82.
  3. 3. Fei S. Applying hotspot detection methods in forestry: A case study of Chestnut Oak regeneration. International Journal of Forestry Research. 2010; 815292.
  4. 4. Vega Orozco C, Tonini M, Conedera M, Kanveski M. Cluster recognition in spatial-temporal sequences: the case of forest fires. Geoinformatica. 2012;16: 653–73.
  5. 5. Bidin CM, Marcos RD, Marcos CD, Carraro G. Not an open cluster after all: the NGC 6863 asterism in Aquila. Astronomy and Astrphysics. 2010; 510:A44.
  6. 6. Minamisava R, Nouer SS, de Morais Neto OL, Melo LK, Andrade ALS. Spatial clusters of violent deaths in a newly urbanized region of Brazil: Highlighting the social disparities. International Journal of Health Geographics. 2009; 8:66. pmid:19943931
  7. 7. Leitner M, Helbich M. The Impact of Hurricanes on Crime: A Spatio-temporal Analysis in the City of Houston, TX. Cartography and Geographic Information Science. 2011; 37:214–22.
  8. 8. Jung I, Kulldorff M, Klassen AC. A spatial scan statistic for ordinal data. Statistics in medicine. 2007; 26(7): 1594–607. pmid:16795130
  9. 9. Kulldorff M. and Information Management Services, Inc. SaTScanTM v9.3: Software for the spatial and space-time scan statistics. http://www.satscan.org/, 2016.
  10. 10. Tango T. A test for spatial disease clustering adjusted for multiple testing. Statistics in Medicine. 2000;19;191–204. pmid:10641024
  11. 11. Tango T, Takahashi K. A flexibly shaped spatial scan statistic for detecting clusters. International Journal of Health Geographics. 2005; 4(1): 1.
  12. 12. Kulldorff M, Huang L. Pickle L, Duczmal L. An elliptic spatial scan statistic. Statistics in medicine. 2006; 25(22): 3929–43. pmid:16435334
  13. 13. Duczmal L, Assunção R. A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Computational Statistics and Data Analysis. 2004; 45:269–86.
  14. 14. Patil GP, Taillie C. Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological Statistics. 2004;11: 183–97.
  15. 15. Assunção R, Costa M, Tavares A, Ferreira S. Fast detection of arbitrarily shaped disease clusters. Statistics in Medicine. 2006;25(5):723–42. pmid:16453376
  16. 16. Duczmal L, Cançado ALF, Takahashi RHC, Bessegato LF. A generic algorithm for irregularly shaped spatial scan statistics. Computation Statistics and Data Analysis. 2007;52(1):43–52.
  17. 17. Costa MA, Assunção R, Kulldorff M. Constrained spanning tree algorithms for irregularly-shaped spatial clustering. Computation Statistics and Data Analysis. 2012;56(6):1771–83.
  18. 18. Takahashi K, Tango T. An extended power of cluster detection tests. Statistics in Medicine. 2006; 25(5): 841–52. pmid:16453379
  19. 19. Duczmal L, Kulldorff M, Huang L. Evaluation of spatial scan statistics for irregularly shaped clusters. Journal of Computational and Graphical Statistics. 2006;15(2):428–42.
  20. 20. Ribeiro SHR, Costa MA. Optimal selection of the spatial scan parameters for cluster detection: a simulation study. Spatial and Spatio-temporal Epidemiology. 2012;3(2), 107–20. pmid:22682437
  21. 21. Han J, Zhu L, Kulldorff M, Hostovich S, Stinchcomb DG, Tatalovich Z, et al. Using Gini coefficient to determining optimal cluster reporting sizes for spatial scan statistics. International Journal of Health Geographics. 2016; 15(1): 27. pmid:27488416
  22. 22. Tango T. A spatial scan statistic with a restricted likelihood ratio. Japanese Journal of Biometrics. 2008; 29(2):75–95.
  23. 23. Tango T, Takahashi K. A flexible spatial scan statistic with a restricted likelihood ratio for detecting disease clusters. Statistics in Medicine. 2012; 31(30): 4207–18. pmid:22807146
  24. 24. Abrams AM, Kleinman K, Kulldorff M. Gumbel based p-value approximations for spatial scan statistics. International Journal of Health Geographics. 2010; 9(1):1.
  25. 25. Jung I, Park G. p-value approximations for spatial scan statistics using extreme value distributions. Statistics in Medicine. 2015;34(3):504–14. pmid:25345856
  26. 26. Takahashi K, Yokoyama T, Tango T. FleXScan v3.1: Software for the flexible spatial scan statistic. National Institute of Public Health, Japan. 2010.
  27. 27. Cançado ALF, Duarte AR, Duczmal LH, Ferreira SJ, Fonseca CM, Gontijo ECDM. Penalized likelihood and multi-objective spatial scans for the detection and inference of irregular clusters. International Journal of Health Geographics. 2010;9:55. pmid:21034451
  28. 28. Jung I, Lee H. Spatial cluster detection for ordinal outcome data. Statistics in Medicine. 2012;31:4040–8. pmid:22807106
  29. 29. Jung I, Kulldorff M, Richard OJ. A spatial scan statistic for multinomial data. Statistics in medicine. 2010; 29(18):1910–18. pmid:20680984
  30. 30. Kulldorff M, Huang L, Konty K. A scan statistic for continuous data based on the normal probability model. International journal of health geographics. 2009; 8(1):1.
  31. 31. Huang L, Kulldorff M, Gregorio D. A spatial scan statistic for survival data. Biometrics. 2007; 63(1):109–18. pmid:17447935