A Scan Statistic for Binary Outcome Based on Hypergeometric Probability Model, with an Application to Detecting Spatial Clusters of Japanese Encephalitis

As a useful tool for geographical cluster detection of events, the spatial scan statistic is widely applied in many fields and plays an increasingly important role. The classic version of the spatial scan statistic for the binary outcome is developed by Kulldorff, based on the Bernoulli or the Poisson probability model. In this paper, we apply the Hypergeometric probability model to construct the likelihood function under the null hypothesis. Compared with existing methods, the likelihood function under the null hypothesis is an alternative and indirect method to identify the potential cluster, and the test statistic is the extreme value of the likelihood function. Similar with Kulldorff’s methods, we adopt Monte Carlo test for the test of significance. Both methods are applied for detecting spatial clusters of Japanese encephalitis in Sichuan province, China, in 2009, and the detected clusters are identical. Through a simulation to independent benchmark data, it is indicated that the test statistic based on the Hypergeometric model outweighs Kulldorff’s statistics for clusters of high population density or large size; otherwise Kulldorff’s statistics are superior.


Introduction
In epidemiological studies, it is often important to evaluate whether the occurrence of a disease is randomly distributed or tends to occur as clusters over time and/or space after adjusting for known confounding factors, which may provide clues to the etiology of the disease [1]. In addition, outbreak detection becomes possible thanks to growing geographically referenced health-related data, such as data from sales of Over-the-Counter (OTC) Healthcare Products.
Likelihood ratio based spatial scan statistic is a cluster detection test. Because of its ability of both identifying localized clusters and evaluating their significance, the spatial scan statistic becomes more popular relative to other statistical methods for disease clustering. In order to investigate excessive risk of disease after adjusting for the unevenly distributed population, Kulldorff proposed the spatial scan statistic for the binary outcome in his seminal papers [2,3]. Those included the Bernoulli and the Poisson models. The spatial scan statistic quickly has become a popular research field and various new methods have been developed, which could roughly be divided into two classes: those extending the shape of the scanning window to detect irregularly shaped clusters [4][5][6][7][8][9][10][11], and those modifying the test statistic to exploit more information [12][13][14][15][16][17] or to accommodate more complex data structures, such as multivariate data [18][19][20], ordinal data [21], survival data [22,23], normally distributed data [24,25] and multinomial data [26]. For all methods, the significance of the test statistic is evaluated using the Monte Carlo test [27].
While a great variety of methods have been developed for diverse purposes, the two classic models for the binary outcome play a central role in the spatial scan statistic due to their wide application. The Bernoulli model is more appropriate for casecontrol data, while the Poisson model is more appropriate for casepopulation data. When the number of cases is small compared to the target population, the two models approximate each other. Both of them are based on the likelihood ratio test theory and use the likelihood ratio (LR), a generalized measuring index of clustering for each window, thereby making windows of different size comparable. In a word, the larger the LR of a window is, the more likely it is a true cluster. The window with the largest LR is called the most likely cluster (MLC) and the test statistic is the largest LR. In general, as summarized by Kulldorff [28], a test for spatial randomness adjusted for an inhomogeneity comprises 7 steps, of which the 3rd step is to construct a measuring index for each defined area and the 5th step is to define a summary quantification for all defined areas. The LR and the largest LR correspond with the 3rd and 5th steps, repsectively. Neill [29] presents a heuristic interpretation that the LR is a sort of distance away from the null hypothesis of no clustering. To the best of our knowledge, quite on the contrary, the LR is an index of measuring the closeness to the alternative hypothesis of an existing window of clustering. However, it is interesting to study whether there exists an index of measuring the departure from the null hypothesis of no clustering and, if so, how the new measuring index performs compared with the existing methods.
In this paper, we apply the Hypergeometric probability model to construct a likelihood function under the null hypothesis, which sometimes is complementary to the alternative hypothesis. The idea originates from the email exchange with Dr. Kulldorff and another method from Dr.Wong et al for syndrome surveillance, which is named WSARE [30]. By analogy with our situation, WSARE employs the P value of testing heterogeneity of proportions inside and outside each window as a measure of the departure from the null hypothesis. That is, the window with the minimum P value is the 'most strange window' (in Dr.Wong's words) and the minimum P value is the test statistic. The proposed and existing methods are both applied to the real data of Japanese Encephalitis (JE) in Sichuan province, China. A simulation between the proposed test statistic and the Kulldorff's statistics is carried out using a set of independent benchmark data.

JE Data
In 2003, there was a outbreak of SARS in China, and this exposed the underdevelopment of the public health system in handling public health emergencies in China. The China Central government requested to strengthen the construction of an infectious disease and public health emergency system, with focus on promoting the timeliness, sensitivity and accuracy of reporting. The Chinese Center for Disease Control and Prevention (CCDC) made the construction of a new operation model, Chinese Information System for Infectious Diseases Control and Prevention (CISIDCP). CISIDCP was established on the basis of individual cases and public health emergencies. A Virtual Private Network (VPN) has been constructed using the information safety technology, and information of individual cases is directly reported to the national database through the internet. This system will report 39 notifiable infectious diseases to CCDC within 24 hours. However, the management is classified into national, provincial, prefecture and county levels. CISIDCP makes feedback with health authority departments at every level. In 2005, CISIDCP had covered at least 93.3% of medical units at the county and above.
JE is among the 39 notifiable infectious diseases and, therefore JE case will be reported routinely by CISIDCP. JE is a vectorborne viral disease with a high mortality rate and a high percentage of neuropsychiatric sequelae. The JE virus is spread by marsh birds and intensified by pigs, mainly transmitted via the bite of infected Culex mosquito. Humans are dead-end hosts [31]. Many of the ecological, environmental, climatic and human behavioral factors are involved in the spread of the JE virus [32]. Contextual determinants of JE include irrigated rice farming, pig rearing and the rural population. Sichuan is a province in Southwest China. It is one one of the major agricultural production bases of China, including rice and pork production. Hence, Sichuan province is a high-incidence region for JE with the incidence of JE ranked the 5th in 2009 among 31 provinces in Mainland China. As a subordinate unit of CCDC, Sichuan Center for Disease Control and Prevention (SCDC) has the permission to access the data of Sichuan from CISIDCP. It is interesting for SCDC to investigate the geographic distribution pattern of JE. This analysis may help further learn the disease cluster areas and influencing factors of JE, finally assisting health officials in allocating the health resources.

Notation and Kulldorff's Test Statistics for the Binary Outcome
N G: the whole study region divided into many counties N Z: a circular window in the study region N V: all overlapped windows formed by circles of arbitrary radius r centered in each one of counties, V~fZg N n Z : the number of population within window Z N c Z : the observed number of cases within window Z N n G : the total population in the study region N c G : total number of cases in the study region N z Ã : one cluster such that all individuals within z Ã have probability p in be a case and p out is the same probability for individuals outside z Ã When scanning for the high rates only, such as identifying areas with high rate of leukemia [33] or Breast cancer [34], Kulldorff's statement on the hypotheses is as following: H 0 : p in~pout , H 1 : p in wp out for z Ã . Under the null hypothesis, the expected number of cases within Z, e Z , is calculated as follows: In the framework of likelihood ratio ratio test, the final test statistic for the Poisson model is the likelihood ratio maximized over Z: where The search space for the candidate cluster is ristricted by equation 3, where the first inequality specifies the maximum spatial cluster size, and the second inequality determines the aim of scanning for high risk area. Mathematical details have been provided by Kulldorff [3]. Sometimes it is interesting to identify areas with low rates only, such as detecting low clusters of sex ratio [35]. We just need to change the direction of the second inequality. The window Z Ã attaining the maximum is defined as the Most Likely Cluster (MLC). The MLC is least likely to be a chance occurrence under the null hypothesis. And the test statistic is the l in equation 2. The significance of the test statistic is evaluated by Monte Carlo test.

A Test Statistic based on the Hypergeometric Probability Model
As with the Poisson-and Bernoulli-based statistics, the Hypergeometric-based statistic has two aims: detecting the potential clusters; and evaluating the significance of the detected clusters.
Detecting. The null hypothesis signifies complete spatial randomness with each individual in the study region, implying that each person in the study region are equally likely to become the case. There are n G individuals of which c G are cases in total. Under null hypothesis, we can think of that all n G individuals are equally likely to be 'labeled'. For a window Z with n Z people, the probability of c Z 'labeled' individuals has probability as: This is the classic application of Hypergeometric distribution, sampling from a finite population without replacement. The window Z with the minimum probability is least likely occur under the null hypothesis. This can be written: Where the Z is the same as Kulldorff's method, which is defined by equation 3. The window Z ÃÃ attaining the minimum probability is least likely occur under the null hypothesis. We may call this window the most strange window, as it have the minimum probability of satisfying the null hypothesis that p in~pout .
Test of significance. The test statistic is the l Ã in equation 5. To evaluate whether the identified cluster is statistically significant or just can be explained by random noise, the P value is obtained through Monte Carlo test, by comparing the rank of the test statistic from the real data set with those from the random replications. The test statistic is calculated for each random replication as well as for the real data set, and if the latter is among the 5 percent lowest, then the test is significant at the 0.05 level. When there are multiple clusters in the data set, many methods can deal with this problem [3,36,37].

Detection of Clusters of JE in Sichuan Province
We applied both the proposed and existing methods to analyze data on JE from 2009 in Sichuan province, China. Our analysis used both methods to investigate whether the high JE incidence is evenly spread over Sichuan province. This would examine whether any observed clusters of JE cases could be explained by chance alone, or whether there were clusters of statistical significance.
In 2009, Sichuan province had a population of 81,379,919 and consisted of 181 counties. It had a total number of 598 cases, in which 2 cases lost the geographical information. The JE cases data in Sichuan province came from CISIDCP and this data was not publicly available. The population data came from the Nation Bureau of Statistics of China (http://www.stats.gov.cn/tjsj/ndsj/). To eliminate the random noise in the incidence map, we utilized a method of empirical Bayes estimate to visualize the spatial distribution of JE in Sichuan (Figure 1). It seems that there is a high risk in the east borderline region, especially in the northeast and southeast areas.
Because of the case-population data structure, we used the Poisson model representing Kulldorff's method. The 9,999 random data were generated under the null hypothesis to evaluate the significance of the detected clusters. On a significance level of 0.05, the two methods obtained almost the same results ( Figure 2). They detected two significant clusters, the most likely cluster in the southeast with 18 counties and one secondary cluster in the northeast with 12 counties. Both clusters had a P value of 0.0001. In addtion, the two methods detected identical areas as a third likely cluster, but with different P values: 0.0909 for the existing method and 0.1334 for the new method. This area is not colored in the figure. Overall, the detected clusters account for 53.2% (317/596) of the JE cases in Sichuan province. Cao [38] had analyzed the influencing factors of JE in the southwest of China on the Prefecture-city level, an administrative division below a province and above a county in China's administrative structure, but found no statistically significant factors. As Cao presented, this is probably due to the little spatial variation of the influencing factors in the southwest of China. On the other hand, the financial support may, on a large extent, determine the spatial variation of JE in Sichuan. We ranked the counties by GDP per capita in a descending order. It turned out that 86.7% (26/30) of the counties in the significant clusters ranked in the last two thirds of the 181 counties, and 33.3% (10/30) ranked in the last one third parts. As pointed by Zhen [39], the less-developed region of Sichuan province was often short of funds for JE control, especially the remote rural areas. These constitute the high risk areas in Sichuan, which indicates that more financial and policy support is required to control JE in these areas.

Simulation Study
Benchmark data. To carry out the comparison study between the two kinds of methods, we use public domain benchmark data sets. The benchmark data sets are based on the 1990 female population in the 245 counties and county equivalents in the Northeastern United States, consisting of the states of Maine, New Hampshire, Vermont, Massachusetts, Rhode Island, Connecticut, New York, New Jersey, Pennsylvania, Delaware, Maryland and the District of Columbia. Each county is represented by a centroid coordinate. The benchmark data was created by Kulldorff, M., T. Tango and P.J. Park (2003) to investigate the performance of different statistical methods for clustering. They have been used in many research studies [40][41][42][43]. It is available at 'http://www.satscan.org/datasets.html'. The benchmark data and how it was generated has been described in detail elsewhere [40]. We provide a brief summary here.
Under the null hypothesis of no clustering, 99,999 random data sets were generated by randomly allocating 600 cases to the various counties, with probabilities proportional to the county population. The null data is used to estimate the critical values, which is the cut-off point for the significance. Hot-spot clusters were generated by setting the relative risk in some counties to be larger than 1. Different real hot-spot clusters corresponds to different combinations of population density, number of counties. For each kind of hot-spot clusters, 10,000 random data sets were generated using a multinomial probability distribution with the relative risks such that if the exact location of the real cluster was known in advance, the power to detect it should be 0.999. In order to clearly examine the performance of these methods when applied in high and low population density, we focused on the relatively extreme combinations. urban area, rural area, rural & mixed area and urban & mixed area.
Evaluation criteria. The sensitivity (SEN) and the positive predictive value (PPV) were estimated to examine the performance of different methods. The SEN and PPV of the spatial scan statistic were introduced by Huang et al [22], and can be defined in terms of either the number of regions or the population. First, we define the SEN as the probability of detecting the regions that actually constitute the cluster, i.e, proportion of the number of regions correctly detected from the true clusters.
We can also weight each region with its population, and hence obtain population based SEN and PPV.
For the SEN and PPV, the larger the better they are, with 100% being the optimal. Here, the detected clusters are defined as follows: a critical value corresponding to a 0.05 significance level was computed by identifying the 5000th highest maximum index from among the 99,999 random data sets generated under the null model. For each kind of hot spot cluster, all windows with index exceeding the critical value are candidate clusters. The window with the maximum index is reported as the most likely cluster. Then, we eliminate the remaining candidate clusters that overlap with most likely cluster, and report the one having the largest index as the second cluster. We then repeated this procedure for the third and the fourth clusters and so on. All reported clusters are detected clusters.
Simulation results. We obtain the estimate of SEN and PPV for the test statistics based on the Bernoulli, the Poisson and the Hypergeometric models, respectively, and find that the results of the two test statistics proposed by Kulldorff are exactly the same except that a few values differ slightly. Thus, we take the Poisson model as an example for Kulldorff's methods (Table 1). There are several common features between Hypergeometric and Poisson models: 1) In general, the two test statistics perform very similarly; 2) The SEN decreases as the number/size of hot spot clusters increases; 3) The PPV does not show the 2nd feature and always maintains a high level; 4) In general, the SEN based on population is greater than that based on counties, likewise for the PPV.
The SEN and PPV represent two sides of a test statistic for cluster detection. If the SEN and PPV based on one probability model both are greater than the corresponding values based on the other model, the test statistic based on the former model outweighs the latter under that scenario. From Table 1 it appears that: 1) the Hypergeometric model outweighs the Poisson model when the hot-spot cluster is in urban area, whereas the Poisson model

Discussion
First, we would like to review the motivation of this study. Our initial test statistic was the P value, but it failed in the simulation study. There is a lot of literature relating to the role of the P value and its origin, in which a great part are concerned with what is evidence in statistics and the debate over Fisher's test of significance and Neyman-Pearson's hypothesis testing. That is beyond the scope of this paper, and plenty of literature has interestingly discussed those questions [44][45][46][47][48][49][50].
The key assumption to the two kinds of test statistics is that there is no positive spatial autocorrelation, which implies that pairs of observations taken nearby are more similar than those taken far apart. As summarized by Tango [1], the positive spatial autocorrelation will be a key issue in statistical modeling of spatial epidemiology. For instance, when spatial regression is performed to determine what covariates contribute to a higher risk for a disease under study, it is critical to adjust for the spatial autocorrelation in the data. Otherwise, the risk will be overestimated, with biased p-values that are too small, providing ''statistically significant'' results when none exist. However, for detecting disease clustering or disease clusters, we should not adjust for the spatial autocorrelation since we are interested in detecting clusters due to such autocorrelation and, if they are adjusted away, important clusters might go undetected. The test statistic based on the Hypergeometric model is more similar with the one based on Poisson model in terms of the data structure, both for case-population data.
As the simulation study shows, with the identical assumption, the two kinds of methods perform similarly. The SEN and PPV based on population usually are greater than that based on counties, indicating that the cluster detection methods often ''capture'' the hot spot areas of high population, which would benefit more overall. Furthermore, the test statistic based on the Hypergeometric model performs better when used with densely populated or largely sized hot spot clusters; otherwise Kulldorff's test statistics perform better. Although it appears relatively small, given the scarce resources available to most local health departments, a greater improvement would reduce the cost to investigate potential disease outbreaks. In the application of JE in Sichuan province, the two kinds of methods identified the same cluster, which may greatly help the health department allocate relevant resources to these areas for JE prevention.
Further refinements of the new test statistic may include clusters that are not circular, but instead irregularly shaped ones. Spacetime clusters extensions to the proposed method are also straightforward.