Open data and injuries in urban areas—A spatial analytical framework of Toronto using machine learning and spatial regressions

Injuries have become devastating and often under-recognized public health concerns. In Canada, injuries are the leading cause of potential years of life lost before the age of 65. The geographical patterns of injury, however, are evident both over space and time, suggesting the possibility of spatial optimization of policies at the neighborhood scale to mitigate injury risk, foster prevention, and control within metropolitan regions. In this paper, Canada’s National Ambulatory Care Reporting System is used to assess unintentional and intentional injuries for Toronto between 2004 and 2010, exploring the spatial relations of injury throughout the city, together with Wellbeing Toronto data. Corroborating with these findings, spatial autocorrelations at global and local levels are performed for the reported over 1.7 million injuries. The sub-categorization for Toronto’s neighborhood further distills the most vulnerable communities throughout the city, registering a robust spatial profile throughout. Individual neighborhoods pave the need for distinct policy profiles for injury prevention. This brings one of the main novelties of this contribution. A comparison of the three regression models is carried out. The findings suggest that the performance of spatial regression models is significantly stronger, showing evidence that spatial regressions should be used for injury research. Wellbeing Toronto data performs reasonably well in assessing unintentional injuries, morbidity, and falls. Less so to understand the dynamics of intentional injuries. The results enable a framework to allow tailor-made injury prevention initiatives at the neighborhood level as a vital source for planning and participatory decision making in the medical field in developed cities such as Toronto.


The injury landscape
Injury is one of the leading causes of death and disability in the United States of America [1]. In Canada alone, an estimated 4. 27  injury between 2009-2010 [2]. The growing number of traumas in urban areas has brought a significant public health concern [3] and fostered a negative perception of health and subjective wellbeing [4]. It is projected that by 2020, injuries will be the third foremost cause of death and disability worldwide [5]. Additionally, the repercussion of injuries from traumatic events has a temporal lag on the psychological and social adjustment of victims, jeopardizing wellbeing in general, and leading to depression [6]. Injuries can be divided into two significant groups generating distinct demographic profiles with leading causes and complex characteristics of epidemiological concern [7]. On one side, unintentional injuries [8] form a leading cause of death in the population between the ages of 1 to 39. Intentional injuries, on the other hand, including assaults and suicides, rank as the second leading cause of death in people ranged 15 to 39. Injuries, therefore, have direct consequences on the active population of Canadians, where three individuals die from injury-related causes every day. Further to these deaths, fifty Canadians are hospitalized due to injuries [9], which poses a severe strain on the Canadian economy and workplace [10]. Injuries currently represent over seven percent of all hospitalizations [11]. Non-fatal injuries accrue an additional burden to society, as many of these injuries affect the brain or spinal cord, leaving a substantial incidence over permanent disability. Costs on the health-care system in terms of waiting times is evident given the encumbrance over the carrying capacity of hospital systems. Geographical and temporal knowledge of injury events may help in optimizing adequate strategies that convey prevention, control, and efficient monitoring. While until recently, the focus was predominantly on the individual characteristics of the injured person, advances in spatial computation and data science promote new and integrative roles of the spatial aspects of what may lay within the injury landscape at regional level [12][13][14]. The injury landscape resonates with the concept of regional intelligence [15], where cities may have a proactive role through ubiquitous data integration in mitigating injury risk. By injury landscape, we define the geographical topology of spatially-explicit interactions of injury, where different types of injury occur with particular spatial attributes throughout a given geographic territory. This paper has the following structure. The next section, Section 1, offers a literature review of the paradigm of injury, and the importance of novel approaches for injury prevention. Section 2 brings the Methodology presenting a systematic framework of the different tools and techniques and exploring the necessary steps of data that allow the statistical and geostatistical analytics. Section 3 discusses the results of the implemented approach for the three regions, and Section 4 offers some concluding remarks and summarizes potential future works.

Literature review
Spatial understanding of the geography of metropolitan areas is of emerging importance in regions that have witnessed rapid urbanization [16], and where the incidence of injuries are positively correlated [17,18]. Geographic Information Systems (GIS), spatial analysis, and geostatistics allow addressing regional phenomena of health-related concerns in a spatiallyexplicit context [19]. Several studies analyzed the integration of geographical aspects of public health. For example, Kivell and Mason (1999) used geographic information systems (GIS) to place thirty trauma centers across the United Kingdom [20]. Several authors have also used GIS to predict pedestrian injuries [21][22][23]. Research on traffic-accident information systems has optimized the capacity to assess the risk of different types of traffic collisions [24][25][26]. Specifically, in the City of Toronto, researchers have explored the spatial patterns of motor vehicle collisions leading to pedestrian injury based on the pedestrian injury type, age and location within the city [27]. Other studies have examined the relationship between crime and geographic location [28,29], child maltreatment and geographic location [30], frequency and type of drug use, which influenced the location of drug and HIV-prevention activities [31], and the likelihood of increased risk of violent injury based on racial segregation [32].
An additional aspect of the spatial patterns of injury that has been explored is the comparison of injury by type in rural versus urban areas. These studies discuss how physical space and subsequent infrastructure (i.e., access and distance to hospitals) links to injury severity and morbidity. Additionally, these studies highlight the importance of understanding the spatial nature of injury by type so that injury prevention strategies may be more accurately targeted [33][34][35].
Like this study, there are other studies that have explored the spatial nature of injuries with ambulance datasets, including a 2010 study that, with ambulance data from the City of Toronto, explored the spatial and temporal patterns of violent injury [36] and a 2012 study which, with ambulance data, conducted and analysis of outdoor falls based on temporal, spatial and demographic distribution in Laval and Montréal, Canada [37].
The relationship between the spatial distribution of injuries and demographic composition of injured individuals has also been explored. For example, a 2016 study explored the cultural, social and geographic components leading to higher injury risk for Aboriginal peoples in British Columbia, Canada [38]. Another 2016 study explored injury burden caused by accidental venomous bites based on national geography and demographics in Australia [39] and a 2017 study explored the socio-and geo-demographics linked to firearm injuries in Miami-Dade County, Florida [40].
Analysis' of the outcomes of injuries and how they are linked to geographic location and demographics have also been conducted in several studies including a 2019 study which examined the association between injury mortality, geography and sex as it related to youth suicide, senior falls and transport injuries [41]. Furthermore, a study conducted by Keeves and others (2019) used electronic databases of various studies to investigate the outcomes of traumatic injury and their geographic variations, globally. This study found that urban pre-hospital patients have a lower risk of mortality compared to rural patients. This research concludes that there are currently gaps in the literature in regard to determining the link between injury outcome and geography and recommends the use of geographic information systems in future studies related to the spatial distribution of injuries [42].
Despite the many contributions, computational power and data availability have, in recent decades, hindered the opportunity of examining large geographical extents or comparing multiple regions simultaneously. Such studies are particularly important to support regional decision-making for injury prevention proactively and determine key characteristics of injury distributions within urban cores [43][44][45]. Concise multi-temporal datasets for extensive studies on the injury landscape are rarely available. This study approaches this gap by assessing the complete injury landscape of Toronto. A spatial-analytical framework allows the critical characteristics of different injury types leading to an integrative vision of the consequences and the underlying patterns of injuries in Toronto while benefiting from open data initiatives the city has available. The integration of open data such as Wellbeing Toronto (WT) is addressed at the neighborhood level, offering insights on the potential participatory role of public health initiatives for injury prevention.

Data
2.1.1. Injury data. The National Ambulatory Care Reporting System (NACRS) is a comprehensive database that contains demographic, diagnostic, and procedural information on all injury-related occurrences where an ambulance has been dispatched. ICD-10 codes were selected for unintentional injuries: (i) resulting from external causes (ICD-10 codes S00 to T14), (ii) external morbidity and mortality (ICD-10 codes V01 to V99), and (iii) fall (ICD-10 codes W00 to W19). For intentional self-harm, the ICD-10 codes from X60 to X84 were used. Data cleaning was carried out further to importing the data from its original format in SAS. The presence of a count with less than five events was discarded and considered as 0. A total of 1714512 injuries (intentional and unintentional) were registered and georeferenced by postal code conversion to latitude and longitude coordinates between 2004 to 2010 ( Table 1).
The majority of injuries resulted from external causes of which: (i) injuries to the wrist and hand, (ii) injuries to the head, and (iii) injuries to the knee and lower leg were the most significant cause of ambulance dispatch.

Socio-economic data.
Wellbeing Toronto (WT) data was used to assess critical variables at the neighborhood level for Toronto (Fig 1). WT corresponds to an integrative and open approach for visualization of Toronto's 140 neighborhoods [46]. As an open data concept, it hosts a significant amount of data over three reference periods (2008, 2011, and 2014), that include crucial variables encouraging citizen participation, government accountability, and data transparency. For health analytics, these are vital requisites for successful policy implementation. The Table below shows the variables that were selected from the WT portal (Table 2).

Preliminary data organization.
The data was georeferenced utilizing the existing postal code attribute and projected as point features for every single incident onto WGS84. Due to privacy reasons, the data was handled in a secured server and a count selection by location to the nearest census tract performed. This resulted in a generalized geometry dataset. The generalized point count polygons per category of injury were then further simplified onto the neighborhood level and projected into NAD83 17N. The compiled data from WT were added to the data set for further exploration of geostatistical analysis.

Global spatial autocorrelation.
Global spatial autocorrelation was tested employing a Moran's I index per injury category. This statistic was conducted to test the null hypothesis (Ho) relating to the absence of spatial clustering of injuries in Toronto (α = 0.05) (Eq 1): Where w ij corresponds to a binary weight matrix defined with the weight of one, given a contiguity of adjacency for any value that holds w ij = 1 and any value without adjacency as w ij = 0. The product of the distance is defined as x i for any location i in the distance to relation of its mean. This holds as a statistic for assessing the entire spatial distribution of adjacency formed for the city of Toronto. The null hypothesis was rejected in all categories, suggesting a high spatial autocorrelation for all the injury categories in Toronto.

Local spatial autocorrelation.
The Local G i � statistic was calculated by first determining the injury density. While several approaches allow for spatial density estimation, we considered that the importance of neighborhood demographics should hold. Thus, the neighborhood injury density results from a ratio where density corresponded to the injuries found in a neighborhood by the population count of the neighborhood. While greater spatial detail could have helped the accuracy of the assessment, one should note that the objective is related to the potential of participatory interaction of injury with available open data. In this sense, neighborhoods are the ideal geographic boundary for governance and city planning. This approach allowed for the seamless definition of injury density at a spatial level and computation of the statistic, determining the locational aggregation of injury hotspots and cold spots [47]. The calculation of the local G i � statistic is as follows (Eq 2): ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Where w ij is the spatial weight matrix following a 1 km distance (d), and w ij (d) is assumed as 1. The maps show densities of injury patient residences as hot spots and cold spots, with red representing the highest concentrations of injury and blue the lowest. The selection of regional socio-demographic characteristics for this analysis was guided by previous research and availability of Wellbeing Toronto data.

Regression framework.
Screening of key demographic variables available at Wellbeing Toronto was carried out by means of a stepwise regression through backward elimination. This allowed for a successful preliminary selection of variables that were applied to three distinct regressions frameworks: (i) spatial lag model, (ii) spatial error model, as well as a nonspatial model to compare performance, and (iii) ordinary least squares model. The spatial lag model (SL) (Eq 3) understands spatial dependency by the addition of a dependent variable that defines the spatial attribute.
Where I represents an identity matrix, and the N(0,σ 2 I) indicates that the errors follow a normal distribution with mean equal to zero and constant variance. When ρ is zero, the lagdependent term is canceled out, leaving the model under the Ordinary Least Squares (OLS) form. Though when ρ is not zero, it means that spatial dependency exists, and that non-random spatial observable interactions are present [48]. As for the spatial error model (Eq 4), the spatial dependency ξ is accounted within the error term �, assuming the errors of the model as spatially correlated [49].

Exploratory data analysis
The Figure below exemplifies the categorization of injuries based on external causes (Fig 2).
Concerning unintentional injuries in Toronto between 2004 and 2010, for the category of external causes, a total of 1602996 were obtained. For external morbidity and mortality, a total number of 22888 were registered, and for falls, a total of 86751 lead to ambulance dispatch. This constituted the larger set of the data used as intentional injuries corresponded only to a fraction of 1877 events, short of 0.12 percent of the total data set.

Spatial autocorrelation
3.2.1. Global spatial autocorrelation. Testing for spatial autocorrelation through Moran's I statistic for each event brought evidence that there is significant spatial autocorrelation for all injury categories at the global level (Table 3). Despite regional differences in the rates of unintentional and intentional injuries, the spatial patterns of the residences of those injured by unintentional or intentional mechanisms were found to be highly spatially autocorrelated (p < 0.01 for each injury type) indicating that the residence locations of those injured by each of these mechanisms were not randomly distributed in the city of Toronto. This suggests a high spatial clustering that justified further local exploration.
Highest Moran's I values were registered for (a) Injuries to unspecified parts of the trunk, limb or body region, (b) Injuries involving multiple body regions, and (c) Injuries to the neck. While (a) and (b) suggest anatomically more extensive regions, injuries to the neck are quite specific and may become a cause for serious concern given the propensity for physical disability, recovery time, and additional cost to care. The spatial aspects of this injury analysis overall lead to a pressing conclusion that there are clearly geographical determinants that should be assessed to understand the landscape of injury (Table 4).
As expected, all indices remained high, with falls showing very strong spatial-autocorrelation, followed by injuries and injuries leading to mortality. Intentional Injuries had the lowest Moran's I, however, still corresponding to a very strong Moran's I. Local spatial autocorrelation allows us to assess the neighborhoods at a local scale through the integration of hotspots.

Local spatial autocorrelation.
The calculation of Local Gi � allowed for the exploration of the spatial distributions of hotspots and their significance levels for the categories of: (i) unintentional injury (external causes), (ii) unintentional injury (resulting in morbidity and mortality), (iii) unintentional injury (due to falls), (iv) intentional injury (self-harm). A weight matrix was generated of queen contiguity type of order 1, for the 140 neighborhoods, as a minimum number of neighbors 3 and a maximum number of neighbors of 11 (Fig 4). The mean and median neighbors corresponded to 5.96 and 6.00, respectively, and a total percentage of non-zero values of 4.26 percent was found. Fig 3 depicts the Queen contiguity map for neighborhoods in Toronto, nevertheless the most intriguing aspect of these distributions, besides the clear evidence of hotspots and cold spots, was the unique spatial profile of injuries (Fig 4A-4D). Red represents "hotspots", or areas with high injury density, and blue represents cold spots or areas of low or no density of injury. All injury types depict a distinctive pattern.

Regression results
The table below (Table 5) compares the three distinct models. Three of the four injury categories showed moderate performance, suggesting that the data available at Wellbeing Toronto may support well decision making at neighborhood and community participation for injury analysis and integration. In all cases, the spatial regression outperformed the ordinary least squares, with significant improvements in the r 2 statistic throughout. The intentional injury model, however, showed low r 2 , suggesting that demographic data does not explain sufficiently the reasons for self-harm. Finally, it is important to note that injury categories have different variables for each explanatory model, suggesting that there should be different policies and preparedness integration within the city's public health decisions. The following variables were selected through the initial backward elimination as consistent for the models: i.

SOM cluster results
Analysis of health geography is highly important as it aids in providing evidence of possibly unknown risk factors that may be quantified and better understood only if they are explored spatially [50]. In addition to the regression models (discussed in section 3.3), self-organizing maps (SOM) were built based on the regressors (variables) included in the regression models.
In the evaluation of health geography, SOM is a highly useful tool that is used to identify

PLOS ONE
Open Data and Injuries in the urban areas-A spatial analytical framework of Toronto outliers in a dataset [51]. In this analysis, SOM has been used to identify which variables (the attributes or characteristics) are most correlated to injury by type in the City of Toronto neighbourhoods. SOM clusters were generated for each type of injury in the regression model, including unintentional injuries, morbidity, falls and intentional injuries. The SOM built for unintentional injuries included four clusters. In cluster 1, total population and traffic collisions were the variables that were most strongly correlated with unintentional injuries. This cluster was spatially located in the northeast and northwest peripheries as well as the south-central neighbourhoods in Toronto. In cluster 2, seniors living at home, social housing, and population density were the variables that were most strongly correlated with unintentional injuries. This cluster was spatially distributed throughout Toronto and was less prevalent in south-central Toronto. In cluster 3, traffic collisions, social housing, seniors living at home, total population, population density (ordered from most to least correlated) were the variables that were most strongly correlated with unintentional injuries. This cluster was also spatially distributed throughout Toronto but was more prevalent in south-central Toronto. In cluster 4, population density was most strongly correlated with unintentional injuries. This cluster was represented in a single neighbourhood, located in Toronto's city center. The results of the heatmaps for the regressors (for unintentional injuries) have been summarized in Table 6, which shows a breakdown of the cluster that each variable is highly correlated with. These results show that clusters 1 and 2 have the highest number of variables correlated with unintentional injuries, whereas cluster 4 has fewer variables correlated with unintentional injuries and cluster 3 has variables that are only moderately correlated with unintentional injuries.
The SOM built for morbidity also included four clusters. In cluster 1, seniors living at home and social housing were the variables that were most strongly correlated with morbidity. This cluster was spatially distributed throughout Toronto but was less prevalent in the northeast part of the city. In cluster 2, seniors living at home, traffic collisions total population and visible minorities (ordered from most to least correlated) were the variables that were most strongly correlated with morbidity. This cluster was spatially distributed throughout Toronto but was more prevalent in the north, south, central and southwest. In cluster 3, total population and visible minorities were the variables most strongly correlated with morbidity, however, seniors living at home was also strongly correlated with morbidity. This cluster was represented in only two neighbourhoods, one is south-central and the other in east Toronto. In cluster 4, traffic collisions and area were most strongly correlated with morbidity. This cluster was also only represented in two neighbourhoods, one in northeast and the other in northwest Toronto. The results of the heatmaps for the regressors (for morbidity) have been summarized in Table 6, which shows a breakdown of the cluster that each variable is highly correlated with. These results show that clusters 3 and 4 have the highest number of variables correlated with morbidity, whereas clusters 1 and 2 have fewer variables correlated with morbidity. The SOM built for falls, again, included 4 clusters. In cluster 1 seniors living alone, social assistance recipients, social housing, traffic collisions, total population, and visible minorities (ordered from most to least correlated) were the variables most strongly correlated with falls. This cluster was distributed throughout Toronto and was the dominating cluster, representing the majority of the city. In cluster 2, social housing was most strongly correlated with falls. This cluster was only represented in two neighbourhoods, both located in south-central Toronto. In cluster 3, total population and visible minorities were most strongly correlated with falls. This cluster was also only represented in two neighbourhoods, one located in southcentral and the other located in east Toronto. In cluster 4, traffic collisions and area were the variables most strongly correlated with falls. Like clusters 2 and 3, this cluster was also only represented in two neighbourhoods, one in northeast and the other in northwest Toronto. The results of the heatmaps for the regressors (for falls) have been summarized in Table 7, which shows a breakdown of the cluster that each variable is highly correlated with. These results show that clusters 1, 3 and 4 have the highest number of variables correlated with falls, whereas cluster 2 has fewer variables correlated with falls. The SOM built for intentional injuries only included 3 clusters (Table 8). In cluster 1, lowincome population, low-income family, total population, rented dwelling, and population density (ordered from most to least correlated) were strongly correlated with intentional injuries. This cluster was distributed throughout Toronto and was the dominating cluster, representing most neighbourhoods in the city. In cluster 2, rented dwelling, robberies, assaults, low-income population, low-income family, total population (ordered from most to least correlated) were the variables strongly correlated with intentional injuries. This cluster was represented in several neighbourhoods, all spatially located in south central Toronto. In cluster 3, rented dwelling and population density were strongly correlated with intentional injuries. This cluster was only represented in two neighbourhoods, both located in central Toronto. The results of the heatmaps for the regressors (for intentional injuries) have been summarized in Table 9, which shows a breakdown of the cluster that each variable is highly correlated with. These results show that cluster 2 has the highest number of variables correlated with unintentional injuries, whereas clusters 1 and 3 have fewer variables correlated with intentional injuries.
Seniors living alone and traffic collisions were strongly correlated with the majority of clusters for unintentional injuries, morbidity, and falls. Indicating that these variables may be more likely to contribute to these types of injuries compared to the other variables included in this analysis. Rented dwelling and low-income population were strongly correlated with the clusters for intentional injuries, indicating that intentional injuries are more likely to occur in poorer (or low income) Toronto neighbourhoods. Overall, clusters that were represented by a larger number of neighbourhoods tended to have a higher number of variables correlated with each injury type, while smaller clusters tended to have fewer numbers of (or more specific) variables associated with injury type. Population density and rented dwellings were variables that tended to be associated with locations in central Toronto (i.e., the neighbourhoods that have higher population density compared to the city's peripheries).

Conclusions
Recent advances in geocomputational methods, as well as spatial analysis, have brought new techniques that better enable the understanding of spatial characteristics of cities and regions [52]. It is of utmost importance to understand regional patterns of epidemiologic concern, to better optimize public health efficiency in rapidly changing regions [53]. In this sense, geocomputational methods, when combined with large spatially-explicit data, allow for significant contributions of regional understanding of injury dynamics. Supported by data availability, open data at the city level may have a profound impact on the assessment and resulting community and policy intervention strategies for neighborhoods. The application of geocomputational techniques to the National Ambulatory Care Reporting System has allowed perceiving that the pattern of the residence locations of injured persons is not spatially random, but clearly very spatially dependent. There is some disagreement in the literature regarding the effects of immigration status on health and violence. A number of authors have shown that population health determinants such as income and social status, education, employment or working conditions, social and physical environments, personal health practices, healthy child development, biologic and genetic endowment, health services, sex, and culture have a relationship with injury patterns [54][55][56]. Others have argued that the distinction between intentional and unintentional injury is arbitrary [57,58] and that the risk factors associated with intentional and unintentional injury overlap [59][60][61]. Based on these lines of previous work, we would have expected that the spatial distributions of the residences of those injured by these disparate mechanisms may have overlapped. However, ours is the first study to demonstrate that the spatial distributions of residence locations were similar regardless of whether the mechanism of injury was intentional or unintentional. This finding was consistently seen in the choice of selected variables, despite marked differences in size, economy, and cultural composition. The slightly larger areas of hotspots of home locations of those injured unintentionally may either reflect a difference of the aforementioned factors or simply be related to the larger number of persons injured unintentionally. Finally, the most resounding conclusion is that injury can greatly benefit from tailor-made injury prevention initiatives that address the specificities of neighborhoods and types of injury to guarantee a successful mechanism of injury prevention at the local level.