Applying Machine Learning and Geolocation Techniques to Social Media Data (Twitter) to Develop a Resource for Urban Planning

The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


Introduction
The World Bank has declared that data are the next deprivation to end; they argue that the lack of data causes many of the world's poorest populations to be overlooked when resources are allocated to address their essential needs [1].Data deprivation is a pressing challenge with as many as 74% of the global and 97% of the Sub-Saharan African population living in countries without adequate vital registration [2]; one-third of countries lack any poverty statistics [1]; and only 17% of the estimated road traffic deaths are reported in official figures of low-income countries [3].Without data to inform national and urban policies, the gap between low-and high-income countries will worsen [4].However, while official statistics are poor, data in the hands of private providers are plentiful, populated by the rapid expansion of mobile phones and social media.
In this project we test the hypothesis of whether privately maintained data can be transformed into a resource to better understand development challenges.Private data have been used to characterize populations from determining poverty to understanding public emotions [12][13][14][15][16][17].Here, we use private data to describe the urban environment that affects those populations, specifically analyzing events reported on social media that affect people's safety such as road traffic crashes, crime or floods.We focus on road traffic crashes (RTCs).Despite being the number one cause of death for children and young adults aged 5-29 years, the lack of adequate data on RTCs is a recognized and unmet challenge [18].The objective is to improve RTC data for urban planners so they can contribute to addressing the high toll of road deaths, estimated globally at 1.35 million a year [3].Our case study is Kenya, a country with high road mortality, where the official figures are said to underestimate the number of fatalities by a factor of 4.5 [3].
The United Nations' Sustainable Development Goal (SDG) 3 sets a target to halve road mortality by 2020; progress has been slow, and the target moved to 2030.The Stockholm Declaration by the Third Global Ministerial Conference on Road Safety "Achieving Global Goals 2030" reiterated the call for country investments in road safety-from legislation and regulation, safe urban and transport design, safe modes of transport and vehicles, to modern technologies for crash prevention, trauma care, and urban management.However, resource constraints make it unlikely that countries will be able to meet all of these goals.Instead, countries should strategically invest for the greatest impact.This requires knowing where and when crashes happen, so that resources can be targeted to risky locations and times.
Social media data, with all their biases, can contribute to filling some of the data gaps for urban analysis, planning and management [19].In this study, we create an algorithm that classifies transport-related tweets into geolocated RTCs for Nairobi.This is done by building on existing literature to test two natural language processing algorithms to identify crash reports [20,21], developing an improved geoparsing algorithm to extract data on crash time and location [22][23][24][25][26][27][28], and ground truthing the results.The paper also contributes to a broader literature that uses machine learning methods for road safety analysis [29][30][31].
This study innovates on three fronts and demonstrates the value of using social media to expand data availability.(1) Geospatial Twitter data analysis usually uses the approximately 1% of tweets that have a geolocation tag [32][33][34]; we improve this by using a machine learning geoparsing algorithm to leverage the 99% of tweets that do not contain a geotag.(2) To our knowledge there are no other studies that physically validate the locational accuracy of tweets in real time.Among verified tweets, 92% were found to be valid crashes, demonstrating the validity of crowdsourced crash data.(3) The work created an essential resource by generating one of the first real-time maps of RTCs in an African city (Nairobi).We identify 52,228 crash reports and geolocate those with enough information provided in the text (32,991 of them).In a context where there is no systematic georeferenced data on crashes to support policy planning, the process outlined here could be used to capture these data for cities all over the world that need this essential resource.
Overall, the method expands the coverage of road crashes that can be used to analyze road safety and to prioritize policy action around the locations where crashes occur more often.This is especially useful in country contexts where the only data available for analysis are aggregated statistics on total fatalities in the country, with no detailed breakdown of location or time.Crowdsourced data can help act as an additional input that can be used by policymakers in better understanding the situation.By using a clustering algorithm to identify and rank crash locations, we find that the top 15% of crash clusters (66 of 435) account for half of all crashes.Knowing that a small portion (<1%) of the road network hosts 50% of RTCs in the crowdsourced data can help reduce an intractable problem to a more manageable one.This analysis shows the potential for using these data to complement road safety diagnostics and to guide investments and planning in road safety in Kenya and in other contexts, especially those with similar data deficiencies and with sufficient social media density like India and the Philippines [35].
The approach can be extended to other events reported on social media, whether related to disaster relief, crime, personal safety, urban mobility, or road maintenance.The work on disaster relief and response makes prominent use of geoparsing of tweets [36][37][38][39][40][41][42][43].Geoparsing of tweets that lack geolocation information could enable more comprehensive crime analytics [44][45][46].Improved algorithms can lead to faster and better geolocation of events, which would help urban planners and policy makers improve responses and better target interventions.

Method
The goals of this analysis are to create data on road crashes with times and locations and understand how these incidents cluster in the city, which allows for the spatial prioritization of urban investments in road safety.The technical challenges this study addresses are: i) improve the protocols for geolocation, ii) apply applications of AI to classify tweets reporting crashes and identify their location from multiple geographical references, iii) cluster the crashes geographically and identify areas with many crashes.See the Supplemental Information (SI) for the detailed methodology.The components are as follows: 1. Scrape data.We scrape 874,588 tweets posted by Ma3Route, an existing urban mobility platform with 1.1 million followers, since its inception in May 2012 through July 2020 (see SI for examples of tweets and for a figure of the daily number of tweets across time).
2. Develop and augment a gazetteer.We build a gazetteer of landmarks for the five counties that constitute the Nairobi metro area using: OpenStreetMap, Geonames and Google Places.The gazetteer includes the landmark name, geocoordinates and type of landmark (e.g., school, bus stop).We use consecutive combinations of 2 and 3 words (known as n-grams) and skip-grams of landmarks in the gazetteer, alternate spellings and abbreviations, and splitting of landmarks with select punctuation (e.g., slashes, parentheses, commas).We innovate by developing alternate names that exclude the landmark type from the name (e.g., excluding "Hotel" from the name).
3. Develop a truth dataset.We develop a truth dataset to train the algorithm.Taking all tweets for July 2017 -July 2018, we restrict tweets to the ones most likely related to a crash based on a broad list of words and their variations.Each tweet is manually coded, indicating (1) if the tweet reported a crash and (2) the approximate latitude and longitude of any reported crash whenever enough information is provided.A total of 9,480 tweets were coded, of which 69% (6,602) reported a crash and of these, 63% (4,192) identified an approximate location of the crash.On average, users posted 10 crash reports that could be geolocated to Twitter daily.4. Identify RTCs and their location.We use a three-step process to convert unstructured crowdsourced text into a dataset.The first is to identify relevant reports from hundreds of thousands of reports.The second is to extract necessary information from the relevant reports.The third is to consolidate unique record information from multiple reports of the same event.In Figure 1, we illustrate how the algorithm works to classify and geolocate RTCs.We use the tweet "Bad accident on Waiyaki Way next to Kianda heading towards ABC Place."We restrict the analysis to tweets that contain keywords from a broad list of English and Kiswahili road safety terms such as "accident" or "overturn."This approach follows previous research and allows for misspellings [20].We use natural language processing to classify and exclude tweets that contain road safety keywords but discuss road safety conditions rather than specific crash events (e.g., "terrible drivers keep causing crashes").We test two approaches that analyze the combination of words in a tweet: Naive Bayes and support vector machines (SVM).
(b) Geolocate reports.We extract all landmarks and roads that have an exact match between the gazetteer and the tweet.In Figure 1, "kianda" and "abc way" match several entries in the gazetteer.We extract misspelled matches based on Levenshtein distance varied by length of the n-gram, matches based on the word following a preposition, and matches based on intersections between multiple roads.
Existing geoparsers extract all possible location references without identifying the unique location that makes the data useful.We resolve two technical challenges to find the location of the crash: i.When multiple locations are mentioned in the tweets, we use prepositions to sort locations into tiers, based on the probability of a location being correct after a particular preposition.
For example, in Figure 1, "next to" is ranked as tier 1 while "toward" is ranked as tier 6, resulting in the correct geolocation for the crash at "kianda" and not "abc place".
ii.When a name refers to multiple landmarks, we adopt a toponym resolution approach.In Figure 1, more than 6 landmarks across Nairobi have "kianda" in the name.We resolve the toponym in three steps: (1) we search for landmarks that are within 500 m of a road if it is mentioned, (2) we use the centroid of the clustered location if 90% or more of the landmarks are in a 500 m radius, or (3) we rank the landmarks by the probability of being correct using the landmark type in the truth data (see SI for statistics on location type).In the example, we use "Waiyaki Way" to narrow down the landmarks "kianda" in a 500 m radius (from 6 to 3) and then use the centroid as the crash location.
We define a correct geoparse as one located within 500 m of the coordinates in the truth dataset.As a benchmark, we compare our algorithm to the Location Name Extraction tool (LNEx), which was shown to have better accuracy than other geoparsers [40].As LNEx and other geoparsers are not designed to extract one unique location from text [26,40,47], we first judge performance by examining whether any location references are near the true coordinates.
Next, we define the crash location as determined by LNEx to be the centroid of all locations it finds in the tweet and compare this with the unique location identified by our algorithm.
(c) Identify unique reports.To avoid over-counting, we develop a clustering algorithm that uses time and location to identify which tweets refer to the same crash.In Figure 1, five tweets report a crash within two hours of each other, referencing different landmarks that are all close together.
To develop reasonable parameters for clustering, we manually identify tweets that report the same crash in the truth dataset based on the time, location and crash characteristics.The 4,192 crash reports are clustered into 2,648 unique crashes.For unique crash clusters, 97% of tweets reported 0.656 0.774 'N Crashes' refers to the number of correctly identified crashes.'Raw Gaz' refers to the raw gazetteer (ie, dictionary of landmarks with original names) and 'Aug Gaz' refers to the augmented gazetteer.We use our raw gazetteer as an input into LNEX, which implements its own augmentation process.For LNEx, the crash location is determined by taking the centroid of all locations captured by the algorithm.Locations are considered close if they are within 500 meters of each other.landmarks within 500 m and within 4 hours of each other (see additional details in SI for how parameters were chosen).
(d) Ground truth.To ensure that the crowdsourced data are reliable and provide correct information, we conduct a ground-truthing exercise to validate the quality of the data and the performance of the underlying algorithm.We processed tweets in real-time and dispatched a motorcycle delivery service (Sendy) to the site of the crash within minutes.The Sendy driver was tasked with verifying and reporting whether a crash actually happened in that location.If a driver could not see the crash, they were instructed to ask a bystander whether a crash had occurred but was cleared or whether a crash occurred nearby.Drivers were able to arrive at the crash location quickly; the median time between being alerted of a crash and arriving at the scene was 26 minutes.

Results
The methods laid out here created a georeferenced RTC dataset that was previously unattainable and produced one of the first real-time maps of RTCs in Nairobi.We classify 52,228 tweets as crash-related out of a universe of 874,588 tweets during 2012 -2020 (Panel A of Figure 2).This is based on the SVM algorithm, which we find performs better than the Naive Bayes algorithm according to the F1 statistic (see Table S4 in the SI).We geolocate 32,991 time-stamped crash tweets from August 2012 to July 2020 and cluster them into 22,872 unique geolocated crashes (panels B and C of Figure 2 show the unique crashes generated by Twitter daily using the algorithm and clustering).In our truth dataset, where we manually coded each crash-related tweet, we found that 63% of tweets contain enough information in order to be geolocated.Assuming the same proportion of tweets contain enough information to be geolocated in the full dataset, we would expect 32,903 tweets with enough location information.This aligns almost perfectly with the number of tweets that the algorithm is able to geolocate.The ground-truthing exercise confirms the validity of the crowdsourced data.We find that of the 73 crash-related tweets physically verified, 92% correctly corresponded to a crash near the estimated location; 32.8% witnessed the crash scene, 57.5% did not see the crash but were told by a bystander that a crash occurred and was recently cleared, and 1.4% reported that the crash did not occur at the specified location but nearby.Furthermore, using our truth dataset to benchmark shows that our algorithm performs significantly better than the current geoparsing standard.Our algorithm's recall rate of 65% is a five-fold improvement in performance compared to the LNEx algorithm (13% recall) in identifying the unique location of a crash (Table 1).This is largely because LNEx is not designed to identify a unique location when multiple locations are mentioned.Our algorithm performs 25% better than LNEx even when comparing whether any location extracted from the tweet is near the true location.
Analyzing the crash data produced using our algorithm and focusing on the truth dataset within the city limits of Nairobi, we find that all crashes from July 2017 to July 2018 can be found in 435 clusters, each with a maximum diameter of 300 m.Of these clusters, 67% have two or more crashes and there are 56 clusters with 10 or more crashes.Additionally, 66 crash clusters represent over 50% of all the crashes.When looking at the 7.5 years of crowdsourced data for the city of Nairobi, the number of crash clusters does not grow linearly, implying that the locations where crashes occur and are reported in Twitter are consistent across years.Only 14% of crash locations have only a single crash, and there are 443 crash clusters with 10 or more crashes.We see the concentration of crashes even more when we note that only 9% of crash clusters (133 of 1,375) represent 50% of the crashes reported (Figure 3 shows crash heatmaps for the truth dataset from July 2017 to July 2018 and for 2012-2020).

Discussion
Cities are constantly evolving and understanding urban mobility is critical to creating urban designs that help to manage risks for pedestrians and vehicles.Severe data limitations hinder the development of policy interventions needed to manage risks, especially in low-and middle-income resource-constrained countries.
Closing the data deprivation gap can help avert divergence in socioeconomic conditions between data-poor and -rich countries.By focusing on RTCs-the number one cause of death among young people-we demonstrate that social media could be an inexpensive way to produce non-existent RTC data in resource-poor contexts that can support government analyses of road safety and potentially inform policy.This tool could be especially powerful when combined with investments in building a digital administrative dataset that would provide information on the crashes attended by police.The answer to the seemingly simple question of where and when crashes occur has profound implications for public policy response that can save lives.And while official data deprivation can be an impediment to economic development, data generated by private operators can be transformed and placed in the hands of policy makers as a resource for policy making.By expanding the amount of data, we can generate more input to help resource-constrained countries prioritize policy action where it is most needed.This example of geolocating crash data from mining twitter data can help to guide infrastructure redesign or enforcement policies to reduce RTCs.Nairobi comprises an extensive road network of 6200 km; with the city's limited resources, addressing road safety across the whole network is difficult.By using this type of geolocated data, urban planners and policy makers can narrow down the problem to the areas with the highest number of crashes.This has been proven to work in developed countries where targeting risky locations led to reductions in the concentration of crashes [48].As shown in the results, crashes reported on Twitter are highly concentrated, with the top 15% of locations spread across 20 km of road having 50% of the crashes reported on Twitter.
It should be noted that there are some limitations to the approach.The data generated are limited by the coverage of the crowdsourced data.Users are more active on social media at particular times, and it is necessary to possess a smartphone and have access to internet to be able to use the service.This can lead to bias in the reports generated via the crowdsourced data.Only 7.5% of tweets are sent between the hours of 9 p.m. and 6 a.m., and as a result only 12% of the crash reports from Twitter are during this time.There could also be geographic bias if there are areas of the city where people with smartphones are more likely to be present or passing by, and therefore more likely to report.Our real-time motorcycle validation exercise demonstrates the internal validity of the crowdsourced data and the improved algorithm.External validity is more difficult to assess because we do not know what the universe of crashes is.Additionally, we do not know the severity of the crashes reported on Twitter.Therefore, we have no way of knowing if the areas where crashes happen are the most dangerous, which is what policy makers likely would want to target.These caveats should be considered by policy makers when using crowdsourced data to inform policies and targeting.
Despite the limitations, our improved geoparsing algorithm discussed in this paper can begin filling some of the gaps in data in low-capacity and data-scarce settings.And while the crash cluster areas identified by the algorithm may not be the most dangerous or may not represent all crash areas, they nevertheless highlight problem areas.All crashes, minor or severe, have important economic consequences in terms of property damage and lost time and productivity due to the traffic generated (which is one of the reasons the crash is likely reported on Twitter).Therefore, these data can be used to target areas for design solutions where we are seeing high numbers of crashes consistently over time.In settings where there are limited or non-existent administrative records and, therefore, lack of any geolocated data, this tool can produce information in real-time for one of the most pressing challenges in developing countries.Furthermore, by developing tools that generate time-stamped geolocated data and statistics from crowdsourcing on different "events" that are reported on social media, we can hope to expand data availability across other contexts and across issues beyond RTCs.For example, real-time traffic applications like RIDLR in India can be used to expand data on road safety.These improved tools can also help geolocate victims during a natural disaster or alert disaster management teams to the location of unsafe buildings or areas needing immediate attention.They can support law enforcement or communities to locate and respond to crimes, cases of violence against women, or police violence.Improved identification of the time and location of events can help to automate and accelerate policy response across a wide set of issues, potentially leading to better policy outcomes.
app and posts the report to Twitter.We scraped all tweets posted by Ma3Route from May 2012, when the Twitter feed was started, onward.Figure S1 shows the number of tweets across time. 1he full dataset of tweets that we use consists of 874,588 tweets scraped between May 2012 and July 2020.See Table S1 for examples of tweets.there is an accident at the pangani underpass heading to either muranga road or forest road involving two personal cars and a matatu mini bus this is causing a bit of snurl up cc 6 jogoo road traffic small accident just before donholm 7 a heavy truck has rolled at karai naivasha loaded with what seems to be bags of maize such trucks are supposed to use mai mahiu route how did it end up there 8 prepare for snurl up jogoo road just a minor incident apo hamza 9 bad accident involving 6 matatus and a lorry on thika road near till station 10 an accident has occurred kenyatta road involving a lorry that has overturned and several vehicles User mentions have been removed.
We explored additional Twitter handles that focus on traffic and road safety in Kenya.These include twitter handles such as RoadAlertsKE, KenyanTraffic and ThikaTowntoday.The majority of tweets from these other handles are already tweeted out by Ma3Route; therefore, including these additional handles does not produce many new tweets to incorporate into the dataset.An additional source of data is including tweets that mention Ma3Route but are not necessarily posted by Ma3Route.While these tweets are not included in the current analysis, they can be easily incorporated to expand the data set that is used to generate additional crash reports.We have already done this for the data set of crashes that we are producing for the Government of Kenya.

Building a Truth Data set of Tweets
We build a truth data set of Ma3Route tweets where tweets are labeled as to whether they refer to a specific traffic crash and, if they do, are geocoded.We code all potentially crash related tweets from July 2017 to July 2018.We define a tweet as potentially crash-related if one of the following words appeared in the tweet: accident, accidents, ajali, axident, collision, crash, crashes, crashs, crush, crushed, damage, disaster, emergency, fatal, fatality, fender bender, fender-bender, hazard, hit, hit-and-run, incident, incidents, injuries, injury, magari zmegongana, mishap, overturn, overturned, ovrturn, ovrturned, pileup, rammed, read end, rear ended, roll, rolled, smash, smashed, wreck, wreckage, zilicrash, zimecrash To account for misspellings of select words, we also include tweets if they contained a word that had a Levenshtein distance of two or less to "accident" or "incident" or a Levenshtein distance of one to "crash" or "crashed".
Six coders were trained to process the 9,480 tweets defined as potentially crash related.Coders were instructed to label a tweet as reporting a crash if the tweet referred to one or more specific crashes; general comments about crashes were labeled as not reporting a crash.If the coder labeled the tweet as reporting a crash, they were instructed to geocode the location of the crash based on the tweet text if they were able.Coders were instructed to record the street names and landmarks used to geocode the crash.In addition, they provided the approximate coordinates of the crashes.
Each tweet was labeled and geocoded by two coders; differences were resolved by one of the authors.
(We consider geocodes different if they were more than 100 m apart.) Of the 9,480 tweets, 6,602 (69%) reported a crash and of these, 4,192 (63%) identified an approximate location of the crash.

Augmenting a Gazetteer
The We test two methods for determining whether a tweet reports a crash: Naive Bayes and support vector machines.Both techniques are commonly used in text classification for their ability to handle high dimensionality, e.g. when the number of features is greater than the number of observations [7,8].The Naive Bayes model is estimated as: ŷ = y P (y) where y is whether the tweet is classified as crash related or not and x i are all the n-grams that occur in a tweet.
The linear SVM solves the minimization problem: where C is a regularization parameter and ||w|| 2 is a penalty function.Here, y equals 1 when the tweet references a crash and -1 when it does not.We use a squared hinge loss function (L2).
We implement k-fold cross-validation on 4 folds, training the model on 75% of the truth data and testing on 25% of the data within each fold.Table S4 shows results for select parameters.
While the Naive Bayes algorithm performs slightly better based on precision, the SVM has higher recall and generally performs better for 2 and 3 n-grams.Overall, the F1 statistic, which provides a balance between the precision and recall, is best for SVM at 0.95 using 2 and 3-grams.Given that the overarching goal is to produce a data set of geolocated crashes based on the tweets, better recall is more important than higher precision.The reason for this is that even if a larger set of tweets is misclassified as crash related, it is more likely that these general tweets will not be geolocated at the second stage since they are not discussing a particular crash with a given location.We therefore want to capture as many of the tweets that are reporting crashes as possible at this stage, even if it means capturing slightly more tweets that are not reporting a crash.The SVM algorithm also has a very high accuracy of 0.93.The table shows best results for both SVM and Naive Bayes.For these results, both models use the original tweet and no features are removed.The Naive Bayes models do not use TF-IDF, while the SVM models do.

Preparation for Geolocation
Prior to being able to use the geolocation algorithm, two additional pieces need to be prepared.One relates to identifying types of landmarks that are more common to be mentioned as the location of a crash in a tweet.In the situation where there might be multiple landmarks with the same name, the more likely landmark for a crash is the one that should be chosen for the location.The second relates to identifying the correct location when multiple locations are mentioned in the tweet.We can use the typical grammatical structure of a tweet to identify prepositions that are used prior to the correct location of a crash compared to ones that are more likely to be used with locations that are not close to the crash.Ranking prepositions based on these probabilities makes it possible to choose the correct location from the possible locations mentioned.

Determining Landmark Types More Commonly Used as the Crash Location
When a landmark name is mapped to multiple locations, the algorithm preferences certain landmark types.To determine which landmarks to preference, we examine which landmark types are more commonly associated with the correct location.We consider cases where (1) one landmark is used to identify the crash location and (2) the landmark name is mapped to locations both near and far from the crash location.We compute the proportion of times a type is near and far from a crash location and divide the proportion near over far to understand the likelihood that choosing the type is near the crash location.
Figure S2 shows results.Among tweets considered, a landmark location that is a bus stop is near the correct location 17% of the time and is far from the correct location less than 1% of the time, leading to a bus stop being close to the correct location 22 times more frequently than far from the correct location.
In the algorithm, we use the top 6 landmark types (all being 2.5 or more times likely to be near the correct location) to preference landmarks: bus stop, parking, mall, cafe, transit station and bus station.The truth dataset indicates the landmark used to geocode the crash.We examine the phrases that precede the landmark.Figure S3 shows the top phrases.The phrase "at" precedes the correct landmark in 42% of tweets and in roughly half these cases "accident at" precedes the landmark.
We examine the phrases that precede the landmark to guide decision making when more than one landmark is mentioned.For this, we take all phrases that precede the correct landmark at least 20 times.We then identify cases where two of these phrases appear in a tweet and one of the phrases precedes the correct landmark; we then calculate the proportion of times each phrase precedes the correct landmark when the other phrase is also in the tweet.Figure S4 shows results.While 'at' is the most common word that precedes a landmark, other phrases that precede landmarks are more predictive of the correct landmark.For example, when both 'at' and 'near' appear in the tweet (and one of them precedes the correct landmark), the landmark is preceded by 'at' only 6% of the time.We use information from these phrase-pairings to divide phrases into "tiers"; if two landmarks are found in a tweet, the landmark is used where the phrase that precedes it is from a lower tier.We develop 6 tiers:  Repeat until doing so would remove all landmarks considered. 11  11 For example, in the tweet "accident at garden city toward town", the algorithm searches for landmarks after 'at.'It first finds all landmarks that contain 'garden', then it narrows down these landmarks to those with both 'garden' and 'city'.
No landmark contains 'garden', 'city' and 'toward', so the algorithm stops and considers landmarks with 'garden' and 'city'.
(c) Among extracted landmarks, determine which landmark has the smallest number of words and only keep landmarks with that number of words. 12.
12 For example, if 'garden city', 'garden mall', 'garden city mall' and 'airtel money agent rock city gardens' were extracted, the algorithm keeps 'garden city' and 'garden mall' C. Extract point locations from roads 1.For each found, check if the length of the diagonal along the bounding box is less than 500 m; if it is, take the centroid and consider this location to be a landmark 13 .(b) If an area is mentioned (e.g., a neighborhood), for each landmark -follow the same steps as above.(c) If a landmark is mentioned after a tier 1 preposition (e.g., "next to", "just after"), for each other landmark -follow the same steps as above, checking the distance between the other landmarks to landmark locations after tier 1 prepositions. 15  15 Helpful in case the landmark near a tier 1 preposition doesn't form a dominant cluster, but a dominant cluster is formed from another landmark mentioned).I. Final checks to determine whether location should be used

Dominant
1.If a road is mentioned and the location chosen is greater than 500 m from any mentioned road, no location is outputted by the algorithm 2. If multiple landmarks are mentioned, the closest landmark to the crashword is used 18 and the landmark is 18 This would happen when no tier 1-6 phrase precedes a landmark more than two words away from the crash word, no location is outputted by the algorithm 3.If multiple landmarks are mentioned, a tier 5 or 6 phrase precedes the chosen landmark and the landmark is more than two words away from the crash word, no location is outputted by the algorithm Choosing Parameters for Clustering Crash Reports into Unique

Crashes
Multiple people often tweet about the same crash.In order to cluster crash reports to unique crashes, we cluster by the kilometer and time distance between reports.To determine optimal kilometer and time parameters, a team manually determined which crash reports refer to the same crash.The dataset was double coded by different team members, resulting in two "truth" datasets.
To judge whether crash reports refer to the same crash, the team used the location of the crash, the time of the tweet and looked for details about the crash in the tweet itself (e.g., extent of injuries, types and numbers of vehicles, etc.).
The below table shows summary statistics of the maximum distance and time between any two crash reports in the same clustered or individual crash.Before calculating the statistics, outliers were removed (we define an outlier as a crash cluster where reported crashes occurred over 24 hours or over 5 km from each other).Across both truth datasets, around 52% of tweets were clustered with another tweet, meaning that 48% of tweets are the only tweet reporting one crash.We examine two common metrics for evaluating clustering performance: the adjusted Rand index and the Jaccard coefficient [9].When using our algorithm to cluster crash reports, we test all combinations of 0.1, 0.5, 1, 2 and 3 kilometers and 1, 2, 4, 12 and 24 hours.For truth dataset 1, both the Rand index and Jaccard coefficient show that 12 hours and 500 m leads to best results, while truth dataset 2 shows 2 hours and 500 m (see figure ).The difference in results in the truth datasets likely results from the exercise being partially subjective, particularly when limited or no

Figure 1 .
Figure 1.Illustration of classification and geolocation algorithm developed for extracting data from crowdsourced information

Figure 2 .
Figure 2. Crowdsourced crash reports from twitter data that our algorithm has geolocated and clustered into unique crashes for the city of Nairobi between 2012 and 2020.Road data comes from OpenStreetMap.

Figure 3 .
Figure 3. Heatmap of crashes Data in panel a is from July 2017 -July 2018, where we use the manually coded Twitter dataset.Data in panel b is for August 2012 -July 2020.Road data comes from OpenStreetMap.

Figure S2 :
Figure S2: Landmark types typically near or far from the crash location when a landmark name is mapped to multiple locations

Figure S3 :
FigureS3: Top words that precede the landmark that correctly identifies the crash location.

Figure S4 :
Figure S4: Likelihood of different words preceding the correct landmark

Table 1 .
Geolocation Algorithm Results primary goal of the algorithm to augment the gazetteer is to generate alternate names of landmarks that users may use instead of the original name in the gazetteer.Alternate names are generated in three steps:(1)splitting landmark names at certain punctuation (e.g., slashes), (2) Remove one word landmarks that are also English words (spelled correctly according to an English spell checker) 4 but are not nouns 5 or categorized as a bus/ create n-grams and skip-grams of landmarks and (3) in select cases, removing the landmark type from the end of the name (e.g., removing 'restaurant' from 'McDonald's restaurant.')Thealgorithm also removes landmark names that are common words that may often be used in a context to not refer to a landmark.In addition, the algorithm removes landmarks that do not refer to a specific location, such as roads.C.Remove certain landmarks1.Remove landmarks that are just one character in length 2. Remove landmarks that have certain types (eg, where the type indicates that the landmark actually represents a large area).We remove landmarks with the type: route, road, political, locality or neighborhood except if the landmark also contains "flyover" or "roundabout" in the name 1

Table S4
If multiple landmark names were selected 17 ii.Choose landmark closest to the event word (could still result in multiple!)(b) If landmark name mapped to multiple locations i. Select locations within 500 m of mentioned road; if none near road, don't subset G. [If landmark location is not near any mentioned road] Broaden search to find similarly named landmarks near the road 1. Start with all landmarks that are near any mentioned road and subset to those that contain the landmark name.Take the next word in the tweet and subset landmarks that contain this word.Repeat process until doing so would cause no landmarks to be found.Among these locations: (a) If a dominant cluster exists, use this location.(b) If no dominant cluster exists, further subset locations to those where the landmark word in the tweet is at the beginning of the landmarks found.If a dominant cluster is found, use this location.i.If no location is found in the previous step, repeat, but check words in the tweet proceeding the landmark name.
H. Snap to Road 1.If a road is mentioned, snap location to road 2. If no road is mentioned, snap to nearest road if road within 500 m.

Table S6 :
Clustered Tweets Truth Data Summary Statistics