A scalable machine learning approach for measuring violent and peaceful forms of political protest participation with social media data

In this paper, we introduce a scalable machine learning approach accompanied by open-source software for identifying violent and peaceful forms of political protest participation using social media data. While violent political protests are statistically rare events, they often shape public perceptions of political and social movements. This is, in part, due to the extensive and disproportionate media coverage which violent protest participation receives relative to peaceful protest participation. In the past, when a small number of media conglomerates served as the primary information source for learning about political and social movements, viewership and advertiser demands encouraged news organizations to focus on violent forms of political protest participation. Consequently, much of our knowledge about political protest participation is derived from data collected about violent protests, while less is known about peaceful forms of protest. Since the early 2000s, the digital revolution shifted attention away from traditional news sources toward social media as a primary source of information about current events. This, along with developments in machine learning which allow us to collect and analyze data relevant to political participation, present us with unique opportunities to expand our knowledge of peaceful and violent forms of political protest participation through social media data.

that the photo was taken, country, city and state (where applicable) was available in the metadata.
The Python code which was used to extract and format the image metadata is shown below. Please note that all files in the code below refer to machine readable .json versions of the .csv data files which we included as part of this submission for purposes of exposition. After Code A was run, the collected image metadata held within events.json was edited by hand to associate the "city" field to appropriate geographic units within the geo-coded Tweets database, which is annotated with codes for the GADM global administrative boundaries. See http://www.gadm.org/version2 for more information. The contents of the resulting GADM-associated database of image metadata are dumped into CSV format under the file APprotestevents-metadata.csv for convenience.

Extracting samples of geo-coded tweets Using AP metadata
Using the GADM-associated AP image metadata database events.json discussed above, the geo-coded Twitter database was queried and tweets were extracted using geographic and time parameter ranges which corresponded to event data listed in APprotestevents-metadata.csv. While alignment of the extracted data to the geo-coded Twitter database was straightforward for temporal information (times were simply converted from GMT to EST), association for geographic data was somewhat more complicated. Generally, the AP images were tagged with city-level spatial information, while the geo-coded Twitter database was tagged for US county-level equivalents. This resulted in each city being coded by hand for overlapping counties, which resulted in different levels of refinement for different cities. For example, while San Francisco was mapped to its name-identical county, New York City was identified with the five boroughs, and each of Oakland and Berkeley were associated with their superset, Alameda county. Ultimately, the effects of this imperfect mapping were most significantly seen at the time of coding the filtered Twitter date for social actions. In particular, for the small cities that are subsets of large, populous counties, like Ferguson MO., the filtering resulted in samples that were more "watered down" in terms of social action (since more unrelated tweets were included from the surrounding areas). After extraction, tweets were reformatted into a series of JSON formatted files which were then distributed to each of the 5 U.C. Berkeley undergraduate students who were subsequently instructed about how to code each of the tweets that they were assigned according to the four social action categories that we described in the paper. Following this, the student-coded tweets were comprehensively audited for consistency and then merged to form the final coded data set.
A sample of the completed Tweet database as viewed by the coders is presented in Fig A. The Python code which was used to query the Tweet database and format relevant tweets for human coding is below. Code B Python code using AP image metadata to query geo-coded Tweet database.
The Python code which was used to prepare the tweets for human coding is shown below.
1 im po rt re , j s o n 2 im po rt random a s r a Code C Python code used to prepare tweets for human coding.

Building the labeled data database for adept Bayes classifier
After coding of the relevant tweets, tweets were compiled into a labeled tweet database containing 22,626 tweets which was used to train the adept Bayes classifier and assess classifier performance. The labeled tweet data used to train and test the classifier is included as part of the supplementary materials in this submission under the filename Coded-Tweets-training.csv.

Building and training the adept Bayes classifier
In addition to the substantive contribution of this paper, we provide a methodological innovation which improves upon a well known machine learning classifier known as naïve Bayes. While the naïve Bayes classifier is known to provide superior performance results and scalability in the context of text classification, a major drawback of the algorithm is the conditional independence assumption which assumes that words within documents are independent of each other. Although the naïve Bayes classifier performs very well for text analysis tasks despite the conditional independence assumption, many have theorized that relaxing this assumption can potentially improve performance even further [2].
Here, we relax the conditional independence assumption by incorporating a NLP phrase chunking method [3] that allows us to better meet the independence assumptions of the naïve Bayes classifier, resulting in an enhanced naïve Bayes classifier which we refer to as "adept" Bayes.
Code for implementing the adept Bayes classifier is shown below.  Code D Functional components of the adept Bayes classifier. A complete pipeline uses all but the last function, processT weet(), for training, and the final function for classification, as will be seen in the subsequent cross-validation code.
Using the coded data from Twitter, the model was trained as a parallel series of binary classifiers. In other words, a separate adept Bayes classifier was trained for each of the four types of social action (e.g., collective force, singular peace, etc.), in addition to the collapsed categories (i.e., collective, singular, force, peace, or any). This means that the processing of the coded data resulted in the training of 9 binary classifiers that can separately assess the presence of each type of social action, e.g., application of the classifier would predict that a tweet either is a representation of collective force, or is not a representation of collective force, while simultaneously predicting if the same tweet is a representation of singular peace, or is not a representation of singular peace, et cetera.

Assessing classifier performance
After training the adept Bayes classifier, we assesses precision, recall and F 1 statistics using tenfold cross-validation on the trained model.
The code used to produce the trained model and the classifier statistics are shown below. Code E Python code used to produce tenfold cross-validation statistics for the adept Bayes classifier.
In addition to this, we sought to assess out-of-domain performance of the classifier for each type of action. To accomplish this, we trained the adept Bayes classifier on all coded data, except for those that were drawn from Hong Kong, which were used for testing. Under this setup, the classifier had no in-domain knowledge of the Hong Kong Democracy protests that were tested upon.
The Python code for the execution of the out-of-domain test is shown below.   1.6 Exploring social actions during New York City's climate change protest.
As part of this submission we also include a movie file which demonstrates how our trained adept Bayes classifier can track social actions as they unfold in real-time on the ground. This was accomplished by using the classifier to classify tweets during a multiple day climate change protest in New York City which began on September 21st, 2014. The movie file included as part of this submission is a Quicktime .mov file entitled NYC-ClimateChangeProtests-Med.mov. The code and classified tweets used to produce this movie file and are included as part of this submission. The classified tweets used to create this movie file and S2