Look who’s talking: Two-mode networks as representations of a topic model of New Zealand parliamentary speeches

Quantitative methods to describe the participation to debate of Members of Parliament and the parties they belong to are lacking. Here we propose a new approach that combines topic modeling with complex networks techniques, and use it to characterize the political discourse at the New Zealand Parliament. We implement a Latent Dirichlet Allocation model to discover the thematic structure of the government’s digital database of parliamentary speeches, and construct from it two-mode networks linking Members of the Parliament to the topics they discuss. Our results show how topic popularity changes over time and allow us to relate the trends followed by political parties in their discourses with specific social, economic and legislative events. Moreover, the community analysis of the two-mode network projections reveals which parties dominate the political debate as well as how much they tend to specialize in a small or large number of topics. Our work demonstrates the benefits of performing quantitative analysis in a domain normally reserved for qualitative approaches, providing an efficient way to measure political activity.


Introduction
Topic models have received widespread attention in recent years [1][2][3][4] as they have proven to be useful tools for dealing with the vast amount of semantic information that is becoming available. Topic modeling is a set of machine learning techniques that take a collection of documents as input and attempts to discern the themes that pervade them [5]. However, the methods that topic models utilize to search, summarize and understand large electronic archives have rarely been applied to political texts.
The New Zealand government has been making parliamentary transcripts ('Hansard') available in digital format since 2003. Suitable annotation of these transcripts allow them to be used as a corpus for the development of topic models.
This comprehensive corpus of political text can then be examined through a number of lenses. Topic models allow us to monitor the ebb and flow of themes that are discussed in parliament over multiple years and associate particular themes with individual Members of Parliament (MPs). This allows the identification of trends of topics that particular parties follow. That is, we may observe which issues are discussed repeatedly with great interest and which cease to be mentioned.
In the four parliamentary terms analyzed there was a transition of power from the 5 th Labour government (1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008) to the 5 th National government (2008-). The left-leaning Labour Party and right-leaning National Party have been the two parties sharing power for most of the 20 th century. In 1996, the method of electing MPs was changed to a mixed-member proportional (MMP) system and the two major parties were joined by a number of smaller parties. These smaller parties have sometimes held the balance of power, with the left-wing Green Party as the largest of these.
A number of textual analyses of political speeches are concerned with finding 2/28 where on the political spectrum a speaker falls (e.g. [6][7][8]). Topic modeling as applied in our analyses cannot determine the sentiment of a statement or speech. Despite this fact, multiple aspects of politicians' policy interests can be unraveled with further statistical analysis.
Here we construct bipartite networks [9][10][11][12], whereby sets of MPs are linked to a set of topics, with each link representing a topic that is of clear interest to a particular MP, based on the content of their parliamentary speeches. We can then decompose such bipartite networks into their two projections: the MP-projection and the topic-projection. The former represents a network where the links between MPs indicate the existence of a mutual interest, and the latter represents a network where links represent topics that co-occur as interests of a particular MP. In this study, we make use only of MP-projections. Measuring properties such as the node degree (i.e. the number of links that connect it to other nodes), homophily [13,14] and clustering and community structure [15,16] of these networks provides information about their underlying topology. For instance, one can discover whether or not the typical range of interests of an MP is changing, as well as patterns in this behavior over time.
Moreover, we apply community-detection methods [17,18] in order to find clusters of politicians that share interests, and investigate the partisan composition of these communities.
This work is of an exploratory nature, in that our goal is twofold: to present a novel quantitative approach of measuring political activity and to demonstrate the benefits of performing quantitative analysis in a domain normally reserved for qualitative approaches, by using a combination of machine learning and complex networks techniques.
The remainder of this paper is organized as follows: the Methods and Data sections 2.2 and 2.3 introduce fundamental aspects of topic modeling and bipartite networks respectively and outline the preparation and organization of our data; Sections 3 and 4 present the results of our analyses alongside a discussion and our conclusions.

Data
The semantic data we are using for our analyses are extracted from the New Zealand Hansard database [19]. Hansard presents records of what is said in the debating chamber as debates (a collection of speeches on a particular topic), speeches (individual statements by MPs) or dailies and volumes (collections of speeches over different time periods). By considering only those documents labeled in that database as a 'speech' we were able to find out in which topics specific MPs were engaging with.
This makes it possible to associate speeches and by extension MPs with topics of interest over time.
Once these data are obtained, we observe that many speeches are rather short and contain little topical content. An example is given below, which comes from a committee discussion on the Shop Trading Hours Amendment Bill and was published in Hansard Volume 716 on the 17 th of August 2016 [19]: "CHAIRPERSON (Lindsay Tisch): Just a point: this debate concludes at quarter past and to whoever is speaking at the time, I will be stopping it at that point." In an attempt to remove these non-topical speeches, we have removed from our database speeches with 150 words or less, which constitute about 20% of the database. This cut-off is shown in Fig. 1 which presents the distribution of word-counts per speech. This decision is informed by observations of the insufficient topical content of speeches below this threshold.

Topic Models
The process of topic modeling involves utilizing a set of algorithms that have been developed to understand the underlying thematic structure of a corpus. The simplest and most commonly used topic model is Latent Dirichlet Allocation (LDA) [20].
Within the framework of LDA, each document is a mixture of corpus-wide topics and each topic can be understood as a distribution over keywords. The total number of 4/28 In blue, the speech sizes used in our analysis. Speeches shorter than 150 words are omitted from our analysis due to lack of topical content, this threshold is indicated by the dashed vertical line.
documents comprising the corpus is denoted as D and the total number of topics as K.
Additionally, the order of words that comprise the document is not considered, only the frequency with which words appear. Unfortunately, computing the posterior is computationally unfeasible and hence needs to be approximated by an inference algorithm. Consequently, topic modeling algorithms are commonly classified as sampling-based algorithms or variational algorithms [21].

5/28
In this work, we used the MALLET software package [22] for the topic modeling component of our analysis. MALLET implements Gibbs sampling [23], which constructs a sequence of random variables in a Markov chain, where each variable is dependent on the previous. The algorithm then assumes that the true posterior distribution is the limiting distribution of this sequence, and obtains an approximation to this posterior using these samples. For a full mathematical description of LDA and a further discussion of the methods used to estimate a posterior, see [21].
LDA assumes the topics are the same for all documents, and only the topic proportions vary. Therefore, MALLET requires an input which specifies the number of topics to be discovered. Choosing this number is critical to the success of a topic model, as too few topics may merge distinct themes, while too many topics may introduce many "themes" consisting of vocabularies that appear to have nothing in common, or even start splitting topics that were identifiable at smaller input values.
For our analysis it is important that the topics are easily identifiable and distinct from one another. We found that 30 topics satisfies these requirements. Identified topics and their corresponding keywords can be found in Table A.2. It is worth noting, however, that some topics (nine of them, corresponding to about 36% of the corpus) appeared to consist mainly of terms that were primarily either procedural or general rhetoric, such as "proud", "hope" or "nation". As this language reveals little in the way of substantive interactions, such topics were omitted from our subsequent analyses after networks had been inferred. Fig. 2 shows the remaining topics with their rescaled proportions.  The weighting in these projections offers a way to eliminate edges that represent tenuous links. This is important, as complete subgraphs (where every node is connected to every other node in the subgraph) of MPs are generated by every topic.
This means that every MP that speaks about a popular topic is connected to every other MP that speaks about that same topic. The existence of popular topics can make analyses such as community detection challenging in the absence of weighting.
In order to build bipartite networks connecting MPs to topics, we looked at the corpus of each MPs speeches in more detail. We considered an MP to be connected to a topic when at least 6.7% of the MP's speeches over the course of a year was assigned to that topic by MALLET. This occurs when MPs talk about a topic twice as much as would be expected if they were talking about all topics equally within a year. This method, removes topics that MPs only touch on briefly or in passing, which does not indicate engagement with the topic. Finally, MPs that had spoken less than 10 4 words in the entire term were removed form the network for the lack of significance in the volume of words spoken. 8/28

Words spoken
Despite having fewer MPs, opposition parties tend to have a greater total word count than the governing party. Figure 4 shows the total word count for each of the 3 largest parties (as of the 50th parliament) over the course of 4 parliaments. In each parliament, the total word count for opposition parties exceeds that of the governing party. The increase in words spoken does not appear to be driven by any particular

Time Series of Topic Popularity
Allowing a decomposition by party, we ran a topic model on data concatenated by MP and year. The topic proportions obtained over a total number of 30 topics are normalized for each year so that they are comparable across a time span of 14 years.
Proceeding this way, we can reproduce the evolution of topic popularity over time at the Parliament and its decomposition for each of the three most represented parties.
Clear trends and differences across parties are visible in Fig. 5

The Parliamentary Speech Network
The MP-projected networks for the 47 th to 50 th parliaments resulting from the process described above are shown in Fig. 7. The community structure [18] in these networks is visible, as is the party make-up of these communities. Table 1 shows the number of MPs per party present in each of these four networks.  the National Party.
The degree distributions in Fig. 9 also tell an interesting story. From Fig 9(a) we see that topics are attracting the interest of more MPs over time. Associate with that is Fig. 9

Discussion
Topic models provide a way to parse human speech and extract themes from large bodies of text that are often difficult and time consuming to analyze manually. In few cases is it more important to gather and process this information than in the speeches of those people that control the legislative and political direction of a country. Topic modeling is unlikely to replace traditional media analysis of political speech, however,

14/28
here we have shown that it is a useful tool in examining larger themes and trends in political discourse. We were able to use topic modeling to track changes in the content of parliamentary speeches across time, and identify features in these time-series that correspond to particular issues or events.