Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Quantifying gender bias towards politicians in cross-lingual language models

  • Karolina Stańczak ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    ks@di.ku.dk

    Affiliation Department of Computer Science, University of Copenhagen, Copenhagen, Denmark

  • Sagnik Ray Choudhury,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Computer Science, University of Copenhagen, Copenhagen, Denmark

  • Tiago Pimentel,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom

  • Ryan Cotterell,

    Roles Conceptualization, Funding acquisition, Supervision

    Affiliation Department of Computer Science, ETH Zürich, Zürich, Switzerland

  • Isabelle Augenstein

    Roles Conceptualization, Funding acquisition, Supervision

    Affiliation Department of Computer Science, University of Copenhagen, Copenhagen, Denmark

Abstract

Recent research has demonstrated that large pre-trained language models reflect societal biases expressed in natural language. The present paper introduces a simple method for probing language models to conduct a multilingual study of gender bias towards politicians. We quantify the usage of adjectives and verbs generated by language models surrounding the names of politicians as a function of their gender. To this end, we curate a dataset of 250k politicians worldwide, including their names and gender. Our study is conducted in seven languages across six different language modeling architectures. The results demonstrate that pre-trained language models’ stance towards politicians varies strongly across analyzed languages. We find that while some words such as dead, and designated are associated with both male and female politicians, a few specific words such as beautiful and divorced are predominantly associated with female politicians. Finally, and contrary to previous findings, our study suggests that larger language models do not tend to be significantly more gender-biased than smaller ones.

1 Introduction

In the last decades, digital media has become a primary source of information about political discourse [1] with a dominant share of discussions occurring online [2]. The Internet and social media especially are able to shape public sentiment towards politicians [3], which, in an extreme case, can influence election results [4], and, thus, the composition of a country’s government [5]. However, information presented online is subjective, biased, and potentially harmful as it may disseminate misinformation and toxicity. For instance, Prabhakaran et al. [6] show that online comments about politicians, in particular, tend to be more toxic than comments about people in other occupations.

Relatedly, natural language processing (NLP) models are increasingly being used across various domains of the Internet (e.g., in search) [7] and social media (e.g., to translate posts) [8]. These models, however, are typically trained on subjective and imbalanced data. Thus, while they appear to successfully learn general formal properties of the language (e.g., syntax, semantics [9, 10]), they are also susceptible to learning potentially harmful associations [6]. In particular, pre-trained language models are shown to perpetuate and amplify societal biases found in their training data [11]. For instance, Shwartz et al. [12] showed that pre-trained language models associated negativity with a certain name if the name corresponded to an entity who was frequently mentioned in negative contexts (e.g., Donald for Donald Trump). This strongly suggests a risk of harm when employing language models on downstream tasks such as search or translation.

One such harm that a language model could propagate is that of gender bias [13]. In fact, pre-trained language models have been reported to encode gender bias and stereotypes [11, 14, 15]. Most previous work examining gender bias in language models has focused on English [16], with only a few notable exceptions in recent years [1720]. The approaches taken in prior work have relied on a range of methods including causal analysis [21], statistical measures such as association tests [14, 22], and correlations [23]. Their findings indicate that gender biases that exist in natural language corpora are also reflected in the text generated by language models.

Gender bias has been examined in stance analysis approaches, but with most investigations focusing on natural language corpora as opposed to language models. For instance, Ahmad et al. [24] and Voigt et al. [25] explicitly controlled for gender bias in two small-scale natural language corpora that focused on politicians within a single country. Specifically, according to Ahmad et al. [24] the media coverage given to male and female candidates in Irish elections did not correspond to the ratio of male to female contestants, with male candidates receiving more coverage. Perhaps surprisingly, Voigt et al. [25] found that there is a smaller difference in the sentiment of responses written to male and female politicians, as opposed to other public figures. However, it is unclear whether these findings would generalize when tested at scale (i.e., examining political figures from around the world) and in text generated by language models.

In this paper, we present a large-scale study on quantifying gender bias in language models with a focus on stance towards politicians. To this end, we generate a dataset for analyzing stance towards politicians encoded in a language model, where stance is inferred from simple grammatical constructs (e.g., “〈blankperson” where 〈blank〉 is an adjective or a verb). Moreover, we make use of a statistical method to measure gender bias—namely, a latent-variable model—and adapt this to language models. Further, while prior work has focused on monolingual language models [14, 23], we present a fine-grained study of gender bias in six multilingual language models across seven languages, considering 250k politicians from the majority of the world’s countries.

In our experiments, we find that, for both male and female politicians, the stance (whether the generated text is written in favor of, against, or neutral) towards politicians in pre-trained language models is highly dependent on the language under consideration. For instance, we show that, while male politicians are associated with more negative sentiment in English, the opposite is true for most other languages analyzed. However, we find no patterns for non-binary politicians (potentially due to data scarcity). Moreover, we find that, on the one hand, words associated with male politicians are also used to describe female politicians; but on the other hand, there are specific words of all sentiments that are predominantly associated with female politicians, such as divorced, maternal, and beautiful. Finally, and perhaps surprisingly, we do not find any significant evidence that larger language models tend to be more gender-biased than smaller ones, contradicting previous studies [14].

2 Background

2.1 Gender bias in pre-trained language models

Pre-trained language models have been shown to achieve state-of-the-art performance on many downstream NLP tasks [2630]. During their pre-training, such models can partially learn a language’s syntactic and semantic structure [31, 32]. However, alongside capturing linguistic properties, such as morphology, syntax, and semantics, they also perpetuate and even potentially amplify biases [11]. Consequently, research on understanding and guarding against gender bias in pre-trained language models has garnered an increasing amount of research attention [16], which has created a need for datasets suitable for evaluating the extent to which biases occur in such models. Prior datasets for bias evaluation in language models have mainly focused on English and many revolve around mutating templated sentences’ noun phrases, e.g., “This is a(n) 〈blackperson.” or “Person is 〈black〉.”, where 〈black〉 refers to an attribute such as an adjective or occupation [2123]. Nadeem et al. [14] and Nangia et al. [15] present an alternative approach to gathering data for analyzing biases in language models. In this approach, crowd workers are tasked with producing variations of sentences that exhibit different levels of stereotypes, i.e., a sentence that stereotypes a particular demographic, a minimally edited sentence that is less stereotyping, produces an anti-stereotype, or has unrelated associations. While the template approach suffers from the artificial context of simply structured sentences [33], the second (i.e., crowdsourced annotations) may convey subjective opinions and is cost-intensive if employed for multiple languages. Moreover, while a fixed structure such as “〈black〉.” may be appropriate for English, this template can introduce bias for other languages. Spanish, for instance, distinguishes between an ephemeral and a continuous sense of the verb “to be”, i.e., estar, and ser, respectively. As such, a structure such as “Person está 〈black〉.” biases the adjectives studied towards ephemeral characteristics. For example, the sentence “Obama está bueno (Obama is [now] good)” implies that Obama is good-looking as opposed to having the quality of being good. The lexical and syntactic choices in templated sentences may therefore be problematic in a crosslinguistic analysis of bias.

2.2 Stance towards politicians

Stance detection is the task of automatically determining if the author of an analyzed text is in favor of, against, or neutral towards a target [34]. Notably, Mohammad et al. [35] observed that a person may demonstrate the same stance towards a target by using negatively or positively sentimented language since stance detection determines the favorability towards a given (pre-chosen) target of interest rather than the mere sentiment of the text. Thus, stance detection is generally considered a more complex task than sentiment classification. Previous work on stance towards politicians investigated biases extant in natural language corpora as opposed to biases in text generated by language models. Moreover, these works mostly targeted specific entities in a single country’s political context. Ahmad et al. [24], for instance, analyzed samples of national and regional news by Irish media discussing politicians running in general elections, with the goal of predicting election results. More recently, Voigt et al. [25] collected responses to Facebook posts for 412 members of the U.S. House and Senate from their public Facebook pages, while Pado et al. [36] created a dataset consisting of 959 articles with a total of 1841 claims, where each claim is associated with an entity. In this study, we curated a dataset to examine stance towards politicians worldwide in pre-trained language models.

3 Dataset generation

The present study introduces a novel approach to generating a multilingual dataset for identifying gender biases in language models. In our approach, we rely on a simple template “〈blackperson” that allows language models to generate words directly next to entity names. In this case, 〈black〉 corresponds to a variable word, i.e., a mask in language modeling terms. This approach imposes no sentence structure and does not suffer from bias introduced by the lexical or syntactical choice of a templated sentence structure (e.g., [2123]). We argue that this bottom-up approach can unveil associations encoded in language models between their representations of named entities (NEs) and words describing them. To the best of our knowledge, this method enables the first multilingual analysis of gender bias in language models, which is applicable to any language and with any choice of gendered entity, provided that a list of such entities with their gender is available.

Our approach therefore allows us to examine how the nature of gender bias exhibited in models might differ not only by model size and training data but also by the language under consideration. For instance, in a language such as Spanish, in which adjectives are gendered according to the noun they refer to, grammatical gender might become a highly predictive feature on which the model can rely to make predictions during its pre-training. On the other hand, since inanimate objects are gendered, they might take on adjectives that are not stereotypically associated with their grammatical gender, e.g., “la espada fuerte (the strong [feminine] sword)”, potentially mitigating the effects of harmful bias in these models.

Given the language independence of our methodology, we conducted analyses on two sets of language models: a monolingual English set and a multilingual set. Overall, our analysis covers seven typologically diverse languages: Arabic, Chinese, English, French, Hindi, Russian, and Spanish. These languages are all included in the training datasets of several well-known multilingual language models (m-BERT [26], XLM [37], and XLM-RoBERTa [38]), and happen to cover a culturally diverse choice of speaker populations.

As shown in Fig 1, our procedure is implemented in three steps. First, we queried the Wikidata knowledge base [39] to obtain politician names in the seven languages under consideration (Section 3.1). Next, using six language models (three monolingual English and three multilingual), we generated adjectives and verbs associated with those politician names (Section 3.2). Finally, we collected sentiment lexica for the analyzed languages to study differences in sentiment for generated words (Section 3.3). We make our dataset publically available for use in future studies (https://github.com/copenlu/llm-gender-bias-polit.git).

thumbnail
Fig 1. The three-part dataset generation procedure.

Part 1 depicts politician names and their gender in the seven analyzed languages. Part 2 depicts the adjectives and verbs associated with the names that are generated by the language model. Part 3 depicts the sentiment lexica with associated values for each word.

https://doi.org/10.1371/journal.pone.0277640.g001

3.1 Politician names and gender

In the first step of our data generation pipeline (Part 1 in Fig 1), we curated a dataset of politician names and corresponding genders as reported in Wikidata entries for political figures. We restricted ourselves to politicians with a reported date of birth before 2001 and who had information regarding their gender on Wikidata. We note that politicians whose gender information was unavailable account for <3% of the entities for all languages. We also note that not all names were available on Wikidata in all languages, causing deviations in the counts for different languages (with a largely consistent set of non-binary politicians). Wikidata distinguishes between 12 gender identities: cisgender female, female, female organism, non-binary, genderfluid, genderqueer, male, male organism, third gender, transfeminine, transgender female, and transgender male. This information is maintained by the community and regularly updated. We discuss this further in Section 6. We decided to exclude female and male organisms from our dataset, as they refer to animals that (for one reason or another) were running for elections. Further, we replaced the cisgender female label with the female label. Finally, we created a non-binary gender category, which includes all politicians not identified as male or female (due to the small number of politicians for each of these genders; see S1 Text. Politician Gender in the Appendix). Table 1 presents the counts of politicians grouped by their gender (female, male, non-binary) for each language. (See S1 Text. Politician Gender in the Appendix for the detailed counts across all gender categories.) On average, the male-to-female gender ratio is 4:1 across the languages and there are very few names for the non-binary gender category.

thumbnail
Table 1. Counts of politicians grouped by their gender according to Wikidata (female, male, non-binary) for each language.

https://doi.org/10.1371/journal.pone.0277640.t001

3.2 Language generation

In the second step of the data generation process (Part 2 in Fig 1), we employed language models to generate adjectives and verbs associated with the politician’s name. Metaphorically, this language generation process can be thought of as a word association questionnaire. We provide the language model with a politician’s name and prompt it to generate a token (verb or adjective) with the strongest association to the name. We could take several approaches towards that goal. One possibility is to analyze a sentence generated by the language model which contains the name in question. However, the bidirectional language models under consideration are not trained with language generation in mind and hence do not explicitly define a distribution over language [40]—their pre-training consists of predicting masked tokens in already existing sentences [10]. Goyal et al. [41] proposed generation using a sampler based on the Metropolis-Hastings algorithm [42] to draw samples from non-probabilistic masked language models. However, the sentence length has to be provided in advance, and generated sentences often lack diversity, particularly when the process is constrained by specifying the names. Another possible approach would be to follow Amini et al. [33] and compute the average treatment effect of a politician’s name on the adjective (verb) choice given a dependency parsed sentence. In particular, Amini et al. [33] derived counterfactuals from the dependency structure of the sentence and then intervened on a specific linguistic property of interest, such as the gender of a noun. This method, while effective, becomes computationally prohibitive when handling a large number of entities.

In this work, we simplify this problem. We query each language model by providing it with either a “〈blackperson” input or its inverse “personblack〉,” depending on the grammar formalisms of the language under consideration. (See S1 Table. Word Orderings in the Appendix for the word orderings used for each language.) Our approach returns a ranked list of words (with their probabilities) that the model associates with the name. The ranked list of words included a wide variety of part of speech (POS) categories; however, not all POS categories necessarily lend themselves to analyzing sentiment with respect to an associated name. We therefore filtered the data to just the adjectives and verbs, as these have been shown to capture sentiment about a name [43]. To filter these words, we used the Universal Dependency [44] treebanks, and only kept adjectives and verbs that were present in any of the language-specific treebanks. We then lemmatize the data to prevent us from recovering this trivial gender relationship between the politician’s name and the gendered form of the associated adjective or verb.

A final issue is that all our models use subword tokenizers and, therefore, a politician’s name is often not just tokenized by whitespace. For example, the name “Narendra Modi” is tokenized as [“na”, ‘##ren’, “##dra”, “mod”, “##i”] by the WordPiece tokenizer [45] in BERT [26]. This presents a challenge in ascertaining whether a name was present in the model’s training data from its vocabulary. However, all politicians whose names were processed have a Wikipedia page in at least one of the analyzed languages. As Wikipedia is a subset of the data on which these models were trained (except BERTweet, which is trained on a large collection of 855M English tweets), we assume that the named entities occurred in the language models’ training data, and therefore, that the predicted words for the 〈black〉 token provide insight into the values reflected by these models.

In total, we queried six language models for the word association task across two setups: a monolingual and a multilingual setup. In the monolingual setup, we used the following English language models: BERT [base and large; 26], BERTweet [46], RoBERTa [base and large; 27], ALBERT [base, large, xlarge and xxlarge; 47], and XLNet [base and large; 28]. In the multilingual setup, we used the following multilingual language models: m-BERT [26], XLM [base and large; 37] and XLM-RoBERTa [base and large; 38]. The pre-training of these models included data for each of the seven languages under consideration. Each language model, together with its corresponding features is listed in Table 2. For each language, we entered the politicians’ names as written in that particular language.

3.3 Sentiment data

Previously, it has been shown that words used to describe entities differ based on the target’s gender and that these discrepancies can be used as a proxy to quantify gender bias [43, 48]. In light of this, we categorized words generated by the language model into positive, negative, and neutral sentiments (Part 3 in Fig 1).

To accomplish this task, we required a lexicon specific to each analyzed language. For English, we used the existing sentiment lexicon of Hoyle et al. [49]. This lexicon is a combination of multiple smaller lexica that has been shown to outperform the individual lexica, as well as their straightforward combination when applied to a text classification task involving sentiment analysis. However, such a comprehensive lexicon was only available for English, we therefore collected various publicly available sentiment lexica for the remaining languages, which we combined into one comprehensive lexicon per language using SentiVAE [49]—a variational autoencoder model (VAE; [50]). VAE allows for unifying labels from multiple lexica with disparate scales (binary, categorical, or continuous). In SentiVAE, the sentiment values for each word from different lexica are ‘encoded’ into three-dimensional vectors whose sum is added to form the parameters of a Dirichlet distribution over the latent representation of the word’s polarity value. From this procedure, we obtained the final lexicon for each language—a list of words present in at least one of the individual lexica and three-dimensional representations of the words’ sentiments (positive, negative, and neutral). Through this approach, we aimed to cover more words and create a more robust sentiment lexicon while retaining scale coherence.

Following the results presented in [49], we hypothesized that combining a larger number of individual lexica with SentiVAE leads to more reliable results. We confirmed this assumption for all languages but Hindi. We combined three multilingual sentiment lexica for all remaining languages: the sentiment lexicon by Chen and Skiena [51], BabelSenticNet [52] and UniSent [53]. Due to the poor evaluation performance, we decided to exclude BabelSenticNet and UniSent lexica for Hindi. Instead, we combined the sentiment lexica curated by Chen and Skiena [51], Desai [54], and Sharan [55]. Additionally, we incorporated monolingual sentiment lexica for Arabic [56], Chinese [57, 58], French [59, 60], Russian [61] and Spanish [6264].

Following Hoyle et al. [49], we evaluated the lexica resulting from the VAE approach on a sentiment classification task by inspecting their performance—for each language. Namely, we used the resulting lexica to automatically label utterances (sentences and paragraphs) for their sentiment, based on the average sentiment of words in each sentence. This is shown in the Appendix in S2 Table. Sentiment Analysis Evaluation where we also include the best performance achieved by a supervised model (as reported in the original dataset’s paper) as a point of reference. In general, the sentiment lexicon approach achieves comparable performance to the respective supervised model for most of the analyzed languages. We observed the greatest drop in performance for French, but a performance decrease was also visible for Hindi and Chinese. However, the results in S2 Table. Sentiment Analysis Evaluation in the Appendix are based on the sentiment classification of utterances rather than single words, as in our setup. Here, we treat these results as a lower-bound performance in our single-word scenario.

4 Method

Our aim is to quantify the usage of words around the names of politicians as a function of their gender. Formally, let be the set of genders, as discussed in Section 3.1; we denote elements of as g. Further, let be the set of politicians’ names found in our dataset; we denote elements of as n. With w we denote a lemmatized word in a language-specific vocabulary . Finally, let G, W and N be, respectively, gender-, word- and name-valued random variables, which are jointly distributed according to a probability distribution p(W = w, G = g, N = n). We shall write p(w, g, n), omitting random variables, when clear from the context. Assuming we know the true distribution p, there is a straightforward metric for how much the word w is associated with the gender g—the point-wise mutual information (PMI) between w and g: (1) Much like mutual information (MI), PMI quantifies the amount of information we can learn about a specific variable from another, but, in contrast to MI, it is restricted to a single gender–word pair. In particular, as evinced in Eq (1), PMI measures the (log) probability of co-occurrence scaled by the product of the marginal occurrences. If a word is more often associated with a particular gender, its PMI will be positive. For example, we would expect a high value for PMI(female, pregnant) because the joint probability of these two words is higher than the marginal probabilities of female and pregnant multiplied together. Accordingly, in an ideal unbiased world, we would expect words such as successful or intelligent to have a PMI of approximately zero with all genders.

Above, we consider the true distribution p to be known, while, in fact, we solely observe samples from p. In the following, we assume that we only have access to an empirical distribution derived from samples from the true distribution p (2) where we assume a dataset is composed of I independent samples from the distribution p. With a simple plug-in estimator, we can then estimate the PMI above using this , as opposed to p. The plug-in estimator, however, may produce biased PMI estimates; these biases are in general positive, as shown by [65, 66].

To get a better approximation of p, we estimate a model pθ to generalize from the observed samples with the hope that we will be able to better infer the relationship between G and W. We estimate pθ by minimizing the cross-entropy given below (3) where gn is the gender of the politician with name n. Then, we consider a regularized estimator of pointwise mutual information. We factorize . We first define (4) where is a one-hot gender representation, and both and are model parameters, which we index as and ; these parameters induce a prior distribution over words pθ(w) ∝ exp(mw) and word-specific deviations, respectively. Second, we define (5) where are model parameters, which we index as .

Assuming that pη(wg) ≈ p(wg), i.e., that our model learns the true distribution p, we have that will be equivalent (up to an additive term that is constant on the word) to the PMI in Eq (1): (6) (7) (8) (9) If we estimate the model without any regularization or latent sentiment, then ranking the words by their deviation scores from the prior distribution is equivalent to ranking them by their PMI. However, we are not merely interested in quantifying the usage of words around the entities but are also interested in analyzing those words’ sentiments. Thus, let be a set of sentiments; we denote elements of as s. More formally, the extended model jointly represents adjective (or verb) choice (w) with its sentiment (s), given a politician’s gender (g) as follows: (10) We compute the first factor in Eq (10) by plugging in Eq (4), albeit with a small modification to condition it on the latent sentiment: (11) The second factor in Eq (10) is defined as pσ(sg) ∝ exp(σs,g), and the third factor is defined as before, i.e., pϕ(g) ∝ exp(ϕg), where are learned. Thus, the model pθ is parametrized by , with denoting the word- and sentiment-specific deviation. As we do not have access to explicit sentiment information (it is encoded as a latent variable), we marginalize it in Eq (10) to construct a latent-variable model (12) whose marginal likelihood we maximize to find good parameters θ. This model enables us to analyze how the choice of a generated word depends not only on a politician’s gender but also on a sentiment via jointly modeling gender, sentiment, and generated words as depicted in Fig 2. Through the distribution pη(ws, g), this model enables us to extract ranked lists of adjectives (or verbs), grouped by gender and sentiment, that were generated by a language model to describe politicians.

thumbnail
Fig 2. Graphical model depicting the relations among politician’s gender (g), generated word’s sentiment (s), and the generated word (w).

https://doi.org/10.1371/journal.pone.0277640.g002

We additionally apply posterior regularization [67] to guarantee that our latent variable corresponds to sentiments. This regularization is taken as the Kullback–Leibler (KL) divergence between our estimate of pθ(sw) and q(sw); where q is a target posterior that we obtain from the sentiment lexicon described in detail in Section 3.3. Further, we also use L1-regularization to account for sparsity. Combing the cross-entropy term, with the KL and L1 regularizers, we arrive at the loss function: (13) with hyperparameters . This objective is minimized with the Adam optimizer [68]. We then validate the method through an inspection of the posterior regularizer values; values close to zero indicate the validity of the approach as a low KL divergence implies our latent distribution pθ closely represents the lexicon’s sentiment.

Finally, we note that due to the relatively small number of politicians identified in the non-binary gender group, we restrict ourselves to two binary genders in the generative latent-variable setting of the extended model. In Section 6, we discuss the limitations of this modeling decision.

5 Experiments and results

We applied the methods defined in Section 4 to study the presence of gender bias in the dataset described in Section 3. We hypothesized that the generated vocabulary for English would be much more versatile than for the other languages. Therefore, in order to decrease computational costs and maintain similar vocabulary sizes across languages, we decided to further limit the number of generated words for English. We used the 20 highest probability adjectives and verbs generated for each politician’s name in English, both in mono- and multilingual setups. For the other languages, the top 100 adjectives and top 20 verbs were used. Detailed counts of generated adjectives are presented in S3 Table. Generated Words in the Appendix. We confirmed our hypothesis that the vocabulary generated for English is broader, as including the top 20 adjectives and verbs for English results in a vocabulary set (unique lemmata generated by each of the language models) similar to or bigger than for Spanish—the largest vocabulary of all the remaining languages.

First, using the English portion of the dataset, we analyzed estimated PMI values to look for the words whose association with a specific gender differs the most across the three gender categories. Then, we followed a virtually identical experimental setup as presented in [43] for our dataset. In particular, we tested whether adjectives and verbs generated by language models unveil the same patterns as discovered in natural corpora and if they confirm previous findings about the stance towards politicians. To this end, we employed PMI and the latent-variable model on our data set and qualitatively evaluated the results. We analyzed generated adjectives and verbs in terms of their alignment within supersenses—a set of pre-defined semantic word categories.

Next, we conducted a multilingual analysis for the seven selected languages via PMI and the latent-variable model to inspect both qualitative and quantitative differences in words generated by six cross-lingual language models. Further, we performed a cluster analysis of the generated words based on their word representations extracted from the last hidden state of each language model for all analyzed languages. In additional experiments in Appendix S3 Text. Additional Experiments, we examined gender bias towards the most popular politicians. Then, for each language, we studied gender bias towards politicians whose country of origin (i.e., their nationality) uses the respective language as an official language. Finally, we investigated gender bias towards politicians born before and after the Baby Boom to control for temporal changes. However, we did not find any significant patterns.

Following Hoyle et al. [43], our reported results were an average over hyperparameters: for the L1 penalty α ∈ {0, 10−5, 10−4, 0.001, 0.01} and for the posterior regularization β ∈ {10−5, 10−4, 0.001, 0.01, 0.1, 1, 10, 100}.

5.1 Monolingual setup

5.1.1 PMI and latent-variable model.

In the following, we report the PMI values calculated based on words generated by all the monolingual English language models under consideration. From the PMI values for words associated with politicians of male, female, or non-binary genders, it is apparent that words associated with the female gender are often connected to weaknesses such as hysterical and fragile or to their appearance (blonde), while adjectives generated for male politicians tend to describe their political beliefs (fascist and bolshevik). There is no such distinguishable pattern for the non-binary gender, most likely due to an insufficient amount of data. See Table 3 for details.

thumbnail
Table 3. Top 15 adjectives with the biggest difference in PMI for male and female (left); top 5 adjectives (top right) and bottom 5 verbs (bottom right) PMI for non-binary gender.

Based on words generated by all monolingual language models for English.

https://doi.org/10.1371/journal.pone.0277640.t003

The results for the latent-variable model are similar to those for the PMI analysis. Adjectives associated with appearance are more often generated for female politicians. Additionally, words describing marital status (divorced and unmarried) are more often generated for female politicians. On the other hand, positive adjectives that describe men often relate to their character values such as bold and independent. Further examples are available in Table 4.

thumbnail
Table 4. The top 10 adjectives, for female and male politicians, that have the largest average deviation for each sentiment, extracted from all monolingual English models.

https://doi.org/10.1371/journal.pone.0277640.t004

Following Hoyle et al. [43], we used two existing semantic resources based on the WordNet database [69] to quantify the patterns revealed above. We grouped adjectives into 13 supersense classes using classes defined by Tsvetkov et al. [70]; similarly, we grouped verbs into 15 supersenses according to the database presented in Miller et al. [71]. We list the defined groups together with their respective example words in the Appendix S2 Text. Supersenses.

We performed an unpaired permutation test [72] considering the 100 largest-deviation words and found that male politicians are more often described negatively when using adjectives related to their emotions (e.g., angry) while more positively with adjectives related to their minds (e.g., intelligent), as presented in Fig 3. These results differ from the findings of Hoyle et al. [43], where no significant evidence of these tendencies was found.

thumbnail
Fig 3. The frequency with which the 100 largest-deviation adjectives for male and female gender correspond to the supersense “feeling” for the negative sentiment and the supersense “mind” for the positive sentiment.

Results presented for language models with significant differences (p < 0.004) between male and female politicians after Bonferroni correction for the number of supersenses (here, 13).

https://doi.org/10.1371/journal.pone.0277640.g003

5.1.2 Sentiment analysis.

We report the results in Fig 4. We found that words more commonly generated by language models that describe male rather than female politicians are also more often negative and that this pattern holds across most language models. However, based on the results of the qualitative study (see details in Table 4), we assume it is due to several strongly positive words such as beloved and marry, which are highly associated with female politicians. We note that the deviation scores for words associated with male politicians are relatively low compared to the scores for adjectives and verbs associated with female politicians which introduces also more neutral words to the list of words of negative sentiment. Ultimately, this suggests that words of negative and neutral sentiment are more equally distributed across genders with few words being used particularly often in association with a specific gender. Conversely, positive words generated around male and female genders differ more substantially.

thumbnail
Fig 4. Mean frequency with which the 100 largest-deviation adjectives for male and female genders correspond to positive or negative sentiment in English.

Each point denotes a language model. Significant differences (p < 0.05) are represented with ‘x’ markers.

https://doi.org/10.1371/journal.pone.0277640.g004

To investigate whether there were significant differences across language models based on their size and architecture, we performed a two-way analysis of variance (ANOVA). Language, model size, and architecture were the independent variables and sentiment values were the target variables. We then analyzed the differences in the mean frequency with which the 100 largest deviation words (adjectives and verbs) correspond to each sentiment for the male and female genders. The results presented in Table 5 indicate significant differences in negative sentiment in the descriptions of male politicians generated by models of different architectures. We note that since we are not able to separate the effects of model design and training data, the term architecture encompasses both aspects of pre-trained language models. In particular, XLNet tends to generate more words of negative sentiment compared to other models examined. Surprisingly, larger models tend to exhibit similar gender biases to smaller ones.

thumbnail
Table 5. ANOVA computed group mean sentiment for male and female genders for adjectives generated by monolingual language models.

Significant differences (p < 0.05) are indicated in bold and the dashes denote a baseline group for the analyzed parameter.

https://doi.org/10.1371/journal.pone.0277640.t005

5.2 Multilingual setup

5.2.1 PMI and latent-variable model.

For PMI scores, a pattern similar to the monolingual setup holds. Words associated with female politicians often relate to their appearance and social characteristics such as beautiful and sweet (prevalent for English, French, and Chinese) or attentive (in Russian), whereas male politicians are described as knowledgeable, serious, or (in Arabic) prophetic. Again, we were not able to detect any patterns in words generated around politicians of non-binary gender, where generated words vary from similar and common (as in French and Russian) or angry and unique (as in Chinese).

The results of the latent-variable model confirm the previous findings (for an example, see Table 6 for Spanish). Some of the more male-skewed words such as dead, and designated are still often associated with female politicians given the relatively low deviation scores. Words of positive sentiment used to describe male politicians are often successful (Arabic), or rich (Arabic, Russian). In a negative context, male politicians are described as difficult (Chinese and Russian) or serious (prevalent in French and Hindi), and the associated verbs are sentence (in Chinese) and arrest (in Russian). Notably, words generated in Russian have a strong negative connotation such as criminal and evil. Positive words associated with female politicians are mostly related to their appearance, while there is no such pattern for words of negative sentiment.

thumbnail
Table 6. The top 10 adjectives, for female and male politicians, that have the largest average deviation for each sentiment, extracted from all multilingual models for Spanish.

https://doi.org/10.1371/journal.pone.0277640.t006

Unlike for English, we did not have access to pre-defined lists of supersenses in the multilingual scenario. We therefore analyzed word representations of the generated words and resorted to cluster analysis to identify semantic groups among the generated adjectives. For each of the generated words, we extracted their word representations using the respective language model. We then performed a cluster analysis for each of the languages and language models analyzed, using the k-means clustering algorithm on the extracted word representations. We conducted this analysis separately for each gender to analyze differences in clusters generated for different genders. In each gender–language pair there are clusters with words describing nationalities such as basque and arabe in French (see Table 7 and S4 Table. Cluster Analysis in the Appendix). Furthermore, regardless of language, there are clusters of words typically associated with the female gender, such as beautiful. The distribution of genders for which the words were generated in each cluster is relatively equal across all clusters. Fig 5 shows the distribution of genders for which the words were generated for Arabic with the XLM-base model. These results are valid in all languages and language models. However, based on our previous latent-variable model’s results, words associated with male politicians are also often used to describe female politicians. The same is not true for female-biased words, which do not appear as often when describing male politicians.

thumbnail
Fig 5. Distribution of genders in each cluster identified within word representations generated for Arabic the XLM-base language model.

https://doi.org/10.1371/journal.pone.0277640.g005

thumbnail
Table 7. Results of the cluster analysis for French for words generated with m-BERT in association with male politicians.

We list 5 words from every cluster.

https://doi.org/10.1371/journal.pone.0277640.t007

5.2.2 Sentiment analysis.

We additionally analyzed the overall sentiment of the six cross-lingual language models towards male and female politicians for the selected languages. Our analysis suggests that sentiment towards politicians varies depending on the language used. For English, female politicians tend to be described more positively as opposed to Arabic, French, Hindi, and Spanish. For Russian, words associated with female politicians are more polarized, having both more positive and negative sentiments. No significant patterns for Chinese were detected. See Fig 6 for details.

thumbnail
Fig 6. Mean frequency with which the top 100 adjectives–the most strongly associated with either male or female gender–correspond to negative (top) and positive (bottom) sentiment.

Significant differences (p < 0.05) are represented with ‘x’ markers.

https://doi.org/10.1371/journal.pone.0277640.g006

Finally, analogously to the monolingual setup, we investigated whether there were any significant differences in sentiment dependent on the target language, language model sizes, and architectures; see ANOVA analysis in Table 8. Both XLM and XLM-RoBERTa generated fewer negative and more positive words than BERT multilingual, e.g., the mean frequency with which the 100 largest deviation adjectives for the male gender correspond to negative sentiment is lower by 2.00% and 5.73% for XLM and XLM-R, respectively. Indeed, as suggested above, we found that language was a highly significant factor for bias in cross-lingual language models, along with model architecture. For English and French, e.g., generated words were often more negative when used to describe male politicians. Surprisingly, we did not observe a significant influence of model size on the encoded bias.

thumbnail
Table 8. ANOVA computed group mean sentiment values for male and female genders for adjectives generated by cross-lingual language models.

Significant differences (p < 0.05) are in bold and the dashes denote a baseline group for the analyzed parameter.

https://doi.org/10.1371/journal.pone.0277640.t008

6 Limitations

6.1 Potential harms in using gender-biased language models

Prior research has unveiled the prevalence of gender bias in political discourse, which can be picked up by NLP systems if trained on such texts. Gender bias encoded in large language models is particularly problematic, as they are used as the building blocks of most modern NLP models. Biases in such language models can lead to gender-biased predictions, and thus reinforce harmful stereotypes extant in natural language when these models are deployed. However, it is important to clarify that by our definition, while bias does not have to be harmful (e.g., female and pregnant will naturally have a high PMI score) [73], it might be in several instances (e.g., a positive PMI between female and fragile).

6.2 Quality of collaborative knowledge bases

For the purpose of this research, it is imperative to acknowledge the presence of gender bias in Wikipedia, which is characterized by a clear disparity in the number of female editors [74], a smaller percentage of notable women having their own Wikipedia page, and these pages being less extensive [75]. Indeed, we observe this disparity in the gender distribution in Table 1. We gathered information on politicians from the open-knowledge base Wikidata, which claims to do gender modeling at scale, globally, for every language and culture with more data and coverage than any other resource [76]. It is a collaboratively edited data source, and so, in theory, everyone could make changes to an entry (including the person the entry is about), which poses a potential source of bias. Since we are only interested in overall gender bias trends as opposed to results for individual entities, we can tolerate a small amount of noise.

6.3 Gender selection

In our analysis, we aimed to incorporate genders beyond male and female while maintaining statistical significance. However, politicians of non-binary gender cover only 0.025% of collected entities. Further, politicians with no explicit gender annotation were not considered in our analysis. Furthermore, it is plausible that this set could be biased towards non-binary-gendered politicians. This restricts possible analyses for politicians of non-binary gender and risks drawing wrong conclusions. Although our method can be applied to any named entities of non-binary gender to analyze the stance towards them, we hope future work will obtain more data on politicians of non-binary gender to avoid this limitation and to enable a fine-grained study of gender bias towards diverse gender identities departing from the categorical view on gender.

6.4 Beyond English

We explored gender bias encoded in cross-lingual language models in seven typologically distinct languages. We acknowledge that the selection of these languages may introduce additional biases to our study. Further, the words generated by a language model can also simply reflect how particular politicians are perceived in these languages, and how much they are discussed in general, rather than a more pervasive gender bias against them. However, considering our results in aggregate, it is likely that the findings capture general trends of gender bias. Finally, a potential bias in our study may be associated with racial biases that are reflected by a language model, as names often carry information about a politician’s country of origin and ethnic background.

7 Conclusions

In this paper, we have presented the largest study of quantifying gender bias towards politicians in language models to date, considering a total number of 250k politicians. We established a novel method to generate a multilingual dataset to measure gender bias towards entities. We studied the qualitative differences in language models’ word choices and analyzed sentiments of generated words in conjunction with gender using a latent-variable model. Our results demonstrate that the stance towards politicians in pre-trained models is highly dependent on the language used. Finally, contrary to previous findings [14], our study suggests that larger language models do not tend to be significantly more gender-biased than smaller ones.

While we restricted our analysis to seven typologically diverse languages, as well as to politicians, our method can be employed to analyze gender bias towards any NEs and in any language, provided that gender information for those entities is available. Future work will focus on extending this analysis to investigate gender bias in a wider number of languages and will study this bias’ societal implications from the perspective of political science.

Acknowledgments

The authors would like to thank Eleanor Chodroff, Clara Meister, and Zeerak Talat for their feedback on the manuscript.

References

  1. 1. Kleinberg MS, Lau RR. The Importance of Political Knowledge for Effective Citizenship: Differences Between the Broadcast and Internet Generations. Public Opinion Quarterly. 2019;83(2):338–362. Available from:
  2. 2. Hampton KN, Shin I, Lu W. Social media and political discussion: when online presence silences offline conversation. Information, Communication & Society. 2017;20(7):1090–1107. Available from:
  3. 3. Zhuravskaya E, Petrova M, Enikolopov R. Political Effects of the Internet and Social Media. Annual Review of Economics. 2020;12(1):415–438. Available from:
  4. 4. Mohammad SM, Zhu X, Kiritchenko S, Martin J. Sentiment, Emotion, Purpose, and Style in Electoral Tweets. Information Processing and Management. 2015;51(4):480–499. Available from:
  5. 5. Metaxas PT, Mustafaraj E. Social Media and the Elections. Science. 2012;338(6106):472–473. Available from: https://www.science.org/doi/abs/10.1126/science.1230456. pmid:23112315
  6. 6. Prabhakaran V, Hutchinson B, Mitchell M. Perturbation Sensitivity Analysis to Detect Unintended Model Biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong, China: Association for Computational Linguistics. 2019;5740–5745. Available from: https://www.aclweb.org/anthology/D19-1578.
  7. 7. Huang PS, He X, Gao J, Deng L, Acero A, Heck L. Learning deep structured semantic models for web search using clickthrough data Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2013. Available from: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf.
  8. 8. Gotti F, Langlais P, Farzindar A. Translating Government Agencies’ Tweet Feeds: Specificities, Problems and (a few) Solutions In Proceedings of the Workshop on Language Analysis in Social Media. Atlanta, Georgia: Association for Computational Linguistics. 2013;80–89. Available from: https://aclanthology.org/W13-1109.
  9. 9. Liu NF, Gardner M, Belinkov Y, Peters ME, Smith Noah A. Linguistic Knowledge and Transferability of Contextual Representations In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019;1073–1094. Available from: https://aclanthology.org/N19-1112.
  10. 10. Rogers A, Kovaleva O, Rumshisky A. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics. 2020;8:842–866. Available from: https://aclanthology.org/2020.tacl-1.54
  11. 11. Bender EM, Gebru T, McMillan-Major A, Shmitchell S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 Association for Computing Machinery Conference on Fairness, Accountability, and Transparency. New York, NY, USA: Association for Computing Machinery. 2021;610–623.
  12. 12. Shwartz V, Rudinger R, Tafjord O. “You are grounded!”: Latent Name Artifacts in Pre-trained Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Online: Association for Computational Linguistics. 2020;6850–6861. Available from: https://www.aclweb.org/anthology/2020.emnlp-main.556.
  13. 13. Basta C, Costa-jussà MR, Casas N. Evaluating the Underlying Gender Bias in Contextualized Word Embeddings. In: Proceedings of the First Workshop on Gender Bias in Natural Language Processing; August 2019; Florence, Italy. Association for Computational Linguistics; 2019. p. 33–39. Available from: https://www.aclweb.org/anthology/W19-3805.
  14. 14. Nadeem M, Bethke A, Reddy S. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Online: Association for Computational Linguistics. 2021;5356–5371. Available from: https://aclanthology.org/2021.acl-long.416.
  15. 15. Nangia N, Vania C, Bhalerao R, Bowman SR. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Online: Association for Computational Linguistics. 2020;1953–1967. Available from: https://www.aclweb.org/anthology/2020.emnlp-main.154.
  16. 16. Stańczak K, Augenstein I. A Survey on Gender Bias in Natural Language Processing. arXiv preprint arXiv:211214168. 2021;abs/2112.14168. Available from: https://arxiv.org/abs/2112.14168.
  17. 17. Liang S, Dufter P, Schütze H. Monolingual and Multilingual Reduction of Gender Bias in Contextualized Representations. In Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics. 2020;5082–5093. Available from: https://www.aclweb.org/anthology/2020.coling-main.446.
  18. 18. Kaneko M, Imankulova A, Bollegala D, Okazaki N. Gender Bias in Masked Language Models for Multiple Languages. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics. 2022; 2740–2750. Available from: https://aclanthology.org/2022.naacl-main.197.
  19. 19. Névéol A, Dupont Y, Bezançon J, Fort K. French CrowS-Pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics. 2022;8521–8531. Available from: https://aclanthology.org/2022.acl-long.583.
  20. 20. Martinková S, Stańczak Karolina, Augenstein I. Measuring Gender Bias in West Slavic Language Models. In Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023). Dubrovnik, Croatia. 2023; 146–154. Available from: https://aclanthology.org/2023.bsnlp-1.17.
  21. 21. Vig J, Gehrmann S, Belinkov Y, Qian S, Nevo D, Singer Y, et al. Investigating Gender Bias in Language Models Using Causal Mediation Analysis. In Advances in Neural Information Processing Systems. Curran Associates, Inc. 2020;33:12388–12401. Available from: https://proceedings.neurips.cc/paper/2020/file/92650b2e92217715fe312e6fa7b90d82-Paper.pdf.
  22. 22. May C, Wang A, Bordia S, Bowman SR, Rudinger R. On Measuring Social Biases in Sentence Encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota: Association for Computational Linguistics. 2019;622–628. Available from: https://www.aclweb.org/anthology/N19-1063.
  23. 23. Webster K, Wang X, Tenney I, Beutel A, Pitler E, Pavlick E, et al. Measuring and Reducing Gendered Correlations in Pre-trained Models. Computing Research Repository. 2020;abs/2010.06032. Available from: https://arxiv.org/abs/2010.06032.
  24. 24. Ahmad K, Daly N, Liston V. What is new? News media, General Elections, Sentiment, and Named Entities. In Proceedings of the Workshop on Sentiment Analysis where AI meets Psychology. Chiang Mai, Thailand: Asian Federation of Natural Language Processing. 2011;80–88. Available from: https://www.aclweb.org/anthology/W11-3712.
  25. 25. Voigt R, Jurgens D, Prabhakaran V, Jurafsky D, Tsvetkov Y. RtGender: A Corpus for Studying Differential Responses to Gender. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation. Miyazaki, Japan: European Language Resources Association. 2018. Available from: https://www.aclweb.org/anthology/L18-1445.
  26. 26. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota: Association for Computational Linguistics. 2019;4171–4186. Available from: https://www.aclweb.org/anthology/N19-1423.
  27. 27. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:190711692. 2019;abs/1907.11692. Available from: https://arxiv.org/abs/1907.11692.
  28. 28. Yang Z, Dai Z, Yang Y, Carbonell JG, Salakhutdinov R, Le QV. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, December 8-14, 2019, Vancouver, BC, Canada. 2019;5754–5764. Available from: https://proceedings.neurips.cc/paper/2019/hash/dc6a7e655d7e5840e66733e9ee67cc69-Abstract.html.
  29. 29. Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, et al. PaLM: Scaling Language Modeling with Pathways. arXiv preprint arXiv:220402311. 2022;abs/2204.02311 Available from: https://arxiv.org/abs/2204.02311.
  30. 30. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems 2020, December 6-12, 2020, virtual. 2020;1877–1901. Available from: https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  31. 31. Hewitt J, Manning CD. A Structural Probe for Finding Syntax in Word Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota: Association for Computational Linguistics. 2019;4129–4138. Available from: https://www.aclweb.org/anthology/N19-1419.
  32. 32. Tenney I, Xia P, Chen B, Wang A, Poliak A, McCoy TR, et al. What do you learn from context? Probing for sentence structure in contextualized word representations. In 7th International Conference on Learning Representations 2019, New Orleans, USA, May 6-9, 2020. 2019. Available from: https://openreview.net/forum?id=SJzSgnRcKX.
  33. 33. Amini A, Pimentel T, Meister C, Cotterell R. Naturalistic Causal Probing for Morpho-Syntax. Transactions of the Association for Computational Linguistics. 2023. Available from: https://arxiv.org/abs/2205.07043.
  34. 34. Mohammad SM, Kiritchenko S, Sobhani P, Zhu X, Cherry C. A Dataset for Detecting Stance in Tweets. In Proceedings of the Tenth International Conference on Language Resources and Evaluation. Portorož, Slovenia: European Language Resources Association. 2016;3945–3952. Available from: https://www.aclweb.org/anthology/L16-1623.
  35. 35. Mohammad SM, Sobhani P, Kiritchenko S. Stance and Sentiment in Tweets. Association for Computing Machinery Transactions on Internet Technology. 2017;17:1–23. Available from:
  36. 36. Padó S, Blessing A, Blokker N, Dayanik E, Haunss S, Kuhn J. Who Sides with Whom? Towards Computational Construction of Discourse Networks for Political Debates. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics. 2019;2841–2847. Available from: https://www.aclweb.org/anthology/P19-1273.
  37. 37. Conneau A, Lample G. Cross-lingual Language Model Pretraining. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, December 8-14, 2019, Vancouver, BC, Canada. 2019;7057–7067. Available from: https://proceedings.neurips.cc/paper/2019/hash/c04c19c2c2474dbf5f7ac4372c5b9af1-Abstract.html.
  38. 38. Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, et al. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics. 2020;8440–8451. Available from: https://www.aclweb.org/anthology/2020.acl-main.747.
  39. 39. Vrandečić D, Krötzsch M. Wikidata: A Free Collaborative Knowledgebase. Communications of the Association for Computing Machinery. 2014;57(10):78–85. Available from:
  40. 40. Hennigen LT, Kim Y. Deriving Language Models from Masked Language Models. arXiv preprint arXiv:230515501. 2023;abs/2305.15501. Available from: https://arxiv.org/abs/2305.15501.
  41. 41. Goyal K, Dyer C, Berg-Kirkpatrick T. Exposing the Implicit Energy Networks behind Masked Language Models via Metropolis–Hastings. The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Available from: https://openreview.net/forum?id=6PvWo1kEvlT.
  42. 42. Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. 1970. Biometrika, 57(1), 97–109. Available from:
  43. 43. Hoyle AM, Wolf-Sonkin L, Wallach H, Augenstein I, Cotterell R. Unsupervised Discovery of Gendered Language through Latent-Variable Modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics. 2019:1706–1716. Available from: https://www.aclweb.org/anthology/P19-1167.
  44. 44. Zeman D, Nivre J, Abrams M, Ackermann E, Aepli N, et al. Universal Dependencies 2.6. 2020. Available from: http://hdl.handle.net/11234/1-3226.
  45. 45. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:160908144. 2016;abs/1609.08144. Available from: https://arxiv.org/abs/1609.08144.
  46. 46. Nguyen DQ, Vu T, Tuan Nguyen A. BERTweet: A pre-trained language model for English Tweets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics. 2020;9–14. Available from: https://www.aclweb.org/anthology/2020.emnlp-demos.2.
  47. 47. Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In 8th International Conference on Learning Representations 2020, Addis Ababa, Ethiopia, April 26-30, 2020. 2020. Available from: https://openreview.net/forum?id=H1eA7AEtvS.
  48. 48. Dinan E, Fan A, Wu L, Westin J, Kiela D, Williams A. Multi-Dimensional Gender Bias Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Online: Association for Computational Linguistics. 2020;314–331. Available from: https://www.aclweb.org/anthology/2020.emnlp-main.23.
  49. 49. Hoyle AM, Wolf-Sonkin L, Wallach H, Cotterell R, Augenstein I. Combining Sentiment Lexica with a Multi-View Variational Autoencoder. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota: Association for Computational Linguistics. 2019;635–640. Available from: https://www.aclweb.org/anthology/N19-1065.
  50. 50. Kingma DP, Welling M. Auto-Encoding Variational Bayes. arXiv preprint arXiv:13126114. 2013;abs/1312.6114. Available from: https://arxiv.org/abs/1312.6114.
  51. 51. Chen Y, Skiena S. Building Sentiment Lexicons for All Major Languages. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, Maryland: Association for Computational Linguistics. 2014;383–389. Available from: https://www.aclweb.org/anthology/P14-2063.
  52. 52. Vilares D, Peng H, Satapathy R, Cambria E. BabelSenticNet: A Commonsense Reasoning Framework for Multilingual Sentiment Analysis. In 2018 Institute of Electrical and Electronics Engineers Symposium Series on Computational Intelligence. 2018;1292–1298. Available from: https://ieeexplore.ieee.org/document/8628718.
  53. 53. Asgari E, Braune F, Roth B, Ringlstetter C, Mofrad M. UniSent: Universal Adaptable Sentiment Lexica for 1000+ Languages. In Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association. 2020;4113–4120. Available from: https://www.aclweb.org/anthology/2020.lrec-1.506.
  54. 54. Desai N. Hindi language—bag of words—sentiment analysis. 2016. Available from: https://data.mendeley.com/datasets/mnt3zwxmyn/2.
  55. 55. Sharan M. Sarcasm Detector; 2016. Available from: https://github.com/smadha/SarcasmDetector.
  56. 56. Elsahar H. Large Arabic Multidomain Lexicon; 2015. https://github.com/hadyelsahar/large-arabic-multidomain-lexicon.
  57. 57. Xinfan M. Chinese Sentiment Lexicon; 2012. https://github.com/fannix/Chinese-Sentiment-Lexicon.
  58. 58. Chen CC, Huang HH, Chen HH. NTUSD-Fin: A Market Sentiment Dictionary for Financial Social Media Data Applications. In El-Haj M, Rayson P, Moore A, editors. Proceedings of the Eleventh International Conference on Language Resources and Evaluation. Paris, France: European Language Resources Association. 2018. Available from: http://lrec-conf.org/workshops/lrec2018/W27/pdf/1_W27.
  59. 59. Abdaoui A, Azé J, Bringay S, Poncelet P. FEEL: A French Expanded Emotion Lexicon. Language Resources and Evaluation. 2017;51(3):833–855. Available from:
  60. 60. Fabelier. Tom de Smedt; 2012. https://github.com/fabelier/tomdesmedt.
  61. 61. Loukachevitch N, Levchik A. Creating a General Russian Sentiment Lexicon. In Proceedings of the Tenth International Conference on Language Resources and Evaluation. Portorož, Slovenia: European Language Resources Association. 2016;1171–1176. Available from: https://www.aclweb.org/anthology/L16-1186.
  62. 62. Dolores Molina-González M, Martínez-Cámara E, Teresa Martín-Valdivia M, Alfonso Ureña López L. A Spanish Semantic Orientation Approach to Domain Adaptation for Polarity Classification. Information Processing & Management. 2015;51(4):520–531. Available from: https://www.sciencedirect.com/science/article/pii/S0306457314000910.
  63. 63. Bravo-Marquez F. EarthQuakeMonitor; 2013. Available from: https://github.com/felipebravom/EarthQuakeMonitor.
  64. 64. Figueroa JC. Sentiment Analysis Spanish; 2015. Available from: https://github.com/JoseCardonaFigueroa/sentiment-analysis-spanish.
  65. 65. Treves A, Panzeri S. The Upward Bias in Measures of Information Derived from Limited Data Samples. Neural Computation. 1995;7:399–407. Available from:
  66. 66. Paninski L. Estimation of Entropy and Mutual Information. Neural Computation. 2003;15:1191–1253. Available from:
  67. 67. Ganchev K, Graça Ja, Gillenwater J, Taskar B. Posterior Regularization for Structured Latent Variable Models. Journal of Machine Learning Research. 2010;11:2001–2049. Available from: http://jmlr.org/papers/v11/ganchev10a.html.
  68. 68. Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. In Bengio Y, LeCun Y, editors. 3rd International Conference on Learning Representations 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. 2015. Available from: http://arxiv.org/abs/1412.6980.
  69. 69. Fellbaum C. WordNet: An Electronic Lexical Database Bradford Books. 1998. Available from: https://mitpress.mit.edu/9780262561167.
  70. 70. Tsvetkov Y, Schneider N, Hovy D, Bhatia A, Faruqui M, Dyer C. Augmenting English Adjective Senses with Supersenses. In Proceedings of the Ninth International Conference on Language Resources and Evaluation. Reykjavik, Iceland: European Language Resources Association. 2014;4359–4365. Available from: http://www.lrec-conf.org/proceedings/lrec2014/pdf/1096_Paper.pdf.
  71. 71. Miller GA, Leacock C, Tengi R, Bunker RT. A Semantic Concordance. In Human Language Technology: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21-24, 1993. 1993. Available from: https://www.aclweb.org/anthology/H93-1061.
  72. 72. Good PI. Permutation, Parametric, and Bootstrap Tests of Hypotheses. Berlin, Heidelberg: Springer-Verlag; 2004. Available from: https://link.springer.com/book/10.1007%2Fb138696.
  73. 73. Blodgett SL, Barocas S, Daumé III H, Wallach H. NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics. 2020;5454–5476. Available from: https://www.aclweb.org/anthology/2020.acl-main.485.
  74. 74. Collier B, Bear J. Conflict, Criticism, or Confidence: An Empirical Examination of the Gender Gap in Wikipedia Contributions. In: Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work; 2012; Seattle, Washington, USA. Association for Computing Machinery; 2012. p. 383–392. Available from: https://doi.org/10.1145/2145204.2145265.
  75. 75. Wagner C, Garcia D, Jadidi M, Strohmaier M. It’s a Man’s Wikipedia? Assessing Gender Inequality in an Online Encyclopedia. Proceedings of the International AAAI Conference on Web and Social Media. 2021;9(1):454–463.
  76. 76. Wikidata. Wikidata:WikiProject LGBT/gender. Available from: https://www.wikidata.org/wiki/Wikidata:WikiProject_LGBT/gender.