Quantifying gender bias towards politicians in cross-lingual language models

Recent research has demonstrated that large pre-trained language models reflect societal biases expressed in natural language. The present paper introduces a simple method for probing language models to conduct a multilingual study of gender bias towards politicians. We quantify the usage of adjectives and verbs generated by language models surrounding the names of politicians as a function of their gender. To this end, we curate a dataset of 250k politicians worldwide, including their names and gender. Our study is conducted in seven languages across six different language modeling architectures. The results demonstrate that pre-trained language models’ stance towards politicians varies strongly across analyzed languages. We find that while some words such as dead, and designated are associated with both male and female politicians, a few specific words such as beautiful and divorced are predominantly associated with female politicians. Finally, and contrary to previous findings, our study suggests that larger language models do not tend to be significantly more gender-biased than smaller ones.


Introduction
Recent large language models have been shown to achieve state-of-the-art performance on many downstream tasks using only few target task examples (Devlin et al., 2019;Liu et al., 2019;Yang et al., 2019).During their pre-training, these models are known to partially learn a language's syntactic and semantic structure (Hewitt and Manning, 2019;Tenney et al., 2019).However, alongside capturing the linguistic nuances of a language, they also perpetuate (and potentially amplify) biases (Zhao et al., 2017;Bender et al., 2021).
Thus far, approaches to analyze (gender) bias in pre-trained language models have primarily used basic statistical measures such as association tests (May et al., 2019;Nadeem et al., 2020), correlations (Webster et al., 2020) and causal analysis (Vig et al., 2020).Further, most previous work on gender bias in language models has focused on En- Figure 1: Mean frequency with which the top 100 adjectives-the most strongly associated with either masculine or feminine gender-correspond to negative sentiment for each language.Each point denotes a language model.Significant differences between masculine and feminine gender are denoted with 'x' markers.
glish (Nadeem et al., 2020;Vig et al., 2020), with the notable exception of Liang et al. (2020) who examined biases for both English and Chinese.
In parallel, another stream of work has concentrated on analyzing stance towards politicians.Prabhakaran et al. (2019) discover that leaders (i.e.politicians) tend to have higher toxicity associations and argue that if an entity is often mentioned in negative linguistic contexts, a pre-trained language model might learn to associate negativity to its name.Similarly, Shwartz et al. (2020) show that some names tend to be grounded to specific entities and observe the most striking effects on politicians.
The contributions of this work are as follows: Cross-lingual study.Prior work has almost exclusively investigated bias in monolingual language models.We present a fine-grained study of gender bias in 6 cross-lingual language models.Stronger statistical methods.Prior work has mostly used less expressive statistical methods to measure gender bias in language models.We employ an unsupervised method using a latentvariable model.Largest coverage.We present the largest study quantifying gender bias towards politicians to date, considering 250k politicians from most of the world's countries and measure gender biases towards them in 7 languages.
New findings.We find that, for both male and female politicians, stance towards politicians in pretrained language models is highly dependent on the analyzed language-with no significant findings for other-gendered politicians.Fig 1, for instance, shows that while in English, male politicians are associated with a more negative sentiment, the opposite is true for most other analyzed languages.Perhaps surprisingly, we do not find any significant evidence that larger language models tend to be more gender-biased than smaller ones which contradicts previous findings (Nadeem et al., 2020).
Finally, our methodology can easily be extended to study gender biases towards other types of entities in other languages.

Related Work
Gender bias in pre-trained language models Research on gender bias in pre-trained language models has gained an increasing amount of attention.Prior datasets for bias evaluation in language models have mainly been built for English and with sentences of basic structures, e.g."This is a(n) blank person."or "Person is blank.",where blank refers to an attribute such as an adjective or occupation (May et al., 2019;Webster et al., 2020;Vig et al., 2020).Another approach to gather data is Nadeem et al.'s (2020), where crowdworkers provided stereotypical, anti-stereotypical and unrelated associations.This method does not suffer from the artificial context of simply structured sentences, however, crowdsourced annotations on stereotypes may convey subjective opinions and are cost intensive if employed for multiple languages.
We suggest a simple approach where we generate tokens in neighborhoods of named entities (NEs) expecting to obtain words that language models associate with a given name.
Stance towards politicians Stance detection is the task of automatically determining if the author of an analyzed text is in favor of, against, or neutral towards a target (Mohammad et al., 2016).Notably, Mohammad et al. (2017) observes that a person may demonstrate the same stance towards a target by using negative or positive language.Thus, stance detection is a challenge more complex than sentiment classification.
Here, we generate a dataset for analyzing stances towards politicians encoded in a language model, where this stance is inferred from simple grammatical constructs (e.g.ADJ-entity).
Previous work on stance towards politicians is based on datasets with more complex grammatical structures.However, these have mostly targeted specific entities in a single country's political context.Ahmad et al. (2011) analyze samples of national and regional news by Irish media on politicians running in general elections with the goal of predicting election results.More recently, Voigt et al. (2018) collected responses to Facebook posts for 412 members of the U.S. House and Senate from their public Facebook pages, while Padó et al. (2019) created a dataset consisting of 959 articles with a total of 1841 claims where each claim is associated with an entity.It is readily apparent that each of these resources include information on a limited number of entities.
Geared towards analyzing stance towards politicians in a broader context, we construct a multilingual dataset of 250k politicians from most of the world's countries.

Gender Bias Detection
Our primary aim is to quantify gender bias pervaded in Transformer-based language models, which lets us assess potential representational harms arising when employing these models on downstream tasks.Unlike prior work, we do not merely want to analyze monolingual English language models, but to design and conduct a study of gender bias in large cross-lingual language models.Thus, we are able to examine if the nature of gender bias exhibited in models differs depending not only on model size and training data, but also on a language and its typology.
For instance, in a language such as Spanish, in which adjectives are gendered according to the noun they refer to, grammatical gender might become a highly predictive feature the model can rely on during its pre-training.On the other hand, since inanimate objects are gendered, they might take on adjectives which are usually not associated with their grammatical gender, e.g."la espada fuerte (the strong [feminine] sword)", potentially mitigating the effects of harmful bias in these models.
We analyze gender bias towards NEs, i.e. politicians.Prior research has shed light on information incorporated in NE representations, e.g.Prab-hakaran et al. (2019) test NLP models for unwanted biases, discovering that sentiment towards NEs is encoded in their representations and politicians in particular tend to have high toxicity associations.
To conduct this analysis, we require a multilingual dataset suitable for quantifying gender bias towards politicians.To the best of our knowledge, currently available datasets are not broad enough to cover an extensive number of politician names in multiple languages.Second, we need a suitable methodology for measuring gender bias.
Considering the above, we: • Establish a method to generate a multilingual dataset for quantifying gender bias towards politicians in cross-lingual language models; ( §4) With this in mind, we let language models generate words around entity names (here, politicians) while imposing no sentence structure -relying on a [MASK] <person name> query, as we explain later.We assume that this bottom-up approach can unveil associations encoded in language models between their representations of NEs and words describing them.To the best of our knowledge, this method enables the first multilingual analysis of gender bias in language models -being applicable to generate datasets in any language and with any choice of gendered entity, provided that 1 An example is the sentence "Obama está bueno (Obama is [now] good)", which sounds unusual to Spanish speakers.
a list of such entities with their gender is available.
We restrict our analysis to 7 typologically diverse languages: Arabic, Chinese, English, French, Hindi, Russian and Spanish.This enables an analysis of several multilingual language models -while covering a culturally diverse choice of speaker populations.While this framework is applicable to any type of NE, we study stance towards politicians since this topic has recently gathered attention, being of both academic and social importance (Fan et al., 2019) and has not yet been analyzed in a broad multilingual setup.
To generate the dataset, we first query the Wikidata knowledge base (Vrandečić and Krötzsch, 2014) to obtain politician names in the 7 analyzed languages.Next, using 6 language models we generate adjectives and verbs associated with those politician names.Finally, we collect sentiment lexica for the analyzed languages to study differences in sentiment for generated words.

Politician names
We obtain politician names and their genders from Wikidata, retrieving their names in the 7 analyzed languages (not all names available in all languages).Wikidata distinguishes between 12 gender identities,2 from which we exclude female and male organisms, as they refer to animals which (for a reason or another) were running for elections.We also replace the cisgender female label with female, and exclude any politician for which gender information is unavailable.3Finally, we create a general gender category other, which includes all politicians not identified as male or female (due to the small number of politicians for each of these genders; see Table 5 in the appendix).Tables 3  and 4 (appendix) present the counts of politicians grouped by their gender (male, female, other) for each language.On average, the male to female gender ratio is 4:1 across the languages and there are very few names for the gender category 'other'.

Language generation
The language generation process can be thought of as a word association questionnaire.We give the language model a politician's name and ask it to generate a token (verb or adjective) which it thinks is most appropriate for the name.We could take several approaches towards that goal.A possibility would be to let the language model generate a sentence where the name must appear and analyze it.However, most recent large language models are not trained with language generation in mind -their pre-training consists in predicting masked tokens in already existing sentences (Rogers et al., 2021). 4Another possible approach would be to use a dependency parsed sentence and replace a verb (v) or adjective (a) in it with a [MASK] token whenever a person name is encountered (p) such that relation(v,p) = nsubj or relation (p, a) = amod.Then p could be replaced with a politician's name and the model could be used to predict the [MASK] token.However, the language model's prediction would be biased by the other tokens in the analyzed sentences.
Our solution is to simplify this problem.In the languages we study, there is no common structure: while in English, adjectives usually precede nouns, the opposite is true for both Spanish and French (Dryer, 2013a).Furthermore, while English has subject-verb-object (SVO) ordering, verbs usually precede subjects in the modern standard variant of Arabic (it has VSO ordering, see Dryer 2013b).We thus query each analyzed language model by feeding it either a [MASK] <person name> input, or its inverse <person name> [MASK].This returns a ranked list of words (with their probabilities) which the model associates These returned words, though, will not be restricted to adjectives or verbs -the part of speech (POS) categories we want to analyze as they capture the sentiment about the name (Hoyle et al., 2019a).Furthermore, POS tagging is a sequence prediction problem and off-the-shelf POS taggers cannot be reliably used to predict the POS tag for individually predicted tokens.Therefore, we rely on the universal dependency (Zeman et al., 2020) treebanks to filter these words, only keeping the ones which are present in any of the language-specific treebanks and marked as an adjective or verb.
A final issue is that all our models use subword tokenizers and, therefore, a politician's name is often not just tokenized by a whites-pace.For example, the name "Narendra Modi" is tokenized as ["na", "##ren", "##dra", "mod", "##i"] by the BERT uncased tokenizer.It is difficult to ascertain solely from a model's vocabulary whether the model indeed saw a name during training.However, all politicians whose names we analyze have Wikipedia pagesand Wikipedia is a subset of the data on which these models were trained (except BERTweet, which is trained on a large collection of 855M English tweets).Therefore, we assume that the models did see these names during training and that the predicted words for the [MASK] tokens are truly reflective of the model's sentiment.

Sentiment data
Following Hoyle et al. (2019a), we use sentiment as a proxy to quantify bias, which requires a sentiment lexicon for each analyzed language.We use the combined sentiment lexicon of Hoyle et al. (2019b) for English words, which was shown to outperform a number of individual sentiment lexica and their straight-forward combination on a text classification task involving sentiment analysis.Unfortunately, this is only available for English.
For the remaining six languages we use Senti-VAE (Hoyle et al., 2019b) -a multi-view variational autoencoder -to combine existing sentiment lexica, the same method Hoyle et al. (2019a) use to generate the English sentiment lexicon.Particularly, SentiVAE combines lexica with disparate scales into a common latent representation, where the output represents the strength of each word's sentiment (positive, negative and neutral) in the form of a three-dimensional Dirichlet distribution.Hence, even polysemous and rare words are accounted for.We combine 3 multilingual sentiment lexica for all analyzed languages, except Hindi: BabelSenticNet (Vilares et al., 2018), the sentiment lexicon by Chen and Skiena (2014) and UniSent (Asgari et al., 2020).Additionally, we incorporate monolingual sentiment lexica for Arabic (Elsahar, 2015), Chinese (Xinfan, 2012;Chen et al., 2018), French (Abdaoui et al., 2017;Fabelier, 2012), Russian (Loukachevitch and Levchik, 2016) and Spanish (Dolores Molina-González et al., 2015;Bravo-Marquez, 2013;Figueroa, 2015).For Hindi, we combine the sentiment lexica from Chen and Skiena (2014), Desai (2016) and Sharan (2016). 5ollowing Hoyle et al. (2019b) we evaluate the resulting lexica in a text classification task on a selected dataset for each of the languages.Namely, we use the resulting lexica to automatically label instances with their sentiment, based on the average sentiment of words in each sentence.The results presented in Table 6 show that there is no major performance difference between using sentiment lexica for labelling instances, and using a supervised model trained on the training split of each dataset.For the latter, we include the best performance reported in the paper presenting the respective dataset as a point of reference.

Methods
We use two methods to quantify gender biases in large LMs.First, we use point-wise mutual information (PMI) as a measure of association between a generated word and gender.This method, though less accurate, is unsupervised and can even be applied to small data samples.Secondly, we adopt the methodology of Hoyle et al. (2019a), which extends the concept of PMI with latent sentiment and regularization.Originally, this method uses the corpus of Goldberg and Orwant (2013) and is provided a pre-defined list of animate nouns to be analyzed.Therefore, it could not be directly used to detect gender bias towards entities such as politicians.However, having annotated NEs in our dataset, we are able to circumvent this requirement.Further, the method of Hoyle et al. (2019a) was developed in a monolingual setting.We adapt it to be suitable for a broader study in a multilingual setup.To do so, we account for grammatical patterns which the analyzed languages follow (i.e.sentence structure, PoS tagging, lemmatization).Finally, we collect sentiment lexica in analyzed languages.Note that due to the relatively small number of politicians identified in the other gender group, we restrict ourselves to two binary genders in this generative latent-variable setting.

Point-wise mutual information
Point-wise mutual information (PMI) is a measure of association that examines co-occurrences of two random variables and quantifies the amount of information we can learn about a specific variable from another.In our initial approach we treat generated words as bags of words and for Hindi due to poor evaluation results.
analyze the PMI between gender g ∈ G = {MALE, FEMALE, OTHER} and word w as: In particular, PMI measures the difference between the co-occurrence probability of a word and gender versus their joint probability if they were independent.As such, if a word appears as often with one gender as does with the others, its PMI will be zero.On the other hand, if a word is more (less) often associated with a gender, its PMI will be positive (negative).For example, we would expect a high value for PMI(f emale, pregnant) because their co-occurrence probability is higher than the independent probabilities of female and pregnant.Accordingly, in an ideal un-biased world, we would expect words such as successful or intelligent to have a PMI of approximately zero for all genders.

Generative latent-variable model
In order to analyze stance towards male and female politicians and to quantify the degree to which the language generated to describe them differs, we apply the generative latent-variable model presented in Hoyle et al. (2019a). 6This model jointly represents adjective (or verb) choice (w) with its sentiment (s), given a politician's gender (g).
Following Hoyle et al. (2019a), we define G = {MALE, FEMALE} to be the set of genders in our dataset, whereas its instances are represented by one-hot vectors g ∈ G ≡ {0, 1} T .We also define w to be a word (the lemma7 of an adjective or a verb) in a language-specific vocabulary w ∈ V.
Finally, let S = {POS, NEG, NEU} be a set of sentiments, with s ∈ S being one such sentiment.We further define n ∈ N to be a politician name, where N is the set of analyzed names (we denote g n as the gender of a politician with name n).
Since we do not have access to explicit sentiment information (it is encoded as a latent unobserved variable), we marginalize it out from eq. 2 while training our model.This yields the objective: where p(w | n) is defined as the probability of a word given a politician name, taken from the analyzed language model.As in Hoyle et al. (2019a), we apply posterior regularization (Ganchev et al., 2010) to guarantee that our latent variable correspond to sentiments.This regularization is taken as the Kullback-Leibler (KL) divergence between our estimate of p(s | w) (again marginalized from eq. 2) and a ground-truth q(s | w)-which we get from the sentiment lexicon described in detail in Section 4.3.We further rely on L 1 -regularization to account for sparsity.This objective is minimized with Adam optimizer (Kingma and Ba, 2015).
Through the distribution p(w | s, g), this model enables us to extract ranked lists of adjectives (or verbs), grouped by both gender and sentiment, that were generated by a language model to describe politicians.This distribution is defined as where m w ∈ R works as a word specific bias, f g ∈ {0, 1} T is the one-hot gender representation and η(w, s) ∈ R T is a word-sentiment specific deviation.During evaluation, we remove the word specific bias and analyze only its gender-specific sentiment deviation.To get instances with a high association between FEMALE and POS, for instance, we get: We refer the reader to Hoyle et al. (2019a) for a more detailed description of the model.

Experiments and Results
We apply the methods defined in Section 5 (PMI and the generative latent-variable model) to study the presence of gender bias in the dataset described in Section 48 .First, using the English portion of the dataset, we analyze the PMI values to look for the words diverging the most for each of the three gender categories.Then, we conduct the same analysis as presented in Hoyle et al. (2019a) for our dataset to test if adjectives and verbs generated by language models follow the same patterns as discovered in natural corpora and if they confirm previous findings about stance towards politicians.Next, we conduct a multilingual analysis for the seven selected languages via PMI and the latent-variable model to unveil both qualitative and quantitative differences in words generated by 6 cross-lingual language models.96.1 Monolingual setup

PMI and latent-variable model
From the PMI values for words associated with politicians of male, female or other genders10 it is apparent that words associated with the female gender are often connected to weaknesses such as hysterical and fragile or to their appearance (blonde and lovely), while adjectives generated for male politicians tend to describe their political beliefs such as fascist and bolshevik.There is no such distinguishable pattern for the gender category 'other', most likely due to insufficient amounts of data.For details, see Table 7 (in the appendix).
The results for the latent variable model are similar to those for the PMI model.Adjectives associated with appearance are more often generated for female politicians.Additionally, words describing marital status such as divorced and unmarried are also significantly more often generated for female politicians.On the other hand, positive adjectives describing men often relate to their character values such as bold and independent.More examples can be seen in Table 8 in the appendix.
Following Hoyle et al. (2019a), we use two existing semantic resources to quantify the patterns revealed above.We group adjectives and verbs into small numbers of coarse classes called supersenses using respectively Tsvetkov et al. (2014) and Miller et al. (1993) resources.We find that the male politicians are more often described negatively when using adjectives related to their emotions (e.g.anger) while more positively with adjectives related to their minds (e.g.intelligent) (details in Figure 4, appendix).These results differ from findings of Hoyle et al. (2019a), where no significant evidence of these tendencies was found.

Sentiment analysis
Next, we examine overall sentiment of the language models towards male and female politicians.We evaluate differences in the frequency of words of positive and negative sentiment for each gender and report the results in Figure 2. We find that the language models significantly more often generate words with negative sentiment to describe male rather than female politicians, and that this pattern holds across most language models.However, based on the results of the qualitative study (see details in Table 8, appendix) we assume it is due to a number of strongly positive words such as beloved and marry, which are highly associated with female politicians.
To investigate if there are any significant differences across language models based on their size and architecture, we carry out a two-way analysis of variance (ANOVA).We take language, model size and architecture as independent variables and sentiment values as target variables; then analyze the differences in how the mean frequency with which the 100 largest deviation words (adjectives and verbs) correspond to each sentiment for male and female gender.
Results presented in Table 1 indicate significant differences in negative sentiment in descriptions of male politicians generated by models of different architectures.Notably, XLNet tends to generate more words of negative sentiment compared to other models examined.Also, larger models tend to exhibit similar gender biases to smaller ones.

Pointwise-mutual information and latent-variable model
For PMI scores, a pattern similar to the monolingual setup holds for the multilingual language models.Words associated with female politicians often relate to their appearance and social characteristics such as beautiful and sweet (prevalent for English, French and Chinese) or attentive (in Russian).
On the other hand, male politicians are described by as knowledgeable, serious, or (in Arabic) prophetic.Again, we are not able to detect any patterns in words generated around politicians of gender category 'other', where generated words vary from similar and common (as in French and Russian) or angry and unique (as in Chinese).
The results of the latent-variable model confirm the previous findings and are consistent with the results for monolingual models (for an example, see Table 2 for Spanish).Words of positive sentiment used to describe male politicians are often successful (Arabic), or rich (Arabic, Russian).In a negative context, male politicians are described as difficult (Chinese and Russian) or serious (prevalent in French and Hindi) and the associated verbs are sentence (in Chinese) and arrest (in Russian).Notably, words generated in Russian have a strong negative connotation such as criminal and evil.Positive words associated with female politicians are mostly related to their appearance, while there is no such pattern for words of negative sentiment.

Sentiment analysis
Further, we analyze overall sentiment of the 6 crosslingual language models towards male and female politicians for the selected languages.Our analysis suggests that sentiment towards politicians varies depending on the language used.For English, female politicians tend to be described more positively as opposed to in Arabic, French, Hindi and Spanish.For Russian, words associated with fe- Table 2: Ranked list of 10 adjectives with the largest average deviation over all models to describe male and female politicians in Spanish.
male politicians tend to be more polarized, having both more positive and negative sentiment.No significant patterns can be detected for Chinese.For details, see Figures 1 and 3.
Finally, analogously to the monolingual setup , we investigate if there are any significant differences in sentiment dependent on the target language, language model sizes and architectures (see ANOVA analysis in Table 1).Both XLM and XLM-RoBERTa generate fewer negative and more positive words than BERT multilingual (e.g. the mean frequency with which the 100 largest deviation adjectives for male gender correspond to negative sentiment is lower by 2.00% and 5.73% for XLM and XLM-R, respectively.Indeed, as suggested above, we find that language is a highly significant factor for bias in cross-lingual language models, alongside model architecture.For English and French, e.g., generated words are often more negative when used to describe male politicians.Surprisingly, we do not observe a significant influence of model size on the encoded bias.

Conclusions
In this paper, we present the largest study of quantifying gender bias towards politicians to date, considering a total number of 250k politicians.To this end, we establish a novel method to generate a multilingual dataset to measure gender bias towards entities.We study the qualitative differences in language models' word choices and analyze sentiments of generated words in conjunction with gender using a latent-variable model.Our results demonstrate that stance towards politicians in pre-trained models is highly dependent on the language used.Finally, contrary to previous findings (Nadeem et al., 2020), our study suggests that larger Transformer models do not tend to be significantly more gender-biased than smaller ones.
While we restrict our analysis to seven selected, typologically diverse languages, as well as to politicians, our method can be employed to analyze gender bias towards any named entities and in any languages, given that gender information for those entities is available.Future work will thus focus on extending this analysis to investigate gender bias in a wider number of languages and will study the wider societal implications of our results.

Ethical Considerations
Potential harms in using gender-biased language models Prior research has unveiled the prevalence of gender bias in political discourse, which can be picked up by NLP systems if trained on such texts.Gender bias encoded in Transformerbased language models is particularly problematic, as they are used as the building blocks of most modern NLP models.Biases in such language models can lead to gender biased predictions, and thus reinforce harmful stereotypes extant in natural language when these models are deployed.However, it is important to clarify that by our definition, bias does not have to be harmful (Blodgett et al., 2020), but it suffices to show that there is a significant difference in a system output that can be characterized by a social construct.

Quality of collaborative knowledge bases
For the purpose of this research, we gathered information on politicians from the open-knowledge base Wikidata, which claims to do gender modeling at scale, globally, for every language and culture with more data and distribution than any other resource (Wikidata).It is a collaboratively edited data source and so, in theory, everyone could make changes to an entry (including the person the entry is about), which poses a potential source of bias.Since we are only interested in overall gender bias trends as opposed to results for individual entities, we can tolerate a small amount of noise.
Gender selection In our analysis, we aim to incorporate genders beyond male and female while maintaining statistical significance.However, politicians of other genders cover only 0.025% of collected entities.This restricts possible analyses for politicians of other genders, and risks drawing wrong conclusions.Although our method can be applied to any named entities of other genders to analyze stance towards them, we hope future work will obtain more data on politicians of non-binary genders to avoid this limitation.
Beyond English We explored gender bias encoded in cross-lingual language models in 7 typologically different languages.We acknowledge that the selection of these languages may introduce additional biases into our study.The words generated by a language model also be more of a reflection of how politicians are perceived in these languages and how much are they discussed in general, than how much gender bias exists towards them.However, abstracting away from individual politicians, it is much more likely that findings should capture general gender bias trends.Finally, a potential bias in our study may be associated with racial biases that are being reflected by a language model, as names often carry information about politician's country of origin.

Masc Fem
Figure 4: The frequency with which the 100 largest-deviation adjectives for male and female gender correspond to the supersense FEELING for the negative sentiment (left) and the supersense MIND for the positive sentiment (right).Results presented for language models with significant differences between male and female politicians after Bonferroni correction.1.4 filipino 1.1 independent 1.4 mere 2.9 agriculture 3.0 female 3.0 tragic 1.4 distinguishing 1.1 monumental 1.3 next 2.9 divorced 2.9 translucent 3.0 great 1.4 retail 1.1 like 1.3 another 2.9 couple 2.9 dear 3.0 insulting 1.4 socially 1.0 support 1.3 lower 2.9 female 2.9 marry 3.0 out 1.4 bottled 1.0 notable 1.3 naughty 2.9 blonde 2.9 educated 2.9 Table 8: Ranked list of the top 10 adjectives with the largest average deviation for each sentiment extracted over all monolingual models for English to describe male and female politicians.

Figure 2 :Figure 3 :
Figure2: Mean frequency with which the 100 largestdeviation adjectives for male and female genders correspond to a positive or negative sentiment in English.Significant differences (p < 0.05/3 under an unpaired Bonferroni correction) represented with 'x' markers.
a l b e r t -x l a r g e -v 1 a l b e r t -l a r g e -v 1 b e r t w e e t -b a s e b e r t -l a r g e -u n c a s e d b e r t -b a s e -c a s e d b e r t -b a s e -u n c a s e d a l b e r t -b a s e -v 1 b e r t -l a r g e -c a s e r t -x l a r g e -v 1 a l b e r t -l a r g e -v 1 b e r t w e e t -b a s e a l b e r t -x x l a r g e -v 1 b e r t -l a r g e -u n c a s e d b e r t -b a s e -c a s e d b e r t -b a s e -u n c a s e d a l b e r t -b a s e -v 1 b e r t -l a r g e -c a s

Table 1 :
Group mean sentiment values for negative and positive sentiment for masculine and feminine gender for adjectives generated by monolingual language models (left) and cross-lingual language models (right) calculated via ANOVA.Significant differences (p < 0.05) between negative and positive mean sentiment values are in bold.

Table 4 :
Counts of politicians grouped by gender for each language based on queried Wikidata information.Numbers across languages differ due to politician data not being available in all languages.

Table 5 :
List of genders grouped together as 'other'.

Table 6 :
Classification performance for self-supervised approach using SentiVAE sentiment lexica vs. the best reported result in the paper presenting the respective dataset for each language.Performance metric given in the brackets.

Table 7 :
Top 15 adjectives with the biggest difference in PMI for male and female (left), top 5 adjectives (top right) and bottom 5 adjectives (bottom right) PMI for other genders based on words generated by all monolingual language models for English.