Automated stance detection in complex topics and small languages: The challenging case of immigration in polarizing news media

Automated stance detection and related machine learning methods can provide useful insights for media monitoring and academic research. Many of these approaches require annotated training datasets, which limits their applicability for languages where these may not be readily available. This paper explores the applicability of large language models for automated stance detection in a challenging scenario, involving a morphologically complex, lower-resource language, and a socio-culturally complex topic, immigration. If the approach works in this case, it can be expected to perform as well or better in less demanding scenarios. We annotate a large set of pro- and anti-immigration examples to train and compare the performance of multiple language models. We also probe the usability of GPT-3.5 (that powers ChatGPT) as an instructable zero-shot classifier for the same task. The supervised models achieve acceptable performance, but GPT-3.5 yields similar accuracy. As the latter does not require tuning with annotated data, it constitutes a potentially simpler and cheaper alternative for text classification tasks, including in lower-resource languages. We further use the best-performing supervised model to investigate diachronic trends over seven years in two corpora of Estonian mainstream and right-wing populist news sources, demonstrating the applicability of automated stance detection for news analytics and media monitoring settings even in lower-resource scenarios, and discuss correspondences between stance changes and real-world events.


Introduction
Understanding complex socio-political and cultural issues, such as polarization and news biases requires a comprehensive perception of cultural systems.Computational and data-driven research can offer valuable insights, provided we acknowledge the limitations of computational methods and scrutinize findings.Advances in natural language processing, such as pretrained large language models (LLMs) have enabled the analysis of large volumes of data, but these methods may have limited applicability in smaller languages with limited training data and NLP resources.This includes dealing with politically charged issues that involve diverse linguistic expressions and cultural perspectives.However, quantifying the reporting of different arguments or stances towards various issues can help scholars to better understand media ecosystems, study the political positions of different media groups or specific outlets, but can also aid industry, including media organizations if they are looking to balance and avoid bias in their reporting.
We report on an experiment of automatically classifying topic-specific political stance in news media texts written in a low to medium-resource, morphologically complex language, Estonian, spoken natively by about 1.1 million people, primarily in the European state of Estonia.While we use one small language as the example, we argue below that our results have implications for the applicability of automated stance detection and media monitoring more broadly.The topic in question is the globally much disputed and often polarizing topic of immigration.Our corpus consists of news articles published in 2015-2022 by one mainstream media group (Ekspress Grupp), and one right-wing populist online news and opinion portal (Uued Uudised, or the "new news").The study is based on an academia-industry collaboration project with the Ekspress media group, who provided data from their in-house publishing database (but did not influence the design of the study nor the conclusions).Their interest was to assess the neutrality of their content.The goal of this study was to determine the feasibility and accuracy of automated stance detection for linguistically and culturally complex issues (in this case, immigration), in a lower-resource language, and also apply it to mapping stance in a large corpus of news dealing with the topic.This could be applied to assess the balance of different views in news reporting as well as to foster discussions about bothsidesism.For that purpose we compare sources that may likely have contrasting views on the politically charged topic of immigration.We focus on testing a supervised learning approach, annotating a set of training data, tuning a number of different LLMs on the training examples, and testing them on a holdout test set.The best-performing model is further applied to the larger corpus to estimate the balance of different stances towards immigration in the news.
Our experiment design follows a fairly standard annotate-train-test procedure.We first extracted 8000 sentences from the joint corpus using a lexicon of topic-relevant keywords and word stems (referring to keywords such as migrant, immigration, asylum seeker), carried out manual stance annotation, and fine-tuned a number of pre-trained LLMs on this dataset for text classification, including multilingual and Estonian-specific ones.We also experiment with zero-shot classification in the form of instructing ChatGPT 3.5 to classify sentences according to similar guidelines as the human annotators.All LLM classifiers achieve reasonably good test set accuracy, including the zero-shot variant, which performs almost on par with the best annotations-tuned model.Our work has three main contributions: We demonstrate the feasibility and example accuracy of what amounts to a proof of concept for an automated political stance media monitoring engine, and also compare it to cheaper approaches of bootstrapping a general sentiment analysis classifier to estimate stance, and using zero-shot learning.While not perfect, we argue the approach can yield useful results if approached critically, keeping the error rates in mind.We have chosen a socio-politically complex example topic and a lower-resource language for this exercise.Consequently, it is reasonable to expect higher accuracy when following analogous procedures, where one or more of the variables are more favorable: either the target language having larger pre-trained models available, the topic being of lesser complexity, or larger quantities of training data are annotated.We are making our annotated dataset of 7345 sentences public, which we foresee could be of interest to the Estonian NLP as well as media and communications studies communities, as well as applications of multilingual NLP and cross lingual transfer learning.We also contrast the more traditional annotations based training approach to zero-shot classification based on using an instructable LLM, ChatGPT.We offer a perspective of such an approach's future importance in academia and beyond.While first attempts at testing ChatGPT as a zero-shot classifier have focused on large languages like English, we provide insight into its performance on a lower-resource language.
Secondly, we carry out qualitative analysis of the annotation procedure and model results, highlighting and discussing difficulties for both the human annotators and the classifier, when it comes to complex political opinion, dog-whistles, sarcasm and other types of expression requiring contextual and cultural background knowledge to interpret.Lessons learned here can be used to improve future annotation procedures.
Finally, we show how the approach could be used in practice by media and communications scholars or analytics teams at news organizations, by applying the trained model to the rest of the corpus to estimate stances towards immigration and their balance in the two news sources over a 7 year period.This contributes to understanding immigration discourse, media polarization and radical-right leaning media on the example of Estonia.The topic is also an example of real-world commercial interest, where our industry partner has been interested in keeping balance of their reporting of different stances.We find and discuss qualitative correspondences between changes in stance and relate them to events such as Estonian parliamentary elections in 2019 and the start of the Russian invasion to Ukraine of 2022.

Analytic approach
We approach stance detection as determining favorability toward a given (pre-chosen) target of interest (Mohammad et al., 2017) through computational means.Stance detection (or stance classification, identification or prediction) is a large field of study, partially overlapping with opinion mining, sentiment analysis, aspect-based sentiment analysis, debate-side classification and debate stance classification, emotion recognition, perspective identification, sarcasm/irony detection, controversy detection, argument mining, and biased language detection (ALDayel & Magdy, 2021; Küçük & Can, 2020).Stance detection is used in natural language processing, social sciences and beyond in order to understand subjectivity and affectivity in the forms of opinions, evaluations, emotions and speculations (Pang & Lee, 2008).Compared to sentiment analysis, which generally distinguishes between positivity or negativity, stance detection is a more topic-dependent task that requires a specific target (Mohammad et al., 2017) or a set of targets (Sobhani et al., 2017;Vamvas & Sennrich, 2020).The distinction is of course not clearly categorical, with e.g.aspect-based sentiment analysis, commonly applied to product reviews, being applicable to multiple targets (Do et al., 2019).We chose to assess stance towards one target, immigration, and contrast our results with using an existing Estonian sentiment analysis dataset to fine-tune a classifier based on the best performing LLM.
Both sentiment analysis and stance detection are classification tasks with multiple possible implementations.Earlier approaches were based on dictionaries of e.g.positive and negative words, and texts would be classified by simply counting the words, using rules of categorization, or various statistical models.We employ the method of tuning large pretrained language models like BERT (Devlin et al., 2019) as supervised text classifiers.Such context-sensitive language models have been shown to work well across various NLP tasks and typically outperform earlier methods (Devlin et al., 2019;Ghosh et al., 2019).Reports on using LLMs for stance detection in lower-resource languages are relatively limited in literature.However, their usefulness is clear in scenarios where language-specific NLP tools and resources such as labeled training sets may be lacking, but there is enough unlabeled data such as free-running text to train a LLM or include the language in a multilingual model (Hedderich et al., 2021;Magueresse et al., 2020).Resources relevant for NLP include both available methods as well as datasets among other factors (cf.Batanović et al., 2020).
Automated stance detection has also been relevant in studies on immigration and related topics.The data they use is most often textual, ranging from often studied Twitter (ALDayel & Magdy, 2021;Khatua & Nejdl, 2022) to online discussion forums (Yantseva & Kucher, 2021) and comments of online news (Allaway & McKeown, 2020).In the context of news media, the immigration topic is also relevant in hate-speech detection, which applies similar methods (Khatua & Nejdl, 2022).These studies use a variety of methods for stance detection, including LLMs.These include single-shot studies (e.g.Card et al., 2022) where training set topics match the predicted topics; multi-shot approaches which offer partial transferability; and zero shot (Allaway & McKeown, 2020;Vamvas & Sennrich, 2020) which aims to predict topics not contained in the training set.Automated stance detection has been used to study immigration topics in under-resourced languages (Yantseva & Kucher, 2021), and across-topics (zero-shot) and multilingual approaches using LLMs have been shown to work across languages other than English like Italian, French and German (Vamvas & Sennrich, 2020).

Object of analysis
Immigration has witnessed increased focus in media and politics in Europe since the 2015 European migrant crisis, but is also relevant globally.Analysis of media representations of immigration is crucial, as it can determine stances towards immigration (Burscher et al., 2015;Meltzer & Schemer, 2021), such as perception of the actual magnitude of immigration.In turn, exposure to immigration related news can have an impact on voting patterns (Burscher et al., 2015).This topic is also central in populist radical right rhetorics (Mudde, 2007;Rooduijn et al., 2014).Social media has been argued to be one of the means for achieving populistic goals (Engesser et al., 2017(Engesser et al., : 1122)).In the Estonian context, most of the radical right content circulating in Estonian-language social media have been reported to be references to articles from the news and opinion portal Uued Uudised (Kasekamp et al., 2019), making it a relevant source for understanding radical right populists' perspective towards immigration.
Focus on immigration fits into populistic rhetoric in the context of distancing the "us" from the strange or the "other".In the case of the radical right, this other may be often chosen based on race or ethnicity (Abts & Rummens, 2007: 419).Such exclusionism of immigrants and ethnic minorities can be present in radical right populism to the extent that it becomes its central feature (Mudde, 2007;Rooduijn et al., 2014).Who that minority group is varies and may also change over time.For example, before 2015, Central and Eastern European (CEE) populist radical right parties used to target mainly national minorities, whilst in Western-Europe it was more often immigrants.After the 2015 immigration crisis, immigrants also became the main target in the CEE countries (Kasekamp et al., 2019).
The same applies to Estonia, where immigration has been one of the topics used by the radical right parties to grow their political impact, especially since 2015 (Kasekamp et al., 2019).2015 also marked the emergence of many anti-immigrant social media groups, blogs and online news and opinion portals which have gained popularity since then.This includes the radical right online news portal Uued Uudised, a channel whose news are often ideologically in line with and give voice to the political party of EKRE ("Conservative People's Party of Estonia").In academic literature EKRE has often been classified as a radical right populist party (Auers, 2018;Kasekamp et al., 2019;Koppel & Jakobson, 2023;Madisson & Ventsel, 2018;Petsinis, 2019) whilst the party describes itself as national conservative (Madisson & Ventsel, 2018;Saarts et al., 2021).Uued Uudised has been described as both alternative (Kasekamp et al., 2019) as well as hyper partisan media (Saarts et al., 2021).It was established in 2015 during the EU immigration crisis.The content of the Estonian radical right media discourse is often following provocative and controversial argumentations (Kasekamp et al., 2019).Immigrants are often constructed as an antithetical enemy, where the Other is portrayed as a mirror image of the Self, whereas the Other may first be given negative characteristics that are then perceived as nonexistent in one's own group (Kasekamp et al., 2019;cf. Lotman et al., 1978;Madisson & Ventsel, 2016).Such othering towards immigration can also be noticed through the topics discussed in the media more broadly, such as framing immigration in the context of criminal activity (Kaal & Renser, 2019;Koppel & Jakobson, 2023).
While our study has a methodological focus, we have chosen an example that also contributes to a better understanding of the topic of immigration, media polarization and radical-right discourse in our example country of Estonia.Radical-right discourse has been an under-researched topic in Estonian context (Kasekamp et al., 2019).Political science has focused on communication of the parties themselves (Braghiroli & Makarychev, 2023;Petsinis, 2019;Saarts et al., 2021), while textual analyses have often focused on social media (Kasekamp et al., 2019;Madisson & Ventsel, 2016, 2018).These qualitative studies can benefit from a large-scale quantitative approach through automated stance and sentiment detection offering a complementary perspective.

Dataset
We chose the data based on accessibility, and to contrast two sources expected to have different stances on immigration.The corpus consists of articles from 2015 to the beginning of April 2022.The mainstream news are from the Ekspress Grupp, one of the largest media groups in the Baltics.Our data covers one dominant online news platform, Delfi, across all of the time period, and a sample from multiple other daily and weekly newspapers and smaller magazines.The populist radical-right leaning media is represented by the abovementioned online news portal Uued Uudised.
We acquired the Ekspress Grupp data directly from the group and scraped Uued Uudised from its web portal.Both datasets were cleaned of tags and non-text elements.We included Estonian language content only (the official language of the country is Estonian, but there is a sizable Russian speaking minority, and both news sources include Russian language sections).Our dataset consists of 21 667 articles from Uued Uudised (April 2015 to April 2022) and 244 961 from Ekspress Group (January 2015 to March 2022).The received data of Ekspress Group was incomplete with a gap in October-December 2019.The data from 2020 onwards contains multiple times more content from other periodicals besides Delfi (cf.We chose sentence as the unit of analysis, instead of e.g.paragraph or article, for three reasons.The length of articles varies greatly, as does the length of a paragraph across articles, and some articles lack paragraph splits.Secondly, longer text sequences may include multiple stances, which may confuse both human annotators and machine classifiers.Thirdly, the computational model, BERT, has an optimal input length limit below the length of many longer paragraphs.It was hoped a sentence would be a small enough unit to represent a single stance on average, but enough context to inform the model.Admittedly, sentence level analysis does have the limitation of missing potentially important contextual information across sentences, as we further discuss in our annotation and classification-error analysis.It is often hard to deduce an opinion from a single sentence length text alone (cf.Mohammad et al., 2017), but we do expect sentence to be a suitable unit of analysis to indicate changes in rhetoric and large-scale changes across time.
We extracted immigration-related sentences using a dictionary of keywords to cover different aspects related to immigration, implemented as regular expressions (also to account for the morphological complexity of Estonian and match all possible case forms).Previous research on immigration has approached sampling by choosing topic-specific datasets, like immigration related discussion forums (Yantseva & Kucher, 2021) or using dictionary based approaches, like Card et al. (2022).We found using predefined keywords as simple and efficient enough for our task.Using text embeddings can provide a good alternative if keywords are harder to limit or have many synonyms (Du et al., 2017).We created a list of keywords sorted into groups representing various aspects of the migration as well as other closely related topicsmigration, refugees, foreign workers, foreign students, non-citizens, race, nationality, and terms related to radical-right and liberal opposition (e.g."multiculturalism") terms (cf.Appendix for more information on keywords and Fig. S2 for distribution of keywords groups).This plurality of topics (e.g. also covering "digital nomads") made the task much more challenging but at the same time allowed to grasp more nuances of the migration discourse at large.This yielded sentences that included both opinions as well as factual descriptions and were therefore stylistically varied.In addition to searching for relevant keywords, we used a negative filter to exclude unrelated topics, like bird migration (see

Annotations
We assigned two Estonian speaking graduate students to annotate a total of 8000 sentences for supervised training.The annotators were compensated monetarily.The sample was balanced by keyword prevalence and publishers (Uued Uudised and Ekspress Grupp), but not by the time or source article.Based on annotator feedback, we removed very long, repetitive or list-like and non-topical sentences, leaving 7345 sentences.The sentences were annotated on a 1-5 point scale from pro-to anti-immigration, with the option to mark the sentence as ambiguous instead.Ratings were later reduced to four classes of Against (1-2), Neutral (3), Supportive (4-5), and Ambiguous.The latter was meant for sentences that we assumed to be unhelpful for model tuning, and were therefore excluded, including sentences that were either unintelligible, non-topical or expressed multiple stances at once.While some sentences are straightforward to interpret, others can pose a challenge for annotation due to complex metaphorical usage or references requiring additional knowledge.Below are some examples, translated into English (original Estonian in the Appendix).1) "Mass immigration would be disastrous for Europe and it would not solve anything in the world."(Against) 2) "The process to get a residence permit here was not very complicated."(Supportive) 3) "Migration issues must definitely be analyzed, including the aspect of international obligations and their binding nature, and various steps should be considered."(Neutral) 4) "One can only wonder -when do Libyans quit and follow the flow of things when Europe is just talking about controlling the migrant crisis but itself just pours oil on fire."(Ambiguous) 5) "It is not worth mentioning that the person in question is thoroughly Europhile and globalist."(Against, because the manner presumes that it is said from the perspective of someone who may be against immigration) 6) "They criticize racism, homophobia, xenophobia and what they see as outdated nationalism."(Supportive, but refers to a third person and may thus also be taken as Against) The sentiment analysis classification used for comparison differed from stance detection only in terms of the annotations used for fine-tuning.We used a publicly available Estonian language dataset of short paragraphs labeled for sentiment as Negative, Neutral, Positive and excluded Mixed class (Pajupuu et al., 2016).
A third annotator (the first author) later annotated a subset from both of the previous annotators to estimate inter-rater agreement.There was substantial agreement on Supportive, Again and Neutral classes (κ = 0.69 and 0.66 between the third and each of the other annotators) (see Appendix for details).There was a very strong agreement between only Pro and Against classes (κ = 0.97 for both), indicating that most of the disagreement was between one extreme and Neutral.
test data was unbalanced in terms of the number of classes, our training took into account the weights (relative size) of the classes.
In addition, we also compared the results with that of GPT-3.5 based ChatGPT (we conducted our experiments on March 3, 2023, using the February 13 version of ChatGPT 3.5).It is a more recent large language model specifically trained for dialogue.
The new approach of using (even larger) generative LLMs as zero-shot classifiers (also known as prompt-based learning, cf.Liu et al., 2023) has opened up potentially new avenues of cheap and efficient text classification, as it requires no fine-tuning and can simply be instructed using natural language.
There has been a surge of research on ChatGPT performance for different NLP tasks, but mostly focused on English.It has been shown that ChatGPT can achieve similar or better results in English than comparable supervised and other zero-shot models, including in stance detection (Zhang et al., 2023).On a wider array of tasks, ChatGPT has been shown to be a good generalist model, but performing worse than models fine-tuned for a specific task (Qin et al., 2023).The model is problematic in terms of evaluation and replicability due to the ongoing development and closed nature of the model (Aiyappa et al., 2023).Our goal is to estimate its potential relevance for future studies by comparing it with the established pipeline of supervised tuning of pretrained LLMs for classification tasks.
We created a prompt that included optimized classification instructions and input sentences, in batches of 10.Responses not falling into Against, Neutral or Supportive classes were requested again until only labels belonging to this set were returned.Also if a wrong number of tags was returned, the sentences were requested again.An example input and output would look as follows: Input: Stance detection.Tag the following numbered sentences as being either "supportive", "against" or "neutral" towards the topic of immigration."Supportive" means: "supports immigration, friendly to foreigners, wants to help refugees and asylum seekers"."Against" means: "against immigration, dislikes foreigners, dislikes refugees and asylum seekers, dislikes people who help immigrants"."Neutral" means: "neutral stance, neutral facts about immigration, neutral reporting about foreigners, refugees, asylum seekers".Don't explain, output only sentence number and stance tag. 1.Unfortunately, by now the violence has seeped from immigrant communities to all of the society.

Results
The best-performing fine-tuned model was based on Est-RoBERTa, achieving an acceptable F1 macro score of 0.66 (precision 0.65 and recall 0.68; see Table 1).The difference with other monolingual EstBert (0.64) and multilingual XLM-RoBERTa (0.64) was minimal.All of the fine-tuned models performed better at classifying Against than Supportive stances.Est-RoBERTa model achieved F1 0.74 for Against, 0.69 for Neutral and 0.55 for Supportive class.The misclassification was mostly between Neutral and one extreme (see Fig. 2), similarly to e.g.Card et al. (2022).We regard it preferable to confusing the two extremes.The results are comparable to similar studies, and there is little difference between the models.Classifier trained on an existing sentiment dataset with Est-RoBERTa achieved the worst score, but performed better than expected.There were more mistakes between the two extremes than with sentiment analysis training set (see Annex for sentiment confusion matrix).We confirmed sentiment analysis training set performance by comparing the sentiment and stance predictions for all of the immigration related sentences, resulting in a fair agreement (kappa 0.29).It demonstrates the complexity of our task, which included features from stance as well as sentiment.Finally, comparable performance of zero-shot ChatGPT with the best model shows it could serve as a viable but cheaper alternative to fine-tuned models.We further assessed the mistakes made by the best performing classifier.We looked at the mistaken predictions in the evaluation set between Against and Support classes and observed at least four types of interpretable mistakes.
1) Mistaken human annotations.These may be hard to fully exclude when using human annotations but could be reduced with better instructions.2) Sarcasm, a well known challenge in NLP 3) Ambiguous and context dependent sentences.These may be generally more complicated to classify 4) Sentences that refer to a third person.These are tricky, as referencing someone else's opinion may implicitly imply agreeing or opposite standpoint, which is highly context dependent and therefore not easy, but a simpler task for humans than classifiers.These could relate to our chosen unit of analysis; paragraphs might perform better.

Limitations
The limitations of classifier performance are at least partly rooted in human annotations.Some of these shortcomings were reported by the annotators themselves.The distinction between neutral and ambiguous classification was also problematic, where more clear instructions might have helped.Confusion between neutral and ambiguous classes is not expected to have a strong effect on the results, but may have limited the size of our training set by having neutral sentences classified as ambiguous.
Annotations are also dependent on the annotators' prior knowledge and biases.
Annotators were instructed to only rate the sentence itself, but we expect that they also relied on personal contextual knowledge (cf.Batanović et al., 2020 for more aspects for annotators to consider in future studies).Yet, LLM are not impervious to (e.g.training set induced) biases either.It may explain why in some cases smaller and more specific models might perform better (Bender et al., 2021).
We suspect that some limitations to classifier accuracy arose from the dataset itself.
The text contained opinions, descriptive sentences as well as quotations in indirect as well as direct speech.This was discussed with the annotators before and during the annotation process, as it was reported to have created some confusion.In the case of opinions, explicit expressions were easily distinguishable, but in many cases the opinions were implicit.Quotations were also problematic as these could easily be misinterpreted without the proper context that a paragraph might provide.Sarcasm and metaphoric speech is also among challenges that automatic classifiers have to face, e.g."The protests were but shouts in the deserts because the wheel of racial equality had already been set on its way."(Kuid protestid jäid hüüdjaks hääleks kõrbes, sest rassilise võrdsuse ratas oli juba hooga veerema lükatud.).We also included keywords often used by the radical right to negatively refer to the liberals, like "multiculturalists" and "globalists" etc., which may be difficult to spot as negative without context or prior knowledge.Annotators also reported pro immigration stances as harder to identify.This may be due to anti-immigration rhetoric being more systematic and less fragmented whilst pro-immigration rhetoric is more dependent on the specific sub-topic.

Exemplary analysis
Lastly we conducted an exemplary diachronic analysis of the change of stances towards immigration across time.This tests the applicability of our method and demonstrates some of its possible uses.In the following, we visualize and analyze the larger changes in the stance trends in relation to media events, look at the related media polarization and general similarities based on text embeddings.
The relative amount of immigration related articles across time and publisher (see Fig. 3) provides an understanding of immigration related media events and their importance for each of the publishers.Uued Uudised clearly focuses more on the immigration topic than Ekspress Grupp, based on keyword prevalence.Uued Uudised also has a stronger reaction to immigration related media events, such as the European migration crisis of 2015-2016, UN immigration pact at the end of 2018, and the Russian invasion of Ukraine from February 2022 onwards, which caused an increase in refugees.These findings confirm what is known about radical-right media in general and it provides novel insight into the Estonian context.We used the best-performing model, based on Est-RoBERTa, to predict the stances of all sentences in the corpus containing relevant keywords (n=106539).We focus on monthly trends, as a tradeoff between detail and the amount of available data per unit of time.
The findings, as seen in Figs. 4 and 5, confirm and expands previous assumptions and findings within media studies on the roles of the respective publishers (Kasekamp et al., 2019) and radical right populism in general (cf.(Mudde, 2007;Rooduijn et al., 2014).We found trends that showed polarization and indicated changes of stance corresponding to the UN migration pact and elections, and Ukraine war.Uued Uudised stance was generally against immigration, not neutral or supportive.On the other hand, Ekspress Grupp had a dominantly neutral stance over time and kept generally more stable than Uued Uudised.The relative stance differed noticeably per keyword group, whereas multiculturalism and xenophobia and race related words had the highest percentage of sentences labeled as Against migration (cf.Figs.S6-S9 for stances per keyword groups).
There is a clear change taking place around the 2018-2019 during the UN migration pact discussions (most heated debates in Estonian media happening around November 2018) and general elections (March 3, 2019).Uued Uudised contained more sentences classified as Against migrants than before and right after that period.The share of the Against stance is increasing with the UN migration pact discussions, but decreases soon after the elections in March 2019.The Against stance increased in these years for all of the keyword groups.A change is also noticeable in Ekspress Grupp, where relevance of Against stance increases during the same period.This demonstrates the possible connection between potential politicization of the migration topic and the elections.This could be further investigated in future research.
From March 2020, when Covid-19 became the dominant media event, the stances seem to change again.This may be due to the shift of focus on other topics, such as Covid, where the radical-right shifted its focus from anti-immigration to anti-governmental.Lastly, the events of the Russian invasion into Ukraine in 2022 correspond to a small increase in supportive stance in Uued Uudised and a much larger increase in supportive stance towards immigrants in Ekspress Grupp.Whilst the Supportive stance increased in almost all of the keyword groups for Ekspress Grupp, there was more variability for Uued Uudised.This ambiguity of Uued Uudised may reflect the continued anti-immigration rhetoric by the related rightwing political party of EKRE.In order to understand the changes taking place within and between the publishers, we calculated the cosine similarities between the sentences from different publishers with Sentence-BERT (Reimers & Gurevych, 2020).Figure 6 shows how the cosine similarity has spiked in the end of 2018 and 2022.We interpret the change as a possible increase of similarities of topics or rhetorics towards immigration.The latter change with the Ukrainian war differs from one connected to the 2018 UN migration pact as the similarity increased almost just as much with all of the stances.Analysis of similarities within publisher sources (cf.Fig. S12) had similar trends relating to those events, meaning that both publishers were possibly more focused on one media or used similar rhetoric during these months.

Discussion
Our study shows that automated stance detection is feasible and provides useful insights for media monitoring and analytics purposes, also beyond large languages like English or German.The accuracy of the classifiers was satisfactory, achieving F1 macro 0.66 with Est-RoBERTa.Zero-shot ChatGPT achieved a similar result of 0.65.We expect zero-shot accuracy will improve as generative AI models are being improved and developed.Classification of Against stances was noticeably more accurate than of Supportive stances.As expected, radical-right news media indeed holds generally a more anti-immigration stance in comparison to more mainstream news.We also provided insights into stance change over time, relating it to known local and world events, identifying increased interest towards these topics during the 2015-2016 immigration crisis, the 2018-2019 UN immigration pact and local elections, and the 2022 Russian invasion into Ukraine.These findings, approximated by applying an automated classifier, can be used as basis for further more in-depth research in Estonian-specific or areal media and politics studies.
However, there are also limitations.Fine-tuning pretrained LLMs as classifiers requires annotated training data, which may not be available for specific topics or in lower-resource languages.We discussed issues with annotation, pointing out that linguistically and socio-politically complex topics such as this are also difficult for human annotators and for formalizing the task.There is also the question of unit of analysis: shorter units like sentences are fast to annotate, but may not contain enough contextual information.Longer like paragraphs do, but may contain multiple stances, which complicates the task both for humans and machines.

Future research
Whilst supervised stance detection can provide acceptable results, the need for annotated training data makes it time consuming and expensive, while being applicable to one topic at a time.One option is to use a generic sentiment classifier instead.However, we showed that this does not work very well for complex topics such as immigration, where support may be expressed in sentences with negative overall tone, and vice versa.Using new generation generative LLMs like ChatGPT may provide a solution, being easy to instruct in natural language, and applicable across languages, tasks and topics.This makes it particularly attractive for smaller languages with less resources and with less existing annotated datasets.
These models could also be used to annotate data in tandem with human annotators, or augment existing annotations (Gilardi et al., 2023).Accuracy and model bias should still be evaluated.For example, in our case it could have been used to further classify sentences as expressing opinions, factual descriptions, and direct quotes.This can result in a feedback loop that results in better datasets, more accurate models and also better understanding of the functioning of the model through assessing the classification errors.
This new approach has already been explored in preliminary experiments, which our research complements.The accuracy of applying zero-shot learning should still be evaluated, and not be taken for granted.For this, annotated datasets such as the one we also make available, are still useful and necessary.This is more so relevant in smaller languages, for which likely less initial training data has been used in models like ChatGPT.While annotating new datasets requires instructing human annotators, using generative models requires careful prompt engineering.This also complicates replication of results, as slightly different instructions can lead to differences in classification performance, in addition to the inherently stochastic nature of generative AI (Reiss, 2023).More so, results are difficult to replicate if a cloud-based, frequently updated model like ChatGPT is used.Then again, these models may also improve accessibility, making deep learning available and feasible for non-computer scientists and researchers with limited access to large computer clusters, although it will depend on the companies offering such services and their pricing policies.Therefore, we expect generative AI driven analytics to become more widespread together with the affordances of cloud based computing across disciplines.This also calls for more critical studies as well as thorough analysis of the applications of the methods to better understand the biases related to specific LLMs and cloud-based services.

Conclusions
We demonstrated the applicability of automated stance detection using pretrained LLMs for socio-politically complex topics in smaller languages on the example of Estonian news media coverage of immigration discourse.We compare several popular models, and also release the stance-annotated dataset.Our experiments with using ChatGPT as an instructable zero-shot classifier are promising, and if applied carefully, this approach could obviate the need for topic-specific annotation and expedite media analytics and monitoring tasks.This is more so the case in languages where such resources are limited.As a proof of concept, we also applied one classifier to the larger corpus to provide an overview of changes in immigration in Estonian news media in 2015-2022, including one mainstream and one radical-right news source, finding support for discussions in previous literature as well as providing new insights.

Declarations
Annotation and data acquisition together with preliminary analysis were conducted with co-funding from Ekspress Grupp, which did not influence the design of the study nor the conclusions.M.M., A.K., M.S., I.I. are supported by the CUDAN ERA Chair project for Cultural Data Analytics at Tallinn University, funded through the European Union Horizon 2020 research and innovation program (Project No. 810961).

Text embeddings explanation
We provide a short explanation of LLMs and the process of classification.LLMs are based on text embeddings.It means that the text is first transformed into numerical form based on the text co-occurrences, theoretically based on the distributional hypothesis, stating that a meaning of a word can be inferred from the lexical context, that is, from the words around it (Sahlgren, 2008).Therefore, the goal of the model is to predict the probabilities of words occurring amongst other words.The same logic applies to ChatGPT.Next, the text undergoes dimension reduction and is represented in a multidimensional space where the number of dimensions is dependent on a specific model (e.g.768 for BERT).Every dimension represents a certain abstract feature of the word, like the "blueness" or "wealth" although in practice these dimensions are hard to identify.The resulting embedding space affords us to rely on a spacial interpretation of textual relations, e.g.how far are terms "immigration" and "refugees'' from each other in relation to other terms, or the comparison of distances between whole sentences as in our case.

Classification explanation
The classification model uses the given parameters to figure out what sentence corresponds to which category through trial and error of labeling the training data.The approach uses a process called masking where the model either tries to guess which word fits in the empty slot around other words or which words could be present in the empty slot around a specific one.The resulting model is expected to be generalizable for other similar data, measured with the part of data that is kept back from the annotations to test its performance on data not used for training.BERT has been reported to often outperform large language models, like Word2Vec or GloVe (Devlin et al., 2019), although a more complicated combination of BERT with other methods may yield even somewhat better results (Li et al., 2019), not to mention larger models, like the GPTs.BERT follows models like Word2Vec, Glove and ELMo and is superseded by similar but even larger language models, like RoBERTa.All of them use pre-trained language models, meaning that they are pre-trained on large language corporas.Typically these are further fine-tuned, like in our case, to fit a more specific task.They differ in the size of this pre-training language model, but all use datasets of millions of words.What sets them apart is that Word2Vec and Glove give only one vector per word, but ELMo and BERT are context sensitive.Meaning that the word bank in phrase river bank erodes gets a separate numerical value than bank in phrase financial bank refinances.In comparison to ELMo, BERT is better at this by simultaneously considering both the text that comes before and after the word bank.

Stance trends with threshold
Figures S10 and S11.Relative amount of stances plus a group of stances that were less uncertain.The uncertain class contains sentences that have less than 70% probability of fitting into a specific class.E.g. a sentence under threshold may have 65% probability of being anti-immigration and the other 35% is shared between neutral and pro-immigration class.The plot shows that using thresholds does not have a large impact on the general trends.Below threshold sentences is relatively larger in Ekspress Grupp than in Uued Uudised, but overall difference is from 5-10% and this variability is relatively small across months.
12. Sentence embeddings trends Fig S1 for detailed distribution of Ekspress data).
Fig 1 for distribution of filtered sentences).

Figure 1 .
Figure 1.Monthly distribution of immigration related sentences.The red line represents the Ekspress Grupp and blue Uued Uudised.There is no data for Ekspress Grupp at the end of 2019, where the count is 0. The change of relevant sentences in Ekspress Grupp after that reflects the difference in the dataset, which was larger and was more varied in terms of specific periodicals (cf.Figs, S1 and S3 for distribution of articles per Ekspress Grupp periodical and Fig. S4 for similar distribution of immigration related sentences but per week).

Figure 2 .
Figure 2. Confusion matrix of stance detection.Based on one fold from the best performing model.Percentage shows the overlap between true (annotated) and predicted classes.Ideal but non-realistic classification would be 100% for diagonal from bottom left to top right.We regard the small values in top left and bottom right as a good sign, showing that most of the mistakes were between Supportive or Against, and Neutral, not between the two extremes (cf.Fig. S5 for comparison with sentiment analysis).

Figure 3 .
Figure 3. Percentage of articles mentioning immigration.Top plots show the counts of articles mentioning immigration.The articles contain at least one immigration related keyword.Higher percentage for the populist radical-right source (blue) confirms that the outlet is more focused on the immigration.The fluctuations in Uued Uudised is due to the smaller amount of data in absolute terms, especially in 2015.The relatively lower amount of immigration related articles in Ekspress group data since 2020 is likely connected to the significantly increased amount of content from a larger variety of specific journals, indicating that the amount of immigration related content is somewhat dependent on specific journals of Ekspress Group (see appendix on Ekspress Grupp distribution details).

Figures 4 and 5 .
Figures 4 and 5. Stances of immigration related sentences.It shows the relative percentage of each stance per month for both publishers.Barplots show the amount of immigration sentences per month in comparison.In 2022 at the beginning of the Ukraine war, there was a noticeable increase in Supportive stances towards migration in the Ekspress Group with a much smaller increase in Uued Uudised (cf.Figs.S10 and S11 for trends with thresholds for classification certainty).

Fig 6 .
Fig 6.Comparison on sentence cosine similarities between the publishers.Similarities are calculated separately per stance.The larger spikes in higher cosine similarities in the end of 2018 and in 2022 may be indicating that the outlets pay attention to similar events in a similar stance.

Fig S1 .
Fig S1.Distribution of all articles in our dataset per periodical.Area chart distinguishes the biggest periodicals of Ekspress Grupp per month and the blue trend compares it to the Uued Uudised.All Ekspress Grupp articles from 2015 to 2020 data mostly originates from Delfi, a fully

Figure S2 .
Figure S2.Distribution of keyword groups by the number of sentences mentioning keywords relevant to the group.The refugee and migration related keywords make up most of the dataset whilst there are relatively very few sentences about foreign students.The two outlets have some differences.E.g.Uued Uudised has double the amount of sentences on refugee and foreign workforce topics.Thirdly, xenophobia and multiculturalism related keywords are more used in Ekspress Grupp although there are differences in specific keywords.This indicates that these publishers have different focus on immigration related topics.

Figure S3 .
Figure S3.Distribution of immigration related articles per periodical.Shows the articles with immigration related keywords per largest periodicals in Ekspress Grupp in our dataset.Stacked colors represent different publications in Ekspress Grupp.The main source is the online platform Delfi (purplish).Compared to Uued Uudised marked with blue line.

Figure S4 .
Figure S4.Weekly distribution of immigration related sentences.A more detailed view of immigration related events on a weekly scale.Provided as a comparison to monthly trends

Figures
Figures S6 and S7.Changes in Against stance per keyword group.Shows the yearly changes per keyword group.We found the Against stance most informative for analyzing the changes dependent on specific keyword groups.Notice the different scale of y axis.

Figures
Figures S8 and S9.Changes in Against stance per keyword group.Depicts yearly changes.For Ekspress Grupp in 2022 all topics, except foreign students, are getting more positive.Especially xenophobia\multiculturalism related keywords and the large refugee keyword group.Interestingly, there is a slow increase of supportive stances towards foreign workforce across time.There are much smaller changes in 2022 for Uued Uudised for Uued Uudised.But there is more differentiation between topics.

Figure S12 .
Figure S12.Monthly average cosine similarities of sentences within the same.Similarities are calculated separately per stance, e.g.comparing neutral sentences to other neutral sentences.

Table 1 .
Comparison of classification models.F1 scores from different models by each class and across all classes.Bold indicates the best result with Est-RoBERTa.We used 5-fold cross-validation with 20% of data with all models.

Table S4 .
Categorization of misclassified sentences.Sentences wrongly classified by the model by mistaking Pro and Against classes.Examples are taken from shorter sentences.Original punctuation and spelling.The Probabilities column provides the classification probabilities for each example where numbers correspond in order to Against, Neutral and Supportive labels.