Machine-learning media bias

We present an automated method for measuring media bias. Inferring which newspaper published a given article, based only on the frequencies with which it uses different phrases, leads to a conditional probability distribution whose analysis lets us automatically map newspapers and phrases into a bias space. By analyzing roughly a million articles from roughly a hundred newspapers for bias in dozens of news topics, our method maps newspapers into a two-dimensional bias landscape that agrees well with previous bias classifications based on human judgement. One dimension can be interpreted as traditional left-right bias, the other as establishment bias. This means that although news bias is inherently political, its measurement need not be.


Introduction
Political polarization has increased in recent years, both in the United States and internationally [1], with pernicious consequences for democracy and its ability to solve pressing problems [2] It is often argued that such polarization is stoked by the media ecosystem, with machine-learning-fueled filter bubbles [3] increasing the demand for and supply of more biased media.Media bias is defined by [4] as favoring, disfavoring, emphasizing or ignoring certain political actors, policies, events, or topics in a way that is deceptive toward the reader, and can be accomplished through many different techniques.
In response, there has been significant efforts to protect democracy by studying and flagging media bias.However, there is a widespread perception that fact-checkers and bias-checkers can themselves be biased and lack transparency [5].It is therefore of great interest to develop objective and transparent measures of bias that are based on data rather than subjective human judgement calls.Early work in this area is reviewed in [6], Fig 1. Generalized principal components for articles about BLM.The colors and sizes of the dots were predetermined by external assessments and thus in no way influenced by our data.The positions of the dots thus suggest that the horizontal axis can be interpreted as the traditional left-right bias axis, here automatically rediscovered by our algorithm directly from the data.are likely to be from media on the political left and right, respectively.Our goal is to make the bias-detection algorithm as automated, transparent and scalable as possible, so that biases of phrases and newspapers are machine-learned rather than input by human experts.For example, the horizontal positions of phrases and newspapers in FIG. 1, which can be interpreted in terms of left-right bias, were computed directly from our data, without using any human input as to how various phrases or media sources may be biased.The rest of this paper is organized as follows.The Methods section describes our algorithm for automatically learning media bias from an article database, including a generalization of principal component analysis tailored for phrase frequency modeling.The Results section shows our findings for the most biased topics, and identifies a two-dimensional bias landscape that emerges from how bias correlates across topics, with left-right stance and establishment stance as its two bias axes.The Conclusions section summarizes and discusses our findings.

Methods
In this section, we present our method for automated bias detection.We first describe how we automatically map both phrases, meaning monograms, bigrams, or trigrams, and newspapers into a d-dimensional bias space using phrase statistics alone, then present our method for phrase selection.

Generalized SVD-modeling of phrase statistics
Given a set of articles from n different media sources, we begin by counting occurrences of m phrases (say "fetus", "unborn baby", etc.).We arrange these counts into an m × n matrix N of natural numbers N ij ≥ 0 encoding how many times the i th phrase occurs in the j th media source.We model N ij as a random variable drawn from a Poisson distribution whose mean Nij (the average number of times the phrase occurs) is non-negative and depends both on the phrase i and the media source j: Our goal is to accurately model this matrix N in terms of biases that link phrases and newspapers.Specifically, we wish to approximate either N (or, alternatively, its logarithm) as a low-rank matrix N, as in Singular-Value Decomposition (SVD) [17]: where the rank r < min(m, n).Without loss of generality, we can choose U and V to be orthogonal matrices (UU t = I, VV t = I) and w k > 0. Singular-value decomposition (SVD) corresponds to minimizing the mean-squared-error loss function L SVD = || N − N|| 2  2 .Although SVD is easy to compute and interpret mathematically, it is poorly matched to our media bias modeling problem for two reasons.First of all, it will in some cases predict negative phrase counts Nij , which of course makes no sense as a language model.Second, it implicitly gives equal weight to fitting every single number Nij , even though some are measured much more accurately than others from the data (the Poission error bar is Nij and phrase counts can differ from one another by orders of magnitude).To avoid these shortcomings, we September 2, 2021 3/29 choose to not minimize the SVD loss, but to instead maximize the Poisson likelihood i.e., the likelihood that our model produces the observed phrase counts N. Numerically, it is more convenient to maximize its logarithm The approximation in the last step uses Stirling's approximation ln(k!) ≈ k ln(k/e), and we use it for numerical speedup only when Nij > 50.To avoid the aforementioned problems with forbidden negative N -values, we try two separate fits and select the one that fits the data better (gives a higher Poisson likelihood): where ReLU(x) = x if x ≥ 0, vaninishing otherwise.In our numerical calculations in the Results section, we find that the second fit performs better most of the time, but not always.
We determine the best fit by selecting the desired rank r (typically r = 3) and numerically minimizing the loss function L ≡ − ln L Poisson over the fitting parameters w k , U ik and V jk .We do this using the gradient-descent method method implemented in scipy.optimize[18], which is greatly accelerated by the following exact formulas for ∇L that follow from equations (3) and (4): where W is the diagonal matrix with W kk = w k , and θ is the Heaviside step function defined by θ(x) = 1 if x > 0, vanishing otherwise.For the exponential parametrization of equation ( 5), these formulas are identical except that D = N − N. Once the numerical optimization has converged and determined N, we use the aforementioned freedom to ensure that U and V are orthogonal matrices and w k ≥ 0.

Data
Using the open-source Newspaper3k software [19], we scraped and downloaded a total of 3,078,624 articles published between January 2019 and December 2020 from 100 media sources chosen to include the largest US newspapers as well as a broad diversity of political stances.The 83 newspapers appearing in our generalized SVD bias figures below are listed in in FIG. 4 and the correlation analysis at the end also includes articles from Defense One and Science.
September 2, 2021 4/29 The downloaded article text was minimally pre-processed before analysis.All text in "direct quotes" was removed from the articles, since we are interested in biased phrases use by journalists, not by their quoted sources.We replaced British spelling of common words (e.g., favourite, flavour) with American spelling (favorite, flavor) to erase spelling-based clues as to which newspaper an article is from.Non-ASCII characters were replaced by their closest ASCII equivalent.Text was stripped of all punctuation marks except periods, which were removed only when they did not indicate end-of-sentence -for example, "M.I.T." would become "MIT".End-of-sentence periods were replaced by "PERIOD" to avoid creating false bigrams and trigrams containing words not in the same sentence.Numerals were removed unless they were ordinals (1 st , 17 th ), in which case they were replaced with equivalent text (first, seventeenth).The first letter of each sentence was lower-cased, but all other capitalization was retained.We discarded any articles containing fewer than ten words after the aforementioned preprocessing.

Extraction of discriminative phrases
We auto-classified the articles by topic using the open-source MITNewsClassify package from [20].For each of the topics mentioned below (covered in 779,174 articles), we extracted discriminative phrases by first extracting the 10,000 most common phrases, then ranking, purging and merging this phrase list as described below.

Automatic purge
To avoid duplication, we deleted subsumed monograms and bigrams from our phrase list: we deleted all monograms that appeared in a particular bigram more than 70% of the time and all bigrams that appeared in a particular trigram more than 70% of the time.For the BLM topic, for example, "tear" was deleted because if appeared in "tear gas" 87% of the time.
Next, all phrases were sorted in order of decreasing information score where is the aforementioned N-matrix rescaled as a joint probability distribution over phrases i and newspapers j, and replacing an index by a dot denotes that the index is summed over; for example, N •• is the total number of phrases in all the articles considered.The mutual information between phrases and articles is i I i , which can be interpreted as how many bits of information we learn about which newspaper an article is from by looking at one of its phrases.The information scores I i can thus be interpreted as how much of this information the i th phrase contributes.
Phrases are more informative both if they are more common and if their use frequency varies more between newspapers.We remove all phrases where more than 90% of all occurrences of the phrase are from a single newspaper.These "too good" phrases commonly reference journalist names or other things unique to newspapers but not indicative of political bias.For example, CNBC typically labels its morning news and talk program Squawk Box, making the phrase Squawk Box useful for predicting that an article is from CNBC but not useful for learning about media bias.To further mitigate this problem, we created a black list of newspaper names, journalist names, other phrases uniquely attributable to a single newspaper, and generic phrases that had little stand-alone meaning in our context (such as "article republished").Phrases from this list were discarded for all topics.Phrases that contained PERIOD were also removed from consideration.Just as September 2, 2021 5/29 we discarded direct quotes above, we also removed all phrases that contained "said" or "told" because they generally involved an indirect quote.Once this automatic purge was complete, the 1,000 remaining candidate phrases with the highest information scores were selected for manual screening as described in the next section.

Manual purge and merge
To be included in our bias analysis, phrases must meet the following criteria: 1. Relevance: • In order to be relevant to a topic, a phrase must not be a very common one that has ambiguous stand-alone meaning.For example, the phrase "social media" could be promoting social media pages, as in "Follow us on social media", or referencing a social media site.For simplicity, such common phrases with multiple meanings were excluded.Note that longer phrases (bigrams or trigrams) that contained such shorter phrases (monograms or bigrams) could still be included, such as "social media giants" in the tech censorship topic.
• A phrase is allowed to occur in multiple topics (for example, "socialism" is relevant to both the Venezuela and Cuba topics), but a sub-topic is not.For example, phrases related to the sub-topic tech censorship in China were excluded from both the tech censorship and China topics because they were relevant to both.
2. Uniqueness: Since there was minimal pre-processing, many phrases appear with different capitalizations or conjugations.In some cases, only one of the phrase variations was included and the others were discarded.In other cases, all variations were included because they represented a meaningful difference.These choices were made on a case by case basis, with a few general rules.
If both a singular and plural version of a word were present, only the more frequent variant was kept.If phrases were differentially capitalized (for example "big tech" and "Big Tech"), we kept both if they landed more than two standard deviations apart in the generalized principal component plot, otherwise we kept only the most frequent variant.If phrases were a continuation of one another, such as "Mayor Bill de" and "Bill de Blasio", the more general phrase was included.In this case, "Bill de Blasio" would be included because it does not contain an identifier.If there was no identifier, the more informative phrase was kept: for example, discarding "the Green New" while keeping "Green New Deal".
3. Specificity: Phrases must be specific enough to stand alone.A phrase was deemed specific if the phrase could be interpreted without context or be overwhelmingly likely to pertain to the relevant topic.This rules out phrases with only filler words (e.g., "would like", "must have") and phrases that are too general (e.g."politics").
4. Organize Subtopics (if needed): Some topics were far larger and broader than others.For example, finance contained many natural subtopics, including private finance and public finance.If natural subtopics appeared during the above process, the parent topic was split into subtopics.If topics were small and specific, such as guns, no such additional manual processing was performed.

5.
Edge cases: There were about a dozen cases on the edge of exclusion based on the above criteria, for which the include/exclude decision was based on a closer

Results
In this section, we present the results of applying our method to the aforementioned 779,174-article dataset.We will first explore how the well-known left-right media bias axis can be auto-discovered.We then identify a second bias axis related to establishment stance, and conclude this section by investigating how bias correlates across topics.

Left-Right Media Bias
We begin by investigating the Black Lives Matter (BLM) topic, because it is so timely.
The BLM Movement swept across the USA in the summer of 2020, prompting media coverage from newspapers of varied size and political stance.We first compute the aforementioned N-matrix describing phrase statistics; N ij is how many times the i th phrase was mentioned in the j th newspaper.We have made this and all the other N-matrices computed in this paper are available online 1 .Table 1 shows a sample, rescaled to show the number of occurrences per article, revealing that the frequency of certain phrases varies dramatically between media sources.For example, we see that "riots" is used about 60 times more frequently in PJ Media than in the NY Times, which prefers using "protests".As described in the previous section, our generalized principal component analysis attempts to model this N-matrix in terms of biases that link phrases and newspapers.The first component (which we refer to as component 0) tends to model the obvious fact that some phrases are more popular in general and some newspapers publish more articles than others, so we plot only the next two components (which we refer to as 1 and 2) below.BLM components 1 and 2 are shown in FIG. 1, corresponding to the horizontal and vertical axes: the phrase panel (left) plots U i1 against U i2 for each phrase i and the media panel (right) plots V j1 against V j2 for each media source j.The bars represent 1 standard deviation error bars computed using the Fisher information matrix method.To avoid clutter, we only show phrases occurring at least 200 times and newspapers with at least 200 occurrences of our discriminative phrases; for topics with fewer than 15,000 articles, we drop the phrase threshold from 200 to 100.In the media panel, the dots representing newspapers are colored based on external left-right ratings and scaled based on external pro-critical establishment ratings (which crudely correlates with newspaper size)2 .It is important to note that the colors and sizes of the dots were predetermined by external assessments and thus in no way influenced by the N-matrices that form the basis of our analysis in this paper.It is therefore remarkable that FIG. 1 reveals a clear horizontal color separation, suggesting that the first BLM component (corresponding to the horizontal axis) can be interpreted as the well-known left-right political spectrum.

Phrase bias and valent synonyms
As described in the Methods section, the phrases appearing in FIG. 1 (left panel) were selected by our algorithm as the ones that best discriminated between different newspapers.We see that they typically carry implicit positive or negative valence.Looking at how these phrases are used in context reveals that some of them form groups of phrases that can be used rather interchangeably, e.g., "protests" and "riots".For example, a June 8 2020 New York Times article reads "Floyd's death triggered major protests in Minneapolis and sparked rage across the country" [24] while a June 10 2020 Fox News article mentions "The death of George Floyd in police custody last month and a series of riots that followed in cities across the nation" [25].The x-axis in FIG. 1 is seen to automatically separate this pair, with "protests" on the left and "riots" on the right, with newspapers (say NY Times and PJ Media) similarly being left-right separated in the right panel according to their relative preference for these two phrases.FIG. 2 shows many such groups of emotionally loaded near-synonyms for both BLM and other topics.In many cases, we see that such a phrase group can be viewed as falling on a linguistic valence spectrum from positive (euphemism) to neutral (orthophemism) to negative (dysphemism).

The nutpicking challenge
FIG. 1 is seen to reveal a clean, statistically significant split between almost all left-leaning and right-leaning newspapers.The one noticeable exception is Counterpunch, whose horizontal placement shows it breaking from its left-leaning peers on BLM coverage.A closer look at the phrase observations reveals that this interpretation is misleading, and an artifact of some newspapers placing the same phrase in contexts where it has opposite valence.For example, a Counterpunch article treats the phrase "defund the police" as having positive valence by writing "the advocates of defund the police aren't fools.They understand that the police will be with us but that their role and their functions need to be dramatically rethought" [26].In contrast, right-leaning PJ Media treats "defund the police" as having negative valence in this example: "If you're a liberal, whats not to like about the slogan defund the police?It's meaningless, it's stupid, it's dangerous, and it makes you feel good if you mindlessly repeat it" [27].This tactic is known as nutpicking: picking out and showcasing what your readership perceives as the nuttiest statements of an opposition group as representative of that group.
In other words, whereas most discriminative phrases discovered by our algorithm have a context-independent valence ("infanticide" always being negative, say), some phrases are bi-valent in the sense that their valence depends on how they are used and by whom.We will encounter this challenge in many of the news topics that we analyze; for example, most U.S. newspapers treat "socialism" as having negative valence, and as a result, the arguably most socialist-leaning newspaper in our study, Socialist Alternative, gets mis-classified as right-leaning because of its frequent use of "socialism" with positive connotations.For example, for the Venezuela topic, Socialist Project uses the term "socialist" as follows: "Notably, Chavismo is a consciously socialist-feminist practice throughout all of Venezuela.Many communities that before were denied their dignity, have collectively altered their country based on principles of social equity and egalitarianism."[28].In contrast, Red State uses "socialist" in a nutpicking way in this example: "conservative pundits and politicians have painted a devastatingly accurate picture of what happens when a country embraces socialism.Pointing out the dire situation facing the people of Venezuela provided the public with a concrete example of how socialist policies destroy nations."[29].
September 2, 2021 9/29 BLM bias (the x-axis in FIG. 1) and abortion bias (the x-axis in FIG. 3) are seen to be highly correlated.Each dot corresponds to a newspaper (see legend in FIG. 4).

Correlated left-right controversies
Our algorithm auto-discovers bias axes for all the topics we study and, unsurprisingly, many of them reflect a traditional left-right split similar to that revealed by our BLM analysis.For example, FIG. 3 shows that the first principal component (the x-axis) for articles on the abortion topic effectively separates newspapers along the left-right axis exploiting relative preferences for terms such as "fetus"/"unborn babies", "termination/infanticide" and "anti choice"/"pro life".In addition to valent synonyms, we see that our algorithm detects additional bias by differential use of certain phrases lacking obvious counterparts, e.g., "reproductive rights" versus "religious liberty".FIG. 5 shows that the correlation between BLM bias and abortion bias is very high (the correlation coefficient r ≈ 0.90).Since these two topics are arguably rather unrelated from a purely intellectual standpoint, their high correlation reflects the well-known bundling of issues in the political system.
A simple way to auto-identify topics with common bias is to rank topic pairs by their correlation coefficients.In this spirit, Table 2 shows the ten topics whose bias is most strongly correlated with BLM bias, together with the corresponding Pearson correlation coefficient r and its standard error ∆r ≡ (1 − r 2 )/(n − 2), where n is the number of newspapers included in its calculation.The results for three of the most timely top-ranked issues (tech censorship, guns, and US immigration) are shown in FIG.6, again revealing a left-right spectrum of media bias for these topics.

Establishment bias
The figures above show that although the left-right media axis explains some of the variation among newspapers, it does not explain everything.Figure 7 shows a striking example of this for the military spending topic.As opposed to the previous bias plots, the dots are no longer clearly separated by color (corresponding to left-right stance).Indeed, left-leaning CNN ( 18) is seen right next to right-leaning National Review (53) and Fox News (36).Instead, the dots are seen to be vertically separated by size, corresponding to establishment stance.In other words, we have auto-identified a second bias dimension, here ranging vertically from establishment-critical (bottom) to pro-establismnent (top) bias.Just like left-right bias, establishment bias manifests as differential phrase use.For example, as seen Table 3. the phrase "military industrial complex" is used more frequently in newspapers classified as establishment-critical, such as Canary and American Conservative, but is rarely, if ever, used by mainstream, pro establishment outlets such as Fox or CNN, which instead prefer phrases such as "defense industry".
We find that the military spending topic, much like the BLM topic, is highly correlated with other topics included in the study.This is clearly seen in Figure FIG.8, which plots the pro-critical generalized principal components of the military spending topic and the Venezuela topic.A closer look at the Venezuela topic in FIG. 9 reveals a establishment bias similar to that seen in FIG. 7. We see that, while establishment-critical papers frequently use phrases such as "imperialism" and "regime change", pro-establishment newspapers prefer phrases such as "socialism" and "interim president".This figure reveals that the Venezuela topic engenders both establishment bias (the vertical axis) and also a smaller but non-negligible left-right bias (the horizontal axis).

Establishment-Critical
To identify additional topics with establishment bias, we again compute correlation coefficients between generalized principal components-this time with the vertical component for military spending.Table 4 shows the ten most correlated topics, revealing a list quite different from the left-right-biased topics from Table 2. Nuclear weapons, Yemen, and police, three timely examples from this list, are shown in Figure 10.Here the left panels illustrate how usage of certain phrases reflects establishment bias separation.In articles about nuclear weapons, the terms "nuclear arms race" and "nuclear war" are seen to appear preferentially in establishment-critical newspapers, while "nuclear test" and "nuclear deterrent" are preferred by pro-establishment papers.In articles about Yemen, the phrase "war on Yemen", suggesting a clear cause, is seen to signal an establishment-critical stance, while "humanitarian crisis", not implying a cause, signals pro-establismnent stance.For articles about police, grammatical choices in the coverage of police shootings is seen to be highly predictive of establishment stance: establishment-critical papers use passive voice (e.g., "was shot dead") less than pro-establishment papers, and when they do, they prefer the verb "killed" over "shot".Such news bias through use of passive voice was explored in detail in [30].FIG.11 illustrates such use of the passive voice and valent synonyms across establishment topics.

Machine learning the media bias landscape
Throughout this paper, we have aspired to measure media bias in a purely data-driven way, so that the data can speak for itself without human interpretation.In this spirit, we will now eliminate the manual elements from our above bias landscape exploration (our selection of the two rather uncorrelated topics BLM and military spending and the topics most correlated with them).Our starting point is the 56 × 56 correlation matrix R for the generalized principal components of all our analyzed topics, shown in FIG.12. Notation such as "BLM 1" and "BLM 2" reflects the fact that we have two generalized principal components corresponding to each topic (the two axes of the right panel of FIG. 1, say).Our core idea is to use the standard technique of spectral clustering [31] to identify which topics exhibit similar bias, using their bias correlation from FIG. 12 as measure of similarity.We start by performing an eigendecomposition of the correlation matrix R, where λ i are the eigenvalues and the columns of the matrix E are the eigenvectors.FIG. 13 illustrates the first two eigencomponents, with the point September 2, 2021 18/29

Establishment bias
Native Americans US immigration sexual harassment corresponding to the k th topic plotted at coordinates (E 1k , E 2k ).To reduce clutter, we show the ten components with the largest |E 1k | and the ten with the largest |E 2k |, retaining only the largest component for each topic.For better intuition, the figure has been rotated by 45 • , since if two internally correlated clusters are also correlated with each other, this will tend to line up the clusters with the coordinate axes.If needed, we also flip the sign of any axis whose data is mainly on the negative side and flip the 1/2 numbering to reflect cluster membership as described in the Supporting Information.
We can think of FIG. 13 as mapping all topics into a 2-dimensional media bias landscape.The figure reveals a clear separation of the topics into two clusters based on their media bias characteristics.A closer look at the membership of these two clusters suggests interpreting the x-axis as left-right bias and the y-axis as establishment bias.We therefore auto-assign each topic to one of the two clusters based on whether it falls closer to the x-axis or the y-axis (based whether |E 1k | > |E 2k | or not, in our case corresponding to which side of the dashed diagonal line the topic falls).We then sort the topics on a spectrum from most left-right-biased to most establishment-biased: the left-right topics are sorted by decreasing x-coordinate and followed by the establishment topics sorted by increasing y-coordinate.When ordered like this, the two topic clusters become visually evident even in the correlation matrix R upon which our clustering analysis was based: FIG. 12 shows two clearly visible blocks of highly correlated topics -both the left-right block in the upper left corner and the establishment block in the lower right.Above, the newspapers were mapped onto a separate bias plane for each of many different topics.We normalize each such media plot, e.g., the left panel of FIG. 1, such that the dots have zero mean and unit variance both horizontally and vertically.We then unify all these plots into a single media bias landscape plot in FIG. 14 by taking weighted averages of these many topic plots, weighting both by topic relevance and inverse variance.Specifically, for each topic bias, we assign two relevance weights corresponding to the absolute value of its x-and y-coordinates in FIG. 13, reflecting its relevance to left-right and establishment bias, respectively.These weights can be found in the Supporting Information.For example, to compute the x-coordinate of a newspaper in FIG.14, we simply take a weighted average of its generalized principal components for all topics, weighted both by the left-right relevance of that topic and by the inverse square of the error bar.
FIG. 14 can be viewed as the capstone plot for this paper, unifying information from all our topic-specific bias analyses.It reveals fairly good agreement with our the external human-judgement-based bias classifications reflected by the colors and sizes of the dots: it shows a separation between blueish does on on the left and reddish ones on the right, as well as a separation between larger (pro-establishment) dots toward the top and smaller (establishment-critical) ones toward the bottom.
Closer inspection of FIG. 14 also reveal some notable exceptions that deserve further scrutiny.As mentioned, the "nutpicking" poses a challenge for our method.An obvious example is Jacobin Magazine, a self-proclaimed socialist newspaper [32] that FIG. 14 classifies as right-leaning because of its heavy use of the phrase "socialism" approvingly while it is mainly used pejoratively by right-leaning media.Nutpicking may also help explain why FIG. 14 shows some more extreme newspapers closer to the center than more moderate ones (according to the human-judgement-based classification from AllSides [21]).For example, AllSides rates Breitbart as further right than Fox, and uses the phrase "defund the police" more often than Fox -presumably to criticize or mock it, thus getting pulled to the left in FIG. 14 towards left-leaning newspapers who use the phrase approvingly.One might expect nutpicking to be more common on the extremes of the political spectrum, in which case our method would push these newspapers toward the center.FIG. 14 also shows examples where our method might be outperforming the human-judgement-based classification from AllSides [21]).For example, [21] labels Anti War as "right" while our method classifies it as left, in better agreement with its online mission statement.
Our analysis also offers more nuance than a single left-right bias-score: for example, our preceding plots show that American Conservative is clearly right-leaning on social issues such as abortion and immigration, while clearly left on issues involving foreign intervention, averaging out to a rather neutral placement in FIG.14.

Conclusions
We have presented an automated method for measuring media bias.It first auto-discovers the phrases whose frequencies contain the most information about what newspapers published them, and then uses observed frequencies of these phrases to map newspapers into a two-dimensional media bias landscape.We have roughly a million articles from about a hundred newspapers for bias in dozens of news topics, producing a a data-driven bias classification in good agreement with prior classifications based on human judgement.One dimension can be interpreted as traditional left-right bias, the other as establishment bias.
Our method leaves much room for improvement, and we will now mention three examples.First, we saw how the popular practice of nut-picking can cause problems for our analysis by the same phrase being used with positive or negative connotations September 2, 2021 21/29 depending on context.This could be mitigated by excluding such bi-valent phrases from the analysis, either manually or with better machine learning.Second, topic bias can cause challenges for our method, by separating newspapers by their topic focus (say business versus sports) in a way that obscures political bias.As described above, we attempted to minimize this problem by splitting overly broad topics into narrower ones, but this process should be improved and ideally automated.
Third, although our method is almost fully automated, a manual screening step remains whereby auto-selected phrases are discarded if they lack sufficient relevance, uniqueness or specificity.Although this involves only the selection of phrases (machine-learning features), not their interpretation, it is worthwhile exploring whether this screening can be further (or completely) automated, ideally making our method completely free of manual steps and associated potential for human errors.
As datasets and analysis methods continue to improve, the quality of automated news bias classification should get ever better, enabling more level-headed scientific discussion of this important phenomenon.We therefore hope that automated new bias detection can help make discussions of media bias less politicized than the media being discussed.When we performed the generalized singular value decompositions for each topic, we had the freedom to choose both the sign of each plotted component and whether we numbered it 1 or 2. To eliminate these ambiguities and standardize the components, we the automatically flip signs such that all topics k in Cluster 1 have E 1k > 0 and all topics in Cluster 2 have E 2k > 0, and numbered the two components as follows.The two components each have a relevance weight as shown in the Table 5: the one with the largest relevance weight is numbered "1" if it's a left-right component and "2" otherwise; the second component gets the opposite number.September 2, 2021 29/29

Fig 2 .
Fig 2.  Valent synonyms reflecting left-right bias: Each row shows phrases that can be used rather interchangeably, with a horizontal position reflecting where our automated algorithm placed them on the left-right bias axis.

Fig 13 .
Fig 13.Spectral clustering of topics by their media bias characteristics as explained in the text.The bars represent 1 standard deviation Jackknife error bars.

Fig 14 .
Fig 14.Media bias landscape: Our method locates newspapers into this two-dimensional media bias landscape based only on how frequently they use certain discriminative phrases, with no human input regarding what constitutes bias.The colors and sizes of the dots were predetermined by external assessments and thus in no way influenced by our data.The positions of the dots thus suggest that the two dimensions can be interpreted as the traditional left-right bias axis and establishment bias, respectively.

Fig 17
Fig 17.Public finance bias

Fig 20
Fig 20.Private Finance bias

Table 1 .
BLM phrase bias: the average number of occurrences per article of certain phrases is seen to vary strongly between media sources.the underlying data and the phrase error bar emerging from the principal component analysis.Most of these phrases were excluded for occurring only in a single newspaper for stylistic reasons.When necessary, we examined the use of the phrase in context by reading a random sample of 10 articles in our database containing the phrase.

Table 2 .
BLM correlation coefficients: Topics most correlated with the BLM topic

Table 4 .
Topics whose bias is most correlated with military spending bias

Table 5 .
Topic relevance weights for the left-right and establishment topic clusters