Positivity of the English Language

Over the last million years, human language has emerged and evolved as a fundamental instrument of social communication and semiotic representation. People use language in part to convey emotional information, leading to the central and contingent questions: (1) What is the emotional spectrum of natural language? and (2) Are natural languages neutrally, positively, or negatively biased? Here, we report that the human-perceived positivity of over 10,000 of the most frequently used English words exhibits a clear positive bias. More deeply, we characterize and quantify distributions of word positivity for four large and distinct corpora, demonstrating that their form is broadly invariant with respect to frequency of word use.


Introduction
While we regard ourselves as social animals, we have a history of actions running from selfless benevolence to extreme violence at all scales of society, and we remain scientifically and philosophically unsure as to what degree any individual or group is or should be cooperative and pro-social. Traditional economic theory of human behavior, for example, assumes that people are inherently and rationally selfish-a core attribute of homo economicus-with the emergence of global cooperation thus rendered a profound mystery [1,2]. Yet everyday experience and many findings of psychology, behavioral economics, and neuroscience indicate people favour seemingly irrational heuristics [3,4] over strict rationality as exemplified in loss-aversion [5], confirmation bias [6], and altruistic punishment [7]. Religions and philosophies similarly run the gamut in prescribing the right way for individuals to behave, from the universal non-harming advocated by Jainism, Gandhi's call for non-violent collective resistance, and exhortations toward altruistic behavior in all major religions, to arguments for the necessity of a Monarch [8], the strongest forms of libertarianism, and the "rational self-interest" of Ayn Rand's Objectivism [9].
In taking the view that humans are in part storytellers-homo narrativus-we can look to language itself for quantifiable evidence of our social nature. How is the structure of the emotional content rendered in our stories, fact or fiction, and social interactions reflected in the collective, evolutionary construction of human language? Previous findings are mixed: suggestive evidence of a positive bias has been found in small samples of  the present work (the full data set is provided as Supplementary Information for [20]). Tabs. S1, S2, and S3 respectively give the top 50 words according to positivity, negativity, and standard deviation of happiness scores.

Results and Discussion
In Fig. 1, we show distributions of average word happiness h avg for our four corpora. We first discuss the overall distributions, i.e., those corresponding to the most frequent 5000 words combined in each corpus (black curves), and then examine the robustness of their forms with respect to frequency range. The distributions as shown were formed using 35 equal-sized bins; the number of bins does not change the visual form of the distributions appreciably, and an odd number ensures that the neutral score of 5 is a bin center. We employed binning only for visual display, using the raw data for all statistical analysis.
We see each distribution is unimodal and strongly positively skewed, with a clear abundance of positive words (h avg > 5, yellow shade) over negative ones (h avg < 5, gray shade). In order, the percentages of positive words are 72.00% (TW), 78.80% (GB), 78.38% (NYT), and 64.14% (ML). Equivalently, and as further supported by Fig. 1's upper inset plots of percentile location, we see the percentile corresponding to the neutral score of 5 is well below the median. The lower inset plots show how the number of positive and negative words increase as we cumulate moving away from the neutral score of 5; positive words are always more abundant further illustrating the positive bias. The mode average happiness of words is either above neutral (TW, GB, and NYT) or located there (ML). Combining words across corpora, we also see the same overall positivity bias for parts of speech, e.g., nouns and verbs (not shown), in agreement with previous work [10].
While these overall distributions do not match in detail across corpora, we do find they have an unexpected and striking internal consistency with respect to usage frequency. We provide a series of increasingly refined and nuanced observations regarding this emotional and linguistic phenomenon of scale invariance.
First, along with the overall distribution in each plot in Fig. 1, we also show distributions for subsets of 1000 words (symbols), ordered by frequency rank r (1-1000, 1001-2000, etc.). The similarity of these distributions suggests to the eye that common and rare words are sim-ilarly distributed in their perceived degree of positivity.
In Fig. S1, we provide statistical support via p-values from Kolmogorov-Smirnov tests for each pairing of distributions. Here, p-values are to be interpreted as the probability that two samples could have been derived from the same underlying distribution. The three corpora NYT, ML, and GB show the most internal agreement, and we see in all corpora that neighboring ranges of 1000 frequencies could likely match in distribution. Of the 40 pair-wise comparisons across the four corpora, 29 show statistically significant matches (p > 10 −2 ).
In any study of texts based on word counts, the words themselves need to be presented in some form as commonsense checks on abstracted measurements. To provide further insight into how word happiness behaves as a function of usage frequency rank, we plot a subsample of words for the New York Times in Fig. 2. We present analogous examples for the other three corpora in Figs. S2, S3, and S4. In these plots, usage frequency rank increases from bottom to top with average happiness along the bottom axis. To make clear the connection with Fig. 1, we include the overall distribution for the top 5000 words at the top of each plot. Each word is centered at the location of its values of h avg and usage frequency rank. The alternating colors are used for visual clarity only, as are the random angles. Underlying the words, the light gray points indicate the locations of all of the most frequently used 5000 words.
For the New York Times example, we find that the word pattern for average happiness and usage frequency rank is indeed reasonable. Down the right hand side of Fig. 2, we see highly positive words while decreasing in usage frequency such as 'love', 'win', 'comedy' 'celebration', and 'pleasure'. Similarly, down the left hand side, we find 'war', 'cancer', 'murder', 'terrorist', and 'rape'. Words of flat affect such as 'the', 'something', 'issued', and 'administrator' run down the middle of the happiness spectrum. For words with usage frequency rank near 2500, moving left to right in the plot, we find the sequence of increasingly positive words 'jail', 'arrest', 'inflation', 'fee', 'ends', 'advisor', 'taught', 'india' 'truly', and 'perfect'. Moving through the space represented in other directions gives further reassurance of the general trends we observe here. Note that the random sampling of words used to generate these figures much more coarsely samples the word distributions for neutral or medium levels of happiness.
While the four corpora share common words in their most frequent 5000, numerous words appear in only one corpus. For example, 'rainbows' and 'kissing' make the top 5000 only for Music Lyrics, and 'punishment' the same for the Google Books corpus (see Tabs. S1 and S2). Moreover, the usage frequency rankings change strongly, as a visual comparison of Fig. 2 with Figs. S2, S3, and S4 reveals. Further detailed comparisons can be made directly from the labMT 1.0 data set [20].
To bolster our observations quantitatively, we first compute a linear regression and a Spearman correlation coefficient ρ s and associated p-value (two-sided) for h avg as a function of usage frequency rank, r. We record the results for each corpus in Tab. II.
The slopes of linear fits are all negative but extremely small, ranging from -3.04×10 −5 (GB) to -7.78×10 −5 (TW). All corpora also present a weak negative correlation, ranging from ρ s = −0.013 (GB) to -0.103 (TW). The correlation for the Google Books corpus is not statistically significant (p=0.35), while it is for the other three, and especially so for TW and ML (p = 2.3×10 −13 and   We next move to a more detailed quantitative view of the word happiness distribution as a function of word usage frequency. In Fig. 3, we show how deciles behave as a function of usage frequency rank. Using a sliding window containing 500 words, we compute deciles moving down the usage frequency rank axis. Using these 'jellyfish plots', we see that apart from the lowest decile (which is universally uneven), GB and NYT are very stable while a slight negative trend is perceptible for TW and ML. We can now with some confidence state that the measured, edited writing of the New York Times and the Google Books corpus possess a remarkable scale invari- ance in emotion with respect to word usage frequency.
The emotional content of words on Twitter and in music lyrics, while still roughly similar across usage frequency ranks, show a small bias towards common words being disproportionately positive in comparison with increasing rare ones. The bias is sufficiently small as to be likely indiscernible by an individual familiar with these corpora; moreover, cognitive biases regarding the salience of information would presumably render such detection impossible [32].
We have thus far considered distributions of average happiness values for words. Each word's estimate comes from a distribution of assessment scores, and a useful, simple investigation can be carried out on the standard deviation of individual word happiness, h σ .
A range of word and concept categories yielded high h σ in our study, the top 50 of which are shown in Tab. S3. At the top of the list, we observe words that are or relate to profanities, alcohol and tobacco, religion, both capitalism and socialism, sex, marriage, fast foods, climate, and cultural phenomena such as the Beatles, the iPhone, and zombies. As a result of variation in the rater's preferences perhaps due to inherent controversy or cultural and demographic variation, these terms all elicited diverse responses.
We repeat our analyses of h avg for h σ by first considering a sample of words for the Google Books corpus, In Fig. 4, we show example words from the Google Books corpus as a function of word usage frequency rank and standard deviation (Figs. S6, S7, and S8 show the same for TW, NYT, and ML ). The right hand side of Fig. 4 shows example words with high h σ and increasing usage frequency rank including 'work', 'pay', 'summer', 'churches', 'mortality' and 'capitalism'. For low h σ (the left hand side of Fig. 4), we see basic, neutral words such as 'these', 'types', 'inch', and 'seventh'. While this word diagram is primarily intended for qualitative purposes, we see that for h σ , the overall trend for Google Books is a gradual increase as a function of usage frequency rank. In other words, relatively rarer words have higher standard deviations in comparison with relatively more common ones. This is confirmed visually in Fig. 5, where we present jellyfish plots showing deciles for all four corpora. The Music Lyrics corpus shows a similar increase in h σ with usage frequency rank as GB, whereas TW and NYT corpora exhibit no obvious linear variation. These observations are supported by the linear fits and Spearman correlation coefficients recorded in Tab. III, where we consider h σ as a function of usage frequency rank. All linear approximations yield a very small positive growth, with both the TW and NYT corpora clearly smaller than the other two, particularly TW. The corresponding Spearman correlation coefficients indicate we have statistically significant monotonic growth in h σ for GB, ML, and NYT, particularly the first two, and indicates no evidence of growth for TW.
All told, we find slight deviation from an exact scaling independence of h avg and h σ in terms of usage frequency rank, but it is highly constrained and corpus specific. In particular, the corpora that show a slight negative correlation between h avg and usage frequency rank, TW and ML, do not match those showing a positive correlation between h σ and usage frequency rank, GB and ML.

Concluding remarks
Our findings are that positive words strongly outnumber negative words overall, and that there is a very limited, corpus-specific tendency for high frequency words to be more positive than low frequency words. These two aspects of positivity and usage frequency can only be separated with the kind of data we study here. Previous claims that positive words are used more frequently [10][11][12], suffered from insufficient, non-representative data. For example, Rozin et al. recently compared usage frequencies for just seven adjective pairs of positive-negative opposites [11]. Augustine et al. showed that average happiness and usage frequencies for 1034 words [14] were more positively correlated than we observe here [10]; however, since these words were chosen for their meaningful nature [14,33,34] rather than by their rate of occurrence, their findings are naturally tempered. A positivity bias is also not inconsistent with many observations that negative emotions in isolation are more potent and diverse than positive words [32].
In sum, our findings for these diverse English language corpora suggest that a positivity bias is universal, that the emotional spectrum of language is very close to selfsimilar with respect to frequency, and that in our stories and writings we tend toward prosocial communication. Our work calls for similar studies of other languages and dialects, examinations of corpora factoring in popularity (e.g., of books or articles), as well as investigations of other more specific emotional dimensions. Related work would explore changes in positivity bias over time, and correlations with quantifiable aspects of societal organization and function such as wealth, cultural norms, and political structures. Analyses of the emotional content of phrases and sentences in large-scale texts would also be a natural next, more complicated stage of research. Promisingly, we have shown elsewhere for Twitter that the average happiness of individual words correlates well with that of surrounding words in status updates [20].
The authors are indebted to conversations with B. Tivnan, N. Johnson, and A. Reece. The authors are grateful for the computational resources provided by the Vermont Advanced Computing Center which is supported by NASA (NNX 08A096G). KDH was supported by VT-NASA EPSCoR. PSD was supported by NSF CAREER Award # 0846668.