In linguistic studies, the academic level of the vocabulary in a text can be described in terms of statistical physics by using a “temperature” concept related to the text's word-frequency distribution. We propose a “comparative thermo-linguistic” technique to analyze the vocabulary of a text to determine its academic level and its target readership in any given language. We apply this technique to a large number of books by several authors and examine how the vocabulary of a text changes when it is translated from one language to another. Unlike the uniform results produced using the Zipf law, using our “word energy” distribution technique we find variations in the power-law behavior. We also examine some common features that span across languages and identify some intriguing questions concerning how to determine when a text is suitable for its intended readership.
Citation: Rêgo HHA, Braunstein LA, D′Agostino G, Stanley HE, Miyazima S (2014) When a Text Is Translated Does the Complexity of Its Vocabulary Change? Translations and Target Readerships. PLoS ONE 9(10): e110213. https://doi.org/10.1371/journal.pone.0110213
Editor: Matjaz Perc, University of Maribor, Slovenia
Received: June 3, 2014; Accepted: September 18, 2014; Published: October 29, 2014
This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. Our work considered books in their regular commercial versions. All versions of the books in their original formats and instructions for obtaining them are available at http://books.google.com and (commercially) at such sites as http://www.amazon.com.
Funding: The authors thank the Boston University Center for Polymer Studies and Department of Physics where this research was developed and carried out. The Boston University work was supported by ONR Grant N00014-14-1-0738, DTRA Grant HDTRA1-14-1-0017, and NSF Grant CMMI 1125290). This work was partially supported by the CAPES Foundation and the Ministry of Education of Brazil, Braslia/DF (Proc. No. BEX 18007/12-0). LAB also acknowledges UNMdP and grant PICT-2013-0429 for financial support. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Scaling laws have been an important topic in the physics community across a wide range of fields –. The dynamics of several complex systems in biology –, economics , , and natural phenomena ,  have been described with relative success using scaling laws. Scaling phenomena also emerge in the analysis of data associated with human behavior, especially those containing a statistically distributed component, such as the number of links in the World Wide Web or the size of cities , . In current research, the analysis of scaling in data continues to produce new and interesting findings in a variety of scientific fields , .
In linguistics, Zipf  described another typical example of a power law in data on human behavior. He proposed that the distribution of the effort of both speakers and listeners as they attempt to optimize their communication produces a distinctive distribution, the now well-known Zipf Law. Recent research has analyzed how the Zipf scaling of the word frequency distribution changes over the centuries , and how this change is affected by both social and natural phenomena . As is the case for many other scaling laws, the Zipf law can also be used in the statistical analysis of huge data sets from other systems , –, e.g., the distribution of wealth and income in a given population  or the distribution of family names .
By examining the word frequency in a given corpus of a natural language Zipf found that a word's frequency is inversely proportional to its rank in the frequency table , , i.e., , where is a constant for the corpus being analyzed. A log-log plot of the frequency distribution for the first 1000–2000 words in the Brown corpus of the English language , for example, yields a straight line with slope . More recently, Petersen et al. found that the Zipf scaling law for word distribution reveals a significant difference between high-frequency words and low-frequency words, and that this behavior seems to be independent of the language considered , i.e., in each regime all languages show the same slope.
In another recent publication, by assuming that the Zipf law is also controlled by the Maxwell-Boltzmann (M-B) distribution associated with the physical world, Miyazima et al. were able to determine a book's linguistic “temperature” value , and used this concept to compare the “temperatures” of educational textbooks in the English language. They found that the higher the vocabulary grade level of a textbook, the lower its temperature. They found, for example, that the temperature of English textbooks for grades through in the US educational system decreases from 1.48 K to 0.87 K when the 1.00 K temperature of the American National Corpus (ANC) is used as a standard. In the same analysis they found that the temperature of Einstein's The Theory of Relativity was approximately 0.65 K .
If the temperature measurement of a text in a textbook allows us to determine its academic level from its vocabulary, the next step is to determine whether that temperature value can serve as a measurement of the vocabulary complexity of books in general. We propose a technique based on the temperature concept that allows us to analyze texts in their various translations and determine how vocabulary features change across languages. We examine a group of popular books in six different languages and find some intriguing patterns in their translated versions. By improving our comparative analysis, we are able to measure a text's suitability for its intended readership and thus to determine which vocabulary standards better fit a particular text.
Word Energy and Measurement of Temperature
Through the use of some basic concepts, we can define the key quantities. In thermodynamics, the probability function for the energy states in a substance follows the Maxwell-Boltzmann distribution. In general, (1)where , is the Boltzmann constant ( J/K), and the absolute temperature. Here, as a convenience, we consider irrespective of unit.
We assume that each word corresponds to an energy value in the M-B distribution in Eq. (1). Although we can only calculate for each word and not itself, if we assume a 1 K temperature for the corpus considered (e.g., the Brown corpus), we can determine the specific energy for each word.
When we count each word in the vocabulary of a volume of an English text, e.g., a journal, a novel, or a school textbook, we assume that we will find a word distribution that deviates from the distribution of the vocabulary in the Brown English corpus. We use this deviation to determine the temperature of the text in its English version. Fitting our “word-energy” frequency distribution, versus , to the Brown corpus, we find a straight line with slope in a semi-log plot, reflecting the standard M-B distribution. Fitting the same “word-energy” distribution to any other text in the same scale, we find a slope sightly higher or lower than the standard. Since this slope represents the term in Eq. 1, we can easily calculate the corresponding energy for this particular text. We fit the distribution to the Maxwell-Boltzmann distribution, change the temperature, and calculate the temperature of the text.
Figure 1(a) shows the probability distribution of for the vocabulary of the book The da Vinci Code by Dan Brown in English where, e.g., , are plotted against the word energies of “the” and “of.” This plot also presents the comparative standard distribution, in this case the energies associated with the words in the Brown corpus. Figure 1(b) shows that it is easier to plot against and fit it using a straight line. Note that the “word-energy” distribution for the Brown corpus has the expected slope , but that the slope for the book is , which corresponds to a temperature . This temperature varies greatly when other books and their translations are considered.
The (a) plot of the probability distribution versus the “energy” associated to a given word in the vocabulary of the book The da Vinci Code by Dan Brown in its English version, compared with the standard curve for the Brown corpus; and (b) its respective semi-log plot. This book contains 99,673 distinct words, named as “items” at the plot that shows only 5,529 different words. To calculate the fit (green line), we considered only the points in red (), where we choosed the maximum energy as being lower than 7. An increase in the number of points up to 1,000 (interval where the Zipf law is still valid) shall not change the result in a significant way.
The Comparative Thermo-Linguistics Technique
The main component of our technique, “comparative thermo-linguistic analysis,” assumes that every readership (e.g., a geographic community or a group of people with common interests) has its own vocabulary. For example, the way in which a newspaper reports an event such as a soccer game is strongly influenced by the frame of reference of its reading public. This goes beyond simply hometown papers supporting the home team. The reading level and interests of those reading the sports page in an up-scale broadsheet will differ from those reading the same in a tabloid, for example.
Our comparative technique for text analysis is as follows:
- Define the target readership.
- Determine the standard vocabulary for the target readership, i.e., locate a literary “corpus” that adequately represents its vocabulary. Miyazima et al.  considered the corpus of the entire English language as a general standard for the analysis of English textbooks. Their choice was useful, but only in a limited way.
- Calculate the corresponding “energy” for each word in the corpus in order to determine the standard distribution of word energy for the target readership.
- Use this energy distribution to determine the “relative temperature” of each text to be examined.
- Compare the relative temperature of the texts examined with the standard vocabulary exhibited by the literary corpus being used as a reference.
Similar to what we have found for grade levels, we expect the relative temperature of each text to be closely related to the reading effort required of someone in the target readership. When the relative temperature of a text is higher (lower) than that of the standard corpus, the complexity of its vocabulary will be lower (higher) than that of the standard. If the temperatures are approximately the same, the text being examined is deemed highly appropriate for the target readership [see Fig. 1(b)].
Results and Discussion
Books and their translations
We next examine how the vocabulary of a text changes when it is translated into another language. To minimize bias, we consider 30 different books and their respective translations (versions) in six different languages. The books include a variety of different authors, release dates, and original languages.
Figure 2a shows a log-log plot similar to Petersen et al. that compares the distribution of the probabilities of occurrence of the 1024 most frequent words indicated in the “Project Gutenberg” corpuses of English, French, German, Portuguese, Spanish, and Italian Languages , . Although all of the curves are approximately identical, the rank of a given word (and its corresponding translation) changes when other languages are taken into consideration.
Languages comparision for: (a) a log-log plot exibiting the Zipf law (all curves has similar slopes) in the probability distribution of ocurrency for the 1024 most frequent words in the corpus according the “Project Gutenberg”; (b) and a log plot of the probability distribution of the word energies in the book Da Vinci Code by Dan Brown, exibiting the different slopes, therefore different temperatures. (Note that the y axis in both graphics are shifted for better visualization).
Using our comparative thermo-linguistic analysis we find that the rank position of a word usually differs between languages. Although the Zipf distribution does not change when different languages are considered, when a text is translated the energy distribution does change (see Fig. 2b).
The Figure 2(b) shows a plot of the energy distribution of words for several translated versions of The da Vinci Code and their respective temperatures calculated from their slopes (e.g., the slope for the Portuguese translation is approximately , corresponding to , and so on). To allow a comparison between languages, we use as the “standard temperature” for each corpus.
We repeat this same procedure for 30 books and their translations into six languages (see Fig. 3). Table 1 shows the numerical results generated by this new technique of “comparative thermo-linguistic analysis.” Figure 3 shows the average temperature, with basic books to the left, medium-level books in the center, and advanced books to the right. Within each of the three regions the arrangement is random.
Plot of characteristic temperature dependance of the language for several books.
For these results we used frequency lists of up to 10,000 words drawn from a variety of sources and without specific requirements, e.g., how the list was assembled (see Table 2). Figure 3 uses the Corpus Brasileiro - PUC/SP  as the Portuguese language standard, and the Brown Corpus  as the English language standard.
Table 1 shows that, for a given book, the temperature of its Portuguese version is almost always lower than the temperature of the other versions. Exceptions to this include The Bible, The New Testament, and Pinocchio (see Fig. 4a). Similar behavior can be observed in the temperature of the German versions of these same books [see Fig. 4(b)]. The English and Spanish language versions, on the other hand, are consistently higher [see Figs. 4(c) and 4(d)]. While the French and Italian versions [see Figs. 4(e) and 4(f)] exhibit temperatures that are intermediate. This result seems to be unaffected by the original language of the book.
What makes the difference?
Figure 2(b) shows results that imply that the temperature of a book always changes when it is translated. To investigate this we consider books writen by bilingual authors who do their own translations, assuming that the vocabulary preference and literary style of an author will remain constant across translations ,.
Table 3 shows the temperature values for 16 different books, each written and translated into two languages (A and B) by a single bilingual author. Note that in each case the English version of a book tends to have a higher temperature value than the same book in any other language (with only one exception). This result is consistent with the results we obtained when we analyzed the books listed in Table 1, i.e., the translation process itself does not significantly affect the change in the complexity of the vocabulary when a book is translated and does not cause the change in temperature.
Examining again the reference corpus of each language, we calculate the temperature of all the books shown in Fig. 3, choosing each corpus irrespective of how it was compiled or assembled. Because the vocabulary of a language is strongly influenced by social and cultural forces , a text written for a target readership will be strongly influenced that readership. Thus changes in temperature will occur when there is a change in the standard vocabulary that we use when we do our comparative analysis.
Figure 5(a) shows the same set of books shown in Fig. 3 in their English versions. Each curve corresponds to a different corpus. Figure 5(b) shows the Portuguese versions of the same books. We compiled our own corpus using all the words contained in the books in English and Portuguese and used this as our standard corpus in the analysis in both languages–see the curves “All Books EN” and “All Books PT” in Figs. 5(a) and 5(b), respectively.
Plots of the characteristic temperature for books by increasing order for different corpus in (a) English and (b) Portuguese.
These figures show that the temperature of the books approaches a value of 1 when we take into consideration our compiled corpus, independent of language, i.e., they are becoming increasingly similar to the baseline corpus. As the vocabulary of a book increasingly deviates from that of the corpus, the temperature deviates from 1.
Figure 6 shows a comparison of book temperatures in all six languages using our own compiled corpus. All the temperature values approach 1 and the differences among the languages sharply decrease.
Plots of the characteristic temperature for books considering a self-made corpus with all the books used for each language.
If we were to use a baseline corpus compiled from the words of one book in an analysis of the same book, the temperature would invarably be one. Thus our comparative thermo-analysis technique can be used to measure how appropriate the vocabulary used in a text is to its target readership.
The temperature of a book is strongly related to its vocabulary level. Basic level textbooks use as many common words as possible and their temperature is higher than the temperature of more advanced books. By performing a cross-language comparisons with our comparative thermo-analysis, we find that this tendency is independent of language, but that the effort required to read or write a given text differs among languages. Figure 3 shows that the temperature of a book in Portuguese is usually lower than the temperature of same text in English. This indicates that the book requires more effort of a Brazilian reader than an English reader.
Figure 6 shows that changing the corpus used as standard will change the effort required of the reader. It also shows that the reading effort never reaches zero, and that books in English always have a higher temperature than books in Portuguese. It is possible that English has a high temperature because there are many synonyms that express a similar content in the English Language, and that Portuguese has a low temperature because it requires more words to express different meanings.
In understand why the temperature of a book, and thus the complexity of its vocabulary, changes when it appears in a different language, we have eliminated the factors of the original language of the author and the influence of the translation process itself. The change that occurs is thus related to the syntax and other grammar features of each language.
Irrespective of cause, a significant factor in solving this puzzle is how well a particular text uses the vocabulary of its target readership. In this way, our comparative thermo-analysis also allows us to determine quantitatively whether the vocabulary of a book, either in its original language or in translation, achieves that goal–how well it reaches its readership.
Conceived and designed the experiments: HHAR SM. Performed the experiments: HHAR SM. Analyzed the data: HHAR LB GD HES SM. Contributed reagents/materials/analysis tools: HHAR SM. Wrote the paper: HHAR LB GD HES SM.
- 1. Bak P, Christensen K, Danon L, Scanlon T (2002) Unified Scaling Law for Earthquakes. Phys Rev Lett 88: 2714.
- 2. Barabási AL, Albert R (1999) Emergence of Scaling in Random Networks. Science 286: (5439): 509–512.
- 3. Vallianatosa F, Sammondsa P (2013) Evidence of non-extensive statistical physics of the lithospheric instability approaching the 2004 Sumatran Andaman and 2011 Honshu mega-earthquakes. Tectonophysics 590: 52–58.
- 4. Viswanathan GM, Buldyrev SV, Havlin S, da Luz MGE, Raposo E, et al.. (1999) Optimizing the success of random searches. Nature 401: (6756): 911.
- 5. Viswanathan GM, da Luz MGE, Raposo EP, Stanley HE (2011) The Physics of Foraging. Cambridge University Press, Cambridge.
- 6. Barabási AL, Buldyrev SV, Stanley HE, Suki B (1996) Avalanches in the Lung: A Statistical Mechanical Model. Phys Rev Lett 76: 2192.
- 7. Papa ARR, da Silva L (1997) Earthquakes in the Brain. Theory in Biosciences 116: 321–327.
- 8. Gabaix X, Gopikrishnan P, Plerou V, Stanley HE (2003) A theory of Power-Law Distributions in financial market fluctuation. Nature 423: (6937): 267–270.
- 9. Mantegna RN, Stanley HE (1995) Scaling behaviour in the dynamics of an economic index. Nature 376: (6535): 46–49.
- 10. Sornette D, Knopoff L, Kagan Y, Vanneste C (1996) Rank-ordering statistics of extreme events: application to the distribution of large earthquakes. Journal of Geophysical Research 101: (B6): 13883–13894.
- 11. Adamic LA, Huberman BA (2000) Power-Law Distribution of the World Wide Web. Science 287: (5461): 2115.
- 12. Cordoba JC (2008) On the distribution of city sizes. J Urban Econ 63: 177–197.
- 13. Perc M (2013) Self-Organization of Progress Across the Century of Physics. Sci Rep 3: 1720.
- 14. Perc M (2014) The Matthew effect in empirical data. J R Soc Interface 11: 20140378.
- 15. Perc M (2012) Evolution of the most common English words and phrases over the centuries. J R Soc Interface 9: 3323–3328.
- 16. Gao J, Hu J, Mao M, Perc M (2012) Culturomics meets random fractal theory: Insights into long-range correlations of social and natural phenomena over the past two centuries. J R Soc Interface 9: 1956–1964.
- 17. Zipf G (1932) Selected Studies of the Principle of Relative Frequency in Language. Harvard University Press, Cambridge.
- 18. Zipf G (1949) Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading.
- 19. Furusawa C, Kaneko K (2003) Zipf's Law in Gene Expression. Phys Rev Lett 90: 088102.
- 20. Redner S (1998) How popular is your paper? An empirical study of the citation distribution. Eur Phys J B 4: 131–134.
- 21. Champernowne D (1953) A model of income distribution. Economic Journal 63: 318–351.
- 22. Miyazima S, Lee Y, Nagamine T, Miyajima H (2000) Power-law distribution of family names in Japanese societies. Physica A 278: 282–288.
- 23. Cancho RF, Sole RV (2003) Least effort and the origins of scaling in human language. PNAS 100: 788–91.
- 24. Francis WN, Kucera H (1964) A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. Brown University, Providence.
- 25. Petersen AM, Tenenbaum JN, Havlin S, Stanley HE, Perc M (2012) Languages cool as they expand: Allometric scaling and the decreasing need for new words. Scientific Reports 2: 313.
- 26. Miyazima S, Yamamoto K (2008) Measuring the Temperature of Texts. Fractals 16: 25.
- 27. “Project Gutenberg”. Available: http://www.gutenberg.org/
- 28. Sardinha TB (2004) Linguística de Corpus. Manole, São Paulo.
- 29. Grosjean F (2010) Bilingual: Life and Reality. Harvard University Press, Cambridge.
- 30. Camargo DC (2011) Uma investigação de aspectos de normalização na autotradução An Invincible Memory. TradTerm 16: 217–230.
- 31. De Beaugrande R (1998) Language and Society: The Real and the Ideal in Linguistics, Sociolinguistics, and Corpus Linguistics. Journal of Sociolinguistics 3: (1): 128–139.