When a Text Is Translated Does the Complexity of Its Vocabulary Change? Translations and Target Readerships

In linguistic studies, the academic level of the vocabulary in a text can be described in terms of statistical physics by using a “temperature” concept related to the text's word-frequency distribution. We propose a “comparative thermo-linguistic” technique to analyze the vocabulary of a text to determine its academic level and its target readership in any given language. We apply this technique to a large number of books by several authors and examine how the vocabulary of a text changes when it is translated from one language to another. Unlike the uniform results produced using the Zipf law, using our “word energy” distribution technique we find variations in the power-law behavior. We also examine some common features that span across languages and identify some intriguing questions concerning how to determine when a text is suitable for its intended readership.


Introduction
Scaling laws have been an important topic in the physics community across a wide range of fields [1][2][3]. The dynamics of several complex systems in biology [4][5][6][7], economics [8,9], and natural phenomena [3,10] have been described with relative success using scaling laws. Scaling phenomena also emerge in the analysis of data associated with human behavior, especially those containing a statistically distributed component, such as the number of links in the World Wide Web or the size of cities [11,12]. In current research, the analysis of scaling in data continues to produce new and interesting findings in a variety of scientific fields [13,14].
In linguistics, Zipf [17] described another typical example of a power law in data on human behavior. He proposed that the distribution of the effort of both speakers and listeners as they attempt to optimize their communication produces a distinctive distribution, the now well-known Zipf Law. Recent research has analyzed how the Zipf scaling of the word frequency distribution changes over the centuries [15], and how this change is affected by both social and natural phenomena [16]. As is the case for many other scaling laws, the Zipf law can also be used in the statistical analysis of huge data sets from other systems [12,[18][19][20], e.g., the distribution of wealth and income in a given population [21] or the distribution of family names [22].
By examining the word frequency in a given corpus of a natural language Zipf found that a word's frequency is inversely proportional to its rank f (r) in the frequency table [17,23], i.e., f (r)*r {a , where a is a constant for the corpus being analyzed. A log-log plot of the frequency distribution for the first 1000-2000 words in the Brown corpus of the English language [24], for example, yields a straight line with slope a~1 [17]. More recently, Petersen et al. found that the Zipf scaling law for word distribution reveals a significant difference between high-frequency words and low-frequency words, and that this behavior seems to be independent of the language considered [25], i.e., in each regime all languages show the same slope.
In another recent publication, by assuming that the Zipf law is also controlled by the Maxwell-Boltzmann (M-B) distribution associated with the physical world, Miyazima et al. were able to determine a book's linguistic ''temperature'' value [26], and used this concept to compare the ''temperatures'' of educa-tional textbooks in the English language. They found that the higher the vocabulary grade level of a textbook, the lower its temperature. They found, for example, that the temperature of English textbooks for grades K 1 through K 12 in the US educational system decreases from 1.48 K to 0.87 K when the 1.00 K temperature of the American National Corpus (ANC) is used as a standard. In the same analysis they found that the temperature of Einstein's The Theory of Relativity was approximately 0.65 K [26].
If the temperature measurement of a text in a textbook allows us to determine its academic level from its vocabulary, the next step is to determine whether that temperature value can serve as a measurement of the vocabulary complexity of books in general. We propose a technique based on the temperature concept that allows us to analyze texts in their various translations and determine how vocabulary features change across languages. We examine a group of popular books in six different languages and find some intriguing patterns in their translated versions. By improving our comparative analysis, we are able to measure a text's suitability for its intended readership and thus to determine which vocabulary standards better fit a particular text.

Word Energy and Measurement of Temperature
Through the use of some basic concepts, we can define the key quantities. In thermodynamics, the probability function for the energy states in a substance follows the Maxwell-Boltzmann distribution. In general, where b~1 kT , k is the Boltzmann constant (1:38|10 {23 J/K), and T the absolute temperature. Here, as a convenience, we consider k~1 irrespective of unit. We assume that each word corresponds to an energy value in the M-B distribution in Eq. (1). Although we can only calculate bE for each word and not E itself, if we assume a 1 K temperature for the corpus considered (e.g., the Brown corpus), we can determine the specific energy for each word.
When we count each word in the vocabulary of a volume of an English text, e.g., a journal, a novel, or a school textbook, we assume that we will find a word distribution that deviates from the distribution of the vocabulary in the Brown English corpus. We use this deviation to determine the temperature of the text in its English version. Fitting our ''word-energy'' frequency distribution, p(E) versus E, to the Brown corpus, we find a straight line with slope {1:0 in a semi-log plot, reflecting the standard M-B distribution. Fitting the same ''word-energy'' distribution to any other text in the same scale, we find a slope sightly higher or lower than the standard. Since this slope represents the term {bE in Eq. 1, we can easily calculate the corresponding energy for this particular text. We fit the distribution to the Maxwell-Boltzmann distribution, change the temperature, and calculate the temperature of the text. Figure 1(a) shows the probability distribution of P(bE) for the vocabulary of the book The da Vinci Code by Dan Brown in English where, e.g., P½bE(the), P½bE(of) are plotted against the word energies of ''the'' and ''of.'' This plot also presents the comparative standard distribution, in this case the energies associated with the words in the Brown corpus. Figure 1(b) shows that it is easier to plot log P½bE against E and fit it using a straight line. Note that the ''word-energy'' distribution for the Brown corpus has the expected slope {1:0, but that the slope for the book is {0:9952, which corresponds to a temperature T*1. This temperature varies greatly when other books and their translations are considered.

The Comparative Thermo-Linguistics Technique
The main component of our technique, ''comparative thermolinguistic analysis,'' assumes that every readership (e.g., a geographic community or a group of people with common interests) has its own vocabulary. For example, the way in which a newspaper reports an event such as a soccer game is strongly influenced by the frame of reference of its reading public. This goes beyond simply hometown papers supporting the home team. The reading level and interests of those reading the sports page in an up-scale broadsheet will differ from those reading the same in a tabloid, for example.
Our comparative technique for text analysis is as follows: 1. Define the target readership. 2. Determine the standard vocabulary for the target readership, i.e., locate a literary ''corpus'' that adequately represents its vocabulary. Miyazima et al. [26] considered the corpus of the entire English language as a general standard for the analysis of English textbooks. Their choice was useful, but only in a limited way. 3. Calculate the corresponding ''energy'' for each word in the corpus in order to determine the standard distribution of word energy for the target readership. 4. Use this energy distribution to determine the ''relative temperature'' of each text to be examined. 5. Compare the relative temperature of the texts examined with the standard vocabulary exhibited by the literary corpus being used as a reference.
Similar to what we have found for grade levels, we expect the relative temperature of each text to be closely related to the reading effort required of someone in the target readership. When the relative temperature of a text is higher (lower) than that of the standard corpus, the complexity of its vocabulary will be lower (higher) than that of the standard. If the temperatures are approximately the same, the text being examined is deemed highly appropriate for the target readership [see Fig. 1(b)].

Books and their translations
We next examine how the vocabulary of a text changes when it is translated into another language. To minimize bias, we consider 30 different books and their respective translations (versions) in six    Figure 2a shows a log-log plot similar to Petersen et al. that compares the distribution of the probabilities of occurrence P(r) of the 1024 most frequent words indicated in the ''Project Gutenberg'' corpuses of English, French, German, Portuguese, Spanish, and Italian Languages [25,27]. Although all of the curves are approximately identical, the rank of a given word (and its corresponding translation) changes when other languages are taken into consideration.
Using our comparative thermo-linguistic analysis we find that the rank position of a word usually differs between languages. Although the Zipf distribution does not change when different languages are considered, when a text is translated the energy distribution does change (see Fig. 2b).   The Figure 2(b) shows a plot of the energy distribution of words for several translated versions of The da Vinci Code and their respective temperatures calculated from their slopes (e.g., the slope for the Portuguese translation is approximately {1:26, corresponding to T pt~0 :79, and so on). To allow a comparison between languages, we use T st~1 :00 as the ''standard temperature'' for each corpus.
We repeat this same procedure for 30 books and their translations into six languages (see Fig. 3). Table 1 shows the numerical results generated by this new technique of ''comparative thermo-linguistic analysis.'' Figure 3 shows the average temperature, with basic books to the left, medium-level books in the center, and advanced books to the right. Within each of the three regions the arrangement is random. For these results we used frequency lists of up to 10,000 words drawn from a variety of sources and without specific requirements, e.g., how the list was assembled (see Table 2). Figure 3 uses the Corpus Brasileiro -PUC/SP [28] as the Portuguese language standard, and the Brown Corpus [24] as the English language standard. Table 1 shows that, for a given book, the temperature of its Portuguese version is almost always lower than the temperature of the other versions. Exceptions to this include The Bible, The New Testament, and Pinocchio (see Fig. 4a). Similar behavior can be observed in the temperature of the German versions of these same books [see Fig. 4 What makes the difference? Figure 2(b) shows results that imply that the temperature of a book always changes when it is translated. To investigate this we consider books writen by bilingual authors who do their own translations, assuming that the vocabulary preference and literary style of an author will remain constant across translations [29,30]. Table 3 shows the temperature values for 16 different books, each written and translated into two languages (A and B) by a single bilingual author. Note that in each case the English version of a book tends to have a higher temperature value than the same book in any other language (with only one exception). This result is consistent with the results we obtained when we analyzed the books listed in Table 1, i.e., the translation process itself does not significantly affect the change in the complexity of the vocabulary when a book is translated and does not cause the change in temperature.
Examining again the reference corpus of each language, we calculate the temperature of all the books shown in Fig. 3, choosing each corpus irrespective of how it was compiled or assembled. Because the vocabulary of a language is strongly influenced by social and cultural forces [31], a text written for a target readership will be strongly influenced that readership. Thus changes in temperature will occur when there is a change in the standard vocabulary that we use when we do our comparative analysis. Figure 5(a) shows the same set of books shown in Fig. 3 in their English versions. Each curve corresponds to a different corpus. Figure 5(b) shows the Portuguese versions of the same books. We compiled our own corpus using all the words contained in the books in English and Portuguese and used this as our standard corpus in the analysis in both languages-see the curves ''All Books EN'' and ''All Books PT'' in Figs. 5(a) and 5(b), respectively.
These figures show that the temperature of the books approaches a value of 1 when we take into consideration our compiled corpus, independent of language, i.e., they are becoming increasingly similar to the baseline corpus. As the vocabulary of a book increasingly deviates from that of the corpus, the temperature deviates from 1. Figure 6 shows a comparison of book temperatures in all six languages using our own compiled corpus. All the temperature values approach 1 and the differences among the languages sharply decrease.
If we were to use a baseline corpus compiled from the words of one book in an analysis of the same book, the temperature would invarably be one. Thus our comparative thermo-analysis technique can be used to measure how appropriate the vocabulary used in a text is to its target readership.

Conclusions
The temperature of a book is strongly related to its vocabulary level. Basic level textbooks use as many common words as possible and their temperature is higher than the temperature of more advanced books. By performing a cross-language comparisons with our comparative thermo-analysis, we find that this tendency is independent of language, but that the effort required to read or write a given text differs among languages. Figure 3 shows that the temperature of a book in Portuguese is usually lower than the temperature of same text in English. This indicates that the book requires more effort of a Brazilian reader than an English reader. Figure 6 shows that changing the corpus used as standard will change the effort required of the reader. It also shows that the reading effort never reaches zero, and that books in English always have a higher temperature than books in Portuguese. It is possible that English has a high temperature because there are many synonyms that express a similar content in the English Language, and that Portuguese has a low temperature because it requires more words to express different meanings.
In understand why the temperature of a book, and thus the complexity of its vocabulary, changes when it appears in a different language, we have eliminated the factors of the original language of the author and the influence of the translation process itself. The change that occurs is thus related to the syntax and other grammar features of each language.
Irrespective of cause, a significant factor in solving this puzzle is how well a particular text uses the vocabulary of its target readership. In this way, our comparative thermo-analysis also allows us to determine quantitatively whether the vocabulary of a book, either in its original language or in translation, achieves that goal-how well it reaches its readership. In a Text Translation, Does the Complexity of Its Vocabulary Change?