The authors have declared that no competing interests exist.
Conceived and designed the experiments: Suin Kim SP Sooyoung Kim JB AHO. Performed the experiments: Suin Kim SP Sooyoung Kim JB. Analyzed the data: Suin Kim SP. Wrote the paper: Suin Kim SP SAH AHO.
Multilingualism is common offline, but we have a more limited understanding of the ways multilingualism is displayed online and the roles that multilinguals play in the spread of content between speakers of different languages. We take a computational approach to studying multilingualism using one of the largest user-generated content platforms, Wikipedia. We study multilingualism by collecting and analyzing a large dataset of the content written by multilingual editors of the English, German, and Spanish editions of Wikipedia. This dataset contains over two million paragraphs edited by over 15,000 multilingual users from July 8 to August 9, 2013. We analyze these multilingual editors in terms of their engagement, interests, and language proficiency in their primary and non-primary (secondary) languages and find that the English edition of Wikipedia displays different dynamics from the Spanish and German editions. Users primarily editing the Spanish and German editions make more complex edits than users who edit these editions as a second language. In contrast, users editing the English edition as a second language make edits that are just as complex as the edits by users who primarily edit the English edition. In this way, English serves a special role bringing together content written by multilinguals from many language editions. Nonetheless, language remains a formidable hurdle to the spread of content: we find evidence for a complexity barrier whereby editors are less likely to edit complex content in a second language. In addition, we find that multilinguals are less engaged and show lower levels of language proficiency in their second languages. We also examine the topical interests of multilingual editors and find that there is no significant difference between primary and non-primary editors in each language.
Wikipedia is the world’s largest general reference work, and it depends on active editors to generate and maintain up-to-date and accurate information. Wikipedia is also one of the top ten websites in terms of traffic volume, and its articles are often among the top results for many search queries on Google [
There are 288 language editions of Wikipedia hosted by the Wikimedia Foundation, providing easy access to information for many Internet users globally, but there are high levels of inequality and asymmetry in the information available in the different language editions.
The English edition contains more than 4.9 million articles as of June 2015, which is 13.8% of all the articles in the 288 language editions [
There are several reasons for this information inequality and asymmetry in Wikipedia. First, Wikipedia was only available in English when it started in January 2001. The German and the Catalan editions were added two months later, and other language editions followed after a few years, but English has always remained the largest edition [
As awareness of this information inequality and asymmetry has increased there have been efforts to grow or distribute more information for several language editions. In 2010, Google sponsored a contest to encourage students in Tanzania and Kenya to contribute to the Swahili edition of Wikipedia [
Previous research has shown that multilingual users play a key role in information diffusion across languages in social media [
Approximately 15% of active Wikipedia editors are multilingual, contributing content to multiple language editions of the encyclopedia [
Unlike the studies on monolingual Wikipedia editors [ Do primary and non-primary editors show different levels of Do primary and non-primary editors show different levels of topical Do primary and non-primary editors show different levels of
Each editor’s engagement is quantitatively measured by their text contributions and the time spent revising articles. The interests of editors are identified by the topics of the articles they edited. Lastly, the language proficiencies of editors are measured by various language complexity measures applied to their contributions.
Our results can be summarized in three parts. First, there is are significant differences in the levels of engagement and language proficiency in the German and Spanish editions: multilingual users who primarily edit either of these editions show higher levels of engagement and language proficiency than multilingual users who edit these editions as non-primary languages. Second, there was no notable difference between the degrees of engagement and proficiency of editors who edited the English edition as a primary or non-primary language, verifying the common assumption of English as lingua franca of the Web. Third, primary and non-primary editors show similar levels of interest for most topics with the exception that primary editors are significantly more interested in local topics and non-primary editors in global topics.
Our contributions to the field of Web-based study of multilingualism are as follows:
We construct an extensive dataset of multilingual Wikipedia edit history, which comprises more than 5 million edits and make it publicly available for future research. We define and analyze three relevant aspects of multilingual editors’ behavior: engagement, interest, language complexity. We define and validate several language-independent measures for quantifying the language complexity of edits. We show that multilingual editors indeed have potential to help mitigate the inequality and asymmetry in the information available in different languages.
In this section, we describe the data collection and analysis methods. To analyze the editors’ engagement, interests, and language proficiency levels, we first start with the edit metadata from the English, German, and Spanish editions for one month and construct article edit sessions to identify consecutive contributions to articles by the same editor. Based on these article edit sessions, we extract multilingual editors and their contributions. By analyzing their contributions, we (1) measure their degrees of
To extract only the contributions of multilingual editors to articles from the unstructured edit history data, we conduct a data processing pipeline consisting of the following steps. We first start with metadata of English, German, and Spanish language editions from July 8 to August 9, 2013. We then construct article edit sessions by grouping together consecutive edits to the same article by the same editor. Based on the identified sessions, we define multilingual editors as those who are involved in article edit sessions for two or more language editions. Finally, we download the actual revision text for the article edit sessions of those multilingual editors. We describe these steps in more detail below.
We begin with the edit metadata collected by Hale [
Even though the easiest way to measure the edit activity on Wikipedia is simply counting each edit when changes are submitted, this does not accurately reflect editors’ behavior because of individual differences in activity patterns. For example, some editors may submit a few large edits while others may make a series of smaller edits saving the pages more frequently as they work. To account for these individual differences, we adopt the idea of
In other words, from the collected dataset, we define an
Next, we identify multilingual editors from the metadata and retrieve the content of all the edits made by multilingual editors from Wikipedia using the Wikipedia API. We define a multilingual editor as an editor who edits two or more language editions. Using this definition, we identified 12,577 multilingual editors with 427,529 total article edit sessions composed of 622,766 distinct edits. Following Hale [ We identify the We identify the For each language edition X, we define
Thus, an editor can be a primary editor in only one language edition, but can be a non-primary editor in multiple language editions.
(a)
There is more activity in the English edition (en) than in either the German (de) or Spanish (es) edition. In all three language editions there are more primary editors (p) than non-primary editors (np) and primary editors are more active than non-primary editors.
en-p | en-np | de-p | de-np | es-p | es-np | |
---|---|---|---|---|---|---|
# Multilingual Editors | 3,832 | 7,784 | 1,631 | 1,640 | 996 | 1,510 |
# Article Edit Sessions | 200,883 | 36,959 | 112,788 | 7,334 | 63,947 | 5,609 |
# Edits | 298,868 | 51,665 | 151,014 | 9,111 | 104,341 | 7,757 |
# Edited paragraphs | 1,447,692 | 230,893 | 816,647 | 27,656 | 554,762 | 25,340 |
Wikipedia provides the previous and current versions of each article. To look at the actual edited text, we download the “diffs,” which are the parts of the text that have changed from the previous version of the article for each article edit session we identified.
(a) Edit paragraph containing a visible edit. (b) Edit paragraph containing a non-visible edit. We define an edit paragraph as one line of Wikipedia markup, utilizing the diff from Wikipedia.
In this way, we retain only the visible text from edits, discarding all non-visible and non-text information including URL, multimedia, metadata, and document structure. We regard an edit as
We examine the distribution of multilingual editors’ contributions by language and the number of article edit sessions.
We display the number of article edit sessions up to 100. Dots in the same color denote the same language edition. The plot, on a log-log scale, shows that the distributions for primary and non-primary users in all three editions is heavy-tailed; most users perform only a few edits while a few users perform many edits.
Our approach is to analyze and compare the behavior of primary and non-primary multilingual editors for the English, German, and Spanish editions in terms of
We define an edit paragraph as a line of Wiki markup in an article, utilizing the difference of the text before and after the edit from Wikipedia as shown in
The
Rather than comparing each editor’s behavior across languages, we compare editing behavior of primary versus non-primary editors within each language, as defined in the previous section. That means, for example, we take all editors in our dataset who contributed to any articles in the English edition, divide them into two groups: 1) those who contributed to the English edition more than any other language edition, and 2) those who contributed to some other language edition more than the English edition, and then we compare the behaviors of those two groups. This way, we do not need to worry about the inherent differences among languages (e.g., German tends to have longer words than English).
Since we categorized multilingual editors into two groups (primary and non-primary) on the basis of the number of article edit sessions they had across language editions, we measure the level of quantitative engagement of editors within an article edit session, such that the measure becomes independent of the number of article edit sessions.
We measure the amount of engagement based on four metrics. First, we count the number of revisions committed within an article edit session, only including the sessions containing more than 1 edit. Second, we measure the session length in minutes by the difference of timestamps between the first and the last revision in each article edit session. Third, we measure the amount of text added and deleted in terms of characters, words, and sentences for all the
- Number of edits |
- Session length in minutes |
- Number of edited characters, words, sentences (for |
- Fraction of non-visible article edit sessions. |
- Normalized frequency of edits for each topic |
- Lexical diversity measures (for |
- Entropy of unigram, bigram, trigram frequencies |
- Syntactic complexity measures (for |
- Entropy of POS unigram, bigram, trigram frequencies |
- Article usage measures (for |
- Fraction of articles in added tokens |
For each metric, we compute the mean over all article edit sessions for each editor, and then again compute the mean of primary and non-primary editors to represent the level of engagement for each groups. We perform two-tailed independent samples t-test to examine if the differences between the means of the two groups are significant.
Assuming that the topics of the edited articles represent editors’ interests in specific fields, we measure the interests of multilingual editors using a Bayesian topic model. We first determine the main topic category for each article a multilingual user edited and assign each article a single topic label. We then compute the proportion of interest from primary and non-primary editors for each topic. By comparing the distributions, we can compare the interests of primary and non-primary editors.
In order to measure the language proficiency of multilingual editors, we focus on three aspects of their edits: (1)
Using For automatic POS tagging on edits, we employ a maximum entropy POS tagger [ The difficulty comes from the complex usage of articles: i.e., that there is not a one-to-one correspondence between languages. Such difficulty imposes a challenge for language learners of a second language [ Since the maximum value may be merely affected by the number of edits, we control the number across editors to three. We uniformly sample three edits within the entire edits an editor made, repeating 100 times for editors and average the evaluated 100 maximum values.
English | German | Spanish | |
---|---|---|---|
Definite | the | des, die, den, der, dem, das | el, la, los, las |
Indefinite | a, an | eine, eines, einer, einem, einen, ein | un, una, unos, unas |
All of the engagement, interest, and proficiency measures described in this section are summarized in
In this section, we report our findings on engagement, interests, and language proficiency of primary and non-primary multilingual editors for the English, German, and Spanish editions of Wikipedia.
We show in
For all metrics and languages, primary and non-primary editors are showing significantly different behavior: primary editors tend to be more engaged than non-primary editors. (*
We also find that the number of tokens added per edit session by primary editors is higher than the number added by non-primary editors in all three language editions. This also holds whether tokens are measured by characters, words, or sentences. These findings on the amount of edited content align with the previous findings on the number of edits and the overall time spent indicating that in general primary editors are more engaged in revising the text of articles than non-primary editors.
Finally, we find that non-primary editors make more non-visible edits, such as adding/removing hyperlinks or applying stylistic changes, in all three languages. This tendency indicates that editors may be making different types of edits in their primary and non-primary languages. Similar results have been shown in qualitative research (e.g., [
The twenty topics discovered for the articles edited by multilingual users in the English edition are as follows:
Articles related to Descriptive and History make up a large proportion of all articles while Middle East Geography, American Sports, and Transportation articles are relatively small in proportion.
Similarly, the twenty topics discovered for Spanish articles are as follows:
Finally, the twenty topics discovered for German articles are as follows:
The overall difference of interest between primary and non-primary editors is not significant (
However, when looking at each topic, we observe notable differences of the level of interests between primary and non-primary editors for a few topics in each language. In the English edition, we observe two topics with significantly different levels of interest between primary and non-primary editors. Primary editors show significantly higher level of interest in the topic of
In the German and the Spanish editions, we also observe an interesting pattern where topics with more interest from primary users have higher syntactic complexity compared to topics with more interest from non-primary users (see
Number next to each topic represents the average entropy for Part-of-Speech trigrams for the articles within each topic.
English | German | Spanish | |
---|---|---|---|
Primary | |||
Non-Primary |
Prior to computing the language complexity measures for edits, we control for the topics of the articles presented in the previous section, as prior research found that the language used in conceptual articles tend to be more complex than the language used in biographical and factual articles [
Language complexity as measured by the entropy of POS trigrams varies by topic. Thus, we control for topic in order to measure language complexity more accurately.
We first examine the complexity of
(a)
Moreover, we observe that the entropy of part-of-speech unigrams, bigrams, and trigrams is significantly higher for edits by primary editors in English and Spanish. Likewise, the POS entropy of edits is always higher for primary than non-primary editors in the German edition. The difference for bigrams and trigrams is significant while the difference in unigram POS entropy is not. This indicates that syntactic complexity also differs between primary and non-primary editors in all three languages in general. Thus, we conclude that primary editors edit more complex parts of articles compared to non-primary editors. In addition, the result implies that multilingual editors writing in a non-primary language may face a complexity barrier whereby they shy away from editing more complex sections of articles.
Delta complexity measures the difference in complexity before and after each article edit session and thus provides a measure of how the edits by the multilingual user changed the complexity of the article.
These results implies that primary editors in German and Spanish editions possess higher linguistic proficiency, in terms of writing ability at least, than non-primary editors. This is consistent to the findings from the analysis of user interest. However, this does not apply to English, where there is no evidence that primary and non-primary editors of English possess different levels of linguistic proficiency.
We looked at the editing behaviors of multilingual editors in Wikipedia, the world’s largest general reference work. Other recent papers have sourced Wikipedia as a unique dataset of revision history that can be used to predict various collective opinions and actions [
This paper offers a first in-depth look at the behaviors of multilingual Wikipedia editors who are invaluable in delivering information across the language barrier. Previous research shows that multilingual editors’ cultural background and language influence their perspective—how they view, interpret, and document world events [
With respect to linguistic complexity, we found that primary and non-primary editors differ in their behavior. Within the Spanish and German editions, we found that (i) primary users choose to edit more complex text than non-primary users (
These findings reinforce how strong of a barrier language is on online collaboration platforms. Even multilingual users who edit multiple editions of Wikipedia devote most of their efforts to editing the edition of their primary language—making more edits and spending more time within article edit sessions in their primary languages than in their non-primary languages. When multilingual users do edit in their non-primary languages, they often face a language complexity barrier. Users editing a non-primary language edition targeted their edits toward the grammatically simpler parts of the articles. In addition, with the exception of English, the edits that they made did not raise the linguistic complexity of the articles as much as the edits by multilingual users who primarily edited the language. This accords with linguistics research that multilinguals have differing levels of competency in their languages and that such differences are often related to how much they use each language [
A limitation of our study is that the metrics we used are language dependent (e.g., linguistic complexity based on POS tag entropy). Given this limitation, we compared primary and non-primary editors within each language, rather than across languages. If we can find better, language-indendent metrics, it will be possible to compare editors across languages as well. Further, because our study relies on NLP tools, we were limited to the three languages, English, German, and Spanish, with the most accurate NLP tools. With advances and availability of NLP tools for other languages, it will be possible to expand this study and examine a larger variety of languages in the future. Another limitation in this study is that a multilingual editor’s primary language is not always their native language. To quantify the degree of discrepancy, we checked the alignment between editors’ primary languages and their self-declared native languages for the English edition by constructing another dataset of 221,162 editors who contributed to one or more articles in the English edition of Wikipedia appearing within the category “Wikipedia Controversial Topics.” Among these editors, there are 18,962 users who disclosed their native languages on their user pages. Since the editors are crawled from the edit histories of the English edition of Wikipedia, we retain only the editors whose primary language or disclosed native language is English. We find that 94.4% of these editors (1,604 of 1,699) have the same primary language as the disclosed native language, indicating that the primary language (i.e., the language of the most edited edition) of a multilingual Wikipedia editor is a good proxy for the user’s native language.
There remains a range of questions to tackle. There are many other online collaboration platforms beyond Wikipedia. In addition to the specific editing behaviors we studied in this paper, there are additional behavioral patterns that could be examined to further study the nature of cross-lingual knowledge diffusion and the contributions of multilingual users. Furthermore, to fully understand the behavior and the roles of multilingual users, the contributions of multilingual users should also be compared with those of monolingual users.
Examples of English Wikipedia article titles for discovered topics.
(PDF)
Examples of Spanish Wikipedia article titles for discovered topics (presented in English).
(PDF)
Examples of German Wikipedia article titles for discovered topics (presented in English).
(PDF)
Syntactic complexity for 20 English, Spanish and German topics and independent t-test results comparing contributions by primary and non-primary editors.
(PDF)
We thank Wikimedia foundation for providing access to the database on the Wikimedia Toolserver. We also thank the PLOS ONE reviewers and the associate editor for the detailed and insightful comments.