"What else are you worried about?" – Integrating textual responses into quantitative social science research

Julia M. Rohrer; Martin Brümmer; Stefan C. Schmukle; Jan Goebel; Gert G. Wagner

doi:10.1371/journal.pone.0182156

Abstract

Open-ended questions have routinely been included in large-scale survey and panel studies, yet there is some perplexity about how to actually incorporate the answers to such questions into quantitative social science research. Tools developed recently in the domain of natural language processing offer a wide range of options for the automated analysis of such textual data, but their implementation has lagged behind. In this study, we demonstrate straightforward procedures that can be applied to process and analyze textual data for the purposes of quantitative social science research. Using more than 35,000 textual answers to the question “What else are you worried about?” from participants of the German Socio-economic Panel Study (SOEP), we (1) analyzed characteristics of respondents that determined whether they answered the open-ended question, (2) used the textual data to detect relevant topics that were reported by the respondents, and (3) linked the features of the respondents to the worries they reported in their textual data. The potential uses as well as the limitations of the automated analysis of textual data are discussed.

Citation: Rohrer JM, Brümmer M, Schmukle SC, Goebel J, Wagner GG (2017) "What else are you worried about?" – Integrating textual responses into quantitative social science research. PLoS ONE 12(7): e0182156. https://doi.org/10.1371/journal.pone.0182156

Editor: Ingo Frommholz, University of Bedfordshire, UNITED KINGDOM

Received: May 11, 2016; Accepted: July 12, 2017; Published: July 31, 2017

Copyright: © 2017 Rohrer et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data used in this study are available from the German SOEP project due to third party restrictions. Data are made available to the scientific community by the German Institute for Economic Research (DIW Berlin). For requests, please contact soepmail@diw.de. All analysis scripts and code used in this study are made accessible on the Open Science Framework, http://osf.io/aj3bn/.

Funding: We acknowledge support from the German Research Foundation (DFG) and the University of Leipzig within the Open Access Publishing program.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Open-ended questions such as “Would you like to add anything?”, “Is there anything else you would like to tell us?”, “Please tell us anything you think is important” are commonly used as complements in surveys that otherwise rely heavily on closed-ended questions [1]. However, to this day–decades after the collection of such textual answers began–routines have yet to be established for analyzing the unstandardized textual answers–so-called free texts–and for integrating them into quantitative social science research.

On the other hand, the in-depth analysis of textual data such as diaries, discourses, or transcripts of interviews is an established part of qualitative research. The so-called Qualitative Content Analysis (QCA, see [2] for a brief overview and examples of its application) offers a range of techniques to approach the content of a text on different levels, from the gist of the text to subtle references that can be understood only in the broader context of current events and public discourses. Content analysis itself covers a wide range of different strategies. For example, researchers can derive categories of interest from the data itself, from theory, or from prior research. Researchers are also able to focus on the keywords that are identified from the underlying context [3].

While these analytical strategies might aim to fulfill quality criteria that are well known to quantitative researchers, such as reliability and validity [2], their utility is limited when considering the types of answers submitted as answers to open-ended questions on surveys. Open-ended questions on surveys typically generate a large number of short texts, in contrast to the small numbers of long and comprehensive texts that are routinely analyzed in QCA. Respondents of large surveys often provide only one or a handful of words in their answers because of (1) the narrow phrasing of the questions, (2) a lack of motivation to answer exhaustively, (3) space limitations on the questionnaire, or (4) time pressure in the interview situation due to the interest of the interviewer to complete the interview quickly. This brevity might make the work that is necessary to apply a thorough QCA appear excessive: The answers are short, but the number of respondents is very high.

O’Cathain and Thomas [1] fittingly characterized the data generated by open-ended questions as neither strictly qualitative nor strictly quantitative, a status the two authors describe as “uncomfortable.” As a reaction to this intermediate state, quantitative researchers who decide to use such textual data often use a strategy of low-key “quantitizing” to solve the issue in the manner in which they are accustomed: Texts are manually replaced by numeric codes representing categories that are supposed to be relevant [4]. Quantitizing is also used in qualitative and mixed methods research and is guided by a number of assumptions and judgments that might be glossed over in the process [4] or, in the case of quantitative research, not even stated explicitly.

But even low-key quantitizing is labor intensive when the number of respondents is very high as it is when large-sample surveys are used. In this case, automated text analysis offers an effective way to tackle the data. Natural language processing (NLP) is a multidisciplinary field in which the interaction between human language and computers is explored. Tools that have been developed in this domain allow several steps of text analysis to be automatized and offer new models for quantifying and investigating the textual data.

Existing approaches

Strategies that have been employed for the automated analysis of free texts in the social sciences can be classified into two broad categories. First, there are strategies that follow a top-down logic, which might also be referred to as deductive methods or closed-vocabulary approaches. Second, there are strategies that follow a bottom-up logic, and these are primarily data driven (i.e. inductive methods, open-vocabulary approaches).

Top-down approaches rely on existing word lists, called dictionaries, that organize certain words, parts of speech (e.g. pronouns), or other textual properties (grammatical structure, punctuation) into specific categories. They are employed in, for example, the field of sentiment analysis, which is the attempt to use NLP to recognize emotions, opinions, and attitudes toward entities in textual data [5]. Sentiment analysis is applied broadly, ranging from observations of the public’s attitudes toward political movements to movie sales predictions (see [5, 6] for various applications), and sophisticated technological approaches have been developed for these purposes (see [6] for a comprehensive survey). Nevertheless, simple keyword spotting (e.g. classifying texts on the basis of unambiguous affect words) is still popular because it is accessible and economical [7].

A specific tool that was created with a top-down approach is the LIWC [8, 9]. This software counts words according to categories that are considered to be relevant for psychological research and resulted from exhaustive preliminary studies. The authors of the software distinguish between content words that convey the content of the communication (e.g. “write,” “scientific,” “paper”) and style words that are needed to form phrases (e.g. “I,” “a,” “and”). The latter make up large parts of written and vocal speech. Tausczik and Pennebaker suggested that style words are more closely linked to people’s social and psychological worlds [8], citing studies that link, for example, pronoun use to relationship quality. The LIWC is a popular and well-established tool amongst psychologists, and its dictionary has been translated into more than 10 languages; the tool has also been praised for its user friendliness [10].

The most evident disadvantage of software that is based on predefined dictionaries is a lack of flexibility. The categories that are employed–no matter how well-validated they are within a specific context–might not cover the aspects of interest in the respective study, might not apply to the specific type of text used, or might miss important information that the researcher is not aware of when pre-specifying the categories of interest.

Bottom-up approaches avoid these problems by deriving relevant categories or properties from textual data itself instead of relying on predefined dictionaries. For example, Schwartz et al. [11] introduced the term open-vocabulary technique to describe their bottom-up analysis of Facebook messages. They analyzed words and phrases consisting of two to three consecutive words (so-called n-grams, e.g. “my children”), as well as topics derived from the texts through Latent Dirichlet Allocation (LDA). LDA is a generative topic model in which texts are assumed to share a certain number of underlying topics, which explain co-occurrences of words within texts [12]. Using this technology, Schwartz et al. found for example that one topic in Facebook status updates was family (“son,” “daughter,” “father,” etc.), and this topic occurred more frequently in the status updates of older users. By contrast, the topic studies (“classes,” “semester,” “college,” etc.) was more relevant to younger users. The study furthermore introduced the term differential language analysis (DLA) to describe how simple ordinary least squares regressions could be used to link word use with characteristics of the author of the text.

Besides topic modeling, cluster analysis is a second popular bottom-up approach. Whereas topic models attempt to identify topics that underlie the text documents (e.g. topic “family” vs. topic “work”); cluster analysis attempts to sort the text documents into meaningful categories (e.g. documents that fall into the category “family” vs. documents that fall into the category “work”). Thus, both approaches can be applied to find meaningful units in a number of text documents but differ regarding the statistical model and can lead–depending on the features of the text documents–to either similar or diverging results.

As an example from political science, Grimmer and King [13] applied not only one but all published cluster analysis methods to find meaningful partitions in press releases of US Senator Frank Lautenberg’s Senate Office, George W. Bush’s 2002 State of the Union address, and randomly drawn Reuters news stories. Their unique method revealed that different algorithms lead to clusterings that can be organized in a two-dimensional space. On the basis of this space of clusterings, they discovered that Lautenberg’s press releases could be organized into four clusters: Credit Claiming, Advertising, and Position Taking–traditionally considered to be the three basic kinds of activities that congressmen engage in [14]–and a new fourth cluster that they labeled Partisan Taunting.

For the implementation of bottom-up approaches, several software packages such as Leximancer [15] and SPSS Text Analytics for Surveys [16] offer social scientists the means to process textual data and run procedures such as topic detection. However, these programs are rather expensive, hinder replication, and lower transparency through the “black box” characterization that often comes with proprietary software. Noncommercial and open alternatives that offer a wide range of similar functions include the R-package tm [17] and collections of tools such as the Apache OpenNLP library [18], the Natural Language Toolkit for Python [19], and the ASV Toolbox [20].

Not all approaches to automated text analysis fall neatly into the idealized categories top-down versus bottom-up. In supervised learning, part of the data need to be labeled. For example, a certain number of texts is classified by trained human coders. In the next step, an algorithm is trained on the basis of these labeled data to infer the assigned class from the texts, and this trained classifier can then be used to classify new texts automatically. Supervised learning could be considered top-down because the relevant categories are imposed in the first step. However, it derives the relevant features for distinguishing the categories from the text itself in a bottom-up manner. Research on authorship attribution has been drawing from such methods since the 19^th century: A number of texts for which the author is known (labeled data) can be used to identify the features that distinguish between multiple candidate authors with the potential to identify the authorship of documents of uncertain origin. Applications range from the identification of Shakespeare plays to the verification of suicide notes, see [21] for a survey of modern authorship attribution methods.

As a potential application of supervised learning in the social sciences, Hopkins and King [22] used automated content analysis to investigate opinions about the 2008 US presidential candidates. A few hundred blog posts were hand-coded, ranging from extremely negative to extremely positive. Automated analysis then allowed the authors to estimate opinions in a large corpus of blog posts on the basis of the training data. For example, results revealed a sharp increase in negative opinions about John Kerry following his botched joke that was perceived as an insult to the troops in Iraq in October 2006 (“If you make the most of it and you study hard and you do your homework and you make an effort to be smart, you can do well. If you don't, you get stuck in Iraq”). The authors of this paper wrote an R-package to make the method they developed accessible [23].

Challenges of the analysis of textual data

Despite readily available software solutions and the advances of NLP technologies, core features of human languages still render automated analyses difficult. By combining words, an unlimited amount of utterances can be produced, which can be–even if the combination is novel to the listener or reader–easily understood by other humans. This feature has been labeled “productivity” [24]. Productivity is even reflected in object naming tasks: Participants tend to generate variable and sometimes quite inventive answers, a phenomenon that has been described as exuberant responding [25], which can cause issues in research on speech production in standardized experiments. It seems plausible that variability in answers increases even more when proceeding from simple tasks to questions addressing a respondent’s social life, problems, interests, living conditions, and so forth. To the human recipient, this does not cause any issues in most cases. For example, one can easily see that the phrases “My wife doesn’t let me meet my friends” and “Spouse’s impact on friendships” refer to a similar problem. However, it is a major challenge for automated analyses to detect semantic similarities between two such answers that share hardly any common substrings.

Data pre-processing can tackle variability in human languages to a certain extent. For example, we might be able to reduce “friends” and “friendships” in the two phrases to a common word stem to gain the insight that both strings have something to do with “friend.” In an even more sophisticated approach, we might be able to automatically look up spouse and wife in a dictionary–in this case, more specifically in a WordNet that groups words and labels their semantic relations–and figure out that they refer to similar concepts. Another source of variability than can be reduced through NLP is flawed data input. Respondents might write down their answers themselves or dictate them to the interviewer. In both cases, spell-checking can become a necessary data pre-processing step because pairs of non-matching strings such as “mispelling” and “misspelling” cannot be mapped onto each other. Decreasing homogeneity through pre-processing with these kinds of steps improves simple analyses such as word counts because words carrying similar or identical semantics can be identified as such even when the strings are not strictly identical.

However, this “normalization” might also lead to a loss in the richness of the original answers and their individual style. There is no convention that governs the extent to which free texts should be altered, and the steps of pre-processing largely depend on the aim of the analysis. For example, so-called stop words such as “the” and “and,” which are typically the most frequent words in any language, are often dropped before the automated classification of textual data because they are not supposed to carry any significant content [26]. Yet these words strongly overlap with the concept of style words that are considered of special importance from the psychological perspective of the LIWC [8] and have been found to be the best features for discriminating between authors in the context of authorship attribution [27]. A one-size-fits-all solution for automated text analysis is currently neither available nor attainable: A suitable analysis strategy always depends on the properties of the text and the aim of the analysis.

Aim of this paper

The aim of this study is to demonstrate pragmatic analytical strategies for free texts generated by open-ended questions as such questions are frequently included in large social science surveys. We offer guidelines on how to make use of such data in the context of quantitative social science research. The analyses follow a data-driven approach so that the results do not rely on pre-defined categories of interest. They draw on statistical procedures that are well-known in the social sciences (e.g. regression models such as OLS and logit models) and existing tools used by the NLP community. Grammar and word order are ignored in all analyses of textual data, following a so-called bag-of-words approach. Instead of pinning down certain software solutions, we will emphasize the concepts and the rationale behind the steps used in the analysis of textual data.

To illustrate this process, we analyzed textual answers to the question “What else are you worried about?” from a large-scale longitudinal study. The open-ended question followed a block of close-ended questions about worries. The free texts were typically short, ranging from single words to brief lists or simple sentences. Fig 1 presents an overview of the steps used in data pre-processing and the analyses that were applied in this study. Analysis code can be accessed via the Open Science Framework (https://osf.io/aj3bn/). We attempted to answer three broad research questions that are of general interest when analyzing textual data from survey studies:

Which respondents make use of the open-ended questions?
Which topics can be found in the answers to these questions?
How are free texts linked to respondents’ characteristics?

Download:

Fig 1. Overview of the steps of automated text analysis.

https://doi.org/10.1371/journal.pone.0182156.g001

These questions can be rephrased more specifically given the data at hand: Who reports worries in a textual format? What are respondents worried about? And who worries about what?

Method

Data

The data came from the Socio-Economic Panel (SOEP), a representative prospective multi-cohort study of people living in private households in Germany [28]. The SOEP, a research infrastructure unit of the Leibniz Association (http://www.leibniz-soep.de), is located at the German Institute of Economic Research (DIW Berlin), and the data are collected by the commercial fieldwork organization TNS Infratest Sozialforschung (Munich). SOEP data have been gathered annually since 1984, and the sample has been refreshed several times to ensure representativity. In this study, we used data collected from the years 2000 to 2011, yielding a total of 261,894 records (i.e. completed questionnaires) from 44,506 individuals. On average, there were 21,800 records per year, with a minimum of 19,127 in 2010 and a maximum of 24,576 in 2000.

Variables

Mode.

The SOEP employs different modes of interviewing; mainly the CAPI (Computer Assisted Personal Interview, 30.25% of all of our observations), oral interview (25.17%), written questionnaire in the presence of an interviewer (24.85%), and written questionnaire sent via mail (12.67%). The interviewing mode was coded as either (1) oral (oral interview, CAPI, phone interview), (2) written (questionnaire with or without the presence of an interviewer, written, sent via mail), or (3) mixed/other (with interviewer assistance, oral and written, proxy).

Gender and age.

Respondents’ gender was assessed via self-report and coded 0 if female and 1 if male. Age was derived from respondents’ year of birth and the survey year. To disentangle between-subjects age effects from within-subjects age effects, we generated two variables: The mean age of the person across all of his or her records to capture between-subjects variability and the difference between a person’s current age and his or her mean age to capture within-subjects variability [29].

Region.

The variable “sample region” was used to differentiate between individuals living in East and West Germany. The variable was entered into the analyses as a binary variable, representing the region in which the respondent had spent the majority of the survey years.

Level of education.

Information on the highest level of education was available for multiple years. The most recent information was used as the indicator of the participant’s level of education. In Germany, children are separated into one of multiple tracks of secondary education after elementary school, which leads to a number of different school leaving qualifications. In the multilevel analysis of selection effects, education was included as a qualitative variable with five levels: (1) “no degree,” (2) lower secondary education (final examination after Grade 9, Hauptschulabschluss), (3) middle secondary education (final examination after Grade 10, Mittlere Reife/Realschulabschluss), (4) intermediate higher secondary education (final examination after Grades 11 or 12; entitled to study at a University of Applied Sciences, Fachhochschulreife), (5) higher secondary education (final examination after Grades 12 or 13; entitled to study at all types of universities, Abitur), “no degree yet” and “other degree.” For correlational analyses, individuals with “no degree yet” or “other degree” were dropped, and the variable was assumed to be ordinal, with level of education increasing from (1) to (5).

Immigration background.

Respondents’ history of migration was originally reported in four categories: “no migration background,” “direct migration background” (born outside of Germany), “indirect migration background” (born in Germany, at least one parent born outside of Germany), and “migration background, not further differentiated” if there was no information on whether the respondent was born in Germany or not. The last category accounted for only 0.06% of the observations, which were recoded into “indirect migration background” to simplify the analyses.

Personality.

Personality was assessed in 2005 and 2009 with a brief personality questionnaire (BFI-S [30]). The BFI-S consists of 15 items answered on a 7-point scale to capture the five broad personality dimensions extraversion, emotional stability, agreeableness, conscientiousness, and openness to experience. Scale reliability ranged from α = .50 (agreeableness) to α = .68 (openness to experience). Due to the brevity of the BFI-S, scales were computed only when all 15 items had been answered; in 0.5% of the observations, respondents had answered only parts of the BFI-S. Thus, we did not calculate personality scores. If self-reported personality was available for both points of measurement, we averaged the scores.

Life satisfaction.

Life satisfaction was assessed every year with a single item on an 11-point scale. As described for age, within-subjects centering was used to derive two variables, the mean level of a person’s life satisfaction (ignoring missing records) and the deviation from that mean level in a specific year.

Closed-ended worry items.

Respondents answered a number of closed-ended questions regarding their worries about various subjects on a 3-point scale (“not worried at all,” “somewhat worried,” and “very worried”). Nine items were included in all surveys from 2000 to 2011 (worries about the general economic situation, personal financial situation, personal health, job security, protection of the environment, peace, development of criminality in Germany, immigration to Germany, and hostility toward foreigners in Germany). Additional items addressing current issues were included intermittently over the course of the study (e.g. worries about the introduction of the Euro, global terrorism, the stability of the financial markets, and the security of nuclear power plants).

We averaged eight of the nine items that were asked on every survey to form a score of reported worries that was comparable across survey waves. The item regarding worries about job security was excluded because it applied only to the subset of individuals who were employed. Scores were computed if at least seven of the eight items were answered but not for 0.47% of observations in which one to six items were answered or 1% of observations in which none of the eight worry items were answered. The resulting scale had an acceptable reliability coefficient (α = .73, pooled across all waves). Again, within-subjects centering was used to derive two variables, a person’s mean level of reported worries and the person’s level of deviation in a specific year.

Textual data

After receiving the block of worry items, participants were asked whether they had any other worries. Answers to this open-ended question (written by participants or transcribed by interviewers) were cleaned and prepared for analysis in Java. The response language was German, which has more inflections than English (e.g. words are modified according to case and gender) and uses more suffixes and therefore imposes some particular challenges.

Texts were set to lowercase because capitalization had been used inconsistently by respondents and interviewers, and information about part of speech (which is related to capitalization in German) was not considered in further analyses. Texts were then tokenized (e.g. broken down into single words) by applying the OpenNLP tokenizer [18]. Character encoding varied between different waves of the panel, and thus, encoding needed to be unified to prevent special characters from being misrepresented. Furthermore, respondents and interviewers used abbreviations because of the space limitations on the questionnaire. We thus assembled an ad hoc list of common abbreviations (e.g. “soz.” to “sozial,” social; “dtl” to “Deutschland,” Germany; see S1 List) and the most conspicuous spelling errors (e.g. “standart” to “standard,” standard) and replaced the strings accordingly.

In the next step, stop words were removed from the data. Note that there is no such thing as a universal or official stop word list. In this study, we used a German stop word list that was based on the Leipzig Corpora Collection [31]. Finally, words were reduced to their word stem (e.g. “politischen,” political, and “politiker,” politicians, to “polit”; “kind,” child, and “kinder,” children, to “kind”) by applying the German Snowball stemmer list [32, 33] but then re-expanded (e.g. “polit” to “politik,” politics, “kind” to “kinder,” children) by applying a custom list to improve readability, see S2 List.

Fig 2 shows a word cloud of the tokenized but not further edited texts and visually represents the “raw” textual data. Note that two definite articles, “der” and “die,” are the most frequent words, followed by several other stop words such as “in” (in) and “und” (and). Furthermore, multiple wrongly encoded special characters displayed as question marks within words are visible. Fig 3 shows a word cloud that represents the prepared data that were used in further analyses. Note that at this point, meaningful words dominated the cloud, i.e. “kinder” (children), “zukunft” (future), and “politik” (politics). The contrast between these two word clouds illustrates how the pre-processing steps eliminated irrelevant words and flawed strings.

Download:

Fig 2. Word cloud of “raw” texts, tokenized but not otherwise processed.

https://doi.org/10.1371/journal.pone.0182156.g002

Download:

Fig 3. Word cloud of free texts after data pre-processing.

https://doi.org/10.1371/journal.pone.0182156.g003

Words were translated into English just before the visual representations were created so that the translation process had no impact on the results of any analysis. We used a manually compiled ad-hoc list (see S3 List), which was compiled with the help of several online German-English dictionaries. The translation is not a one-to-one mapping and can thus lead to a number of peculiarities. For example, German compound words (e.g. “lehrstelle,” apprenticeship position) translate into multiple words that would have been split by tokenization if the analyses had been run on English textual data, and in some cases, multiple German words were mapped onto the same English word (e.g. “lehrstelle” and “ausbildungsplatz,” both apprenticeship position). In addition, the different levels of inflection and other specifics of the translation from German to English rendered the process a bit fuzzy overall. However, we considered this step to be necessary to illustrate the results for a broader audience.

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation is a general probabilistic model that can be used for collections of texts [12]. Every document (i.e. textual answer) is modeled as a combination of topics, and topics are modeled as distributions of words. LDA requires a number of parameters to be predefined, that is, α, the Dirichlet prior on the per-document topic distributions; β, the Dirichlet prior on the per-topic word distribution; and the number of topics to be modeled. The Dirichlet priors influence whether a document may contain more or fewer topics and whether topics tend to be distinct or contain a mixture of most words. Typical quantitative measures of model fit that help to determine the adequate number of topics are based on likelihood. However, human ratings of the coherence of the derived topics do not necessarily line up with these statistics [34]. In this study, multiple analyses were run, modeling one up to 100 topics. Priors were fixed to α = 50/No. of topics and β = 0.1 as suggested by Griffiths and Steyvers [35].

Correlational analyses

Schwartz et al. [11] suggested correlational analyses to identify words that potentially distinguish between respondents with different features (e.g. words that indicate older age, low extraversion, etc.). Their model is a simple ordinary least squares regression with the desired dependent variable (e.g. life satisfaction) and a binary variable that indicates whether a specific word (e.g. “job”) was used by the respondent as the independent variable. Furthermore, age and gender are also included as independent variables to control for potential confounding. We followed their approach but chose the statistical model according to the level of measurement of the dependent variable: ordinary least squares regressions for the continuous dependent variables age, life satisfaction, extraversion, emotional stability, agreeableness, conscientiousness, and openness to experience; logit regressions for the binary dependent variables gender and sample region; and an ordered logistic regression for the ordinal dependent variable education. Two control variables, gender and age, were included unless the respective variable was the dependent variable in the analysis. The outcome of interest was the standardized regression coefficient of the word use variable or the semi-standardized regression coefficient in the case of logistic regression.

We ran analyses for all 243 words that appeared at least 50 times in the cleaned textual data. All analyses were run twice: one time on the subsample of people who answered the open-ended question and a second time on the full sample, thus including individuals who did not answer (coded as 0 for the use of every word that was analyzed).

Furthermore, we ran correlational analyses with topic occurrence as modeled by the LDA as an independent variable. LDA returns a distribution of topics for every textual answer; the probabilities of topic occurrence sum to 1. To simplify the analyses, occurrence of a specific topic within a text was dichotomized on the basis of the probability of topic occurrence. Values equal to or larger than .3 were coded as topic occurrence (1) and values below .3 as no topic occurrence (0). Preserving the original outcome variables (the probabilities for each topic within each text) led to virtually the same results in all of the following analyses. However, the results were easier to interpret when they were based on dichotomous topic occurrence variables instead of continuous topic occurrence probabilities. Apart from the different nature of the independent variable (topic occurrence instead of word occurrence), the analyses of topics paralleled the correlational analyses of single words with regard to the statistical model and control variables.

Results

Selection effects

Which respondents made use of the open-ended question, or more specifically: Which variables determined whether a respondent answered or ignored the question? We ran multilevel binary logistic regressions predicting the binary outcome answer/no answer to investigate selection effects (see Table 1). All analyses of selection effects were run on the sample of 221,881 records of 25,952 respondents in which all variables included in the final model were non-missing.

Download:

Table 1. Results of binary logistic multilevel regressions predicting responses to the open-ended question, including 222,165 records from 25,978 individuals.

https://doi.org/10.1371/journal.pone.0182156.t001

The first model included basic demographic variables (gender, age, sample region, education) and the mode of data assessment (survey mode). Men were less likely to answer the question than women (OR = 0.87, p < .001). Regarding age, the between- and within-subjects effects showed different trends. Although older respondents were more likely to answer than younger respondents (OR = 1.01 per year, p < .001), an individual became less likely to answer over time (OR = 0.98 per year, p < .001). Respondents from East Germany had remarkably higher odds of answering the questions (OR = 1.36, p < .001). An overall test revealed that level of education had a significant effect on their answering behavior, Χ²(6) = 335.45, p < .001, and the pattern of Odds Ratios relative to the comparison group with the lowest secondary school leaving qualification showed that respondents with higher levels of education were more likely to answer. Individuals who had immigrated to Germany were less likely to answer the question (OR = 0.79, p < .001), yet the children of immigrants were–if at all–slightly more likely to answer the question (OR = 1.13, p = .060). Last, answering behavior varied with survey mode: Respondents were more likely to answer the question in a verbal interview than on a written questionnaire (OR = 0.90, p < .001) or in a mixed survey mode (OR = 0.86, p = .001)

The second model additionally incorporated life satisfaction to test whether this subjective indicator predicted answering behavior above and beyond the objective variables entered in the first model. Life satisfaction had effects on answering behavior beyond the variables included in the first model, Χ²(2) = 412.90, p < .001. Both the between- and within-subjects effects were significant. Respondents who were on average more satisfied with their lives than other respondents were less likely to provide an answer to the open-ended question about worries (OR = 0.85 per each point of life satisfaction, p < .001); moreover, respondents became less likely to answer the question when they became more satisfied over the years (OR = 0.91 per each point of life satisfaction, p < .001).

Last, we tested whether personality variables that are not directly linked to worries additionally influenced individuals’ answering behavior. The third model added the Big Five personality traits extraversion, emotional stability, agreeableness, conscientiousness, and openness to experience. Including the personality traits significantly improved the model over the previous version (Χ²(5) = 529.75, p < .001). Most remarkably, individuals who had reported higher levels of openness to experience were more likely to provide a textual answer (OR = 1.33 per SD, p < .001). Furthermore, individuals who were more emotionally stable were less likely to answer (OR = 0.88 per SD, p < .001). A comparatively small effect was found for extraversion: More extraverted respondents were slightly more likely to answer (OR = 1.07 per SD, p < .001).

Relationship between open- and closed-ended questions.

In an additional analysis, we tested whether a higher worry score from the closed-ended questions was associated with answering the open-ended question. In the analysis including only the worry score as a predictor, both the between- and within-subjects effects were statistically significant. Respondents who answered affirmatively to more worry items were also more likely to answer the open-ended question about worries: A difference of 1 point (e.g. answering all worry items with “slight worries” vs. answering all items with “no worries”) doubled the odds of answering the open-ended question (OR = 2.01, p < .001). Likewise, an individual who ticked more worry items than he or she had ticked in other years was also more likely to answer the open-ended question in the respective “worrisome” year (OR = 1.86, p < .001). Including socio-demographic variables, life satisfaction, and the Big Five personality traits (i.e. all variables included in the third model described above) decreased the between-subjects effect of the worry score (OR = 1.57) but left the within-subjects effect nearly unchanged (OR = 1.83). Both effects retained their significance (ps < .001). Thus, reports of worries across different answer formats shared common variance between-subjects as well as within-subjects, a finding that persisted after accounting for several socio-demographic and personality variables.

Topic detection

Topic model.

What else were the people worried about? To answer this question, we first had to decide how many topics were represented in our model. Fig 4 displays the log likelihood of the data given the topic model: From left to right, the model fit improved, but parsimoniousness decreased as the number of topics increased. On the basis of these numbers, we estimated that at least 10 topics were necessary to cover the steepest increase in model fit. We then individually examined the per-topic word distributions and stopped at the model consisting of 15 topics, which yielded the desired degree of abstraction. Jacobi et al. [36] compared the decision about the number of topics to the decision about the number of factors in a factor analysis: The goal is to reduce the number of dimensions effectively (i.e. to find a parsimonious model) but also to lose as little information as possible (i.e. to achieve high model fit). A more objective approach for determining the adequate number of topics might be desirable, but, as mentioned before, statistical approaches do not necessarily lead to results that are aligned with human judgments of coherence [34].

Download:

Fig 4. Log likelihood of LDA models, depending on the number of topics chosen.

https://doi.org/10.1371/journal.pone.0182156.g004

Fig 5 shows a selection of the resulting topics represented as word clouds. The size of the word represents the probability of a word appearing within the respective topic (i.e. the per-topic word distribution). We chose to present Topics 13 and 14 because they illustrate how topics that are centered around the same word (politics) can capture very different sentiments; we will later discuss Topics 4 and 15 with respect to their trends over time. Word clouds for all 15 topics can be found on the OSF project page (https://osf.io/aj3bn/). Topic labels were derived from the most frequently occurring words within a topic. Table 2 contains the 10 most relevant terms for each topic and the proportion of texts in which the topic occurred, according to the final LDA model.

Download:

Fig 5. Four of the 15 topics derived through LDA topic modeling.

https://doi.org/10.1371/journal.pone.0182156.g005

Download:

Table 2. Labels of the topics derived via LDA topic modeling, proportion of texts in which the topic occurred, and the most relevant terms.

https://doi.org/10.1371/journal.pone.0182156.t002

Topics over time.

The occurrence of the derived topics changed over time. We calculated the measure of variability among proportions proposed by Coffey et al. [37]–the so-called Coffey-Feingold-Bromberg (CFB) measure [38]–to identify which topics fluctuated most strongly across the waves. A CFB of 0 indicates that there is no variability across the set of proportions (i.e. equal distribution), whereas a CFB of 1 indicates the maximum variability possible given the mean proportion. The CFB across survey years for each topic is displayed in Fig 6.

Download:

Fig 6. Variabilities (Coffey-Feingold-Bromberg measure) of the occurrence of topics across the years.

https://doi.org/10.1371/journal.pone.0182156.g006

We identified the two topics with the highest CFB, Topic 15 (“War and terrorism,” CFB = .121) and Topic 10 (“Rising prices,” CFB = .089) and plotted their occurrence over time (see Fig 7). “War and terrorism” apparently peaked three times: in 2003 (coinciding with the onset of the Iraq War), in 2006 (coinciding with e.g. the Lebanon War), and again at the end of the interval we investigated, in 2011, when the Syrian Civil War started. Worries about “Rising prices” peaked in 2008, coinciding with the so-called Great Recession that followed the 2007 Financial Crisis.

Download:

Fig 7. Time course of the two topics with the highest variabilities across survey years, Topic 15 (War and terrorism) and Topic 4 (Rising prices).

https://doi.org/10.1371/journal.pone.0182156.g007

Topics and closed-ended worry items.

How were worry topics that were reported as free text related to worries that were reported in closed-ended questions? Considering that both the closed-ended items and the open-ended question asked for worries, one would expect a certain overlap between worry topics and worry items. To analyze this, we calculated the relative risk of topic occurrence in the textual answer by comparing people who answered a closed-ended question as being “very worried” to people who reported being only “somewhat worried” or “not worried at all” about that subject. The analyses included only respondents who answered the open-ended question. Fig 8 shows the results. In many cases, the worry items were most strongly associated with similar or obviously related worry topics. For example, individuals who reported being very worried about their financial situation were 2.26 times more likely to write about the topic “Finding employment” in their textual answer. Similarly, respondents who were very worried about immigration to Germany were 2.40 times more likely to two write about the topic “Foreigners in Germany,” and respondents who were very worried about peace were 1.58 times more likely to write about the topic “War and terrorism.”

Download:

Fig 8. Relationships between reports of worries in closed-ended questions regarding various subjects (Panels A to I) and topic occurrence in free texts (Topics 1 to 15).

Topics with the highest and lowest relative risk are labeled for each item.

https://doi.org/10.1371/journal.pone.0182156.g008

The influence of parameter values.

In addition to testing the topic model we reported above, we ran a number of alternative specifications to probe the influence of the Dirichlet priors and the number of topics. Assuming only two topics led to a very robust solution that was hardly influenced by the choice of the Dirichlet priors: One topic most visibly centered around future and children, and a second topic centered around politics. Increasing the number of topics to 100 still yielded multiple topics with high face validity, such as a topic centered specifically around the introduction of the euro and the subsequent currency devaluation (summarized by the portmanteau “Teuro,” a contraction of “teuer,” expensive, and euro). However, many of the topics also seemed to capture idiosyncrasies of the respondents and the survey years, which were not of interest in the present study. For example, one topic was defined by various epidemics and maladies in combination with the words Jesus and Christ. The model with 100 topics also seemed most sensitive to changes in the Dirichlet priors.

When we held the number of topics constant at 15, the choice of different Dirichlet priors had a visible but weak influence on the resulting topics. As an arbitrary example, changing α from 3.33 to 0.01 and β from 0.01 to 0.80 resulted in one rather peculiar topic described by questionnaire, data privacy, work, bird flu, and wife, but at least 12 of the topics could still be mapped onto the solution by visual inspection. This indicates that, given the chosen number of topics, priors that fell within a reasonable range did not substantially affect the conclusions.