The literature of social class and inequality is not only diverse and rich in sight, but also complex and fragmented in structure. This article seeks to map the topic landscape of the field and identify salient development trajectories over time. We apply the Latent Dirichlet Allocation topic modeling technique to extract 25 distinct topics from 14,038 SSCI articles published between 1956 to 2017. We classified three topics as “hot”, eight as “stable” and 14 as “cold”, based on each topic’s idiosyncratic temporal trajectory. We also listed the three most cited references and the three most popular journal outlets per topic. Our research suggests that future effort may be devoted to Topics “urban inequalities, corporate social responsibility and public policy in connected capitalism”, “education and social inequality”, “community health intervention and social inequality in multicultural contexts” and “income inequality, labor market reform and industrial relations”.
Citation: Guo L, Li S, Lu R, Yin L, Gorson-Deruel A, King L (2018) The research topic landscape in the literature of social class and inequality. PLoS ONE 13(7): e0199510. https://doi.org/10.1371/journal.pone.0199510
Editor: C. Mary Schooling, CUNY, UNITED STATES
Received: October 25, 2017; Accepted: June 9, 2018; Published: July 2, 2018
Copyright: © 2018 Guo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data used in this article can be found in the Core Collection of Web of Science—Clarivate (http://apps.webofknowledge.com/) by executing the following advanced search command: (TS="Social Class" OR TS="Social Classes" OR TS="Social Stratification" OR TS="Social Stratifications" OR TS="Social Inequality" OR TS="Social Inequalities") AND LANGUAGE: (English) AND DOCUMENT TYPES: (Article) Indexes=SSCI Timespan=1956-2017. More information can be found in the section of "Description of the Sample" in the article.
Funding: Liang Guo is supported by the Qilu Project of Shandong University, China. Ruodan Lu is supported by the British EPSRC DTA fund (DTA2014). Ariane Gorson-Deruel receives salary from Kantar TNS. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.
Competing interests: AGD is an employee of and receives salary from Kantar TNS, a marketing research firm. This does not alter our adherence to PLOS ONE policies on sharing data and materials.
Social stratification or social class refers to visible societal layers or classes of differing wealth, income, race, education or power . Social stratification, social class and social inequality (hereafter social class and inequality) are often used interchangeably, all of which are the products of an unequally structured society in which identities are socially produced on a large scale . As societies evolve, the number of layers can change, and the boundaries between them move. Mobility within and between classes and their persistence from one generation to another influences a society’s governance, customs, culture, identity, and social inequality perception . Recent so-called “black swan events” (i.e. Donald Trump’ victory in the American election and the Brexit referendum) and the growth of populism in Europe are the vivid examples of how human society is transformed by the struggle between different social classes.
Social scientists have studied social class and inequality at length. In the 19th century, Marxian theories of stratification  considered social inequality as crucial to understand human society. The struggle between the exploited and exploiting classes would eventually lead to a political revolution, which would replace private monopolies by total equality (e.g. the Soviet Union and Communist China). In the early 20th century, Max Weber proposed the three-component theory of stratification, with class, status and power as distinct ideal types and social class manifests itself as unequal access to economic resources  In the late 20th century, Lenski  developed the theory of social stratification, further arguing that the accumulation of information, especially technological information, is the most basic and powerful factor in the evolution of human societies. Technological advances laid the foundations for social inequality in terms of power and wealth distribution.
Based on classic social theories, many studies have empirically examined the determinants and consequences of social class and inequality. Multidisciplinary knowledge in the field is not only diverse and insightful, but also fragmented and multifaceted. There is a pressing need for clear mapping of this ever more complex landscape to help researchers and students to conduct efficient, effective literature reviews. A comprehensive mapping of the field will help by providing an understanding of how it has evolved over time, shedding light on the points of consensus and divergences among scholars, while revealing research gaps in the intellectual structure of the field.
This study comprises a computer-based overview of the social class and inequality literature over the period of 1956–2017. First, we mapped out the topic landscape, and then attempted to anticipate hot topics that will generate seminal research in the future. As far as we know, this is the first systematic review of the field across many disciplines over seven decades and the first attempt to forecast topic prevalence in this literature. Our first contribution lies in uncovering a hidden structure of 25 distinct topics and development trajectories in a corpus comprising the abstracts of 14,038 scholarly articles. This study draws on an unprecedentedly large text corpus that includes a broad range of author backgrounds, disciplinary influences and research focuses. Our study will enable researchers to explore not only topic development paths within the overall literature, but also the most salient articles in each individual topic. Our second contribution lies in forecasting the popularities of these 25 topics, based on each topic’s temporal idiosyncrasies which will help both researchers and journal editors to select promising research topics. In the next section, we briefly introduce topic modeling techniques and applications in modeling scientific literature. Then we describe our analyses and results. And finally, we discuss the implications of our work for scholars, journal editors, and practitioners.
Topic modeling methodology
A document can be represented as a vector of word term weights (i.e. features) from a set of terms (i.e. dictionary) and the topic of a document is made of a joint membership of terms which have a pattern of occurrence . Early document clustering techniques employ the vector space modeling technique, which can calculate the similarity between two documents . This technique fails to deal with the issues caused by synonymy (i.e. different words with similar or identical meanings) and polysemy (i.e. the words with different meanings in different contexts). Later, Latent Semantic Analysis (LSA) was developed in an effort to improve classification performance in document retrieval . Like most topic modeling techniques, LSA starts from a pre-processing step, which cleans the corpus of a set of text documents and builds a document-term matrix for subsequent modeling. The cleaning procedures include tokenization (i.e. partitioning a text document into a list of tokens), stop-word removal (i.e. removing the words that are extremely common but are of little value in helping classifying documents, such as this, it, is), stemming and lemmatization (i.e. removing the ends of conjugated verbs or plural nouns while keeping the lemma, base or root form), and compound words (i.e. concatenating hyphenated words that describe one concept). The remaining words are used to construct a document-term-matrix (DTM). The DTM is a matrix where each row represents a document, each column represents a unique word, and each cell denotes the number of times a given word appears in a given document. Then, LSA reduces the DTM into a filtered DTM through singular value decomposition (SVD). Finally, LSA computes the similarity between text documents to pick the heist efficient related words. While computationally efficient, LSA fails to identify and distinguish between different contexts of word usage without recourse to a dictionary or thesaurus .
Backed by Bayesian statistics, Latent Dirichlet Allocation (LDA) is developed to apply a probabilistic model to analyze word distributions in text documents and uncover topics in an automated fashion [7,11]. This generative modeling technique does not require prior categorization, labelling and annotation of the texts but reveals the invisible, latent topic structure through statistical procedures . Instead, it follows the “bag-of-words” assumption to treat a document as a vector containing the count of each word type, regardless the order in which they appear. In a nutshell, LDA assumes that each document can be modelled as a mixture of topics, and each topic is a discrete probability distribution that defines how likely each word is to appear in a given topic. A document is then represented by a distribution of topic probabilities. It estimates the parameters in the distributions of word and of topics with Markov chain Monte Carlo (MCMC) simulations . LDA then assigns topics to each document through a Dirichlet distribution of topics. Given a specific number of topics in a collection of text documents, the extent to which each topic (and its associated words) is represented in a specific document can be modelled by a latent variable model, where latent variables represent the topics and how each document in the collection manifests them [7,13]. In short, LDA discovers patterns of word use and connect patterns of similar use to estimate the posterior distribution of hidden variables, which represents the topic structure of the collection [12,13].
Recently, some LDA-based techniques have been proposed. For example, Correlated-Topic-Model (CTM) uses a logistic normal distribution to create relations among topics . Supervised LDA  can introduce known label information into the topic discovery process. Labeled LDA (LLDA)  allows for multiple labels of documents and for the relation of labels to topics represents one-to-one mapping. Partially labeled LDA (PLLDA)  further extends LLDA to have latent topics missing from the given document labels.
LDA has been widely used to process otherwise unmanageably large volumes of text, identify the most salient topic in a single document, investigate similarities between documents, and uncover topic prevalence over time [11,13,17]. We summarize some recent applications of LDA in scientific topic discovery in Table 1.
Description of the sample
We extracted article abstracts from the core collection of the Web of Science (WoS) database using the following criteria: articles published in English, whose topic terms (i.e. titles, abstracts and keywords) included “social stratification(s)”, “social class(es)” or “social inequality(ies)” in SSCI indexed journals over the period of 1956 to December 2017. The search found 15,057 articles. We deleted those without keywords and abstracts, leaving 14,038 articles in the collection. Among these articles, 67.11% belong to “social class(es)” alone, 23.60% to “social inequality(ies)” alone and 6.71% to “social stratification(s)” alone. There are 1.74% of articles that belong to both “social class(es)” and “social inequality(ies)”; 0.52% to “social class(es)” and “social stratification(s)”; and 0.26% to both “social inequality(ies)” and “social stratification(s)”. There are only 0.04% of articles that belong to three topic terms.
In addition, we built three time series in terms of annual article counts for these three terms respectively. The correlation coefficients between “social class(es)” and “social inequality(ies)” series is 0.87, between “social class(es)” and “social stratification(s)” series is 0.86, and between “social inequality(ies)” and “social stratification(s)” series is 0.97. These statistics confirm that the three topic themes are highly similar. They all reflect the types of social divisions envisaged by Marx and refer to groups defined by their relationship to ownership and control over the means of production, of labor and of distribution . We did not include the term “social status” because it emphasizes the social distinctions caused not only by economic factors but also by cultural ones, which include denotative (what is), normative (what should be), and stylistic (how done) beliefs, shared by a group of individuals who have undergone a common historical experience and participate in an interrelated set of social structures .
Analyses and results
Fig 1 depicts the yearly distribution of articles in terms of annual article counts and the percentage of our sample article counts to the total number of SSCI articles per year (hereafter, publication percentage). The field has grown substantially over the last seven decades. There were only 12 articles (0.04%) published in 1956, but this figure changed to 1,001(0.31%) in 2017. The average annual growth rate in the field reached 5.99%. A systematic change in both series of article count and of publication percentage can be identified over time. The year of 1991 is a change point in the field, as the growth rate in this year jumped from 16.71% in the previous year to 166.98%. And from 1991 onward, the publication percentage (mean = 0.24%, std. = 0.06%) was much higher than that in previous years (mean = 0.05%, std. = 0.02%).
The authors of these articles are from 128 countries, especially USA (36.69%), UK (25.64%) and Canada (5.96%). The ten most frequent organizations in the sample are University College London (2.89%), Harvard University (2.05%), University of Michigan (1.91%), University of Helsinki (1.79%), University of Edinburgh (1.55%), University of Bristol (1.44%), University of Toronto (1.33%), Karolinska Institute (1.29%), University of Cambridge (1.28%), and University of Copenhagen (1.22%).
The articles spread in 112 WoS research areas. Table 2 summarizes Top 10 research areas, which account for around 93.33% of the sample articles. These articles were published in 2,495 journals, among which, Social Science Medicine, Journal of Epidemiology and Community Health, and European Journal of Public Health are the three most frequent outlets in the field (see Table 3).
Grid search of the optimal number of topics
We first built a corpus containing the titles, keywords, and abstracts of all sample articles. All texts were converted to lower case. We removed stop-words as well as punctuation based on the standard NLTK list and reduced the remaining words to their stems. We then used an algorithm developed by Wang, McCallum, & Wei  to replace n-grams with compound words in the text documents. To speed up the modelling process, we followed Blei and Lafferty , Hornik and Grun , and Antons et al  in including only the terms in a topic model whose term-frequency-inverse-document-frequency (tf-idf) values are just above the median of all tf-idf values of the entire vocabulary. These preprocessing procedures resulted in a DTM for further analyses.
We conducted LDA topic modeling analysis with the Genism package . The first step was to perform a two-stage grid-search procedure  to find the optimal number of topics in our collection. We computed a model set of 3–103 topics in step of 10 (i.e. 3, 13, 23 ∆103), each of which repeats 30 times circumvent the impact of random resampling within LDA. Each model was evaluated by the semantic coherence score with the algorithms of Newman, Lau, Grieser, & Baldwin  and of Mimno, Wallach, Talley, Leenders, & McCallum . A good topic model with the optimal number should make the semantic coherence score as large as possible . The first-stage grid search procedure suggested that the semantic coherence score was the largest (-61.91) when number of topics k was three and the second largest (-99.81) when k was 33. Given that it is unlikely to categorize a large collection of articles like ours into just three topics, we decided the optimal number of topics of the first-stage grid search procedure as kfirst-stage = 33. Then we conducted the second-stage grid search procedure by computing a model set of kfirst-stage +/- 10 in step of one (i.e. 23, 24, 25,…,42, 43). The second stage procedure suggests that the topic coherence score reaches its maximum when the number of topics is 25. Then, we used Latent Semantic Analysis (LSA) to re-do the two-stage grid-search procedure for the sake of robustness check. The topic coherence scores of LSA were also shown in Fig 2, in which the best topic number seems to be 23 (see Fig 2). These results suggested that our collection of articles could be modelled into more than 20 but less than 30 topics. Note that LDA is proved to be more accurate and robust than LSA . Therefore, we chose the result obtained from the LDA grid-search analysis (25).
We assessed topic modeling quality in the following ways. Firstly, we plotted the distances of 25 topics in Fig 3 with the multidimensional scaling (MDS) method. Fig 3 confirms the high quality of the 25-topic model, as topics do not cluster but spread evenly through unit spaces.
Then, we computed the likelihood of each article covering each of the 25 topics with LDA. Note that LDA is a mix-membership model, which means that each document is represented as a mixture of a set of topics and each topic is regarded as a distribution over the words in the vocabulary . We assigned each article to the dominant topic whose topic loading was the highest. We presented the topic modeling results in Table 4. The values of the highest topic loadings of these articles range from 0.96 to 0.11 (mean = 0.56, std. = 0.14). Antons et al  argue that an article does not contain a meaningful topic if the loading to this topic is smaller than 0.10. Therefore, the highest topic loadings of all articles were valid.
Finally, we evaluated the level of topic diversity with the Herfindahl-Hirschman Index (HHI), which has been used in a commonly accepted measure of market or portfolio diversification. As a rule of thumb, a market with an HHI of less than 0.10 is a competitive or diverse marketplace, an HHI of 0.10 to 0.25 is a moderately concentrated marketplace, and an HHI of 0.25 or greater is a highly concentrated or monopolistic marketplace . Analogically, for each article, we squared the topic loading of each topic, and then summing the resulting numbers, which can range from close to zero to one. We followed the same vein of market competition analysis to define that an article contains diverse topics if its HHI is smaller than 0.10; an article contains important topics if its HHI is of 0.10 to 0.18; an article contains a salient topic if its HHI is 0.18 or greater. If there are many articles of diverse topics, then the number of topics chosen may be problematic, as LDA fails to extract dominant topics that are distinct from other topics. We found that 57.71% of the articles are of a salient topic, 38.60 of a few important topics while only 3.69% are of diverse topics. The MDS, the analyses of topic loadings and of topic diversity provide solid supports to the fact that our LDA topic model with 25 topics is of high quality, as the significant topics hidden in each article have been successfully retrieved.
We manually labeled each topic in the following manner. Firstly, we downloaded the full texts of the 20 articles whose loadings were the highest within each topic and invited 50 graduate students to read them carefully. That is, each student read 20 randomly-chosen articles and each article was read by two students. Each student proposed a preliminary label for each topic. At the same time, the author team read the abstracts of the 50 highest loading articles per topic. Finally, the author team organized several workshops with the students to finalize the labels. For 21 of the 25 topics, the students suggested labels that were identical or highly similar to those generated by the author team. We discussed the four topics for which the labels assigned by the students and the author team differed significantly to reach a consensus on the most appropriate topic labels.
The number of articles per topic ranges from 252 to 1,172 (mean = 562.2, std. = 249.00). The three most prevalent topics are “globalization, modernization and social class evolution” (Topic 5), “education and social inequality” (Topic 9) and “urban inequality, corporate social responsibility and public policy in connected capitalism” (Topic 22), each of which contains more than 1,000 articles. The three least prevalent topics are “preventive health inequality” (Topic 4), “criminal justice, terrorism, lifestyle exposure and victimization in different social classes” (Topic 10), and “sociolinguistics and social inequality” (Topic 15), each of which contains fewer than or around 300 articles. In addition, “urban inequality, corporate social responsibility and public policy in connected capitalism” (Topics 22), “mortality and social inequality” (Topic 13), and “cancer and social inequality” (Topic 8) exhibit the three highest average loadings (>0.42), indicating that the articles covering these topics tend to be more similar than those covering relatively low-loading ones, for example, “social class schema and theoretical debates” (Topic 3, average loading = 0.26), “discrimination, social value, and gender and racial inequality” (Topic 7, average loading = 0.29) and “pathways of social inequality and psychosocial health” (Topic 25, average loading = 0.28).
Finally, we listed the three most cited references and the three most frequent outlets per topic in Tables 5 and 6. These cited references and outlets can be regarded as the field’s principal knowledge sources. In general, Krieger, Williams, & Moss  has been cited in 12 topics, and Liberatos, Link, & Kelsey  in nine. Pierre Bourdieu’s work [30,31] is also extensively and widely cited in many topics. In addition, Social Science & Medicine is one of Top 3 outlets in 16 topics, Journal of Epidemiology and Community Health in 10 topics, and American Journal of Public Health in five topics.
Given that the field in general has experienced substantial growth after 1991, we discussed the temporal dynamics of each topic in two periods (i.e. 1956–1990 and 1991–2017). We constructed 26 time series (i.e. the field and the 25 topics, shown in Fig 1 and S1 Fig). The publication percentage of the field has grown significantly in both pre-1991 (mean = 3.03%) and post 1991 periods (mean = 9.12%). There are 16 topics that experienced a decline before 1991 but all of them strongly bounded up after 1991. For example, the publication percentage of “Cancer and social inequality” (Topic 8) shrink (on average -26.11% per year) before 1991 but expanded (on average 6.71% per year) in the second period. None of the 25 topics declined in the post-1991 period. In particular, “smoking, diet and active health promotion activities in different social classes” (Topic 20) has increased on average 54.94% per year, “heart disease, work environment and social inequality” (Topic 6) increased on average 39.61% and “education and social inequality” (Topic 9) increased on average 26.05%.
Some topics, such as “smoking, diet and active health promotion activities in different social classes” (Topic 20), “childhood social class and adulthood health” (Topic 21), and “preventive health inequality” (Topic 4), did not appear in the 1950s and 1960s. It was not until the 1990s that all 25 topics were present. “Social class schema and theoretical debates” (Topic 3) was prevalent in 1960s and 1970s but suddenly becomes much less popular in the following decades.
Then, we intended to identify the trends in the filed as a whole and in each topic using time series forecasting technique. We did not follow conventional trend analysis to employ linear and quadratic time trend regressions for the series of article counts. That is because, on the one hand, article count series usually exhibits strong autocorrelation, which manifests in correlated residuals after a regression model has been fit. The autocorrelation violates the standard assumption of independent errors . On the other hand, article counts do not take the consistent growth in all SSCI publications over time into account, which makes the results obtained by regressions spurious. Therefore, we chose Autoregressive Integrated Moving Average (ARIMA) technique. The AR part can be conceived as a linear regression on previous time series values and the MA part is conceptually regarded as a linear regression of the current value of the series against prior random shocks. The I (for “integrated”) part the data values have been replaced with the difference between their values and one or several previous values, which allow non-stationary series to be modeled. Explicitly catering to a suite of standard structures in time series data, ARIMA provides a simple yet powerful method for making skillful time series forecasts .
We constructed 26 time series and identified the appropriate ARIMA terms following the conventional Box-Jenkins Methodology :
Firstly, we split a series into a training part (80%, i.e. 1956–2005) and a test part (20%, i.e. 2006–2017). We used the Augmented Dickey–Fuller test to identify the appropriate order of differencing (i.e. the d parameter) for the training series. Secondly, we specified the number of AR order with the partial autocorrelation function (PACF) plot for the training series. The PACF displays the autocorrelation of each lag of a series after controlling for the auto correlation caused by all preceding lags . If there is a sharp drop in the PACF of a series after p lags, then an ARIMA model should include p autoregressive terms as the previous p-values are responsible for the autocorrelation in the series . Thirdly, we specified the number of MA terms by plotting the ACF of the training series. If the ACF is non-zero for the first q lags and then drops toward zero, then an ARIMA model should include q MA terms . Fourthly, we fitted an ARIMA with the identified order parameters (i.e. p, d, q) to the training series. To verify the quality of this model, we plotted its residual to see whether it appears as entirely random white noise and conducted the Ljung-Box test to formally check whether the errors are uncorrelated across many lags [36,37]. Otherwise, we improved the model upon by removing all the remaining trend. Finally, we tested the improved model with the test series and computed the scores of RMSE, AIC and BIC.
To check the robustness of our ARIMA order specifications, we conducted a grid-search by estimating 1,125 ARIMA models with different combinations of orders (i.e. d = [0,5], p = [0,15], q = [0,15]). By comparing these models with the manually specified optimal model in terms of the Ljung-Box test of residuals, AIC and BIC, the ARIMA grid-search results confirm that our order specifications were indeed optimal (i.e. the Ljung-Box test is statistically insignificant and the values of RMSE, AIC and BIC are minimum). Results were summarized in Table 7 and S1 Fig.
We employed the optimized ARIMA models to forecast the publication percentages of the field and of each topic for the next ten years (i.e. 2018–2027) respectively. The forecast average annual growth rate was used as the indicator of future topic prevalence (see Table 7). The field may continue to expand in the next decade, as its annual growth rate will be 2.51%, suggesting that the field of social class and inequality will consistently attract significant attention in multidisciplinary research communities. We classified the 25 topics into three categories using the following criteria: hot topics for those whose forecast annual growth rates are higher than or equal to the one of the field (i.e. 2.51%), stable topics for those whose rates are positive or equal to zero but smaller than the one of the field, and cold topics for those whose rates are negative. There are three hot topics, eight stable topics and 14 cold topics. We discussed these findings in the next section.
Discussion and conclusions
The aim of this study is to provide a systematic review of social class and inequality research over the last seven decades: its evolution, topic landscape, and dynamics. Our topic modelling analyses considerably enhance understanding of the hidden structure of 25 distinct topics covering the overall development in the field. In addition, our analysis of topic dynamics reveals the highly fluctuated nature of the field’s content structure. Our forecasting results suggest that while in general, the field will continue to attract more attention, 14 topics may lose their popularities. In particular, “skeletal, dental and cranial anthropology and social stratification throughout history” (Topic 2) will dramatically shrink -241.18%, followed by “sociolinguistic research and social inequality (Topic 15, -20.01%) and “preventive health inequality” (Topic 4, -6.50%). These findings seem to be reasonable, given that the three topics are not mainstream in the field, all of which took up less than 2.5% of the articles respectively.
In addition, the 25 topics can be roughly divided into two categories. The 15 medicine-related research topics dominate the field, comprising 54.86% of the articles. This is not surprising, given that healthcare, the sociology of illness, and the social organization of medicine are among the fastest growing areas of modern research. Studies in these topics use core principles and concepts of medical sociology to elucidate the determinants and consequences of various types of illness and wellness (e.g. oral health, prenatal care and psychology). These articles have extensively examined the socioeconomic risk factors of health and their iatrogenic repercussions. Such research contributes to the field of social class and inequality by exploring the social meaning of illness, by examining the issue of care-taking as well as care-giving actions related to familial, community and governmental responsibilities, and by deconstructing health inequalities grounded in social stratifications. Our research suggests that in general, the research in these topics has substantially grown and matured, because that the forecast annual growth rates of many medicine-related research topics are either negative or close to zero. That is probably because many studies have reached a consensus that the problems of access to health care, inequality in medical coverage, and the influence of oppressive social structures make ‘health’ impossible for many people confined in an unfavorable class position . Future efforts may be devoted to “community health, intervention and social inequality in multicultural contexts” (Topic 14), whose forecast annual growth rate will reach 8.53%.
The second category of work in our collection is social sciences-oriented, focusing on topics related to education inequality, social structure evolution, the impact of globalization, business development and public policies. There may be research gaps in “education and social inequality” (Topic 9, whose forecast annual growth rate will be 3.69%) and “income inequality, labor market reform and industrial relations” (Topic 16, whose forecast annual growth rate will be 1.63%). Growing inequality is regarded as one of the most important developments in today’s industrial relations. This phenomenon has been most pronounced in the West, where rising support for populism has disrupted politics and challenged corporate capitalism in many countries . Future research may give special attention to emerging forms of organizational restructuring and labor market institutions, such as trade union power, wage regulations and the influence of the Artificial Intelligence-based fourth industrial revolution.
In conclusion, this study applies LDA topic modelling to structure a large text corpus effectively. By doing so, we enable researchers to examine the detailed profile of each topic and estimate its relative salience. By describing the whole body of knowledge at a relatively granular level, we contribute to a rich understanding of the field’s topic landscape. As such, researchers can appreciate the full range of topics and select those they wish to examine in depth. In addition, our topic landscape informs social class and inequality teaching and course design. Instructors can identify important topics to cover in a course, and include relevant articles associated with each topic. Our study also helps postgraduate students and junior researchers identify which research topics to examine. Finally, our findings have many meaningful implications for journal editors. They can compare the field’s current topic landscape against their journal’s editorial priorities, and thus choose promising topics to be reflected in the composition of the editorial board or promoted through special issues.
However, our study may be of some limitations. Our sample articles were collected from WoS. Although it is probably the single most authoritative source for “high-impact” publications and has a relatively better coverage of social sciences and arts/humanities than other academic databases, WoS focuses mainly mainstream journals and articles, especially those in English. As a result, our analyses excluded articles published in emerging journals, in non-English languages and other types of publications (e.g. books, conference papers, technical reports, theses and dissertations). Future studies may collect publication records from Google Scholar, as it covers book contents along with other freely-accessible online publications. In addition, we did not take the correlations between topics into account so that we cannot forecast how the values of one topic will be correlated with those of other topics. Future work may employ multivariate time series methods to capture the associations between topic time series. Finally, we did not specify forecasting models with any external bibliometric factors that may correlate with the growth or decline of a topic time series. Future work should investigate bibliometric determinants of topic dynamics.
S1 Fig. The temporal trajectories of 25 topics.
Korgen K. The Cambridge handbook of sociology. Cambridge: Cambridge University Press; 2017.
Keister L, Southgate D. Inequality: A contemporary approach to race, class and gender. Cambridge: Cambridge University Press; 2012.
- 3. Erola J, Moisio P. Social mobility over three generations in Finland, 1950–2000. Eur Sociol Rev. 2007;23: 169–183.
Avineri S. The social and political thought of Karl Marx. Cambridge: Cambridge University Press; 1968.
Giddens A. The Relations of Production and Class Structure. Cambridge: Cambridge University Press; 1971.
Lenski G. Power and privilege. New York (USA): McGraw-Hill; 1966.
- 7. Blei D, Ng A, Jordan M. Latent Dirichlet Allocation. J Mach Learn Res. 2003;3: 993–1022.
- 8. Salton G, Allan J, Singhal A. Automatic text decomposition and structuring. Inf Process Manag. 1996;32: 127–138.
- 9. Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R. Indexing by latent semantic analysis. J Am Soc Inf Sci. 1990;41: 6–391.
- 10. Larson R. Introduction to information retrieval. J Am Soc Inf Sci Technol. 2010;61: 852–853.
- 11. Blei D. Introduction to probabilistic topic modeling. Commun ACM. 2012;55: 77–84.
- 12. Antons D, Kleer R, Salge T. Mapping the topic landscape of JPIM, 1984–2013: in search of hidden structures and development trajectories. J Prod Innov Manag. 2016;33: 726–749.
- 13. Blei D, Lafferty J. Topic models. Text Min Classif Clust Appl. 2009;1: 71–89.
- 14. Mcauliffe J, Blei D. Supervised topic models. Adv Neural Inf Process Syst. 2008;1: 121–128.
Ramage D, Hall D, Nallapati R, Manning C. A supervised topic model for credit attribution in multi-labeled corpora. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009. pp. 248–256.
Ramage D, Manning C, Dumais S. Partially labeled topic models for interpretable text mining. Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. 2011. pp. 457–465.
- 17. Griffiths T, Steyvers M. Finding scientific topics. Proc Natl Acad Sci. 2004;101: 5228–5235. pmid:14872004
Kohn M, Slomczynski K. Social structure and self-direction: A comparative analysis of the United States and Poland. Basil: Blackwell; 1990.
- 19. Schooler C. A working conceptualization of social structure: Mertonian roots and psychological and sociocultural relationships. Soc Psychol Q. 1994;57: 262–273.
Wang X, McCallum A, Wei X. Topical N-grams: Phrase and topic discovery, with an application to information retrieval. Proceedings of IEEE International Conference on Data Mining. 2007. pp. 697–702.
- 21. Hornik K, Grün B. topicmodels: An R package for fitting topic models. J Stat Softw. 2011;40: 1–30.
Rehurek R, Sojka P. Software framework for topic modelling with large corpora. Proceedings of The LREC 2010 Workshop on New Challenges for NLP Frameworks. 2010. pp. 45–50.
Newman D, Lau J, Grieser K, Baldwin T. Automatic evaluation of topic coherence. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2010. pp. 100–108.
Mimno D, Wallach H, Talley E, Leenders M, McCallum A. Optimizing semantic coherence in topic models. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. 2011. pp. 262–272.
Syed S, Spruit M. Full-text or abstract? Examining topic coherence scores using latent dirichlet allocation. The 4th IEEE International Conference on Data Science and Advanced Analytics. 2017. pp. 165–174.
- 26. Lucas C, Nielsen R, Roberts M, Stewart B, Storer A, Tingley D. Computer-assisted text analysis for comparative politics. Polit Anal. 2015;23: 254–277.
Hill C, Jones G, Schilling M. Strategic management: theory: An integrated approach. New York (USA): Cengage Learning; 2014.
- 28. Krieger N, Williams D, Moss N. Measuring social class in US public health research: Concepts, methodologies, and guidelines. Annu Rev Public Health. 1997;18: 341–378. pmid:9143723
- 29. Liberatos P, Link B, Kelsey J. The measurement of social-class in epidemiology. Epidemiol Rev. 1988;10: 87–121. pmid:3066632
Bourdieu P, Passeron J. Reproduction in education, society and culture. London: Sage; 1977.
Bourdieu P. The forms of capital. In: Richardson J, editor. Handbook of Theory and Research for the Sociology of Education. New York (USA): Greenwood; 1986. pp. 241–258.
Hyndman R, Athanasopoulos G. Forecasting: principles and practice. Free eBook: OTexts; 2014.
Box G, Jenkins G, Reinsel G, Ljung G. Time series analysis: Forecasting and control. New York (USA): John Wiley & Sons; 2015.
McCleary R, Hay R, Meidinger E, McDowall D. Applied time series analysis for the social sciences. Beverly Hills (USA): Sage; 1980.
Cowpertwait P, Metcalfe A. Introductory time series with R. New York (USA): Springer-Verlag; 2009.
Cryer J, Chan K. Time series analysis: With applications in R. New York (USA): Springer; 2008.
- 37. Ljung G, Box G. On a measure of lack of fit in time series models. Biometrika. 1978;65: 297–303.
- 38. Langenberg C, Hardy R, Kuh D, Brunner E, Wadsworth M. Central and total obesity in middle aged men and women in relation to lifetime socioeconomic status: Evidence from a national birth cohort. J Epidemiol Community Health. 2003;57: 816–822. pmid:14573589
Inglehart R, Norris P. Trump, Brexit, and the rise of populism: Economic have-nots and cultural backlash. Cambridge: Harvard Kennedy School; 2016.
- 40. Heo G, Kang K, Song M, Lee J. Analyzing the field of bioinformatics with the multi-faceted topic modeling technique. BMC Bioinformatics. 2017;18: 975–1014.
- 41. Karami A, Gangopadhyay A, Zhou B, Kharrazi H. Fuzzy approach topic discovery in health and medical corpora. Int J Fuzzy Syst. 2018;20: 1334–1345.
- 42. Figuerola C, Marco F, Pinto M. Mapping the evolution of library and information science (1978–2014) using topic modeling on LISA. Scientometrics. 2017;112: 1507–1535.
- 43. Yau C, Porter A, Newman N, Suominen A. Clustering scientific documents with topic modeling. Scientometrics. 2014;100: 767–786.
- 44. Hu Z, Fang S, Liang T. Empirical study of constructing a knowledge organization system of patent documents using topic modeling. Scientometrics. 2014;100: 787–799.
- 45. Das S, Sun X, Dutta A. Text mining and topic modeling of compendiums of papers from transportation research board annual meetings. Transp Res Rec. 2016;20: 48–56.
- 46. Westgate M, Barton P, Pierson J, Lindenmayer D. Text analysis tools for identification of emerging topics and research gaps in conservation science. Conserv Biol. 2015;29: 1606–1614. pmid:26271213
- 47. Tvinnereim E, Fløttum K. Explaining topic prevalence in answers to open-ended survey questions about climate change. Nat Clim Chang. 2015;5: 744.
- 48. Carnerud D. Exploring research on quality and reliability management through text mining methodology. Int J Qual Reliab Manag. 2017;34: 975–1014.
- 49. Farrell J. Corporate funding and ideological polarization about climate change. Proc Natl Acad Sci. 2016;113: 92–97. pmid:26598653
- 50. Bittermann A, Fischer A. How to identify hot topics in psychology using topic modeling. Zeitschrift fur Psychol Psychol. 2018;226: 3–13.
- 51. Oh J, Stewart A, Phelps R. Topics in the journal of counseling psychology, 1963–2015. J Couns Psychol. 2017;64: 604–615. pmid:29154573
- 52. Wang S, Ding Y, Zhao W, Huang Y, Perkins R, Zou W, et al. Text mining for identifying topics in the literatures about adolescent substance use and depression. BMC Public Health. 2016;16: 975–1014.
- 53. Sun L, Yin Y. Discovering themes and trends in transportation research using topic modeling. Transp Res Part C-Emerging Technol. 2017;77: 49–66.
- 54. Muntaner C, Eaton W, Diala C, Kessler R, Sorlie P. Social class, assets, organizational control and the prevalence of common groups of psychiatric disorders. Soc Sci Med. 1998;47: 2043–2053. pmid:10075245
- 55. Hollingshead A. Four Factor Index of Social Status. 1975.
Ambrose S. Isotopic analysis of paleodiets: methodological and interpretative considerations. In Sandford M. K. (Ed.), Investigations of Ancient Human Tissue. Chemical Analyses in Anthropology. 1993.
- 57. Phenice T. A newly developed visual method of sexing the os pubis. Am J Phys Anthropol. 1969;30: 297–301. pmid:5772048
- 58. Hayden B. Pathways to power: Principles for creating socioeconomic inequalities. Found Soc Inequal. 1995; 15–86.
- 59. Goldthorpe J. Women and class analysis: In defense of the conventional view. Sociology. 1983;17: 465–488.
- 60. Stanworth M. Women and class analysis: A reply to john goldthorpe. Sociology. 1984;18: 159–170.
Dahrendorf R. Class and Class Conflict in Industrial Society. Stanford: Stanford University Press; 1959.
- 62. Marmot M, Smith G. Health inequalities among British civil servants: The Whitehall ii study. Lancet. 1991;337: 1387–1394. pmid:1674771
- 63. Davis P. Office encounters in general practice in the Hamilton Health District. I. Social class patterns among employed males, 15–64. N Z Med J. 1985;98: 789–792. pmid:3865076
- 64. Smaje C, Le Grand J. Ethnicity, equity and the use of health services in the British NHS. Soc Sci Med. 1997;45: 485–496. pmid:9232742
- 65. Reay D. Beyond consciousness? The psychic landscape of social class. Sociol J Br Sociol Assoc. 2005;39: 911–928.
- 66. Peterson R, Kern R. Changing highbrow taste: From snob to omnivore. Am Sociol Rev. 1996;61: 900.
- 67. Rosengren A, Wedel H, Wilhelmsen L. Coronary heart disease and mortality in middle aged men from different occupational classes in Sweden. Br Med J. 1988;297: 1497–1500.
- 68. Marmot M, Rose G, Shipley M, Hamilton P. Employment grade and coronary heart disease in British civil servants. J Epidemiol Community Heal. 1978;32: 244–249.
- 69. Karasek R. Job Demands, job decision latitude, and mental strain: implications for job redesign. Adm Sci Q. 1979;24: 285–308.
- 70. Kessler R, Mickelson K, Williams D. The prevalence, distribution, and mental health correlates of perceived discrimination in the United States. J Heal Soc Behav. 1999;40: 208–230.
- 71. Karlsen S, Nazroo J. Relation between racial discrimination, social class, and health among ethnic minority groups. Am J Public Health. 2002;92: 624–631. pmid:11919063
- 72. Williams D, Neighbors H, Jackson J. Racial/Ethnic discrimination and health: Findings from community studies. Am J Public Health. 2008;98: S29—37. pmid:18687616
- 73. Farley T, Flannery J. Late-stage diagnosis of breast cancer in women of lower socioeconomic status: Public health implications. Am J Public Health. 1989;79: 1508–1512. pmid:2817162
- 74. Krieger N, Chen J, Waterman P, Soobader M, Subramanian S, Carson R. Geocoding and monitoring of US socioeconomic inequalities in mortality and cancer incidence: Does the choice of area-based measure and geographic level matter? The public health disparities geocoding project. Am J Epidemiol. 2002;156: 471–482. pmid:12196317
- 75. Clegg L, Reichman M, Miller B, Hankey B, Singh G, Lin Y, et al. Impact of socioeconomic status on cancer incidence and stage at diagnosis: Selected findings from the surveillance, epidemiology, and end results of the National Longitudinal Mortality Study. Cancer Causes Control. 2009;20: 417–435. pmid:19002764
- 76. Raftery A, Hout M. Maximally maintained inequality: Expansion, reform, and opportunity in Irish education, 1921–75. Sociol Educ. 1993;66: 41.
- 77. Erikson R, Goldthorpe J. The constant flux: A study of class mobility in industrial societies. Contemporary Sociology. 1992.
- 78. Mare R. Social background and school continuation decisions. J Am Stat Assoc. 1980;75: 295–305.
- 79. Steensland B, Park J, Regnerus M, Robinson L, Wilcox W, Woodberry R. The measure of American religion: Toward improving the state of the art. Soc Forces. 2000;79: 291–318.
- 80. Wright B, Caspi A, Moffitt T, Miech R, Silva P. Reconsidering the relationship between SES and delinquency: Causation but not correlation. Criminology. 1999;37: 175–194.
Hindelang M, Hirschi T, Weis J. Measuring Delinquency. Measuring Delinquency. Beverly Hills (USA): Sage; 1981.
- 82. Whalley L, Deary I. Longitudinal cohort study of childhood IQ and survival up to age 76. Br Med J. 2001;322: 819–822.
Hollingshead A, Redlich F. Social Class and Mental Illness: A Community Study. New York (USA): Wiley; 1958.
- 84. Brayne C, Calloway P. The association of education and socioeconomic status with the mini mental state examination and the clinical diagnosis of dementia in elderly people. Age Ageing. 1990;19: 91–96. pmid:2337015
- 85. Kraus M, Keltner D. Signs of socioeconomic status: A thin-slicing approach. Psychol Sci. 2009;20: 99–106. pmid:19076316
- 86. Pratto F, Sidanius J, Stallworth L, Malle B. Social dominance orientation: A personality variable predicting social and political attitudes. J Pers Soc Psychol. 1994;67: 741–763.
- 87. Tajfel H, Turner J. An integrative theory of intergroup conflict. Soc Psychol Intergr Relations. 1979;81: 33–47.
- 88. Huisman M, Kunst A, Bopp M, Borgan J-K, Borrell C, Costa G, et al. Educational inequalities in cause-specific mortality in middle-aged and older men and women in eight western European populations. Lancet. 2005;365: 493–500. pmid:15705459
- 89. Marmot M, Mcdowall M. Mortality decline and widening social inequalities. Lancet. 1986;328: 274–276.
- 90. Kunst A, Groenhof F, Mackenbach J, Hlth E. Occupational class and cause specific mortality in middle aged men in 11 European countries: comparison of population based studies. Br Med J. 1998;316: 1636–1641.
- 91. Bronfenbrenner U. The Ecology of Human Development: Experiments by Nature and Design. Am Psychol. 1977;32: 513–531.
- 92. Liu W, Soleck G, Hopps J, Dunston K, Pickett T. A new framework to understand social class in counseling: The social class worldview model and modern classism theory. J Multicult Couns Devel. 2004;32: 95–122.
- 93. Adler N, Epel E, Castellazzo G, Ickovics J. Relationship of subjective and objective social status with psychological and physiological functioning: Preliminary data in healthy white women. Heal Psychol. 2000;19: 586–592.
American Psychiatric Association. Diagnostic and Statistical Manual of Mental Health Disorders (DSM-III-R). Arlington; 1987.
Trudgill P. The social differentiation of English in Norwich. In: Coupland N, Jaworski A, editors. Sociolinguistics. Modern Lin. London: Palgrave; 1997.
- 96. Labov W. The intersection of sex and social class in the course of linguistic change. Lang Var Change. 1990;2: 205.
- 97. Erikson R, Goldthorpe J, Portocarero L. Intergenerational class mobility in three western European societies: England, France and Sweden. Br J Sociol. 1979;30: 415–441.
- 98. Sorenson A. Toward a sounder basis for class analysis. Am J Sociol. 2000;105: 1523–1558.
- 99. Shavit Y, Blossfeld H-P. ersistent inequality: Changing educational attainment in thirteen countries. social inequality series. Br J Educ Stud. 1993; 408.
- 100. Brooke O, Anderson H, Bland J, Peacock J, Stewart C. Effects on birth weight of smoking, alcohol, caffeine, socioeconomic factors, and psychosocial stress. Br Med J (Clin Res Ed). 1989;298: 795–801.
- 101. Pattenden S, Dolk H, Vrijheid M. Inequalities in low birth weight: parental social class, area deprivation, and lone mother status. J Epidemiol Community Health. 1999;53: 355–358. pmid:10396482
- 102. Lynch J. Income inequality and mortality: Importance to health of individual income, psychosocial environment, or material conditions. Br Med J. 2000;320: 1200–1204.
Evans G. The end of class politics? Class voting in comparative context. Oxford: Oxford University Press. 1999.
Inglehart R. Culture shift in advanced industrial society. rinceton: Princeton University Press. 1990.
- 105. Hout M, Brooks C, Manza J. The democratic class struggle in the United States, 1948–1992. Am Sociol Rev. 1995;60: 805–828.
- 106. Smith G, Hart C, Watt G, Hole D, Hawthorne V. Individual social class, area-based deprivation, cardiovascular disease risk factors, and mortality: The Renfrew and Paisley study. J Epidemiol Community Health. 1998;52: 399–405. pmid:9764262
- 107. OCampo P, Xue X, Wang M, Caughy M. Neighborhood risk factors for low birthweight in Baltimore: A multilevel analysis. Am J Public Health. 1997;87: 1113–1118. pmid:9240099
- 108. Galobardes B, Shaw M, Lawlor D, Lynch J, Smith G. Indicators of socioeconomic position (part 1). J Epidemiol Community Health. 2006;60: 7–12.
- 109. Marshall S, Jones D, Ainsworth B, Reis J, Levy S, Macera C. Race/ethnicity, social class, and leisure-time physical inactivity. Med Sci Sports Exerc. 2007;39: 44–51. pmid:17218883
- 110. Lynch J, Kaplan G, Salonen J. Why do poor people behave poorly? Variation in adult health behaviours and psychosocial characteristics by stages of the socioeconomic lifecourse. Soc Sci Med. 1997;44: 809–819. pmid:9080564
- 111. Poulton R, Caspi A, Milne B, Thomson W, Taylor A, Sears M, et al. Association between children’s experience of socioeconomic disadvantage and adult health: A life-course study. Lancet. 2002;360: 1640–1645. pmid:12457787
- 112. Krieger N, Okamoto A, Selby J. Adult female twins’ recall of childhood social class and father’s education: A validation study for public health research. Am J Epidemiol. 1998;147: 704–708. pmid:9554610
Harvey D. NeoLiberalism: A brief history. Oxford: Oxford University Press. 2005.
Bian Y. Work and inequality in urban China. New York: SUNY Press. 1994.
Townsend P, Nick D. Inequalities in Health. New York (USA): Penguin; 1990.
- 116. Ware J, Sherbourne C. The MOS 36-item short-form health survey (SF-36). I. Conceptual framework and item selection. Med Care. 1992;30: 473–483. pmid:1593914
- 117. Adler N, Boyce T, Chesney M, Cohen S, Folkman S, Kahn R, et al. Socioeconomic status and health: The challenge of the gradient. Am Psychol. 1994;49: 15–24. pmid:8122813
- 118. Burkam D, Ready D, Lee V, LoGerfo L. Social-class differences in summer learning between kindergarten and first grade: Model specification and estimation. Sociol Educ. 2004;77: 1–31.
Wilkinson R. Unhealthy Societies. London: Routledge. 1996.
Kitagawa E, Hauser P. Differential Mortality in the United States. Cambridge MA: Harvard University Press. 1973.
- 121. Radloff L. The CES-D scale: A self-report depression scale for research in the general population. Appl Psychol Meas. 1977;1: 385–401.