Has sentiment returned to the pre-pandemic level? A sentiment analysis using U.S. college subreddit data from 2019 to 2022

Background As the impact of the COVID-19 pandemic winds down, both individuals and society are gradually returning to life and activities before the pandemic. This study aims to explore how people’s emotions have changed from the pre-pandemic period during the pandemic to this post-emergency period and whether the sentiment level nowadays has returned to the pre-pandemic level. Method We collected Reddit social media data in 2019 (pre-pandemic), 2020 (peak period of the pandemic), 2021, and 2022 (late stages of the pandemic, transitioning period to the post-emergency period) from the subreddits communities in 128 universities/colleges in the U.S., and a set of school-level baseline characteristics such as location, enrollment, graduation rate, selectivity, etc. We predicted two sets of sentiments from a pre-trained Robustly Optimized BERT pre-training approach (RoBERTa) and a graph attention network (GAT) that leverages both the rich semantic information and the relational information among posted messages and then applied model stacking to obtain the final sentiment classification. After obtaining the sentiment label for each message, we employed a generalized linear mixed-effects model to estimate the temporal trend in sentiment from 2019 to 2022 and how the school-level factors may affect the sentiment. Results Compared to the year 2019, the odds of negative sentiment in years 2020, 2021, and 2022 are 25%. 7.3%, and 6.3% higher, respectively, which are all statistically significant at the 5% significance level based on the multiplicity-adjusted p-values. Conclusions Our study findings suggest a partial recovery in the sentiment composition (negative vs. non-negative) in the post-pandemic-emergency era. The results align with common expectations and provide a detailed quantification of how sentiments have evolved from 2019 to 2022 in the sub-population represented by the sample examined in this study.


Introduction 1.Background
While COVID-19 remains a public health priority, many governments have transitioned away from the emergency phase that gripped the globe in 2020 and 2021.With a variety of effective strategies implemented to combat the COVID-19 pandemic, including vaccination, quarantine measures, and the adoption of remote work and study routines, the impact of the pandemic on society has gradually subsided since the second half of 2021.In the U.S., nearly all state-level mask mandates had been lifted by April 2022; many educational institutions from elementary schools to higher education institutes have returned to the pre-pandemic in-person learning mode; social gatherings, conferences, sports, and entertainment events have also welcomed back participants and fans at full capacity, among others.However, the post-pandemic world does not mirror the pre-pandemic era in many aspects, including the psychological and emotional aspects.Several studies have been conducted to analyze sentiments and attitudes on various aspects post the COVID-19 pandemic.[1] studied Twitter users' sentiment change toward COVID-19 vaccination after the first COVID-19 vaccination was implemented in the U.S..They used negative binomial regression and linear regression and found that public sentiment towards vaccination became more positive after the first dose of vaccination.Through frequency verification between public opinions and sentiments and analysis of the influence mechanism, [2] claimed positive public opinion and sentiment on ports and corporate choice of import and export of goods post the pandemic.[3] studied sentiment and topic trends related to patient experience before, during, and after the pandemic in the framework of topic modeling with latent Dirichlet allocation.[4] studied the socioeconomic factors that may affect people's attitude towards reopening the economy in post-COVID-19 using Twitter data, socioeconomic data, environment data, and COVID-19 cases, and applied logistic regression to identify important factors.They concluded that people with low education levels, low income, in the labor force, and with higher residential rents are more interested in reopening the economy.[5] studied the public well-being and sentiment toward education post-COVID-19.They curated Twitter data that's relevant to the education sector and used Aspect-based Sentiment Analysis and machine learning techniques to identify sentiment and emotional triggers.They conclude that safety is a top concern for students, parents, and educators.[6] studied the Jordanian community's attitude towards online and in-person hybrid learning by analyzing post-pandemic Twitter data.They used long short-term memory (LSTM) networks to classify tweet emotions and found that 18.75% of the samples fall within the category of Dissatisfied Anger and Hate, 21.25% Sad, 13% Happy, and 24.5% Neutral.[7] studied the motivation and inclination of traveling in 2021 using thematic analysis, sentiment classification, and word cloud.They concluded that nature-based travel has become the first choice of travel after 2020.[8] used Twitter data to study people's attitudes towards remote working in the post-pandemic era.They used TextBlog, the latent Dirichlet allocation model, and multiple machine learning models, and found that the topics "work-life balance", "less stress", "future" and "engagement" are positive; negative topics include "virtual health", "privacy concerns", and "stress", and neutral topics involve "new technologies", "sustainability", and "technology issues".[9] analyzed Twitter data to study the sentiment distribution in India after the second wave of COVID-19 using Vader, LSTM, and convolutional neural networks and found the majority of sentiments are either neutral or positive.
In summary, all the above work used social media data to study sentiment or attitudes at a single pandemic time point in 2021 and 2022 after the first wave of the pandemic in 2020, with many focusing on data in a specific domain such as travel, import/export, remote working, and education.

Study Objective and Overview
This study is different from the works summarized in Section 1.1.First, it investigates general sentiment rather than sentiment in a specific domain or attitude toward a specific topic; in addition, it examines the temporal trend of sentiment from 2019 to 2022, representing the before-pandemic baseline and several phases during the pandemic, rather than a snapshot in time.
This study is a follow-up study to [10] that examined the sentiment during the early phase of the pandemic (2020) as opposed to the pre-pandemic (2019) in 8 higher-education institutes (HEI) with Reddit data in the U.S. In this study, we collected Reddit data from 128 HEIs in the U.S., including the 8 schools in [10], over a 4-year period (August to December in 2019, 2020, 2021, and 2022), where 2021 and 2022 can be regarded as later stages of the pandemic.In other words, the scope of this study is much broader with a longer study period and many more schools that cover all four regions of the U.S.; the number of messages also increases from 165,570 in [10] to 4,129,170 in this study.
The primary goal of this study is to examine the sentiment shift from 2019 to 2022 and whether and when the level of negative sentiments has returned to the pre-pandemic era (2019).As secondary objectives, we also examine how other factors may affect the sentiment based on the collected Reddit data, such as region, school type and classification, enrollment, etc.
To analyze the data, we adopted a similar approach as in [10] by first predicting the sentiment of each collected message using machine learning.The technique employs advanced natural language processing (NLP) techniques, specifically, the Robustly Optimized BERT Pretraining Approach (RoBERTa), in conjunction with graph neural networks (GNN) that leverage the inter-message relations among the Reddit messages Upon assigning sentiment classes (negative or non-negative) to each message, we employed a generalized linear mixture model (GLMM) to examine the effect of year on sentiment and to identify relevant covariates that may have significant relations with sentiment.
The remainder of the paper is structured as follows.In Section 2, we describe the data collection for this study.In Section 3, we introduce the machine learning and statistical procedures used to make sense of the data.The study results are presented in Section 4. The study limitations and future work are discussed in Section 5 and the main study conclusions are described in Section 6.

Data Collection
Our study focuses on a sample of Higher Education Institutions (HEIs) in the U.S. with Reddit data.When selecting the school, we aimed for representativeness and diversity.We first compiled a list of HEIs with subreddits, leading to more than 400 institutions.We then dropped those schools that don't have enough messages in their subreddits.Specifically, if a subreddit has < 20 messages in each of the four years from 2019 to 2022 or has at least two years with < 10 messages, we dropped the school from the sample.In addition, due to storage and computational constraints (see Sec 3), we subsetted the schools, eventually leading to a total of 128 schools, as listed in provided in the Appendix.When subsetting the schools, we used criteria such as school diversity in terms of geographical regions within the U.S. and HEI types (i.e., research universities, liberal arts colleges, and institutions specializing in particular fields such as the Naval Academy).The schools also have a wide of range rankings from more prestigious institutions to colleges/ universities ranked beyond 300 per the U.S. News rankings.Nevertheless, it's important to acknowledge that the selection process, to some extent, involved subjective judgment influenced by the authors' knowledge, despite our best efforts to maintain objectivity.
The data collection process started with the retrieval of all textual messages from the subreddits associated with each HEI.The time frame spanned from August to November in each year from 2019 to 2022.We supplemented this textual corpus with additional attributes specific to each HEI, such as region, Carnegie classification of HEI (CCHEI), enrollment, graduation rate, faculty headcount, etc.

Reddit data
The Reddit data collection and how the data are used in this study are in accordance with Reddit's Terms and Conditions on data collection and usage.We also consulted the research compliance program at the University of Notre Dame and no IRB approval is needed given that the collected data are publicly accessible on Reddit; we did not collect private identifiable data nor interacted with the Reddit users.More information regarding privacy compliance is provided in the Appendix.
To examine the sentiment trend from 2019 and 2022, we downloaded the data from August to November, in the year 2019, 2020, 2021, and 2022 from the subreddit communities of the 128 schools.2019 is regarded as the pre-pandemic baseline, 2020 was during the pandemic, while 2021 and 2022 represent the transition to the post-emergency period.We used the Pushshift API (https://github.com/pushshift/api) to download the comment data but excluded the submission data due to the non-availability of the submission data when the study was conducted.
The messages in each school form a graph.In the graph, each message represents a node, and if one message replies to another one, they are direct neighbors and we draw a directional edge between the two.Due to computational constraints, we limited the size of each graph.For those schools with over 30,000 nodes, we obtain a subgroup with 30,000 nodes using the following sampling approach.We first randomly select 50 nodes to start and then add the nodes that are connected to at least one of the 50 nodes one by one to form the subgraph.If there are no direct neighbors to any of the 50 nodes, we randomly select another 50 nodes to add to the subgraph.When the subgraph node number gets close to 30,000 in the node-adding process, we only add the first several neighbor nodes or the first several randomly selected nodes to make it exactly 30,000.In this case, the sequence of the neighbor nodes based on which they will be added is determined by a combination of the Breadth First Search algorithm [11] and the sequence that the neighbor message appears in our data (an index that is independent of the messages themselves).We repeat this process until the subgroup meets the threshold of 30,000 nodes.

Baseline school-level data
We considered a set of variables at the school level that might impact the sentiment change from 2019 to 2022.The baseline data were collected from multiple sources -the 2020 United States census [12], Carnegie Classification Of Institutions Of Higher Education (CCIHE) [13], and Integrated Postsecondary Education Data System (IPEDS) [14].The variables are listed in Table 1.For the data collected from IPEDS, if there are multiple years of data, we average them across all available years to obtain the final value for these variables.

Methods
We apply several machine learning and statistical analysis approaches to explore the data and address the goal of this study.

Sentiment Prediction
We apply the same ensembled graph neural networks and pre-trained RoBERT model in [10] to predict the sentiment class (negative vs. nonnegative) for each message in the downloaded Reddit dataset.
RoBERTa [15] is an improved version of the BERT (Bidirectional Encoder Representations from Transformers) [16] model and a pretraining framework that's based on the attention mechanism [17].The original RoBERTa was trained on a dataset of over 160GB of uncompressed text, and it includes BookCorpus plus English Wikipedia (16GB), CC-News (76GB), OpenWebText (38GB) Stories (31GB).However, the Reddit data we used has different properties than the original training data.It contains emojis, non-standard spelling, Internet slang, and other possible features of Internet language.Thus we choose a RoBERTa model [22] that is trained on 58 million messages from Twitter and fine-tuned for sentiment analysis, which is more suitable for our application.The Python code for the RoBERTa framework is adapted from [22] (see the Appendix for the link to the code).
We obtained embeddings for the Reddit messages from the RoBERTa model that are used in two downstream learning tasks.First, the embeddings are fed to a feed-forward neural network with softmax as the last layer to output the sentiment probabilities for the messages.Second, to better utilize the relational information among messages, we employ the Graph Attention Networks (GAT) model trained in [10] with the embeddings and the adjacency matrices among the messages as input to output a second set of predicted sentiment probabilities for the messages.GAT is a kind of GNN model that incorporates the attention mechanism into the graph.In our case, we treat all messages in each school as an independent graph.When one message replies to another, an edge goes from the first message to the second message but not the other way around.In other words, the adjacency matrix is asymmetric.The GAT model updates the hidden state of each node given the initial states of the node and its neighbors.
[10] found that GAT and RoBERTa can be inconsistent in their sentiment prediction -GAT tends to be more accurate in predicting negative messages, and Roberta tends to be more accurate in non-negative messages.To obtain more accurate sentiment predictions, we applied the stacking method in [10] and formulated a logistic model to combine the sentiment probabilities from GAT And RoBERTa to obtain the final sentiment classification for each message.
Regarding the computational cost for running the prediction models, it took about one week to run RoBERTa and one day to run GAT, respectively, across all the messages in 128 schools on a computer with Intel(R) Xeon(R) CPU L5520 @ 2.27GHz and RAM 72.0 GB, and x64-based processor.98.7 GB was used to store all unprocessed and processed data.

Statistical Analysis of Sentiment Trend from 2019 to 2022
After having the sentiment classifications for the 4,129,170 messages, we fitted a generalized linear mixed-effects model (GLMM) to examine how sentiment changes from 2019 and 2022.
The GLMM is log Pr(y ik is negative) 1−Pr(y ik is negative) = β 0 + p j=1 β j x ijk + z k , where the sentiment label y ik (negative vs non-negative) of message i in school k is the binary response; year (categorical) and the set of variables in Table 1 are fixed-effect predictors coded in x ijk for j = 1, . . ., p (p is the number of regression coefficients associated with covariates X).Because the messages from the same school are correlated, z k ∼ N (0, σ 2 ) is included as a random effect to account for the within-school dependency.

School-level Characteristics
The baseline characteristics of the school-level data are summarized in Tab 1.For categorical variables, frequency and percentage of each category are provided; for continuous variables, mean, standard deviation, minimum, and maximum are provided.In each year, the number of messages varies by school, but most schools have messages < 30k across all 4 years.Due to computational constraints, for schools with more than 30k messages, we sampled a subgraph that has 30,000 messages (node) using the methods described in Section 2.1.This leads to a total of 4,129,170 messages the sentiment of which are predicted.

Temporal sentiment trend from 2019 to 2022 and school-level covariate effects on sentiment
The GLMM model was run on complete records only (a total of 4,129,170 messages).There is a high imbalance in the covariate CCHIE, with only one school classified as "Master's Colleges & Universities: Larger Programs" and seven schools as "Baccalaureate: arts & sciences focus", which could lead to potential computational and inferential problems in the GLMM estimation.We thus combined the two categories as one and referred to it as "Baccalaureate/Master's Colleges/Universities".The inferential results from the GLMM are presented in Table 2 and Fig 4 .We used the glmer function in R package lme4 to run the GLMM and the p.adjust function in R package stats to obtain FDR adjusted p-values.
Using < 0.05 as the threshold for statistical significance on the adjusted p-values, Year is significantly associated with Sentiment.The odds of having negative sentiments in 2020, 2021, and 2022, are 24%, 4.3%, and 10.3% higher, respectively, than that in 2019, suggesting the likelihood of negative sentiment increased significantly during the pandemic compared to before the pandemic, based on the messages posted on Reddit, the negative sentiment proportion in the later half of 2021 almost returned to the pre-pandemic level, and slightly increased during the second half of 2022.Overall, we may conclude the level of negative sentiment goes down post-pandemic compared to during the pandemic, not is still higher than pre-pandemic.Enrollment is also statistically significantly associated with sentiment.For every one SD (18,075) increase in enrollment, the odds of having negative sentiment goes up by 11.9%, implying larger enrollment tends to have a negative impact on Sentiment.Compared to Master's/Baccalaureate universities/colleges, the odds of having negative sentiment for doctoral schools with very high research activity is 30.8%higher.Though the odds for doctoral school with high research activity is also higher (26.5%) than that for Master's/Baccalaureate universities/colleges, the increase is not statistically significant at the 5% level based on the adjusted p-value.The observations exhibit rational comprehensibility; there is constant pressure on both students and faculty members in HEIs with the requirement of research productivity and excellence, which is likely linked with the higher negative sentiment in those schools and the communities that are associated with them.
Private schools tend to have lower odds of negative sentiment (12.4% lower) than public schools, though the difference is not statistically significant based on the adjusted p-value.
The rest of the examined covariates do not have a pronounced effect on Sentiment, such as region, D1 school or not, a medical school or not, selectivity, etc.
The CI widths for the odds ratios on the year comparisons are much smaller compared to those associated with other factors.This is because Year is the only within-cluster factor whereas the others are between-cluster factors, where cluster here refers to School in the mixed-effect model.The variance of the effect of a between-cluster factor contains the between-cluster variance (variance across schools) and the sampling variability whereas that for a within-factor factor only contains the latter and is thus smaller.The very precise estimates for the year comparison benefit from the huge number of messages, where the precision on the estimated effects of the between-school factors is more determined by the number of schools, which is 128.

Discussion
In this study, we collected subReddit data from 128 universities and colleges in the U.S. and some school-level baseline covariates to study sentiment change from 2019 to 2022 that covered the pre-pandemic period to several stages of the COVID-19 pandemic.While we 0.978 (0.936,1.022) 0.320 0.623 † The multiplicity-corrected/adjusted p-values were calculated using the method in [18].‡ For a numerical variable, the odds ratio is associated with one SD increase.The rows with -as entries are the reference categories for the categorical covariates.The bold rows are the covariates/levels that are statistically significant if < 0.05 is used for the adjusted p-values.aimed for school representativeness and diversity by considering factors such as school ranking, location, size, and school type, the schools included in this study are not an unbiased sample of all the HEI in the U.S. Since we used subreddit data, only schools with active subreddits from 2019 to 2022 are eligible, which are the schools that are relatively well-known, larger, and have active online communities on social media.For this reason, while the study results can be generalized to the sub-population the data represents and reflects the sentiment changes from 2022 to 2019 in that group, but they would not immediately be generalized to the general population without understanding the demographics of the individuals (which we don't have data on) who posted messages on Reddit.
To reduce labor costs for sentiment labeling, we opted to employ the sentiment classification model proposed in [10], which was trained on data from 2019 and 2020 in eight schools.While there is some overlapping between the training data in [10] and the data employed Figure 4: Forrest plot of estimated odds ratios of negative sentiment with 95% confidence intervals.An asterisk * indicates a statistically significant odds ratio per the adjusted p-value (Table 2) for the corresponding covariate vs. its reference level or with 1SD increase in the variable in the current study since they both are extractions from HEI subreddits, the data in this study are much broader and more comprehensive.Therefore, the classifier may not be the most accurate for predicting messages that are outside the range of messages in the training data.For future work, we intend to retrain the model using more training data to improve the classification accuracy.
Since the collected subreddit data do not contain individual-level demographic information about the individuals who posted the messages, which can be highly sensitive or pose privacy risks for re-identification, the covariates examined in the GLMM include only school-level public information.The current GLMM does not examine time-varying covariates except for the year itself.A potential interesting extension to the current study is to include time-varying covariates, such as the unemployment rate and the inflation rate, in the model.While the GLMM model itself is not capable of drawing causal relations, it suggests a significant drop in negative sentiment in 2021 compared to 2020, likely due to the availability of vaccines and more effective treatments for COVID-19, giving people hope and a positive outlook that things would return to normal.Though the negative sentiment level in 2022 is still lower than in 2020, it is higher than in 2021, which may be within normal fluctuation in sentiment or indeed reflect a slight rise in negative sentiment from the transient large drop in 2020, due to other factors that negatively affect sentiment such as inflation in 2022.However, this is just speculation and the findings should be regarded as being hypothesis-generating to be confirmed by a rigorous study with a proper set of data that focuses on understanding the reasons behind the emotion shift.

Conclusion
In this study, we gathered subreddit messages from 128 HEIs in the U.S., covering the pre-pandemic period (2019) to various stages of the pandemic (2020, 2021, and 2022).The sentiments of the messages were predicted using the machine learning procedure in [10].Adjusting for the school-level covariates, the GLMM analysis suggests a near-full recovery in the sentiment composition (negative vs. non-negative) in 2021 relative to the pre-pandemic era and the negative sentiment levels slightly arose in 2022 but were still notably lower than in 2020.The results are expected but quantify the sentiment shift from 2019 to 2022.The results also suggest larger enrollment tends to be associated with a higher level of negative sentiment in a statistically significant manner and schools with very high research activities also exhibit more negative sentiments in comparison to schools classified as baccalaureate or Master's colleges/universities.

Fig 1
Fig 1 depicts the distributions of the number of messages across the 128 schools by year.In each year, the number of messages varies by school, but most schools have messages < 30k across all 4 years.Due to computational constraints, for schools with more than 30k messages, we sampled a subgraph that has 30,000 messages (node) using the methods described in Section 2.1.This leads to a total of 4,129,170 messages the sentiment of which are predicted.

Figure 1 :
Figure 1: Histograms of number of messages across schools by year (blue and red lines represent median and mean, respectively)

Figure 2 :
Figure 2: Heatmaps of negative percentage in all schools by year.Each circle represents a school.The right column shows the within-school differences in 2020 to 2022 vs. 2019 (pre-pandemic).The crosses in the 2021 and 2022 plots represent difference values outside the [−20, 20]% range (28.04% for the University of Notre Dame in 2020; 20.01% for Bowling Green State University-Main Campus,32.35%for Boise State University, and -27.83% for the University of Idaho in 2021; and 44.30% for the University of Maine, 26.18% for the University of New Mexico-Main Campus, 32.34% for the Tulane University of Louisiana in 2022)

Figure 3 :
Figure 3: Percentage of Negative sentiment distribution for all schools in each year (blue and red line represents median and mean, respectively)

Table 1 :
Descriptive Statistics of School-level Baseline Characteristics

Table 2 :
Estimated effects of covariates on the odds of negative sentiment