Representation of professions in entertainment media: Insights into frequency and sentiment trends through computational text analysis

Societal ideas and trends dictate media narratives and cinematic depictions which in turn influences people's beliefs and perceptions of the real world. Media portrayal of culture, education, government, religion, and family affect their function and evolution over time as people interpret and perceive these representations and incorporate them into their beliefs and actions. It is important to study media depictions of these social structures so that they do not propagate or reinforce negative stereotypes, or discriminate against any demographic section. In this work, we examine media representation of professions and provide computational insights into their incidence, and sentiment expressed, in entertainment media content. We create a searchable taxonomy of professional groups and titles to facilitate their retrieval from speaker-agnostic text passages like movie and television (TV) show subtitles. We leverage this taxonomy and relevant natural language processing (NLP) models to create a corpus of professional mentions in media content, spanning more than 136,000 IMDb titles over seven decades (1950-2017). We analyze the frequency and sentiment trends of different occupations, study the effect of media attributes like genre, country of production, and title type on these trends, and investigate if the incidence of professions in media subtitles correlate with their real-world employment statistics. We observe increased media mentions of STEM, arts, sports, and entertainment occupations in the analyzed subtitles, and a decreased frequency of manual labor jobs and military occupations. The sentiment expressed toward lawyers, police, and doctors is becoming negative over time, whereas astronauts, musicians, singers, and engineers are mentioned favorably. Professions that employ more people have increased media frequency, supporting our hypothesis that media acts as a mirror to society.

Several studies have examined the nature of the portrayal of different professions in popular media, like lawyers (Asimow, 1999), accountants (Dimnik and Felton, 2006), physicians (Flores, 2002), and cops (Pautz, 2016). While these past studies closely examined the personality of the character portrayed in the profession of interest, their methods are not scalable because the authors manually viewed the movie or read the transcripts to infer the character's personality. Such studies can not examine more than a few hundred movies or TV shows at a time. The set of personality attributes also varied between different works. Therefore, there is a need to conduct such media studies of professions more systematically and computationally.
Our objective is to conduct a computational study of the representation of professions in media content, spanning a large set of movies and TV shows over a time period. We rely on textual data from media subtitles for this study. Specifically in this work, we use professional mentions as a proxy for the representation of professions in movies. Professional mentions are job titles (doctor, engineer, cop, lawyer, etc.) used to indicate a profession within an utterance. Word mentions have been previously used to study trends of different societal functions, for example, education, culture, and language use patterns (Moskovkin et al., 2019;Younes and Reips, 2018;Basile et al., 2016). Michel et al. (2011) introduced the Google Books corpus, which contains digitized copies of more than five million books . They used ngram frequencies to track the size of the English lexicon, word usage, grammatical structures, popularity index of individuals, etc., over time. Brysbaert et al. argued that word frequency measures of media content were better than those calculated from written sources for psycholinguistic research (Brysbaert and New, 2009). Inspired by these works, we use mentions of professional words to study the representation of professions in media content. The following are the contributions of our work: 4. We analyze the frequency and contextual sentiment trend of professional mentions in media over time. We also investigate the presence of any correlation between incidence of professional mentions and the genre, title type, and country of production of the movie or TV show.
The rest of the paper is organized as follows: sect 2 recounts some past social studies about media representation of professions. It also gives the technical background of the computational models and knowledge bases we use in our work. Sec 3 describes the profession gazetteers and the subtitles dataset we use to search professional mentions in media content. The Methodology section (sec 4) is divided into two parts: Taxonomy Creation and Profession Search. The first subsection explains how we create a searchable taxonomy of job titles. The second subsection describes how we use this taxonomy to search and annotate professional mentions in media subtitles. The Analysis section (sec 5) analyzes the media frequency and sentiment trends of different professions, and their correlation with the real-world employment figures. We conclude with a discussion of the results of our analysis and opportunities for further research.

Related Work
We summarize some past social studies on media representation of professions. These studies examined a small set of movies, and their methods cannot be scaled to other occupations. We build upon various well-established natural language processing (NLP) methods and lexical resources to address these limitations. We leverage job title gazetteers and WordNet synsets to create a searchable taxonomy of professions. We prune non-professional mentions of job titles using word sense disambiguation and named entity recognition, and find the targeted sentiment of the remaining mentions. We briefly explain the theoretical background of each method and highlight relevant work.

Social Scientific Studies
Several past works have studied the portrayal of professions in entertainment media. Asimow (1999) studied the representation of lawyers in 284 films and found most portrayals to be negative. Dimnik and Felton (2006) examined the representation of accountants in 121 movies and extracted six character stereotypes of the accountant personality. Flores (2002) investigated physicians' image in 131 films and found that they were mostly depicted as greedy and uncaring. Pautz (2016) used a sample of 34 films containing more than 200 police (cop) characters and found that most cops were shown as good, hard-working, and competent law-enforcement officers. Kalisch and Kalisch (1986) analyzed 670 nurse and 466 physician characters in novels, movies and television series, and concluded that compared to physicians, media nurses were consistently less central to the plot, less intelligent, rational, and less likely to exercise clinical judgement. Smith et al. (2012) investigated gender representation of occupations in films, prime-time programs, and children TV shows, and found that females are grossly underrepresented compared to males in science, technology, engineering and math jobs (STEM). These works involved extensive human coding and profession-specific analysis which is not reproducible on a large scale. The present study offers a complementary view in the sense it trades off a smaller scale character-centric study for a larger scale lexical analysis, focusing on character utterances instead of personality traits. The remaining sections describe the computational methods used to achieve the same.

Named Entity Recognition
Named entity recognition (NER) is a classic NLP task of finding entity mentions in text and classifying their type. Traditional NER primarily targets person, organization, and location entity types. Most NER datasets only contain labels for these three types of entities (Tjong Kim Sang and De Meulder, 2003). The OntoNotes 5 dataset increased the number of entity types to 18 by including nationalities, products, events, and numeric values (Pradhan et al., 2013). However, it does not contain any professional titles. Fine-grained NER extends traditional entity recognition by expanding the entity set to include hundreds of named categories. Ling and Weld (2012) created a benchmark dataset for fine-grained NER that labels 112 different entity types but it only contains a few professional titles. Sekine (2008) built an ontology for named entities and defined professional titles as vocational attributes for the person-entity type. Mai et al. (2018) used this entity hierarchy to annotate English and Japanese sentences and evaluated different fine-grained NER models. However, they did not include the professional attributes in their labeling set. The TAC KBP track of entity discovery and linking introduced job titles as entity types to model the person:title relationship (Ellis et al., 2015). The Stanford CoreNLP NER model used the KBP 2017 dataset to create regular expression-based rules for finding professional titles in text (Manning et al., 2014). However, we observed that it missed many of the professional mentions in media text.
In the absence of labeled data, entity gazetteers are often used to find candidate spans for named entities. Gazetteers are curated lists of entities that improve NER performance when combined with supervised models.  used gazetteers to identify text subsequences for the region-based encoder and improved the state-of-the-art NER performance on the ACE2005 benchmark.    Nadeau et al., 2006). In this work, we use the SOC taxonomy to get the initial list of job titles and professional groups.

WordNet
WordNet is a widely-used lexical resource in English (Miller, 1995). It groups words into synonym sets called synsets. Synonymy is the semantic relationship between words of similar meaning. Polysemous words have multiple meanings and belong to more than one synset. For example, the word "conductor" is present in three synsets -conductor.n.01 (the person who leads a musical group), conductor.n.02 (a substance that readily conducts electricity and heat) and conductor.n.03 (the person who collects fares on a public conveyance). The name of a synset is composed of three dot-separated literals. The first literal is the lemmatized form of the main word of the synset, the second literal is the part-of-speech, and the third literal is the sense index. Synsets are tagged by semantic classes (the complete tagset can be found at wordnet.princeton.edu). Synset A is a hyponym of synset B if it is a more specific form of B. For example, allergist.n.01 (doctor specialized in the treatment of allergies), surgeon.n.01 (doctor specialized in surgery) and veterinarian.n.01 (doctor practicing veterinary medicine) are all hyponyms of the more general synset, doctor.n.01 (a licensed medical practitioner).
WordNet has been used to construct entity gazetteers. Toral and Muñoz (2006) leveraged WordNet's noun hierarchy to build person and location gazetteers. Magnini et al. (2002) used WordNet to identify trigger words and gazetteer terms for English NER. Boteanu et al. (2018) expanded a shopping taxonomy for efficient product search by matching the product names to WordNet synsets. In this work, we use WordNet synsets to extend the SOC taxonomy and create a searchable dictionary of professions. WordNet synsets are also the target labels for the word sense disambiguation task, an integral part of our search pipeline.

Word Sense Disambiguation
Word sense disambiguation (WSD) is the task of assigning words in context to their most appropriate sense. Wordnet usually serves as the sense inventory that provides the target senses. The same word can express different meanings depending upon its context. Consider the word "conductor" in the following two sentences -"Conductors communicate with the musicians through hand gestures" and "Metals are good heat conductors". The former refers to a person directing the music of an orchestra, denoted by the synset conductor.n.01. The latter means a heat-conducting substance, denoted by the synset conductor.n.02 (See Sec 2.3). Many NLP tasks such as machine translation, information retrieval and question answering use WSD in their text-processing pipeline (Neale et al., 2016;Pu et al., 2018;Zhong and Ng, 2012;Ramakrishnan et al., 2003).
Knowledge-based and supervised approaches have tackled WSD, with the latter usually outperforming the former. Raganato et al. standardized the evaluation framework for WSD and used a combination of SenseEval (Edmonds and Cotton, 2001;Snyder and Palmer, 2004) and SemEval (Pradhan et al., 2007;Navigli et al., 2013;Moro and Navigli, 2015) datasets to compare the performance of different WSD models (Raganato et al., 2017a). Raganato et al. (2017b) treated WSD as a sequence labeling task and used an LSTM with attention layers to find the sense of all sentence words jointly.  constructed context-gloss pairs and converted WSD into a sentence pair classification task. Kumar et al. (2019) produced gloss embeddings using the WordNet graph and combined them with contextual vectors to find the word sense. Bevilacqua and Navigli (2020) extended this method by adding hypernym and hyponym relational knowledge to construct the synset vectors and achieved state-of-the-art WSD performance. In this work, we use WSD to remove non-professional mentions of job titles in media subtitles.

Sentiment Analysis
Sentiment analysis or opinion mining is the task of finding the sentiments, opinions, attitudes, appraisals and emotions towards entities or their attributes expressed or implied in text (Liu, 2015). Sentiment is always targeted at some entity or towards some attribute of the entity. The target entity or attribute can be a person, organization, issue, product, service, topic or event. Such target-oriented opinion mining is called aspect-based sentiment analysis (ABSA).
We use professional mentions as opinion targets for sentiment analysis to find how positively or negatively different professions are talked about in media stories. The task of profession ABSA is to find the sentiment orientation of the opinion expressed towards the person or group of persons referred to by their job title. If the job title does not refer to any profession, or people employed in the profession, the sentiment is neutral. The following example sentences show the sentiment label of the job title word, marked in bold.

Harry Floyd was a great actor. (POSITIVE)
Explanation: Actor refers to Harry Floyd who is described as great.
2. But that damn vet kept ordering test after test after test! (NEGATIVE) Explanation: The speaker uses a swear term, damn, to address the veterinarian and criticizes his or her action.
3. Fine, then we get the armor and reverse engineer it. (NEUTRAL) Explanation: The word engineer refers to an action, not a profession.

4.
You're going to be a lousy architect. (NEUTRAL) Explanation: The person towards which the negative sentiment is expressed, is not an architect.
Benchmark ABSA datasets exist for several domains like question answering forums, customer reviews and tweets (Saeidi et al., 2016;Pontiki et al., 2014;Dong et al., 2014). Dong et al. (2014) proposed an adaptive recursive neural network for target-dependent twitter sentiment classification, that propagated the sentiments of words to target depending upon the context and syntactic relations. Tang et al. (2016) used LSTMs to model the left and right context of the target entity for twitter sentiment classification. Memory networks and graph convolutional neural models have also been proposed (Wang et al., 2018;Zhang et al., 2019). Recently, several transformer networks have been introduced for the ABSA task and have achieved state-of-the-art performance (Vaswani et al., 2017).  2019) proposed dynamically weighing or masking the attention weights of the sentence words depending upon its distance from the aspect expression. In this work, we use ABSA models to find the sentiment expressed towards professions in media content.

Data
We search for mentions of job titles in entertainment media content to study the representation of professions. We use the Standard Occupational Classification (SOC) system (Bureau of Labor Statistics, U.S. Department of Labor, 2018) to create a searchable taxonomy of professional titles. We apply this taxonomy to find professional mentions in media (movie) subtitles, for which we use the OpenSubtitles corpus (Lison et al., 2018). This section describes the SOC taxonomy and the OpenSubtitles dataset.

Standard Occupational Classification taxonomy
The Standard Occupational Classification (SOC) system is a profession taxonomy, maintained by the US Bureau of Labor Statistics (Bureau of Labor Statistics, U.S. Department of Labor, 2018). It arranges professions in four tiers: major, minor, broad, and detailed. The detailed tier contains a set of professions, closely related by work. Fig 1 lists all 23 major SOC groups, and shows a portion of the taxonomy's subtree rooted at Management Occupations major SOC group. As shown in the figure, the profession Governor occurs in the following SOC hierarchy: Management The SOC taxonomy contains 6520 unique professions.

OpenSubtitles
We used the English subset of the OpenSubtitles dataset (Lison et al., 2018), which contains 135,998 subtitle files corresponding to a variety of media content from the years 1950 to 2017. Each subtitle file is mapped to a unique IMDb title. More than 94% of the IMDb titles are movies or TV show episodes, and the rest are made up of video games, TV shorts, TV mini-series, etc. Subtitle files for IMDb titles released before 1950 are available, but we excluded them because there were very few titles for each year (less than 100) and it would not have been a representative sample for that period's media content. The subtitle files of our dataset contain around 126 million sentences and 942 million words. in our dataset. An IMDB title can have multiple genres. Drama and Comedy are the two most common genres, covering more than 80% of the IMDb titles. The third panel shows the distribution of the top ten most common countries where the production company is based. About 68% of the time, the production company was based in the US or the UK.

Methodology
We create a corpus of professional mentions by searching job titles in the OpenSubtitles dataset. We use this corpus to study the relationship between media portrayal of professions and real-world employment trends. However, we cannot use the SOC taxonomy directly to find professional mentions in media content because of the following reasons:  1. Most of the SOC job titles are very specific multi-word phrases, for example, Department Store Manager, Registered Occupational Therapist, Television News Video Editor, etc. Such detailed words are rarely spoken in everyday conversations (including those captured in the subtitle transcripts of media considered here). They instead include simpler unigram professional words like Manager, Therapist, Editor, etc. Less than 7% of the SOC job titles are unigrams.
2. The mere occurrence of a job title in text does not mean it refers to some profession. For example, consider the sentence -I made a peach cobbler for the party. Cobbler is a job title, but here refers to a type of food. Both the lexical form and the context decides whether the word is a professional mention or not.
Therefore, in order to make the SOC taxonomy searchable, we need to extend its list of job titles to include simpler, more common words, and have a disambiguation model to filter non-professional usages of job titles. Fig 3 outlines the complete pipeline of expanding the SOC taxonomy, creating the corpus of professional mentions, and analyzing its frequency and sentiment trends. The figure uses the cobbler profession to exemplify the corpus creation method. As shown in the figure, the Taxonomy Creation (sec 4.1) section describes how we expand the SOC system, and create a searchable profession taxonomy. The Profession Search (sec 4.2) section explains the NLP techniques we apply to find the professional mentions and its targeted sentiment.

Taxonomy Creation
We use WordNet synsets to create a searchable profession taxonomy.  We use the following method to expand the SOC taxonomy.
Find Substrings: Given a SOC job title, we split it into substrings and join them cumulatively from the end to find candidate job titles. For example, given the job title Chief Executive Officer, we find the new candidate titles -Officer and Executive Officer. We do not split titles that contain conjunctions, prepositions, punctuations, or those that are abbreviations (all letters are in upper case), or if they contain more than five words. In Fig 4, Conductor is a new job title we find from the SOC words, Orchestra Conductor and Train Conductor.
Find WordNet synsets: We find WordNet synsets of the candidate job titles. We retain only those synsets whose semantic class is noun.person or noun.group. We add the hyponym synsets of the retained synsets.
Remove Non-Professional Synsets: We manually check the list of synsets and remove those that do not refer to any profession. As shown in Fig 4, conductor.n.02 denotes some heat or electricity conducting substance, which is not a profession and therefore, it is removed. The final curated list contains 1615 professional synsets.

Synonym Expansion:
We collect all synonyms of the professional synsets. These are the new job titles, which we add to the SOC taxonomy. As shown in Fig 4,  SOC Mapping: We map the job titles in the expanded taxonomy to SOC major groups. This allows us to compare the employment in different SOC professional groups with their frequency in media content. We create the mapping through a semi-automatic process. Given a job title x, we first find the SOC major groups that contain x or has some job title which contains x as a substring. If there exists exactly one such SOC major group, we map x to it. Otherwise, we examine the professional synsets of x and find the mapping manually. Often, different synsets of a job title map to different SOC groups. For example, from Fig 4, we observe that the two synsets of Conductor, conductor.n.01 and conductor.n.03, map to two different SOC major groups, depending upon their respective definitions. We only perform the mapping for the top 500 most occurring job titles in media subtitles. These job titles belong to 562 professional synsets and cover more than 94% of the professional mentions. Table 1 shows the number of job titles and synsets in the different profession taxonomies. The SOC taxonomy does not contain any synsets and therefore, is not searchable. The expanded taxonomy adds professional synsets and quadruples the number of unigram job titles, making it searchable. The SOC-mapped taxonomy is a subset of the expanded taxonomy which we use to study the relationship of media frequency of professions with their employment trend. Fig 5 shows the structure of the SOC-mapped profession taxonomy. It contains three tiers: SOC major groups, WordNet synsets and job titles. The figure only shows five SOC major groups of the complete taxonomy. We expand the SOC taxonomy to create the Expanded taxonomy. The SOC-mapped taxonomy is a subset of the Expanded taxonomy. It has been mapped to SOC major groups. Figure 5: Searchable profession taxonomy. The SOC-mapped expanded profession taxonomy contains 3 tiers: SOC major groups, WordNet synsets, and job titles. The synsets and unigram job titles make the taxonomy searchable. This figure shows a few nodes of the SOC-mapped taxonomy, which contains 500 job titles and 562 synsets.

Profession Search
We search mentions of the expanded taxonomy's job titles in the OpenSubtitles corpus. We apply NER and state-of-theart WSD techniques to prune non-professional mentions. Finally, we train an ABSA model on sentiment-annotated subtitle sentences, and use it to tag professional mentions with their sentiment polarities.

Mention Search
We search the subtitle sentences to find mentions of job titles. We create a word-document search index using the Whoosh Python package for quick retrieval of mentions. We also search for the plural form of the job title while finding its mentions. Paranthesized mentions, speaker references (Referee: The match will begin shortly!) and lyrical mentions are removed.

Removing Non-Professional Mentions
As discussed in the sec 4, not all job title mentions refer to some profession. We remove non-professional mentions using WSD and NER methods. We apply the EWISER (Enhanced WSD Integrating Synset Embeddings and Relations) WSD model to find the mention's sense (Bevilacqua and Navigli, 2020). The EWISER model achieved state-of-the-art performance on the WSD benchmark dataset (Raganato et al., 2017a), reporting an overall F1 of 80.1 . We apply the Stanford CoreNLP NER model (Manning et al., 2014) to find the named entity tag of words. We use the following rule to find professional mentions. A job title mention refers to a profession if 1) the predicted sense belongs to the set of professional WordNet synsets of the expanded taxonomy (see sec 4.1), and 2) it is not the name of an organization, or of a person who is cast in the corresponding IMDb title. We remove the non-professional mentions of job titles using the above method. The remaining mentions form our corpus of professional mentions. Fig 3 shows the steps to remove non-professional mentions for the cobbler job title.
To evaluate our rule-based model of finding professional mentions, we randomly sample 200 job title mentions and manually annotate their professional label. The test set contained 123 professional mentions and 77 non-professional mentions. Our model correctly predicted the professional label for 83.5% of the mentions, with 94.12% precision and 78.05% recall. Therefore, our corpus of professional mentions has a 5.88% false-positive rate.

Determining expressed sentiment
We tag each professional mention with the sentiment (positive, negative, or neutral) expressed towards them in the subtitle sentence (see sec 2.5). We apply the LCF (Local Context Focus) BERT model to find the targeted sentiment (Zeng et al., 2019). The LCF model defines the semantic relative distance (SRD) between context tokens and the target mention as the absolute difference of token indices between the target and context word. The model has two variants: CDM (context dynamic masking) and CDW (context dynamic weighing). The CDM architecture masks less-SRD tokens, whereas the CDW architecture weighs them dynamically. The LCF model achieved state-of-the-art accuracy on the Twitter ABSA task (Dong et al., 2014), recording 75.78 F1 score. It also obtained high scores on the customer reviews dataset (Pontiki et al., 2014) of laptops (79.59 F1) and restaurants (81.74 F1). We train the LCF model using sentiment-annotated professional mentions.
We crowdsourced sentiment annotations using Amazon Mechanical Turk. We trained Turkers using expert-labelled examples, and then asked them to annotate the sentiment of five sentences. We selected only those annotators who correctly annotated all five sentences. 52 annotators qualified our test. These annotators then labelled the sentiment of 15,000 professional mentions: two annotations per mention. We retained only those mentions with identical annotations. We were left with 9613 sentiment-annotated professional mentions: 3,316 positive, 1,683 negative, and 4,614 neutral. The dataset contains mentions of 107 professions. To train the LCF model, we divide the dataset into train, validation and test sets. Professions, whose mentions occur in the training set, do not appear in the validation and test set. This prevents the model from overfitting to the target professions of the training set, encouraging it to learn the sentiment from the context and handle mentions of unseen professions. Table 2 shows the distribution of the sentiment classes and the number of professions in each set. The annotations are crowdsourced from Amazon Mechanical Turk. The train, validation and test sets do not share professions.
We tune the following hyperparameters of the LCF model on the validation set: SRD, architecture type (CDM or CDW), L2-regularization, dropout, embedding dimension, and hidden dimension. The model achieved 87.76% accuracy and 83.22 F1 on the test set. We apply the trained model to find the targeted sentiment of each professional mention of our corpus.
In total, our corpus of professional mentions contains 3,657,827 mentions, covering 4,073 professions. The corpus contains mentions from 133,133 IMDb titles, ranging between the years 1950 to 2017. The top 500 most occurring professions, which have been mapped to SOC major groups (see sec 4.1), cover more than 94% of the mentions.

Analysis
We study profession representation in media according to the frequency of their mentions and the sentiment expressed towards them in subtitles. We also analyze the effect of media attributes like genre, location of the production company, and title type, on the incidence and sentiment of different professions. Lastly, we investigate the relationship between the trends of media frequency and the real-world employment statistics of professions.

Profession Frequency
We calculate the media frequency of a profession as the total number of professional mentions (both singular and plural form, for example, advocate and advocates) divided by the total number of n-grams in the subtitles. Here, n equals the number of words in the profession phrase, for example, doctor is a 1-gram, chief executive officer is a 3-gram, etc. We calculate the frequency of SOC major groups by adding the frequencies of professions mapped to it (see sec 4.1). This frequency measure is motivated by the Google-ngrams study (Michel et al., 2011). We calculate the trend of a profession or SOC major group as the Spearman's rank correlation coefficient (Spearman, 1904) of its media frequency against time. A significant positive correlation denotes an increasing trend, and a significant negative correlation implies a decreasing trend over time (α = 0.05).  Table 3 lists these groups and their frequency trends. Only 3 SOC groups showed an increasing frequency trend in mentions over time, while 16 SOC groups decreased in frequency over time. The rank correlation was not significant for the remaining 4 SOC groups. These SOC groups contain 500 professions (see sec 4.1). We analyze the trend of some professions belonging to these SOC groups.  Fig 7 a) shows the frequency trend of two of its professions: hacker and programmer. Both occupations showed a positive frequency trend, but mentions of hackers increased more than programmers.
Life, Physical, and Social Science Occupations: This SOC group includes archaeologists, astronauts, biologists, chemists, geologists, etc. Almost all its professions showed an increasing frequency trend over time . Fig 7 b) shows the trend of four occupations: geologist, biologist, anthropologist, and economist. Mentions of geologists and biologists are consistently more frequent than economists and anthropologists. Mentions of actor dominate actress mentions in media content. The word actor can be used as a gender-neutral term, explaining part of this trend. The frequency of sports-related professions dipped in the 1960s but has increased since then. Pianist mentions decreased, whereas mentions of bass players, guitarists, and drummers increased over time.
Mentions of journalists and reporters are more frequent than correspondents and columnists.
Legal Occupations: Legal occupations include lawyers, judges, attorneys, prosecutors, etc. Fig 7 g) shows the frequency trend of defense attorneys and prosecutors. Mentions of prosecutors (a lawyer who conducts a case against a defendant) are more frequent than defense attorneys (a lawyer who defends the client against criminal charges).

Food Preparation and Serving Related Occupations:
This SOC group includes professions related to food serving and food preparation. Figs 7 h), i), and j) show the frequency trends of some of its occupations. Mentions of waiters are more frequent than waitresses overall. Similar to actors, waiters can refer to either gender, so this trend is not surprising. However, mentions of waiters have decreased over time, whereas mentions of waitresses have increased. Gendered professional mentions like barman and barmaid are less frequent than the gender-neutral term: bartender. Mentions of cooks were more common than chefs, but the latter became more frequent in media content in the 2010s.
Healthcare Support Occupations: This SOC group includes healthcare assistants, nursing aides, massage therapists, etc. Fig 7 k) shows the frequency trend of two gendered professions: masseur (male) and masseuse (female). The frequency of masseurs has significantly decreased over time after it peaked around 1970. Mentions of masseuses have become more common than masseurs. The frequency of the gender-neutral term, massage therapist, has increased over time.
Business and Financial Operations Occupations: This SOC group includes accountants, contractors, auditors, etc. Fig 7 l) shows the frequency trend of accountants and auditors. Their frequencies have mostly remained steady over time. Accountants are more frequent than auditors.
Personal Care and Service Occupations: This SOC group includes barbers, valets, nannies, ushers, etc . Fig 8 a) shows the frequency trend of three child-care-related professions: nanny, babysitter, and governess. The mention Installation, Maintenance, and Repair Occupations: This SOC group includes mechanics, electricians, locksmiths, etc . Fig 8 b) shows the frequency trend of electricians and mechanics. The mentions of both these professions have decreased in media subtitles.
Farming, Fishing, and Forestry Occupations: This SOC group includes farmers, shepherds, herders, fishermen, etc . Fig 8 c) shows the frequency trend of farmers, fishermen, and hunters. Mentions of farmers and fishermen have decreased, whereas mentions of hunters have increased over time.
Architecture and Engineering Occupations: This SOC group includes architects, designers, engineers, surveyors, etc. Fig 8 d) shows the frequency trend of engineers and architects. The frequency of architect mentions has remained steady over time, whereas mentions of engineers have diminished.

Sales and Related Occupations: This SOC group includes sales and real-estate-related professions. Figs 8 e) and f)
show the frequency trends of some of its occupations. Mentions of retailers, distributors, vendors, and brokers have increased over time. The increase in frequency is even more prevalent in real-estate jobs: estate agent, realtor, and real estate agent.
Management Occupations: This SOC group includes administrative professions and occupations related to the management of educational institutions. Figs 9 g) and h) show the frequency trend of some management occupations.
Mentions of both congressman and congresswoman have increased over time. Although congressmen are mentioned more than congresswomen, the frequency of congresswomen has increased at a higher rate (not discernible from the graph). Mentions of senators, governors, and mayors have decreased over time. The frequency of headmistress mentions has increased, whereas headmaster mentions have decreased in media content. The frequency of the gender-neutral term, principal, has increased.
Educational Instruction and Library Occupations: This SOC group includes teachers, professors, librarians, etc. Office and Administrative Support Occupations: This SOC group includes clerks, receptionists, tellers, notaries, etc. Fig 9 i) shows the frequency trend of some of its professions: clerk, secretary, and bookkeeper. The frequency of all three professions decreased over time.
Building, Grounds Cleaning and Maintenance Occupations: This SOC group includes construction workers, maids, chamberlains, janitors, etc. Fig 9 j) shows the frequency trend of some of its professions: gardener, sweeper, and caretaker. Mentions of all three professions decreased over time in media content.  Fig 9 k) shows the frequency trend of some of its professions. Mentions of field marshal (army) and admiral (navy) decreased over time. The frequency of air marshal mentions increased (not discernible from the graph).
The frequency trend of the SOC group does not reflect the frequency trend of all its professions. For example, Figs 8 e) and f) show increasing trends for many sales-related professions, but the overall frequency of the Sales Occupations SOC group decreased. A large proportion of Sales Occupations mentions are comprised of bankers and cashiers, whose frequencies have decreased.

Profession Sentiment
We find the sentiment expressed toward each professional mention in our dataset using the LCF model (see sec 4.2).
The computed sentiment can take the following values: positive, negative, or neutral. We call all mentions tagged with non-neutral sentiment as opinionated mentions. We represent the sentiment expressed towards a profession or a major SOC group as the number of positive sentiment mentions divided by the total number of opinionated mentions. We find the sentiment trend of professions by calculating the Spearman's rank correlation (Spearman, 1904) between the proportion of positive sentiment mentions and time.   The sentiment expressed towards therapists, spies, rangers, and detectives in media content trends toward becoming more positive over time, whereas the sentiment expressed toward doctors, lawyers, professors, and scientists trends more negative. Lawyer mentions have the second-lowest sentiment correlation over time, behind doctors. This negative trend agrees with the work of Asimow (1999). Astronauts not only have the highest proportion of positive sentiment mentions overall (see Fig 11), but also one of the most positive sentiment trends.

Media Attributes
We have observed that both the frequency of professional mentions and the sentiment expressed towards them change over time. In this section, we study the relationship of the following media attributes: year, genre, title type, and country of production (see sec 3.2) with the observed media frequency and sentiment trends of professions and SOC major groups.
We perform a regression analysis to analyze the effect of media attributes. The response is the frequency or sentiment of the profession or the SOC major group. The predictors are year, genre, title type, and country of production. All predictors are categorical variables except for year, which is numeric. Genre and country are multi-valued. An IMDB title can belong to multiple genres, and its production can take place in multiple countries. Therefore, we create a categorical variable for each possible genre and country, taking values 0 or 1. We set it to 1 if the IMDb title belongs to the corresponding genre or its production takes place in the corresponding country. We ignore media attribute configurations for which the number of IMDb titles is less than 30. We use logistic regression because both frequency and sentiment are proportions, bounded between 0 and 1 (defined in secs 5.1 and 5.2). We use a generalized linear model with the binomial family and logit link. We provide the total number of ngrams and opinionated mentions (defined in sec 5.2) as prior weights to the model to specify the number of trials when the response is frequency or sentiment, respectively. = −) and the intensity of the color denotes its magnitude. The white cells mean that the predictor is not significant. We observe some interesting relationships between the frequency of professional mentions and media attributes. The frequency of actors increases, but the frequency of actresses decreases when the genre is adventure, documentary, or thriller. The reverse is true when the genre is romance. Mentions of lawyers increase in crime, drama, and mystery genre media content. The frequency of lawyers and attorneys increase, and the frequency of prosecutors decreases when the country of production is the United States. United States-produced movies and TV shows mention cops and sheriffs more than inspectors and police. The opposite is true for United Kingdom-produced titles. The frequency of detectives and spies increases in mystery genre titles. Comedy, reality-TV, and music genres increase the frequency of dancers, singers, and artists. The frequency of doctors, nurses, and surgeons in movies is higher than in TV shows. In science-fiction and family media titles, the frequency of doctors increases, and the frequency of nurses and surgeons decreases. Documentary titles increase the frequency of reporters and journalists. Mentions of engineers and scientists increase when the genre is science-fiction or documentary and decreases in comedy and fantasy genres. The frequency of teachers and professors decreases in action and adventure genres. Movies mention teachers more than TV shows, but the opposite holds for professors. Mentions of senators, mayors, and presidents increases in news and thriller genres. The frequency of lieutenants and soldiers increases in action and war media titles and decreases when the genre is fantasy and romance. Figure 14: Heatmap of coefficient values of media attributes in predicting SOC group frequency. The color denotes the sign of the coefficient (blue = +, red = −) and the intensity of the color is proportional to the magnitude of the coefficient. The blank cells indicate that the media attribute is not a significant predictor for the frequency of the corresponding SOC group. Fig 14 shows the coefficient heatmap of media attributes when the response is the frequency of SOC groups. We highlight some relationships between SOC frequency and media attributes. The frequency of Management, Business, and Financial Operations occupations increases in biography and news genre media titles. Documentaries and sciencefiction movies and TV shows frequently mention STEM professions like Computer, Mathematical, Architecture, Engineering, Life, Physical, and Social Science occupations. Mentions of Community and Social Service occupations decrease in reality TV shows, sports, and family movies. The frequency of Legal occupations increases in news genre titles. The frequency of Arts, Design, Entertainment, Sports and Media occupations increases in music, sports, game show, and biography genre titles. Mentions of Healthcare Practitioners increase in news and drama genres and decrease in musicals. Food Preparation and Serving related professions occur highly in reality TV shows and decrease in music and adventure movies. The frequency of manual labor jobs like Construction, Extraction, Production, Building, Grounds of positive sentiments expressed towards marshalls, mayors, professors, and prosecutors increases when the genre is mystery. The proportion of negative sentiment increases for congressmen, priests, and prosecutors in crime genre media titles. The average sentiment of SOC groups containing STEM occupations like Computer, Mathematical, Life, Physical, and Social Science occupations becomes more positive in documentaries. Movies express more negative sentiment than TV shows towards Legal occupations.

Employment
We have analyzed the frequency and sentiment trends of different professions and SOC groups in media content. We also studied the effect of media attributes on these representations. However, our analysis has been limited largely to the entertainment media domain, by using NLP on the subtitles of media content. In this section, we test the hypothesis that media stories reflect real-world events. We do so by studying the relationship between the media frequency of SOC groups and their real-world employment trends. We obtained the employment data of SOC major groups from the Occupational Employment Statistics survey (OES). The survey does not provide employment numbers for individual professions. Therefore, we only conduct our analysis on SOC groups. We calculate the Spearman's rank correlation (Spearman, 1904) between the media frequency of the SOC group and the proportion of the working population employed in any of the professions of the SOC group. We compute the correlation for the period 1999-2017. The employment data for the earlier years is not available.
Figure 15: Spearman's rank correlation coefficient of the media frequency of SOC groups with employment. The red-colored bars have positive correlation: media frequency and employment have the same trend. The blue-colored bars have negative correlation: media frequency and employment have opposite trends. The grey-colored bars mean that the correlation is not statistically significant (α = 0.05) Fig 15 shows the correlation between media frequency and employment for the SOC groups, from most positive to most negative. The correlation is positive for 14 out of the 22 SOC groups (64%) and negative for the rest. Therefore, the trend of media frequency mirrors the employment trend for most SOC groups. The number of media mentions of the SOC group increases as more people are employed in its professions, and decreases as people move away to other occupations. The correlation is significantly negative for only two SOC groups: Healthcare Practitioners and Social Service occupations.

Discussion
Professions vary in the frequency and sentiment expressed towards them in media content. We observed that genderneutral terms like massage therapists and flight-attendants are becoming more frequent than their gendered counterparts. The frequency of some female job titles like waitresses, congresswomen, and policewomen has either increased or remained steady relative to the corresponding male job titles -waiters, congressmen, and policemen -which have decreased in frequency. However, the overall incidence of most male job titles exceeds their female counterparts. The frequency of STEM, sports, arts, and design occupations has increased, whereas the frequency of construction, farming, and manual labor jobs has decreased. The frequency of specialized professions like cardiologists, gynecologists, and neurologists has increased, but the frequency of generic terms like doctors and nurses has decreased. The frequency trend of individual professions does not reflect the overall trend of the subsuming SOC group. Sales-related occupations like retailers, vendors, brokers, and realtors are increasing in frequency, but the aggregate frequency of the SOC group shows a decreasing trend.
STEM occupations are favorably mentioned in media subtitles, but the sentiment expressed towards blue-collar jobs is largely negative. Police, monks and nuns are mentioned negatively, whereas musicians and engineers are portrayed positively. The sentiment expressed towards therapists, astronauts, and detectives is becoming more positive over time, and the sentiment expressed towards doctors, lawyers, professors, and presidents is becoming more negative. The decreasing trend of positive sentiment towards lawyers agrees with the work of Asimow (1999).
Media attributes like genre, country of production, and title type affect the frequency and sentiment trends of professions. Genre especially seems to be a good predictor of the profession types mentioned in the subtitles. For example, sciencefiction movies mention engineers and scientists, action and war genres mention lieutenants and soldiers, and mystery titles contain detectives and spies. Adventure and thriller genre titles contain more actor references than actresses, but the opposite is true for romantic movies, suggesting the prevalence of some form of gender bias. Sheriffs are mentioned more than inspectors in titles produced in the US, and the converse holds for the titles produced in the UK, reflecting the different social structures of the two countries. Movies and TV show subtitles did not differ significantly in their professional mentions.
Lastly, we observed that media frequency correlates with the employment trend of most SOC groups. Professions that employed more people were also more frequently mentioned in media content. This supports our hypothesis that media mirrors society and plays a role in our professional choices.

Conclusion
In this work, we have created a searchable taxonomy of professions to facilitate job title search in short context documents like media subtitles. We used WordNet synsets and word sense disambiguation methods to retrieve professional mentions in movie and TV show subtitles. We classified the sentiment (positive, negative, or neutral) expressed towards these professional mentions in the subtitle sentence. We analyzed the frequency and sentiment trends of professions and SOC groups, the effect of media attributes on these trends, and verified the hypothesis that media frequency of professions correlates with their employment statistics. Future work entails extending our analysis to include industries and businesses, and to explore other media domains like news and social media. The profession taxonomy and sentiment-annotated subtitle corpus is publicly available at https://github.com/usc-sail/mica-profession/tree/main/datasets.