Boys don’t cry (or kiss or dance): A computational linguistic lens into gendered actions in film

Victor R. Martinez; Krishna Somandepalli; Shrikanth Narayanan

doi:10.1371/journal.pone.0278604

Abstract

Contemporary media is full of images that reflect traditional gender notions and stereotypes, some of which may perpetuate harmful gender representations. In an effort to highlight the occurrence of these adverse portrayals, researchers have proposed machine-learning methods to identify stereotypes in the language patterns found in character dialogues. However, not all of the harmful stereotypes are communicated just through dialogue. As a complementary approach, we present a large-scale machine-learning framework that automatically identifies character’s actions from scene descriptions found in movie scripts. For this work, we collected 1.2+ million scene descriptions from 912 movie scripts, with more than 50 thousand actions and 20 thousand movie characters. Our framework allow us to study systematic gender differences in movie portrayals at a scale. We show this through a series of statistical analyses that highlight differences in gender portrayals. Our findings provide further evidence to claims from prior media studies including: (i) male characters display higher agency than female characters; (ii) female actors are more frequently the subject of gaze, and (iii) male characters are less likely to display affection. We hope that these data resources and findings help raise awareness on portrayals of character actions that reflect harmful gender stereotypes, and demonstrate novel possibilities for computational approaches in media analysis.

Citation: Martinez VR, Somandepalli K, Narayanan S (2022) Boys don’t cry (or kiss or dance): A computational linguistic lens into gendered actions in film. PLoS ONE 17(12): e0278604. https://doi.org/10.1371/journal.pone.0278604

Editor: Natalia Grabar, STL UMR8163 CNRS, FRANCE

Received: February 10, 2022; Accepted: November 21, 2022; Published: December 21, 2022

Copyright: © 2022 Martinez et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: We have made the dataset and labeling models available for download at https://sail.usc.edu/~ccmi/actions-agents-and-patients/.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

TV and film media often reflect the views of a society with a considerable impact in how gender stereotypes are created or reinforced [1]. Through assumed behaviors and social roles, media representation and portrayals are a major influence in the way we construct our beliefs and ideas around gender-appropriate behaviors and norms, particularly during the formative years of childhood and youth [2–5]. Even throughout our adulthood, these portrayals can still guide the way we think [6], how we create our worldview and perceptions of others [7], our fashion choices [8], and the perception of our self-identity [9].

Right now there is a clear disparity in the way characters are portrayed in TV and film media, particularly with respect to the characters’ assumed or perceived gender expression. For example, over the past two decades, the media industry has been highlighted as one where women tend to be both underrepresented [10, 11] and depicted in a stereotypical manner [12, 13]. The former being so gravely unbalanced that male leads outnumber female leads two-to-one, with male characters speaking and appearing on the screen twice as often than their female counterparts [11]. Female characters are often presented in decorative (e.g., for their body and beauty), family-oriented, and demure roles [14–18]; male characters are typically shown as independent, authoritarian and professional agents and, unlike their female counterparts, these representations do not depend on the male actor’s age or physical appearance [12, 19].

One of the main limitations in most of these efforts is their largely qualitative nature that require immense manual work with human annotations and/or surveys [20]. These cannot match the scale at which media content is currently being produced or consumed; in fact, these efforts have been unable to produce systematic data for both science and media scholarship at scale. To provide supporting evidence for systematic differences in gender portrayals at a larger scale, researchers have recently turned to machine learning models. These applications range from works that automatically detect actors’ faces and voices in TV and film [21–24] to works on film narrative understanding through the analysis of character dialogues [25, 26]. A number of these are applied directly to movie scripts to gain insight into the early stages of content creation, and where the suggested modifications could be implemented at lower cost. For example, a linguistic analysis of gender ladenness (i.e., the degree of association in which language may be perceived as feminine or masculine [27]) in movie scripts found that romantic movies tend to include language with a higher degree of feminine association, whereas action movies tend to include language with higher degree of masculine association [28]. With respect to characters’ dialogues, linguistic analysis studies have found that male characters are associated with a higher number of words related to achievement, whereas female characters are usually written with more positive language, lower agency, and less power than their male counterparts [26, 29]. Other types of studies center around the social network inferred from scene sharing among characters. These studies demonstrate that with a few exceptions, men play almost all central roles across all genres, and for every three characters interacting in a movie, at least two are men [25, 26].

These automatic approaches come with its own limitations. For example, even when there is a general concensus among researchers that gender lies on a spectrum, most of the automated media content tools are limited to identifying only the female–male dyad—a reduction that some might argue to be overly simplistic [30]. Another limiting factor is that gender stereotypes are not bound to dialogue or scene co-appearance, but also can be communicated through the actions and behaviors of the characters [31]. For instance, consider the following three common stereotypes. First, the tomboy, a stereotype embodied by girls who are interested in science, vehicle mechanics, sports or other gender non-conforming behaviors or appearances. Second, that of the spinster or crazy cat lady, an unmarried women of a certain age whose sole narrative arc centers around their quest to find a partner. Finally, the scary angry man, which often portrays persons of color as innately savage, animalistic, destructive, and criminal. These stereotypes are rarely described explicitly through the character’s dialogues, and similarly one would be hard pressed to infer them through the character’s scene co-appearance networks. Instead, audiences tend to infer the implied stereotypes through a complex understanding of the character’s appearance, as well as their actions and behaviors [32, 33]. With this in mind, we believe there to be an open opportunity for a complementary analysis based on the character’s actions to better understand the pervasiveness of harmful gender representations in media. This work, to best of our knowledge, is the first to address this gap.

Current research

In this work, we present a large-scale media content analysis framework to uncover differences in how characters engage in actions depending on their their assumed gender expression. To this end, we perform three steps: (i) data collection and annotation; (ii) machine-learning modeling, and (iii) statistical analysis. Fig 1 presents an overview of the complete process. In the following we will briefly describe each of these steps.

Download:

Fig 1. Computational linguistic lens into gendered actions in film.

Our framework starts with a dataset of over 1.2 million movie descriptors from 912 movies and then implements three steps: first, the annotation process, where we collect manual annotations for 9,613 descriptions and over 1.5 million gender expression labels for characters. In the second step, we develop a machine learning model to identify actions, agents and patients from the natural language found in the movie scripts. Our model—trained on a large-set of general-domain documents and fine-tuned on a manually labeled description set—identifies actions, agents and patients from the natural language descriptions found in movie scripts. We use this model to automatically label the complete dataset for analysis. In the final third step, we perform a series of statistical analysis to uncover portrayal differences along characters’ portrayed attributes.

https://doi.org/10.1371/journal.pone.0278604.g001

Data collection and annotation

We start from a collection of 912 movie scripts from which we extracted 131,954 scenes with 1,242,107 action descriptors (i.e., sentences that describe an action occurring during a scene). We design a task to collect manual annotations for a small sample of these action descriptions. This provides us with a manually labeled dataset for actions and their constituents (i.e., the agents and patients engaged in the action). Additionally, to identify the assumed gender expressions for each character in the movie scripts, we design a heuristic-based method based on identifying usage of proper names, gender pronouns and manually labeled examples.

Machine-learning modeling

While the previous step can provide reliable labels, it can only do so for a small subset of the action descriptions. To scale our approach, we develop a state-of-the-art machine-learning model that identifies characters engaged in actions from the natural language descriptors found in the movie scripts. Our approach leverages transformers, particularly the recent developments in large-scale contextual language models (BERT) [34]. We fine-tune the pre-trained general-knowledge BERT model to allow the model to learn the different ways in which scriptwriters describe character actions and behaviors. We use the model to automatically label the remainder of the descriptions for actions, as well as the characters acting as agents and patients of the action. This process yields a set of 1.2M+ automatically labeled descriptors, containing over 50,000 different actions with over 20,000 different participants.

Statistical analysis

As a final part of this work, we show how this framework can uncover portrayal differences along character attributes such as gender. To this end, we perform a series of statistical analyses over the 1.2+ million action descriptions to estimate the frequency of portrayal as a function of role and gender. As part of our results, we provide insights into some of the nuanced aspects in which certain stereotypes are being communicated through character actions and behaviors. Based on previous literature, we categorize our findings on the differences in portrayals into three main groups: (i) in how female characters are often depicted with less agency than male characters [29]; (ii) on the emphasis placed on the female appearance and sexual objectification of women actors [35, 36]; and (iii) on how gender plays a role in the frequency of affective portrayals, either by frequently casting women into overly emotional roles [12, 37, 38] or by a clear absence of male portrayals of affection [39, 40].

These analyses, and the insights obtained, exemplify how the computational framework can be beneficial for film makers, scriptwriters, and the broader society in terms of creating awareness about media portrayals from an equity lens.

In summary, the contributions of this work include:

the design, construction and public sharing of an action description dataset with over 1.2 million descriptions obtained from 912 movie scripts
the proposal for a machine learning method to identify actions, agents and patients from linguistic cues found in the action descriptions of a screenplay, and
Implementation of a statistical framework with the 1.2 million action descriptions to highlight biases in the gendered portrayals of actions, agents and patients in media

To provide other researchers with the opportunity to explore additional hypotheses, we have made the dataset and labeling models available for download at https://sail.usc.edu/~ccmi/actions-agents-and-patients/.

Data collection and annotation

In this section we describe the process used to obtain manual annotations for actions and the characters engaged in these actions as agents or patients. Furthermore, we describe the heuristics used to identify the character’s assumed gender expressions from the language used throughout the movie script.

Our analysis is performed over a dataset of 1.2 million action descriptions obtained from a publicly available collection of 912 movie scripts covering over 31 genres and 104 years of movie productions (1909–2013) [41]. This collection of movie scripts has been widely accepted and validated by the research community, particularly in the analysis of character portrayals [29, 42]. Thus, we believe this corpus provides a representative sample of the typical character actions and behaviors on screen. Moreover, it is one of the few resources publicly available to provide human-validated named entity recognition, co-reference resolution, and gender information for each of the main characters in each movie script.

Action description dataset

From each movie script, we collect all of its action descriptions. Action descriptions are paragraphs within a scene that tells the reader what is going to happen on the screen and describes the characters and their actions. We focus on these descriptions as a complementary approach to prior works based on character dialogue [25, 26, 29]. We split each action paragraph into sentences and identify the actions (verbs) using Spacy [43]. This process yields a total of 1, 242, 107 sentences (μ = 1363.45, σ = 560.03, M = 1313 per description), 1, 634, 230 predicates and 84, 513 unique actions. Table 1 presents descriptive statistics of the constructed dataset.

Download:

Table 1. Descriptive statistics for the action description dataset.

https://doi.org/10.1371/journal.pone.0278604.t001

Manual annotation

We select a sample of 12, 500 sentences to be coded by human annotators. This sample is constructed by rejection sampling, where each sentence has to have at least one verb (action). Our annotation procedure consisted of two tasks: labeling and verification. The following sections describe each task in detail.

A total of 981 annotators were hired through Mechanical Turk. From these, 602 were assigned to the labelling task, and 379 to the verification tasks. On average, each annotator took around a minute to complete an annotation task (μ = 50.21s; σ = 173.69), while only requiring half the time to verify an annotated result (μ = 28.49s; σ = 30.78). Remuneration scheme was devised to ensure that the annotators receive at least an hourly minimum wage in the U.S. This was calculated by diving the current (2021) minimum wage per hour ($7.25) by the expected time it would take to complete a single annotation (∼1 minute).

Labeling.

For the labeling task, we present non-expert annotators with a sentence and ask them to identify the agents (or patients) for a particular action (verb). To ensure that annotators label only characters (and not inanimate objects), our annotation style follows the work of [44] where the syntactic heads of the constituents are annotated instead of their full extent. Moreover, we simplify the annotator’s task by providing annotators with two pieces of information: a pre-selected action, and a list of possible entities (see Fig 2). The former is obtained from the part-of-speech tags provided by the corpus. For each verb, we created a separate annotation task where annotators identify the agents and patients for that particular action. For the latter, however, we were not able to provide a list of all possible characters, as accurately identifying the literary figures in a text is still an open research problem [45]. Instead, we provided our best attempt at a reduced list of possible characters by identifying entities that are more likely to be used as character names. To construct this list, we start by filtering out words that are not pronouns, proper nouns, nouns, or noun phrases. From the remainder, we remove most of the common words since these are not normally used to refer to a character (e.g., door, eyes, room, hand, car, head, floor, etc…). For the special case of honorifics (such as Mr. or Miss), we follow [45] in considering these tokens as part of an entire maximal span (e.g., [Mr. Collins] or [Miss Havisham]). We include this maximal span as a single entry in our select box. For cases where the agent (or patient) is not explicitly stated in the sentence, annotators have the option to check in the box for ‘Does not say’.

Download:

Fig 2. Labeling task.

Annotators are presented with a sentence and an action. They are asked to either select the agent (source) and patient (target) of the action. For cases where one of these is missing, the annotator has the option to check the ‘Does not say’ box.

https://doi.org/10.1371/journal.pone.0278604.g002

Each sentence was annotated by 3 non-experts and their agreement is used as the presumptive label for the next stage. If there was no agreement between the annotators, we discarded the sentence from the sample.

Verification.

In this stage, we ask another annotator to verify if these labels are correct or not. We present a single sentence, the pre-selected action, and the presumptive labels for agents and patients when applicable or a ‘Does not say’ string for when not. The annotator gets prompted with a single question: “Is [AGENT] doing the [ACTION] to [PATIENT]?” and radio buttons for Yes, No or Does not say. If the annotator does not agree with the result, or if they cannot say whether it is correct or not, we discard the particular sentence from consideration.

As a final quality control step, one of the authors checked all the sentences and verified the labels for consistency. We decided to follow this multi-tiered approach to ensure a high level of annotation quality, one that makes sure there is a marked distinction between subjects and patients. This distinction becomes paramount if one considers that the models we will be working with do not have an inherent notion of what a character is. Hence, if the annotation results in low-quality labels, we run the risk of having a model that picks any word from the predicate as the patient for a given action. As we will discuss in further sections, our two-step verification was needed to ensure that most of the labels were correct and identify unreliable annotators.

Inferring assumed gender for characters

To obtain a character’s assumed gender expression, we follow a hierarchical heuristic approach. This approach is heavily informed by prior work on the same domain [25, 26, 29]. Our gender estimation method proceeds as follows: first, for movie scripts of an already produced film, we obtain the character’s gender from the casting of that role from IMDb http://imdb.com. For the remainder of the characters, we rely on the following heuristics:

For proper names, we estimate gender using historical U.S. census data [46].
We use gendered pronouns as markers of that character’s gender. The set of pronouns was selected from Twenge et al. [47], and includes the following words: female pronouns (e.g., she, hers, her, and herself); male pronouns (e.g., he, his, him, and himself), and neutral (or plural) pronouns (e.g., we, they, them).
Lookup over a manually collected word list containing the 3, 000 most frequent words and their gender. This list includes gendered words for family and relationships (e.g., uncle, aunt, wife, husband), common gendered nouns and other words where gender is presumed evident (e.g., boy, gal, policeman, congresswoman).

From the total of n = 2, 136, 207 character instances, our hierarchical heuristic method is able to infer the gender for 71.64% of instances (n = 1, 530, 587). From these, 917, 114 correspond to agents and 613, 473 to patients (see Table 2). In line with previous research, the sample of genders contains male characters in a 2-to-1 proportion to female characters [10, 48]. We manually identify 49, 885 cases of character references which should have a gender, but our heuristics were not able to determine which gender it should be. A majority of these cases (56.4%) are unresolved co-references (e.g., I, you, me), and could be addressed in future work when appropriate literary co-reference systems become widely available.

Download:

Table 2. Gender distribution of agents and patients.

Our sample displays a 2-to-1 ratio of male to female characters. Proportion tests reveal an unequal proportion of Male agents to Female agents, and Female patients to Male patients (p < 0.0001).

https://doi.org/10.1371/journal.pone.0278604.t002

Gender estimation performance.

We estimate the performance of our gender estimation through a manual verification process. This process involved a manual inspection of the dataset to collect 400 character names alongside their gender (i.e., non-gendered, male, female, and neutral). We then verify that our heuristics are able to infer the correct gender for each of these instances as a classification task. We report classification performance as part of our results.

Machine-learning model

In this section we provide an overview of the computational model that identifies the action and its constituents (agents and patients). This model is based on the current state-of-the-art BERT-based models for semantic role labeling (SRL) [49]. Additionally, we present the steps performed for domain-adaptation, which results in a significant improvement in performance over competitive baselines.

Automatic identification of character actions

We frame the problem of automatically identifying the set of actions and its participating characters as a Semantic Role Labeling task (SRL), with a few differences. Given a sentence, the SRL task consists of analyzing the propositions expressed by some target verbs of the sentence. In particular, for each target verb, all constituents in the sentence which fill a semantic role of the verb have to be recognized. Typical Semantic arguments include Agent, Patient, Instrument, etc. and also adjuncts such as Locative, Temporal, Manner, Cause, etc. [50].

According to Shi et al. [49], a typical formulation of the SRL task is split into four subtasks: predicate detection, predicate sense disambiguation, argument identification, and argument classification. We start from the assumption that there are models that support the first two tasks in a reliable manner. Specifically, in our experiments, we use Spacy [43] for predicate identification, which allows us to focus entirely on the argument identification and classification sub-tasks. Furthermore, in contrast to the traditional SRL task [50], we are only interested in the characters performing the action, and those that are the object of an action. Hence, our label set can be restricted to actions, agents, and patients only. Another point of contrast is that we explicitly make the distinction between objects (inanimate) and patients (characters).

Proposed model

We follow Shi et al. [49] in applying a simple yet powerful recipe for SRL: obtain word vector representations from a pretrained BERT-based architecture to train a Recurrent Neural Network for sequence labeling. Our proposed model (see Fig 3) learns to map the sequence of tokens from an input sentence to a sequence of labels for actions, agents, and patients. Inputs to our model are sentences, which are tokenized and fed into a BERT model to obtain highly contextualized word representations. These representations are then used as input to a Recurrent Neural Network to produce a sequence of token-level labels.

Download:

Fig 3. Proposed SRL system.

Starting at the bottom, the input to the system is an action description in natural language. The output, shown at the top of the figure, is a sequence of labels (one per word). Labels indicate whether this word depicts an action, agent, patient or none. From its inputs, our model obtains a highly-contextualized representation for each word using the BERT transformer [53]. Each representation corresponds to a high dimensional dense vector that encodes the semantics of that word and the context it plays within the sentence. The sequence of vector representations is then fed into a recurrent neural network and a softmax layer for sequence labeling. As a post-processing step, a set of heuristics aggregate conjunctions to handle the case of groups of agents or patients.

https://doi.org/10.1371/journal.pone.0278604.g003

In contrast to Shi et al. [49], who inputs the predicate as an additional feature to the model, our current setup restricts role labeling to a single predicate per sentence. If a sentence has more than one predicate, we create a separate copy for each predicate; this same setting was applied by Daza and Frank [51] and Zhou et al. [52]. In the following sections, we provide further details into the steps taken by our model.

Input representation.

We selected a BERT model for our input representation because of its remarkable success on a variety of NLP tasks, such as question answering, dialogue systems, and information extraction [53]. In this work, we start from the original BERT model (https://github.com/google-research/bert) trained for the general-domain on a large unlabeled plain text corpus—that is, the complete English Wikipedia and BookCorpus.

We follow the traditional format used for sentence encoding in the BERT transformer [53]. To obtain an input representation, we feed an action description into the model as an input sequence. This sequence starts with a sentence delimiter ([CLS]) and ends in a separator delimiter ([SEP]), as follows:

Our first step is to tokenize the sentence elements using WordPiece [54]. The WordPiece tokens are fed into pre-trained BERT models from which we obtain one vector representation for each of the tokens. Formally, the sentence representation step maps a delimited sequence of k words into n WordPiece tokens, as follows (1)

Note that the length of these sequences might not be the same as the tokenizer might split a word into multiple sub-word tokens. The token sequence is then mapped into the representation sequence by the BERT model. Let H denote the sequence of BERT representations, given by (2)

The dimension of each h_i is given by the BERT model, and corresponds to 768 and 1024 for the bert-base and bert-large models, respectively. The sequence of highly-contextualized vector representations is then fed into a RNN for token-level prediction.

Token-level prediction.

We use a bidirectional recurrent neural network (RNN) to obtain the semantic role labels for each word in a sentence. RNNs are a class of neural networks specialized in processing sequences of inputs. With each input, the RNN updates its internal state (memory) and produces a probability distribution over the labels for that input. Here, we feed the output of BERT (Eq 2) into the RNN layer as sequence of tokens, for which we obtain a sequence of probability distributions. To obtain the sequence of SRL labels (i.e., Verb, Agent, and Patient), each token gets assigned to the label with the maximum (posterior-)probability. For the RNN, we explore two popular configurations: Long short-term memory cells (LSTM) [55] and Gated Recurrent Units (GRU) [56].

Formally, the sequence obtained from BERT, H, is then fed into a bidirectional RNN to learn a mapping between the tokens and the SRL labels of interest. The RNN takes the sequence of representations H and outputs a sequence of n hidden vectors {v_[CLS], v₁, …, v_n, v_[SEP]}. Each v_i is constructed as the concatenation of the left-side and right-side context, . To predict a label, we use a fully connected dense layer and a softmax function over all labels: (3) where ϕ is the activation function. In our experiments, this function corresponds to a linear activation function ϕ(x) = xA^⊺ + b where A and b are learn-able parameters of the model. Finally, the complete model is trained using a weighted cross entropy loss where w_C is the weighted associated to the class C

Post-processing.

From the SRL system, we collect two outputs: the subword WordPiece token representation of the sentence, and the sequence of token-level labels. To recover the word-level sentence, we post-process the outputs of the RNN (Eq 3) by removing the special tokens introduced by BERT (i.e., [CLS], [SEP], [PAD] and [MASK]) and merging back WordPiece tokens into words. The word-level label is calculated from its corresponding tokens as the mode of the token’s label. Additionally, we postprocess the data further to accommodate for cases where agents and patients are composed of more than one word. Examples of this include the case of honorifics (e.g., ‘Mr. Anderson’, ‘Captain Crunch’) or expressions with more than one character (e.g., ‘Mr. Smith and his wife’). We restructure our word sequence by (i) concatenating consecutive words with the same label into a single term, and (ii) merging consecutive terms that end up in a conjunction. The labels for word groups is assigned to be equal to the label of the left-most word. For example, after applying our model and post-processing procedure to the sentence presented in Fig 3 we are able to discern that “Boromir” is the agent of the action “look”, and that the patients correspond to “Elron and Galdalf”.

Fine-tuning

Fine-tuning aims to adjust the BERT language model and its vocabulary for the way language is used in a particular domain, without offsetting the generalization power of the original model. This procedure typically results in models that achieve state-of-the-art performance for domain-related tasks. In this work, we perform fine-tuning of BERT language models by continuing the train of the model end-to-end on the action description dataset described in the previous section. In end-to-end training, we back-propagate the errors from the sequence labeling task back through the network, all through the BERT transformer layers (yellow box in Fig 3). This results in the update of the BERT layer parameters, which adapts them to the patterns in the language of the movie scripts.

Model performance comparison.

Model performance was estimated over the manual annotated dataset of n = 9, 613 action descriptions. For training, we used 75% of the available data (n = 7, 209). Performance was estimated on 15% of the data, held back as a test set (n = 1, 419). A development set (10%, n = 985) was used for parameter optimization and early stopping. Our experiments measure the model’s ability to correctly categorize each token by its corresponding semantic role label. Hence, this can be seen as a 4-way classification task (i.e., Action, Agent, Patient, and None). We report the average accuracy and micro-average F-score for this classification task.

Baselines.

We compare the performance of our proposed approach to 3 baseline models, which were selected informed by previous literature in the SRL task. Our first baseline corresponds to a named-entity based approach proposed by Sap et al. [29] where we leverage part-of-speech tags and syntactic dependency tress to identify actions, Agents and Patients. The second and third baselines follow state-of-the-art BERT-based models for semantic role labeling (SRL): SimpleBERT [49] and AllenNLP SRL [57]. Even though our architectures are quite similar, there are a few differences between our approaches, particularly with respect to the inputs and the sequence labeling layer. A first difference is that AllenNLP uses a time-distributed linear dense layer to classify the sequence of outputs from the BERT system into the sequence of semantic role labels. In contrast, SimpleBERT and our method use a single RNN layer, as this might be better adapted to handling sequence data. Second, both SimpleBERT and AllenNLP extend the sentence representation of BERT to include the current predicate, as a way of informing the model which action to attend to. Instead, we restrict our inputs to only a single predicate per sentence. Finally, we note that SimpleBERT and AllenNLP are trained on non-literary data. Hence, these models serve as a direct comparison to the performance of out-of-the-box state-of-the-art SRL systems when applied to a novel real-world application.

We control for this possible source of noise by presenting two versions of each baseline. In the first version, the models are only provided with the action description and have to perform predicate identification and semantic role labeling. In a second version, we provide the sentence and the gold standard predicate, so the models only need to do the semantic role labeling task. We refer to the second version as the oracle predicate version.

Statistical analysis

We propose a statistical model to identify significant differences in the frequency of the action portrayals due to the role and gender of its participants. We do this through a series of four studies. These studies explain the response variable (action frequency) as a function of different independent variables. The first and second studies examine the relation between action frequency and agent’s or patient’s gender independently. We denote the results from this studies by (S1: agent-only) and (S2: patient-only), respectively. The third study investigates the effects of the agent’s and patient’s gender simultaneously by including an interaction term as part of the co-variates. This model highlights actions that occur more frequently only when agents’ and patients’ have a particular gender each. We denote this model by (S3: agent & patient). Finally, our fourth model studies the dependency between action portrayals dependent on film trends over the years. We denote this study by (S4: agent & patient + time).

In each study, a generalized mixed-effects linear regression model is used to relate the genders and roles to the frequency of portrayals. We describe our statistical formulation in the following sections.

Regression models

We part from the assumption that if there is no difference between the number of times an action is portrayed with respect to any particular gender, then this action does not reflect a gender stereotype. This gives us a natural framework for posing the problem of identifying stereotypes as a regression over the frequency of actions and the gender of its participants.

To uncover this relation, we use a Poisson-regression generalized linear mixed model(GLMM; [58]). GLMMs are an extension of generalized linear models (e.g., logistic regression) to include both fixed and random effects. A fixed effects approach was chosen over a random-effects approach because our data contain repeated measurements (i.e., a single character can participate in multiple actions, multiple times). Furthermore, we use a Poisson regression as it is particularly useful for response variables that represent counts or frequencies [59].

The formal specification of the GLMM is as follows. Let Y = [y₁, y₂, …, y_N] be our response variable. Each y_i corresponds to the number of times we see action i in our dataset. We assume that Y follows a multi-variate Poisson distribution, and we model the expected value as a linear combination of unknown parameters, (4) where X and β are the fixed effects design matrix, and fixed effects; Z and γ are the random effects design matrix and random effects. The link function g corresponds to the log-link function log. We vary our predictor variables X and random effects matrix Z according to the each study as we describe in the forthcoming sections.

Parameter estimation.

Regression parameters are estimated by deviance minimization, a generalization of the idea of using the sum of squares of residuals in ordinary least squares to cases where model-fitting is achieved by maximum likelihood [58]. Additionally, we identify statistically significant coefficients through a series of hypothesis tests on the coefficients. For each coefficient, we perform a Z-test and correct for multiple comparisons using Holm-Bonferroni method.

Model validation.

For each of our presented studies, we validate its corresponding regression models by comparing the performance with two reduced models: null and the no-interaction models. In the null model, no additional explanatory variables are used. This model seeks to explain frequency of portrayal as a function solely of the control variables Z. In the no-interaction model, we remove the random effect–cross–gender interaction variable. By comparing against these two reduced models, we can provide statistical evidence for the existence of a relation between the gender of the characters (X) and the frequency of portrayals (Y). Furthermore, by comparing against the no-interaction model, we provide evidence for the hypothesis that the interaction between the agent’s (or patient’s) gender and actions makes for a better predictor than just the action and gender on their own. All model comparisons were done using a likelihood ratio test (χ² test) at a significance level of α = 0.05.

Studies

Study 1: Agent-only.

For our first study, the fixed effect matrix, X, corresponds to an indicator variable of the gender of the agent. Genders are encoded as M, F and N for male, female and neutral, respectively.

Additionally, we control for two sources of variability as part of the random effect matrix, Z. The first effect controls for the distribution of actions. That is, the fact that naturally some actions will occur more often than others. For this, Z incorporates actions as a categorical co-variate (). The second effect we control for is the movie genre. This comes from the fact that certain actions are more common than others for a particular type of movie. For example, we can expect the action ‘run’ to be more common in action movies than in dramas and romantic comedies. We obtain the genres for all of the movies in our data set from IMDb, and transform them into a binary matrix, m_g ∈ {0, 1}^N×G, where each entry indicates whether that movie has that particular genre. We include m_g as an additional co-variate in our model. Formally, Z is given by (5)

To fit this study’s regression models, we sub-sample the action description dataset to include only those records with a known agent. We observe that the gender distributions of this sub-sample still follows the gender distribution of the full dataset. Moreover, it also follows previous reports in the literature, that is, it remains as a 2-to-1 male to female ratio [10, 48].

Study 2: Patient-only.

Similarly to our first study, in the second study, the fixed effect matrix, X, encodes an indicator variable of the gender of the patient. It also uses the same random effect matrix as the previous study (Eq 5)

To fit the regression model, we sub-sample the action description dataset to include only those records with a known patient. This sub-sample also follows the gender distribution of the full dataset and the 2-to-1 male to female ratio.

Study 3: Agent & patient.

For the third study, the fixed effect matrix, X, is designed to reflect the gender group of agents and patients. This gender group is constructed as a categorical value for the combinations of male-to-male, male-to-female, female-to-male, and female-to-female. We use the same random effect matrix, Z, as in the first study (Eq 5).

For this study, the regression model is fitted to those records for which we know the gender of both agent and patient.

Study 4: Agent & patient + time.

It could be argued that the frequency of actions can be explained by changes in film trends over the years. Since our sample already provides a resource that covers a large span of production years (1909–2013), in our final study, we investigate if the frequency of portrayals can be explained as a function of both time and character demographics. To this end, we include an additional control variable m_y ∈ {1909, 2013}^N, the year the movie was released, as part of the control variables Z. By following this approach (as opposed to just including the variable as an additional co-factor), we ensure that the model learns that different years might follow different trends. The final control effect variable is given by

Results

Annotation verification

Manual annotations.

From the 12, 500 sampled sentences, we found 14, 344 actions to be annotated. Annotators agreed with both the agents and patients for a large majority of the cases n = 11, 775(82.09%). Although, in one out of five, the verification annotation considered their answers to be incorrect (n = 2, 162(18.36%)). A posterior analysis on the samples that were deemed incorrect reveals that most (90%) of the errors were caused by a small subset of annotators. We believe this annotators skipped the task completely by marking the ‘Does not say’ box and submitting. After discarding the incorrectly labeled samples, we are left with a dataset of 9, 613 sentences with at least one action.

Character’s gender estimation performance.

Table 2 presents the gender distribution of our dataset. Previous works argue that male characters are generally given more agency than female characters [11, 29]. We are able to corroborate this finding through proportion tests. Our results reveal that the proportion of male agents is significantly higher than that of female agents (Z = 294.12, p < 0.0001). Additionally, the proportion of female patients is larger than that of male patients (Z = 294.12, p < 0.0001).

The results of the gender estimation heuristics performance on a set of 400 manually selected samples is shown in Table 3. This table reports precision, recall and F1 for each one of the gender groups, as well as average metrics across all groups. Our method achieves 75% accuracy and macro-F1 score of 77.75%, with individual F1 scores rating from 72.0% (non-gendered) to 87.0% (neutral). From their class-level results, we observe that our gender classification method is fairly precise when labeling female and male characters as well as gender-neutral terms (e.g., they, them).

Download:

Table 3. Gender classification report for the proposed heuristics over 400 manually annotated samples.

https://doi.org/10.1371/journal.pone.0278604.t003

Machine-learning model performance

The complete system-level performance results are presented as part of the S1 Table. With respect to identifying agents and patients, the baseline models achieve high precision but suffer from low recall. While a high precision is important, we must consider that this model is to be used in the context of a large-scale analysis of characters and their actions. To obtain the most representative sample for such analysis, we would want our model to retrieve as many instances of actions and their participants. Hence, models with a higher recall ought to be preferred.

Even though there is a clear domain mismatch in the way the baselines were trained (news wires vs. movie scripts), both baselines can still recover some of the signal present in the dataset. In contrast, our naïve approach of relying on a pre-trained model BERT-base (uncased) resulted in no patient label being produced, and thus a 0.00% F1 score for that category. This suggests that movie scripts, and character names specifically, always follow a proper grammar.

Our results show that the domain adaptation of BERT language model resulted in the biggest improvement overall. For example, domain adapting SimpleBERT [49] resulted in a 6 percent increment in action identification, and about 3 to 4% (absolute) gains in agent and patient classification. Furthermore, our proposed model, trained end-to-end, achieved over 30% (absolute) above the baseline performance. The best performing model was our proposed conjunction of transformer and RNN (GRU), where the transformer was initialized with a BERT-base cased pre-trained model, and the complete set of parameters was updated end-to-end. This model achieves F1 scores of 96.80, 89.78 and 73.00 percent for action, agent and patient respectively. Compared to the baseline models, the difference in these performances was found to be significant (permutation test, n = 10⁵, all p < 0.05). Surprisingly, even though we did not precondition BERT with the current predicate (as both SimpleBERT and AllenNLP do), our model was able to correctly infer the action for most of the sentences.

Finally, we investigated changes in the performance of the proposed model due to different RNN dimensions (see S2 Table). The model seems to get saturated around a dimension of 300, with higher dimensions not performing particularly differently from the current size. This result also suggests that the poor performance of the SRL baselines could be due to their larger size.

Statistical models results

Across all four studies, models that include gender predictors performed significantly better than the null models (χ² tests, all p < 0.0001). Including interactions between gender and action improved the performance of all models (χ² tests, all p < 0.0001). These results support our assumption that gender plays an important role in defining the frequency of portrayals for a particular set of actions.