Auditory word recognition of verbs: Effects of verb argument structure on referent identification

Mònica Sanz-Torrent; Llorenç Andreu; Javier Rodriguez Ferreiro; Marta Coll-Florit; John C. Trueswell

doi:10.1371/journal.pone.0188728

Abstract

Word recognition includes the activation of a range of syntactic and semantic knowledge that is relevant to language interpretation and reference. Here we explored whether or not the number of arguments a verb takes impinges negatively on verb processing time. In this study, three experiments compared the dynamics of spoken word recognition for verbs with different preferred argument structure. Listeners’ eye movements were recorded as they searched an array of pictures in response to hearing a verb. Results were similar in all the experiments. The time to identify the referent increased as a function of the number of arguments, above and beyond any effects of label appropriateness (and other controlled variables, such as letter, phoneme and syllable length, phonological neighborhood, oral and written lexical frequencies, imageability and rated age of acquisition). The findings indicate that the number of arguments a verb takes, influences referent identification during spoken word recognition. Representational complexity and amount of information generated by the lexical item that aids target identification are discussed as possible sources of this finding.

Citation: Sanz-Torrent M, Andreu L, Rodriguez Ferreiro J, Coll-Florit M, Trueswell JC (2017) Auditory word recognition of verbs: Effects of verb argument structure on referent identification. PLoS ONE 12(12): e0188728. https://doi.org/10.1371/journal.pone.0188728

Editor: Christos Papadelis, Boston Children’s Hospital / Harvard Medical School, UNITED STATES

Received: December 2, 2016; Accepted: November 13, 2017; Published: December 5, 2017

Copyright: © 2017 Sanz-Torrent et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information file.

Funding: This work was supported by the Ministerio de Economía y Competitividad (Madrid, Spain), Grant number: EDU2013-44678-P to PI: LA. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

A great deal of experimental evidence supports the idea that word recognition includes the activation of a range of syntactic and semantic knowledge that is relevant to language interpretation and reference (e.g., [1–4]). This information is believed to include syntactic category information (e.g., noun, verb, adjective), combinatory syntactic information (e.g., the number and types of syntactic complements the word assigns), non-combinatory semantic information (animacy, etc.) and combinatory semantic information (the number and types of semantic entities, or arguments/roles). It is also believed that this information is activated in real-time, as a function of the frequency and contextual relevance of the word (e.g., [5–9]). Moreover, neuroscientific evidence exists to support the idea that this information is functionally organized in the brain and activated in response to the recognition of lexical items, with the evidence coming from event-related potentials (e.g., [10–13]), functional Magnetic Imaging, fMRI [14–15] and lexical processing dissociations related to brain damage (e.g., [16–29]).

Given this evidence, one might expect that the sheer amount of information that a word activates would impinge negatively on its processing time. Yet, this expectation is not always confirmed in studies of language processing. For example, some reading time data suggests that the processing time of a verb is not a function of its argument structure or semantic complexity, (e.g., [30–32]), whereas other reading data seem to implicate that lexical semantic complexity influences reading times (e.g., [33–34]). In particular, Inhoff [30] found that factive and nonfactive verbs did not receive different fixation times. Rayner and Duffy [31] showed that verb complexity does not affect lexical access time in that causative, factive and negative verbs did not influence fixation times. Schmauder [32] was unable to find evidence that the number of semantic arguments influences the ease of processing during language comprehension. In contrast, Gennari and Poeppel [33] found that eventive verbs showed longer processing times than stative verbs. Moreover, McElree et al. [34] showed that reading times were longer for complements that required type-shifting than for complements that directly matched the semantic restrictions of the matrix verb. Likewise, some cross-modal lexical decision tasks that were designed to tap processing difficulty have found positive effects of verb argument complexity (e.g., [35–37]), whereas others have not been able to find these effects [38–39]. In particular, Shapiro and collaborators [35–37]) found in different studies that verb’s representational complexity (syntactic subcategorization and argument structure) affects real-time sentence processing. On the other hand, Schmauder and collaborators [38–39] did not find this effect in cross-modal lexical decision and monosyllabic secondary lexical decision tasks.

Perhaps these inconsistent results relating lexical processing time to representational complexity, in the face of overwhelming evidence that such lexical information is indeed computed in response to encountering a word, are evidence favouring cost-free parallel activation of lexical information (e.g., [40–43]). From this perspective, lexical processing time is not related to representational complexity per se but may instead be related to how well it informs the task at hand. For example, verb information would have a greater impact on reading times in a sentence if it generated structural ambiguity related to the meaning. On the other hand, there are other theories that appeal to semantic and syntactic complexity on its own to explain processing times (e.g., [33, 44]). This account assumes that the more complex a word is, the more processing time it takes. For example, Gennari and Poeppel [33] used a lexical decision task and a self-paced reading study to analyze the processing times of eventive verbs, which denote causally structured events, and stative verbs, which denote facts without causal structure. As they expected, the conceptually more complex eventive verbs took longer to process than stative verbs in both tasks.

In our study, we seek to analyse in greater detail if lexical processing time increases as a function of representational complexity. Concretely, we aim to study if hearing a verb in isolation and looking for its possible referent is influenced by verb argument complexity. We operationalize “verb argument complexity” as the number of arguments that a verb takes. Jackendoff [45] states that the verb defines the number of semantic arguments, and determines which of these arguments have to be expressed in the syntactic structure of the sentence in which the verb is embedded. For example, the verb “to hit” must have two arguments (the “agent” (the hitter) who executes the action and the “patient” (the person or thing being hit, who suffers from the action). In contrast, the verb “to give” has three arguments (“agent”, “theme” and “recipient”). Verbs can take one, two, and three arguments. These three types of verbs show an incremental semantic and syntactic complexity. These differences will allow us to analyse if lexical processing time will also increase as a function of the number of verb arguments. Because the number of syntactic complements and the number of semantic arguments a verb can take are highly related we cannot in this study distinguish between syntactic and semantic complexity and so simply use the term verb argument complexity to refer to both.

The experiments below used the visual world paradigm [46, 47] to study this issue. In this paradigm participants’ eye movements are recorded as they hear words that refer to visually present referents. For example, Allopenna et al. [5] recorded listener’s eye movements to a visual display containing a target object (e.g., a beaker), a rhyme competitor (e.g., a speaker), and an unrelated competitor (e.g., a carriage). Participants followed a spoken instruction to move one of the objects with a mouse. Initial fixations were equally likely to be directed to the target and cohort but after the disambiguating point, looks went towards the target. Moreover, after this point, the rhyme competitor received more fixations than unrelated items. Subsequent work revealed that word recognition is affected by frequency [48], neighborhood density [49] and coarticulatory mismatch [50]. These findings showed that eye movements in this task can provide a real-time window on the dynamics of lexical activation.

To date there exists only one visual world study that examined how word recognition is related to verb argument complexity but it was conducted only with children (see [51]). Andreu, Sanz-Torrent and Guàrdia-Olmos [51] compared the dynamics of spoken word recognition for nouns and verbs with different argument structure preferences, in Spanish-speaking children with and without Specific Language Impairment. All the groups recognized nouns faster than verbs and recognized one-argument verbs faster than two- and three-argument verbs—although all effects occurred later, after word offset. It was also observed that children with SLI were slower than their controls, especially in the recognition of three-argument verbs.

Using the same method, the present study sought to examine whether adults’ speed to identify words vary with different verb argument complexity in auditory single word recognition. This will provide us with more accurate results than the previous methods such as cross-modal lexical decision task or self-paced reading because the visual world paradigm provides not only a time measure that participants take to activate a word but also the temporal dynamics moment-by-moment associated with the activation of lexical semantic information.

Previous studies reviewed above showed inconsistent results relating lexical processing time to verb argument complexity (e.g., [30–34]). On the one hand, there are a group of studies that have found that processing time is not related to the number of verb arguments [35–37]). If this is the case, we won’t find differences on word recognition as the verb arguments increases. On the other hand, if processing time is affected by argument complexity ([as 32]), our results will show that the latency times will be longer as the number of verb arguments increases.

In this work, we report three experiments comparing the dynamics of spoken word recognition for verbs with different argument complexity. In experiment 1, we recorded listeners’ eye movements as they searched an array of pictures in response to hearing a verb. The target verb differed in terms of the number of arguments it takes (1, 2 or 3 arguments). In experiment 2, participants had the same task but also had to indicate via a button press exactly when they had heard the target word in the input stream. Finally, in experiment 3, we controlled more variables that could in principle affect the auditory word recognition. We extended the sample and increased the number of verb stimuli.

Experiment 1

Method

This study was approved by the Ethics Committee of the Universitat Oberta de Catalunya.

Participants

Thirty-one native Spanish speakers participated in the experiment (16 females and 15 males). All participants were born in Spain and studied primary and secondary school in Spain. They were students or junior faculty at various Universities in the Barcelona area. All participants either had uncorrected vision or wore soft contact lenses or eyeglasses. They gave their written informed consent for participation in this study.

Stimuli

We used the same stimuli as in Andreu, Sanz-Torrent and Guàrdia-Olmos [51]. Eighteen verbs (six one-argument, six two-argument and six three-argument verbs) were used as target words. Moreover, 18 nouns were used as a target for filler stimuli (see S1 Appendix). Both sets of words (verbs and nouns) were selected following the same criteria: they had to be very common words and easily recognizable from visual stimuli (for example, events denoted by the verbs walk, open or tie are easily recognizable but others like those denoted by the verbs think, love, etc. not). A preliminary list of words was first created and then only those words that received relatively high imageablity ratings were selected. All the words were matched for number of syllables such that there were the same number of monosyllabic (one), disyllabic (thirteen) and trisyllabic words (four). Verbs with different argument structure and the nouns had the same mean syllable length of 2.16. In addition, for both verbs and nouns we controlled the frequency of written Spanish (using the LEXESP corpus [52]). At the time experiment 1 was conducted we had no access to Spanish oral lexical frequency data, but given that the Spanish is a language with a very shallow orthographic system, we used written frequency. In addition, we also controlled the imageability from published rating norms [53]; the label appropriateness from a separate group of 32 adults ratings (values 1–7) and the mean age of first production using the program FREQ of the CLAN (CHILDES project [54]) from Serra-Solé and Vila corpus and own authors’ database which includes monthly speech transcriptions of 13 children from ages 1 to 4 approximately.

Table 1 shows that the three verb subsets did not differ significantly on any of the variables, although there was a marginally significant uncorrected pairwise comparison in the imageability between one-argument and two-argument verbs [t(10) = 2.07, p = 0.07]. Table 2 also shows that, as expected, verbs did not differ from filler nouns in any way except for imageability and label appropriateness. (see, e.g., Gillette, Gleitman, Gleitman and Lederer [55] for a similar effect in English).

Download:

Table 1. Mean properties of verb classes (SD) and (range) in parentheses.

F-Ratios reflect effect of verb class.

https://doi.org/10.1371/journal.pone.0188728.t001

Download:

Table 2. Mean properties of target verbs and filler nouns (SD) and (range) in parentheses.

F-Ratios reflect effect of syntactic category.

https://doi.org/10.1371/journal.pone.0188728.t002

In sum, within the set of verbs, verbs with one, two or three arguments did not differ from each other in terms of how imageable they were or how appropriate they were as labels for their pictures. On the other hand, filler nouns differed from target verbs in that they were more imageable and more appropriate as labels for their corresponding pictures.

As described in Andreu, Sanz-Torrent and Guàrdia-Olmos [51], the 36 words (18 verbs and 18 nouns) were paired with a picture depicting the action or object. Each target picture was then paired with three additional pictures such that the resulting set of four images always included two event images and two object images. Target verbs had one event competitor and two object distracters and filler stimuli had one object competitor and two event distracters. The preferred names for competitor and distracter pictures were similar in frequency to the target names (using the LEXESP corpus [52]). In addition, the onset phoneme of each target word always differed from the onset phoneme of the words for the competitor picture and the two distracter pictures, so to avoid auditory cohort competitor effects (see Allopenna et al [5]).

Target words were recorded by a male native Spanish speaker and sampled at 44,100 Hz. Each trial image consisted of four pictures each placed within four quadrants on the computer screen (see Fig 1). The background was white and had two black lines, one vertical, one horizontal, were used to divide the four quadrants. The position of the target picture, the competitors and distracters were randomized in these four quadrants. Moreover, the number of arguments involved in target and competitor event pictures were balanced across conditions. In particular, for the six target items within each condition, two always appeared with a one-argument competitor, two appeared with a two-argument competitor and two appeared with a three-argument competitor. Finally, we carefully selected distracter pictures so that their appearance or similarities in form, function or color were not similar to the targets.

Download:

Fig 1. Stimuli example.

(A) Target verb: To lick (one-argument verb); competitor: To launch (three-argument verb); distracters: plane and cauldron. (B) Filler stimulus: Target Noun: cake; competitor: window; distracters: To ride (two-argument verb) and to paint (two-argument verb).

https://doi.org/10.1371/journal.pone.0188728.g001

The audio and the visual image for each item were merged together in a video file lasting 4000 ms, using VirtualDubMod software. In each video, the onset of the spoken word coincided with the onset of the visual stimuli. The spoken word finished around 1000ms from image onset.

Procedure

Participants were seated approximately 22” in front of a Tobii T120 eye tracker with an integrated 17” TFT monitor. Tobii Studio Software was used to present the stimuli, and collect the eye tracking data. Stimuli videos were 800 x 600 pixels in size and centered on the screen, which was set to 1024 x 768 pixels. The visual angle of each object subtended approximately 13 degrees, well above the 0.5 degree accuracy of the eye tracker. All audio was played over a mono channel split to two loudspeakers positioned on either side of the viewing monitor. Eye position was sampled at 120Hz (i.e., at 8.333 ms intervals).

A nine point calibration procedure was carried out at the beginning of the experiment. The Tobii Studio Software automatically validates calibrations and the experimenter could, if required, repeat the calibration process if validation was poor. Calibration took approximately 20 s. Participants were instructed that for each trial they would see a set of four pictures and hear a single word spoken aloud. Their task was to find the picture mentioned, and then continue looking at the picture until the video disappeared. There were two practice trials before the experimental task (one with a verb target and one with a noun target) to acquaint the participant with the flow of events. The test videos were presented in random order in two blocks. Each block contained eighteen different words (nine target verbs, three of each verb type and nine fillers in which the noun was the target). All the participants were given both blocks. Between each trial, participants were presented with a crosshair centered in the middle of the screen (which they had been instructed to fixate). This position was equidistant from each quadrant and corresponded to the intersection of the two lines that divided the four quadrants. The crosshair was displayed for 2000 ms.

Analysis

For each target picture, there was a pre-defined area of interest that consisted of a rectangle surrounding the picture (see Fig 1). The horizontal and vertical eye position data was then used to determine looks to the target picture (See S2 Appendix). A value of one was given to every eye-tracking sample that fell within the target region; otherwise it was given a zero (looks to other areas, off the screen or track loss). Then, for each participant on each trial, the proportion of looks to the target was calculated during two time windows following Andreu, Sanz-Torrent and Guàrdia-Olmos [51]. The first window began 200 ms after the onset of the spoken word (and video) and lasted until the end of the word, which was always 1000 ms. A 200 ms offset was used because the minimum latency to plan and launch a saccade is estimated to be between 150 and 180 ms in simple tasks [56–58]. As such, 200 ms after word onset is approximately the earliest point at which one expects to see looks driven by the acoustic information. The second time window corresponded to 1000–2000 ms, which was a one second interval after the word was uttered.

Finally, trials with more than 33% track loss were excluded. The mean percent of track loss was 2.7% resulting in the need to drop three trials.

Results

Fig 2a presents the proportion of looks over time to the target referent, plotted by time. The three black vertical lines divide the two windows of analysis. Fig 2b presents the same data binned into the two time windows, as defined in the analysis section above.

Download:

Fig 2.

A) Proportion of looks to the filler nouns (n0), one-argument (v1), two-argument (v2) and three-argument target verbs (v3) from image and word onset. B) The same data binned into the two time windows (average of subject means).

https://doi.org/10.1371/journal.pone.0188728.g002

As shown in Fig 2b, differences between the word types emerge during the second time window. There appear to be effects of verb argument number, such that one argument verbs have more target looks than two argument verbs, which in turn have more looks than three argument verbs. Moreover, as expected, participants were better at finding the nouns as compared to a verb. As shown in the proportion curves in Fig 2a, these differences between conditions emerge toward the end of the first time window, and reflect the speed at which participants can locate the target picture (i.e., they reflect how quickly the proportion curves reached their asymptote of approximately 0.95).

Which of these differences in the second window can be explained as arising from differences in imageability, and which can be associated with verb argument complexity? Recall from the stimuli section that the three verb subsets (1, 2 vs. 3 arguments) did not differ between themselves in terms of imageability/label-appropriateness dimensions, yet target looking times do differ between these verb types. This pattern in the norms suggests, albeit indirectly, that differences among verbs may reflect something about the complexity of the semantics associated with these verbs, rather than imageability/label-appropriateness. However, the filler nouns were rated as being more imageable terms than verbs, and were also rated as being more appropriate labels for their pictures. Thus, it is possible that the speeded identification of nouns over verbs may be related to the syntactic category of the labels or to the fact that the nouns were more imageable and better labels of their pictures than verbs.

The mean label-appropriateness ratings for each word (regardless of whether it is a verb or a noun) were found to be highly correlated with each item’s mean proportion of target looks during time window 2 (see Fig 3a); people are better able to locate the target picture if the word being uttered is a highly appropriate label for that picture (imagebility correlated with label appropriateness, R² = 0.324; p<0.001, and generates similar results when related to looking times). We therefore focus our discussion on label appropriateness. One can partial out the variance associated with label appropriateness by transforming target item means into residuals, i.e., positive and negative deviations from the fitted line in Fig 3a. Fig 3b plots residualized item means by condition. As can be seen in the Fig 3, nouns were no longer different from verbs as a whole, but verbs show an effect of argument number; one argument events are located faster than what is predicted based on label appropriateness, whereas three argument events are located slower than expected. Simple transitive (two-argument) events and nouns are located just as fast as predicted by label appropriateness alone.

Download:

Fig 3.

A) Correlation between proportion of time on target verbs and filler nouns and label appropriateness rating. B) Residualized item means by condition (error bars = 1 S.E).

https://doi.org/10.1371/journal.pone.0188728.g003

These observations are however based on aggregated data (i.e., item means), and as such may be failing to capture relevant variation [59]. A better way to analyze this data is via multi-level mixed linear modeling of non-aggregated trial-level observations for the verb data only. In particular, the E-logit-transformed proportion of target looks for each trial was modeled using the lmer function in R, with crossed-random intercepts supplied for each Subject and Item. We can enter both argument number (1, 2 vs 3 arguments) and label appropriateness norms (a continuous variable) as predictors, to see how much variance is accounted for by both variables separately and simultaneously. The best fitting model is one that includes a reliable effect of argument number and label appropriateness, both as continuous variables (see Table 3). Thus, argument number and label appropriateness account for different aspects of the variance within verbs.

Download:

Table 3. Fixed effects from best fitting multi-level linear model of the proportion of target looks, E-logit transformed, time window 2 (Experiment 1).

https://doi.org/10.1371/journal.pone.0188728.t003

In a separate analysis that included both the noun and the verb data, we examined the effect syntactic category (nouns vs. verbs) and label appropriateness norms (a continuous variable) as predictors, to see how much variance is accounted for by both variables separately and simultaneously. Although a model that contained only syntactic category (noun vs. verb) showed a significant effect of this factor (beta estimate = -24.4, t(1) = -4.71, p<0.01), the best fitting model was one that used both label appropriateness and syntactic category as predictors, in which label appropriateness was the only significant predictor (see Table 4). This implies that the variance is better explained by label appropriateness rather than the syntactic category of the label.

Download:

Table 4. Fixed effects from best fitting multi-level linear model of the proportion of target looks, E-logit transformed, time window 2 (Experiment 1).

https://doi.org/10.1371/journal.pone.0188728.t004

Discussion

In this experiment, we observed that there is a reliable linear effect of argument number above any effect of label appropriateness. One Argument (intransitive) events were faster to locate than would be expected given their label appropriateness, whereas three argument (di-transitive) events were slower to locate than would be expected given their label appropriateness; two argument (transitive) events fell in between and were located at a rate expected given their label appropriateness. Explaining this effect as being related to verb argument complexity runs into trouble however because intransitive verbs were also found to be processed more quickly than simple nouns (after factoring out label appropriateness), requiring one to conclude that intransitive verbs are, for some unknown reason, representationally simpler than nouns.

Moreover, we observed that nouns are faster to identify than verbs. However, this effect is carried entirely by the degree to which nouns are better labels for pictures than verbs: Nouns are more imageable than verbs. This relationship has been observed before in a very different experimental setting, in which participants were asked to learn the meanings of nouns and verbs directly from visual observation of the world (see [55]). In Gillette et al. [55], although nouns were found to be learned more easily than verbs by adults (an effect also observed in infants learning their first language), the effect was attributable solely to the imageability of the words, not their syntactic status as a noun or a verb. They concluded that verbs are more difficult to learn from direct observation with the world because verbs are more likely to label aspects of the world that are difficult to see. The present finding offers support for this conclusion, it is harder to locate pictures labeled by verbs as compared to pictures labeled by nouns because verbs are less imageable than nouns.

Experiment 1 has shown that eye movements reveal the time course of the dynamics of lexical activation which improves our measure of lexical processing from previous studies. As we can see in Fig 2, the slope of the curve reflects the speed at which participants can locate the target picture and when the curve reached their asymptote was the moment that the vast majority of participants decided which picture was the target. However, in the present experiment participants were not asked to indicate exactly when they had located the target (e.g., by pressing a button). Instead, they were asked to hold gaze on the target picture. In Experiment 2, we collect button pressing data as an explicit indication of the timing of participants’ decision making.

Experiment 2

In this second experiment we seek to replicate these observed effects but make alterations to the experiment that might improve our measure of lexical processing time. We used reaction time data collected from button presses. Eye movements were also collected, to examine how this process unfolds over time.