How many categories are there in crossmodal correspondences? A study based on exploratory factor analysis

Yuka Ohtake; Kanji Tanaka; Kentaro Yamamoto

doi:10.1371/journal.pone.0294141

Abstract

Humans naturally associate stimulus features of one sensory modality with those of other modalities, such as associating bright light with high-pitched tones. This phenomenon is called crossmodal correspondence and is found between various stimulus features, and has been suggested to be categorized into several types. However, it is not yet clear whether there are differences in the underlying mechanism between the different kinds of correspondences. This study used exploratory factor analysis to address this question. Through an online experiment platform, we asked Japanese adult participants (Experiment 1: N = 178, Experiment 2: N = 160) to rate the degree of correspondence between two auditory and five visual features. The results of two experiments revealed that two factors underlie the subjective judgments of the audiovisual crossmodal correspondences: One factor was composed of correspondences whose auditory and visual features can be expressed in common Japanese terms, such as the loudness–size and pitch–vertical position correspondences, and another factor was composed of correspondences whose features have no linguistic similarities, such as pitch–brightness and pitch–shape correspondences. These results confirm that there are at least two types of crossmodal correspondences that are likely to differ in terms of language mediation.

Citation: Ohtake Y, Tanaka K, Yamamoto K (2023) How many categories are there in crossmodal correspondences? A study based on exploratory factor analysis. PLoS ONE 18(11): e0294141. https://doi.org/10.1371/journal.pone.0294141

Editor: Kyoshiro Sasaki, Kansai University, JAPAN

Received: April 25, 2023; Accepted: October 25, 2023; Published: November 14, 2023

Copyright: © 2023 Ohtake et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All dataset files are available from the OSF (https://osf.io/5c49f/).

Funding: This research was supported in part by JSPS (Japan Society for the Promotion of Science, https://www.jsps.go.jp/) KAKENHI Grant Numbers JP18K13370 (KY), JP23H03703 (KT), and JP21J20359 (YO). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

People receive a variety of sensory information simultaneously in the natural environment, such as hearing a person’s voice while observing his/her mouth movements during a conversation. This kind of multisensory information is processed independently through specialized sensory systems and then integrated to form a coherent percept [1]. Specific features of a unimodal stimulus can be associated with different features of other sensory modalities, which are referred to as crossmodal correspondences (for a review, see Spence [2]). For example, bright and dark appearances are often associated with high- and low-pitched sounds, respectively. Crossmodal correspondences are a common phenomenon experienced by many people, unlike synesthesia, and have been observed between stimulus features across various sensory modalities, including touch, smell, and taste [3–5]. In the present study, we focused on crossmodal correspondences between auditory and visual stimulus features, which have been extensively studied previously.

Crossmodal correspondence has been demonstrated by using various tasks. Mudd [6] investigated the extent to which four sound dimensions (frequency, intensity, duration, and direction) were associated with horizontal/vertical spatial positions by asking participants to plug a peg into a position on a panel that they thought the sound stimulus would represent. The study revealed that higher frequencies and louder sounds tend to be plotted higher on the vertical axis, which was later interpreted as a correspondence between pitch/loudness and spatial position. Marks [7] examined the correspondence between sound and visual brightness by asking participants to manipulate the pitch/loudness of pure tones until they felt that the tone matched a gray square of different luminance. The participants matched the higher-pitched and louder tones to a brighter square. This finding is supported by a later study by Marks, Hammel, and Bornstein [8], who used a two-alternative forced-choice procedure and revealed that almost all participants over the age of four years match a high-pitched tone to bright light and low-pitched tone to dim light. In addition to these direct-matching tasks, the speeded classification paradigm has also been widely used in recent years to examine crossmodal correspondences (see [2]). In this paradigm, participants are asked to judge the feature of a target stimulus as quickly as possible while ignoring an irrelevant stimulus that is presented simultaneously. Reaction time is shorter when the feature of the irrelevant stimulus matches that of the target stimulus in terms of crossmodal correspondences compared with when they are inconsistent [7, 9–11]. These findings suggest that crossmodal correspondences, to some extent, operate automatically.

Crossmodal correspondences have also been studied in the context of child development and cross-linguistic comparison. Evidence has suggested that preverbal infants show the same crossmodal correspondences as adults [12–15]. Although these results imply that crossmodal correspondences may not be acquired through learning or experience, Haryu and Kajikawa [16] reported that 10-month-old children do not show pitch–size correspondences, although they can associate between pitch and brightness. Thus, not all kinds of correspondences are present in early development. It has been suggested that the strength of the space–pitch association may be susceptible to language use [17–19]. For example, adult Dutch speakers show a stronger height–pitch association than a thickness–pitch association, whereas adult Turkish speakers who have a thickness metaphor in language show the opposite tendency [17]. These findings indicate that language plays an important role in some kinds of crossmodal correspondences.

Spence [2] summarized a variety of crossmodal correspondences between visual and auditory features and their influence on information processing and categorized them into three principal types: structural, statistical, and semantic correspondences. Structural correspondences arise from neural connections between sensory systems or from common processing systems or mechanisms. The correspondences that occur between related features in the magnitude domain may be included in this type—they are represented by a generalized system in the brain [20]. Statistical correspondences are based on statistical regularities or co-occurrences between stimulus features in the environment [21]. For example, larger objects tend to produce lower-pitch sounds than smaller ones [22, 23], and such natural correlations may be learned to form the correspondence between pitch and size. Semantic correspondences are also acquired through learning but are primarily mediated by language (see also [24]). Typically, the features described with the same adjectives, such as “high” and “low,” may be associated through their linguistic consistency.

These distinctions are important to understand the mechanisms of audio-visual crossmodal correspondences. Notably, there remain some limitations. First, the three types of crossmodal correspondences are not necessarily exclusive, and some correspondences may belong to more than one type. This makes it difficult to determine which type a given pair of auditory and visual features falls under. For example, the correspondence between pitch and elevation could be explained by either the internalization of natural statistics or the use of the same words, or both [2]. Second, the categorization is based on the difference in how each crossmodal correspondence can occur, with little consideration for the direct relations between them. Thus, the distinction may change if a new possible mechanism is found. Indeed, it has been suggested that some correspondencies between complex stimuli, such as music and color, are mediated by emotion [25, 26], referred to as the fourth mechanism for audiovisual correspondences [27]. These limitations may be due, at least in part, to the fact that each kind of crossmodal correspondence has been separately examined, resulting in ambiguity of their commonality or consistency.

We addressed the above queries by using exploratory factor analysis to examine the underlying structure of subjective ratings for different kinds of crossmodal correspondences. Factor analysis is a statistical method that is used to explore the underlying structure of a set of variables and to identify the latent factors that explain the patterns of correlation or variation among these variables. Although crossmodal correspondences are a common phenomenon that many people experience, previous studies have reported that there are differences across cultures and individuals in the patterns or strengths of correspondences [17, 19, 28, 29]. We used factor analysis to identify the underlying mechanisms of crossmodal correspondences based on such variation. The present study included some correspondences that were not listed in Spence (2011) or were not classified as a single type in his list. Therefore, we used exploratory factor analysis rather than confirmatory factor analysis to explore latent factors. Specifically, we focused on the correspondences between basic stimulus dimensions in the auditory and visual domains, which have been well demonstrated in the literature, to test whether there is a clear distinction between crossmodal correspondences, as suggested by Spence [2]. In Experiment 1, using a pair of visual stimuli and one auditory stimulus, we asked participants to rate which visual stimulus matched the auditory stimulus better. Experiment 2 presented a pair of auditory stimuli and one visual stimulus to the participants, who rated which auditory stimulus matched the visual stimulus better.

Experiment 1

Methods

Participants.

We used the Yahoo! JAPAN Crowdsourcing website to recruit participants. Although there are no strict rules regarding the appropriate sample size in exploratory factor analysis, it is generally suggested that a participant-to-variable ratio of 5:1 to 10:1 is required [e.g., 30]. Thus, we recruited 200 participants, allowing for the possibility of excluding some samples and items during the analysis process. Participants were not screened at recruitment, but check questions were asked after the main task to ensure that they had performed the task correctly. In total, 178 participants (140 male; mean age = 49.0 ± 10.6 years) completed the online experiment. An additional 21 participants completed the experiment but were excluded from the analysis because they did not click the play buttons to listen to the auditory stimuli in the previewing phases (N = 13), could not distinguish the auditory features of the stimuli in the check questions (N = 5), or both (N = 3). Participants received 30 Japanese yen worth of shopping points for completing the experiment.

Ethics statement.

The experiment was conducted in April 2021, with special care taken to ensure the anonymity of the data. The authors did not have access to any information that could identify individual participants during or after data collection, either at the recruitment site or at the online experimental platform. Informed consent was obtained carefully. In particular, the first screen of the experiment provided an overview of the experiment and the anonymity of the data and stated that no information that could identify individuals would be included in the data analysis and publication. Only participants who checked the box at the bottom of the screen indicating that they fully understood the experiment and agreed to participate in the study proceeded to the experiment. Our study was approved by the ethical committee of the Faculty of Human-Environment Studies, Kyushu University (No. 2020–027).

Apparatus and stimuli.

We used the Gorilla Experiment Builder (www.gorilla.sc; [31]) to create and host the experiment. We asked the participants to use their own computers and web browsers to access the platform and perform the online experiment. Images and tones were presented as stimuli to participants during the experimental task. They were instructed to use their earphones or headphones to listen to the auditory stimuli. No instructions as to the observational environment were given to the participants. When the size of their browser window was smaller than the image resolution, the image was maximized within the window while maintaining the aspect ratio.

The auditory stimuli were two pairs of pure tones, one with different loudness levels (loud vs. soft) and the other with different pitch levels (high vs. low pitch). These stimuli were created by modulating the amplitude or frequency of a 625-Hz standard tone, which was used for volume control at the beginning of the experiment. Specifically, the amplitudes of the loud and soft stimuli were 1.5 and 0.5 times as great as that of the standard tone, respectively. The frequencies of the high-pitched and low-pitched stimuli were 750 Hz and 500 Hz, respectively. These frequencies were chosen for their ease of discrimination and low level of discomfort. The duration of each tone was 1,000 ms, including 20 ms linear ramps at on- and off-set. The participants played the auditory stimuli by clicking the play button presented on their screens.

The visual stimuli were five pairs of objects, each pair of which differed in terms of brightness (bright vs. dark), vertical position (high vs. low position), size (large vs. small), shape (rounded vs. angular), or spatial frequency (high SF vs. low SF). Each stimulus pair was placed side by side on a white background to form a single image (881 pix × 493 pix). The stimuli were circular in shape except for the rounded and angular stimuli. The diameters were 240 pix for the large stimulus, 120 pix for the small stimulus, and 180 pix for the other stimuli. The high and low SF stimuli consisted of sinusoidal gratings with spatial frequencies of 0.1 cycle/pix and 0.033 cycle/pix, respectively, and were oriented 45° to the left. The other stimuli were uniform in luminance: white (with black contour) for the bright stimulus, black for the dark stimulus, and gray for all other stimuli. Each stimulus pair was aligned horizontally (470 pix from center to center), except for the pair with different positions, in which the stimuli were placed diagonally with a vertical distance of 180 pix. Following the findings that crossmodal correspondences are based on the relative difference between the stimuli [32, 33], these values of the stimulus features were chosen to ensure the relative differences within the pairs for both the visual and auditory stimuli.

Procedure.

Participants were directed from the crowdsourcing website to the Gorilla experimental platform. After receiving a detailed explanation of the experiment and providing informed consent, they were presented with a three-minute instructional video for performing the task. The standard tone was then presented, and the participants were instructed to adjust the volume of their computer to a comfortable level and to make no change to the volume during the experiment.

In the main task, the correspondences between the two auditory and five visual features were measured using a Likert-type rating scale. Each trial consisted of a previewing phase followed by a rating phase. In the previewing phase, the participants were instructed to listen to two tones paired in terms of different loudness or pitch levels and to imagine the designated visual features (i.e., either the brightness, vertical position, size, shape, or pattern) associated with each of them. The tones could be replayed by clicking the play buttons placed side by side on the center of the screen. Once the imagination component was done, the participants clicked a continue button to move on to the rating phase. In the rating phase, the tone button placed on the left side was presented along with the paired visual stimuli and the rating scale. The participants were asked to rate which side of the visual stimulus was better matched to the presented tone by clicking on the seven-point scale ranging from “extremely left” to “extremely right.” After the first rating, the participants rated the other tone of the pair again in the same manner, followed by the next trial. Fig 1 illustrates examples of the displays presented in the previewing and rating phases.

Download:

Fig 1.

Examples of the displays presented in the previewing phase (left) and rating phase (right) in Experiment 1. The instructions and Likert scale were presented in Japanese.

https://doi.org/10.1371/journal.pone.0294141.g001

Each participant completed 10 trials, combining the two pairs of tones and five pairs of visual stimuli. The trial order was randomized for each participant. There were four possible combinations of the spatial arrangements of the tones (i.e., play buttons) and visual stimuli. For the tones, the loud and high-pitched tone buttons were placed on the left side, whereas the soft and low-pitched tones were placed on the right side, or vice versa. For the visual stimuli, the dark, high position, large, rounded, and high SF stimuli were placed on the left side, whereas the bright, low position, small, angular, and low SF stimuli were placed on the right side, or vice versa. The participants were randomly assigned to one of four spatial arrangements.

After the main task, the participants were asked two questions to confirm whether they could distinguish the auditory features of the tones. For each question, the tone pair with different loudness or pitch levels was presented side by side in the same spatial arrangement as the main task. The participants judged which tone was louder or higher in pitch by clicking one of the buttons placed under each tone. The experiment took about ten minutes to complete.

Results and discussion

Fig 2 shows the results of the rating scales. The loud tone was judged more often to match the large, angular, high position, bright, and low SF stimuli, whereas the soft tone was judged more often to match the small, rounded, low position, dark, and high SF stimuli. Moreover, the high-pitched tone was judged more often to match the high position, bright, angular, high SF, and large stimuli, whereas the low-pitched tone was judged more often to match the low position, dark, rounded, low SF, and small stimuli.

Download:

Fig 2.

Results of the rating scales for crossmodal correspondences between loudness and visual features (A) and pitch and visual features (B) in Experiment 1. The numbers represent the percentages of participants who judged that either (or neither) of the paired visual stimuli matched the auditory feature at least slightly. This figure was created using the Likert package in R [34].

https://doi.org/10.1371/journal.pone.0294141.g002

To examine the degree of correspondences, we designed the rating scales to be scored from -3 to 3, where positive and negative values indicated the identical and opposite directions of the correspondences described above, respectively. Table 1 shows the means and standard deviations of the rating scores and t values obtained from two-tailed one-sample t-tests. The t-tests with Bonferroni correction revealed that the rating scores were significantly higher than zero, except for the combination of pitch and size, suggesting the significant correspondences between various auditory and visual features. These results are consistent with those of previous studies that found crossmodal correspondences between various stimulus features using matching or speeded classification tasks. Meanwhile, the reported correspondence between pitch and size in previous studies [8, 9] was not significant. Rather, high (or low) pitch was matched better to large (or small) appearance, indicating an opposite trend from the previous reports.

Download:

Table 1. Mean rating scores and results of one-sample t-tests for audiovisual correspondences.

https://doi.org/10.1371/journal.pone.0294141.t001

To identify the potential factors, including their structure, underlying these correspondences, we conducted exploratory factor analysis using the rating scale data. Two factors were extracted based on the minimum average partial (MAP; [35]) criterion, and a scree plot supported the extraction. The results of the factor analysis using weighted least squares estimation and Oblimin rotation showed that the correspondence between loudness and spatial frequency (i.e., between loud tone and low SF and between soft tone and high SF) failed to reach loadings of 0.3 or higher, and the correspondence between pitch and spatial frequency (i.e., between high pitch and low SF and between low pitch and high SF) showed negative loadings, below -0.3. Thus, the same analysis was performed again without these items. The results are summarized in Table 2.

Download:

Table 2. Factor structures of crossmodal correspondences obtained in Experiment 1.

https://doi.org/10.1371/journal.pone.0294141.t002

Factor 1 consisted of eight items related to the correspondences between loudness and size, pitch and vertical position, loudness and vertical position, and pitch and size. The Japanese language uses the words “large” and “small” to describe the difference in loudness as well as the difference in size, and “high” and “low” to describe the difference in pitch and vertical position. It also has linguistic similarities between loudness and vertical positions, as seen in expressions such as turning “up” and “down” the volume. Therefore, we considered that these correspondences were mainly mediated by language and named the factor “semantic correspondences,” following the definition of Spence [2].

Factor 2 consisted of ten items related to the correspondences between pitch and brightness, loudness and brightness, pitch and shape, loudness and shape, and pitch and spatial frequency. These correspondences were unlikely to be mediated by language. Therefore, we named the factor “sensory correspondence.” The internal consistencies assessed by the Cronbach’s alpha coefficients were .80 and .78 for semantic and sensory correspondences, respectively, indicating acceptable reliabilities.

The results of the exploratory factor analysis indicated that two factors underlie the various kinds of correspondence between auditory and visual features. These factors were interpreted as linguistically mediated and sensory-based correspondence. The correspondence between pitch and size was grouped into semantic correspondence, although they have no clear linguistic similarity between them in Japanese. The rating scores indicated the opposite trend from the previous study, suggesting the possibility that unexpected influences could have caused the semantic interaction between pitch and size during the experiment.

In this experiment, the degree of correspondence was judged based on auditory stimuli. That is, in the rating phase, the tones were presented individually to allow the participants to rate which of the visual stimuli was better matched to the tone. This led participants to make two ratings for each correspondence, such as matching a loud or soft tone with a pair of different visual sizes. However, these items of the same correspondence were grouped into the same factor and loaded on it to a similar extent (Table 2). This suggests that participants rate the features of the same correspondence on a similar basis. Meanwhile, the participants’ judgments could have been based on auditory stimuli themselves, which might have affected the results. When the visual features to be matched were the same, the items were grouped into the same factor regardless of whether the targeted auditory feature was loudness or pitch, indicating that visual features had a greater impact on the emergence of the correspondences. To clarify the influence, we conducted Experiment 2, in which the degree of correspondences was judged based on visual stimuli.