Beyond traditional stimuli: Validating AI-generated images for eliciting negative emotions in affect research

Hey Tou Chiu; Hoi In Sou; Yuen Wing Lam; Clayton Siu Fung Ng; Savio W.H. Wong

doi:10.1371/journal.pone.0342434

Abstract

Studies of emotion often rely on standardized stimulus sets to elicit affective responses. Although established databases provide images with normative valence and arousal ratings, selecting suitable stimuli can be difficult when experiments require specific thematic or content constraints. This challenge is especially pronounced for negative stimuli, which are central to research on maladaptive emotions and behaviors in clinical contexts but are often scarce in necessary quantity or specificity. The present study evaluated the feasibility of using generative AI, specifically text-to-image generators, to create tailored negative and neutral affective stimuli. To assess whether these images can serve as alternatives to traditional stimuli, we compared their affective properties to those reported in standardized image databases. Across two studies, participants rated the valence and arousal of 160 and 200 AI-generated images. Our findings revealed that AI-generated negative and neutral images reproduced the characteristic inverse association between valence and arousal observed in standardized databases, with moderate to strong correlations between these dimensions. These results highlight the potential of generative AI as a practical methodological tool for creating customized affective stimuli aligned with specific research objectives and experimental designs.

Citation: Chiu HT, Sou HI, Lam YW, Ng CSF, Wong SW (2026) Beyond traditional stimuli: Validating AI-generated images for eliciting negative emotions in affect research. PLoS One 21(2): e0342434. https://doi.org/10.1371/journal.pone.0342434

Editor: Sandra Carvalho, University of Minho: Universidade do Minho, PORTUGAL

Received: June 1, 2025; Accepted: January 22, 2026; Published: February 10, 2026

Copyright: © 2026 Chiu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The rating data of Study 1 and Study 2 are included as supporting information in table S2 and S3 respectively.

Funding: General Research Fund from the Research Grant Council Hong Kong (RGC Ref No. 14619919) Social Innovation and Entrepreneurship Development Fund (SIE Fund) KPF20QEP12.

Competing interests: The authors have declared that no competing interests exist.

Introduction

A central aspect in the study of emotions involves examining individuals’ responses to controlled stimuli designed to elicit specific emotional responses [1]. The development and utilization of emotionally salient materials have become essential for accurately measuring behavioral responses (e.g., reaction time, accuracy) and physiological reactions associated with specific emotions [2]. Researchers employ various modalities of stimuli, including visual, lexical and auditory, to induce emotional responses [3,4]. Among these, visual stimuli (images) are the most widely used in behavioral and neuroimaging research because they require minimal linguistic knowledge and semantic processing, making them intuitive and particularly suitable for cognitive research of affective processing compared to textual or auditory stimuli [5–7]. However, before visual stimuli can be used as reliable emotional elicitors, researchers must carefully control stimulus content and physical properties (e.g., size, brightness and colour) tailored to specific research questions. Consequently, identifying appropriate visual stimuli aligned with particular experimental requirements remains challenging, as consistently noted across previous studies [2,8,9].

Visual stimuli in affective research mostly come from existing standardized image databases such as the International Affective Picture System (IAPS) [10], the Nencki Affective Picture System (NAPS) [9], and the Open Affective Standardized Image Set (OASIS) [11]. These databases offer images with a wide range of themes and content, along with normative ratings on affective dimensions. Valence, measuring the positive or negative nature of an affective experience, contrasts states of pleasure with displeasure. Arousal focuses on the distinct level of excitement induced [12]. The self-reported ratings on the affective dimensions are usually captured using the Self-Assessment Manikin (SAM) [13], whereby valence and arousal are represented by pictorial figures that are inserted along a 9-point scale. The SAM is a culturally and linguistically universal measuring instrument which targets affective responses associated with the stimuli (“How do you feel while viewing the picture”) instead of the semantic knowledge (“Are the object or situations depicted good or bad?”). This technique has facilitated replicability through validation of affective images across languages and cultures (e.g., [1,5,14,15]) and is currently widely adopted as a standardized procedure for collecting normative ratings in affective databases. Importantly, the stimuli in these databases have been extensively used in various experimental paradigms in both behavioral and neuroimaging research (e.g., [16–19]).

Several limitations of existing standardized stimuli databases have been identified. First, the availability of stimuli within specific categories is often constrained [1,9]. For research requiring a high frequency of stimuli aligned with particular themes or content categories, broad-topic databases, such as IAPS, may fall short in providing sufficient suitable options for task-specific purposes [2]. This is particularly the case in studies of negative affect processing, an area that has historically dominated affective research because of the greater motivational and clinical relevance of negative emotions to behavior [20–22]. Suitable negative images are often difficult to obtain in large quantities and may require researchers to select specific stimuli and combine them as a set from multiple databases [23,24]. Additionally, as stimuli from these databases are predominantly natural photographs, inconsistencies in image quality and perceptual characteristics (e.g., color, size, brightness) can complicate the process of maintaining experimental control over visual stimuli. Furthermore, certain images in databases like the IAPS may feel outdated or contextually irrelevant, as the database was originally developed in a pre-internet era [25]. Thus, when research designs require specific valence, image content, precise control over perceptual attributes or image styles, and when these images are needed in a substantial quantity, it becomes imperative to explore novel methodologies for generating stimuli tailored to affective research. Such approaches not only address the inherent limitations of existing database but also reduce the time-intensive process of searching for suitable stimuli.

Current uses of artificial intelligence in stimuli development

Recent technological advancements have spurred the rapid growth of artificial intelligence (AI). Particularly, generative AI enables automatic creation of diverse content, encompassing texts, images, and videos, in response to user-provided prompts [26,27]. These innovations have encouraged researchers from diverse disciplines to leverage AI for generating materials tailored to their specific research paradigms. To date, generative AI has been employed to create visual and linguistic stimuli across various disciplines, ranging from the arts, linguistic to psychological research [28–30]. For example, Alzahrani et al. [30] examined the feasibility of using AI to generate auditory and written sentence stimuli and evaluated its acceptability and validity across three psycholinguistic experimental designs. Using Lovo AI, a text-to-speech tool and ChatGPT-3 for generating AI-generated sentences, Alzahrani et al. [30] showed that the quality of AI-generated psycholinguistic stimuli in English was perceived as comparable to or superior to those developed by experienced researchers. Although these stimuli were unable to consistently replicate established psycholinguistic effects, the study provided evidence of high acceptability, indicating that stimuli were perceived as human-like. For other types of stimuli, such as AI-generated faces, studies have shown that they are incredibly hard to distinguish from real-life faces (e.g., [31,32]). AI-generation techniques have been applied to introducing subtle changes in facial expressions to examine its impact on participants aesthetic ratings [33]. More recently, Tassinari [34] used Dall-E 2 [35] to generate specific stimuli tailored to study weight biases by generating average and overweight versions of facial stimuli, intended for use in the Implicit Association Test (IAT). These studies demonstrated that AI shows promise in developing stimuli comparable to those produced by humans and is increasingly adopted as a tool to modify or create specific stimuli to study psychological processes, making it a potentially valuable tool for use in experimental research.

Recent studies have further investigated the potential of generative AI to generate emotionally charged materials [28,29,36]. For instance, Demmer et al. [28] created visually abstract artworks using a random noise generator (RNG) and compared them to artworks created by human artists. Participants were asked to report the extent to which they experienced emotions while viewing both types of artwork. The results revealed that participants reported feeling emotions and ascribing intentions to the artworks, regardless of whether they were created by AI or human artists. This suggests that AI-generated artworks are capable of eliciting emotional responses in viewers. In another study, Azuaje et al. [36] developed a therapeutic writing tool incorporating text-to-image AI to generate artwork intended to positively distract users from negative emotions. The results indicated that while the tool contributed to improvements in some emotional outcomes, such as reductions in anger and sadness, it was less effective in addressing other emotions, such as anxiety or stress. Moreover, the intended positive distraction of the AI-generated images was inconsistent; some participants found the images negative and unsettling [36]. Although the inclusion of AI-generated artwork in the writing tool did not consistently help participants downregulate their negative emotions, the study demonstrated that AI-generated images can effectively evoke both positive and negative emotional responses in viewers.

Research gap

Despite the increasing use of AI in stimuli generation, it is unclear whether generative AI is suitable for creating standardized emotionally provoking stimuli specifically tailored for experimental designs in affective research. While it seems that AI-generated artworks can readily evoke emotional responses in participants [28,29], no study has yet to systematically examine the affective dimensional properties of AI-generated images, particularly in naturalistic scenes. To establish that AI-generated emotional images can be used as a valid tool for emotion research, it is essential to investigate whether visual stimuli created through generative AI can reproduce the normative valence and arousal patterns observed in standardized affective stimuli [10]. Standardized affective stimuli typically exhibit a “boomerang” or “U-shaped” distribution, where positive and negative stimuli are rated higher on arousal, while neutral stimuli tend to be rated lower [5,37,38]. If AI-generated affective stimuli demonstrate similar properties in these dimensions, AI could emerge as a viable tool for affective researchers, particularly for sourcing additional or highly specific stimuli. With the growing demand for large quantities of themed visual stimuli, and for stimuli tailored to specific experimental designs, exploring the potential of generative AI to complement existing standardized image databases is critical. Moreover, findings from this exploration will provide insights into the broader applicability and limitations of AI-generated stimuli within future affective research.

Present study

The present study investigates the potential of generative AI in creating static negative and neutral visual stimuli for affective research. To our knowledge, this is the first study to utilize text-to-image generative AI to develop naturalistic scene stimuli tailored to specific experimental designs. Beyond generation, we sought to establish normative ratings using standardized validation procedures. The primary novelty of this study lies in its demonstration that AI tools can produce tailored emotional scenes that yield replicable and consistent affective ratings across independent samples. While previous studies have examined emotional response to AI-generated artwork [28], the inter-associations of affective dimensions (valence and arousal) in AI-generated naturalistic scenes remain unexplored. The scope of this study was intentionally limited to the negative-to-neutral valence spectrum for practical and clinical validity. Methodologically, incorporating all three image categories (positive, neutral, and negative) for within-subject ratings would significantly increase the number of trials. This could lead to participant fatigue and habituation effects. Furthermore, given that negative affect is central to understanding psychopathology, such as PTSD [39], depression [40,41], anxiety [42], prioritizing this spectrum allows for a more focused contribution to the dominant literature on maladaptive emotional processing [20–22]. Therefore, we prioritized data quality and clinical relevance over a full-spectrum valence investigation.

We conducted two image rating studies using AI-generated stimuli specifically developed for an executive function task [43]. In Study 1, we collected valence and arousal ratings from participants who had just completed a behavioral task using these images. This design aligns with standard practices where post-task ratings serve as a manipulation check to verify that stimuli eliciting the intended affect within specific experimental context (e.g., [44–46]). However, because prior exposure can introduce habituation effects, we conduct Study 2 with two independent groups of exposure-naïve participants. This second study provides a cleaner, normative set of ratings unconfounded by task demands. By reporting findings from both studies, we offer a comprehensive validation: Study 1 demonstrates the effectiveness of the stimulifollowing task engagement, while Study 2 establishes a generalizable benchmark for future research. All stimuli and rating datasets are available for research use upon request.

Study 1

Method

Development of AI-generated stimuli.

A set of 160 images (80 negative and 80 neutral images) was developed for an executive function task as part of a larger study [43]. Each image was designed to feature a combination of two of four specific content categories: animal [A], people [P], tree [T] or vehicle [V], yielding six possible combinations: people-tree (PT), people-animal (PA), people-vehicle (PV), tree-animal (TA), tree-vehicle (TV) or vehicle-animal (VA). The distribution of images across these combinations is detailed in Table 1.

Download:

Table 1. Number of images rated per category in Study 1.

https://doi.org/10.1371/journal.pone.0342434.t001

We used three text-to-image generation models: 1) Stable Diffusion, 2) Adobe Firefly and 3) Leonardo.Ai to generate these images. Stable Diffusion was released in 2022, which uses latent diffusion as a deep learning technique to generate images based on text inputs [47]. Stable Diffusion can be implemented through front-end platforms, like DreamStudio, that allows for additional processing functions that enable users to mask a specific image area and filling it with further text prompts (i.e., “inpainting”). It allows users to extend an image beyond its original dimensions, again through prompts inputted to Stable Diffusion, to generate new content (i.e., “outpainting”). In this study, we used Stable Diffusion v1.6 within DreamStudio.

Adobe Firefly operates on a generative AI model and is trained on licensed content, including Adobe Stock and public domain images where copyright has expired [48]. Users can access Firefly through the web browser with an Adobe Creative Cloud account, and can utilize a range of features (e.g., text-to-image generation, generative recoloring and generative fill). This study used the Firefly Image 2 Model. Leonardo.Ai is an advanced AI image generator that can create impressive graphics in a short amount of time. It offers a range of fine-tuned models, two of which are Leonardo Diffusion XL and Leonardo Kino XL. These models were built on top of existing sophisticated models, to improve the quality of generated images, also with the purpose of tailoring the models to produce specific styles. Both models used Stable Diffusion XL 1.0 as their base model [49] and can be accessed through creating a free account on the web platform. Among the three AI models, the majority of our negative images were generated using Stable Diffusion and Leonardo.Ai, while our neutral images were mostly generated using Adobe Firefly.

Images were generated using text prompts that specified negative and neutral scenes, each incorporating two specific content categories. Both inclusive and exclusive prompts were used: inclusive prompts specified two of the four categories (A, P, T, V) while exclusive prompts omitted the remaining two categories to ensure that the image content aligned with intended criteria. In addition to excluding specific categories, prompts were refined to control various aspects of the image output, such as tone (e.g., “with grey skies or drizzles” to create a cooler tone and a more negative mood), and the background (e.g., “no tall buildings behind” to minimize background distractions). The resulting scenes varied based on the category combinations. For example, negative images depicted scenarios such as a car hitting a person on the road, causing injury or blood, or, a fallen tree trapping an animal, resulting in death or distorted figures. Negative prompts were generally centered around themes of accidents, injuries, violence, disasters and catastrophes, which are similar to those found in standardized databases like IAPS and NAPS. Neutral images, on the other hand, typically depicted a person or an animal in a natural setting with a tree or a car in the background, such as a man sleeping under a large tree. Fig 1 provides examples of text prompts for both a negative and a neutral image, illustrating how these prompts were used to generating scenes corresponding to specific content category combinations. Additional examples of text prompts and corresponding generation parameters are provided in S1 Table.

Download:

Fig 1. Sample AI-generated images for negative and neutral emotion categories.

Left = people-animal (PA) as a content category. Right = tree-vehicle (TV) as a content category.

https://doi.org/10.1371/journal.pone.0342434.g001

Generating suitable images often involved iterative adjustments and repeated refinements of the text prompts, as initial prompts did not always result in images that fully matched our expectations. Distortions were particularly common in scenes depicting human and animal faces or limbs. Additionally, the backgrounds of some scenes occasionally appeared overly stylized or unrealistic. To address these issues during the process of image generation, we utilized features within the AI tools themselves. For example, Adobe Firefly allowed us to regenerate specific areas of an image with targeted prompts, while in Stable Diffusion, adjusting parameters such as prompt strength enabled greater control over how closely the output adhered to the original prompt.

However, our primary goal was not to create images indistinguishable from real-life photographs but to elicit the intended emotional response (negative or neutral). Consequently, we accepted generated images even when they appeared distinctly “AI-like”. For example, some generated images showed backgrounds that lacked detail or appeared blurred compared to actual photographs, while others depicted target objects disproportionate to their backgrounds, or placed in unusual positions. Since these images were presented briefly (around 3–4 seconds) during experimental tasks, neither the level of realism nor participants’ recognition of images as AI-generated was considered critical. Following generation, a post-processing workflow was applied. When necessary, we used traditional image editing software (e.g., Adobe Photoshop) to adjust object proportions, placements, and refine color, saturation and brightness. All stimuli were then cropped to 1080 x 1080 pixels and Adobe Lightroom was used to apply a uniform color tone filter across the entire set of 160 images.

Participants.

74 participants (64 females, M_age = 20.6, SD = 1.91) were recruited using convenience sampling via mass email within the university community. Participants were screened online based on the following inclusion criteria: (1) aged between 18–25, (2) fluent in Cantonese, (3) able to read Traditional Chinese, and (4) normal or corrected-to-normal vision. Given the recruitment method, most participants were undergraduate students (85.1%), followed by postgraduate students (13.5%) and university staff (1.4%). Written informed consent was obtained from all participants. The experimental protocol was approved by the ethics committee of the Chinese University of Hong Kong (SBRE-22–0675).

Stimulus presentation and rating scales.

The 160 AI-generated images (80 negative and 80 neutral) were divided evenly into two sets of 80. To minimize fatigue, each participant rated the images across two separate runs, with one set assigned to each run. For attentional checks, eight positive images sourced from the Google database were randomly inserted into each set, bringing the total to 88 images per run. These images were pseudo-randomized into four blocks of 22, with constraint that no more than two stimuli from the same emotional valence (negative, neutral or positive) or image category appeared consecutively. The presentation order of blocks was counterbalanced across participants using a Latin square design.

Participants received instructions through a recorded PowerPoint presentation, which explained the 9-point Self-Assessment Manikin (SAM) [13] rating scales for valence and arousal in Cantonese (see Fig 2). These instructions were adapted from the IAPS technical manual. For valence ratings, participants responded to the prompt, “This image is…” on a scale ranging from 1 (“very negative”) to 9 (“very positive”). For arousal ratings, participants responded to the prompt, “My reaction to this image is…” on a scale ranging from 1 (“weakly aroused”) to 9 (“highly aroused”). Before the main task, participants familiarized themselves with the procedure by completing three practice trials using images not included in the study.

Download:

Fig 2. Display of the SAM scale for Valence and Arousal.

Rating scale presentation. Left = valence, Right = arousal.

https://doi.org/10.1371/journal.pone.0342434.g002

The trial sequence for the image rating task is illustrated in Fig 3. Each trial began with a white fixation cross displayed for 2 seconds to orient participants’ attention, followed by the target image displayed for 2 seconds (700 x 700 pixels). Immediately afterward, a smaller version of the same image (500 x 500 pixels) appeared above the valence rating scale. After participants submitted their valence rating, the scale was replaced by the arousal rating scale for the second rating. Both valence and arousal ratings were self-paced and entered using the number keys on the upper-left corner of the keyboard. A 2-second blue fixation cross then appeared, signaling the end of the trial and preparing the participant for the next image. The whole procedure consisted of two runs, with each run comprising four blocks of trials. To minimize fatigue, participants were offered a self-paced break of at least one minute after completing each block. The study was conducted on standard PCs with 24-inch monitors and stimuli were presented using PsychoPy [50].

Download:

Fig 3. Trial Sequence of the Rating Procedure.

Ratings were provided for valence first, then arousal. Duration is shown in seconds. *** = self-paced duration.

https://doi.org/10.1371/journal.pone.0342434.g003

Procedure.

Participants visited the laboratory individually or in pairs. They first completed a behavioral task as part of the larger study and were then given the instructions for the image rating task. They were informed that there were no right or wrong answers and were encouraged to provide their honest responses when viewing the images. Upon completion of ratings for the first image set, participants took a mandatory 1-minute break before proceeding to rate the second image set. After completing the entire experiment, each participant received HKD $60 as compensation for their time.

Statistical analyses.

Descriptive statistics, including mean and standard deviations, were calculated for each image. Inter-rater reliability of the ratings was assessed using intra-class correlation coefficient (ICC). Scatterplots were generated to illustrate the relationships between valence and arousal ratings, allowing visualization of the bidimensional affective space and comparison of the current sample’s rating distributions with those of previous studies. Independent samples t-tests were conducted to evaluate differences in valence and arousal ratings between neutral and negative stimuli. Pearson’s correlation coefficients (r) were calculated separately for neutral and negative images to clarify the relationship between valence and arousal ratings. Additionally, linear and quadratic regressions were performed to further investigate valence as a predictor of arousal. All statistical analyses were conducted using SPSS 21 and JASP version 0.18.3 and scatterplots were generated using R Studio.

Results

This study examined affective ratings for a total of 160 AI-generated images, comprising 80 neutral and 80 negative stimuli. Detailed ratings for all images are provided in S2 Table.

Data cleaning.

Given the self-paced nature of the rating task, responses with reaction times (RT) shorter than 200ms were removed as such brief response times indicate insufficient evaluation of the stimuli. Additionally, data from seven participants were excluded due to technical errors with the experimental software. Data from two additional participants were excluded because their average rating durations were excessively long (more than 3 SD above the group mean). These exclusions ensured that the analyzed sample had comparable exposure durations to the AI-generated images. Therefore, the final analyzed sample consisted of 65 participants (Female = 57, M_age = 20.7, SD = 1.94).

Rating reliability.

Inter-rater reliability for both valence and arousal ratings was assessed by computing ICC and their 95% confidence intervals, using a two-way mixed-effects model with consistency-agreement for multiple raters (ICC 3, k) [51]. The ICC values indicated excellent reliability for both valence (ICC = 0.993, 95% CI [0.991, 0.995]) and arousal (ICC = 0.953, 95% CI [0.942, 0.963]) ratings.

Rating distribution.

Descriptive statistics for valence and arousal ratings across neutral and negative images are shown in Table 2. For neutral images, the mean valence rating was 5.81 (SD = 0.74, range: 4.02–7.42) and the mean arousal rating was 3.78 (SD = 0.64, range: 2.63–4.95). For negative images, the mean valence rating was 2.54 (SD = 0.58, range: 1.46–4.36) and the mean arousal rating was 5.14 (SD = 0.63, range: 3.67–6.77). Overall, valence ratings ranged from 1.46 to 7.42 indicating that some neutral images were perceived as relatively positive, though their overall mean valence (5.81) remained close to the midpoint of the scale. In contrast, arousal ratings had a narrower range, from 2.32 to 6.77.

Download:

Table 2. Descriptive statistics for valence and arousal ratings of images in Study 1 (n = 65).

https://doi.org/10.1371/journal.pone.0342434.t002

Relationship between valence and arousal.

Independent-sample t-tests were conducted to evaluate the differences in valence and arousal ratings between negative and neutral images. Degrees of freedom were adjusted when Levene’s test indicated unequal variances. Results revealed significant differences between neutral and negative stimuli in both valence ratings, t(149) = −30.94, p < .001, 95% CI [−3.47, −3.05] and arousal ratings, t(158) = 13.56, p < .001, 95% CI [1.16, 1.56]. Both differences indicated medium effect sizes (Cohen’s d = 0.66 for valence, d = 0.63 for arousal). These findings indicate that the AI-generated affective stimuli successfully elicited distinct responses in valence and arousal.

Pearson’s correlation coefficients were computed to further examine associations between valence and arousal ratings separately for both negative and neutral images. Among negative images, valence ratings correlated negatively with arousal (r = −.72, p < .001), indicating that images rated as more negative elicited higher arousal ratings. Conversely, for neutral images, valence ratings correlated positively with arousal (r = .65, p < .001), indicating that images rated as relatively more positive elicited higher arousal ratings.

Fig 4 shows the scatterplot of arousal versus valence ratings. Specifically, the highest arousal ratings were associated with the lowest valence ratings (the most negative images) and with neutral images rated more positively. Given that our study did not include AI-generated positive stimuli, the upward trend on the positive side of the valence scale was less prominent. To statistically confirm this quadratic relationship, linear and quadratic regression analyses were performed, with mean valence scores and squared mean valence entered as predictors of arousal. Model comparisons indicated that the quadratic regression (R² = 0.769) provided a substantially better fit than the linear regression model (R² = 0.443).

Download:

Fig 4. Scatterplot illustrating the relationship between valence and arousal ratings for the 160 AI-generated images.

https://doi.org/10.1371/journal.pone.0342434.g004