Validation of the C.A.R.E. stimulus set of 640 animal pictures: Name agreement and quality ratings

Stimulus sets are valuable tools that can facilitate the work of researchers designing experiments. Images of faces, and line drawings of objects have been developed and validated, however, pictures of animals, that do not contain backgrounds, have not been made available. Here we present image agreement and quality ratings for a set of 640 color images of animals on a transparent background, across 60 different basic categories (e.g. cat, dog, frog, bird), some with few, and others with many exemplars. These images were normed on 302 participants. Image agreement was measured both with respect to the proportion of participants that provided the same name as well as the H-statistic for each image. Image quality was measured both overall, and with respect to the accuracy of participants’ naming of the basic category. Word frequency of each basic and superordinate category based on the English Lexicon Project (Balota, et al., 2007) and the HAL database (Kucera & Francis, 1976) are provided as are Age of Acquisition (Kuperman, Stadthagen-Gonzalez, & Brysbaert, 2012) data.


Introduction
The development of sets of stimulus images for experimental research has often been a lab by lab endeavor. When these stimulus sets are standardized, they can be a powerful resource for other scientists to use in experimental settings. Stimulus sets of line drawings [1][2][3], faces [4,5] and objects [6][7][8] and corpora of images (e.g. SUN database and ImageNet) have proven extremely useful to the scientific community. These have been used for studies that range from examining amygdala responses to medication in depressed patients [9], studies of object categorization in children [10] and adults [11], validations of memory models [12] psycholinguistic studies [13] and studies examining computer vision [14,15] and machine learning [16]. These normed and validated stimulus sets often garner thousands of citations, suggesting that the scientific community values the sharing of stimulus resources. The stimulus sets that are available are generally comprised of faces, line drawings, or images objects that represent animate human and inanimate categories. Although images, or line drawings of animals are sometimes included in databases, the numbers of exemplars of these are generally very few. In looking through different lists of stimulus databases (e.g. cogsci.nl; cs.cmu.edu), only 3 of more than 40 stimulus databases contained images of animals. Three of these databases contained multiple exemplars of one animal (e.g. 10, 000 cats, 619 butterflies and 600 birds in natural backgrounds), while the other contained approximately 30 images of animals and insects, with only one exemplar of each animal. A database composed only of animal pictures, that contain multiple exemplars across multiple categories can be useful across a variety of experimental settings. In our own research, this database was developed to test the accuracy of temporal visual detection for specific exemplars within categories (e.g. find this cat among cats), as well as outside of different types of categories (e.g. find a cat among other four legged animals). This database could be relevant to the study of memory, attention and categorization, and offers many opportunities for developmental studies of various kinds (e.g. across memory, perception, attention and language) as animals are tokens that are often in the vocabulary of very young children. For example, the McArthur Bates Communicative Developmental Inventory [17] for words and gestures, and words and sentences, assess young children's (between 8 and 18 months of age and 16-30 months, respectively) receptive and expressive vocabulary. Both of these inventories contain many words (400 for the words and gestures, 680 for words and sentences) that children are likely to understand and say, and approximately 10% of these lists represent animal names and sounds, suggesting that they make up a fair portion of typically developing infants and toddler's vocabulary. Thus, a stimulus set of multiple animal images has broad applicability both across scientific disciplines, but also across developmental ages.
Here we present normative data on a large set of animal pictures (N = 640) across more than 50 basic and eight different superordinate categories. Fig 1 presents a sample of images that were all used and modified with permission. The pictures were culled from multiple databases, photoshopped onto white backgrounds and can be manipulated for size, making them flexible for use in multiple experimental settings.

Approaches to validation
To validate these stimuli, we used approaches similar to [1,18] and tested a sample of approximately 300 undergraduate participants. Name agreement was defined as the degree to which participants agreed on the name of the pictured animal [1,18]. Name agreement was measured by assessing the proportion of participants who provided the same name for each image, and was completed for both the modal name as well as for the next two most common referents. Name agreement affects naming independently of other attributes including word frequency [19,20]. Further, name agreement is a strong predictor of naming difficulty as images that elicit multiple names are more difficult to identify than those where there is larger naming consensus [1,13,18]. Research has shown that images with high name agreement are identified more quickly and accurately than those with low name agreements [21,22].
In addition to naming agreement we also provide the H-statistic for each image, which is considered a more sensitive measure of name agreement than percentages, providing more information about the name distributions across participants [1]. That is, if participants provide the dominant name for two different images 70% of the time, but in one case the image also elicits 3 other names and in the other case the image only elicits 1 name, then the percent agreements will be equivalent, but the H-statistic will be lower in the former than in the latter. Values close to or at 0 reflect high name agreement, whereas higher values reflect poorer name agreement.
Providing stimuli with both high and low naming agreements offers researchers the opportunity to modulate complexity either between or within tasks using a uniform set of pictures. To further aid in the selection of images appropriate for a variety of research purposes, we also report word frequency for both the basic and subordinate categories via both Kucera & Francis (1976) and HAL database thanks to the English Lexicon Project [23] which can help researchers select either common or infrequent pictures based on their specific research questions. Word frequency has been linked by some to both visual orienting to images [24] and naming speed ( [19,21,25] but see [26,27]). Additionally, we also provide age of acquisition (AoA) information, which represent the age at which adults think they learned a word [28]. AoA has been related to the actual ages of acquisition of words and has been shown to correlate highly with lexical decision time [28].
Participants were also asked to rate each image with respect to its fitness as an exemplar of the word they used to describe the image. This provides a metric of the relationship between an image and its mental representation and is intended as a proxy of quality. Images with high ratings correspond to individual's idea of what a particular stimulus looks like in their mind, which might impact its ease of recognition [21]. We did not ask participants to imagine the animal before rating its representativeness as an exemplar, but rather asked participants whether the image they saw was a good representation of the name they gave that animal. The rating data are presented both overall, that is, irrespective of whether participants correctly identified the basic category, as well as for correct responses.
Finally, we provide the size of each image, as a measure of objective, visual complexity [29], as this has been shown to be correlated with subjective measures of complexity as well as impact picture naming accuracy, while being uncorrelated with RT, word frequency and age of acquisition.

Materials and methods
Participants and procedure 302 (197 females) healthy college students participated in the study. Participants were on average 19.4 years old (SD = 3.16), and predominantly right handed (297/302). All participants were over 18 years of age and provided written consent once they had read the description of their study and had their questions answered. The consent form and project had been approved by the ethics review board at Syracuse University. Participants were seated approximately 2 feet away from a Dell computer screen (1920 X 1080 pixel, 60Hz refresh rate), and the task was run on a Mac Mini (2.4GHz). The task was presented via MATLAB [30] and programmed using Stream [31] an interface that uses psychophysics toolbox [32]. Images were resized to be no greater than 500 Ã 500 pixels and were presented at the center of a screen one at a time in a self-paced manner, with breaks every 25 trials. Each participant viewed 325 distinct images (a subset of the original 650 total images), yielding 150 ratings per image (+/-2).
Images were presented one at a time, and participants were asked to type the name of the animal (name agreement task) they saw or type 'don't know' or 'not sure' (.23% of all observations). They were not directed to the basic, super or subordinate category, and thus, their responses reflect the word they most closely associated with that image. After participants finished typing in their answer they pressed enter and a second screen appeared where they were asked to rate how good of an example the image was of the animal they typed. These ratings were measured on a 7 point Likert scale where 1 represented 'very bad', 5 represented 'good' and 7 represented 'very good'. Participants were provided with the anchors for each rating and they remained on the screen until the participant responded. Participants were first presented with a fixation cross, which was replaced 500ms later with a screen asking participants to press enter when ready. The image was then presented until the participant's first keystroke. If participants did not respond after 10 seconds they were prompted to respond.

Stimuli
The stimulus pictures were 650 color images (497 horizontal) of animals culled from multiple open access databases, the public domain as well as copyrighted images for which permission to use, modify and distribute was provided. In particular, the Washington State University veterinary image set makes up a large proportion of the images. The images include multiple examples across different basic categories of animals that were familiar (cat, dogs, birds) and a smaller set of examples of images of less familiar animals (e.g. llama, manatee). The numbers of images of each basic category are sorted by superordinate category and are presented in Table 1.

Image and stimulus set analysis
The stimuli, raw data, as well as an excel spreadsheet containing specific details related to each stimulus are available on the Open Science Framework: http://doi.org/10.17605/OSF.IO/ 5DQU8. These details include the superordinate, basic and subordinate category of each image, the filename, image orientation, and size in pixels for each image, the number of ratings each image received, and their frequency across two corpora from the English Lexicon Project. In addition, we provide the strict and relaxed (see below) ratings of the following: the most frequent name associated with each image, the proportion of participants who provided the most frequent name, the means and standard deviations for name agreement and the quality ratings, alternative names provided in decreasing order, the proportion of responses fitting those alternative names, and the H-statistic of each image. In contrast to others (e.g. [1]) incorrect spellings were not included as correct. In some cases, there was no way to be certain whether participants misspelled the word or were not completing the task with full attention. 10 images had to be removed because the images could not be located in their original form, and thus permission could not be requested. Accuracy ratings. Two types of accuracy were calculated. Strict accuracy ratings reflect the accuracy with which participants reported the basic category name for an exact match, and relaxed accuracy ratings reflect when a participant's response contained the target word (e.g. if participants typed tabby cat this would be counted as correct in the relaxed criterion). Spelling errors or typos were counted as incorrect, and as such, these ratings are likely a slight underestimate of actual accuracy. Mean ratings by category both overall and for accurate responses only are presented in Table 1 while the ratings for each image (accurate only) can be found on our OSF page (http://doi.org/10.17605/OSF.IO/5DQU8).
Naming agreement. The overall H statistics, percent agreement ratings across all images for both strict and relaxed criteria are presented in Table 2. The H-statistic was calculated using the formula: H ¼ X k i¼1 p i log2ð1=p i Þ, where k refers to the total number of names provided for that image, and p i is the proportion of participants providing that specific name.
We excluded any incorrect spellings and, as such, the H-statistics are slightly higher than have been reported in other studies. The first set of statistics is based on a strict exact match (e.g. if the image was a bear and the participant typed polar bear, this was considered incorrect, see columns H-P). The second set of statistics used a slightly looser criterion such that, if participants used the target word in their response (e.g. tabby cat for cat), this was included as correct (see columns Q through W). For both the strict and relaxed criteria, we also include the overall mean, standard deviation, median, range, 1 st and 3 rd quartiles (Q1 and Q3 respectively), and Skew (measured as (Q3-mdn)/(mdn-Q1) with values >1 indicating positive skew) to facilitate the selection of concepts at the center (or extremes) of the distribution ( Table 2). The degree to which participants agreed on the identity of the images was approximately 80% across all images and ranged from 12 to 100%. As percentage agreement varies with complexity, this stimulus set provides a range of easy to difficult images from which researchers can select on the basis of their experimental questions. For the category of dog, the H-statistic was slightly higher than expected, as this category of animal is commonly encountered. This was related to the fact that participants often tried (correctly or incorrectly) to name the breed, or subordinate category of the animal (e.g. poodle), which does not contain the target word that represents the basic category (dog). This suggests that for this animal (and to a lesser extent the bird images), the subordinate category was more accessible than the basic category and provides researchers important information that can be used to select a subset of images that suit their needs (e.g. using only animals that are frequently referred to at the basic category level).

Discussion
There are some limitations of the stimulus set and the norming procedure. First, although the maximum size of the images presented was 500 Ã 500 pixels, some of the images were smaller than this during the presentation. Images were all clearly visible, but the effect of size on agreement is unclear. Second, the animals were not limited in their orientation in that some of them were presented in top view, while others were presented in profile. The orientation of each image is also made available for the stimulus set, but the relationship between stimulus orientation and ratings is unknown.
The goal of developing this stimulus set was to provide researchers with a normed set of colored images of animals across broad superordinate, basic and subordinate categories that can be used in a variety of experimental settings. The versatility of this stimulus set and the fact that it was normed on untrained participants who were not asked to name a particular level (e.g. basic or subordinate) of image identity provides information regarding the natural manner in which naïve observers tend to respond. We hope that this will be a useful resource for researchers to answer their questions of interest.