Random forests, sound symbolism and Pokémon evolution

Alexander James Kilpatrick; Aleksandra Ćwiek; Shigeto Kawahara

doi:10.1371/journal.pone.0279350

Abstract

This study constructs machine learning algorithms that are trained to classify samples using sound symbolism, and then it reports on an experiment designed to measure their understanding against human participants. Random forests are trained using the names of Pokémon, which are fictional video game characters, and their evolutionary status. Pokémon undergo evolution when certain in-game conditions are met. Evolution changes the appearance, abilities, and names of Pokémon. In the first experiment, we train three random forests using the sounds that make up the names of Japanese, Chinese, and Korean Pokémon to classify Pokémon into pre-evolution and post-evolution categories. We then train a fourth random forest using the results of an elicitation experiment whereby Japanese participants named previously unseen Pokémon. In Experiment 2, we reproduce those random forests with name length as a feature and compare the performance of the random forests against humans in a classification experiment whereby Japanese participants classified the names elicited in Experiment 1 into pre-and post-evolution categories. Experiment 2 reveals an issue pertaining to overfitting in Experiment 1 which we resolve using a novel cross-validation method. The results show that the random forests are efficient learners of systematic sound-meaning correspondence patterns and can classify samples with greater accuracy than the human participants.

Citation: Kilpatrick AJ, Ćwiek A, Kawahara S (2023) Random forests, sound symbolism and Pokémon evolution. PLoS ONE 18(1): e0279350. https://doi.org/10.1371/journal.pone.0279350

Editor: Maki Sakamoto, The University of Electro-Communications, JAPAN

Received: July 12, 2022; Accepted: December 6, 2022; Published: January 4, 2023

Copyright: © 2023 Kilpatrick et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All data, scripts and an explanation of their implementation are available at the following OSF repository. https://osf.io/pe24w/?view_only=02e9327a7bd54b9280b57434a90ed83a

Funding: AK - Grant obtained from Japan Society for the Promotion of Science (Tokyo, JP) GRANT_NUMBER: 20K13055 https://www.jsps.go.jp/english/index.html The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Natural language processing (NLP) is a field of study that combines computational linguistics and artificial intelligence and is concerned with giving computers the ability to understand language in much the same way humans can. The present study tests whether an NLP algorithm can classify samples using sound symbolism, which has been a largely overlooked feature of human language in NLP. While in modern linguistics, the relationship between sound and meaning is generally assumed to be arbitrary [1], a growing number of studies have revealed systematic relationships between sounds and meanings, some of which hold cross-linguistically. For example, speakers of many languages tend to associate words containing [i] with small objects, while words containing [a] are typically associated with larger objects [2–5]. Humans understand certain sound symbolic associations in infancy and these associations are said to scaffold language development and facilitate word learning [6–9]. It is therefore important for any NLP algorithm to understand sound symbolism if its goal is to understand language in the same way that humans can. This study is concerned with the random forest algorithm (further RF: [10]), which is an ensemble method machine learning algorithm typically applied to classification and regression tasks. It builds upon recent research by Winter and Perlman ([11]; see also [12]), who used RFs to show that there is a systematic sound-symbolic relationship between size and phonemes in English words.

In the following, we construct and test RFs using the fictional names of characters known as Pokémon. Initially released in 1996 as a video game, Pokémon is an incredibly popular mixed-media franchise, particularly in its country of origin, Japan [13]. The present study measures the classification accuracy of RFs against that of Japanese university students. The RFs are trained to classify Pokémon into pre-evolution and post-evolution categories using only the sounds that make up their names. In Experiment 1, three RFs are constructed using the sounds that make up the names of Japanese, Mandarin Chinese (hereafter: Chinese), and South Korean (hereafter: Korean) Pokémon. These RFs are trained using a subset of each dataset and then tested on the remaining data. While all RFs classify Pokémon at a rate better than chance, the Japanese RF was found to perform the best, hence the remaining experiments are conducted on Japanese participants and Japanese Pokémon names only. The Japanese RF is then tested using the results of an elicitation experiment where Japanese participants were asked to name previously unseen Pokémon presented next to a pre/post-evolution parallel. A further RF is constructed using the responses from the elicitation experiment and tested both on the elicitation responses and the official Japanese names. In Experiment 2, we retrain the RFs presented in Experiment 1 to include name length. These retrained RFs uncover an issue of overfitting caused by a lack of variability in decision trees. We resolve this issue through cross-validation by constructing multiple random forests (MRFs) with different starting values for the randomization of splitting the data into training and testing subsets. The mean accuracy of the RFs in the Japanese MRF is then compared to the results of a classification experiment where Japanese participants were asked to classify the elicited responses from Experiment 1 into pre- and post-evolution categories. The results of the human participants in the categorization experiment are then measured against the results of the MRFs. To summarize, Experiment 1 tests whether RFs can learn to make classification decisions using the sounds that make up names and whether this learning is applicable to elicited samples, and Experiment 2 measures the performance of MRFs against humans.

Sound symbolism

One of the standard assumptions of modern linguistic theory is that the relationship between sound and meaning is arbitrary [1,14]. While language is undoubtedly capable of associating sounds and meanings in arbitrary ways, the last few decades have seen a growing number of studies that reveal systematic relationships between sounds and meanings [15–17]. One well-known example is the takete-maluma effect [18] which is the observation that voiceless obstruents are typically associated with jagged-shaped objects, while names with sonorant sounds are more often associated with round-shaped objects. This effect has been shown to hold cross-linguistically [19–23]. While relationships between sound and meaning can be systematic, they are typically stochastic in nature [24]; that is, sound-meaning relationships manifest themselves as a probability distribution that show statistical skews but may not be hold in all lexical items. For example, English adjectives like tiny, mini, and itsy bitsy adhere to the high front vowel equates to smallness pattern discussed above, while the English adjective small is a clear exception to this generalization [11]. Sound symbolism is demonstrably important for language acquisition processes [21,25]; symbolic words are more common in both child-directed speech and early infant speech [26,27], and indeed, research has shown that infants are sensitive to sound symbolism [6–9].

Pokémonastics is a relatively new subfield of sound symbolism that examines sound symbolic relationships between the names of video game characters known as Pokémon and their attributes. In the video games, the player character collects Pokémon, which they use to battle other players. As Pokémon earn experience, many have the option to evolve. Pokémon evolution permanently changes the Pokémon, they typically grow larger and stronger, and their names change. Pokémonastic studies have shown that Pokémon evolution status can be signaled via some sound symbolic means in English and Japanese by an increase in name length, increased use of voiced obstruents, and in vowel use where the high front vowel [i] is typically associated with pre-evolution Pokémon [28–31]. Based on these established relationships and the likelihood that the participants would be familiar with the subject, Pokémon evolution was determined to be a suitable test case for measuring the ability of RFs against humans in understanding sound symbolism (see also [11]).

Random forests

RFs, first introduced by Breiman [10], are ensemble method machine learning algorithms that are typically applied to classification and regression tasks. Since their inception, RFs have been a popular tool in machine learning, and several recent review articles attest to their efficacy [32–34]. Typically, RFs work by constructing many decision trees using a two-thirds subset of the data, they are then tested on the remaining data. Decision trees themselves are non-parametric supervised machine learning algorithms that resemble flow charts where each internal node represents a test of features. The decision tree splits at each node based on how important each feature is in the task. Splits eventually lead to a terminal node in the decision tree, which depicts the outcome of the decision-making process. Decision trees can be extremely useful; they are scale-invariant, robust to irrelevant features and inherently interpretable. However, decision trees are sensitive to noise and outliers, and are thus prone to overfitting data which limits their ability to generalize to unobserved samples [35,36]. Overfitting is a modelling error that occurs when a function is too closely aligned to a limited set of data points. This results in a model that performs well for the trained dataset but may not generalize well to other datasets. To address the issue of overfitting, RFs use bootstrap aggregating (bagging: [37]) and the random subspace method [38]. Bagging involves using many decision trees to improve the stability and accuracy of the algorithm by averaging voting (in classification) or the output (in regression). In bagging, samples are randomly allocated to trees, typically with replacement, which raises the issue of duplication. The random subspace method resolves this issue by randomly selecting a subset of features at each internal node, which allows the model to better generalize by introducing variability into the decision trees. In other words, bagging randomly selects samples while the subspace method randomly selects features. By randomizing the decision trees across both dimensions, random forests resolve the issue of overfitting inherent in decision trees.

Experiment 1: Elicitation

Material and methods

The data, an explanation of the data, and a detailed annotated script for the following algorithms are available under the OSF repository.

Official Pokémon name data.

All data were obtained from Bulbapedia ([39], last accessed in June 2022). As of June 2022, Bulbapedia has completed (mainspaced in the parlance of the website) lists for Japanese, Chinese, Korean, English, German, and French Pokémon. Japanese, Chinese, and Korean names were selected for this experiment on the basis that Japanese katakana, Chinese pinyin, and Korean McCune-Reischauer romanisation are reasonably phonetic scripts. An algorithm was created for each language to count the number of times each sound occurs in each name. The algorithms and a detailed explanation for their implementation are included in the above OSF repository. This resulted in an almost entirely phonemic analysis except in the case of tones in Chinese, which are counted as separate features, and voicing on plosives in Korean. In Korean [40] and Chinese [41], there is no phonological opposition between voiced and voiceless plosives. However, Korean plosives are systematically voiced when they occur intervocalically [40], and this is reflected in the McCune-Reischauer romanisation of Korean. Given that voiced plosives have been shown carry information pertaining to Pokémon evolution in other languages [29,31], intervocalic plosives were counted separately in Korean.

As of June 2022, there are 905 Pokémon that span eight generations. This study only examines the names of pre-evolution and post-evolution Pokémon. Some Pokémon do not evolve and are therefore not included in the current study. The sixth generation of the core video game series saw the introduction of a mechanic known as Mega Evolution that temporarily transforms certain Pokémon. Mega evolution is not considered by the present study because this is a temporary transformation that has little effect on Pokémon names other than the addition of prefixes like mega. Other Pokémon that were excluded from the analysis are mid-stage evolutionary variants. An example of a mid-stage Pokémon is Electabuzz which was introduced in the first generation of the video game series. Its pre-evolution variant, Elekid, was introduced in the second generation, and its post-evolution variant, Electivire, was introduced in the fourth generation. In the present study, we exclude Electabuzz from the analysis because it is considered the mid-stage variant, despite other Pokémon being added to the evolutionary family retroactively. Kawahara and Kumagai [28] analysed the relationships between the sounds in the names of Pokémon and Pokémon evolution where they did not exclude mid-stage Pokémon. To achieve this, they had four categories based on evolution level rather than binary pre- and post-evolution categories. RFs are capable of multiclass classification; however, we opted for binary classification for the current analysis because, while the data is technically count data, it is almost entirely binary (e.g., 96.7% of all data points in the Japanese dataset are either 0 or 1). Therefore, it made sense to use a binary classifier given that the sound symbolic patterns are likely scalar across mid- and final-stage categories. The removal of mid-stage Pokémon and Pokémon with no evolutionary family resulted in 628 unique Pokémon names, 303 of which are pre-evolution and 325 of which are post-evolution. The reason for the distribution skew is because certain pre-evolution Pokémon may evolve into multiple post-evolution variants.

Elicitation experiment.

This experiment received ethics approval from the Nagoya University of Business and Commerce. ID number 21048.

The elicitation experiment has two main goals. The first is to determine whether an RF constructed using the official Pokémon name data can be used to classify names elicited from participants and vice versa. In other words, is there enough overlap between the official names and names provided by participants for each model to be useful in classifying Pokémon from the alternate dataset. The second goal is to provide stimuli for a categorization experiment (Experiment 2) designed to measure the performance of human participants against the machine learning algorithms. To get a fair measurement of classification accuracy, it was important to test both humans and the machine learning algorithms on data that they had not previously been exposed to, hence the need for elicited samples.

The elicitation experiment was conducted using Google Forms. Each Google form consisted of a short instructional paragraph, followed by twenty Pokémon-like images. Following the method outlined in Kawahara & Kumagai [28], these images were not of existing Pokémon and had likely not been previously viewed by the participants. The instructions noted that only native Japanese speakers were to take the survey. Participants were informed that they were to name twenty new Pokémon. It was made clear to participants that they would be shown images of pre- and post-evolution Pokémon. Participants were asked to provide names for Pokémon in katakana which is the script used for Pokémon names and nonce words in Japanese. Participants were instructed not to use existing words (Japanese or otherwise) to name the Pokémon. Participants were given no further instructions (such as length limitations) regarding naming the Pokémon. Participants were not asked if they were familiar with the Pokémon franchise prior to completing the survey. All instructions were written in Japanese. Participants were informed that their participation was entirely voluntary, that they may quit the survey at any time. Consent was obtained verbally and it was explained to participants that their participation also constituted consent. No personal data were collected other than student email addresses which were collected to ensure that students were not completing the survey twice. These were discarded prior to the analysis.

Each image contained a pre-evolution and a post-evolution Pokémon presented side by side. The pre-evolution Pokémon was always located to the left of the post-evolution Pokémon and was always presented as substantially smaller (see Fig 1) than its post-evolution counterpart. In each image, there was an arrow pointing to the Pokémon that was to be named. Images with arrows pointing to the pre-evolution Pokémon were always followed by an identical image, except the arrow would be pointing to the post-evolution Pokémon. Trials were not randomized, and the pre-evolution image was always followed by the post-evolution image. Pre-evolution Pokémon were always presented on the left and post-evolution Pokemon were always presented on the right. The images were created by a semi-professional artist (DeviantArt user: Involuntary-Twitch), and samples are presented in Fig 1. The images very closely resemble the pixelated images used to represent Pokémon in the earlier generations of Pokémon games.

Download:

Fig 1. Sample stimulus pairs of pre- and post-evolution Pokémon characters used in Experiment 1.

These images are reproduced with the permission of the artist.

https://doi.org/10.1371/journal.pone.0279350.g001

Participants were recruited from the Nagoya University of Commerce and Business via a post on the student bulletin board. Students were not compensated for their time monetarily or otherwise. The human participants needed to be somewhat familiar with the subject matter because sound-symbolic relationships in fictional names may not adhere to those found in natural languages. Given the popularity of Pokémon in Japan and that the participants were Japanese university students, Pokémon was determined to be a good test case for assessing the accuracy of RFs against that of humans. Forty-nine students responded to the survey. In total, 980 responses were recorded; however, some responses were blank and other responses contained duplicate names, the distribution of which suggested that participants had possibly conferred while taking the survey. These were discarded, resulting in 967 unique names (482 pre-evolution; 485 post-evolution). Elicited names were transcribed using the same algorithm used for the official Japanese Pokémon names. None of the names collected in the elicitation experiment were names of existing Pokémon.

Random forests.

Random forests were constructed and tested using the ranger package 0.13.1 [42]. The number of trees included in each RF was manually tuned by constructing nine RFs at different tree number values with different starting points for randomization (set.seed). Optimal values were determined by examining mean out of bag (OOB) accuracy and its standard deviation. OOB error refers to incorrectly classified samples. For all RFs, 20,000 trees were determined to be a suitable size because we observed no reduction in OOB error with increased trees and because calculating feature importance using the Altmann method [43] at 20,000 trees approached the processing capability of the computer the RFs were constructed upon. Hyperparameters pertaining to the number of features examined at each node, the sample fraction, and node size were tuned using the tuneRanger package 0.5 [44]. Essentially, the tuning process determines how much variability there is between trees. Highly variable trees will produce highly variable results but might encounter issues with datasets that contain many unimportant features or null values. Low variability in decision trees results in more stable algorithms but may mask the importance of weaker features because they will often be paired with strong features. The accuracy of the RF is determined by feeding the testing data into the model and assessing the OOB error. The OOB error gives an overall representation of the accuracy of the algorithm but does not communicate which features are important in classification, which is instead determined by feature importance. There are several ways to calculate feature importance, the present study uses permutation. In permutation, each feature is randomized individually, and then the algorithm is reconstructed with all other features remaining the same. Feature importance is calculated on the increase of OOB error due to randomization. One issue with the interpretability of RFs is that feature importance does not communicate directionality. For example, those sounds that are important to classification may be considered as “pulling” each sample into one category or the other, while feature importance communicates the strength of the “pull”, it does not communicate whether that “pull” is in the direction of the pre- or post-evolution category. In the present study, we report on the distribution of speech sounds to pre- and post-evolution categories.to indicate directionality, though it should be noted that they are not necessarily the same measure.

In total, there were six RFs constructed for Experiment 1. The first three RFs presented in the results section were trained using a randomly sampled subset consisting of two-thirds of the Japanese, Chinese and Korean Pokémon names. The fourth RF is trained using two-thirds of the results of the elicitation experiment. All four RFs are then tested using the remaining one-third subset of each dataset. We then calculate feature importance for each RF to examine potential cross-linguistic patterns, and patterns between the Japanese Pokémon data and the elicited data. The remaining two RFs are constructed using the entirety of the official Japanese Pokémon names and the entirety of the samples collected in the Elicitation experiment. These two RFs are then tested using the alternate dataset. In other words, one RF is constructed using all the official names and tested on the elicited responses, while the other is constructed using all the elicited samples and tested on the official names. This is done to determine whether there is enough overlap in the two datasets for the algorithms to be useful in classifying the opposite dataset.

Results

The three RFs trained and tested on the official Pokémon names all classified Pokémon at a rate better than chance. Given that there is an uneven distribution of pre- and post-evolution Pokémon, any model that naïvely classified to the majority category would achieve an accuracy of around 52% (OOB error 48%) depending on the split of the training and testing subsets. The Japanese RF was the most accurate (OOB error 29.05%), followed by the Chinese RF (OOB error 39.05%), and finally, the Korean RF (OOB error 40.95%). A confusion matrix for the Japanese RF is presented in Table 1 and feature importance for the Japanese RF is presented in Table 2. Note here that in Experiment 2, we report on the results of MRFs with different starting values for the randomization of both splitting the data in the training and testing subsets, and the RFs themselves. The results of the MRFs (OOB error: M = 34.07%, SD = 2.48%) suggest that this result was an outlier caused by a particularly advantageous split between training and testing subsets. This process was conducted for the Chinese (OOB error: M = 40.85%, SD = 3.35%) and Korean (OOB error: M = 43.28%, SD = 3.09%) datasets as well. The RF trained and tested on the elicited names (Elicited RF) classified samples at a rate better than chance. As with the official datasets, there was an uneven distribution to categories, a naïve model would accurately classify samples in the elicited data 50.16% (OOB error 49.84%) of the time. The Elicited RF achieved an OOB error of 30.96%. Feature importance was calculated for each model to determine which sounds contributed to classification. Feature importance and significance is calculated using the Altmann [43] permutation method on the training subsets. Permutation involves randomizing features individually; the random forest is then reconstructed for each feature. Feature importance is the increase in OOB error for the feature being randomized. The Altmann permutation method involves running multiple permutations to estimate more precise p values. Feature importance significance is calculated by normalizing the biased measure based on a permutation test. This returns a significance result for each feature, not for the random forest itself [43]. All RFs in the present study use the Altmann permutation method with the number of iterations set at 100. Directionality was determined by the distribution of features in the training subsets of the data. The distribution of features in the Japanese training subset is presented in Fig 2. In the Japanese RF, the most important features were the bilabial nasal (/m/), the coda nasal (/ɴ/), long vowels (/:/), and the voiced velar plosive (/g/). Of these features, only /m/ occurs more frequently in the pre-evolution samples.

Download:

Fig 2. Distribution of features to pre- and post-evolution categories in the Japanese training subset.

Features appear in order of importance from left to right. Asterisks denote significant features.

https://doi.org/10.1371/journal.pone.0279350.g002

Download:

Table 1. Confusion matrix for the Japanese RF.

https://doi.org/10.1371/journal.pone.0279350.t001

Download:

Table 2. Feature importance (Importance) and p values for features that achieved a feature importance greater than 0.1% in the Japanese RF.

https://doi.org/10.1371/journal.pone.0279350.t002

As with the Japanese RF, the distribution of most features that were important in the Chinese RF skewed towards the post-evolution category. A confusion matrix for the Chinese RF is presented in Table 3 and feature importance scores for its features are presented in Table 4, and distribution is presented in Fig 3. Tones are an important feature in the RF; where the falling tone occurs more frequently in the post-evolution samples, the neutral tone occurs more frequently in the pre-evolution samples. The velar nasal (/ŋ/) was also found to be an important feature in the Chinese RF.

Download:

Fig 3. Distribution of features to pre- and post-evolution categories in the Chinese training subset.

Features appear in order of importance from left to right. Asterisks denote significant features.

https://doi.org/10.1371/journal.pone.0279350.g003

Download:

Table 3. Feature importance (Importance) and p values for features that achieved a feature importance greater than 0.1% in the Chinese RF.

https://doi.org/10.1371/journal.pone.0279350.t003

Download:

Table 4. Feature importance (Importance) and p values for features that achieved a feature importance greater than 0.1% in the Chinese RF.

https://doi.org/10.1371/journal.pone.0279350.t004

In the Korean RF, vowels /ɯ/, /a/, /ʌ/, and /u/ were important, as was the voiced labial-velar approximant /w/. Interestingly, while the close back unrounded vowel, /ɯ/ was present more often in post-evolution samples, the close back rounded vowel /u/ was present more often in pre-evolution samples. Table 5 presents a confusion matrix for the Korean RF, Table 6 presents the feature importance and p values, and Fig 4 presents the distribution.

Download:

Fig 4. Distribution of features to pre- and post-evolution categories in the Korean training subset.

Features appear in order of importance from left to right. Asterisks denote significant features.

https://doi.org/10.1371/journal.pone.0279350.g004

Download:

Table 5. Confusion matrix for the Korean RF.

https://doi.org/10.1371/journal.pone.0279350.t005

Download:

Table 6. Feature importance (Imp.) and p values (p) for features that achieved a feature importance greater than 0.1% in the Korean RF.

https://doi.org/10.1371/journal.pone.0279350.t006

Most of the features that were important in the Japanese RF were also important in the Elicited RF. These include voiced plosives (/g/ & /d/), the open front unrounded vowel (/a/), the coda nasal (/ɴ/), and long vowels (/:/). Interestingly, all the features that achieved a feature importance greater than 0.1% in the Elicited RF occurred more frequently in post-evolution Pokémon. The confusion matrix for the RF constructed and tested on the data from the elicitation experiment are presented in Tables 7 and 8 presents the feature importance scores, and Fig 5 presents the distribution chart.

Download:

Fig 5. Distribution of features to pre- and post-evolution categories in the elicited training subset.

Features appear in order of importance from left to right. Asterisks denote significant features.

https://doi.org/10.1371/journal.pone.0279350.g005

Download:

Table 7. Confusion matrix for the Elicited RF.

https://doi.org/10.1371/journal.pone.0279350.t007

Download:

Table 8. Feature importance (Imp.) and p values (p) for features that achieved a feature importance greater than 0.1% in the Elicited RF.

https://doi.org/10.1371/journal.pone.0279350.t008

Given that the Japanese RF and the Elicited RF feature importance patterns are reasonably similar, we wanted to test whether these RFs would be able to accurately classify samples from their opposite dataset. Important features that are shared between the two models are non-labial voiced obstruents such as /d/ and /g/, coda nasals, long vowels, and the low front vowel /a/. Interestingly, the distributional skew for all of these features is towards the post-evolution category. We tested each existing RF on the entirety of their opposite dataset (not just the test subsets). The Japanese RF was able to accurately classify the elicited samples 61.43% of the time (OOB error 38.57%), and the Elicited RF was able to accurately classify the official Japanese Pokémon name samples 66.72% of the time (OOB error 33.28%) where naïve models would be expected to achieve an accuracy of 52% and 50.16% respectively. The confusion matrix for the RF trained using the official Japanese Pokémon names and tested using the elicited samples is shown in Table 9. Table 10 shows the confusion matrix for the for the RF trained using the elicited samples and tested using the official Japanese Pokémon names.

Download:

Table 9. Confusion matrix for the RF trained using the official Japanese Pokémon names and tested using the elicited samples.

https://doi.org/10.1371/journal.pone.0279350.t009

Download:

Table 10. Confusion matrix for the for the RF trained using the elicited samples and tested using the official Japanese Pokémon names.

https://doi.org/10.1371/journal.pone.0279350.t010

Discussion

All the RFs presented above performed better than a naïve algorithm would. For the Japanese, Chinese, and Korean RFs, a naïve algorithm would be expected to achieve an OOB error of 48%. While the Japanese RF was shown to be the most accurate (OOB error 29.05%), the Chinese (OOB error 39.05%) and Korean (OOB error 40.95%) error rates were well below 48%. The elicited RF, for which a naïve algorithm would be expected to achieve an OOB error of 50%, achieved an OOB error of 30.96%. Important to remember here is that the RFs were trained on only two-thirds of the data, so the RFs were efficient learners, given that they only had 419 samples to learn from. The RF trained on the official Japanese data and tested on the elicited data was trained on all 628 official Japanese samples and tested on all 967 elicited responses. Despite having more samples from which to learn, the RF trained on the official names and tested on the elicited responses (OOB error 38.57%) was less accurate than the RF trained and tested on the official names (OOB error 29.05%). Similarly, the RF trained on the entirety of the elicited responses and tested on the official names (OOB error 33.28%) was less accurate than the RF trained and tested on the elicited responses (OOB error 30.96%). Despite these differences, the official names and the elicited responses are similar enough to perform better than naïve algorithms.

The feature importance scores of the RFs reveal interesting relationships between Pokémon evolution status and the sounds that make up their names, some of which hold across languages. While high front vowels did not achieve a feature importance greater than 0.1% in any of the RFs, the low front vowel /a/ and the high back unrounded vowel /ɯ/ were important in the Japanese, Korean, and Elicited RFs and were distributionally skewed towards post-evolution in all cases. The result for the phoneme /a/ as representing post-evolution Pokémon is in line with the well-known observation that nonce words containing [a] are larger than those containing [i] [2,45,46] given that post-evolution Pokémon are typically larger than their pre-evolution counterparts. Interestingly, the high back rounded vowel /u/ was important in the Chinese model, but it skewed towards the pre-evolution category. Vowels were found to be important in the Korean model, particularly /ɯ/, /a/, /ʌ/, and /u/. Korean vowels have been found to hold sound symbolic correspondences between “light” and “dark” vowels [47]. These correspondences run counter to cross-linguistic patterns. For example, light vowels are defined as being low vowels and are said to reflect small, fast-moving entities, while dark (or high) vowels are said to reflect larger, slow-moving entities [48]. Our findings do not support this observation. Although the distribution of dark vowels /ɯ/ and /ʌ/ skew towards the post-evolution category, the distribution of the light vowel /a/ skews towards the post-evolution category, while the dark vowel /u/ skews towards the pre-evolution category. The finding that /a/ is important to the Korean model and skews towards the post-evolution category is in line with [5] who found that Korean listeners judge nonce words to be larger when the contain [a]. Long vowels were important in both the Japanese and the Elicited RFs, and they skewed towards post-evolution in both cases. This finding is reflected in previous Pokemon studies [29,30], which also show that long vowels are associated with increased size. These studies tend to suggest this can be explained by the iconicity of quantity which is the finding that larger objects are typically associated with longer names [49]. This is explored further in Experiment 2. Lastly, tones in the Chinese RF were important to the model. The falling tone had the highest feature importance in the Chinese RF and it skewed toward the post-evolution category. The neutral tone, on the other hand, skewed toward the pre-evolution category. In a similar Pokémonastic study, Shih et al., [50] found that the falling tone seems to be associated with increased power, evolution stage, and increased distribution to the male gender. This is seemingly more complex than what Ohala’s Frequency Code hypothesis [51] would predict as it simply states that low tones should reflect largeness while high tones should predict smallness; but makes no prediction regarding tone pitch contour. Shih et al., [50] propose that the falling tone has the steepest pitch of all Chinese tones, and that this may explain why this tone is iconically linked to largeness in Chinese.

The Japanese nasal /ɴ/ and the Chinese nasal /ŋ/ were important in the Japanese, Chinese, and Elicited RFs and skewed towards post-evolution in all cases. This is an interesting finding given that both consonants can only occur in the coda position, although the coda nasal /ŋ/ in Korean did not achieve a feature importance greater than 0.1%. Cross-linguistically, nasal consonants are generally associated with large entities [2,52], likely due to their low frequency [2]. In Japanese, however, bilabial consonants have been found to be associated with images of cuteness and softness [53], which may explain why /m/ was both important in the Japanese model and was skewed towards the pre-evolution category. High back vowels in the Korean model present an interesting case study when examined through the lens of the relationship between cuteness and labiality in Japanese. In the Korean model, both high back vowels were found to be important. While the high back rounded vowel /u/ skewed towards the pre-evolution category, the high back unrounded vowel /ɯ/ skewed towards the post-evolution category. This result suggests that the association between cuteness and labiality may be a cross-linguistic one; however, this suggestion is tentative given that the Korean labial-velar approximant both skewed towards post-evolution and was important in the model. Berlin [2] suggests that nasal consonants can imply largeness given their low frequency energy; however, the bilabial nasal /m/ skewed towards the pre-evolution category and was found to be important in the Japanese RF. In line with Shih et al. [31], who found that voiced plosives were reflective of size in Japanese and English Pokémon names, voiced plosives /d/ and /g/ were important in the Japanese and Elicited RFs. Intervocalic plosives in Korean were counted separately due to maintaining systematic voicing in these positions; however, these did not achieve a feature importance greater than 0.1%.