Figures
Abstract
Modern artificial intelligence (AI) technology is capable of generating human sounding voices that could be used to deceive recipients in various contexts (e.g., deep fakes). Given the increasing accessibility of this technology and its potential societal implications, the present study conducted online experiments using original data to investigate the validity of AI-based voice similarity measures and their impact on trustworthiness and likability. Correlation analyses revealed that voiceprints – numerical representations of voices derived from a speaker verification system – can be used to approximate human (dis)similarity ratings. With regard to cognitive evaluations, we observed that voices similar to one’s own voice increased trustworthiness and likability, whereas average voices did not elicit such effects. These findings suggest a preference for self-similar voices and underscore the risks associated with the misuse of AI in generating persuasive artificial voices from brief voice samples.
Citation: Jaggy O, Schwan S, Meyerhoff HS (2025) AI-determined similarity increases likability and trustworthiness of human voices. PLoS ONE 20(3): e0318890. https://doi.org/10.1371/journal.pone.0318890
Editor: Ying Shen, Tongji University, CHINA
Received: May 20, 2024; Accepted: January 23, 2025; Published: March 5, 2025
Copyright: © 2025 Jaggy et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All anonymized data files are available from the figshare database https://doi.org/10.6084/m9.figshare.21022081.v2
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Artificial intelligence (AI) has become integral to modern life and is revolutionizing how people interact with technology and process information. From autonomous vehicles to personalized recommendation systems, AI’s ability to analyze and replicate human-like behaviors profoundly impacts all industries. One particularly relevant application is in the field of speech technology, in which AI systems not only recognize and synthesize speech but also simulate individual voice characteristics. This capability opens new avenues for personalized interactions, such as matching voice assistants to a user’s voice profile or augmenting that profile, illustrating the interplay between technology, identity, and human perception.
The human voice remains a remarkable signature of individuality, transcending mere communication to embed rich layers of information about the speaker. Beyond the conveyance of words, each voice carries the unique timbre, tone, and other acoustic information, that hint at the speaker’s identity. Like fingerprints, the human voice can be used to distinguish individuals from one another with a high degree of accuracy [1,2] and gives insights into the speaker’s emotions and physical attributes. Speech data can be used, for example, to recognize stress [3], emotions [4–7], the level of interest [8], age and sex [9,10], and personality traits [11,12] – for a review on speech analysis for health, see [13].
Voice assistants such as Alexa or Siri attempt to mimic human voice in terms of pleasant and recognizably individualized speech characteristics. So far, most voice assistants implement only one synthetic voice and thus follow an approach in which one voice fits all users [14]. However, voice assistants may also compute the voiceprint (see below) of the customer and utilize this information to modify the synthetic voice to make it similar to the customer’s voice.
Therefore, the question arises how listeners evaluate voices similar to the listeners’ own voices and whether they prefer average voices compared to more distinct voices. The present paper addresses this question in five experiments. As a first step, we show that similarity judgements of two voices by AI-based speaker recognition systems and human listeners significantly correspond (Experiments 1 and 2), which is a necessary precondition for AI-based cloning of individual voices. As a second step, we show that this correspondence also holds if one of the voices is the listener’s own voice (Experiment 3). As a third step, we showed that average voices are not preferred over distinct voices (Experiment 4). We finally demonstrate that listeners judge voices similar to their own voice (according to the AI-based speaker recognition system) to be more likable and trustworthy then dissimilar voices (Experiment 5).
Characterizing individual human voices through AI-based d-vectors
The complexity of human speech poses a significant challenge: how can one distill and encode these sophisticated vocal characteristics into a form that captures the essence of individual identity? Modern speaker recognition systems use d-vectors, or similar kinds of speaker embeddings (such as x-vectors, r-vectors, or ECAPA-TDNN), derived from a deep neural net [15,16]. While there are only minor differences in performance among these speaker embeddings [17], d-vectors have the advantage that the speaker encoder that generates the embeddings is a lightweight model, widely used in the open-source community, and relatively easy to implement.
Starting point are short audio samples of a human speakers. The audio samples are non-linearly transformed on the frequency scale to emphasize distances between frequencies for which the human ear is most sensitive. Next, these transformations, called mel-spectrograms, are used to train deep neural networks. D-vectors, then, are the averaged activations of the final hidden layer of a deep neural network that is trained on a speaker verification or identification task. As a result, they are abstract representations of audio called “voiceprints”, which contain compressed information about the audio signal’s unique characteristics, such as timbre and tone, in a multidimensional space.
However, such voiceprints may not only be used for speaker identification. Instead, deep learning methods [18,19], enables software to clone the voice of a real person. Cloning a voice traditionally involves training a Text-to-Speech (TTS) system using audio samples from the target individual. However, this requires a large number of audio samples from the individual (often unavailable) and the training of an entire TTS system, which is both time-intensive and computationally demanding. Consequently, cloning someone’s voice was most of the time either impossible or too prohibitively expensive. Yet, providing voiceprints as additional information when training a TTS system makes it possible to clone a voice with only a few seconds of audio material and without the need to train a new system [20–22]. Even if the results are not yet as convincing as previously used techniques, the essential prerequisites have been met to convert any given text into speech and predetermine the used voice by providing a voiceprint.
Comparing human to d-vector based voice similarity judgments
Yet, there is little research on how voiceprints are related to human perception. Since voiceprints are new in the field, we needed to establish their validity for (human) similarity judgments, which is a prerequisite to study the cognitive consequences of voice similarity thereafter. Beside research on performance differences between human and speaker recognition systems [23], to the best of our knowledge, there is only one study that has investigated the relationship between voice similarity estimates by humans and an automatic speaker recognition system [24]. The study by [24] showed a positive relationship between participants’ similarity judgments and comparison scores from a speaker recognition system [25]. In contrast to this study, we are interested in voiceprints derived from a speaker recognition system based on d-vectors, which are derived by training a deep neural net [26] and can be used to clone a voice. Although our study does not employ cloned voices, the application of d-vectors for generating speech that resembles a target speaker’s voice makes them the ideal candidate for examining similarity effects.
The significance of likeability and trustworthiness in social interactions
Likeability and trustworthiness are foundational attributes that significantly influence social interactions and relationships. Research has demonstrated that individuals judged more likable by others are more persuasive, often receiving preferential treatment and social support [27–30], and that similar others are also perceived as more likeable [31]. Similarly, trustworthiness is central in fostering long-term (business) relationships and ensuring effective collaboration, as it mitigates uncertainty, reduces the perceived risk in interactions and increases predictability [32–36]. From an evolutionary perspective, trustworthiness likely signals an individual’s reliability and cooperative intent, which are essential for fostering social cohesion and reciprocal behaviors within groups. Similarly, likeability facilitates social bonding by eliciting positive affect and reducing interpersonal tension, enhancing collaboration and mutual support. Thus, since these constructs are integral to social evaluation processes, using likability and trustworthiness as dependent variables is critical to understanding the impact of voice similarity.
Voice typicality and its influence on trustworthiness and likability
Previous research in other perceptual domains has consistently demonstrated a beauty-in-averageness effect, where average or prototypical faces and objects are perceived as more attractive than those that deviate from the norm [37–39]. This phenomenon extends beyond mere aesthetic preference, reflecting a broader cognitive tendency to favor typical over unusual stimuli, which may be rooted in the ease of processing more familiar or expected patterns [40]. Additionally, familiarity has been shown to enhance social evaluations, such as perceived trustworthiness, particularly in the context of faces [41]. These findings suggest that perceptual and cognitive processes prioritize typicality and familiarity, potentially because they signal safety, reliability, or group affiliation.
Building on this framework, Experiment 4 sought to explore whether a similar effect is observable in the auditory domain, specifically for voices. In this context, typicality was operationalized as the mean cosine similarity between a given speaker’s voiceprint – a numerical representation of their vocal characteristics – and the voiceprints of all other speakers in our dataset. By examining whether voices with higher typicality are associated with greater trustworthiness and likability, we aimed to extend the beauty-in-averageness principle to auditory stimuli.
Similarity attraction and the possible effects of voices resembling listeners’ own voices
Cloned voices are an essential component of deep fakes, which are primarily used for entertainment purposes, such as showing Elon Musk performing a belly dance or Barack Obama mocking Donald Trump. However, there are also malicious use cases [42], and deep fakes have been used to spread fake news and propaganda [43]. However, there are also more subtle possibilities to use manipulated audio, particularly in the field of voice assistants, which could have significant effects on users of voice assistant systems. According to the similarity attraction hypothesis [44,45], people like other people more if they behave, appear or think similarly to them – for a meta-analysis, see [46]. A possible explanation for similarity attraction is linked to a phenomenon called implicit egotism: People tend to evaluate themselves positively, and if they associate other people with themselves, the positive self-evaluation may influence their evaluation [47–49]. Building on this concept, it has been proposed that similarity influences attraction by shaping the perceived valence and significance of inferred traits [50–52]. Specifically, individuals may derive positive or negative evaluations of others based on shared or divergent attitudes, personality traits, or other attributes. According to [51], similar attitudes do not directly lead to attraction but foster expectations of additional positive qualities in the similar individual, driven by the individual’s own favorable self-assessment [50,53]. Moreover, the inclination to interpret superficial similarities as indicative of deeper shared traits can result in an overestimated sense of alignment. For instance, individuals might presume that an advisor who shares surface-level characteristics also holds similar preferences, thereby perceiving their advice as more applicable or insightful [54]. However, the effects of similarity may also stem from humans’ inherently social nature. Social Identity Theory [55] suggests that individuals exhibit a preference for and more positive behaviors toward those they perceive as members of their own group.
Biological explanations further contribute to our understanding of similarity effects. In the context of social networks, this mechanism is often referred to as homophily, which describes the tendency to form connections with similar others [56]. From an evolutionary perspective, it has been argued that altruistic behaviors typically come at a cost, except when directed toward genetically related individuals [57,58]. Since genetic similarity often correlates with phenotypic resemblance, evolutionary pressures may have favored prosocial behaviors toward those perceived as similar, enhancing cooperation and cohesion within social groups.
However, similarity effects do not only occur between human agents. Research on human-machine communication has shown that humans exhibit social responses to computers just as they do to humans. Consequently, similarity attraction also might arise in human-computer interactions involving artificial voices [59,60]. Indeed, general (i.e. non-adaptive) alignments of acoustic-prosodic features, such as speech rate, intensity, pitch, volume, and prosody, can lead to a similarity attraction towards synthetic voices [61], which can positively influence learning [62], engagement [63], and enjoyment [64].
Based on the above considerations, we investigated
- Whether the cosine similarity derived from the trained neural network correlates with human similarity judgments (Exp 1-3).
- Whether speakers with prototypical voices are judged as more likeable and trustworthy (Exp 4).
- Whether speakers with similar voiceprints to the corresponding participants are perceived as more likable and trustworthy (Exp 5).
The relation between AI and human similarity judgments
In the first Experiment, we investigated the validity of the cosine similarity as a measure of perceived voice similarity by probing whether the cosine similarity of two voiceprints predicts human similarity judgments.
Method
Ethics statement.
The studies reported were approved by the ethics committee of the Leibniz-Institut für Wissensmedien, Tübingen (approval number LEK 2020/061 and LEK 2021/123). All participants provided written informed consent through the online platform qualtrics.com, and all experiments included in this study were preregistered (Exp. 1: https://osf.io/kxwsv; Exp. 2: https://osf.io/8c7xw; Exp. 3: https://osf.io/yt3b7; Exp. 4: https://osf.io/q59da; Exp. 5: https://osf.io/cv5g9).
Encoder and data.
For the first as well as the following Experiments, we used an open-source encoder [65] based on research conducted by [22,66,67]. In contrast to the described model in [65], our model consists of three recurrent neural networks (RNN) of the long short-term memory type(LSTM layers) with 768 nodes, followed by a fully connected projection layer with 256 nodes and a tanh activation function.
The encoder is trained on a speaker verification task in which it learns to embed utterances from the same speaker close together in the embedding space and utterances from different speakers farther away. This increases intraspeaker variation and enhances interspeaker discrimination. For each utterance, a 256-dimensional feature vector is created, where each feature can encode certain voice features. These voice features are characteristic for the speaker and could be understood as a numerical representation of the voice, a voiceprint. The similarity of two voices can be compared by calculating the cosine similarity of two feature vectors, yielding values ranging from -1 to 1.
To train the network, we used the German subset of the Common-Voice dataset (https://github.com/mozilla/common-voice), Distant-Speech (https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/acoustic-models.html), LibriVoxDeEn [68], and the German subset of the VoxForge dataset (http://www.voxforge.org/home). These datasets contain linguistically diverse material. The linguistic content ranges from simple phrases to complex sentences, covering a broad spectrum of phonetic, lexical, and syntactic structures in German. The combined datasets consist of approximately 1,000 hours of spoken audiobooks and Wikipedia articles read aloud by about 10,000 non-professional speakers. Audio files not already cut at sentence boundaries were cut at the appropriate points.
Participants.
We chose a sample size of 100 participants for our first experiment as a practical starting point. This sample size provided a balance between feasibility and statistical power, allowing us to evaluate the relationship between cosine similarity and human similarity judgments while accounting for individual variability. Therefore, we recruited 50 male and 50 female German participants via prolific (https://www.prolific.com), which was the recruitment platform used for all experiments. Basic demographic information was collected via Qualtrics (https://qualtrics.com) in all experiments.
Six of the participants were excluded because they failed in more than one control trial. The mean age was M = 32.01 (SD = 11.26). Forty-seven of the 94 participants were female, one diverse, and three refused to answer. Recruitment occurred from March 3, 2021, to March 5, 2021. Participants received £3.45 for their participation in the study.
Materials, stimuli and procedure.
Since our dataset included more male than female speakers (approximate ratio of 3:1), and because this was our first experiment using this type of data, we aimed to achieve a wide range of cosine similarity values with high granularity. We used only male speakers in this study to simplify the experimental design and ensure consistent conditions.
For each male speaker in our dataset, we calculated the cosine similarity of the voice embedding with each other speaker in the dataset. Since those cosine similarities are approximately normally distributed, randomly drawing from these pairs would result in too few examples from the edge categories. Therefore, we subdivided the cosine values into ten categories, using the lowest and highest cosine value between speaker pairs as reference points with equal cosine value differences between the breakpoints. We subsequently drew speaker pairs based on the categories, which should ensure an even distribution of cosine values and, therefore, the greatest possible variance in the stimulus material. For each drawn speaker, we randomly picked one audio sample from our dataset, trimmed it to a maximum length of 5 s, and normalized the volume. We drew 50 sets of 100 male speaker pairs and presented each set to one female and one male participant in a random order.
Since our experiments were conducted online on pavlovia.org (https://pavlovia.org/) using PsychoPy [69], we checked whether participants had a working audio setup at the beginning of the experiment: We presented a short text in which we informed them to count sinus tones. After presenting four sinus tones with an inter-stimulus interval of 1s, participants should indicate on a slider (ticks on 0,1, 2,3,4 and 5) how many sinus tones they were hearing. If they failed the test, the experiment was concluded.If they passed the detection task, three introductory trials were presented that had the same structure as the regular trials; audio recordings from two different male speakers were presented sequentially, with an inter-stimulus interval of 1s. While audio was played, a headphone icon was depicted. After hearing both voices once, the participants were asked to rate the dissimilarity by adjusting the slider on an unmarked continuous rating scale (range: little dissimilarity - great dissimilarity). They were allowed to take as much time as needed for this rating, with no imposed time constraints. While piloting our study, we found it much more challenging to rate the similarity than the dissimilarity. Accordingly, we asked the participants to rate the dissimilarity rather than the similarity and inverted the response afterward. Participants could skip a rating but were informed that they should only choose this option if they couldn’t hear one of the samples properly. The participants who skipped more than ten trials were excluded. To check the participants’ attention, we presented every 30th trial two different audio samples from the same speaker. The participants who rated the dissimilarity higher than 0.2 in more than one of the three control trials were excluded.
Results
To analyze the data in this study, we used the software R [70], the R package lme4 [71], and the R package MuMIn [72]. We used mixed models with participants as random effects, the raw cosine values as the independent variable, and the inverted slider responses as the dependent variable (R code for data processing is publicly available, see below). To test whether encoder ratings can predict how similarly humans judge different voices, we compared an intercept-only model, a linear model, and a quadratic model. We included a quadratic model because human judgments, particularly those based on perceptual features like voice similarity, often show non-linear trends [73]. This approach accounts for potential non-linear relationships between the cosine similarity of voice embeddings and human similarity judgments. For instance, individuals may perceive two voices as more similar up to a certain point, but after that, additional increases in cosine similarity might not yield proportional increases in perceived similarity. This suggests diminishing or varying returns on perceived similarity as cosine similarity increases.
Since previous research found evidence for a own-gender bias in the ability of voice identification [74] and gender differences in voice processing [75,76], we included participants' gender as an additional factor. Weighting using Akaike information criterion (AIC) scores (see Table 1) showed a clear quadratic relationship between the calculated cosine similarity of the encoder and participants’ rating (intercept: 0.36, 95% CI [0.34, 0.38], t(102.3) = 33.05, p < .001; cosine: 0.10, 95% CI [0.07, 0.13], t(9215) = 7.68, p < .001; cosine2: 0.47, 95% CI [0.42, 0.52], t(9215) = 18.51, p < .001). The median of the individual Spearman Correlations between the encoder’s cosine similaritiy values and participants' similarity ratings was Mdn rs = 0.37 (Q1 = 0.28, Q3 = 0.43), indicating a moderate relationship. The model explained 27% of the variance (Rc2 = 0.27). Numerically, female participants rated the similarity slightly higher; however, the inclusion of gender as an additional factor is not justified given the AIC values.
The relationship between cosine values and the participants’ similarity ratings seems stronger for higher cosine similarity values. The quadratic relationship indicates that the cosine values derived from the deep neural network are associated with human similarity judgments and highlight a stronger association at extremer cosine similarity values. Participants skipped, on average, M = 0.86 trials (SD = 1.16) and needed, on average, Mdn = 28.98 minutes to complete the experiment.
Given the large variance observed among participants’ similarity ratings and the use of a vast amount of different stimulus pairs, we performed additional analysis with aggregated data to gain more insights into the relationship between cosine values and similarity judgments. Rather than employing the encoder’s raw cosine values for each similarity judgment, we utilized the predefined similarity categories used in the sampling process as predictors. The response variable was the mean similarity judgment corresponding to each category. Since the above analysis revealed a quadratic relationship, we compared a linear model with a quadratic regression model using an analysis of variance (ANOVA). The results strongly favored the quadratic model over the linear model, F(1,7) = 240.98, p < .001. The analysis of the quadratic model itself revealed a significant quadratic relationship between the similarity category and the mean similarity rating, F(2,7) = 670.5, p < .001 (intercept: 0.37, 95% CI [0.36, 0.39], t(7) = 50.51, p < .001; category -0.02, 95% CI [-0.03, -0.01], t(7) = − 5.97, p < .001, category2: 0.006, 95% CI [0.005, 0.007], t(7) = 15.52, p < .001). The model explained most of the variance in the mean similarity rating, R2 = 0.995 (Fig 1). Even though mixed effects models account for random variation and lead to shrinkage, the analysis with more aggregated data further reduces variation and leads to a more pronounced relation.
Depicted is the quadratic relationship between cosine similarity categories and the participants’ mean similarity ratings as well as the the 95% confidence interval of the regression line.
While the association between similarity values and similarity judgments appears modest in the analysis with mixed models, the analysis with aggregated data suggests that the relationship warrants attention. This finding is noteworthy given the potential limitations of using stimuli derived from open-source datasets. Factors such as varying audio quality, speakers’ articulation proficiency, and the semantic content of audio clips could have influenced evaluations. At the same time, the diversity of the stimuli may have contributed to the ecological validity of incidental similarity evaluations.
It is also important to consider that assessing the similarity of two voices based on just 5 seconds of random samples is inherently challenging. The quadratic relationship observed in the data indicates that these challenges were particularly pronounced when participants evaluated speaker pairs with moderate to low cosine similarity values. These complexities and their potential implications for interpreting the results will be explored further in the General discussion.
Our findings gain context when compared with those of MOSNet [77]. MOSNet demonstrated a capacity to predict human-perceived similarity judgments, more precisely termed identity ratings, with Spearman Rank Correlation coefficients ranging between 0.292 and 0.455. The derived median correlation coefficient of 0.37 in our experiment aligns with the midpoint of MOSNet’s observed correlations but for raw values on a similarity judgment – which we consider more valuable for research on the influence of voice similarity on cognitive processes.
Overall, the results confirm the validity of cosine similarity as a measure of perceived voice similarity. The quadratic relationship suggests that participants were disproportionately sensitive to very dissimilar and very similar voices but less capable of differentiating at intermediate similarity levels. This may reflect a natural limit in human voice discrimination abilities, particularly for voices that are neither too distinct nor too similar. These findings support the utility of AI-generated cosine similarity for approximating human voice similarity judgments.
The reliability of human similarity judgments
In order to interpret the magnitude of the correlation between raw cosine values and similarity judgments observed in Experiment 1, we needed to assess the reliability of human similarity judgments, which limits the maximum of observable correlations [78].
Method
Participants.
G*Power [79] was used to calculate the necessary sample size for the Correlation: Bivariate normal model test as an approximation for the non-parametric Spearman rank correlation test that was used to calculate the test-retest reliability. The analysis aimed to detect a medium to large effect size with an alpha = 0.05 and a power = 0.80. The power analysis revealed a minimum sample size of 46 participants. We therefore recruited 50 new participants via prolific. Five were excluded because they failed in more than one control trial. Eighteen of the remaining 45 participants were female, two did not indicate their sex. The mean age was M = 30.31 (SD = 10.91). Recruitment took place on April 19, 2021. Participants received £4.36 for their participation in the study.
Materials, stimuli and procedure.
Besides some minor changes, the material and procedure were identical to Experiment 1. In contrast to Experiment 1, we sampled 50 instead of 100 speaker pairs and only used one set of speakers for all participants. After each of the 50 speaker pairs was presented once, participants were asked to rate their similarity again. The order of the speaker pairs was altered in the second part of the experiment but was the same for all participants. We aimed for this uniformity of the experimental conditions to avoid variance from individual randomizations of the trials order since our approach mainly focused on correlations, which require reliable estimates for person parameters rather than experimental conditions (for which randomization would be necessary).
Results
As in Experiment 1, we observed a correlation between cosine similarity and similarity judgments. Using AIC values from Table 2, we identified a quadratic relationship between cosine similarity (generated by the encoder) and participants’ ratings. The model explained 19% of the variance in similarity judgments (intercept: 0.43, 95% CI: 0. 40 to 0. 46, t(47.05) = 25.75, p < .001; cosine: 0. 09, 95% CI [0. 05, 0.13], t(4249) = 4.55, p < .001; cosine2: 0. 24, 95% CI [0. 16, 0. 31], t(4249) = 5.97, p < .001, Rc2 = 0.19). The median of the individual Spearman Correlation between cosine similarities and similarity ratings was Mdn rs = 0.23 (Q1 = 0. 19, Q3 = 0. 26). Therefore, we were able to replicate our results of Experiment 1, which showed a quadratic relation between cosine similarities and similarity judgments. The relationship, however, was slightly less pronounced than in Experiment 1. This may be explained by the limited stimulus material required to measure the reliability scores as well as the reduced number of trials to keep the experiment within reasonable boundaries.
As in the first experiment, we conducted additional regression analyses using the similarity category as the predictor and the mean similarity judgments as the response variable. An ANOVA comparing the quadratic and the linear model found no significant increase in fit, F(1,7) = 0.47, p = .52. The analysis of the linear model revealed a significant relationship between the similarity category and the mean similarity rating, F(1,8) = 5.52, p = .047 (intercept: 0.39, 95% CI [0.28, 0.50], t(8) = 8.48, p < .001; category 0.02, 95% CI [0.00, 0.04], t(8) = 2.35, p = .047). The model explained 40.8% of the variance (R2 = 0.408). Again, the lack of a quadratic effect in this aggregated dataset compared to the first experiment likely stems from the diminished variance due to the smaller number of speaker pairs per category (10 vs. 500) and the reduced sample size (50 vs. 100). Despite these limitations, the findings suggest a consistent, incremental monotonic increase in similarity judgments for speaker pairs in higher similarity categories.
Reliability and attenuation correction.
The test-retest reliability, indexed by the median of the individual Spearman Correlation between the first and the second similarity rating, was Mdn rs = 0.57 (Q1 = 0.44, Q3 = 0.65), which can be considered as a fair test-retest reliability [80]. Since the cosine similarity values derived from the encoder are consistent, a single attenuation correction was performed to estimate the true correlation. Using the obtained reliability value yielded a correlation between the cosine values of the encoder and the participants’ similarity ratings of Mdn rs = 0.48 for the first experiment and Mdn rs = 0.31 for the second experiment. Explanatory analyses showed a polynomial relationship between the first and the second rating (Fig 2). This indicates a stronger correlation for extreme (dis-)similarities.
The scatterplot shows the polynomial relationship between the first and the second similarity judgment and the 95% confidence interval of the regression line.
The observed reliability is crucial in understanding the correlation between AI-generated cosine values and human similarity judgments. Any error or inconsistency in human judgments (as suggested by the reliability value less than 1) can attenuate or reduce the observed correlation. This means that the true correlation would likely be higher in the absence of such errors. Therefore, the obtained correlation between AI ratings and human judgments underestimates the actual strength of this relationship due to the influence of measurement error inherent in human judgments. Considering this reliability, our attenuation correction suggests that the true correlation is stronger than what is directly observed – even though the correlation values calculated by the attenuation correction are to be regarded as upper bounds.
The more pronounced correlations at extreme values of similarity, as indicated by the polynomial relationship, supports this view and demonstrates that people struggle to make reliable and consistent judgments for speaker pairs average in similarity. This finding does not contradict the outcomes of Experiment 1, where a quadratic relationship suggested difficulties in making nuanced ratings for more dissimilar speaker pairs. Moreover, the results added evidence to the notion that judging the similarity is inherently challenging, with more reliable assessments typically occurring at the extremes of the similarity spectrum.
Consistency across raters.
Since we used a fixed set of stimuli for all participants, we also assessed the consistency of similarity judgments across different raters by calculating the Intraclass Correlation Coefficient (ICC) with the R package irr [81]. We employed a two-way model to evaluate the level of agreement on similarity judgments among the 44 raters across the 50 speaker pairs. Since participants rated each speaker pair twice, we analyzed only the ratings from the first 50 trials. Where participants skipped the assessment of a speaker pair, we used the median of the other participants to replace the missing values – which was necessary in 17 of the 2200 cases. The results indicated a small to moderate level of agreement among the raters, ICC(A,1) = 0.31 (95% CI [0.23, 0.42], F(49, 722) = 25.8, p < .001). Albeit these results suggest a consistent assessment of similarity across raters within the context of our study, there are also significant individual differences in judging the similarity of speaker pairs highlighting the difficulty to evaluate the similarity of two voices.
General observations.
Participants skipped, on average, M = 2.30 trials (SD = 1.79) and needed, on average, Mdn = 26.21 minutes to complete the experiment.
The findings confirm that AI-derived cosine values are predictive of human similarity judgments, though their strength varies depending on dataset restrictions and individual variability. Stronger correlations at extreme similarity values underscore participants’ difficulties in making reliable judgments for speaker pairs of average similarity. This pattern complements the quadratic relationship observed in Experiment 1, where the dissimilarity of speaker pairs posed challenges for nuanced ratings. Overall, the results reinforce that similarity judgments are inherently challenging and prone to individual differences, particularly in moderate similarity categories.
Similarity judgments in relation to the own voice
The first two experiments demonstrated the encoders’ ability to predict similarity judgments when the voices are those of other people. In the third experiment, we investigated whether this holds true if one of the voices is one’s own voice. Since the perception of one’s own voice depends on whether we are speaking or just listening to an audio sample of our voice [82], we manipulated this as a between-subjects factor. In the internal group, we only presented an audio sample of a speaker and asked participants to compare this sample with their own (internal) voice. In the external group, we presented an audio sample and additionally a sample of the participant, which was recorded prior to the experiment.
Method
Participants.
To ensure statistical consistency across experiments and facilitate meaningful comparisons, we used the same sample size of 100 participants in Experiment 3 (and the remaining experiments) as in Experiment 1. This approach minimizes potential discrepancies arising from differences in statistical power. Therefore, we recruited 100 new German participants via prolific. Ten were excluded because they detected fewer than two control trials. The mean age was M = 25.76 (SD = 6.81). Fifty-one participants were female, two diverse, and three did not specify their sex. Recruitment occurred from August 14, 2021, to August 30, 2021. Participants received £5.00 for their participation in the study.
Materials, stimuli and procedure.
In the first session, each participant recorded five sentences. These recordings were used to compute the feature vector of their voice. We then calculated the cosine similarities of the participants’ voice embeddings with all speakers of the same gender in our dataset. In order to achieve the highest possible variance in cosine similarities, we assigned each raw cosine similarity value to one of ten similarity categories – where the category boundaries from the first two experiments were used. Ten speakers were randomly selected from each category, resulting in a total of 100 speakers. Since the cosine similarity values are approximately normally distributed, the extreme categories would be underrepresented otherwise. If there were not enough speakers in the more extreme category, a speaker was chosen from the category closer to the mean. We picked one audio sample from our dataset for each speaker, trimmed it to a maximum length of 5 s, and normalized the volume.
In the second session, after checking the audio setup, in 100 trials the participants were asked to rate the dissimilarity of the presented voice in comparison to their own voice. If they were assigned to the external representation group, on each trial, participants were randomly presented with one of their five audio recordings, followed by another person’s audio sample – without them having been instructed to listen to their recordings beforehand. In the internal representation group, only the speaker from our dataset was presented. In both groups, we simply asked the subjects to rate the similarity of the other person’s voice to their own voice and, therefore, did not mention that their own voice might have an internal representation.
After presenting three introductory trials, every 30th trial contained recordings of two different speakers, none of which came from the participant. Participants were asked to detect these pairs by clicking a red button below the rating scale. The participants who caught fewer than two control trials were excluded.
Results
We used mixed models with participants as random effects, the raw cosine values as the independent variable, and judged similarity as the dependent variable. We compared an intercept-only model, a linear model, and a quadratic model. Additionally, we included the between-subjects factor as an interaction term. Weighting using the AIC scores in Table 3 showed a linear relationship (Fig 3) between cosine similarity values and participants’ ratings. The model explained 21% of the variance in similarity ratings (intercept: 0.34, 95% CI [0.32, 0.37], t(101.1) = 26.32, p < .001; cosine: 0.15, 95% CI [0.12, 0.18], t(8853) = 9.80, p < .001; Rc2 = 0.21). Whether one’s voice was externally presented or not had no significant effect. The median of the individual Spearman Correlation between cosine similarities and similarity ratings was Mdn rs = 0.11 (Q1 = 0.01, Q3 = 0.23). Performing an attenuation correction yielded a Spearman Correlation of Mdn rs = 0.15. This reflects moderate predictive power at the individual level. Participants skipped on average M = 2.11 trials (SD = 2.56) and needed, on average, Mdn = 25.07 minutes to complete the experiment.
Depicted is the quadratic relationship between cosine similarity categories and the participants’ mean similarity ratings as well as the the 95% confidence interval of the regression line.
We conducted additional regression analyses, employing the similarity category as the predictor and the average similarity judgments as the dependent variable. An ANOVA contrasting the quadratic with the linear model just missed the threshold of significance, F(1,7) = 4.27, p = .078. The analysis of the linear model revealed a significant effect of the similarity category on the mean similarity ratings, F(1,8) = 194.1, p < .001 (intercept: 0.34, 95% CI [0.33, 0.343], t(7) = 98.34, p < .001; category: 0.009 (95% CI [0.007, 0.010], t(7) = 13.93, p < .001). This model accounted for a significant variance in mean similarity ratings, as indicated by an R² = 0.96.
As we noticed an overall decrease in similarity ratings when participants compared voices to their own voice, we conducted post-hoc Tukey-Kramer tests to investigate differences across the three experiments. Significant differences emerged between the average slider responses: M2 - M1 = 0.03, t = 5.49, p < .001; M3 - M1 = -0.08, t = -18.80, p < .001; M3 – M2 = -0.11, t = -20.44, p < .001. These findings suggest a consistent bias where participants are less likely to judge voices as similar to their own.
Experiment 3 showed the encoders’ ability to partially predict similarity judgments even when one of the voices is one’s own voice. Unlike Experiments 1 and 2, Experiment 3 revealed a linear relationship between raw cosine similarity values and judged similarity, likely reflecting a general bias against perceiving self-voice similarities. The lower similarity ratings may stem from a Need for Uniqueness [83], where participants hesitate to identify other voices as similar to their own. Alternatively, heightened familiarity with one’s own voice may increase sensitivity to subtle differences, leading to underestimation of similarity. Surprisingly, whether participants compared the presented voices to an internal mental representation or an external recording of their own voice had no significant effect. This suggests that the internal representation of one’s voice may serve as the dominant reference point.
The beauty in average voices
Because the previous experiments indicated that cosine similarity is a valid proxy for perceived similarity, we investigated the cognitive consequences of voice similarity in the remaining experiments. In this experiment we focused particularly on the likability and trustworthiness of average voices. Our investigation was motivated by the concept of the beauty-in-averageness effect [39,40], which suggests that average features are often perceived as more attractive – even though they may not be optimally attractive [84]. This effect, well-documented in studies focusing on facial stimuli [85], may extend to auditory perceptions. By examining whether voices with average characteristics (determined by mean cosine similarity across our speaker dataset) are perceived as more likable and trustworthy, we wanted to explore whether this phenomenon transcends visual stimuli and applies to auditory perceptions as well.
Method
Participants.
We recruited 100 new German participants via prolific. Two of them were excluded because they had less than 90 submitted ratings. The mean age was M = 30.32 (SD = 11.64). Forty-three participants were female, and two were diverse. Recruitment occurred from October 21, 2021, to October 26, 2021. Participants received £2.82 for their participation in the study.
Materials, stimuli and procedure.
We computed the cosine similarities of each male speaker with each other male speaker. We used the mean cosine similarity of a speaker as a measure of typicality. To obtain the highest possible variance in typicality, we assigned each mean cosine similarity value to one of ten similarity categories and randomly selected ten speakers from each category, resulting in a total of 100 speakers. We picked one audio sample from our dataset for each speaker, trimmed it to a maximum length of 5s, and normalized the volume. While the overall design of the experiment mirrored that of the second session in the third experiment – detecting sine tones to verify audio settings, performing three introductory trials, and judging 100 speakers – there were key differences. Instead of rating similarity, participants were asked to assess the likability and trustworthiness of the speakers using two continuous rating scales (ranging from “not at all” to “very”). Additionally, for the control trials, participants were required to detect audio samples from two female speakers.
Results
Considering the AIC scores in Table 4, there was no significant effect of mean cosine similarity on likeability ratings (Fig 4). The model comparison for the trustworthiness ratings (see Table 5), revealed a quadratic relationship (Fig 5; intercept: 0.52, 95% CI [0.49, 0.54], t(178.19) = 39.35, p < .001; mean cosine: -0.18, 95% CI [-0.38, 0.02], t(9627.08) =
Depicted is the quadratic relationship between cosine similarity categories and the participants’ mean likeability ratings as well as the 95% confidence interval of the regression line.
Depicted is the quadratic relationship between cosine similarity categories and the participants’ mean trustworthiness ratings as well as the 95% confidence interval of the regression line.
-1.88, p = 0.06; mean cosine2: 0.74, 95% CI [0.18, 1.30], t(9627.08) = 2.57, p = .01; Rc2 = 0.21, indicating that the model explained 21% of the variance in trustworthiness ratings). The median of the individual Spearman Correlation between mean cosine similarities and trustworthiness ratings was Mdn rs = 0.04 (Q1 = -0.03, Q3 = 0.09), reflecting weak associations. Participants skipped, on average, M = 2.74 trials (SD = 1.12) and needed, on average, Mdn = 20.29 minutes to complete the experiment.
Analyzing averaged data revealed no significant effects in mean cosine similarity values on likeability or trustworthiness ratings (all p > .05).
These results suggest that a voice’s typicality (as captured by mean cosine similarity) does not affect likeability and has only a negligible influence on perceived trust. However, it should be noted that this could also be due to an insufficient number of particularly typical and atypical speakers or that the semantic content could have biased the evaluation. Previous research also pointed out that averageness has positive effects only in some dimensions but not in others [86].
The attraction toward similar voices
In the final experiment, we investigated whether speakers with similar voices to one’s own voice are perceived as more likable and trustworthy.
Method
Participants.
We recruited 100 new German participants via prolific. Seven were excluded because they had less than 90 submitted ratings. The mean age was M = 31.04 (SD = 11.28). Forty-five participants were female. Recruitment occurred from November 11, 2021, to November 29, 2021. Participants received £3.76 for their participation in the study.
Results
We compared an intercept-only model, a linear model, and a quadratic model for both the likeability rating as well as the trustworthiness rating. Weighting using the AIC values in Table 6 showed a quadratic relationship between the voice similarity and likeability ratings (intercept: 0.47, 95% CI [0.45, 0.49], t(111) = 46.70, p < .001; cosine: 0.11, 95% CI [0.05, 0.18], t(8863) = 3.55, p < 0.001; mean cosine2: 0.17, 95% CI [0.04, 0.30], t(8864) = 2.59, p = .009; Rc2 = 0.19, indicating that 19% of the variance in likeability ratings was explained by the model). The median of the individual Spearman Correlation between cosine similarities and likeability ratings was Mdn rs = 0.15 (Q1 = 0.05, Q3 = 0.25), which, while modest, demonstrates a consistent positive relationship.
Weighting using the AIC values in Table 7 also showed a quadratic relation between the voice similarity and trustworthiness ratings (intercept: 0.50, 95% CI [0.48, 0.52], t(111.2) = 47.55, p < .001; cosine: 0.06, 95% CI [-0.01, 0.12], t(8863) = 1.74, p = .08; mean cosine2: 0.31, 95% CI [0.17, 0.45], t(8865) = 4.45, p < .001; Rc2 = 0.19, indicating that 19% of the variance in likeability ratings was explained by the model). The median of the individual Spearman Correlation between cosine similarities and trustworthiness ratings was Mdn rs = 0.16 (Q1 = 0.06, Q3 = 0.25), which, while once again modest, demonstrates a consistent positive relationship. Participants skipped, on average, M = 5.95 trials (SD = 3.12) and needed, on average, Mdn = 19.06 minutes to complete the experiment.
To investigate the relationship further, we employed the similarity category as the predictor, with the average likeability judgments serving as the dependent variable. The ANOVA comparison between the quadratic and the linear model demonstrated an improved fit for the quadratic model, F(1,7) = 6.57, p = .04. The quadratic model’s analysis revealed a significant influence of the similarity category on the average likeability ratings, F(2,7) = 134.5, p < .001 (intercept: 0.467, 95% CI [0.454, 0.480], t(7) = 86.4, p < .001; category at 0.005; 95% CI [-0.001, 0.012], t(7) = 1.916, p = .097; category2: 0.0008(95% CI [0.0001, 0.0015], t(7) = 2.564, p = .04). With R² = 0.97, this model accounted for a substantial portion of the variance.
The ANOVA comparison between the quadratic and linear models using the similarity categories as predictors and the corresponding mean trustworthiness ratings as the response variable indicated a more favorable fit for the quadratic model, F(1,8) = 8.31, p = .02. The quadratic model’s analysis revealed a significant effect, F(2,7) = 73.95, p < .001 (intercept: 0.50, 95% CI [0.48, 0.51], t(7) = 65.141, p < .001; category: 0.002, 95% CI [-0.008, 0.011], t(7) = 0.422, p = .686; category2: 0.001; 95% CI [0.0002, 0.0022], t(7) = 2.883, p = .02). With R² = 0.955, this model accounted for a significant proportion of the variance.
Taken together, these results demonstrate that the quadratic relationship between voice similarity and ratings of likeability and trustworthiness accounts for a substantial proportion of the variance. These findings suggest that while individual effect sizes are modest, the overall fit of the models underscores the practical relevance of voice similarity in shaping social perceptions. Indeed, these findings support the similarity-attraction hypothesis [44,45], in such that voices similar to one’s own are perceived as more likable and trustworthy. The quadratic relationship suggests that the effect is stronger for higher levels of voice similarity. This effect likely stems from implicit egotism [47–49], suggesting that individuals evaluate self-associated traits positively or from social identity processes, in which perceived similarity fosters a sense of connection or group affiliation [55]. These results highlight the potential for AI systems to exploit similarity effects in personalized technologies, such as voice assistants, to influence user perceptions and behavior.
Discussion
Speaker verification systems can compute numerical representations of human voices, so-called voiceprints. Whereas traditionally, speaker verification systems served, for example, as a forensic toolkit or as a biometric security feature in highly secured areas, the spread of deep learning technologies also increased the possibilities of utilizing voiceprints. Most importantly with regard to the present study, they could be used to design and shape the voice features of artificial voices. Despite the emerging importance of voiceprints in TTS systems, little research has been conducted on potential cognitive influences (including manipulations) of variations in the voice features of digital assistants. In the present set of experiments, we present such first evidence, including the methodological prerequisites for such an investigation.
The results of our first experiment indicated that the cosine similarity between voiceprints can predict voice similarity judgments (i.e., validity). The resulting quadratic relationship is likely due to disproportionate sensitivity to dissimilar voices. Since voices of close relatives are often similar and vary within a speaker depending on the time of day, physical condition, and context [87–89], the ability to discriminate between similar voices is a necessary skill for humans to learn. Below a certain threshold at which it is obvious that voices stem from different speakers, we are not aware of any reasonable explanation as to why it might be valuable to further differentiate between different levels of dissimilarity. Since the trained speaker verification system is agnostic regarding ecological advantages, it can differntiate even between dissimilar voices. Experiment 2 replicated these results. Moreover, the results revealed a relatively fair test-retest reliability of human similarity judgments. On the one hand, this implies that, in principle, correlations between similarity judgments and other cognitive variables should be observable. On the other hand, however, should such correlations arise, the numerical values most likely underestimate the true correlations since the reliability was far from being perfect.
The results of Experiment 3 revealed that the AI is also capable of partially predicting the perceived similarity between voices if one of the voices is one’s own. Although the correlation was less pronounced, we found a linear relationship between cosine similarities and participant ratings. The most relevant difference for the experiment involving one’s own voice relative to judging similarity between two unrelated voices (Experiments 1 and 2) was that similarity ratings were generally lower. One possible explanation for this is people’s Need for Uniqueness [83], which could make them more hesitant to classify a voice as similar to their own voice. Another interpretation could be that familiar voices are processed differently from unfamiliar voices [90,91]. The familiarity with one’s own voice could thus increase the sensitivity for differences, which, in turn, might lead to an underestimation of the similarity. Interestingly, whether the participants were asked to compare a voice with their internal representation of their own voice, or an external recording of their voice had no substantial effect on the observed similarity judgments. This is surprising as one’s own voice typically is not only transferred via air but also via bone. As there was no difference, we consider it likely that the internal representation of one’s own voice may be the dominant reference point from which similarity judgments are made.
After establishing these essential methodological prerequisites for studying correlational relationships, we strived to investigate how voice features might affect basic cognitive evaluations. Following previous research on the beauty in averageness effect, we first investigated whether speakers are perceived as more likable and trustworthy if they have a more average voice. In contrast to other studies, we observed no evidence for a correlation between typicality and likeability and only a marginal effect of typicality on trustworthiness. However, there are substantial differences between previous studies and our experimental approach. Previous studies generated an average voice by creating composites, either by statistically averaging speakers [92] or by auditory morphing [93,94]. However, [95] consider that such composites lead to artifacts, especially an increased harmonics-to-noise ratio [96], which is discussed to decrease with age [97,98], in stressful situations [99], and cases of hoarseness [100]. Therefore, it is questionable whether the similarity or the more favorable change in the harmonics-to-noise ratio is the reason for the obtained results. In sum, we thus tend to consider that there is little evidence for a beauty-in-average effect of voices.
With regard to similarity to one’s own voice, however, the results of our final experiment drew a different pattern of results. When using one’s own voice as a reference point, similar voices are perceived as more likable and trustworthy. These findings match previous studies [101,102]. Again, however, the evidence from previous studies was rather weak, as they did not manipulate similarity directly [101] or just adjusted pitch (+/- 20Hz) or loudness (+/- 10 dB) [102]. Participants rated samples altered in loudness as more favorable compared to samples shifted in pitch. The authors concluded this pattern is due to the higher similarity of the loudness manipulated samples to the original recording. However, we are not convinced that participants perceive a recording of their voice as a recording from another person if it is just louder or quieter [103]. Therefore, recordings altered in pitch may not be compared to a similar voice but to one’s own voice instead. Since people overestimate the attractiveness of their own voice, the results could be a consequence of this vocal implicit egotism [47]. It would have been much more consistent to manipulate the similarity by shifting the pitch by various degrees. Apart from that, shifting the pitch of a recording introduces far more noise than altering the loudness, making it more artificial and possibly unpleasant.
Our study’s results underscore the potential for even brief voice recordings to be misused in shaping artificial voices in a way that influences people. The observed correlations between voiceprint similarity and human judgments of likability and trustworthiness (as seen in our final experiment) highlight a vulnerability in human perception. This could be exploited in TTS systems, where slight alterations of voice features aligned with a user’s voiceprint could subtly sway their perceptions and behavior. Thus, while our research contributes to understanding the cognitive impact of voice similarity, it also opens discussions about ethical implications in the context of TTS technologies and voice assistants, where personalized voices might be used to manipulate user responses.
Even though the effect may have been relatively small in our experiments, due to the widespread use of voice assistants and the more elaborate methods available to large technology companies, the impact can be tremendous in absolute terms. This applies not only to the use of user-adapted voices to make interactions with the assistants more attractive but also to the impact on advertising messages and political propaganda.
General limitations
While our study provides valuable insights into the relationship between voice similarity and cognitive evaluations, several limitations must be acknowledged.
A major limitation of this study arises from the wide range of voice similarity values that we were able to investigate. By including pairs of voices from a wide range of the similarity spectrum, we were able to ensure a solid understanding of the overall pattern; however, this broad spectrum may have diluted specific effects that are particularly pronounced in highly similar voices. Future studies should focus more on speakers with high cosine similarity values to better capture the nuances and practical implications of judgments in this critical range. This limitation also reflects a trade-off between experimental control and ecological validity. While our design provided valuable insights into general trends, focusing exclusively on highly similar voices may provide more precise and application-oriented results.
The use of open-source datasets, while providing a wide range of speaker voices, also introduced variability in audio quality, articulation, and linguistic content. These factors may not only have influenced participants’ judgments but reduced the internal validity of the experiments. Future work should employ more controlled datasets or systematic manipulations of stimulus properties to reduce potential biases.
Our study emphasized static voice similarity judgments based on pre-recorded audio. Dynamic aspects of speech, such as conversational context, prosody, or situational factors, were not considered. These elements are likely to influence perceptions and warrant exploration in future research.
Finally, the online nature of the experiments presents challenges such as variable listening environments and participant compliance. Although attention checks and control trials were implemented, these measures cannot fully account for potential distractions or technical issues encountered by participants during the study.
Conclusion
Our findings demonstrate that AI-derived cosine similarity measures effectively predict human voice similarity judgments and influence social evaluations. Across the first three experiments, we found significant relationships between the cosine similarity of voice embeddings and participants’ similarity ratings, with a quadratic pattern emerging for judgments of other voices. This suggests that participants were particularly sensitive to highly similar and dissimilar voices, while intermediate similarity was more challenging to evaluate.
In Experiment 3, we extended this analysis to self-voice comparisons, revealing a general bias against perceiving other voices as similar to one’s own voice. This bias likely stems from increased sensitivity to subtle differences in one’s own voice or a Need for Uniqueness [83]. Comparisons to an internal mental representation or external audio recordings of the own voice yielded similar results, suggesting that the internal representation serves as a dominant reference.
Experiments 4 and 5 explored how voice similarity influences social perceptions such as likability and trustworthiness. Contrary to the beauty-in-averageness effect found in visual stimuli, we found no evidence that average voices were perceived as more likable and only weak effects on trustworthiness. However, voices similar to one’s own were judged as both more likable and trustworthy, supporting the similarity-attraction hypothesis and the influence of implicit egotism.
Our results highlight the potential of AI-generated cosine similarity as a tool for understanding voice perception. While individual effects were modest, the consistency of the findings underscores their practical relevance for voice-based technologies like personalized voice assistants or synthetic speech systems. Future research should focus on refining models for highly similar voices, exploring cross-linguistic generalizability, and addressing the ethical implications of voice similarity manipulations. This study advances our understanding of voice similarity’s role in cognition and social interaction by bridging human perception and AI-driven voice representations.
References
- 1. Doddington GR. Speaker recognition—identifying people by their voices. Proc IEEE. 1985;73(11):1651–64.
- 2.
Li H, Xu C, Rathore AS, Li Z, Zhang H, Song C, et al. VocalPrint: exploring a resilient and secure voice authentication via mmWave biometric interrogation. Proceedings of the 18th Conference on Embedded Networked Sensor Systems. Virtual Event Japan: ACM; 2020. p. 312–25.
- 3. Van Puyvelde M, Neyt X, McGlone F, Pattyn N. Voice stress analysis: a new framework for voice and effort in human performance. Front Psychol. 2018;9:1994. pmid:30515113
- 4. Kaya H, Karpov AA. Efficient and effective strategies for cross-corpus acoustic emotion recognition. Neurocomputing. 2018;275:1028–34.
- 5. Lee C-C, Mower E, Busso C, Lee S, Narayanan S. Emotion recognition using a hierarchical binary decision tree approach. Speech Commun. 2011;53(9-10):1162–71.
- 6.
Grágeda N, Alvarado E, Mahu R, Busso C, Becerra Yoma N. Distant Speech Emotion Recognition in an Indoor Human-robot Interaction Scenario. INTERSPEECH 2023. ISCA; 2023. p. 3657–3661.
- 7. Grágeda N, Busso C, Alvarado E, García R, Mahu R, Huenupan F, et al. Speech emotion recognition in real static and dynamic human-robot interaction scenarios. Comput Speech Lang. 2025;89:101666.
- 8. Jeon JH, Xia R, Liu Y. Level of interest sensing in spoken dialog using multi-level fusion of acoustic and lexical evidence. Interspeech 2010;2010:2802–5.
- 9. Li M, Han K, Narayanan S. Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Comput Speech Lang. 2012;27:151–67.
- 10. Meinedo H, Trancoso I. Age and gender classification using fusion of acoustic and prosodic features. Interspeech 2010;2010:2818–2821.
- 11. Carbonneau M-A, Granger E, Attabi Y, Gagnon G. Feature learning from spectrograms for assessment of personality traits. IEEE Trans Affective Comput. 2017;11(1):25–31.
- 12.
Mohammadi G, Vinciarelli A. Automatic personality perception: Prediction of trait attribution based on prosodic features. 2015 International Conference on Affective Computing and Intelligent Interaction (ACII). 2015. p. 484–490.
- 13. Cummins N, Baird A, Schuller BW. Speech analysis for health: current state-of-the-art and the increasing impact of deep learning. Methods. 2018;151:41–54. pmid:30099083
- 14. Cambre J, Kulkarni C. One voice fits all?: Social Implications and research challenges of designing voices for smart devices. Proc ACM Hum-Comput Interact. 2019;3(CSCW):1–19.
- 15. Hinton G, Deng L, Yu D, Dahl G, Mohamed A, Jaitly N, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process Mag. 2012;29(6):82–97.
- 16. Ohi AQ, Mridha MF, Hamid MA, Monowar MM. Deep speaker recognition: process, progress, and challenges. IEEE Access. 2021;9:89619–43.
- 17. Zhao Z, Pan D, Peng J, Gu R. Probing deep speaker embeddings for speaker-related tasks. arXiv; 2022. Available from: http://arxiv.org/abs/2212.07068
- 18. Oord A van den, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, et al. WaveNet: A generative model for raw audio. arXiv preprint arXiv:160903499. 2016 [cited 25 Jan 2021. ]. Available from:
- 19.
Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, et al. Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, AB: IEEE; 2018. p. 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368
- 20.
Arik S, Diamos G, Gibiansky A, Miller J, Peng K, Ping W, et al. Deep voice 2: multi-speaker neural text-to-speech. Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017. p. 2966–2974. Available from: http://arxiv.org/abs/1705.08947
- 21.
Cooper E, Lai C-I, Yasuda Y, Fang F, Wang X, Chen N, et al. Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE; 2020. p. 6184–6188. https://doi.org/10.1109/ICASSP40776.2020.9054535
- 22. Jia Y, Zhang Y, Weiss RJ, Wang Q, Shen J, Ren F, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. arXiv:180604558 [cs, eess]. 2018;abs/1806.04558.
- 23. González Hautamäki R, Kinnunen T, Hautamäki V, Laukkanen A-M. Automatic versus human speaker verification: the case of voice mimicry. Speech Commun. 2015;72:13–31.
- 24. Gerlach L, McDougall K, Kelly F, Alexander A, Nolan F. Exploring the relationship between voice similarity estimates by listeners and by an automatic speaker recognition system incorporating phonetic features. Speech Commun. 2020;124:85–95.
- 25. Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P. Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process. 2011;19(4):788–98.
- 26.
Variani E, Lei X, McDermott E, Moreno IL, Gonzalez-Dominguez J. Deep neural networks for small footprint text-dependent speaker verification. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Florence, Italy: IEEE; 2014. p. 4052–4056.
- 27. Gommans R, Sandstrom MJ, Stevens GWJM, ter Bogt TFM, Cillessen AHN. Popularity, likeability, and peer conformity: four field experiments. J Exp Soc Psychol. 2017;73:279–89.
- 28. Younan M, Martire KA. Likeability and expert persuasion: dislikeability reduces the perceived persuasiveness of expert evidence. Front Psychol. 2021;12:785677. pmid:35002877
- 29. Brodsky SL, Neal TMS, Cramer RJ, Ziemke MH. Credibility in the courtroom: how likeable should an expert witness be? J Am Acad Psychiatry Law. 2009;37(4):525–32pmid:20019000
- 30. Clayson D. The student evaluation of teaching and likability: what the evaluations actually measure. Assess Eval High Educ. 2022;47(2):313–26.
- 31. Moreland RL, Zajonc RB. Exposure effects in person perception: Familiarity, similarity, and attraction. J Exp Soc Psychol. 1982;18(5):395–415.
- 32. Dyer JH, Chu W. The role of trustworthiness in reducing transaction costs and improving performance: empirical evidence from the United States, Japan, and Korea. Organ Sci. 2003;14(1):57–68.
- 33. Evans AM, Krueger JI. The psychology (and economics) of trust. Social Pers Psych. 2009;3:1003–17.
- 34. Kumar N. The power of trust in manufacturer-retailer relationships. Harv Bus Rev. 1996;74:92–106.
- 35. Rempel JK, Holmes JG, Zanna MP. Trust in close relationships. J Pers Soc Psychol. 1985;49:95–112.
- 36. Simpson JA. Psychological foundations of trust. Curr Dir Psychol Sci. 2007;16(5):264–8.
- 37. Halberstadt J. The generality and ultimate origins of the attractiveness of prototypes. Pers Soc Psychol Rev. 2006;10(2):166–83. pmid:16768653
- 38. Holzleitner IJ, Lee AJ, Hahn AC, Kandrik M, Bovet J, Renoult JP, et al. Comparing theory-driven and data-driven attractiveness models using images of real women’s faces. J Exp Psychol Hum Percept Perform. 2019;45(12):1589–95. pmid:31556686
- 39. Langlois JH, Roggman LA. Attractive faces are only average. Psychol Sci. 1990;1(2):115–21.
- 40. Winkielman P, Halberstadt J, Fazendeiro T, Catty S. Prototypes are attractive because they are easy on the mind. Psychol Sci. 2006;17(9):799–806. pmid:16984298
- 41. Sofer C, Dotsch R, Wigboldus DHJ, Todorov A. What is typical is good: the influence of face typicality on perceived trustworthiness. Psychol Sci. 2015;26(1):39–47. pmid:25512052
- 42.
Brewster T. Fraudsters cloned company director’s voice in $35 million bank heist, police find. In: Forbes [Internet]. 14 Oct 2021 [cited 24 Mar 2022]. Available from: https://www.forbes.com/sites/thomasbrewster/2021/10/14/huge-bank-fraud-uses-deep-fake-voice-tech-to-steal-millions/
- 43.
Burgess S. Ukraine war: Deepfake video of Zelenskyy telling Ukrainians to “lay down arms” debunked. In: Sky News [Internet]. 17 Mar 2022 [cited 30 Mar 2022. ]. Available from: https://news.sky.com/story/ukraine-war-deepfake-video-of-zelenskyy-telling-ukrainians-to-lay-down-arms-debunked-12567789
- 44. Byrne D. Interpersonal attraction and attitude similarity. J Abnormal Soc Psychol. 1961;62:713–5. pmid:13875334
- 45. Byrne D, Griffitt W, Stefaniak D. Attraction and similarity of personality characteristics. J Pers Soc Psychol. 1967;5(1):82–90. pmid:4382219
- 46. Montoya RM, Horton RS, Kirchner J. Is actual similarity necessary for attraction? A meta-analysis of actual and perceived similarity. J Soc Pers Relat. 2008;25(6):889–922.
- 47. Hughes SM, Harrison MA. I like my voice better: self-enhancement bias in perceptions of voice attractiveness. Perception. 2013;42(9):941–9. pmid:24386714
- 48. Jones JT, Pelham BW, Carvallo M, Mirenberg MC. How do i love thee? Let me count the JS: implicit egotism and interpersonal attraction. J Pers Soc Psychol. 2004;87(5):665–83. pmid:15535778
- 49. Peng Z, Hu Z, Wang X, Liu H. Mechanism underlying the self-enhancement effect of voice attractiveness evaluation: self-positivity bias and familiarity effect. Scand J Psychol. 2020;61(5):690–7. pmid:32395824
- 50. Ajzen I. Effects of information on interpersonal attraction: similarity versus affective value. J Pers Soc Psychol. 1974;29(3):374–80. pmid:4814127
- 51. Kaplan MF, Anderson NH. Information integration theory and reinforcement theory as approaches to interpersonal attraction. J Pers Soc Psychol. 1973;28(3):301–12.
- 52. Montoya RM, Horton RS. A meta-analytic investigation of the processes underlying the similarity-attraction effect. J Soc Pers Relations. 2013;30(1):64–94.
- 53. Stalling RB. Personality similarity and evaluative meaning as conditioners of attraction. J Pers Soc Psychol. 1970;14(1):77–82. pmid:5435539
- 54.
Hovland CI, Janis IL, Kelley HH. Communication and persuasion; psychological studies of opinion change. New Haven, CT, US: Yale University Press; 1953. p. xii, 315.
- 55.
Tajfel H, Turner JC. The social identity theory of intergroup behavior. In: Worchel S, Austin WG, editors. Psychology of Intergroup Relations. Chicago, IL: Nelson-Hall; 1986. p. 7–24.
- 56. McPherson M, Smith-Lovin L, Cook JM. Birds of a feather: homophily in social networks. Annu Rev Sociol. 2001;27(1):415–44.
- 57. Burnstein E, Crandall C, Kitayama S. Some neo-Darwinian decision rules for altruism: Weighing cues for inclusive fitness as a function of the biological importance of the decision. J Pers Soc Psychol. 1994;67(5):773–89.
- 58. Hamilton WD. The genetical evolution of social behaviour. II. J Theor Biol. 1964;7(1):17–52. pmid:5875340
- 59.
Nass C, Steuer J, Tauber ER. Proceedings of the SIGCHI conference on Human factors in computing systems. Computers are Social Actors; 1994. p. 72–78.
- 60. Nass C, Moon Y. Machines and mindlessness: social responses to computers. J Soc Issues. 2000;56(1):81–103.
- 61. Nass C, Lee KM. Does computer-synthesized speech manifest personality? Experimental tests of recognition, similarity-attraction, and consistency-attraction. Journal of Experimental Psychology: Applied. 2001;7:171–181.
- 62.
Lubold N, Walker E, Pon-Barry H, Ogan A. Automated pitch convergence improves learning in a social, teachable robot for middle school mathematics. International Conference on Artificial Intelligence in Education. Cham: Springer; 2018. p. 282–296.
- 63. Chaspari T, Lehman JF. An Acoustic Analysis of Child-Child and Child-Robot Interactions for Understanding Engagement during Speech-Controlled Computer Games. Interspeech. 2016;2016:595–9.
- 64.
Sadoughi N, Pereira A, Jain R, Leite I, Lehman JF. Creating prosodic synchrony for a robot co-player in a speech-controlled game for children. Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction. Vienna Austria: ACM; 2017. p. 91–99.
- 65.
Jemine C. Real-time voice cloning. Université de Liège; 2019.
- 66.
Heigold G, Moreno I, Bengio S, Shazeer N. End-to-end text-dependent speaker verification. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Shanghai: IEEE; 2016. p. 5115–5119.
- 67. Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Zongheng, Xiao Ying, Chen Zhifeng, Bengio Samy, Le Quoc, Agiomyrgiannakis Yannis, Clark Rob, Saurous Rif A. Tacotron: towards end-to-end speech synthesis. arXiv:170310135 [cs]. 2017 [cited 15 Jul 2020. ].
- 68.
Beilharz B, Sun X, Karimova S, Riezler S. LibriVoxDeEn: a corpus for German-to-English speech translation and german speech recognition. Proceedings of The 12th Language Resources and Evaluation Conference. 2020. p. 3590–3594.
- 69. Peirce JW. PsychoPy—Psychophysics software in Python. J Neurosci Methods. 2007;162(1-2):8–13. pmid:17254636
- 70.
R Core Team. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2019. Available from: https://www.R-project.org/
- 71. Bates D, Mächler M, Bolker B, Walker S. Fitting linear mixed-effects models using lme4. arXiv:14065823 [stat]. 2014 [cited 21 Jul 2021. ]. Available from: http://arxiv.org/abs/1406.5823
- 72.
Barton K. MuMIn: multi-model inference. 2020. Available from: https://CRAN.R-project.org/package=MuMIn
- 73. Perrachione TK, Furbeck KT, Thurston EJ. Acoustic and linguistic factors affecting perceptual dissimilarity judgments of voices. J Acoust Soc Am. 2019;146(5):3384–99. pmid:31795676
- 74. Skuk VG, Schweinberger SR. Gender differences in familiar voice identification. Hear Res. 2013;296:131–40. pmid:23168357
- 75. Ahrens M-M, Awwad Shiekh Hasan B, Giordano BL, Belin P. Gender differences in the temporal voice areas. Front Neurosci. 2014;8.
- 76. Junger J, Pauly K, Bröhr S, Birkholz P, Neuschaefer-Rube C, Kohler C, et al. Sex matters: Neural correlates of voice gender perception. Neuroimage. 2013;79:275–87. pmid:23660030
- 77. Lo C-C, Fu S-W, Huang W-C, Wang X, Yamagishi J, Tsao Y, et al. MOSNet: deep learning based objective assessment for voice conversion. Interspeech. 2019;2019:1541–5.
- 78. Hedge C, Powell G, Sumner P. The reliability paradox: why robust cognitive tasks do not produce reliable individual differences. Behav Res. 2018;50:1166–86.
- 79. Faul F, Erdfelder E, Lang A-G, Buchner A. G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods. 2007;39(2):175–91. pmid:17695343
- 80. Cicchetti DV. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol Assess. 1994;6(4):284–90.
- 81.
Gamer M, Lemon J. IFPS. irr: Various coefficients of interrater reliability and agreement. 2012. Available from: https://CRAN.R-project.org/package=irr
- 82. Pörschmann C. Influences of bone conduction and air conduction on the sound of one’s own voice. Acta Acust United With Acust. 2000;86:1038–45.
- 83. Snyder CR, Fromkin HL. Abnormality as a positive characteristic: The development and validation of a scale measuring need for uniqueness. J Abnorm Psychol. 1977;86(5):518–27.
- 84. Perrett DI, May KA, Yoshikawa S. Facial shape and judgements of female attractiveness. Nature. 1994;368(6468):239–42. pmid:8145822
- 85. Rhodes G, Yoshikawa S, Clark A, Lee K, McKay R, Akamatsu S. Attractiveness of facial averageness and symmetry in non-western cultures: in search of biologically based standards of beauty. Perception. 2001;30(5):611–25. pmid:11430245
- 86. Said CP, Todorov A. A statistical model of facial attractiveness. Psychol Sci. 2011;22(9):1183–90. pmid:21852448
- 87. Kreiman J, Park SJ, Keating PA, Alwan A. The relationship between acoustic and perceived intraspeaker variability in voice quality. Interspeech. 2015;2015:2357–60.
- 88. Lavan N, Burton AM, Scott SK, McGettigan C. Flexible voices: identity perception from variable vocal signals. Psychon Bull Rev. 2019;26(1):90–102. pmid:29943171
- 89.
Lee Y, Kreiman J. Within and between speaker variation in voices. In: Calhoun S, Escudero P, Tabain M, Warren P, editors. Proceedings of the 19th International Congress of Phonetic Sciences. Australia, Melbourne; 2019. p. 1460–1464.
- 90. Sidtis D, Kreiman J. In the beginning was the familiar voice: personally familiar voices in the evolutionary and contemporary biology of communication. Integr Psych Behav Sci. 2012;46(2):146–59. pmid:21710374
- 91. Stevenage SV. Drawing a distinction between familiar and unfamiliar voice processing: A review of neuropsychological, clinical and empirical findings. Neuropsychologia. 2018;116(Pt B):162–78. pmid:28694095
- 92.
Andraszewicz R, Yamagishi J, King S. Vocal attractiveness of statistical speech synthesisers. In Proc ICASSP 2011; 2011. p. 5368–5371.
- 93.
Belin P. On voice averaging and attractiveness. In: Weiss B, Trouvain J, Barkat-Defradas M, Ohala JJ, editors. Voice attractiveness. Singapore: Springer Singapore; 2021. p. 139–149.
- 94. Bruckert L, Bestelmeyer P, Latinus M, Rouger J, Charest I, Rousselet GA, et al. Vocal attractiveness increases by averaging. Curr Biol. 2010;20(2):116–20. pmid:20129047
- 95. Zäske R, Skuk VG, Schweinberger SR. Attractiveness and distinctiveness between speakers’ voices in naturalistic speech and their faces are uncorrelated. R Soc Open Sci. 2020;7(12):201244.
- 96. Hillenbrand J. A Methodological study of perturbation and additive noise in synthetically generated voice signals. J Speech Hear Res. 1987;30(4):448–61. pmid:2961932
- 97. Ferrand CT. Harmonics-to-Noise Ratio: An Index of Vocal Aging. J Voice. 2002;16(4):480–7. pmid:12512635
- 98. Stathopoulos E, Huber J, Sussman J. Changes in acoustic characteristics of the voice across the life span: measures from individuals 4-93 years of age. JSLHR. 2011;54:1011–21.
- 99. Kappen M, Hoorelbeke K, Madhu N, Demuynck K, Vanderhasselt M-A. Speech as an indicator for psychosocial stress: a network analytic approach. Behav Res Methods. 2022;54(2):910–21. pmid:34357541
- 100. Yumoto E, Gould WJ, Baer T. Harmonics‐to‐noise ratio as an index of the degree of hoarseness. J Acoust Soc Am. 1982;71(6):1544–9. pmid:7108029
- 101. Miyake K, Zuckerman M. Beyond personality impressions: effects of physical and vocal attractiveness on false consensus, social comparison, affiliation, and assumed and perceived similarity. J Pers. 1993;61(3):411–37. pmid:8246108
- 102. Peng Z, Wang Y, Meng L, Liu H, Hu Z. One’s own and similar voices are more attractive than other voices. Aus J Psychol. 2019;71(3):212–22.
- 103.
Jaggy O, Schwan S, Meyerhoff HS. Do not trust your ears: AI-determined similarity increases likability and trustworthiness of human voices. Figshare; 2022.