Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Multilingual capabilities of GPT: A study of structural ambiguity

  • Myung Hye Yoo,

    Roles Conceptualization, Data curation, Writing – original draft

    Affiliation Department of English Language and Literature, Chungnam National University, Daejeon, South Korea

  • Joungmin Kim,

    Roles Conceptualization, Data curation, Writing – review & editing

    Affiliation Department of Japanese Language and Literature, Korea University, Seoul, South Korea

  • Sanghoun Song

    Roles Conceptualization, Data curation, Funding acquisition, Supervision, Writing – review & editing

    sanghoun@korea.ac.kr

    Affiliation Department of Linguistics, Korea University, Seoul, South Korea

Abstract

This study examines the multilingual capabilities of GPT, focusing on its handling of syntactic ambiguity across English, Korean, and Japanese. We investigate whether GPT can capture language-specific attachment preferences or if it relies primarily on English-centric training patterns. Using ambiguous relative clauses as a testing ground, we assess GPT’s interpretation tendencies across language contexts. Our findings reveal that, while GPT (GPT-3.5-turbo, GPT-4-turbo, GPT 4o)’s performance aligns with native English speakers’ preferred interpretations, it overgeneralizes this interpretation in Korean and lacks clear preferences in Japanese, despite distinct attachment biases among native speakers of these languages. The newer, smaller-scale models—o1-mini and o3-mini—further reinforce this trend by closely mirroring English attachment patterns in both Korean and Japanese. Overall results suggest that GPT’s multilingual proficiency is limited, likely reflecting a bias toward high-resource languages like English, although differences in model size and tuning strategies may partially mitigate the extent of English-centric generalization. While GPT models demonstrate aspects of human-like language processing, our findings underscore the need for further refinement to achieve a more nuanced engagement with linguistic diversity across languages.

Introduction

In recent years, large language models (LLMs) have demonstrated remarkable abilities across a range of natural language processing tasks, including chain-of-thought reasoning [13], in-context learning [4,5], word-in-context semantic judgment [6], and even professional exams [79]. In particular, GPT models, developed by OpenAI, play a dominant role in both academic research and real-world AI applications, powering tools like ChatGPT and numerous commercial and educational platforms. This widespread adoption positions GPT as a compelling and practical foundation for investigating how LLMs handle complex tasks.

ChatGPT leverages supervised learning through human-annotated examples to produce targeted outputs for given prompts. Additionally, human rankings of generated outputs are used to develop a reward model that optimizes ChatGPT’s performance through reinforcement learning. As one of the leading LLMs, ChatGPT rapidly captured widespread attention after its release on November 30, 2022, amassing 100 million users within its first two months [10]. It has gained significant attention for its coherent responses across diverse tasks, including creative writing [11], image generation [12], computer programming or coding [13,14], sentiment analysis [15,16], and various annotation tasks [1719].

Further reports highlight GPT models’ success in handling complex, domain-specific tasks, in various domains such as medicine [7,8,20,21], law [2224], public health [25], economics [26], and programming [27,28]. This has sparked enthusiasm across fields for potential applications of ChatGPT as professional assistants [2932]. Such findings suggest that LLMs may significantly impact various industries and applications in the near future.

With the global spread of LLM-based services like ChatGPT, their use has extended to non-English-speaking communities. As ChatGPT has become widely adopted in English-speaking contexts across multiple domains, assessing its performance in specialized fields for non-English languages is increasingly essential. However, most tasks are predominantly focused on English, and much of ChatGPT’s performance has been assessed primarily in the English language. Hence, even though ChatGPT is trained on extensive, large-scale datasets from diverse sources, it often shows a proficiency bias toward high-resource languages [33]. This creates the possibility of issues such as replication of information from high-resource languages in outputs without proper citation or application of English-trained knowledge to non-English contexts. This raises questions about whether multilingual LLMs are truly versatile across languages—specifically, whether GPT models can be effectively used in other languages or if more language-specific technologies are needed. To investigate GPT’s multilingual capabilities, recent studies have begun assessing its performance across various tasks in non-English languages. These comprehensive evaluations cover areas such as reasoning abilities, language identification, and machine translation with diverse languages [34,35], as well as specific applications like medical licensing exams in Japanese [36] and Chinese [37] and tasks like grammatical error correction or cross-lingual summarization in languages such as Chinese and German [38,39]. Overall, studies indicate that GPT performs reasonably well in non-English languages, albeit with some inaccuracies. For instance, while its grammatical error correction is generally effective at the sentence level, it falls short at the discourse level in languages like German and Chinese [38]. Bang et al. [34], conducting a series of multilingual, multimodal, multitasks on reasoning, hallucination, and interactivity, found GPT unreliable. Furthermore, [35,40] showed that GPT tends to perform better with English prompts, even when tasks and inputs are designed for other languages, demonstrating its English-language bias.

While ChatGPT has been evaluated for many different domains, relatively few studies examine its multilingual proficiency in linguistics. Our study addresses this gap by evaluating GPT’s multilingual ability within the linguistic domain: resolution of ambiguities. Like other fields, LLMs exhibit behavior resembling human linguistic processing [4143]. Ambridge and Blything [44] even have demonstrated that LLMs outperforms traditional theoretical linguists in predicting human judgments in certain type of structure. Meanwhile, much of the research has been conducted primarily on English [45].

To offer deeper insights into multilingual applications of GPT, we focus specifically on its interpretation of ambiguous sentences across languages. As GPT’s capabilities evolve, it is expected to prioritize multiple valid interpretations in a human-like manner [46]. Effectively managing ambiguity is a crucial aspect of human language comprehension, allowing speakers to anticipate misunderstandings and listeners to adjust interpretations for smooth communication. For language models like GPT, mastering this skill is critical—not only for its effectiveness in conversational interfaces and writing aids but also for achieving communication skills that resemble human interaction. A particularly insightful way to assess GPT’s multilinguality and conversational sophistication is by examining its handling of ambiguous sentences, where human preferences for resolution vary significantly across languages. For instance, Japanese and Korean speakers often show opposite tendencies to English speakers. Our study focuses on investigating GPT’s ability to reflect language-specific interpretations in English, Korean, and Japanese while resolving structural ambiguities. We assess whether it can apply unique linguistic knowledge or if it primarily relies on English-based training data. By examining GPT’s handling of ambiguous sentences across these languages, we aim to gain crucial insights into its capacity for genuinely multilingual, human-like communication. In addition, we compare multiple versions of GPT—from GPT-3.5-urbo to the recent o1-mini and o3-mini models—to examine how variations in tuning techniques, model scale, and optimization goals influence interpretive behavior across languages. Such comparisons allow us to trace the developmental trajectory of GPT models’ multilingual capabilities and to identify key factors underlying language-specific parsing behavior. In this context, we aim to gain insight not only into the practical implications of GPT model selection but also into systematic strategies for enhancing linguistic adaptability across diverse languages and language families.

Previous studies on attachment ambiguities

Relative clause attachment ambiguity in psycholinguistics

Sentences are not always clearly written and may remain ambiguous even after they are terminated. For example, sentences containing complex noun phrases that are modified with a relative clause, known as attachment ambiguity of relative clauses, are typically perceived as ambiguous because relative clauses can modify either the first noun phrase (NP) or the second NP.

In the above sentence (1), it is not clear who exactly was on the balcony- the servant or the actress. This is because for the given complex NP clause (i.e., the servant of the actress who was on the balcony), the relative clause is grammatically correct regardless of whether it is attached to either the first NP (NP1, the servant) or the second NP (NP2, the actress). In English structure, NP1 is located structurally higher than NP2. We call “High attachment (HA)” to refer to the attachment to the structurally higher NP, that is, NP1 in English, whereas “Low attachment (LA)” refers to the attachment to the structurally lower NP, that is, NP2 in English.

Although the structural ambiguity associated with the sentence (1) is not resolved and has no correct answers, readers often have a tendency to select one NP over the other, with the preference tendency differing across various languages. LA is preferred by speakers of English [4851], Chinese [52], Romanian [53], and Arabic [54], favoring to parsing the clauses to low NPs (i.e., the actress). HA is preferred by speakers of Spanish [4749,55], Brazilian and European Portuguese [5658], Korean [5963], Japanese [64], Dutch [65,66], French [67,68], German [69], and Greek [70], preferring to parse the clauses to high NPs (i.e., the servant).

As noted, while English native speakers prefer LA interpretation, Korean and Japanese native speakers have demonstrated HA preferences.

For instance, Japanese and Korean native speakers prefer HA interpretations, attaching relative clauses to a higher NP, that is, the servant in (2). As head-final languages, the word order of the head noun and the relative clause in Japanese and Korean is opposite from English, meaning a lower NP precedes a higher NP (i.e., “actress” + “servant”).

While interpretation preferences are often assessed through offline tasks that require judgments after reading entire sentences, psycholinguistic research has examined interpretation preferences in real-time processing to assess initial interpretations. For English, numerous studies have consistently reported a LA preference in both online (real-time) and offline language processing, indicating a stable parsing strategy [48,71]. In Korean, a HA preference persists across both the initial parsing and final interpretation stages, demonstrating stability throughout processing [59,61]. Japanese, on the other hand, is known to deviate from these stable patterns, exhibiting a shift in attachment preference between processing stages: speakers initially favor an LA interpretation during online processing, but ultimately resolve the ambiguity with an HA interpretation by the time of offline judgment [64]. A similar pattern has been reported for Spanish, where an initial LA preference often shifts to HA in final interpretations [48,51]. However, this early LA preference is not always robust. Another study in Spanish, particularly those manipulating linguistic factors such as animacy, have shown that HA preferences can also emerge during real-time processing [72]. Thus, early parsing strategies may vary based on both language-specific tendencies and structural or semantic cues [70]. Table 1 summarizes the general attachment preferences in four languages.

thumbnail
Table 1. Summary of previous studies testing relative clause attachment ambiguity.

https://doi.org/10.1371/journal.pone.0326943.t001

Ambiguity in large language models

The performance of language models (LMs) on ambiguous sentences remains an underexplored area. Some studies have examined encoder models, primarily with English data. Recently, [73] reported that LMs’ predictability, based on surprisal, aligns with human processing difficulty for a variety of syntactically complex English sentences, including attachment ambiguities. Wiki-LSTM and GPT-2 generally predicted the direction of human processing difficulty, but did not consistently predict its magnitude across sentence types. For sentences with attachment ambiguities, they compared processing difficulty between HA and LA interpretations at the word following the critical verb region. The models accurately predicted where difficulty would arise, showing increased processing difficulty for HA interpretations but not for LA interpretations.

Davis (2022) [74] further analyzed attachment ambiguity by examining Long Short-Term Memory (LSTM) models and various transformer models (GPT-2 XL, BERT, and RoBERTa) in English and Spanish. Overall, these LLMs showed a preference for LA interpretations in English (LSTMs, GPT-2 XL, and BERT), though RoBERTa did not exhibit a clear preference. Recall that the native Spanish speakers have a different preference from English, favoring HA attachment. However, Spanish versions of GPT-2 and BERT tended to prefer LA attachment, similar to English models, while a subset of models, such as RuPERTa and the LSTMs, showed no preference. The preference for LA interpretation in Spanish may reflect the online processing behavior of Spanish speakers. However, Davis suggested an alternative explanation, proposing that the models might be capturing production trends as observed from the Spanish training corpus [75], rather than true comprehension tendencies.

While prior research has advanced our understanding of attachment ambiguities in both human cognition and language model performance, significant gaps remain. Most studies focus on high-resource languages such as English or Spanish, often neglecting typologically distinct languages like Korean and Japanese. Additionally, much of the research on LMs emphasizes their performance in isolated linguistic contexts without systematically exploring their adaptability to cross-linguistic variation in attachment preferences. Given the limited exploration of LMs’ performance on resolving attachment ambiguities, we go one step further by employing a unified framework to assess both language-specific preferences and the broader question of whether LMs can meaningfully adapt to linguistic diversity.

The present work

The present work uses attachment ambiguity as a testing ground to evaluate GPT’s ability to capture language-specific interpretation preferences, focusing on GPT models’ performance in three different languages: English, Korean, and Japanese. Specifically, we examine whether previous findings in the Spanish context simply reflected LA preferences based on English-language training—not language-independent or language-specific interpretation—genuinely capturing the online processing of Spanish interpretation preferences. Investigating GPT in Korean and Japanese allows us to further advance our understanding of its multilinguality. Recall that Japanese speakers initially prefer LA during online processing but shift toward HA in offline tasks, mirroring Spanish speakers when linguistic factors such as animacy or syntactic complexity are not involved. Conversely, Korean speakers consistently maintain an HA interpretation from the initial stages of parsing through to their final interpretations [59]. The divergence in initial parsing strategies between Japanese and Korean, despite their syntactic similarities, may reflect differences in parsing heuristics employed by speakers of these languages. For Korean, a persistent HA preference may arise from parsing mechanisms, such as strong sensitivity to syntactic cues, specifically predicate proximity [76], promoting attachment to structurally higher nouns even during the initial stages of sentence processing [59,61]. In contrast, Japanese parsing initially appears driven by a locality-based strategy, attaching the relative clause to the closest available noun phrase, resulting in early LA interpretations that later shift toward HA preferences [64]. We reason that if both Korean and Japanese exhibit LA preferences like in the English and (partial) Spanish cases, it indicates a limitation of LMs handling multiple languages and their bias toward English. However, if some findings of LA preference presented in Spanish captures online processing, we expect that only Japanese data should exhibit LA preference, while Korean data should not because Japanese only exhibited initial LA preferences [64].

Materials and methods

For the main dataset, the experimental materials consisted of 210 items that were presented in 6 different versions for each language (1260 times). Some items were modified from stimuli used in [47] and [73]. We consistently used the pronoun ‘I’ as the main subject instead of person names because person names vary across languages, and we wanted to avoid any impact of familiarity of person names on interpretations. For Japanese, we consistently used watasi, which is compatible with both male and female, instead of boku or ore, which are typically male-specific. To maximize control of sentences across three languages, we imposed several restrictions. First, the terms like sister, or father which describe relative relations were used only for the high attached nouns (NP1 in English, NP2 in Korean and Japanese) for naturalness in English, like the sister of the teacher, not the teacher of the sister. If we use it in the high attached noun positions in English, NP2, it is more natural to use possessive like my sister as in the teacher of my sister, to refer to the sister of the main subject, I (without contexts). Otherwise, we used words for professions (e.g., teacher, doctor). Secondly, we ensured that both noun phrases (NPs) shared the same animacy to preserve ambiguity. In Japanese, there are two types of ‘to be’ verbs that differ based on animacy: aru, which indicates the existence of inanimate objects, and iru, which indicates the existence of animate beings. By maintaining the same animacy for both NPs, we avoided providing clues as to which subject the verb is modifying. The sample item for each language is exemplified in (3).

Previous studies have highlighted that LLMs are sensitive to question formatting, particularly in multiple-choice settings, where their responses can be influenced by the structure of the prompt and the order in which answer choices are presented [7781]. To avoid these effects, we created three question templates (A/B, Compare, and Repeat) with two different orders of answer choices or question types, as presented in (4), following [82] associated with ambiguous scenarios regarding moral belief choices. For Question types A and C, we systematically varied the order in which the two noun phrases were presented: in one condition, NP1 appeared as the first option, while in the other condition, NP2 appeared first. In the yes/no format used for Question type B, we similarly manipulated the focus of the query by explicitly asking not only whether NP1 was the modified noun phrase but also whether NP2 was the modified entity. Therefore, 210 set of items were asked twice by manipulating the order of answers, thus each question type was asked 420 times, making a total 1260 items. (4) presents the set of sample prompt templates in English.

Question type 1 asks to select a modified NP and Question type 2 directly asks whether NP1 or NP2 is the one performing an action or in the state described by relative clauses by answering yes/no. Question type 3 asks to repeat the full phrases, the modified NPs with relative clauses. We conducted this experiment through the OpenAI application programming interface (API), an open access online service. We also provided instructions in each language to force GPT models to process language specifically because previous studies found that GPT is better at analyzing tasks with English prompts [35]. In all responses, we additionally asked GPT models to state the reason why it selected a specific answer; this aspect is discussed in the Discussion section.

We examined five GPT models—GPT-3.5-turbo, GPT-4-turbo, GPT-4o, o1-mini, and o3-mini. GPT-3.5-turbo serves as a foundational model commonly used for benchmarking due to its computational efficiency and robust general performance across various language-processing tasks [83]. GPT-4-turbo, is an advanced successor, notable for improved reasoning, enhanced contextual understanding, and greater accuracy, especially in complex linguistic contexts [84]. GPT-4o, representing the newer generation, incorporates multimodal processing and improved multilingual functionality, specifically aiming to mitigate language-specific biases and enhance adaptability to diverse linguistic contexts [85]. In addition to these mainstream models, we included two recently released lightweight variants—o1-mini and o3-mini—developed to provide faster and more cost-efficient alternatives while maintaining competitive performance in reasoning tasks. The o1-mini model, released in September 2024, prioritizes enhanced reasoning by incorporating a chain-of-thought approach, facilitating better handling of complex, logic-intensive tasks. The o3-mini model, introduced in January 2025, is streamlined to deliver prompt, accurate responses tailored for instructional and interactive real-time applications. Their inclusion provides novel insights, as performance on complex linguistic phenomena like structural ambiguity remains underexplored in these recent models. Furthermore, evaluating o1-mini and o3-mini enables us to explore how variations in training techniques and model scale affect syntactic ambiguity resolution and multilingual processing, as detailed in the Discussion section.

Analysis and results

To analyze whether one interpretation (A or B) was significantly preferred over the other, we initially ran separate logistic mixed-effects regression models for each prompt template using the glmer function from the lme4 packages [86] in R ([87]). The preference of interpretation (binary: LA or HA) was set as the dependent variable, with the probability of choosing interpretations. To examine interpretation preferences across all prompt templates, we first ran a combined logistic mixed-effects regression model incorporating all templates into the analysis. This model included a random intercept to control for variability across individual prompt templates, alongside random intercepts for item and the order of answer choices (Question type A, C) or the noun being questioned (Question type B). This allowed us to control for variability across individual prompt templates, the order of options, and the items presented. Next, we conducted a sub-analysis to determine whether one interpretation was significantly preferred over the other for each prompt template individually. For this, we ran separate logistic mixed-effects regression models, including two random intercepts to account for variability associated with item and order of choices. The results of the models are presented in Table 2.

thumbnail
Table 2. Mean proportions of LA interpretations.

https://doi.org/10.1371/journal.pone.0326943.t002

The proportion of LA preferences by model for each language is plotted in Fig 1. P values were determined using a Satterthwaite approximation through the computation of the LmerTest function [88].

thumbnail
Fig 1. Mean proportion of the LA interpretation across three languages: English, Korean, and Japanese.

Error bars indicate the standard error of the mean.

https://doi.org/10.1371/journal.pone.0326943.g001

We report pooled results across all question types for each language, followed by separate analyses for each type. In English, all GPT models from GPT-3.5-turbo to o3-mini consistently showed a general preference for LA interpretations across question types, aligning with native English speakers’ tendencies to agree with the lower noun. These trends were validated by statistical significance in the analysis. In the subset analysis, LA preferences were significant across question types, with one exception. Question type B in GPT −4-turbo exhibited the same trend, favoring LA interpretations, but did not reach statistical significance (z = −0.83, p = 0.40).

One plausible explanation for this lack of statistical significance is variability arising from the random effect of question type. Specifically, GPT-4-turbo exhibited a strong bias related to which noun phrase was queried: when explicitly asked if NP1 was modified, the model selected NP2 in over 90% of cases, whereas when NP2 was the explicit query, NP1 was preferred in over 65% of responses. This pattern was further supported by statistical analysis, which revealed a significant effect when the random intercept for the noun being questioned was excluded from the model (z = −5.49, p < 0.01). This pattern suggests that GPT-4-turbo’s attachment interpretations were significantly influenced by the particular noun phrase queried. One possible account for the increased variability and lack of statistical significance observed in the yes/no question format (Question Type B) may lie in the inherent property of this prompt structure. Unlike Question Types A and C, which present contrasting alternatives and require the model to resolve ambiguity by selecting between interpretations, yes/no prompts require the model to evaluate a single interpretation in isolation. This lack of direct comparison may result in weaker commitment to a specific interpretation, producing more variable responses, leading to reduced robustness in attachment preferences. Prior studies have shown that LLMs are sensitive to subtle variations in phrasing or prompt structure in yes/no tasks, which can further exacerbate inconsistencies and limit interpretive clarity [81,89].

For Korean, all GPT versions similarly showed a preference for LA interpretations, with over 60% of responses favoring this interpretation—mirroring the English data. Although the overall LA interpretation rate of o1-mini did not reach 60%, it was close—reaching 58.49%—and still showed a numerical trend toward LA preference. Statistical analysis confirmed a significant preference for LA over HA in Korean, with marginal significance in GPT-3.5-turbo (z = −1.63, p = 0.10). The sub-analysis for each question type reached statistical significance for GPT-4o and GPT-4-turbo. Although question types A and B in GPT 3.5 did not achieve statistical significance, LA preferences were still evident, with over 75% favoring LA interpretations for question type A and 60% for question type B. The interpretation for Question Type B in o1-mini was also not statistically significant, which likely contributed to the lack of significance in the overall results. However, when the random intercept for question type was excluded from the model, the overall interpretation became significant (z = 3.03, p = 0.002), mirroring the pattern observed with Question type B in GPT-4-turbo.

In Japanese, GPT models from GPT-3.5-turbo to GPT 4o generally showed no strong preference for either LA or HA interpretations, with overall differences remaining insignificant. However, in the sub-analyses, only question type C in GPT 3.5 displayed a significant preference for LA (over 75%), reflecting trends similar to those observed in English and Korean. Interestingly, GPT-4o even showed an overall significant preference for HA interpretations, which is the opposite of the trends in English and Korean. Additionally, with the release of newer GPT models, we observed a gradual decrease in the proportion of LA interpretations, particularly evident in GPT-4o. This shift clearly supports that GPT-4o’s intended focus on enhanced multilingual capability has indeed been realized, as it begins to reflect human-like interpretation patterns in Japanese. However, the more recent smaller-scale models, o1-mini and o3-mini, exhibited statistically significant preferences for LA interpretations, mirroring the patterns observed in English.

Discussion

Cross-linguistic model comparisons

A fundamental question of the current study was to evaluate whether GPT models can be effectively applied across languages or if language-specific technologies are needed for non-English languages. To address this, we examined GPT models’ cross-linguistic performance in a linguistic domain that requires nuanced understanding of human language processing. Our findings revealed that GPT models display human-like behavior in English, showing a preference for LA interpretations. Despite the cross-linguistic differences evident among Korean native speakers, we observed that GPT models mirrored English native speaker preferences, favoring LA interpretations—a tendency not shared by Korean speakers.

This result raises important questions about whether LMs reflect online processing or if their responses are influenced mainly by training data in English. As discussed in the introduction, we anticipated that if the LA preference in Spanish, noted by Davis (2022) [74], genuinely reflected initial parsing, only Japanese data would demonstrate an LA preference, while Korean data would not, given that Korean speakers exhibit HA preferences in both online and offline settings. However, our observation of consistent LA preferences in Korean suggests that this tendency in Spanish ambiguous sentences may not reflect online processing. Instead, the LA preferences in Korean suggest that GPT’s interpretation of Korean is heavily influenced by its English-based training, possibly affecting Spanish as well. To validate this conclusion further, future research could directly compare Spanish data with Korean or other languages with similar HA patterns to Korean. Additionally, as prior studies and the current one have evaluated distinct architectures—encoder and decoder models—further research would benefit from a comprehensive examination of both. It is also noteworthy that our study evaluated more recent models compared to Davis’ (2022) work [74] with the GPT-2 XL model.

Interestingly, the Japanese data did not show a strong preference for either interpretation GPT 3.5-turbo and 4-turbo. This could be due to several factors. One possibility is resource availability. According to Common Crawl corpus and its snapshot—a large-scale, non-curated dataset of multilingual webpages that is used to pre-train models like GPT-3—Japanese is considered a higher-resource language compared to Korean [34,35,90]. As Japanese falls into a higher-resource category, the availability of more data likely helps models suppress a direct reliance on English-based training, in contrast to Korean, which may be more susceptible to mirroring of English speakers’ interpretation preferences due to resource limitations. The gradual increase in HA interpretations, from 38.42% to 54.21% as GPT models advance with evidence of its significance in two question types in the latest model 4o, may indicate a progressive development of human-like interpretive tendencies, possibly influenced by the availability of relatively high resources. This shift clearly demonstrates that GPT-4o’s intended enhancements in multilingual capability have been realized, enabling it to more closely reflect native Japanese interpretation patterns and highlighting the benefits associated with high resources in this model. Another possible explanation is that certain Japanese sentences could have alternative interpretations. For example, nouns indicating relational roles, like son in phrases such as the son of the architect, could create ambiguities, where architect is interpreted as an appositive describing the son’s profession. These potentially confounding sentences comprised 35.23% of the total items, or 74 out of 210 items in our stimuli. To address this, we conducted an exploratory analysis on the Japanese data, excluding these potentially confounding sentences. The results confirmed the initial trend, with no significant preference for either interpretation across all GPT models (all ps > 0.36). Therefore, the lack of strong preferences in Japanese cannot be entirely explained by these sentence types.

This absence of clear attachment preferences in GPT-3.5-turbo and GPT-4-turbo for Japanese may reflect the influence of relatively higher resource availability for Japanese compared to Korean during training. This broader exposure may have mitigated the tendency to generalize syntactic patterns learned from English, leading to more balanced behavior. In this context, GPT-4o, which was explicitly trained to improve cross-linguistic generalization and reduce English-language bias, appears better suited for capturing language-specific patterns of attachment ambiguity in relatively resource-rich languages like Japanese.

The robust LA preferences observed in o1-mini and o3-mini for Japanese as well as Korean—mirroring patterns found in English—further suggest the importance of training data scale and techniques in shaping interpretive behavior. These smaller-scale models, optimized for speed and cost-efficiency, contain significantly fewer parameters and are likely exposed to a narrower range of non-English linguistic input during training. Consequently, they appear to rely more heavily on generalized heuristics learned from high-resource languages like English, consistently favoring LA interpretations across typologically diverse languages. This behavior reflects a trade-off, where deep contextual reasoning is sacrificed in favor of faster inference, often prioritizing surface-level generalizations and high-probability completions. These findings align with prior research showing that model performance improves with increased size and data exposure [91,92], suggesting that smaller models may be less equipped to capture language-specific syntactic nuances in multilingual contexts.

In terms of tuning techniques, the reduced multilingual capability of o1-mini and o3-mini can be attributed to the techniques used for model alignment. While GPT-3.5-turbo, GPT-4-turbo, and GPT-4o were trained using Reinforcement Learning from Human Feedback (RLHF), as well as instruction-tuning, o1-mini and o3-mini are mainly instruction-tuned [93]. RLHF allows models to be further refined using human feedback, enhancing their ability to produce responses that align with human preferences—particularly in cases of ambiguity or underspecified input. This additional layer of alignment improves flexible reasoning, contextual coherence, and interpretive sensitivity. In the absence of RLHF, o1-mini and o3-mini may be more prone to rely more heavily on surface-level heuristics or high-frequency patterns seen during pretraining, limiting their ability to capture subtle syntactic and semantic distinctions across languages and reinforcing English-based interpretations.

Model reasoning and future directions

As noted earlier, we asked GPT models to provide the reasoning behind each answer choice, enabling us to examine how GPT approaches the resolution of attachment ambiguity. For English, GPT predominantly selects the LA interpretation because NP2 (i.e., the low attachment) is positioned nearer to the relative clause, with the general explanation that “relative clauses typically modify the nearest preceding noun.” This reasoning suggests that GPT identifies a pattern of LA preferences based on the linear distance between the relative clause and a noun, as learned from English data. This reasoning also applies to Korean and Japanese interpretations, where GPT selects the LA interpretation by identifying the closest noun. Despite the differences in word order from English, the relative clauses in both Korean and Japanese are also positioned closer to the noun associated with low attachment (i.e., NP1). Notably, even when GPT occasionally chooses high attachment nouns, it still provides the same explanation about the noun’s proximity to the relative clause, which is clearly incorrect. While Japanese interpretations may appear less influenced by English data in our binary task, GPT’s reasoning process is not entirely language-independent.

Another possibility is that GPT’s provided rationales may not reflect the actual decision-making criteria employed during ambiguity resolution, but rather constitute post-hoc rationalizations disconnected from its underlying interpretive processes. In other words, GPT might simply generate justifications based on generalized heuristics (such as predicate proximity-based attachment), independent of whether these heuristics genuinely guided its specific attachment choices. This possibility suggests that GPT’s explanations might not offer reliable insights into its decision reasoning, but instead reflect plausible, learned patterns drawn from training data regarding typical human-like linguistic reasoning. Such a scenario aligns with recent concerns expressed by [94], who caution against conflating LLMs’ apparent linguistic competence with genuine reasoning capacities. Thus, while GPT’s explicit reasoning offers some insight into its linguistic behavior, it may not directly reflect the underlying processes guiding its decisions. Future research could further investigate whether GPT’s provided rationales are genuinely connected to its interpretive decisions or merely represent surface-level justifications, by employing more sophisticated experimental tasks and analytical measures.

Our findings suggest that GPT models’ processing aligns with the Late Closure strategy [71], one of the well-known psycholinguistic theories for resolving attachment ambiguity. According to this principle, if grammatically permissible, incoming words are likely attached to the most recent phrase or clause being processed, rather than an earlier one in the sentence. Thus, when encountering structural ambiguity, readers are inclined to connect the relative clause to the nearest noun phrase (NP) rather than one farther back. While this processing strategy initially aimed to explain attachment preferences universally, it encountered limitations as studies revealed cross-linguistic differences. For GPT models, it seems that the Late Closure strategy is overgeneralized across languages, without incorporating language-specific patterns. Even GPT-4, one of the latest models, did not fully adopt language-specific interpretations, indicating that it may not be entirely multilingual.

Although multilingual data enables GPT to handle inputs and produce responses across different languages, our findings highlight the gap between GPT’s current capabilities and the nuanced adaptability required for genuine conversational sophistication. Addressing these limitations is critical for advancing GPT’s multilingual functionality and its potential to achieve human-like communication skills. There is a clear need within the community for a comprehensive, public, and independent evaluation of GPT across diverse non-English languages and NLP tasks to inform future research and applications effectively. We propose several directions for future research. First, as our current study mainly evaluates GPT, subsequent studies could extend this analysis to include other recent multilingual models, such as LLaMa, BLOOM, or Claude. In fact, we conducted an exploratory evaluation using Anthropic’s Claude 3.5 Sonnet, which revealed a similar trend across all three languages, each showing a consistent preference for LA interpretations exceeding 60%. As these findings extend beyond the primary focus of the present study, they are not included in the main manuscript but are provided in the Supporting Information for reference. This observed convergence across model families suggests that the LA preference may reflect a broader characteristic of LLMs, rather than being specific to the GPT architecture. Future research could systematically investigate whether this pattern holds across a wider range of LMM architectures under varied prompting conditions or other measurements to assess the extent to which current language models possess genuine multilingual capabilities.

Another research direction is conducting more in-depth cross-linguistic comparisons based on language families. For example, [95] recently highlighted the performance limitations of model editing techniques under a cross-lingual model editing paradigm, particularly when comparing languages from different language families, such as Latin and Indic. Building on this, future research could encompass comparisons with languages both within the same family as English, like French or Dutch, and those from distinctly different language families. Additionally, examining corpus data or human production tasks in Korean and Japanese could help clarify whether LA preferences observed in GPT align with production trends or comprehension tendencies in these languages. Such analyses would ensure a deeper understanding of interpretation trends and their alignment with human linguistic behavior. These broader approaches would enable a more comprehensive understanding of the impact of typological factors on model performance across languages.

Conclusion

This study underscores the limitations of GPT’s multilingual capabilities, particularly in resolving attachment ambiguities in non-English languages, where it tends to overgeneralize English-based processing strategies. While GPT can respond across languages, it lacks the language-specific nuance observed in human processing, as evidenced by its consistent LA preferences, even in typologically distinct languages like Korean and no preferences in Japanese. These findings suggest that language-specific technologies may be essential for building truly multilingual models, equipping them to reflect the structural nuances of each language for more accurate applications.

References

  1. 1. Huang H, Tang T, Zhang D, Zhao WX, Song T, Xia Y, et al. Not all languages are created equal in llms: improving multilingual capability by cross-lingual-thought prompting. arXiv:230507004 [Preprint]. 2023.
  2. 2. Qin L, Chen Q, Wei F, Huang S, Che W. Cross-lingual prompting: improving zero-shot chain-of-thought reasoning across languages. arXiv:231014799 [Preprint]. 2023.
  3. 3. Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, et al. Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst. 2022;35:24824–37.
  4. 4. Wei J, Wei J, Tay Y, Tran D, Webson A, Lu Y, et al. Larger language models do in-context learning differently. arXiv:230303846 [Preprint]. 2023.
  5. 5. Work WMI-CL. Rethinking the role of demonstrations: What makes in-context learning work?
  6. 6. Shi P, Zhang R, Bai H, Lin J. Xricl: cross-lingual retrieval-augmented in-context learning for cross-lingual text-to-sql semantic parsing. arXiv:221013693 [Preprint]. 2022.
  7. 7. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. pmid:36812645
  8. 8. Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of gpt-4 on medical challenge problems. arXiv:230313375 [Preprint]. 2023.
  9. 9. Terwiesch C. Would chat GPT3 get a Wharton MBA. A prediction based on its performance in the operations management course. 2023.
  10. 10. Milmo D. Guardian confirms it was hit by ransomware attack. Guardian. 2023;11(1):23.
  11. 11. Kirmani AR. Artificial intelligence-enabled science poetry. ACS Energy Lett. 2022;8(1):574–6.
  12. 12. Bubeck S, Chandrasekaran V, Eldan R, Gehrke J, Horvitz E, Kamar E, et al. Sparks of artificial general intelligence: early experiments with gpt-4. arXiv:230312712 [Preprint]. 2023.
  13. 13. Davis JC, Lu YH, Thiruvathukal GK. Conversations with ChatGPT about C programming: an ongoing study. 2023.
  14. 14. Poldrack RA, Lu T, Beguš G. AI-assisted coding: Experiments with GPT-4. arXiv:230413187 [Preprint]. 2023.
  15. 15. Esh M. Sentiment analysis in ChatGpt interactions: unraveling emotional dynamics, model evaluation, and user engagement insights. Tech Serv Q. 2024;41(2):160–74.
  16. 16. Tabone W, de Winter J. Using ChatGPT for human-computer interaction research: a primer. R Soc Open Sci. 2023;10(9):231053. pmid:37711151
  17. 17. Ding B, Qin C, Liu L, Chia YK, Joty S, Li B, et al. Is GPT-3 a good data annotator? arXiv:221210450 [Preprint]. 2022.
  18. 18. Huang F, Kwak H, An J, editors. Is chatGPT better than human annotators? potential and limitations of chatGPT in explaining implicit hate speech. Companion proceedings of the ACM web conference 2023; 2023.
  19. 19. Kuzman T, Ljubešić N, Mozetič I. ChatGPT: Beginning of an end of manual annotation? Use case of automatic genre identification. arXiv:230303953 [Preprint]. 2023.
  20. 20. Johnson D, Goodman R, Patrinely J, Stone C, Zimmerman E, Donald R, et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model. Res Sq. 2023.
  21. 21. Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med. 2023;388(13):1233–9. pmid:36988602
  22. 22. Choi JH, Hickman KE, Monahan AB, Schwarcz D. ChatGPT goes to law school. J Legal Educ. 2021;71:387.
  23. 23. Choi JH, Schwarcz D. Ai assistance in legal analysis: an empirical study. 2023. Available from: SSRN 4539836.
  24. 24. Hargreaves S. Words are flowing out like endless rain into a paper cup: ChatGPT & law school assessments. Legal Educ Rev. 2023;33:69.
  25. 25. Biswas SS. Role of chat GPT in public health. Ann Biomed Eng. 2023;51(5):868–9. pmid:36920578
  26. 26. Geerling W, Mateer GD, Wooten J, Damodaran N. ChatGPT has aced the test of understanding in college economics: now what? Am Econ. 2023;68(2):233–45.
  27. 27. Buchberger B. Automated programming, symbolic computation, machine learning: my personal view. Ann Math Artif Intell. 2023;91(5):569–89.
  28. 28. Surameery NMS, Shakor MY. Use chat GPT to solve programming bugs. IJITCE. 2023;31:17–22.
  29. 29. Ali H. The potential of GPT-4 as a personalized virtual assistant for bariatric surgery patients. Obes Surg. 2023;33(5):1605. pmid:37009970
  30. 30. Anand Y, Nussbaum Z, Duderstadt B, Schmidt B, Mulyar A. GPT4all: training an assistant-style chatbot with large scale data distillation from GPT-3.5-turbo. GitHub. 2023.
  31. 31. Cheng K, Li Z, Li C, Xie R, Guo Q, He Y, et al. The potential of GPT-4 as an AI-powered virtual assistant for surgeons specialized in joint arthroplasty. Ann Biomed Eng. 2023;51(7):1366–70. pmid:37071279
  32. 32. Zhang J, Sun K, Jagadeesh A, Ghahfarokhi M, Gupta D, Gupta A, et al. The potential and pitfalls of using a large language model such as ChatGPT or GPT-4 as a clinical assistant. arXiv:230708152 [Preprint]. 2023.
  33. 33. Ray PP. ChatGPT: a comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. IOT & CPS. 2023;3:121–54.
  34. 34. Bang Y, Cahyawijaya S, Lee N, Dai W, Su D, Wilie B, et al. A multitask, multilingual, multimodal evaluation of chatGPT on reasoning, hallucination, and interactivity. arXiv:230204023 [Preprint]. 2023.
  35. 35. Lai VD, Ngo NT, Veyseh APB, Man H, Dernoncourt F, Bui T, et al. ChatGPT beyond english: towards a comprehensive evaluation of large language models in multilingual learning. arXiv:230405613 [Preprint]. 2023.
  36. 36. Kasai J, Kasai Y, Sakaguchi K, Yamada Y, Radev D. Evaluating GPT-4 and chatgpt on japanese medical licensing examinations. arXiv:230318027 [Preprint]. 2023.
  37. 37. Fang C, Wu Y, Fu W, Ling J, Wang Y, Liu X, et al. How does ChatGPT-4 preform on non-English national medical licensing examination? An evaluation in Chinese language. PLOS Digit Health. 2023;2(12):e0000397. pmid:38039286
  38. 38. Fang T, Yang S, Lan K, Wong DF, Hu J, Chao LS, et al. Is chatGPT a highly fluent grammatical error correction system? a comprehensive evaluation. arXiv:230401746 [Preprint]. 2023.
  39. 39. Wang J, Liang Y, Meng F, Li Z, Qu J, Zhou J. Cross-lingual summarization via chatGPT. arXiv:230214229 [Preprint]. 2023.
  40. 40. Zhang X, Li S, Hauer B, Shi N, Kondrak G. Don’t trust ChatGPT when your question is not in English: a study of multilingual abilities and types of LLMs. arXiv:230516339 [Preprint]. 2023.
  41. 41. Marvin R. Targeted syntactic evaluation of language models. arXiv:180809031 [Preprint]. 2018.
  42. 42. Warstadt A, Parrish A, Liu H, Mohananey A, Peng W, Wang SF. BLiMP: The benchmark of linguistic minimal pairs for English. Trans Assoc Comput Linguist. 2020;8:377–92.
  43. 43. Wilcox EG. On the predictive power of neural language models for human real-time comprehension behavior. arXiv:200601912 [Preprint]. 2020.
  44. 44. Ambridge B, Blything L. Large language models are better than theoretical linguists at theoretical linguistics. Theor Linguist. 2024;50(1–2):33–48.
  45. 45. Bender EM, editor. Linguistically naïve!= language independent: why NLP needs linguistic typology. Proceedings of the EACL 2009 workshop on the interaction between linguistics and computational linguistics: virtuous, vicious or vacuous? 2009.
  46. 46. Lau JH, Clark A, Lappin S. Grammaticality, acceptability, and probability: a probabilistic view of linguistic knowledge. Cogn Sci. 2017;41(5):1202–41. pmid:27732744
  47. 47. Cuetos F, Mitchell DC. Cross-linguistic differences in parsing: restrictions on the use of the Late Closure strategy in Spanish. Cognition. 1988;30(1):73–105. pmid:3180704
  48. 48. Carreiras M, Clifton C Jr. Another word on parsing relative clauses: eyetracking evidence from Spanish and English. Mem Cognit. 1999;27(5):826–33. pmid:10540811
  49. 49. Carreiras M, Clifton C Jr. Relative clause interpretation preferences in Spanish and English. Lang Speech. 1993;36(4):353–72.
  50. 50. Dussias PE. Bilingual sentence parsing. In: One mind, two languages: bilingual sentence processing; 2001. p. 159–76.
  51. 51. Fernández E, Bradley D, Fodor J, editors. Relative clause attachment in English and Spanish: cross-linguistic similarities. Proceedings of the thirteenth Annual CUNY conference on sentence processing, California; 2000.
  52. 52. Shen X. Late assignment of syntax theory: evidence from Chinese and English. United Kingdom: University of Exeter; 2006.
  53. 53. Ehrlich K, editor. Low attachment of relative clauses: new data from Swedish, Norwegian and Romanian. The 12th annual CUNY conference on human sentence processing New York, NY, 1999; 1999.
  54. 54. Quinn D, Abdelghany H, Fodor JD, editors. More evidence of implicit prosody in silent reading: French, English and Arabic relative clauses. Poster presented at the 13th annual CUNY conference, La Jolla, CA; 2000.
  55. 55. Dussias PE, Sagarra N. The effect of exposure on syntactic parsing in Spanish–English bilinguals. Biling Lang Cogn. 2007;10(1):101–16.
  56. 56. Maia M, Fernández EM, Costa A, do Carmo Lourenço-Gomes M. Early and late preferences in relative clause attachment in Portuguese and Spanish. J Port Linguist. 2007;6(1).
  57. 57. Soares AP, Fraga I, Comesaña M, Piñeiro A. El papel de la animacidad en la resolución de ambigüedades sintácticas en portugués europeo: evidencia en tareas de producción y comprensión. Psicothema. 2010:691–6.
  58. 58. Soares AP, Oliveira H, Ferreira M, Comesaña M, Macedo AF, Ferré P, et al. Lexico-syntactic interactions during the processing of temporally ambiguous L2 relative clauses: an eye-tracking study with intermediate and advanced Portuguese-English bilinguals. PLoS One. 2019;14(5):e0216779. pmid:31141531
  59. 59. Lee D, Kweon S o. A sentence processing study of relative clauses in Korean with two attachment sites. 담화와인지. 2004;11(2):125–40.
  60. 60. Lee SY. Processing of relative clause attachment ambiguity in Korean. Poster presented at 33rd annual CUNY conference on human sentence processing. Amherst: University of Massachusette; 2020.
  61. 61. Lim N. Processing of relative clauses in Korean: high vs. low attachment. 언어. 2012;37(3):719–36.
  62. 62. Lim NS. Korean-English bilinguals sentence processing of relative clauses in Korean. 언어과학연구. 2012;60:169–90.
  63. 63. Moon N, Yun H. The role of honorific agreement in the resolution of relative clause attachment ambiguity. 영어영문학. 2021;26(4):23–53.
  64. 64. Kamide Y, Mitchell DC. Relative clause attachment: nondeterminism in Japanese parsing. J Psycholinguist Res. 1997;26:247–54.
  65. 65. Brysbaert M. Modifier attachment in sentence parsing: evidence from Dutch. Q J Exp Psychol Sect A. 1996;49(3):664–95.
  66. 66. Mitchell DC, Brysbaert M, Grondelaers S, Swanepoel P. Modifier attachment in Dutch: testing aspects of construal theory. In: Reading as a perceptual process. Elsevier; 2000. p. 493–516.
  67. 67. Colonna S, Pynte J, Mitchell DC, editors. Relative clause attachment in French: The role of constituent length. 13th CUNY conference on human sentence processing; 2000.
  68. 68. Zagar D, Pynte J, Rativeau S. Evidence for early closure attachment on first pass reading times in French. Q J Exp Psychol Sect A. 1997;50(2):421–38.
  69. 69. Hemforth B, Konieczny L, Scheepers C, Strube G. Syntactic ambiguity resolution in German. In: Sentence processing: a crosslinguistic perspective. Brill; 1998. p. 293–312.
  70. 70. Papadopoulou D, Clahsen H. Parsing strategies in L1 and L2 sentence processing: a study of relative clause attachment in Greek. SSLA. 2003;25(4):501–28.
  71. 71. Frazier L, Fodor JD. The sausage machine: a new two-stage parsing model. Cognition. 1978;6(4):291–325.
  72. 72. Acuña-Fariña C, Fraga I, García-Orza J, Piñeiro A. Animacy in the adjunction of Spanish RCs to complex NPs. Eur J Cogn Psychol. 2009;21(8):1137–65.
  73. 73. Huang K-J, Arehalli S, Kugemoto M, Muxica C, Prasad G, Dillon B. Large-scale benchmark yields no evidence that language model surprisal explains syntactic disambiguation difficulty. J Mem Lang. 2024;137:104510.
  74. 74. Davis FL. On the limitations of data: mismatches between neural models of language and humans. Cornell University; 2022.
  75. 75. Taulé M, Martí MA, Recasens M. Ancora: multilevel annotated corpora for Catalan and Spanish. Lrec; 2008.
  76. 76. Gibson E, Pearlmutter N, Canseco-Gonzalez E, Hickok G. Recency preference in the human sentence processing mechanism. Cognition. 1996;59(1):23–59. pmid:8857470
  77. 77. Efrat A, Levy O. The turking test: Can language models understand instructions? arXiv:201011982 [Preprint]. 2020.
  78. 78. Jang M, Kwon DS, Lukasiewicz T, editors. BECEL: Benchmark for consistency evaluation of language models. Proceedings of the 29th international conference on computational linguistics; 2022.
  79. 79. Jiang Z, Xu FF, Araki J, Neubig G. How can we know what language models know? Trans Assoc Comput Linguist. 2020;8:423–38.
  80. 80. Reynolds L, McDonell K, editors. Prompt programming for large language models: Beyond the few-shot paradigm. Extended abstracts of the 2021 CHI conference on human factors in computing systems; 2021.
  81. 81. Webson A, Pavlick E. Do prompt-based models really understand the meaning of their prompts? arXiv:210901247 [Preprint]. 2021.
  82. 82. Scherrer N, Shi C, Feder A, Blei D. Evaluating the moral beliefs encoded in llms. Adv Neural Inf Process Syst. 2024;36.
  83. 83. Ye J, Chen X, Xu N, Zu C, Shao Z, Liu S, et al. A comprehensive capability analysis of GPT-3 and gpt-3.5 series models. arXiv:230310420 [Preprint]. 2023.
  84. 84. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, et al. GPT-4 technical report. arXiv:230308774 [Preprint]. 2023.
  85. 85. Shahandashti KK, Sivakumar M, Mohajer MM, Belle AB, Wang S, Lethbridge TC. Evaluating the effectiveness of GPT-4 turbo in creating defeaters for assurance cases. arXiv:240117991 [Preprint]. 2024.
  86. 86. Bates D, Maechler M, Bolker B, Walker S, Christensen RHB, Singmann H. Package ‘lme4’. Convergence. 2015;12(1).
  87. 87. Team RDC. R: a Language and environment for statistical computing; 2019.
  88. 88. Kuznetsova A, Brockhoff PB, Christensen RHB. lmerTest package: tests in linear mixed effects models. J Stat Softw. 2017;82(13).
  89. 89. Zhao Z, Wallace E, Feng S, Klein D, Singh S, editors. Calibrate before use: Improving few-shot performance of language models. International conference on machine learning. PMLR; 2021.
  90. 90. Wenzek G, Lachaux M-A, Conneau A, Chaudhary V, Guzmán F, Joulin A, et al. CCNet: extracting high quality monolingual datasets from web crawl data. arXiv:191100359 [Preprint]. 2019.
  91. 91. Hestness J, Narang S, Ardalani N, Diamos G, Jun H, Kianinejad H, et al. Deep learning scaling is predictable, empirically. arXiv:171200409 [Preprint]. 2017.
  92. 92. Kaplan J, McCandlish S, Henighan T, Brown TB, Chess B, Child R, et al. Scaling laws for neural language models. arXiv:200108361 [Preprint]. 2020.
  93. 93. OpenAI. [cited 2025 Apr. ]. Available from: https://platform.openai.com/docs/models
  94. 94. Mahowald K, Ivanova AA, Blank IA, Kanwisher N, Tenenbaum JB, Fedorenko E. Dissociating language and thought in large language models. Trends Cogn Sci. 2024.
  95. 95. Beniwal H, Singh M. Cross-lingual editing in multilingual language models. arXiv:240110521 [Preprint]. 2024.