Large language models are able to downplay their cognitive abilities to fit the persona they simulate

This study explores the capabilities of large language models to replicate the behavior of individuals with underdeveloped cognitive and language skills. Specifically, we investigate whether these models can simulate child-like language and cognitive development while solving false-belief tasks, namely, change-of-location and unexpected-content tasks. GPT-3.5-turbo and GPT-4 models by OpenAI were prompted to simulate children (N = 1296) aged one to six years. This simulation was instantiated through three types of prompts: plain zero-shot, chain-of-thoughts, and primed-by-corpus. We evaluated the correctness of responses to assess the models’ capacity to mimic the cognitive skills of the simulated children. Both models displayed a pattern of increasing correctness in their responses and rising language complexity. That is in correspondence with a gradual enhancement in linguistic and cognitive abilities during child development, which is described in the vast body of research literature on child development. GPT-4 generally exhibited a closer alignment with the developmental curve observed in ‘real’ children. However, it displayed hyper-accuracy under certain conditions, notably in the primed-by-corpus prompt type. Task type, prompt type, and the choice of language model influenced developmental patterns, while temperature and the gender of the simulated parent and child did not consistently impact results. We conducted analyses of linguistic complexity, examining utterance length and Kolmogorov complexity. These analyses revealed a gradual increase in linguistic complexity corresponding to the age of the simulated children, regardless of other variables. These findings show that the language models are capable of downplaying their abilities to achieve a faithful simulation of prompted personas.

• Formatting was removed using an author's script, cha to txt.R.
-Output directory: ChildesDirty • Further formatting marks were removed and text cleaned using an author's script, postcleaning.py.
-Output directory: ChildesClean • File names were modified for easier navigation.Files were copied to ChildesClean2 and processed with postpostcleaning.py.
-Output directory: ChildesClean2 • 10 files for each age group were manually selected and truncated by the authors, leading to 60 total scenarios.
-Output directory: childes excerpts • Scenarios were further prepared for OpenAI API usage with excerpts to scenarios.py.
-Output directory: childes excerpts ready 2 Communication with OpenAI API • The scenarios are assembled and run via the call.pyscript.(Note: API key is removed).
-Output directory: output • This script takes the following files as input, which vary the independent variables: 1. Two theory of mind (ToM) tasks: -Change of location (cupboard-drawer.txt)-Unexpected content (candy-pencils.txt)

Three prompt types:
-Plain zero shot prompt (plain.txt)-Expert simulation chain-of-thought prompt (explain.txt)-CHILDES corpus priming (files from childes scripts ready directory) 1 • Resulting data are extracted to a table along with metadata and the final part of the dialogue that can be used to manually decide whether the simulated participant succeded or failed in the ToM task.
-Script: raw2table.py-Output file: output.txt • From the files in the output folder, child lines that were not part of the assignment were filtered.This provides a basis for measuring lexical complexity.
-Script: filter children only.py-Output directory: output filtered • Length and complexity of the child's part of the dialogue were measured using zlib compression library (measure complexity.py) and merged with the table in output.txt.

Manual Coding
• Manual coding was conducted to assess success in theory of mind tasks.

• Coding formalism
-Unexpected Content 0 -nothing.Examples: "i dont know", "surprise!","i want to find out", "Hmm, I think she might be surprised to see pencils too.But maybe she'll want to draw with them too!" 1 -candy/candies.Including situations such as "she will be surprised to find pencils instead of candies".2 -pencils.3 -something else.E.g., toys, stickers, or more options like "Hmm, I don't know.Maybe she will think there are stickers or maybe even more candy!I can't wait to see her reaction!". 4 -chocolate.Often in the candy-pencils task, when looked into the transcripts, the children specified the candies earlier in the conversation (46 instances).This options is further counted the same as candies (number 1).
-Change of Location 0 -nothing.1 -cupboard.Including scenarios like "Maxi will look in the cupboard first, but he won't find it there.Then he'll check the drawer and find the chocolate!",and the beginnings of the word such as "Cup-cup!". 2 -drawer.3 -something else.
• This coding was replicated by a second annotator (file intercoding.tsv in the intercoding directory) and intercoder reliability was assessed using Cohen's Kappa (calculated by the script kappa.py,located in the intercoding directory).

Visualization of Results
The results are visualized in charts.The scripts for this visualization can be found in the charts folder: • Analysis of the correctness of theory of mind tasks: The script used is point.py.
• Analysis of the length of statements and their complexity: The script used is violin.py.