Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Exploring the role of meaning in non-Māori speakers’ ‘proto-lexicon’

  • Wakayo Mattingley ,

    Roles Data curation, Formal analysis, Investigation, Methodology, Project administration, Visualization, Writing – original draft, Writing – review & editing

    wakayo.mattingley@canterbury.ac.nz

    Affiliation New Zealand Institute of Language, Brain and Behaviour, University of Canterbury, Christchurch, New Zealand

  • Forrest Panther,

    Roles Data curation, Investigation, Methodology, Writing – review & editing

    Affiliations New Zealand Institute of Language, Brain and Behaviour, University of Canterbury, Christchurch, New Zealand, Department of Linguistics, University of Canterbury, Christchurch, New Zealand

  • Simon Todd,

    Roles Formal analysis, Resources, Writing – review & editing

    Affiliations New Zealand Institute of Language, Brain and Behaviour, University of Canterbury, Christchurch, New Zealand, Department of Linguistics, University of California, Santa Barbara, California, United States of America

  • Jennifer Hay,

    Roles Conceptualization, Funding acquisition, Methodology, Project administration, Writing – review & editing

    Affiliations New Zealand Institute of Language, Brain and Behaviour, University of Canterbury, Christchurch, New Zealand, Department of Linguistics, University of Canterbury, Christchurch, New Zealand

  • Jeanette King,

    Roles Conceptualization, Funding acquisition, Methodology, Project administration, Writing – review & editing

    Affiliations New Zealand Institute of Language, Brain and Behaviour, University of Canterbury, Christchurch, New Zealand, Aotahi: School of Māori and Indigenous Studies, University of Canterbury, Christchurch, New Zealand

  • Peter J. Keegan

    Roles Conceptualization, Funding acquisition, Writing – review & editing

    Affiliation Te Puna Wānanga, Faculty of Arts and Education, University of Auckland, Auckland, New Zealand

Abstract

Previous work has demonstrated that New Zealanders who do not speak Māori but are regularly exposed to the language develop implicit knowledge of it. The core of this knowledge, it has been argued, is the ‘proto-lexicon’—a set of stored word-forms, without associated meaning, which yields subsequent Māori phonotactic and morphological knowledge. Previous research shows that having a proto-lexicon gives learners a head start in learning Māori word meanings in formal education. We investigate experimentally whether the proto-lexicon confers an advantage for attaching meanings to words. In Experiment 1, non-Māori-speaking New Zealanders were tested on their ability to identify meanings of Māori words in a forced-choice definition task, and they did this relatively well. Then, words with low accuracy were selected for Experiment 2, where non-Māori-speaking New Zealanders and non-New Zealanders were asked to learn meanings for Māori words and nonwords. New Zealanders performed better, indicating that familiarity with Māori word shapes confers an advantage. However, they showed no greater advantage for real words over nonwords. If these words are definitely in the proto-lexicon, then this would suggest that knowledge of individual word-forms does not, in fact, confer an advantage. In Experiment 3, we therefore explore whether the words in Experiment 2 are actually robustly in the participants’ proto-lexicon, by running a word identification task with the same participants. These words were not robustly distinguished from nonwords. By selecting words for their lack of semantic knowledge, we also inadvertently selected words that do not appear to be in the proto-lexicon. Together, our results indicate that different levels of semantic knowledge exist for different words, even when we consider only words that cannot confidently be said to be in a full lexicon. The results suggest that the claim of previous studies that the proto-lexicon is ‘without semantics’ may be oversimplified.

Introduction

Word learning is a fundamental component of language learning and happens across life [14]. Lexical knowledge is a complex construct comprising different levels of linguistic representation [5,6]. For example, in our mental lexicon, word-forms (sequences of phonemes) have associated meanings (semantics), orthographic representations, and both morphological and syntactic features. In this study, we focus on a ‘proto-lexicon’—-a memory store of word-forms without accompanying meanings [710]—which is an important part of the language learning process. We are interested in how the proto-lexicon assists learning meanings of words in an experimental setting. While past work has examined the learning of new meanings for words that are already stored in a mental lexicon with different meanings (e.g., [1113] for children, [14,15] for adults), there has been little discussion of the learning of meanings for words in a proto-lexicon.

Previous studies show that non-Māori-speaking New Zealanders and non-Spanish-speaking Californians and Texans have impressive implicit proto-lexical knowledge of their respective ambient languages, acquired through regular passive exposure [1618]. For example, incidental exposure to Māori allows non-Māori-speaking New Zealanders (NMS) to develop a proto-lexicon containing over a thousand words or word parts [16], while their active vocabulary consists of only around 70 words [19]. This proto-lexicon provides phonotactic and morphological knowledge [17,20]. Furthermore, Mattingley et al. [21] shows that having a proto-lexicon gives learners a head-start in explicit learning of word meanings in an introductory language course. However, it is not yet clear how this proto-lexicon can be beneficial for learning word meanings in an experimental setting without any contextual cues, nor what kind of role is played by knowledge contained in the adult proto-lexicon in learning the meaning of words. We expect that learning the form-meaning pairing of a word would be easier for someone who already has knowledge of the form (via a proto-lexicon) and needs only attach meaning to it than for someone who has knowledge of neither form nor meaning and needs to learn them both.

This study primarily focuses on proto-lexical knowledge attained in a natural language situation. It tests the extent to which a Māori proto-lexicon facilitates learning Māori words, by teaching New Zealanders and non-New Zealanders Māori words paired with pseudo-meanings in short-term word learning tasks. Studying the role of proto-lexical knowledge provides important insights for theories of word learning and for the very earliest stages of language acquisition in adults.

Background

Te reo Māori

Te reo Māori is a Polynesian language and is the language of the indigenous people of New Zealand. Although the language has become endangered as the result of colonization, there are substantial revitalization initiatives. Māori is an official language and New Zealanders encounter Māori words and phrases regularly in media and public ceremonies. However, Māori is not a compulsory subject at school and only 4.3% of the population are able to hold a conversation about everyday things in Māori [22]. English is the most common language, but most New Zealanders are familiar with a small number of basic Māori vocabulary items that have been increasingly integrated into New Zealand English over time [19,23]. The most recent and second wave of borrowing from Māori into English brought in more words from the domains of Māori society and culture [24,25]. Although the Māori vocabulary knowledge of NMS varies depending on the different ways of performing a word definition task, the average NMS is likely to have an active Māori vocabulary of fewer than 100 words [19,23].

Māori has a relatively small phonology and a transparent spelling system [26]. It has 10 consonantal phonemes /p, t, k, m, n, ŋ, w, f, r, h/, which are written <p, t, k, m, n, ng, w, wh, r, h>. The five vowels /i, e, a, o, u/ correspond to the letters <i, e, a, o, u>. These five vowels also have long forms, which are usually indicated orthographically with a macron over the vowel. With regard to categorical Māori phonotactics, all syllables are open and onsets are empty or consist of any one of the consonants [27].

Māori, like other Polynesian languages, has limited inflectional morphology aside from the passive construction [26]. Derivation is a key morphological process, particularly through reduplication, compounding, and affixation, with several regular affixes used to modify word meaning [26].

Early stages of word learning

Lexical knowledge generally involves sound structure, word-form and word meaning [5,6]. The proto-lexicon can be understood as an initial, form-based repository of word-like units that infants acquire before these forms are meaningfully linked to semantics, representing a foundational stage within the continuum of lexical knowledge [28,29]. This proto-lexical stage supports later mapping between word-form and meaning, acting as a crucial bridge from pre-lexical representations to fully specified lexical entries.

By around 9 months, infants demonstrate sensitivity to permissible sound sequences in their native languages and the frequency of occurrence of statistically recurrent sound sequences in speech [3035]. This pre-lexical knowledge enables infants to segment the speech stream and gradually build a proto-lexicon of remembered word-forms without detailed semantic content [33,36,37]. By 11 months, infants have mental representations of familiar word-forms acquired from everyday experience, allowing recognition of these forms even when pronunciation varies (e.g., misstressing) [38]. Importantly, this proto-lexicon develops implicitly and receptively, prior to explicit learning of word meanings or orthographic knowledge.

Research suggests that statistical learning of sound sequences is crucial in early word learning. Exposure to sound sequences with strong internal structures (i.e., high transitional probabilities) facilitates word learning by mapping meanings to segmented words [39,40]. Adults are also sensitive to such conditional statistics in languages [41], and this sensitivity is stable across the lifespan [4244].

Phonotactic probability also influences lexical acquisition, with studies showing that school-aged children learn nonword and picture pairings better when the pairs include nonwords with high phonotactic probability [45]. Taken together, these findings emphasize the importance of statistical learning and phonotactic regularity in early language acquisition.

Previous work on building a Māori proto-lexicon

People who grow up in New Zealand are exposed to Māori throughout their lives. Computational modeling of a previous study has demonstrated that NMS have a Māori proto-lexicon of more than 1500 words or word-parts [16]. This proto-lexicon enables NMS to distinguish real Māori words from phonotactically matched Māori nonwords, and to accurately rate the gradient wellformedness of Māori nonwords [16,17]. In addition, NMS utilize their phonotactic knowledge as a proxy for rating how likely an item is to be a real Māori word if they do not already know it, with the effect being pronounced in nonwords and lower-frequency real words.

Proto-lexical representations refer to stored word-forms without associated meanings, while phonotactic knowledge consists of gradient phonotactic probabilities in permissible sound sequences rather than categorical phonotactic variations (i.e., phonology). The proto-lexicon and phonotactic knowledge represent distinct types of linguistic information, yet they are best understood as interconnected. Phonotactic knowledge is often assumed to arise from generalizations over stored words in the mental lexicon [46], and, similarly, early phonotactic knowledge in language acquisition is thought to emerge from generalizations over forms in the proto-lexicon [8,10].

Panther et al. [17] demonstrated that phonotactic generalizations emerge from the structure and distribution of stored forms in the proto-lexicon. Participants with a larger proto-lexicon—as indicated by their ability to distinguish real Māori words from similar nonwords—exhibited greater sensitivity to phonotactic patterns when judging the wellformedness of Māori nonwords. Together, these findings support the view that proto-lexical representations and phonotactic knowledge are not completely separate systems, but rather two interrelated aspects. Phonotactic knowledge is a generalization over (proto)lexical forms.

Crucially, this construct differs from a general familiarity with a language’s phonological grammar, which is typically conceptualized as an abstract rule-based system acquired through broad linguistic exposure (e.g., [4749]). In contrast, proto-lexically derived phonotactic knowledge is statistical knowledge about the probabilities of sound sequences, emerging from exposure to specific word-forms, even in the absence of semantic content. This aligns with statistical learning accounts, which propose that linguistic knowledge arises from tracking the frequency and distribution of patterns in the input, rather than from abstract rule induction alone.

Subsequent studies have demonstrated that NMS have further well-developed implicit knowledge of Māori; NMS have some syntactic knowledge [50] and they are also able to morphologically segment Māori words in a similar way to fluent Māori language speakers [20,51]. Further studies have revealed that incidental exposure to Māori continues to lead to implicit learning and growing the proto-lexicon throughout the adult lifespan; NMS who have more exposure during adulthood have more extensive knowledge [52].

Although NMS have more than 1500 words or word-parts stored in their proto-lexicon [16], they can explicitly define the meanings of only around 70 words, indicating that most of their knowledge is limited to form without accessible semantic content words [19]. Nonetheless, they demonstrate sensitivity to various aspects of Māori knowledge such as morphological segmentation [20,51], which may reflect proto-lexical knowledge, and syntactic regularities [50], which likely emerge from broader implicit exposure. It is very likely that NMS are unaware of the meaning of most words in their proto-lexicon. However, a proto-lexicon may aid in the learning of form-meaning pairs, through providing forms to which meanings can be attached. Mattingley et al. [21] investigated the role of the proto-lexicon in a real life language learning study, showing that the proto-lexicon can be activated to facilitate overt language learning. Adult students with larger proto-lexicons have a learning advantage when formally learning the meanings of Māori words in a formal education environment.

While proto-lexical representations are primarily form-based, emerging evidence suggests that they may nonetheless engage in graded or partial semantic associations. This aligns with research showing that sublexical or distributional features can influence semantic processing, even for unfamiliar word-forms or nonwords (e.g., [53,54]). Such findings suggest that the boundary between form-based proto-lexical knowledge and semantic knowledge may be more permeable than previously assumed—–implying a continuum, rather than a strict dichotomy.

The present study

Taken together, the above studies show that a proto-lexicon can be built through both childhood and adulthood exposure to a target language in a natural language situation. This knowledge remains largely implicit, as NMS typically possess only a small explicit Māori vocabulary. Nevertheless, having a proto-lexicon appears benefit learners when acquiring Māori word meanings in formal education settings.

However, we do not yet know whether there is a relationship between a proto-lexicon and word learning in controlled experimental settings. In Mattingley et al. [21], students’ word learning ability in Māori was assessed using words from their course materials. Some words were explicitly learned as required words in the course, while others were encountered incidentally through course exposure. Therefore, we do not know how robust a proto-lexicon is for word learning in a controlled environment.

Additionally, it is possible that the Māori proto-lexicon comprises multiple knowledge phases or stages of lexical knowledge. For example, Dale [55] proposes four developmental stages of word knowledge. Each stage is characterized by learners’ comprehension and use of words. The four stages are: (1) having never seen the word before, (2) knowing there is such a word but not knowing what it means, (3) having a vague contextual placing of the word, (4) knowing the word and remembering it. In this framework, the second and third stages are most relevant to the present study. Specifically, the second stage can be seen as equivalent to the concept of a proto-lexicon.

Building on this framework, the present study asks: to what degree is the Māori word learning ability of NMS, who have a Māori proto-lexicon (i.e., the second stage), greater than that of people who do not have a Māori proto-lexicon (i.e., the first stage)? When NMS are compared to people who do not have a Māori proto-lexicon, we would expect that implicit knowledge of NMS gives them an advantage in attaching meaning to words. We hypothesize based on the developmental stages of word knowledge [55] that most words in the Māori proto-lexicon are at the second stage and therefore would show an advantage over nonwords at the first stage in the learning of meaning for NMS. On the other hand, non-New Zealanders would show no differences in the way they learn the meanings of real words and nonwords. Accordingly, the present study set out to address the following research question:

Primary RQ:

Is there a relationship between NMS’ proto-lexical knowledge and word learning in experimental settings? That is, does a proto-lexicon facilitate the attachment of meanings to word-forms?

In order to explore this question, we explore a range of related questions. One of them is the degree to which a forced-choice definition task gives the same answer about the degree of semantic knowledge as a free-response definition task. Another is the degree to which words that are not well-defined in a forced-choice task are actually robustly present in the proto-lexicon. We refer to both the forced-choice and free-response formats as word definition tasks throughout.

To address our primary research question, we conducted two separate web-based experiments. First, in Experiment 1, we ran a forced-choice definition task to identify Māori words with no robust associated meanings among NMS. These words served as stimuli for Experiment 2, where we directly answered our primary research question. In Experiment 2, we conducted short-term word learning tasks to test whether a non-semantic Māori proto-lexicon benefits learning word meanings. We taught pseudo-meanings for the selected words to NMS and non-New Zealanders living in the USA. By comparing performance across these groups, we explored how a proto-lexicon influences word learning in an experimental context.

Our findings show that NMS perform better at the task than non-New Zealanders, but that there is no difference in their performance with words and real words. To explore a potential explanation for this, in Experiment 3, we conducted web-based word identification and wellformedness rating tasks with the same NMS participants from Experiment 2. This experiment assesses the extent to which these participants definitely have a Māori proto-lexicon and related implicit knowledge, while also exploring whether this proto-lexicon includes the specific Māori words examined in Experiment 2.

General methods

This study is based on three web-based experiments that we conducted, as described in the previous section, alongside distinct datasets from previous studies used for comparison. All our experiments in this article were conducted individually via the web, at a time convenient for participants and in a location of their choice. Participants were first presented with instructions and consent forms. All experiments were carried out with full ethical clearance from the Human Research Ethics Committee at the University of Canterbury (2017/90, 2022/10/LR-PS). Participation was entirely voluntary. Participants could choose to receive a $10 online gift voucher in Experiment 1. In Experiments 2 and 3, compensation was provided according to Prolific regulations.

Although the stimuli in Experiment 3 were presented in written form, the transparent orthographic-to-phonological mapping in Māori ensures that the written stimuli reflect the language’s phonological structure. This allowed to us to assess participants’ sensitivity to gradient phonotactic probabilities—probabilistic patterns of permissible sound sequences in Māori. These patterns reflect implicit statistical learning rather than categorical phonological rules. Osborne et al. [Unpublished] demonstrated that NMS exhibit phonotactic sensitivity across both spoken and written modalities, supporting the cross-modal nature of this gradient phonotactic knowledge.

Participant exclusion criteria, including non-qualifying demographics, and statistical models were determined prior to data analysis based on established procedures from previous studies (see Technical Supplement for details S1 File)

Table 1 summarizes our experiments and the datasets used in this study. In the case of Panther et al. (2023), stimuli from both the wellformedness rating and word identification tasks were used in our experiments; however, only the word identification data were analysed in relation to Experiment 2.

Experiment 1: Definition task

In Experiment 1, we conducted a two-task, web-based experiment through a custom browser interface between January 22-26, 2022. The session consisted of a forced-choice definition task and a word splitting task. In this paper, we focus on the forced-choice definition task. Although participants also completed the word splitting task as part of the study design, it is not reported here because it addresses a different research question.

The definition task had two purposes. The primary purpose was to identify Māori words for which NMS cannot readily recognize the definitions, to use these words as stimuli in the word learning task of Experiment 2. In our word learning task, we wanted to use words that participants did not have robust existing meanings for. We were also interested in the fact that the different ways of performing a definition task might show different degrees of strength of knowledge in relation to various developmental stages. Thus, the secondary purpose was to compare our findings with a previous free-response definition task assessing the size of the active vocabulary [19].

Participants

Participants, recruited via paid Facebook advertisements, were 68 adults. Among those people who completed the experiment, we excluded 13 participants with non-qualifying demographics (see Technical Supplement for details S1 File). After excluding these participants, 55 adult native speakers of New Zealand English remained (45 female). All participants were aged between 18 and 60 years and did not have sufficient proficiency in Māori to hold a basic conversation in Māori, in line with the inclusion criteria. According to the demographic survey, 87.3% were monolingual, while 12.7% reported being able to speak one additional language, such as Japanese, French, or Spanish.

Materials

Stimuli in this task were real Māori words, drawn from the material in Panther et al. [17]. Panther et al.’s original material consisted of 521 Māori words and 521 Māori-like nonwords. All stimuli consisted of three to six phonemes and did not contain long vowels. Words were categorized according to frequency in spoken Māori (high, mid, low) based on the MAONZE corpus [56] and the Māori Broadcast Corpus [Boyce, Unpublished].

Oh et al. [16] demonstrated that the proto-lexicon of NMS is composed primarily of morphs, which are frequent phonological segments derived from ambient speech, rather than full words or semantically meaningful units. To measure participants’ phonotactic knowledge, we utilized the SRI Language Modeling Toolkit (SRILM) [57] to assign phonotactic scores to each word. These scores were based on morphs identified from fluent speakers’ segmentation of dictionary lemmas (see [16] and their Detailed Materials and Methods Supplement for further details). The morph dataset consists of unique types only, with each morph represented once. The phonotactic scores reflect length-normalized log-probabilities (base 10) calculated from the frequencies of phoneme trigrams. The scores ranged from –1.2 to –0.60, corresponding to average conditional probabilities of the third phoneme in a triphone (given the preceding two phonemes as context) that range from 0.063 ( = 10−1.2) to 0.251 ( = 10−0.6). A high phonotactic score indicates that the local sound sequences that make up the stimulus are, on average, quite typical or common in Māori words.

Each word was also assigned a neighborhood occupancy rate: the proportion of its potential phonological neighbors (i.e., phonotactically legal forms that are a Levenshtein edit distance of one from it) that correspond to a real Māori word. The neighborhood occupancy rate can be understood as a normalized version of phonological neighborhood density (i.e., the number of real words whose phonological forms are a Levenshtein edit distance of one from the stimulus), reflecting the probability that distortion in the perception of a single phoneme would result in a real word. In general, the longer a stimulus, the lower its neighborhood occupancy rate. A high neighborhood occupancy rate indicates that it globally resembles many Māori words. See Panther et al. [17] for further details.

Among 521 Māori words, we selected a fixed set of 146 words from the high- and mid-frequency categories (high; n  =  53, mid; n  =  93) and obtained a definition for each word from a dictionary. For each participant, all 146 words were paired up with all 146 definitions, such that a random subset of 73 words were paired with their correct (matched) definition and the remaining 73 words were paired with an incorrect (mismatched) definition corresponding to a different word. Stimulus characteristics were as follows: lengths (M = 4.35, SD = 0.92), phonotactic scores (M = −0.81, SD = 0.1), and neighborhood occupancy rates (M = 0.1, SD = 0.06).

Procedure

In this experiment, we tested adult NMS’ Māori word knowledge by a web-based binary forced-choice task. For each trial, participants saw a stimulus word and a definition in the middle of the screen. The definition could be either a correct (matched) definition or an incorrect (mismatched) definition. Participants were asked to answer whether the presented word-definition pair was ‘good’ or ‘not good’. After the participant clicked one of the options and the ‘Next’ button, the next stimulus was presented.

Before starting the task, participants were instructed to provide their best guess without relying on a dictionary or help from others, in order to obtain each individual’s actual knowledge of Māori words. After the task, participants also completed a post–questionnaire which included 27 demographic questions. The experimental session lasted less than 30 minutes in total.

To evaluate whether participants relied on external aids, we examined their total task duration and median reaction times. If a participant’s median reaction time exceeded two standard deviations above the group median, we also analysed their accuracy. Two participants exceeded this threshold, and their accuracy was near chance level (51% and 56%), suggesting they were not using external aids.

Results

The word list for Experiment 2.

To determine overall accuracy of definitions for each participant and each word, we calculate a d-prime [58] score which is a measure of participants’ response sensitivity for the correctness of pairings between words and definitions. Participants with higher values of d-prime more accurately discriminated between matched and mismatched word-definition pairs, and words with higher values of d-prime were more accurately identified by participants as matched or mismatched to their paired definitions.

In terms of stimulus words, the rate of correct responses varied among words, ranging between 33% for kuhu ‘to enter’ and 93% for whenua ‘land’. The participants selected correct options with more than 50% chance for 109 words out of 146 words (75%), which indicates that they can identify word meanings in Māori words for a substantial subset of the words.

A d-prime score for each stimulus ranged between –0.82 and 3.46. After initial data exploration, we selected stimuli by setting cutoff points for maximum d-prime and minimum phonotactic score. Stimuli were required to have a d-prime less than or equal to 0.3, indicating that participants were not consistently able to match them to their definitions, and a phonotactic score greater than or equal to –0.89 (against an overall range of –1.2 to –0.60), indicating that they were not composed of atypical phoneme sequences. Based on these criteria, we selected 48 words for the word learning task (Experiment 2). Because of their low d-prime scores, these words are less likely to be associated with semantic knowledge, and because their phonotactic scores are drawn from a high and narrow range (–0.868 to –0.645), their forms are all sufficiently typical for Māori in a similar way, meaning there should not be asymmetries in the effect of phonotactic regularity on attaching meanings to word-forms.

For creating a word list, we did not remove any participants based on participants’ d-prime score because the list would be based on how widely the words were known regardless of participants’ ability.

Comparing results in two different definition tasks (Descriptive statistics).

For the secondary phase of this study, we directly compare our findings with the previous definition task assessing the size of the active vocabulary [19]. Oh et al. [19] used a free-response definition task to evaluate the explicit knowledge of 132 Māori words by 123 NMS. The average New Zealander can define about 70 words. We focus on the comparison between the free-response definition task and the forced-choice definition task. Both definition tasks measure NMS’ receptive knowledge of form and meaning and the varying degrees at which each word is known, but determine different degrees of strength of knowledge [59].

To begin, five participants who responded to more than 95% of trials in the same way (e.g., “not good”) in our forced-choice definition task dataset were removed from the data analysis. After the removal process, analyses were carried out on 7300 observations of stimulus words with 50 participants. The mean accuracy was nearly identical when summarized by item and by participant (M = 0.59 for both). However, the variability differed: accuracy varied more across items ( than across participants ). This suggests that item difficulty contributed more to performance differences than individual differences among participants, who were relatively consistent in their accuracy.

For 12 words overlapping between the current study and Oh et al.’s [19], we report the extent to which the mean accuracy of each word is correlated with each other from different tasks. In order to compare directly between two different definition tasks, we adjusted the accuracy scores of each item in the free-response definition task by 0.5*free-response definition accuracy + 0.5. This is because participants would be expected to get it right 50% of the time just by random guessing in the forced-choice definition task.

Fig 1 shows the proportion of correct responses per word in both definition tasks. Half of words have higher accuracy in the forced-choice definition task, by up to 10% compared with free-response definition accuracy. In a few cases, the two situations give similar accuracy. The other three words (hongi ‘pressing noses in greeting’, taniwha ‘water spirit’, whare ‘house, building’) that were defined accurately more than 90% of the time in both the tasks are strongly related to the Māori culture. This might be attributed to a ‘second wave’ of borrowings from te reo Māori driven by Māori speakers in which more Māori words in the domains of Māori society and culture have been integrated into New Zealand English [24,25].

thumbnail
Fig 1. The relationship between two tasks for the proportion of correct responses for each word.

Definitions of words are in the brackets.

https://doi.org/10.1371/journal.pone.0339325.g001

On the other hand, the word whenua ‘land’ which is one of the most frequent Māori words in New Zealand contexts shows the near perfect accuracy in the forced-task but about 15% lower in the free-response definition task. We can see the similar trend from the word ora ‘to be alive’. The word appears in Kia ora which is a Māori-language greeting which has been integrated into New Zealand English. These words are more likely at stage 3 (i.e., a vague contextual placing of the word) in Dale’s [55] four developmental stages of word knowledge. In the free-response definition task [19], even if participants provided the meanings of partially known words (for example, knowing that iwa was a number, but not that it meant ‘nine’), these responses were regarded as incorrect definitions. It seems that presenting a definition with a word could help participants retrieve the word from their memory or infer the definition with less mental effort. Without any cues, participants need to retrieve and recall their memory for knowing words with more mental effort.

The forced-choice accuracy is generally higher. However, there is not a huge difference in accuracy between the two definition tasks after we adjusted the accuracy scores of each item in the free-response definition task in order to account for random guessing in the forced-choice definition task.

Summary of Experiment 1

In Experiment 1, the participants could identify word meanings in Māori words for a substantial subset of the words. Recalling Dale’s [55] development stages of word knowledge in the Present Study section, it is likely that the second stage and the third stage in this measurement are closely related to our findings. Although it has been assumed that most words in the proto-lexicon were at the second stage (i.e., knowing there is such a word but not knowing what it means), it is clear that for NMS, there are some words that are actually at the third stage (i.e., a vague contextual placing of the word).

In terms of definitions, most words had higher accuracy in the forced-choice case than in a previous study using a free-response definition task, whereas words that are related to Māori culture were associated with their meanings accurately 90% of the time for both definition tasks. The findings are consistent with previous studies using corpora, which shows that in the second wave of borrowing from Māori into English, words related to Māori society and culture have increased in token frequency [60] and Pākehā (European or white New Zealander) are more likely to use loanwords that refer to Māori culture [61].

Experiment 2: Learning meanings of words

Experiment 2 is designed to address our primary research question. We examine whether there is a relationship between NMS’ proto-lexical knowledge and word learning in experimental settings. Our research question asks how a proto-lexicon facilitates the attachment of meanings to word-forms in Māori. We conducted a web-based experiment through a custom browser interface via Prolific (www.prolific.com) between September 18-19, 2022 for non-New Zealanders and from November 28, 2022 to January 9, 2023 for NMS. Experiment 2 was a word learning experiment which was divided into three phases. In the learning phase, participants were asked to try their best to learn new words. In the two test phases, participants were tested on their word learning. The participants completed a post-questionnaire after the tasks. The experimental session lasted less than 20 minutes in total. The details of each task are given below.

Participants

Experiment 2 involved two groups of adult participants: (a) NMS and (b) non-New Zealanders living in the USA. Among people who completed the experiment, we excluded participants who did not pay attention to the tasks and did not fulfill our demographic conditions (see Technical Supplement for details S1 File). After removing unusable participants, there were 137 participants in total: 70 NMS (36 female) and 67 US participants (31 female). All participants were native speakers of English and aged between 18 and 60, and have never studied linguistics at a college, university, or community college. They were not able to hold a basic conversation in te reo Māori. They have not lived outside their countries for any period of longer than a year since they were aged seven. According to the demographic survey, among NMS participants, 81.4% were monolingual, while 2.9% reported speaking two additional languages and 15.7% one additional language. For US participants, 82.1% were monolingual, 4.5% reported speaking two additional languages, and 13.4% one additional language. Examples of additional languages include Japanese, French, and Spanish.

Materials

Word stimuli.

As mentioned in the Materials section of Experiment 1, we selected 146 words from high and mid frequency categories from an original stimulus set [17] and conducted a definition experiment with participants. We filtered out words that participants were able to identify easily in the definition experiment. For the word learning experiment, we selected 48 words with low d-prime scores from the definition experiment that spanned a narrow range of phonotactic scores, along with their phonotactically matched nonwords (words/nonwords: phonotactic score range = -0.87 to -0.65). We focused on this narrow range of phonotactic scores because phonotactic regularity promotes attaching meanings to word-forms (e.g., [45]), as discussed earlier. This approach allowed us to disentangle effects of form-based (proto-lexical) knowledge on word learning from phonotactic knowledge. The word and nonword stimuli were closely matched on key lexical characteristics. Both stimulus types had identical average lengths (M = 4.54, SD = 0.74), phonotactic scores (M = −0.77, SD = 0.06), and neighborhood occupancy rates (M = 0.09, SD = 0.06). This ensured that any observed effects could not be attributed to differences in these properties. Table 2 presents example word/nonword pairs with their phonotactic scores.

thumbnail
Table 2. Example word–nonword pairs from Experiment 2 with phonotactic scores.

https://doi.org/10.1371/journal.pone.0339325.t002

Picture stimuli.

In order to obtain a range of objects for which NMS are unlikely to know the true Māori term, we selected 16 objects from the online database of the International Picture Naming Project [62]. There were several criteria to control effects of picture stimuli. The English language was selected to search for picture items. All of the pictures were objects from a range of semantic categories including: animals, small and large artifacts, body parts, vehicles, clothing, objects in nature, people. All of the words consisted of one syllable in English (e.g., bat, bridge). The number of alternative names for the object (in English) was three or fewer. Pictures were black line-drawings.

Stimulus set.

In the experiment, there were three experimental tasks (learning phase, post-test phase 1, post-test phase 2) as well as three repetitions within each task. There were 16 picture stimuli, 48 real words and 48 nonwords as mentioned in the previous section. Each participant must add meaning to 16 word-form stimuli. Therefore, we prepared 36 experiment configurations containing stimuli paired with pictures. There were six random orders of the pictures in the experiment. We also created six random samples of eight words and eight nonwords. Each stimulus appears only once across the 6 word stimulus samples (i.e., there are no repeat stimuli across the samples). A total of 36 experimental configurations were used.

For the purpose of matching stimuli for the experimental tasks, a procedure was followed to identify: (i) another stimulus that has an ‘incorrect’ image for the learning phase (task 1); (ii) an image that is the incorrect ‘word’ for post-test phase 1 (task 2); (iii) an ‘incorrect’ ‘word’-image pairing for post-test phase 2 (task 3). For task 1, stimuli were randomly paired together, and were a unique match to each other, for each experimental configuration. For task 2, the same procedure was followed, although for each stimulus, its task 1 pair was excluded. For task 3, the stimuli were split in half: half of the stimuli retained their ‘correct’ ‘word’-image pairing, while the other half of the stimuli were matched with another stimulus within that group (which could not also be the task 1 or task 2 pairing), and their ‘word’-image pairing was swapped, and these formed the ‘incorrect’ matching group. Correct response would be shown on the left/right side.

Procedure

Participants completed three tasks that were implemented and presented to participants using OpenSesame and OSweb [63,64]. Before starting the experiment, all participants were presented with a page containing information about the experiment, and in the study description we mentioned that we would be using an indigenous language of New Zealand, Māori, as the basis of this research.

The participants were presented with instructions on the computer screen prior to each task. At first, participants were instructed that “You are stranded on an island and you don’t know the language spoken on this island. Your task is to learn the names of objects in the language of the island.” Participants were told that if their decision took too long, a new item would be shown. They were provided with an instruction to not worry about making a lot of mistakes and just to try and learn as much as they could throughout the task. After reading each instruction, participants needed to press a specific key to start their task.

The experiment was divided into three tasks. Each task had three repetitions in different stimulus order. The purpose of task 1 (learning phase) was to learn the names of objects in the language of the experimental setting. Each trial started with a fixation dot presented in the center of the screen for 500 ms (the same applied to the following tasks). Participants then saw a picture in the middle of the screen. After a further 500 ms delay, two words appeared, one of which was the correct name of the pictured object. One word was presented on the left side of the picture and the other on the right side. Participants were instructed to press the ‘z’ key if they thought the picture matched the word on the left side of the screen and to press the ‘m’ key if they thought the picture matched the word on the right side of the screen. If a participant responded incorrectly, they would see a red cross in the middle of the screen; if they responded correctly, they would see a green tick. On each trial, the stimuli were presented together for (up to) 5000 ms; if a participant did not respond within this time limit, a new item would be shown without any feedback.

In task 2 (post-test phase 1), on each trial, participants saw a word in the middle of the screen, then after 500 ms of delay, two pictures appeared. One picture was presented on the left side of the word and the other on the right side. The task was to decide which picture matched the word. If participants thought the left picture matched the word, then they pressed the ‘z’ key. If they thought the right picture matched the word, then they pressed the ‘m’ key. No feedback was given for this phase. On each trial, the stimuli were presented together for 5000 ms or until the participant responded, at which point the fixation for the next trial began.

In task 3 (post-test phase 2), for each trial, participants saw a word above a picture. The word may or may not be the correct name for the picture. After 500 ms of delay, two words appeared: ‘correct’ and ‘incorrect’. The word “incorrect” was presented on the left side of the stimuli and the other word “correct” on the right side. The task was to decide whether the word and the picture was either a correct or incorrect pair. If participants thought the pair was correct, they pressed the ‘m’ key. If participants thought the pair was incorrect, they pressed the ‘z’ key. No feedback was given for this phase. On each trial, the stimuli were presented together for 5000 ms or until the participant responded, at which point the fixation for the next trial began.

After the third task, participants answered a demographic questionnaire using a customized open source survey software tool, Lime Surveys [65]. The experiment took less than 20 minutes to complete. At the end of the experiment, a debriefing statement appeared to explain that some of the words participants learnt were not real Māori words and the meanings they learnt for the real Māori words were not the real meanings.

Results

In order to assess the ability of NMS and non-New Zealanders to learn the word meanings, their responses were coded as accurate or inaccurate. For tasks 1 and 2, responses were coded as accurate if the participant chose the correct word/picture; for task 3, responses were coded as accurate if the participant correctly indicated whether or not the word-picture pair were a match. Analyses were carried out on 19,728 observations of stimulus words (48 stimuli × 3 tasks × 137 participants) with mixed effects logistic regression, using the glmer function in the lme4 library of R [66,67] along with the bobyqa optimizer in R.

Table 3 presents the mean accuracy (in percentages) and standard deviations for each combination of participant group (NZ, US), stimulus type (word, nonword), and task (1–3). Sample sizes (N) indicate the total number of observations included in each condition. Accuracy rates were generally consistent across groups, with some variability observed across tasks and stimulus types. Notably, except for Task 1, the NZ group tended to perform better than the US group. This summary provides a descriptive context for interpreting subsequent statistical analyses.

thumbnail
Table 3. Mean accuracy (percentage) with standard deviation and sample size (N) by language group, stimulus type, and task.

https://doi.org/10.1371/journal.pone.0339325.t003

The dependent variable was accuracy, defined above. The fixed effects considered were participant group (sum-coded, US[1], NZ[–1]), word type (treatment-coded, reference: nonword), and experimental task (Helmert-coded) as test predictors. We added another factor, education (treatment-coded, reference: high school) as a control predictor, which expects that individuals with different education backgrounds might have different learning capabilities in general. Our primary interest is in the interaction between the test predictors. Our consideration of interactions with experimental task is guided by the two comparisons formed by the use of Helmert coding. The first comparison (task 1 vs. task 2-3) would indicate the difference between the learning phase and two test phases. The second comparison (task 2 vs. task 3) would indicate the difference between two tests.

We hypothesized that because the real words in the experiment are part of NMS’ proto-lexicon, NMS would learn the meanings of the real words more accurately than non-New Zealanders. We also expected that NMS would learn the meanings of the real words more accurately than the meanings of the nonwords but non-New Zealanders would show no differences in the way they learn the meanings of real words and nonwords. We started the stepwise model fitting process with interactions between all test predictors, as in the previous analysis(see [21]) described in detail in the supplementary materials. An initial model contained a three-way interaction between group, type and task and education is a simple effect; glmer (Accuracy Group * Type * Task + Education + (1|PID) + (1|Word). The initial fitting procedure applies to fixed effects only, and later stages add intercepts and slopes. In the final model, stimulus type was added as a by-participant slope, and group and task were added as by-word slopes. For the interpretation of significant interactions, we conducted post-hoc estimated marginal means (EMM) tests using the R package emmeans [68]. The resulting model is shown in Table 4.

thumbnail
Table 4. Model results for Experiment 2 (learning accuracy).

https://doi.org/10.1371/journal.pone.0339325.t004

We did not observe a significant interaction between stimulus type and group. Results of the model showed a significant main effect of group, indicating that non-New Zealanders performance was lower than the grand mean of two groups. The effect of task for the first comparison between the learning phase and the test phases was also detected, which indicates that participants gave lower-accuracy responses in the learning task, when they were first presented with words and images and had to learn a mapping between them by making guesses and incorporating feedback, than in the test tasks, when they already had (partial) knowledge of the mapping based on feedback from the learning task. There was also a significant interaction between group and task for the first comparison (task 1 vs. task 2 and 3), as suggested by Fig 2. An EMM test confirms that the effect of group is significant across tasks and confirms that two groups performed similarly in task 1 (task  =  1: NZ - US; 0.049, z  =  0.428, p = .669), but NMS learned meanings of words better in task 2 and 3 (task  =  2: NZ - US; 0.464, z  =  3.938, p = < .001, task  =  3: NZ - US, 0.314, z  =  2.619, .01).

thumbnail
Fig 2. Interaction plot between group and task from the final model.

Estimated 95% confidence intervals on predictors. Bias correction was applied for this plot. The x-axis shows each task in the experiment. The y-axis shows the predicted portion of accurate responses.

https://doi.org/10.1371/journal.pone.0339325.g002

Summary of Experiment 2

Contrary to our prediction, we did not observe an interaction between stimulus type and group. That is, although NMS generally learned new meanings more accurately than non-New Zealanders, this advantage was not focused on their learning of meanings for words that were assumed to already exist in their proto-lexicon. Rather, it appears that each group learned the meanings for nonwords as well as they did for words, and NMS were better than non-New Zealanders at learning the meanings for both. Consequently, our prediction that NMS would learn the meanings of real words easier than nonwords was not supported by the experimental evidence.

It does, however, appear that NMS have an overall advantage in this task. Education level is controlled in the model, and so group differences for this variable would not be an explanation for why NMS have done better. The most likely explanation is that all of the words are very Māori-like, and so are more statistically likely for NZ-based participants than non-NZ participants. We know that phonotactic knowledge can facilitate word learning [45], and we know that New Zealanders have extremely good phonotactic knowledge of Māori [16,17]. The results are consistent with the interpretation that this phonotactic knowledge facilitates word learning for Māori-like words.

Based on these results, we considered whether there was detectable variation in the stimuli which might cause NMS’ performance to be better than that of non-New Zealanders. Therefore, phonotactics and neighborhood effects of stimuli were examined, however, exploratory analyses showed no differences in task performance between these groups (see Technical Supplement 3.7.1 for details S1 File). This is likely to be due to the narrow range of Māori words and phonotactic scores in the stimuli.

So while there is an advantage for New Zealanders in this task, there is no evidence that this advantage is lexical. There are three potential interpretations of this lack of a lexical effect. One of them is that having a form in the proto-lexicon does not, in fact, confer an advantage for learning its meaning. This would need to be reconciled with the classroom results reported by [21]. A second possibility is that there is something about our task which obscures this advantage. For example, we are not teaching the real meanings of the words, and so it is possible that there is latent conflicting semantic knowledge which interferes with the ability to assign our pseudo-meaning to the words. Additionally, we did not very explicitly orient people to the idea that they were being presented with Māori words. The information sheet contained the phrase “we will be using an indigenous language of New Zealand, Māori, as the basis of this research” in the research description, but we are not confident that all participants read the full information sheet carefully. The separately presented instructions (which they are more likely to have read, and which were presented later), said simply “You are stranded on an island and you don’t know the language spoken on this island. Your task is to learn the names of objects in the language of the island.” It is thus very likely that they were not particularly oriented to the Māori language. The belief that they are learning some other language may thus suppress any lexical advantage that the Māori proto-lexicon might otherwise have conferred.

A final possibility is that, while past work has shown that New Zealanders have a Māori proto-lexicon, that lexicon does not happen to contain the specific words that are in our learning experiment. This is a possibility, given the fact that we explicitly selected words in our experiment which have low accuracy in a definition task.

In order to explore this final possibility, we returned to our real word stimuli and compared them with Panther et al.’s [17] identification task. Fig 3 shows the proportion of correct definitions for each word in our definition task, and their relation to NMS participants’ mean wordhood confidence ratings using a 1-to-5 scale in the Panther et al.’s [17] identification task. The higher the wordhood confidence rating is, the more NMS in [17] were confident that the stimulus was a real word. We can see that there is a positive correlation. Words that were well-defined in our experiment were also well-identified in the word identification task.

thumbnail
Fig 3. The relationship between proportion of correct definitions for each word and their mean wordhood confidence ratings from Panther et al.’s [17] identification task.

https://doi.org/10.1371/journal.pone.0339325.g003

The words that were ultimately selected as stimuli for Experiment 2 are highlighted in light blue. These words were accurately defined between 40 to 60% of the times in the definition task. While we selected words as stimuli for Experiment 2 in a narrow band of accuracy rate, these words spread across the identification ratings in the Panther et al.’s [17] data. Our stimuli contain some words which were not defined with greater than chance accuracy and were rated lower than three. Although we assumed all of our experimental words are in the proto-lexicon of NMS, this may not actually be the case.

In order to determine whether our experiment participants had a proto-lexicon and— particularly—whether they had these words in their proto-lexicon, we invited our Experiment 2 NMS participants back for a further experiment. We investigate the degree to which participants in Experiment 2 can accurately discriminate words from nonwords, using stimuli that contain words and nonwords used in this experiment.

Experiment 3: Proto-lexicon in NMS

In order to assess NMS’ proto-lexical knowledge, we conducted a two-task web-based experiment through a custom in-browser interface via Prolific (www.prolific.com) from March 1, 2023 to August 2, 2023. The identification task asks participants to distinguish real words from phonotactically matched nonwords. The wellformedness rating task is designed to assess NMS’ sensitivity to gradient phonotactics. The details of each task are given below.

Participants

Participants, recruited via Prolific, were 65 NMS who participated in Experiment 2. Among those people who completed the experiment, we excluded one participant who had little variation in responses. After removing the participant, there were 64 NMS.

Materials

This experiment draws on a pre-existing set of stimulus materials used in [17,52], which consisted of Māori word and Māori-like nonword pairs, as described in the Materials section of Experiment 1.

For the wellformedness rating task, the pre-existing stimulus set includes 60 nonwords that were sampled from each of three phonotactic bins. We used precisely these stimuli, with no additions or removals. Stimulus characteristics were as follows: lengths (M = 4.67, SD = 0.99), phonotactic scores (M = −0.82, SD = 0.15), and neighborhood occupancy rates (M = 0.07, SD = 0.05).

For the word identification task, the pre-existing stimulus set includes 30 high-frequency and 30 mid-frequency words, each paired with a nonword of the same length and similar phonotactic score, for a total of 120 stimuli. To this set, we added the words used in Experiment 2 (if they were not already present), together with their phonotactically matched nonwords. This yielded a total of 182 stimuli (91 word-nonword pairs). For stimulus characteristics, both stimulus types had identical average lengths (M = 4.38, SD = 0.84), phonotactic scores (M = −0.79; nonwords: SD = 0.09, words: SD = 0.08), and neighborhood occupancy rates (M = 0.10, SD = 0.06).

For each task, each participant responded to all stimuli in random order. No stimuli were shared between the two tasks.

Procedure

Participants completed the wellformedness task first, followed by the identification task. Task instructions were presented prior to each task. For both tasks, participants responded to orthographic stimuli, with stimuli presented one at a time. In the wellformedness task, on each trial, participants were presented with a nonword, and they were asked to rate how Māori-like it was using a scale ranging from 1 (non Māori-like nonword) to 5 (highly Māori-like nonword). After the participant chose a rating and clicked ‘Next’, the next stimulus was presented. No training was given to participants, but rating examples were given in the instructions for a highly Māori-like nonword and non-Māori-like nonword which were not actual stimulus items.

In the identification task, participants were presented with either a word or nonword from a set of real Māori words and phonotactically matched Māori nonwords. Participants were asked to rate how confident they were that the stimulus was a real word using a scale ranging from 1 (confident that it is NOT a Māori word) to 5 (confident that it IS a Māori word). Participants were instructed to give their best guess without relying on other tools (e.g., a dictionary) or help from others. After the participant chose a rating and clicked ‘Next’, the next stimulus was presented. The total experiment took less than 30 minutes.

We applied the same measure used in Experiment 1 to evaluate potential external aid usage, given the absence of time constraints in this task. Two participants exceeded the threshold for median reaction time. We compared their median reaction times between real-word and nonword trials. For one participant, the difference was negligible (0.023 seconds), and for the other, it was modest (0.509 seconds). Given the minimal differences and the robustness of the median to outliers, there is no evidence of systematic aid usage.

Results

Consistent with previous studies [17,21], we analysed the data with mixed-effects ordinal regression using the ordinal package [69] in R [67]. For both tasks, the dependent variables were rated on a Likert scale (in the wellformedness task: wellformedness judgment ratings, in the identification task: wordhood confidence ratings). Categorical predictors were treatment-coded, and numerical predictors were centered. Random intercepts for participant and word were included in all models, with by-participant random slopes for linguistic predictors and by-stimulus random slopes for nonlinguistic predictors.

For the wellformedness rating task, the fixed effects considered were phonotactic score as a test predictor and neighborhood occupancy rate as a control predictor. High phonotactic score indicates that the stimulus contains local sound sequences found in many Māori words, and high neighborhood occupancy rate indicates that it globally resembles many Māori words. The initial model considered the interaction between the predictors. After using a stepwise model fitting procedure (see Technical Supplement for details S1 File), results of the final model in Table 5 show that ratings significantly positively correlate with the phonotactic score, indicating that participants have phonotactic intuitions consistent with previous studies. There was no effect of neighborhood occupancy rate, suggesting that stimuli with denser phonological neighborhood do not receive higher wellformedness rating.

thumbnail
Table 5. Model results for Experiment 3 (wellformedness judgment ratings).

https://doi.org/10.1371/journal.pone.0339325.t005

For the identification task, stimuli were clustered into three categories based on our experiments: real words that were defined more accurately in Exp1 (n  =  32 pairs) and were therefore not included in Exp2, words that were used in Exp2 (n  =  48 pairs), and words that appeared only in Exp3 (n  =  11 pairs). By comparing these categories, we sought to examine whether real words used in each experiment showed signs of being present in the proto-lexicon of the participants. The fixed effects considered were phonotactic score, word type (reference: nonwords), frequency category (reference: mid), experiment (reference: Exp1) as test predictors, and number of phonemes (continuous) as a control predictor. We analysed data beginning from an initial model that included all predictors as well as two three-way interactions. One is a three-way interaction between phonotactic score, word type and frequency category, as in the previous analysis [17,52]. The other is a three-way interaction between experiment, type and phonotactic based on results from Experiment 2. Results of the final model are in Table 6 and Fig 4 shows the relationship between mean wordhood confidence rating and stimulus-type (word and nonword) across all stimuli in the raw experimental results, for each of the different categories.

thumbnail
Fig 4. NMS’s mean wordhood confidence ratings for each stimulus type across experiments in the raw experimental results.

https://doi.org/10.1371/journal.pone.0339325.g004

thumbnail
Table 6. Model results for Experiment 3 (wordhood confidence ratings).

https://doi.org/10.1371/journal.pone.0339325.t006

Across three categories, real words received higher ratings than nonwords. There is a significant effect of phonotactic score, which marginally interacts with word frequency. These results indicate that, in line with previous studies, participants gave higher ratings to stimuli with higher phonotactic scores and this effect is stronger in mid-frequency stimuli than in high-frequency stimuli. An EMM test confirms that the effect of phonotactic score is significant in each frequency category, mid: 6.821, z  =  4.423, p <.0001, high: 3.321, z  =  2.095, p = .036. The positive effect of the number of phonemes is also detected, indicating that participants are sensitive to length differences among stimuli, and longer words are more word-like. The results also show that participants gave significantly higher ratings to words than nonwords, which interacts with experiment. An EMM test confirms that the effect of word type is significant for stimulus in Experiment 1 and 3, but not significant for stimuli used in Experiment 2 (experiment  =  Exp1: word - nonword, 1.087, z  =  4.080, p <.001, experiment  =  Exp2, word - nonword, 0.303, z  =  1.385, p = 0.166, experiment  =  Exp3, word - nonword, 2.307, z  =  5.109, p <.001). These results show that while participants have proto-lexical knowledge of Māori, they cannot distinguish words from phonotactically matched nonwords in stimuli used in Experiment 2.

We also explored whether any differences between participants or between words in this experiment could predict differences in a potential lexical effect in Experiment 2, but found no evidence to support this (see Technical Supplement 4.5.6 for details S1 File). This is likely due to the fact that the variability across words and participants is not particularly high, nor is the total number of words investigated in Experiment 2.

Summary of Experiment 3

It is clear that words that were used in Experiment 1 (the forced-choice definition task) but not in Experiment 2 (the word learning task) were adequately distinguished from phonotactically matched nonwords. Since these are the words for which participants were generally more able to discriminate between correct and incorrect definitions, it is possible that some participants had explicit knowledge of these words and others had at least a vague sense of their meaning, which likely helped them to be recognized as real words. By contrast, the words from Experiment 1 that were also used in Experiment 2 were not well distinguished from phonotactically matched nonwords. By design, these are the words that participants were less able to associate with a correct definition. Therefore, participants have little to no explicit knowledge of these words and only a limited sense of their meaning, and the results of Experiment 3 suggest that they may not have even had an implicit awareness that they are real words. Furthermore, these words were chosen from a narrow range of phonotactic scores by design, so participants could not even refer to phonotactic wellformedness as a proxy for wordhood. As a result, these specific words may not be present in the proto-lexicon for a number of NMS participants, in which case they would not be expected to facilitate the learning of meaning. Together, the findings suggest that it is plausible that NMS were not able to learn meanings of real words more accurately than nonwords in Experiment 2 because they do not have implicit form-based (i.e., proto-lexical) knowledge of those words. That is, there is no distinction between word types for those stimuli in Experiment 2 by NMS.

Previous studies have shown that NMS have implicit proto-lexical knowledge of Māori words but explicit semantic knowledge is minimal. However, our combined results suggest that their proto-lexical knowledge might be associated with vague semantic knowledge, which is different from explicit semantic knowledge. We discuss this idea in the next section.

Discussion

We examined how a Māori proto-lexicon supports adult individuals in learning word meanings across three experiments. First, we used a forced-choice definition task to identify words for which participants had unclear semantic knowledge, to be tested further. According to our descriptive statistics, participants’ ability to define Māori words varied depending on the task format. Participants generally performed better on the forced-choice definition task—which likely engages both implicit and explicit knowledge—than on the free-response task, which relies solely on explicit recall. However, this difference may also reflect task demands: the free-response task applied strict scoring criteria that may have penalized partial semantic knowledge, while the forced-choice task’s accuracy could have been aided by the similarity of foil definitions, allowing participants with partial knowledge to select the correct option. Therefore, failure to produce an accurate definition in a free-response task does not necessarily indicate an absence of semantic knowledge within the Māori proto-lexicon.

Next, we examined whether a proto-lexicon facilitates learning meanings by comparing NMS and non-New Zealanders. NMS were significantly better at learning Māori-like word-forms, likely due to their strong phonotactic knowledge. Although all stimuli were phonotactically controlled, the word shapes were generally familiar to NMS but not to non-New Zealanders. This effect likely reflects their implicit sensitivity to the phonotactic probabilities of Māori word-forms. While other group differences could contribute, education level and initial task accuracy were similar, suggesting that NMS’s advantage stems primarily from their familiarity with these phonotactic patterns.

In terms of the predicted lexical effect, contrary to our prediction, NMS did not learn real words better than matched nonwords; their pattern mirrored that of non-New Zealanders. While previous work shows that the proto-lexicon acquired in adult NMS aids Māori word learning in educational settings [21], our results did not replicate this. We suggested three possible reasons: (1) the proto-lexicon may not facilitate learning in controlled experiments; (2) task design may have obscured the effect–for example, participants were not explicitly told the stimuli were Māori; or (3) the words used in the learning task were not represented in participants’ proto-lexicons. The second possibility was not directly tested in the current study, but could be examined in future work by manipulating how explicitly participants are cued to treat the input as Māori.

To investigate the third possibility, we conducted the word identification task (Experiment 3) using the same participants. We found that while words that were used in Experiment 2 (i.e., the word learning task) were not distinguished from phonotactically matched nonwords, words that were defined more accurately in Experiment 1 were more confidently distinguished from matched nonwords. There is thus no evidence that our participants had proto-lexical knowledge of the words that were used in Experiment 2. Additionally, the experiment confirmed that participants have robust phonotactic knowledge, consistent with previous studies [16,17,21,52], supporting the view that this phonotactic knowledge underlies their overall advantage in the learning task.

We can not rule out the possibility that if we gave participants instructions leading them to believe that they would be seeing Māori words, then this would more easily activate forms from their proto-lexicon, and a lexical effect could have emerged. However, given the results of Experiment 3, the most likely explanation for the lack of a lexical effect in our word learning experiment is that we inadvertently chose items that are not robustly represented in the proto-lexicon. In Mattingley et al. [21], participants attached real definitions to words that they had been exposed to. If we specifically selected items where we have good evidence of a proto-lexical form, it is possible that our results would be different. However, this becomes methodologically complex, because one potential interpretation of our result is that words in the proto-lexicon have some implicit semantic information associated with them, and by selecting for words with no semantics, we have inadvertently selected words that are not in the proto-lexicon. If this is the case, our design would be problematic for testing whether these words give an advantage for learning meaning, as participants may struggle to associate our fake definitions with words for which they have some competing semantic knowledge.

In general terms, building a proto-lexicon is a process of word learning before attaching meanings to word-forms [8]. The NZ Māori proto-lexicon studies cited in the background section propose the idea that while NMS have implicit form-based (proto-lexical) knowledge of many Māori words, they have explicit semantic (lexical) knowledge of very few words [19]. This rather simplified picture suggests words fall into one of three categories: not known, proto-lexical (without any meaning), or fully known.

However, based on our results from three experiments, it seems the Māori proto-lexicon might be more gradient, and consist of multiple knowledge phases. Some words in the proto-lexicon actually appear to be at the third stage of word knowledge development [55], in which individuals have vague knowledge of the word’s meaning—what Dale referred to a “twilight zone” [55] (p. 898). On the other hand, some reasonably frequent words seem not to reliably be in the proto-lexicon at all, therefore placing somewhere between the first (i.e., having never seen the word before) and second stages (knowing there is such a word but not knowing what it means). Previous work has argued that NMS possess a proto-lexicon of more than a thousand Māori words or word-parts [16]. It appears that some, or possibly even all, of these words may actually have some latent attached semantic knowledge.

This interpretation aligns with emerging evidence that proto-lexical representations, although primarily form-based, can nonetheless engage in subtle and graded semantic interactions. For example, Hendrix and Sun [54] demonstrated that participants’ responses to nonwords in lexical decision tasks were influenced by the semantic properties of their closest real-word neighbors. This suggests that sublexical or distributional features can activate partial semantic associations, even without explicit word knowledge. Such findings challenge a strict dichotomy between form and meaning, supporting the idea that proto-lexical knowledge exists along a continuum from purely phonological representations to those with partial semantic content. This may also explain the result in Experiment 2, as nonwords closely resembling real words could activate proto-lexical and semantic networks to some degree, potentially obscuring clear lexical advantages during learning.

Implicit knowledge of Māori in NMS continuously expands and is reinforced with continuous exposure to Māori across the lifespan [52]. This raises questions about the exact nature of the proto-lexicon underpinning the results found in the past work on NMS. It also raises a number of questions about the role of these forms in generalization. What degree of ‘embeddedness’ is required, for example, before the word-form can start to facilitate explicit acquisition of meaning? How can we design tasks to robustly distinguish these different degrees and types of knowledge? Do all stored word-forms feed phonotactic generalizations to the same degree, or do more robust and rich word-forms play a larger role? Future studies should explore such effects in learning the meaning of words.

Conclusion

This study focuses on a ‘proto-lexicon’—a set of stored forms— and how the proto-lexicon assists learning meanings of words in an experimental setting. The results of three experiments indicate that there are different levels of semantic knowledge for different words even when we consider only words that cannot confidently be said to be in a full lexicon. These findings suggest that the claim of previous studies [16,17] that the proto-lexicon is ‘without semantics’ may be over-simplified.

Supporting information

S1 File. Technical Supplement: Exploring the role of meaning in non-Māori speakers’ ‘proto-lexicon’.

https://doi.org/10.1371/journal.pone.0339325.s001

(HTML)

Acknowledgments

We would like to thank Chun-Liang Chan and the Speech Communication Research Group at Northwestern University, for the original development of the software underpinning our experiments. We acknowledge Robert Fromont for his technical support and help during the development of the online experimentation, and audiences at NZILBB for their input at various stages of this study. We also thank all the participants in the study and our research assistant Allie Osborne.

References

  1. 1. Hartshorne JK, Germine LT. When does cognitive functioning peak? The asynchronous rise and fall of different cognitive abilities across the life span. Psychol Sci. 2015;26(4):433–43. pmid:25770099
  2. 2. Keuleers E, Stevens M, Mandera P, Brysbaert M. Word knowledge in the crowd: Measuring vocabulary size and word prevalence in a massive online experiment. Q J Exp Psychol (Hove). 2015;68(8):1665–92. pmid:25715025
  3. 3. Nippold MA. Lexical learning in school-age children, adolescents, and adults: a process where language and literacy converge. J Child Lang. 2002;29(2):474–8; discussion 489-94. pmid:12109383
  4. 4. Ramscar M, Hendrix P, Love B, Baayen HR. Learning is not decline: The mental lexicon as a window into cognition across the lifespan. The Mental Lexicon. 2002;8(3):450–81.
  5. 5. Ellis N. Consciousness in second language learning: psychological perspectives on the role of conscious processes in vocabulary acquisition. AILA Review. 1994;11:37–56.
  6. 6. Nation IP. Learning vocabulary in another language. Cambridge: Cambridge University Press; 2001.
  7. 7. Hallé PA, de Boysson-Bardies B. The format of representation of recognized words in infants’ early receptive lexicon. Infant Behavior and Development. 1996;19(4):463–81.
  8. 8. Johnson EK. Constructing a proto-lexicon: an integrative view of infant language development. Annu Rev Linguist. 2016;2(1):391–412.
  9. 9. Martin A, Peperkamp S, Dupoux E. Learning phonemes with a proto-lexicon. Cogn Sci. 2013;37(1):103–24. pmid:22985465
  10. 10. Ngon C, Martin A, Dupoux E, Cabrol D, Dutat M, Peperkamp S. (Non)words, (non)words, (non)words: evidence for a protolexicon during the first year of life. Dev Sci. 2013;16(1):24–34. pmid:23278924
  11. 11. Casenhiser DM. Children’s resistance to homonymy: an experimental study of pseudohomonyms. J Child Lang. 2005;32(2):319–43. pmid:16045253
  12. 12. Dautriche I, Fibla L, Fievet A-C, Christophe A. Learning homophones in context: Easy cases are favored in the lexicon of natural languages. Cogn Psychol. 2018;104:83–105. pmid:29778004
  13. 13. Storkel HL, Maekawa J, Aschenbrenner AJ. The effect of homonymy on learning correctly articulated versus misarticulated words. J Speech Lang Hear Res. 2013;56(2):694–707. pmid:23275395
  14. 14. Fang X, Perfetti C, Stafura J. Learning new meanings for known words: Biphasic effects of prior knowledge. Lang Cogn Neurosci. 2017;32(5):637–49. pmid:29399593
  15. 15. Maciejewski G, Rodd JM, Mon-Williams M, Klepousniotou E. The cost of learning new meanings for familiar words. Language, Cognition and Neuroscience. 2019;35(2):188–210.
  16. 16. Oh Y, Todd S, Beckner C, Hay J, King J, Needle J. Non-Māori-speaking New Zealanders have a Māori proto-lexicon. Sci Rep. 2020;10(1):22318. pmid:33339844
  17. 17. Panther FA, Mattingley W, Todd S, Hay J, King J. Proto-Lexicon Size and Phonotactic Knowledge are Linked in Non-Māori Speaking New Zealand Adults. Laboratory Phonology. 2023;14(1).
  18. 18. Todd S, Ben Youssef C, Vásquez-Aguilar A. Language structure, attitudes, and learning from ambient exposure: Lexical and phonotactic knowledge of Spanish among non-Spanish-speaking Californians and Texans. PLoS One. 2023;18(4):e0284919. pmid:37104290
  19. 19. Oh YM, Todd S, Beckner C, Hay J, King J. Assessing the size of non-Māori-speakers’ active Māori lexicon. PLoS One. 2023;18(8):e0289669. pmid:37611026
  20. 20. Panther F, Mattingley W, Hay J, Todd S, King J, Keegan PJ. Morphological segmentations of Non-Māori Speaking New Zealanders match proficient speakers. Bilingualism. 2023;27(1):1–15.
  21. 21. Mattingley W, Panther F, Todd S, King J, Hay J, Keegan PJ. Awakening the proto-lexicon: a proto-lexicon gives learning advantages for intentionally learning a language. Language Learning. 2024;74(3):744–76.
  22. 22. Statistics New Zealand. Census results reflect Aotearoa New Zealand’s diversity. 2024; https://www.stats.govt.nz/news/census-results-reflect-aotearoa-new-zealands-diversity/
  23. 23. Macalister J. English Aotearoa. 2004;52:69–73.
  24. 24. Macalister J. The Māori presence in the New Zealand English lexicon 1850 –2000: evidence from a corpus-based study. English World-Wide. 2006;27(1):1–24.
  25. 25. Macalister J. Revisiting weka and waiata: familiarity with Māori words among older speakers of New Zealand English. New Zealand English Journal. 2007;21:41–50.
  26. 26. Harlow R. A Māori Reference Grammar. Auckland: Longman; 2001.
  27. 27. Harlow R. Māori: A linguistic introduction. New York: Cambridge University Press; 2007.
  28. 28. Mattys SL, Jusczyk PW. Phonotactic cues for segmentation of fluent speech by infants. Cognition. 2001;78(2):91–121. pmid:11074247
  29. 29. Zamuner TS. The emergence of lexical competition in 18-month-olds: Evidence from a visual world paradigm. Journal of Experimental Child Psychology. 2009;103(4):389–408.
  30. 30. Hochmann J-R, Endress AD, Mehler J. Word frequency as a cue for identifying function words in infancy. Cognition. 2010;115(3):444–57. pmid:20338552
  31. 31. Jusczyk PW, Luce PA, Charles-Luce J. Infants’ sensitivity to phonotactic patterns in the native language. Journal of Memory and Language. 1994;33(5):630–45.
  32. 32. Mattys SL, Jusczyk PW, Luce PA, Morgan JL. Phonotactic and prosodic effects on word segmentation in infants. Cogn Psychol. 1999;38(4):465–94. pmid:10334878
  33. 33. Mattys SL, Jusczyk PW. Do infants segment words or recurring contiguous patterns?. Journal of Experimental Psychology: Human Perception and Performance. 2001;27(3):644.
  34. 34. Shi R, Cutler A, Werker J, Cruickshank M. Frequency and form as determinants of functor sensitivity in English-acquiring infants. J Acoust Soc Am. 2006;119(6):EL61-7. pmid:16838552
  35. 35. Saffran JR, Aslin RN, Newport EL. Statistical learning by 8-month-old infants. Science. 1996;274(5294):1926–8. pmid:8943209
  36. 36. Archer SL, Curtin S. Nine-month-olds use frequency of onset clusters to segment novel words. J Exp Child Psychol. 2016;148:131–41. pmid:27181298
  37. 37. Zamuner TS. Phonotactic probabilities at the onset of language development: speech production and word position. J Speech Lang Hear Res. 2009;52(1):49–60. pmid:18723600
  38. 38. Vihman MM, Nakai S, DePaolis RA, Hallé P. The role of accentual pattern in early lexical representation. Journal of Memory and Language. 2004;50(3):336–53.
  39. 39. Graf Estes K, Evans JL, Alibali MW, Saffran JR. Can infants map meaning to newly segmented words? Statistical segmentation and word learning. Psychol Sci. 2007;18(3):254–60. pmid:17444923
  40. 40. Hay JF, Pelucchi B, Graf Estes K, Saffran JR. Linking sounds to meanings: infant statistical learning in a natural language. Cogn Psychol. 2011;63(2):93–106. pmid:21762650
  41. 41. Saffran JR. Absolute pitch in infancy and adulthood: the role of tonal structure. Developmental Science. 2003;6(1):35–43.
  42. 42. Moreau CN, Joanisse MF, Mulgrew J, Batterink LJ. No statistical learning advantage in children over adults: Evidence from behaviour and neural entrainment. Dev Cogn Neurosci. 2022;57:101154. pmid:36155415
  43. 43. Saffran JR, Newport EL, Aslin RN, Tunick RA, Barrueco S. Incidental language learning: listening (and learning) out of the corner of your ear. Psychol Sci. 1997;8(2):101–5.
  44. 44. Thiessen ED, Girard S, Erickson LC. Statistical learning and the critical period: how a continuous learning mechanism can give rise to discontinuous learning. Wiley Interdiscip Rev Cogn Sci. 2016;7(4):276–88. pmid:27239798
  45. 45. Storkel HL, Roger MA. The effect of probabilistic phonotactics on lexical acquisition. Clinical Linguistics & Phonetics. 2000;14(6):407–25.
  46. 46. Frisch S. Emergent phonotactic generalizations in English and Arabic. In: Bybee J, Hopper P, editors. Frequency and the emergence of linguistic structure. Amsterdam: John Benjamins; 2001. p. 159–79.
  47. 47. Chomsky N, Halle M. The Sound Pattern of English. New York, NY: Harper & Row; 1968.
  48. 48. Hayes B. Phonological acquisition in optimality theory: the early stages. Constraints in phonological acquisition. Cambridge University Press; 2004. p. 158–203. https://doi.org/10.1017/cbo9780511486418.006
  49. 49. Berent I, Steriade D, Lennertz T, Vaknin V. What we know about what we have never heard: evidence from perceptual illusions. Cognition. 2007;104(3):591–630. pmid:16934244
  50. 50. Hay J, King J, Todd S, Panther F, Mattingley W, Oh YM, et al. Ko te mōhiotanga huna o te hunga kore kōrero i te reo Māor. Te Reo. 2022;65(1):42–59.
  51. 51. Varatharaj A, Todd S. More than just statistical recurrence: human and machine unsupervised learning of Māori word segmentation across morphological processes. In: Proceedings of the 21st SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology; 2024. p. 20–31.
  52. 52. Mattingley W, Hay J, Todd S, Panther F, King J, Keegan PJ. Ongoing exposure to an ambient language continues to build implicit knowledge across the lifespan. Linguistics Vanguard. 2024;10(1):345–55.
  53. 53. Cassani G, Chuang Y-Y, Baayen RH. On the semantics of nonwords and their lexical category. J Exp Psychol Learn Mem Cogn. 2020;46(4):621–37. pmid:31318232
  54. 54. Hendrix P, Sun CC. A word or two about nonwords: Frequency, semantic neighborhood density, and orthography-to-semantics consistency effects for nonwords in the lexical decision task. J Exp Psychol Learn Mem Cogn. 2021;47(1):157–83. pmid:31999159
  55. 55. Dale E. Vocabulary measurement: techniques and major findings. Elementary English. 1965;42(8):895–948.
  56. 56. King J, Maclagan M, Harlow R, Keegan P, Watson C. The MAONZE corpus: establishing a corpus of Māori speech. New Zealand Studies in Applied Linguistics. 2010;16(2):1–16.
  57. 57. Stolcke A. SRILM-an extensible language modeling toolkit. In: Proceedings of the Seventh International Conference on Spoken Language Processing (ICSLP 2002). 2002. p. 901–4.
  58. 58. Macmillan NA, Creelman CD. Detection theory: A user’s guide. 2nd ed. Mahwah, NJ: Lawrence Erlbaum Associates; 2005.
  59. 59. Nation ISP, Webb SA. Researching and analyzing vocabulary. Boston, MA: Heinle, Cengage Learning; 2011.
  60. 60. Davies C, Maclagan M. Māori- words – read all about it: testing the presence of 13 Māori words in four New Zealand newspapers from 1997 to 2004. Te Reo. 2006;49:73–99.
  61. 61. Calude AS, Miller S, Pagel M. Modelling loanword success – a sociolinguistic quantitative study of Māori loanwords in New Zealand English. Corpus Linguistics and Linguistic Theory. 2020;16(1):29–66.
  62. 62. Szekely A, Jacobsen T, D’Amico S, Devescovi A, Andonova E, Herron D, et al. A new on-line resource for psycholinguistic studies. J Mem Lang. 2004;51(2):247–50. pmid:23002322
  63. 63. Mathôt S, March J. Conducting linguistic experiments online with OpenSesame and OSWeb. Language Learning. 2022;72(4):1017–48.
  64. 64. Mathôt S, Schreij D, Theeuwes J. OpenSesame: an open-source, graphical experiment builder for the social sciences. Behav Res Methods. 2012;44(2):314–24. pmid:22083660
  65. 65. LimeSurvey GmbH. LimeSurvey: an open source survey tool. Hamburg, Germany: LimeSurvey GmbH; 2012.
  66. 66. Bates D, Mächler M, Bolker B, Walker S. Fitting linear mixed-effects models Usinglme4. J Stat Soft. 2015;67(1).
  67. 67. R Core Team. R: A language and environment for statistical computing. 2022.
  68. 68. Lenth R, Singmann H, Love J, Buerkner P, Herve M. Emmeans: estimated marginal means, aka least-squares means. 2022.
  69. 69. Christensen RHB. Ordinal—regression models for ordinal data [R package version 2022 .11-16]. 2022.