Examining the potential influence of crosslinguistic lexical similarity on word-choice transfer in L2 English

We examined whether and how L1-L2 crosslinguistic formal lexical similarity influences L2 word choice. Our sample included two learner subcorpora, containing 8,500 and 6,390 English texts, written in an educational setting, by speakers of diverse L1s in the A1–B2 CEFR range of L2 proficiency. We quantified similarity based on phonological overlap between L1 words and their L2 (English) translations. This similarity relates to psycholinguistic cognancy, which occurs when words and their translations share a high level of formal similarity, often due to historical cognancy from shared etymology or language contact. We then used mixed-effects statistical models to examine how this similarity influences the rate of use of the L2 words; essentially, we checked whether L2 words that are more similar to their L1 translations are used more often. We also controlled for potential confounds, including the baseline L1 frequency of the English words. The type of crosslinguistic similarity that we examined did not influence learners’ choice of L2 words in their writing in the present sample, which represents a type of educational setting that many learners encounter. This suggests that the influence of such similarity is constrained, and that communicative needs can override transfer from learners’ L1 to their L2, which raises questions regarding when and how else situational factors can influence transfer.


TABLE OF CONTENTS
1 APPENDIX S1: LEXICAL DISTANCE

The term "lexical distance"
There is no universal distinction between the terms lexical distance and lexical similarity, which are often used interchangeably. 1 In the present study, we use lexical distance to refer to formal distance between individual L1-L2 words, which are translations of one another. This distance is based on objective phonological distance (specifically, normalized Levenshtein distance-LDN), which serves as a proxy for the subjective similarity between the words that is expected to be perceived by speakers, as supported by the studies outlined in the next subsection ( §1.2).
The reason we use the term lexical distance in particular is to distinguish it from other types of language distances, such as morphological distance, in line with prior studies (Bakker et al., 2009;Brown et al., 2008;Gooskens, 2006;Holman et al., 2008b;Schepens et al., 2016;Schepens, van der Slik, et al., 2013b). Here, it is worth noting that lexical distance can serve as a proxy of overall language distance, which is sometimes also referred to as linguistic distance or typological distance (e.g., Ecke, 2015;Llach, 2010), 2 but we do not use it in this sense here, since in our study we only consider the distance of individual L1-L2 word pairs directly, rather than the distance between languages as a whole. Nevertheless, note that we are using a specific type of lexical distance-phonological LDN-as a proxy for overall lexical distance, which can include other factors, such as orthography. 3 In addition, note that other less-common terms are sometimes used for lexical distances that are similar to 1 Though increased distance denotes decreased similarity and vice versa, so lexical distance is technically more closely associated with lexical dissimilarity. 2 Though one issue with the term "typological distance" is that it is not always used to refer to overall language distance. Rather, it is sometimes used to refer to distance that is based on grammatical features, such as those that are available in the World Atlas of Language Structures (WALS), in order to draw a distinction between it and other types of distance, such as lexical distance that is based on Levenshtein distance in Swadesh lists (Bakker et al., 2009). 3 Though note that phonological and orthographic similarity tend to be highly correlated. For example, in a recent study on the English and French vocabulary of Dutch speaking children, De Wilde et al. (2021), who also used normalized Levenshtein distance, included only phonological similarity in their analyses, and omitted orthographic similarity, since the two variables were highly correlated and could therefore lead to issues with collinearity. This is also something that these researchers did in another associated study (De Wilde et al., 2020), and is an issue that was raised by other researchers, such as Carrasco-Ortiz et al. (2021), who found a correlation of r = .782 between orthographic and phonological distance in their dataset of English and Spanish words. Here too, there were similarly strong correlations between phonological and orthographic distance in the parallel dictionaries, where all the L1s share English's Latin script, both in the case of LDN (r = .68, 95% CI = [.67, .70], p < .001), and in the case of LD (r = .73, 95% CI = [.71, .74], p < .001). We do not include orthographic distance in our analyses both because of its substantial overlap with phonological similarity in the case of L1s that share English's script, and because we wanted to use consistent analyses for all the L1s in the sample, but orthographic distance is largely meaningless across languages that use different scripts (which includes several of the L1s in the Swadesh-based sample).
the one that we use here, such as phonological overlap (Carrasco-Ortiz et al., 2021) or cognate linguistic distance (van der Slik, 2010). 4 Finally, it is important to emphasize that although we use the term "lexical distance" in the present study, our analyses focus on a specific facet of such distance-phonologicalbased formal distance. However, lexical distance may also encompass other aspects of crosslinguistic similarity, such as semantic and morphological similarity. Nevertheless, given that many studies found a cognate facilitation effect when focusing on formal similarity in a similar way, as shown in the next section, it is unlikely that focusing on such similarity would prevent us from finding a cognate facilitation effect in our own analyses.

Validation of Levenshtein distance
Here, we outline the extensive use and validation that Levenshtein distance (LD) and its normalized form (LDN) have received in prior research across various fields, including typology, psycholinguistics, and SLA. 5 First, we open with a notable study by Schepens et al. (2012), which is often cited by other researchers in this context (e.g., Blom et al., 2020;Cenoz et al., 2021;Cop et al., 2017;De Wilde et al., 2020Otwinowska & Szewczyk, 2019;Silveira & Leussen, 2015;. Specifically, in their study, Schepens et al. (2012) conclude the following: It is possible to automatically identify large distributions of cognates with respect to form-similarity in various European languages by means of a formalized formsimilarity metric such as normalized Levenshtein distance. Applying this metric to a professional translation database, similarity norms were obtained that are comparable to experimentally acquired orthographic similarity ratings (Dijkstra et al., 2010;Tokowicz et al., 2002), and lead to high correlations (around .90) and a large proportion of correctly classified stimuli (over 90%). The obtained distributions were also compared to an account of cross-language similarity based on Gray and Atkinson (r = .72). A common pattern in the degree of orthographic similarity of these 4 Though the term "cognate lexical distance" is not appropriate to use here, since it refers to the overall distance between languages as calculated based on the proportion of cognates, rather than to distances that are calculated between individual word pairs. 5 Note that LDN is sometimes also referred to in the literature using similar terms, and especially NLD and nLD.
distributions was observed within languages of the same family. In our analysis, English showed characteristics of multiple language families (Germanic, Romance).
Cognate distributions were computed here using semi-complete lexicons, whereas Gray and Atkinson used only a small set of high frequency words.
In all, our study demonstrated the feasibility and advantages of applying techniques from artificial intelligence to psycholinguistic and linguistic research involving multiple languages. First, the application of the normalized Levenshtein distance function resulted in an automatized selection of more and better stimulus materials for cognate studies on bilingual word processing. Second, the Levenshtein distance function yielded accurate and detailed cross-language similarity distributions for multiple languages, thus allowing a comparison to language family trees. As such, the present study has shown that the Levenshtein distance function can compete with existing similarity measures (such as those proposed by Coltheart, Davelaar, Jonasson &Besner, 1977, andVan Orden, 1987) and can be considered as a new formal and computational model of orthographic similarity, useful for future empirical studies in monolingual and bilingual domains as diverse as those dealing with neighborhood effects, spelling systems, and dyslexia.
(p. 165) In addition, further support for LD(N) as a measure of lexical distance comes from many other studies.
First, there is substantial support for this measure based on its extensive use in studies pertaining to language classification. For example, in a study that examined lexical distance between 35 Indo-European L1s and Dutch, Schepens et al. (2013b) found a very high correlation (r = .90) between this measure as determined based on the ASJP's Swadesh lists, and distances that are based on shared cognates as determined by Gray and Atkinson (2003) on historical-comparative grounds. Furthermore, Schepens et al. (2013a) found that this measure correlates strongly with crosslinguistic morphological similarity (r = -.65), as determined based on morphological features in the World Atlas of Language Structures. In addition, based on comparisons with other data sources, such as established dialect boundaries, using LD between phonetic strings has been shown to be effective for assessing dialects, for example when it comes to Gaelic (Kessler, 1995) and Dutch (Gooskens & Heeringa, 2004;Nerbonne & Heeringa, 2001). Finally, other studies have found that this measure leads to accurate language classification as determined based on measures such as expert classification, when it comes to many other languages and dialects (Schepens et al., 2012;Serva & Petroni, 2008;Wichmann et al., 2010).
There is also substantial support for LD(N) based on the high correlation between it and various psycholinguistic measures (Heeringa & Prokić, 2018). 6 For example, Beijering et al. (2008) found a strong correlation between LDN-based distances and intelligibility scores (r = -.86) and perceived linguistic distances (r = .52), in their study of Standard Danish and 17 other Scandinavian language varieties. 7 Similarly, Gooskens (2006) found a correlation of r = -.82 between phonetic LD and intelligibility scores among students from schools in Denmark, Norway, Sweden, and Finland. Furthermore, Gooskens and Heeringa (2004), who examined 15 Norwegian dialects as judged by Norwegian listeners, found a strong correlation between LD and perceptual distance (r = .62 in an experiment where monotonized recordings were used, and r = .67 in an experiment where nonmanipulated recordings were used), leading the researchers to conclude that: This shows that dialect distances calculated with Levenshtein distance approximate perceptual distances rather well. We see this as a confirmation of the usefulness of the Levenshtein method, as has been shown before for Dutch dialects. Now we know that the method is also applicable in a language area with a less simple geographic situation than the Dutch one.
(p. 205) Furthermore, this measure has also been extensively used and validated in the context of second language acquisition (SLA) research, which involved similar analyses as the present study. This includes the following: − Otwinowska et al. (2020) used LDN to quantify L1-L2 orthographic similarity between words, in their study on the influence of cross-linguistic lexical similarity on the learning of cognates and non-cognates among Polish learners of English. Specifically, they used this measure to show that the cognates and false cognates that they examined were comparable 6 This is important, since LD/LDN are objective measures of language distance, which often serve-including in the present research-as proxies for the subjective language distance that learners perceive (i.e., the psychotypology), which is the main driver behind the crosslinguistic influence that they experience (Jarvis & Pavlenko, 2008;Kellerman, 1983;Ringbom, 2007;Xia, 2017). 7 They also found similar correlations when it comes to non-normalized LD (r = -.79 for intelligibility and r = .62 for perceived distance).
in terms of their L1-L2 orthographic similarity, and this measure has been used in similar ways in associated studies (e.g., Marecka et al., 2021;Otwinowska & Szewczyk, 2019). 8 − Many studies used this measure to assess cognancy. This includes using LD to determine cognancy based on phonological (Sadat et al., 2016) or orthographic transcriptions (Bultena et al., 2020;Y. Zhu & Mok, 2020), using LD to compare cognates and non-cognates based on both phonological and orthographic transcriptions (Carrasco-Ortiz et al., 2021), using LDN to determine cognancy based on orthographic transcriptions (Casaponsa et al., 2015), and using LDN to determine cognancy based on both phonological and orthographic transcriptions (De Wilde et al., 2020).
− In addition, LD/LDN were also used in other studies to assess crosslinguistic similarity of words and its influence on L2 acquisition (De Wilde et al., 2022;van de Ven et al., 2019), to quantify crosslinguistic orthographic overlap of non-identical cognates (Vanlangendonck et al., 2020), and to serve various similar purposes (Cenoz et al., 2021), as have other closely related measures of lexical distance (Dijkstra et al., 2010;Schepens, van der Slik, et al., 2013a).
Finally, in the case of the present study, the classification of L1s based on their lexical distance from English aligns with what we expect based on general language classification.
Specifically, based on the distances per L1, which are shown in Table 1, the Germanic and Romance L1s are the lexically closest to English, and all the Indo-European L1s are closer to English than all the non-Indo-Eurpoean L1s (Eberhard et al., 2021). Note. These values are calculating using English-based tables, where distances are calculated from each English word in the dataset to its closest L1 synonym. It is also possible to calculate these distances using L1-based tables, where distances are calculated from each L1 word to its closest English synonym. However, the distances are quite similar regardless of which option is used (Spearman's ρ = 0.97, p < .001); the key differences are that when L1-based tables are used, the Spanish-English distance increases to make it more distant than French, and the Russian-Portuguese distance increases to make it more distant than Portuguese. a Language classifications are based on (Eberhard et al., 2021).
The fact that the Indo-European L1s were found to be lexically closer to English also aligns our expectations based on the measure of linguistic distance proposed by Chiswick and Miller (2005). Specifically, this measure is based on the difficulty that English speakers have acquiring other languages, and has been shown by Chiswick and Miller to predict the difficulty that speakers of those languages will have when acquiring English as an L2.
Similarly to our measure of distance, their measure also suggests that all the Indo-European L1s that are included here are closer to English than the non-Indo-Eurpoean L1s. 9 9 Their measure ranks languages on a scale of 1-3, where 1 marks the hardest languages to learn (i.e., the most distant) and 3 marks the easiest languages to learn (i.e., the least distant). Out of the L1s included in the present sample, French, Italian, and Portuguese have a ranking of 2.5, German, Spanish, and Russian, have a ranking of 2.25, Arabic and Mandarin have a ranking of 1.5, and Japanese has a ranking of 1. This roughly corresponds to the ranking found here, whereby all the Indo-European L1s are closer to English than the non-Indo-European L1s.
The imperfect correlation between their measure and ours is expected, since, as they note, their measure includes various aspects of the language beyond vocabulary, such as syntax.
Furthermore, in this regard, the use of our measure of lexical distance is further supported by Schepens et al. (2013a), who calculated lexical distance in a similar manner as us between 49 L1s and Dutch, and found that increased distance is strongly correlated (r = -.80) with broad L2 proficiency in Dutch. 10 This suggests that distances that are based on this measure strongly predict L2 learnability, in a similar manner as proposed by Chiswick and Miller. In summary, there is extensive support for our use of LDN as a measure of lexical distance here, including in terms of construct validity. This includes: − Many studies that validated it by comparing it to other measures of language classification, such as expert cognancy judgments (Brown et al., 2008;Gooskens & Heeringa, 2004;Holman et al., 2008b;Kessler, 1995;Nerbonne & Heeringa, 2001;Schepens et al., 2012;Schepens, van der Slik, et al., 2013b, 2013aServa & Petroni, 2008;Wichmann et al., 2010).
− The alignment of the overall crosslinguistic lexical distances in our samples with what is expected based on general language classification.
That said, this measure, like all linguistic measures, is imperfect, and we recommend that future work replicate our analyses using other distance measures, 11 as we do ourselves using feature edit distance. Furthermore, it is important to remember that the validation of this measure is itself imperfect, in the sense that the studies that validated it likely had their own 10 Schepens et al. base this on distances as calculated using Swadesh lists in the ASJP, similarly to us, though they use LDND rather than LDN; this is a closely associated variant of Levenshtein distance, which is discussed in detail in the next sub-section. 11 For more information on the issues with this measure, see the "Limitations of LDN" sub-section in the paper's methodology. Also, additional criticism of this measure-primarily in the context of language classification-can be found in Greenhill (2011). limitations and shortcomings, and their methodologies and goals do not always align with our own. Nevertheless, given all the support for this measure outlined above, we believe that its use here is reasonable, and that the outcomes based on it are reasonably reliable and generalizable.

Limitations of LDN
LDN is limited in several key ways.
First, it treats all character transformations as equal. For example, this means that the English word "fish" /fɪʃ/ has an equal and maximal LDN of 1 from both the corresponding Spanish word ("pez" /pes/) and the Hebrew one ‫"דג"(‬ /dag/), even though the English word is closer phonologically and etymologically to the Spanish word than to the Hebrew one, and could be considered a cognate of the first but not the second.
To partially address this issue, we replicated our analyses using feature edit distance (or phonological edit distance), and the results of these models replicated our results when using LDN as the measure of distance, as shown and explained in detail in Appendix S2.
Briefly, this distance, which has less validation and standardization than LDN, attempts to account for the phonological similarity across segmental units, by assigning different costs to the transformation of different units, based on their phonological features. For example, in the case of "fish" considered above, substituting /ʃ/ with /z/ generally incurs a lower cost than substituting /ʃ/ with /g/, since /ʃ/ and /z/ share more phonological features (e.g., being coronal), so they are more similar to each other from a phonological perspective.
Another limitation of our use of LDN as a measure of lexical distance is that it only looks at one aspect of formal similarity across words (phonological overlap). However, other factors, including both formal ones, such as orthographic depth, and non-formal ones, such as semantic and pragmatic similarities, may also affect crosslinguistic influence. For example, it may be that there is an interaction between orthographic depth and the effects of phonological distance, or that the use of a different script across L1s from different language families moderates the effects of phonological similarity.
Nevertheless, past studies (e.g., Sadat et al., 2016) found a facilitative effect of formal similarity even without considering such factors, as did Rabinovich et al. (2018), who did not investigate the influence of these factors. Furthermore, we addressed this limitation in our research in two ways. First, as shown in the paper's "Data analysis" section, we used mixedeffects models to control for some of these potential effects through random effects for word and L1. Second, we replicated our analyses on a sub-sample containing only German speakers (see "German-only models" in Appendix S5), which minimizes some of these issues (e.g., variation in the effects of similarity across language families), and found that these analyses replicated our key findings. However, it will still be beneficial for some future analyses to assess the role of these factors directly.
Finally, note that LDN does not assess cognancy directly, which we use in the psycholinguistic sense, of words that have similar meaning and pronunciation/spelling across languages. Rather, it only quantifies the formal similarity between words that are generally similar in terms of meaning. Most notably, this means that there are cases where a large distance does not indicate lack of cognancy, as in the "fish" example above. Nevertheless, as noted in the previous sub-section ("Validation of Levenshtein distance"), LDN is strongly correlated with cognancy (e.g., Schepens et al., 2012), and has been used to estimate cognancy directly in SLA studies that then used it to successfully predict L2 outcomes (Carrasco-Ortiz et al., 2021;Sadat et al., 2016), so we expect to be a reasonable approximation in the context of the present large-scale analyses. 12 It is important to keep these limitations in mind when interpreting the findings of the study. Nevertheless, as noted in the previous sub-section ("Validation of Levenshtein distance"), this distance has been extensively validated, through research in various fields, such as SLA, psycholinguistic, and language typology. This validation includes, most notably, strong correlations with other measures of distance, such as expert cognancy judgments and perceived language distance (Beijering et al., 2008;Schepens et al., 2012), and the use of this measure in SLA to successfully predict many L2 outcomes at the word level-including in the context of the cognate facilitation effect-such as word recognition and retrieval, in a similar manner as in the present study (Carrasco-Ortiz et al., 2021;Sadat et al., 2016). As such, we believe that the use of LDN is reasonable in the present study. Most importantly, even if it will be unable to perfectly capture all of the effects of crosslinguistic similarity, it should be able to successfully capture some of them, as it did in many past SLA studies.

LDN vs. LDND
As noted in the body of the paper, LDN is the normalized version of LD, which accounts for word length by diving the LD between a pair or words by the length of the longer word, to control for variations in word length.
LDN can be further normalized into LDND, by dividing it by the mean LDN of all N(N-1)/2 pairings of words with different meanings, to control for shared phonotactic preferences or overlap in phoneme inventories (Bakker et al., 2009, p. 171). However, while the first normalization of LD is usually seen as crucial, the second normalization is controversial and rare (Petroni & Serva, 2010;Wichmann et al., 2010), and none of the SLA or psycholinguistic studies outlined in the previous sub-section ( §1.2) used it. Furthermore, the use of LDND can lead to two notable issues. First, it is not sample-independent unlike LDN, so the LDND between two words varies based on which others words from the same languages are included in the analysis, which is not the case for LDN. Second, it minimizes similarity due to shared phonotactic preferences or overlap in phoneme inventories, which should be taken into account when assessing lexical distance in the present context, since similarity driven by these causes can influence the perceived similarity of words across languages.
As such, in the present study we use LDN, rather than LDND. Nevertheless, these two measures are generally strongly correlated (Holman et al., 2008a;Pompei et al., 2011;Wichmann et al., 2010), so the impact of using one over the other is likely minor.

Rationale for feature edit distance
As noted in our discussion of Levenshtein distance in the paper, a notable issue with this measure is that it treats all character transformations as equal, even though this does not accurately represent differences in distances as perceived by learners. For example, this means that the English word "fish" /fɪʃ/ has an equal and maximal LDN of 1 from both the corresponding Spanish word ("pez" /pes/) and the Hebrew one ‫"דג"(‬ /dag/), even though the English word is closer phonologically and etymologically to the Spanish word than to the Hebrew one, and could be considered a cognate of the first but not the second.
A potential way to mitigate this issue is to assign different weights to different character transformations, based on the phonological features of the associated segmental units. The resulting measure, which can be viewed as a modified form of Levenshtein distance, is referred to as phonological edit distance, feature edit distance, or feature distance (FD) (Allen & Becker, 2015;Eden, 2018;Fontan et al., 2016;Hall et al., 2017;Kondrak, 2000;Manurung et al., 2008;McCoy & Frank, 2018;Mortensen et al., 2016;Sanders & Chin, 2009;Schepens, Dijkstra, et al., 2013;L. Zhang, 2018). For example, when using FD, substituting /ʃ/ with /z/ would generally incur a lower penalty than substituting it with /g/, since /ʃ/ and /z/ are share the same value on more phonological features, such as being coronal, so they can be considered more similar to each other from a phonological perspective.

Limitations of feature edit distance
Though FD might be able to capture phonological similarity more accurate than LD, we decided to use LD(N) as the key measure of similarity in our study, for two main reasons.
First, while there is extensive validation for the use of LD based on research in several fields (as shown in "Validation of Levenshtein distance" in Appendix S1), there is little validation of FD in similar contexts. As such, while LD might potentially be less linguistically motivated than FD, we do know based on prior research that it is able to predict linguistics outcomes fairly well-including when used to predict the influence of crosslinguistic similarity on L2-whereas we do not yet know the same for FD. In fact, the limited research that did investigate the use of FD and similar measures did not find that they are necessarily better predictors of linguistic outcomes than simple Levenshtein distance (Wieling et al., 2007;. For example, as Wieling et al. (2007, p. 93) state: It was found that generally speaking the binary versions approximate perceptual distances better than the feature-based and acoustic-based versions. The fact that segments differ appears to be more important in the perception of speakers than the degree to which segments differ. Therefore we will use the binary version of Levenshtein distance in this article… Second, the simplicity of LD (compared to FD) presents advantage for replication of analyses, the generalizability of findings, the comparison of findings across studies, and the minimization of researcher degrees of freedom. Specifically, while LD is generally implemented in consistent manner across the various software packages that offer it, which means that calculating LD using different packages/software will lead to the same results, this is not the case of FD, which depends heavily on factors such as: − Which phonological features are taken into account (Gooskens & Heeringa, 2004;Nerbonne & Heeringa, 1997). 13 − What weights should be assigned to differences in feature values, and how substitutions should be weighted compared to insertions/deletions.
− Whether different weights should be assigned to different features, and if so, then what weights. This is compounded by the fact that different features could potentially be weighted differently for different populations (e.g., speakers of different L1s, who perceive the different features differently) and in different contexts (e.g., when it comes to assessing perceived distance vs. intelligibility).
− How this distance should be normalized. 14 Furthermore, in this regard, there is also the question of whether to use FD in particular, or a similar measures that attempts to capture crosslinguistic similarities, such as pointwise mutual information (PMI)  or naive discriminative learning (NDL) .
In summary, although FD might be more linguistically motivated than LD, it is not clear that this is the case and that FD is a better predictor of linguistic outcomes. Furthermore, much methodological work needs to be done on FD to validate and standardize its use, before it can be used with confidence by researchers.

Our technical approach
We built models that use FD as a predictor, to supplement our main models (which use LD).
However, these models, should be interpreted with caution, given the limitations of FD that we discussed above.
To calculate FD for our models, we used PanPhon, a Python package that relates IPA segments-both simple (e.g. /t/) and complex (e.g. /t ͡ sː/)-to their definitions in terms of articulatory features . 15  ant [±anterior]. Is a constriction made in the front of the vocal tract?
cor [±coronal]. Is the tip or blade of the tongue used to make a constriction?
lab [±labial]. Does the segment involve constrictions with or of the lips?
hi [±high]. Is the segment produced with the tongue body raised?
lo [±low]. Is the segment produced with the tongue body lowered?
back [±back]. Is the segment produced with the tongue body in a posterior position?
round [±round]. Is the segment produced with the lips rounded?
tense [±tense]. Is the segment produced with an advanced tongue root?
The feature values are taken directly from the Nov 11, 2019 release of PanPhon. Some feature names here are trimmed here due to space constraints.
Specifically, we used the partial_hamming_feature_edit_distance function, 16 which calculates FD in the following manner: − An edit that involves an insertion or a deletion incurs a cost of 1.
− An edit that involves going from a certain feature value to an opposite feature value incurs − An edit that involves going from a certain feature value to an identical feature value incurs no cost. For example, if a segment that is [+back] is substituted with a segment that is also [+back], no cost is incurred for that particular feature edit.
The resulting FD was normalized into FDN by dividing it by the length of the longer string in the pair, based on the number of segmental units (e.g., /t ͡ sː/), since FD focuses on segmental units rather than characters.
Note that whereas LD is standardized, FD is not, as mentioned in the previous section.
As such, the FD that we calculated here should be viewed as only one type of FD, and other types of FD are calculated differently and may lead to different outcomes.

Descriptive statistics for FDN values
There was a moderate-to-strong correlation between FDN and LDN in both the Swadesh lists (r = .40, 95% CI = [.28, .50], p < .001) and the parallel dictionaries (r = .47, 95% CI = [.45, .49], p < .001). 17 This suggests that although these two measures have a strong association, as 16 Alternative functions are available for this purpose in PanPhon. We selected this function because it offered a balance between the two other main functions: feature_edit_distance, where insertion/deletions are treated the same as substitutions, and so generally incur a cost <1 (due to the presence of unspecified features), and hamming_feature_edit_distance, where transformations from specified feature values to unspecified ones (and vice versa) incur a cost of 1/22, similarly to transformations to opposite feature values. It is not clear that the specific distance that we used is the best one (i.e., the one that best predicts the perceived similarity between words), which highlights the need for validation and standardization of this measure. Nevertheless, this is not crucial for the present research, as the differences between the distances that these measures lead to are small enough that they do not influence our findings. 17 A similar correlation is found when the full available samples of the Swadesh lists and parallel dictionaries are used (as opposed to the samples that were trimmed for the present study, primarily in terms of focusing only on single-word entries); specifically, this correlation is r = .39 (95% CI = [.30, .48], p < .001) in the Swadesh lists, can be expected, they capture substantially different aspects of crosslinguistic distance, and the use of one rather than the other might influence the results of analyses, at least to some degree. 18 Figure 1 and Table 3 18 Although we do not expect it to change the null findings in the present study, both because past studies found an effect of crosslinguistic similarity while using LD(N), and because the correlation between LDN and FDN means that we would expect to find at least some effect of similarity in our sample, which is not the case. Figure 1. Lexical distance between L1 words English, per L1 in each dataset. The distance is equal to the phonological FDN between L1 words and their most lexically similar English counterpart. Within the boxplots, the line inside the box indicates the median, the lower and upper hinges indicate the 1 st and 3 rd quartiles, the whiskers indicate 1.5 interquartile ranges (IQR) past the hinges, and the dots indicate outliers beyond that. The violin plots indicate an estimate of the probability density of lexical distance for each L1, which can be viewed as the likelihood that a word in each L1 will have a certain lexical distance, where increased width indicates greater likelihood. Data is based on 25 words per L1 in the Swadesh lists and 1,103 words per L1 in the parallel dictionaries (i.e., after the removal of multi-word entries).
The distance here is the phonological FDN from the closest synonym, calculated for the single-word entries in each dataset. There were 225 entries in the Swadesh lists (i.e., rows with an English word and all its corresponding counterparts in a certain L1), with 25 entries for each of the 9 L1s in the dataset. There were 5,515 entries in the parallel dictionaries, with 1,103 for each of the 5 L1s. All counts are after the removal of multi-word entries.
Several key observations can be made about these distances.
First, FDN is much more evenly distributed within each L1 than LDN, primarily due to the lack of ceiling effect present in LDN (i.e., the tendency of words to have the maximal possible LDN of 1). This can likely facilitate analyses using this distance, but it does not necessarily more accurately represent distance between words as perceived by learners.
Second, there are some similarities and differences in the per-L1 differences here compared to those based on LDN, as shown in Table 4 below. Specifically, the similarities are that German is ranked as the closest L1 to English, and that all the Romance L1s (French, Italian, Spanish, and Portuguese) are ranked as closer than all the non-Indo-European L1s (Arabic, Japanese, and Mandarin). The differences are that the ranking is different within the Romance L1, the Indo-European L1s, and the non-Indo-European L1s, and that there are also several differences across these groups, including, most notably, that in FDN Russian is ranked as substantially closer to English than Portuguese and Mandarin, and that French is ranked as being practically as distant from English as Mandarin. These distances are not directly reflective of those between the languages, since they include only single-word entries (as discussed in more detail in the "Validation of Levenshtein distance" section of Appendix S1). Nevertheless, these as shown in the aforementioned section, these distances are expected to be close to the "real" distances between these languages, and as such the results for FDN are highly unexpected, especially in the case of French. This suggests that the present FDN measure is not better than LDN at quantifying crosslingusitic distance.

FDN-based models
As with our main models, we used the normalized version of this distance (FDN), which we scaled (by multiplying it by 10) and centered.
We initially built these models using the same fixed and random effects as in our main models. However, the Swadesh-based models in both subcorpora had issues with singular convergence (due to the intercepts and slopes of the L1 random effect), and the parallel-based models did not converge at all. 19 As such, below (in Tables 5 and 6) we present the results for FDN-based models without the L1 random effect. However, this does not substantially influence our findings, since this effect was very weak in the FDN-based models that contained it and did converge, and the results of the models were functionally identical regardless of the inclusion of this effect, as was the case for the LDN-based models (see the "Models without the L1 random effect" in Appendix S5).
These tables show that the FDN-based models replicate the key findings of the LDNbased models, with a similar null effect of distance and its interaction with proficiency (B = 0.00-0.01, corresponding to IRR = 1.00-1.01), together with strong task effects.
In addition, we also built FDN-based models using only data from German speakers.
This is both to replicate the associated LDN-based models, and because the FDN-based results for the German speakers were consistent with the LDN-based results and with what is expected based on general language classification (as discussed "Validation of Levenshtein distance" in appendix S1), while also being the L1 that is closest to English.
The results of these models are shown in Tables 7 and 8. As with the German-based models that used LDN as the measure of distance, these models replicate the key findings of the main models, in terms of the lack of a substantial effect of distance or of its interaction with proficiency, and in terms of the strong task effects. 20 Overall, the results from the FDN-based models complement those of the LDN-based main models, and suggest that the null effect in the main models should not be attributed to LDN failing to fully capture the phonological overlap between words, something that is also supported by past validation of Levenshtein distance. However, given the limitations of FD that were above, both in general and within this sample, more work on validating and standardizing FD and similar measures is needed before a conclusive statement can be made on the influence of its use in this context. Table 5. Results of the mixed-models with FDN as the distance measure, for the Swadesh-based samples. The response variable was the rate of use of the target L2 English words (i.e., their count offset by the total number of words in each text). Under fixed effects, distance is the phonological FDN between each L2 word and its most lexically similar L1 counterpart (originally 0-1, scaled to 0-10), proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 represents the SD of the associated random intercepts.  Table 6. Results of the mixed-models with FDN as the distance measure, for the parallel-based samples. The response variable was the rate of use of the target L2 English words (i.e., their count offset by the total number of words in each text). Under fixed effects, distance is the phonological FDN between each L2 word and its most lexically similar L1 counterpart (originally 0-1, scaled to 0-10), proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 represents the SD of the associated random intercepts.  Table 7. Results of the mixed-models with FDN as the distance measure, for the Swadesh-based samples, using only data from German speakers. The response variable was the rate of use of the target L2 English words (i.e., their count offset by the total number of words in each text). Under fixed effects, distance is the phonological FDN between each L2 word and its most lexically similar L1 counterpart (originally 0-1, scaled to 0-10), proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 represents the SD of the associated random intercepts.  Table 8. Results of the mixed-models with FDN as the distance measure, for the parallel-based samples, using only data from German speakers. The response variable was the rate of use of the target L2 English words (i.e., their count offset by the total number of words in each text). Under fixed effects, distance is the phonological FDN between each L2 word and its most lexically similar L1 counterpart (originally 0-1, scaled to 0-10), proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 represents the SD of the associated random intercepts.

Comparison of our approach with that Rabinovich et al.
Our approach differs from that of Rabinovich et al. (2018) in several key ways.
First, in terms of sample, the L2 writings that they examined were produced in a relatively spontaneous setting (on social media), whereas our L2 material was produced in a relatively constrained task-based setting, and within an educational environment. In addition, they examined highly proficient L2 learners, whereas we examine a range of beginner-tointermediate L2 learners. Also, they covered more L1s in their analyses, but only included Indo-European L1s, whereas we also include a few non-Indo-European L1s (Arabic, Japanese, and Mandarin) in some analyses.
Second, in terms of controlling for background effects, Rabinovich et al. focused on pre-processing the material to minimize background effects (e.g., by randomly shuffling texts by all authors), whereas we control for these effects using mixed-effects models. A notable benefit of our approach is that it allows us to estimate the magnitude of the associated task effects, which are important to SLA research and practices.
Another difference is that Rabinovich et al. analyzed the effects of etymological cognancy as a proxy of formal similarity, whereas we analyze the effects of phonological similarity (a key aspect of formal similarity). While there is a strong association between cognancy and phonological similarity, since cognates are generally more similar phonologically across languages (as discussed in Appendix S1, under "Validation of Levenshtein distance"), these are two different measures, which may lead to different outcomes in certain cases. For example, increased crosslinguistic similarity might facilitate the use of an L2 word even if it this word is not cognate with its L1 equivalent (e.g., if the words share a single but salient consonant due to chance). Also, cognancy involves additional forms of similarity beyond phonological overlap (e.g., pragmatic similarities), which may also influence L2 word choice.
Finally, Rabinovich et al. also examined this effect within synonym sets, whereas we examine this effect in L2 words that are not always parts of synonym sets (as discussed in Appendix S3, under "Analysis of synonym sets"). Furthermore, they focused on synonym sets that contain at least two different etymological paths in particular. A key goal of theirs in doing this was to use L2 word choice to identify speakers' L1, so it made sense for them to focus on a subset of L2 words that are more likely to involve cognate facilitation. Conversely, our key goal is to understand the influence of learners' L1 on their L2 word choice as it manifests during the SLA process, so we focus on a diverse sample of L2 words, which should be more representative of the words that learners encounter during SLA. Nevertheless, as shown later, our sample contains many crosslinguistically similar L1-L2 word pairs, so we expect to be able to detect associated crosslinguistic influence if it exists in our sample, and this is supported by the precise coefficient estimates that we found for the effects of similarity in our models.

Comparison of baseline word frequencies
The mean Zipf frequency in the Swadesh lists (N = 25 words per language) was 5.24 (SD = 0.72, median = 5.14, range = 4.15-7.11), and the mean Zipf frequency in the parallel dictionaries (N = 1,103 words per language) was 4.35 (SD = 0.83, median = 4.32, range = 1.87-7.41). Accordingly, the distribution of the Zipf frequencies in our parallel-dictionaries sample-based on the magnitude of the mean, SD, and range-is similar to that of other studies that found a cognate facilitation effect, and our sample also contains substantially more (1,103) words, so this should not be an issue for our analyses.

Correlations of distance, frequency, and word use
Figure 2 contains basic scatterplots with the usage of the target English words in relation to their lexical distance from the corresponding L1 words. These plots show that the datasets contain words with a broad range of lexical distances, and a broad range of rates of usage. In addition, there appears to be a weak positive association between lexical distance and word usage, since the words with the higher rates of usage are almost exclusively located on the right. This is contrary to the negative correlation that we expect, whereby higher distance is associated with reduced usage. However, this could be due to confounds such as the baseline frequency of the English words, which our mixed-models address.  Table 9 shows the raw correlations between lexical distance, baseline frequency of the words in English, and the rate of usage of the L2 English words in the present sample. The Table   shows that, in both lexical-distance datasets, there is a significant and substantial positive correlation between the baseline frequency of words and their rate of use in the learner sample, though this correlation is stronger for the words in the Swadesh lists (r = .39-.41) than in the parallel dictionaries (r = .17-.18). In addition, in the Swadesh lists there is also a significant and substantial positive correlation (r = .18) between the lexical distance of words and their frequency, meaning that more distant words are more frequent, but this correlation is not substantial (r = .03) in the parallel dictionaries.
In addition, there is a weak positive correlation between distance and usage for the Swadesh-based samples (r = .10-.11), which might be attributable to the distance-frequency and frequency-usage correlations. This is opposite to the association that we would expect between distance and usage if there was a cognate facilitation effect (assuming no other factors played a role), since decreased crosslinguistic distance (i.e., increased similarity/cognancy) should lead to increased word use. In the case of the parallel dictionaries, there is functionally no correlation between distance and usage (r = .01), which is expected given the almost null correlation between distance and frequency in this dataset, together with the weaker correlation between frequency and word use.
The difference in correlations between the Swadesh lists and the parallel dictionaries can be attributed, in part, to the fact that the parallel dictionaries contain a broader range of words in terms of their baseline English frequencies, including ones that are lower-frequency than in the Swadesh lists (Zipf frequency range of 1.87-7.41 in the parallel dictionaries, compared to 4.15-7.11 in the Swadesh lists). However, as shown in Table 10, when this difference is largely eliminated, by selecting a subset of the parallel dictionaries containing only words with a Zipf frequency of 4.15 and above (as in the Swadesh lists), 21 the distancefrequency and frequency-usage correlations increase but remain weaker than in the Swadesh lists (respectively, r = .07 and r = .23-.25), and the distance-usage correlation remains functionally zero (r = .01 in both subcorpora).
34 One possibility that was raised, based on the findings of the mixed-models in the paper, is that the cognate facilitation effect does not exist, and was found in other studies due to the confounding influence of factors such as frequency, which we controlled for in the models.
While this would be a novel finding in its own right, we do not believe that this is the case.
This is because past studies have found evidence of the cognate facilitation effect even when frequency is controlled for, so we would expect to find this effect here too ( ones where task effects, as conceptualized in the present study, do not play a role, since they were focused primarily on experiment-based investigation of language processing, so it does not appear the us controlling for task effects could explain the lack of cognate facilitation either. In addition, the correlations that we found here do not lead to a cognate facilitation effect, even without controlling for proper background factors. Specifically, in the case of the Swadesh lists, based on the positive distance-frequency and frequency-usage correlations, we would expect to find an effect opposite to cognate facilitation, in the sense that increased distance (i.e., reduced similarity) will correlate with increased word use, which is in fact what we find for the distance-usage correlation. Furthermore, in the case of the parallel dictionaries, we would not expect to find a similar effect at all, since the correlation between distance and frequency is functionally zero.
Overall, the extensive evidence from past studies shows that the cognate facilitation effect exists even when frequency and other factors are controlled for. Furthermore, the raw correlations between the key variables in our study (lexical distance, baseline frequency, and L2 word usage) show that, when background factors are not properly controlled for, we would expect to find either a null effect or an opposite effect than cognate facilitation. As such, the absence of the cognate facilitation effect in our main models is a novel theoretical finding, that is not merely attributable to the fact that we control for frequency. Table 11 contains descriptive statistics regarding the frequency ratio of the words in the samples, as visualized in Figure 3 of the paper (in the beginning of the Results section). It shows that, on average, target English words were used in equal rates in the sample as in baseline English (i.e., had a frequency ratio near 1). However, all samples contained a range of words with different frequency ratios (total range 0.70-1.58), and this rate was greater in the parallel-based samples, likely due to the inclusion of very low-frequency words. In addition, this inclusion is likely also the reason why more of the words from the parallel dictionaries did not appear in the parallel-based samples at all, as indicated by the substantially higher rate of words with a frequency of 0 in the parallel dictionaries. Table 11. Descriptive statistics regarding frequency ratio, which is the frequency of a word in a given sample divided by its baseline frequency in English. The baseline frequency in English is based on the same frequency measure that we use throughout the paper, as discussed in the "Baseline word frequency" section of the paper. The frequency of use per sample is calculated separately for each combination of a target word and a specific L1, since different L1s can have different distances from English for any given word. a Words is equal to the number of L1s in the distance dataset (9 in the Swadesh lists, 5 in the parallel dictionaries), times the number of words per L1 (25 in the Swadesh lists, 1,103 in the parallel dictionaries). b Words that did not appear in the sample were assigned a Zipf frequency of 0, in line with Speer (2020), and consequently have a frequency ratio of 0 here. n represents the number of such words in the sample, and the % represents the percent of such words out of the total words in the sample. c All the frequency ratio statistics were calculated while excluding cases with a frequency of zero. A ratio of 1 indicates that a word is used in an equal rate in the sample and in baseline English, whereas a ratio >1 indicates that the word is used more frequently in the sample, and a ratio <1 indicates the opposite.

Analysis of synonym sets
Rabinovich et al. (2018)  − There must be a communicative need or reasonable opportunity to convey the relevant meaning. They characterize their sample as involving "spontaneous productions", so in their case it is likely that learners had more opportunities for choosing which meanings to convey than in more constrained task-based settings.
− The relevant meaning must be able to be conveyed using a synset. This is because the cognate facilitation effect, as found by them, is based on the contrast in usage between synonyms within a synset.
− The synonyms must be easily interchangeable. This is because otherwise, the effects of cognancy may be obscured by other factors that play a role in the choice of specific synonyms out of the synset, and especially frequency effects. In their study, they operationalized this concept by avoiding synsets that were dominated by a single synonym (i.e., where a single synonym accounted for 90% or more of the usage of that synset in their dataset). This means, for example, that a synset such as {kiss, buss, osculation} was excluded, whereas a synset such as {divide, split} was retained. 22 − There must be a mix of cognates and non-cognates in the synset. Specifically, there must be at least one cognate for the cognate facilitation effect to occur, but there must also 22 While this is a reasonable operational definition from a practical perspective, especially when working with large-scale datasets, it is important to note that there are various issues with it. For example, some synonyms might not be easily interchangeable due to connotations that they carry, even if they have a similar rate of usage. In addition, the reliance on a strict 90% threshold can lead to issues, such as in a case where a single synonym accounts for 85% of the uses in a corpus, meaning that it is still fairly dominant over the others. Similarly, there can be a difference between a synset with two synonyms that each account for 50% of uses, and a synset with 3 synonyms that has a usage distribution of 50%-49%-1% or 50%-25%-25%. Finally, if a certain L2 word a cognate in many languages, it might become a highly dominant synonym, and therefore be omitted from the sample even though it displays a strong cognate facilitation effect.
be at least one non-cognate against which the cognate stands out. 23 Note that this criterion is L1-dependent, since cognancy of an L2 word is defined based on its relation to an L1 word.
We briefly analyzed our samples to determine to what degree these conditions occur there.
In the Swadesh lists, none of the English words were listed as being a part of a synset.
In the parallel dictionaries, out of 1,103 English words that were included in our analyses, 751 (68.09%) were listed as having no synonyms, and 352 (31.91%) were part of a synset. Of those with a synonym, 21 (5.97%) of the entries that originally had two synonyms in the dataset appeared by themselves in the final dataset, due to removal of the other synonym in during the data preparation. 24 Of the 331 entries that were a part of a synset in the present dataset, 304 (91.84%) were part of a synonym pair (i.e., a synset with 2 synonyms), and 27 (8.16%) were a part of a synonym triplet (i.e., a synset with 3 synonyms). As such, there were a total of 161 synsets in our parallel-dictionaries dataset.
When considering how many of these were easily interchangeable, we based our criterion on a similar one as Rabinovich et al., and define an easily interchangeable synset as one where the difference in Zipf frequency between the synonyms is no greater than 1 (i.e., where no synonym is 10 times or more common than the others, since Zipf frequency is on a logarithmic scale). 110 (68.32%) of the synsets (corresponding to 223 entries) fulfilled this criterion, with Zipf frequency differences ranging all the way from 0.00-0.99.
Next, there was the question of which of these synsets contain a difference in lexical similarity that could be characterized as corresponding to cognancy/non-cognancy, since we use a continuous measure of lexical similarity, rather than something that clearly delineates whether a pair of words are cognates or not. As a rough measure, we categorized synset as fulfilling this criterion if at least one of the synonyms had an LDN ≤ .60 and at least one had an LDN ≥ .80. 25 Unlike the previous criteria, which were L1-independent, this was L1dependent, so there were 550 relevant synset combinations (110 synsets for each of the 5 L1s in the parallel dictionaries). Of these, 93 (16.91%), which contain 189 synonyms, fulfilled the cognancy criterion.
Finally, there was the question of whether there was a communicative need for the underlying meanings represented by these synsets. This was determined based on whether at least one of the synonyms in the relevant synsets appeared at least once in a text: − In the first subcorpus, there were 179,439 rows which represent a combination of one of the above synonyms with a text (while taking learners' L1 into account). Of these, 710 (0.4%) rows had a count > 0 for the target word, meaning that it was used at least once. 26 These represented the use of 63 synsets (67.74% of the original synsets).  26 See the paper for more information on this rate of usage.

Spelling correction
We calculated the counts of words in the datasets using a spelling-corrected version of each text, which comes built-in as part of the EFCAMDAT Cleaned Subcorpus, and which was generated using the autocorrect library (McCallum, 2019) in Python, since we are interested in how often learners attempt to use target words, and misspellings could obscure those patterns. Nevertheless, this does not appear to make a practical difference to our analyses, as the correlations between the corrected and uncorrected counts were extremely high (Pearson's r = .9954-.9998 for all datasets, with p < .001 in all cases, and the 95% CIs falling no more than .0001 from the estimates. Spearman's ρ had similar values, from .9918-.9982, all with p < .001).

Task random effect
Our models contained a random effect of task, to control for all the aspects of each writing task that can influence word choice, such as its prompt, with the exception of the task's associated L2 proficiency level, which we control for using the relevant predictor. This approach accounts for all aspects of task effects in aggregate, and does not disentangle the different aspects. 28 The use of mixed-effects models allows us to assess such task effects despite the fact that each task is associated with only a single proficiency level (Hox et al., 2018;Winter, 2019), and this type of effects structure-where each group in a random grouping variable always takes the same potentially unique value along a continuous predictor-is conventional in both corpus linguistics (e.g., Levshina, 2018) and psycholinguistics (Baayen et al., 2007;Vandenberghe et al., 2021), including in studies on the cognate facilitation effect (e.g., De Wilde et al., 2021). 29

Incidence rate ratio (IRR)
As noted in the body of the paper, we exponentiated the coefficient estimates in the mixedmodels to derive an incidence rate ratio (IRR), in order to facilitate the interpretation of the results, and the standard errors (SEs) of the coefficients were then scaled by multiplying them by the exponentiated coefficient estimates (Hox et al., 2018;Sedgwick, 2010).
The IRR itself can be interpreted as the expected change in the rate of the response variable as a factor of a 1-unit increase in the predictor. For example, an IRR of 2 means that a 1-unit increase in the predictor doubles the rate of response (i.e., doubles the rate of use of the target word), while an IRR of 0.5 means that a 1-unit increase in the predictor halves it.
An IRR of 1 corresponds to a coefficient estimate (B) of 0, as there is no expected change in the response variable as a result of a change in the predictor.
It is important to note that when combining multiple coefficients, you should not add the exponentiated coefficients, but rather multiply them, which is equivalent to exponentiating the added coefficients. For example, consider a situation where you are predicting the IRR of 28 This operationalization of task is distinct from most notions of task within task-based learning and teaching approaches, and we make no claim regarding the impact of any specific aspect of tasks, such as their genre or cognitive complexity (Alexopoulou et al., 2017). 29 Levshina (2018)  a word that is 1 unit more frequent than some baseline level, in a learner whose proficiency is 1 unit higher than some baseline level. If the raw coefficient of frequency is 0.5 and that of proficiency is 0.3, then the IRR will be: In addition, if you want to predict the IRR of a word that is 1 unit less frequent, then you need to take the inverse of the IRR of a word that is 1 unit more frequent, since this is equivalent to exponentiating the negative of the associated coefficient. For example, if the coefficient is 0.5, then the IRR of a word that is 1 unit less frequent than some baseline level is: −0.5 = 1 0.5 = 0.6  (Winter, 2019, pp. 109-110) The reliance on visual checks is particularly important given the large sample sizes in the present study, which can lead to statistically significant but meaningless deviations from model assumptions (Hartig, 2020).

Technical details
All analyses were conducted using R. The models were built using the glmmTMB package, which was developed for fitting generalized linear mixed models (GLMMs) (Brooks et al., 2017). 30 Analysis of residuals for the model diagnostics was performed using the DHARMa package (Hartig, 2021a). This package was chosen as it is dedicated to residual diagnostics for the type of models used in the present study (GLMMs), and it is used in the glmmTMB documentation as the package of choice for this purpose (Bolker, 2020), and it is also widely used by others for this purpose (e.g., Brooks et al., 2019;Gries, 2021).
DHARMa uses an approach to residual diagnostics that addresses common issues with such diagnostics. Full details for the package's approach to diagnostics, and for the rationale behind this approach, can be found in the package's documentation (Hartig, 2021b).
However, the key points regarding this approach are the following: DHARMa aims at solving these problems by creating readily interpretable residuals for generalized linear (mixed) models that are standardized to values between 0 and 1, and that can be interpreted as intuitively as residuals for the linear model. This is achieved by a simulation-based approach, similar to the Bayesian p-value or the parametric bootstrap, that transforms the residuals to a standardized scale. The basic steps are: 1. Simulate new data from the fitted model for each observation.

…
The key advantage of this definition is that the so-defined residuals always have the same, known distribution, independent of the model that is fit, if the model is correctly specified. To see this, note that, if the observed data was created from the same data-generating process that we simulate from, all values of the cumulative 30 We chose glmmTMB for several reasons, including that it is designed with GLMMs in mind, it supports variants of Poisson models that we used or expected to potentially need (e.g., Conway-Maxwell Poisson), it is substantially faster than many competing packages for the type of models that we built (Brooks et al., 2017), it is welldocumented, it interfaces well with other relevant packages (e.g., broom.mixed), and it uses a similar syntax as lme4.
distribution should appear with equal probability. That means we expect the distribution of the residuals to be flat, regardless of the model structure (Poisson, binomial, random effects and so on). (Hartig, 2021b) Specifically, for each model, we ran the four main diagnostic functions that are available in DHARMa. These are explained in detail in the DHARMa documentation (Hartig, 2021b), but we can briefly say the following regarding them and regarding their interpretation: A. plotQQunif-this produces a uniform quantile-quantile plot, to detect deviations from the expected distribution for the model. In a well-specified model, the residuals (black dots) should be plotted over the straight red line. C. testDispersion-this tests whether the observed data is more or less dispersed than expected under the fitted model, by comparing the variance of the observed residuals against the variance of the simulated residuals. The key outcome of this test is the ratio between the two, where a ratio < 1 indicates underdispersion, while a ratio > 1 indicates overdispersion. 31 The predicted values are rank-transformed by default, since this makes patterns easier to spot visually, especially if the distribution of predictors is skewed, as noted in the DHARMa documentation (http://web.archive.org/web/20210803085455/https://rdrr.io/cran/DHARMa/man/plotResiduals.html). 32 Note that "a scaled residual value of 0.5 means that half of the simulated data are higher than the observed value, and half of them lower. A value of 0.99 would mean that nearly all simulated data are lower than the observed value. The minimum/maximum values for the residuals are 0 and 1." (Hartig, 2021b). Furthermore, due to the way that residuals are transformed in DHARMa, the scaled residuals in a properly fitted model are expected to have a uniform-rather than normal-distribution.
D. testZeroInflation-this compares the observed number of zeros with the zeros expected from simulations. The key outcome of this test is the ratio between the two, where a ratio < 1 indicates that the observed data has fewer zeros than expected, while a ratio > 1 indicates that it has more zeros than expected (i.e., zero-inflation).
The results of the diagnostic tests for each model will be presented in their own figure in the next sub-section, in the form of a panel with 4 tests, each represented by a dedicated plot.
Within each figure, plot (A) will correspond to the results from the plotQQunif function, plot (B) will correspond to plotResiduals, plot (C) will corresponds to testDispersion, and plot (D) will correspond to testZeroInflation.
Note that, as mentioned in the DHARMa documentation, some minor deviations from perfect patterns (e.g., in the residual plots) can occur due to chance, even in well-specified models. Furthermore, when assessing deviations, it is important to consider the magnitude of the deviation in addition to its significance, as even negligible deviations can be significant in large samples.

Diagnostic plots
The diagnostic plots for the Swadesh-lists models appear in Figures 3 and 4  In the case of the parallel-dictionaries models, we were unable to run the full diagnostics on the full models, since the large size of the models necessitated memory allocation for the diagnostics that exceeded our available computational resources. To address this, we built new models using sub-samples from the original samples (separately for each subcorpus), containing 2,500,000 randomly selected observations each, and used these for the diagnostics. 33 The results of these models, which are shown in Table 12, are similar to those of the main models, which supports their use for diagnostic purposes, though the model for the first subcorpus is slightly less well-specified than for the associated main model. 34 The results of the associated diagnostic checks, which appear in Figures 5 and 6, are similar to those of the Swadseh-based models, and suggest that the model are fairly well-specified, though they also have some underdispersion. 33 The size of 2,500,000 observations was chosen since with a 3,000,000-observations sub-sample we still hit the memory allocation limit for the dispersion and zero-inflation tests. 34 There are two key differences between the subsample-based model for the first subcorpus and the associated main model. First, this (subsample-based) model had a "singular convergence" warning, likely due to the random intercept for L1 and the associated random slope of distance for L1, though the associated effect sizes were very similar to those in the main models (i.e., functionally 0). Second, the frequency predictor in the subsample model is underestimated, as it has a smaller IRR (and SE) than in the main models, though the frequency predictor is still substantial. It is important to keep these differences in mind when it comes to the diagnostics, but they are nevertheless minor enough that this model is reasonable to use for diagnostic purposes, especially given that it is slightly less well-specified than the main model, which makes using it more conservative. In addition, note that, as expected, the differences between the subsample-based model and the main model generally become smaller as the size of the sub-sample increases, and the residual plots also become even closer to what is expected in a well-specified model. For example, when the sub-sample is increased to 3,000,000 observations, though there is still a "singular convergence" warning, the IRR and SE of frequency both become more similar to those of the associated main model (specifically, the IRR becomes 13.23 and the SE becomes 0.80), and the residual plot become even closer to what is expected for a well-specified model (i.e., the slight uptick at the right side of the plot flattens). Table 12. Results of the mixed-effects models, for the parallel-based samples, using the 2,500,000-observation subsamples that were selected for diagnostics. The response variable was the rate of use of the target L2 English words (i.e., their count offset by the total number of words in each text). Under fixed effects, distance is the phonological LDN between each L2 word and its most lexically similar L1 counterpart (originally 0-1, scaled to 0-10), proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 and τ11 respectively represent the SD of the associated random intercepts and slopes, and ρ01 represents the correlation between random intercepts and associated random slopes (here, distance for L1). Note. L1_ρ 01 = 1 in the first model due to the "singular convergence" issue discussed earlier. This is not an issue in the corresponding main model, which has more data. Figure 5. Diagnostics for the parallel-dictionaries models (first subcorpus). The zero-inflation test (5D) looks different here than in the other models, because it involves both a larger sample and a ratio of observed vs. simulated zeros that is slightly farther from 1, so overall there is a larger difference between the observed vs. simulated zeros. This shows there is no zero-inflation here, since there are fewer observed zeros than expected (the ratio of observed vs. simulated zeros is <1), and the ratio is close enough to 1 (0.994) that this is not an issue for the model. Figure 6. Diagnostics for the parallel-dictionaries models (second subcorpus).
Overall, the diagnostics for the models suggests that the models are fairly well-specified.
Specifically, as shown in the graphs in these section-an explanation for which can be found in the previous section ( §4.5.1.2) and in Hartig (2021a)-there do not appear to be substantial deviations from the expected distribution from the model (including no zero-inflation), or any substantial deviations from uniformity (i.e., there does not appear to be heteroskedasticity or non-linearity). 35 The diagnostics do show that some of the models have some underdispersion, which can lead to overestimated SEs, and consequently to overestimated p-values (Brooks et al., 2017(Brooks et al., , 2019Dean & Lundy, 2016;Forthmann & Doebler, 2021;Harris et al., 2012;Hartig, 2021b;Sellers & Morris, 2017). However, this underdispersion does not invalidate the present findings, given the robust effect sizes that were found across all samples (IRRs very close to 1, with SEs ≤ 0.01), since even if these SEs are overestimated, the key patterns of results are still the same, in terms of the lack of effect of distance and of its interaction with proficiency. Essentially, even if these SEs should be smaller than they are, this would only reinforce our certainty regarding the estimated IRRs, and show that they are functionally equivalent to 1, which corresponds to a coefficient estimate of 0 and means that there is no effect. This is further supported by the supplementary models in the next section, which replicate our findings while accounting for underdispersion. In sum, these diagnostics suggest that these models are fairly well-specified, and that they allow us to reliably answer our key research questions.

Supplementary models (generalized Poisson)
To account for any underdispersion in the main models, we built supplementary generalized Poisson models, which can handle both underdispersion and overdispersion (Brooks et al., 2019;Harris et al., 2012;Sellers & Morris, 2017;F. Zhu, 2012). 36 As shown below, these 35 Also, note that many past studies on the facilitative effect of crosslinguistic similarity found this effect using similar linear models (e.g., Casaponsa et al., 2015;De Wilde et al., 2020Sadat et al., 2016), so we would expect our own linear models to capture such an effect too. 36 In addition, we also attempted to build Conway-Maxwell-Poisson models, which can also handle both underdispersion and overdispersion (Brooks et al., 2017(Brooks et al., , 2019Forthmann & Doebler, 2021;Lynch et al., 2014;Sellers & Morris, 2017). The reason for this attempt was that these models might be less prone to convergence problems, though they are also much more computationally intensive (Brooks et al., 2019). Unfortunately, they also had convergence warnings for the Swadesh-based model in the second subcorpus, similarly to the generalized Poisson models, so they were not helpful in this regard, and furthermore, due to their high computational costs, we were unable to get them to converge for the parallel-based samples. Nevertheless, this is not crucial, as the results for these models in the case of the Swadesh-based samples where they did converge were very close to models suffered from various convergence issues, so they are not a viable option to use as the main models, and we do not compare them directly to the main models here in terms of performance (e.g., based on AIC/BIC). Nevertheless, these models had very similar results as the main models, which provides support for the key findings.
Specifically, Table 13 contains these models for the Swadesh-based samples. Both models had results that are extremely similar to the main models, particularly in the case of the key variables that the study focuses on (distance and the distance:proficiency interaction).
The sample for the first subcorpus converged with a "NA/NaN function evaluation warning". 37 those of the generalized-Poisson models, and functionally equivalent when it comes to the key variables under consideration (i.e., an IRR of 0.99-1 and an SE ≤.01 for distance and the distance:proficiency interaction). 37 See the glmmTMB documentation for a description and discussion of all the convergence warnings and errors mentioned here: http://web.archive.org/web/20210516105444/https://cran.rproject.org/web/packages/glmmTMB/vignettes/troubleshooting.html Table 13. Results of the generalized Poisson models, for the Swadesh-based samples. The response variable was the rate of use of the target L2 English words (i.e., their count offset by the total number of words in each text). Under fixed effects, distance is the phonological LDN between each L2 word and its most lexically similar L1 counterpart (originally 0-1, scaled to 0-10), proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 and τ11 respectively represent the SD of the associated random intercepts and slopes, and ρ01 represents the correlation between random intercepts and associated random slopes (here, distance for L1).  Table 14 contains the generalized Poisson models for the parallel-based samples. There were more convergence issues here, as the first subcorpus did not converge at all (It had a "gradient function must return a numeric vector of length 13" error, as well as a "NA/NaN function evaluation" warning), and the second subcorpus converged with two warnings ("singular convergence" and a "non-positive-definite Hessian matrix"). 38 Nevertheless, the findings of the model that did converge, albeit with warnings, are very similar to those of the associated main model. Table 14. Results of the generalized Poisson models, for the parallel-based samples. The response variable was the rate of use of the target L2 English words (i.e., their count offset by the total number of words in each text). Under fixed effects, distance is the phonological LDN between each L2 word and its most lexically similar L1 counterpart (originally 0-1, scaled to 0-10), proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 and τ11 respectively represent the SD of the associated random intercepts and slopes, and ρ01 represents the correlation between random intercepts and associated random slopes (here, distance for L1). In summary, we attempted to build models that use variants of the Poisson distribution that can handle both underdispersion and overdispersion (namely, generalized Poisson models).

First subcorpus a Second subcorpus
The resulting models had a number of convergence issues, errors, and warnings, which supports the use of the regular Poisson models as the main models in the study. Nevertheless, the findings in the models that did converge, including those that converged with no warnings (i.e., the Swadesh-based models in the first subcorpus) mirror the findings of the main models, especially with regard to the key variables in the study (the distance predictor and the distance:proficiency interaction). This was expected, since the main issue with underdispersion are overestimated SEs (Brooks et al., 2017(Brooks et al., , 2019Dean & Lundy, 2016;Forthmann & Doebler, 2021;Harris et al., 2012;Hartig, 2021b;Sellers & Morris, 2017), and this is not a problem here, given the very small SEs that were found across all samples. As such, these models provide support for the findings of the main models, and suggest that any potential underdispersion in the data does not substantially change our key findings.

Collinearity
In addition to residual plots, we checked for potential collinearity using the performance package in R (Lüdecke et al., 2021). 39 The results of this appear in Figure 7, which contains the variance inflation factor (VIF) for the predictors in each model. In all cases, the VIF was minimal (i.e., equal to or very close to 1), which indicates the collinearity was not an issue for the present analyses, especially given the large sample sizes (Morrissey & Ruxton, 2018;O'Brien, 2007;Winter, 2019).

Software used in the analyses
All analyses were performed in R (R Core Team, 2021). 40 All tests of statistical significance throughout the study were two-tailed. To list the specific packages that were loaded throughout the analyses, we used the sessionInfo function from the report library (Makowski & Lüdecke, 2019). This generates an automated output based on the citation information associated with the metadata of each package, which may be incomplete or formatted differently than APA style. We kept it here as is, to preserve the original output, and also separated the associated references listed here from the other references used in this document, which appear at the final section of this document. -Ben Bolker and David Robinson (2020 ---End of report(sessionInfo()) output above---

Random slopes
Initially, we tested several potential mixed-effects models, with random slopes of lexical distance for the learner, L1, task, and word random effects (separately for each one). For the models based on the parallel dictionaries, only the model with random slopes for L1 converged properly, as the other models either had problems with singular convergence or did not converge at all, even though they were tested on their own (i.e., as a single random slope, before combining multiple ones).
Given this, and given that the goal was to use a consistent random-effects structure across all models, we included only random slopes of distance for L1 in these models.
However, as shown in the results section of the main paper, this does not appear to be an issue given our particular findings, since the main concern with omitting random slopes is an increased rate of Type I error (Matuschek et al., 2017;Winter, 2019), but our key findings provide support for the null hypothesis.

Random intercepts by text
We considered adding to the models a random effect (random intercepts) for each text in the sample. However, there is substantial overlap between this and the learner random effect, since, as noted in the paper, most learners only had a single text in the sample. 41 In addition, we also had the task random effect, which accounts for further variance that may be associated with specific texts (each learner had only a single text per task).
When we attempted to build models that included the text random effect in addition to learner, in the case of the parallel-based models, the model did not converge for the first subcorpus, and had convergence warning for the second subcorpus. 42 Given this, and given that the goal was to use a consistent random-effects structure across all models, we did not include this random effect in our final models. 41 The mean number of texts per learner was 1.36 in the first subcorpus and 1.41 in the second. For more details on this, see the "Sample information" document in the study's OSF repository. 42 In the first subcorpus, we had a "gradient function must return a numeric vector of length 8" error, as well as "NA/NaN function evaluation" and "restarting interrupted promise evaluation" warnings. In the second subcorpus, we had the same warnings as in the first subcorpus, but not the error. Table 15. Results of the mixed-models with text as an additional random effect, for the Swadesh-based samples. The response variable was the rate of use of the target L2 English words (i.e., their count offset by the total number of words in each text). Under fixed effects, distance is the phonological LDN between each L2 word and its most lexically similar L1 counterpart (originally 0-1, scaled to 0-10), proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 and τ11 respectively represent the SD of the associated random intercepts and slopes, and ρ01 represents the correlation between random intercepts and associated random slopes (here, distance for L1).  Table 16. Results of the mixed-models with text as an additional random effect, for the parallel-based samples. The response variable was the rate of use of the target L2 English words (i.e., their count offset by the total number of words in each text). Under fixed effects, distance is the phonological LDN between each L2 word and its most lexically similar L1 counterpart (originally 0-1, scaled to 0-10), proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 and τ11 respectively represent the SD of the associated random intercepts and slopes, and ρ01 represents the correlation between random intercepts and associated random slopes (here, distance for L1).  Table 17. Results of the mixed-models without the L1 random effect, for the Swadesh-based samples. The response variable was the rate of use of the target L2 English words (i.e., their count offset by the total number of words in each text). Under fixed effects, distance is the phonological LDN between each L2 word and its most lexically similar L1 counterpart (originally 0-1, scaled to 0-10), proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 represents the SD of the associated random intercepts.  Table 18. Results of the mixed-models without the L1 random effect, for the parallel-based samples. The response variable was the rate of use of the target L2 English words (i.e., their count offset by the total number of words in each text). Under fixed effects, distance is the phonological LDN between each L2 word and its most lexically similar L1 counterpart (originally 0-1, scaled to 0-10), proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 represents the SD of the associated random intercepts.

Baseline models (without distance)
The baseline models were models that did not include lexical distance at all (MODELSbaseline).
We compared these models to the main models that were used in the study (MODELSmain), where lexical distance was included as a predictor, as part of an interaction with L2 proficiency, and as random slopes of L1. In addition, to better understand how the removal of lexical distance from the models influences them, 44 we also compared the baseline and main models with models that contained distance as a predictor/interaction but without random slopes (MODELSno_slope), and with models that had distance only as a predictor, with no random slopes or interaction (MODELSonly_predictor).
Specifically, we compared the models' AIC and BIC, and the results of this are shown in Table 19. Both measures were used, as suggested in Kuha, (2004). The AIC and BIC of each model were extracted directly from each model object in R using the summary function.
All comparisons were between models that used the same set of data (i.e., between models that use the same learner sample and lexical-distance dataset), as required when using these measures (Fabozzi et al., 2014;Kuha, 2004). Note. ΔAIC is calculated by subtracting the AIC of a given model from the AIC of the model with the minimal AIC for that combination of subcorpus (i.e., first/second) and lexical-distance dataset (i.e., Swadesh/parallel), since comparisons can only be made between models that are based on the same data (Fabozzi et al., 2014;Kuha, 2004). Accordingly, no ΔAIC is listed for the model with the minimal AIC for a certain combination (e.g., Swadesh lists in the first subcorpus). The same is the case for ΔBIC.
Interpretations of the differences in AIC/BIC are based on Fabozzi et al. (2014). In terms of BIC, there was very strong support for the simplest (baseline) model in all 4 cases, as it had the minimal BIC, with ΔBIC either slightly below 10 or far above it. In terms of AIC, the picture was less clear. Specifically, in the case of the parallel dictionaries in the first subcorpus, the baseline model was strongly supported (ΔAIC > 40). However, in the case the first subcorpus in the Swadesh lists, there was only weak support for the baseline and predictor-only models over the main model (ΔAIC ~2-3), and in the case of the second subcorpus (both Swadesh and parallel), there was moderate support (ΔAIC ~5-6) for the main models over the other models (though the main models were ranked the worst in all cases based on BIC). This difference between AIC/BIC can be attributed to the greater penalty that BIC imposes for the number of parameters in the model (Fabozzi et al., 2014).
When the patterns of the two measures are considered, together with the estimates for the associated predictors, it appears that the AIC comparisons are sometimes recommending the use of an overfitted model here.
Overall, the comparison between the models did not consistently support the inclusion of linguistic distance as a predictor based on AIC, and consistently supported its exclusion based on BIC. It is, therefore, reasonable to conclude that the effect of distance is at best unclear in our dataset. This is strongly supported by the findings for the main models that are shown in the paper, where the distance predictor and the interaction had IRRs very close to 1 (corresponding to a coefficient estimate of 0) and very small SEs, and where the SDs of the random slopes of distance were also very close to 1 (i.e., to a coefficient estimate of 0).
The results for the baseline models are shown in Tables 20 and 21.  Table 20. Results of the baseline mixed-effects models, for the Swadesh-based samples. The response variable was the rate of use of the target L2 English words (i.e., their count offset by the total number of words in each text). Under fixed effects, proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 represents the standard deviation (SD) of the associated random intercepts.  Table 21. Results of the baseline mixed-effects models, for the parallel-based samples. The response variable was the rate of use of the target L2 English words (i.e., their count offset by the total number of words in each text). Under fixed effects, proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 represents the standard deviation (SD) of the associated random intercepts.

German-only models
We built models using data from only German learners. 45  In summary, the samples using only data from German speakers largely replicated our key findings, which indicates that our findings hold even when focusing on this key L1. Table 22. Results of the mixed-models without the L1 random effect, for the Swadesh-based samples, using only data from German speakers. The response variable was the rate of use of the target L2 English words (i.e., their count offset by the total number of words in each text). Under fixed effects, distance is the phonological LDN between each L2 word and its most lexically similar L1 counterpart (originally 0-1, scaled to 0-10), proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 represents the SD of the associated random intercepts.  Table 23. Results of the mixed-models without the L1 random effect, for the parallel-based samples, using only data from German speakers. The response variable was the rate of use of the target L2 English words (i.e., their count offset by the total number of words in each text). Under fixed effects, distance is the phonological LDN between each L2 word and its most lexically similar L1 counterpart (originally 0-1, scaled to 0-10), proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 represents the SD of the associated random intercepts.

Binary-response models
Since the underlying response variable that we examined is a count, we used Poisson models for our main analyses (Green, 2021;Hox et al., 2018;Winter, 2019). Nevertheless, since lexical distance might affect whether or not an L2 word is used at all, rather than how often it is used, we also built supplementary logistic-regression models with a binary response variable.
We derived the binary response variable by converting the original count of the number of times that each target English word was used in a text to a binary variable (i.e., a count greater than 0 was converted to a '1' in the response variable, and a count of 0 was kept as a zero). To model this response variable, we used logistic regression (i.e., models with the binomial family and canonical logit link). The total wordcount of texts was included in the models as a direct predictor, similarly to the offset in the Poisson models. We also included exponentiated coefficients, which in this case are called odds ratio (OR) rather than incidence rate ratio in Poisson models, though the two are similar conceptually (e.g., an OR of 2 means that a 1-unit increase in the predictor doubles the likelihood that the target word will be used in a text).
There was an issue with singular convergence in the Swadesh-based sample in the first subcorpus, due to the L1 random effect, and the parallel-based models did not converge at all, 47 so we omitted this random effect from the models. However, given the correspondence between these models and the Poisson models (as shown below), and given that removing this random effect from the Poisson models did not change the key findings (see "Models without the L1 random effect in the Supporting Information"), this should not substantially influence our key findings. Indeed, for the Swadesh models that did converge (albeit with singular convergence), the results of the models were functionally identical to the Swadesh models without the L1 random effect.
The results of binary-response (i.e., logistic-regression) models appear in Tables 24   and 25. Despite reducing the complexity of the models by removing the L1 random effect, the model still did not converge for the parallel-based sample in the first corpus, which supports our preference for using the Poisson models as the main models. Nevertheless, the three models that did converge without an issue replicate the results of our main models, as there was a functionally null effect of distance and of its interaction with proficiency (B = -0.02-0.00, corresponding to OR = 0.98-1.00) as well as strong effects of task, word, and especially the task:word interaction. This means that looking at a binary response variable (whether a word was or was not used in a text), rather than a count response variable (the number of time a word was used in a text), does not change our findings substantially. Table 24. Results of the mixed-models with a binary response variable (i.e., whether a target English word did or did not appear in the text), for the Swadesh-based samples. Under fixed effects, distance is the phonological LDN between each L2 word and its most lexically similar L1 counterpart (originally 0-1, scaled to 0-10), proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 represents the SD of the associated random intercepts.  Table 25. Results of the mixed-models with a binary response variable (i.e., whether a target English word did or did not appear in the text), for the Swadesh-based samples. Under fixed effects, distance is the phonological LDN between each L2 word and its most lexically similar L1 counterpart (originally 0-1, scaled to 0-10), proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 represents the SD of the associated random intercepts.

Added-interactions models
Our main models included an interaction between distance and L2 proficiency, since that was the key interaction that we expected to find based on the literature (specifically, we expected that the effects of distance will weaken as L2 proficiency increases). To expand these models, we created supplementary models with three additional interactions: − The first interaction was for distance and frequency, primarily in case the effects of distance are stronger in lower-frequency words (e.g., because those words are more difficult, so learners rely more on the facilitative effects of similarity, or because these words are more likely to be a part of an interchangeable synonym set).
− The second interaction was for proficiency and frequency (e.g., if proficiency effects are stronger for lower-frequency words, that people are less likely to know).
− The third interaction was an interaction between distance, proficiency, and frequency. Such three-way interaction is more difficult to interpret, but one way to think of it is that the interaction between distance and proficiency, if it exists, may itself be moderated by the frequency of the words (e.g., the effect of distance is weaker as L2 proficiency increases, but only for high-frequency words).
These interactions were inserted into the models in R by specifying: ldn_phono_closest_scaled_centered * proficiency_level_centered * frequency_zipf_centered To accommodate the extra complexity of these models, we removed the L1 random effect, which was the key cause of convergence issues in the other models, and which did not substantially influence the findings (see the "Models without the L1 random effect" section in this Appendix).
The results of these models are shown in Tables 26 and 27. The models converged with no issues, except for the parallel-based sample in the first corpus where the model did not converge. 48 48 Specifically, the model that did not converge had a "gradient function must return a numeric vector of length 4" error and a "NA/NaN function evaluation" warning.
The models that converged all replicated the key findings of the main models, in terms of a null effect of the distance predictor, the null interaction between distance and proficiency, and the strong effects of the task, word, and task:word.
In addition, the added two-way interaction between distance and frequency, and the added three-way interaction between distance, proficiency, and frequency were consistently and robustly null across all the models (B = -.0.02-0.01 and SEB ≤ 0.01, corresponding to IRR = 0.98-1.01 and SEIRR ≤ 0.01). Conversely, there was a consistent, positive, but weak interaction between proficiency and frequency (B = 0.05-0.07, corresponding to IRR = 1.05-1.07), which suggests that as learners' L2 proficiency increases, the effect of word frequency on their rate of use of the L2 words also increases. However, this small interaction is irrelevant to the present research, and as shown in the models, it does not change the key findings.
In summary, we found no substantial interaction between distance and other predictors (i.e., between distance/proficiency, distance/frequency, and distance/proficiency/frequency). In addition, our key findings replicate when adding the new interactions into our models, although adding this interaction does cause convergence issues.
This does not suggest that these interactions cannot occur in other contexts; for example, there may indeed be a distance/proficiency and a distance/frequency interaction in more spontaneous L2 settings, where there is a stronger effect of distance. Rather, this merely indicates that these interactions did not occur in the present sample, likely due to the general lack of effect of distance. Table 26. Results of the mixed-models with the added interactions, for the Swadesh-based samples. The response variable was the rate of use of the target L2 English words (i.e., their count offset by the total number of words in each text). Under fixed effects, distance is the phonological LDN between each L2 word and its most lexically similar L1 counterpart (originally 0-1, scaled to 0-10), proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 represents the SD of the associated random intercepts.  Table 27. Results of the mixed-models with the added interactions, for the parallel-based samples. The response variable was the rate of use of the target L2 English words (i.e., their count offset by the total number of words in each text). Under fixed effects, distance is the phonological LDN between each L2 word and its most lexically similar L1 counterpart (originally 0-1, scaled to 0-10), proficiency is the EFCAMDAT L2 proficiency level at which the text was written (1-12, corresponding to CEFR A1-B2), and frequency is the baseline Zipf frequency of the target word in English (~1-7.5). Under random effects, τ00 represents the SD of the associated random intercepts.