Internal and External Dynamics in Language: Evidence from Verb Regularity in a Historical Corpus of English

Human languages are rule governed, but almost invariably these rules have exceptions in the form of irregularities. Since rules in language are efficient and productive, the persistence of irregularity is an anomaly. How does irregularity linger in the face of internal (endogenous) and external (exogenous) pressures to conform to a rule? Here we address this problem by taking a detailed look at simple past tense verbs in the Corpus of Historical American English. The data show that the language is open, with many new verbs entering. At the same time, existing verbs might tend to regularize or irregularize as a consequence of internal dynamics, but overall, the amount of irregularity sustained by the language stays roughly constant over time. Despite continuous vocabulary growth, and presumably, an attendant increase in expressive power, there is no corresponding growth in irregularity. We analyze the set of irregulars, showing they may adhere to a set of minority rules, allowing for increased stability of irregularity over time. These findings contribute to the debate on how language systems become rule governed, and how and why they sustain exceptions to rules, providing insight into the interplay between the emergence and maintenance of rules and exceptions in language.

• Type: All token instances of a given verb lemma, e.g., in "I walked, she walks and he talks", there are two types, walk and talk. Walk has two tokens (walked, walks), while talk has one (talks). The number of tokens for a given type determines its frequency.
• Verb Type: All separate verb lemmas, regardless of shared basic root verb. For example, do and undo constitute two separate verb types with different token counts.
• Root Type: Lemmas collapsed by basic root verb. When referring to root types, tokens of do and undo both contribute to the frequency of the root type do.
• f : The basic usage frequency of a type, calculated as the number of tokens (from all tenses) divided by the total number of tokens in a decade (this number is held constant across decades at the value of 2,177,456 after confining at the size of the first decade, see text).
• f past : The frequency of usage for a lemma in the past tense, defined as the total number of past tense tokens for the lemma divided by the size of the decade (overall number of tokens in decade) prior to confining (shown in Table S1).
• I: The proportion of irregularity for a given type or class, defined as the number of irregular (non -ed) past tense tokens for the type divided by the total number of past tense tokens for the type. I can be defined only for types which exhibit a sufficient frequency (see Undefined). Each type may have a different I in each decade.
• Undefined: A verb or root is considered undefined in a given decade if its f past < 2.75 · 10 −6 . In other words, a type's regularity is undefined if it has extremely low f past .
• Extended Vocabulary: Entire set of verb or root types present after confining the corpus size to control for the increase in available text over time. This set includes verbs which enter and/or exit the vocabulary at any point during the 160 year time period.
• Core Vocabulary: The set of verb and root types present in all sixteen decades considered. Note that verbs or roots with very low f , though part of the core vocabulary, may have undefined I in some decades.
• Mostly Regular: A type is classified as mostly regular if its I < 0.5.
• Mostly Irregular: A type is classified as mostly irregular if its I ≥ 0.5.
• : The error threshold for considering a verb or root to be regular or irregular (see below), set at = 0.01. In other words, if the I of a type is within of 0 or 1, it is considered regular or irregular (respectively). Note that a verb which is regular (irregular) in a decade may not stay regular (irregular) in other decades.
• Stable Regular: A verb or root is considered a stable regular if it respects I ≥ 0 + in all 16 decades.
• Stable Irregular: A verb or root is considered a stable irregular if it respects I ≥ 1 − in all 16 decades.
• Active: A verb or root is considered active if its I exhibits a value respecting 0 + ≤ I ≤ 1 − in at least one decade.

S3: Corpus Preparation
This section provides basic information about the corpus, and details regarding our methods in the preliminary stages of preparing the corpus for analysis. The Corpus of Historical American English contains over 400 million words of written English from 1810-2009. Each decade is genre balanced to contain roughly equal representation of fiction and non-fiction sources [2]. CoHA provides tagged frequency lists of words for each decade.
Although CoHA spans 1810-2009, we used only the period between 1830 and 1989. The final two decades were removed (1990-2009) due to the fact that they are duplicated in the larger Corpus of Contemporary American English (CoCA; used for a separate investigation not reported here). The first two decades were removed as they displayed rather extreme growth in the database size. The number of verb tokens in the first decade (1810-1819) is approximately 20% of the size of the second (1820-1829), and the second decade is still only about half the size of the third (1830-1839). Thereafter, growth levels off, with more moderate increases in the number of tokens between decades (see Figure S3, Table S2). We thus discarded the first two decades due to their small relative size; further potential effects of increasing database size were addressed by removing extremely low frequency items and by confining the size of each decade according to the 1830-1839 decade. This is explained in greater detail below.

Lemmatisation, Removals & Confining
Prior to analysis, the corpus was confined only to verbs and hand lemmatised by the authors (i.e., the words walked, walking and walks were all tagged as tokens of the lemma walk). In the process of lemmatisation, several types of removals were made. First, all modal auxiliary verbs in all tenses (e.g., can, could, may, must) were removed as they are considered function words rather than lexical verbs (and are excluded from earlier studies of regularity, e.g., [5]). Tagging errors, spelling and OCR errors, and items which occurred at extremely low frequency in very few decades were also removed.
Hand lemmatisation allowed the coders to check many obvious tagging errors against their context in the corpus and remove such errors altogether. For example, chung is tagged as a past tense verb, but invariably occurs as a proper noun (e.g., in a context such as "I told her very briefly Chung Bong's story" [2]). Note that not all tagging errors were removed; only those which were unknown English words and which had low enough frequency to allow checking against the corpus itself to verify the error 1 . Where a past tense verb had the potential for error (e.g., abandoned can be either an adjective or verb, and adjective tokens were sometimes incorrectly tagged as verbs as in "this dear abandoned innocent" [2]), but this error was not uniform (i.e., most tokens of abandoned tagged as the past tense were correctly tagged), all tokens were included. This is due to the fact that manually checking all contexts of a particular word form is infeasible in CoHA (due to copyright restrictions at the time of analysis) for words which do not have a low token count.
Spelling or optical character recognition (OCR) errors were corrected where possible, for example abandonedthe was coded as a past tense token of abandon, and aaserting is a clear instance of a misspelling of asserting. Where these types of errors could not be straightforwardly interpreted, (e.g., for stranded bound morphems like -ify, where each occurence was an error of the end of a different word, such as qual-ify, sign-ify), they were removed entirely. Verbs were also removed if they did not occur with a frequency of more than 10 −8 in at least three decades. In other words, as our analysis aimed to observe changes over time, we discarded verbs with both extremely low frequency and a very short lifespan (<30 years total, though not necessarily consecutive) in the corpus.
These three criteria for removal led to the removal of between 9 − 12% of the verb tokens in each decade (see Table  S2). Figure S3 contrasts the token count of all verbs in the corpus (Davies, personal communication) versus the token count after our removals from each decade. The removal of modals accounted for the loss of between 6 − 8% of tokens (between 55 − 80% of all removals), depending on the decade. This percentage drops over time, indicating that the proportion of modals as a proportion of all verbs drops over time. This is likely due to the growth in the number of new lexical verbs evident even with the corpus size constrained (see main text, Figure 1). Additionally, the percentage of removal due to other criteria also increases over time (see Table S2); this is due to an increase in tokens removed due extremely low frequency items which occur for very short periods of time. This is consistent with an increase in the database size (more tokens can lead to the introduction of more low frequency types [3]), rather than any major variations in tagging errors, estimated to be stable around 1 − 2% overall [2]. Moreover, the direction of these removals is conservative with respect to our results; in other words, even though a larger proportion of verb tokens were removed from the later decades, we still observed a growth in the number of verbs over time (see main text).
An increase in database size has the potential to affect the number of types observed [3]. Despite having removed very briefly or sporadically occurring low frequency items in the coding process, the growth in the number of tokens over time is still considerable (see Figure S3). To consider genuine vocabulary growth (rather than the growth in text available for digitized corpora over time), we confined the number of verbs considered in each decade to the number of tokens in the smallest decade (1830-1839; 2,177,456 tokens). This involved recreating a random sequence of all observed verb tokens in each decade. We then drew 2,177,456 tokens from each randomly sequenced set of verbs (with the exception of the first decade considered, 1830-1839, which remained intact). Verb types which remained after the set was confined constitute the extended vocabulary considered in our final analysis. Frequencies (f ) were calculated based on the number of tokens per lemma divided by the size of the confined set.

S4: Data Preparation
After confining the size of each decade, we analysed several fundamental properties of the data: entrances and exits, frequency, root types, and regularity. Entrances and exits were considered in terms of single decade interval, meaning every entrance and exit was counted, even if the same verb entered and then exited in consecutive decades. In other words, the verbs entering between 1840 and 1850 are verbs appearing in 1850 which did not appear in 1840, regardless of whether they appeared in 1830. However, overall, more verbs enter and stay (or enter more times than they exited), making the net result a growth in the number of verbs (see Figure 1E, main text).

Regularity
Types in the extended vocabulary were categorized according to their proportion of irregularity (I), i.e., the fraction of irregular simple past tense occurrences over the total number of past tense tokens. The I was calculated from our lemmatized version of the verb set, prior to confining decade size. Since the purpose of confining was to control for frequency effects related to database size, the process of confining did not recreate a tagged database of verb tokens; information regarding past tense regularity is only available from the original lemmatized version. The I was calculated using only the simple past tense; irregularity in the past participle was not considered, such that e.g., prove is entirely regular (e.g., I proved her wrong) although it has an irregular past participle (e.g., It has proven difficult) in common usage. Irregular spellings such as paid for the past tense of pay (as opposed to payed, which also occurs in the corpus) were considered regular past tense tokens, since spelling irregularity and variation were not considered in our analysis. Figure S5 shows the proportion of irregular tokens overall, which indicates that between 65 − 70% of all past tense utterances are irregular. Even with the removal of the highest frequency irregulars, be and have (which also have the potential to be function verbs), around 50% of all past tense verb tokens are irregular. This indicates that while regularity dominates types, irregularity dominates tokens.
We considered regularity undefined if past tense usage was so infrequent (or non-existent) that the regularity of a verb could not be determined without the potential for error. Therefore, in order to have defined regularity, a verb had to have past tense usage greater than or equivalent to a frequency of 2.75·10 −6 in a given decade. Because past tense usage and irregularity is based on the unconfined corpus, past tense usage frequency was calculated according to the original lemmatized corpus size for each decade. This frequency threshold is equivalent to at least 6 past tense tokens for the first decade (1830-1839), but the number of past tense tokens required to reach this threshold scales with the increase in corpus size (such that e.g., at least 14 past tense tokens are required to pass the threshold in the final decade). Frequency of usage in the past tense scales with overall frequency, such that low frequency items are much more likely to be undefined.
For broad contrasts in the extended vocabulary, all verbs were classified as mostly regular or mostly irregular (I < 0.5 and I ≥ 0.5, respectively). However, the remainder of the analysis leveraged the availability of a scalar I by contrasting regular and irregular roots with active roots in the core vocabulary. Root types with an I ≤ 0 + were labelled as regulars, root types with I ≥ 1 − irregulars. When considering decades separately, active types are only considered active in decades where their I ≥ 0 + or I ≤ 1 − , but are considered (ir)regular elsewhere. However, across the entire time period, stable regulars or irregulars are types with an I ≤ 0 + or I ≥ 1 − in every decade. Consequently, types with a I ≥ 0 + or I ≤ 1 − in at least one decade were labeled as active across the time period.
The extended vocabulary presented with 6885 unique verb types. In order to examine the contribution of genuinely new verbs, exclusive of the contribution of new verbs which used an existing verb root productively, verbs were collapsed by their roots. This was particularly important since the use of existing irregular verb roots contributed in part to the introduction of "new" irregular types in the period. To this end, each of the 6885 verb types was assigned a root. The vast majority of verbs were monomorphemic and thus identical to their roots; e.g., the verb usher is identical to the root usher in all decades. Even many multimorphemic verbs were also identical to their roots, since words were only classed by free verb roots, as irregularity can only be "inherited" from a verb root [7]. For example, the verbs slave and enslave constitute separate verbs as well as separate roots, since they derive from the noun slave, and thus do not share a verb root. Collapsing verbs by their roots resulted in a reduction in the number of unique types in the extended vocabulary, to 5791. Tables S3 and S4 summarize types by decade in terms of verbs and roots, respectively. The values of f and I for roots were re-calculated with the number of root tokens over decade size (for f ) and the total number of irregular past tense tokens over the total number of past tense tokens for the entire root (I) 2 .
Of 200 unique irregular verb types (151 root types) in the corpus, 18 (9%) of these appear after 1840, while 22 (11%) are lost, making for a small net decrease in the number of irregular verb types (the remaining 80% are in the core vocabulary). In the case of regulars, the birth rate observed is not only much greater than that of irregulars, but it dwarfs the death rate; 26.3% of all regular verb types in the corpus are born after 1840, while only 14.3% of verbs are lost.
Calculation of root types shows that the 18 entrances of irregular verbs occur either because of definition or root proliferation (i.e., the productivity of an existing irregular root, as in do-undo). Definition occurs when a verb acquires sufficient frequency of usage in the past tense for its regularity to be reliably defined; i.e., it moves out of the undefined category. Four of the 18 entering irregulars (approximately 23%) are undefined in the early decades, but enter as mostly irregular (I ≥ 0.5) at their first occurrence and remain mostly irregular throughout their lifetime. The largest percentage of irregular verb birth is accounted for by the proliferation of irregular roots; 11 of the 18 nascent irregulars (just over 60%) are multi-morphemic verbs using an existing irregular root, such as outdo and override. The remaining three new irregular verbs (constituting 17%), are not entrances, rather, they are instances of irregularization: verbs which at their first occurrence were regular, but by the final decade have become irregular. If verbs are collapsed by their roots -eliminating the process of root proliferation as a mechanism of birth, this leaves only 7 new irregular roots (of 151 unique root types total): 3 irregularizations and 4 definitions.
Unlike for irregulars, root proliferation is not a major contributing force behind new regular types, since collapsing regular types into roots has little effect on the observed rates of birth and death (adjusting them to 26% and 13.8%, respectively). Definition accounts for a large proportion of new regular verb types, with 78.7% of new regulars becoming defined as regular sometime after 1840 (although they occur with some f in early decades). Verbs entering the system form the second largest source of new regulars, accounting for almost 21% of new regulars. Lastly, regularization accounts for a small minority of new regulars, with only three verbs regularizing completely, constituting less than 0.5% of regular verb growth. In other words, defining and entering verbs skew drastically towards being regular, and a growth of the number of types over time is the primary force driving an overall increase in regularity in the language system.

S5: Phonological classes
All verbs which had a non-zero I at any time during the 160 years examined were classed according to the change from the infinitive form to the irregular past tense form. A full list of all 52 classes and their members is provided in Table S5. Verbs which exhibited multiple irregular forms were phonologically classified based on their most frequent irregular form (i.e., swing was classed with ring instead of string, although the form swung did occur in a minority). Suppletive forms such as go and be were in their own class, and forms such as slay/slew and lay/lay did not class identically with any other verbs, and were thus also classed alone. Each class has an f defined as the sum of the frequencies of its root members, while the I in each class is calculated by dividing the sum of irregular past tense tokens in the class by the sum of all past tense tokens.
These classes may not be optimal, and could in fact be more or less fine grained. For example, this metric is coarse in that it does not take into account the presence of several complex onsets in irregulars (e.g., spr-, str-), which may effect irregularisation (in other words, it does not use overall phonological distance; see [1]). On the other hand, most systems of classification are much broader with as few as 6-8 classes encompassing all irregular verbs [4,6].
Because an irregular form is required for class membership, class sizes are not fixed over time. In other words, while irregulars have the potential to contribute to classes, regulars do not (such that, e.g., there is no subset of regular classes with phonologically similar members). When a particular root type regularizes completely, it leaves its irregular class entirely; in other words, the class can be said to be losing a member to regularization. For example, the verb work occurs early on in the corpus with an I of 0.04 as there is still some usage of the irregular form wrought. While work has this positive I, it belongs to the teach class. However, by 1930, work has an I of 0, and has therefore left the teach class. Thus, in 1930, the teach class shrinks in overall size, and work no longer contributes to either the class's f or its I. Likewise, each class also has the potential to gain members as new irregular forms emerge (or new verbs enter as irregular, although this is rare in our data). For example, the verb ruin occurs only with the form ruined until 1880, at which point the form ruint emerges. As ruin now has an I of 0.02, it enters to the burn class contributing to both its f and I.  Figure 1E in the main text, which depicts mostly regular and mostly irregular root types). This shows that the growth in number of types is mainly a consequence of entering regular types, many of which were simply previously undefined. The starting point of the first frequency bin for each decade is indicated.
active IR R undef.   In each decade, for a given class with a size, s, we can define its variance, σ, as: Where I r is the proportion of irregularity of each member root of the class, and I c is the overall I of the class. This figure shows a plot of the variance of σ over time against the average σ (σ) for each class. Classes with higherσ are in the process of losing a member throughout the time period, while higher variance in σ indicates the loss of a member. For example, the mean class is slowly losing dream and lean, which have an I of 0.108 and 0.007 respectively by the final decade. The choose and strike classes both lose members (behoove and climb, respectively) in the time period. Classes with low variance in σ over time and lowσ are highly stable (and/or have only a single member, e.g., the be class).The variance of σ over time versusσ for all classes. Only classes with a variance over time in σ>0.05 are labelled.