Hahahahaha, Duuuuude, Yeeessss!: A two-parameter characterization of stretchable words and the dynamics of mistypings and misspellings

Stretched words like ‘heellllp’ or ‘heyyyyy’ are a regular feature of spoken language, often used to emphasize or exaggerate the underlying meaning of the root word. While stretched words are rarely found in formal written language and dictionaries, they are prevalent within social media. In this paper, we examine the frequency distributions of ‘stretchable words’ found in roughly 100 billion tweets authored over an 8 year period. We introduce two central parameters, ‘balance’ and ‘stretch’, that capture their main characteristics, and explore their dynamics by creating visual tools we call ‘balance plots’ and ‘spelling trees’. We discuss how the tools and methods we develop here could be used to study the statistical patterns of mistypings and misspellings and be used as a basis for other linguistic research involving stretchable words, along with the potential applications in augmenting dictionaries, improving language processing, and in any area where sequence construction matters, such as genetics.

As a comparison to our normalized entropy measure for balance discussed in Sec. III B, we also compute an alternate normalized entropy measure, H alt , that measures balance from a di↵erent view.
To compute H alt , we first calculate the overall average stretch for each character as before, but now do so across all tokens at once. Then, we subtract one from each of these values and normalize them so they sum to 1 and can be thought of like probabilities. We then compute the normalized entropy, H alt , of these values as a measure of overall balance. H alt is similar to H in that if each character stretches the same on average, the normalized entropy is 1, and if only one character in the kernel stretches, the normalized entropy is 0. Again, higher entropy corresponds with more balanced words.
The di↵erence is the view, and what is meant by 'on average'. For H alt , each token is weighted equally when calculating balance. Thus, this measure corresponds to the view of if one randomly samples tokens and looks at how balanced they are on average. By contrast, for H, as calculated in Sec. III B, tokens are grouped by length, and then each group gets an equal weight regardless of the group size. This view looks at how well balance is sustained across lengths, and corresponds to sampling tokens by first randomly picking a length, and then randomly picking a token from all tokens of that length, and then looking at how balanced the sampled tokens are on average. For example, for the kernel (pa), H alt = 1.00000, signifying nearly perfect balance. However, looking at the balance plot for (pa) in Fig. A1, we see that perfect balance is not sustained across lengths. Because most of the tokens are short, and short stretched versions of (pa) are well balanced, all of the weight is on the well balanced short ones when randomly picking tokens. However, as people create longer stretched versions of (pa), they tend to use more 'a's than 'p's, and near perfect balance is not maintained. This is better captured by the measure H = 0.80982. As our main measure of balance, we chose the view better representing how well balanced tokens are as they are stretched, equally weighing lengths. This does have the limitation that groups of tokens with di↵erent lengths have di↵erent sizes, and some of them may contain a single token, possibly increasing the variance of the measure. It is possible this could be improved in the future by only including lengths that have a certain number of examples, or possibly creating larger bins of lengths for the longer tokens like we do in the balance plots.
We include the same plots and tables for H alt as we did with H, and many of the observations are similar. Fig. A2 shows the two jellyfish plots for H alt . Similar to before, Fig. A2A is the version containing all words and for Fig. A2B we remove the words that have a value of 0 for entropy. The top of the plots in Fig. A2 shows the fre-quency histograms in each case. As before, after removing kernels with an entropy of 0, we see a small left-shift in the highest ranked kernels, and then the distribution largely stabilizes. Again, the highest ranked kernels tend to be more equally balanced, and kernels only stretching a single character tend to be lower ranked. Table A1 shows the kernels with the ten largest entropies and Table A2 shows those with the ten smallest nonzero entropies as measured in this alternate way. We observe that the kernels with largest entropies are all of the form (l 1 l 2 ) and are almost perfectly balanced given the view of equally weighing all tokens. The kernels with lowest entropies all expand to regular words that when spelled in the standard way contain a letter that is repeated, plus these kernels allow other letters to stretch.
Finally, Fig. A3 shows the scatter plot of each kernel where the horizontal axis is given by this alternate measure of balance, H alt , and the vertical coordinate is again given by the measure of stretch for the kernel using the Gini coe cient, G. We again see that the kernels span the two dimensional space.
We still get the same kind of rough vertical banding that we saw in Fig. 9 for the same reason, but we also see a curved dense band at lower entropy values, which seems to mostly contain kernels whose base word is spelled with a double letter, like 'summer' (

with kernel [s][u][m][e][r]).
FIG. A3. Kernels plotted in Balance-Stretch parameter space using an alternate measure of normalized entropy for balance. Each kernel is plotted horizontally by the value of its balance parameter, given by an alternate normalized entropy, H alt , and vertically (on a logarithmic scale) by its stretch parameter, given by the Gini coe cient, G, of its token count distribution. Larger entropy implies greater balance and larger Gini coe cient implies greater stretch.