A dataset for the study of identity at scale: Annual Prevalence of American Twitter Users with specified Token in their Profile Bio 2015–2020

doi:10.1371/journal.pone.0260185

Fig 1.

Process for creating datasets.

This flowchart describes the process of creating the Cross Sectional and Longitudinal datasets.

More »

Expand

Table 1.

Counts of unique US users per sample and year.

More »

Expand

Table 2.

Example rows of annual token data.

More »

Expand

Fig 2.

Token prevalence distributions per year for the longitudinal sample.

Note that the x-axis is on a log scale. There are a small number of high-prevalence tokens and large numbers of low-prevalence tokens. This accords with general expectations of word usage.

More »

Expand

Table 3.

The 20 most-surprisingly common tokens from the longitudinal sample in 2020.

More »

Expand

Fig 3.

Distribution of estimated annual change in prevalence for all unique tokens within the longitudinal sample.

Prevalence is stable (i.e. zero change) for many tokens. Note from the min and max annotations that extreme values are present in the data but not pictured here. See Tables 4 and 5 for illustration.

More »

Expand

Table 4.

Top 20 winner tokens in the longitudinal sample.

More »

Expand

Table 5.

Top 20 loser tokens in the longitudinal sample.

More »

Expand

Fig 4.

Distribution of estimated annual change in prevalence for all unique tokens within the cross-sectional sample.

Prevalence is stable (i.e. zero change) for many tokens. Note from the min and max annotations that extreme values are present in the data but not pictured here. See Tables 6 and 7 for illustration.

More »