ReportAGE: Automatically extracting the exact age of Twitter users based on self-reports in tweets

Advancing the utility of social media data for research applications requires methods for automatically detecting demographic information about social media study populations, including users’ age. The objective of this study was to develop and evaluate a method that automatically identifies the exact age of users based on self-reports in their tweets. Our end-to-end automatic natural language processing (NLP) pipeline, ReportAGE, includes query patterns to retrieve tweets that potentially mention an age, a classifier to distinguish retrieved tweets that self-report the user’s exact age (“age” tweets) and those that do not (“no age” tweets), and rule-based extraction to identify the age. To develop and evaluate ReportAGE, we manually annotated 11,000 tweets that matched the query patterns. Based on 1000 tweets that were annotated by all five annotators, inter-annotator agreement (Fleiss’ kappa) was 0.80 for distinguishing “age” and “no age” tweets, and 0.95 for identifying the exact age among the “age” tweets on which the annotators agreed. A deep neural network classifier, based on a RoBERTa-Large pretrained transformer model, achieved the highest F1-score of 0.914 (precision = 0.905, recall = 0.942) for the “age” class. When the age extraction was evaluated using the classifier’s predictions, it achieved an F1-score of 0.855 (precision = 0.805, recall = 0.914) for the “age” class. When it was evaluated directly on the held-out test set, it achieved an F1-score of 0.931 (precision = 0.873, recall = 0.998) for the “age” class. We deployed ReportAGE on a collection of more than 1.2 billion tweets, posted by 245,927 users, and predicted ages for 132,637 (54%) of them. Scaling the detection of exact age to this large number of users can advance the utility of social media data for research applications that do not align with the predefined age groupings of extant binary or multi-class classification approaches.


Introduction
Considering that 72% of adults in the United States use social media [1], it has been widely utilized as source of data in a variety of research applications.However, a common limitation of this research is that users of particular platforms are not representative of the general population [2].Thus, advancing the utility of social media data requires methods for automatically detecting demographic information about social media study populations, including users' age.Most studies have approached the automatic detection of age as binary classification [3,4] or multiclass classification [5][6][7][8][9] of predefined age groups.These studies first identify the age of users based on their or other users' posts, their profile metadata, or external information, and then evaluate the prediction of the users' age group based on modeling a large collection of their posts [3][4][5][6], their profile metadata [7], a combination of their posts and profile metadata [8], or their followers and followers' friends [9].While the automatic classification of age groups may be a suitable approach for specific demographic inquiries about social media users, the fact that the number and range of the age groups vary across the studies suggests that this approach is not generalizable to all applications.
Automatically identifying the exact age of social media users, rather than their age groups, would enable the large-scale use of social media data for applications that do not align with the predefined groupings of extant binary or multi-class models, such as identifying specific agerelated risk factors for observational studies [10], or selecting age-based study populations [11].
Nguyen et al. [5] have developed a regression model for automatically identifying the exact age of Dutch Twitter users, but their evaluation was based in part on annotations of perceived age, which may be influenced by humans' systematic biases and, thus, different from the users' actual age [12].For this reason, the annotators in Nguyen et al.'s [5] study were asked to assess how confident they were in identifying exact age, with a margin of error of up to ten years.Sloan et al. [13] have developed rules to automatically identify self-reports of Twitter users' actual age, but their high-precision approach extracts age only from users' profile metadata, and was able to automatically detect age for only 1,470 (0.37%) of 398,452 users.
The objective of this study was to develop and evaluate a method that automatically identifies the exact age of users based on self-reports in their tweets.Individual tweets have been used to manually verify self-reports of age for evaluating prediction models, but, to the best of our knowledge, methods to extract self-reports in tweets have not been scaled for automatically identifying users' age.For example, Al Zamal et al. [3] have identified users' age by searching for self-reported birthday announcements in tweets (e.g., happy ##(st|nd|rd|th) birthday to me), but then used the ages to evaluate a binary model that infers an age group from 1,000 of the users' most recent tweets, based on linguistic differences associated with age.Similarly, Morgan-Lopez et al. [8] have identified users' age by searching for self-reported birthday announcements in tweets, but then used the ages to evaluate a multi-class model that infers an age group from 200 of the users' most recent tweets and their profile metadata.These high-recall approaches can potentially infer an age group for any user, since they do not rely on explicit reports of age.For this same reason, however, they are not designed for identifying users' exact age, which limits their application beyond predefined groupings.
In this paper, we present ReportAGE (Recall-Enhanced Pipeline for Obtaining Reports of Tweeters' Ages Given Exactly), situated in the gap between rule-based (high precision) and predictive modeling (high recall) approaches.ReportAGE utilizes individual tweets as a resource to identify exact age for a large number of users, overcoming the sparse reports of age in users' profiles.A tweet-based approach, however, does present challenges in natural language processing (NLP).Query patterns that have high precision within the constraints of a user's profile (e.g., years old) would return significantly more noise from tweets, and, while a small number of patterns may capture most of the ways in which users express their age in a profile, tweets afford a wider range of expressions.These expressions may require deriving the user's age from references to the past or future, whereas ages in profiles are likely to refer to the present.To address these challenges, we have designed ReportAGE as an end-to-end NLP pipeline that includes high-recall query patterns, a deep neural network classifier, and rule-based extraction, which we describe in the next section.

Methods
The Institutional Review Board (IRB) of the University of Pennsylvania reviewed this study and deemed it to be exempt human subjects research under Category (4) of Paragraph (b) of the U.S.

Data collection
In previous work [10], we manually annotated more than 100,000 tweets in approximately 200 users' timelines, including reports of age.In the present study, we leveraged these annotations to develop handwritten, high-recall regular expressions-search patterns designed to automatically match text strings-to retrieve tweets that potentially mention a user's age between 10 and 99.
We deployed 26 regular expressions on two collections of public tweets: (1) more than 1.1 billion tweets (413,435,160 users) from the 1% Twitter Sample Application Programming Interface (API), collected between March 2015 and September 2019, and (2) more than 1.2 billion tweets posted by 245,927 users who have announced their pregnancy on Twitter [14].
One the one hand, the Twitter Sample API allows us to model the general detection of exact age based on the demographics of Twitter users.On the other hand, detecting the exact age of users who have announced their pregnancy on Twitter represents challenges that may be posed by more specific applications-for example, disambiguating the age of the user from the gestational age of the baby.After automatically ignoring retweets and removing "reported speech" (e.g., quotations, news headlines) [15], the regular expressions matched 1,340,015 tweets from the Twitter Sample API, and 997,486 tweets from the pregnancy collection.

Annotation
To train and evaluate supervised machine learning algorithms, annotation guidelines were developed to help five annotators distinguish tweets that self-report a user's exact age ("age" tweets) from those that do not ("no age" tweets).For tweets that were annotated as "age," the annotators also identified the user's exact age that the tweet explicitly or implicitly reports.The annotators independently annotated a random sample of 11,000 of the 2,337,501 matching tweets-5,500 posted by unique users in each of the two collections.Among the 11,000 tweets, 10,000 were dual annotated, and 1000 were annotated by all five annotators.Based on the 1000 tweets that were annotated by all five annotators, the inter-annotator agreement for distinguishing "age" and "no age" tweets was 0.80 (Fleiss' kappa).For the "age" tweets on which the annotators agreed, the inter-annotator agreement for identifying the user's age was 0.95 (Fleiss' kappa).The first author of this paper resolved the class and age disagreements among the 11,000 tweets.Upon resolving the disagreements, 3543 (32%) of the tweets were annotated as "age," and 7457 (68%) as "no age."Table 1  Tweet 3 does not specify when the user will be 21, but it would be annotated as "age" under the assumption that the tweet is referring to the user's next birthday.Tweet 4, however, would be annotated as "no age" because it is ambiguous about whether the user was 21 when the tweet was posted, or whether the user is referring to a future age.Tweet 5 also would be annotated as "no age" because it is ambiguous whether the user was 18 when the tweet was posted, or whether the user is referring to age further in the past.Of course, tweets also would be annotated as "no age" if they obviously do not refer to the user or an age.Table 1 illustrates some of the challenges of training machine learning algorithms to automatically distinguish "age" and "no age" tweets.

Classification
We used the 11,000 annotated tweets in experiments to train and evaluate supervised machine learning algorithms for binary classification of "age" and "no age" tweets.For the classifiers, we used the WLSVM Weka Integration of the LibSVM [16] implementation of Support Vector Machine (SVM), and two deep neural network classifiers based on bidirectional encoder representations from transformers (BERT): the BERT-Base-Uncased [17] and RoBERTa-Large [18] pretrained transformer models in the Flair Python library.We split the tweets into 80% (training) and 20% (test) random sets, stratified based on the distribution of "age" and "no age" tweets.
For the SVM classifier, we preprocessed the tweets by normalizing URLs, usernames, and digits,

Extraction
We used the 2834 "age" tweets in the training set to develop a rule-based module that automatically extracts the exact age from "age" tweets.First, the module preprocesses the "age" tweets, including replacing spelled-out numbers with digits, removing URLs and usernames, which may contain digits, and removing spaces and other non-alphanumeric characters between digits (e.g., the big 3-0).Then, the module uses an optimized sequence of 87 handwritten regular expressions to match linguistic patterns containing two consecutive digits.Finally, the module applies a simple mathematical operation to the digits in the matching pattern, based on the regular expression that the tweet matches.If a tweet does not match one of the query patterns, the module simply extracts the first two-digit group from the tweet.Table 2 also illustrates the importance of optimizing the order in which the tweets match the query patterns.For example, Tweet 7 and Tweet 8 both report that the user turned an age, but if the pattern in Tweet 8 were applied before the pattern in Tweet 7, the age would be incorrectly extracted from Tweet 7 as 21.In addition, Tweet 7 illustrates that some patterns define the time period (three times) as the second group of digits (3), rather than the first (21)-a distinction that is especially important for age extraction rules that are based on subtraction, as in Tweet 9.In Tweet 9, however, the subtracted unit of time is weeks, rather than years, so, as Fig 1 illustrates, the time period would be converted to years by dividing the second group of digits (3) by 52, then rounding up to the nearest integer (1).While turning in Tweet 9 refers to a future age (18), it refers to a present age in Tweet 10 (21), in which case the age is extracted simply as the matching group of digits.

Results and discussion
We    For 61 (84%) of the 73 false positive tweets that were automatically classified correctly but from which the age was extracted incorrectly, the extracted age was only one year different from the annotated age-in particular, one year less for 43 (70%) of these 61 false positives.Among these birthday (19 false positives), as in Tweet 1 in Table 3, or turning an age (12 false positives), as in Tweet 2. Among the 68 false positives that were automatically classified incorrectly as "age," the digits in 58 (85%) of them do refer to an age.Among these 58 false positives, 21 (36%) of them self-report an age that the annotators had determined was temporally ambiguous, as in Tweet 3, and 14 (24%) of them that had been determined not to be self-reports do include a personal reference to the user elsewhere, as in Tweet 4. In contrast, 20 (37%) of the 54 false negatives that were automatically classified incorrectly as "age" do not explicitly refer to the user, as in Tweet 5.
We deployed ReportAGE end to end on the 1.2 billion tweets in our pregnancy collection [14].
In contrast to the Twitter Sample API, our pregnancy collection contains users' timelines (i.e., all of their tweets posted over time), which enables us to estimate the proportion of users for which The performance (F1-score = 0.855) and coverage (54% of users) of ReportAGE suggest that our tweet-based approach can detect exact age for many more users than an approach based on extracting self-reports of age from their profiles [13].Because ReportAGE deploys a classifier on only tweets that match regular expressions, it can scale to this large number of users without modeling hundreds or thousands of each user's posts [3][4][5][6]8].The regular expressions also help address the selection bias noted by Morgan-Lopez et al. [8] as a limitation of their study-that is, evaluating a model for detecting any user's age based only a population of users who announce their birthday in tweets.Because the regular expressions in ReportAGE are the same ones we used to collect the tweets in our annotated training and test sets, ReportAGE extracts ages only for users who have posted tweets that match patterns represented in our evaluation of performance.
Our evaluation of coverage, however, may reflect a selection bias towards users who are in an age group associated with pregnancy-that is, if users who are younger report their age on Twitter more often than that of users who are older.Deploying ReportAGE on our pregnancy collection also exemplifies a limitation of extracting age from tweets in users' timelines.
Whereas deploying ReportAGE in real-time-for example, directly to the tweets collected from the Twitter Streaming API-would extract a user's present age, deploying ReportAGE on the tweets in a user's timeline may extract a past age.Normalizing a past age to the user's exact age in the present or another point in time, which is beyond the scope of this study, would be limited for users whose "age" tweets do not specify their birthday; however, if the past age were correctly extracted, the margin of error would be only one year, and this same limitation would apply to normalizing ages in profiles to points in the past.Directions for future work include Title 45 Section 46.101 for publicly available data sources (45 CFR §46.101(b)(4)).
removing non-alphanumeric characters (e.g., punctuation) and extra spaces, and lowercasing and stemming[19] the text.Following preprocessing, we used Weka's default NGram Tokenizer to extract word n-grams (n = 1-3) as features in a bag-of-words representation.During training, each tweet was converted to a vector representing the numeric occurrence of n-grams among the n-grams in the training data.We used the radial basis function (RBF) and set the cost at c = 32 and the class weights at w = 1 for the "non-age" class and w = 2 for the "age" class, based on iterating over a range of values to optimize performance using 10-fold cross validation over the training set.We scaled the feature vectors before applying the SVM for classification.For the BERT-based classifiers, we preprocessed the tweets by normalizing URLs and usernames, and lowercasing the text.After assigning vector representations to the tweet tokens based on the pretrained BERT model, the encoded representation is passed to a dropout layer (drop rate of 0.5), followed by a softmax layer that predicts the class for each tweet.For training, we used Adam optimization, 10 epochs, and a learning rate of 0.0001.During training, we finetuned all layers of the transformer model with our annotated tweets.To optimize performance, the model was evaluated after each epoch, on a 5% split of the training set.

Fig 1 .
Fig 1. Sample Python code for extracting age from patterns with a unit of time.
Fig 2 illustrates our end-to-end pipeline.

Fig 2 .
Fig 2. ReportAGE: an automatic natural language processing (NLP) pipeline for extracting evaluated an SVM classifier and two deep neural network classifiers on a held-out test of 2200 annotated tweets.For the "age" class, the SVM classifier achieved an F1-score of 0.772 (precision = 0.734, recall = 0.814); the classifier based on the BERT-Base-Uncased pretrained model achieved an F1-score of 0.879 (precision = 0.826, recall = 0.941); and the classifier based on the RoBERTa-Large pretrained model achieved an F1-score of 0.914 (precision = 0.905, recall = 0.942), where: F1-score =
Tweets were annotated as "age" if the user's exact age could be determined, from the tweet, at the time the tweet was posted.In Table1, Tweet 1 is a straightforward example of an "age" tweet, in which the user's exact age is explicitly stated.Although Tweet 2 does not explicitly state the user's age, it can be inferred from the fact that the user reports turning 20 tomorrow.

Table 2
presents examples of matching patterns (bold) in "age" tweets and their associated age extraction rules.

Table 2
(19)e users refer to my birthday in Tweet 1 (my 21st), Tweet 2 (my 18th), and Tweet 3 (my 18th), but the age extraction rule is different for each of these tweets based on the linguistic context in which this reference occurs.After preprocessing, the context in Tweet 1 includes an additional group of digits referring to a future time period (two more years until).Because the unit of time is years, the age is defined by simply subtracting the first group of digits (2) from the second (21).Fig 1 provides the Python code illustrating how our extraction module automaticallyidentifies the age for Tweet 1. Tweet 4 includes an additional group of digits referring to a past time period (20 yrs ago), so the age is defined by adding the first group of digits (20) to the second(19).Tweet 5 also includes an additional group of digits referring to the past (28), but the reference is to a past age (at 28), rather than a time period, so the age is defined by the greater of the two groups of digits (35).Tweet 4, Tweet 5, and Tweet 6 illustrate how the age extraction rules vary depending on the specific pattern in which at (the age of) occurs.

Table 3
presents examples of false positives and false negatives.