Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach

doi:10.1371/journal.pone.0073791

Figure 1.

The infrastructure of our differential language analysis.

1) Feature Extraction. Language use features include: (a) words and phrases: a sequence of 1 to 3 words found using an emoticon-aware tokenizer and a collocation filter (24,530 features) (b) topics: automatically derived groups of words for a single topic found using the Latent Dirichlet Allocation technique [72], [75] (500 features). 2) Correlational Analysis. We find the correlation ( of ordinary least square linear regression) between each language feature and each demographic or psychometric outcome. All relationships presented in this work are at least significant at a Bonferroni-corrected [76]. 3) Visualization. Graphical representation of correlational analysis output.

More »

Expand

Figure 2.

Correlation values of LIWC categories with gender, age, and the five factor model of personality.

[34] : Effect size as Cohen's values from Newman et al. 's recent study of gender (positive is female, not significant at ) [30]. : Standardized linear regression coefficients adjusted for sex, writing/talking, and experimental condition from Pennebaker and Stone's study of age ( not significant at ) [27]. : Spearman correlations values from Yarkoni's recent study of personality ( not significant at ). our : Standardized multivariate regression coefficients adjusted for gender and age for this current study over Facebook ( = not significant at Bonferroni-corrected ).

More »

Expand

Figure 3.

Words, phrases, and topics most highly distinguishing females and males.

Female language features are shown on top while males below. Size of the word indicates the strength of the correlation; color indicates relative frequency of usage. Underscores (_) connect words of multiword phrases. Words and phrases are in the center; topics, represented as the 15 most prevalent words, surround. (: females and males; correlations adjusted for age; Bonferroni-corrected ).

More »

Expand

Table 1.

Summary statistics for gender, age, and the five factor model of personality.

More »

Expand

Figure 4.

Words, phrases, and topics most distinguishing subjects aged 13 to 18, 19 to 22, 23 to 29, and 30 to 65.

Ordered from top to bottom: 13 to 18 19 to 22 23 to 29, and 30 to 65. Words and phrases are in the center; topics, represented as the 15 most prevalent words, surround. (; correlations adjusted for gender; Bonferroni-corrected ).

More »

Expand

Figure 5.

Standardized frequency of topics and words across age.

A. Standardized frequency for the best topic for each of the 4 age groups. Grey vertical lines divide groups: 13 to 18 (black: out of ), 19 to 22 (green: ), 23 to 29 (blue: ), and 30+ (red: ). Lines are fit from first-order LOESS regression [81] controlled for gender. B. Standardized frequency of social topic use across age. C. Standardized ‘I’, ‘we’ frequencies across age.

More »

Expand

Figure 6.

Words, phrases, and topics most distinguishing extraversion from introversion and neuroticism from emotional stability.

A. Language of extraversion (left, e.g., ‘party’) and introversion (right, e.g., ‘computer’); . B. Language distinguishing neuroticism (left, e.g. ‘hate’) from emotional stability (right, e.g., ‘blessed’); (adjusted for age and gender, Bonferroni-corrected ). Figure S8 contains results for openness, conscientiousness, and agreeableness.

More »

Expand

Table 2.

Comparison of LIWC and open-vocabulary features within predictive models of gender, age, and personality.

More »

Expand