Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach
1) Feature Extraction. Language use features include: (a) words and phrases: a sequence of 1 to 3 words found using an emoticon-aware tokenizer and a collocation filter (24,530 features) (b) topics: automatically derived groups of words for a single topic found using the Latent Dirichlet Allocation technique ,  (500 features). 2) Correlational Analysis. We find the correlation ( of ordinary least square linear regression) between each language feature and each demographic or psychometric outcome. All relationships presented in this work are at least significant at a Bonferroni-corrected . 3) Visualization. Graphical representation of correlational analysis output.