Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Am I who I say I am? Unobtrusive self-representation and personality recognition on Facebook

  • Margeret Hall ,

    Contributed equally to this work with: Margeret Hall, Simon Caton

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation School of Interdisciplinary Informatics, University of Nebraska at Omaha, Omaha, United States of America

  • Simon Caton

    Contributed equally to this work with: Margeret Hall, Simon Caton

    Roles Conceptualization, Funding acquisition, Methodology, Resources, Software, Supervision, Writing – original draft, Writing – review & editing

    Affiliation School of Computing, National College of Ireland, Dublin, Ireland


Across social media platforms users (sub)consciously represent themselves in a way which is appropriate for their intended audience. This has unknown impacts on studies with unobtrusive designs based on digital (social) platforms, and studies of contemporary social phenomena in online settings. A lack of appropriate methods to identify, control for, and mitigate the effects of self-representation, the propensity to express socially responding characteristics or self-censorship in digital settings, hinders the ability of researchers to confidently interpret and generalize their findings. This article proposes applying boosted regression modelling to fill this research gap. A case study of paid Amazon Mechanical Turk workers (n = 509) is presented where workers completed psychometric surveys and provided anonymized access to their Facebook timelines. Our research finds indicators of self-representation on Facebook, facilitating suggestions for its mitigation. We validate the use of LIWC for Facebook personality studies, as well as find discrepancies with extant literature about the use of LIWC-only approaches in unobtrusive designs. Using survey data and LIWC sentiment categories as predictors, the boosted regression model classified the Five Factor personality model with an average accuracy of 74.6%. The contribution of this work is an accurate prediction of psychometric information based on short, informal text.


Across platforms like Facebook, LinkedIn, Twitter, and blogging services, users (sub)consciously represent themselves in a way which is appropriate for their intended audience [15]. However, researchers have not yet adequately addressed controlling for self-representation, the propensity to display socially responding characteristics or effects of self-censorship in online settings [2,6]; including online social network platforms. The trove of potential online social media data is vast, but the ability of researchers identifying ground truth models, and thus to verify its authenticity, is low. This can result in misleading or wrong analyses [710]. As such, researchers on these platforms risk working with ‘gamified,’ or socially responding personas that go beyond efforts to contain Common Method Biases (CMB) in research design [11,12]. This leaves the open question of alignment of unobtrusively gathered online data and self-reported data. In this paper, we focus on the alignment of survey methods with unobtrusive methods of gathering data from online social media.

This article has two aims:

  • To explore the relationship between offline and online personalities via survey responses and self-produced text such that;
  • Participant-influenced biases in publically sourced data can be mitigated.

In response to these research aims, we hypothesize that self-representation can be identified by test-based attributes (Section 2) and describe a mechanism to do so in the context of Facebook studies. For this, we employed the popular crowdwork platform Amazon Mechanical Turk, receiving survey responses and anonymous Facebook Timeline data from 509 workers (Section 3). Following on from the identification of self-representation, we discuss how it can be controlled for in broad social models (Section 4). Section 5 then discusses the implications of this work and summarizes the contribution, limitations, and points out areas for future work (Section 6).

Conceptual background

Self-representation has been discussed in several works for online and offline fora. These studies discuss that one's tendency to truthfully disclose or censor personal information emanates from an associated intrinsic value [1318]. Many methods including surveys, interviews, and (n)ethnographic research can identify self-representation from the first person perspective. Sentiment analysis is a promising research design for the unobtrusive identification and mitigation of self-representation bias in data at a lower overall cost [1922]. Whilst the phenomenon of representation of self is across all social media, Facebook lends itself well for conducting such analyses as it is larger and has a higher upper bound of characters per post than its major competitor Twitter [23], and Facebook generally has set audience boundaries [24].

Presentation of self in online social networks.

We define self-representation in accordance with Goffman [25] as controlling or guiding the impression others could make by altering the posters’ settings, appearance and manner. Goffman’s work was extended for digital fora by [3,5]. Both Hogan [3] and boyd and colleagues [5] contend self-representation is an increasingly frequent strategy in online participation and communication. In the view of Goffman and Hogan self-representation is the display of the scenario-based ideal self, rather than a pattern of deception. This view was extended by [4], who finds that self-representation online can be for expressive, communicative, or promotional purposes. However, in contrast to the work by Van Dijck [4], we define self-representation as distinct from the concept of identity contingencies [26], where self-representation is the presentation of a scenario-based idealized self and identity contingencies is the staging of a social identity marker (e.g., being a computer scientist, being from the United States) in order to highlight communal (dis)similarities. Online self-representation can be employed on social media with text typed, photos posted, emojis used, and presence/absence of group identities (among other displayed attributes).

Self-representation is also bound to time and place. In real life one must immediately respond to an interlocutor or opponent. In social networks, one has the option not to act immediately. This is even true in the case of messaging platforms using delivered/read notifications (i.e., Facebook, Whatsapp). Even these types of sites deliver notifications of messages to the front page or screen of the interface, thus allowing the user to opt to respond at a time of their choice. Local binding is functionally eliminated with online social networks [3,25]. In real life direct communication is often the social norm [27] whereas in social networks communication is more indirect. Status updates, uploading pictures, or inserting information in the "About Me" section is not directed to anyone specifically. Although one approximately knows who may be reached, it is not known who will respond [4].

Individuals self-represent due to an increase in intrinsic value [16,25]. Across studies, honesty in online representation is valued but ability and application of self-representation online has attractive socially-reinforced benefits. Qualitative interviews (n = 100) on internet dating found that the potential for self-representation is an attractive attribute of online activities [14]. A contradicting study by [13] considered an online dating environment in order to determine the extent of self-representation by users. Results of their interviews (n = 34) indicate that the users who are more ‘honest’ in self-presentation have more success in dating. Nonetheless, all interviewees noted that in their online dating profiles they attempt to reveal themselves particularly positively, and have the same impression of the profile construction of other users. [28] describe self-representation as self-monitoring, defined as the construction of a publically presented self for social interactions in their 116-person study. [28] define high self-monitors as those who carefully curate their self-presentation and low self-monitors as those who are less guarded by portraying their ‘real’ selves. They find that high self-monitors are more likely to occupy preferential positions and have higher social network density than low self-monitors, measures of the relative success of a self-representation strategy and popularity situating [29].

There is still open debate on the extent of self-representation online. For example, online self-representation was challenged by [17], who find that posters describe extensions of their actual lives in their survey and nethnography of 133 Americans and 103 Germans. In a literature review, [30] argue that self-representation is contextual. Most people use Facebook to stay in touch with people met offline, so they cannot completely detach their true identity [2,31]. Utz and colleagues established in their twinned studies of 255 and 198 Dutch participants that users shorten self-descriptions to make themselves seem more interesting. When the audience is likely to be unknown, users try to present a socially aspired self-image to be ‘popular’ [29].

Emotional disclosure on Facebook.

Studies show that honest self-disclosure is generally more emphasized in real life and is different online [1,13,18]. [1] measured 185 then 37 participants in two studies, discovering that users communicate their positive emotions online more frequently via social posturing, finding that negative emotions in Facebook are hardly communicated. When negative (and positive) emotions are used, they tend to cluster around users groups [32,33]. The intensity of positive emotion disclosure is often linked to one’s extraversion or neuroticism levels as measured on the Five Factor personality model of [34]. Extraverts have been found to express significantly higher frequencies of positive emotions [3537].

Facebook’s study on self-disclosure, the typing then editing, deleting, or posting of statuses and comments from 3.9 million Facebook users, found that 71% of users self-censor in some way. Males censor more than female, and Facebook posts are more frequently regulated than comments. They find that those with higher boundaries (estimated by the amount of regulations on visibility in place for a given audience member of the posting person) self-censor more, and theorize that lack of control drives self-censorship. Given that perceived lack of control is a characteristic of neurotic personalities [3840], active self-censoring can be understood as an expression of neuroticism on social media.

Linguistic Inquiry and Word Count (LIWC).

This section concentrates on the properties and related finding of the text analysis package Linguistic Inquiry and Word Count (LIWC). This review is not extensive, and does not cover the multiple non-LIWC tools available to measure computational affect, psychometrics, and sentiment analysis. LIWC’s premise is that it is function and not context of the word that matters. Latent emotional and psychological states are revealed by word function more than the words actually in use. Function words comprise approximately 55% of a given language and are difficult and expensive to manipulate [41]. Function words can detect emotional states [4245], predict psychometrics [35,46,47], as well as gender and age [48]. LIWC has been applied to predict deception [49,50], and its output has proven to outperform humans when detecting dishonest writing samples [50]. LIWC shown excellent precision and recall capacities with high but not overfitting correlations in the analysis of latent sentiment [51,52]. A number of studies discuss correlations between LIWC and personality as well as attempt prediction tasks based on the same [35,41,48,5355]. Until now it has been found that machine learning approaches often perform better than LIWC-only approaches in prediction tasks [55,56].

Recent criticisms of LIWC’s fundamental approach suggest two problems: LIWC has yet to be thoroughly validated for different mediums of online social media data [10,30], and emerging studies report low correlation strength between existing scales or survey responses and online social media data [57]. Comparison studies by [58,59] found that LIWC and LIWC-based dictionaries (e.g., SentiStrength) had high levels of precision, word recognition, and agreement as well as good prediction accuracy [58]. In general, these studies reported that LIWC was among the top of the ranks of all tools tested for the metrics named above. This is likely due to LIWC’s focus on latent sentiment: It is more difficult to manipulate the latent emotional function and state of a word than actual word use [41,60].

Benchmark studies on personality and Facebook using LIWC.

Two studies closely match the approach of this work and are elucidated here. The initial study applying LIWC to assess personality traits from online discourse is the work [35]. Yarkoni evaluated word usage and personality traits of 694 bloggers using LIWC 2001’s 66 categories (linguistic categories minus non-semantic categories). He employed a correlation analysis of all LIWC categories and the Five Factor Model with a False Discovery Rate criterion of 0.05. This work found strong correlations across and between LIWC and the Five Factor Model. His work reports a full feature vector of each LIWC correlation with the respective personality trait.

The work [48] also considers the interaction between personality as displayed by Facebook writing samples and LIWC. Schwartz and colleagues extracted Facebook data of 75,000 participants, analysing a corpus of 700 million words. They employed three techniques to predict the gender, age, and personality of participants. Firstly, they employed LIWC as a stand-alone tool. They compared this to the open vocabulary approach (a combination of words and n-grams) and a topics-based approach. Each technique was combined to evaluate the predictive power, using a Bonferroni correction in their evaluations. Schwartz and colleagues reported on a word and phrase basis the indicators of personality, age and gender as compared to Yarkoni, who reported LIWC categories. They report that gender can be predicted with between 78.4–91.9% accuracy. They report the explained variance but not prediction accuracy of the Five Factor Model.

This work differs on several aspects. We utilize regression modelling instead of correlations for our reporting as opposed to the Yarkoni work. However, our models are built to respect the high variable-to-predictor ratio, thus use boosted models (see Statistical Modelling for more details) which is a difference in approach to the work of Schwartz and colleagues.

We report word categories in the style of [35] rather than individual words in comparison to [48]. We argue that by following the dictionary-label approach we aid replicability of the study. LIWC’s dictionaries are curated and updated fairly regularly, meaning that words falling into these dictionaries will generally be recognized. By using only words and not classifiers, researchers run the risk of particular words or phrases falling out of usage in online language. In this case, the word-based approach would no longer be replicable.

We note that many studies exist in literature that are not analysed in depth here (see, e.g., [6164]). These studies generally employ open language approaches [61] as opposed to our concentration on the LIWC package, or employ regression modelling without enhancements from the machine learning domain as are employed in this work [38,48,61,62].

Materials and methods

Personality as a tool to detect self-representation on Facebook

Given the status of the literature, an interesting question is raised on the unknown interaction between personality types, posting on Facebook, and propensity for self-representation. A link between online self-representation and real-life personality has neither been definitively addressed in cyberpsychology nor sentiment analytics literature [1,47,64] on Facebook.

Personality is good basis for the identification of self-representation due to its known relationships in on- and offline fora [48,64,65] and stability [34,53,54]. Based on the findings of [35,48,53,54,62] we assume that personality is identifiable from online social media data, and that these traits can be isolated with the LIWC package. H1 and H2 support that, and serve as the expected literature-based benchmarks. H1a/b and H2a/b consider the current literature based discussions and further hypothesize that:

  1. H1 Self-representation is characterized by withdrawing or enhancing psychometric characteristics on Facebook.
  2. H1a Positivity bias (enhanced positivity and withdrawn negativity) is a characteristic of self-representation on Facebook.
  3. H1b Enhanced confidence is a characteristic of self-representation on Facebook.
  4. H2 Personality is detectable and is not mitigated via self-representation.
  5. H2a Online self-representation cannot distort digital traces of personality that they become undetectable.
  6. H2b LIWC features detect the attributes of personality on Facebook.

Research design

This study design was reviewed by the National College of Ireland’s ethics committee and approved following a full review. The data anonymized are available under To facilitate our study, 509 Amazon Mechanical Turk (AMT) workers completed psychometric surveys via a Facebook application. In use for personality is the Big Five Inventory introduced by [34], human flourishing as presented by [66] and the online social media usage survey of [67], modified to be used for Facebook. The modified mechanisms of [66,67] can be found in the Online Appendix (S1 Text. Online Appendix to: Am I Who I Say I Am?), and are represented as [SM#] and [HF#] forthwith. We recognize that many psychometrics exist that could be indicative of self-representation, but the ones in use are thoroughly researched and have strong literature-based benchmarks, and thus are the most appropriate for this analysis.

AMT has proven a reliable platform for conducting online behavioural experiments [6871]. AMT has been found to be more representative of diversity than standard samples, and is similar to the standard Facebook population [30]. AMT has also been used in similar research designs where psychometrics and Facebook are simultaneously investigated [72].

An initial screening question based on the Instructional Manipulation Check was employed in order to minimize ‘click-through’ behaviour [30,68] in order to increase the reliability of the results. Payments of US$ 0.74 were issued at the end of the survey, equating to 1 cent per question. Regardless of users’ privacy settings allowing timeline extraction or not, all 509 workers were paid with and for survey completion. The study was launched over a 24-hour period to accommodate differences in time zones.

Participants’ data including IDs were automatically one-way hashed for user privacy, with timeline, survey, and worker payment being tied to the hashed ID. This is established as a best practice in [73]. Text-based data was automatically fed into the LIWC processing tool. A summarized privacy statement and informed consent document were presented on the entry page of the AMT HIT (Human Intelligence Task). A full privacy statement was available, detailing the uses of data and steps taken to guarantee privacy in line with [71]. At no point were identifying information available to the research team, only post-processed aggregated data [71,73,74]. After the analysis for this paper was conducted all data was destroyed to completely mitigate all possibilities of de-anonymization similar to that reported in [75] and to also ensure that the terms and conditions of the MTurk platform were not compromised.

As participants completed the survey, a PHP-based Facebook application simultaneously accessed and hashed their unique Facebook ID, and via Facebook’s Open Graph API (application programming interface) accessed participants’ Facebook timelines for offline analysis (Fig 1). Workers were given an option to opt out of the HIT at the stage where it linked to their Facebook profile or abandon the HIT at any other point. Privacy-aware users were able to hide their activities from the app.

Fig 1. Workflow illustrating the steps to acquire, analyse, and interpret text data.

A Facebook popup screen detailed the types of data requested by the app. The app extracted only posts, i.e., status updates, participants made to their timelines. Other post types such as comments, shares, profile data and updates, etc. are excluded as they are not fully self-produced texts or could be excessively identifying. While this type of constraint can create researcher bias by potentially culling messages from the list of retrieved posts [76], we are considering the online presentation of self. Text produced by other users or the platform do not serve the same purpose. It is also an ethical grey zone to harvest the comments of participants’ friends without their direct consent [30].

Statistical modelling

We investigate the (dis)similarity between commonly applied methods for psychometric analysis (specifically the Five Factor personality model) with a profile constructed by applying LIWC to text data sourced from the social network platform Facebook (see Fig 2). In juxtaposing these two profiles, we statistically analyse whether there are any relationships (latent or otherwise) and/or predictive capabilities in the text-based profiles. Restating the general hypothesis for this work, we expect any deviations in these profiles to be indicative of self-representation (H1). Correspondingly, as we have a psychometric inventory for each participant to hand (via the Five Factor personality model) we can statistically assess which components of our higher dimensional text-based profile account for these differences (H2b). Thus, we provide researchers with a preliminary model to redact the effects of self-representation in online platforms; specifically, Facebook.

Two statistical procedures are heavily utilized in this work, namely Spearman’s ρ and Automatic Linear Modelling (SPSS Statistics version 24). In additional, a One-Way ANOVA was performed to assess mean differences for one case and bi-nominal regression was employed in the case of discrete choice variables. While linear relationships exist in the data, some cases are non-normally distributed. [77] notes that Spearman’s ρ outperforms other correlation methods in cases of contaminated normal distributions, and is robust to Type III errors (correctly rejecting the null hypothesis for the wrong reason(s)). This justifies the use of ρ rather than Pearson’s r, in spite of the fact r tests on true values rather than ranks (thus monotonic relationships).

Automatic Linear Modelling is a machine learning extension of regression modelling and is employed for personality detection. Our analysis utilizes the boosted, best-subset model using Adjusted R2 as the model evaluation criteria. This is consistent with data mining approaches as suggested in [7881]. Regression in SPSS version 24 is ruled out as it is limited to step-wise methods only, cannot conduct an all-possible subset analysis (used here for exploratory reasons), does not automatically identify and handle outliers, and cannot accommodate a model with a high variable to observation ratio [82]. Automatic linear modelling is more robust against Type I and II errors in comparison [82]. 10-fold cross validation is automatically employed by the model [82,83]. It is important to note that SPSS uses cross-validation as a part of the model building phase, therefore the individual folds have no meaning as cross-fold validation is used as the optimisation component in boosting. This is standard in boosted processes, as the weak learners are progressively compiled [83,84].

A boosted model explores iteratively learning weak classifiers with respect to a distribution by adding them to a final strong classifier [85]. When weak classifiers are added, they are typically weighted in some way that is usually related to their accuracy [86]. After a weak learner is added, the data is reweighted. This forces misclassified predictors gain weight and predictors that are classified correctly to lose weight. Thus, future weak learners focus more on the predictors that previous weak learners misclassified [78,83]. This is supported by expanding the model to a best subset approach. While computationally more intensive compared to the more common stepwise approach that economizes on computational efforts by exploring only a certain part of the model space, the all-possible-subsets approach conducts an intensive search of a much larger model space by considering all possible regression models from the pool of potential predictors [82]. This aids prediction accuracy. Pseudo-codes for the AdaBoost algorithm employed can be found in [83,84,87]. Outliers with a Cook’s Distance smaller than one were retained when they were observed to not have an undue influence on the data [88].

Boosted models are popular machine learning extensions to standard regression models, and can be employed in high-dimensional data scenarios [89]. The process of splitting the data into training and testing sets and cross-validating it tend to guard from overfitting [78]. Boosted models return strong empirical results [87] for relatively small increases in computational complexity. Most importantly, given the approach’s weight on the previous fold’s misclassified results, and assessing many weak predictors in classifying results (see above paragraph), it is expected to return highly accurate predictions [78,85,87]. As an additional step, nested 10-fold cross-validation was employed as a mechanism to evaluate the overarching model. Although Automated Linear Modelling employs cross validation in boosted model training, concerns about potentially overfitting the data can still exist. Thus, by employing nested cross-validation (cross-validating a model built using cross-validation) additional insight into the quality and performance of the resultant model is provided. The reported error estimations are less prone to overfitting and therefore are more adequate for model evaluation. This procedure additionally required SPSS Modeller (version 18) as SPSS Statistics cannot accommodate nested cross-validation.


In order to provide context, first noteworthy descriptive statistics of the data across demographic dimensions are provided, then key data cleaning and transformation processes are outlined. Subsequently, descriptives of each profile type; namely surveys and text-based via LIWC are presented and discussed, before compared with each other as well as the findings of [48]. Finally, a predictive model is proposed where key LIWC categories indicative of self-representation are discussed as a mechanism to control for self-representation. In order to provide context, first noteworthy descriptive statistics of the data across demographic dimensions are provided, key data cleaning and transformation processes are also outlined. Subsequently, descriptives of each profile type; namely surveys and text-based via LIWC are presented and discussed, before compared with each other as well as the findings of [35,48]. Finally, a predictive model is proposed where key LIWC categories indicative of self-representation are discussed as a mechanism to control for self-representation.

Descriptive attributes of the population

Following standard online survey guidelines [90,91], participants who completed in less than nine minutes were excluded from the analysis, as well as those with unit or item non-responses (n = 40, or 7.9% of the sample population). Participants were nearly evenly split between the United States and India. The largest language group was English with 285 timelines predominately using English. 73% of participants self-reported to be aged 35 or younger. Gender of the participants is evenly split between women and men, with one non-disclosure and one choice of ‘Other.’ 37% reported being unemployed and 57% completed at least a bachelor’s degree. While this does not reflect a normalized population, a younger sample with higher educational achievements is close to the Facebook population [23].

Of the 285 English profiles, 283 have profiles with 50 or more words over the lifetime of the profiles. Sensitivity analyses indicated that the 50 word threshold was the lower limit for robust results, which is 20 words shorter than the next lowest benchmark found in IBM’s Personality Insights program with its 70-word cutoff [92]. Only the 283 English profiles with more than 50 words are used for LIWC analyses unless otherwise noted. Table 1 illustrates some descriptive categories considering the mean, standard deviation, and median of the profiles, as well as the frequency of words with more than six letters and words per sentence, all measures of linguistic maturity. The average word count per worker is 9,379, just slightly over the average of [48], at 9,333 words per participant.

Table 1. Average and Standard Deviation per profile (n = 283).

Self-reported attributes of self-representation

There are some generally interesting results dealing with self-reported contact patterns and motivation of use outside of self-representation issues revealed by the Spearman’s ρ and binomial regression analyses. Participants who use Facebook frequently also update their profiles frequently (rs(337 = .292, p < .005) [SM 1/2], though those with a higher number of friends have a negative relationship with the frequency of logins (rs(337 = -.314, p < .005) [SM 1/3]. A negative relationship also exists between number of Facebook friends and the number of updates (rs(337 = -.252, p < .005) [SM 2/3].

Family, and on and offline friends are major interest areas in this sample. Participants who use Facebook to show what they know and can are less interested in contacting family than all other groups (on and offline friends, unknown people) (Exp(B) = 0.5, p = 0.071) [SM 9H/SM4]. Those who mainly like status updates are most likely to contact family members (Exp(B) = 2.320, p = 0.006) [SM 1D/SM4]. Participants who use Facebook in order to be recognized by others and are half as likely to have offline friends on Facebook as the rest of the population (Exp(B) = 0.550, p = 0.085), and are twice as likely to be interested in contacting family members on Facebook (Exp(B) = 1,989, p = 0,067) [SM 9C/4]. An exception here is those who want recognition and support from other users: they are half as likely to contact family members (Exp(B) = 0.406, p = 0.011) [SM 9E/4]. Men are less interested in maintaining contact with family on Facebook as women (Exp(B) = 0.393, p = 0.001) [SM4], and those who frequently like videos are twice as likely to use Facebook for contacting their family (Exp(B) = 2.502, p = 0.004) [SM5/4]. Participants whose profile picture does not show their face are half as likely to want to contact offline friends and are more interested in finding unknown online friends (Exp(B) = 0.413, p = 0.007) [SM 11F/4], as well as participants who agree with the statement ‘I can determine myself what I do or do not show others’ (Exp(B) = 1.344, p = 0.033) [SM14B/4].

Written attributes of self-representation on Facebook

As seen in Fig 3, participants generally communicate their positive emotions frequently (an average of 6.16% of each timeline), where negative emotions on Facebook are hardly communicated (2.06% of all data). This is encouraging as it is in line with LIWC standards as established by [60,93]. It is also in line with the work [1] who name this positivity bias to be social posturing. It must be noted that a contributing factor to this difference could be that LIWC has been found to generally have positive polarity in its algorithm [58]. However, 60% more words in the LIWC dictionaries are associated with negative sentiment than positive sentiment. Given that difference, it is likely that the positivity bias in this dataset is in fact a display of social posturing: people represent themselves to be more positive and less negative on their Facebook profiles, an affirmation of H1a. We note that this could also be a contributing factor to the findings of [44].

Fig 3. Positive and negative sentiment usage across the sample population (logarithmic scale).

The analysis also looked at expressed confidence as a measure of self-representation (H1b). This is measured by the frequency in usage of first person singular and third person plural; where people that are more confident use ‘I’ words less than ‘We’ words [47,93]. We tested the demographic groups established in the survey with an ANOVA (Fig 4) and found a significant difference in gender (Gender F(2,279) = 11.893, p < .0005; Wilks' Λ = .921; partial η2 = .079). Males use more first person singular terms. Our findings cannot reject a difference between third person plural between men and women (First Person Plural (We) F(1,280) = .643, p = .423; partial η2 = .002), whereas first person singular has a significant difference in gendered usage (First Person Singular (I) F(1,280) = 23.405, p < .0005; partial η2 = .077). There was homogeneity of variance-covariance matrices, as assessed by Box's test of equality of covariance matrices (p = .002). Males are significantly more likely to present their confidence by use of ‘I’ words in their online personas. Based on the findings of [6,48], this is an unexpected and contradictory finding. This supports emerging findings that women express less confidence than men do, and thereby does not support overt self-representation specific to online social networks (H2b).

Fig 4. Gendered usage of confidence-expressing statements on Facebook profiles.

Detecting personality from online responses and online discourse

In order to mitigate self-representation, the attributes indicating personality must first be addressed. This section discusses the predictors of the variables with the strongest predictive coefficients from the entire list of possible 136 variables (survey items and LIWC categories) and also introduces models with only data that would be available from Facebook profiles (the LIWC categories) to define the relationships between LIWC and psychometrics (Tables 26) (H2b). Applying the data mining technique referred to in the Methodology section (refer to Fig 2 for the model representation), we regress 136 variables of survey responses and LIWC categories on each of the five personality traits of the Five Factor model, then regress the 80 variables representing LIWC categories. It is worth noting that the same process was completed for the prediction of human flourishing. The correlations of extraversion and neuroticism to well-being are strong enough ([rs(282) = .357 p < .0005] / [rs(282) = -.263 p < .0005]) that further analyses are precluded. We introduce these attributes as personality vectors (Tables 26). Tables 7 and 8 display and discuss the prediction accuracy and explained variance as well as the nested cross-validation of these values of the five traits considering all 136 variables.

Table 2. LIWC dictionary attributes significantly predicting the trait openness.

Table 3. LIWC dictionary attributes significantly predicting the trait conscientiousness.

Table 4. LIWC dictionary attributes significantly predicting the trait extraversion.

Table 5. LIWC dictionary attributes significantly predicting the trait agreeableness.

Table 6. LIWC dictionary attributes significantly predicting the trait neuroticism.

Table 7. Prediction accuracy, explained variance, and nested cross-validation values of the five factor personality traits compared to the accuracy and explained variance of [48].

Table 8. Performance comparison of standard ALM results and 10-fold cross-validated (CV) ALM results.


Openness has the high prediction accuracy at 65.0%, and an explained variance of 47.2%. Significant at the 0.001 level for openness are the survey categories meaning [HF 4], self-esteem [HF 9], engagement [HF3], competence [HF 1], optimism [HF 5], positive emotion [HF 6], and resilience [HF 9]; the country of origin of the worker; and the LIWC category ‘feelings.’

Table 2 illustrates the relationships between the trait and LIWC categories. Anger, Abbreviations, Dashes, Recognized by Dictionary, and Words per Sentence positively predict openness; Inclusion, Apostrophes, Discrepancies, Humans, Motion, and Semi Colons are negatively predictors.


With a prediction accuracy of 66.7% and an R2 (explained variance) of 43.3%, conscientiousness is described by the largest collection of LIWC categories of all five traits (Table 3). This could be an indication of the nuance of this particular trait’s expression in online dialogue. Perhaps unsurprisingly, the strongest predictor of this trait is the LIWC category Assent.

The most relevant predictors are the LIWC categories, ‘friends’, ‘down’, and ‘fillers’; survey responses ‘a profile picture that is not obviously me’ [SM11F], number of friends [SM3], ‘I understand quickly how others perceive me’ [SM 14A], assent to ‘People should present themselves on online social networks as the same person as they are offline’ [SM 8], and using Facebook to give and get information [SM 9K], and the survey measurement resilience [HF 9] and positive relationships [HF 7].


Extraversion with 77.9% accuracy and R2 of 56.1% is related to the survey items competence [HF 1], self-esteem [HF9], meaning [HF 4], optimism [HF 5], positive emotion [HF 6], vitality [HF 10], and resilience [HF 9]; country of origin; and the survey responses ‘I understand quickly how I am perceived by others’ [SM 14A] and managing Facebook profiles with displays of albums [SM 11G].

Interestingly, those scoring high in Extraversion have a positive usage of words displaying Anger but withdrawn usage of words conveying Negative Emotions. Extroverts also use ‘We’ words (first person plural) more than the other traits, which could be a display of withdrawn confidence as expressed online (Table 4).


Agreeableness has an accuracy of 63.5% and 46.3% explained variance indicating high reliability. Highly significant are the survey items resilience [HF 8], meaning [HF 4], self-esteem [HF 9], and competence [HF 1]; country of origin; the LIWC categories ‘friends’, ‘inhibition’, ‘feelings’, and ‘assent’; and declination of ‘I can be who or what I want on my Profile page’ [SM 14D]. Unexpectedly those scoring high on this trait reflect withdrawn usage of Positive Emotion (Table 5). They score highest of all traits in attributes capturing linguistic maturity (Unique Words, Words per Sentence).


Neuroticism has a good performance (70.8% accuracy) and reasonable R2 (49.9%). The most significant survey items are resilience [HF 8], self-esteem [HF 9], emotional stability [HF 2], vitality [HF 10], and optimism [HF 5]; using Facebook to spy on others [SM 9D], managing presentation of self with pictures not of them [SM 11F], using Facebook to observe other people [SM 9F], and liking videos on Facebook [SM 5]. Finally, the LIWC category ‘feelings’ is highly significant. Table 6 displays an interaction between positive usage of personal achievement but a withdrawn usage of References to Others–this could indicate that the discourse of those high in neuroticism errs towards self-centred discourse.

Model performance considering benchmark works and implications

Worker’s self-produced text is indicative of self-representation when compared to their responses to the Five Factor model (H2). The Automated Linear Modelling approach in SPSS creates meritorious model fits averaging 68.8% reference model accuracy and 48.6% explained variance as seen in Table 7, without overt signs of data overfitting (H2a).

Considering sizeable correlations between predictor groups, the unique variance explained by each of the variables indexed by the squared semipartial correlations is low. In no case was there an instance of Cook’s Distance larger than one, so all outliers were handled within the data rather than trimmed [88]. The multivariate models are statistically significant for each personality trait (p < .05).

When nested cross-validation is additionally performed we see an average result of 0.67 (Table 7, Table 8). While the average of the model is nearly the same, indicating goodness of the approach, there are fluctuations found in the individual constructs (Minimum and Maximums columns, Table 8). The fluctuations in the results are assumed to be a function of the program in use, namely that when in SPSS a linear model encounters a testing instance with a value it hadn’t anticipated (e.g. an attribute value outside the range of the training data provided) SPSS generally predicts $null$. Table 8 compares the minimum, maximum and average performance of nested cross-validation across the five constructs and compares the results with those of Table 7. Per-fold results are included as Supporting Information (S1 Table. Supporting Information Per-fold performance testing).

Our models have three major differentiators with the works of [35,48]. First, we find fewer categories which are significant at the 0.05 or above level per personality trait (see Tables 26) as compared to [35]. We see the reduced dimensionality as a strength of our approach. It indicates that the representation of the five traits is more compact than in the benchmark works [35,48], and is likely more generalizable. Second, the strength of the coefficients in our model are considerably higher than the LIWC-only results reported in [48]. This implies that our method has competitive prediction accuracy while utilizing fewer features with stronger statistical power. Finally, an advantage of our approach (boosted, best-subset regression modelling) is the superior performance considering explained variance. The reported explained variance of the LIWC-only approach in [48] with a standard regression model reached an average of 26%, and 35% when combined with other features. Our approach averages an explained variance between 56–43%. Given this work’s near-replication of the psychometric instruments as well as known relationships between them (e.g., well-being, extroversion, and neuroticism) our reported difference is unlikely to be solely due to differences in sample size. This suggests that while other approaches (e.g., latent semantic analysis [94], open-word approaches [48,61], or correlation studies [35]) are meritorious, LIWC-only approaches when combined with machine learning extensions are also appropriate for the task. Indeed, the performance increase in comparison with standard liner models and other linguistic approaches suggests that future research should consider employing such (relatively light) machine learning approaches in the future for more accurate, reliable results.

Implication: Personality is a tool for mitigating self-representation

Having established a compact representation of the five personality traits of [34] detected from LIWC data as it represents Facebook data, researchers can use the results reported in this work as personality vectors. Personality vectors in this case are the collated LIWC categories reported in Tables 26. Researchers may apply the vectors to Facebook-based data when investigating psychometrics in order to represent a more realistic view of the subject. This contributes a method for social researchers to verify psychometric baselines of subjects. Having done this, researchers are able to mitigate the effects of socially responding personas in online social media data. This delivers a closer representation to the in real personality of the subject than is currently available.

Discussion and conclusion

The key findings of this research are that self-representation in online social media is an identifiable phenomenon, that self-representation can be isolated, and a smaller number of indicators than previously reported can be used to do so. Moreover, it opens an interesting discussion on the impact of self-representation on social media analyses, both from the perspective of the researcher validating social models, and the subject with respect to the intent of such behaviours. To our knowledge this is the first work that validates Facebook applying LIWC as a stand-alone tool for the identification of personality traits and self-representation. Similar studies have validated other text inputs (e.g., [35]), or have approach feature creation from an individual word basis [48,64]. Finally, the accuracy of our results was aided by employing a machine learning extension to the regression model (boosted regression modelling), increasing accuracy dramatically.

Self-representation was identified in a number of indicators. Positive affectivity and withdrawn negative emotions are identifiable across the workers’ profiles. Withdrawn negative affect is a particularly indicative of self-representation (H1a). However, confidence follows expected patterns across genders (H1b). Male participants appear more confident in their written profiles than females. As this is a finding in emergent literature, this cannot be understood as an overt measure of self-representation. Personality is still detectable even when self-representation is present (H2a), and LIWC-only features have meritorious performance in comparison to latent semantic methods like the open vocabulary approach of [48] (H2b). Our reported accuracies were enabled by creating a fitting model for personality prediction as opposed to off-the-shelf prediction models.

The stated aims of this research are twofold: establishing the relationship between offline and online personalities in order to mitigate such biases in publically sourced data. In accomplishing these goals, this research creates a generally applicable method for the design of cross-disciplinary methods and the analysis of social media data. Such a method is impactful in both research arenas and commercial domains, in that it allows the study designer to approximate participant baselines without highly intrusive mechanisms. A strength of this study is its consideration and application of the findings from recent cyberpsychology literature.

In a systematic manner, this research detailed the experimental design, data collection, and analysis. Common method biases are addressed and appropriately eliminated when identified. The method allows for replication by careful detailing of the steps, (pre)processing of data and models built. A major contribution is addressing method biases in the harvesting and analysis of social media data. This research utilizes the entire data stream as posted by the individual per profile, mitigating sampling errors. It also names common markers of the phenomena of self-representation based on simple LIWC categories and psychometrics that allow researchers to mitigate its effects in future research. With personality and mood validated and a sentiment analysis performed on the lifespan of a user’s Facebook timeline, we can now measure the propensity of a user to portray themselves in opposition to their truthful, psychological baseline.

We propose that researchers can apply this method of personality isolation to their analyses of publically sourced data in order to mitigate the effects of self-representation. This supports the goal of (Big) data-driven personality research being both precise and accurate. Such an approach has diverse applications in that it allows for a new personality-based estimator from which to deduce generalizations from publically accessible text onto the general population. With self-representation identified and removed, a valid measurement of psychometrics without necessitating expensive surveys or interviews is created.

Limitations, future work

A limitation is the sample size, which disallows larger statements about linguistic subgroups; the non-English samples are too small for meaningful statistics. While larger than similar cyberpsychology studies found in the related work in terms of both participant number and volume of text, the study is still smaller than the largest Facebook studies to date [6,44,48]. Another drawback is that the results are tailored to Facebook–the findings of this study are unlikely to generalize to professional networking, microblogs, or visual media sites. A concluding remark on limitations is related to privacy. While the study obtained informed consent of its workers, the open question remains if workers truly understood the amount of information that was being given in the task.

Extensions of this research are closely linked to its limitations. Cross-platform analysis of the same user for their various public profiles would give future work a more nuanced view in the ways that social media users self-represent to different audiences. Such a work would fill research gaps in ‘best’ platform usage for information disbursement, creation, and influence, as well as impact for a given network. A network analysis of users and the resulting textured understanding of how users cluster and complement within a network would be a good area of future research. Such an approach would also support answering the questions of why social media users self-represent in the way they do, given a particular site.

Supporting information

S1 Text. Online appendix to: Am I Who I Say I Am?


S1 Table. Supporting information per-fold performance testing.

Model ID O–Openness; Model ID C—Conscientiousness; Model ID E–Extraversion; Model ID A–Agreeableness; Model ID N—Neuroticism



  1. 1. Qiu L, Lin H, Leung AK, Tov W. Putting their best foot forward: emotional disclosure on Facebook. Cyberpsychol Behav Soc Netw. 2012;15: 569–72. pmid:22924675
  2. 2. Zhao S, Grasmuck S, Martin J. Identity construction on Facebook: Digital empowerment in anchored relationships. Comput Human Behav. 2008;24: 1816–1836.
  3. 3. Hogan B. The Presentation of Self in the Age of Social Media: Distinguishing Performances and Exhibitions Online. Bull Sci Technol Soc. 2010;30: 377–386.
  4. 4. van Dijck J. “You have one identity”: performing the self on Facebook and LinkedIn. Media, Cult Soc. 2013;35: 199–215.
  5. 5. Boyd D, Chang M, Goodman E. Representations of Digital Identity. CSCW’04. 2004;6: 6–10. Available:
  6. 6. Das S, Kramer A. Self-Censorship on Facebook. Seventh International AAAI Conference on Weblogs and Social Media. Cambridge, USA; 2013. pp. 120–127.
  7. 7. Jungherr A, Jürgens P, Schön H. Why the Pirate Party Won the German Election of 2009 or The Trouble With Predictions. Soc Sci Comput Rev. 2011;30: 229–234.
  8. 8. Rost M, Barkhuus L, Cramer H, Brown B. Representation and Communication: Challenges in Interpreting Large Social Media Datasets. CSCW’13. San Antonio, TX: ACM Press; 2013. pp. 357–362.
  9. 9. Chung J, Mustafaraj E. Can collective sentiment expressed on twitter predict political elections? Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence. San Fransisco, CA; 2011. pp. 1770–1771.
  10. 10. Boyd RL, Pennebaker JW. Language-based personality: a new approach to personality in a digital world. Curr Opin Behav Sci. 2017;18: 63–68.
  11. 11. Podsakoff PM, MacKenzie SB, Podsakoff NP. Sources of method bias in social science research and recommendations on how to control it. Annu Rev Psychol. 2012;63: 539–69. pmid:21838546
  12. 12. Podsakoff PM, Mackenzie SB, Lee J, Podsakoff NP. Common Method Biases in Behavioral Research: A Critical Review of the Literature and Recommended Remedies. J Appl Psychol. 2003;88: 879–903. pmid:14516251
  13. 13. Ellison N, Heino R, Gibbs J. Managing Impressions Online: Self-Presentation Processes in the Online Dating Environment. J Comput Commun. 2006;11: 415–441.
  14. 14. Lawson HM, Leck K. Dynamics of Internet Dating. Soc Sci Comput Rev. 2006;24: 189–208.
  15. 15. Lingel J, Naaman M, boyd danah. City, self, network: transnational migrants and online identity work. CSCW’14. 2014. pp. 1502–1510.
  16. 16. Tamir DI, Mitchell JP. Disclosing information about the self is intrinsically rewarding. Proc Natl Acad Sci U S A. 2012;109: 8038–43. pmid:22566617
  17. 17. Back MD, Stopfer JM, Vazire S, Gaddis S, Schmukle SC, Egloff B, et al. Facebook profiles reflect actual personality, not self-idealization. Psychol Sci. 2010;21: 372–4. pmid:20424071
  18. 18. Hilsen AI, Helvik T. The construction of self in social medias, such as Facebook. AI Soc. 2012;29: 3–10.
  19. 19. Lin H, Qiu L. Two sites, two voices: Linguistic differences between facebook status updates and tweets. Rau PLP, editor. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics). LNCS 8024; 2013;8024 LNCS: 432–440.
  20. 20. Pennebaker J, King L. Linguistic Styles: Language Use as an Individual. J Pers Soc Psychol. 1999;77: 1296–1312. pmid:10626371
  21. 21. Gonzales AL, Hancock JT, Pennebaker J. Language Style Matching as a Predictor of Social Dynamics in Small Groups. Communic Res. 2010;37: 3–19.
  22. 22. Groom CJ, Pennebaker J. Words. J Res Pers. 2002;36: 615–621.
  23. 23. Duggan M, Ellison N, Lampe C, Lenhart A, Madden M. Pew Social Media Report 2015 [Internet]. 2014. Available:
  24. 24. Wilson RE, Gosling SD, Graham LT. A Review of Facebook Research in the Social Sciences. Perspect Psychol Sci. 2012;7: 203–220. pmid:26168459
  25. 25. Goffman E. The Presentation of Self In Everyday Life. 1st ed. New York, New York, USA: Anchor; 1959.
  26. 26. Purdie-Vaughns V, Steele CM, Davies PG, Ditlmann R, Crosby JR. Social identity contingencies: how diversity cues signal threat or safety for African Americans in mainstream institutions. J Pers Soc Psychol. 2008;94: 615–30. pmid:18361675
  27. 27. Hoever A. Strategien und Konzepte der Selbstdarstellung auf Social Network Services am Beispiel Facebook. Berlin: Berliner Methodentreffen Qualitative Forschung; 2010.
  28. 28. Mehra A, Kilduff M, Brass DJ. The social networks of high and low self-monitors: Implications for workplace performance. Adm Sci Q. 2001;46: 121–146.
  29. 29. Utz S, Tanis M, Vermeulen I. It is all about being popular: the effects of need for popularity on social network site use. Cyberpsychol Behav Soc Netw. 2012;15: 37–42. pmid:21988765
  30. 30. Gosling SD, Mason W. Internet Research in Psychology. Annu Rev Psychol. 2015;66: 877–902. pmid:25251483
  31. 31. Bazarova N, Taft J, Choi YyH, Cosley D. Managing Impressions and Relationships on Facebook: Self- Presentational and Relational Concerns Revealed Through the Analysis of Language Style. J Lang Soc Psychol. 2012;
  32. 32. Bollen J, Gonçalves B, Ruan G, Mao H. Happiness is assortative in online social networks. Artif Life. 2011;17: 237–51. pmid:21554117
  33. 33. Fowler J, Christakis N. Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the Framingham Heart Study. BMJ. 2008;337: a2338. pmid:19056788
  34. 34. John OP, Donahue EM, Kentle RL. The big five inventory—versions 4a and 54. Berkeley, USA; 1991.
  35. 35. Yarkoni T. Personality in 100,000 words: A large-scale analysis of personality and word use among bloggers. J Res Pers. 2010;44: 363–373. pmid:20563301
  36. 36. Hall M, Kimbrough SO, Haas C, Weinhardt C, Caton S. Towards the gamification of well-being measures. 2012 IEEE 8th International Conference on E-Science, e-Science 2012. Ieee; 2012. pp. 1–8.
  37. 37. Hall M, Caton S, Weinhardt C. Well-being’s Predictive Value. In: Ozok AA, Zaphiris P, editors. Proceedings of the 15th International Conference on Human-Computer Interaction (HCII). Berlin: LNCS, Springer Verlag; 2013. pp. 13–22.
  38. 38. DeNeve KM, Cooper H. The happy personality: a meta-analysis of 137 personality traits and subjective well-being. Psychol Bull. 1998;124: 197–229. pmid:9747186
  39. 39. Warshaw J, Matthews T, Whittaker S, Kau C, Bengualid M, Smith B a. Can an Algorithm Know the “Real You”? Understanding People’s Reactions to Hyper-personal Analytics Systems. Proc 33rd Annu ACM Conf Hum Factors Comput Syst. 2015; 797–806.
  40. 40. John O, Naumann L, Soto C. Paradigm Shift to the Integrative Big Five Trait Taxonomy. Handbook of Personality. 2008. pp. 114–158.
  41. 41. Tausczik Y, Pennebaker J. The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. J Lang Soc Psychol. 2010;29: 24–54.
  42. 42. Kramer A. An Unobtrusive Behavioral Model of “Gross National Happiness.” Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Atlanta, USA; 2010. pp. 287–290.
  43. 43. Kramer A. The spread of emotion via facebook. Proceedings of the 2012 ACM annual conference on Human Factors in Computing Systems—CHI ‘12. New York, New York, USA: ACM Press; 2012. pp. 767–770.
  44. 44. Kramer A, Guillory JE, Hancock J. Experimental evidence of massive-scale emotional contagion through social networks. Proc Natl Acad Sci. 2014;111: 8788–8790. pmid:24889601
  45. 45. Lindner A, Hall M, Niemeyer C, Caton S. BeWell: A Sentiment Aggregator for Proactive Community Management. CHI’15 Extended Abstracts. Seoul, Korea: ACM Press; 2015. pp. 1055–1060.
  46. 46. Chung C, Pennebaker J. Counting little words in Big Data: The Psychology of Communities, Culture, and History. In: Forgas J, Vincze O, Laszlo J, editors. Social Cognition and Communication. New York, New York, USA: Psychology Press; 2014. pp. 25–42.
  47. 47. Campbell RS, Pennebaker J. The secret life of pronouns: Flexibility in writing stryle and physical health. Psychol Sci. 2003;14: 60–65. pmid:12564755
  48. 48. Schwartz HA, Eichstaedt J, Kern M, Dziurzynski L, Ramones S, Agrawal M, et al. Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS One. 2013;8: e73791. pmid:24086296
  49. 49. Ott M, Choi Y, Cardie C, Hancock J. Finding Deceptive Opinion Spam by Any Stretch of the Imagination. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. 2011. pp. 309–319.
  50. 50. Newman M, Pennebaker J, Berry D, Richards J. Lying Words: Predicting Deception From Linguistic Styles. Personal Soc Psychol Bull. 2003;29: 665–675.
  51. 51. Salas-Zárate M del P, López-López E, Valencia-García R, Aussenac-gilles N, Almela Á, Alor-Hernández G. A study on LIWC categories for opinion mining in Spanish reviews. J Inf Sci. 2014;1: 1–13.
  52. 52. Mahmud J. Why Do You Write This? Prediction of Influencers from Word Use Psycholinguistic Analysis from text. ICWSM. Ann Arbor, USA; 2014. pp. 603–606.
  53. 53. Markovikj D, Gievska S. Mining Facebook Data for Predictive Personality Modeling. Proc of WCPR13, in …. 2013. pp. 23–26. Available:
  54. 54. Farnadi G, Zoghbi S, Moens M, Cock M De. Recognising Personality Traits Using Facebook Status Updates. Work Comput Personal Recognit Int AAAI Conf weblogs Soc media. 2013; 14–18. Available:
  55. 55. Komisin M, Guinn C. Identifying Personality Types Using Document Classification Methods. Proceedings of the Twenty-Fifth International Florida Artificial Intelligence Research Society Conference. Palo Alto, USA; 2012. pp. 232–237.
  56. 56. Balahur A, Hermida JM. Extending the EmotiNet Knowledge Base to Improve the Automatic Detection of Implicitly Expressed Emotions from Text. LREC. Istanbul, Turkey; 2012. pp. 1207–1214.
  57. 57. Beasley A, Mason W. Emotional States vs. Emotional Words in Social Media. Proceedings of ACM WebSci’15. Oxford, England: ACM Press; 2015.
  58. 58. Gonçalves P, Araújo M, Benevenuto F, Cha M. Comparing and combining sentiment analysis methods. Proc first ACM Conf Online Soc networks—COSN ‘13. 2013; 27–38.
  59. 59. Araújo M, Gonçalves P, Cha M, Benevenuto F. iFeel: A Web System that Compares and Combines Sentiment Analysis Methods. International World Wide Web Conference Committee (IW3C2). 2014. doi:
  60. 60. Caton S, Hall M, Weinhardt C. How do politicians use Facebook? An applied Social Observatory. Big Data Soc. SAGE Publications; 2015;2: 2053951715612822.
  61. 61. Park G, Schwartz HA, Eichsteadt JC, Kern ML, Kosinski M, Stillwell DJ, et al. Automatic personality assessment through social media language. J Pers Soc Psychol. 2015;108: 1–25.
  62. 62. Lambiotte R, Kosinski M. Tracking the Digital Footprints of Personality. Proc IEEE. 2014;102: 1934–1939.
  63. 63. Wang N, Kosinski M, Stillwell D, Rust J. Can Happiness be Measured using Facebook status updates? 2010;
  64. 64. Youyou W, Kosinski M, Stillwell D. Computer-based personality judgments are more accurate than those made by humans. Proc Natl Acad Sci. 2015;
  65. 65. Hall M, Glanz S, Caton S, Weinhardt C. Measuring Your Best You: A Gamification Framework for Well-being Measurement. Third International Conference on Social Computing and its Applications. Karlsruhe, Germany: IEEE; 2013. pp. 277–282.
  66. 66. Huppert F, So TTC. Flourishing Across Europe: Application of a New Conceptual Framework for Defining Well-Being. Soc Indic Res. 2013;110: 837–861. pmid:23329863
  67. 67. Ewig C. Social Media: Theorie und Praxis digitaler Sozialität / Social media: theory and practice of digital sociality. In: Anastasiadis M, Thimm C, editors. Social Media: Theorie und Praxis digitaler Sozialität. Frankfurt am Main: Peter Lang Internationaler Verlag der Wissenschaten; 2011.
  68. 68. Berinsky AJ, Huber G, Lenz GS. Evaluating Online Labor Markets for Experimental Research:’s Mechanical Turk. Polit Anal. 2012;20: 351–368.
  69. 69. Paolacci G, Chandler J, Ipeirotis P. Running experiments on Amazon Mechanical Turk. Judgm Decis Mak. 2010;5: 411–419.
  70. 70. Ross J, Zaldivar A, Irani L, Tomlinson B. Who are the Turkers? Worker Demographics in Amazon Mechanical Turk. CHI 2010. 2010. pp. 2863–2872.
  71. 71. Mason W, Suri S. Conducting behavioral research on Amazon’s Mechanical Turk. Behav Res Methods. 2012;44: 1–23. pmid:21717266
  72. 72. Yearwood MH, Cuddy A, Lamba N, Youyou W, van der Lowe I, Piff PK, et al. On wealth and the diversity of friendships: High social class people around the world have fewer international friends. Pers Individ Dif. 2015;87: 224–229.
  73. 73. Lease M, Hullman J, Bigham JP, Bernstein MS, Kim J, Lasecki W, et al. Mechanical turk is not anonymous. Soc Sci Res Network. 2013; 15. doi:
  74. 74. Clifford S, Jewell RM, Waggoner PD. Are samples drawn from Mechanical Turk valid for research on political ideology? Res Polit. 2015;2: 1–9.
  75. 75. Zimmer M. “But the data is already public”: on the ethics of research in Facebook. Ethics Inf Technol. 2010;12: 313–325.
  76. 76. González-Bailón S, Wang N, Rivero A, Borge-Holthoefer J. Assessing the bias in samples of large online networks. Soc Networks. Elsevier B.V.; 2014;38: 16–27.
  77. 77. Fowler RL. Power and Robustness in Product-Moment Correlation. Appl Psychol Meas. 1987;11: 419–428.
  78. 78. Schonlau M. Boosted regression (boosting): An introductory tutorial and a Stata plugin. Stata J. 2005;5: 330–354. doi: The Stata Journal
  79. 79. Li Q, Racine JS. Cross-Validation Local Linear Nonparametric Regression. Stat Sin. 2004;14: 485–512.
  80. 80. Hurvich CM, Simonoff JS, Tsai C. Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. J R Stat Soc Ser B. 1998;60: 271–293.
  81. 81. Cleveland WS, Devlin SJ. Locally Weighted Regression: An Approach to Regression Analysis by Local Fifing. J Am Stat Assoc. 1988;83: 596–610.
  82. 82. Yang H. The Case for Being Automatic: Introducing the Automatic Linear Modeling (LINEAR) Procedure in SPSS Statistics. Mult Linear Regres Viewpoints. 2013;39: 27–37.
  83. 83. IBM. IBM SPSS Advanced Statistics 22. 2011.
  84. 84. IBM. IBM SPSS Regression 22. 2011.
  85. 85. Fernández-Delgado M, Cernadas E, Barro S, Amorim D, Amorim Fernández-Delgado D. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? J Mach Learn Res. 2014;15: 3133–3181.
  86. 86. Tulyakov S, Jaeger S, Govindaraju V, Doermann D. Review of Classifier Combination Methods. Rev Classif Comb Methods. 2007;90: 361–386.
  87. 87. Schapire RE. The Boosting Approach to Machine Learning: An Overview. Nonlinear Estimation and Classification. 2003. pp. 149–171.
  88. 88. Cook RD, Weisberg S. Residuals and Influence in Regression. 1982.
  89. 89. Bühlmann P. Boosting for high-dimensional linear models. Ann Stat. 2006;34: 559–583.
  90. 90. Bosnjak M, Tuten TL. Classifying Response Behaviors in Web-based Surveys. J Comput Commun. 2001;6: 14.
  91. 91. Galesic M, Bosnjak M. Effects of Questionnaire Length on Participation and Indicators of Response Quality in a Web Survey. Public Opin Q. 2009;73: 349–360.
  92. 92. Mahmud J. IBM Watson Personality Insights: The science behind the service [Internet]. Almaden, USA: IBM; 2015. Available:
  93. 93. Pennebaker J, Mehl MR, Niederhoffer KG. Psychological aspects of natural language use: our words, our selves. Annu Rev Psychol. 2003;54: 547–77. pmid:12185209
  94. 94. Deerwester S, Dumais ST, Furnas GW, Landauer TK. Indexing by Latent Semantic Analysis. J Am Soc Inf Sci. 1998;41: 391–407.