Fig 1.
We extract a patient language encoding from the words and phrases within an individual’s Facebook status updates. The three word clouds shown represent the words most prevalent in three example dimensions of the encoding. We then learn predictive models and identify predictive markers for the medical condition categories in the medical records.
Table 1.
Medical condition prevalence and participant characteristics.
Fig 2.
A. Diagnoses Prediction Strength of Demographics and Facebook. This figure represents overall accuracies of Facebook and demographic models at predicting diagnoses. Accuracies were measured using the area under the receiver operating characteristic curve (AUC), a measure of discrimination. The category “Facebook alone” represents predictions based only on Facebook language. “Demographics alone” represents predictions from age, sex, and race. “Demographics & Facebook” represents predictions based on a combination of demographics and Facebook posts. Diagnoses are ordered by the difference in AUC between Facebook alone and demographics alone. For the top 10 categories, Facebook predictions are significantly more accurate than those from demographics (p < .05), and for the top 17 plus iron deficiency anemia, Facebook & demographics are significantly more accurate than Facebook alone (p < .05). * Pregnancy analyses only included females. B. Markers (most predictive topics) per diagnosis. This figure illustrates top markers (clusters of similar words from social media language) most predictive of selected diagnoses categories. Word size within topic represents rank order prevalence in the topic. Expletives were edited and represented by stars (i.e. *). All topics shown, except for those with digestive abdominal symptoms, were individually predictive beyond the demographics (multi-test correct p < .05). (Full results in supplement [S2 Table]).
Fig 3.
Differential expression of topics across medical conditions within the social mediome.
Analogous to studying the differential expression of a genome, topics of the social mediome can be explored differentially across diagnoses. The 21 rows represent all medical condition categories of the study ordered using hierarchical clustering while the 200 columns indicate the predictive strength[24] (measure by area under the ROC curve) of each potential language marker (topics). Blue topics were more likely to be used by patients with the given medical condition and orange topics were less likely to be mentioned. Medical condition categories each have unique patterns of markers. These encodings allow for the prediction of diagnoses and identification of diagnoses with similar patterns of markers.