Reader Comments

Post a new comment on this article

Absolute vs. Relative Data Collection

Posted by fhuszar on 03 Jan 2013 at 21:21 GMT

Thank you for the insightful analysis, my comment addresses the data collection strategy employed by the authors.

Users have valuable knowledge about the strength of their connections to others, and the purpose of this study was to design an experimental mechanism to elicit this knowledge, and then use it as ground truth for analysis of social tie strength in online social networks. I believe the proposed mechanism for eliciting user's knowledge about tie strengths is not the most effective one, mainly because it discretises the continuous tie strength into essentially binary values: 'strong' and 'weak' connections. There are a few variables and subject-level differences that this procedure does not control for, namely:

#1) absolute activity of users: some people may not communicate actively on facebook, while others use it with higher frequency

#2) total number of friends: the "top 5 closest friends on facebook" means a different thing for someone with 20 connections and someone with 300 or more.

#3) distribution of connection strengths within the users' network: even in real life, some people may have 10 very strong connections, others may not have strong connections at all. Furthermore, the threshold of connection strength above which you call someone a close friend may vary from person to person.

Any of these 3 effects may interfere with the data collection and analysis strategy employed in the paper.

An alternative data collection strategy could have used series of two-alternative forced choice (2AFC) trials: Ask the user series of questions "Who is a closer friend to you, Alice or Bob?". This concept for eliciting user opinions has been used in the past, although in slightly different context, for example in this experiment:

Binary preferentce (2AFC) data has several nice properties and advantages for inference. There are several methods that would allow you to infer real valued connection strength from this data. There are methods directly analogous to logistic regression or support vector machines; see for example (Fürnkranz & Hüllermeier, 2011, Preference Learning, Springer)

Most importantly, designing the data collection procedure this way eliminates the effect of issues #2) and #3) above. The data collection strategy remains meaningful regardless of the total number and strength of connections in the user's network. Problem #1 can probably be dealt with by normalising engagement levels by the total activity of the user, or by introducing further hidden variables to the model.

What do the authors think about inferring connection strength on the basis of binary preferential ground truth data? Are there perhaps drawbacks to the 2AFC procedure that I did not consider? The currently available ground truth data could in fact be converted into binary preference data, I am curious what additional insights re-analysing the data this way could yield.

Competing interests declared: The author of this comment is an employee of PeerIndex, a company engaged in research on social network analysis.

RE: Absolute vs. Relative Data Collection

jhfowler replied to fhuszar on 04 Jan 2013 at 17:15 GMT

Thanks for the thoughtful comment!

I agree that all three of the issues raised are important for improving the precision of the inference about which friend is the closest friend for a given individual. For this reason, we used a percentile rank method in our paper:

"A 61-Million-Person Experiment in Social Influence and Political Mobilization"
Nature 489: 295–298 (13 September 2012)

The percentile rank method takes into account absolute activity of users (Issue #1), since it is a within-user measure. It also allows people with more connections to have more friends (Issues #2 and #3). But it implicitly assumes proportionality -- someone with twice as many Facebook friends will have twice as many close friends. An advantage of absolute measurement is that it more effectively allows the data to reveal which individuals have many close ties and which have few.

We think the idea of two-alternative forced choice is a good one from the point of view of data analysis, but it is extremely demanding of subject time. A user with n friends would need to report on n*(n-1)/2 choices. For the average user with 150 Facebook friends, that's 11,175 responses!

We could in principle turn the best friend into n-1 of these choices, since he or she was chosen over all others. But analysis of those dyads is essentially equivalent to what we have already done here.

In sum, we hope others will follow up with novel alterations to our method like the one proposed. It will be good to better understand the utility of absolute vs. relative measures, and we are hopeful that the algorithms developed as a consequence of these efforts will become even better at inferring tie strength between real world friends.

No competing interests declared.

RE: RE: Absolute vs. Relative Data Collection

fhuszar replied to jhfowler on 04 Jan 2013 at 18:39 GMT

Thank you for the detailed response. I am following up to clarify a few points around 2AFC.

With the 2AFC procedure, one does not have to obtain full matrix data, that is all the possible n*(n-1)/2 pairwise comparisons. Partial data (for example a fixed number of randomly chosen friend-pairs per user) is sufficient to infer the underlying function with high accuracy using statistical machine learning techniques. (or, for example logistic regression)
2AFC is succesfully used this way cognitive psychology to uncover mental representations, see for example

Nevertheless, if the length of the experiment is prohibitive, one could further optimise the procedure by using sequential optimal experiment design, also known as active learning. In this scenario the questions presented to the user would be optimised, so they provide maximal information about the underlying function one tries to learn. It's like playing 20 questions against nature: trying to recover the underlying patterns with as few questions as possible.

I disagree that turning existing data into pairwise preference data and then applying a preference learning model would be essentially equivalent to your approach using binary classifiers/regression. Mathematically that procedure is more similar to inroducing additional latent variables that the classification/regression based analysis did not have.

Thanks again for the detailed response to my comment.

No competing interests declared.

RE: RE: RE: Absolute vs. Relative Data Collection

jhfowler replied to fhuszar on 05 Jan 2013 at 12:21 GMT

Can you show an example where using 2AFC on all n-1 choices given information about which friend is the "best friend" yields a substantively different result than using logistic regression, support vector machines, or random forests (the three techniques we test separately in this article).

I have to say, I would be very surprised if that were true, so I'd love to see a demonstration using simulated data on a sample of the same size as the one we report.

I suspect there may be some slight improvement in AUC, but not enough to matter at this resolution.

No competing interests declared.