Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1.

Example application of a real-time community detection system.

Diageo want to explore the market (competitors, customers, associations etc.) around their brand. They feed in information about themselves (“seeds”). In this example the seeds are the company itself (Diageo) and some of their major brands (Smirnoff, Baileys and Guinness). Our systems finds accounts that are similar/related to the seeds and then structures the similar accounts into communities.

More »

Fig 1 Expand

Table 1.

Comparison of related work.

SCM stands for runs on a Single Commodity Machine.

More »

Table 1 Expand

Fig 2.

Illustration of LSH applied to minhash signatures.

Each row represents a signature. The signatures have been banded up, so that each band contains two hashes. Accounts A1 and A2 will be grouped as similar candidates since they have identical signatures in Band 1.

More »

Fig 2 Expand

Table 2.

Typical runtimes and space requirements for systems performing local community detection on the Twitter Follower network of 700 million vertices and 20 billion edges and producing 100 vertex output communities.

More »

Table 2 Expand

Fig 3.

Sorting similarities of LSH candidates.

The diagram shows a set of seed accounts (X) bounded by an ellipse in Jaccard space. Outside of the ellipse are a set of LSH candidate accounts (+). At each iteration the candidate account (A*) closest (according to the Jaccard distance) to the center (X) of the seeds is added to the list of returned values.

More »

Fig 3 Expand

Fig 4.

Visualizing the intersection graph generation.

Interesting vertices are depicted as larger red nodes, and the neighbors as smaller, more numerous gray nodes. A shows a complete social network. B depicts the overlapping bipartite neighborhood graphs of the three interesting vertices in A. C summarizes the social network in A by an inferred network using the Jaccard similarity measure of the set of neighboring vertices as edge weights. Vertices connected by high weights are more likely to be in the same community.

More »

Fig 4 Expand

Fig 5.

The full process diagram.

A set of seeds is queried using LSH and Minhash Similarity. The weighted adjacency matrix for the top 100 results is estimated using minhash signatures. The WALKTRAP community detection algorithm is applied to the weighted adjacency matrix, and the results are visualized.

More »

Fig 5 Expand

Table 3.

Properties of ground-truth communities sorted by edge density.

CR stands for Conductance Ratio. High values of clustering, density and separability and low values of CR, conductance and Cohesiveness indicate good communities.

More »

Table 3 Expand

Fig 6.

Mixed martial arts (MMA) community.

The MMA community is relatively homogeneous and densely interconnected with high clustering and good separability from the rest of the network. The only disconnected region is the yellow region, which has been magnified to show that it is made up of Olympic judo competitors. This community is well detected by all methods.

More »

Fig 6 Expand

Fig 7.

Basketball community.

The basketball community has attributes similar to the baseball and American football communities: All are densely connected and well separated from the rest of the network. The individual team structure is not apparent in the graph. Instead the two large clusters show teams from the Eastern and Western Conferences. The small peripheral clusters are mostly major college teams. We have magnified an area showing players of the Womens National Basketball Association.

More »

Fig 7 Expand

Fig 8.

Alcohol community.

This is a low-density community with poor clustering. It is divided into broad classes of drinks such as beer, spirits and wine. We have magnified an area of the cider sub-community.

More »

Fig 8 Expand

Fig 9.

Hotels network.

The hotels community has low conductance indicating that it is not well separated from the rest of the network. It also has high cohesiveness indicating it contains components that appear to be the true modular units. The two clearly visible subcomponents are the Four Seasons brand in blue to the left and the hotels of Las Vegas, which is magnified.

More »

Fig 9 Expand

Fig 10.

Dendrograms showing the strength of interconnection within communities.

The vertical axes show the Jaccard distances. Blue areas are weakly connected. In each colored region, no two nodes are separated by a Jaccard distance greater than 0.85. The dendrograms are agglomerative: All accounts with a Jaccard distance less than the y-value are fused together into a super-node. The fusing process is sequential and the x-axis indicates the order of fusing with the first nodes to agglomerate at the right. The bottom-right subfigure shows team sports (here: Basketball). There are any highly connected sub-groups. The Bottom-left subfigure shows the most clearly defined communities (here: Mixed Martial Arts) containing sub-communities mostly due to nationality. The top-right subfigure shows industrial groups (here: Alcohol) with limited interactions. The top-left subfigure shows industrial groups (here: Hotels) with small highly connected groups due to sub-brands.

More »

Fig 10 Expand

Fig 11.

Expected error from Jaccard estimation using minhash signatures as a function of the number of the hashes used in the signature.

The error bars show twice the standard error using 400,000 data points.

More »

Fig 11 Expand

Table 4.

Twitter accounts with the highest Jaccard similarities to @Nike.

J and R give the true Jaccard coefficient and Rank, respectively. and give approximations using Eq (3) where the superscript determines the number of hashes used. Signatures of length 1,000 largely recover the true Rank.

More »

Table 4 Expand

Table 5.

Twitter dataset area under the recall curves (Fig 12).

Bold entries indicate the best performing method. Minhash similarity (MS) is the best method in 8 cases, Agglomerative Clustering (AC) in 8 cases and Personalised PageRank (PPR) in none. A perfect community detector would score 0.5.

More »

Table 5 Expand

Fig 12.

Twitter dataset average recall (with standard errors) of Agglomerative Clustering (yellow), Personal PageRank (red) and Minhash Similarity (blue) against the number of additions to the community expressed as a fraction of the size of the ground-truth communities given in Table 3.

The tight error bars indicate that the methods are robust to the choice of seeds.

More »

Fig 12 Expand

Table 6.

Email dataset area under the recall curves (Fig 13).

Bold entries indicate the best performing method. Minhash similarity (MS) is the best method in 10 cases, Agglomerative Clustering (AC) in 5 cases and Personalized PageRank (PPR) in none. A perfect community detector would score 0.5.

More »

Table 6 Expand

Fig 13.

Email dataset average recall (with standard errors) of Agglomerative Clustering (yellow), Personal PageRank (red) and Minhash Similarity (blue) against the number of additions to the community expressed as a fraction of the size of the ground-truth communities given in Table 3.

The tight error bars indicate that the methods are robust to the choice of seeds.

More »

Fig 13 Expand

Table 7.

Clustering runtimes averaged over communities.

More »

Table 7 Expand

Fig 14.

Communities around seeds from the US republican party in December 2015.

Seeds are “Donald Trump”, “Marco Rubio”, “Ted Cruz”, “Ben Carson” and “Jeb Bush”. The vertex size depicts degree of similarity to the seeds. Edge widths show pairwise similarities. Colors are used to show different communities.

More »

Fig 14 Expand

Fig 15.

Visualization the Twitter Follower graph around global pop music.

Seeds are “Justin Bieber”, “Lady Gaga” and “Katy Perry”. Vertex size depicts degree of similarity to the seeds. Edge widths show pairwise similarities. Colors represent different communities.

More »

Fig 15 Expand

Fig 16.

The major social networks.

Seeds are Twitter, Facebook, YouTube and Instagram.

More »

Fig 16 Expand

Fig 17.

The many faces of RedBull.

More »

Fig 17 Expand

Fig 18.

Visualisation of the Twitter graph around European sport brands.

Vertex size depicts degree of similarity to the seeds. Edge widths show pairwise similarities. Colors represent different communities. Seeds are Adidas and Puma.

More »

Fig 18 Expand

Fig 19.

US sports brands.

Seeds are Nike, Reebok, UnderArmour, Dicks.

More »

Fig 19 Expand