Dimensionality of Social Networks Using Motifs and Eigenvalues

We consider the dimensionality of social networks, and develop experiments aimed at predicting that dimension. We find that a social network model with nodes and links sampled from an m-dimensional metric space with power-law distributed influence regions best fits samples from real-world networks when m scales logarithmically with the number of nodes of the network. This supports a logarithmic dimension hypothesis, and we provide evidence with two different social networks, Facebook and LinkedIn. Further, we employ two different methods for confirming the hypothesis: the first uses the distribution of motif counts, and the second exploits the eigenvalue distribution.


INTRODUCTION
Empirical studies of on-line social networks as undirected graphs suggest these graphs have several intrinsic properties: highly skewed or even power-law degree distributions [3,12], large local clustering [36], constant [36] or even shrinking diameter with network size [23], densification [23], and localized information flow bottlenecks [11,24].Many existing models of social network connections and growth have trouble capturing all of these properties simultaneously [14,16,17].One that does is the geometric protean model (GEO-P) [5].It differs from other network models [3,[21][22][23] because all links in geometric protean networks arise based on an underlying metric space.This metric space mirrors a construction in the social sciences called Blau space [25].In Blau space, agents in the social network correspond to points in a metric space, and the relative position of nodes follows the principle of homophily [26]: nodes with similar socio-demographics are closer together in the space.
In order to accurately capture the observed properties of social networks-in particular, constant or shrinking diameters-the dimension of the underlying metric space in the GEO-P model must grow logarithmically with the number of nodes.The logarithmically scaled dimension is a property that occurs frequently with network models that incorporate geometry, such as in multiplicative attribute graphs [16] and random Apollonian networks [39].Because of its prevalence in these models, the logarithmic relationship between the dimension of the metric space and the number of nodes has been called the logarithmic dimension hypothesis [5].This hypothesis generalizes previous analysis which shows that individuals in a social network can be identified with relatively little information.For instance, Sweeney found that 87% of the U.S. population had reported attributes that likely made them unique using only zip code, gender and date of birth, and concluded that few attributes were needed to uniquely identity a person in the U.S. population [33].In the following study, we find evidence of the log-dimension property in real world social networks.
We emphasize that the present paper is the first study that we are aware of which attempts to quantify the dimensionality of social networks and Blau space.While we do not claim to prove conclusively the logarithmic dimension hypothesis for such networks, our experiments, such as those of [33], suggest a much smaller dimension in contrast to the overall size of the networks.Interestingly, speculation on the low dimensionality of social networks arose independently from theoretical analysis of mathematical models of social networks in [5,16,39].

MGEO-P
The particular network model we study is a simple variation on the GEO-P model that we name the memoryless geometric protean model (MGEO-P), since it enables us to approximate a GEO-P network without using a costly sampling procedure.The MGEO-P model depends on five parameters: n the total number of nodes, m the dimension of the metric space, 0 < α < 1 the attachment strength parameter, 0 < β < 1 − α the density parameter, 0 < p ≤ 1 the the connection probability.
The nodes and edges of the network arise from the following process.Initially the network is empty.At each of n steps, a new node v arrives and is assigned both a random position p v in R m within the unit-hypercube [0, 1] m and a random rank r v from those unused ranks remaining in the set 1 to n.The influence radius of any node is computed based on the formula: I(r) = 1 2 r −α n −β 1/m .With probability p, the node v forms an undirected connection to any preexisting node u where D(v, u) ≤ I(r v ), where the distances are computed with respect to the following metric: and where • ∞ is the infinity-norm.We note that this implies that the geometric space is symmetric in any point as the metric "wraps" around like on a torus.The volume of space influenced by the node is r −α v n −β .Then the next node arrives and repeats the process until all n nodes have been placed.
Figure 1 illustrates two features of the model.First, after a few steps, only a few nodes exist and even a large influence region will only produce a few links.Second, when the number of steps approaches n, a large influence region will produce many links.The idea behind the model is a simple abstraction of the growth of an on-line social network.When the network is first growing (few steps), even influential members will only know a few other members who have also joined.But after the network has been around for a while (many steps), influential members will begin with many friends.
We formally prove that the MGEO-P model has the following properties.Let α ∈ (0, 1), β ∈ (0, 1 − α), p ∈ (0, 1] and m be positive integer.The following statements hold with probability tending to 1 as n tends to ∞: * 1.Let v be a node of MGEO-P(n, m, α, β, p) with rank R that arrived at step t.Then This result implies that the degree distribution follows a powerlaw with exponent η = 1 + 1 α .2. The average degree of node of MGEO-P(n, m, α, β, p) is

The diameter of MGEO
Figure 3: The 8 graph motifs or graphlets that we use to match a sample of a social network from Facebook or LinkedIn to a sample of the mGEO-P network model.
same values, although there may exist non-isomorphic graphs that also have the same values.For instance, co-spectral graphs have the same spectral densities (21).We wish to use these summaries in order to determine the best dimension that preserves the summary.Graph motifs, graphlets, or graph moments are the frequency or abundance of specific small subgraphs in a large network.We study undirected, connected graphs up to four nodes as our graph motifs.This is a set of 8 graphs shown in Figure 3. Counting the exact number of occurences of each subgraph within the large graph takes time O(n4), which is prohibitively large.Instead, we employ the rand-esu sampling algorithm (22) as implemented in the igraph library (23).This algorithm approximately estimates the count of each subgraph and depends on a sampling probability that can be interpreted as the probability of continuing a search.Thus, values near 1 indicate exact scores and small probabilities truncate the search.The value we is 10 log n/ log 10 +1.We use logtransformed output from this procedure in order to capture the dynamic range of the resulting values.

Todo
• Insert discussion of use of network motifs, see rand-esu paper (Wernicke-2006motifs).It seems like we don't need to discuss this and should cite the PNAS paper and the Janssen et al. paper for this info instead.The point here was that they were trying to claim something about function and this was more complicated as random graphs exhibit nontrivial graph motifs.Consider the graph G =(V, E), the normalized Laplacian matrix dictates many network properties including the behavior of a random walk, the number of connected components, an approximate connectivity measure, and many other features (24,25).The spectral density of the normalized Laplacian is a particularly helpful characterization that is a rough summary of many such separate network properties.We approximate it via a 201-bin histogram of the eigenvalues of the normalized Laplacian, which all fall between 0 and 2. To compute eigenvalues of a network, we employ the recently developed ScaLAPACK routine using the MRRR algorithm (26)(27)(28).
At left and center, we have the steps involved in fitting via graphlets; at right and center, we have the steps involved in fitting via spectral histogram.Throughout, red lines denote the flow of features for the MGEO-P networks whereas blue lines denote flow of features for the original networks.At the bottom, we show an enlarged representation of the 8 graphlets we use.
This last property suggests that, ignoring constants, for a network with n nodes and diameter D, the expected dimension based on the MGEO-P model is Thus, like some network models that incorporate geometry [16,39], in the MGEO-P model, the dimension m must scale logarithmically in order for the diameter to remain constant as n increases.

Experimental Design and Graph Summaries
Both graph motifs and spectral densities are numeric summaries of a graph that abstract the details of a network into a small set of values that are independent of the particular nodes of a network.These summaries have the property that isomorphic graphs have the same values, and we will use these summaries to determine the dimension of the metric space that best matches Facebook and LinkedIn networks as illustrated in Figure 2. Graph motifs, graphlets, or graph moments are the frequency or abundance of specific small subgraphs in a large network.We study undirected, connected subgraphs up to four nodes as our graph motifs.This is a set of 8 graphs shown in at the bottom of Figure 2 along with the single two node graph of an edge.The spectral density of a graph is the statistical distribution of eigenvalues of the normalized Laplacian matrix as indicated in the upper right of that figure.These eigenvalues indicate and summarize many network properties including the behavior of a uniform random walk, the number of connected components, an approximate connectivity measure, and many other features [2,6].Thus, the spectral density of the normalized Laplacian is a particularly helpful characterization that captures many such separate network properties.
We study dimensional scaling in social networks by comparing samples of the MGEO-P networks of varying dimensions with samples of social network data from Facebook and LinkedIn.We pay particular attention to the relationship between the number of nodes n of the network and the dimension m of the best fit MGEO-P network.In order to determine The scale of the network data involved in our study varies over three orders of magnitude.We see similar scaling for both types of networks, but with slightly different offsets.For Facebook, log 10 (edges) = 1.06 log 10 (nodes) + 1.35 with R 2 = 0.945; for LinkedIn log 10 (edges) = 1.07 log 10 (nodes) + 0.56 with R 2 > 0.999.The regularity in the LinkedIn sizes is due to our construction of those networks.what underlying dimension for MGEO-P best fits a given graph, we employ two distinct methods.For one experiment, we use features known as graph motifs, graphlets, or graph moments in concert with a support vector machine (SVM) classifier.This approach has been used successfully to determine the best generative mechanism of a network [27] and to select parameters of a complicated network models to fit real-world data [14,28].In a second experiment, we use spectral densities of the normalized Laplacian matrix of a graph and a KL-divergence similarity measurement, which has been used to match protein networks between species [1,31].We find evidence of the logarithmic dimension hypothesis in both cases.

The data
Facebook distributed 100 samples of social networks from universities within the United States measured as of September 2005 [34], which range in size from 700 nodes to 42,000 nodes.We call these networks the Facebook samples.The LinkedIn samples were created from the LinkedIn connection network together with the creation time of each connection from May 2003 to October 2006.To perform our experiments on networks of different size, we build the snapshots of the LinkedIn network at various timestamps.We then extracted a dense subset of their graph at various time points that is representative of active users; we used the 5-core of the network for this purpose [32].See Figure 3 and the appendix for additional properties of these networks.In both networks, the number of edges per node grows at essentially the same rate.

RESULTS
The results of our dimensional fitting for graphlets are shown in Figure 4 and the results of the fitting using spectral densities are in Figure 5.For both datasets and both types of statistics, the best-fit dimension scales logarithmically with the number of nodes and closely tracks a simple model prediction based on the diameter D of the network (the model curve plots m = log(n)/ log(D)).These experiments corroborate the logarithmic dimension hypothesis; although the precise fits differ: Using graphlets, for the Facebook data, we find that the dimension m = 2.06 log(n)/ log(10)− 3.00 with 95% confidence intervals of (1.851, 2.264) and (−3.821, −2.182), respectively.For the LinkedIn data, we find that m = 0.98 log(n)/ log(10) + 1.01 with 95% confidence intervals of (0.786, 1.178) and (0.1591, 1.87).Using spectral densities, for the Facebook networks, we find that d = 1.21 log(n)/ log(10) + 1.65 is the best-fit line, with a 95% confidence interval for the coefficients of (0.9782, 1.446) and (0.7242, 2.578).For the LinkedIn networks, we find d = 0.77 log(n)/ log(10) + 1.1.The 95% confidence interval for these coefficients, respectively is (0.56, 0.99) and (0.23, 1.95). .Facebook dimension at top, LinkedIn dimension at bottom-computed via graphlet features and a support vector machine classifier to select the dimension.For the Facebook data, we find that m = 2.06 log(n)/ log(10) − 3.00.For the LinkedIn data, we find that m = 0.7333 log(n)/ log(10) + 1.In the left figure, we show the variance in the fitted dimension as a box-plot.We estimate the variance by using only 20% of the original training data and repeating over 50 trials.There are only a few outliers for small dimensions.  .At top, Facebook data, at bottom, LinkedIn data.We show the fitted dimensions based on the minimum KL-divergence between the spectral densities.The dimensions shift modestly higher for Facebook and remain almost unchanged for LinkedIn.Both still are closely correlated with the theoretical prediction based on the model.

Sensitivity
We investigate the sensitivity of the graphlet results in two settings.If we reduce the training set size of the SVM classifier by using a random subset of 20% of the input training data and then rerun the training and classification procedure 50 times, then we find a distribution over dimensions that we report as a box-plot, shown in Figure 4.In the appendix, we further study perturbation results that argue against these results occurring due to chance.In particular, we find that these dimensions are robust to moderate changes to the network structure and we find that our methodology does not predict useful dimensions of Erdös-Rényi random graphs or random graphs with the same degree distribution.We do not report a precise p-value as there are no widely accepted null-models for network data.We study the sensitity of the spectral densities that look for matches that are within 105% of the true minimum divergence.This defines a dimension interval around each match that is small for all of our examples.

Discussion
There is a growing body of evidence that argues for some type of geometric structure in social and information networks.An important study in this direction views networks as samples of geometric graphs within a hyperbolic space [18][19][20].Recent work has further shown that hyperbolic embeddings reproduce shortest path metrics in real-world networks [40].In both MGEO-P and hyperbolic random geometric networks, highly skewed or powerlaw degree distributions are imposed-either directly as in MGEO-P, or implicitly as in the hyperbolic space scaling.These results further support hidden metric structures in networks by empirically confirming a prediction about the dimension of the metric space made by For three of the Facebook networks, we show the eigenvalue histogram in red, the eigenvalue histogram from the best fit MGEO-P network in blue, and the eigenvalue histograms for samples from the other dimensions in grey.The MGEO-P model correctly captures the peak of the distribution around 1, but fails to completely capture the tail between 1 and 2. Thus, we see meaningful difference between these profiles and hence, do not suggest that MGEO-P captures all of the properties of real-world social networks.one particular model.
Note that these results do not conclusively argue that MGEO-P is a perfectly accurate model for social networks; there are meaningful differences between the spectral histograms from MGEO-P and real social networks, see Figure 6.There are also similar differences in the graphlet counts.Our results support a different hypothesis.The closest MGEO-P network to a given social network has a metric space whose dimension scales logarithmically with the number of nodes.In the appendix material we have determined that this property is not due to either the edge density or the degree distribution; thus, our findings appears to reflect a new intrinsic property of social networks.

EXPERIMENTAL DESIGN
Given a graph G = (V, E), we employ the following methods to determine the dimension m of the MGEO-P models: The first approach employs a complex statistical technique-the support vector machine classifier-to determine nonlinear predictive correlations among the graphlet counts and the dimension.This sophistication renders the method opaque and difficult to interpret the precise similarity mechanism.The second approach is simple and still illustrates the dimensional scaling, although the precise dimensions differ, which indicates that it is matching the network in a different way.

Estimating dimensions using graphlets and support vector machines
The relationship between the dimension of a graph and its graphlets is highly nonlinear and so we used a multi-class support-vector machine (SVM) based classification tool from WEKA to predict this relationship.In this case, each dimension is a class, but as an SVM can only make a binary decision we train the SVM using a dimension-vs-dimension classification.That is, we build a classifer to predict dimension 5-vs-dimension 3, dimension 5-vs-dimension 4, etc. so there are 66 = "12-choose-2" SVMs trained.The dimension picked most often among these classifiers is the predicted class; this is the standard behavior of the sequential minimal optimization classifier (SMO) used in Weka.The dimension of a real-world network is then predicted by running this classifier on the graphlet counts of the networks.An alternative methodology (which has had some previous success) would be to to train the classifier using alternating decision trees; however this training methodology significantly restricts the behavior of the classifier and produces inconsistent results.

Comparing spectral densities
Given the eigenvalues of the normalized Laplacian, we compute a spectral density by taking a 201-bin histogram of these eigenvalues.We then use the KL-divergence between these histograms as used in Banerjee and Jost (2009) as a measure of similarity.If P A and P B are the histograms of networks A and B normalized to probabilities, then for our 201-bin histograms we have that: We select the single best dimension based on the value of m that minimizes the KL divergence KL(S, G m ) where S is the sample of either Facebook or LinkedIn and G m is a sample of a MGEO-P network with dimension m.We add 1 to all of the eigenvalue counts in the histogram as a form of smoothing for the probabilities.We define a dimension interval by looking at the maximum interval such that the extreme points are within 105% of the true minimum.
Diameters The MGEO-P model of a network predicts that the dimension m should approximate log(n)/ log(D), where D is the diamater.However, as D is sensitive to outliers we use the 99% effective diameter computed via an asymptotically accurate approximation scheme [30] as implemented in the SNAP library on 2011-12-31.The effective diameter of all Facebook networks ranges between 3.5 and 4.6, with a mean of 4.1.For the LinkedIn data, the effective diameter ranges between 4.3 and 5.9, with a mean of 5.4.In both networks, larger graphs have bigger effective diameters, although the differences are slight and the full data is available in the appendix material.
Graphlets To compute graphlets, we employ the rand-esu sampling algorithm [37] as implemented in the igraph library [8].This algorithm approximates the count of each subgraph via a stochastic search, which then depends on the probability of continuing to search.Thus, if the probability is near 1 then the scores are nearly exact, but very expensive to compute, and small probabilities truncate the search early to produces fast estimates.
The value we use is 10/n.We use log-transformed output from this procedure in order to capture the dynamic range of the resulting values.

Spectral densities
We approximate the spectral density via a 201-bin histogram of the eigenvalues of the normalized Laplacian, which all fall between 0 and 2. (The choice of 201 was based on prior experiences with the spectral histograms of networks.)To compute eigenvalues of a network, we employ the recently developed ScaLAPACK routine using the MRRR algorithm [9,10,35].
SVM We used a multi-class support-vector machine (SVM) based classification tool from Weka [38] to predict the relationship between the graphlets and the dimension.
Setting MGEO-P Parameters Consider a graph G = (V, E) that we wish to compare to an MGEO-P sample.The MGEO-P model depends on four parameters: n, m, α, and β.The choice of n is straightforward as we use the number of nodes of the original graph.Both α and β can be chosen independently of the dimension m.Specifically, both α and β determine the average degree of the network and the exponent of the power law in the degree distribution, up to lower-order terms, as shown by property 1 and property 2. By computing just these two simple statistics of a network-the exponent of the power law and the average degree-we can invert these relationships and choose these parameters.Let η be the power-law exponent and ρ be the average degree.Then: We use the following treatment of the probability p. Suppose that the original network had E = nρ/2 edges.Given the output of an MGEO-P network, we randomly delete edges until the output has exactly the same number of edges as the input network.This step can be interpreted as using the value of p necessary to get the same edge count as the original graph.In the case where there are insufficient edges, we leave the output from the MGEO-P generator untouched.

A MEMORYLESS GEO-P MODEL
A.1 Review of GEO-P The geometric-protean model (GEO-P) model is a model for online social networks which incorporates geometric and ranking information into an evolving network structure.More specifically, the GEO-P model, as defined by Bonato, Janssen, and Pra lat [5], defines a sequence of graphs {G t : t ≥ 0} on n nodes where G t = (V t , E t ), based on four parameters: the attachment strength α ∈ (0, 1), the density parameter β ∈ (0, 1 − α), the dimension m ∈ N, and the link probability p ∈ (0, 1].Each node v ∈ V t has a unique rank r(v, t) ∈ [n] where [n] = {1, 2, . . ., n}; we explicitly list r(v, t) to emphasize that the rank may change with time.In order to stay consistent with the standard usage, the highest rank is 1 and the lowest rank is n.Additionally, each node has a geometric location in [0, 1] m under the torus metric d(•, •).That is, for any two points x, y ∈ [0, 1] m , d(x, y) is defined to be min { x − y − u ∞ : u ∈ {−1, 0, 1} m } .We note that this implies that the geometric space is symmetric in any point as the metric "wraps" around.For any node v, we define its influence region at time t ≥ 0, written R(v, t), to be the ball of Euclidean volume r(v, t) −α n −β centered at v. Notice that, since the we are in the torus metric, this is a cube measuring r(v, t) − α /m n − β /m on a side.
Note.All asymptotic results in this paper are with respect to n.We say that a statement holds with extremely high probability, if it holds with probability at least for some function ω(n) with ω(n) → ∞ as n → ∞.In particular, if there are a polynomial number of events, each of which holds with extremely high probability, then all of them hold with extremely high probability.
Let G 0 be any graph.In order to form G t from G t−1 , first choose a node w uniformly at random from V t−1 and remove it.The remaining nodes are re-ranked, that is, all nodes with lower ranks than w decrease their ranks by 1. Then place a node v uniformly at random in [0, 1] m , generate uniformly at random a rank for v, and re-rank the remaining nodes again.Finally, for every node u which is such that v is in the influence region of u, add the edge {u, v} with probability p.It is clear that this process depends only on the current state of G t , and so forms a ergodic Markov chain with a limiting distribution π.A random instance of GEO-P is then defined to be a sample from this limiting distribution.
It is clear that the distributions of edges of G t are determined by the relative rank histories of all the nodes at the time the other nodes entered.More specifically, if we order the nodes of G t according to their age with node 1 being the oldest, then for any i > j the probability of the edge {i, j} being present is determined by their respective geometric locations and the rank of node j when node i arrives.Thus, in order to sample from the limiting distribution π it suffices to sample from the distributions of node histories, then randomly assign locations to the nodes, and determine if the edges are present.We note that according to the distribution π the final permutation between ages and ranks is uniformly distributed over all permutations.Since there are n! permutations of nodes and at most n 2 different permutations reachable from a given state, it takes at least log n 2 (n!) = n 2 (1 − o( 1)) iterations to reach the stationary distribution.Standard results in the mixing rate of random graphs suggest that in order to assure that a sample is close to the stationary distribution at least Ω log n! n 2 = Ω(n log(n)) iterations are required.In fact, it is easy to see that the stationary distribution is reached at the time when the last node from the initial graph G 0 is removed, which happens with probability 1 + o(1) after (1 + o(1))n log n steps, by the coupon collector problem.

A.2 Introducing MGEO-P
For large n this number of iterations is a significant computational roadblock, so we introduce here a variant of the GEO-P model which we call a memoryless geometric-protean graph (MGEO-P).In essence this model is the GEO-P model where the each node has forgotten its history of ranks.More specifically, a permutation σ on [n] is chosen uniformly at random and σ(i) represents the rank of the i th oldest node.Thus, for each pair i > j the edge {i, j} is potentially present if and only if the node j is in the ball of volume σ(i) −α n −β centered around node i.It is worth noting that, as shown in Bonato et al.Lemma 5.2 [5], if a node in the GEO-P model receives an initial rank R ≥ √ n log 2 n, then its rank is ) for its entire lifetime with extremely high probability.Thus, if we imagine coupling the MGEO-P model in the natural way to GEO-P, and assuming that ranks do not change much as mentioned above, we have that for all but a vanishing fraction of the edges, the probability that a given edge is present in one model but not the other is O pn − α+2β /2 log(n) 1−4α /2 .Hence, we would intuitively expect that the MGEO-P model would not differ too much from GEO-P model.In order to confirm this we prove that the parameters we are interested in do not differ by much from the proven parameters of the GEO-P model.Specifically, we look at the average degree, the degree distribution, and the diameter.

A.3 An equivalent description of the MGEO-P model
We now describe a model that is equivalent to the MGEO-P model just introduced, but that we found useful for our analysis.It has a different interpretation.The key change is that we reverse the way links are formed: when a node i arrives in the network, then all existing nodes j form links to i if i is within the influence regions of j.Intuitively, this models how links may arise in a citation network -a new paper links to those that are topically related (that is, nearby in the metric space) or highly influential.In the language we used above, this process is: fix a permutation σ on [n] chosen uniformly at random and σ(i) represents the rank of the i th oldest node.Thus, for each pair i > j the edge {i, j} is potentially present if and only if the node i is in the ball of volume σ(j) −α n −β centered around node j.The two descriptions are equivalent as we can simply reverse the order of vertex arrivals.Thus, they induce the same distribution over graphs because the order is a uniform random choice.

A.4 The average degree
In order to consider the degrees, we first need the following standard result on the tails of the hypergeometric distribution, see for instance Jansen et al. [15] Lemma 1.Let X be the number of red balls in a set of t balls chosen at random from a set of n balls containing m red balls.Then, E[X] = tm n , and for any > 0, Further, for any ∈ (0, 1), Let v be a node of MGEO-P(n, m, α, β, p) with rank R and age i, then with extremely high probability.
Proof.Let deg + (v) denote the number of older neighbors of v and let deg − (v) denote the younger neighbors of v.In order to determine deg + (v) we consider connecting v to nodes of all ranks other than R and keeping i − 1 of those uniformly at random.The expected degree of v before the edge deletion is .
with extremely high probability, and thus, Lemma 1 gives that with extremely high probability as well.Additionally, if i ≥ log 3 (n)n α+β , then equality holds.
Since the edge probability between v and the younger nodes does not depend on the rank of the younger neighbors, deg − (v) can be expressed as a sum of independent random variables which has expectation (n − i)pR −α n −β .Hence, by Chernoff bounds it follows that with extremely high probability In order to express the error in a multiplicative faction, we note that Thus, for the entire range of i both of the error terms are individually dominated by one of the primary terms and hence, we have that with extremely high probability and furthermore if log Noting that where i ≤ log 3 (n)n α+β , and where n − i ≥ log 3 (n)R α n β completes the proof.
Theorem 2. Let α ∈ (0, 1), β ∈ (0, 1 − α), n ∈ N, m ∈ N, and p ∈ (0, 1], then with extremely high probability the average degree of node of MGEO-P(n, m, α, β, p) is Proof.From the proof of Theorem 1 we have that with extremely high probability for a node v with age i, with equality if i ≥ log 3 (n)n α+β .Now since every edge is counted exactly once in deg for some node u, the average degree is with extremely high probability In a similar manner, we find that completing the proof.

A.5 The degree distribution
Let N j be the number of nodes in MGEO-P(n, m, α, β, p) with degree precisely j and let N ≥k = ∞ j=k N j be the number of nodes in degree MGEO-P(n, m, α, β, p) with degree at least k.We will show that similarly to the geometric protean graphs, N ≥k ∝ k − 1 α for a significant range of k, and thus, MGEO-P(n, m, α, β, p) exhibits a power-law degree distribution over that range with power-law exponent 1+ 1 α .Following prior work [5] we will characterize the pairs (i, R) of ages and ranks which will assure that the degree of a node is at least k and show that this value concentrates about its expectation using the following specialization of the Azuma-Hoeffding inequality.
We will use the notation Moreover, it will be convenient not to worry about less significant factors, so we will use Õ(f (n)) to denote any function which is at most and let k and be such that for some c > 0, then with extremely high probability MGEO-P(n, m, α, β, p) satisfies that Proof.We first note that if the age rank pair (i, R) for a node v satisfies that , then by Theorem 1 with extremely high probability , then with extremely high probability deg(v) < k.Let X i be the event that the node with age i has rank R satisfying , and let Y i be the event that the node with age i has rank R satisfying .
We note as well that We recall that the age-rank pairs can be represented by a permutation σ chosen uniformly at random from the symmetric group, and thus, it can be generated by a sequence of transpositions (1, a 1 )(2, a 2 ) • (n, a n ) where each a i is chosen independently and uniformly at random from {i, i + 1, . . ., n}.Thus, X (and Y ) may be viewed as a function of independent random variables and so Theorem 3 applies.Furthermore, the change of any particular variable impacts the value of X by at most 2. Hence, we note that, and thus, with extremely high probability Hence, we have that , with extremely high probability and the desired result follows.
We note that by choosing = log − 1 /3 (n) we can easily obtain the same type of degree distribution result for MGEO-P that exists for the original GEO-P [5]. and then with extremely high probability MGEO-P(n, m, α, β, p) satisfies where N ≥k is the number of nodes of degree at least k.
A. 6 The diameter Proof.We first show that the diameter is Õ n ) .To this end, let and divide [0, 1) m into t m uniform subcubes with side-lengths 1 t in the natural way.Now, as β (1−α) < 1, by Chernoff bounds there are Θ n nodes in each of the subcubes with extremely high probability.Thus, in order to show diameter Õ n it suffices to show that for any two nodes u and v at ∞ -distance at most 2 /t the graph distance between the two nodes is at most some fixed constant.Now consider an arbitrary node v.By Chernoff bounds, with extremely high probability there are Ω n 1−α−β nodes at ∞ distance at most neighbors at ∞ -distance at most 1 2t , with rank at most n β 1−α ln 2/(1−α) n, and age rank at most n/2 (that is, old nodes).Note that each node with rank at most n β 1−α ln 2/(1−α) n has radius of influence at least Combining these two observations we have that, with extremely high probability, every node v is within graph-distance two and ∞ -distance n − α+β m + 1 2t of a set of Θ ln 2 (n) nodes, X v , with rank at most n β 1−α ln 2/(1−α) n.Thus, if u and v are at ∞ -distance at most 2 t , the distance between elements of X u and X v is at most 2 t + 2n − α+β m + 1 t ≤ 4 t .On the other hand, as we already mentioned, the radius of influence of each node in X u or X v is at least 4/t.Thus, with extremely high probability, some member of X u and X v are adjacent and hence u and v are within graph distance 5, completing the proof of the upper bound.
For the lower bound, let us take some node v and consider distances to other nodes.With probability 1 − 2 −m some other node is at ∞ -distance at least 1/4.Hence, by Chernoff bounds, with extremely high probability there exist two nodes at ∞ -distance at least 1  4 .As the diameter of every influence region is at most n − β m , this gives that the diameter of the graph is Ω n

B SENSITIVITY STUDIES
In the following sections, we study how the predicted dimension changes due to large scale structural changes in the graph.We focus our efforts on studying the Facebook samples as the LinkedIn samples are highly correlated due to the temporal nature of their construction.Our results show that 1. Erdős Rényi random graphs have no apparent dimension.
2. The graphlet fitting methodology is influenced by the degree distribution in a way that generates high variance in the predicted dimension but where a logarithmic trend may still exist.This effect is not present in the spectral histograms.
3. The graphlet fitting methodology is robust to changing 10% of the edges of the network via a random percolation process.

B.1 Dimensions of Erdős Rényi random graphs
In our first experiment to verify the relevance of our dimensionality fits, we attempt to fit the dimension of an Erdős Rényi random graph with the same number of expected edges.That is, for each of the samples of the Facebook network, we run the SVM dimension classifier we constructed on the graphlet counts of 50 separate Erdős Rényi random graph samples where the probability is designed to yield the number of edges of the original network in expectation.In all but 3 of the 5000 examples (50 samples for each of the 100 graphs), the predicted dimension is the maximum 12.When the dimension was not the maximum in those three cases, it was 11.When we tried this with dimensions up to 10, then the Erdős Rényi random graphs fit to the dimension 10, thus, we expect these graphs to be predicted at the highest dimension of the training set.We see this as evidence that our graphlet methodology is sensitive to clearly erroneous graphs.

B.2 Dimensions of random graphs with the same degree distribution
In our second experiment to verify the relevance of our dimension fits, we attempt to fit the dimension of a graph with the same degree distribution as one of the Facebook networks but with edges randomly drawn.To generate these graphs, we use the Bayati-Saberi-Kim procedure [4] as implemented in the bisquik library [13].This method terminated for 92 of the 100 graphs.(The process did not terminate in the other 8 cases, which is a limitation of this particular sampling scheme.)The dimensional fits for these 92 resampled networks are shown in Figure 7.The eigenvalue fits show no logarithmic scaling in the dimension whereas the graphlet fits do.However, the variance in the predicted dimensions based on graphlets is substantially higher for these random samples compared to the original networks (see Figure 4 in the main text).The evidence from graphlets alone, is then, possibly biased due to the degree distribution.However, the results from the spectral histograms, the graphlets, and the prediction dimension from the model itself encourage us to be more optimistic.Predicted dimensions of random graphs with the same degree distribution.At left, the predicted dimension using our SVM-Graphlet methodology and at right, the prediction dimension using our spectral histogram methodology

B.3 Dimension variance with random percolation
In our final experiment, we study random percolation of the predicted dimension of the Facebook networks.In a random percolation process, we randomly sample an edge from the network, delete it, replace it with an edge between two randomly drawn nodes, and continue until we have done this procedure k times.We study how the predicted dimension varies as we change 1%, 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50% of the total edges of a network.For each of the 100 Facebook networks, and each percentage of total edges, we repeat the percolation process 10 times.This generates 110 total networks for each Facebook network.Figure 8 shows a box-plot of how the predicted dimension varies for each perturbation level over all 1100 total graphs.This plot suggests that the dimension is unchanged until more than 15% of the edges have been percolated.This figure further illustrates that the predicted dimension is a stable quantity for a network that is not overly sensitive to small perturbations.

Figure 1 .
Figure1.An example describing the MGEO-P process on a graph with 250 nodes in the unit square with torus metric, where α = 0.9 and β = 0.04 and p = 1.Each figure shows the graph "replicated" in grey on all sides in order to illustrate the torus metric.Links are drawn to the closest replicated neighbor.The blue square indicates the region [0, 1] 2 .Top row (left to right) The MGEO-P process begins with relatively few nodes, and thus, nodes must have large influence radii (red squares) to link anywhere.As more nodes arrive, large radii result in many connections, modeling influential users, and small radii result in a few connections, modeling standard users.Bottom row Illustrates the final constructed graph.

Figure 3 .
Figure 3.The scale of the network data involved in our study varies over three orders of magnitude.We see similar scaling for both types of networks, but with slightly different offsets.For Facebook, log 10 (edges) = 1.06 log 10 (nodes) + 1.35 with R 2 = 0.945; for LinkedIn log 10 (edges) = 1.07 log 10 (nodes) + 0.56 with R 2 > 0.999.The regularity in the LinkedIn sizes is due to our construction of those networks.

Figure 4
Figure 4. Facebook dimension at top, LinkedIn dimension at bottom-computed via graphlet features and a support vector machine classifier to select the dimension.For the Facebook data, we find that m = 2.06 log(n)/ log(10) − 3.00.For the LinkedIn data, we find that m = 0.7333 log(n)/ log(10) + 1.In the left figure, we show the variance in the fitted dimension as a box-plot.We estimate the variance by using only 20% of the original training data and repeating over 50 trials.There are only a few outliers for small dimensions.

Figure 5
Figure 5.At top, Facebook data, at bottom, LinkedIn data.We show the fitted dimensions based on the minimum KL-divergence between the spectral densities.The dimensions shift modestly higher for Facebook and remain almost unchanged for LinkedIn.Both still are closely correlated with the theoretical prediction based on the model.
Figure 6.For three of the Facebook networks, we show the eigenvalue histogram in red, the eigenvalue histogram from the best fit MGEO-P network in blue, and the eigenvalue histograms for samples from the other dimensions in grey.The MGEO-P model correctly captures the peak of the distribution around 1, but fails to completely capture the tail between 1 and 2. Thus, we see meaningful difference between these profiles and hence, do not suggest that MGEO-P captures all of the properties of real-world social networks.

Experiment 1 1 .
Set n to the number of nodes.Determine values of α and β independently of m (see the appendix of the original paper).2. Simulate 50 samples of an MGEO-P network with m varying between 1 and 12.3.Compute the graphlet counts for each sample of MGEO-P and train a SVM classifier to predict the dimension of the network given the samples.4. Compute the graphlet counts for the graph G and use the output from the classifier as the dimension m of the network.Experiment 2 1 & 2. As in experiment 1. 3. Compute the spectral density for one sample of MGEO-P for each m between 1 and 12 (only one MGEO-P sample is used to get the density).† 4. Compute the spectral density of the graph G and find the value of m that minimizes the KL-divergence between the density from the graph and the MGEO-P samples.

Figure 7 .
Figure 7. Predicted dimensions of random graphs with the same degree distribution.At left, the predicted dimension using our SVM-Graphlet methodology and at right, the prediction dimension using our spectral histogram methodology

Figure 8 .
Figure 8.The change in the predicted dimension based on the graphlets methodology as we randomly percolate small or large fractions of the total edges in the network.Each box-plot represents the results over all 100 Facebook networks.The label 0.05 corresponds to randomly altering 5% of the total edges in a network.The line tracks the mean over all the samples.
2. Top row (left to right) The MGEO-P process begins with relatively few nodes, and thus, nodes must have large influence radii (red squares) to link anywhere.As more nodes arrive, large radii result in many connections, modeling influential users, and small radii result in a few connections, modeling standard users.Bottom row Illustrates the final constructed graph.