One-step estimation of networked population size: Respondent-driven capture-recapture with anonymity

Size estimation is particularly important for populations whose members experience disproportionate health issues or pose elevated health risks to the ambient social structures in which they are embedded. Efforts to derive size estimates are often frustrated when the population is hidden or hard-to-reach in ways that preclude conventional survey strategies, as is the case when social stigma is associated with group membership or when group members are involved in illegal activities. This paper extends prior research on the problem of network population size estimation, building on established survey/sampling methodologies commonly used with hard-to-reach groups. Three novel one-step, network-based population size estimators are presented, for use in the context of uniform random sampling, respondent-driven sampling, and when networks exhibit significant clustering effects. We give provably sufficient conditions for the consistency of these estimators in large configuration networks. Simulation experiments across a wide range of synthetic network topologies validate the performance of the estimators, which also perform well on a real-world location-based social networking data set with significant clustering. Finally, the proposed schemes are extended to allow them to be used in settings where participant anonymity is required. Systematic experiments show favorable tradeoffs between anonymity guarantees and estimator performance. Taken together, we demonstrate that reasonable population size estimates are derived from anonymous respondent driven samples of 250-750 individuals, within ambient populations of 5,000-40,000. The method thus represents a novel and cost-effective means for health planners and those agencies concerned with health and disease surveillance to estimate the size of hidden populations. We discuss limitations and future work in the concluding section.


Introduction
Estimating the size of hidden and hard-to-reach populations is of critical importance to health officials seeking to mitigate the extent of health problems that may be concentrated within such populations [1], or when "reservoirs" of infection among a hidden population pose a methods, two assumptions must generally be met: (i) the sample is representative of the hidden population at large, and (ii) everyone in the hidden population is equally likely to be "captured" in the official statistics being used [43]. While representativeness can sometimes be assumed (as in the case of RDS), it is often difficult to establish the uniformity of the capture statistics. Frankly stated, police arrests and hospital admissions can seldom be assumed to draw randomly from the hidden population. Further, capture-recapture/multiplier methods often require that the sample be identifiable in the institutional record, implying that expectations of anonymity on the part of sample respondents be abandoned. When working with hidden and highly stigmatized populations, such a sacrifice can be highly detrimental to both recruitment and informant reliability [44]. Network scale-up methods are also used to establish the size of hidden populations, though work in this area remains at an early stage. Here members of the entire population (ambient plus hidden) are asked to report on the number of known associates who fit the hidden population criteria [45,46]. This approach has the advantage of being employable under ordinary random sampling conditions that can make use of known sampling frames (i.e. mail surveys and/or random digit dialing) [47]. However, the technique requires that ordinary people know whom among their associates fit the criteria for inclusion in the hidden category [48,49]. Such an assumption faces objections in many of the situations in which we might wish to apply the technique, as when we seek to estimate the size of populations of PWID or sex workers. In these types of settings, individuals from the hidden population may go to great lengths to hide their membership status from friends and associates. Such effects inject "transmission error" into NSUM calculations, a quantity that is difficult to both detect and measure.
In previous work, we presented a novel capture-recapture methodology for estimating the size of a hidden population from an RDS sample [50], referred to there as the "telefunken" method. The method could be easily integrated into a conventional RDS framework, allowing researchers to continue to take advantage of the wide body of work on RDS and its ability to yield reliable prevalence estimates. The method was adopted experimentally in the context of efforts to collect data on commercially sexually exploited children [51] and, later, users of methamphetamine [52]. Both these studies made use of RDS and took place in New York City. Subsequent implementations of the technique provided further evidence of its effectiveness and ease of implementation [34]. The telefunken method was so named because its application entailed asking each RDS respondent to report on others in the population known to them by providing an encoding of their associates' telephone number and demographic features (note that the technique is in no way related to the German apparatus company, Telefunken). In taking this approach, the method avoided reliance on official statistics (as needed in scale-up methods), and the requirement of drawing two independent samples (as needed by capturerecapture methods). Each individual's code was created by considering a protocol-specified number of digits of their phone number, in order from last to first, and encoding each digit as 0/1 based on whether it was even or odd, and again 0/1 based on whether it was low (0-4) or high (5)(6)(7)(8)(9); in this manner, each subject and associate was "identified" by means of a multibit binary code. This many-to-one encoding allowed for ongoing anonymity for both respondents and their reported associates, while enabling the matching of contacts across numerous respondent interviews. In essence, the telefunken method represents a "one-step" approach which lifts many assumptions normally associated with other capture-recapture methods, and can be achieved using a single RDS sample from the hidden population. If shown to be effective, such an approach lends simplicity and greater cost-effectiveness to the size estimation procedure, potentially allowing for widespread application.
Concerning the issue of anonymity, independently and in roughly the same time period, Fellows put forward a general framework of Privatized Network Sampling (PNS) design [53].
PNS addresses two of the major concerns with regard to RDS data, namely the assumption that coupons are passed at random among alters, and that subjects can accurately report the number of alters that they have. As PNS is closely related to RDS, the standard RDS estimators may be used on data collected with a PNS design.
Given the growing interest in telefunken and PNS-like techniques [26,34,54], this paper aims to provide a systematic exposition of its strategy for one-step, anonymity preserving, network-based population size estimation. In what follows we formally describe the technique, analyze its mathematical properties, and validate its performance through simulations under a variety of implementation conditions. The simulations show considerable promise for the technique in scenarios normally associated with research among "hidden populations". Limitations and next steps toward validation/extension are discussed at the end of the paper.

Background
Current network size estimation methods are based on quantifying the "repetition" or overlap observed across multiple samples [55]-where the category of objects sampled may be nodes, edges, distances, paths, motifs, or substructures [56,57], depending on the specific approach in question.
• Node sampling methods often begin by taking independent uniform random samples of the population. In interpreting the overlap between samples [58,59], these methods are based on the same principle as the well-studied "Coupon collector's problem" from probability theory, for which maximum likelihood estimators and conservative confidence intervals are well known [60]. This classic method considers two uniform independent random samples [61]; in ecology, the method is often referred to as the "mark and recapture" protocol. Within a population V, the protocol first selects a uniform random "capture" sample S V, and then a second (and independent) uniform random "recapture" sample R V. From independence assumptions one infers that and hence The right-hand-side expression in (2) is known as the Lincoln-Peterson estimator [62,63]. Many extensions and improvements to this classical technique have been developed, such as those making use of weighted sampling techniques [64], or sampling that is biased by the degree distribution of network nodes [65].
• Edge sampling approaches to population size estimation have also been developed [66][67][68]. These methods not only consider a sampled set of nodes, but also elicit a sample of their network neighbors. While edge sampling encounters problems associated with a bias toward high degree nodes, these methods offer potential gains in efficiency in dense graphs and where independent random sampling of nodes is restricted.
• Lastly, sampling via random walks represents a practical approach that is commonly used in estimating the size of social networks. Random walk methods start from an arbitrary node, then move to a neighboring node uniformly at random, and iterate. A typical random walk visits every node with a frequency proportional to its degree, but this bias can be quantified by Markov Chain analysis, and corrected to enable the derivation of an estimate of graph size from the frequency with which sampled nodes appear (and reappear) during the walk process. Random walk methods have largely used a sampling with replacement model, which may, in theory, introduce bias in estimates when the (fractional) size of the sample is large [24,69]; however, there is some recent experimental evidence that such concerns may be overstated [70]. These methods are widely used to measure the size of online social networks, and are frequently employed in conjunction with a variety of web crawler data [71][72][73][74][75].
The approach developed here is inspired by and builds on several of the above strategies, including random walks and edge elicitation. An outline of this paper follows: In Section 3.1, we present a population estimator for uniform random samples. This estimator is extended for respondent-driven samples in Section 3.2. The two estimators are evaluated over a broad range of graph families (see Subsection 4.1) using a general experimental framework (see Subsection 4.2). The experimental results are presented in Sections 4.3 and 4.4. In Section 4.5, we adapt the estimators for use in networks with clustering, showing in Section 4.6 that the revised schemes continue to perform well on synthetic networks. In Section 5, we extend the network size estimation schemes to allow for protection of subject privacy. These anonymitypreserving extensions are evaluated through simulation experiments in Sections 5.2 and 5.3. The impact of non-uniformities is assessed in Section 6, with special consideration of degree bias in RDS seed selection, and bottlenecking due to community structure. The performance of the proposed estimators is evaluated on a real-world network in Section 7. Finally, discussion and limitations are presented in Section 8.

New population size estimators
We seek to generalize the Lincoln-Peterson framework of overlapping capture and recapture sets (2) to the context of networked populations, and describe it formally in the language of graphs. The following definition provides graph-theoretic notations which will be necessary in order to precisely define the proposed sampling and estimation processes.
In the arguments that follow, graph-theoretic quantities (such as those formalized in Definition 1) will sometimes be considered simultaneously in the context of more than one graph-e.g. G 1 = (V 1 , E 1 ), and G 2 = (V 2 , E 2 ). To avoid ambiguity in such settings, we will make the context clear by appending the graph as a parameter-e.g. the arithmetic mean degree of vertices in G 1 is denoted "

Population size from a uniform random sample
With the formalisms of Definition 1 in place, we can define the estimator n 1 , which, given a uniform random subset of vertices T V, yields an estimate of |V|. Definition 3. Given a graph G = (V, E) and T V, define Lemma 1 shows that as the sample size grows, n 1 converges to |V|. Lemma 1. Let G = (V, E) be a graph and let T 1 T 2 T 3 . . . V be an ascending chain converging to . . are ascending chains of multisets, and M i R i (i = 1, 2,. . .). Suppose u 2 Δ i and w R i ðuÞ ¼ a; clearly 0 < a d (u). Then since the ascending chain (T i ) i = 1, 2,. . . converges to V, there exists a least j 0 > i for which w M j ðuÞ ¼ dðuÞ and therefore w D j ðuÞ ¼ 0 for all j ! j 0 . It follows that where multiset intersection and difference are as described in Definition 2, and thus The next proposition gives sufficient conditions under which uniform random samples T V produce consistent estimates n 1 (T) * |V| when |V| is large. Concrete realizations of these conditions are presented in Corollary 1.
Proposition 1. For n = 1, 2,. . . let G n = (V n , E n ) be a graph on |V n | = f(n) vertices, where f(n) grows unboundedly. Let c n 2 (0, 1] and take T n V n to be a subset of size |T n | = bc n Á f(n)c selected using uniform random sampling in V n . If c n Á f(n) diverges as n goes to infinity while for some finite constant Θ 1 > 0, then n 1 ðT n Þ jV n j necessarily converges to 1. Proof. Define random variables " M n ≔ 1 f ðnÞ hMðT n ; ;Þi: ð11Þ For uniform random u 2 V n , E½dðuÞ ¼ " dðV n Þ. Since |T n | = bc n Á f(n)c diverges, the law of large numbers and linearity of expectation imply that as n tends to infinity and thus Again, by the law of large numbers and linearity of expectation, as n tends to infinity Considering (13) and (14) as preconditions of Slutsky's theorem [76], we conclude: The correspondence between Eq (8) in Definition 3 and our previous telefunken estimator is clear [77]. In addition, Eq (8) demonstrates a parallel structure with the Lincoln-Peterson estimator shown in expression (2): T represents the first assay (set); R(T, ;) stands for the second assay (a multiset); the multiset M(T, ;) is the subpopulation of the first assay that is recaptured by the second assay. Of course, in the present setting, the second assay R(T, ;) is far from independent of the first assay T, since the two sets are intrinsically linked through the network geometry of G. Nevertheless, the fact that T is a random subset of V is enough to neutralize the impact of this non-independence and enable consistent estimation of population size. Corollary 1. Several special cases of Proposition 1 are of interest. In each of these cases, it is straightforward to verify that as n goes to infinity, c n Á f(n) diverges, while c 2 n Á " dðV n Þ tends to some finite strictly positive constant: • When f(n) = O(n), c n = O(1) is a constant, and " dðV n Þ ¼ Oð1Þ is a constant. In this case, we have a family of graphs of increasing size and constant average degree, in which we are taking uniform random samples whose size is a constant proportion of the entire population.
• When f(n) = O(n), c n = O(g(n)/n), and " dðV n Þ ¼ Oðn 1À =gðnÞ 2 Þ, where g(n) is a function which diverges, and > 0 is a constant. For example, if we take g(n) = n , then c n = O(1/n 1− ), and " dðV n Þ ¼ Oðn 1À 3 Þ. As tends to 0, we approach a family of graphs of increasing size and linear average degree, in which we are taking uniform random samples of an absolute constant size. This special limit case is manifested by Erdős-Rényi graphs [78].

Population size from a respondent-driven sample
Although the n 1 estimator shows robust performance under uniform random sampling (see Section 4.3), random sampling is seldom a feasible strategy with hidden populations. As discussed above, sampling hard-to-reach populations presents considerable practical challenges [55], and many current surveys of hidden populations have come to depend on a tracked "peer referral" process known as respondent driven sampling [21]. For purposes of estimation, we consider a respondent-driven sample to be a random variable based on several parameters: an underlying networked population G = (V, E), a specified number of seeds |D|, the number of recruiting coupons c to be given to each subject, and the target sample size r. In our simulation experiments, the sampling procedure begins by randomly choosing |D| initial "seed" subjects in the network. For most of this paper, seeds are selected uniformly at random, though later, in Section 6, we will report on the differential impact of non-uniform RDS seed selection-specifically, seed selection that is biased by ego network size or restricted by the presence of community structures. Each seed subject is given c recruiting coupons and asked to participate in a "referral" process by distributing these among their study-eligible peers. Each subject v succeeds in recruiting between 0 and min{c, d (v)} individuals from their ego network, with the precise number being determined stochastically according to a specified distribution δ R on {0, 1, . . ., c}. Each referred peer is assumed to come in for their interview at a time that is offset from their recruiter's interview by an amount that random and exponentially distributed with rate λ W . When one or more of the recruited peers come in for interview with the coupon given to them by their recruiter, they too are given c coupons and asked to participate in the referral process. The scheme proceeds recursively in this manner using a finite number of 3r depletable coupons, until all r individuals have been recruited and interviewed. If (and whenever) the referral process stalls before r subjects have been interviewed, a new seed is recruited. Participation incentives are arranged to ensure that no subject will be the recipient of more than one coupon, and thus the process results in a collection of disjoint directed trees rooted at the seeds [79]. The precise values of the RDS parameters |D|, c, r and implementation parameters δ R , λ W for our simulation experiments are detailed in Assumption 2; the stochastic process used to generate the underlying synthetic networks G(V, E) on which this RDS operates is described in Section 4.1.
Given the tendency of RDS to oversample high degree nodes, issues arise when estimation techniques attempt to make use of the degree statistics of a respondent driven sample. Special steps must be taken to account for differences between the average degree of an RDS sample and the average degree of the population from which the RDS sample is drawn. The simplifying assumption below is needed for our formal proofs of the proposed estimators' performance. We emphasize that this assumption is not enforced (and is often violated) within the synthetic networks we used in our simulations, through which the proposed estimators' performance was experimentally evaluated. Assumption 1. Whenever we are considering H = (S, F) to be a subgraph on S V obtained through an RDS process inside graph G = (V, E), we will assumedðSÞ $ " dðVÞ. This assumption is justified in prior work [20,22], is provably true for configuration graphs [24], and is reflective of the basic fact that the harmonic mean is robust against the presence of high-degree outliers, as we may expect to face when S is obtained via a non-uniform sampling process like RDS.
The next estimator n 2 , provides an estimate |V| from a respondent driven sample S V.

Definition 4. Given a graph G = (V, E), a set S V, and H = (S, F) a subgraph with edge set
The next proposition gives sufficient conditions under which respondent-driven samples S V produce consistent estimates n 2 (T) * |V| when |V| is large.
Proposition 2. For n = 1, 2, . . . let G n = (V n , E n ) be a graph obtained by configuration graph sampling via degree distribution D n , where the vertex set size |V n | = f(n) grows unboundedly. Let c n 2 (0, 1], and take S n V n to be a subset of size |S n | = bc n Á f(n)c selected using RDS sampling in G n from |D n | seeds chosen uniformly at random. Define the random variable Accepting Assumption 1, if c n Á f(n)/D n diverges as n goes to infinity, while for some finite constant Θ 2 > 0, then n 2 ðS n Þ jV n j necessarily converges to 1. Proof. Let (S n , F n ) be a subgraph produced by an RDS sampling process in G n , and let T n V n be an equal-sized set of vertices chosen by uniform random sampling, i.e. |T n | = |S n |. For random u 2 S n and v 2 T n , as n tends to infinity jNðu; ;Þj " dðS n Þ À jNðv; ;Þj " dðT n Þ ¼ jNðu; ;Þj " dðS n Þ À jNðv; ;Þj " dðV n Þ ¼ jNðu; ;Þj " dðS n Þ À jNðv; ;Þj where the first equality stems from the law of large numbers, and the second from Assumption 1. Now S n is an RDS sample and hence is the disjoint union of D n many trees. It follows that jF n j jS n j ¼ 1 À jD n j bc n Á f ðnÞc : We note that |N(u, F n )| |N(u, ;)|, and incorporating (18) back into the final expression in (17), we deduce Definition 1's Eq (6) and linearity of expectation then imply that as n tends to infinity The configuration graph sampling process dictates that as n tends to infinity, for uniformly random u 2 S n Definition 1's Eq (7), expression (20), the law of large numbers, and linearity of expectation, together imply that as n tends to infinity Define the following random variables, closely related to (10) and (11) of Proposition 1: From our assumptions on the convergence of D 2 n Á c 2 n Á " dðV n Þ, we see that as n tends to infinity Considering (24) and (25) as preconditions of Slutsky's theorem [76], we conclude:

Evaluating the n 1 and n 2 estimators
To evaluate the proposed estimators n 1 (8) and n 2 (15), we conducted simulation experiments on samples drawn from synthetic networks using uniform and respondent-driven sampling, respectively. Underlying networks were selected by configuration sampling techniques [80][81][82] relative to Lognormal, Poisson, and Exponential distributions. We also considered Barabási-Albert graphs [83], and Erdős-Rényi graphs [78].

Synthetic networks
The tendency of RDS to over-recruit high degree nodes is well known, and readily evidenced in experiments on idealized topologies. Attempts to model peer-referral or "snowball" recruitment processes point to the fact that the degree distribution of nodes can influence the performance of mean estimators [84], suggesting Bayesian approaches which make use of degree distribution data in the derivation of population size estimates [35,85]. To validate the n 1 and n 2 estimators against a wide range of possible topologies, five idealized families of random graphs were used to perform initial experiments. In later sections, we take up the issue of clustering (Section 4.5), anonymity (Section 5), non-uniformity in the seed selection (Section 6), and performance on a real-world network (Section 7).
In what follows, configuration graphs were sampled (relative to a specified degree distribution) by first attaching the prescribed number of free half-edges to each node. Pairs of free half-edges were then chosen uniformly at random and bound together to form an edge, repeatedly, until no free half-edges remain. Note that this sampling process may yield graphs that have multiple parallel edges and self loops.
Definition 5. Given a set V with |V| = n, for each l 2 R, λ > 1, let distributions D LðlÞ , D PðlÞ , D X ðlÞ , and D RðlÞ : V ! N be defined such that for each v 2 V: • D Lðl;nÞ ðvÞ ¼ 1 þ X where X is a Lognormal random variable with mean λ − 1 and standard deviation 1.
Corresponding to each of the three distributions above, let Lðl; nÞ, Pðl; nÞ, X ðl; nÞ, Rðl; nÞ be the sample spaces of configuration graphs G = (V, E) where |V| = n. Note that a random graph drawn from these sample spaces will have expected mean vertex degree E½ " dðVÞ ¼ l. Definition 6. For each l 2 R, λ > 1, let Bðl; nÞ be the sample space of n-vertex Barabási-Albert graphs G = (V, E). Each such graph is the final output of a process which produces a sequence of graphs with probability 1 þ blc À l 1 þ bl=2c otherwise: ( of pre-existing nodes fp i;1 ; . . . p i;D i g V iÀ 1 . This set is constructed by sequential sampling without replacement, i.e. as l = 1, . . ., with a probability that reflects degree-biased preferential attachment . The final member of the resulting sequence G n = (V n , E n ) is output as the sampled graph. Note that if n ) λ, the process above results in a graph G = (V, E), sampled from Bðl; nÞ, and having expected mean vertex degree E½ " dðVÞ $ l. Definition 7. For each l 2 R, λ > 0, let Eðl; nÞ be the sample space of n-vertex Erdős-Rényi graphs G = (V, E), where E V × V is a random subset constructed uniformly at random by taking: Note that a random graph G = (V, E) drawn from Eðl; nÞ will have expected mean vertex degree E½ " dðVÞ $ l.

Evaluating n 1 on synthetic networks
The experiments here follow the framework described in Section 4.2 and use uniform random samples. The 12 graphs in Fig 1 present the performance of the n 1 estimator as the true population size n is varied from 5 Á 10 3 to 40 Á 10 3 (vertical axis of the grid) and the size of the uniform sample is varied from 250 to 750 (horizontal axis of the grid). In each of the 12 graphs, the xaxis varies the average degree λ from 3 to 10. For each choice of λ, the medians and quartile ranges of n 1 are given for each of the 5 graph families. Each of these is determined by 900 simulations (30 graphs times 30 uniformly drawn samples in each graph). Fig 1 shows that as sample size increases, the medians of n 1 converge to the true population size. For example, when n = 5 Á 10 3 and r = 250, Exponential degree distribution graphs with λ = 3 have a median n 1 value of 5663 (a 13.3% offset from the true value of n = 5 Á 10 3 ). In comparison, when r = 750, the median for this family of graphs is 5204 (just 4.1% offset from the true value). As the sample size increases from r = 250 to r = 750, the error in the median estimate decreases by 9.2%. The benefit of increasing sample size diminishes as networks grow larger, however. For example, for a network of size n = 40 Á 10 3 , increasing the sample size from r = 250 to r = 750 causes the error in the median n 1 estimate to undergo only a 2% change.
In addition, Fig 1 shows that as sample size increases, the interquartile range (IQR) of the estimates decreases. For example, when n = 5 Á 10 3 and r = 250, Lognormal degree distribution graphs with λ = 10 experience an interquartile range of 1950 in their n 1 estimates (35.9% of the median). In comparison, when r = 750, the interquartile range for this family of graphs decreases to 1425 (a 26.9% reduction). The magnitude of this effect increases as networks grow larger. For example, for a network of size n = 40 Á 10 3 , increasing the sample size from r = 250 to r = 750 causes the interquartile range of the n 1 estimate to undergo a 48.6% decrease.

Evaluating n 2 on synthetic networks
The experiments in this and all subsequent sections use respondent-driven samples. The precise values of the RDS parameters |D|, c, r and implementation parameters δ R , λ W are given below.
Assumption 2. In all our experiments where RDS is used to generate samples, we take |D| = 7 random seeds drawn uniformly at random from V. Each subject was given c = 3 coupons. Depending on the experiment, the sample size r was either 250, 500, or 750. Reflecting our experiences in the field [86], we took the recruiting success distribution δ R such that each subject had a 90% chance of recruiting 2 subjects randomly from their ego network, and a 10% chance of recruiting just 1. [Individuals with an ego network of size 1 were assumed to recruit that one individual with 100% probability, while individuals with an ego network of size 0 recruited no one]. The delay between recruiter and recruited subjects' interview times were assumed to be exponentially distributed with rate λ W = 1.
The 12 graphs in Fig 2 present the performance of the n 2 estimator as the true population size n is varied from 5 Á 10 3 to 40 Á 10 3 (vertical axis of the grid) and the size of the RDS sample is varied from 250 to 750 (horizontal axis of the grid). In each of the 12 graphs, the x-axis varies the average degree λ from 3 to 10. For each choice of λ, the medians and quartile ranges of n 2 are given for each of the 5 graph families. Each of these is determined by 900 simulations (30 graphs times 30 uniformly drawn samples in each graph). Fig 2 shows that the median of n 2 converges to the true population size across a range of topologies, RDS sample sizes, and overall populations. In addition, Fig 2 shows that as sample size increases, the interquartile difference decreases. For example, when n = 5 Á 10 3 and r = 250, Poisson degree distribution graphs with λ = 3 experience an interquartile range of 1676 in their n 2 estimates (33.8% of the median). In comparison, when r = 750, the interquartile range for this family of graphs decreases to 524 (a 68.7% reduction). The magnitude of this effect decreases as networks grow larger, such that, for a network of size n = 40 Á 10 3 , increasing the sample size from r = 250 to r = 750 causes the interquartile range of the n 2 estimate to undergo a 60.8% decrease. However, the total range of estimates as a proportion of the median decreases as sample size increases, indicating decreasing sample-based variance (a key concern in RDS sampling [28]).

Population size estimation in the presence of clustering
Beyond the oversampling of high degree nodes, RDS faces challenges when used in networks where network clustering is pronounced [49,87]. While methods are available to assess the presence of clustering [25], and recent work has proposed new techniques to estimate and account for clustering from a single RDS sample [88], the effects of this phenomenon on population size estimation from RDS samples is seldom discussed. The root of the problem lies in the fact that RDS walks necessarily sample network neighborhoods. Where neighbors show high levels of network transitivity, counts of common edges will produce high numbers of "matches" that appear in the denominator of both n 1 and n 2 . This will bias the estimates of overall population size derived from these estimators toward underestimation of the total network size.
In the context of random walk techniques, one approach to this problem is to only consider collisions among nodes that are far away from each other in the sampling chain when inferring a population size estimate [75]. A similar approach is taken here by considering neighbor overlap among respondents whose path distances in the RDS chains are above a specific threshold. For simplicity, here we take this threshold to be infinity, leaving the consideration of finite thresholds for consideration in future research. In short, we consider a modification of n 2 that discounts matched free ends within a single RDS sampling tree and, for purposes of estimation, only counts those matches that occur across distinct RDS trees. The next Definition introduces formalisms necessary to make this precise. Definition 8. Let G = (V, E), take S V, and let H = (S, F) be a subgraph on S V with edge set F E \ (S × S) obtained by respondent driven sampling from a set of seeds D S where |D| > 1. Define the function γ: S ! D associating each u 2 S with the unique seed γ(u) 2 D from which u was discovered through a sequence of referrals. For each u 2 S, the component of u is denoted The next estimator n 3 , provides a revised estimate |V| from a respondent driven sample S V, discounting matches that occur within the same RDS component.
The next proposition gives sufficient conditions under which respondent-driven samples S V produce consistent estimates n 3 (T) * |V| when |V| is large.
Proposition 3. For n = 1, 2, . . ., let G n = (V n , E n ) be a graph on |V n | = f(n) vertices obtained by configuration graph sampling via degree distribution D n , where f(n) grows unboundedly. Let c n 2 (0, 1], and take S n V n to be a subset of size |S n | = bc n Á f(n)c selected using RDS sampling in G n from |D n | > 1 seeds. Define the random variable Accepting Assumption 1, if c n Á f(n)/D n diverges as n goes to infinity, while for some finite constant Θ 3 > 0, then n 3 ðS n ;F n ;D n ;gÞ f ðnÞ necessarily converges to 1.
One-step estimation of networked population size: Respondent-driven capture-recapture with anonymity Proof. Since each seed s 2 D n is chosen uniformly at random, and RDS recruits from all seeds concurrently, and |S n | = bc n Á f(n)c diverges, for random s 2 D n , we know that jC g ðsÞj ! p jD n j À 1 jD n j Á jS n j ¼ jD n j À 1 jD n j Á c n Á f ðnÞ ð31Þ Combining (30) and (32), we conclude Sufficient reasoning about the configuration graph construction process tells us hXðs; F n ; gÞi ! p 1 jD n j Á hMðS n ; F n Þi Á jD n j À 1 jD n j : Define the following random variables, closely related to (22) and (23)  As n tends to infinity R n ! p " dðS n Þ À 1 dðSÞ jD n j À 1 jD n j Á c n Á f ðnÞ Á R Ã n ðS n ; F n Þ M n ! p jD n j À 1 jD n j Á M Ã n ðS n ; F n Þ: where R Ã n ðS n ; F n Þ ! p D n Á c n Á " dðV n Þ as noted in (22), while as noted in (23). Thus By Slutsky's theorem [76], it follows that n 3 ðS n ; F n ; D n ; gÞ f ðnÞ ¼

Evaluating n 3 on synthetic networks
Prior to examining the performance of n 3 on empirical networks, we first look at its performance on the synthetic networks used to evaluate n 1 and n 2 . The experiments shown in Fig 3 follow the framework described in Section 4.2 and use respondent driven samples, each obtained via an RDS process operating as specified in Assumption 2. The 12 graphs in Fig 3 present the performance of the n 3 estimator as the true population size n is varied from 5 Á 10 3 to 40 Á 10 3 (vertical axis of the grid) and the size of the RDS sample is varied from 250 to 750 (horizontal axis of the grid). In each of the 12 graphs, the x-axis varies the average degree λ from 3 to 10. For each choice of λ, the medians and quartile ranges of n 3 are given for each of the 5 graph families. Each of these is determined by 900 simulations (30 graphs times 30 uniformly drawn samples in each graph). Fig 3 shows that the median of n 3 converge to the true population size, much like the performance of the n 2 estimator. In all the networks, the medians of n 3 estimates are all very close to the their true network populations, regardless the sample size, population size, and type of network topology. In addition, Fig 3 shows that as sample size increases, the interquartile range of the estimates decreases. For example, when n = 5 Á 10 3 and r = 250, Lognormal degree distribution graphs with λ = 3 experience a interquartile range of 1915 in their n 3 estimates (39.1% of the median). In comparison, when r = 750, the interquartile range for this family of graphs decreases to 604 (a 68.5% reduction). The magnitude of this effect decreases as networks grow larger. For example, in a network of size n = 40 Á 10 3 , increasing the sample size from r = 250 to r = 750 causes the interquartile range of the n 3 estimate to undergo a (still sizable) 55.0% decrease.

Subject privacy through hashing
Significant obstacles arise in the direct application of estimators n 1 , n 2 , n 3 (see (8), (15), and (28), respectively). In many circumstances where RDS is used, researchers are often required to measure the sizes of stigmatized networked populations (e.g. people who inject drugs, sex workers, individuals engaged in specific types of illegal activity, etc.) and within social communities that naturally seek to remain "unidentified". In these circumstances, the membership of sets S and R(S, F) is often not explicitly knowable because individuals are reluctant to unambiguously identify themselves or their social network peers.
To formalize and accommodate notions of privacy required under such circumstances within the estimation procedures described above, we assume that each individual in V = {v 1 , v 2 , . . ., v |V| } has a unique ID; for simplicity we take the ID of v i 2 V to be the integer i (for i = 1, . . ., |V|). Towards ensuring anonymity, we imagine a hashing [89] function ψ: V ! O that assigns each individual's ID to a code in O. We thus follow the general framework of Privatized Network Sampling (PNS) design [53], mimicking the hash functions of telefunkentype [50].
By taking ψ to be a random (not necessarily 1-to-1) function that is difficult to invert, subjects are convinced that disclosing the hash code of an individual does not unambiguously identify the individual themselves, and so preserves their privacy.

Assumption 3. Suppose V is a set of individuals obtained via RDS referral tree F. While each v i 2 V is unwilling to disclose their own ID i, and is secretive about the IDs of their peers {j|v j 2 N (v i , ;)}, they are readily willing to reveal (a) the own hash code ψ(v i ); (b) the (multiset of) hash codes of their peers (outside the referral tree F):
and (c) their own network size dðv i Þ ¼ hN c u ðS; FÞi, excluding the referral tree F. Assumption 4. To simplify our analysis, throughout what follows, we will assume ψ is a function chosen uniformly at random from the space of all functions from V ! O. We will refer to such a ψ as a "random hash function" from V to O. The action of ψ on the V is illustrated in Fig  4. In Section 6.3, we describe ways to translate the results of this paper to settings where ψ is not a uniformly random hashing function.
In practice, ψ(v) might be an obtained by amalgamating a well-defined tuple of characteristics of v which are known to v's friends (e.g. v's gender, phone number, hair color, approximate age, racial category, etc.) and then encoding this using a cryptographic function. A related coding technique was used in our earlier work on estimating the size of the methamphetamine using population in New York City, where it was referred to as the telefunken code [50].
identifiable subjects anonymized subjects hash function One-step estimation of networked population size: Respondent-driven capture-recapture with anonymity

Revised estimators incorporating hashing
We begin by "lifting" the terms introduced in the earlier Definition 1, to the hashing or PNS framework [53].

Definition 10. Let G = (V, E) be a graph, and ψ: V ! O a random hash function. Let H = (S, F) be a subgraph on S V with edge set F E \ (S × S). The (multiset of) hash codes of the subjects is
The  (7). The next Lemma is foundational and justifies the proposed revised estimates n c 1 , n c 2 , and n c 3 , which will be presented subsequently.

Suppose u 2 S reports its own code x ≔ ψ(u), the code y ≔ ψ(v) of one of its neighbors v 2
N u (S, F). If w 2 ψ −1 (y) \ S is selected uniformly at random, and w has degree d(w), then The expected total number of free ends incident to some vertex in the set ψ −1 (y)\{w} is ðn 0 À 1Þð1 À cÞ jOj ÁdðSÞ þ ðn 0 À 1Þc jOj ÁdðSÞ À 1 and since w 2 S, the expected number of free ends incident to w is d(w) − 1. So  The reader may wish to compare expressions (41) and (42) with the non-hashed analogues in Definition 8's expressions (26) and (27). We also definẽ xðy; s; g; n 0 Þ ≔

Evaluating n c 2 on synthetic networks
The experiments discussed here follow the framework used in prior experiments described above. Samples are derived using the RDS process operating as specified in Assumption 2. The hash space size used for the encoding of each agent's identity was varied from |O| = 2 Á 10 3 to 256 Á 10 3 . The 12 graphs in Fig 5 present the performance of the n c 2 estimator as the true population size n is varied from 5 Á 10 3 to 40 Á 10 3 (vertical axis of the grid), the sample size is fixed to r = 500 and the hash space size was varied from |O| = 2 Á 10 3 to 256 Á 10 3 (horizontal axis of the grid). In each of the 12 graphs, the x-axis varies the average degree λ from 3 to 10. For each choice of λ, the medians and quartile ranges of n c 2 are given for each of the 5 graph families. Each of these is determined by 900 simulations (30 graphs times 30 uniformly drawn samples in each graph). Fig 5 shows that as hash space size increases, the medians of n c 2 converge to the true population size. For example, when n = 5 Á 10 3 and |O| = 2 Á 10 3 , Lognormal degree distribution graphs with λ = 3 have a median n c 2 value of 4705 (a 5.9% offset from the true value of n = 5 Á 10 3 ). In comparison, when |O| = 256 Á 10 3 , the median value for this family of graphs is 4901 (just 2.0% offset from the true value). As the hash space size increases from |O| = 2 Á 10 3 to |O| = 256 Á 10 3 , the error in the median estimate decreases by 3.9%. The magnitude of this phenomenon increases as networks grow larger. For example for a network of size n = 40 Á 10 3 , increasing the hash space size from |O| = 2 Á 10 3 to |O| = 256 Á 10 3 causes the error in the median n c 2 estimate to undergo a 33.9% change. In addition, Fig 5 shows that as hash space size increases, the interquartile range of the estimates decreases. For example, when n = 5 Á 10 3 and |O| = 2 Á 10 3 , Poisson degree distribution graphs with λ = 3 experience a interquartile range of 1522 in their n c 2 estimates (32.0% of the median). In comparison, when |O| = 256 Á 10 3 , the interquartile range for this family of graphs decreases to 793 (a 47.9% reduction). The magnitude of this effect increases as networks grow larger. For example for a network of size n = 40 Á 10 3 , increasing the hash space size from |O| = 2 Á 10 3 to |O| = 256 Á 10 3 causes the interquartile range of the n c 2 estimate to undergo a 42.1% decrease.

Evaluating n c 3 on synthetic networks
A second set of experiments shows the performance of the n c 3 performance under identical hashing conditions used to test n c 2 . These experiments also follow the framework described in Section 4.2 and use samples derived from an RDS process operating as specified in Assumption 2. The hash space size was varied from |O| = 2 Á 10 3 to 256 Á 10 3 .
The 12 graphs in Fig 6 present the performance of the n c 3 estimator as the true population size n is varied from 5 Á 10 3 to 40 Á 10 3 (vertical axis of the grid), the sample size is fixed to r = 500 and the hash space size was varied from |O| = 2 Á 10 3 to 256 Á 10 3 (horizontal axis of the grid). In each of the 12 graphs, the x-axis varies the average degree λ from 3 to 10. For each choice of λ, the medians and quartile ranges of n c 3 are given for each of the 5 graph families. Each of these is determined by 900 simulations (30 graphs times 30 uniformly drawn samples in each graph). Fig 6 shows that as hash space size increases, the medians of n c 3 converge to the true population size. For example, when n = 5 Á 10 3 and |O| = 2 Á 10 3 , Lognormal degree distribution graphs with λ = 3 have a median n c 3 value of 4667 (a 6.7% offset from the true value of n = 5 Á 10 3 ). In comparison, when |O| = 256 Á 10 3 , the median for this family of graphs is 4865 (just 2.7% offset from the true value). As the hash space size increases from |O| = 2 Á 10 3 to |O| = 256 Á 10 3 , the error in the median estimate decreases by 4.0%. The magnitude of this phenomenon increases as networks grow larger. For example for a network of size n = 40 Á 10 3 , increasing the hash space size from |O| = 2 Á 10 3 to |O| = 256 Á 10 3 causes the error in the median n c 3 estimate to undergo a 38.4% change.
In addition, Fig 6 shows that as hash space size increases, the interquartile range of the estimates decreases. For example, when n = 5 Á 10 3 and |O| = 2 Á 10 3 , Exponential degree distribution graphs with λ = 3 experience a interquartile range of 1491 in their n c 3 estimates (31.0% of the median). In comparison, when |O| = 256 Á 10 3 , the interquartile range for this family of graphs decreases to 905 (a 39.3% reduction). The magnitude of this effect increases as networks grow larger. For example for a network of size n = 40 Á 10 3 , increasing the hash space size from |O| = 2 Á 10 3 to |O| = 256 Á 10 3 causes the interquartile range of the n c 3 estimate to undergo a 43.0% decrease.

Impacts of non-uniformity
The experiments described in previous sections of this paper assumed an RDS process that begins with a set of seeds D V sampled uniformly at random without replacement. More precisely, D = X |D| is the last entry in sequence X 0 , X 1 , . . .X |D| , where X 0 = ; and X i = X i−1 [ {u i } with One-step estimation of networked population size: Respondent-driven capture-recapture with anonymity for each u 2 V. While the uniform model allowed formal analysis of the estimators' properties to be tractable, many researchers have noted that practical deployments of RDS often exhibit bias in seed selection [90][91][92]. This bias originates in local features of the network topology (e.g. variation in node degrees) as well as global properties (e.g. the presence of community structures).

Degree-biased selection of RDS seeds
We begin by describing experimental findings on the differential impacts of degree-based bias in initial seed selection on the performance of the n c 3 estimator. Towards this, we define a new model of seed selection in which a real-valued parameter r 2 R controls degree-based bias. In particular, expression (44) is generalized to for each u 2 V. Note that when ρ = 0 expression (45) reduces to the uniform random selection of seeds prescribed in (44). When ρ > 0, seed selection is biased towards the network's high degree vertices; when ρ < 0, low degree vertices are favored. The first segment of Table 1 shows that as ρ is varied between -1 and +1, non-uniform seed selection has no discernable negative differential impact on the performance of RDS estimator One-step estimation of networked population size: Respondent-driven capture-recapture with anonymity n c 3 . While the data in the first segment of Table 1 are based on 30 RDS samples (r = 500) on each of 30 graphs from Lðl ¼ 5; n ¼ 10 4 Þ, i.e. graphs with 10K nodes and a Lognormal degree distribution as described in Section 4.1, the conclusion for the other 5 graph families is similar.

Community structures
Next we consider the impact of community structures which can potentially create bottlenecks for RDS and restrict the reach of subject's self-reported ego networks [91,92]. We quantify the impacts of such structures on the n c 3 estimator through simulation experiments, and towards this, extend each of the 5 families defined in Section 4.1 to support the controlled presence of community effects. Two new parameters are introduced: the number of communities K, and the cross-community connection probability μ. The space Lðl; nÞ, for example, is thus extended to a space Lðl; n; K; mÞ consisting of graphs of size K Á bn/Kc, i.e. approximately n, which is sampled from as follows: Lðl; nÞ as defined in Section 4.1.
Define V ≔ S K i¼1 V i to be the vertex set of our sampled graph. Take E ¼ S K i¼1 E i to be our initial approximation of the edge set of our sampled graph, to be updated according to the rewiring process below. 3. Completion of step (2) yields the sampled graph (V, E) on K Á bn/Kc vertices, having K communities each coming from family Lðl; nÞ and wired together so that roughly μ fraction of each community's members has a connection to some member of a different community (and the degree distribution of the graph as a whole is consistent with the bias of family L).
The families Pðl; n; K; mÞ; X ðl; n; K; mÞ; Bðl; n; K; mÞ, and Eðl; n; K; mÞ, are defined analogously. When μ * 1 or K * 1, community effects are insignificant. As μ ! 0 + or K ) 1, the population consists of many effectively isolated communities. Whenever a set of seeds are to be selected from the network (e.g. to obtain a respondent driven sample), all seeds are chosen (uniformly at random) from community 1.
The second segment of Table 1 shows that as K is increased from 1 to 16 (while μ is held fixed at 0.5), increasing the number of communities causes n c 3 to slightly underestimate population size. For example, when the network consists of K = 8 communities, a median estimate falls short of the true value by 7%; for K = 16 communities the deficit becomes 16%. The third and final segment of Table 1 shows that as μ is decreased from 0.5 to 0.1 (while K is held fixed at 8), increasing community isolation causes n c 3 to significantly underestimate population size. For example, when the inter-community connection probability μ = 0.4 the deficit is 14%, but when μ = 0.2 the estimate produced is roughly 45% of the true value. While the data in the second and third segments of Table 1 are based on 30 trials on each of 30 graphs from Lðl; nÞ, i.e. graphs with Lognormal degree distribution as described in Section 4.1, the results for the other 5 graph families are quite similar.

Non-uniform hash functions
The experiments and analyses so far have considered a uniform random hashing function ψ, and have shown that the size of the hash space |O| has a significant impact on estimator variance. The uniform hashing assumption is reasonable when each individual's anonymity-preserving code is based on attributes that have been uniformly randomly assigned across the population. For example, it is reasonable to expect that a telephone company will assign numbers to customers randomly, and thus a code that is built from the parity and scale of the final 4 digits of each individual's phone number would constitute a uniform random hash function.
In this section, we describe how to translate the conclusions of previous experiments and analyses to settings where the hashing function is not uniformly random. This would likely be the case if ψ were built from each individual's demographic characteristics (e.g. age, height, hair color, and race) that are known to vary non-uniformly across the population. For example, if subjects and reports were encoded using 4 categories for age, 3 categories for height, 3 for hair color, and 5 categories for race, one could only say that the hash space size was 4 × 3 × 3 × 5 = 180 if all combinations of these attributes were equally likely to appear. Researchers employing such non-uniform hashing functions may want to know the equivalent uniform hash space size |O|, so as to correctly translate the results of previous sections into reasonable expectations for the non-uniform situation at hand. The following Lemma will assist in defining this translation: Lemma 3. Let A, B be finite sets, and ψ: A ! B be a uniformly random function. Then Proof. We seek the expected number of distinct items obtained in sampling |A| elements from B with replacement. Consider x 2 B, then The result then follows by linearity of expectation. Proposition 4. Let A, B be finite sets, and ψ: A ! B a uniformly random function. Suppose |A| = x and |ψ(A)| = y, where x, y ) 0, then the maximum likelihood estimator of |B| is given by where Lambert's W function is the inverse function of f(W) = We W . Proof. Applying Lemma 3, the maximum likelihood estimator is obtained by solving for |B|. Since |B| ! y ) 0, we may approximate log jBj À 1 jBj when |B| is large. Then we have Eq (47) now becomes which when solved for |B| yields expression (46) above. Proposition 4 tells us that the image of a set of size x is expected to have size y, provided the function is a uniform random map into a set whose size is given by expression (46). Such a combinatorial result can be used to compute the equivalent uniform hash space size in settings where the hash function is non-uniform. In particular, if we have x = |A| subjects, who provide us with exactly y = |ψ(A)| distinct codes (y x), then the equivalent uniform hash space size |O| is given by expression (46) above.

Evaluating estimators on real networks
While a range of degree distributions and randomly occurring clusterings can be expected in idealized topologies, the performance of RDS-based estimators n c 2 and n c 3 on organically arising, natural human networks may vary. To test this possibility, we perform a number of random-start, RDS-based estimation experiments on the Brightkite data set. Brightkite was once a location-based social networking service provider where users shared their locations by checking-in. The friendship network was collected using their public API, and consists of |V| = 58,228 nodes and |E| = 214,078 edges [93]. Though originally a directed graph, we symmetrized the edges for the purposes of these experiments. Since not all users made a public checkin during the data collection period, the population we used here consists of 51,406 people. The average clustering coefficient in the network was 0.1723, while the fraction of closed triangles is 0.03979. The diameter (longest-shortest path in the symmetrized network) is 16, though the 90-percentile effective diameter is 6.
For purposes of the experiment we generated 900 respondent-driven samples of size r = 250, 500, 750 and hash space size from |O| = 2 Á 10 3 to |O| = 256 Á 10 3 within the Brightkite network, each obtained via an RDS process operating as specified in Assumption 2. The boxplot graphs in Fig 7(a)-7(c) show that estimator n c 2 -where no accommodation is made for the tendency of RDS to oversample tightly clustered network neighborhoods-underestimates the true population size of 51,406 in every case. Given the high clustering coefficient of the network (17.2%), it seems likely that, for a given sampling tree, the peer-discovery process necessarily walks across close pairs of nodes that shared one or more common vertices. Of note is that increasing the sample size and hash space size does little to correct for these effects.
Graphs (d-f) in Fig 7 present the boxplots of Brightkite population estimates using estimator n c 3 . As above, we generated 900 respondent-driven samples of size r = 250, 500, 750 and hash space size from |O| = 2 Á 10 3 to |O| = 256 Á 10 3 within the Brightkite network. We see that the three different hash space sizes show similar results, while increasing the sample size r from 250 to 500 and 750 improves the accuracy of the median estimate. Unlike the case in Fig  7(a)-7(c), we don't see a consistent pattern of underestimation, indicating that the cross-seed estimator n c 3 was successful in compensating for the clustering found in the network. As above, the overall size of the hash space has minimal effect on the accuracy of the median estimate, but we note that an increase in the RDS sample size improves the accuracy of the median estimate and produces smaller interquartile ranges.

Discussion
The results shown here indicate that size estimates for hidden and hard-to-reach populations can be derived from RDS samples across a range of topologies, and in the presence of significant network clustering. As important, this is accomplished under conditions of anonymity by way of identity hashing, e.g. using telefunken codes [50] or a Privatized Network Sampling (PNS) design [53]. The n c 3 estimator joins other successful, RDS-based population estimation procedures, such as those by Handcock and Gile [85], and Crawford, Wu, and Heimer [35]. Like Crawford et al, we make use of half-edge counts. However, our estimator invokes a different strategy-beginning with the original capture-recapture concept-and is shown to be robust across a wide range of topologies and assumptions.
A notable feature of the n c 3 estimator is that a lower level of variance can be expected at conventional RDS sample sizes. For r = 500 to 750, interquartile ranges were low relative to both the median estimate and true population size (See segments 1 and 2 of Table 2 which summarize a slice of the data in Fig 6).
Additionally, when hashing was employed towards ensuring subject anonymity, sufficiently large hash spaces (32 Á 10 3 or larger) and samples sizes (500 or above) produced a narrow One-step estimation of networked population size: Respondent-driven capture-recapture with anonymity range of estimates (See segment 3 of Table 2 which summarizes a slice of the data in Fig 6). Given concerns about RDS sample variance generally [28], these results indicate robustness against the faults of a single sample.
Another consistent feature observed in these experiments is the performance of the n c 3 estimator as graph density increases (See segment 4 of Table 2 which summarizes a slice of the data in Fig 6). In terms of the interquartile ranges, the estimator exhibits worse performance in sparse (i.e. " dðVÞ ¼ 3) as opposed to dense networks (i.e. " dðVÞ ¼ 10). Given the edge-sampling focus of our approach, this is not surprising. Fewer total edges suggest fewer total "matches" to discover, leading to greater variability depending on stochastic factors likely associated with the selection of RDS seeds and the random walk features of the RDS sampling process. These results suggest limits on the implementation of n c 3 estimator in sparse graphs. As researchers increasingly turn to RDS methods for sampling hard-to-reach populations, these results should be of considerable interest to those concerned with what is often referred to as "the denominator problem". Where agencies and government administrations seek to understand both the scope of public health challenges and to measure the effectiveness of their intervention and promotion efforts, the ability to estimate population size (and with this, population prevalence) is widely needed. The results presented here indicate that "one step" methods are capable of providing such estimates. Along with the methods mentioned above, this work has the potential to provide public health officials and planners with means to more effectively promote the health of hidden populations-and thus the health of the larger populations in which they are embedded.

Limitations
In using uniform random samples to estimate population size, it is possible for the proposed n 1 estimator to "fail" if one finds that hM(T, ;)i = 0 in Definition 3. This happens with greater frequency as the sample size r ( n the population size. Fig 8(a) shows the mean failure rate (the fraction of the 13,500 trials where n 1 failed to produce a population estimate), for each choice of population size n (ranging from 5 Á 10 3 to 40 Á 10 3 ), and uniform sample size r (chosen to be 250, 500 or 750). We see from Fig 8(a) that the failure rate is non-linear in both r and n. For small uniform samples r = 250, the failure rate of n 1 is *0 when n = 10 Á 10 3 , but undergoes an inflection at n = 20 Á 10 3 , and rises to 3.9% when the population size again doubles to n = 40 Á 10 3 . Note that we considered each of 5 families Lðl; nÞ; Pðl; nÞ; X ðl; nÞ; Bðl; nÞ, and Eðl; nÞ defined in Section 4.1, and each λ = 3, 5, 10; from each of these 15 concrete sample spaces, we used configuration graph sampling to select 30 random graphs of size n. In each of these 5 × 3 × 30 = 450 graphs, we generated 30 uniform samples (for n 1 ). In this manner, a total of 450 × 30 = 13,500 simulations were conducted. Similarly, in using respondent-driven sampling to estimate population size, it is possible for the proposed n 2 (resp. n 3 ) estimators to "fail" if one finds that hM(S, F)i = 0 in Definition 4 One-step estimation of networked population size: Respondent-driven capture-recapture with anonymity (resp. ∑ s2D hX(s, F, γ)i = 0 in Definition 9) . Fig 8(b) shows the mean failure rate (the fraction of the 13,500 trials where n 2 failed to produce a population estimate), for each choice of population size n (ranging from 5 Á 10 3 to 40 Á 10 3 ), and RDS sample size r (chosen to be 250, 500 or 750). RDS samples of size r = 250 exhibit an n 2 failure rate of *0 when n = 5 Á 10 3 , but undergo an inflection at n = 10 Á 10 3 ; the mean failure rate rises to 6% when the population size again doubles to n = 40 Á 10 3 . In examining the n 3 estimator , Fig 8(c) shows us that when it is used with RDS samples of size r = 250, it exhibits a failure rate of *0 when n = 5 Á 10 3 , but the failure rate undergoes an inflection at n = 10 Á 10 3 , rising to 8.8% when the population size again doubles to n = 40 Á 10 3 . For sample sizes that are 2X and 3X as large (i.e. r = 500 and r = 750) the inflection point is not yet reached at n = 40 Á 10 3 and mean failure rates remain below 0.1%. This indicates that our estimators based on RDS are more robust against failure than the n 1 uniform sampling estimator, and at typical RDS sample sizes (500 r 750), they are robust enough to be used in settings where the population size is expected to be on the order of n * 40 Á 10 3 . Fig 8(d)-8(e) explore the impact of hash space size on the mean failure rate. Here we consider a fixed sample size r = 500 and vary the size of hash space |O| between 2 Á 10 3 and 256 Á 10 3 . We observe that the mean failure rates of n c 2 and n c 3 (again taken across 13,500 experiments) grow linearly as n increases, but that the rate of growth depends on |O|. In particular, when |O| is too small (in this case 2 Á 10 3 or smaller), the mean failure rate is seen to grow steeply, even for small networks. The two graphs (d-e) make evident the tradeoff between subject anonymity/privacy and the failure rates of the estimator. When the hash space size is sufficiently large (32 Á 10 3 −256 Á 10 3 ), failure rates remain low, but smaller hash spaces (which provide for greater anonymity) may produce greater instability in the estimators. Finally, the three heatmaps in Fig 8(f) show how the failure rate of n c 3 rises whenever the hash space size or sample size decreases.
Although 32 Á 10 3 − 256 Á 10 3 may appear to be a very large hash space size, we note 10 4 32 Á 10 3 10 5 256 Á 10 3 10 6 : Thus, asking research subjects for the last 5 or 6 digits of their own telephone number and those digits of their friends' phone numbers would be sufficient to provide an accurate estimate (assuming that numerical digits are randomly assigned by phone service providers). In the event that research subjects remain reluctant to reveal precise digits of their own or their alter's phone numbers, a telefunken code could be constructed [50] or a Privatized Network Sampling (PNS) design [53] employed.