A review of reported network degree and recruitment characteristics in respondent driven sampling implications for applied researchers and methodologists

Objective Respondent driven sampling (RDS) is an important tool for measuring disease prevalence in populations with no sampling frame. We aim to describe key properties of these samples to guide those using this method and to inform methodological research. Methods In 2019, authors who published respondent driven sampling studies were contacted with a request to share reported degree and network information. Of 59 author groups identified, 15 (25%) agreed to share data, representing 53 distinct study samples containing 36,547 participants across 12 countries and several target populations including migrants, sex workers and men who have sex with men. Distribution of reported network degree was described for each sample and characteristics of recruitment chains, and their relationship to coupons, were reported. Results Reported network degree is severely skewed and is best represented by a log normal distribution. For participants connected to more than 15 other people, reported degree is imprecise and frequently rounded to the nearest five or ten. Our results indicate that many samples contain highly connected individuals, who may be connected to at least 1000 other people. Conclusion Because very large reported degrees are common; we caution against treating these reports as outliers. The imprecise and skewed distribution of the reported degree should be incorporated into future RDS methodological studies to better capture real-world performance. Previous results indicating poor performance of regression estimators using RDS weights may be widely generalizable. Fewer recruitment coupons may be associated with longer recruitment chains.


Results
Reported network degree is severely skewed and is best represented by a log normal distribution. For participants connected to more than 15 other people, reported degree is imprecise and frequently rounded to the nearest five or ten. Our results indicate that many samples contain highly connected individuals, who may be connected to at least 1000 other people.

Conclusion
Because very large reported degrees are common; we caution against treating these reports as outliers. The imprecise and skewed distribution of the reported degree should be incorporated into future RDS methodological studies to better capture real-world performance. Previous results indicating poor performance of regression estimators using RDS

Introduction
Since its development in 1997, respondent driven sampling (RDS) has become increasingly popular for measuring disease prevalence and correlates of disease in hidden populations [1]. In RDS, social connections among members of these hard to reach target populations are used to propagate recruitment, similar to snowball sampling. However, RDS differs from snowball sampling in two important ways: it requires the collection of additional information, including network size and it creates long (as opposed to wide) recruitment chains. Through the use of coupons with unique codes, the number of people a participant can recruit is restricted. This produces long recruitment chains to ensure that the final sample is independent of the initial recruits. It also allows researchers to trace the recruitment process and collect information on who recruited whom. In addition, as a proxy for their sampling probability, participants are asked about their number of connections in the target population. This additional information enables better estimation of disease prevalence by adjusting for non-random selection probability. Several prevalence estimators which account for the RDS design have been developed; the most commonly reported are the Volz-Heckathorn RDS-II estimator [2] and the Gile successive sampling estimator (SS) [3]. Gile et al. [4] have recently reviewed the statistical advances in RDS and give a thorough overview of the available estimators. A number of studies have evaluated the performance of estimators and found that none are uniformly superior [5][6][7][8][9][10]. The accuracy of the variance estimates is still unclear [2,7,11], and depends on network and sampling conditions. Reported network degree (hence force referred to simply as degree) is an important variable in RDS studies. Degree refers to the number of ties an individual has to others in the target population. Individuals with more network connections are more likely to be recruited into an RDS study, so participants' reported degree is a proxy measure of sampling probability. Details on the distribution of degrees in RDS simulation studies is scarce, but has been recently modelled as a Poisson process [10], or with the more flexible Conway-Maxwell-Poisson distribution [12]. Empirical research presented by Killworth et al. [13] suggests that social networks have a right-skewed distribution and their histograms of network degree suggest a log-normal distribution. Our recent finding [14] that weighted regression methods performed poorly when the reported network degree was highly skewed raised the question of whether those data were unique or if highly skewed degree distributions are common. Preliminary analyses suggested that, if degree is normally distributed, weighted regression may perform much better. Therefore, the question of how degrees are distributed is of great practical importance: if skewed distributions are common then our recommendation for regression analyses remains not to weight observations, otherwise, more work is necessary to determine appropriate regression strategies. Our previous regression work was motivated by an RDS sample of Indigenous people living in Toronto, Canada. These participants reported degrees that were extremely skewed and appeared log normally distributed. This distribution resulted in participants with low reported degree being assigned very high weights. These acted as leverage points in the regression analysis, and resulted in poor regression parameter coverage rates [14].
To continue to make improvements to the quality of inferences for RDS samples, it is necessary to understand the real-world samples to which these methods are applied. Much work has been dedicated to evaluating RDS estimates by simulation, which requires some assumption regarding degree distribution. The objectives of this study were two-fold: 1) to describe the distribution of reported degree distributions in real-world samples from a variety of geographies and 2) to better inform RDS methodology researchers on how to model degree distributions for methodological studies.

Search strategy
Authors of recently published papers were contacted and asked to share study data on reported degree and recruitment chains. The PubMed database was searched for papers using RDS published in English, between 1 January 2019 and 31 August 2019 using the following search term: (("respondent driven sampling"[Title/Abstract]) AND ("2019/01/01"[Date-Publication]: "2019/08/31"[Date-Publication])) AND "english" [Language]. One hundred six results were returned; there were three additional manuscripts in the author's reference database published in this period, so 109 manuscripts were examined for eligibility. There was one duplicate manuscript, two protocol studies, three studies employing non-traditional RDS techniques without degree estimates, a methods based manuscript with no sample and one study with a sample size too small (n = 36) to examine degree distribution. From the remaining 101 manuscripts 59 unique author groups were identified and contacted, 15 (25%) agreed to share data on 53 distinct RDS samples. Research ethics approval was not obtained because all data were anonymised prior to being shared. Data on the chain linkages and personal network sizes were sought, no demographic data was collected. Details of the data available from these studies are presented in Table 1.

Analysis
For each sample, a number of distributions were investigated: the Poisson, geometric, negative binomial, normal and log normal distributions using the fitdistrplus package in R [15] and the discrete q-exponential, Poisson-lognormal, Conway-Maxwell-Poisson (CMP), Yule and Waring distributions using the degreenet package [16]. Fellows [10] used a Poisson distribution to simulate degree distribution; the Poisson-lognormal, CMP, geometric and negative binomial, discrete q-exponential, Yule and Waring are all more general discrete distributions that allow for extra dispersion. McCreesh et al. [9] reported a network distribution that was 'approximately normal with a slight positive skew', and raw plots suggested a skewed distribution so the continuous normal distribution was fit to both the raw and log-transformed reported degree. Fit was assessed using the BIC criterion, with smaller values indicating better fit, and by visual inspection of the raw data and fitted curves. Data from studies collected across multiple sites or years were left disaggregated. Participants whose reported network degree was missing were removed from the analysis. Those who reported a network degree of zero were recoded to 1, since in order to be recruited into the study, they needed to know at least one other member of the population. For each sample and participant, the wave that the participant was recruited into, and the identifier of the seed the participant was recruited from were determined. This data was used to examine the distribution of waves across studies and to determine if the reported degree of the seeds was correlated with the total number of participants in the seeds clusters. To give an indication of how effective most RDS studies may be in achieving samples independent of the initial seeds, recruits were ordered by wave and the wave of the median recruit was determined for each sample. This indicates the minimum distance from seeds for at least 50% of the sample. The ease with which RDS chains propagated was investigated by calculating the number of waves recruited for each seed and the number of recruits for every participant, across all studies.
To determine how frequently the population of available recruits is substantially depleted by the sampling process we used a method similar to that reported by Gile et al. [17] and Crawford et al. [18]. These authors regressed 1:n (with n representing the sample size) against the time-ordered reports of network degree to determine if reported degree decreases monotonically with time. Such a decline would be expected if the available recruits were indeed being depleted by the sampling process. Our approach was similar: the wave number (as a proxy for recruitment order) was regressed on the natural logarithm of the reported degree.
Participating studies were asked to report the question(s) that were used to elicit the number of ties to others in the population (their reported degree). Samples were classified into two groups: those that used a single question to define ties and those that used at least two nested questions (see Table 2 for examples). Because the data are observational, formal statistical tests were not applied, but the reported degree was plotted for different populations for studies with single vs. nested questions to explore the effect of tie definition.

Results
Data from 15 groups, containing 53 distinct RDS samples from North and Central America, Europe, Africa and Asia were collected. These samples mainly targeted four types of populations: men who have sex with men, drug users, female sex workers and migrants. In addition, there were samples of transgender women, Indigenous people, youth of colour and one general population sample. The shape of the reported degree distributions was remarkably similar across population type and geography. Table 1 details the location and timing of the studies as well as the questions asked to elicit reported degree.

Distribution of reported degree
Under a criteria of minimising the Bayesian information criteria (BIC), the log-normal distribution was the best fit to the data, for all samples. The Conway-Maxwell-Poisson and Poisson models were consistently a poor fit for the network degree data.  Table 3 describes the distribution of the raw and log-transformed degrees and Table 4 indicates the rank of the fit of each candidate degree within each sample, ordered by the BIC statistic. Fig 3 shows the relative frequency of reported degrees across various population types, aggregated across samples, for degrees up to 100. Reported degree, when greater than fifteen, is commonly reported in multiples of five or ten. Of the 12,492 reported degrees greater than fifteen, 81.8% were rounded to the nearest ten, 11.6% were rounded to the nearest five and only 6.6% ended in neither a zero nor a five. This rounding is evident in Fig 3, the general shape and spread of the observed degrees follow a log normal distribution, but degrees ending in 0 are reported much more frequently than expected.

Recruitment characteristics
The number of waves recruited by each seed, corresponding to the length of recruitment chains was calculated, across all samples. Approximately one-third of seeds were unsuccessful in recruiting participants into the study, one-third of seeds produced recruitment chains of between one and three waves and the final third produced chains four waves or longer. Fig 4  examines the relationship between the number of waves in the longest recruitment chain and Table 2. Examples of different strategies for defining ties among population members used to elicit reported network degree from respondents.

Defining ties using a single question
Defining ties using nested questions 1. How many other drug users do you know in your community?
1. How many migrant sex workers who are over 18 and are currently or recently working in your job from (county) do you know? 2. Of these people from above, how many know you? Of these people who know you, how many did you see in the past week? 3. Of those people you saw, how many did you speak to in the past week? https://doi.org/10.1371/journal.pone.0249074.t002 the wave of the median recruit. Fig 5 illustrates the distribution of recruitment chain length for all seeds, over 30% of seeds did not recruit. Fig 6 plots seed degree against both chain length and number recruited and indicates seed degree is not correlated with recruitment success. Recruitment per person (including seeds) was summarised across all studies; of the 36,547 participants, 47% did not recruit, 15% recruited one person, 34% recruited two people and 4% recruited three or more.

Reported degree and tie definition
Samples were classified into two groups based on how network ties were defined: those that used a single question and those that used two or more nested questions (as in Table 2).

Trend in degree over time
A reduction in reported degree as study recruitment progress can indicate that a large fraction of the target population has been recruited. Fig 8 illustrates the log of the reported network degrees, as a function of recruitment wave, for the samples experiencing the greatest increase and decrease in reported degree over successive waves. The slopes of the linear regression of the log of degree on wave are shown in the bottom Fig 8, plotted against study sample size, for every study. For most studies, there is little change in reported degree over wave, the study with the largest decline had a rate of change of -0.14 log(degree)/wave, amounting to twenty fewer network connections over ten waves of recruitment.

Discussion
Details about RDS recruitment chains and reported degrees were collected and summarised for 53 samples, encompassing 36,547 participants representing several target populations in 12 countries. To our knowledge, this is the first study to examine the distribution of RDSreported degrees across several countries and target populations. Our findings have implications for applied researchers using RDS and for statisticians working to improve estimates arising from these data. Reported degree is a discrete variable, but is best described by a normal distribution on the log-transformed degrees and reports of very high degrees are common. Instead of interpreting individuals with large reported degrees as outliers, the possibility of individuals acting as 'super nodes' arises and thoughtful consideration needs to be given before modifying or removing these values. In their work estimating population size from RDS data, Crawford et al. [18] report removing a subject with reported degree of 200 who was considered an outlier. Our results indicate that this is likely too small of a limit for truncation and we suggest caution in modifying or removing data. Sensitivity analysis, in which results with all reported degrees are compared to results with a truncated upper limit on degree may be useful to inform researchers about the effects of highly connected participants on RDS estimates of prevalence.
These findings indicate that reported degree is not a precise measure of actual degree; nearly all degrees greater than ten are reported to the nearest five or ten people. This does not suggest that reported degree is inaccurate, only that it is imprecise. Although reports of degree greater than 1000 may initially seem unlikely, for an individual who has been closely involved with community members over several years 1000 connections is not unreasonable. Given that 14 of the 53 samples reported degrees in excess of 1000 suggests that these large network connections are real. We can not comment further based on the data collected, but future work may focus more on the accuracy of reported degree and its importance in analysing RDS data. For statisticians developing RDS estimators, the distribution of reported degrees may have implications for estimator accuracy and precision. Statistical tests of the appropriateness of different distributions, such as the Shapiro-Wilk for normal data are not useful for RDS data because of the tendency for reported degree to be reported to the nearest multiple of 5 or 10. For this reason, we compared the likelihood of several reasonable candidate distributions: the Poisson, geometric, negative binomial, discrete q-exponential, Poisson-lognormal, Conway-Maxwell-Poisson, Yule and Waring, as well as the continuous normal distribution on both raw and log-transformed degree. Although a Poisson counting process may seem like a natural mechanism for modelling connectedness, our results indicate that this is the least appropriate distribution for describing degree. Results based on Poisson-simulated degree, such as Fellows' [10] simulations of the performance of the homophily configuration graph for prevalence estimation, may need to be re-examined in light of what we are now learning about degree distributions in practice.
In our earlier work [14], we showed that weighted regression methods are not suitable for RDS data when the degree distribution is highly skewed, and so we reiterate the need for caution when applying weighted regression methods to RDS data. Extremely skewed degree data results in people with very few reported connections receiving high Volz-Heckathorn weights. These individuals then act as leverage points, and can either nullify true relationships or introduce relationships where none exist. While alternative weighting strategies could be employed, our words of warning stem from the common use of the RDS-II (Volz-Heckathorn) weights in regression analyses of RDS samples. Long recruitment chains are desirable to minimise the impact of the initial seeds on the final sample. Simulation studies have shown that even with heavily biased seeds, only four or five waves are necessary to ensure unbiased prevalence estimates [1]. Papers often report the length of the longest recruitment chain, but we have not found information regarding wave distribution in the literature. In the samples observed here, all studies that reported maximum chain length of at least eight waves had five or more waves for the median recruit (Fig 5). Studies offering only two coupons per participant achieved the longest chains, while those offering five or more were the least likely to have a median chain length of at least five waves. We encourage authors to report the wave of the median participant, ordered by recruitment timing, as a measure of the potential dependency of the sample on the initial seeds.
There is some evidence that the definition of network ties dampens the degree reports but the shape of the distribution, and the overall spread appear less affected (Fig 7). The  Table 1. implications of tie definition on RDS estimators is beyond the scope of the current study, but requires further investigation.
We found no evidence that the pool of available recruits was depleted by the recruitment process for the samples investigated here. Although Fig 8 indicates that the reported degree often decreases as recruitment progresses (negative values indicate an inverse relationship between degree and wave), the magnitude of the decline is minimal, and there were a number of studies with sample size near 1,000 where the reported degree actually increased with successive waves.
A limitation inherent in any survey is response bias and this survey is no exception. We can not evaluate the distribution of the reported degree for the samples that were not shared. However, we feel that the breadth of sample types (MSM, FSW, PWID, people of colour, migrant workers) and the geographical coverage of the samples (North America, Europe, Africa, Asia and Central America), coupled with the fact that all samples were best described by a normal distribution on log-transformed degree is sufficient to be confident that our results are generalizable. Future work will evaluate the impact of extremely skewed degree on the accuracy of RDS prevalence estimators.
A valid RDS study will achieve a representative sample of the target population, and accurately estimate the disease burden in that community. For researchers employing RDS methods we have two key findings: 1) fewer coupons per participant may be useful in achieving longer recruitment chains and 2) reports of very high network degrees are relatively common, and what constitutes an outlier is unclear. For researchers investigating RDS estimator performance, we recommend using log normal distributions for reported degrees, and recognising that degree is likely to be imprecisely reported by participants. Methodological work on appropriate methods for RDS data will be most informative if validation is undertaken using data that reflects what is observed in practice.