Reply & Supply: Efficient crowdsourcing when workers do more than answer questions

Crowdsourcing works by distributing many small tasks to large numbers of workers, yet the true potential of crowdsourcing lies in workers doing more than performing simple tasks—they can apply their experience and creativity to provide new and unexpected information to the crowdsourcer. One such case is when workers not only answer a crowdsourcer’s questions but also contribute new questions for subsequent crowd analysis, leading to a growing set of questions. This growth creates an inherent bias for early questions since a question introduced earlier by a worker can be answered by more subsequent workers than a question introduced later. Here we study how to perform efficient crowdsourcing with such growing question sets. By modeling question sets as networks of interrelated questions, we introduce algorithms to help curtail the growth bias by efficiently distributing workers between exploring new questions and addressing current questions. Experiments and simulations demonstrate that these algorithms can efficiently explore an unbounded set of questions without losing confidence in crowd answers.


Introduction
Crowdsourcing has emerged as a powerful new paradigm for accomplishing work by using modern communications technology to direct large numbers of people who are available to complete tasks (workers) to others who need large amounts of work to be completed (crowdsourcers) [1][2][3][4]. Crowdsourcing often focuses on tasks that are easy for humans to solve, but may be difficult for a computer. For example, parsing human written text can be a difficult task and optical character recognition systems may be unable to identify all scanned words [5][6][7]. To address this, the reCAPTCHA [8] system takes scanned images of text which were difficult for computers to recognize and hands them off to Internet workers for recognition. By having many people individually solve quick and easy tasks, reCAPTCHA is able over time to transcribe massive quantities of text. Crowdsourcing in general is especially important as a new vehicle for addressing problems of social good [9][10][11].
Deciding on an optimal way to assign particular tasks to workers, and in what order, remains an active area of research. For many problems, multiple worker responses to a task must be aggregated to determine a final answer [4] but often, a budget limits the total crowdsourcing resources available [12][13][14][15], either due to financial limits when workers are compensated or time constraints where the speed or size of the crowd limits the number of tasks to be performed or questions to be answered. Most previous work on optimal task assignment takes the form of a Markov Decision Process (MDP) [12,16]. MDP provides a rigorous mathematical framework to test policies for allocating tasks to workers [17]. Using MDP and other strategies, such as Thompson sampling [18], methods have been introduced to efficiently aggregate responses from workers, including consideration of which workers are most likely to be well suited for a given question based on their past performance on related questions [19][20][21][22][23][24][25][26][27].
However, to the best of our knowledge, past research has been limited to the case where a fixed set of tasks need to be accomplished, and the response of a worker to a task is only ever to complete the assigned task. In contrast, consider a crowdsourcing problem where workers are able to do more than perform tasks-they may be allowed to propose new questions as well as answers to a given question. The truest expression of crowdsourcing must incorporate the intuition and experience of workers, who are potentially capable of providing the crowdsourcer with far more actionable information for many problem domains [28][29][30]. While MDP made significant contributions to the design of question assignment algorithms, when the question set is growing due to the crowd, MDP does not naturally account for the hidden state transitions needed to represent newly contributed questions. The lack of research on algorithms accounting for growing question sets reveals a gap in our abilities to efficiently assign questions to workers.
To this end, we study a type of crowdsourcing problem we term Reply & Supply. As workers answer a given question (Reply), they are given the opportunity to propose a related question (Supply). Example applications of Reply & Supply include: • Exploring social networks ("Are Alice and Bob friends?" "Who else is friends with Alice?" "With Bob?") • Product recommendations ("Have you bought a camera and laptop together?" "What else would someone buy when buying a camera?") • Image classification ("Does this photo contain a horse and a mountain?" "What else does it contain?") • Causal attribution ("Do you think 'hot weather' causes 'violent crime' ?" "What causes 'violent crime' ?") • Health informatics: Crowdsourcing patient reports to find connections between co-occurring symptoms, new drug interactions, etc. ("Do you suffer from symptom X?" "What other symptoms do you have?" "Do you take drug Y ?" "What other drugs do you take?") In all these examples, new questions can be built by combining crowd-suggested responses with the components of the original question, leading to the creation of a network structure of interrelated questions. To explore a social network, for example, if a worker responds that Alice and Bob are friends (Reply) and also proposes that Alice and Carol are friends (Supply), then a new question ("Are Alice and Carol friends?") is formed that other workers can consider and that links to other questions related to Alice. Further, we will show that this network representation naturally generalizes to non-network question sets, and the methods we develop here are fully applicable to both question sets and question nets. Question networks can be studied with tools from network science that consider the statistical properties governing how theoretical and real-world networks grow and behave [31][32][33][34][35][36][37][38]. One property, the scale-free or heavy-tailed degree distribution [33], where most nodes in the network have low degree but some very high-degree nodes do exist, holds in many real-world networks. How a scale-free network grows over time introduces biases ('first-mover advantage') that are also inherent in a growing crowdsourced experiment.
In brief, this manuscript makes the following contributions: 1. The introduction of a growing network of linked questions with an accompanying theoretical analysis; 2. The use of Thompson sampling to develop crowd-steering algorithms that enable efficient exploration of an evolving set of tasks or questions without losing confidence in answers; 3. Simulations and real-world crowdsourcing experiments that validate the efficiency and, to some extent, the accuracy of the crowdsourcing performed under the crowd-steering algorithm.
The rest of this paper is organized as follows: Section 2 poses the generic crowdsourcing problem we focus on, analyzes a simple graphical model of how a growing question net is built by a crowd, and uses this model to motivate methods for efficiently assigning questions to workers as the question net grows. Section 3 describes experiments and evaluation metrics to test the proposed theory and methods with both simulated and real-world crowdsourcing tasks. Section 4 presents the results of these experiments and Sec. 5 concludes with a discussion of these results and future work.

Methods
Here we introduce a graphical model of a growing question network where questions consider the presence or absence of a relationship between two items (Sec. 2.1). We study the network's properties under a null condition where the crowdsourcer assigns questions to workers randomly without use of a "steering" algorithm to provide guidance (Sec. 2.2). We then use these properties to develop a probability matching algorithm which provides said guidance to the crowdsourcer (Sec. 2.3).

Crowdsourcing growing question networks
We model a growing set of questions (or tasks) as a graphs where nodes are items and edges or links represent questions relating pairs of items. A question network G = (V, E) is composed of a set of nodes V and a set of edges E, where |V | = N and |E| = M . Edge attributes record the answers given by workers, i.e., associated with each edge is a categorical variable storing the counts of worker responses. Those workers may also propose new questions (i.e., new combinations of new or existing items), leading to new nodes and edges. This network model also accommodates non-network question sets, for example by considering each question as a disjoint edge..
As an example of such a network, consider a synonym proposal task (SPT) where workers are asked if two words u and v are synonyms. The question is the link (u, v) between two items u and v representing those words. After replying to the question, the worker may also supply another word w which is a synonym for u, for v, or for both words. This grows the question network by introducing new questions linking items (u, w), or items (v, w), or both (u, w) and (v, w). The degree k i of item i counts the number of questions linking item i to other items.
We focus on cases, such as the SPT, where questions have binary answers, e.g, when workers are asked whether or not a link between two items should exist. Edge attributes on links capture the number of 'yes' and 'no' answers given by workers. However, this graph representation is flexible enough to allow edge attributes to contain any number of dimensions and there are no restrictions imposed on how workers propose questions. Moreover, this graphical model is capable of representing growing question sets without such relations, for example, a collection of N disjoint questions always containing the response items 'True' and 'False' only may be a two node, N multi-edge graph. While not a particularly meaningful representation, it demonstrates that the algorithms we develop are applicable to general crowdsourcing tasks without modification. Lastly, one can also extend this model to non-binary, multiple choice questions in several ways, including representing questions as hyperedges in a hypergraph.

Null model
We propose a generative null model for a growing question network [39,40]. Beginning from a network with one question, a crowdsourcer randomly chooses existing questions to send to workers also chosen at random. Those workers answer the questions and then with some probability also propose new questions. We study the properties of the network under these assumptions to motivate the development of a probability matching algorithm that can allow a crowdsourcer to efficiently explore the growing question network.
The network begins (at time t = 0) with two nodes and one undirected link connecting those nodes, representing a single question considering two items. Under the null model, every link (i, j) has an associated innovation rate ρ ij . The innovation rate for (i, j) defines the probability a random worker will introduce a new question into the network when presented with question (i, j). If she chooses to innovate, the new question may relate to either or both of the items i and j of the original question the worker was given.
Specifically, suppose a random worker is given question (u, v) relating items u and v. Under the null model: 1. The worker answers question (u, v) with probability 1.
2. The worker proposes a new item w to study with probability ρ uv : (a) w is linked to one of the items of the original question with probability γ uv .
A single new question, either (u, w) or (v, w) chosen uniformly at random, is introduced; (b) otherwise, w links to both items of the original question with probability 1 − γ uv . Two new questions, (u, w) and (v, w), are introduced.
3. Repeat from (1) with another sampled question and worker until termination.
This model is tractable but quite basic and does not consider many potential details. For example, it assumes that while questions may have different innovation rates, workers do not. However, for sufficiently large numbers of workers, the average response is always going to be the primary concern, particularly in most crowdsourcing tasks which need to aggregate multiple worker responses to decide upon a final answer for a question. If it is necessary, a crowdsourcer interested in accounting for variation between workers can propose a statistical model for their features, and then use statistical inference to estimate these worker parameters during crowdsourcing (see also the Discussion).
We now prove several average properties of this null model. Studying the characteristics of the randomly growing, uncontrolled network informs policies that a crowdsourcer may use to manipulate the network (such as the algorithm we develop in Sec. 2.3). Many of these results are also informative for non-network growing question sets.
The first theorem describes question growth in the random uncontrolled network.
Theorem 1 (Rate of question growth). The total number of links M (t) as a function of time t is, on average, M (t) = ηt + 1 where η = ρ (2 − γ ) is termed the exploration rate.
Proof. For the network to grow, a worker must suggest an additional question, which occurs with probability on average ρ (average of ρ ij ). Once the worker commits to a suggestion, one question is added with probability on average γ or two questions are added with probability on average 1 − γ . Combining these two possibilities, the total number of questions grows on average over one timestep according to with initial condition M (0) = 1 representing the single seed question of the network. Making a continuum approximation, this difference equation becomes where the exploration rate η ≡ ρ (2 − γ ) plays an important role in the overall network growth.
The number of links grows linearly with a rate η that combines the average rates ρ and γ . Intuitively, the network grows faster if questions are more likely to be innovative (larger ρ ), and/or the worker is able to suggest a question for both items at the same time (smaller γ ).
The solution to the rate equation for question growth can be used to compute the mean number of worker answers per question: Theorem 2 (Mean answer density). The mean answer density (number of answers per

Proof. The mean number of answers per question is
A = total number of answers total number of questions .
At every time step a question in the network accumulates a single answer from a worker. The denominator of (2) is the solution (1), and so the average density of answers per question is The mean answer density correlates with the overall uncertainty in the crowdsourcing since there is generally more certainty (but not necessarily correctness) in crowd responses when more workers on average have independently answered questions. Controlling the answer density, and therefore the certainty, now boils down to controlling the exploration rate η. The mean answer density's dependence on η also encapsulates an 'exploration-exploitation' tradeoff: lower η leads to higher answer density, but at the cost of less exploration in the network; higher η increases the exploration but lowers answer density and makes more uncertainty in the network. In this null model, the crowdsourcer does not make choices that can exploit this, but tuning between these poles is a key component of the probability matching algorithm we introduce in Sec. 2.3.
The previous two theorems govern global properties of random question networks. We now turn to properties of individual items within the network to explain the unequal distribution of questions attached to items:

5/20
Theorem 3 (Rich-get-richer mechanism). A node i entering the network at time t i will gain degree, on average, as k i (t) = η ρ 1+ηt 1+ηti where H is the Heaviside function.
Proof. An existing item i only gains a question when the crowdsourcer chooses a question attached to i and the worker answering that question proposes a new question involving i. A question (i, j) associated with item i is selected by the crowdsourcer with probability After the worker answers question (i, j) she must innovate (probability ρ ) with an item w that is not already a neighbor of i (and w = i) and the new question must be (w, i) (probability γ /2) or it must be two questions (w, i) and (w, j) (probability 1 − γ ). If the worker introduces question (w, j) only (probability γ /2) then i does not gain a new question and so this possibility does not contribute to k i (t). Combining these possibilities together, k i (t) evolves on average according to We approximate and simplify this difference equation as before: where k i (t i ) is the initial degree when item i was introduced at some time t i . Solving Eq. (4) results in We see from this derivation that the rich-get-richer, preferential attachment mechanism [33] is automatic when questions are chosen at random: an item i is more likely to appear in a sampled question the more questions it has, and therefore items with more questions are more likely to gain further questions than other items. Further, the degree of an item depends critically on two quantities. The first, the ratio of exploration rate η to ρ , equally affects all items in the network. The second, the time of entry t i , dampens the growth of items that enter the network late and increases the growth of earlier items. This phenomena is often called the 'first mover's advantage', and in the context of crowdsourcing a growing network, items entered earlier in the system accrue more questions than later items.
Using the local estimate of item degree to derive the global degree distribution of the network, we find: Theorem 4 (Degree Distribution). The degree distribution of the growing question network as t → ∞.
Proof. Following [39], begin with the cumulative probability distribution of item i's degree: Meanwhile, the entry times t i of items into the network follow a distribution proportional to ρ uniformly through time: and, after normalizing, we discover the time of entry follows a uniform distribution. Referring back to (7) and using the integral definition of a cumulative distribution, Lastly, differentiating (8) with respect to k gives the degree distribution: as t → ∞.
Our theoretical analysis is supported by simulations of growing question networks (Fig 1). We conducted 5, 000 simulations and recorded the degree distribution P (k) and degree k of items across different values of exploration rate η and time of item entry t i .  Fig 1(a) validates the slower rate of question accrual for late arriving items, and Fig 1(b) shows the degree distribution's match to theory by the collapse of each curve over multiple values of η.

Probability matching algorithm for growing question sets and nets
Most algorithms for steering workers towards questions choose questions by defining a metric that captures important characteristics in the system. For example, algorithms stressing accuracy often build metrics that reward higher numbers of answers for questions, achieving a p-value below a pre-defined threshold, or diminishing the variance of questions. The framework of probability matching, specifically Thompson sampling [18] (TS), is one of the most powerful ways to efficiently choose from a set of dynamic "options" when choices must be made with limited information. Unlike greedy algorithms, one of the strengths of TS is that its stochastic nature prevents choosing locally optimal questions only.
To Thompson sample from a set of options, one assumes a random variable X which follows a distribution ϕ(x | θ i (t)), where θ i (t) is a set of parameters specific to i at time t. One draws an x i (t) for each option i and selects the option j with the smallest x (or largest x, depending on what x represents), j = arg min i x i (t). After option j is played (in our case, the worker's answer is received), the parameters for option j are updated. Often x is a Bernoulli random variable and it is natural for ϕ to be the conjugate Beta distribution with parameters α, β which are updated depending on whether x = 0 or x = 1.
For specific problems, TS depends on an appropriate reward function. In the context of crowdsourcing, one generally cannot verify the accuracy of crowd answers, so the best choice is to reward certainty or consensus. If the crowd is consistent in their responses for a given question, then that implies the question is being answered as well as possible under current conditions. Thus, in contrast to the Bernoulli Bandit problems typically studied with TS, we do not want to reward 'yes' answers over 'no' answers only. Instead, we want to reward choices that lower the crowdsourcer's measure of uncertainty for questions.
A natural measure of uncertainty for a categorical random variable is the Shannon entropy. However, efficiency is also important to a crowdsourcer. A yes/no question that has 200 responses which are evenly split is very different than a question with 2 responses which is also evenly split, despite having the same entropy. Generally, the crowdsourcer would prefer to assign a worker to the latter question, as there is greater hope of lowering its uncertainty.
This argument guides us to choosing a metric involving both the total number of answers to a question and how evenly distributed those answers were over the categories of that question. We introduce a metric called link bias (d) that is sensitive to the uncertainty of a question, but unlike entropy, also accounts for the total number of answers. To begin, the multinomial distribution, with C − 1 parameters, naturally models the distribution of a categorical question's total number of answers T across C possible answers, and the Dirichlet distribution, conjugate to the multinomial, can estimate the parameters of the multinomial. Since we expect no available prior information, a non-informative prior can be used. In the case of two categories, which we focus on, the Dirichlet distribution reduces to the Beta distribution (B(α, β)).
To define question uncertainty, we need a reference point. At a question's peak uncertainty, workers have answered evenly among the question's (C) categories causing an equal proportion of answers per category. In our binary case (C = 2), this corresponds to a proportion of 1/C = 1/2. The link bias d transforms the proportion of answers for question (i, j) to the distance from maximum uncertainty with (1) is the fraction of '1' or 'yes' or 'true' answers. When p ij ∼ B(α, β), the probability density of d becomes where for simplicity the dependence of α, β on (i, j) has been suppressed. Intuitively, a low link bias (d ≈ 0) occurs when the crowd is evenly split among possible answers, while a high link bias (at most d = 1/2) tells us the crowd converged on a single category. However, the link bias alone may not sufficiently steer the crowdsourcer to choose questions with a lower number of answers. If needed, we can combine a preference for sampling questions with few answers, with a preference for questions that are uncertain, by weighting (10) where N ij is the total number of answers to question (i, j) at the time of sampling. Thompson sampling of questions via ϕ or via ϕ N defines the two probability matching algorithms we propose. These algorithms handle growing networks of questions automatically and are fully applicable to problems without graphical relations between questions. We will conduct experiments on growing question networks testing the relative performance of both algorithms, and comparing them to other null or control baseline strategies, such as randomly choosing questions.

Experiments
We conducted two experiments to test the theoretical analysis and the sampling methods. For the first experiment, we simulated crowdsourcing of a growing question network with a commonly used benchmarking dataset by superimposing two distinct network structures onto a previously conducted crowdsourcing task [12], where questions have been time-ordered to mimic a growing question network, and used this to test three different question sampling algorithms. For the second experiment, we conducted real-world crowdsourcing using the Mechanical Turk crowdsourcing platform [2].

Experiment 1
To determine the effectiveness of choosing questions based on link bias, we first performed a five-armed experiment using the Recognizing Textual Entailment (RTE) dataset [12], a set of 8,000 binary answers (0 or 1) to 800 unique questions.
For simulating question growth, we superimposed graph structures onto the question set to link the 800 questions together. As mentioned in the introduction, many crowdsourcing problems naturally possess a network structure; here we imposed a structure on the RTE dataset only because it allows us to use the same benchmark dataset that many other researchers have studied. We built 5,000 Erdős-Rényi (ER) and Barabási-Albert (BA) networks [41]. These two options represent two extremes of network structure, and were chosen to test question sampling algorithms over different classes of networks. Briefly, an ER network [31] (specifically the G(n, m) formulation) starts with a set of N nodes and 0 links; a pre-specified number of links M are placed in the network choosing randomly without replacement from all possible N 2 pairs of nodes. In contrast, the BA network [33] starts with 2 nodes joined by a single link, nodes are added one at a time until all N nodes are placed, and each new node attaches to m 0 existing nodes in the network. New nodes attach to an existing node i with probability k i / n∈N k n , a mechanism that is often called preferential attachment.
For simulation purposes, each ER network realization must contain exactly 400 nodes, 800 links, and be connected. BA networks are connected by design; we still enforced the same number of nodes and links as the ER networks. Each simulated crowdsourcing was initialized with one question (a link in the network connecting two corresponding item) chosen at random from the underlying network. During the simulated crowdsourcing, workers answer a question with a 1 with probability equal to the proportion of 1's observed in the original RTE dataset for that question, otherwise the worker answers 0. Next, and with probability ρ , a new node (item) is introduced into the network by selecting randomly from the unseen neighbors of either i or j within that simulation's graph. (This differs slightly from the analytic null model because there is no γ . Instead, two links are formed automatically if the newly introduced item is linked to both i and j in the superimposed network.) If there are no new items to add corresponding to the selected question, this iteration is undone and the algorithm continues. All simulations were run with ρ = 0.20 for 6, 000 time steps.
Simulations were performed independently for each of five arms. The condition of each arm governs how questions are selected by the simulated crowdsourcer: Random: The first arm of the experiment had a condition where questions (links) were chosen randomly from the pool of already visited links.
Looping: The second arm used a looping question sampling algorithm. The first link that entered the system is answered by a worker, then the second link in the system is given to a worker, then the third link and so on. When the algorithm reaches the most recent link within the system it starts again from the oldest link.
Binomial sampling: This strategy selects questions (i, j) based on p-values for a two-sided binomial test that the proportion p ij (1) is significantly different from 1/2. If the p-value of this exact test is small, then it is likely the crowd has already reach consensus on that question and it is not worthwhile to sample that question further. The sampled question was chosen randomly from the set of questions which have a p-value > 0.2 and which have received fewer than 10 answers (at the time of sampling) Thompson sampling with ϕ: The fourth arm uses Thompson sampling to select links based on link bias (ϕ).
Thompson sampling with ϕ N : As in the fourth arm but links are Thompson sampled with ϕ N instead of ϕ.
This experiment can demonstrate the strengths and weaknesses of selecting links based on these different sampling strategies, and, because it is synthetic, many trials can be conducted while avoiding the costs associated with a new crowdsourcing experiment.
Results of Experiment 1 are presented in Sec. 4.

Experiment 2: Synonym Proposal Task
This three-armed experiment created new question networks grown from a single seed question (link), and evaluated the ϕ N -based Thompson sampling and Binomial sampling versus Random sampling. We paid US-based workers on Amazon's Mechanical Turk crowdsourcing platform [2,42] to participate in a synonym validation and proposal experiment. Synonymy proposal is a good test application for the question sampling algorithms we study because workers can easily understand the question and are capable of proposing new questions (by suggesting new synonyms). Of course, data on synonymy relations are available in lexical resources such as WordNet [43], which we used in this specific task for assessing the accuracy of proposed synonyms (see below), but our primary goal with this experiment is not crowdsourcing a new thesaurus but testing the different question sampling strategies. In Experiment 2, each worker completes synonymy tasks at a compensation of $0.08 USD per task. Each synonymy task gives a pair of words to a worker and asks whether or not they are synonyms. After a worker answer either 'yes' or 'no', we allow the worker to suggest additional synonyms for each word of the given pair, or a single synonym associated with the combined word pair. A screenshot of the web form used for this task is shown in Fig 2. Three independent crowdsource networks were built, one for each arm. All three networks began with the same seed question (the word pair patriotic, person). All other word pairs were proposed by the crowd. The question sampling algorithms draw from all previous worker answers and suggested questions within their respective arms to deliver a question to the next queued worker. The first arm (Random sampling) chooses links using the same methodology as the random arm from Experiment 1, which also closely matches the null model we studied (Sec. 2.2). The second arm (Binomial sampling) selects links according to the Binomial sampling algorithm introduced in Experiment 1. Lastly, the third arm (Thompson sampling) selects links according to Thompson sampling of ϕ N . Results for Experiment 2 are presented in Sec. 4.

Evaluation metrics
For the first experiment, we measure five attributes across the simulated crowdsourcings to compare the different question sampling algorithms. At each time step t, for each simulated network we record network properties f nodes , the fraction of items, and f edges , the fraction of questions: where V (t) is the set of items at time t, V (∞) is the set of all items at the end of the experiment, E(t) is the set of questions at time t, and E(∞) is the set of all questions at the end of the experiment. Next, we record the entropy S and link bias d, averaged over all currently visible questions, to quantify uncertainty in the network: (13) and where p ij (x) is the (Laplace-smoothed) fraction of binary answers of x for question (i, j) (at time t).
The final evaluation metric, mean answer density, measures how many answers are given per question in a particular network (see also Thm. 2): where the N ij (x) represents the count of answer x for question (i, j) (at time t).

Validating proposed synonyms
A factor that motivated us to choose the synonym proposal task as our crowdsourcing example is that synonym proposal can, in principle, be validated. Therefore, we will measure both crowd consensus (measured by S or d ) and, as best we can, if Experiment 2's crowdsourcing algorithms lead to different quality rates of synonyms-are we trading off quality for efficiency? However, measuring synonymy from natural language text is challenging. In principle, all that is needed is a complete thesaurus, meaning a complete lookup table of all words and all their synonyms, perhaps with weights denoting the degree of relatedness between a word and its synonym and accounting for all possible contexts in which those words may appear. However, without such an exhaustive resource, it can be challenging to determine synonyms, especially when workers may introduce typos, may propose different forms (runs, running, ran) of the same root lemma (run), or they may propose a multi-word phrase (MWP) which may have a synonymous meaning but where such a meaning is difficult to determine computationally.
Given the challenges of measuring synonymy, we applied two measures to the synonym word pairs (u, v) proposed by workers during the crowdsourcing experiments: Shared WordNet lemmas The first measure starts by determining for each word w the set of all forms of all its synonym lemmas as encoded in WordNet [43]: where synsets(w) is the set of all synonym forms stored in WordNet (we merge sets across parts-of-speech and take synsets(w) = ∅ if w is not present in WordNet). We then say that the two words u and v are synonyms if they share at least one lemma, i.e. that |L(u) ∩ L(v)| > 0, otherwise they are not synonyms. This is a relatively strict test, and fails to account for many MWPs and natural language concerns such as misspellings, so we expect many (u, v) pairs that workers deem synonyms to be missed by this measure and therefore the actual proportion of synonymous word pairs may be much higher.
Word vector similarity The second measure we employ is based on the meanings encoded by the "word2vec" word embedding algorithm [44]. Word2vec uses a neural network model to learn low-dimensional vector representations of words based on their contextual co-occurrence patterns over a very large text corpus. Supported by the distributional hypothesis [45], the contexts encoded in these vectors are then considered to capture to some extent the meanings and relationships of these words such as, for example, analogous relationships (Berlin is to Germany as Paris is to France). Given a pre-trained set of 300-dimensional vectors trained on a 100B word corpus taken from Google News, we define the similarity between two words (or MWPs, if the MWPs are present in the vector data) u and v as their cosine similarity: where w represents the associated word vector for word or MWP w. If either u or v is not present in the word2vec vector data, we exclude that pair from our analysis (this occurred in Experiment 2 for approximately 17.9% of crowd-proposed word pairs for the Random sampling experiment, 19.8% for Binomial sampling, and 10.9% for Thompson sampling).

Experiment 1
Fig 3 displays the five evaluation metrics associated with Experiment 1, averaged over the 5, 000 ER and BA networks. (For simulated Binomial sampling only, note that we required questions to have fewer than 30 answers at the time of sampling, not 10 as discussed previously, to provide more simulation statistics.) The Binomial sampling and ϕ N Weighted Thompson sampling algorithms outperformed all others in exploration metrics across ER and BA networks. Both methods explored more of the network, and faster, than other methods, as evidenced by f edges and f nodes . Weighted Thompson sampling performed best at minimizing the uncertainty of answers, as measured by lower entropy S and higher link bias d . In contrast, Binomial and Thompson sampling ϕ were inconsistent for these two metrics. Lastly, Binomial and ϕ N Weighted Thompson sampling also required fewer answers than other algorithms (lower A ). Binomial sampling slightly outperformed ϕ N Weighted Thompson sampling in many metrics. However, Binomial sampling has a distinct drawback: the thresholds used to sample questions may lead to a situation where no questions meet its sampling criteria. This is visible in the simulation curves, which are quite noisy due to individual simulations which terminated too early. Of course, this can be fixed by any of several means, such as falling back to random sampling when no questions meet the criteria, or tuning the cutoffs used in Binomial sampling. But Thompson sampling avoids these complexities entirely.
The overall performance of Binomial sampling and ϕ N -based Thompson sampling in simulated crowdsourcing nominates them as candidate algorithms for Experiment 2's real crowdsourcing.

Experiment 2
Fig 4 shows the constructed networks for each arm of the Synonym Proposal Task (the task is described in Fig 2). Qualitatively, all three networks appeared similar. Quantitatively, (Tab. 1) both Binomial and Thompson sampling were able to explore   more of the network (discovering more items and questions) than Random sampling with more efficiency (lower mean number of answers A ). The explored networks appeared similar by a number of network metrics, although the network generated via Binomial sampling has a lower average degree and higher average shortest path length. Lastly, Binomial and Thompson sampling were comparable to Random sampling in crowd consensus on individual answers, having similar levels of entropy and link bias. Both of these statistics measured how skewed the worker answers were in favor of 'yes, they are synonyms' or 'no, they are not synonyms'. Taken together, both Binomial and Thompson sampling maintained a comparable level of certainty (measured by consensus or consistency in worker responses) in the network with fewer answers needed on average than Random sampling.  To further understand the answer density of the different sampling methods, we computed the distribution of the number of answers N ij to question (i, j) in Fig 5. Here Random sampling clearly separated from the other two sampling strategies, and Random sampling ended with more questions with more answers than the other sampling strategies. We also note that all three arms finished with many questions with few answers: approximately 50% of questions at the end of the experiment had a single answer. We discuss this further in Sec. 5.
Next, we examined the synonym "quality" of the SPTs, using the synonymy measures introduced in Sec. 3.3. We limited these calculations to proposed word pairs examined by at least three crowd workers to ensure sufficient answers from the crowd. Measures of synonymy for Experiment 2's crowdsourced word pairs. Synonymy for proposed word pairs was estimated using (a) shared WordNet lemmas, (b) cosine similarity between word2vec word embedding vectors (see Sec. 3.3). Approximately 11-12% of crowdsourced word pairs share one or more WordNet lemmas (a strict measure of synonymy), and Binomial and Thompson sampling achieved slightly higher rates than Random sampling. As a control, the word pairs proposed by the crowd were randomized, and the proportion of word pairs with shared lemmas dropped significantly. Meanwhile, regardless of sampling algorithm, the word pairs proposed by the crowd also had significantly higher word vector similarity when at least one member of the crowd agreed that the pair were synonymous (n ij (1) > 0), as opposed to no members agreeing the pair were synonymous (n ij (1) = 0). This further underscores the estimated quality of the proposed questions and answers and that Binomial and Thompson sampling methods do not appear to trade off quality for efficiency. (To avoid ambiguous answers, we considered word pairs that received at least three answers from workers in these calculations, and panel (a) considers those word pairs with n ij (1) > 0.) was not lost when using a more efficient sampling strategy. Of course, 11-12% of word pairs sharing a lemma seems low, but recall that shared lemmas is a very strict measure that is likely to miss many synonymous word pairs and so we do not conclude that the majority of the crowd answers are "wrong." Furthermore, to better understand the shared lemma proportion, we constructed a randomized control by shuffling the word pairs (preserving the total frequencies of individual words) and re-measured the proportion of shared lemmas. We found a significant drop in the proportion to approximately 1% (the error bars on these proportions are shown in Fig 6(a) but are quite small).
Likewise, Fig 6(b) shows word vector similarities for the three sampling methods, decomposed into word pairs where at least one worker agreed they were synonyms versus no workers agreeing they were synonyms. The crowd-proposed word pairs flagged as synonyms had similarities significantly higher than those not flagged as synonyms.
There is a small drop in vector similarity for Binomial and Thompson sampling compared with Random sampling, likely balancing out the small increase in WordNet shared lemma proportion shown in Fig 6(a). We conclude that overall there is no loss in quality, at least as indicated by these measures, when using more efficient sampling algorithms.
Taken together, while we only have one crowdsourcing realization for each arm, it is reasonable to conclude from Experiment 2 that both Binomial sampling and Thompson sampling achieved much higher rates of exploration (more items) and greater efficiency (fewer answers per question) than Random sampling without losing confidence or accuracy in question responses.

Discussion
We studied the problem of efficient assignment of crowdsourcing tasks to workers when those workers are also able to propose tasks themselves. Using workers to contribute new tasks and not merely perform predetermined tasks helps unlock the true potential of crowdsourcing. We formulated a growing question network model for this problem, prove theoretical properties of this system, and developed and validated sampling algorithms that can guide workers to grow the network efficiently, while only sacrificing at most minimal confidence in their responses.
Modeling the evolution of the uncontrolled question network teaches us how to better design crowdsourcing policies. For example, by monitoring the innovation rate (ρ) and exploration rate (η) of the growing question network, a crowdsourcer may be able to better and more efficiently control the question network as it grows. At the same time, the rich-get-richer growth of items (older items are attached to a larger fraction of questions), implies that crowdsourcers should pay special attention to the newest items entering the network, to balance out the inherent bias in favor of older items.
Thompson sampling is fast, easy to implement, and flexible enough to capture the preferences of different crowdsourcers, but it is only one potential policy for question selection. More rigorous question selection techniques can be implemented which may outperform the proposed techniques, but with potentially more restrictions. The Thompson sampling algorithms proposed here work for both question nets but also non-network question sets, and can naturally accommodate both growing and static questions sets and nets. Further, statistical inference of question parameters and worker features [15], based on extensions of the null model analyzed in Sec. 2.3, can be used by the crowdsourcer to better pair workers with questions.
There remains considerable room for improvement. For example, in Fig 5, approximately 50% of questions in Experiment 2 received a single answer, regardless of arm. This means that even with the current algorithms the crowd is still supplying an inordinate amount of questions that are being left mostly unconsidered. Of course, some of this may be unavoidable; if there is too much Supply, then the crowd will invariably fall behind. This is further compounded by the inherent bias in favor of older questions. Thompson and Binomial sampling helped curtail this "first-mover-advantage" bias in the growing network but did not necessarily eliminate it. This is the fundamental challenge (and appeal) of this crowdsourcing problem, and more work focused on these issues is needed.
In the future, we will address more detailed schemes for question selection. Questions that contain more than a binary (true/false) response should be further investigated, although the only adaptation of the Thompson sampling algorithm is in the choice of metric to Thompson sample from. Different network structures may arise for different crowdsourcing problems, and assessing the accuracy of the network inferred by the crowdsourcing, and not necessarily the accuracy of individual links, will also be investigated. These and many other interesting and important questions remain in the new problem of crowdsourcing with growing question nets and sets.

Acknowledgments
We thank M. Wagy and J. Bongard for useful comments and gratefully acknowledge the resources provided by the Vermont Advanced Computing Core. This material is based upon work supported by the National Science Foundation under Grant No. IIS-1447634.