Conceived and designed the experiments: A. Kawrykow GR LS MB JW. Performed the experiments: A. Kawrykow GR A. Kam DK CL CW EZ. Analyzed the data: A. Kawrykow GR LS MB JW. Contributed reagents/materials/analysis tools: A. Kawrykow GR A. Kam DK CL CW EZ LS MB JW. Wrote the paper: MB JW.
Luis Sarmenta is an employee of Nokia and Jerome Waldispuhl received a donation from Nokia. This does not alter the author's adherence to all the PLoS ONE policies on sharing data and materials.
Comparative genomics, or the study of the relationships of genome structure and function across different species, offers a powerful tool for studying evolution, annotating genomes, and understanding the causes of various genetic disorders. However, aligning multiple sequences of DNA, an essential intermediate step for most types of analyses, is a difficult computational task. In parallel, citizen science, an approach that takes advantage of the fact that the human brain is exquisitely tuned to solving specific types of problems, is becoming increasingly popular. There, instances of hard computational problems are dispatched to a crowd of non-expert human game players and solutions are sent back to a central server.
We introduce Phylo, a human-based computing framework applying “crowd sourcing” techniques to solve the Multiple Sequence Alignment (MSA) problem. The key idea of Phylo is to convert the MSA problem into a casual game that can be played by ordinary web users with a minimal prior knowledge of the biological context. We applied this strategy to improve the alignment of the promoters of disease-related genes from up to 44 vertebrate species. Since the launch in November 2010, we received more than 350,000 solutions submitted from more than 12,000 registered users. Our results show that solutions submitted contributed to improving the accuracy of up to 70% of the alignment blocks considered.
We demonstrate that, combined with classical algorithms, crowd computing techniques can be successfully used to help improving the accuracy of MSA. More importantly, we show that an NP-hard computational problem can be embedded in casual game that can be easily played by people without significant scientific training. This suggests that citizen science approaches can be used to exploit the billions of “human-brain peta-flops” of computation that are spent every day playing games. Phylo is available at:
The problem of optimally aligning a set of biological sequences (multiple sequence alignment (MSA)) is one of the most fundamental question in computational biology, with the first problem formulations and accompanying algorithms dating back to the early 1970's
Most mathematical formulations of MSA aim at identifying a maximum-scoring alignment, given a set of sequences. Although the sum-of-pairs score (which is defined as the sum, over all pairs of species, of the scores of the pairwise alignments induced by the MSA) has been heavily used in early studies, more phylogenetically-aware scoring schemes are now preferred
To produce accurate alignments using a classical computational framework, exact and computationally intensive algorithms are required. Unfortunately, their usage on genome-scale problems clearly exceeds the capacity of even the most powerful computer clusters. In recent years, outsourcing has become a common strategy to address these computational limitations. The connection of thousands of individual computers through the internet network enabled to build giant virtual clusters with unmatched computing power, at a minimal cost. In 1999, the SETI@home project
In classical outsourcing methods such as SETI@home and Folding@home, the bottleneck is twofold. First, the objective function can be hard to formalize. For instance, this is the case when the goal is to identify objects inside images. Next, even when fully defined, the objective function may not allow an efficient computing schema and thus require an exhaustive enumeration of the solution landscape. When the number of candidate solutions grows exponentially with the size of the input, this leads to computationally prohibitive algorithms. It turns out that these features characterize many real world problems. Interestingly, the human brain developed capabilities to efficiently address some of these problems. In particular, humans excel at visual pattern recognition. In such cases, the assistance of humans appears to be a reasonable option. This observation motivated the development of methods for harnessing these human abilities and has been embedded in a concept called
Historically, the first attempt to apply citizen science principles was made by the Audubon Society's Christmas Bird Count, which started in 1900. However, the emergence of computers and of the internet greatly expanded the range of applications and the potential of this approach. Indeed, by developing human-computer interfaces that enable users to assist a computer program to solve a problem, and distributing this interface through the web, we can easily gather a large community of volunteers to help solving a given problem. In 2006, Stardust@home
In this paper, we introduce Phylo, a citizen science framework to solve MSA problems. More specifically, Phylo aims to compute high-quality alignments of a set of orthologous promoter regions from different vertebrate species. Unlike previous citizen science applications, Phylo intentionally hides much of the science behind it. A central idea of our contribution is to reduce the human computing part to a casual game, a puzzle, in order to broaden the spectrum of participants and collect the computing power generated by regular, non-scientist gamers. This approach expands to natural sciences the concepts of re-usability previously introduced by Luis von Ahn and co-workers in the ESP game
The paper is organized as follows. In the Results section, we describe Phylo and the set of alignments considered and show evidence of the effectiveness of our approach at improving alignments. In the Methods section, we detail the game mechanism and explain how the data are validated and re-inserted into the original MSA. Finally, we conclude by discussing the perspectives offered by citizen science approaches.
Phylo is a citizen computing framework for local improvement of multiple sequence alignment.
Phylo aims to convert a MSA problem into a puzzle game that can be easily understood and played by web users through a flash or javascript interface (
We score the puzzles using a simple but realistic, easy-to-understand maximum parsimony algorithm that predicts ancestral sequences from the given alignment and sums the scores of the induced pairwise alignments, over all branches of the tree (see
Several mechanisms have been added to increase the entertaining value of the game while helping players achieve good solutions. First, the sequences are progressively added. The game starts with two sequences and the player must find an alignment with a score that is at least as good as the score of the original alignment (i.e. the alignment that has been pre-calculated by Multiz). We call the score of this alignment the “par” to be allowed to proceed to the next stage. Then, another sequence is added to the puzzle and the process is iterated until all sequences have been added. Note that contrary to the classical progressive alignment approach, players are allowed to revise any part of the alignment at any point. The second feature we added is a timer. Each stage must be completed within a certain time limit. In addition, we have also implemented a ranking system that records the number of puzzles solved by each registered user, and displays the list of the top 20 contributors. Together, these features aim to stimulate the competitiveness between players. Finally, we have implemented multiple mechanisms of puzzle selection. Players can either choose a puzzle by its difficulty level or by the type of disease the corresponding gene is associated to.
To evaluate the effectiveness of crowd computing at multiple sequence alignment, we selected a set of human promoters associated to genes with known implications in various diseases from the OMIM database
For each selected promoter (1 kb region upstream of the annotated transcription start site), we then extracted the corresponding sections of a 44-species multiple genome alignment
Phylo was officially released on November 29, 2010
The top figure shows the number of puzzles played by registered and anonymous players during the seven first months of Phylo. The bottom figure shows the number of registered players w.r.t. the number of puzzle they solved.
Phylo is not equally attractive to all players. Registered users completed an average of
In
(a) Average Phylo score of original alignments (red) and average best score obtained (yellow). (b) Success rate per level: Average number of times a puzzle has been played (red), and average number of times a player reaches the final stage of a puzzle (yellow).
Puzzle solutions with score better than the par were sent back to our database. Each solution was re-inserted into the original alignment block and sequences that had been left out from the puzzle were re-inserted into the alignment (see
For each alignment block, the Phylo-based alignment was built from each of the different puzzle solutions, irrespective of the Phylo score they obtained. Each completed alignment was scored and the highest-scoring alignment was retained. Original and improved alignment blocks are available in Supplementary Material. Overall, the best Phylo-based alignment outscored the original Multiz alignments for 70% of the alignment blocks. In fact, even the score of
Recall that the puzzle score shown to the user only measures the quality of the solution to the alignment puzzle itself, outside of its alignment block context and with only a subset of the species present in the full alignment block. An interesting question is whether this score correlates with the final score of the alignment after its completion and reinsertion into the full alignment block. This correlation is weak, with only 55% of the puzzles played at least 5 times showing a positive correlation between Phylo score and final alignment score. Note however that puzzle solutions are only returned to our server if they achieve a score at least as good as the “par”, which means that only “good” solutions are considered. This suggests that the Phylo puzzle solutions form a good pool of initial solutions based on which improved multiple alignments can be obtained, but that the Phylo scores themselves (or at least those beating the par) are not very indicative of the quality of the alignment when placed in its context and extended to the full set of sequences. In that case, puzzles played a large number of times would have a better chance of producing improved alignments. Indeed, this is the case: 77% of the puzzles with at least 5 different Phylo solutions yield an improvement over the original alignment, whereas this fraction drops to 53% for puzzles with at most two different solutions.
Unsurprisingly, the number of species in the puzzle has an impact on the quality of the completed alignments that can be obtained from them. Small puzzles (3 or 4 sequences) result in improved alignments less than 63% of the time, while this percentage goes up to 73% for larger ones (size 7 and above). This is despite the fact that small puzzles are played on significantly more often than larger ones (2-fold difference in number of different solutions).
In this paper, we showed that a citizen science approach can be applied to improve the accuracy of multiple sequence alignments. More importantly, we demonstrated that we can turn this problem into a intuitive and entertaining computer game suited for casual gamers without any scientific background. Contrary to existing alignment editing tools such as JalView
In this work, we applied our methodology on a 44-way Multiz MSA from the UCSC genome browser
The clarity and the simplicity of the design that characterizes Phylo is an important asset to ensure the popularity of our game. In particular, we abstract the nucleotides to coloured blocks and develop an intuitive yet realistic scoring scheme that is well supported visually by various aspects of the game interface. This allows players to solve puzzles MSA's with up to
An interesting related question is how best to harness crowd computing for improving alignments: one wants the player community to work on as many regions of the alignment as possible, but also to do as good a job as possible at improving each of them. As discussed previously, the more often a puzzle is played on, the better the chances of producing good alignments. However, the value of additional solutions diminishes as the number of available solutions increases. While our current dispatching system assigns puzzles to players in a random manner (subject to an user's preferences about problem size and disease associations), a better approach would be an adaptive approach where we monitor, for each puzzle, the number of different solutions obtained to date and the number of people who played on it. Puzzles whose solution space seems to saturate (i.e. the same solutions are found over and over again by the players) should be considered solved and stop being fed to players. Similarly, puzzles that are rarely completed by the players may have properties that makes them boring or too challenging and should stop being sent to users. Adapting puzzle dispatching may even go further and detect a specific player's preferences or skills via the set of solutions produced to date and select new puzzles on that basis.
Finally, we conclude this paper by discussing the validity and the scientific impact of citizen science frameworks. Above everything, the question of the computational tractability of the problem addressed is fundamental. Indeed, to be scientifically justified, this strategy must demonstrate that human expertise is necessary and that computer programs cannot perform better. We believe that any citizen science approach applied to well-defined scientific problems must satisfy these three criteria: (i) Computational difficulty of the problem, (ii) range of application of exact methods, and (iii) comparison with heuristic methods. Here, we stress that the MSA problem using a maximum parsimony score has been shown
However, computational considerations are not the only ones of interest. Another fundamental aspect of this game is its role toward educating people to the challenges encountered in computational biology and discrete optimization in general, as well hinting to some important evolutionary biology and genetic concepts. Although Phylo intentionally abstracts the scientific context of MSA's to an intuitive casual game, it also offers a portal for anyone looking for information on the subject. More precisely, two games menu sections titled “About” and “FAQ” describe the biological motivations, the scoring scoring algorithm and how the puzzle solution are used. Moreover, over the last year, our interface has been already used in several classes and public demonstrations around the world to illustrate genomic research.
Billions of “human-brain peta-flops” of computation are wasted daily playing games that do not contribute to advancing knowledge. While only a small fraction of important computational problems are amenable to crowd computing, and translating those that are into fun, intuitive games can be challenging, the reward of a well-designed framework for human computation, combined with a wide user-base, is access to a huge basin of computing power.
The interface of Phylo displays a simplified and entertaining representation of an MSA instance with its associated phylogenetic tree. Each nucleotide is represented by a block whose colour corresponds to its type (i.e. Adenine, Cytosine, Guanine and Thymine). The scoring scheme for a given puzzle alignment must be evolutionarily realistic while being intuitive and fast to compute (as it is recomputed in real time every time the player modifies the alignment). To evaluate a given alignment, the game starts by inferring ancestral nucleotides or gaps at each ancestral node of the phylogenetic tree using a maximum parsimony approach (Fitch algorithm
Our puzzle database is based on a multiple sequence alignment of
Interesting blocks are typically longer and contain more species than our game can accommodate. The number of species in the alignment is first reduced to at most 8, by keeping the first 8 species according to our species ranking list, which aims at selecting a set of species as phylogenetically diverse as possible (i.e. whenever possible, select distantly related species). We then scan each reduced “interesting” block with a sliding window of size
Recall that a Phylo puzzle consists of a slice of 24 columns taken from the original UCSC 44-way multiple alignment, and then reduced to a set of at most 8 species. To be useful, a solution to the puzzle must be reinserted into the original alignment block and completed by adding any left-out species to the alignment. This is performed as follows. Consider an alignment block
For the purposes of comparing the accuracy of the various alignment strategies proposed, maximum likelihood ancestral sequences for a given alignment block are first inferred using Ancestors 1.0 program
The original client interface has been implemented in a Adobe Flash Actionscript 3.0. More recently, we released a javascript interface to improve the portability of our system and enable users to play Phylo on tablets and other mobile devices. The server has been implemented in Java. The client connects to the server via XMLSocket and the server listens through SocketServer. The communication between the server and the MySQL database is supported by JDBC drivers. Finally, password of registered users are encrypted SHA-512 and Salting to ensure user privacy.
Phylo is open, free of charges, to academic users who are willing to use it to improve their MSA data. Interested users are invited to send us data using the MAF format at phylo-submit@cs.mcgill.ca. Sequences should be pre-aligned using a computer program, preferably together with a confidence score. Once submitted, the data will be scanned in order to create new puzzles. Once every puzzle will have been played a predefined number of times (by default
We thank the more than 12,000 players who have contributed to this project by playing Phylo, and the dozens of users who have made important suggestions to improve the game.