Advancing computational biology and bioinformatics research through open innovation competitions

Open data science and algorithm development competitions offer a unique avenue for rapid discovery of better computational strategies. We highlight three examples in computational biology and bioinformatics research in which the use of competitions has yielded significant performance gains over established algorithms. These include algorithms for antibody clustering, imputing gene expression data, and querying the Connectivity Map (CMap). Performance gains are evaluated quantitatively using realistic, albeit sanitized, data sets. The solutions produced through these competitions are then examined with respect to their utility and the prospects for implementation in the field. We present the decision process and competition design considerations that lead to these successful outcomes as a model for researchers who want to use competitions and non-domain crowds as collaborators to further their research.


Introduction
Crowdsourcing enables large communities of individuals to collectively address a common challenge or problem. Mechanisms for crowdsourcing include tools to encourage voluntary work in data collection and curation, gamification of scientific problems, and the use of open innovation competitions. Researchers in computational biology increasingly rely on the first two mechanisms to tackle simple yet laborintensive tasks, such as annotating images [1] or predicting aspects of protein structure [2]. Additionally, researchers use open innovation competitions to benchmark their solutions to a particular computational problem, such as the algorithm MegaBLAST for comparing DNA sequences [3], or to generalize their methodologies to instances of the problem for which a solution is unknown, such as inferring molecular networks [4].
Past examples demonstrate that open innovation competitions possess considerable potential in addressing biology problems [5]. However, most applications are intended for a "crowd" of researchers in communities that are most directly connected to the scientific problem at hand, such as the community of computational biologists to predict drug sensitivity [6]. Less is known about the potential of innovation competitions in biology when they are open to a crowd of non-experts, although there have been promising examples in other domains [7,8]. This suggests a need for methodological developments on the use of competitions when the community of experts is small or nonexistent, or when it lacks the necessary knowledge to solve the problem (e.g., rapidly evolving or emerging fields that depend heavily on large amounts of data and computational resources but are deficient in experts in scientific computing, data science or machine learning).
In this study, we focus on contests that involve participants outside the community of researchers connected to the scientific problem, as in [3]. Rather than the prospect of advancing research in the field or achieving a scientific publication, researchers have to incentivize participation with opportunities to win cash prizes. 1 Furthermore, they must articulate the problem in clear and easily-digestible language and construct metrics to evaluate solutions that provide live feedback to participants. Although competitors in these contests typically lack domain-specific expertise, they bring a breadth of knowledge across multiple disciplines, which creates the opportunity for exploration and cross-fertilization of ideas.
Open innovation competitions of this nature allow members of the public to contribute meaningfully to academic fields by providing incentives and easy access to problems and data that are otherwise inaccessible to them. solutions, since continuous feedback of performance in the form of a real-time leaderboard raises the prospect of model overfitting on the validation set.

Motivation and Objective
In the process of immunotherapy and vaccine development, next-generation sequencing of antibody repertoires allows researchers to profile the large pools of antibodies that comprise the immune memory formed following pathogen exposure or immunization. During the formation of immune memory, antibodies mature in an iterative process of clonal expansion, accumulation of somatic mutations, and selection of mutations that improve binding affinity. Thus, the memory response to a given pathogen or vaccine consists of "lineages" of antibodies which can trace each of their origins to a single precursor antibody sequence. Detailed study of such lineages can provide insight to the development of protective antibodies and the efficacy of vaccination strategies. After the initial alignment and assembly of reads, the antibody sequences can be clustered based on their similarity to the known gene segments encoding heavy or light antibody chains. Clustering these antibody sequences allows researchers to understand the lineage structure of all antibodies produced in individual B cells [9,10,11].
The number of antibody sequences from a single sample can easily reach to the millions; posing a major computational challenge for clustering at such a large scale. The bottleneck lies at computing a pairwise distance matrix and subsequently performing hierarchical clustering of the sequences. The former task scales as O(N 2 ) in both computational and space complexity, where N is the total number of input sequences. The latter task, assuming a typical agglomerative hierarchical clustering algorithm, has a computational complexity that scales as O(N 2 log N ). By comparison, the file I/O is expected to scale as O(N ).

Benchmarks and Methodologies Prior to the Competition
Prior to the competition, a Python implementation of the clustering algorithm was developed, utilizing numpy, fastcluster, and scipy.cluster.hierarchy.fcluster modules (Online Methods, Sec. 6.1). The clustering was performed using the average linkage criterion with a maximum threshold distance imposed.
The implementation utilized the Python built-in multi-processing module to parallelize the computations on multiple cores and required approximately 54.4 core-hours and 80GB of memory (computations were performed on a 32 core machine with 250GB RAM) for a dataset containing 100K input antibody sequences. Empirically, the primary bottlenecks for this implementation and dataset were computation time and storage of the distance matrix (note that the embarrassingly parallel nature of this task implies a trivial cost conversion between single and multi-core computations). Full-hierarchical clustering scales comparably in terms of computational complexity; with the threshold imposed, the relative cost is approximately a factor of 50 less than the cost of constructing the distance matrix, however. Although I/O was included in the timing estimates, its contribution in this case was negligible.
Extrapolating the computation to a typical antibody profiling sample containing one million input sequences, the required computational time and storage is expected to reach around 5440 core-hours and 8TB, respectively. Given its poor scalability and efficiency limitations, this implementation is inadequate for large-scale profiling, which for a small clinical vaccine evaluation may consist of dozens of subjects with several longitudinal samples per subject. The goal of the challenge was to optimize and improve the algorithm and its implementation, so that routine data analysis of a large-scale antibody profiling study would become feasible given modest computational resources.

Problem Abstraction and Available Data Sets
10K input sequences, whereas the testing datasets comprised four 10K and six 100K input sequences.
For the given threshold distance, the maximum cluster size for each dataset ranged from 1.6% to 4% of the total number of antibody sequences in the set. The most probable cluster size for each set was unity, however.

Competition Outcome
The competition lasted for 10 days, involved 34 participants, and averaged 6.91 submissions per participant. All contest submissions were evaluated on a 32 core server with 64GB memory, although the top four solutions utilized only a single core. For the given dataset, a majority of participants had submitted solutions that were significantly more computationally efficient than the A2 benchmark, with the winning solutions being orders of magnitude more efficient. sequences in approximately 1.8s on a single core implying a total computational cost reduction (speedup) of 108,800 when compared to the A1 benchmark. Interestingly, the primary bottleneck for this implementation was no longer clustering but rather file I/O. Neglecting file I/O, which accounted for 85% of the computation time, the effective speedup achieved for clustering over the A1 benchmark was approximately 777,000 for 100K sequences 2 . For N up to 1M input sequences, the four solutions required less than approximately 17GB memory, with the winning solution requiring only 0.7GB memory.
The winning solution achieved significant improvements in computational efficiency by abandoning generic implementations of hierarchical clustering, which demand computing the full (or half) the distance matrix as input in favor of a problem-specific strategy. The solution exploited several important properties of the data set to dramatically reduce the number of Levenshtein distances computed by noting that: 1. Some families of antibodies are never clustered together implying a partitioning of the dataset.
Contest participants were tasked to explore alternative approaches using an eight-times-larger dataset for their inferential models (Online Methods; Sec. 6.2) with the goal to predict 11,350 non-landmarks from 970 L1000 landmarks. Their models were then evaluated using as ground truth RNA-Seq data profiled on the same samples.

Competition Outcome
Eighty-eight competitors submitted their predictions for evaluation; fifty-five (62%) achieved an average performance higher than the MLR benchmark, with improvements as high as 50%, and with little difference between provisional and system evaluations (Online Methods, Fig. 5), indicating no overfitting.
For the top five submissions, performance improvements were in absolute and relative accuracy ( Fig.   2A-D). The median gene-level rank correlation between the predicted values and the ground truth was 70% higher than the benchmark. The gene-level relative accuracy (i.e., the rank of self-correlation compared to correlations with other genes) was distributed with a lower dispersion (an interquartile range 40-90% lower depending on the submission), resulting in a higher recall. 3 Inspection of the methods used by the top five submissions reveals that 4 out of 5 of the topperforming submissions converged to a K-Nearest Neighbors (KNN) regression, and only one used a Neural Network approach. KNN regression is a non-parametric regression approach that makes predictions for each case by aggregating the responses of its K most similar training examples. Compared to MLR, KNN regression makes weaker assumptions about the shape of the "true" regression function thereby allowing a more flexible representation of the relationship between landmark and non-landmark genes.
Compared to a standard KNN-based imputation model, such as [15], the winning submission presents a few innovations. First, it combines multiple predictions obtained by iteratively applying the same regression function to different data normalization, such as scale, quantile, rank, and ComBat batch normalization [16]. Second, within each iteration, it identifies and combines multiple sets of K nearest neighbors using different similarity measures. These modifications present some potential advantages over a standard KNN method. One potential advantage is to alleviate batch effects thereby controlling for a source of non-biological variation in the data. Another advantage is the identification of the training samples that are most biologically relevant for predicting a given test sample (e.g., same tissue). The net outcome is that the winning approach achieves ¿5% improvement in absolute accuracy and ¿2.6% improvement in the recall over the other top KNN approaches.
We further examined key differences between the winning method and the MLR benchmark by comparing performance on large (100,000) and small (12,000) sample size training datasets. We found that the winning KNN approach outperforms the MLR, although the performance is sensitive to the sample size. When trained on the smaller dataset, the overall performance of the winning KNN approach was higher than the benchmark (10% higher absolute accuracy) but the recall was (5%) lower.
Clustering analysis of the gene-level scores (Fig. 2-E) suggested potential complementarities among the top submissions. We tested this hypothesis by combining the predictions of the top five submissions into an ensemble approach. We used the training dataset to select automatically the best predicting method for each particular gene (the one with the highest combined score). By doing so, we found a strong complementarity between the winning KNN and the Neural Network approach, which were equally selected for over two-thirds of the genes ( Fig. 2-F). We then evaluated the resulting predictions on the test set, showing a 2% improved performance over the top performing submission (Fig. 2).
We then examined to what extent a better inference model affects the ability to recover expected connections using CMap data [see 13, for a formal definition of connections]. To test this hypothesis, the winning approach was used to impute non-landmark genes on a dataset of L1000 profiles from about 46,000 samples of multiple perturbagens (Online Methods, Sec. 6.2). We then processed these KNN-imputed data through the standard CMap pre-processing pipeline [13] and queried the resulting signatures, and their MLR equivalents, with a collection of annotated pathway gene sets. Based on the literature, each gene set was expected to connect to at least one of the perturbagens. We compared the distribution of the connectivity scores and corresponding ranks generated using the predictions made by the KNN approach and those made by the benchmark MLR approach. Results show no significant difference in the distributions of these connectivity measures (according to Kolmogorov-Smirnov test).
In conclusion, the top submission achieved substantial improvements in performance over the MLR benchmark, thus it succeeded in achieving a better comparability of CMap inferred data with external data. To facilitate the use of the winning submission for this purpose, the winning code was deployed in the "R" package ("cmapR') and is currently available at github.com/cmap. At the same time, we found no evidence that the better inference would translate into a more accurate downstream connectivity analysis (higher ability to recover the expected connections), contrary to the initial hypothesis. However, limitations in the contest configuration (scores were not directly based on connectivity) preclude a conclusive statement at the moment, and further investigation is ongoing.

Motivation and Objective
Researchers often use the Connectivity Map to compare specific patterns of up-and down-regulated genes for similarity to expression signatures of multiple perturbagens (e.g., compounds and genetic reagents) in order to develop functional hypotheses about cell states [13]. These hypotheses can be used to inform areas of research ranging from elucidating gene function to identifying the protein target of a small molecule to nominating candidate therapies for disease.
Given the high potential of this approach, many algorithms that assess transcriptional signatures for similarity have been developed over the years. These algorithms are generally computationally expensive, which may limit their use to relatively small-sized data. The principal problem is that computationally efficient methods, such as cosine similarity, may be inadequate for interpreting gene expression data in general. Moreover, more powerful methods, such as the Gene Set Enrichment Analysis [17], need to perform computationally-expensive tasks, such as ranking genes in each signature by their expression levels to compare gene rank positions individually across signatures. Given the Connectivity Map has recently expanded to over one million profiles [13], this limitation is particularly problematic.
To address this problem, the CMap group developed a fast query-processing algorithm, called Sig-Query. This tool was implemented in MATLAB incorporating a range of optimization techniques to speed up queries on the Connectivity Map, and was available on the online portal CLUE.IO. Overall, the algorithm achieves a good level of performance (in a preliminary analysis, it took about 120 minutes to process 1000 queries with gene sets of size 100 against a signature matrix of 470,000 signatures). Even so, execution time and memory requirements are still a potential barrier to adoption for the Connectivity Map.
To further the development of query algorithms for CMap data, the "CMap Query Speedup Challenge" solicited code submissions for fast implementations of the present CMap query methodology.

Problem Abstraction and Available Data Sets
Following the query methodology described in Sec. 6, one bottleneck lies at rank-ordering all signature values and subsequently walking down the entire list of G genes in a signature to compute the running sum statistics over all possible pairs of S signatures and queries. Using the quick-sort algorithm, this task has a computational complexity of O(S × G log(G)) in the average case. To save time, however, results can be stored on disk, as in the current implementation, with adjustments to try to minimize the cost of access to disk memory. The other very burdensome task involves computing the running sums for all genes in each signature, which has a computational complexity upper bound that scales as O(S × G) per query.
For this contest, participants had to address these problems and the performance of their methods was evaluated on 1000 queries to be run on the whole CMap signature matrix, which has expression values for > 470k signatures and > 10k genes (Online Methods, Sec. 6.3).

Competition Outcome
The competition resulted in 33 participants making 168 code submissions. All final submissions were evaluated on the holdout query dataset on a server with 16 cores. Results showed significant speed improvements over the benchmark: the median speedup was of 23× with at least two submissions achieving speedups beyond 60×.
Comparison of performance between 16-core and single-core evaluations showed that multithreading alone accounted for a large fraction of the gains in performance over the benchmark (Fig. 4-A): the median speedup difference between single and multi core was 18×, accounting for 70% of the final median speedup over the benchmark.
Beyond multithreading, analysis of the scaling properties of the winning submission showed a computational time complexity that scales as the square-root of the number of queries ( Figure 4-B), which represents a major improvement compared to the benchmark's linear scaling.
The winning submission also showed substantial performance improvements in the reading time -the time to load in memory the absolute values and rank-ordered positions of genes in the CMap signature matrix: the overall reading time of the winning submission was just below 10 seconds, which represents a 50× speedup compared to the 500 seconds used by the benchmark (Online Methods, Fig   6).
Close examination of the codes of the top submission revealed multiple optimization adjustments that can account for all these improvements. These adjustments often overlap with those of the other submissions, thus making hard any clear-cut categorization. Consequently, we report instead the areas of optimization that we believe are responsible for most of the performance improvement over the benchmark (beyond multithreading and the recourse to low-level programming languages, such as C++, instead of MATLAB). Essentially, these are: 1. Efficient data storage techniques in order to maximize the available cache memory for each thread.
2. Streaming SIMD Extensions (SSE) technology to execute multiple identical operations simultaneously for each thread.
One of the ways by which the winning submission achieved its exceptional performance was by loading the entire signature matrix in the cache memory. Cache memory is indeed the fastest memory in the computer, albeit of very limited capacity. To minimize memory usage, the winning contestant stored the CMap signature matrix at a lower precision than the benchmark (32-bit single precision floating for the scores and 16-bit integers for the ranks), with essentially no loss in accuracy. Precision reduction alone, however, was insufficient to fit level 1 cache memory (the fastest cache memory available) due to the large extent of queries and gene sets to be processed per signature. So, it developed a clever system of matrices to efficiently store the indexes and partial sums for each gene in a query. The resulting algorithm made a much more efficient use of memory compared to the benchmark.
The other major improvement is related to SSE, which is a set of instructions that allows the processor to compute the same operation on multiple data points simultaneously [18]. The winning submission used SSE to form batches of 4 genes and simultaneously compute the rank positions of these genes, thus reducing by approximately a factor of one-fourth the time of each query.
As a result, the winning code submission for the CMap Query Speedup Challenge was deployed in the online portal CLUE.io and is now currently available as an option to users in the Query App (Online Methods, Fig. 7). This improved algorithm has also enabled CLUE to support batch queries, allowing users to execute multiple queries in a single job, all via the CLUE user interface.

Discussion
This work demonstrates how researchers in computational biology and bioinformatics have utilized open innovation competitions to make inroads on a variety of computational roadblocks in their work. While participants in these competitions may not possess the domain knowledge to solve every research problem faced by scientists, their unique skills can be leveraged to attack well-defined tasks (e.g., algorithm improvement to maximize computational speed) resolving specific issues or bottlenecks to the research process.
We highlight a few key advantages over traditional approaches. First, competitions enable rapid yet broad exploration of the solution space of the problem. This broad exploration opens the possibility of discovering high-performing solutions, which may lead to breakthroughs in the field. Second, competitions give researchers access to multiple complementary solutions. Thus, they create opportunities to boost performance even further with ensemble techniques whereby different approaches are combined based upon their strengths in different regimes or for different subsets of data.
Our results indicate that these gains arise from the efforts of out-of-the-field participants instead of the community of practitioners and domain experts. Thus, on the one hand, we know open innovation competitions are a great tool to solicit community-based effort (e.g., Dream Challenges); on the other, we show that the potential for their use in biology and other life sciences goes beyond the size and availability of the community of researchers connected to the problem.
By defining appropriate objective functions to guide the competitors' efforts, researchers have indeed the flexibility to pursue vastly different problems or even decide to tackle multiple aspects of the same problem concurrently via a composite objective function. This kind of problem definition, however, can be difficult. It requires knowledge of the aspects that are critical to the problem and for which improvements are quantifiable and achievable, understanding of ways to trade-off improvements in one dimension for another, and the ability to abstract the original problem from its domain to attract broad participation (knowing that domain-specific information hurts participation but also offers competitors insights on how to solve the problem).
We have shown some of the challenges encountered in addressing these issues for two very general types of computational problems: "code development" and "machine learning." Code development problems typically boil down to improving existing computational algorithms that have well-defined inputs and outputs under a variety of constraints (e.g., speeding up the computation, without exceeding memory limits). Here, although performance improvements are quantifiable and can be checked by test cases, other relevant aspects (e.g., robustness) are less so. This restriction forces researchers to take additional steps after the contest to validate methodologies and integrity of solutions beyond the limited test cases considered during the competition. These steps typically include ensuring security (understanding what the code does and ensuring it is not malicious) and legality of the produced codes (that they are original or properly licensed) before integration and deployment.
Machine learning problems focus on more exploratory questions, such as producing predictive models that describe an existing data set, yet are generalizable to new data. These problems are typically multi-dimensional, given the wide range of potential applications, and are sometimes hard to quantify (e.g., measurements that offer only a partial picture of biological states to model). As a result, competitions addressing these problems typically involve interpretation and further evaluation of methodologies, exploration of possible complementarities, and understanding strengths and pitfalls of solutions in comparison to known methodologies along dimensions not considered within the competition.
Our study suggests that although competitions offer the flexibility to address both kinds of problems, the associated post-competition efforts can be quite different. In both cases, handling these additional tasks requires expertise that itself may not be available to the end-user, although they may be addressed in part by subsequent competitions; thus, keeping a modular design strategy appears beneficial. For ML problems, further efforts (comparable to those devoted to replicating the methods used in another study) are often needed to evaluate and assimilate the new knowledge produced. Further research to examine the outer limits of scientific problem solving through contests is necessary.
We conclude by mentioning a few additional empirical contexts in which open innovation competitions seem promising. One is the assignment of functional attributes of small molecules and genes, such as predicting the mechanism of action of a compound. While many algorithms have been independently developed for this purpose, a broad exploration is often out of reach to individual laboratories. On the development side, research in biology is often impeded by bottlenecks in the memory storage and transmission of genetic data, which are critical and quantifiable, thus an open innovation competition seems an effective way to overcome these bottlenecks.        6 Online Methods 6.1 Antibody Clustering Challenge (Scripps) 6

.1.1 Competition Evaluation
Evaluation of a multi-task challenge such as this, which aims to yield performance gains in speed and memory use while maintaining a high degree of accuracy, requires some care due to the unavoidable interplay between performance characteristics which is inherent to scalar metrics. Therefore, we developed a weighted scoring function to evaluate the contestants' solutions against the gold standards. For a given dataset, scores were assigned to each algorithm based on where t is the computation time required for a given test set on a 32 core server, ACC is the associated accuracy of the computation compared to the gold standards, = 10 −2 , and t 0 = 100ms. Solutions that exceeded a memory use threshold of 3GB received a score of zero. The accuracy (ACC) was determined from the similarity between two clustering based on an adaptation of the Rand index [19]. The final evaluation was based on the average performance on multiple test sets, after normalizing the scores for each test set by the top-performing solution. Fig. 1 illustrates the average performance of all final submissions as a function of computational cost and accuracy for 10K hold-out datasets, as compared to the A1 and A2 benchmarks.

Data Access
The training, validation and test data sets,as well as the Python and C++ benchmark clustering algorithms are available at https://github.com/SuLab/Antibody-Clustering-Challenge.

.2.1 Competition Evaluation
To assess imputation accuracy, we created an evaluation metric that balanced both absolute and relative measures of accuracy. Let P ij represent the predicted expression levels for a set of samples, indexed by i = 1, ..., N , and non-landmark gene labels, indexed by j = 1, ..., M (for this study, M = 11350).
Similarly, let G ik represent the true (measured) expression levels for the same set of samples and nonlandmark genes, indexed k = 1, ...,. For each pair of gene labels j, k, construct the Spearman rank correlation matrix elements ρ jk (P, G) between prediction and truth data across all the samples. For a given gene label j, let R j (P, G) be the relative rank of the correlations ρ ij (P, G) where k = j with respect to the remaining correlations k = j. The score attributed to each gene-level prediction is given by an equally-weighted average of the Spearman correlation between prediction and truth data, and the relative rank of the correlation when compared with the correlations associated with the remaining genes:

Data Access
All the datasets related to this challenge can be found at the Gene Expression Omnibus (GEO) data repository, www.ncbi.nlm.nih.gov/geo/download/?acc=GSE92743. Test samples were generated in collaboration with the Genotype Tissue Expression (GTEx) project (www.gtexportal.org/home/).
A brief description of the main files is as follows: • Training: Affymetrix data of 12,320 gene expressions for 100,000 samples used by contestants for building their models, ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE92nnn/GSE92743/suppl/ GSE92743_Broad_Affymetrix_training_Level3_Q2NORM_n100000x12320.gctx.gz.
3. For each gene set G j , compute a running sum statistic rss g,j that walks down the rank-ordered signature s r and, at each rank position r g corresponding to a gene g, is incremented by a factor r g /s sum,j when g ∈ G j , otherwise is reduced by 1/(G − |G j |); 4. Compute the maximum deviation of the running statistic rss gmax,j from zero for both gene sets j = down, up 5. Finally, if the deviations corresponding to each gene set have a different sign, return the average absolute deviation; otherwise return zero (i.e., no similarity is found).
Extending this procedure to a database with S signatures and Q queries is straightforward and it consists of iteratively applying the above procedure to all pairs of signatures and queries.

Competition Evaluation
For a given set of queries, let sp = t/b be the speedup over the benchmark, which is defined by the ratio between t the time (in seconds) a participant's submission takes to complete all the queries and b the runtime of the existing SigQuery tool for the same task. All code submissions were timed on the same server and rank-ordered based on the score score = (1 + 4/sp) −1 .
The choice of using the runtime for scoring, instead of the process time (e.g., CPU time), deserves further comment. If the scoring function was based on the process (or CPU) time, participants would have had an incentive not to use multithreading techniques in their submissions, given the additional inter-processor communication overheads. By contrast, a scoring function based on runtime speedups would encourage competitors to use multithreading, which was available to everyone in the final evaluation process (i.e., all codes were evaluated on the same machine with 16 cores). So, the choice of using runtime for scoring was essentially to encourage implementations of multithreaded solutions.
In addition to speed, submissions had to be considered sufficiently accurate to be eligible for prizes.
We measured accuracy as the lowest absolute deviation in the Kolmogorov-Smirnov statistics between those obtained by the competitor and those computed by the SigQuery tool. Submissions were thus required to have their lowest absolute deviation below a given threshold (i.e., 0.0001). Note that we focused on the lowest absolute deviation of all the Kolmogorov-Smirnov statistics that were computed on each set of up-and down-regulated genes (i.e., before the possible normalization to zero, as described above) separately. This choice reflects considerations about the possibility that strategies for optimizing data storage may result in small losses in precision that may, in turn, cause random changes in the sign of the Kolmogorov-Smirnov statistics, when the absolute value of the statistic is low. An additional requirement was, hence, to allow a maximum of 1000 such differences.

Data Access
The gene sets used for the queries in this challenge were obtained by downloading public Affymetrix data from the National Center for Biotechnology Information's GEO repository and performing comparative marker selection between case and control samples in order to identify differentially expressed genes [20]. For the convenience of analysis, we fixed the total size of the gene sets to 100, a number which was intended to mimic the typical size of queries.
As with the other challenges, the dataset of queries was randomly split into training, validation, and test sets of size 250, 250, and 500. Queries in the training and validation sets were provided as CSV files, each containing gene set identifiers (rows) and a list of identifiers of individual genes contained in the gene set (columns). The test dataset was withheld and used to validate the final submissions.
Competitors had access to the whole CMap signature matrix that was stored and distributed as a series of CSV files containing the matrix of differential gene expression values and the rank-ordered matrix that was pre-computed by sorting the signature matrix in descending order.   q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q benchmark (mlr) q final provisional   Challenge in the online portal CLUE.io, where the code is currently available as an option to users in the Query App ("compute with sig fastquery tool").