Indexes of Large Genome Collections on a PC

The availability of thousands of individual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements. We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size, which is customisable. It fits in a standard computer with 16–32 GB, or even 8 GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries (of average length 150 bp) are handled in average time of 39 µs and with up to 3 mismatches in 373 µs on the test PC with the index size of 13.4 GB. For a smaller index, occupying 7.4 GB in memory, the respective times grow to 76 µs and 917 µs. Software is available at http://sun.aei.polsl.pl/mugi under a free license. Data S1 is available at PLOS One online.


Introduction
About a decade ago, thanks to breakthrough ideas in succinct indexing data structures, it was made clear that a full mammaliansized genome can be stored and used in indexed form in main memory of a commodity workstation (equipped with, e.g., 4 GB of RAM). Probably the earliest such attempt, by Sadakane and Shibuya [1], resulted in approximately 2 GB sized compressed suffix array built for the April 2001 draft assembly by Human Genome Project at UCSC. (Obtaining low construction space, however, was more challenging, although later more memory frugal, or disk-based, algorithms for building compressed indexes appeared, see, e.g., [2] and references therein.). Yet around 2008, only a few sequenced human genomes were available, so the possibility to look for exact or approximate occurrences of a given DNA string in a (single) genome was clearly useful. Nowadays, when repositories with a thousand or more genomes are easily available, the life scientists' goals are also more ambitious, and it is desirable to search for patterns in large genomic collections. One application of such a solution could be simultaneous alignment of sequencing reads against multiple genomes [3]. Other applications are discussed in the last section.
Interestingly, this is a largely unexplored area yet. On one hand, toward the end of the previous decade it was noticed that the ''standard'' compressed indexes (surveyed in [4]), e.g. from the FM or CSA family, are rather inappropriate to handle large collections of genomes of the same species, because they cannot exploit well the specific repetitiveness. On a related note, standard compres-sion methods were inefficient for a simpler problem of merely compressing multiple genomes. Since around 2009 we can observe a surge of interest in practical, multi-sequence oriented DNA compressors [5][6][7][8][9][10][11][12][13][14][15], often coupled with random access capabilities and sometimes also offering indexed search. The first algorithms from 2009 were soon followed by more mature proposals, which will be presented below, focusing on their indexing capabilities. More information on genome data compressors and indexes can be found in the recent surveys [16][17][18].
Mä kinen et al. [19] added index functionalities to compressed DNA sequences: display (which can also be called the random access functionality) returning the substring specified by its start and end position, count telling the number of times the given pattern occurs in the text, and locate listing the positions of the pattern in the text. Although those operations are not new in fulltext indexes (possibly also compressed), the authors noticed that the existing general solutions, paying no attention to long repeats in the input, are not very effective here and they proposed novel self-indexes for the considered problem.
Claude et al. [7] pointed out that the full-text indexes from [19], albeit fast in counting, are rather slow in extracting the match locations, a feature shared by all compressed indexes based on the Burrows-Wheeler transform (BWT) [4]. They proposed two schemes, one basically an inverted index on q-grams, the other being a grammar-based self-index. The inverted index offers interesting space-time tradeoffs (on real data, not in the worst case), but can basically work with substrings of fixed length q. The grammar-based index is more elegant and can work with any substring length, but uses significantly more space, is slower and needs a large amount of RAM in the index build phase. None of these solutions can scale to large collections of mammalian-sized genomes, since even for 37 sequences of S. cerevisiae totaling 428 Mbases the index construction space is at least a few gigabytes.
While a few more indexes for repetitive data were proposed in recent years (e.g., [20][21][22][23]), theoretically superior to the ones presented above and often handling approximate matches, none of them can be considered a breakthrough, at least for bioinformatics, since none of them was demonstrated to run on multigigabyte genomic data.
A more ambitious goal, of indexing 1092 human genomes, was set by Wandelt et al. [24]. They obtained a data structure of size 115.7 GB, spending 54 hours on a powerful laptop. The index (loaded to RAM for a single chromosome at a time), called RCSI, allows to answer exact matching queries in about 250 ms, and in up to 2 orders of magnitude longer time for k-approximate matching queries, depending on the choice of k (up to 5).
Sirén et al. [25] extended the BWT transform of strings to acyclic directed labeled graphs, to support path queries as an extension to substring searching. This allows, e.g., for read alignment on an extended BWT index of a graph representing a pan-genome, i.e., reference genome and known variants of it. The authors built an index over a reference genome and a subset of variants from the dbSNP database, of size less than 4 GB and allowing to match reads in less than 1 ms in the exact matching mode. The structure, called GCSA, was built in chromosome-bychromosome manner, but unfortunately, they were unable to finish the construction for a few ''hard'' chromosomes even in 1 TB of RAM! We also note that a pan-genome contains less information than a collection of genomes, since the knowledge about variant occurrences in individual genomes is lost.
A somewhat related work, by Huang et al. [26], presents an alignment tool, BWBBLE, working with a multi-genome (which is basically synonymous with pan-genome in the terminology of [25]). BWBBLE follows a more heuristic approach than GCSA and can be constructed using much more humble resources. Its memory use, however, is over 16n log 2 n bits, where n is the multi-genome length. This translates to more than 200 GB of memory needed to build a multi-genome for a collection of 1092 human genomes. Both BWBBLE and GCSA need at least 10 ms to find matches with up to 3 errors.
The recently proposed journaled string tree (JST) [27] takes a different approach, providing an online scan over the reference sequence, but also keeping track of coverages of variants falling into the current window over the reference. Each individual is represented as a journal string, that is, a referentially compressed version of the original sequence; segments of journal strings, together with helper data, are stored in a journal string tree. The JST approach allows to generically speed up many sequential pattern matching algorithms (for exact or approximate search) when working on a collection of similar sequences. A drawback of this approach is that search times are never better than of an online scan over a single (reference) sequence.
Also recently, Durbin [28] presented an interesting data structure dubbed Positional Burrows-Wheeler Transform (PBWT), to find long matches between sequences within a given collection, or between a new test sequence and sequences from the collection. PBWT provides very compact representation of the dataset being searched, yet its application is different to ours: only binary information about variant occurrences are kept (not even their position in a reference sequence), which means that handling standard locate queries (given a string, report all its match positions in the relevant sequences in the indexed collection) is impossible in this way.
Aligning sequencing reads to a genome with possible variants was also recently considered in theoretical works, under the problem name of indexing text with wildcard positions [29,30], where the wildcards represent SNPs. No experimental validation of the results was presented in the cited papers.
Most of the listed approaches are traditional string data structures, in the sense that they can work with arbitrary input sequences. The nowadays practice, however, is to represent multigenome collections in repositories as basically a single reference genome, plus a database of possible variants (e.g., SNPs), plus information on which of the variants from the database actually occur in each of the individual genomes. The popular VCF (Variant Call Format) format allows to keep more information about a sequenced genome than listed here, but this minimal collection representation is enough to export each genome to its FASTA form. Dealing with input stored in such compact form should allow to build efficient indexes much more easily than following the standard ''universal'' way, not to say about tremendous resource savings in the index construction.
This modern approach was initiated in compression-only oriented works [5,13,14], and now we propose to adapt it in construction of a succinct and efficient index. According to our knowledge, this is the first full-text index capable to work on a scale of thousand(s) of human genomes on a PC, that is, a small workstation equipped with 16-32 GB of RAM. What is more, for a price of some slow-down the index can be used even on an 8 GB machine. No matter the end of the space-time tradeoff we are, the index is capable of handling also approximate matching queries, that is, reporting patterns locations in particular genomes from the collection with tolerance for up to 5 mismatches. As said, the index is not only compact, but also fast. For example, if up to 3 errors are allowed, the queries are handled in average time of 373 ms on the test PC and the index takes 13.4 GB of memory, or in 917 ms when the index is of size 7.4 GB. The current version of our index requires more resources (from 38 GB to 47 GB of RAM, depending on the index settings) in the construction phase; a drawback which may be eliminated in a future work, as discussed in the last section of this paper.

Datasets
We are indexing large collection of genomes of the same species, which are represented as the reference genome in FASTA format together with the VCF [31] file, describing all possible reference sequence variations and the genotype information for each of the genome in the dataset. We are only interested in details allowing for the recovery of the DNA sequences, all non-essential fields are ignored. Therefore, the data included in the VCFmin format, used in [14], are sufficient. Each line describes a possible variant that may be a single nucleotide polymorphism (SNP), a deletion (DEL), an insertion (INS) or a structural variation (SV), which is typically a combination of a very long deletion and an insertion. The genotype of each genome is specified in one designated column with information if each of the variant is found in this genome. In case of diploid and phased genotypes this information concerns two basic, haploid chromosome sets for each genome and treats them independently. Thus for any phased diploid genome, its DNA sequence is twice the size the reference sequence.
In our experiments we used the data available from Phase 1 of the 1000 Genomes Project [32] describing the collection of 1092 phased human genomes. We concatenated the available 24 VCF files (one for each chromosome), to get one combined VCF file, which-together with the reference sequence-is the input of our algorithm building the index.

The general idea
Our tool, Multiple Genome Index (MuGI), performs fast approximate search for input patterns in an indexed collection of genomes of the same species. The searched patterns can be provided in a text file (one pattern per line), or in FASTA or FASTQ format. The index is built based on the reference genome and the VCF file describing the set. The search answers the locate query-the result consists of all positions of the pattern with respect to the reference genome along with the list of all individuals in which it can be found.
The basic search regime is exact matching. Its enhanced version allows for searching with mismatches. Both modes use the seedand-extend scheme. The general mechanism is to quickly find a substring of the pattern and then extend this seed to verify if it answers the query.
The index has one construction-time parameter, k, which is the maximum possible length of the seed. The match can be found directly in the reference genome and/or in its modified form, with some of the variations introduced. To find the seed we build an array of all possible k-length sequences (k-mers) occurring in all genome sequences. In the space-efficient version only a part of the array is kept. The extension step is done using the reference and the available database of variants, checking which combination of possible variations introduced, if any, allows to find the full pattern.
To know individuals in which the match can be found, we have to identify all variants whose occurrence, or absence of, have impact on the match, and list only the genomes with such combination of variants.

Building the index
To build the index, we process the input data to create the following main substructures, described in detail in the successive paragraphs: VD contains details about all possible variations. For each variant, the following items are stored: type (1 byte), preceding position (4 bytes) and alternative information (4 bytes). (Note that we keep the preceding positions to be able to manage the variants INSs, DELs and SVs, as this convention conforms to their description in VCF files.) The last item indicates alternative character in case of SNP, length of the deletion in case of DEL and position in the additional arrays of bytes (VD-aux) in case of INS and SV. VD-aux holds insertion length (4 bytes) and all inserted characters (1 byte each), if any, for every INS and SV. For SV it also stores length of the deletion (4 bytes). The variants are ordered by the preceding position and a lookup table is created to accelerate search for a variant by its location. VD together with REF can be used to decode the modified sequence from some given position to the right, by introducing certain variants. To be able to decode the sequence to the left, an additional list of all deletions (SVs and DELs), ordered by the resulting position, is created. The list, VD-invDel, stores for each variant its number in the main VD (4 bytes) and the resulting position, that is, the position in the reference after the deletion (4 bytes).
There is one BV for each variant, each of size of the number of genomes in the collection (2 times the number of genomes for diploid organisms). Value 1 at some j th position in this vector means that the current variant is found in the j th haploid genome. To reduce the required size, while preserving random access, we keep the collection of these vectors in compressed form, making use of the fact that spatially close variant configurations are often shared across different individuals. The compression algorithm makes use of a dictionary of all possible unique 192-bit chunks (the size chosen experimentally). Each BV is thus represented as a concatenation of qno haploid genomes=192r 4-byte tokens (vocabulary IDs).
kMA keeps information about each k-length sequence (k-mer) occurring in the whole collection of genomes. The k-mer sequence itself is not kept. Instead, only the minimum information needed to retrieve it with help of REF and VD is stored. Based on the amount of details necessary to keep, we partition k-mers into four groups, each stored in one of the four subarrays of kMA: kMA 0 , kMA 1 , kMA 2 or kMA 3 . The entries in each subarray are sorted according to the lexicographical order of k-mers they represent. All k-mers beginning with the unknown character (i.e., N or n) are filtered out.
All k-mers found in REF are kept in kMA 0 . Only the preceding position Spos ref T (4 bytes) is stored for each such k-mer, as it is enough (using REF) to retrieve its sequence. These k-mers are present in all genomes with no variants introduced in the corresponding segment.
The k-mers that are obtained by applying some variant to the reference sequence are stored in kMA 1 /kMA 2 /kMA 3 . They are produced with going through the reference genome and checking for each position p if there is any possible variant with the preceding position in the range from p to pzk{1. If the check is positive, we decode the k-mer. The decoding process takes into account all possible paths. By path we understand any combination of occurrence of subsequent variants, influencing the decoded sequence. For example, if SNP is possible at current position (i.e., it is listed in VD), two paths are considered: when it is found and when it is absent, resulting in two decoded sequences, differing in the last inspected character. Thus, starting from a single preceding position, many resulting sequences may be obtained. To decode most k-mers, it is enough to store the preceding position plus flags about the presence/absence of following variants. This evidence list (evList) is stored as a bit vector, where 1 means that the corresponding variant is present. For any k-mer starting inside an insertion (INS or SV) it is also necessary to store the gap from the beginning of the inserted string to the first character of the k-mer.
The k-mer with no gap and at most 32 evidences about consecutive variants from VD in the evList is stored in kMA 1 , where each entry is defined as Spos ref, evListT (4z4 bytes). If there is also a gap involved, such k-mer goes to kMA 2 , defining each entry as Spos ref, gap, evListT (4z4z4 bytes). All k-mers with more than 32 evidences in the evList or with evidences about nonconsecutive (with respect to VD) variants are kept in kMA 3 , where each k-mer is represented by four fields: Spos ref, gap, evSize, evListT (4z4z4zevSize|4 bytes). The representative example of the latter case is a k-mer with SV introduced and many variants in VD placed within the deleted region. Keeping track of these variants, not altering the resulting sequence, is pointless.
Any k-mer is kept in kMA only if there is at least one haploid genome that includes it, that is, has the same combination of occurring variants. It is checked with help of BV. Recall that the kmers in each subarray kMA i , i[f0,1,2,3g, are sorted lexicographically. To speed up the binary search (by narrowing down the initial search interval), a lookup table, taking into account the first 12 characters, is created for each subarray.

The basic search algorithm
The pseudocode of the basic search algorithm is presented as Algorithm 1 in Table 1. It looks for all exact occurrences of the pattern P in the compressed collection, using the seed-and-extend scheme. The undetermined nucleotides (i.e., N or n) occurring in P are encoded differently than in REF, so they never match any character in the collection. The seed S is chosen to be a substring of P, precisely its first k characters, or the full pattern, if DPDvk (lines 1-2).
The first step is to scan kMA for all k-mers whose prefixes (or simply full sequences, if DSD~k) match S. It is done with binary search in each subarray kMA i , i[f0,1,2,3g, separately (lines 3-4). Next, each found seed is partly decoded and then extended (lines 5-7). The partial decoding, done by the partDecode function, starts from pos ref of the current k-mer and move p~DSD characters forward, according to the k-mer's details (i.e., there may be a need to introduce some found variant). Character-bycharacter matching is not performed, as it is already known that DSD-length prefix of the k-mer matches S. Function partDecode returns the seed's succeeding position (pos curr) and variant (vt curr) in the reference, along with the list of encountered variants (vtList) and the list of evidences about their presence or absence (evList). The latter is a vector of 0 s in case of kMA 0 and a copy of k-mer's evList (or its part) for other subarrays. The first variant (the one with preceding position greater than or equal to the preceding position of the k-mer) is found with binary search in VD. It is not shown in the pseudocode, but for each seed also the preceding SVs and DELs are taken into account when creating the initial vtList and evList.
The seed S is recursively extended according to all possible combinations of variants, that is, as long as succeeding characters match the characters in P and found occurrences of P are reported (line 7). The pseudocode of the algorithm extending the seed and reporting the results is presented as Algorithm 2 in Table 2 Table 2. Pseudocode of the algorithm extending the found seed in the basic search.  When the extension reaches the end of the pattern P, it is checked in which individuals, if in any, the relevant combination of variants (track kept in vtList) is found (lines 20-25). The bit vector R is initialized to be the size of the number of haploid genomes. The value 1 at j th position means that j th haploid genome contains the found sequence. The vector R is set to all 1 s at the beginning, because if vtList is empty, the sequence is present in all genomes. To check which genomes have the appropriate combination of variants, the bitwise AND operations are performed between all BVs related to variants from the vtList, negating all BVs with 0s at the corresponding position in the evList. If R contains any 1 s, pattern P is reported to be found with the preceding position pre (in relation to the reference genome) and vector R specifies genomes containing such sequence.

The space-efficient version
To reduce the required space, while still being able to find all occurrences of the pattern, we make use of the idea of sparse suffix array [33]. This data structure stores only the suffixes with preceding position being a multiple of s (sw1 is a constructiontime parameter). In our scheme, the two largest subarrays, kMA 0 and kMA 1 , are kept in sparse form, based on preceding positions of k-mers. For kMA 1 , it is also necessary to keep all k-mers that begin with deletion or insertion (the first variant has the same preceding position as the k-mer).
The search algorithm has to be slightly modified. Apart from looking for the k-length prefix of the pattern (i.e., P½0 . . . k{1) in kMA, also k-length substrings starting at positions 1 . . . s{1 must be looked for in kMA 0 , kMA 1 , and kMA 3 (as some specific seeds may be present only in kMA 3 ). The substrings, if found in one of mentioned subarrays, must be then decoded to the left, to check if their prefix (from 1 to s{1 characters, depending on the starting position) matches the pattern P. The VD-invDel substructure is   All times in ms. We do not provide times for large sparsities and more errors than 3, since in such cases the internal queries would be for very short sequences and in turn result in numerous matches and significant times; thus, we do not recommend to use MuGI in such parameter configurations. doi:10.1371/journal.pone.0109384.t004 used for the process. The rest of search is the same as in the basic search algorithm.

The approximate search algorithm
The approximate search algorithm looks for all occurrences of the given pattern with some maximum allowed number of mismatches. According to the well-known property, for any sequence of length ' with m mismatches at least one of the consecutive substrings of length q~t ' mz1 s is the same as in the original sequence. Therefore, the approximate search begins with dividing the string to mz1 substrings of length q. Next, the exact search algorithm is used to look for each of the substrings. If a substring is found in the collection, it is further decoded to the right and to the left, similarly as in the exact search, but allowing for at most m differences between the decoded sequence and the searched sequence. Expanding to the left is done with aid of the same auxiliary substructure as in the space-efficient version (VD-invDel). The list of genomes in which the found sequences are present is obtained in the same way as in the exact searching.

Test data
To evaluate the algorithm, we first used a similar methodology as the one in [24]. To this end, we generated a file with 100K queries, where each pattern is a modified excerpt of length '~120 . . . 170 (uniformly random value) from a randomly selected genome from the collection, starting at a randomly selected position. Excerpts containing undetermined nucleotide (i.e., N) were rejected. The modifications consisted in introducing random nucleotides in place of x existing nucleotides, where x is a randomly selected integer number from the ½0,0:05|') range.
Additionally, we use real reads from the 1000GP repository. There are 140K reads chosen randomly in such a way that each of 14 human populations is represented with 10K reads. Their length varies between 100 and 120 bp. Both data sets are available at project home page.
As the index construction costs are not that small (as mentioned earlier), we provide an exemplary index over the 1000GP data at our software page.

Results
All experiments were performed on a PC with Intel Core i7 4770 3.4 GHz CPU (4 cores with hyperthreading), equipped with 32 GB of RAM, running Windows 7 OS. The C++ sources were compiled using GCC 4.7.1 compiler.
The index was built on another machine (2.4 GHz Quad-Core AMD Opteron CPU with 128 GB RAM running Red Hat 4.1.2-46) and required more RAM: from 38 GB (for k~25) to 47 GB (for k~45). The corresponding build times were 15 hours and 72 hours, respectively. The index build phase was based on parallel sort (using Intel TBB and OpenMP libraries), while all the queries in our experiments were single-threaded. The correctness of obtained query results, in exact and approximate matching mode, was experimentally verified with a set of patterns, for which a naïve (sequential) scan over all the sequences was run.
From Table 3 we can see that the fastest index version (i.e., with sparsity 1, which translates to standard k-mer arrays) may work on the test machine even for the seed maximum length of 40 symbols. Significant savings in the index size are however possible if sparsity of 3 or more is set, making the index possible to operate on a commodity PC with 16 GB of RAM. If one (e.g., a laptop user) requires even less memory, then the sparsity set to 16 makes it possible to run the index even in 8 GB of RAM. Naturally, using   Table 6. Query times for various variants of indexes for real data. All times in ms. We do not provide times for large sparsities and more errors than 3, since in such cases the internal queries would be for very short sequences and in turn result in numerous matches and significant times; thus, we do not recommend to use MuGI in such parameter configurations. doi:10.1371/journal.pone.0109384.t006 larger sparsities comes at a price of slower searches; in Fig. 1, each series of results for a given value of k corresponds to sparsities from f1,2, . . . ,8,10,12,14,16g (sparsities of 1 correspond to the rightmost points, with the exception of the case of k~45, for which the sparsities start from 2). Still, this tradeoff is not very painful: even the largest allowed sparsity value (16) slows down the fastest (for sparsity of 1) queries by factor about 2 on average, in most cases. Costlier, in terms of query times, is handling mismatches. In particular, allowing 4 or 5 mismatches in the pattern requires at least an order of magnitude longer query times than in the exact matching mode. Yet, even for 5 allowed errors the average query time was below 10 ms in all tests. This translates, for example, to 224 mapped reads per second allowing up to 5 mismatches and 10,593 mapped reads per second with up to 1 mismatch, at index size of 11.4 GB (k~40, sparsity of 4, simulated reads; cf. Table 4).
Apart from the average case, one is often interested also in the pessimistic scenario. Our search algorithms do not have interesting worst-case time complexities, but fortunately pathological cases are rather rare. To measure this, for each test scenario a histogram of query times over 100K patterns was gathered, and the time percentiles are shown in Fig. 2. Note that the easy cases dominate: for all maximum errors allowed, for 90% test patterns the query time is below the average. Yet, there are a few percent of test patterns for which the times are several times longer, and even a fraction of a percent of patterns with query times exceeding 100 ms (at least for approximate matching). More details exposing the same phenomenon are presented in Table 5.
While we cannot directly compare our solution to RCSI by Wandelt et al. [24], as their software is not public, we can show some comparison. Their index was built over twice less data (haploid human genomes vs. diploid genomes in our data). We handle exact matches much faster (over 6 times shorter reported average times, but considering the difference in test computers this probably translates to factor about 4). Roughly similar differences can be observed for the approximate matching scenario, but RCSI handles the Levenshtein distance, while our scheme handles (so far) only mismatches. Finally, and perhaps most importantly, our index over 1092 diploid human genomes can be run on a standard PC, equipped with 32 or 16 GB of RAM (or even 8 GB, for the price of more slow-down), while RCSI requires a machine with 128 GB (unless searches are limited to one chromosome, when a portion of the index may be loaded into memory).
We were are not able to run GCSA [25] or BWBBLE [26], due to their large memory requirements in the construction phase.
We did, however, ran a preliminary comparison of MuGI against GEM [34], one of the fastest single genome read mappers. We ran it on 1 CPU core, for mismatches only, in the all-strata mode, in which all matches with 0,1, . . . ,max mismatches errors are reported, in arbitrary order. Table 4 contains a detailed rundown of the results on simulated reads. For example, we can see that GEM performed exact matching in 24.0 ms, found matches with up to 1 mismatch in 50.6 ms, matches with up to 3 mismatches in 86.4 ms, and matches with up to 5 mismatches in 217.3 ms. The memory use was 5.0 GB. This means that, depending on chosen options of our solution, GEM was only about twice faster in the exact matching mode and 15-20 times faster when 5 mismatches were allowed. On real reads (Table 6) GEM is about 1.5-5 times faster with exact matching and about 10-20 times faster with 3 allowed mismatches. The major scenario difference is however that GEM performs mapping to a single (i.e., our reference) genome, so to obtain the same mapping results GEM would have to be run 2|1092 times, once per haploid genome. We thus consider these preliminary comparative results very promising. Table 7. Comparison of MuGI and JST on simulated and real data, both over 1092 individual sequences of chr1.  Finally, in Table 7 we compare MuGI against a recent tool JST by Rahn et al. [27]. As we can see, MuGI is usually 5-6 orders of magnitude faster at somewhat less memory consumption. This huge gap in performance can be explained with two different search ''philosophies'': sequential scan over the reference sequence in JST vs. fully indexed search in MuGI. As in this test we used only chr1 data (1092 sequences), the performance gap would probably be larger with the full human collection. On the other hand, we admit that JST performance with growing k (the maximum allowed number of errors) remains unchanged (which is a property of Myers' algorithm), therefore this scheme might be a satisfactory choice for a collection of short and highly-varied genomes.

Discussion
We presented an efficient index for exact and approximate searching over large repetitive genomic collections, in particular: multiple genomes of the same species. This has a natural application in aligning sequencing reads against a collection of genomes, with expected benefits for, e.g., personalized medicine and deeper understanding of the interaction between genotype and phenotype. Experiments show that the index built over a collection of 2|1092 human genomes fits a PC machine with 16 GB of RAM, or even half less, for the price of some slow-down. According to our knowledge, this is the first feat of this kind. The obtained solution is capable of finding all pattern occurrences in the collection in much below 1 ms in most use scenarios.
We point out that representing a ''true'' genome as a linear sequence over the ACGT(N) alphabet is inherently imperfect, since our knowledge about these sequences is (and will likely remain in the near future) limited. Every sequencing technology introduces its errors, therefore storing qualities (i.e., estimated correctness probabilities) together with the DNA symbols would convey more information useful for read mapping, yet we are unable to imagine an index over large collections based on such information not requiring huge amount of resources (especially main memory) in its runtime and construction stages. Moreover, large discrepancies between the reference and a given genome, e.g., long indels, result in reads that cannot be usually mapped, which implies incomplete variant information in the built VCF. Basically for those reasons the application of MuGI (and related software, like RCSI or BWBBLE) for mapping sequencing reads trades some accuracy for performance and reasonable memory use, yet with improving sequencing technologies the obtained mapping results should also be more valuable.
On the other hand, we should stress that MuGI is an index rather than a full-fledged read mapper. Aligning reads to multiple genomes is one of its possible applications. Another example could be searching for nullomers, that is, k-mers with no occurrences in a given genome (or, in our scenario, genome collection). To apply MuGI here, we may generate random strings of specified length (e.g., 20) in a loop and check if they have any occurrence; we may also force the mimimum distance to any 20-mer in the genome to be 2 or 3, with running the MuGI engine in the approximate matching mode, to minimize the impact of noisy data in a genome, at still acceptable search speed. Also a closely related problem of finding the minimal absent word was investigated in the literature, and it can be solved with MuGI with a systematic scan over its component structures. Nullomers/minimal absent words can be used for studies of population genetics, drug discovery and development, evolution studies, design of molecular barcodes or specific adaptors for PCR primers [35,36]. Other (or more general) areas for application of our algorithm may include comparative genomics and personalized medicine.
Several aspects of the presented index require further development. The current approximate matching model comprises mismatches only; it is desirable to extend it to edit distance. The pathological query times could be improved with extra heuristics (even if it is almost irrelevant for large bulk queries). A more practical speedup idea is to enhance the implementation with multi-threading. Some tradeoffs in component data structures (cf. Table 8) may be explored, e.g., the reference genome may be encoded more compactly but at a cost of somewhat slower access. A soft spot of the current implementation is the index construction phase, which is rather naïve and can be optimized especially towards reduced memory requirements. We believe that existing disk-based suffix array creation algorithms (e.g., [37]) can be adapted for this purpose. Alternatively, we could build our indexing data structure separately for each chromosome (with memory use for the construction reduced by an order of magnitude) and then merge those substructures, onto disk, using little memory. The sparse suffix array may be replaced with a sampled suffix array variant [38], for a hopefully faster search at a similar space consumption. Finally, experiments on other collections should be interesting, particularly on highly-polymorphic ones.

Supporting Information
Data S1 Supplementary material. (PDF)