TagGD: Fast and Accurate Software for DNA Tag Generation and Demultiplexing

Multiplexing is of vital importance for utilizing the full potential of next generation sequencing technologies. We here report TagGD (DNA-based Tag Generator and Demultiplexor), a fully-customisable, fast and accurate software package that can generate thousands of barcodes satisfying user-defined constraints and can guarantee full demultiplexing accuracy. The barcodes are designed to minimise their interference with the experiment. Insertion, deletion and substitution events are considered when designing and demultiplexing barcodes. 20,000 barcodes of length 18 were designed in 5 minutes and 2 million barcoded Illumina HiSeq-like reads generated with an error rate of 2% were demultiplexed with full accuracy in 5 minutes. We believe that our software meets a central demand in the current high-throughput biology and can be utilised in any field with ample sample abundance. The software is available on GitHub (https://github.com/pelinakan/UBD.git).


Introduction
Parallel processing of samples is a powerful method to perform multiple experiments in time and cost-effective manner. Next generation sequencing technologies provide the necessary platform to achieve this [1,2]. Given the current developments in sequencing platforms, the scale of multiplexing will grow requiring larger numbers of barcodes due to high sequence throughput [3].
Generating unique barcodes is challenging in a number of ways. Firstly, barcodes should conform to particular constraints such as length, GC-content, cross-hybridisation with experimental sequences, absence of particular restriction enzymes to minimise their interference with the actual experiment. Secondly, they should be unique enough to allow correct assignment of sequences to their original samples. This can be problematic since there can be errors introduced in the barcodes during both their synthesis and sequencing, resulting in non-unique barcodes. Using longer barcodes could partially solve this issue however they could hamper the sequencing space for the actual sample. Therefore it is imperative to design sufficient number of barcodes with optimal length and maximal distance to each other.
A framework developed by Xu. al. (DeLOB) can generate large numbers of barcodes that can then be printed on microarrays. However uniqueness of barcodes is based on their crosshybridization and no demultiplexing step is implemented [4]. There are a number of available software tools to design and demultiplex DNA barcodes but generation of longer ones (.15mer) is not feasible using these tools [5,6]. A recent publication provides a DNA barcode design framework based on Hamming codes, however the design is limited since the sequence uniqueness is determined only by base-changes (Hamming distance) and does not take into account insertion and deletion events [7]. It is possible to design barcodes that are capable of detecting and correcting errors for optimal demultiplexing [5]. A resource is available to generate error-correcting binary codes up to length 30 (corresponding to 15mer DNA barcodes) after which it is challenging to discover such optimal codes [8]. Such methods may not provide sufficient number of barcodes especially after applying user-specified constraints.
We here implemented a software package for designing large numbers of barcodes of virtually any length that can be demultiplexed with full accuracy. We employed Levenshtein distance to discriminate between the barcodes [9] since insertions and deletions also occur in DNA synthesis and sequencing steps [10,11].
TagGD is easy-to-use, fully customizable, memory-efficient, fast software package and can also be run on a desktop computer. The user needs to specify the number and length of the barcode and the uniqueness based on the combined estimated error rate of the platforms that synthesise and sequence the barcodes including the error modality of the platform. Then the software provides the user the unique barcodes that will give the maximal accuracy within the allowed error rate. If the number of barcodes is not sufficient, the user can then either increase the length of the barcode or decrease the error rate. The user can also impose constraints on the barcode sequences using the barcode configuration file. In the demultiplexing step, the software takes FASTQ files and outputs demultiplexed reads in a FASTQ file with the optional third line containing the barcode of the reads.

Design
First, a random barcode of the desired length and GC content is generated, then it is checked for mono, di or tri-mer repeats, complexity, self-hybridisation, presence of restriction enzyme cut motifs (if required) and edit distance to its reverse complement. It is also hybridised with primers/adapters used in the experiment to prevent its interference. Gibb's free energy and melting temperature of the self-hybridised barcodes were computed using the DNA-fold algorithm [12] and sequences with strong secondary structures are discarded. If the barcode passes these filters, it is added to a barcode pool, and its uniqueness is checked. We use Levenshtein distance to measure the distance of two barcodes in the sequence space. We also allow wildcard position at the beginning and end of the barcode to exclude barcodes that cannot be discriminated as a result of an insertion/deletion. This option is called ''padding'', when it is set to 0, the set will be resistant to substitution errors but not to insertions/deletions to the same extent. The user is recommended to use the same padding option for demultiplexing that s/he used for barcode design. The generated barcodes are printed in a text file, one line containing each barcode. The edit distance distribution of pairs of barcodes is also written in another file. Almost all parameters of TagGD can be adjusted in the configuration file; an example is included within the software package.
For demultiplexing, a hash map is built, containing all the kmers of a given length appearing in all barcodes. This map is filled by searching against with overlapping k-mers of the putative barcode sequence so that each barcode is related to the k-mers it contains. This results in reducing the search space and thus greatly improving the time necessary to identify the best hit. All barcodes that have been found through the map search will be compared to the putative sequence through a semi-global alignment that ensures full overlap with the entire barcode. Also here we employ a parallel strategy due to each barcode mapping being independent. A given number of threads are created which consume the input, and output the computed best mapping. Note that as a result of this, the order of entries is not preserved between input and output. The input and output of the demultiplexer is in FASTQ format, the details of the best-hit barcode are added to the optional third line of the output FASTQ file. TagGD supports paired-end sequencing data.

Implementation
TagGD is implemented using C++ and parallelised using POSIX Threads and OpenMP [13,14]. First, random DNA sequences of desired length and GC content are generated. Sequences are then filtered as specified in the previous section to ensure least interference with each other and the experiment.
Each barcode has to be unique. Because of possible experimental and sequencing errors, barcodes should be far away from each other in sequence space to prevent their false classification. Operations can be substitutions, insertions or deletions. The aim is to find a set of sequences where each has an edit distance greater than the desired threshold to the entire set. Thus, each new sequence that passes all the above filters is checked against the solution set and is accepted only if the edit distance to all  previously accepted elements is greater than the imposed limit. As the two operations of filtering and insertion are independent, a worker thread is created to handle the generation of barcodes and push those to a common pool as shown in Figure 1. The insertion operation imposes a growing cost, scaling with the size of the solution set. Thus a parallelizing strategy is also employed here, where all the available threads are used to concomitantly search through non-overlapping subsets of the solution space. Table 1 lists the running times for generating different numbers of barcodes with and without padding option on.

Results
Assuring the uniqueness of the barcodes Accuracy of the demultiplexing step depends on the combined error rate of the platform processing the barcodes as well as the nature of the error. We implemented TagGD to generate required number of barcodes within reasonable run times while ensuring maximal accuracy depending on the nature and the rate of the errors. An error introduced within a barcode may cause a situation that the error-containing barcode could have the same edit distance to two or more barcodes in the set, hence cannot be classified (Figure 2A). Worse case scenario is that if the error containing barcode has a smaller edit distance to another barcode than to its original, which leads to its misclassification ( Figure 2B). In our experience most of the wrongly classified barcodes were due to insertion and deletion errors. Therefore we developed a strategy that deals with the problem of shifting of the barcode within the read due to insertions or deletions. To this end, we included wild card positions (padding) at the start and end of the barcode, resulting in a semi-global alignment strategy when calculating its distance to another barcode. This step ensures exclusion of barcodes that can be converted to another barcode in the set due to positional errors, insertions and deletions. Additionally, to eliminate cases where an error-containing barcode lie in equal distance to two or more barcodes, we recommend setting the minimal edit distance at least more than twice the number of bases expected to be erroneous based on the estimated error rate (Equation 1).

Minimum Edit Distance
If most of the errors are expected to be due to substitutions then there is no need to design the barcodes with the padding option on so this can be set to zero. This will ensure to give the user the maximal number of barcodes within the shortest running time with full accuracy as long as equation 1 is satisfied. In other words, the barcodes can have errors as half the number of minimum edit distance in the set and can still be demultiplexed with 100% accuracy ( Table 2).  Table 2. 20,000 unique barcodes of length 18 were generated with no padding option. However if the user expects insertion and deletion events, then the padding option must be used and the padding should be equal to the expected number of insertion or deletion events within the read. To ensure full accuracy in the demultiplexing step, equation 1 must still hold but this time errors can be substitutions, insertions or deletions.

Demultiplexing of the barcodes
Systematic errors, in either the experiment or the sequencing can cause substitutions, insertions or deletions in the sequence. This necessitates an extra step for correctly identifying the sequenced barcode. We have implemented a semi-global aligner, in line with the edit distance definition used in generating the barcodes. It ensures the best match of a given sequence to the set of barcodes. The alignment is quality aware and will discriminate between two equally good mappings (from an edit distance perspective) based on the quality of the bases.
False negative rate of the demultiplexing depends on the k-mer length. To achieve full accuracy, the k-mer size must be set such that there will be at least one window of size k not containing an erroneous base (Equation 2). However given the low probability of having all the expected erroneous bases within the barcode, the kmer size can be increased at a low cost to the accuracy while reducing the running time. To assess the effect of the k-mer length on accuracy, we have designed 10,000 and 20,000 barcodes of length 18 (with or without padding respectively) with a minimum edit distance of 4. We used the default settings (GC content between 45-65%, homopolymer limit 4, self hybridisation temperature 50uC). On these we simulated 200,000 reads with an error rate of 2%. Tables 2 and 3 shows the running times for demultiplexing as well as the specificity relation to k-mer length. The running times scales linearly with the number of reads demultiplexed.
The user must provide the starting position of the barcode within the read. We allow for some variation of this site (positional error) considering that insertions or deletions may occur before the barcode location. When setting the positional error during demultiplexing, the user should consider the time cost of this (Table 3) as well as the design step padding. For example, designing with padding zero results in a fully global alignment, which over-estimates the edit distances between barcodes. We suggest the user to set the positional inaccuracy to the padding used in design stage for maximal demultiplexing accuracy. Also note that TagGD supports demultiplexing reads in which the position of the barcode is unknown, by setting the positional error to -1. This will be considerably slower and will result in less accuracy.

Discussion
TagGD is can be downloaded from Github repository and it can be run on Linux and Windows operating systems. We provide user instructions and sample input files to ease to usage of the package (Supplementary Information). We also provide a set of scripts to test the accuracy and speed of the TagGD.
TagGD can generate large numbers of unique DNA sequences that can be used for various biological applications. We believe that TagGD can be used for any given application that requires indexing of sequences and it is implemented in a flexible way so that it can be adapted for various lengths, sequence content as well as their interference with sequences used in the experiments can be also checked against the index sequences. One may consider sequencing a large number of single cells where each cell is barcoded to provide the necessary transcriptome resolution or generating barcodes for the recently developed massively parallel gene reporter assay [15]. We also provide the algorithm to map back the index sequences with full accuracy as long as the actual error rate of the experiment does not exceed the estimated error rate used for designing the barcodes.
Barcode design unit of TagGD can be adapted to design multiplex PCR primers or microarray probes for a given genome. This would require replacing the random generation of barcodes step with given sequences from the genome of interest. Additionally demultiplexing unit of TagGD can be adapted to retrieve barcodes for multiplexed sequencing experiments to improve accuracy and speed of sample identity recovery.