Illumina TruSeq Synthetic Long-Reads Empower De Novo Assembly and Resolve Complex, Highly-Repetitive Transposable Elements

High-throughput DNA sequencing technologies have revolutionized genomic analysis, including the de novo assembly of whole genomes. Nevertheless, assembly of complex genomes remains challenging, in part due to the presence of dispersed repeats which introduce ambiguity during genome reconstruction. Transposable elements (TEs) can be particularly problematic, especially for TE families exhibiting high sequence identity, high copy number, or complex genomic arrangements. While TEs strongly affect genome function and evolution, most current de novo assembly approaches cannot resolve long, identical, and abundant families of TEs. Here, we applied a novel Illumina technology called TruSeq synthetic long-reads, which are generated through highly-parallel library preparation and local assembly of short read data and which achieve lengths of 1.5–18.5 Kbp with an extremely low error rate (0.03% per base). To test the utility of this technology, we sequenced and assembled the genome of the model organism Drosophila melanogaster (reference genome strain y; cn, bw, sp) achieving an N50 contig size of 69.7 Kbp and covering 96.9% of the euchromatic chromosome arms of the current reference genome. TruSeq synthetic long-read technology enables placement of individual TE copies in their proper genomic locations as well as accurate reconstruction of TE sequences. We entirely recovered and accurately placed 4,229 (77.8%) of the 5,434 annotated transposable elements with perfect identity to the current reference genome. As TEs are ubiquitous features of genomes of many species, TruSeq synthetic long-reads, and likely other methods that generate long-reads, offer a powerful approach to improve de novo assemblies of whole genomes.

In brief, scaffolding is accomplished by re-aligning the input short reads to the contigs using BWA aligner 47 [3], and using the paired-end alignments to infer scaffold structure. The link between two contigs is made 48 when two or more paired reads map such that read 1 from a read pair maps to one contig and read 2 from 49 the same read pair maps to the other. The orientation of the contigs relative to one another is also inferred 50 from the orientation of the read pairs. In addition, the end-marker sequences are used to help guide and 51 constrain the construction of our scaffold graph 52 Gap Filling 53 The next step of this module is to fill in scaffold gaps where possible in order to resolve repeats. In this 54 step, we use the input short reads, making use of the FM index computed during the contig assembly. We 55 begin by finding the highest-scoring read which matches the end of one of the contigs, and continue to chain 56 together reads iteratively. If a chain is found that overlaps another contig in the same scaffold, the consensus 57 is retained and the gap filled with this sequence. 58 Assembly QC and Correction 59 The final stage of the analysis pipeline involves verification of the scaffolds and error correction. The short 60 read data is again aligned against the scaffolds generated in the previous step using BWA aligner [3]. Based 61 on the alignments, the scaffolds are corrected for single-nucleotide errors and broken into smaller scaffolds 62 should there be only partial alignment support. Quality scores for the final long reads are also estimated 63 from the alignments. 64 Breaking Scaffolds 65 The short reads used during the synthetic long-read assembly are aligned to the scaffolds. The alignments are 66 searched for read pairs in which one read aligns and the other one does not. Unaligned reads are re-aligned, 67 and reads that are overlapping or running into scaffold gaps are counted and computed. In order to determine 68 whether or not to break a scaffold gap, Illumina computes the following formula: If this ratio is smaller than 0.1, the gap is left as is. If it is larger, the scaffold is broken at this gap. If 73 there are only few reads or none, the scaffold for the region is left as is.

Q-scores
From the alignments of short reads to the scaffolds, a pileup file is generated which provides the base quality 76 scores of the aligned reads at each position in a scaffold. The quality score at each scaffold position is then 77 estimated from the read base qualities as follows:

78
• Remove 'N's and indels from the pileup.

79
• If coverage > 5× and all nucleotides at this position agree, set Q-score to max of pileup.

81
• If all of the above steps fail, look at the most frequently-occurring nucleotide in the pileup as well as  including acetic acid bacteria from the genera Gluconacetobacter, Gluconobacter, and Acetobacter (Table S2   99 in Supporting File S1). Because contamination was extremely rare and because we could not exclude that 100 sequences with no BLAST hits may correspond to fly-derived sequences not previously assembled in the 101 reference genome, we included all sequences in downstream analyses.

Genome assembly from TruSeq synthetic long-reads
Assembly with the Celera Assembler

104
The following Celera Assembler parameters are roughly based on those recommended for PacBio consensus-105 corrected reads: http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=PBcR#Assembly_ 106 of_Corrected_Sequences. Based on our goal of assembling separate copies of TEs, however, we elected to 107 use a greater k-mer size and k-mer threshold to increase specificity and reduce the number of false joins 108 (which could generate chimeric sequences). The bogart unitigger, which is recommended for Illumina data or Illumina data in combination with other 122 data types, and is also employed in the PacBio corrected read assembly pipeline. We required overlap of 123 at least 800 bp in order to merge across reads, a parameter that further increases overlap specificity. Error 124 rates are set substantially lower than the default options, given the low observed rate of mismatches to the 125 reference genome in the TruSeq synthetic long reads as well as the fact that we sequenced a highly inbred 126 strain of D. melanogaster. These parameters are intentionally conservative to avoid the erroneous merging of 127 contigs at identical repeats. Modifications to these parameters may increase overlap sensitivity and achieve 128 greater contig lengths, but likely at the expense of mis-assembly. Assembly for species with higher rates of 129 polymorphism would require error rates to be set higher to avoid separate assembly of individual haplotypes.

Contig merging with Minimus2
NUCmer [5, 6] alignment to the reference genome revealed that in some cases, the Celera Assembler produced 132 contigs with ends with long stretches (>1 Kbp) of perfect sequence identity. As we demonstrated in the main 133 text, many of these cases represent regions of low coverage in synthetic long reads, where data were insufficient 134 to support a join. We therefore used the simple overlap-based assembler Minimus2 to generate supercontigs 135 from the contigs output by Celera. The parameters used for this assembly were: The parameter REFCOUNT=0 means that the assembler performs all vs. all alignment of the contigs, 143 rather than merging two separate assemblies (a common application of Minimus2 The same alignment file (.delta) is also analyzed to define the search space for TEs and genes: https: 178 //github.com/rmccoy7541/assess-assembly. The steps in the pipeline are as follows:

179
• Map contigs to the reference genome with NUCmer, extracting only the optimal mapping of each contig 180 to one position in the reference.

181
• Check whether both the start and end boundary of the gene or TE fall within the same aligned contig.

182
• If so, perform local alignment between the reference sequence of the gene or TE and the corresponding 183 aligned sequence.

184
• Calculate the percent identity and the proportion of the gene or TE's length that was assembled and    A.

B.
Supplemental Tables  Table S1:     , we report the average length of TE copies within each family, the average divergence between each copy and the canonical sequence, and the number of elements that comprise each family. We then report the number of elements of each family entirely recovered in our assembly with perfect identity to the reference genome, as well as the number that are partially recovered, mis-assembled, or contain mismatches relative to the reference. Finally, we report the number of elements from each family that are entirely absent from the assembly (i.e., both start and end coordinates lie within alignment gaps).  , we report the average length of TE copies within each family, the average divergence between each copy and the canonical sequence, and the number of elements that comprise each family. We then report the number of elements of each family entirely recovered in our assembly with perfect identity to the reference genome, as well as the number that are partially recovered, mis-assembled, or contain mismatches relative to the reference. Finally, we report the number of elements from each family that are entirely absent from the assembly (i.e., both start and end coordinates lie within alignment gaps).    ctg100000966696  ctg100000966814  ctg100000966837  ctg100000967379  ctg100000967449  ctg100000967457  ctg100000967511  ctg100000967560  ctg100000967605  ctg100000967626  ctg100000967687  ctg100000967750  ctg100000967783  ctg100000967784  ctg100000967787  ctg100000967852  ctg100000967896  ctg100000967928  ctg100000967969  ctg100000968010  ctg100000968064  ctg100000968094  ctg100000968196  ctg100000968200  ctg100000968250  ctg100000968272  ctg100000968281