Fig 1.
(A) Overview of the SALSA2 scaffolding algorithm. (B) Linkage information obtained from the alignment of Hi-C reads to the assembly. Arrows denote contigs and arcs between arrows denote the inferred linking information from Hi-C reads. Thickness of arcs denote the weight on the Hi-C edge. Thicker edge indicates higher edge weight implied by Hi-C reads (C) Assembly graph obtained from the assembler, where arrows are contigs and arcs denote overlap between contigs(D) Hybrid scaffold graph constructed from the links obtained from the Hi-C read alignments and the overlap graph. Solid edges indicate the linkages between different contigs and dotted edges indicate the links between the ends of the same contig. B and E denote the start and end of contigs, respectively. The E-E edge between blue and red contigs is dashed as this particular orientation between them is not supported by assembly graph, but rather B-E edge is supported. We ignore this dotted edge while computing maximal matching (E) Maximal weighted matching obtained from the graph using a greedy weighted maximum matching algorithm. The numbering of the edges indicates the order in which they were added to the graph. No more solid edges can be added to the matching as it would assign more than one edge to already matched nodes. (F) Edges between the ends of same contigs are added back to the matching to obtain final scaffolds.
Fig 2.
Example of the mis-assembly detection algorithm in SALSA2.
The plot shows the position on x-axis and the physical coverage on the y-axis. The dotted horizontal lines show the different thresholds tested to find low physical coverage intervals. The lines at the bottom show the suspicious intervals identified by the algorithm. The dotted line through the intervals shows the maximal clique. The smallest interval (purple) in the clique is identified as mis-assembly and the contig is broken in three parts at its boundaries.
Table 1.
Hi-C library statistics for different datasets used in this paper.
Mapped read pairs denote the total number of Hi-C read pairs aligned before mapping quality filtering. Intra-contig read pairs account for the read pairs where both the reads align to same contig and inter-contig read pairs account for the read pairs where both reads align to different contigs.
Fig 3.
Comparison of orientation, ordering, and chimeric errors in the scaffolds produced by SALSA2 and 3D-DNA on the simulated data.
As expected, the number of errors for all error types decrease with increasing input contig size. Incorporating the assembly graph reduces error across all categories and most assembly sizes, with the largest decrease seen in orientation errors. SALSA2 utilizing the graph has 2-4 fold fewer errors than 3D-DNA.
Fig 4.
(A) NGA50 statistic for different input contig sizes and (B) the length of longest error-free block for different input contig sizes.
Once again, the assembly graph typically increases both the NGA50 and the largest correct block.
Table 2.
Scaffold and correctness statistics for NA12878 assemblies scaffolded with different Hi-C libraries.
“True links” is an idealized case where the Hi-C links have been filtered in advance. The NG50 of human reference GRCh38 is 145 Mbp. The ratio between NG50 and NGA50 represents how many erroneous joins affect large scaffolds in the assembly. The bigger the difference between these values, the more aggressive the scaffolding was at the expense of accuracy. Longest chunk represents the longest error-free portion of the scaffolds. We observed that the 3D-DNA mis-assembly detection was overly aggressive in some cases, and so we ran some assemblies both with and without this feature. For the Illumina assembly as an input, 3D-DNA w correction did not finish within two weeks and is omitted. An evaluation of a previously published [20] 3D-DNA assembly from short-read contigs is included in S3 Table but did not exceed SALSA2’s NGA50.
Fig 5.
Feature response curve for (A) assemblies obtained from contigs as input (B) assemblies obtained from mitotic Hi-C data and (C) assemblies obtained using Dovetail Chicago data.
The best assemblies lie near the top left of the plot, with the largest area under the curve.
Fig 6.
Chromosome ideogram generated using the coloredChromosomes [39] package.
Each color switch denotes a change in the aligned sequence, either due to large structural error or the end of a contig/scaffold. Left: input contigs aligned to the GRCh38 reference genome. Right: SALSA2 scaffolds aligned to the GRCh38 reference genome. More than ten chromosomes are in a single scaffold. Chromosomes 16 and 19 are more fragmented due to scaffolding errors that break the alignment.
Fig 7.
Contiguity plot for scaffolds generated with (A) standard Arima-HiC data (B) mitotic Hi-C data and (C) Chicago data.
The X-axis denotes the NGAX statistic and the Y-axis denotes the corrected block length to reach the NGAX value. SALSA2 results were generated using the assembly graph, unless otherwise noted.
Fig 8.
Contact map of Hi-C interactions on chromosome 3 generated by the Juicebox software [41].
The cells sequenced in (A) normal conditions, (B) during mitosis, and (C) Dovetail Chicago.