Pairtools: From sequencing data to chromosome contacts

doi:10.1371/journal.pcbi.1012164

Fig 1.

Processing 3C+ data using pairtools.

a. Outline of 3C+ data processing leveraging pairtools. First, a sequenced DNA library is mapped to the reference genome with sequence alignment software, typically using bwa mem for local alignment. Next, pairtools extracts contacts from the alignments in.sam/.bam format. Pairtools outputs a tab-separated.pairs file that records each contact with additional information about alignments. A.pairs file can be saved as a binned contact matrix of counts with other software, such as cooler. The top row describes the steps of the procedure; the middle row describes the software and chain of files; the bottom row depicts an example of each file type. b. Three main steps of contact extraction by pairtools: parse, sort, and dedup. Parse takes alignments of reads as input and extracts the pairs of contacts. In the illustration, alignments are represented as triangles pointing in the direction of read mapping to the reference genome; each row is a pair extracted from one read. The color represents the genomic position of the alignment with the smallest coordinate, from the leftmost coordinate on the chromosome (orange) to the rightmost coordinate on the chromosome (violet). Sort orders mapped pairs by their position in the reference genome. Before sorting, pairs are ordered by the reads from which they were extracted. After sorting, pairs are ordered by chromosome and genomic coordinate. Dedup removes duplicates (pairs with the same or very close positions of mapping). The bracket represents two orange pairs with very close positions of mapping that are deduplicated by dedup.

More »

Expand

Fig 2.

Auxiliary tools for building feature-rich pipelines.

a. Header verifies and modifies the.pairs format. b-d. Flip, select, and sample are for pairs manipulation. e-f. Scaling and stats are used for quality control. For scaling, we report scalings for all pairs orientations (+-, -+, ++, —) as well as average trans contact frequency. Orientation convergence distance is calculated as the last rightmost genomic separation that does not have similar values for scalings at different orientations. g-h. Restrict and phase are protocol-specific tools that extend pairtools usage for multiple 3C+ variants.

More »

Expand

Fig 3.

Benchmark of different Hi-C mapping tools for one mln reads in 5 iterations (data from [64]).

a. Runtime per tool and number of cores. The labels at each bar of the time plot indicate the slowdown relative to Chromap [58] with the same number of cores. b. Maximum resident set size for each tool and number of cores. c. Runtime per tool and number of cores compared to the runtime of the corresponding mapper (gray shaded areas). Labels at the bars reflect the percentage of time used by the mapper versus the time used by the pair parsing tool. d. Maximum resident set size for each tool and number of cores compared with that of the corresponding mapper. To make the comparison possible, the analysis for each tool starts with.fastq files, and the time includes both read alignment and pairs parsing. For pairtools, we tested the performance with regular bwa mem [34] and bwa mem2 [35], which is ~2x faster but consumes more memory. Note that for HiC-Pro, we benchmark the original version and not the recently-rewritten nextflow [65] version that is part of nf-core [66]. FANC, in contrast to other modular 3C+ pairs processing tools, requires an additional step to sort.bam files before parsing pairs that we include in the benchmark. For Juicer, we use the “early” mode. Chromap is not included in this comparison because it is an integrated mapper [58].

More »

Expand

Table 1.

Qualitative comparison of the tools for pairs extraction from 3C+ sequencing data.

We consider methods modular if they have multiple tools that can be used separately or combined in a custom order. HiCExplorer is modular, but its tool for contact extraction is not (indicated with *). We consider methods flexible if they allow parameterization of data processing (e.g., restriction enzyme). We do not consider control only over technical parameters, like the number of cores, to be flexible. For restriction sites, we consider whether a method can either annotate or filter by restriction site.

More »

Expand