MAGERI: Computational pipeline for molecular-barcoded targeted resequencing

doi:10.1371/journal.pcbi.1005480

Fig 1.

MAGERI pipeline.

The figure describes four steps implemented in MAGERI pipeline. The pipeline starts with raw FASTQ files (either single- or paired-end), UMI tagging information (such as primer and adapter sequences containing random N bases, or the coordinates of N bases in raw reads) and reference information (FASTA file, BED file with genomic coordinates and contig information. UMIs are extracted from raw reads and used to group reads into molecular identifier groups (MIGs) which are then assembled into consensus sequences. Consensus sequences are then mapped to corresponding references, variant calling is performed and MAGERI Q scores are computed for substitutions using a Beta-Binomial model that accounts for PCR errors introduced during UMI tagging step in case UMIs are attached using PCR or RT-PCR, or 1st cycle PCR errors in case UMIs are attached using ligation.

More »

Expand

Table 1.

Datasets used for MAGERI benchmark.

More »

Expand

Fig 2.

MAGERI software benchmark using Tru-Q 7 reference standard and control donor DNA.

a Number of detected variant for each variant frequency tier across two independent experiments with the reference standard. Shaded areas show the 95% confidence intervals for expected fraction of recovered variants, i.e. binomial proportion confidence intervals built using known variant frequency and template coverage. b Frequency distribution of known Tru-Q 7 variants coming from each frequency tier and errors in the control donor DNA. c MAGERI Q score and the empirical P-values of erroneous variants detected in control donor DNA. d Comparison of Q score distribution of erroneous variants and variants of each frequency tier. Dotted and dashed lines show P < 0.05 and P < 0.01 thresholds respectively. e Receiver operation characteristic (ROC) curve comparing the sensitivity and specificity of MAGERI Q scores (blue line) and frequency-based thresholding (red line) in the task of classification of errors and 0.1% tier variants.

More »

Expand

Fig 3.

Detection of BRAF gene variants in tumor and plasma samples from two cancer patients.

Each point represents a variant and is colored according to MAGERI Q score, upper panel of each plot shows reference (top) and variant (bottom) bases. Variants passing Q 20 threshold (P < 0.01) are shown with bold circles. Chromosome position is given in hg19 assembly coordinates.

More »

Expand

Fig 4.

MAGERI performance on different types of UMI-tagged data.

a. Analysis of single-strand consensuses from duplex sequencing data. Q scores of detected variants are plotted against empirical P-values, a smoothed fitting is shown with red line, ABL variant known to be present in the sample at ~1% frequency is shown with black dot. b. Analysis of UMI-tagged HIV cDNA sequencing data. MAGERI Q scores are plotted against empirical P-values for a control unmutated HIV cDNA from 8E5 cell line (red) and HIV+ donor plasma sample (blue). c. Indel variants detected in Tru-Q 7 reference standard and PBMC DNA of a healthy donor. Indel frequency is plotted against its size (number of added/deleted nucleotides). The figure shows known EGFR deletion (ΔE746 − A750) in two independent experiments with a known frequency of 1% (original Tru-Q 7 reference standard) and 0.1% (Tru-Q 7 reference standard diluted in 1:9 ratio with healthy donor DNA), erroneous variants present in healthy donor DNA are shown with empty circles.

More »

Expand