Fig 1.
Barplots and boxplots showing variations in basic sequencing metrics between replicated samples and batches.
Barplots of (A) the number of total reads (i.e. number of reads of F3- and F5-tagged paired reads), and (B) percentage of mapped reads in target region for each of the replicate pairs. The boxplots show the batch-to-batch differences in the number of total reads (C), the percentage of mapped reads in the target region (D), the average coverage (E), and the percentage of nucleic acids with ≥20x coverage within the target region (F).
Table 1.
The percentage of target base pairs with at least 1x or 20x coverage for each replicated sample.
Table 2.
Concordance rates between pairs of replicated samples of all unambiguous nucleotide calls over the entire sequenced regions and of SNV calls.
Fig 2.
Boxplots comparing the values of different factors between the pooled reproduced and not-reproduced SNV calls.
The factors include coverage, variant allele count, variant allele frequency, variant allele quality, and p-value of SNV calls. The numbers beside each boxplot indicate the mean±standard deviation of the factor values in the group of SNV calls. T-test p-values are shown.
Fig 3.
Boxplots of SNV call concordance between replicated samples by different factors, including (A)_nucleotide substitution type (gray-shaded boxes = transversions, open boxes = transitions), (B) genome annotation type, (C) coverage, (D) variant allele account, (E) variant allele frequency, (F) variant allele quality, and (G)) SNV call p-value.
The numbers in parenthesis (m) represents the median of the numbers of SNVs per replicated sample that were counted in a given category. ANOVA test p-values are shown. Coefficient of determination (R2), indicating the proportion of the total variation of concordance rate that is explained by the factor alone, is shown for each factor.
Fig 4.
Comparisons between the relative importance of the 5 different variables in determining reproducibility of SNV calls.
Importance was assessed using mutual information value (A), Akaike information criterion (B), and Lasso regression methods (C, D). On panels C and D, the y-axis indicates whether a factor is in the model (y = 1) or not (y = 0). VAC = variant allele count, VAF = variant allele frequency, VAQ = variant allele quality and p-value refers to SNP call p-value generated by BioScope.
Table 3.
Concordance rates of SNVs called by VarScan2 program between replicated samples and after removal of duplicated reads.