Table 1.
Summary of the dataset.
Fig 1.
Post-filtering data quality control.
(A), (B) Distribution of nucleotide quality parameters across reads. The presented data is for both MGISEQ-2000 (A) and HiSeq 2500 (B) platforms for forward (R1) and reverse (R2) reads, respectively. For each position in the reads, the quality scores of all reads were used to calculate the mean, median, and quantile values; therefore, the box plot can be shown. Overall quality score distribution for MGISEQ-2000 and HiSeq 2500 data (C). Distribution GC-content in the data generated by MGISEQ-2000 and HiSeq 2500 (D). FastQC [15] was used for the analysis.
Fig 2.
Analysis of the coverage distribution for MGISEQ-2000 and HiSeq 2500 with the use of the E704 sample.
(A) A fraction of genome covered appropriate number of times. (B) A fraction of genome covered not less than the corresponding number of times. The analysis was performed using the R [17] and BEDtools [18] software packages.
Fig 3.
The results of the QC analysis of read alignment to the reference genome.
(A) The distribution of insert length values between reads of the E704-I library (blue line) and the E704-M library (red line). (B) The number of random errors for HiSeq 2500 (blue line) and MGISEQ-2000 (red line). The alignment algorithm used is BWA-MEM [19]. QC analysis was performed using bamstats [20, 21].
Table 2.
Mapping statistics for the datasets.
Fig 4.
The total number of “errors” (the sum of “FP” and “FN”) for SNPs (“total SNP error”) and indels (“total indel Error”) detection that occurred in the course of genomic variants comparison of E704-M (A) and E704-I (B). Four software packages were used for variant calling: Samtool, Strelka2, Sentieon, and GATK. Baseline data is shown in the S2 File.
Table 3.
Variant calling statistics for the datasetsa.
Table 4.
Variant calling for E704-M versus E704-I.