Comparative analysis of novel MGISEQ-2000 sequencing platform vs Illumina HiSeq 2500 for whole-genome sequencing

The MGISEQ-2000 developed by MGI Tech Co. Ltd. (a subsidiary of the BGI Group) is a new competitor of such next-generation sequencing platforms as NovaSeq and HiSeq (Illumina). Its sequencing principle is based on the DNB and the cPAS technologies, which were also used in the previous version of the BGISEQ-500 device. However, the reagents for MGISEQ-2000 have been refined and the platform utilizes updated software. The cPAS method is an advanced technology based on the cPAL previously created by Complete Genomics. In this paper, the authors compare the results of the whole-genome sequencing of a DNA sample from a Russian female donor performed on MGISEQ-2000 and Illumina HiSeq 2500 (both PE150). Two platforms were compared in terms of sequencing quality, number of errors and performance. Additionally, we performed variant calling using four different software packages: Samtools mpileaup, Strelka2, Sentieon, and GATK. The accuracy of SNP detection was similar in the data generated by MGISEQ-2000 and HiSeq 2500, which was used as a reference. At the same time, a separate indel analysis of the overall error rate revealed similar FPR values and lower sensitivity. It may be concluded with confidence that the data generated by the analyzed sequencing systems is characterized by comparable magnitudes of error and that MGISEQ-2000 and HiSeq 2500 can be used interchangeably for similar tasks like whole genome sequencing.


Background
The combinatorial Probe-Anchor Ligation (cPAL) sequencing technology developed by Complete Genomics was first featured in a paper in 2009 [1]. In 2013, Complete Genomics was acquired by BGI (the Beijing Genomic Institute), and the technology has been subsequently refined [2]. In 2015, a new commercially available second-generation genome analyzer BGI-SEQ-500 was first announced [3]. Since then, the cPAL technology has undergone serious modifications.
The combinatorial Probe-Anchor Synthesis (cPAS) method was an important milestone in the evolution of this technology. The method utilizes fluorescently labeled terminated substrates. In the cPAS method, sequencing occurs as the DNA polymerase begins working with a primer (anchor) complementary to the single DNA strand [4]. DNA nanoballs (DNB) are 160,000 to 200,000-bp-long single-stranded DNA fragments made of replicated butt-joined copies of one of the original library DNA molecules, used for signal amplification. The copies are created in the process of rolling circle amplification of DNA rings, forming a library. Each DNB rests in a separate section of the patterned flow cell, which is ensured by its non-covalent binding to a charged substrate. The flow cell is a silicon wafer coated with silicon dioxide, titanium, hexamethyldisilazane, and a photoresistant material. DNBs are added to the flow cell and selectively binded to positively-charged aminosilanes in a highly-ordered pattern, allowing for the sequencing of a very high density of DNA nanoballs [1], [5]. The sequencing process itself consists of several steps, including the addition of a fluorescently labeled terminated nucleotide (sequencing by synthesis), the cleavage of a terminator during the synthesis process and the detection of the produced fluorescent signal [6], [7], [8].
We would like to emphasize that we were unable to find a detailed description of cPAS-based sequencing in the literature, nor were we able to figure out how it is implemented in MGI-SEQ-2000. However, a patent is available in the public domain that describes the application of the cPAS approach. In this patent, the sequencing process is described as using fluorescently labeled monoclonal antibodies that recognize unique chemical modifications of one of the four terminated dNTPs [9]. In any case, it is not currently possible to obtain full information on MGISEQ-2000 sequencing.
A paper was published two years ago, in which researchers used a reference genomic dataset obtained from GIAB to demonstrate that the BGISEQ-500 platform showed similar accuracy of SNP detection and slightly lower accuracy of indel detection compared to HiSeq 2500, [3]. Several recent studies have compared the performance of these two platforms in ancient DNA [10], metagenome [11] and microRNA [4] sequencing. In general, the quality of the data generated by BGISEQ-500 has proved to be satisfactory, although several of its characteristics were slightly worse than those of Illumina HiSeq 2500.
The Genome in a Bottle Consortium provides reference genomes for benchmarking [12]. By comparing the obtained genomic variants to a reference sequence, one can assess the accuracy/ sensitivity of the tested instrument and the corresponding bioinformatics pipeline for data analysis. In our study, we tested the suitability of the MGISEQ-2000 platform for the assessment of the mutational variability of embryonic cells. To do this, we used the genome of a Russian female egg donor and conducted a genome-wide analysis using two platforms: Illumina HiSeq 2500 and MGISEQ-2000. As HiSeq 2500 is a popular and well-described platform for genomic research, we decided to evaluate the overall error rate in order to understand whether we can use MGISEQ-2000 for the execution of our utilitarian tasks.

Ethics approval and consent to participate
The research was carried out according to The Code of Ethics of the World Medical Association (Declaration of Helsinki). Written informed consent to participate and to publish these case details was obtained from the patient, and the study was approved by the Ethical Committee of Pirogov Russian National Research Medical University, Moscow, Russia.

DNA preparation
A sample of genomic DNA was isolated from whole blood using phenol-chloroform extraction. Quality control was performed using agarose gel electrophoresis (degradation level) and the Qubit dsDNA BR Assay Kit (concentration measurement). The donor was a female resident of the Russian Federation, age 25, her health condition meets all the strong criteria for the procedure of egg donation. MGISEQ-2000. The circularization procedure is essentially the denaturation and renaturation of the DNA library in the presence of excess amounts of a splint oligo (dephosphorylated at the 5'-end) that consists of inverted complementary sequences of adapters ligated to the library. In the process of renaturation with the splint oligo, an annular molecule is formed with a double-stranded structure in the adapter region containing a nick. The nick is sealed by a DNA ligase. Linear DNA library molecules are disposed of at the digestion stage using a mixture of nucleases that cleave linear molecules. A useful scheme was prepared by the MGI's team [13].

Preparation of a library for sequencing
The isothermal synthesis of nanoballs is carried out using the rolling circle amplification (RCA) mechanism and is initiated by the splint oligo. As a result, RCA forms a linear singlestranded DNA consisting of 300-500 repeats. A nanoball is a molecule compactly packed into a coil-like form 200-220 nm in diameter.
The procedure of loading of the nanoballs in the flow cell is simplified and automated: the flow cell has a patterned array structure that facilitates efficient loading (85.5% in our case), which does not depend on the accuracy of library dilution in the case of unordered cells (similar to, for example, Illumina MiSeq or HiSeq 2500). The nanoballs are loaded using a DNB Loader, a device similar to cBot (Illumina). The instrument and the reagents are prepared for sequencing in a way similar to that used for Illumina. Water and maintenance washes must be performed for MGISEQ-2000. The ready-to-use reagents are delivered in a cartridge that needs to be thawed prior to use. A flow cell for MGISEQ-2000 has four separate lanes and one surface, on which DNBs are immobilized.
We used MGIEasy Universal DNA Library Prep Set. 1000 ng of genomic DNA was fragmented using a Covaris ultrasonicator to achieve a length distribution of 100-700 bp with a peak at 350 bp. Size selection was performed using Ampure XP (Beckman). Library concentrations were measured using a Qubit; the amount of DNA used was 289 ng (procedure efficiency 29%). After that, an aliquot of 50 ng of the fragmentation product was transferred to a separate tube for end-repair and A-tailing. For ligation, the equimolarly mixed set of Barcode Adapters 501-508 was used. The ligation product was washed with Ampure XP, and seven PCR cycles were performed after that using primers complementary to the ligated adapters. After the washing of the library with Ampure XP, its concentration was measured using a Qubit. Before the annealing and circularization with splint oligo, the library was normalized to 330 ng in a volume of 60 μL. After linear DNA was digested, the concentration of ring DNA (0.997 ng/ μL) was measured using Qubit with the use of the ssDNA kit.
After RCA and formation of DNBs, the end product was measured using Qubit with the use of the ssDNA kit. The typical range of nanoball concentrations suitable for loading is 8-40 ng/ μL. In our case, the concentration was 20 ng/ μL. Nanoball loading was assisted by a DNB manual loader.
Illumina 2500. 500 ng of genomic DNA was enzymatically fragmented by dsDNA Fragmentase (NEB). The library was prepared using the NEBNext Ultra II kit and indexes from the Dual Index Primers Set 2 (all New England Biolabs) according to the manufacturer's instructions; amplification at the last sample preparation stage was performed in three PCR cycles.
MPS was carried out using the Illumina HiSeq 2500 in the Rapid Run mode (paired-end 150 bp dual indexing) with the use of the 500-cycle v2 reagent kit according to the manufacturer's instructions.

Sequencing
Preparation of genomic libraries and sequencing using MGISEQ-2000 were carried out by our research group at the facilities of MGI Tech. in Shenzhen. Fastq files were generated as described previously using the zebracallV2 software provided by the manufacturer [3].
Library preparation and sequencing on HiSeq 2500 were carried out at the Center for Genome Technologies of Russian National Research Medical University. Fastq files were generated using the Basespace cloud software offered by the manufacturer (https://basespace. illumina.com/analyses/140691740/files/logs).

Data analysis
The detailed description of the sequencing process and the protocols are provided in the S1 File.

Sequencing data summary
In this research, we analyzed two whole-genome datasets obtained by the sequencing of gDNA from a Russian female donor (hereinafter, we will call the sample E704). The donor's genome was sequenced using two platforms: HiSeq 2500 by Illumina and new MGISEQ-2000 by BGI Complete Genomics that have similar performance characteristics. In the case of MGISEQ-2000, DNA was applied onto a separate lane of the flow cell. Sequencing was performed in a paired-end 150 bp mode. We recorded the amount of data generated by MGISEQ-2000 and calculated the average coverage. After that, we sequenced the donor's genome using Illumina HiSeq 2500 in order to obtain a similar amount of data. General sequencing characteristics are presented in Table 1. The detailed description of library preparation is provided in the Materials and Methods section. We would like to note that we used different methods of DNA fragmentation for library preparation: fragmentation by ultrasound for E704-M and enzymatic fragmentation (dsDNA fragmentase) for E704-I. This fact is important for the interpretation of our results.
As shown in Table 1, the size of the obtained dataset, as well as the characteristics of sequencing quality indicated that the datasets could be analyzed and compared. The comparison of the two datasets was unlikely to be skewed by the fact that different fragmentation methods were used [14].

FastQC analysis
The next step in the comparison of the two datasets was to assess the quality of FastQ files using FastQC [15]. We also analyzed all individual FastQ files generated by paired-end sequencing (see Materials and Methods).
FastQC source file analysis demonstrated that the quality of data was acceptable and comparable for both platforms. K-mers were found at the start of the reads in the fastq files generated by MGISEQ-2000-based sequencing and at the end of the reads in the files generated by HiSeq 2500-based sequencing. A deviation from the normal GC-content was observed at the start of the reads in the HiSeq 2500 fastq files. Unremoved adapter sequences in both cases might explain the presence of K-mers. The abnormal GC-content could be a result of enzymatic fragmentation, which apparently causes a deviation from the random distribution pattern. Bearing that in mind, we decided to remove ten nucleotides from 5'-ends of each read in both MGISEQ-2000 and HiSeq 2500 fastq files. Further manipulations were carried out with 130-nucleotide-long fragmented reads. We also trimmed the adapter and other technical sequences (S1 File), which allowed us to save more data and work with a higher average read length. This, however, was not crucial for our purposes, and we proceeded to the next steps of the comparative analysis. We merged all obtained fastq library files containing different barcodes so that each platform was represented by only a pair of fastq files with forward (R1) and reverse (R2) reads, respectively. After merging the fastq files, we repeated the quality assessment procedure using the FastQC service and found that the total data generated by both platforms was of acceptable quality and could be safely compared. Fig 1 shows the assessment of quality of sequencing data by the FastQC service [15]. Data quality was acceptable for each of the nucleotide positions within a read for both MGISEQ-2000 and HiSeq 2500. However, the quality of data representing each position in the MGI-SEQ-2000 fastq file was slightly lower than in the HiSeq 2500 file and tended to gradually deteriorate towards the end of the read (although it was not lower than Q20). For HiSeq 2500-generated data, drops in quality below Q20 were observed only towards the very end of the read. For each nucleotide, the quality of MGISEQ-2000-based sequencing data gradually decreased after 50-60 cycles. In contrast, the total number of high-quality nucleotides was higher for HiSeq 2500 and remained on the same level until the last cycle. A similar picture can be seen in the graphs demonstrating the distribution of reads quality (Fig 1C). The distribution was more uniform for Illumina, meaning that the average quality was higher. The quality of reads generated by the MGISEQ-2000-based sequencing was acceptable, as95% of all reads were above Q30. The GC-content was similar for both platforms (Fig 1D); the distribution graphs are practically identical.

Reads mapping/ alignment and QC
The average coverage is an important characteristic of whole-genome sequencing, as are its distribution and variability. Fig 2 compares the average coverage distribution for MGISEQ-2000 and HiSeq 2500. The figure shows a slightly higher average coverage for MGISEQ-2000 (32.75X for MGISEQ-2000 versus 30.48X for HiSeq 2500). At the same time, the overall coverage distribution is highly uniform for both datasets (Inter-Quartile Range (IQR = 6)), suggesting good sequencing quality [16].
The data presented in Fig 2 was obtained after the FastQC had been performed during the reads alignment. Therefore, the input data was similar in terms of the coverage distribution and the total reads number.
The filtered and trimmed reads were aligned to the reference genome, which was necessary to convert fastq files to BAM files. This was carried out using Burrows-Wheeler Aligner (BWA-MEM) with default settings recommended for the analysis of genomes sequenced on Illumina systems [19]. The quality of read alignment was assessed using the SAMtools software package and the bamstats software module [20,21].
The quality of read alignment was acceptable for both platforms. The insert size for pairedend libraries corresponded to the theoretical size specified in the manufacturer's protocol: 250 bp for Illumina HiSeq 2500 and 400 bp for MGISEQ-2000. The proportion of aligned reads was 99.9% for both BAM files.

Variation calling and false positive/ negative ratio estimation
In order to further assess the quality of MGISEQ-2000 sequencing, as well as to understand the aspects of its potential use, the generated data was subjected to variant calling. After the data was aligned to the reference genome using BWA-MEM [19], the BAM file was modified using four different pipelines: Samtools [20,21], Strelka2 [22], Sentieon [23], and GATK [24].
All software packages used to process the datasets generated by Illumina and MGI demonstrated similar performance in terms of computation speed, which is consistent with the results obtained for BGISEQ [25].
Alignment results are provided in Table 2, it shows that both sequencing platforms performed similarly well. The duplication rate for E704-I was higher than for E704-M, amounting to 12.26%. This value, however, was calculated after we merged the fastq files with different barcodes and obtained from different lanes. In each individual fastq file, the duplication rate did not exceed 5-6% for both instruments (see S2 File). Using Illumina HiSeq 2500, 16 separate fastq files (8 for + 8 rev) were generated. The number of fastq files obtained using MGI-SEQ-2000 was also 16, however, they represented a single flow cell, whereas Illumina's files came from two different flow cells. Therefore, a higher duplication rate recorded for Illumina resulted from the use of two cells. Most likely, the probability of obtaining repeated reads from two independent flow cells is higher than from one cell. As the information in fastq files was summed up, it resulted in an additional 3-4% of duplicates for Illumina-generated data, compared to MGISEQ-2000.
In the absence of a GIAB, we conditionally considered Illumina data to be standard and calculated "error rates" ("False Positive", "False Negative", etc.) in the E704-M dataset using E704-I as a reference. This approach cannot be used to assess the accuracy of the MGISEQ technology, however, it did allow us to conclude that the two compared technologies can be used interchangeably for similar tasks without significant loss of accuracy.  In total, over 3.7 million SNPs were detected in the datasets generated by each of the tested platforms. The E704-M sample contained 3,730,684 SNPs; the number of detected SNPs in the E704-I sample was comparable (3,719,768 SNPs). This data is shown in Table 3. In addition, we detected a similar Ti/ Tv ratio, which may indirectly indicate the sequencing accuracy.
The MGISEQ-2000 was able to detect slightly more indels (803,736) than HiSeq 2500 (770,193; see Table 3). Generally, HiSeq 2500 performance was characterized by a slightly lower average coverage, which partly explains its indel detection rate. However, given that the dbSNP indel rate recorded by HiSeq 2500 was slightly higher (92.1% in E704-I versus 90.86% in E704-M), this may indicate a lower accuracy of indel detection by the MGISEQ-2000 platform. These observations are consistent with the previous findings for BGISEQ-500 [3].
To assess the accuracy of the detection of certain genomic variants, we chose the E704-I dataset as a reference for the E704-M sample. As a large number of such studies had been   [19]. QC analysis was performed using bamstats [20,21]. https://doi.org/10.1371/journal.pone.0230301.g003 carried out for HiSeq 2500, we decided to determine the level of differences for a single genome. Sequencing using two different instruments allowed us to estimate their interchangeability/similarity. We compared tested platforms using the HiSeq 2500 data as a reference, given that the permissible error rates for this technology have already been established by the Consortium. Further research using sequencing data from GIAB reference sample [12] to directly measure true error rates for the detection of various mutations is needed. We estimated the magnitude of various "errors" and calculated "the F1-metric" using vcfcompare (vcftools [26]) and snpeff [27]) for all detected SNPs. Table 4 compares the variants obtained by variant calling using Strelka2; data generated by other software packages is presented in the S2 File. As a result, using the "accessible genome" matrix, we discovered that the sensitivity of SNPs determination in the E704-M sample was 99.51% relative to the E704-I sample, with an "FPR" (false positive rate) value of 0.000254% (F1 metrics = 99.65%). For indels, the sensitivity was 98.84% (F1 metrics = 98.81%). It should be noted that although we did not perform a comparison with the reference sequence, the level of convergence of genotypes for MGISEQ-2000 and Illumina Hiseq2500 was high enough for both the accessible genome and the complete sequence of the read genome. This demonstrated that the MGISEQ-2000 sequencing had higher accuracy compared to previously obtained data for BGISEQ-500 [3]. This data is shown in Table 4.

Discussion
We compared two genomic datasets generated by Illumina HiSeq 2500 and MGISEQ-2000-based sequencing. As part of our study, we aimed to understand whether MGISEQ-2000 could be used for the whole-genome sequencing of embryos, SNP detection and other tasks that our laboratory performs. Our study demonstrated that MGISEQ-2000 provided datasets possessing characteristics similar to the data generated by the "gold standard" of the NGS analysis-the Illumina platform. Given a comparable amount of output data (101.37Gb for MGISEQ and 94.37Gb for Illumina), the average coverage for the two sets was comparable: 32.75X for MGISEQ-2000 versus 30.48X for HiSeq250; the coverage distribution patterns were almost identical (Fig 1).
The analysis demonstrated that the studied instruments provide similar sequencing quality. The existing differences can be explained by the specifics of the preliminary steps of library preparation and are not the result of the features of the sequencing techniques themselves.
Four different pipelines were used to perform variant calling. The detection rate of genomic variants in the two datasets was similar. The computational time required to process the obtained data was comparable for all software packages and all datasets used. The performance of Strelka2 was characterized by the lowest number of errors (Fig 4).
The quality of data obtained with MGISEQ-2000 was inferior in several respects to that generated by Illumina HiSeq 2500. Specifically, the frequency of random sequencing errors, the percentage of quality reads, and the accuracy of indel detection were higher for HiSeq 2500. However, the magnitude of those differences is small and insignificant for most research tasks. Last but not least, sequencing costs are an important factor for the laboratories. To our knowledge, the MGISEQ-2000 platform is comparable to NovaSeq in terms of costs, however, it requires a smaller number of samples per run.

Conclusions
The newly-developed sequencer MGISEQ-2000 from BGI Group can be used as a fully-featured alternative to Illumina sequencers in a whole-genome surveys (variant calling, indels detection). Raw data quality had equal metrics. Differences between two platforms that we found in the processes of variant calling and indel detection were negligible.