Molecular Characterization of Transgenic Events Using Next Generation Sequencing Approach

Demand for the commercial use of genetically modified (GM) crops has been increasing in light of the projected growth of world population to nine billion by 2050. A prerequisite of paramount importance for regulatory submissions is the rigorous safety assessment of GM crops. One of the components of safety assessment is molecular characterization at DNA level which helps to determine the copy number, integrity and stability of a transgene; characterize the integration site within a host genome; and confirm the absence of vector DNA. Historically, molecular characterization has been carried out using Southern blot analysis coupled with Sanger sequencing. While this is a robust approach to characterize the transgenic crops, it is both time- and resource-consuming. The emergence of next-generation sequencing (NGS) technologies has provided highly sensitive and cost- and labor-effective alternative for molecular characterization compared to traditional Southern blot analysis. Herein, we have demonstrated the successful application of both whole genome sequencing and target capture sequencing approaches for the characterization of single and stacked transgenic events and compared the results and inferences with traditional method with respect to key criteria required for regulatory submissions.


Introduction
Commercialization of transgenic crops can be achieved only after regulatory approval which requires rigorous assessment of their safety [1,2]. Molecular characterization of transgenic events is an important analysis towards this goal and is conducted at two stages: first, for the selection of desirable events and later for the characterization of selected lead event(s) to support regulatory submissions. A thorough molecular characterization of the transgene locus, determining its sequence, integrity and its location in the genome, is a critical step in the safety assessment process. This characterization also addresses mandatory analysis that determines whether the transgene expression cassette is inserted into the host genome as a single copy, is intact across generations, has made any unintended alterations to the host genome due to insertion, and whether it lacks the backbone sequences derived from the plasmid vector used for the transgenesis. Furthermore, using a segregating population, it has to be proven that the inserted transgene behaves as a Mendelian locus.
A key technique that is widely utilized in molecular characterization is Southern blot (SB) analysis [3]. Although SB, along with polymerase chain reaction (PCR) and Sanger sequencing, is a universally accepted technique for event sorting and molecular characterization studies for regulatory submissions, it is a very time-and labor-intensive and relatively expensive procedure. Moreover, despite being a robust technique and has been successfully used for the molecular characterization of inserted DNA in regulatory studies for many years, SB is not sensitive enough to detect individual nucleotide substitutions and small insertions/deletions that might occur within a transfer DNA (T-DNA) or around a transgene insertion site [4]. Although the disadvantages of SB can be addressed by Sanger sequencing, this sequencing technique does struggle to accurately sequence complex regions of the genomes [5].
The emergence and rapid evolution of next-generation sequencing (NGS) technologies over the past few years have offered novel, rapid, and cost-effective options for molecular characterization of transgenic crops. As NGS has been widely used for the detection of the structural variations [6], this technology can also be applied for molecular characterization of transgenic events. The application of NGS for event characterization has been extensively reported in animal biotechnology. For instance, this technology was successfully applied to characterize transgenic events in cattle and mouse [7][8][9]. In contrast to conventional PCR and SB methods, NGS has proven to be very sensitive to detect incomplete and multiple integration events [7]. This technology was also used to characterize transgene insertion sites that were located in complex regions of a genome [8,9]. In plant biotechnology, the number of publications reporting the NGS-based molecular characterization of transgenic events is very limited. Yang et al. [4] confirmed that paired-end re-sequencing was more sensitive than PCR and SB analysis for molecular characterization of transgenic events as it revealed additional unintended insertions in a transgenic rice event. Kovalic et al. [10] successfully demonstrated that transgenic events can be characterized by combining NGS with Sanger sequencing, which consequently can be used as an alternative to the SB method. They applied a whole genome sequencing approach to determine transgene copy number in maize by re-sequencing the junction regions between the transgene and the flanking border genomic sequences. However, Sanger sequencing has been used for assessing the integrity and stability of a T-DNA across generations [10]. Using a combination of high coverage whole genome sequencing and bioinformatics analysis, T-DNA insertion and copy number was previously demonstrated in papaya by the assembly of a draft genome [11]. A novel hybrid NGS x PCR-based method was developed for high-throughput zygosity detection in transgenic maize. However, the application of this method requires a prior information of the exact integration site, adjacent genomic sequences and the transgene copy number [12]. Most recently, targeted sequence capture coupled with NGS was successfully applied for event sorting [13]. Thus, there is sufficient evidence that NGS can be used for event characterization of plant and animal transgenic events. In this paper, we present and compare the results of molecular characterization of two transgenic soybean events, Transgenic Event 1 (TE1) and Transgenic Event 2 (TE2), and their breeding stack (TE1 x TE2) using traditional (SB analyses coupled with Sanger sequencing) and advanced (NGS) methods (Fig 1). In particular, for the first time, we have demonstrated the use of both whole genome sequencing (WGS) and target capture sequencing (TCS) approaches for the characterization of both single and stacked events and compared the results and inferences with traditional method with respect to key criteria required for regulatory submissions.

Results
Molecular characterization of two soybean events, TE1 and TE2, and their breeding stack using traditional approach Southern blot analysis. To determine transgene copy number, we performed SB analysis using probes designed to hybridize the inserted DNA (S1 and S2 Figs). The genomic DNA of two single events (TE1 and TE2) was digested with restriction enzymes that cut the inserted DNA to generate distinct patterns and sizes. Although blots were hybridized with different probes covering the entire T-DNA, only the data related to the gene-of-interest (GOI) probes are shown in this paper. For all three generations tested, SB analysis has shown identical patterns and band sizes for all restriction digests on both single events (TE1 and TE2): the probe combinations and hybridizing band patterns indicated that each of these events harbors a single copy of the transgene (Fig A-C in S3 and S4 Figs).
When probes from the plasmid backbone region were used for SB analysis, the expected bands were detected in the respective positive (plasmid) controls whereas no hybridization signal was detected in the transgenic events suggesting the absence of the backbone regions in the transgenic events (  Using the same probe sets and multiple restriction enzymes as used for the analysis of single events, SB analyses were carried out with the soybean breeding stack, TE1 x TE2 (Fig A-C in S7  Fig). A comparison of Southern band patterns from the breeding stack with those of the single TE1 and TE2 events revealed no differences, indicating that the breeding process did not affect the integrity and copy number of T-DNA of the corresponding single events comprising the stack (Fig A-C in S7 Fig).
Sanger sequencing. Sanger sequencing has been employed to determine the structural integrity and location of the transgene insert in the genome as well as to identify any rearrangements associated with transgene insertion and any unanticipated changes that may have occurred in the stack compared to single events. The results of Sanger sequencing demonstrated that the T-DNAs within the TE1 and TE2 events were intact at the nucleotide level compared to their counterparts in the plasmids used for transformation (Fig A and B

Molecular characterization of the soybean events and their breeding stack using NGS
In parallel to traditional molecular characterization, transgenic single events, TE1 and TE2, and their breeding stack, TE1 x TE2, were also subjected to characterization using Illumina paired end (PE) sequencing-both whole genome sequencing (WGS) as well as targeted capture sequencing (TCS). PE sequencing generates read pairs from both ends of a sheared DNA fragment, which when mapped to the combination of plant reference genome and plasmid sequence can answer several key questions pertaining to transgenic event characterization such as the location of the insert, copy number, transgene integrity, stability and the lack of vector backbone.
Location of the insert and copy number. To determine the number of copies of the transgene inserted in the genome and their locations, genomic DNA of the transgenic plant was randomly sheared and sequenced. Such a random shearing produces a mixture of three types of fragments-those derived solely from the plant genome, those derived solely from the transgene and those derived from regions spanning the transgene integration site and thus consisting of both the transgene and the host DNA. When mapped back to the reference genome and transgene sequences, the PE reads generated from this third type of fragment will have one read of a pair mapped to the transgene and its mate mapped to the plant genome. This class of PE reads will subsequently be referred to as 'junction pairs'. In a subset of these junction pairs, one read of the pair will span the junction with a portion of read derived from the transgene and the other derived from the genome. These will be referred to as 'junction reads'. The combination of junction pairs and junction reads helps to identify the transgene integration site(s) in the genome. In our WGS experiments, the number of junction pairs varied from event to event. For instance, in the TE1 event, we obtained an average of 72 junction pairs (average of F2 and T3 generations) that had one read mapped within the T-DNA and its mate mapped to the plant genome. Out of these, seven junction reads spanned the 5' junction region and five reads spanned the 3' junction region ( Table 1). All 5' and 3' junction reads got mapped to a single genomic location which suggested a single transgene integration site for the TE1. If there were multiple insertions of T-DNA within a host genome, WGS would have yielded heterogeneous population of junction reads pointing to multiple locations in the genome.
The copy number was estimated by comparing the genome coverage of the reference genome to the transgene coverage. For TE1, the reference coverage and transgene coverage were 10x and 9x, respectively, suggesting a single transgene copy. In case of multiple copy insertions, the coverage of a transgene would have exceeded the coverage of a genome, with the increase in fold change reflecting an increased copy number. In both F2 and T3 generations, junction reads got mapped to a single locus of the soybean chromosome 6 (Tables 1 and 2 and Fig 2E) and identified the integration site of the T-DNA very precisely.
With respect to TE2, one of the samples tested ('F2-10') was a hemizygous plant and this was reflected in the genome coverage data. The hemizygous sample (F2-10) had half the genome coverage (~4x) compared to both the reference genome and the homozygous sample (T3-1) which had 9x coverage (Table 1). Despite relatively lower coverage, in both F2 and T3 generations, junction reads were mapped to a single locus on chromosome 2 (Table 1 and Fig  2E) confirming a single locus integration of TE2.
A similar analysis of the TE1 x TE2 breeding stack also confirmed a single copy (14x genome coverage vs. 11x T-DNA coverage, Table 1), single integration of each of TE1 and TE2 with an average of five junction reads supporting the 5' and 3' junctions (Table 1 and Fig 3).  However, we did observe that the junction pairs from the 3' end of TE1 mapped to the genome integration sites of both TE1 and TE2 and a similar pattern of mapping to two integration sites was also observed for the junction pairs from the 3' end of TE2 ( Fig 3F). As both TE1 and TE2 share a common terminator in the 3' region (T1_E10 = = T2_E9; Fig 3F), the reads obtained from this part of the transgene resulted in non-specific mapping of some TE1 pairs to the TE2 location and vice versa. Results from the sequencing data concurred with the corresponding Southern blot results and confirmed the single copy single integration of the TE1 X TE2 stack. These results suggest that breeding process did not change the copy number and insertion site of T-DNA of the single events.
Although the results obtained through WGS related to the copy number and insertion site characterization are consistent with SB analysis and Sanger sequencing, WGS generates relatively lower coverage of junction reads, which are crucial in defining the copy number of T-DNA. Further increase in sequencing coverage will increase the coverage of junction reads, but will also increase the cost of the experiments. To address this issue, we explored an alternative sequencing approach, namely target capture sequencing (TCS) that was expected to increase the coverage within T-DNA and junction regions without generating a large amount of host genomic sequences. TCS is an approach that uses a WGS library of fragments as described above and a collection of bait probes designed against a desired target sequence. The baits are used as hybridization probes to capture and thus, increase the relative abundance of fragments from the targeted region. The results of TCS, summarized in the Table 2, demonstrated a remarkable increase (several thousand folds) in coverage at junction regions (Table 2). Importantly, TCS data confirmed the single copy status of T-DNA within both single and stacked events. As the sequence capture method focuses on capturing only the transgene sequence and a small portion of the genomic DNA at the integration site, the single copy nature of the transgene was determined by the homogeneity of junction reads and not by the comparison of coverage across transgene and genome as was done for the WGS since it lacked the broad coverage across the whole genome. However, if the copy number of T-DNA was to be defined by the comparison of the coverages of host genome and transgene, any single copy housekeeping gene could be used in the sequence capture and could serve as a control for copy number estimations. The TSC results agree with the WGS results and confirm the single copy, single insertion in TE1, TE2 and their breeding stack TE1 x TE2, and produced significantly higher number of junction reads. Transgene integrity. The sequence coverage and PE mapping information was used to assess the integrity of the T-DNA. The WGS datasets contained an average coverage of 9x and 8x across the T-DNA for TE1 and TE2 events, respectively, and 11x coverage for the breeding stack ( Table 1). The sequence coverage indicated the presence of all the T-DNA elements and the sequence of the T-DNA from the events exactly matched the reference sequence suggesting the absence of any variations in the T-DNA. The TCS datasets had significantly higher coverage (>2800x) than the WGS datasets (Tables 3 and 4) and were in agreement with the results from the WGS data.
Analysis of relative spacing and orientation of mapped PE reads can indicate evidence of rearrangements. In Fig 4, we show how PE reads would be mapped in three distinct types of rearrangements. No such anomalous pairs were observed during the sequencing of the three events indicating the absence of any rearrangements within the T-DNA confirming the integrity of the T-DNA. WGS and TCS data were in concordance with SB and Sanger sequencing and confirmed the integrity of T-DNA inserts within single and stack events at the nucleotide level.
TSC experiments gave us much higher coverage than WGS, which in turn gives us much higher confidence that the T-DNA inserts in all three events were intact. Compared to Sanger sequencing, where the DNA fragment is sequenced from both sides and, on an average, achieves 2x coverage at the overlapping and 1x at non-overlapping portions, NGS (especially TCS) provides much more robust data to determine the integrity of the T-DNA at the nucleotide level. TCS generates a higher level of coverage for a given cost because it sequences only a captured fragment of DNA. Presence/absence of vector backbone. Sequencing data was also used to determine whether any portion of the vector backbone is present in the transgenic plant. As WGS generates sequences from the entire genome, any vector backbone segments that are present in the transgenic plant will be sequenced and these reads will map back to the plasmid reference sequence. In this study, we did not see any reads mapping to the vector backbone of the construct suggesting a clean integration of the T-DNA and absence of any vector backbone in the transgenic plant. These results were consistent for TE1 and TE2 across all generations and their breeding stack and were further confirmed by TSC experiments also (Figs 2C2 and 3C2).
Stability. Using both WGS and TCS approaches we generated the same information from TE1 and TE2 single event samples representing two generations, F2 and T3 (Tables 1-4). No differences in copy number and integrity of T-DNA were observed in both generations suggesting that T-DNA is stably inherited across generations. These results correspond with the results of the SB analysis.

Discussion
The results reported here highlight the application of NGS to molecular characterization of transgenic events. We show that NGS offers an effective, robust, and sensitive method to identify the transgene insert location, copy number, integrity, and stability. Table 3. Molecular characterization of soybean single events, TE 1 and TE2, and their breeding stack TE1× TE2 using whole genome sequencing sequencing approaches. Each element of the T-DNA is represented by the "X" amount of coverage depth. Both T-DNA inserts within TE1 and TE2 events share several identical elements, such as T1_E5 = T2_E1 (promoters), T1_E7 = T2_E3 = T2_E6 (terminators), T1_E8 = T2_E4 = T2_E7 (promoters), T1_E9 = T2_E8 (gene of interest, GOI), and T1_E10 = T2_E9 (terminators). There are several advantages of NGS over traditional SB + Sanger sequencing analysis. One of them is the sensitivity aspect of technology. Due to the nature of chemistry, particularly PE sequencing, technology allows to detect small DNA re-arrangements (insertions/deletions and inversions) within T-DNA. Although no re-arrangements within T-DNA were found in this study, in the Fig 4 we demonstrated several possible scenarios when PE sequencing could detect small DNA aberrations. Additionally, high level of coverage makes the base calling more reliable and robust. One can rightly argue that Sanger sequencing is also capable of detecting those mutations. However, this can be true only if the complexity of the event allows to generate PCR amplicons covering T-DNA, junction regions and flanking borders to be further Sangersequenced. In case when transgene lands in the highly repetitive regions of a genome, PCR amplification of the region spanning borders and junctions becomes very cumbersome task. This is also true for the events that were created by particle bombardment as the latter can generate deletions and scramble inserted and chromosomal DNA [14]. In the above-mentioned circumstances, deep sequencing offers much higher sensitivity and ability to resolve the complex issues. Superior strategies exists using NGS technologies compared to PCR based methods for resolving problems that are caused by integration of transgene in repetitive region of genomic DNA. Recently emerged single-molecule based NGS technologies generate longer reads (2,000-5,000 bp) at increased coverage depth. The latter is particularly important in resolving the challenges in analysis of the repetitive and low complexity regions of a genome [15 16]. Table 4. Molecular characterization of soybean single events, TE 1 and TE2, and their breeding stack TE1× TE2 using target capture sequencing approaches. Each element of the T-DNA is represented by the "X" amount of coverage depth. Both T-DNA inserts within TE1 and TE2 events share several identical elements, such as T1_E5 = T2_E1 (promoters), T1_E7 = T2_E3 = T2_E6 (terminators), T1_E8 = T2_E4 = T2_E7 (promoters), T1_E9 = T2_E8 (gene of interest, GOI), and T1_E10 = T2_E9 (terminators). Another area where NGS was reported to be more advantageous is the cost of experiments and the amount of labor spent for molecular characterization. Particularly, Kovalic et al [10] reported that conducting the molecular characterization using NGS approaches reduces the cost and labor by 50% compared to Southern blot-based approach. DNA sample preparation for NGS technologies requires shearing of the genomic DNA, using starting material in much lower quantities than needed for SB (up to more than 10-fold less), the subsequent selection of sheared DNA-fragments with appropriate size and the library construction of DNA-fragments for sequencing can be performed with commercially available kits in very high-throughput manner. Although we did not do the direct per-sample cost comparison, we also observed that NGS-based molecular characterization was time-and labor-effective. However, we need to state that in our experiments, SB analysis was conducted under Good Laboratory Practice (GLP) quality management system while Sanger and NGS were not. Any experiments that are done under GLP or ISO quality management systems involve substantial amount of time spent for quality assurance processes which significantly slow down the process. Cost savings could be also relative and depend on the nature of the event. In case of "difficult" events, NGS might require higher coverage and more time spent for data processing that would dramatically increase the cost of the experiments. In our WGS experiments, we achieved slightly lower coverage within junction regions (~7x) than across the transgene (9x). At this point, literature lacks sufficient information related to a "gold standard" for the level of coverage within junction regions to make a solid determination of transgene copy number. Recently Kovalic et al [10] reported~70x coverage within junction regions. While it is possible to expand the coverage of the genome in WGS experiments, this increases the cost of molecular characterization for regulatory submissions. Also the application of WGS for molecular characterization of transgenic crops may be less affordable for companies or institutions with modest budgets due to the high cost of experiments, resources needed to conduct extensive bioinformatics data processing, and purchase and maintenance of storage space for enormous amount of sequencing data [17]. Recently, Zastrow-Hayes et al [13] demonstrated the use of TCS method coupled with NGS technology for high throughput event sorting during trait development process. In this study, we have demonstrated that TCS technology can answer all key questions pertaining to molecular characterization of transgenic crops posed by regulatory agencies. In comparison to WGS TCS could achieve very high coverage within junction regions which boosted our confidence in characterizing the insertion site and defining a copy number of a transgene. As TCS focuses on the target region (e.g. T-DNA) only, the technology generates much less sequencing data and, consequently, requires less storage space and resources to complete bioinformatics analysis. Defining a "gold standard" for the level of coverage within junction regions will ultimately depend on the nature, and specifically the complexity of junction regions. Junction regions with complex DNA re-arrangements occurring during transformation might require much higher coverage to increase the confidence level in decision making. On the contrary, "clean" junction regions might not need that high level of coverage to define the copy number. For high quality data analysis and assembly, sufficiently high coverage is required, which can increase the cost of sequencing. Therefore, the depth of coverage should be set on case-by-case basis and a balance between the requirement, cost and coverage should be made [18]. NGS-based molecular characterization of transgenic events is a promising new trend in regulatory sciences. Thus, for regulators, it is crucially important to understand the similarities between NGS-based and SB-based molecular characterization of transgenic events which will be very helpful during the review of regulatory dossiers. In Table 5, we tried to draw parallels between TCS-based, WGS-based and SB-based molecular characterization. The process comparison clearly demonstrates that the principle of the TCS method closely mirrors the probe versus genomic DNA hybridization principle of SB -analysis (Table 5).
Although we have demonstrated here that reads generated by both WGS and TCS technologies can be successfully employed for characterizing the entire T-DNA, the short reads may pose some challenges to resolve junction regions within the complex repeat-rich regions of the Table 5. Comparison of concepts between Southern blot analysis, Target capture sequencing and Whole Genome Sequencing.
Step Southern Blot analysis Target Capture Sequencing Whole Genome Sequencing genome or with repetitive regions within the T-DNA [19]. However, the combination of paired-end sequencing with larger read lengths and insert sizes and the advancements in sequencing platforms that enable longer read lengths can mitigate these disadvantages [20].

Conclusion
Molecular characterization of transgenic events using NGS technology, namely whole genome sequencing and targeted capture sequencing, can successfully answer all major regulatory questions related to transgene copy number, T-DNA integrity, stability of T-DNA insert across different generations, and the presence/absence of plasmid backbone sequence. Unlike SB analysis, where the decision on the status of the transgene is made based on banding pattern, the outcome of NGS-based molecular characterization is an actual sequence which is confirmed at several fold coverage. In terms of coverage TCS looks more attractive compared to WGS as it is capable of providing ultra-high coverage within T-DNA and junction regions for a reasonable cost. Due to the paired-end chemistry both WGS and TCS possess much higher sensitivity in detecting small DNA re-arrangements within T-DNA and junction regions rather Southern Blot analysis. Although fairly short read lengths could cause a problem in resolving complex junction regions or with insertions within repetitive sequences, this could be overcome by increasing the coverage within the troublesome region using TCS approach. Overall, NGS based molecular characterization is a robust and reliable approach and with further chemistry improvement, in particular an increase in the length of reads, it can easily replace labor-and time-consuming Southern blot analysis in molecular characterization of transgenic crops for regulatory submissions.

Plant materials
Soybean transgenic single events, TE1 and TE2, were generated by Agrobacterium transformation. Soybean breeding stack, TE1 x TE2, was developed by conventional breeding of TE1 and TE2. Non-transgenic soybean control plants are the conventional soybean varieties with a genetic background of the single and stacked events. For Southern blot analysis of TE1 and TE2 single events, four homozygous generations and one segregating generation with three replications for each generation were grown in greenhouse conditions. For Southern blot analysis of TE1 x TE2 stack single homozygous line with three replications was used. For Nextgeneration sequencing analysis, one homozygous line and one heterozygous line with one replication for each of TE1 and TE2 single events and one homozygous line for TE1 x TE2 stack was used.

Genomic DNA extraction
Genomic DNA from frozen soybean leaf tissue from single events, breeding stack event, nontransgenic control plants were extracted following modified CTAB method [21]. Following extraction, the DNA was quantified spectrofluorometrically using PicoGreen reagent (Invitrogen). The DNA was then visualized on an agarose gel to check for genomic DNA quality. Genomic DNA was used for Southern blot, whole genome sequencing and target capture sequencing analysis.

Southern blot analysis
Ten micrograms of genomic DNA from transgenic single and stacked events, non-transgenic control, non-transgenic control spiked with plasmid were digested with required restriction enzymes. Multiple restriction enzymes were selected to determine the copy number, integrity of inserted T-DNA as well as the absence of transformation plasmid backbone in transgenic events. For the single transgenic events (TE1 and TE2) probes specific to TE1 and TE2 events and their backbone regions were labeled with DIG-dUTP using a PCR DIG Probe Synthesis Kit (Roche Diagnostics, Indianapolis, IN). Southern blot analysis was performed essentially as described by Memelink et. al [22]. Hybridization and detection were completed according to the manufacturer's instructions (Roche Diagnostics, Indianapolis, IN). For the stacked event, probes were radioactively labeled with [α-32P]dCTP using the Prime-It RmT Random Primer Labeling Kit (Agilent Technologies, Santa Clara, CA) and purified using ProbeQuant G-50 Micro Columns (GE Healthcare, Chalfont St Giles, Buckinghamshire). Hybridization was conducted with Perfect Hyb Plus hybridization and the membranes were then exposed to Xray film sandwiched between two intensifying screens for one to three days in -80°C freezer.
The film was then developed with an All-pro imaging film developer (ALLPRO Imaging, Melville, NY).

Sanger sequencing
Sanger sequencing was applied to determine the intactness of T-DNA at nucleotide level and characterize the insertion site in parental locus. The entire length of the T-DNA insert and approximately 1Kb fragments of the 5' and 3' flanking border regions within TE1 and TE2 were sequenced. Furthermore, parental loci within the isogenic non-transgenic lines representing the genetic background of TE1 and TE2 were sequenced to identify whether any sequence re-arrangements took place at insertion site during transformation. Entire T-DNA of the single event constituents of the breeding stack, TE1 x TE2 were also re-sequenced. The purpose was to identify whether breeding process incurs any potential changes to T-DNAs of TE1 and TE2 single events when they brought together into one background. T-DNA inserts of single events in TE1 x TE2 stack and the parental locus for each trait were PCR-amplified in overlapping fragments. The fragments were cloned and sequenced by traditional Sanger sequencing. Sequencing was followed by assembly and generation of a consensus sequence spanning the entire locus and flanking border sequences. The resulting consensus sequence was aligned to previously determined sequence of the transformation plasmid for each trait.
Library preparation for Whole Genome Sequencing

Sequencing quality and assembly
Initial Quality control of the sequenced reads was done using the CASAVA software (Illumina, Inc. San Diego, CA). Following that, the reads were trimmed for adapter sequences, and all reads with Phred quality scores below 30 (Q30) were discarded. The trimmed reads were mapped to the Soybean Williams82 reference genome sequence using software packages, including Burrows-Wheeler Aligner [23] and Samtools [24]. Sequence coverage for genome and transgene was obtained using BEDTools [25]. Custom scripts were used to extract the junction reads. The figures presented here were made using Circos [26].