Exon sequencing mutation detection algorithm based on PCR matching

Background With the development of second-generation sequencing technology, more and more DNA sequence variations have been detected. Exon sequencing is the first choice for sequencing many cancer genes, and it can be better used to identify disease status by detecting gene variants. PCR sequence is an effective method to capture that sequence of an exon in the process of sequencing. Exon sequencing sequence contains PCR primer sequence, the correct position of the sequence can be determined by PCR primer sequence, which can be found in SNP, Indel mutation point by comparing the sequence of PCR primer sequence. Results In this paper, a matching algorithm based on the PCR primer sequence is proposed, which can effectively sequence the position of PCR primer sequence and find out the key position sequence. Then the sequencing sequence is sorted and the number of the same sequence is counted to reduce the matching times. Then, the sequenced sequence was matched with PCR primer sequence, so that the DNA position could be accurately matched and the variation in the sequenced sequence could be found more quickly. Conclusions Compared with the traditional sequence matching method, PCR primer sequence matching method can match many sequences and find more variation. It also showed a high recall rate in the recall rate.


Introduction
With the gradual maturation of the second-generation sequencing technologies [1], higher and higher sequencing throughputs and cheaper and cheaper sequencing [2], sequencing technologies are increasingly used in the diagnosis of disease genetics [3]. Many disease-related SNPs [4] have been studied more and more by sequencing techniques. In clinical medicine, the diagnosis of certain diseases and the search of target sites for the action of certain drugs involve only part of the genetic locus [5]. Although whole genome re-sequencing can also achieve these functions [6], Genome re-sequencing will detect those unrelated gene sequences together, which will result in a lot of financial and time waste, but also increase the difficulty of post-processing sequencing data. Under the premise that related reference genes corresponding to some traits of a species are known, some samples do not require genome-wide sequencing when performing sequence analysis on certain specific fragments in the gene. Only by sequencing the target region of specific interest in the sample gene will it be possible to know whether the sample has the potential to express the gene at the gene level. This alternative to genome-wide re-sequencing, which only targets the segments, not only greatly reduces sequencing costs, but also reduces subsequent data processing efforts. Targeted sequencing [7] is a technique for sequencing common exons [8] in specific regions of genes in DNA sequences. This sequencing method is based on the susceptibility genes and can be used to determine whether a certain disease Disease), the region of the relevant gene is detected to detect whether there is a malignant mutation, and if so, the disease is determined.
Due to the short sequence length measured by the sequencer, in order to be able to completely sequenced the DNA sequences in the class, the DNA genome was randomly interrupted by ultrasonic waves into small fragments, and the fragments were added to both ends of the small fragments by PCR By technique [9], these small fragments are sequenced by the sequencer. However, this sequencing requires a certain depth of sequencing in order to meet the corresponding coverage. The deeper the DNA sequence, the more complete the DNA sequence and the higher the sequencing cost. Finally, DNA sequences are assembled through enough small fragments [10]. The primary task of sequencing data is to align the sequences and align the corresponding fragments to the DNA sequence. However, there are a large number of repeat fragments in the DNA sequence. It is estimated that the repeat length of more than 1.5% of the human genome sequence is 1000 bp The above [11], the current second-generation sequencing sequencing data length of 150pb. Traditional SNP and indel methods are based on a combination of tools such as BWA [12] + SAMTOOLS [13] + BCFTOOLS [14], BWA + PICARD + GATK [15], BWA + SAMTOOLS + Freebayes [16] The VCF file is generated at the mutation point. In this study, we did not consider filtering normal human SNP and Indel [17]. In the follow-up research, we will give full consideration to this work.

DNA targeted sequencing data description
Targeted sequencing is a method of sequencing specific exons of a gene, which has the advantages of being targeted, reducing costs, and different sequencing targets for different cancertargeted sequencing. For example, in the study of breast cancer high-throughput sequencing data, a total of 703 exons of 20 genes were sequenced at a depth of 1000.
DNA target sequencing data The original data needs to be cleaned before it can be processed further. The structure of the library under test and the index & R formula of the sample file are shown in Fig 1:  Fig 1 shows sequencing data in two directions of the sequencing data, there are two files R1. fq and R2.fq, respectively, in the exon region is relatively short, the two sequences will have overlapping sub-division. Index & R fine structure (Note: PE1.0 to PE2.0 direction, from 5 'to 3' end, the black sequence for the index sequence, Index structure for the 5-8 base +19 base fragment, the sample is single End index, in the index were named index & R close to PE2.0. Which PE1.0, PE2.0 for a fixed length, you can directly clean the raw data, the raw data after cleaning only Target DNA and index data.

PCR primer sequence introduced
As can be seen from Fig 1, the PCR primer sequences are located at both the beginning and the end of the sequencing sequence (the beginning and the end of the two-terminal sequencing). The length between the two primer sequences is less than 200, and the length of the exon region is generally designed to be less than 200 This value ensures that both sides of the sequencing overlap in the design.
The flow of the method in this article is as follows: Step1: Data cleaning part (index cleaning and public sequence cleaning) to ensure sequencing are the target sequence; Step2: Studying PCR primer sequences to find the optimal eigenvalue sequence and establishing quad tree data index structure; Step3: Using the local optimal contrast algorithm sequence comparison with the target sequence to find the optimal sequence; According to the optimal sequence structure, we calculated the SNP and InDel information of each locus.

Public sequence cleaning
The public sequence is used to capture the DNA sequence. If the DNA sequence has not been successfully captured, the public sequence will be considered as the DNA sequence. If the capture is successful, no common sequence will appear in the original sequence. Public sequence captures DNA sequence shown in Fig 2: As can be seen from Fig 2A, the DNA sequence has been successfully captured, and the sequence of Target DNA do not include the Public sequence, with a length of 150 bp. In Fig  2B, DNA sequences are captured and tagged with a common sequence. If the sequence contains a common sequence, this sequence does not capture the target sequence. When double-end sequencing of the data is possible, two common sequence cross-trapping occurs, It is possible that a segment contains the target sequence and the other segment contains a common sequence, as shown in Fig 2C. Public sequence cleaning is to obtain a better sequencing sequence to improve the efficiency of alignment. For the above situation to be cleared, the principle of liquidation is as follows: 1. If the sequences on both ends contain a common sequence, then both ends of the sequencing data should be cleared; 2. If one end sequence contains a common sequence and the other end contains a target sequence, both ends of the sequencing data are still to be cleared; The cleaning algorithm is as follows: Algorithm 1 Input: Input the fastq file:R1,R2 Output: Read1,Read2 without the public data sequence ReadR1 = readFastq(" ���� R1.fq") ReadR2 = readFastq(" ���� R2.fq") PubClean<-function(ReadR1,ReadR2){ Pubseq ="public data sequence" revseq = as.character(reverseComplement(DNAString(pubseq))) t1 = grep(pubseq,sread(ReadR1)) t2 = grep(revseq,sread(ReadR2)) sta = sort(c(setdiff(t1,t2),t2)) FLA =! sread(ReadR1) %in% sread(readt10) [

Index cleaning
Index data sequence in another reverse sequencing file for bidirectional sequencing, all cleanup at the beginning of the index sequence is taken into account, as well as the mismatch. DNA target sequencing, the sequencing company sequencing samples based on different sequences, when the target sequence is ideal, the linker sequence will not affect the target sequence; if the target sequence is relatively short, the other sequence may be sequenced to the index, so Resulting in sequence S2 sequence data in the 3 'end coincide with index end sequence. As shown in Fig 3: The SNP detection method mentioned in this article, you must clear this part of the sequence, or Part index data will generate more SNP points. To determine if Part Index data needs to be cleaned, compare the length of the Targe DNA to less than 150, and if it is less than 150, clean the excess Part Index.
Cleaning algorithm is as follows: Output: R1,R2 sequence data without the index sequence Cleanindex<-function Clean(Read1,Read2){ #Clean index in Read2 consider the mismatch situation for(i in 1: Sequencing data processing. For sequencing, the amount of data is large, there are tens of millions of data, but there is a large number of repetitive sequences in these data. In the perfect case, the number of non-repetitive data in sequencing is equivalent to the number of target sequences. Due to system parameter settings, environmental factors, and sequencer errors, a large number of sequencing errors exist in the sequencing sequence. Therefore, in order to effectively compare sequences and reduce the number of repetitive sequences, we deal with the original data in the following way: Step1: The original data R1, R2 data splicing to ensure that the same target sequencing Pair spliced together at both ends of the sequence; Step2: Statistics mosaic sequence frequency appears; as shown: Read from the Fig 4 above Read1, Read2 sequences directly connected together, due to the length of the target sequence exists Read2 and Read1, after the length of the sequence is different. Under normal circumstances, most of the target sequences are less than 200 in length, Overlap exists at the first end of the pair of target sequences, and two Overlap sequences exist in the spliced sequence. The length of the sequence after splicing varies. After statistics of the spliced sequence frequency, most sequences may have several bp differences, which is caused by the error of the sequencer.
Sequencing sequence processing algorithm is as follows:  The above code times = tablet1t2 $ Freq> = F means that for the frequency of occurrence is F, when F = 1, for all sequences to be counted, properly adjusted F, to obtain the sequence to meet the relevant frequency, saving a lot of comparison time. The new table data frame is for the frequency greater than or equal to F of the sequence.

Feature sequence extraction method
In the PCR primer sequence [18], there is a unique DNA sequence, the primer sequence is at both ends of the target sequence, and the distance between the two ends of the DNA sequence is unique. If the sequencing exon is relatively small, the sequencing of the double-ended sequencing data samples by PCR primer sequences allows for direct discovery of SNPs and InDels by matching successful sequences to targets. In order to improve the efficiency of comparison, primer sequences in this article are different in length, and the characterizes are extracted from the primer sequences, and the number of sequences can be reduced by extracting the characteristic sequences.
Characteristic sequence: the characteristic sequence between two sequences r and s (note FS (r, s)), take the same character in r, s instead of the eigenvalue.
For example, if the primer sequence R = "ACGTGC" and S = "ACGCGC" then FS (r, s) = {4: TC}, then the DNA sequence to find the one that belongs to that primer only needs to take the fourth character of the primer sequence, If the fourth character is T, the sequencing data belongs to R; if the fourth position is C, the sequencing data belongs to S.
Generally, two sequences, if there are multiple locations, need to find many different locations can be. When multiple primer sequences are searched, eigenvalues of multiple primer sequences need to be extracted.
The extraction process is as follows: Primer sequence R = {r 1 ,r 2 ,r 3 ,. . .,r n }, r i and r j find the smallest FS(r i ,r j ), re-merge the same location information FS(r i ,r j ), the final location information extracted location.

Location-based PCR sequence comparison algorithm
The foregoing method for determining the location eigenvalue can effectively reduce the number of matching times and quickly and accurately find the location of the eigenvalue so as to make a comparison between the target sequences. The PCR sequence comparison model is shown in Fig 5 below: The location-based PCR algorithm is as follows: In Comseq, all sequences that match the Pcr signature sequence are combined, and there may be a single point variation in these signature sequences, where a single point mutation is included in the sequence.

Local optimal solution algorithm
Let the length of the sequence S, T be m and n, respectively, using the score matrix D of data structure (m + 1) � (n + 1), the value of each element in the matrix represents the sequence 0: S: i�m) the best alignment of a suffix with a suffix of sequence 0: T: j (0�j�n).
Except for the prefix score in the local alignment, the initial score matrix is as follows: Since 0: s: i and 0: t: j always have an empty suffix score of 0, all elements in matrix D are greater than or equal to zero. Thus, the scoring matrix of any element is calculated as follows: In which, pðs i ; A threshold of 0 means that the 0 element distribution area of the matrix corresponds to a dissimilar sequencing, while the positive number area is a locally similar area. Finally, find the maximum value in the matrix, which is the optimal local alignment score, and the corresponding point is the unmatched point of the local alignment of the sequence. Then, the previous optimal path is reversely deduced until the local ratio The starting point.
Since the sequence determined by PCR has a great similarity with the target sequence in principle, if the similarity is not high, the sequence is deleted directly. Therefore, when comparing, only a few items need to be compared, and others need not be aligned. The calculation formula of any element in the scoring matrix is as follows: ε is an integer, when ε = 0, said S is completely T sequencing, ε = 1 indicates that there is a position deviation. In this paper ε = | m-n | +1 said The local optimum contrast process is as follows: Step1: Based on the initialized matrix D, pass the d 0,j = 0 of the first row of the score matrix and initialize the first column of the matrix d i,0 = 0.
Step2: Each element in D is calculated, and each element in the score matrix D is sequentially calculated starting from d11 according to the formula (3) and the score function formula (1) until d n+1,m+1.
Step3: Find the optimal path to determine the maximum dij position. Reverse the forward, find the optimal path.
Step4: According to the optimal path, the optimal local alignment of bar sequences is obtained.
Example 2: Local alignment of sequences S = "CGTGAGCTG" and T = "CGTCGAGCTGA" by local alignment method, n = 11, m = 9, then ε = 3 The results are shown in Fig 6 below. In the above figure, the optimal path isd 11,10 -> d 10,9 -> d 9,8 -> d 8,7 -> d 7,6 -> d 6,5 -> d 5,5 -> d 4,4 -> d 3,3 -> d 2,2 -> d 1,1 -> d 0,0 . The result is shown in Fig 6. Find the optimal local alignment from the above path, as shown in Fig 7. According to the above scoring matrix, the optimal path is derived in the backward direction. The Algorithm 6 is as follows: The above algorithm obtains the optimal path, and after determining the optimal path, returns the score. The score must satisfy the requirement and can be considered as the local optimum of S and T.
K denotes the number of unmatched points between S and T. When K = 0, it indicates that S and T are complete subset relations. Generally, k is considered as 2, k is too large, and there is no local maximum between S and T. excellent. The smaller k indicates the local optimum, and the minimum is 0.

SNP, Indel discovery algorithm
Based on the local optimal alignment algorithm, the local optimal method was used to find the sequence SNP, Indel. Firstly, the eigenvalues are extracted from the PCR sequence to find the feature location information, and the location data is used to search the sequencing data for the sequencing data in a specific area. After determining the sequencing data for a particular region, compare the DNA sequences for that particular region, and find the SNP, Indel information in that region. The process is shown in Fig 8 below: Through the PCR sequence to find SNP, Indel method, the reference sequence alignment to achieve, when the reference sequence length is a certain value, there is overlap between the sequencing data at both ends, if the two ends of the sequencing data were performed with reference to the sequence For local comparisons, Indel is present if the score is different from the length of the single-end sequencing data. Indel is not present if the two are the same length, allowing for the presence of an SNP. In front of the sequence to find SNP, InDel information, because by PCR to solve the sequence is spliced through R1, R2, so to spliced the sequence split into R1, R2, and then through R1, R2 and the target sequence Local comparison. Calculate the number of A, C, G, T and N for each position of Target and calculate the SNP and InDel information finally. The analysis model is shown in Fig 9 below: By using P i1 ,P i2 ,P i3 andP i4 , if the position of i in the reference sequence is A, the ratios at i position are: If the value of P (A i ) is 100%, there is no SNP at position i, and if P (A i ) is less than 90%, there is a variation. If the Indel phenomenon is represented by K, when k = 0, there is no Indel When K 6 ¼ 0, the size of K is analyzed. If K is particularly small, it may be SNP or Del phenomenon, and the length of the sequence after sequencing becomes longer, indicating that there is a Del phenomenon, and if there is no change, an SNP phenomenon exists. If K is particularly likely to be In, there is no match for the sequencing sequence to start the match backward and backward to find the reference sequence.

Discussion
The experimental environment adopted in this paper is Inter (R) Core (TM) I7-6700, memory 8G, and R language. The comparison targets are BWA + SAMTOOLS + BCFTOOLS (BSB), BWA + PICARD + GATK (BPG), BWA + SAMTOOLS + Freebayes (BSF). Three main methods of finding SNPs are compared with those presented in this paper based on PCR methods.
Using Wgsim software [19] in this study, the software is better than other software in terms of controllability, which can limit the error rate of sequencing, sequencing length, and sequencing type. Today's high-throughput sequencing market is dominated by Illumina's double-ended sequencing technology [20], and Wgsim's mock-sequencing software is well-suited for double-ended mimic sequencing. Pre-filtration through the sequence of low-quality sequence of cleaning, cleaning effect as shown in Table 1 below, through the control of low quality, clean-up before and after the effect, we can see that after the treatment of the average mass ratio increased.
The sequencing data was sequenced at a depth of 50, 100, 150, and a length of 50, 100, 150. With the increase of sequencing depth, the accuracy of sequencing has been improved to a certain extent. When the length of sequencing data is 150, the effect is the best, and the correct rate can be achieved. Mainly due to the increased sequencing depth, the number of sequencing in the same section increased, the accuracy rate increased, and if the length of the sequencing data was increased, the correct rate in the DNA sequence was also continuously improved.  SVsim tools [21] were used to randomly generate sequences of insertions, deletions, repeats, inversions, and translocations in the sequence to generate sequencing data, and the selected sequencing depth was 50, 100, 150. The above several software tests, from the software to see the actual situation of various software, statistical analysis of SNP and Indel, as shown in  In the analysis of Indel for comparison, due to stools in the real independence Indel, the effect is relatively poor. With the continuous increase of testing depth, the software identifies SNP. Indel effect is obviously improved. Increasing the depth will increase the number of comparisons, and can obviously increase the number of regional sequences and improve the recognition rate.
The above data does not filter the SNP, so find a lot of SNP points, this difference is set by the tool's own algorithm, and data differences also lead to the occurrence of this result. As shown in Figs 10 and 11, there are differences between SNP and Indel in the software. Because there is a small number of BSF in Indel, this article does not participate in the comparison.
In the process of generating simulated sequencing data, we analyzed the BAM files with the above four kinds of software to study the recall of the four kinds of software to estimate the sensitivity of the mutation sites.
Three thousand SNP sites were inserted into the sequencing sequence, with 2000 Indel sites (2-10 bp deletion and 2-10 bp increase). The 25 sets of sequencing data were simulated by ILLUMINA company. The sequencing fragment length was 150 bp, the standard error of sequencing was 0, and the sequencing error rate was 0. That is to say, only 300bp insertions 1bp, in this way, four kinds of analysis software (PCR, BSB, BPG, and BSF) can be used to detect the recall of simulated tumor data with deletion mutation by the single-factor controlled variable method. The following values are mean values of the test samples. The effect is shown in Figs 12 and 13.
As can be seen from the figure above, as the depth of sequencing is found to increase in terms of SNP and Indel, it indicates that there is sufficient sequencing depth in sequencing data to ensure SNP and Indel correctness. In the sequencing process can have enough sequencing depth, can have high accuracy. In order to better express the effect of PCR detection, true positive rate, false positive rate and accuracy rate were used to evaluate. The equation is as follows: The effectiveness of the four algorithms was evaluated by the true positive rate, the false positive rate and the accuracy rate, and the classification performance under different sequencing depths was shown in Tables 5-7.
Comparison of the performance of sequencing data in experiment When using PCR, BSB, BPG and BSF to predict, the accuracy of the original dataset processed by sampling technique was good. When depth = 150, the accuracy of PCR was 97.76%, BSB was 65.54%, BPG was 93.65% and BSF was 88.53%. The reason of low accuracy of BSB may be that the effect of mailing InDel is not ideal and the accuracy of BSB is low. As that sequence depth increases, the PCR algorithm achieve the highest level of accuracy, but it can be seen that the accuracy of the other three methods is also rising. The increase of sequencing length can improve the efficiency of BWA alignment, and the accuracy of BPG is also increased significantly, which is also close to PCR method. Similarly, there is a similar situation in the TPR and FPR indicators, which is not explained here. In the run time, BSB, BPG, BSF need BWA sequence alignment in the initial run time, which will take up a lot of time, while PCR directly sequencing and quantitative statistics of the initial sequence, which takes up less time. The running time of BPG is longer than that of GATK except in BWA. But the effect of GATK is also ideal. As that length of the sequence increase, the running time of PCR decrease, and as the length of the sequence increases, the comparison time also increases, so the running time also increase.

Conclusions
With the development of sequencing technology, sequence can be analyzed directly from exon sequencing, and SNP and InDel in sequence can be obtained quickly, which is the  fundamental goal of sequencing. In this paper, PCR primers are included in sequencing sequence, and the comparison between PCR primer sequence and sequencing data can quickly achieve this goal. In this way, the sequencing data can also be analyzed under the general condition of computer hardware. Compared with other traditional methods, the performance of the proposed method is much better, and the accuracy and time of the proposed method are also very high. In the future, we will focus on how to obtain the sequence of mutation quickly under the condition of specific gene mutation. Especially, SNP, InDel or CNV mutate specific sequences, and use PCR primers or set specific sequences to find the mutated sequences in the sequence data.