Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Breaking Lander-Waterman’s Coverage Bound

Breaking Lander-Waterman’s Coverage Bound

  • Damoun Nashta-ali, 
  • Seyed Abolfazl Motahari, 
  • Babak Hosseinkhalaj
PLOS
x

Abstract

Lander-Waterman’s coverage bound establishes the total number of reads required to cover the whole genome of size G bases. In fact, their bound is a direct consequence of the well-known solution to the coupon collector’s problem which proves that for such genome, the total number of bases to be sequenced should be O(G ln G). Although the result leads to a tight bound, it is based on a tacit assumption that the set of reads are first collected through a sequencing process and then are processed through a computation process, i.e., there are two different machines: one for sequencing and one for processing. In this paper, we present a significant improvement compared to Lander-Waterman’s result and prove that by combining the sequencing and computing processes, one can re-sequence the whole genome with as low as O(G) sequenced bases in total. Our approach also dramatically reduces the required computational power for the combined process. Simulation results are performed on real genomes with different sequencing error rates. The results support our theory predicting the log G improvement on coverage bound and corresponding reduction in the total number of bases required to be sequenced.

Introduction

Data generated from DNA sequencing machines are growing at an unprecedented rate. Extracting knowledge from these data is extremely tedious and usually requires very powerful computing machines. The main reason is that the volume of data generated for an experiment usually contains redundant data and one needs to pay the price of extracting useful information and removing redundant information at the processing step. As an example, in the whole genome sequencing of Human genome with 100x coverage, each base averagely is present in 100 sequencing reads which means 99 percent of the data is redundant. The first question that comes in mind is whether the volume of data generated by the sequencing machines can be reduced without affecting the overall performance. In this paper, we focus on the whole genome sequencing problem and seek fundamental results on the redundancy level required to obtain the desired result.

The first fundamental result in this area has been due to Lander and Waterman in [1] where they present a lower bound on the number of reads, N, required to assemble the whole genome. We refer to this bound as coverage bound. The coverage bound states that for a genome of size G and reads of length L, at least reads are needed such that the whole genome is covered with a probability of no less than 1 − ϵ [2]. Therefore, we should have NNcov.

For the aforementioned scenario, the total number of bases sequenced by the sequencing machine is NL, which requires to be of the order of G log G, from Lander-Waterman’s result. Consequently, the non-reducible redundancy level in such setup will be of the order of log G. However, such result is based on the underlying assumption that sequencing and computing steps are performed independently, i.e., a machine takes samples from the genome and sequences as many reads required to cover the whole genome and then another machine processes the reads to assemble the genome. The overall architecture of such approach for whole genome sequencing is shown in Fig 1(a) and consists of the cascade of two blocks one for sampling and sequencing and one for assembly.

thumbnail
Fig 1. Platforms for genome sequencing.

Whole genome shotgun sequencing (a) and our method platforms (b). In the classic method (a), the processing starts after termination of random sampling and sequencing. In our proposed sequencing platform (b), sampling and processing machines work cooperatively. The results of alignment are fed back to preempt reading redundant data in the sequencing machine.

https://doi.org/10.1371/journal.pone.0164888.g001

Although such separation between sequencing and processing has been traditionally assumed in the literature, one can question whether such separation is fundamentally optimal with respect to amount of sequenced data which is generated and subsequently processed or not. In other words, can we improve the overall performance of such system by merging the two components? In fact, in order to verify the level of performance improvement achieved by such integration, one needs to answer the following two key questions. First, is there any improvement on lowering the redundancy of generated data by merging the two functions? Second, is it physically possible to build such a machine to perform both functions simultaneously? In the rest of this paper, we will try to answer the first question by proving that one can break the coverage bound of Lander and Waterman and reduce the number of sequenced bases to as low as O(G). It is worth mentioning that even if we may not have access to a machine that can efficiently combine the sampling and computing units, the approach proposed in this paper can still reduce the computational power required to assemble the genome. This is due to the fact that the processing unit processes the data on the fly and if it detects that the information from the remaining part of data is redundant it will stop further processing of that piece of data. As we will show, such early termination can have significant effect on reducing computational complexity of the whole process. Note that the proposed method can control depth of sequencing (for obtaining the desired accuracy) by sequencing only informative bases and reducing over-read bases. Using this capability, also for noisy reads, we can obtain O(G) read bases. In addition, at the worst case, our method obtains an accuracy which is the same as the Lander-Waterman’s method.

Answering the second question is beyond the scope of this paper. However, our approach clearly shows that if a sequencing machine can be built that can preempt sequencing at the instant the computation part sees best fit, a much more efficient sequencing machine may actually be obtained.

Our strategy to merge the two functions is shown in Fig 1(b) where the processing machine controls the sequencing machine by blocking sequencing of redundant bases. Hence, we assume that the sequencing machine sequences the DNA fragments base by base and it will stop sequencing a fragment once a blocking command from the processing machine is initiated for that fragment.

In this paper, we only focus on the re-sequencing problem in the processing machine. We first present theoretical results for i.i.d. genome and real genomes with given repeat structures, with noiseless and noisy reads. Our simulations are performed on chr19 of Human genome hg19 for both noiseless and noisy reads. We have shown significant improvement on coverage bound for real genome is achieved by using this method. For processing machine, Meta-aligner [3] scheme is used to align reads on the reference genome.

The longer the read lengths, more reduction on number of read bases will be achieved by our method. A number of Next Generation Sequencing (NGS) methods [4], such as PacBio and Nanopore [57] already provide reads of several thousand bases long and are suitable candidates for such analysis. Fig 2 shows read length distribution of PacBio technology for the first two read archive in NCBI GenBank SRX533609 (http://www.ncbi.nlm.nih.gov/sra?term=SRX533609.). It can be envisioned that other sequencing methods might also provide longer reads as sequencing technology further advances in that direction in coming years.

thumbnail
Fig 2. PacBio read length distribution.

Almost 81% of PacBio reads have length of at least 2000 bps.

https://doi.org/10.1371/journal.pone.0164888.g002

Materials and Methods

Conventionally, in the re-sequencing problem, a reference genome and a set of reads from a target genome are available at the processing step. In this framework, the sequencing machine produces reads of length L from N DNA fragments. Due to Lander-Waterman’s coverage bound, NL, the total number of bases read by the machine, is required to be O(G log G).

Our method changes this bound by assuming that sequencing can be controlled by a processing machine such that it can terminate sequencing at any base. In other words, the sequencer starts reading bases of one side of a DNA fragment one by one and it will stop as soon as a command is initiated by the processing machine. In order to explain the basic ideas and to prove that sequencing can be performed efficiently, we first analyse our proposed methods on i.i.d. genomes. We extend the results to real genomes where repeats play an important role in their structure.

i.i.d. genomes

In this part, we assume that the reference and target genomes are i.i.d. random sequences of {A, C, G, T} with uniform base probabilities. We also assume that reads are sampled uniformly and independently from the target genome. Consider that reads are noiseless. The key strategy that we use to terminate the sequencing process in a controlled manner is as follows. We divide L by some integer number K ∈ {0, …, L} and without loss of generality assume that = L/K is also an integer. We allow the reading machine to read the first bases of all the DNA fragments. Let denote the set of starting bases of all the fragments. Here, Ri() is the first bases of the ith fragment.

After generating , all reads are mapped to the reference genome. Some of the the reads can be mapped uniquely to a location on the genome. We call such reads anchored. More precisely, a read R is assumed to be anchored if there is only one location on the genome with Hamming distance no more than α|R| where |R| is the read length and α is some fixed constant.

After mapping, we partition the set into three disjoint sets: the set of reads that are anchored to some location on the genome and in addition, extending them does not increase the coverage, the set of reads anchored to some location on the genome and in addition, extending them will increase the coverage, and the set of reads that are not anchored in the first step. For a read Ri() in a termination command is initiated to stop further reading of the ith fragment. The union of and is denoted by which is the set of fragments where reading process will be continued on them.

Subsequently, the next base of all fragments in are read and we use the same procedure for mapping and termination. Therefore, at the end of this step, we end up with the set of anchored and terminated fragments with length + 1 and the set that is used for extension in the next step. In this way, one can proceed up to step L + 1 where all the fragments are extended to the maximum length L.

If we denote the set of reads that are uniquely mapped in the algorithm by , then . Our proposed algorithm is then detailed in Algorithm 1.

Algorithm 1

Input: N fragments with size Li of a target genome plus a reference genome of length G.

Output: A set of reads , mapped to the reference genome.

Initiate:

Let L = maxi Li, and . Fix (the sub-fragment’s length), α ∈ [0, 1]. Set to be the set of all fragments.

1: for k = 1 to L + 1 do

2:  if k = 1 then

3:   Sequence the first bases of all reads in .

4:  else

5:   Sequence the ( + k − 1)-th base of all reads in .

6:  end if

7:  Map all reads in to the reference genome with their last fragments and Hamming distance d = ⌊α⌋.

8:  Add uniquely mapped reads in to the set . Put the rest of reads in the set .

9:  Add reads in to , if by further extensions they will not cover a new base on the reference genome.

10:  Add reads in to , if by further extensions they will cover new bases on the reference genome.

11: end for

The two parameters of the algorithm, i.e., and α, should be specified based on the structure of reference genome as well as, G, N and L. One can choose to be as low as 1. However, the chance of finding uniquely mapped reads is very small for short reads resulting in much higher processing time. On the other hand, if is chosen to be large, then a lot of reads will overlap after the first step of the algorithm and we will sample a lot of redundant bases. Therefore, an optimal choice of is desired. The optimal value of depends on the size and repeat structure of the genome as well as the statistics of the variations between target and reference genomes. In the following, we describe selection of these parameters for noiseless reads.

It can be shown that for a given random DNA sequence of length G, the probability of observing two exact copies of a substring with length is lower than G24 [2]. Hence, a noiseless read of length > log G from target genome almost certainly can be uniquely mapped to the reference genome. Conversely, a noiseless read of length < log G from target genome will be mapped to at least two locations on the reference genome. Therefore, for noiseless reads with no variation between target and reference genomes, we can choose = log G.

The choice of α depends on the mapper quality and allowable Hamming distance between reads and the reference genome. If we assume a perfect mapper, the Hamming distance between a read and its true location is ν|R| where ν is the variation rate between target and reference genomes. Therefore, it suffices to set α = ν. In this scenario, we consider that variation between reference and target genome is negligible.

In order to evaluate the performance of the algorithm, we need to prove that the genome can be completely covered by the reads in and the number of bases read more than once is small. After the first step of the algorithm, i.e. sequencing bases of all reads, all reads are anchored to their correct location on the genome with maximum Hamming distance of d = 0. This is due to the preceding discussion in the case of random genomes where duplicate segments are scarce. We distinguish between two cases:

  1. Two subsequent reads have common bases, such as ith read and (i + 1)th read in Fig 3.
  2. Two subsequent reads do not have a common base, such as jth read and (j + 1)th read in Fig 3.

thumbnail
Fig 3. Two possible cases for two subsequent reads.

They are either disjoint or have some common bases.

https://doi.org/10.1371/journal.pone.0164888.g003

In the first case, we will encounter reads whose extension does not increase the coverage and therefore, their sequencing should be terminated. These kind of reads belong to . We call the bases common between ith and (i + 1)th read in such case, over-read bases. In the second case, we will encounter reads whose extension will increase the coverage. We put reads of this kind in , to be further extended in the subsequent steps of the algorithm. Consequently, sequencing of an aligned read in continues until it becomes a member of .

For further analysis, we assume that starting point of reads are a Poisson point process with rate . Hence, the inter-arrival locations have independent exponential distributions.

Let be a random variable representing the number of extra bases read by the sequencing machine for the ith read. If denotes the total number of over-read bases, then . Therefore, .

To compute , note that the next read of the jth read starts x bases after the jth read. If x, then . Otherwise, , as shown in Fig 4. Since x has an exponential distribution, we obtain (1)

thumbnail
Fig 4. A read of and its adjacent read.

The starting point of the second read is x bases after the starting point of the overlapped read from the right hand side (assuming that the two reads are of equal length ).

https://doi.org/10.1371/journal.pone.0164888.g004

In the case of i.i.d. genomes, we set = log G. Hence, the average number of over-read bases of all reads is as follow: (2) where . For a constant value κ, the order of becomes O(G) and number of reads becomes . To be able to cover the genome with N reads, we should be able to close gaps of at least G log G/N bases (from the coverage bound). Therefore, length of reads becomes .

We observe that by tuning κ, one can control maximum read lengths as well as total number of bases read by the machine. For instance, by choosing κ = 1, we obtain and the maximum read length becomes L ≈ 1000 bps.

Next, we consider noisy reads with the error rate ϵ. Moreover, we assume that the target and reference genomes vary in their sequences with the variation rate ν. Before presenting our algorithm for noisy reads, we set a parameter denoted by Cϵ for the coverage depth. The coverage depth helps in removing sequencing errors by averaging over several reads. The coverage depth is chosen such that if Cϵ reads cover a base then it is possible to correctly recover that base with a given probability Pϵ.

To compute Cϵ, we choose a base as the target base if it achieves the highest number of votes over reads covering that location. Let denote the error event of incorrect calling of the ith base. If the random variable Ci denotes the coverage of the ith base, then error occurs if the corresponding base in more reads are incorrect. Thus, the probability of error for the ith base is, (3)

Hence,

For example, if Pϵ = 10−4 and ϵ = 0.07, Cϵ becomes 6. For having coverage depth of Cϵ, it suffices to change the kth step of Algorithm 1 as follows: we add a read in to when its extension will cover a base c times on the reference genome, where cCϵ. Details of the proposed algorithm for noisy reads is presented in Algorithm 2.

Algorithm 2

Input: N fragments with size Li of a target genome with G bases plus a reference genome with the same length.

Output: A set of reads , mapped to the reference genome.

Initiate:

Let L = maxi Li, and . Fix (the sub-fragment’s length), α ∈ [0, 1] and dmax. Fix Cϵ.

1: for k = 1 to L + 1 do

2:  if k = 1 then

3:   Sequence the first bases of all the reads in .

4:  else

5:   Sequence the ( + k − 1)-th base of all the reads in .

6:  end if

7:  Map all the reads in to the reference genome with their last bases and Hamming distance dmax.

8:  Add uniquely mapped reads in to the set . Put the rest of reads in the set .

9:  Add reads in to , if by further extensions they will cover a base more than Cϵ times on the reference genome.

10:  Add reads in to , if by further extensions they will cover a base c times on the reference genome, where cCϵ.

11: end for

Because of sequencing errors, mapping fragments of length with maximum Hamming distance dmax to the reference genome is error prone and we first analyze the performance of the alignment procedure. Let and denote the true and false alignment events for the ith read, respectively. The ith read of length with a maximum Hamming distance of dmax is mapped to its true location with probability (4) and is mapped to a false location on the genome with maximum Hamming distance dmax with probability (using the union bound), (5) where Pw() represents the probability of incorrect alignment of a fragment of length to another position on the reference genome. Denote, as the false alignment event. Hence,

Therefore, if Pw() scales as , tends to zero and all reads are mapped to their true location with ’s. In the worst case, if any read is extracted from each base of the reference genome, i.e. N = G, then Pw() scales as . It can be easily verified that for ϵ = 0.05, = 2log G and dmax = 8, tends to zero and . Therefore for ϵ ≤ 0.05, all noisy fragments of length are uniquely mapped to their correct locations on the reference genome.

In order to compute the number of extra read bases, we use a similar argument as the one used in the noiseless case. To obtain for the jth read in this case, assume that the read after this read starts x bases further. Again . Since, x has an Erlang distribution, we obtain (6) where γ(s, x) is the lower incomplete gamma function and for is equal to

Thus, the average number of over-read bases for all reads in this step is, (7) where . Therefore, for any constant κn, becomes O(G). The coverage bound when each base of the genome is covered by at least Cϵ reads can be determined as follows, (8) where and are error events that at least one base and the ith base of the genome are not covered by at least Cϵ reads, respectively. Therefore, if (9) then each base of the genome is covered by at least Cϵ reads almost surely. Subsequently, for a fix and constant value of κn, number of reads and read length from Eq (9) become and , respectively.

Again, there exists a trade-off between the length of reads and number of read bases that can be controlled by κn. Fig 5 shows in Eq (7) for sequencing error rates of ϵ = {0, 0.02, 0.05} and Cϵ = 6. Also, Fig 6 shows total read bases as a function of read length for different sequencing error rates ϵ = {0, 0.02, 0.05, 0.1}. We set Pϵ = 10−4 and, therefore, for ϵ = {0, 0.02, 0.05, 0.1} we use: Cϵ = {1, 4, 6, 8} from Eq (3), and = {1, 1.5, 2, 3} × log G with dmax = {0, 4, 8, 15}, which satisfy alignment constraints, respectively. These figures show that when ϵ = 5%, approximately 6.08G bases are read for read length of L ≈ 1000 bps, leading to only 0.08G over-read bases.

thumbnail
Fig 5. The normalized average number of over-read bases for Cϵ = 6.

The normalized average number of over-read bases and read lengths for different sequencing error rates and Cϵ = 6.

https://doi.org/10.1371/journal.pone.0164888.g005

thumbnail
Fig 6. The normalized average number of over-read bases with Pϵ = 10−4.

The normalized total number of read bases as a function of read length for different sequencing error rates and Pϵ = 10−4. Note that, total number of read bases at each error rate tends to its corresponding Cϵ.

https://doi.org/10.1371/journal.pone.0164888.g006

Real Genomes

In this section, we consider DNA sequencing of real genomes where many repeats are dispersed across the genome. First, we assume that reads are noiseless. Note that, if all the -mers of the genome are repetitive elements then Algorithm 1 fails in anchoring reads correctly to the reference genome and therefore reading O(G log G) is unavoidable. However, as we will show, the repeat patterns in real genomes allow successful coverage of bases with only O(G) reading bases.

A mosaic model for capturing the repeat structure of the genome is presented in [3]. In this model, the reference genome consists of two types of intervals: repeat and random intervals. These types are defined based on two parameters and d: representing the fragment length and mismatch factor, respectively. Repeat (random) intervals are consecutive bases where any fragment of length starting from a base within these intervals can be aligned to some other location(s) (one location) of the genome with maximum Hamming distance d. For the sake of simplicity, we consider only d = 0.

Let us denote the set of all exact repeat intervals of the reference genome as . Also, assume that a repeat has length R and repeat lengths have the distribution f. We need to treat reads starting from repeat and random intervals, differently. For this purpose, we consider three starting regions for starting point of a given read, as S1, S2, and S3. We can determine the average number of total over-read bases (i.e. ) as follows, (10) where is a random variable that denotes the starting region of the ith read. Hence, (11)

In the following, we determine each term of Eq (11). Region S1 consists of random intervals such that reads starting from random intervals can be anchored to their true locations based on their first bases. Therefore, we can readily compute the average number of over-read bases in random intervals using Eq (1). More precisely, where the total average number of these reads is

On the other hand, the read starting from a repeat interval can not be anchored unless it contains an -mer which resides in random interval. Using this fact, c.f. Fig 7, each repeat interval of length R can be partitioned into two disjoint intervals: 1) Mappable zone: the last min{L, R} bases, 2) Un-mappable zone: the first max{0, RL + } bases. Regions S2 and S3 are mappable and un-mappable zones, respectively.

thumbnail
Fig 7. Repeat intervals regions.

The first region (S1) consists of random intervals. The second region (S2) consists of the last L bases of repeat intervals. The third region (S3) consists of the other bases of repeat intervals.

https://doi.org/10.1371/journal.pone.0164888.g007

Clearly, reads from un-mappable zones cannot be anchored and therefore need to be read up to length L. Using Eq (1), we compute the average number of over-read bases for each read in un-mappable zones as: where the total average number of these reads is

The next step is to compute the number of over-read bases for reads from mappable zones. Consider all mappable regions smS2 of length lm, for . If the mth repeat interval has length m, then lm = min{L, m}. At least lm + bases of each read within sm are being sequenced. Consider the ith read has distance li from the end of its mappable zone. The average number of over-read bases for the ith read in mappable zones can be determined as:

Thus, the average number of total over-read bases for all reads in mappable zones becomes, (12)

Define,

Therefore, the average number of over-read bases by considering repeat structure in Eq (10) becomes, (13)

Given the repeat length distribution (i.e. fR) of any real genome, we can determine the for that genome. The distribution of log fr for Human genome hg19 is illustrated in Fig 8. Also, Fig 9 shows for whole genome of hg19 and i.i.d. genome. In this simulation, we used = log G ≈ 30. Results confirm that for real data set, we can read only O(G) bases to assemble the genome.

thumbnail
Fig 8. Repeat length distribution for Human genome.

The distribution of repeat lengths logarithm (i.e. log fR) for Human genome hg19 based on Meta-aligner model with (, d) = (30, 0) defined in [3].

https://doi.org/10.1371/journal.pone.0164888.g008

thumbnail
Fig 9. Comparison between i.i.d. and Human genomes in noiseless case.

The average number of over-read bases for i.i.d. genome and Human genome hg19 with ≈ 30 and N = G log G/L.

https://doi.org/10.1371/journal.pone.0164888.g009

When reads are contaminated with sequencing errors of rate ϵ, we use a proper value of such that a fragment length of is aligned to its correct location with a probability close to one. Also, consider the coverage depth, i.e. Cϵ, is given by Eq (3). We model the reference genome with Meta-aligner (, d = 0)-model. Thus, using the same argument as noiseless reads, the average number of over-read bases can be determined similar to Eq (13) by incorporating VCϵ(.) in Eq (6) instead of V1(.). Therefore, the average number of over-read bases for real genomes in the presence of noise becomes, (14) where VCϵ(L) ≈ λLCϵ and Fig 10 shows the for whole genome of hg19 with ϵ = {0, 0.05}. We use = log G ≈ 30 and = 2log G ≈ 60 for ϵ = 0 and ϵ = 0.05, respectively.

thumbnail
Fig 10. Comparison between i.i.d. and Human genomes in noisy case.

The normalized average number of over-read bases for i.i.d. genome and Human genome hg19 with ≈ {30, 60} and N = G log G/L for ϵ = {0, 0.05}, respectively.

https://doi.org/10.1371/journal.pone.0164888.g010

Algorithm Coverage Analysis

In this section, we compute the number of gaped bases in the reference genome when the proposed algorithms are used. First, consider i.i.d. genomes. We show that the whole genome is in fact covered by the noiseless reads when using Algorithm 1. Suppose that all reads of length are aligned to the reference genome. Let denote the error event which is the event that a base is not covered in our algorithm. Let us denote the event of not covering the ith base by . Thus, (15) From union bound, we have (16) for arbitrary i ∈ {1, ⋯, G}. Define the set , as the set consisting of starting points of reads that are aligned to the reference genome with less than L bases before the ith base location. Since the nearest read in to the ith base does not overlap with other reads from its right hand side, this read will be extended in subsequent steps of the algorithm and at the end, the ith base will be covered by this read. Hence, must be an empty set and occurs when no read’s starting point is located less than L bases before the ith base. This condition is the same as the coverage bound condition. Thus, (17)

Consequently, if the number of reads N and the reads’ length L satisfy the coverage condition in [1] (i.e. NLG log G), the sequence is completely covered by the reads in our method for noiseless reads and i.i.d. genomes.

For noisy reads, we show the perfect coverage of reference genome when noisy reads are used in Algorithm 2. For this purpose, the same argument as noiseless reads is considered. Let denote the error event which is the event that a base is not covered by at least Cϵ reads in our algorithm. Also, denote error event for the ith base by . Therefore, and occurs when less than Cϵ read’s starting points are located within L bases before the ith base. This condition is the same as the coverage bound condition with a given Cϵ in Eq (8). Thus, if number of reads N and reads’ length L satisfy the coverage condition for noisy reads in Eq (9), the sequence is completely covered by at least Cϵ noisy reads in our method as well.

Now consider a real genome. We must determine how many bases are covered with Algorithm 1 or Algorithm 2 (based on noiseless or noisy reads). For this purpose, let denote the error event that the kth base is not covered by reads with the proposed algorithms. We only consider coverage in this section, therefore, we use Cϵ = 1 for noisy reads. Based on the base location and its neighboring repeat intervals within the genome, this base is classified into two different classes as shown in Fig 11. Using these two classes, different sub-classes for locating random and repeat intervals can be modelled. Note that similar to i.i.d. genome, locating one read within distance of L bases before a given base is sufficient for covering that base. In the following, we determine probability of the for each class. In the proposed analysis, we assume that each fragment of length is mapped uniquely to the reference genome with probability pt. Also, we denote the number of reads within a random interval of length l as .

  1. Class A: Assume d1 and d2 bases within distance of L bases before and after the kth base are in random interval, respectively. If a read has a fragment within a random interval, it can be mapped to the reference genome uniquely. If d1 + d2L, we divide the interval of length L before the kth base to three parts: 1) all bases with distance [L, d1] from base k, 2) all bases with distance [d1, Ld2] from base k, and 3) the remaining bases of the random interval with distance [Ld2, 0] from base k.
    Divide the first part to disjoint sub-intervals of length . If any read starts within the jth sub-interval, the number of fragments of that read within the random interval is . If any read starts within the second part, the number of fragments of that read within the random interval is . In addition, divide the third part to sub-intervals of length such that if any read starts within the jth sub-interval, the number of fragments of that read within the random interval is . Thus, (18)
    The same result is obtained when d1 + d2 < L, i.e., (19)
  2. Class B: Assume d1 and d2 bases within distance of L bases before and after the kth base are in repeat interval, respectively. If d1 + d2L, we divide the interval of length L before the kth base to three parts similar to class A. Thus, (20)
    The same result is obtained when d1 + d2 < L, i.e., (21)

thumbnail
Fig 11. Classification of reference genome bases.

Classification of each base of the reference genome based on random and repeat intervals of the genome. By considering special cases for these two classes, six sub-classes are created.

https://doi.org/10.1371/journal.pone.0164888.g011

We are interested in error probabilities of some special cases. These cases of interest are illustrated as sub-classes in Fig 11. The error probability of each sub-class is determined in the following.

  • Sub-class (I): This sub-class can be modelled with the first class with d1 = d2 = L. Thus, (22)
  • Sub-class (II): Consider that d bases within distance of L bases after the kth base is in random interval. This sub-class can be modelled with the first class with d1 = L and d = d2 < L. Thus, (23)
  • Sub-class (III): Consider that d bases within distance of L bases before the kth base is in random interval. This sub-class can be modelled with the first class with d2 = L and d = d1 < L. Thus, the error probability of this class is the same as the sub-class (II) except that Ld bases within distance of L before the kth base exist in random interval.
  • Sub-class (IV): Consider that d bases within distance of L bases before the kth base is in repeat interval. This sub-class can be modelled with the second class with d2 = L and d = d1 < L. Thus, (24)
  • Sub-class (V): Consider that d bases within distance of L bases after the kth base is in repeat interval. This sub-class can be modelled with the second class with d1 = L and d = d2 < L. Thus, the error probability of this class is the same as the sub-class (IV) except that Ld bases exist within distance of L before the kth base in repeat interval.
  • Sub-class (VI): This sub-class can be modelled with the second class with d1 = d2 = L. Thus, the error probability of this sub-class is 1.

Thus, by considering repeat structure of the genome, we can determine the probability of coverage for the genome. We consider Meta-aligner (30, 0)-model for the reference genome. We classify bases of Human genome hg19 and determine probability of gap for each class using Eqs (18)–(21). The average probabilities of gap, i.e. , for different values of and read lengths (L) are shown in Figs 12 and 13. These probabilities of gap are shown for pt = 1 and pt = 0.7 (dotted line). Note that, the coverage bound shows that using reads of length L ≥ log G and , all bases of an i.i.d. genome are covered with probability almost one.

thumbnail
Fig 12. The average probability of gap for Human for different read lengths.

The average probability of gap within Human genome hg19 versus for different read lengths L = {500, 1000, 2000, 3000} bps with pt = 1 and pt = 0.7 (dotted line). The Meta-aligner (30, 0)-model is used for Human genome.

https://doi.org/10.1371/journal.pone.0164888.g012

thumbnail
Fig 13. The average probability of gap for Human for different coverage depth.

The average probability of gap within Human genome hg19 versus read length L for two values of with pt = 1 and pt = 0.7 (dotted line). The Meta-aligner (30, 0)-model is used for Human genome.

https://doi.org/10.1371/journal.pone.0164888.g013

Results and Discussion

In this section, we propose simulation results in two cases: simulated reads from the chr19 of Human genome hg19 and real reads from Human genome hg19 obtained from 454 technology.

Benchmark

In the simulated reads case, the chr19 of Human genome hg19 is used as the reference and noisy reads are also extracted uniformly from the reference genome and are mapped to this genome. We consider sequencing error rates of ϵ = {0, 5, 10}% with 90% mismatches and 10% indels. Errors are added in an i.i.d. manner. For two cases, we use Meta-aligner [3] to align reads with their two fragments of length and not using all their bases. We consider only the first stage Meta-aligner. Since with a mismatch percentage of α and read length of , there are α × bases altered in each read on the average, we allow Meta-aligner to align reads to the reference genome with a distance of ⌈α × ⌉. For both cases, with reference size of G bases, we use = log G ≈ 30. Therefore, N = (30 + 4(Cϵ − 1))G/L reads are randomly generated from each reference genome for any read length L and Cϵ. Also, we consider C0 = 1, C5 = 6, C10 = 8.

Simulated Reads

In each simulation of the simulated reads case, we present number of aligned reads by Meta-aligner. Reports of Meta-aligner show its robustness to sequencing errors such that it aligns many reads almost correctly at its first stage. It should be noted that we need to read 2 bases at the first step of Meta-aligner which increases the number of over-read bases. However, most of the mapped reads are located on the genome correctly.

We first determine number of mapped reads at end of the first stage of Meta-aligner. Fig 14 shows fraction of mapped reads for ϵ = {0, 0.05, 0.1}. Results show that most of reads are mapped uniquely to the reference genome.

thumbnail
Fig 14. Fraction of mapping.

Fraction of mapped reads for different read lengths and sequencing error rates after the first stage of Meta-aligner.

https://doi.org/10.1371/journal.pone.0164888.g014

In Fig 15, the total number of bases not covered by mapped reads (also known as genome gaps) for various sequencing error rates after the first stage of Meta-aligner is presented. By increasing read length, a higher fraction of repeats are bridged by reads and the gap fraction decreases. Also, Fig 16 shows gap faction of the genome at each step of the first stage of Meta-aligner, for read length L = 1000 bps and different sequencing error rates. This figure shows that using the proposed method, the reference genome is gradually covered by the reads. Consequently, since the remaining bases of the genome are located within long repeat regions, they will be covered by the reads.

thumbnail
Fig 15. Fraction of gap.

Fraction of gap within the chr19 for different read lengths and different sequencing error rates after the first stage of Meta-aligner.

https://doi.org/10.1371/journal.pone.0164888.g015

thumbnail
Fig 16. Fraction of gap at each step of Meta-aligner.

Step-by-step fraction of gap within the chr19 for different sequencing error rates and read length of L = 1000 bps.

https://doi.org/10.1371/journal.pone.0164888.g016

Fig 17 illustrates the normalized total number of read bases for various sequencing error rates. Note that, for large enough read length, the number of over-read bases tends to zero and only O(G) bases are needed to be sequenced by the sequencer machine. For different values of ϵ = {0, 0.05, 0.1}, almost {1.2, 6.4, 8.7} × G bases are read, using Meta-aligner for read length of L = 4000 bps, respectively. Also, Fig 18 shows the normalized total number of read bases of the genome at each step of the first stage of Meta-aligner, for read length L = 1000 bps and different sequencing error rates. This figure demonstrates that after each step, some un-mapped reads overlap with other mapped reads and total number of read bases is increased.

thumbnail
Fig 17. Total number of read bases.

Normalized total number of read bases for various sequencing error rates and read lengths.

https://doi.org/10.1371/journal.pone.0164888.g017

thumbnail
Fig 18. Total number of read bases at each step of Meta-aligner.

Step-by-step normalized total number of read bases for various sequencing error rates and read length of L = 1000 bps.

https://doi.org/10.1371/journal.pone.0164888.g018

Real Reads

In this section, we obtain experimental results for the total number of read bases required to cover the whole human genome in the case of real reads which is published by Roche 454 technology. The data-set can be downloaded from NCBI with accession number SRR003161 (http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR003161). This data-set consists of N = 31,243,790 reads with average length of 577 bps from the Human genome hg19 (this amounts to approximately 6× coverage). We use the first step of Meta-aligner to map this read-set to the Human genome hg19. Using = 30 and d = 1,25,999,865 of reads (≈83.22%) are handled and the total number of read bases and the fraction of gaps are illustrated in Fig 19. Results show that only 0.6 × G bases are over-read for this read-set. Due to our strategy, the remaining fragments are completely sequenced by the sequencing machine. However, these reads cannot be mapped uniquely to the reference genome which implies they are either from repeat regions of the genome or contaminated by many errors. We use Bowtie2 to map these reads to decrease the gap fraction. Bowtie2 in default mode handles 1,671,647 (≈5.35%) reads and as a result, number of gap bases decreases from 0.13 × G to 0.089 × G bases. The main reason for this gap fraction is that reading 6× coverage of the genome is not enough for covering the whole genome. Also, a small fraction of this gap is due to un-mappable reads.

thumbnail
Fig 19. Total number of read bases and gap fraction for 454 reads.

Step-by-step normalized gap fraction and total number of read bases for 454 reads.

https://doi.org/10.1371/journal.pone.0164888.g019

As these simulation results show, increasing sequencing error rate leads to an increase in the number of over-read bases. This is due to the fact that each read can not be aligned to the genome at shorter lengths and its length is increased iteratively. In addition, it should be noted that genomes with a larger percentage of repeat patterns naturally lead to a greater level of over-read bases (comparison of chr19 with i.i.d. genome). In such scenarios, less number of reads are uniquely aligned to the genome due to the ambiguity caused by repeating patterns.

Conclusion

Lander and Waterman have presented the coverage bound based on random sampling of i.i.d. DNA sequence. After sampling, read fragments are sent to the processing part. Under such model, the coverage bound shows that minimum number of reads required for covering the whole genome is NG log G/L. Equivalently, NLG log G bases are required to cover the whole genome. In our method, sequencing and processing are combined such that first all fragments are sequenced up to bases, for a properly chosen value of , and then the processor maps the fragments that are uniquely mapped to the reference genome. Un-mapped reads with non-overlapping reads from the right hand side at the first step are sent back to sequencer for extension to next bases. This procedure is repeated until the process reaches the maximum read length L. As shown in the paper, through use of such approach, the number of bases read in the sequencing part reduces to O(G) bases, a reduction by a log G factor in comparison with Lander-Waterman coverage bound.

We present theoretical results for i.i.d. and real genomes with noiseless and noisy reads. Also, we have simulated our method for chr19 of Human genome hg19 with different sequencing error rates. Simulation results support the validity of the proposed algorithm and demonstrate our improvement on coverage bound for real genomes.

For future work, one may expand our algorithm to derive more efficient alignment algorithms in terms of complexity and precision. Also, this method may be extended to Denovo sequencing.

Author Contributions

  1. Conceptualization: DN SAM.
  2. Data curation: DN.
  3. Formal analysis: DN SAM.
  4. Investigation: DN.
  5. Methodology: DN SAM.
  6. Project administration: SAM BH.
  7. Resources: DN.
  8. Software: DN.
  9. Supervision: SAM BH.
  10. Validation: DN.
  11. Visualization: DN.
  12. Writing – original draft: DN.
  13. Writing – review & editing: SAM BH.

References

  1. 1. Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988 Apr 1;2(3):231–9. pmid:3294162
  2. 2. Motahari AS, Bresler G, David NC. Information theory of DNA shotgun sequencing. IEEE Trans on IT. 2013 Oct;59(10):6273–89.
  3. 3. Nashta-ali D, Aliyari A, Edrisi MA, Moghadam AA, Motahari SA, Khalaj BH. Meta-aligner: Long-read alignment based on genome statistics. bioRxiv. 2016 Jan 1:060129.
  4. 4. Metzker ML. Sequencing technologies-the next generation. Nat Rev genetics. 2010 Jan 1;11(1):31–46. pmid:19997069
  5. 5. Meller A, Branton D. Single molecule measurements of DNA transport through a nanopore. Electrophoresis. 2002 Aug 1;23(16):2583–91. pmid:12210161
  6. 6. Branton D, Deamer DW, Marziali A, Bayley H, Benner SA, Butler T, Di Ventra M, Garaj S, Hibbs A, Huang X, Jovanovich SB. The potential and challenges of nanopore sequencing. Nat Biotech. 2008 Oct 1;26(10):1146–53. pmid:18846088
  7. 7. Ku CS, Roukos DH. From next-generation sequencing to nanopore sequencing technology: paving the way to personalized genomic medicine. Expert review of medical devices. 2013 Jan 1;10(1):1–6. pmid:23278216