HELIOS: High-speed sequence alignment in optics

Ehsan Maleki; Saeedeh Akbari Rokn Abadi; Somayyeh Koohi

doi:10.1371/journal.pcbi.1010665

Abstract

In response to the imperfections of current sequence alignment methods, originated from the inherent serialism within their corresponding electrical systems, a few optical approaches for biological data comparison have been proposed recently. However, due to their low performance, raised from their inefficient coding scheme, this paper presents a novel all-optical high-throughput method for aligning DNA, RNA, and protein sequences, named HELIOS. The HELIOS method employs highly sophisticated operations to locate character matches, single or multiple mutations, and single or multiple indels within various biological sequences. On the other hand, the HELIOS optical architecture exploits high-speed processing and operational parallelism in optics, by adopting wavelength and polarization of optical beams. For evaluation, the functionality and accuracy of the HELIOS method are approved through behavioral and optical simulation studies, while its complexity and performance are estimated through analytical computation. The accuracy evaluations indicate that the HELIOS method achieves a precise pairwise alignment of two sequences, highly similar to those of Smith-Waterman, Needleman-Wunsch, BLAST, MUSCLE, ClustalW, ClustalΩ, T-Coffee, Kalign, and MAFFT. According to our performance evaluations, the HELIOS optical architecture outperforms all alternative electrical and optical algorithms in terms of processing time and memory requirement, relying on its highly sophisticated method and optical architecture. Moreover, the employed compact coding scheme highly escalates the number of input characters, and hence, it offers reduced time and space complexities, compared to the electrical and optical alternatives. It makes the HELIOS method and optical architecture highly applicable for biomedical applications.

Author summary

The character-by-character alignment of two long biological sequences, i.e. DNA, RNA, and protein, is a tedious task, but essential for recognizing homologies, relationships, and variations. In this case, every alteration, including mutations (substitution), and indels (insertion or deletion) is vital and required for many biological developments like diagnosis, medicine, and vaccination. However, the applicability of current sequence alignment methods is limited, specifically in processing time and memory usage, due to their inherent serialism and imperfections of electrical systems, as well as inefficient coding schemes of optical approaches. It approximately leads to quadratic run-time and space requirements in terms of input sequence lengths, becoming an expensive and laborious process for the real-time alignment of large datasets. Hence, proposing a superior alignment method in terms of accuracy, performance, and applicability can promote biological research and developments. Here, we show that we can overcome the long-lasting and challenging problems in sequence alignment procedure by exploiting optics as a novel computing technology. In this manner, we propose a novel method and its optical architecture for alignment of DNA, RNA, and protein sequences by exploiting high-speed processing and operational parallelism in optics. As our simulation studies confirm, it provides an accurate sequence alignment with outperforming the most widely used electrical and optical alternatives in the terms of processing time and memory requirements.

Citation: Maleki E, Akbari Rokn Abadi S, Koohi S (2022) HELIOS: High-speed sequence alignment in optics. PLoS Comput Biol 18(11): e1010665. https://doi.org/10.1371/journal.pcbi.1010665

Editor: Eli Zunder, University of Virginia, UNITED STATES

Received: January 1, 2022; Accepted: October 18, 2022; Published: November 21, 2022

Copyright: © 2022 Maleki et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All the datasets analyzed in this study are publicly available and are cited in the Reference section of the manuscript and its Supporting information files. Moreover, all data, source code, simulation results, and statistical analyses, presented in this manuscript and its Supporting information files are uploaded to Zenodo repository with DOI 10.5281/zenodo.7254829 and can be reached with the following link: https://doi.org/10.5281/zenodo.7254829.

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

This is a PLOS Computational Biology Methods paper.

Introduction

Bioinformatics develops computation-intensive techniques to enhance theoretical and practical biological studies [1]. Pairwise sequence alignment as one of the key operations of bioinformatics compares two DNA, RNA, or protein sequences to recognize homology, similarity, and variation [2]. The character-by-character alignment of two long biological sequences is a tedious task, but essential to locate character matches, mutations (i.e., substitution), and indels (i.e., insertion or deletion) in favor of many biological developments [3].

In this manner, many existing sequence alignment methods consume considerable resources to perform an accurate sequence alignment [4]. For instance, Smith-Waterman [5] and Needleman-Wunsch [6] are based on dynamic programming (DP); BLAST [7], ClustalW [8], ClustalΩ [9] T-Coffee [10], and Kalign [11] utilize a heuristic search; MUSCLE [12] and MAFFT [13] are iterative methods which perform an FFT-based cross-correlation; MUMmer [14] relies on suffix trees; and HMM-based methods [15] are built upon a probabilistic model. Despite their accurate sequence alignment, their resource demands, specifically in time and space, are originated from their sequential natures [16]. Moreover, these methods suffer from various problems due to the imperfections of the electrical systems, such as high computational time and space, high power consumption, heat generation, slow response, etc. [17]. Furthermore, the rapid enlargement of biological datasets and the advancements of bioscience challenge them more than ever [18]. Although various parallel and distributed optimization methods [19] could moderate some of these problems, the electrical implementation of these algorithms enforces inherent serial computation and high memory requirements [20]. Specifically, these methods lead to high time and space requirements in terms of the input sequence lengths [21, 22], which severely limits their applicability. Hence, proposing a superior method in terms of speed, accuracy, and applicability is crucial for the real-time processing of large biological data.

Fortunately, the inherent benefit of optics and photonics [23], as a novel computing technology, provides high-speed operational parallelism and avoids the imperfections of electrical systems [24]. Accordingly, biophotonics develops optical techniques for biological developments [25]. In this manner, some methods have been accomplished recently, such as correlation-based methods [26, 27], Fourier-Transform-based algorithms [28, 29], HAWPOD [30], Moiré Technique [31–33], OptCAM [34], GAC [35], and SPOMF [36]. Some of them [26–29] address optical similarity measurement algorithms for sequence alignment by taking advantage of optical correlation and Fourier Transform. Despite their high-speed processing, these methods only measure the similarities and differences between the input sequences within specific zones, regardless of the exact location of the variations and their importance to biological developments [1]. On the other hand, some studies [30–35] have achieved high-speed optical approaches for pairwise DNA alignment, which are capable of locating the variations. However, their sequence coding assumptions limit the number of input characters to that of the DNA sequences, which makes them incapable of aligning RNA and protein sequences; and misses their specific outcomes in diagnosis, medicine, and vaccination [3]. Finally, it should be noted that these methods should adopt an efficient biological data encoding by an optical modulator to avoid utilizing a large number of pixels per character. Accordingly, proposing a comprehensive pairwise alignment method can promote biological research by character-by-character alignment of various kinds of biological sequences in a fast accurate process.

In this manner, we are motivated to propose an advanced ultra-fast all-optical method to accurately align any pair of biological sequences. The proposed method is named HELIOS, abbreviating High-speed sEquence aLIgnment in OpticS. By exploiting high-speed processing and operational parallelism in optics [24], the highly sophisticated HELIOS method avoids the problems of current sequence alignment methods, as well as the imperfections of their electrical implementations. On the other hand, adopting an efficient optical encoding of biological data, the HELIOS method outperforms the alternative optical methods in the case of time and space requirements. While the proposed method is discussed in two separate sections for more clarity, i.e., HELIOS method and HELIOS optical architecture, each one is manipulated to enhance the other one, and both form a single coherent system. The basic block illustration of the HELIOS method and the HELIOS optical architecture are presented in Fig 1A and 1B, respectively. Given the interdisciplinary nature of the HELIOS method, it can outperform many well-known alignment algorithms in terms of processing time with comparable accuracy, as verified by our comprehensive simulation studies and analytical computations. Finally, the main innovative and exclusive contributions of this paper are described as follows:

Proposing an accurate pairwise sequence alignment method for DNAs, RNAs, and proteins.
Designing an ultra-fast all-optical high-throughput architecture for the proposed method.
Proposing an optical coding scheme utilizing wavelength and polarization of optical beams.

Download:

Fig 1. Block diagram illustration of the (A) HELIOS method and (B) HELIOS optical architecture.

(A) the HELIOS method aligns two input sequences by performing the coding and alignment procedures to exactly locate character matches and variations; and (B) the HELIOS optical architecture executes the HELIOS method by performing the optical beam provision unit, the optical modulation and mechanism unit, and output capturing unit, utilizing inherent parallelism and high-speed processing in optics.

https://doi.org/10.1371/journal.pcbi.1010665.g001

The organization of this manuscript is as follows. The Method section establishes the general concept of the HELIOS method, while its optical architecture is elaborated in the Optical Architecture section. Afterward, the Discussion and Results section discusses the functionality, accuracy, performance, and applicability of the HELIOS method and its optical architecture. Finally, the paper is concluded with future directions in the Conclusion and Future Perspective section.

Method

Principally, the HELIOS method is composed of two main procedures to perform a parallel accurate pairwise sequence alignment, as illustrated in Fig 1A. It includes a) Coding procedure to code each character of input sequences with two parameters, and b) Alignment procedure to align two coded input sequences by performing two distinct operations in parallel. The detailed descriptions of the procedures are presented as follows.

Coding procedure

Generally, the adopted coding scheme considerably affects the efficiency of the sequence alignment methods [2]. Thus, the coding procedure of the HELIOS method adopts a high compact distinct coding pattern to maximize parallelism and to achieve noise reduction. For this purpose, it codes every character of the input sequence according to two coding strategies: a) Self-label coding to code every character based on the character itself, and b) Nearby-label coding to provide a unique code for every character based on its nearby character.

First, the self-label coding provides a distinct code for every character within the input sequence based on the character itself. In this manner, the required number of distinct codes equals the number of nucleotides (four) in DNA and RNA or the number of amino acids (twenty) in protein sequences. Moreover, the nearby-label coding provides a distinct code for every character based on its nearby character. Specifically, it provides a unique code for each character based on its k^th previous character. So, by traversing the sequence, as the k^th previous character changes, the assigned code to the current character varies as well. Preserving locality information, it prevents data interference through the alignment procedure, as discussed in the Alignment procedure subsection. Same as the self-label coding, the required number of distinct codes are four, four, and twenty for coding DNA, RNA, and protein sequences, respectively.

Finally, to code the input sequence, it is traversed character-by-character, and every character is coded to an entry with two parameters according to both self-label and nearby-label coding schemes, described above. By putting together all coded entries, every sequence is represented as a one-dimensional (1D) vector with a size of 1 × length of the sequence (N). Here, we summarize the proposed coding procedure using Eq 1, formulating the coded pattern, and Eq 2, representing the adopted self-label and nearby-label coding schemes as follows: (1) (2) where vector Code represents the coded pattern for each sequence; parameters C_self,i and C_nearby,i−k stand for the self-label and nearby-label coded values of the i^th character, based on the character itself and its k^th previous character, respectively. Moreover, the variable N represents the length of the sequence. Furthermore, variable C_Scheme,j calculates the coded value of character in position j within the sequence, according to coding strategy Scheme which is either the self-label or the nearby-label coding. In addition, the parameter Offset_Scheme defines the smallest value assumed for character coding, and Step_Scheme is the difference between two consequent code values. It is worth noting that in the case of the self-label and nearby-label coding schemes, both values of Offset and Step are independent and different. Finally, the variable Ch_j stands for the character in position j within the sequence, while the parameter V_Chj represents a preset value for every character within the range of 0 and the number of bases minus 1. For instance, it equals 0, 1, 2, and 3 for A, T, G, and C in DNA sequences, respectively.

According to the above discussions, the size of the coded pattern represents the length of the sequence. On the other hand, the value of each entry of the code vector indicates the corresponding character within the input sequence. These features enable random access to each character within the code vector, and hence, prevent information loss with no restriction on the length of the input sequence. Moreover, proposing a one-dimensional (1D) coding scheme enables a two-dimensional (2D) arrangement of various codded patterns for further parallel processing. Furthermore, considering the k^th previous character instead of the adjacent one in the proposed nearby-label coding scheme, it preserves the uniqueness of the code vector in the case of identical consecutive characters or pairs, like “AAAAAA” or “ACACAC”. Specifically, as the k^th previous character changes in traversing the input sequence, the assigned code to the identical consecutive characters varies as well. It prevents false character match through the alignment procedure. As parameter k can be set from 1 to the length of the input sequence (N), the input sequence is assumed as a circular sequence. Hence, the nearby-label coding scheme can wrap around from one end to the other, in the case of large k, to code each character based on any desired nearby character. In this case study and without loss of generality, we assume k equals R + 1, in which R is the number of the sequence shifts, discussed in the Alignment procedure subsection. Some examples of the proposed procedure for coding protein, DNA, and RNA sequences are presented in Fig 2. It should be noted that the presented values in Fig 2 are chosen according to the optical features and implementation choices, as discussed in the Optical architecture section.

Download:

Fig 2. An example of the proposed coding scheme for DNA, RNA, and protein sequences.

In this example, short DNA, RNA, and protein sequences are coded based on self-label and nearby-label coding schemes with preset values as follows: Offset_self = 450, Step_self = 10, Offset_nearby = 0, Step_nearby = 9, k = 2, and R = 1. The parameter Ch_i stands for the character positioned in location i as the current character in the self-label coding, and the k^th previous character in the nearby-label coding scheme. The parameter V represents a preset value between 0 to 19 for amino acids in the protein sequence and 0 to 3 for nucleotides in the DNA and the RNA sequences. Every character is coded with two values determined by the self-label and nearby-label coding schemes, as represented in its corresponding white block. For nearby-label coding of those characters positioned at the beginning of the sequence, the nearby-label coding wraps around the sequence and considers the desired nearby character at the end of the sequence.

https://doi.org/10.1371/journal.pcbi.1010665.g002

Alignment procedure

Once the input sequences are coded, the alignment procedure aligns two coded input sequences to determine their similarities and differences by locating character matches, mutations, and single or multiple indels. For this purpose, it performs two operations in parallel: a) S₁-align operation to determine the state of characters (i.e., character matching, substitution, insertion, or deletion) within the first sequence, and b) S₂-align operation to determine the state of characters within the second sequence. For simplicity, the first and second sequences are called S₁ and S₂ in the following, respectively.

S₁-align operation.

To specify the state of characters within S₁, the S₁-align operation determines whether every character in S₁ corresponds to an identical character in S₂; while in the case of character mutations (i.e., substitution) or indels (i.e., insertion or deletion), this correspondence does not exist. Moreover, while mutations only substitute the character itself; indels cause right-shifting or left-shifting of the rest of the sequence as well [2]. So, both character substitution and character-shifting should be addressed in the case of mutations and indels, respectively. In this manner, the S₁-align operation shifts the coded S₂ vector one to R times in the horizontal direction towards the left and right of the main S₂ vector, as depicted in Fig 3B. Afterward, the main S₂ vector and all its shifts are compared to the non-shifted S₁ vector correspondingly. Performing this comparison, as shown in Fig 3C, a nonzero entry appears in the comparison results if the corresponding self-label and nearby-label codes of the input sequences are identical. Otherwise, the corresponding entry of the comparison results remains zero in the case of non-identical characters. It is worth noting that while the comparison of S₁ and S₂ enables the detection of matched characters and mutations, single or multiple indels are detected by comparing the S₁ vector to the shifted S₂ vectors. Finally, each entry of the comparison outcome vector is formed by aggregating corresponding entries of all vectors of the comparison results, which are resulted from comparing the non-shifted S₁ vector with the main S₂ vector and all its shifts. So, the comparison outcome vector is represented in a row, as depicted in Fig 3D. As a key advantage, distinct code assignment to similar characters within the input sequences by the nearby-label coding scheme prevents data interference through the horizontal shifting and comparing processes.

Download:

Fig 3. Step-by-step progress of the S₁-align operation of the HELIOS method for optical sequence alignment.

(A) Two input sequences, i.e S₁ and S₂, are given to the HELIOS method, assuming the third character (i.e. ‘A’ in S₁ and E in S₂) is mutated. (B) S₁ and S₂ are coded based on the proposed coding procedure, assuming k = 2 and R = 1. While the self-label codes of the third character are different for S₁ and S₂, the nearby-label codes of the fifth characters (i.e. ‘S’) are different as well, due to the mutated character, assuming k = 2. Afterward, the coded S₂ is shifted one time horizontally towards the left and right of the coded S₂ assuming R = 1. Then, the main S₁ is compared with the main S₂ and all its shifts, and hence, (C) the comparison results are presented for each comparison. (D) Next, the comparison output vector is formed by aggregating all the comparison results, where the matched characters result in nonzero entries. As represented with the zero entry, the mutated character in position 3 within the input sequences is successfully located due to the different self-label codes, while the 5^th character is false mismatched due to the different nearby-label codes. (E) To compensate for this false mismatch, the i^th entry of the output is determined according to aggregating the i^th and the (i + k)^th entries of the comparison outcome vector. Hence, the 5^th entry is recovered by the corresponding nonzero value at the 7^th entry; while proper detection of character mutation at the 3rd entry is not affected.

https://doi.org/10.1371/journal.pcbi.1010665.g003

Summarizing the above discussion, we can conclude that the S₁-align operation successfully locates character substitution (both single and multiple) and character insertion (both single and multiple) in S₁ (i.e. character deletion from S₂). However, specifying characters deletion from S₁ requires a further comparative operation, named S₂-align operation, as follows.

S₂-align operation.

As a complementary comparative operation, the S₂-align operation determines the state of every character in S₂ by finding its corresponding character in S₁. For this purpose, the S₂-align operation repeats the comparative S₁-align operation, except that it shifts the S₁ pattern (instead of S₂) one to R times in the horizontal direction towards the left and right of the main S₁ vector, as shown in Fig 4. Afterward, the main S₁ vector and all its shifts are compared to the non-shifted S₂ vector correspondingly. Thus, this operation successfully locates character mutations (both single and multiple), as well as character insertions (both single and multiple) in S₂ (i.e. character deletions (both single and multiple) from S₁). Specifically, character mutations and insertions are represented with zero entries in the comparison outcome vector.

Download:

Fig 4. Side-by-side representation of the S₁-align and S₂-align operations of the HELIOS method.

(A) As an overall view, the S₁-align operation locates character substitutions, as well as character insertions in S₁ (or character deletions from S₂). For this purpose, it compares the main and all shifted S₂ vectors with the S₁ vector. Afterward, to produce the 1D output vector, the i^th entry of the output is determined according to the i^th and (i + k)^th entries of the comparison outcome vector. (B) Similarly, the S₂-align operation compares the main and all shifted S₁ vectors with the S₂ vector to locate character substitutions, as well as character insertions in S₂ (i.e. character deletions from S₁).

https://doi.org/10.1371/journal.pcbi.1010665.g004

It is worth noting that the number of consecutive indels (i.e. consecutive insertions or consecutive deletions) is assumed to not be larger than R, and hence, the value of R should be large enough to support all probable variations between two sequences. However, small values of R can be chosen in the case of aligning two similar sequences. Moreover, for aligning two input sequences with different lengths, the shorter one should slide all over the longer one to determine every probable variation, which results in a large value of R. As each sequence shifts in the horizontal direction towards the left and right of the other sequence, the parameter R varies in the range of [1, ]. However, the various choices of R (from 1 to ) do not affect the processing time and the speed, as discussed in more detail in the Optical architecture section.

Output vector production.

Once the S₁-align and S₂-align operations are performed, every entry of the comparison outcome vector can be determined accordingly, as depicted in Fig 3. Specifically, as shown in Fig 3A and 3D, a mutation or an indel within the input sequence results in a zero value at the corresponding entry of the comparison outcome vector. For example, the 3^rd character (i.e. character ‘A’) within S₁ is mutated against the character ‘E’ in S₂. However, regarding the nearby-label coding scheme, the code of the k^th next character is also affected. Hence, assuming k = 2, the 5^th characters of the input sequences (i.e. characters ‘S’ of S₁ and S₂) are nearby-label coded differently as shown in Fig 3B. Consequently, this variation causes a false mismatch, as well as a false zero value at the 5^th entry of the comparison outcome vector as shown in Fig 3C and 3D, respectively.

To compensate for the false mismatched characters, the corresponding character is involved whose nearby-label code is determined based on the false mismatched characters, which is (i + k)^th charater (i.e. character ‘L’ at the 7^th entry of S₁ and S₂ in Fig 3A and 3D). In this manner to produce the final output, the i^th entry of the output is determined according to aggregating the i^th and (i + k)^th entries of the comparison outcome vector. For example, as depicted in Fig 3D and 3E, the false mismatch at the 5^th entry is recovered by the corresponding nonzero value at the 7^th entry; while proper detection of character mutation at the 3^rd entry is not affected.

As a final word, it should be noted that all aforementioned steps to produce the final output, i.e. shifting, comparing, aggregating, etc., are done in parallel with no hardware complexity, taking advantage of the inherent parallelism in optics.

Analysis and review of the output.

As follows, we summarize the proposed alignment procedure: a) S₁-align operation compares the mail and all shifted S₂ vectors with the S₁ vector to locate character substitutions, as well as character insertions in S₁ (or deletions from S₂), as depicted in Fig 4A and 4b) S₂-align operation compares the main and all shifted S₁ vectors with the S₂ vector to locate character substitutions, as well as character insertions in S₂ (or deletions from S₁), as depicted in Fig 4B. To produce the output, the i^th entry of the output is determined according to the i^th and (i + k)^th entries of the comparison outcome vector. Producing a 1D vector for each comparison, performed by the S₁-align or S₂-align operations, the output can be arranged as a two-row matrix, as formulated in Eq 3; while Eq 4 calculates every entry of the output, as follows: (3) (4) where, output vector represents the output of the sequence alignment in two rows, the parameter Out_row,i represents its i^th entry as a result of comparing the i^th character of the input sequences, while Out_row,i+k compensates the probable false mismatching. Parameter row stands for the output rows’ indices, resulted from the S₁-align or S₂-align operations. Moreover, variable Out_row,j represents the comparison outcome for the j^th character by aggregating all corresponding entries, comparing the non-shifted vector of one sequence with the main and all shifted vectors of the other one as discussed in the S₁-align and S₂-align operation subsections. For this purpose, variable A_{row,j, x} represents the comparison result of two coded characters: Out_row,j as the j^th coded character within the non-shifted sequence, and Code_{2/row, x} as the x^th coded character within the main or shifts of the other sequence. For instance, for calculating output [1, 3], as the state of 3^rd entry of S₁, the S₁-align operation aggregates out_1,3 and out_1,5 based on Eq 3, assuming k = 2 and R = 1. While Eq 4 calculates out_1,3, assuming row = 1 and i = 3, by accumulating the results of comparing codes of the 3^rd entry of S₁ (i.e. Code_1,3) with the 2^nd, 3^rd, and 4^th entries of S₂ (i.e. Code_2,2, Code_2,3, Code_2,4, respectively). Similar operation calculates out_1,5. To finalize, the pseudo-code of the coding and alignment procedures is depicted in Algorithm 1.

Algorithm 1 Pseudo-code of the HELIOS method, including the coding and alignment procedures.

Require: S₁ ∧ S₂ ∧ R ≥ 0 ∧ k > 0

for each input sequence called S_input (input = 1 → 2) do

for character i = 1 → N do

S_input.Code[i] ⇐ C_self ∪ C_nearby

end for

for each operation called row = 1 → 2 do

for entry i = 1 → N do

for entry j = (i−k) → (i + k) do

if S_row.Code[i] = S_2/row.Code[j] then

Out[row, i] ⇐ 1

end if

end for

for entry i = 1 → N do

Output[row, i] ⇐ Out[row, i] ∨ Out[row, i + k]

end for

Performing the HELIOS method, the output appears as a two-row matrix, while each entry represents the alignment output of two characters at the corresponding position within the input sequences, as shown in Fig 5. By traversing the output from left to right, nonzero entries in both rows depict identical characters, i.e. character matching, at the corresponding position of the input sequences. On the other hand, zero entries in both rows depict character mutation at the corresponding position of the input sequences. Finally, a zero entry of a row along with a nonzero entry of the other one indicates indel, i.e. character insertion or deletion, and can be represented by a gap at the corresponding position of the sequence, containing the nonzero entry, as shown in Fig 5.

Download:

Fig 5. Output explanation of the HELIOS method.

The output of the HELIOS method is represented with a two-row matrix, while the first and second rows are produced by the S₁-align and S₂-align operations, respectively. Moreover, each entry represents the alignment output of the characters at the corresponding position in the input sequences. By traversing the output from left to right, nonzero entries in both rows depict identical characters, i.e. character matching; while zero entries in both rows depict character mutation. Finally, a zero entry of a row along with a nonzero entry of the other one indicates indel, i.e. character insertion or deletion, and can be represented by a gap at the corresponding position of the sequence, containing the nonzero entry.

https://doi.org/10.1371/journal.pcbi.1010665.g005

Summarizing the HELIOS method, we would like to emphasize that it exactly locates character matches, mutations, and single/multiple indels through the alignment procedure; while the coding procedure presents distinct coding patterns for input sequences and reduces the noises at the output vector, represented in a convenient form.

Optical architecture

Equivalent to the HELIOS method, the HELIOS optical architecture is developed to exploit the inherent parallelism and ultra-fast processing capabilities of optics, as depicted in Figs 1B and 6. The HELIOS optical architecture consists of three main units: a) Optical beam provision unit to prepare collimated beam to feed the proposed optical architecture, b) Optical modulation and mechanism unit to accomplish the coding and alignment procedures of the HELIOS method, and c) Output capturing unit to capture the final output of HELIOS optical architecture. The units are explained in more detail as follows.

Download:

Fig 6. Schematic illustration of the HELIOS optical architecture.

(A) The optical beam provision unit provides a collimated beam to feed the whole system. In this manner, the wideband laser beam, produced by a laser source, passes through the laser line bandpass filter and the pinhole to be cleaned. Afterward, the clean beam is diverged and collimated with passing through the objective and imaging lenses, respectively. Finally, the collimated beam is directed to the optical modulation and mechanism unit. (B) In the optical modulation and mechanism unit, passing collimated beam through WSF #1 modulates the wavelength of the optical beam based on the self-label coding of S₂ and S₁ on the first and second rows of a 2 × N pixels image, respectively; while PSF #1 performs their polarization selection based on their nearby-label coding scheme. Afterward, the objective and imaging lens arrays diverge and recollimate the optical beam through a horizontal direction to perform the shifting process of the alignment procedure. Moreover, WSF #2 and PSF #2 code S₁ and S₂ on the first and second rows of a 2 × N pixels image, respectively. By passing the expanded beams through WSF #2 and PSF #2, the proposed architecture compares the shifted coded S₂ with S₁ at the first row, implementing the S₁-align operation, and compares the shifted coded S₁ with S₂ at the second row, implementing the S₂ -align operation. Finally, each pixel is directed to two distinct pixels via a chiral medium to compensate for false mismatches. (C) Finally, in the output capturing unit, optical thresholdder eliminates wavelength cross-talks and speckle noises of the output before capturing. Afterward, the output is captured by a bi-convex lens and a charged-coupled device (CCD) camera.

https://doi.org/10.1371/journal.pcbi.1010665.g006

Optical beam provision unit

The optical beam provision unit provides a collimated optical beam to feed the whole optical system. In this manner, it employs a wideband unpolarized laser source [37], a laser line bandpass filter [38], a pinhole, and two lenses, as depicted in Fig 6A. For this purpose, the wideband laser generates an intense coherent monochromatic light beam in a wide spectral range; while the laser line bandpass filter transmits laser light with suppressing ambient light as well as lower intensity secondary laser lines. It improves contrast by only transmitting light within a specific wavelength range e.g. 450 to 650 nanometers in this case of study. Moreover, the thermal load is minimized on the blocking glass and the epoxy by facing the highly reflective side of the filter to the laser source. Afterward, the Galilean beam expander model [39] is employed to provide the collimated beam. It utilizes a pinhole and two lenses, including a) an objective lens, which is a bi-concave lens with a negative focal length (−f1) to diverge the beam, and b) an imaging lens, which is a plano-convex lens with a positive focal length (+ f2) to collimate the diverged beam; while c) a pinhole, placed at the focal point of the lenses, spatially filters the beams to reduce its high pulse energy density and to prevent arcing the air. The absence of the focal point between the lenses because of different signs of the focal lengths (−f1 + f2), avoids high energy density between the lenses, as well as results in a compact design, erect output, and elimination of the correction lens.

Summarizing the above discussion, the wideband laser beam passes through the laser line bandpass filter and the pinhole to be cleaned. Afterward, the clean beam is diverged and collimated by passing through the objective and imaging lenses, respectively. Finally, the collimated beam is directed to the optical modulation and mechanism unit to fill the aperture of the modulator cells with a proper amplitude, wavelength, and polarization of the optical beams.

Optical modulation and mechanism unit

In the HELIOS optical architecture, the coding and alignment procedures of the HELIOS method are performed by transmitting collimated beams through the optical modulation and mechanism unit, as depicted in Fig 6B. In this unit, the self-label and nearby-label coding schemes are performed by modulating the wavelength and polarization of the optical beams, respectively. Specifically, it is performed by passing the collimated beam through electrically controlled spatial filters [40, 41]; while the S₁-align and S₂-align operations of the alignment procedure are simultaneously performed by expanding and overlapping the modulated beams.

Modulation approach.

To perform the self-label coding, a wavelength selection approach is employed to modulate every character of the input sequences at a distinct wavelength. In this manner, a recently developed electrically controllable wavelength selective filter (WSF) is adopted [40], which is built upon a liquid crystal. The filter covers the spectral band in the range of [450–1000] nanometers with bandwidth less than 10 nanometers and throughput more than 80 percents. Employing electronically controlled liquid crystal, the filter transmits only a selected wavelength of light and excludes others at each pixel.

Besides, the nearby-label coding is implemented utilizing the polarization of the optical beams. In this manner, a proposed polarization-based spatial filter (PSF) is employed [41], which is built upon an S-waveplate. It modulates every character of the input sequence with a unique linear polarization. This filter operates by transmitting a specific polarization along an azimuth angle θ in the range of [0, 180] degrees; while rejecting other polarizations. As the S-waveplate is a polarization-sensitive element, the transmittance of the S-waveplate at each pixel can be controlled by adjusting a bias voltage on the waveplate. Hereupon, it passes the incident beam at a specific polarization at each pixel. So, this property enables us to electrically modulate the polarization of the optical beams.

Therefore, the proposed architecture modulates specific wavelengths and polarizations of the unpolarized light beams, transmitted by the optical beam provision unit. Moreover, where a modulated beam crosses wavelength-selective and polarization-based spatial filters, it can only pass through the filters in the case of identical wavelengths and polarizations. Hence, it enables the comparison of two coded patterns in the optical architecture. Despite providing many distinct codes, in this study we assume modulation wavelength in the range of [450, 650] nanometers with 10 nanometers channel spacing, and linear polarization selection in the range of [0, 180] degree with angle variation of 9 degrees. This assumption provides the required orthogonal code sets for the self-label and nearby-label codings of DNA, RNA, and protein sequences, as depicted in Fig 7.

Download:

Fig 7. Modulation approach of the HELIOS Optical architecture, utilizing the wavelength and polarization of the optical beams.

To implement the self-label coding through the wavelength modulation approach, every character of the input sequence is modulated with a distinct wavelength, within the spectral range of [450–650] nanometers with bandwidth of 10 nanometers. On the other hand, to implement the nearby-label coding through the polarization selection approach, every character of the input sequence is assigned to a specific polarization along a 9-degree azimuth angle in the range of 0 to 180 degrees. Each approach provides twenty distinct codes for protein sequences, while only four of them are employed for coding DNA and RNA sequences.

https://doi.org/10.1371/journal.pcbi.1010665.g007

Mechanism of the unit.

In the mechanism of the unit, at first, a space of 2 × N pixels is reserved on the WSF #1 and PSF #1, as depicted in Fig 6D and 6E, respectively. While the first row modulates all characters within S₂ for performing the S₁-align operation, the second row modulates all characters within S₁ for performing the S₂-align operation. Hence, transmitting collimated beam through WSF #1 and PSF #1 modulates wavelength and polarization of optical beams based on the self-label and nearby-label coding of the input sequences, respectively.

To realize shifting process of the modulated beams along the horizontal direction as discussed in the alignment procedure subsection, every modulated beam is expanded to multiple horizontal beams by two micro-lens arrays, as depicted in Fig 6B and 6F: a) an objective microlens array, composed of 2 × N bi-convex lenses with a negative focal length to diverge the modulated beams in the horizontal direction, and b) an imaging microlens array, composed of 2 × N bi-concave lenses with a positive focal length to recollimate the converged beams. It is worth noting that to implement the various numbers of sequence shifts, the objective and image lens arrays with focal lengths of different signs can be adopted to expand every modulated beam to the required range of horizontal pixels. In this process, the number of shifts, i.e. value of parameter R, does not affect the performance of the optical system, since expanding and recollimating the optical beam are performed in parallel.

Afterward, the recollimated modulated S₂ and S₁ beams (produced by WSF #1, PSF #1, and microlens arrays) are fed to WSF #2 and PSF #2. The WSF #2 and PSF #2 modulate the self-label and nearby-label codes of S₁ and S₂ on the first and the second rows of reserved 2 × N pixels, respectively, as depicted in Fig 6G and 6H. Hence, by passing the recollimated modulated beams through WSF #2 and PSF #2, the proposed architecture compares the shifted coded S₂ with S₁ at the first row, implementing the S₁-align operation. Concurrently, it compares the shifted coded S₁ with S₂ at the second row, implementing the S₂-align operation. Specifically, in the case that the crossed beam (modulated by PSF #1 and WSF #1) and the pixel on PSF #2 and WSF #2 have identical wavelength and polarization, respectively, the beam passes through the filters, indicating a non-blocking state, and so, a nonzero amplitude pixel appears at the comparison outcome. Otherwise, the optical beam fails to pass through WSF #2 or PSF #2, indicating a blocking state, and so, a zero amplitude pixel appears at the comparison outcome vector. In this manner, the presence of a pixel with nonzero amplitude at the output clarifies a matching state through the alignment procedure; while a zero amplitude pixel clarifies a mismatch (i.e. mutation or indel) correspondingly.

As discussed in the Output vector production subsection, false mismatches on the k^th next pixels to the right of mutated characters lead to false zero amplitude pixels at the output. Hence, the value of these pixels should be recovered to produce the proper output. In this manner, the property of double refraction of optical beams in optically active media is employed [42]. When a linearly-polarized beam of light enters an optically active medium, like a chiral liquid crystal, it is split into two separate beams of opposite circular polarizations, traveling at different speeds through the medium. Hence, the beams are refracted and diverged by an angle according to the different propagation speeds of the right-circularly and left-circularly polarized light beams [42]. So, every beam, entered the optically active medium, leaves it from two different locations with a specific distance apart. While light speed determines the angle of refraction, the desired distance between the two exit points can be set by the property and geometry of the optically active medium. Considering the double refraction property, every beam representing the i^th pixel is split into two beams at the i^th and the (i − k)^th pixels, as depicted in Fig 6I.

Summarizing the above discussion, the S₁-align and S₂-align operations of the alignment procedure are completely and concurrently performed by passing the collimated beams through the spaces of 2 × N pixels of the WSFs, PSFs, microlens arrays, and optically active media. It is worth noting that the inherent parallelism of optics enables multiple input sequences to be arranged on the aperture of the modulator cells and to be aligned through the HELIOS optical architecture simultaneously. Therefore, efficient use of coding space, as well as considerable speed-up can be achieved by the proposed optical architecture.

Output capturing unit

Finally, optical thresholding is performed by an optical thresholder, shown in Fig 6C to provide a clean outcput. In this manner, it eliminates wavelength cross-talks and speckle noises of the output before capturing. At last, the output is converged to a proper aperture by a bi-convex lens to be captured by a charge-coupled device (CCD) camera [43], as depicted in Fig 6C. The resultant output, represented as 2 × N pixels image, includes nonzero and zero amplitudes, while each pixel represents the state of the corresponding character within the input sequences.

Discussion and results

Comprehensively, various simulation approaches and numerical analyses are investigated to assess the HELIOS method and its optical architecture. At first, the functionality of the HELIOS method and its optical architecture is validated through investigating various simulation outputs. Next, the accuracy of the HELIOS method is inspected with comprehensive simulation studies and statistical analyses for various datasets. Afterward, the time and space complexities and the performance of the HELIOS optical architecture are estimated by analytical computation. For a comparative study, we consider various well-known algorithms, including BLAST [7], ClustalW [8], ClustalΩ [9], MUSCLE [12], T-Coffee [10], Kalign [11], MAFFT [13], Smith-waterman [5], Needleman-Wunsch [6] Nucmer4 [14], BLASR [44], BWA-MEM [45], Bowtie2 [46], Mauve [47], LASTZ [48], Moiré Technique [31], HAWPOD [30], and OptCAM [34]. Finally, some well-known applications are presented that can potentially benefit from the HELIOS method.

Functional validation

In order to evaluate the functionality of the HELIOS method and its optical architecture, the alignment outputs of numerous DNA, RNA, and protein sequences are investigated. In this manner, the HELIOS method and its optical architecture are simulated in MATLAB simulation tool and COMSOL Multiphysics software, respectively. As a case study, the “Severe acute respiratory syndrome coronavirus 2” sequences [49] are aligned and represented in the form of protein, RNA, and DNA, in Fig 8A–8C, respectively; while some single and multiple mutations/indels are manually imposed to the sequence with varying distributions. For more clarity, only a small portion of each full-length alignment, including 60 characters is represented in Fig 8. As shown in Fig 8, two input sequences are successfully aligned in a two-line output by performing two consecutive procedures of the HELIOS method, as well as, passing optical beams through two units of the HELIOS optical architecture. Moreover, the wavelength modulation within the range of 450 to 650 nanometers with 10 nanometers spacing, accompanied with polarization selection in the range of 0 to 180 degrees with 9-degree angle variation are performed for modulating optical beams in the HELIOS optical architecture. It is noteworthy that the crosstalk between neighboring pixels due to an electric field leakage, produced by filtering a specific wavelength and polarization, is negligible and does not affect our results [50].

Download:

Fig 8. Simulation outputs of the HELIOS method and its optical architecture.

In this case of study, the “Severe acute respiratory syndrome coronavirus 2” sequences [49] are aligned in the form of (A) Protein, (B) RNA, and (C) DNA; while some single and multiple mutations/indels are manually imposed to the sequence with varying distributions. For more clarity, only a small portion of each full-length alignment including 60 characters is shown in this figure, with the beginning at (A) character 240, (B) character 721, and (C) character 1921. In the coding and the alignment procedure, the parameters are set to R = 4 and k = 5. By investigating the outputs, two input sequences are successfully aligned in a two-line output by performing two consecutive procedures of the HELIOS method, as well as, passing optical beams through two units of the HELIOS optical architecture. As a result, the matches, mutations, and indels are detected and located accurately.

https://doi.org/10.1371/journal.pcbi.1010665.g008

Analyzing the simulation outputs, all the character matches, mutations, and indels are accurately detected and located within the protein, RNA, and DNA sequences. As depicted in Fig 8, the character matches are presented with the nonzero entries and high amplitude pixels within the outputs of the HELIOS method and its optical architecture, respectively; while the zero entries and zero-amplitude pixels represent mutations or indels. Eventually, investigating the simulation outputs verifies the accurate functionality of the HELIOS method in both levels of method and optical architecture.

Accuracy evaluation

In order to comprehensively assess the accuracy of the HELIOS method, two statistical analyses are performed through simulations of various datasets: 1) Quantitative measurement of homology [51], and 2) Accuracy measurement of classification output [52], compared to the well-known algorithms.

Quantitative measurement of homology.

To perform quantitative measurement of homology [51], the parameters Identity, Similarity, and Alignment Score of the HELIOS outputs are calculated through simulation studies, as reported in Tables 1–3, respectively, assuming the “Nine ND5 protein sequences dataset” [53]. While the Identity reports the number of exactly matched characters of two sequences (in percentage), the Similarity measures the resemblance of two compared sequences. Specifically, regarding the physicochemical properties, the amino acids are categorized into six groups with different similarity values; including GAVLI, FYW, STCM, KRH, DENQ, and P. As the third metric, the BLOSUM62 [54] substitution scoring matrix [54] is adopted to calculate the Alignment Score, with gap opening and extension penalties equal to -10 and -0.5, respectively.

Download:

Table 1. The parameter Identity of the HELIOS method in the quantitative measurement of homology.

The Identity reports the number of exactly matched characters of two compared sequences (in percentage) aligned by the HELIOS method, assuming the “Nine ND5 protein sequences dataset” [53].

https://doi.org/10.1371/journal.pcbi.1010665.t001

Download:

Table 2. The parameter Similarity of the HELIOS method in the quantitative measurement of homology.

The Similarity measures the resemblance of two compared sequences (in percentage) aligned by the HELIOS method. Various amino acids are categorized into six groups based on their physicochemical properties; including GAVLI, FYW, STCM, KRH, DENQ, and P. Moreover, the “Nine ND5 protein sequences dataset” [53] is assumed in this study.

https://doi.org/10.1371/journal.pcbi.1010665.t002

Download:

Table 3. The parameter Alignment Score of the HELIOS method in the quantitative measurement of homology.

To calculate the Alignment Score of two compared sequences aligned by the HELIOS method, the BLOSUM62 substitution scoring matrix is adopted with gap opening and extension penalties equal to -10 and -0.5, respectively. Moreover, the “Nine ND5 protein sequences dataset” [53] is assumed in this study.

https://doi.org/10.1371/journal.pcbi.1010665.t003

For a comparative study, the quantitative measurement of homology is performed by the HELIOS method and is compared to various well-known algorithms, including BLAST [7], ClustalW [8], ClustalΩ [9], MUSCLE [12], MAFFT [13], Kalign [11], T-Coffee [10], Smith-Waterman (SW) [5], and Needleman-Wunsch (NW) [6] algorithms. As an instance, the values of Identity, Similarity, and Alignment Score of the Smith-Waterman algorithm are reported in detail in Tables 4–6, respectively, to be compared to those of the HELIOS method. Additionally, we consider twelve different datasets for this evaluation, represented in S1 Text–S12 Text; while the input sequences of each dataset are represented in Table A3 in its corresponding file. Moreover, the quantitative measurement of homology of all aforementioned algorithms are reported in Tables A4-A33 in the S1 Text–S12 Text for twelve different datasets [53, 55–61]. By the way, as a brief report, the average value of each parameter, achieved by the aforementioned algorithms, are reported in Table 7 for the twelve datasets.

Download:

Table 4. The parameter Identity of the Smith-Waterman algorithm in the quantitative measurement of homology.

The Identity reports the number of exactly matched characters of two compared sequences (in percentage) aligned by the HELIOS method, assuming the “Nine ND5 protein sequences dataset” [53].

https://doi.org/10.1371/journal.pcbi.1010665.t004

Download:

Table 5. The parameter Similarity of the Smith-Waterman algorithm in the quantitative measurement of homology.

The Similarity measures the resemblance of two compared sequences (in percentage) aligned by the HELIOS method. Various amino acids are categorized into six groups based on their physicochemical properties; including GAVLI, FYW, STCM, KRH, DENQ, and P. Moreover, the “Nine ND5 protein sequences dataset” [53] is assumed in this study.

https://doi.org/10.1371/journal.pcbi.1010665.t005

Download:

Table 6. The parameter Alignment Score of the Smith-Waterman algorithm in the quantitative measurement of homology.

To calculate the Alignment Score of two compared sequences aligned by the HELIOS method, the BLOSUM62 substitution scoring matrix is adopted with gap opening and extension penalties equal to -10 and -0.5, respectively. Moreover, the “Nine ND5 protein sequences dataset” [53] is assumed in this study.

https://doi.org/10.1371/journal.pcbi.1010665.t006

Download:

Table 7. A brief report of the quantitative measurement of homology of measurement of the HELIOS method, compared to nine well-known algorithms, including SW, NW, BLAST, ClustalW, ClustalΩ, Muscle, T-Coffee, Kalign, and MAFFT. In this manner, the parameters Identity, Similarity, and Alignment Score are reported.

The Identity reports the number of exactly matched characters of two compared sequences (in percentage), aligned by each algorithm. Moreover, the Similarity measures the resemblance of two compared sequences (in percentage), aligned by every aformentioned algorithm. Various amino acids are categorized into six groups based on their physicochemical properties; including GAVLI, FYW, STCM, KRH, DENQ, and P. Finally, to calculate the Alignment Score of two compared sequences, aligned by every aforementioned algorithm, the BLOSUM62 substitution scoring matrix is adopted with gap opening and extension penalties equal to -10 and -0.5, respectively. Twelve diferent datasets are considered for this study [53, 55–61].

https://doi.org/10.1371/journal.pcbi.1010665.t007

Analyzing all data reported in Tables 1, 4, and 7, we can confirm that the HELIOS method detects and locates a bit more identical characters among the input sequences in most of the given datasets, and hence, it leads to higher values of the Identity compared to the SW and other well-known algorithms (as reported in S1 Text–S12 Text). The main reason can be stated as follows; since the HELIOS method inserts limited consecutive gaps freely (R gaps in maximum), it detects and locates more identical characters between two input sequences. Moreover, comparing Table 2 with 5 and considering Table 7, we can conclude that while the Similarity values, achieved by the HELIOS method are approximately equal to those of alternative algorithms for the most given datasets, while at the worst case it is small by only 2.28%, compared to T-Coffee for ND6 (NADH dehydrogenase subunit 6) protein of eight species dataset [55]. On the other hand, as reported in Tables 3, 6, and 7, the values of the Alignment Scores, calculated by the HELIOS method highly reach those of alternative algorithms in our case studies. Therefore, we can conclude that the HELIOS method performs a comparable accurate alignment against the alternative algorithms.

Accuracy measurement of classification output.

The accuracy measurement of the classification output [52] of the HELIOS method is addressed by calculating the values of Sensitivity (SEN), Specificity (Spec), Accuracy (ACC), Positive Predictive Value (PPV), Negative Predictive Value (NPV), Matthew’s Coefficient Correlation (MCC), and Test’s Accuracy (F-Score) in the simulation studies, according to Eq 5 to Eq 11, respectively. (5) (6) (7) (8) (9) (10) (11) where, parameters TP, TN, FP, and FN stand for True Positive, True Negative, False Positive, and False Negative, respectively.

As a comparative study, the accuracy measurement of the classification output of the HELIOS method is accomplished, and the corresponding metrics (as formulated by Eqs 5 to 11) are calculated with considering Smith-Waterman [5], Needleman-Wunsch [6], ClustalW [8], ClustalΩ [9], BLAST [7], MUSCLE [12], T-Coffee [10], Kalign [11], and MAFFT [13] algorithms as the reference algorithms. As an instance, the values of SEN, Spec, ACC, PPV, NPV, MCC, and F-Score of the HELIOS method are reported with considering SW as the reference algorithm in Tables 8–14, respectively. It should be noted that the values of these metrics for other mentioned algorithms which are considered as the reference algorithm for various datasets are reported in Tables A34-A96 in S1 Text–S12 Text, respectively. By the way, as a brief report, the average value of each parameter considering the mentioned reference algorithms is reported in Table 15 for the same twelve datasets addressed in Table 7.

Download:

Table 8. The parameter Sensitivity (SEN) of the HELIOS method with referencing the Smith-Waterman algorithm in the accuracy measurement of classification output.