Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Pipelined and conflict-free number theoretic transform accelerator for CRYSTALS-Kyber on FPGA

  • Ayesha Waris ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft

    ayesha.waris@pnec.nust.edu.pk

    Affiliation National University of Sciences and Technology (NUST), Karachi, Pakistan

  • Arshad Aziz,

    Roles Project administration, Resources, Supervision, Writing – review & editing

    Affiliation National University of Sciences and Technology (NUST), Karachi, Pakistan

  • Bilal Muhammad Khan

    Roles Supervision, Writing – review & editing

    Affiliation National University of Sciences and Technology (NUST), Karachi, Pakistan

Abstract

Post-quantum cryptographic (PQC) algorithms are essential due to the threat posed by quantum computers to the security of currently deployed cryptosystems. CRYSTALS-Kyber, based on Lattice-based cryptography, has been standardized as the Public-Key Encryption and Key-Establishment Mechanism Algorithm by the National Institute of Standards and Technology (NIST). An efficient hardware acceleration of CRYSTALS-Kyber relies on optimizing the computationally intensive Polynomial Multiplication Number Theoretic Transform (NTT) unit. This work presents an FPGA implementation of conflict-free and pipelined single-path delay feedback based NTT core for Kyber by employing various architectural optimizations including pipelining, resource sharing and algorithmic optimizations like multiplier-less Montgomery reduction algorithm. As a result, our design has 7.8% reduction in resources and 49.6% improved Area-Time Product (ATP) as compared to the state-of-the-art designs. The presented architectures are coded using Verilog HDL and implemented on Xilinx Artix-7 XC7A100T-3 and Virtex-7 XC7VX485T-3 devices using Vivado Design Suite 2022.2.

1. Introduction

Quantum Computing is a fast-emerging technology that exploits properties of Quantum mechanics to perform calculations much faster than our conventional computers. While the high computation capability of quantum computers is extremely advantageous, it also poses a serious threat to the security of currently deployed public-key cryptosystems such as RSA [1], Elliptic Curve Cryptography [2]. Consequently, to protect the confidentiality of classified information it is significant to develop quantum-resistant cryptosystems.

In 2016, the National Institute of Standards and Technology (NIST) initiated a Post Quantum Cryptography (PQC) competition [3] to standardize cryptographic algorithms that can resist quantum attacks and substantial research has been done in this field since then. After a six year competition NIST declared CRYSTALS-Kyber [4] as a standard in the Public-Key Encryption (PKE) and Key Encapsulation Mechanism (KEM) category in July 2022. CRYSTALS-Kyber is a Lattice-based cryptographic (LBC) scheme [5], and its security is based on the hardness of solving the Module-Learning with errors (M-LWE) problem. Also known as Kyber, this KEM provides good trade-off between security, performance, message sizes and flexibility as compared to its counterparts in the NIST competition.

Polynomial multiplication [6,7] is the most computationally expensive operation in LBC schemes. Considerable research has been done to reduce the complexity of polynomial multipliers by proposing several multipliers: School-book, Karatsuba-Ofman, Toom Cook and NTT. The last one, known as the Number Theoretic Transform (NTT) [8] is a variant of Fast Fourier Transform (FFT) and reduces the number of multiplication complexity from to and is used by Kyber for polynomial multiplications. Hence, to make Kyber efficient, acceleration of NTT is imperative.

To reduce the performance bottleneck, NTT has been implemented and investigated on various platforms. While software implementations [9] provide programming flexibility, hardware platforms such as ASICs and FPGAs [1012], can be accelerated using optimization strategies such as parallelization and resource sharing. Hardware/Software co-design platforms are also very popular and implement a software-based processor and hardware based design on a single chip. Reconfigurability and flexibility provided by FPGAs make them an ideal choice for implementing computationally intensive algorithms such as NTT. However, providing an optimal trade-off between resource utilization, processing delay and energy consumption is the major challenge for the hardware designers.

In this work, we have proposed various hardware and algorithmic optimizations and designed an FPGA based NTT polynomial multiplier to accelerate performance of CRYSTALS-Kyber. Research Flow, scheme and methodology of this work is presented in Fig 1 and block diagram of overall architecture is given in Fig 2.

thumbnail
Fig 1. Schematic and sequence of the overall Research procedure for Pipelined Number Theoretic Transform Accelerator for CRYSTALS-Kyber on FPGA.

https://doi.org/10.1371/journal.pone.0333301.g001

thumbnail
Fig 2. Block diagram of the proposed overall architecture.

https://doi.org/10.1371/journal.pone.0333301.g002

Main contributions are as follows:

  • This work first implements conventional single-path delay feedback-based NTT (SDFNTT) architecture for Kyber on FPGA. The paper highlights that incorporating pipelining to increase the maximum frequency of the SDFNTT leads to data conflicts. A novel pipelined SDF-based NTT architecture is then proposed that resolves the data conflicts by introducing an alternative data flow.
  • By using only a small additional hardware resource in each stage, our pipelined SDFNTT achieves a significantly higher clock frequency of approximately 40% on the Artix-7 and 32% on the Virtex-7 device, compared to the non-pipelined SDFNTT implementation.
  • The pipelined SDFNTT incorporates 14 pipeline stages. To further increase the pipelining, configurable FIFO depths are presented, allowing the depth to be adjusted according to the number of pipeline stages to avoid data conflict.
  • Multiple architectural optimizations such as multiplier-less modular multiplier, unified butterfly unit, distributed ROM based memories for twiddle factors storage are adopted to minimize resource consumption of the proposed design and resulted in achieving 49.6% improved ATP when compared to the prior works.

The paper is structured as: literature review is elaborated in section 2, preliminaries are discussed in section 3, section 4 presents the proposed SDFNTT architecture, followed by results discussions and comparisons in section 5.

2. Literature review

Hardware implementations of CRYSTALS-Kyber mainly focusses on NTT as it is the most computationally intensive and resource-hungry part of the design. The complex memory access pattern and large memory requirement is the bottleneck for accelerating the performance of the NTT unit. There has been various research on NTT for Fully Homomorphic Encryption (FHE) and Ring Learning with Errors (RLWE) schemes, but in this paper, we have limited our discussion to the NTT unit of PQC schemes in NIST competition mainly Kyber.

NTT unit is implemented using software, hardware/software co-design or pure hardware approaches. Abdulrahman et al. [9] optimized and implemented CRYSTALS-Kyber KEM and Dilithium signature schemes on Cortex-M4. Authors in [13] implemented three PQC algorithms: FrodoKEM, NewHope and Kyber, on a GPU to accelerate their operations. Botros et al. in [14], proposed a Cortex-4 memory-efficient implementation of Kyber. An ARMv8 microprocessor-based optimized implementation was presented by Nguyen et al. for Kyber, NTRU, and Saber in [15]. Interested readers can find the survey of software implementations of NTT (ARM Cortex, GPU) in [16]. Nguyen et al. in [17] and Dang et al. in [18] used hardware/Software co-design approach and introduced an HLS-based NTT architecture for various round 2 PQC algorithms. The execution time for encapsulation and decapsulation was improved in these studies, compared to the pure software designs.

Inherent parallelism [19] in NTT structure makes pure hardware the best choice for its implementation. In recent years, NTT has been widely implemented on hardware platforms such as ASICs and FPGAs. In [20], Bisheh et al. have provided an instruction set architecture of Kyber and implemented it on ASIC. In this architecture, the execution time of the NTT is reduced by merging pre-processing into the NTT algorithm. Banerjee et al. in [21] have proposed a power efficient RISC-V architecture based crypto-processor on ASIC for several PQC algorithms such as Frodo, qTesla, Kyber, Dilithium and NewHope. Due to its reconfigurability and flexibility, FPGA is a popular platform for implementation testing and result validation for PQC, FHE and RLWE algorithms. According to literature, FPGA implementations of Kyber’s can be classified into two categories, iterative and pipelined. Table 1 summarizes techniques adopted by a few previous works implementing Kyber’s NTT core on FPGA and details are presented in the ensuing paragraphs.

thumbnail
Table 1. Techniques adopted in former works for implementing NTT/INNT Kyber core.

https://doi.org/10.1371/journal.pone.0333301.t001

Huang et. al [30] proposed iterative CT/GS butterfly for NTT/INTT computation which heavily relies on Block RAM units. In [19], Authors have implemented an iterative NTT with CT and GS based eight butterfly units which reduces the NTT iteration time. Chen et al. [31] designed an optimized NTT by modifying the CT/GS butterfly and modular arithmetic units. Zhang et al. have proposed a ping pong memory access scheme and a fast modular reduction based on bitwise add shift operations along with look-up table in [29]. Yaman et al. [32] have implemented a unified butterfly structure and three different architectures (lightweight, balanced, high-performance) for Kyber-specific NTT, using 1, 4 and 16 butterfly units, respectively. A butterfly unit based NTT architecture along with K2RED algorithm is presented in [28] which improves the computation time. Xing et al. [33] have implemented a Kyber processor on the Artix-7 platform. The architecture used two butterfly units to process the even and odd input coefficients. In [34], Gao et al. proposed a conflict-free memory access pattern and performed NTT, INTT and PWM calculations. Recently, Gao et al. presented a memory mapping and access scheme for Kyber-optimized NTT and PWM, together with a mixed radix-2/4 approach [24]. Itabashi et al. have proposed three NTT architectures, radix-2, 2-parallel radix-2 and radix-4, of Kyber in [26]. An optimized version of modular reduction Exact-KRED for Kyber NTT is proposed in [11], along with a butterfly unit NTT structure. To make the memory management compact and simpler, in [35] Ni et al. have proposed an iterative NTT/INTT architecture, with point-wise multiplication and have replaced all the BRAM units with three FIFOs. Sun et al. in [22] designed a radix-4 NTT/INTT core along with a conflict-free memory access scheme.

Pipelined NTT architectures are gaining popularity among researchers because memory management in these architectures is less complex compared to the iterative approach. In [36], Nguyen et al. have introduced a dual-path delay feedback (DDF) based NTT for Kyber. The input coefficients of NTT are divided into two parts for processing, which increases the execution time of NTT. The pipelined architecture is proposed for NTT for various schemes with different parameters [27]. Kyber parameters , are also mentioned in the paper. Ni et al. [25] have presented a radix-2 based multi-path delay commutator (MDC) NTT/INTT architecture using resource sharing. In [23] Nguyen et al. have adopted MDC approach and proposed radix-2 and radix-4 unified NTT architectures for Kyber and Dilithium. Two MDC-based NTT designs are proposed in [37]. The unified 2-parallel and 4-parallel architectures are reconfigurable and support NTT, INTT and PWM operations.

3. Preliminaries

3.1. Classical NTT versus NTT in Kyber

NTT is a variant of Fast Fourier Transform (FFT) in the finite field. The complexity of NTT multiplication has decreased from to [8], when compared to the traditional School-book algorithm. NTT based polynomial multiplication can be represented by , where , are the input polynomials and is the output polynomial. Inputs and are first transformed into the NTT domain followed by a point-wise multiplication operation (PWM represented by ). To complete the multiplication and transform back the result into the original domain, inverse NTT (INTT) is applied to the result of PWM.

NTT transformation [38] is performed on polynomial quotient ring , where is the ring of integers modulo , is the irreducible polynomial, is the power of 2 and is the prime modulus satisfying . For an -point input polynomial , output polynomial in NTT domain is given by where 0≤ <. Primitive -th root of unity , is the smallest integer in the ring that satisfies . INTT can be calculated by multiplying the scaling factor and using negative powers of and is represented as .

NTT transformation is performed by appending zeros at the end of input polynomials expanding their degree from to . This results in an increase in resources as well as computation time. Negative-wrapped Convolution (NWC), optimization requires -point input polynomials and can only be applied when -th root of unity, exists, where and . Input polynomials in NTT operation are multiplied by powers of and output polynomials are multiplied by negative powers , known as pre-processing and post-processing steps, respectively. The NWC based NTT and INTT can now be represented as and [38].

Decimation in Time (DIT) and Decimation in Frequency (DIF) are two approaches to implement NTT, which implements Cooley-Tuckey (CT) [39] or Gentleman-Sande (GS) [40] butterfly structures respectively. Both CT and GS butterflies take two inputs and give two outputs, but the output differs because of the placement of multiplier. The output of CT butterfly is , and GS butterfly is , , where and are input coefficients and is the twiddle factor. Structures of radix-2 CT and GS butterfly are shown in Fig 3a and 3b.

thumbnail
Fig 3. Radix-2 butterfly structure.

(a) Cooley Tuckey (b) Gentlemen Sande.

https://doi.org/10.1371/journal.pone.0333301.g003

CRYSTALS-Kyber uses modulus q =3329 and n = 256, which is not NWC-NTT friendly as it does not satisfy . Calculating the NTT of Kyber requires a special trick called Truncated-NTT and can be achieved by employing an incomplete FFT trick or by splitting the polynomial ring as mentioned in [38]. By splitting the polynomial ring in Kyber we divide the NTT calculation into even and odd parts [41]. As n = 128 and q = 3329, parameters now fulfill the NWC-NTT condition and NTT can now be applied to two smaller polynomial rings.

In this paper, we have adopted the incomplete FFT trick which requires the last stage in Kyber-NTT to be cropped. The output of NTT will now be -degree polynomials, unlike linear terms obtained while using full NTT. Point-wise multiplication operation in Kyber is polynomial multiplication of two -degree polynomials which consists of five multiplications and two additions, different from the original PWM operation which only requires coefficient-wise multiplication of the linear terms obtained after full NTT.

3.2. Iterative and pipelined architectures

Like FFT, NTT can be designed using iterative or pipelined architectures. Iterative NTT in its simplest form incorporate a butterfly unit and a memory unit. NTT is calculated by loading the data from the memory into the butterfly unit and then storing the outputs back into the memory. This is done iteratively till butterfly operations in all the stages of the NTT are computed. Different iterative implementations of Kyber are reported in the literature [11,19,20,22,24,26,2835], using single or multiple butterfly and memory units and using optimized data flow and memory access techniques. These architectures suffer from high memory utilization and high implementation complexity.

Unlike iterative designs, pipelined architectures process data in a continuous flow and critical paths of the architecture can be removed by adding registers which results in high frequency while using low-area resources. An n-point pipelined NTT architecture consists of stages where each stage has one butterfly unit and performs butterfly operations. A few configurations of Pipelined architectures of Kyber are available in literature [23,25,27,36], Single-path delay feedback (SDF) and Multi-path delay-commutator (MDC) being the most popular ones.

4. Proposed architecture

Polynomial multiplication via NTT units are the most significant, complex and time-consuming operations for computation of Kyber as they occupy of computation time [28]. To improve the performance of Kyber, an efficient NTT core is required which provides a high throughput and uses low hardware resources.

In this work, we have implemented two SDF-based NTT architectures, non-pipelined and pipelined, for CRYSTALS-Kyber. The conventional single-path delay feedback (SDF) architecture is a simple, non-pipelined, BRAM-free design with a straightforward memory access pattern that continuously streams input data and performs NTT computations across successive stages. However, its maximum operating frequency is limited because introducing pipelining leads to data conflicts. This work presents a novel pipelined SDF-based NTT architecture that addresses these data conflict issues by introducing an alternative data flow and varying the depth of integrated FIFOs. As a result, the proposed design achieves significantly higher frequency compared to the conventional SDF-based NTT. Furthermore, to enhance pipelining flexibility, configurable FIFO depths are provided for each stage.

High performance of NTT is achieved using various optimizations on architectural and algorithmic level. We have accelerated NTT by employing techniques such as pipelining, using LUT-based Multiplication for integer multiplication unit, LUT-based FIFOs for data buffering, utilizing distributed ROM-based memories, replacing constant multiplications with lightweight operations in modular reduction unit and designing a reconfigurable butterfly core using resource sharing. As NTT and INTT operations are not performed simultaneously, we have used the same architecture to perform both operations. Different data paths have been provided to alternate the data flow between NTT and INTT computations.

In the first few sections low-level arithmetic modules such as modular adder, modular subtractor, modular multiplier and unified butterfly structure are discussed. Then our non-pipelined SDFNTT, which is a conventional SDF architecture, along with its main components and dataflow is explained. Subsequently, issues arising by adopting pipelining in SDF architectures are elaborated and then the proposed pipelined SDFNTT is presented which resolves the data conflict issues in the formerly implemented architecture. At the end, we discussed our unified NTT/INTT SDF architecture.

4.1. Modular addition and subtraction

Modular adder and Modular subtractor, presented in Fig 4a and 4b, are optimized for CRYSTALS-Kyber, using Kyber’s Modulus . The bit selects the output of the multiplexer in both units.

thumbnail
Fig 4. Arithmetic modules.

(a) Modular adder (b) Modular subtractor.

https://doi.org/10.1371/journal.pone.0333301.g004

4.2. Modular multiplier

An efficient and resource-saving modular multiplication unit is significant in designing a compact and fast NTT. Our modular multiplier is divided into two parts (a) Integer multiplier and (b) Montgomery reduction unit.

4.2.1. Integer multiplier.

The integer Multiplier unit in modular multiplication in Kyber takes two inputs and of bits and gives bits output. As coefficients have a length of bits, using DSP48E1 block in -series Xilinx FPGA that supports multiplication [42], will result in partial usage of resources. Our SDFNTT has seven stages of butterfly units, using seven DSP48E1 block-based modular multiplication units will result in a considerable increase in hardware resources.

We have adopted shift and add [43] technique and modified it to design our integer multiplier. Our experiment and results concluded that the proposed multiplier shown in Fig 5 is resource-efficient than other multiplier architectures (Wallace, Booth, Dadda). The presented multiplier takes two bit input and results in a bit output. Six bitwise AND operations are performed between six bits of multiplicand and each bit of Multiplier . The generated partial products are then added using a chain of parallel ripple carry adders. We have reduced the critical path delay by exploiting parallelism in FPGAs and have implemented four units of 6x6 multiplier units , , , The output of each multiplier unit is then accumulated using the Vedic tree adder, as shown in Fig 6.

4.2.2. Modular reduction unit.

In literature, there are two main approaches for implementing modular reduction: Barrett [44] and Montgomery Reduction [45]. The classical Barrett and Montgomery algorithms avoid costly division operations by replacing them with multiplication and shifting operations. Over the period of years, different variations [46,47] of both algorithms have been proposed to accelerate LBC and PQC.

Barrett reduction has been widely employed for designing reduction units in Kyber [19,20,33,48]. SAMS2 is a variation of the Barrett algorithm [47], in which simple inexpensive operations such as bit shifts, addition and subtraction are used instead of constant multiplications. It has been applied in [49,50] for a specific modulus. In [51] Longa et al proposed K-RED based reduction technique which is based on the characteristics of Proth prime numbers of the format and has contributed to the implementation of efficient reduction units for Kyber in [11,28]. Reduction units in [32,34,52], utilizes the property (mod 3329) recursively, till the higher-order bits are reduced into lower-ordered bits. The partial results are then added using a chain of adders.

Algorithm 1. Montgomery reduction algorithm, pre-computed Montgomery constant

Input:

such that

Output: such that

 1:

 2:

 3: if then

 4:

 5: end if

 6: return C

For our NTT design, a custom-built reduction unit has been implemented (Fig 7) for Kyber based on the Montgomery reduction algorithm. All the multiplications in the traditional Montgomery algorithm (Algorithm 1) are replaced with lightweight and inexpensive shifting, addition and subtraction operations. Optimized Montgomery algorithm is presented in Algorithm 2. A dedicated modular reduction unit is designed with Kyber’s modulus q = 3329 and by utilizing the characteristic property of this prime number. We have adopted the approach in [21], in which modulus is written in the form , and the constant multiplications in the Barrett algorithm are replaced by bit shifts, addition and subtraction operations.

The input to our Montgomery reduction unit is bits . The module implements , using bit shifts, subtraction and addition operations and calculates a -bit output. is selected to be , where . By selecting as a power of , all the mod operations in Algorithm 1 are replaced by hardware friendly operations. In step 1 Algorithm 1, the input is bits and is the selection of the rightmost bits of , that is . In step 2, division by is replaced by shifting the bits of the dividend to the right by a factor of . All these algorithmic optimizations are mentioned in Algorithm 2.

Algorithm 2. Optimized Montgomery reduction for CRYSTALS-Kyber.

Input: such that

Output: such that

 1:

 2:

 3:

 4:

 5:

 6:

 7:

 8:

 9: if then

 10:

 11: else

 12:

 13: end if

 14: return C

We have pre-computed Montgomery constant using (1). is a Solinas Prime. These generalized Mersenne Primes were suggested by Solinas in [53]. Using Montgomery friendly primes help in accelerating Montgomery Multiplications on software and hardware platforms [21,54].

(1)

We have two constant multiplications in Algorithm 1, in step 1 multiplication by and in step 2 multiplication by modulus . These two primes can be represented as and . By using this technique, the constant multiplications are replaced by left bit shifts, additions and subtractions as shown in step 2 and step 5 in Algorithm 2. We have taken the 2s’ complement in step 3 to incorporate the negative sign in our calculated μ.

The drawback in Montgomery multiplication is that the calculations are done in the Montgomery domain, it means pre and post-Montgomery calculations are required. The output of Algorithm 2 is , it implies the result must be multiplied with to get and this constant multiplication will require more hardware resources. One of the inputs to be multiplied by the integer multiplier is the precomputed twiddle factor which is stored in the memory. In our design, we have multiplied w with R and stored in the memory instead of w. This pre-computation will result in at the output of our reduction unit.

4.3. Unified butterfly unit

NTT-based polynomial multiplier requires input and output coefficients to be in natural order and implement both, forward NTT and inverse NTT transformations. DIT-based NTT transformation takes input in natural order and gives output in bit-reversed order while DIF-based NTT does the opposite. Using only one approach for both NTT and INTT will result in an additional cost of implementing a bit-reversal unit. Different configurations to implement NTT and INTT are given in [55].

In this design, the bit-reversal operation has been avoided, by designing a unified butterfly architecture for NTT/INTT. DIT-based NTT receives inputs in the natural order and results in bit-reversed output. This output will be given as input to the DIF-based INTT which takes bit-reversed input coefficients and results in natural order output. The data flow for NTT and INTT is illustrated in Figs 8 and 9. are normal order coefficients and are bit-reversed order coefficients.

Presented butterfly unit (BU) takes three 12-bit input coefficients , and results in two output coefficients and . BU comprises of one modular adder, one modular subtractor and one modular multiplier. The control signal given to muxes in the design (Fig 10), selects CT-based butterfly structure for NTT if is , and GS-based butterfly for INTT operation, if signal is .

The reduction unit becomes the critical path of the whole architecture due to the LUT-based shift and addition operations. To reduce the critical path and obtain better performance, we have adopted pipelining and added a two-level pipeline in the Montgomery reduction unit. Pipeline registers are also added in our reconfigurable butterfly architecture to synchronize the output coefficients.

If pipeline registers are increased in the reduction unit then for balancing the timing of the output coefficients, number of registers are also added in the input path of in CT mode and in path in GS mode. In CT mode of our butterfly unit, modular multiplication of and takes two cycles. We have inserted two pipeline stages in the input path of so that both and are synchronized for modular addition and modular subtraction. In the GS configuration, two pipeline registers are added in the path as multiplication of with will have a delay of two cycles. Hence, our fully pipelined BU requires a latency of two clock cycles (cc) whether working in a CT or GS mode.

For INTT operation a final scaling by is required at the output of GS butterfly. To implement it, Zhang et al in [56] proposed to insert “” operation to obtain and at the output. This can be easily achieved by shifting in hardware instead of using extra resources for multiplication with . In our unified BU, we have inserted a -bit right shift block “” to incorporate the INTT operation.

4.4. Non-pipelined SDFNTT

4.4.1. Dataflow in SDF unit.

In our first design, the SDF unit in Fig 11, consists of a radix-2 butterfly unit designed in section 4.3 (excluding pipelining), a FIFO unit for data buffering in the feedback with the BU and a TWROM for storing twiddle factors. The SDF unit will accept one input coefficient and will result in one output coefficient per clock cycle.

Fig 12 represents the data Flow in SDF unit for an 8-point () input. The butterfly is coupled with a delay of 4 (n/2) and a TWROM for twiddle factor storage. For the first four clock cycles, 0data path of the mux/demux is selected (shown with green lines) and coefficients are loaded into FIFO via to provide a delay of . The flow of coefficients in the FIFO register for each clock cycle is shown in Fig 13. For the next four clock cycles, data path is selected (shown with brown lines) and the coefficients from the and the delay unit arrive at and and are computed by the butterfly unit. As the inputs and outputs of the SDF unit are limited to one coefficient at a time, the first outputs from are sent to and outputs from are sent back to the FIFO via on each clock cycle, till is occupied. After four clock cycles, data path is selected, and the data in FIFO is now sent to the . The mux/demux are controlled by the Muxsel signal which has been generated in our design using counters. The TWROM, FIFO and counters in one stage are exclusively designed and independent of other stages.

4.4.2. Kyber non-pipelined SDFNTT.

As is in Kyber and due to the truncated-NTT [38] used in Kyber explained in section 3.1, our SDFNTT consists of seven stages (). Seven SDF Processing units (PU) are cascaded as shown in Fig 14. Each PU consists of a radix-2 BU, TWROM and a delay unit.

The input and output of our Processing Unit (PU1) is one coefficient per cycle. Each PU represents one stage of NTT data flow. Dataflow for the first two stages is given in Fig 15. For the computation of radix-2 BU, coefficients at the input are required to differ with a specific offset for a particular stage. In the case of Kyber, input is a polynomial of coefficients. These coefficients arrive at the input of PU1 per clock cycle and coefficients are loaded into the FIFO to provide a delay of . The next set of coefficients are directed to the butterfly input using the alternate mux data path, and coefficients (output of FIFO) and arrive at and at the same time. One output coefficient is sent as input to PU2, and the second coefficient is loaded again in the FIFO of PU1 to provide delay until the output path is available. Other input pairs separated by offset continue to arrive at the input at every clock cycle.

The next stage, PU2 processes the input in two sequences, first sequence comes from the output of BU of 1st stage and the next sequence comes from the FIFO of PU1. For the first 64 cycles, input coefficients goes to the FIFO and after cycles p’65, p’66, ….., p’128 are sent to . In this way, input coefficient pairs, , arrive at the butterfly unit with an offset of . Subsequently, the second input sequence is computed by PU2, and the pattern repeats for the remaining stages.

4.4.3. FIFO.

FIFO is used in the feedback for temporarily storing the input/output coefficients. The depth of FIFO for each stage is , where is the size of the input sequence to the respective stage. For Kyber, the first stage has an input sequence of coefficients, hence . The input coefficient chunk for the second stage is coefficients so FIFO coupled with the second stage has a depth of . This pattern continues for the rest of the stages.

4.4.4. Twiddle factors storage and management.

An in-place NTT implementation stores the coefficients in BRAMs or registers and updates them directly at each stage of the butterfly computation. It implies, instead of reading from one memory array and writing results to another, the same memory location is reused. The proposed NTT design is based on SDF architecture, which differs from a classical in-place implementation. In our design, the polynomial coefficients are streamed through seven butterfly stages. Each stage uses dedicated FIFOs for memory handling and data alignment, and intermediate values are not stored in the same memory locations. Hence, this architecture does not follow the traditional in-place pattern but rather adopts an out-of place NTT approach with stage-wise dataflow for efficient pipelining and resource utilization. Moreover, our architecture is designed to be BRAM-free and instead uses FIFO buffers to stream polynomial coefficients. The design handles one polynomial at a time, and intermediate values are passed between the stages using dedicated FIFOs associated with each butterfly unit. This approach allows for a compact and efficient dataflow without the need for block RAM storage.

Twiddle factors are pre-computed values which are input to the BU unit during modular multiplication. These values can be stored in FPGA using embedded ROMs or LUT-based distributed ROMs. BROMs have limited width-depth configurations and will be under-utilized if used for storing the twiddle factors in our design, as polynomial in Kyber has coefficients of 12-bit each. The proposed architecture has seven stages, which makes instantiating seven BROMs for each stage costly in terms of hardware resources. LUT-based distributed memories are resource-friendly and can be placed anywhere near the rest of the logic in the FPGA design, also minimizing latency as the butterfly unit continuously loads data from the memory.

For NTT operation in Kyber, 128 (n/2) different twiddle factors are required. As INTT is symmetric, 128 (n/2) different twiddle factors are also needed during INTT operation. It means if unified NTT/INTT is designed (section 4.7) then Kyber of NTT must be able to store 256 (n) twiddle factors. The 1st stage of DIT NTT uses only the first value from the twiddle factor array to compute all butterfly operations. The 2nd stage uses second and third values and so on. For DIF NTT, the 1st stage uses the first values, the 2nd stage uses the next values, and the pattern continues for the remaining stages. This distribution of twiddle factors for different stages in NTT and INTT operation is shown in Fig 16, where blue boxes represent the number of twiddle factors for NTT operation and grey boxes represent twiddle factors for INTT operation.

thumbnail
Fig 16. Twiddle factors for NTT/INTT for different stages.

https://doi.org/10.1371/journal.pone.0333301.g016

In series FPGA, input LUTS are used to store up to bits and implement a bit ROM [57]. Each stage in our SDFNTT has a different requirement of twiddle factors hence we have a different ROM capacity for all stages. For example, the first stage, when working in NTT mode needs TW and when working in INTT mode requires TW. The total number of twiddle factors for this stage is and the width of each coefficient is -bits, which are stored in a bit ROM using LUTS. The second stage requires TW for NTT and TW for INTT mode and are stored in a bit ROM, implemented using LUTS.

4.5. Data-conflict in pipelined SDF

Incorporating pipelining registers, increases the maximum frequency of the overall architecture. Data conflict arises when we adopt pipelining in conventional SDF architecture. The timing diagrams of using a non-pipelined and pipelined butterfly unit in SDFNTT are given in Figs 17 and 18 respectively. We have used a BU unit with a computation latency of clock cycles (section 4.3) and discussed the data collision for an -coefficient input sequence.

thumbnail
Fig 17. Timing diagram of non-pipelined SDF (no data conflict).

https://doi.org/10.1371/journal.pone.0333301.g017

thumbnail
Fig 18. Timing diagram pipelined SDF (with data conflict).

https://doi.org/10.1371/journal.pone.0333301.g018

Coefficients in Fig 18 are sent to the FIFO which provides a delay of clock cycles to provide the required offset for butterfly computation between pairs . The butterfly output goes to output while the coefficients are again sent to the FIFO until the output port is available. The second set of coefficients (shown by orange boxes), also continue to arrive at the input port. This causes data collision in the FIFO (yellow boxes). represents the collision in the FIFO between data output from () and second data input sequence (). The delay between input and corresponding output coefficients is clock cycles, where is the delay of the SDF architecture and is the computational latency of the pipelined BU.

4.6. Pipelined SDFNTT

4.6.1. Dataflow in pipelined SDF unit.

To provide a solution to the data collision in the FIFO register we have proposed SDF architecture with an alternate data flow which is configurable using multiplexers. Our new architecture of PU (Fig 19), employs four FIFO units, FIFO1 depth, , FIFO2 depth, , FIFO3 depth, and FIFO4 depth, . Different depths of the FIFOS in different paths are to control the data flow and to align the correct coefficient pairs at the input of the butterfly unit for computation.

Fig 20 shows the depth of FIFO registers for the first two stages of Kyber. For 1st stage and input sequence , and . Similarly for 2nd stage and input sequence , and . All seven PUs for Kyber NTT are connected and shown in Fig 21. The muxes select the data path and are configured by signal, generated by counters in our design which are independent for each PU. Delay elements in our pipelined PU are unlike our previous architecture with the delay element of . It implies that each PU of pipelined SDFNTT requires additional hardware of only delay unit and results in acceleration of the SDFNTT computation which is discussed in section 5.

The timing diagram of a -coefficient input for pipelined SDFNTT providing a computation latency of cc is shown in Fig 22. The first half of input coefficients are sent to the FIFO1 to provide a delay of clock cycles. The data is then directed to FIFO2 and delayed by another clock cycles. The output coefficients of FIFO2 are now aligned with the next half of the input coefficients arriving at input. The butterfly unit gives output and after a latency of cc. is directed to the output port, and is delayed till the output port is occupied. We have designed the depths of FIFO and to control the data flow of such that there is no collision in any of the FIFO units. At first, is delayed by FIFO3 by cc. The output of FIFO3 is sent to FIFO2 which is now unoccupied and delays the coefficients by cc. Final delay of cc is provided by FIFO4. The output coefficients are now directed to the PU output port which is now available.

thumbnail
Fig 22. Timing Diagram of Pipelined SDFNTT for 16 input coefficients.

https://doi.org/10.1371/journal.pone.0333301.g022

4.6.2. Increasing pipelining.

The depth of FIFO3 and FIFO4 is proportional to the computation latency of the butterfly unit. If pipelining is increased by , the depths of FIFOs are and . If it is increased by then and . This pattern can be continued to increase the number of pipelining but increasing pipelining also increases the delay of the overall architecture. Our analysis and tests conclude that the optimum tradeoff between architecture delay and can be achieved if BU latency is cc. Fig 23a and 23b present architectures of PU if BU latency is increased to 3 and respectively.

thumbnail
Fig 23. Depth of FIFOs when pipelining of BU is (a) 3 cc (b) 4 cc.

https://doi.org/10.1371/journal.pone.0333301.g023

4.7. Unified SDF NTT/INTT architecture for Kyber

In section 4.3, reconfigurable butterfly unit have been proposed employing CT and GS butterfly structure. TWROM, storing the twiddle factor for seven stages of NTT and INTT operation has also been discussed in section 4.4.3. For designing a unified NTT and INTT architecture, we will adopt the resource sharing technique and use the same BU and TWROM unit in both computations. The only difference will be in the FIFO unit. Input order in DIT NTT and DIF INTT is different, so different depths of FIFO units are required. The unified NTT/INTT architecture for the first two stages of Kyber is shown in Fig 24. The brown mux/demux selects the path and the FIFO units configured in the PU unit. The selection signal for the mux/demux is provided by the control signal, which is also given to the BUs for CT and GS mode selection. If signal is , CT structure is configured in BU and black data path is selected for the FIFO units. If signal is , BU becomes a GS butterfly structure for INTT operation and green data paths in PU are selected to provide delay according to the input sequence for INTT. As NTT and INTT are symmetric, the last stage of PU in NTT becomes the first stage of PU in INTT operation.

thumbnail
Fig 24. First two stages of pipelined SDFNTT/INTT Architecture.

https://doi.org/10.1371/journal.pone.0333301.g024

4.8. SDFNTT for Dilithium

Kyber and Dilithium are both lattice-based schemes, where Kyber is a key encapsulation mechanism (KEM) and Dilithium is a digital signature scheme. Both rely on the NTT for polynomial multiplication. The key difference lies in their modulus sizes: Kyber uses a 12-bit modulus (q = 3329), while Dilithium employs a 23-bit modulus (q = 8380417). This implies that Kyber requires narrower datapaths, whereas Dilithium demands significantly larger bit-widths when implemented in hardware.

The proposed SDF-based NTT architecture for Kyber can be adapted to support the Dilithium scheme by designing a butterfly architecture with wider bit-widths and integrating a modular multiplier for the larger modulus q = 8380417. This demonstrates the scalability of our architecture across different lattice-based cryptographic schemes. Reconfigurability and scalability of proposed SDFNTT can be achieved by developing a unified butterfly architecture capable of handling both q = 3329 and q = 8380417. Through such unified butterfly units, the proposed SDFNTT architectures can efficiently perform polynomial multiplications for both Kyber and Dilithium.

Kyber is not NTT-friendly for negative wrapped convolution (NWC), as its modulus q = 3329 does not support a primitive 2n-th root of unity. Therefore, Kyber adopts a truncated NTT approach, which requires only 7 stages of butterfly units for n = 256. In contrast, Dilithium’s modulus q = 8380417 is NTT-friendly, enabling a full NWC NTT. Consequently, a Dilithium SDFNTT implementation will require the complete 8 stages of butterfly units.

5. Result analysis

In this work we have developed two NTT architectures for CRYSTALS-Kyber. The comparison of our designs with state of the art pure NTT/INTT implementations in terms of consumed resources, latency, Area-Time Product (ATP) are shown in Table 2. Achieved ATP is also compared with prior works in Fig 25. We have implemented our designs using Verilog on Artix-7 FPGA (XC7A100T-3) and Virtex-7 FPGA (XC7VX485T-3).Both synthesis and implementation were performed under the default Vivado settings. Each design was tested and verified through functional and post-place and route (PAR) simulations on Xilinx Vivado Design Suite 2022.2.

thumbnail
Table 2. Comparison of CRYSTALS-Kyber NTT/INTT to state-of-the-art designs.

https://doi.org/10.1371/journal.pone.0333301.t002

thumbnail
Fig 25. Area-Time Product comparison of proposed designs with previous work.

https://doi.org/10.1371/journal.pone.0333301.g025

To provide fair comparisons, we normalized the LUT, DSP, and BRAM to the total area utilized in slices. The DSP is approximately equal to slices and BRAM is approximately equal to slices in Artix-7 according to [35]. In cases where ENS (Equivalent number of slices) is not reported, we have performed the conversion ourselves for comparison, as slice consists of LUTs and flip-flops (FFs) in 7-series FPGA [35]. Since some approaches focus on enhancing speed while others prioritize minimizing area, we have selected the ATP as a balanced metric to better represent overall performance. A smaller ATP value indicates a more efficient design. ATP is calculated by multiplying ENS to the execution time required to perform NTT/INTT operation.

5.1. NTT comparison

Table 2 indicates that our designs achieve the lowest ATP when compared to state-of-the-art designs. Low ATP of non-pipelined designs () is attributed to low ENS utilized by employing resource sharing techniques and implementing a BRAM and DSP free architecture. In [11,22,2629,58,59] DSP units are used for multiplication in modular multiplication unit and BRAMs are used for the data management and storing. As we have seven butterfly units in our design, utilizing seven BRAMs for storing pre-computed values and seven DSPs for modular multiplication would have significantly increased our resources. Our approach of using optimized and custom-built distributed LUT based BROMs for each stage and LUT based integer multiplier units resulted in a compact architecture.

The lowest ATP is achieved for pipelined designs () due to inclusion of pipeline stages which reduces the critical path delay and results in an increase in the maximum frequency of the architecture. The increased frequency results in a decreased execution time, as Time (us)=Latency (cycles)/Frequency (MHz). The sequential and straightforward data flow of the input into the pipelined SDFNTT architecture, consequently leads to lowest NTT/INTT operation cycles when compared to [11,22,2629,58,59], which employs a complex memory access pattern. These factors along with architectural optimization results in the lowest ATP for pipelined NTT/INTT designs.

Our first architecture is designed using conventional SDF architecture, without pipelining. Analysis of post-PAR utilization report indicates that our implementation takes LUTs and FFs for Kyber achieving frequency of MHz on Airtex-7 and MHz on Virtex-7 devices. To improve the frequency of SDFNTT and to reduce the critical path delay, we added pipeline stages in our butterfly structure which resulted in a total of clock cycles latency in our design. Implementation of pipelined design utilizes LUTs and FFs achieving a clock frequency of MHz on Airtex-7 and MHz on Virtex-7. Our pipelined designs achieve higher clock frequency, approximately on Airtex-7 and on Virtex-7 device when compared to the non-pipelined designs. Design and in Table 2 are non-pipelined (Artix-7), non-pipelined (Virtex-7), pipelined (Airtex-7) and pipelined (Virtex-7) implementations respectively.

Our non-pipelined SDFNTT takes clock cycles to compute the first output of NTT operation. The proposed pipelined SDFNTT employs a butterfly unit with computational latency of clock cycles. Seven pipelined butterfly units connected in cascade will result in latency of cc and delay of SDFNTT is clock cycles for both NTT and INTT operations. In [11] Nguyen et al. have presented an NTT/INTT architecture with butterfly unit configuration. Our designs report reduction in ENS when compared to [11] and consequently implementation a achieves and c achieves improved ATP.

In [27] Ye et al. have implemented NTT core for various parameter sets. We have compared our work with design utilizing parameter set of Kyber (). When compared to proposed designs and , resource utilization is reduced by , which indicate our increased hardware efficiency. ATP is improved by when compared to and when compared to . Zhang et al. in [29] have proposed a ping pong access scheme for memory management in iterative NTT/INTT design. In comparison with [29], our implementation achieves a reduction in resources consumption and , improved ATP when compared to a and c. K-RED based modular reduction algorithm and butterfly configuration based NTT/INNT design is presented in [28] by Binesh et al. ENS in our implementation is reduced by and ATP in design a and c surpasses by and when compared to [28].

Itabashi in [26] has implemented radix-2, 2-parallel radix-2 and radix-4 architecture. We have compared our design with the radix-2 implementation comprising of one BU unit, two BRAMS and a TWROM unit. Results exhibit a reduction in ENS when [26] is compared to our implementation. ATP is also improved by and in designs a and c. Rashid et al. [58] have utilized extensive pipelining and resource re-use technique and implemented unified NTT/INTT core on Virtex-7 device. When comparing [58] to our designs, 53.2% reduction in resources is achieved along with 81.9%/83.5% and 86.9%/88% improved ATP for NTT/INTT operations when compared to and , respectively.

In [59], the authors have designed a unified NTT/INTT architecture by using register banks for memory storage and implemented their designs on both Artix-7 and Virtex-7 FPGA platforms. For comparison, we evaluate our designs a and c against [59]-a, while our designs and are compared with [59]-b. Our analysis shows 64% reduction in ENS and 79.3%/81.8%, 86.85%/88.5% improved ATP for NTT/INTT operations when a and c are compared to [59]-a. Designs and are 66.9% more resource efficient and achieves an improved ATP of 85.2%/87.09% and 89.2%/90.6% for NTT/INTT operations when compared to [59]-b.

In [22], Sun et al. have introduced a modified radix-4 butterfly unit and an enhanced K2-RED modular reduction for faster and more efficient NTT computation. The work is implemented on Aritex-7 [22]-a and Virtex-7 [22]-b FPGA devices. The resource efficiency is increased by 84.9% along with an ATP improvement of 61.9% and 75.8% when a and c are compared with [22]-a. Comparison of designs and with [22]-b reveals an 85.3% reduction in resources with a 73% and 80.5% increased ATP performance. The work in [60] presents three highly parallel designs on Artix-7 and uses interleaved multiplication for modular multiplication. We have compared the design with parameter K = 22 with our a and c. This results in 26.3% reduction in ENS and 42.4%/42.4%, 63.4%/63.4% improved ATP for NTT/INTT operations when a and c are compared to when [60].

Throughput per slice (TP/slice) values for NTT operation are also presented in Table 2. TP/slice is determined by first calculating throughput (TP) using {clock frequency x no. of bits}/latency. This is then divided by ENS to obtain TP/slice. A few designs listed in Table 2 report varying cycle counts for NTT and INTT operations. Since the execution cycles differ, the throughput for each operation also varies. It should be noted that TP and TP/slice were not directly reported in the referenced works but were instead calculated using the reported clock frequency, latency and the number of input bits processed per cycle in each design.

This section only compares TP/A of pipelined architectures (c, d) with the works listed in Table 2. By comparing the TP/A of c (Airtex-7) with prior designs, our work achieves a higher TP/A of 73.5%, 25%, 3.1% when compared to [59]-a, [22]-a, [26] and lower TP/A of 28%, 40.8%, 51.9% and 65.8% when compared to [11,28,29,60], respectively. Design d (Virtex-7) obtains a higher TP/A by 73.7%, 78.6%, 21.8% and 77.6%, when compared to [58,59]-b, [22]-b and [27], respectively.

While a few designs report higher TP/A, this comes at the cost of significantly increased area [11,28,29] or execution cycles [60]. In contrast, our approach enables efficient resource utilization and prioritizes overall system efficiency, achieving a better area-time product, which is especially beneficial for deployment on cost-sensitive or resource-constrained platforms. Furthermore, our serial architecture can be parallelized, offering a configurable trade-off between area and throughput.

5.2. Butterfly unit comparison

The low ENS in our designs is attributed to the use of a DSP-free optimized butterfly unit. A comparison of the resource utilization, ATP and throughput/slice of our butterfly unit with state-of-the-art designs from the literature is presented in Table 3. The butterfly unit in [32] uses a unified architecture and performs CT/GS butterfly operation for NTT/INTT operation. Our DSP-free butterfly unit results in a 51.3% improved ATP and 86.7% higher throughput when compared with [32]. Three different configurations of butterfly units are presented in [26] to design NTT architectures. Proposed design in our work when compared with a single radix-2 butterfly unit in [26] achieves a 90.8% improved ATP and 90.8% improved throughput.

thumbnail
Table 3. Comparison of Butterfly unit with state-of-the-art designs.

https://doi.org/10.1371/journal.pone.0333301.t003

Authors in [61], have designed optimized butterfly unit for CRYSTALS-Kyber using different modular reduction techniques. Maximum operating frequency and latency is not given in [61], hence we have only compared the area utilization of our proposed butterfly with [61]. The butterfly units in [61] employing LUT4/KRED and K1.5-red modular reduction units, achieve 66.8% and 64.8% greater area when compared to proposed butterfly. Nguyen et al. in [23] have proposed a pipelined butterfly unit to design a unified Kyber/Dilithium NTT architecture. The butterfly unit is DSP-free but the high latency results in an 80.5% increased ATP and 81.3% reduced TP/slice when compared to our proposed design. An optimized DSP-based butterfly unit is presented in [62]. The DSP-free optimized butterfly unit proposed in this paper achieves a 72.6% improved ATP and 72.8% high throughput when compared to [62].

5.3. Modular multiplier comparison

To achieve a compact and resource-efficient butterfly unit an area-time efficient modular multiplier is essential. In this work we have employed Montgomery reduction with lightweight operations, along with a LUT-based integer multiplier. A comparison of the ENS, ATP and throughput of our modular multiplier with state-of-the-art designs from the literature is presented in Table 4. Interleaved Multiplication approach is adopted in [63] for modular multiplication. The architecture is DSP-free but have high latency. Design [63]-a (PHIM approach) and [63]-b (PLIM approach) achieves 76.7%, 64.9% increased ATP and 76.7%, 65% lower throughput when compared to our modular multiplier.

thumbnail
Table 4. Comparison of Modular Multiplier with state-of-the-art designs.

https://doi.org/10.1371/journal.pone.0333301.t004

A DSP-free modular multiplication is implemented in [64], but the Vedic multiplier and K-RED based-modular multiplier consumes increased LUTs/FFs then our implemented design. Our proposed modular multiplier achieves a 75% improved ATP and 75.2% increased throughput when compared to [64]. Optimized Barrett modular reduction and DSP-based integer multiplication approach is adopted in [62]. The DSP-free modular multiplier in this work achieves a 74.7% improved ATP and 74.8% higher throughput when compared to [62]. We have compared two modular reduction techniques in [61] with our work, LUT4/K-RED ( [61]-a) and K1.5-RED ( [61]-b). LUT4/K-RED uses both lookup table and K-RED for reduction. K1.5-RED is an improved variant of K-RED. Both designs perform integer multiplication using DSP resulting in high ENS. When compared to our proposed modular multiplier [61]-a and [61]-b achieves 66.6% and 64.6% increased ENS. ATP and TP/slice comparison is not possible as maximum frequency and clock cycles are not given in [61].

In [65], Shah et al presents PM (5x5 multiplier without reduction) and PFFM (5x5 multiplier with reduction) for post quantum digital signature scheme MAYO with q = 31. Our proposed multiplier is a 12 × 12 modular multiplier for CRYSTALS-Kyber (q = 3329). To facilitate a fair comparison of our core with [65], we re-instantiated our modular multiplier with a 5-bit operand and modulus q = 3329. Our 5x5 multiplier with reduction logic occupies 32 LUTs with CPD 2.17 ns as opposed to the 26 LUTs of PFFM in [65], with CPD 2.28 ns.

This modest increase is expected because [65] exploits the modulus q = 31, which allows a very lightweight reduction, whereas our design supports modulus q = 3329 and therefore incurs slightly higher logic cost even at smaller widths. Importantly, our design remains competitive in terms of CPD and it scales efficiently to the 12 × 12 setting required for Kyber. This comparison demonstrates the scalability and robustness of our architecture across different operand sizes.

5.4. Resource utilization analysis

Targeted Artix-7 platform has LUTS and FFs and Virtex-7 device has LUTs and FFs. Our design utilizes only of the resources on the smallest Artix-7 device and resource on Virtex-7. This exhibits that our lightweight NTT implementation is suitable to use in IOT constrained devices. The breakdown resource utilization of different modules in our implementation , is given in Figs 26 and 27. Moreover, pipelined SDFNTT consumes total on-chip power of 712 mW. Energy is obtained by multiplying the total power estimated in Vivado by the execution time of the design, which for design c is calculated as 1.25uJ.

thumbnail
Fig 26. Resource utilization and breakdown of implemented NTT/INTT core.

https://doi.org/10.1371/journal.pone.0333301.g026

thumbnail
Fig 27. Resource utilization and breakdown of implemented butterfly unit.

https://doi.org/10.1371/journal.pone.0333301.g027

6. Conclusion

CRYSTALS-Kyber has been standardized by NIST as public-key encryption and key encapsulation mechanism algorithm. In this research, we have accelerated the polynomial multiplication NTT unit of Kyber. BRAM-Free and DSP-Free, Single-path delay feedback (SDF) based NTT architecture for CRYSTALS-Kyber is presented. The performance of our hardware is attributed to adopting resource sharing technique, using distributed LUT-based BROMs for storing pre-computed values, LUT-based integer multiplier unit for coefficient multiplication and employing a compact multiplier-less reduction unit. Our proposed pipelined design achieves 49.6% better ATP and utilizes the least hardware resources when compared to former research in the literature.

Additionally, our architecture executes operations in constant time and therefore is resistant to timing-based side-channel attacks (SCAs). However, other types of side-channel leakages, such as power analysis or fault-injection-based attacks are possible in hardware designs. The current work did not incorporate dedicated countermeasures such as masking, hiding or noise injection as our primary focus was on achieving area–time efficient implementations but it is identified as future work and mentioned below.

7. Future work

The NTT is employed by CRYSTALS-Kyber to efficiently perform polynomial multiplication. In this paper, we implement a single-path delay feedback NTT architecture tailored for CRYSTALS-Kyber. To enhance the operating frequency of the overall design, pipelining is incorporated by addressing data conflict issues inherent in conventional SDF NTT structures. Our pipelined implementations, realized on Artix-7 and Virtex-7 FPGAs, demonstrate superior performance compared to state-of-the-art architectures, as detailed in Section 5. The designed SDF architecture can be extended to a multi-path data feedback (MDF) design, enabling the simultaneous processing of multiple NTT coefficients and increasing the overall throughput of the system. As a future direction, we aim to develop a unified butterfly architecture that supports both Kyber and Dilithium parameters while incorporating point-wise multiplication operations, thereby improving the scalability and adaptability of the design. Future work will focus on integrating countermeasures against power-based and fault-based SCAs to strengthen robustness while maintaining efficiency for lightweight PQC implementations.

References

  1. 1. Rivest RL, Shamir A, Adleman L. A method for obtaining digital signatures and public-key cryptosystems. Commun ACM. 1978;21(2):120–6.
  2. 2. Miller VS. Use of Elliptic Curves in Cryptography. In: Williams HC, editor. Advances in Cryptology — CRYPTO’85 Proceedings [Internet]. Berlin, Heidelberg: Springer Berlin Heidelberg; 1986 [cited 2024 Jul 8]. p. 417–26. (Lecture Notes in Computer Science; vol. 218). Available from: http://link.springer.com/10.1007/3-540-39799-X_31
  3. 3. Xu G, Mao J, Sakk E, Wang SP. An Overview of Quantum-Safe Approaches: Quantum Key Distribution and Post-Quantum Cryptography. In: 2023 57th Annual Conference on Information Sciences and Systems (CISS) [Internet]. 2023 [cited 2024 Jul 8]. p. 1–6. Available from: https://ieeexplore.ieee.org/abstract/document/10089619
  4. 4. Bos J, Ducas L, Kiltz E, Lepoint T, Lyubashevsky V, Schanck JM, et al. CRYSTALS - Kyber: A CCA-Secure Module-Lattice-Based KEM. In: 2018 IEEE European Symposium on Security and Privacy (EuroS&P) [Internet]. 2018 [cited 2024 Jul 11]. p. 353–67. Available from: https://ieeexplore.ieee.org/abstract/document/8406610
  5. 5. Wang X, Xu G, Yu Y. Lattice-Based Cryptography: A Survey. Chinese Annals of Mathematics Series B. 2023;44(6):945–60.
  6. 6. Zeng C, He D, Feng Q, Peng C, Luo M. The implementation of polynomial multiplication for lattice-based cryptography: A survey. J Inf Secur Appl. 2024;83:103782.
  7. 7. Nejatollahi H, Dutt N, Ray S, Regazzoni F, Banerjee I, Cammarota R. Post-Quantum Lattice-Based Cryptography Implementations: A Survey. ACM Comput Surv. 2019;51(6):129:1-129:41.
  8. 8. An Extensive Study of Flexible Design Methods for the Number Theoretic Transform | IEEE Journals & Magazine | IEEE Xplore [Internet]. [cited 2024 Jul 11]. Available from: https://ieeexplore.ieee.org/abstract/document/9171507
  9. 9. Abdulrahman A, Hwang V, Kannwischer MJ, Sprenkels A. Faster Kyber and Dilithium on the Cortex-M4. In: Ateniese G, Venturi D, editors. Applied Cryptography and Network Security. Cham: Springer International Publishing; 2022. p. 853–71.
  10. 10. Salarifard R, Soleimany H. An efficient hardware accelerator for NTT-based polynomial multiplication using FPGA. J Cryptogr Eng. 2024;14(2):415–26.
  11. 11. Nguyen H, Tran L. Design of Polynomial NTT and INTT Accelerator for Post-Quantum Cryptography CRYSTALS-Kyber. Arab J Sci Eng. 2022;48(2):1527–36.
  12. 12. Hardware Acceleration of NTT-Based Polynomial Multiplication in CRYSTALS-Kyber | SpringerLink [Internet]. [cited 2024 Jul 8]. Available from: https://link.springer.com/chapter/10.1007/978-981-97-0945-8_7
  13. 13. Gupta N, Jati A, Chauhan AK, Chattopadhyay A. PQC Acceleration Using GPUs: FrodoKEM, NewHope, and Kyber. IEEE Trans Parallel Distrib Syst. 2021;32(3):575–86.
  14. 14. Memory-Efficient High-Speed Implementation of Kyber on Cortex-M4 | SpringerLink [Internet]. [cited 2024 Jul 11]. Available from: https://link.springer.com/chapter/10.1007/978-3-030-23696-0_11
  15. 15. Nguyen DT. Optimized Software Implementations Using NEON-Based Special Instructions. 2021.
  16. 16. A Survey of Software Implementations for the Number Theoretic Transform | SpringerLink [Internet]. [cited 2024 Jul 11]. Available from: https://link.springer.com/chapter/10.1007/978-3-031-46077-7_22
  17. 17. Nguyen DT, Dang VB, Gaj K. A High-Level Synthesis Approach to the Software/Hardware Codesign of NTT-Based Post-Quantum Cryptography Algorithms. In: 2019 International Conference on Field-Programmable Technology (ICFPT) [Internet]. 2019 [cited 2024 Jul 11. ]. p. 371–4. Available from: https://ieeexplore.ieee.org/abstract/document/8977896
  18. 18. Dang V, Farahmand F, Andrzejczak M, Mohajerani K, Nguyen DT, Gaj K. Implementation and Benchmarking of Round 2 Candidates in the NIST Post-Quantum Cryptography Standardization Process Using Hardware and Software/Hardware Co-design Approaches.
  19. 19. Ma L, Wu X, Bai G. Parallel polynomial multiplication optimized scheme for CRYSTALS-KYBER Post-Quantum Cryptosystem based on FPGA. In: 2021 International Conference on Communications, Information System and Computer Engineering (CISCE) [Internet]. 2021 [cited 2024 Jul 11. ]. p. 361–5. Available from: https://ieeexplore.ieee.org/abstract/document/9445987
  20. 20. Bisheh-Niasar M, Azarderakhsh R, Mozaffari-Kermani M. Instruction-Set Accelerated Implementation of CRYSTALS-Kyber. IEEE Trans Circuits Syst I. 2021;68(11):4648–59.
  21. 21. Banerjee U, Ukyab TS, Chandrakasan AP. Sapphire: A configurable crypto-processor for post-quantum lattice-based protocols. IACR Transactions on Cryptographic Hardware and Embedded Systems. 2019:17–61.
  22. 22. Sun J, Bai X. A High-Speed Hardware Architecture of an NTT Accelerator for CRYSTALS-Kyber. Integr Circuits Syst. 2024;1(2):92–102.
  23. 23. High-Speed NTT Accelerator for CRYSTAL-Kyber and CRYSTAL-Dilithium | IEEE Journals & Magazine | IEEE Xplore [Internet]. [cited 2024 Jul 11]. Available from: https://ieeexplore.ieee.org/abstract/document/10453519
  24. 24. Guo W, Li S. Highly-Efficient Hardware Architecture for CRYSTALS-Kyber With a Novel Conflict-Free Memory Access Pattern. IEEE Trans Circuits Syst I. 2023;70(11):4505–15.
  25. 25. Ni Z, Khalid A, Kundi D-S, O’Neill M, Liu W. HPKA: A High-Performance CRYSTALS-Kyber Accelerator Exploring Efficient Pipelining. IEEE Trans Comput. 2023;72(12):3340–53.
  26. 26. Itabashi Y, Ueno R, Homma N. Efficient Modular Polynomial Multiplier for NTT Accelerator of Crystals-Kyber. In: 2022 25th Euromicro Conference on Digital System Design (DSD) [Internet]. 2022 [cited 2024 Jul 11. ]. p. 528–33. Available from: https://ieeexplore.ieee.org/abstract/document/9996868/
  27. 27. Ye Z, Cheung RCC, Huang K. PipeNTT: A Pipelined Number Theoretic Transform Architecture. IEEE Trans Circuits Syst II. 2022;69(10):4068–72.
  28. 28. Bisheh-Niasar M, Azarderakhsh R, Mozaffari-Kermani M. High-Speed NTT-based Polynomial Multiplication Accelerator for Post-Quantum Cryptography. In: 2021 IEEE 28th Symposium on Computer Arithmetic (ARITH) [Internet]. 2021 [cited 2024 Jul 11. ]. p. 94–101. Available from: https://ieeexplore.ieee.org/abstract/document/9603378
  29. 29. Zhang C, Liu D, Liu X, Zou X, Niu G, Liu B, et al. Towards Efficient Hardware Implementation of NTT for Kyber on FPGAs. In: 2021 IEEE International Symposium on Circuits and Systems (ISCAS) [Internet]. 2021 [cited 2024 Jul 11]. p. 1–5. Available from: https://ieeexplore.ieee.org/abstract/document/9401170
  30. 30. Huang Y, Huang M, Lei Z, Wu J. A pure hardware implementation of CRYSTALS-KYBER PQC algorithm through resource reuse. IEICE Electron Express. 2020;17(17):20200234–20200234.
  31. 31. Chen Z, Ma Y, Chen T, Lin J, Jing J. High-performance area-efficient polynomial ring processor for CRYSTALS-Kyber on FPGAs. Integration. 2021;78:25–35.
  32. 32. Yaman F, Mert AC, Öztürk E, Savaş E. A Hardware Accelerator for Polynomial Multiplication Operation of CRYSTALS-KYBER PQC Scheme. In: 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE) [Internet]. 2021 [cited 2024 Jul 11. ]. p. 1020–5. Available from: https://ieeexplore.ieee.org/abstract/document/9474139
  33. 33. Xing Y, Li S. A compact hardware implementation of CCA-secure key exchange mechanism CRYSTALS-KYBER on FPGA. IACR Trans Cryptogr Hardw Embed Syst. 2021:328–56.
  34. 34. Guo W, Li S, Kong L. An Efficient Implementation of KYBER. IEEE Trans Circuits Syst II. 2022;69(3):1562–6.
  35. 35. Ni Z, Khalid A, Liu W, O’Neill M. Towards a lightweight CRYSTALS-Kyber in FPGAs: an ultra-lightweight BRAM-free NTT core: IEEE International Symposium on Circuits and Systems 2023. 2023 IEEE Int Symp Circuits Syst ISCAS Proc. 2023 Jul 21.
  36. 36. Nguyen TT, Kim S, Eom Y, Lee H. Area-time efficient hardware architecture for CRYSTALS-Kyber. Appl Sci. 2022;12(11):5305.
  37. 37. Waris A, Aziz A, Khan BM. Area-time efficient pipelined number theoretic transform for CRYSTALS-Kyber. PLoS One. 2025;20(5):e0323224. pmid:40367126
  38. 38. Liang Z, Zhao Y. Number Theoretic Transform and Its Applications in Lattice-based Cryptosystems: A Survey [Internet]. arXiv; 2022 [cited 2024 Jul 11. ]. Available from: http://arxiv.org/abs/2211.13546
  39. 39. Cooley JW, Tukey JW. An algorithm for the machine calculation of complex Fourier series. Math Comp. 1965;19(90):297–301.
  40. 40. Gentleman WM, Sande G. Fast Fourier Transforms: for fun and profit. In: Proceedings of the November 7-10, 1966, fall joint computer conference [Internet]. New York, NY, USA: Association for Computing Machinery; 1966 [cited 2024 Jul 11]. p. 563–78. (AFIPS’66 (Fall)). Available from:
  41. 41. Satriawan A, Syafalni I, Mareta R, Anshori I, Shalannanda W, Barra A. Conceptual review on number theoretic transform and comprehensive review on its implementations. IEEE Access. 2023;11:70288–316.
  42. 42. 7 Series DSP48E1 Slice User Guide (UG479). 2018;v1.10. Available from: https://docs.amd.com/v/u/en-US/ug479_7Series_DSP48E1
  43. 43. A Review on Comparative Analysis of Add-Shift Multiplier and Array Multiplier Performance Parameters | SpringerLink [Internet]. [cited 2024 Jul 11]. Available from: https://link.springer.com/chapter/10.1007/978-981-15-9293-5_41
  44. 44. Barrett P. Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor. In: Advances in Cryptology — CRYPTO’ 86 [Internet]. Berlin, Heidelberg: Springer; 1987 [cited 2024 Jul 11]. p. 311–23. Available from: https://link.springer.com/chapter/10.1007/3-540-47721-7_24
  45. 45. Montgomery PL. Modular multiplication without trial division. Math Comput. 1985;44(170):519–21.
  46. 46. Kaya Koc C, Acar T, Kaliski BS. Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro. 1996;16(3):26–33.
  47. 47. Liu Z, Seo H, Sinha Roy S, Großschädl J, Kim H, Verbauwhede I. Efficient Ring-LWE Encryption on 8-Bit AVR Processors. In: Cryptographic Hardware and Embedded Systems -- CHES 2015 [Internet]. Berlin, Heidelberg: Springer; 2015 [cited 2024 Jul 11]. p. 663–82. Available from: https://link.springer.com/chapter/10.1007/978-3-662-48324-4_33
  48. 48. Kim DW, Maulana DI, Jung W. Kyber Accelerator on FPGA Using Energy-Efficient LUT-Based Barrett Reduction. In: 2022 19th International SoC Design Conference (ISOCC) [Internet]. 2022 [cited 2024 Jul 11. ]. p. 83–4. Available from: https://ieeexplore.ieee.org/abstract/document/10031533
  49. 49. Rentería-Mejía CP, Velasco-Medina J. High-Throughput Ring-LWE Cryptoprocessors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems. 2017;25(8):2332–45.
  50. 50. Kundi D-S, Zhang Y, Wang C, Khalid A, ONeill M, Liu W. Ultra High-Speed Polynomial Multiplications for Lattice-Based Cryptography on FPGAs. IEEE Trans Emerg Topics Comput. 2022;10(4):1993–2005.
  51. 51. Longa P, Naehrig M. Speeding up the Number Theoretic Transform for Faster Ideal Lattice-Based Cryptography. In: Foresti S, Persiano G, editors. Cryptology and Network Security. Cham: Springer International Publishing; 2016. p. 124–39.
  52. 52. Aikata A, Mert AC, Imran M, Pagliarini S, Roy SS. KaLi: A Crystal for Post-Quantum Security Using Kyber and Dilithium. IEEE Trans Circuits Syst I. 2023;70(2):747–58.
  53. 53. Solinas JA. Generalized Mersenne Numbers. Citeseer. 1999.
  54. 54. Bos JW, Montgomery PL. Montgomery Arithmetic from a Software Perspective [Internet]. 2017 [cited 2024 Jul 11]. Available from: https://eprint.iacr.org/2017/1057
  55. 55. Hirner F, Mert A, Roy S. PROTEUS: A Tool to Generate Pipelined Number Theoretic Transform Architectures for FHE and ZKP Applications. IACR Cryptol EPrint Arch. 2023.
  56. 56. Zhang N, Yang B, Chen C, Yin S, Wei S, Liu L. Highly Efficient Architecture of NewHope-NIST on FPGA using Low-Complexity NTT/INTT. IACR Trans Cryptogr Hardw Embed Syst. 2020:49–72.
  57. 57. 7 Series FPGAs Configurable Logic Block User Guide (UG474). 2016;v1.8. Available from: https://docs.amd.com/v/u/en-US/ug474_7Series_CLB
  58. 58. Rashid M, Sonbul OS, Jamal SS, Jaffar AY, Kakhorov A. A Pipelined Hardware Design of FNTT and INTT of CRYSTALS-Kyber PQC Algorithm. Information. 2025 Jan;16(1):17.
  59. 59. Yahya Hummdi A, Aljaedi A, Bassfar Z, Shaukat Jamal S, Mazyad Hazzazi M, Rehman MU. Unif-NTT: A Unified Hardware Design of Forward and Inverse NTT for PQC Algorithms. IEEE Access. 2024;12:94793–804.
  60. 60. Javeed K, Gregg D. Efficient Number Theoretic Transform Architecture for CRYSTALS-Kyber. IEEE Trans Circuits Syst II. 2025;72(1):263–7.
  61. 61. Bertels J, Norga Q, Verbauwhede I. A Better Kyber Butterfly for FPGAs. In: 2024 34th International Conference on Field-Programmable Logic and Applications (FPL) [Internet]. 2024 [cited 2025 Mar 27]. p. 171–7. Available from: https://ieeexplore.ieee.org/abstract/document/10705545/authors#authors
  62. 62. Waris A, Aziz A, Muhammad Khan B. Unified Butterfly for NTT in Post-Quantum Cryptography Algorithm CRYSTALs-Kyber. In: 2024 4th International Conference on Innovations in Computer Science (ICONICS) [Internet]. 2024 [cited 2025 Aug 11]. p. 1–6. Available from: https://ieeexplore.ieee.org/document/10824499
  63. 63. Javeed K, Rubab S, Gregg D. Efficient Reconfigurable Modular Multipliers for Post-Quantum Digital Signatures. In: 2024 31st IEEE International Conference on Electronics, Circuits and Systems (ICECS) [Internet]. 2024 [cited 2025 Aug 11]. p. 1–4. Available from: https://ieeexplore.ieee.org/abstract/document/10849189
  64. 64. Nguyen TH, Pham CK, Hoang TT. A High-Efficiency Modular Multiplication Digital Signal Processing for Lattice-Based Post-Quantum Cryptography. Cryptography. 2023;7(4):46.
  65. 65. Shah YA, Rafferty C, Khalid A, Khan S, Javeed K, O’Neill M. Efficient Soft Core Multiplier for Post Quantum Digital Signatures. In: 2024 IEEE International Symposium on Circuits and Systems (ISCAS) [Internet]. 2024 [cited 2025 Aug 11]. p. 1–5. Available from: https://ieeexplore.ieee.org/abstract/document/10558234