FPGA VERSUS ASIC IMPLEMENTATION OF RADIX-8 SCALABLE MONTGOMERY MODULAR MULTIPLIER نوم براض قيقحت ةنراقم

Traditional ASIC implementations have the well known draw-back of reduced flexibility compared to software implementations. Since modern security protocols are increasingly defined to be algorithm independent, a high degree of flexibility with respect to the cryptographic algorithms is desirable. A promising solution which combines high flexibility with the speed and physical security of traditional hardware is the implementation of cryptographic algorithms on reconfigurable devices such as FPGA. In this paper we compare – in terms of area and speedFPGA implementation of radix-٨ scalable Montgomery modular multiplier using retiming technique with ASIC implementation for different word sizes of operands. The simulation data were generated using Mentor Graphics CAD tools.


INTRODUCTION
Modular multiplication is a widely used operation in cryptography.Several well know applications, such as the decipherment operation of the RSA algorithm [1], the Diffie-Hellman key exchange algorithm [2], as well as some applications currently under development, such as the Digital Signature Standard [3]  cryptography [4], all use modular multiplication and modular exponentiation.The second operation is often implemented by a series of multiplications and additions [5,6,7,8].
Given the increasing demands on secure communications, cryptographic algorithms will be embedded in almost every application involving exchange of information.Some of these applications, such as smart cards [9] and hand-helds, require hardware restricted in area and power resources [10].
An efficient algorithm to implement modular multiplication is the Montgomery Multiplication algorithm [11].It has many advantages over ordinary modular multiplication algorithms.The main advantage is that the division step in taking the modulus is replaced by shift operations which are easy to implement in hardware [10].
An aspect of cryptographic applications is that very large numbers are used.The precision varies from 128 and 256 bits for elliptic curve cryptography to 1024 and 2048 bits for applications based on exponentiation [12].Most of the hardware designs for modular multiplication are fixed precision solutions.That is, the operands can be only of fixed bit-size.Designs that can take operands with an arbitrary precision are researched in the ASIC [13] and the FPGA [8] realms.
A scalable (variable-precision) Montgomery multiplier design methodology was introduced in [13] in order to obtain hardware implementations.This design methodology allows to use a fixed-area modular multiplication circuit for performing multiplication of unlimited precision operands.The design tradeoffs for best performance in a limited chip area were also analyzed in [13].Extension of this design methodology to higher radices was introduced in [14].
Traditional ASIC implementations, however, have the well known draw-back of reduced flexibility compared to software implementations.Since modern security protocols are increasingly defined to be algorithm independent, a high degree of flexibility with respect to the cryptographic algorithms is desirable.A promising solution which combines high flexibility with the speed and physical security of traditional hardware is the implementation of cryptographic algorithms on reconfigurable devices such as FPGA.
In this paper we compare -in terms of area and speed -FPGA implementation of radix-8 scalable Montgomery modular multiplier using re-timinig technique [14] with ASIC implementation for different word sizes of operands.The simulation data were generated using Mentor Graphics CAD tools.
This contribution is structured as follows.In Section 2 we present the radix-8 Montgomery Modular Multilpication algorithm (R8MM).Section 3 presents the overall organization of the modular multiplier that implements the R8MM.Section 4 shows the simlation results, generated using Mentor Graphics CAD tools.Section 5 concludes the work.

R8MM ALGORITHM
The notation used in the presented multiple-word Radix-8 Montgomery Multiplication algorithm (R8MM) is shown below (Fig. 1).
Fig. 2 shows the R8MM algorithm, which is an extension of the Multiple-Word High-Radix (R K 2 ) Montgomery Multiplication algorithm (MWR K  2 MM) presented and proved to be correct in [14].
In order to make the three least-significant bits of the partial product S all zeros, a multiple of the modulus M, namely j qM M, is added to the partial product.This step is required to make sure that there are no significant bits lost in the right shift operation performed in step 10.To compute the digit ) -operand Y represented as multiple words; The first difficulty in this design comes from the fact that Z and qM can have values that are not powers of 2. As an example, the bit-vector 2Y can be produced from Y by left-shifting Y by one bit.However, the bit-vector 3Y is produced by adding Y and 2Y .The latter case requires huge amount of time compared to simple bit-shifting.
For Z the difficult values are 3 and -3 and for qM the difficult values are 3, 5 and 7.One way of implementing the coefficients is to split Z and qM into at least two values each. in this case.Summing/subtracting the two bitvectors to obtain the bit-vector for 3Y will be an overkill for the computational speed.A better approach is to use two bit-vectors for ( Same logic applies for j qM .

OVERALL ORGANIZATION
The architecture of the modular multiplier that implements the R8MM consists of 3 main blocks; Datapath (or Kernel), IO & Memory, and the Control block.The computation shown in the R8MM algorithm takes place in the kernel [10].
The kernel is organized as a pipeline of Processing Elements (PE) [10], separated by registers.Each PE implements one iteration of the R8MM algorithm (steps 3 to 12).

Radix-8 Processing Element
The radix-8 PE is organized as shown in Fig. 3.The main functional blocks in the PE are: booth recoding, multiple generation (Mult Gen), multiprecision Carry-save adders (MPCSA), j qM table, and registers (shaded boxes).The PE operates on wbit words and for this reason the Mult Gen and MPCSA modules are capable of storing and transferring carry bits from one word to the next.Shifting and word alignment is done by proper combination of signals and registers at the output of the last MPCSA.The design uses a retiming technique explained in [14].More details about these modules and their operation can be found in [10].for S needs to be at least 6 bits in order to have the three LS bits of S generated as early as possible for the next PE.
A stage consists of a PE and a register.At each clock cycle, one word of Y , M, SS, and SC is applied as inputs to a stage.The multiplier digits i X are transferred to PEs at specific times.The newly computed words of SS and SC, together with words of Y and M, are propagated by each stage to the next stage.This way, small PEs work concurrently to perform several iterations of the R8MM algorithm.

SIMULATION RESULTS
The simulation data were generated using Mentor Graphics CAD tools.The radix-8 design presented in this paper was described in VHDL and simulated in ModelSim for functional correctness.A simulation results of this algorithm are shown in Fig. 4

ASIC Implementation
Radix-8 design was synthesized using Leonardo synthesis tool for AMI05-slow (0.5 µm CMOS technology with hierarchy preserved) provided in the ASIC Design Kit (ADK) from the same company.It has to be noted that the ADK has been developed for educational purposes.however, it provides a consistent environment for comparison between the designs, and a reasonable approximation of the system performance.

.1 Area Estimation for Radix-8 Kernel
The area of the kernel depends on the two design parameters: number of stages in the pipeline (NS), and the word size (w) of the operands (Y , M) and the result (S). the total area of the kernel is given by [10]:

Time Estimation for Radix-8 Kernel
The total computational time for the kernel is a product of the number of clock cycles it takes and the clock period.Table 3 shows the critical path delay -measured in ns -as a function of the number of stages in the pipeline (NS), as well as the word size (w) of the operands.As can be seen from the Table, the critical path delay in some cases remains constant even if the number of stages is increased.This attributed to using carry-save logic.
A word of Y , M, and S propagates through the pipeline for ( 1 2 + * NS ) clock cycles.The speed of scanning the bits of X for radix-8 is three bits per stage.Based on these observations, Eq. 2 represents the total number of clock cycles needed for R8MM [10].
The total computational time is obtained by multiplying Tclks by the corresponding critical path delay (clock period) shown in Table 3, which was obtained from synthesis tools.

FPGA IMPLEMENTATION
Radix-8 design was synthesized using Leonardo synthesis tool for Xilinx Virtex -II technology.

Area Results
The area in FPGA is given in terms of Configurable Logic Blocks (CLBs).Each CLB approximately has 172 2-input NOR gate.Table 4 shows the area -in number of 2-input NOR gatesas a function of the number of stages in the pipeline (NS), as well as the word size (w) of the operands.

2.2 Time Results
Table 5 shows the critical path delay (measured in ns) as a function of the number of stages in the pipeline (NS), as well as the word size (w) of the operands.As can be seen from the Table, the critical path delay in some cases remains constant even if the number of stages is increased.This attributed to using carry-save logic.
j qM we need to examine the bits from 5 to 3 of the partial product S generated in step 5 of the R8MM algorithm .* X -Multiplier , Y-Multiplicand , M -Modulus, S -Partial product * Noperands precision * j X -a single radix-8 digit of X at position j; * j qM -quotient digit that determines a multiple of the modulus M to be added to the partial product S; * w -number of bits in a word of either Y , M or S;

Table 2
is constructed using Eq. 1.The area estimates are given in terms of 2-input NOR gate. .

Table 4 AREA
IN NUMBER OF NOR GATES FOR RADIX-8 KERNEL (FPGA)