# Energy-Efficient Low-Latency Signed Multiplier for FPGA-Based Hardware Accelerators

Salim Ullah<sup>®</sup>, Tuan Duy Anh Nguyen, and Akash Kumar<sup>®</sup>, Senior Member, IEEE

Abstract-Multiplication is one of the most extensively used 2 arithmetic operations in a wide range of applications, such 3 as multimedia processing and artificial neural networks. For 4 such applications, multiplier is one of the major contributors to 5 energy consumption, critical path delay, and resource utilization. 6 These effects get more pronounced in field-programmable gate 7 array (FPGA)-based designs. However, most of the state-of-the-8 art designs are done for ASIC-based systems. Furthermore, a few 9 field-programmable gate array (FPGA)-based designs that exist 10 are largely limited to unsigned numbers, which require extra cir-11 cuits to support signed operations. To overcome these limitations 12 for the FPGA-based implementations of applications utilizing 13 signed numbers, this letter presents an area-optimized, low-14 latency, and energy-efficient architecture for an accurate signed 15 multiplier. Compared to the Vivado area-optimized multiplier IP, <sup>16</sup> our implementations offer up to 40.0%, 43.0%, and 70.0% reduc-17 tion in terms of area, latency, and energy, respectively. The RTL 18 implementations of our designs will be released as an open-source 19 library at https://cfaed.tu-dresden.de/pd-downloads.

20 *Index Terms*— Accelerator architectures, artificial neural 21 networks (ANN), fixed-point arithmetic, field-programmable gate 22 arrays (FPGAs), multiplying circuits.

23

# I. INTRODUCTION

PPLICATIONS in the domain of digital signal processing and machine learning extensively use multiplication as one of the basic arithmetic operations. The architecture of a selected multiplier and its implementation directly affect sumption of such applications. The FPGA synthesis tools tend to use DSP blocks for high-performance multiplication [1]. However, two points are worth noting concerning the DSP blocks utilization.

1) For many applications, such as artificial neural networks 33 (ANNs), the 32-b floating-point precision is often not 34 necessary for obtaining acceptable quality results. As 35 discussed in Section III, our 8-b quantized implementa-36 tion of an ANN reduces the classification accuracy only 37 by 0.42% when compared with full-precision classifi-38 cation accuracy. For implementing multipliers for these 39 low-precision numbers, the synthesis tools opt to use 40

41 lookup tables (LUTs) instead of DSP blocks.

Manuscript received March 10, 2020; revised April 30, 2020; accepted May 10, 2020. This work was supported by the German Research Foundation (DFG) funded Project ReAp under Grant 380524764. This manuscript was recommended for publication by J. Hu. (*Corresponding author: Akash Kumar.*)

Salim Ullah and Akash Kumar are with the Department of Processor Design, Technische Universität Dresden, 01062 Dresden, Germany (e-mail: salim.ullah@tu-dresden.de; akash.kumar@tu-dresden.de).

Tuan Duy Anh Nguyen is with Xilinx Research Labs, Xilinx Inc., Singapore (e-mail: duyanhtu@xilinx.com).

Digital Object Identifier 10.1109/LES.2020.2995053

2) As noted by Ullah *et al.* [2] and Kuon and Rose [3], due to 42 the nonuniform distribution of these DSP blocks across 43 the FPGA, the critical path delay could be adversely 44 affected when many of them have to be concatenated for 45 large multiplication operations. Moreover, DSP resources 46 are limited. On the other hand, the LUT resources are 47 much larger. They also offer comparable performance 48 with better energy-efficiency and flexibility than the 49 DSP blocks for small-sized multipliers. Therefore, it 50 is more advantageous to have the option to use the 51 low-area, high performance, and energy-efficient LUT-52 based multiplier beside the DSP blocks. In this letter, we 53 provide area-optimized, low-latency, and energy-efficient 54 accurate signed multipliers for FPGA-based systems. 55

FPGA vendors, such as Xilinx and Intel, provide softcore 56 LUT-based multipliers (signed and unsigned) as described 57 in [4]. These multipliers can be either area or speed optimized. 58 Booth's algorithm [5] is also a commonly used technique for 59 multiplication because it reduces the total number of generated 60 partial products by encoding the multiplier bits. The widely 61 known related works are [7]–[9] and [11]. Kumm et al. [7] 62 and Walters [8] have used Booth's algorithm to present 63 area-efficient radix-4 multiplier implementations for Xilinx 64 FPGAs. However, these implementations do not use compres-65 sor trees for adding the generated partial products and have 66 large critical path delays. More importantly, Kumm et al. [7] 67 has not discussed the implementations for signed numbers. 68 Parandeh-Afshar et al. [11] have proposed a partial product 69 compressor tree for Altera (now Intel) FPGAs. Nonetheless, 70 their generalized parallel counters underutilize LUTs in two 71 consecutive adaptive logic modules (ALMs). Their follow-72 up work, Parandeh-Afshar and Ienne [9] have used the 73 Booth's and Baugh-Wooley's multiplication [6] algorithms for 74 area-efficient multiplier implementation. However, in order 75 to reduce the effective length of the carry chains, their 76 design limits the length of the ALM to five, resulting in the 77 underutilization of the FPGA resources. 78

On the other hand, Kakacak [12] and Kumm et al. [13] 79 utilized smaller multiplier blocks for designing higher order 80 multipliers. However, such techniques prove to be only useful 81 for small bit-width multipliers; for higher bit-width multipliers, 82 they consume more FPGA resources. For example, the logic-83 based implementation (using the "\*" operation) of an accurate 84  $8 \times 8$  multiplier on Virtex-7 FPGA in Xilinx Vivado, with 85 default synthesis options, consumes 71 LUTs, whereas the 86 modular implementation of an accurate  $8 \times 8$  multiplier using 87 accurate 4×4 multipliers consumes 82 LUTs. 88

## A. Motivation for Signed Multipliers

For some signed numbers-based applications, it may still <sup>90</sup> be possible to implement the required hardware accelerators <sup>91</sup> utilizing unsigned multiplier designs. For example, we have <sup>92</sup> quantized the trained parameters (weights and biases) of a <sup>93</sup>

89

1943-0663 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

143

179

TABLE IBOOTH ENCODING AND CORRESPONDING SE FOR PARTIAL PRODUCTS. $b_{m+1}, b_m, \text{ and } b_{m-1}$  Are Multiplier Bits. (a) Radix-4BOOTH ENCODING. (b) SE

| (a)              |        |                  |                |   |    |   |    | (b)            |              |   |  |  |
|------------------|--------|------------------|----------------|---|----|---|----|----------------|--------------|---|--|--|
|                  | Inputs |                  | Encoding       | 0 | ıt | Γ | BE | MSB            | SE           |   |  |  |
| b <sub>m+1</sub> | bm     | b <sub>m-1</sub> | BE             | s | с  | Z | F  |                | Multiplicand |   |  |  |
| 0                | 0      | 0                | 0              | 0 | 0  | 1 | L  | 0              | 0            | 0 |  |  |
| 0                | 0      | 1                | 1              | 0 | 0  | 0 |    | 0              | 1            | 0 |  |  |
| -                | 0      | 1                | 1              | - | -  | - |    | 1              | 0            | 0 |  |  |
| 0                | 1      | 0                | 1              | 0 | 0  | 0 |    | 1              | 1            | 1 |  |  |
| 0                | 1      | 1                | 2              | 1 | 0  | 0 |    | 2              | 0            | 0 |  |  |
| 1                | 0      | 0                | $\overline{2}$ | 1 | 1  | 0 |    | 2              | 1            | 1 |  |  |
| 1                | 0      | 1                | 1              | 0 | 1  | 0 |    | $\overline{2}$ | 0            | 1 |  |  |
| 1                | 0      | 1                | 1              |   | 1  |   |    | $\overline{2}$ | 1            | 0 |  |  |
| 1                | 1      | 0                | 1              | 0 | 1  | 0 | Г  | ī              | 0            | 1 |  |  |
| 1                | 1      | 1                | 0              | 0 | 0  | 1 |    | 1              | 1            | 0 |  |  |

<sup>94</sup> lightweight ANN to 8-b fixed-point numbers to implement
<sup>95</sup> the ANN on FPGA. These parameters are signed numbers.
<sup>96</sup> To implement the ANN hardware using unsigned multipliers,
<sup>97</sup> we require additional signed-unsigned converters to extract the
<sup>98</sup> sign bit from the operands and compute the final product sign.
<sup>99</sup> These converters receive 2's complement numbers and produce
<sup>100</sup> corresponding numbers in sign-magnitude format. After mul<sup>101</sup> tiplication in sign-magnitude format, the result is converted
<sup>102</sup> back to the 2's complement scheme using signed-unsigned
<sup>103</sup> converter. These additional modules have increased the critical
<sup>104</sup> path delay of each multiplier by 2.061 ns and LUTs utilization
<sup>105</sup> by 24. Therefore, for the hardware implementations of appli<sup>106</sup> cations utilizing signed numbers, it is always advantageous to
<sup>107</sup> have high performance signed arithmetic units.

## 108 B. Novel Contributions

<sup>109</sup> Our contributions include the following.

- A Novel Architecture for Booth Multiplier: Using 6-input LUTs and associated fast carry chains of modern FPGAs, we present an architecture for signed multipliers that provides better performance<sup>1</sup> than state-of-the-art designs.
- 2) Parallel Generation of Partial Products: We eliminate the need for sequential computation of the partial products and generate all Booth-encoded partial products in parallel; that significantly reduces the overall critical path delay of the multiplier.
- 3) *Efficient Partial Products Encoding:* Our partial product encoding technique reduces the length of the carry chain in each partial product to further reduce the critical path of the multiplier.

#### 124 II. PROPOSED DESIGNS OF ACCURATE MULTIPLIERS

Using the concepts of radix-4 Booth's multiplication algo-125 126 rithm, we present our area-optimized, low-latency, and energy-127 efficient accurate signed multipliers. The correct sign of a 128 partial product, in booth's encoding (BE)-based multiplier, is 129 decided by the sign of the multiplicand (the MSB) and the 130 corresponding value of BE. Table I(a) and (b) shows the list <sup>131</sup> of required sign extensions (SEs) for all possible combinations 132 of BE's values and MSB of the multiplicand. We have used 133 Bewick's SE technique [16] to implement the correct sign of 134 a partial product. Unlike state-of-the-art implementations, our <sup>135</sup> proposed architecture computes all partial products in parallel 136 and then adds the generated partial products using multiple 137 4:2 compressors and a ripple carry adder (RCA). The parallel <sup>138</sup> generation of partial products significantly reduces the critical 139 path delay of the multiplier. Our implementations provide opti-140 mized configurations for the 6-input LUTs and the associated



Fig. 1. Configuration of LUTs used in proposed design. (a) Type-A. (b) Type-B. (c) Type-C.

carry chains in a logic slice of modern FPGAs such as Xilinx 141 Virtex-7 series.

## A. Accurate Signed Partial Products Generation

Fig. 1 shows the configurations of the 6-input LUTs used for the implementation of the proposed accurate multiplier. The BE is implemented by LUT Type-A configuration, as shown in Fig. 1(a). It receives five inputs, i.e.,  $a_n$  and  $a_{n-1}$  (from multiplicand) and  $b_{m+1}$ ,  $b_m$  and  $b_{m-1}$  (from multiplier). The LUT internally implements three MUXes. Based on the value of BE, the first MUX (controlled by *s* signal) decides whether  $a_n$  or  $a_{n-1}$  should be forwarded for partial product generation. The second MUX, controlled by *c* signal, manages the inversion of the output of the first MUX. Finally, the third MUX can make the partial product zero depending upon the value of the *z* signal. This information is forwarded to the associated carry chain as carrying propagate signal " $p_{out}$ ." The input  $a_n$  156 is used as the carry generate signal " $g_{out}$ " for the carry chain. 157

Bewick's SE technique for each partial product row is 158 implemented by LUT Type-*B* and LUT Type-*C* configurations, 159 as shown in Fig. 1(b) and (c), respectively. The LUT Type-*B* 160 receives five inputs, i.e.,  $b_{m+1}$ ,  $b_m$ , and  $b_{m-1}$  (from multiplier), 161  $a_n$  (the MSB of the multiplicand), and  $p_{in}$ . The  $p_{in}$  signal is 162 constant "1" for the first row of partial products and for all 163 other rows it is constant "0." The LUT computes the  $\overline{SE}$  signal, 164 performs the XOR operation on it and provides the result to the associated carry chain as the carry propagate signal  $p_{out}$ . The 166 carry generate signal  $g_{out}$  is directly provided by the  $p_{in}$  signal. 167 LUT Type-*C* is used to transfer the correct sign information 168 of its respective partial product row to the following partial 169 product row. 170

Utilizing LUTs of types *A*, *B*, and *C*, Fig. 2(a) shows the 171 first row of partial products for an  $8 \times 8$  multiplier. The rightmost LUT of Type-*A* in each partial product row is used for 173 computing the required input carry. This input carry is applied 174 for representing a partial product in 2's complement format. 175 For an  $8 \times 8$  multiplier, a total of four partial product rows will 176 be generated. The last partial product row does not require an 177 LUT of Type-*C*. 178

## B. Optimizing Critical Path Delay

For an  $N \times M$  multiplier, the length of the carry chain in 180 each partial product row is N + 4 bits. To improve the critical 181 path delay of the multiplier, the length of the carry chain can 182 be reduced to N+1 bits. A critical path delay-optimized implementation of our novel multiplier is shown in Fig. 2(b). The 184 partial product terms  $pp_{(x,0)}$  and  $pp_{(x,1)}$ , in each partial product 185 row, require one and two bits of the multiplicand, respectively. 186 These two partial product terms can be implemented by one 187 single 6-input LUT "A1." Similarly,  $pp_{(x,2)}$ , in each partial 188 product row, can be independently implemented using another 189 6-input LUT "A2." A separate 6-input LUT, "CG," can be used 190 to compute the correct input carry for each partial product row. 191 Fig. 3 shows the internal configurations of LUT types A1, A2, 192

<sup>&</sup>lt;sup>1</sup>Collective performance considering the area, delay, and energy.



Fig. 2. First partial product row for an  $8 \times 8$  multiplier. (a) First version of multiplier. (b) Optimized version of multiplier.



Fig. 3. Configuration of LUTs types A1, A2, and CG. (a) LUT A1. (b) LUT A2/CG.

<sup>193</sup> and CG, respectively. LUT types A2 and CG only differ in the <sup>194</sup> output signals  $pp_{(x,2)}$  and  $cg_{out}$ . LUT type A2 utilizes  $pp_{(x,2)}$ <sup>195</sup> signal solely, whereas LUT type CG uses  $cg_{out}$  signal exclu-<sup>196</sup> sively. For an  $N \times M$  multiplier, the number of LUTs required <sup>197</sup> to generate partial products is  $(N + 3) \times \lceil M/2 \rceil - 1$ .

#### 198 C. Accumulation of Generated Partial Products

For the reduction of generated partial products to compute the final product, binary adders, ternary adders, and 4:2 compressors [15] can be utilized. A 4:2 compressor is capable of reducing four partial product rows to two output rows. During our experiments, we observe that the deployment of However, they have higher critical path delays than binary adders. Therefore, in this letter, the 4:2 compressors and binary adders are used for the reduction of the generated partial products. We have used the 6-input LUTs and the associated carry chains to implement them.

# III. RESULTS AND DISCUSSION

210

We have used VHDL for the RTL implementations of all presented multipliers. The proposed designs have been synthesized and implemented using Xilinx Vivado 17.4 for the Virtex-7 xc7v585tffg1157-3 FPGA (unless stated otherwise). Power values are estimated by the simulator and power analyzer tools provided by Vivado.

<sup>217</sup> We have compared the implementation results of our <sup>218</sup> proposed multiplier with the Vivado's area/speed-optimized <sup>219</sup> multiplier IPs [4], "R1" [8], "R5" [7], and "R7" [14].<sup>2</sup> <sup>220</sup> Furthermore, the proposed design is also evaluated against the <sup>211</sup> state-of-the-art approximate multipliers "R2" [17], "R3" [2], <sup>222</sup> "R4" [18], and an  $8 \times 8$  multiplier "R6" from [14].<sup>3</sup> For the



Fig. 4. Comparison of the proposed signed multipliers with the unsigned multipliers (without the signed–unsigned converters). The results are normalized to our proposed multipliers.

*unsigned* numbers-based architectures in "R1," "R2," "R3," 223 "R4," and "R5," we have implemented signed–unsigned converters. To show a fair comparison, we have reported the performance results of the state-of-the-art multipliers with and without using these signed–unsigned converters. 227

## A. Implementation Results

Table II presents the resource consumption (LUTs), CPD, and <sup>229</sup> EDP requirements of our proposed design and different stateof-the-art accurate and approximate multipliers. In the table, <sup>231</sup> the results for "R2," "R3," "R4," "R5," and "R6" multipliers <sup>232</sup> are inclusive of the signed–unsigned converters. For "R6" <sup>233</sup> multiplier, there is only one design point with the input bit-width <sup>234</sup> of  $8 \times 8$ . <sup>235</sup>

As shown in Table II, except for "R1" [8] and "R5" [7], <sup>236</sup> our proposed multiplier always requires less number of LUTs <sup>237</sup> than other state-of-the-art multipliers for different bit-widths. <sup>238</sup> "R1" and "R5" multipliers utilize sequential computation of <sup>239</sup> partial products to obtain area gains at the cost of high critical <sup>240</sup> path delays. The area savings offered by our designs increase <sup>241</sup> with the size of the multiplier, up to 16% when compared with <sup>242</sup> Vivado 32 × 32 area/speed-optimized IP. <sup>243</sup>

Our proposed multiplier provides higher performance than 244 state-of-the-art accurate and approximate multipliers. For example, compared to the 8 × 8 Vivado speed-optimized multiplier 246 IP, our multiplier reduces the critical path delay by 21%. "R1" 247 and "R5" accurate multipliers have higher critical path delays 248 among all presented multipliers. 249

The energy efficiency of the presented multiplier designs is 250 characterized by the EDP as illustrated in Table II. It can be 251 drawn from the table that our proposed multipliers have better 252 energy efficiency than state-of-the-art across different sizes. For 253 example, our  $16 \times 16$  multiplier delivers up to 23.6% reduction 254 in EDP when it is compared against "R1." 255

To further elaborate on the efficacy of our proposed implementation, Table II shows the averages of the products of 257 the normalized values of LUTs utilization, CPD, and EDP 258 (Average [Norm. LUTs × Norm. CPD × Norm. EDP]) across 259 different sizes of multipliers. All individual performance metrics of each multiplier have been normalized with respect 261 to the corresponding performance metrics of Vivado areaoptimized multiplier IP. Our proposed multiplier outperforms 263 state-of-the-art implementations in the overall score. 264

We have also compared our proposed *signed* multipliers <sup>265</sup> to the state-of-the-art accurate and approximate *unsigned* <sup>266</sup> multipliers without deploying signed–unsigned converters. <sup>267</sup> Fig. 4 presents the resource utilization, CPD, and EDP of  $8 \times 8$  <sup>268</sup> and  $16 \times 16$  proposed implementations, "R2," "R3," "R4," <sup>269</sup> "R5," and "R6" multipliers. These results have been normalized <sup>270</sup> to the implementation results of our proposed implementations. <sup>271</sup> Compared to the accurate  $16 \times 16$  "R5" multiplier, our implementation provides 39.0% and 42.0% reduction in CPD and <sup>273</sup> EDP, respectively. <sup>274</sup>

228

<sup>&</sup>lt;sup>2</sup>A generic and open-source implementation for every size of multiplier is not available. Signed multiplier "mul8s\_1KV8.v" from the library is used.

<sup>&</sup>lt;sup>3</sup>For "R6" approximate "mult\_000.v" from the library is used.

TABLE II IMPLEMENTATION RESULTS OF DIFFERENT MULTIPLIERS. "R2," "R3," "R4," "R5," AND "R6" MULTIPLIERS ARE IMPLEMENTED WITH THE SIGNED-UNSIGNED CONVERTERS. THE RESULTS WITH SHADING ARE THE LOWEST IN THEIR RESPECTIVE COLUMN

| Design                   | 4x4  |          |           | 8x8  |          |           | 16x16 |          |           | 32x32 |          |           | Average<br>Performance |
|--------------------------|------|----------|-----------|------|----------|-----------|-------|----------|-----------|-------|----------|-----------|------------------------|
| 5                        | LUTs | CPD [ns] | EDP [zJs] | LUTs | CPD [ns] | EDP [zJs] | LUTs  | CPD [ns] | EDP [zJs] | LUTs  | CPD [ns] | EDP [zJs] |                        |
| Ours                     | 18   | 1.65     | 1.86      | 66   | 2.80     | 16.97     | 243   | 4.13     | 93.61     | 928   | 6.21     | 316.43    | 0.347                  |
| R1: Walters [8]          | 10   | 2.00     | 1.82      | 36   | 3.65     | 16.57     | 136   | 6.56     | 122.53    | 528   | 13.09    | 930.54    | 0.427                  |
| Vivado IP (Speed) [4]    | 18   | 2.14     | 2.27      | 74   | 3.54     | 20.29     | 286   | 4.27     | 146.62    | 1103  | 5.81     | 861.45    | 0.532                  |
| R2: Kulkarni [17]        | 20   | 2.12     | 2.17      | 86   | 4.89     | 36.31     | 330   | 6.59     | 134.39    | 1257  | 8.92     | 303.06    | 0.885                  |
| R3: Ullah [2]            | 22   | 3.34     | 4.57      | 81   | 5.19     | 38.48     | 296   | 7.33     | 136.14    | 1121  | 9.65     | 354.33    | 1.02                   |
| R4: Rehman [18]          | 18   | 2.23     | 2.25      | 92   | 4.99     | 35.43     | 404   | 7.03     | 156.87    | 1512  | 9.57     | 322.10    | 1.05                   |
| R5: Kumm [7]             | 24   | 3.84     | 9.55      | 73   | 6.08     | 58.95     | 217   | 9.52     | 313.56    | 700   | 16.25    | 1631.78   | 1.931                  |
| R6: Mrazek [14] Approx.  | -    | -        | -         | 121  | 4.21     | 27.18     | -     | -        | -         | -     | -        | -         | -                      |
| R7: Mrazek [14] Accurate | -    | -        | -         | 79   | 3.28     | 34.25     | -     | -        | -         | -     | -        | -         | -                      |
| Vivado IP (area) [4]     | 30   | 2.91     | 6.56      | 88   | 3.45     | 31.26     | 326   | 5.04     | 207.96    | 1102  | 6.79     | 1050.63   | 1                      |



Fig. 5. Results of the neural network use-case. The LUT resources and EDP obtained for designs with different multiplier are normalized to Vivado-area.

# 275 B. Case Studies

1) Artificial Neural Network's Inference With Small-Size 276 277 Multiplier: We have also applied our multiplier for the infer-<sup>278</sup> ence stage of a neural network [19]. The network is used for the <sup>279</sup> classification of handwritten digits from MNIST database. The <sup>280</sup> inference accuracy of the ANN for 10 000 images with 8-b fixed <sub>281</sub> point numbers and our  $8 \times 8$  multiplier is 96.67%. The original <sup>282</sup> accuracy with 64-b number and multiplier is 97.09%. The loss <sup>283</sup> in classification accuracy for the multiplier is negligible. If the <sup>284</sup> network was implemented on the FPGA with our proposed accu-<sup>285</sup> rate multiplier instead of the Vivado's area-optimized multiplier, the estimated LUT saving is 17.5%. 286

2) Artificial Neural Network's Inference Implementation on 287 <sup>288</sup> FPGA: The target FPGA is Xilinx Zyng Ultrascale xczu3eg <sup>289</sup> used in the Ultra96 evaluation platform. The network has one 290 fully connected layer. Inside each neuron, beside the MAC unit, there are also activation and quantization modules. The 291 292 activation function is ReLU. The quantization module converts <sup>293</sup> the MAC results (which are represented in a wider bit width) <sup>294</sup> back to the original fixed-point format.

First, we implement the network with the Vivado's speed-295 296 optimized multiplier with as many number of neurons as possible in three different input sizes,  $8 \times 8$ ,  $16 \times 16$ , and 297  $32 \times 32$ . The timing constraint is 4 ns. After that, the same 298 299 setups are applied for the Vivado-area multiplier, R6 [9], and ours. The results are presented in Fig. 5. In the combined 300 LUTxEDP average across all input sizes, ours offers the best 301 <sup>302</sup> results. Our multiplier is 5%, 43%, and 37% better than Vivado-<sup>303</sup> speed, Vivado-area, and R6 [9], respectively. While R6 [9] has <sup>304</sup> the lowest LUT counts, its EDP is the worst among all when <sup>305</sup> the input size increases. In comparison with Vivado-speed, 306 ours is comparable in EDP but requires an average of 8% 307 less number of LUTs. These results imply that with the same 308 amount of fixed FPGA resources, more of our multipliers can 309 be instantiated to further exploit the available parallelism of <sup>310</sup> the application with only a slight increase in energy (if any). Our multiplier also fits well with various modern Xilinx FPGA 311 312 architectures.

#### **IV. CONCLUSION**

This letter presented a novel area-optimized, low-latency, 314 and energy-efficient accurate signed multiplier architecture for 315 FPGA-based systems. We have also evaluated the benefits of 316 our multipliers in neural network applications. The RTL models 317 of our designs will be released as an open-source library at 318 https://cfaed.tu-dresden.de/pd-downloads. 319

# References

- 7 Series DSP48E1 Slice, document UG479, Xilinx, San Jose, CA, [1] 321 USA, 2018. 322
- [2] S. Ullah et al., "Area-optimized low-latency approximate multipliers for 323 FPGA-based hardware accelerators," in Proc. ACM/ESDA/IEEE Design 324 Autom. Conf. (DAC), 2018, pp. 1-6. 325
- I. Kuon and J. Rose, "Measuring the gap between FPGAs and ASICs," [3] 326 IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 26, no. 2, 327 pp. 203-215, Feb. 2007. 328
- [4] LogiCORE IP Multiplier v12.0, document PG108, Xilinx, San Jose, 329 CĂ, USA, 2015. 330
- 331 332
- [5] A. D. Booth, "A signed binary multiplication technique," *Quart. J. Mech. Appl. Math.*, vol. 4, no. 2, pp. 236–240, 1951.
  [6] C. R. Baugh and B. A. Wooley, "A two's complement parallel array multiplication algorithm," *IEEE Trans. Comput.*, vol. C-22, no. 12, 1045–1045. 333 334 p. 1045–1047, Dec. 1973 335
- M. Kumm, S. Abbas, and P. Zipf, "An efficient softcore multiplier 336 [7] architecture for Xilinx FPGAs," in Proc. IEEE Symp. Comput. Arithmetic 337 (ARITH), 2015, pp. 18-25 338
- E. G. Walters, "Array multipliers for high throughput in Xilinx FPGAs with 6-input LUTs," *Computers*, vol. 5, no. 4, p. 20, 2016. [8] 339 340
- [9] H. Parandeh-Afshar and P. Ienne, "Measuring and reducing the 341 performance gap between embedded and soft multipliers on FPGAs, in 342 Proc. Int. Conf. Field Program. Logic Appl. (FPL), 2011, pp. 225–231. 343
- 7 Series FPGAs Configurable Logic Block, document UG474, Xilinx, [10] 344 San Jose, CA, USA, 2016. 345
- [11] H. Parandeh-Afshar, P. Brisk, and P. Ienne, "Exploiting fast carry-chains 346 of FPGAs for designing compressor trees," in *Proc. Int. Conf. Field Program. Logic Appl. (FPL)*, 2009, pp. 242–249.
  [12] A. Kakacak, "Fast multiplier generator for FPGAs with LUT based 347 348
- 349 partial product generation and column/row compression," Integr. VLSI 350 , vol. 57, pp. 147-157, Mar. 2017. 351
- [13] M. Kumm, J. Kappauf, M. Istoan, and P. Zipf, "Resource optimal 352 design of large multipliers for FPGAs," in Proc. IEEE Symp. Comput. 353 Arithmetic (ARITH), 2017, pp. 131–138.
   V. Mrazek, R. Hrbacek, Z. Vasicek, and L. Sekanina, "EvoApprox8b: 354
- [14] 355 Library of approximate adders and multipliers for circuit design and benchmarking of approximation methods," in *Proc. Design Autom. Test* 356 357 Europe Conf. Exhibit. (DATE), 2017, pp. 258-261. 358
- M. Kumm and P. Zipf, "Efficient high speed compression trees on 359 Xilinx FPGAs," in Proc. Methods Description Lang. Model. Verification 360 Circuits Syst. (MBMV), 2014, pp. 171-182 361
- [16] G. W. Bewick, "Fast multiplication: Algorithms and implementation," 362 Ph.D. dissertation, Dept. Elect. Eng., Stanford Univ., Stanford, CA, 363 USA, 1994. 364
- [17] P. Kulkarni, P. Gupta, and M. Ercegovac, "Trading accuracy for power 365 with an underdesigned multiplier architecture," in Proc. 24th Int. Conf. 366 VLSI Design, 2011, pp. 346–351. [18] S. Rehman, W. El-Harouni, M. Shafique, A. Kumar, and J. Henkel, 367
- 368 "Architectural-space exploration of approximate multipliers," in Proc. EEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), 2016, p. 80. 369 370
- [19] MNIST-cnn. (2016). [Online]. Available: https://github.com/integeruser/ 371 MNIST-cnn 372

313

320