Multiplication is basically a shift add operation. There are, however, many variations on how to do it. Some are more suitable for FPGA use than others. This page is a brief tutorial on multiplication hardware. The hyperlinked items in this list are currently in the text. The remaining items will be added in a future release of this page.
Scaling Accumulator Multipliers
Parallel by serial algorithm | |
Iterative shift add routine | |
N clock cycles to complete | |
Very compact design | |
Serial input can be MSB or LSB first depending on direction of shift in accumulator | |
Parallel output |
1 1011001 0 0000000 1 1011001 1 +1011001 10010000101 |
Serial by Parallel Booth Multipliers
Bit serial adds eliminate need for carry chain | |||
Well suited for FPGAs without fast carry logic | |||
Serial input LSB first | |||
Serial output | |||
Routing is all nearest neighbor except serial input which is broadcast | |||
Latency is one bit time |
The simple serial by parallel booth multiplier is particularly well suited for bit serial processors implemented in FPGAs without carry chains because all of its routing is to nearest neighbors with the exception of the input. The serial input must be sign extended to a length equal to the sum of the lengths of the serial input and parallel input to avoid overflow, which means this multiplier takes more clocks to complete than the scaling accumulator version. This is the structure used in the venerable TTL serial by parallel multiplier.
Ripple Carry Array Multipliers
Row ripple form | |
Unrolled shift-add algorithm | |
Delay is proportional to N |
This basic structure is simple to implement in FPGAs, but does not make efficient use of the logic in many FPGAs, and is therefore larger and slower than other implementations.
Row Adder Tree Multipliers
Optimized Row Ripple Form | |
Fundamentally same gate count as row ripple form | |
Row Adders arranged in tree to reduce delay | |
Routing more difficult, but workable in most FPGAs | |
Delay proportional to log2(N) |
Carry Save Array Multipliers
Column ripple form | |
Fundamentally same delay and gate count as row ripple form | |
Gate level speed ups available for ASICs | |
Ripple adder can be replaced with faster carry tree adder | |
Regular routing pattern |
Look-Up Table (LUT) Multipliers
Complete times table of all possible input combinations | |
One address bit for each bit in each input | |
Table size grows exponentially | |
Very limited use | |
Fast - result is just a memory access away |
The following table is the contents for a 6 input LUT for a 3 bit by 3 bit multiplication table.
000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 | |
---|---|---|---|---|---|---|---|---|
000 | 000000 | 000000 | 000000 | 000000 | 000000 | 000000 | 000000 | 000000 |
001 | 000000 | 000001 | 000010 | 000011 | 000100 | 000101 | 000110 | 000111 |
010 | 000000 | 000010 | 000100 | 000110 | 001000 | 001010 | 001100 | 001110 |
011 | 000000 | 000011 | 000110 | 001001 | 001100 | 001111 | 010010 | 010101 |
100 | 000000 | 000100 | 001000 | 001100 | 010000 | 010100 | 011000 | 011100 |
101 | 000000 | 000101 | 001010 | 001111 | 010100 | 011001 | 011110 | 100011 |
110 | 000000 | 000110 | 001100 | 010010 | 011000 | 011110 | 100100 | 101010 |
111 | 000000 | 000111 | 001110 | 010101 | 011100 | 100011 | 101010 | 110001 |
Partial Product LUT Multipliers
Works like long hand multiplication | |
LUT used to obtain products of digits |
Partial products combined with adder tree |
Partial Products LUT multipliers use partial product techniques similar to those used in longhand multiplication (like you learned in 3rd grade) to extend the usefulness of LUT multiplication. Consider the long hand multiplication:
67 x 54 28 240 350 +3000 3618 | 67 x 54 28 240 350 +3000 3618 | 67 x 54 28 240 350 +3000 3618 | 67 x 54 28 240 350 +3000 3618 |
By performing the multiplication one digit at a time and then shifting and summing the individual partial products, the size of the memorized times table is greatly reduced. While this example is decimal, the technique works for any radix. The order in which the partial products are obtained or summed is not important. The proper weighting by shifting must be maintained however.
The example below shows how this technique is applied in hardware to obtain a 6x6 multiplier using the 3x3 LUT multiplier shown above. The LUT (which performs multiplication of a pair of octal digits) is duplicated so that all of the partial products are obtained simultaneously. The partial products are then shifted as needed and summed together. An adder tree is used to obtain the sum with minimum delay.
The LUT could be replaced by any other multiplier implementation, since LUT is being used as a multiplier. This gives the insight into how to combine multipliers of an arbitrary size to obtain a larger multiplier.
The LUT multipliers shown have matched radices (both inputs are octal). The partial products can also have mixed radices on the inputs provided care is taken to make sure the partial products are shifted properly before summing. Where the partial products are obtained with small LUTs, the most efficient implementation occurs when LUT is square (ie the input radices are the same). For 8 bit LUTs, such as might be found in an Altera 10K FPGA, this means the LUT radix is hexadecimal. For 4 bit LUTs, found in many FPGA logic cells, the ideal radix is 2 bits (This is really the only option for a 4 LUT: a 1 bit input reduces the LUT to an AND gate, and since each LUT cell has 1 output, it can only use one bit on the other input).
A more compact but slower version is possible by computing the partial products sequentially using one LUT and accumulating the results in a scaling accumulator. Note that in this case, the shifter would need a special control to obtain the proper shift on all the partials
Computed Partial Product Multipliers
Partial product optimization for FPGAs having small LUTs | |
Fewer partial products decrease depth of adder tree | |
2 x n bit partial products generated by logic rather than LUT | |
Smaller and faster than 4 LUT partial product multipliers |
2 x n bit partial product generated with adder and multiplexer
The Xilinx Virtex device includes an extra gate in the carry chain logic that allows a 4 input LUT plus the carry chain to perform a 2xN partial product, thereby achieving twice the density otherwise attainable. In this case, the adder (consisting of the XOR gates and muxes in the carry chain) adds a pair of 1xN partial products obtained with AND gates. The extra AND gate in the carry logic allows you to put AND gates on both inputs of the adder while maintaining a 4 input function.
2 x n bit computed partial product implemented in Xilinx Virtex using special MULTAND gate in carry chain logic
Constant Coefficient Multipliers
Multiplies input by a constant | |
LUT contains custom times table | |
Width of constants do not affect depth of adder tree | |
All LUT inputs available for multiplicand | |
More efficient than full multiplier | |
Size is constant regardless of value of constant (assuming equal constant bit widths) |
5 bit input * 67 | ||||
input | 00 | 01 | 10 | 11 |
---|---|---|---|---|
000 | 0 | 536 | 1072 | 1608 |
001 | 67 | 603 | 1139 | 1675 |
010 | 134 | 670 | 1206 | 1742 |
011 | 201 | 737 | 1273 | 1809 |
100 | 268 | 804 | 1340 | 1876 |
101 | 335 | 871 | 1407 | 1943 |
110 | 402 | 938 | 1474 | 2010 |
111 | 469 | 1005 | 1541 | 2077 |
Limited Set LUT Multipliers
Multiplies input by one of a small set of constants | |
Similar to KCM multiplier | |
LUT input bit(s) select which constant to use | |
Useful in modulators, other signal processing applications |
5 bit input * 67 | 5 bit input * 85 | |||||||
000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 | |
---|---|---|---|---|---|---|---|---|
000 | 0 | 536 | 1072 | 1608 | 0 | 680 | 1360 | 2040 |
001 | 67 | 603 | 1139 | 1675 | 85 | 765 | 1445 | 2125 |
010 | 134 | 670 | 1206 | 1742 | 170 | 850 | 1530 | 2210 |
011 | 201 | 737 | 1273 | 1809 | 255 | 935 | 1615 | 2295 |
100 | 268 | 804 | 1340 | 1876 | 340 | 1020 | 1700 | 2380 |
101 | 335 | 871 | 1407 | 1943 | 425 | 1105 | 1785 | 2465 |
110 | 402 | 938 | 1474 | 2010 | 510 | 1190 | 1870 | 2550 |
111 | 469 | 1005 | 1541 | 2077 | 595 | 1275 | 1955 | 2635 |
Constant Multipliers from Adders
Adder for each '1' bit in constant | |
Subtractor replaces strings of '1' bits using Booth recoding | |
Efficiency, size depend on value of constant | |
KCM multipliers are usually more efficient for arbitrary constant values |
0 0000000 1 1011001 0 0000000 1 +1011001 1101111010 |
0 0000000 1 1011001 1 1011001 1 +1011001 10011011110 | 0 0000000 -1 1110100111 0 0000000 0 0000000 1 +1011001 10011011110 |
Clearly, the complexity of a constant multiplier constructed from adders is dependent upon the constant. For an arbitrary constant, the KCM multiplier discussed above is a better choice. For certain 'quick and dirty' scaling applications, this multiplier works nicely.
Wallace Trees
Optimized column adder tree | |
Combines all partial products into 2 vectors (carry and sum) | |
Carry and sum outputs combined using a conventional adder | |
Delay is log(n) | |
Wallace tree multiplier uses Wallace tree to combine 1 x n partial products | |
Irregular routing | |
Not optimum in many FPGAs |
A section of an 8 input wallace tree. The wallace tree combines the 8 partial product inputs to two output vectors corresponding to a sum and a carry. A conventional adder is used to combine these outputs to obtain the complete product..
If you trace the bits in the tree (the tree in the illustration is color coded to help in this regard), you will find that the Wallace tree is a tree of carry-save adders arranged as shown to the left. A carry save adder consists of full adders like the more familiar ripple adders, but the carry output from each bit is brought out to form second result vector rather being than wired to the next most significant bit. The carry vector is 'saved' to be combined with the sum later, hence the carry-save moniker.
A Wallace tree multiplier is one that uses a Wallace tree to combine the partial products from a field of 1x n multipliers (made of AND gates). It turns out that the number of Carry Save Adders in a Wallace tree multiplier is exactly the same as used in the carry save version of the array multiplier. The Wallace tree rearranges the wiring however, so that the partial product bits with the longest delays are wired closer to the root of the tree. This changes the delay characteristic from o(n+n) to o(n+log(n)) at no gate cost. Unfortunately the nice regular routing of the array multiplier is also replaced with a rats-nest.
A Wallace tree by itself offers no performance advantage over a ripple adder tree
To the casual observer, it may appear the propagation delay though a ripple adder tree is the carry propagation multiplied by the number of levels or o(n*log(n)). In fact, the ripple adder tree delay is really only o(n + log(n)) because the delays through the adder's carry chains overlap. This becomes obvious if you consider that the value of a bit can only affect bits of the same or higher significance further down the tree. The worst case delay is then from the LSB input to the MSB output (and disregarding routing delays is the same no matter which path is taken). The depth of the ripple tree is log(n), which is the about same as the depth of the Wallace tree. This means is that the ripple carry adder tree's delay characteristic is similar to that of a Wallace tree followed by a ripple adder! If an adder with a faster carry tree scheme is used to sum the Wallace tree outputs, the result is faster than a ripple adder tree. The fast carry tree schemes use more gates than the equivalent ripple carry structure, so the Wallace tree normally winds up being faster than a ripple adder tree, and less logic than an adder tree constructed of fast carry tree adders. An ASIC implementation may also benefit from some 'go faster' tricks in carry save adders, so a Wallace tree combined with a fast adder can offer a significant advantage there.A Wallace tree is often slower than a ripple adder tree in an FPGA
Many FPGAs have a highly optimized ripple carry chain connection. Regular logic connections are several times slower than the optimized carry chain, making it nearly impossible to improve on the performance of the ripple carry adders for reasonable data widths (at least 16 bits). Even in FPGAs without optimized carry chains, the delays caused by the complex routing can overshadow any gains attributed to the Wallace tree structure. For this reason, Wallace trees do not provide any advantage over ripple adder trees in many FPGAs. In fact due to the irregular routing, they may actually be slower and are certainly more difficult to route.Booth Recoding
Booth recoding is a method of reducing the number of partial products to be summed. Booth observed that when strings of '1' bits occur in the multiplicand the number of partial products can be reduced by using subtraction. For example the multiplication of 89 by 15 shown below has four 1xn partial products that must be summed. This is equivalent to the subtraction shown in the right panel.1 1011001 1 1011001 1 1011001 1 1011001 0 +0000000 10100110111 | 1 -1011001 1 0000000 1 0000000 1 0000000 0 +1011001 10100110111 |
1 comments:
At Free Bitcoin you may get free satoshis. 8 to 22 satoshis per 5 mins.
Post a Comment