in Large Block Mixing Ciphers

The goal of this exercise is to get some idea of the chip area required to encipher a large block with Mixing cipher technology.

An interesting consequence of ciphering a large block is the
possibility of extreme hardware speed. Although we expect hardware
to be fast, in some cases, throughput can *increase* with block
size, which may be unexpected. After all, in a block cipher,
it is necessary to in some way "mix" the various parts of the block,
and it seems that a larger block *must* need more mixing.
It does, but certain mixing constructions lend themselves to parallel
hardware which can track block size increases.

Over a period of several years I have developed Balanced Block Mixing (BBM) structures for invertible ciphering which offer arguably "perfect" mixing across wide blocks. Byte-size BBM's can be used in FFT-like patterns to mix huge blocks with complexity n log n.

In a Mixing cipher, each of the BBM operations is independent, and uses and updates two values which are not simultaneously in use by other BBM's. This means that we can realize each BBM in separate hardware, and have them function in parallel. We can push this to unexpected heights.

To get a general idea about how much of a chip the design would consume, we can look at a few chip examples in current production:

- Static-memory devices typically lag dynamic memory sizing by somewhat more than a generation. Current Motorola SRAM chips include a range of 4mb (4 megabit) designs.
- Individual customizable gates are larger than static RAM cells. As data points, Altera has a 250k gate FPGA, and the average IBM ASIC is said to have been about 260k gates in 1996.

One might thus conclude that "a gate" is as large as 16 static RAM cells, on average, which seems a little high. But the actual value does not matter much here, because the design is heavily dominated by RAM anyway.

The mixing cipher described here has a fencing layer (a row of substitution tables), Balanced Block Mixing, fencing, mixing and fencing. This is a conceptually simple cipher.

In a mixing cipher, the tables constitute primary keying and are set up by an external subsystem prior to ciphering.

The Mixing cipher uses Balanced Block Mixer or BBM components which realize:

a = 3x + 2y (mod 2, mod p) b = 2x + 3y (mod 2, mod p).

This can be implemented as:

- q := XOR( x, y );
- a delayless shift of one bit left;
- a conditional r := q or r := XOR( q, p ); then
- a := XOR( r, x ) and b := XOR( r, y ).

These byte-wide mixings are applied across the 64 bytes to be mixed, 32 pairs at a time, in FFT-like patterns. Note that each of the BBM operations is separate and independent hardware, functioning in parallel with other hardware. After 6 mixing sub-layers, every one of the 64 input bytes is equally represented in each of the 64 output bytes.

Consider the chip construction of a 64-byte mixing cipher: we will have fencing, mixing, fencing, mixing, and fencing. With a 64-byte-wide block, we will have 64 byte-wide tables across the block, and three substitution layers, for a total of 192 byte-substitutions. This represents 48K (BYTES) or 384kb (bits) of RAM, which is about 11 percent of a current generation 4mb static RAM.

Note that each RAM generation is four times the size of the previous generation, and we can probably expect two such generations by year 2000. If so, we will see a 64mb static RAM chip, and the "aggressive" 384kb mixing design described here would cover only about 1/2 of 1 percent of such a chip.

There are two mixing layers in this mixing cipher, and in each we have 6 FFT-like sub-layers of 32 BBM's each. Thus, we need 32 * 6 * 2 = 384 BBM's for a total of about 12k gates. But note that the 384k RAM elements dominate the 12k gates by a factor of 32, so in this "back of an envelope" analysis we can ignore the gates with little error.

The 64 individual byte substitutions in each fencing layer each function in parallel, as do the 32 BBM's in each mixing sub-layer. In total, we have 15 separate layers of components which function in parallel. We can group and pipeline as many stages as we need (at almost no cost) to make our clock rate, which might be 200MHz.

This would allow us to insert a new 64-byte block on every
clock. Thus, the projected ciphering rate is 64 bytes per
clock, times 8 bits per byte, times 200,000,000 clocks per
second, for a total of 102,400,000,000 bits per second.
This is 102 * 10**9 b/sec; in the US this is called 102
*billion* bits per second (12.8 GB/sec).

Getting data on and off a chip is a well known issue now being addressed in other contexts. For example, nine DRAM vendors are establishing a standard for a next-generation DRAM architecture to keep up with continued massive improvements in CPU performance. Both the "SyncLinc" and Rambus approach are expected to deliver 800 MB/sec in 1997-98, and Rambus expects to deliver 3-4 GB/sec by 2001. Being beyond this is good: We would clearly like to have our ciphering structures remain the non-limiting element somewhat beyond this near date.

This design sketches the simple construction and massive performance resulting from an aggressive realization. Here, a 384kb RAM area investment ciphers 64-byte-wide blocks at 12.8 GB/sec.

But a wide variety of less aggressive design strategies are obviously available as well. In particular, three separate layers of substitution could be reduced to one 128kb layer used 3 times, at some loss of strength. Similarly, the two mixings would be reduced to one 192-BBM mixing, used twice. The resulting 128kb cipher (about 3 percent of a 4mb static RAM chip) is one-third the size of the larger construction, but is more complex and ciphers at less than one-third the larger rate.

Continuing the theme of smaller designs, within a mixing, the 6 sublayers might be reduced to a single sublayer of just 32 BBM's, used 6 times. Ultimately, we could re-use a single 256-byte substitution across the block, and a single 32-gate BBM as well, although such a design seems pretty minimal.

Since a 64 byte block generally requires no chaining, it also supports independent block ciphering in multiple hardware. And this allows us to easily multiply the throughput of one large block device (in whatever realization we choose) by whatever amount of hardware we can afford. We might, for example, choose a low component count realization, and then replicate that to make our data rate.

In a very similar way, additional speed is available in even
*wider* blocks, provided only that correspondingly more devices
(and, to a lesser extent, more mixing sub-layers) are used. These
devices do not have to switch faster to produce greater throughput.
Thus, in a mixing construction, *wider* blocks can actually
provide linearly *greater* throughput (with a modest increase
in latency), as well as greater strength.

It is even possible to think about a low-end "smart card" mixing cipher realization.

*Last updated:* 1997-03-27