Bit Counting and Similar Instructions

A Ciphers By Ritter Page

Population count and binary logarithm comments.
1998-12-03 Doug Moore: "I'd like to know which modern architectures support instructions for any of the following (somewhat related) functions:
- Count the number of 1 bits in a word
- Identify the position of most significant 1 bit in a word.
- Identify the position of the least significant 1 bit in a word.
- Report the binary logarithm of an integer."
1998-12-04 Vincent Lefevre: "... [the position of the least significant 1 bit in a word] can be done with a few instructions... but you'll have the position as a power of 2."
1998-12-04 Michael Williams: "David Seal came up with a neat algoritm for converting such a value to the index...."
1998-12-11 John Reiser: "Also remember elementary number theory...."
1998-12-11 Nick Maclaren: "...that make a lot of assumptions about the representation."
1998-12-4 Peter L. Montgomery: "Here is a partial list, old and new architectures...." "The population count, least significant 1 bit, and binary logarithm should be added to Fortran, C and other programming languages."
1998-12-04 Nick Maclaren: "There is absolutely no difficulty in adding these as Fortran intrinsic functions, C library macro/functions etc....."
1998-12-04 Herman Rubin: "What is wanted is to read a bit stream from a position given by a bit pointer, find the distance to the next 1 in the stream...."
1998-12-05 Terje Mathisen: "The partial tools already exists, and should be plenty good enough to do this _very_ quickly."
1998-12-6 Derek Gladding: "The ARC (http://www.risccores.com) has the most-significant 1 operation...."
1998-12-05 Terje Mathisen: "With N=8, each lookup will locate 0 to 8 1 bits...."
1998-12-08 Herman Rubin: "It is not a matter of size, but of speed."
1998-12-09 Terje Mathisen: "...I'll see if I can find a way to optimize it...."
1999-01-15 Suresh Kadiyala: "I think PPC has the above two."
1999-01-18 Marc Daumas: "I was told that this operation was regarded as very usefull to break cryptographic codes and thereafter it was ommitted on most architectures so that there willl be no restriction to export."
1999-01-18 Herman Rubin: "One could make this argument about any bit instruction."
1999-01-19 Christian Bau: "There is no single instruction to count bits in a word on the PowerPC, only an instruction to count the number of leading zero bits."
1999-01-19 Terje Mathisen: "A Pentium would be similar...."
1999-01-19 Nick Maclaren: "Most machines are similar to the above."
1999-01-19 Alex Rosenberg: "The permute operation is actually a 5-bit lookup and the 4-bit table can be specified twice to make the fifth bit indifferent."
1999-01-20 Donald Gillies:
1999-01-21 Terje Mathisen: "To really make this version efficient, you need to do the bitcount across an array...." "Robert Harley posted 64-bit Alpha code to do this which ran in (much?) less than a cycle/byte, I have written an MMX asm version which gets similar performance."


Subject: Bit counting and similar instructions
Date: 3 Dec 1998 22:43:04 GMT
From: Doug Moore <dougm@farkas.caam.rice.edu>
Message-ID: <74745o$qi$1@joe.rice.edu>
Newsgroups: comp.arch.arithmetic
Lines: 13

I'd like to know which modern architectures support instructions for
any of the following (somewhat related) functions:

Count the number of 1 bits in a word
Identify the position of most significant 1 bit in a word.
Identify the position of the least significant 1 bit in a word.
Report the binary logarithm of an integer.

Of course, these are unusual instructions since they would hard to
generate from most programming languages.

Doug Moore
(dougm@caam.rice.edu)


Subject: Re: Bit counting and similar instructions
Date: 4 Dec 1998 01:27:07 GMT
From: Vincent Lefevre <Vincent.Lefevre@ens-lyon.fr>
Message-ID: <747dpb$14b@cri.ens-lyon.fr>
References: <74745o$qi$1@joe.rice.edu>
Newsgroups: comp.arch.arithmetic
Lines: 19

In article <74745o$qi$1@joe.rice.edu>,
  Doug Moore <dougm@farkas.caam.rice.edu> wrote:

> Identify the position of most significant 1 bit in a word.

The ARM10 will have such an instruction (CLZ) [ARMv5T architecture].
See http://www.eet.com/story/OEG19981015S0019

> Identify the position of the least significant 1 bit in a word.

This can be done with a few instructions (bitwise logical operations
on x and x-1), but you'll have the position as a power of 2 (not the
number of 0's after the least significant 1).

-- 
Vincent Lefevre <Vincent.Lefevre@ens-lyon.fr> - PhD stud. in Computer Science
Web: http://www.ens-lyon.fr/~vlefevre/ - 100% validated HTML - Acorn Risc PC,
Yellow Pig 17, Championnat International des Jeux Mathematiques et Logiques,
TETRHEX, Faits divers insolites, etc...


Subject: Re: Bit counting and similar instructions
Date: 4 Dec 1998 08:33:18 GMT
From: michael.williams@arm-sponge.com (Michael Williams)
Message-ID: <7486oe$jua@sis.cambridge.arm.com>
References: <747dpb$14b@cri.ens-lyon.fr>
Newsgroups: comp.arch.arithmetic
Lines: 53

In article <747dpb$14b@cri.ens-lyon.fr>,
Vincent Lefevre  <vlefevre+news@ens-lyon.fr> wrote:
>> Identify the position of the least significant 1 bit in a word.
>
>This can be done with a few instructions (bitwise logical operations
>on x and x-1), but you'll have the position as a power of 2 (not the
>number of 0's after the least significant 1).

David Seal came up with a neat algoritm for converting such a value to
the index, many moons ago. The message ID is <32975@armltd.uucp>, but
this dates from early 1994, so may be hard to find. I replicate it
here:

(Apologies to non-ARM-coders. I hope you get the gist.)

In article <32975@armltd.uucp> dseal@armltd.co.uk (David Seal) writes:
>I produced a better variant of the same theme last night:
>
>  ; Operand in R0, register R1 is free, R2 addresses a byte table 'Table'
>
>    RSB     R1,R0,#0            ;Standard trick to isolate bottom bit in R1,
>    AND     R1,R1,R0            ; or produce zero in R1 if R0 = 0.
>
>    ORR     R1,R1,R1,LSL #4     ;If R1=X with 0 or 1 bits set, R0 = X * &11
>    ORR     R1,R1,R1,LSL #6     ;R0 = X * &451
>    RSB     R1,R1,R1,LSL #16    ;R1 = X * &0450FBAF
>
>    LDRB    R1,[R2,R1,LSR #26]  ;Look up table entry indexed by top 6 bits
>                                ; of R1
>
>  ; Result in R1

If you're interested, Table looks like this:

Table DATA
        DCB      0x20,0x00,0x01,0x0c
        DCB      0x02,0x06,0xff,0x0d
        DCB      0x03,0xff,0x07,0xff
        DCB      0xff,0xff,0xff,0x0e
        DCB      0x0a,0x04,0xff,0xff
        DCB      0x08,0xff,0xff,0x19
        DCB      0xff,0xff,0xff,0xff
        DCB      0xff,0x15,0x1b,0x0f
        DCB      0x1f,0x0b,0x05,0xff
        DCB      0xff,0xff,0xff,0xff
        DCB      0x09,0xff,0xff,0x18
        DCB      0xff,0xff,0x14,0x1a
        DCB      0x1e,0xff,0xff,0xff
        DCB      0x17,0x17,0xff,0x13
        DCB      0x1d,0xff,0x16,0x12
        DCB      0x1c,0x11,0x10,0xff

Mike.


Subject: Re: Bit counting and similar instructions
Date: Fri, 11 Dec 1998 06:38:35 GMT
From: jreiser@teleport.com (John Reiser)
Message-ID: <3670ba1d.5981552@news.teleport.com>
References: <747dpb$14b@cri.ens-lyon.fr>
Newsgroups: comp.arch.arithmetic
Lines: 7

Also remember elementary number theory:  log2modp[ (x & -x) % p] where
p is a prime larger than the word size and having 2 as a generator of
the multiplicative group of units modulo p.  log2modp is a
pre-computed table of logarithms.  For 32 bits, the smallest such
prime is p = 37.

jreiser@teleport.com


Subject: Re: Bit counting and similar instructions
Date: 11 Dec 1998 09:42:49 GMT
From: nmm1@cus.cam.ac.uk (Nick Maclaren)
Message-ID: <74qpep$ele$1@pegasus.csx.cam.ac.uk>
References: <3670ba1d.5981552@news.teleport.com>
Newsgroups: comp.arch.arithmetic
Lines: 21

In article <3670ba1d.5981552@news.teleport.com>,
John Reiser <jreiser@teleport.com> wrote:
>Also remember elementary number theory:  log2modp[ (x & -x) % p] where
>p is a prime larger than the word size and having 2 as a generator of
>the multiplicative group of units modulo p.  log2modp is a
>pre-computed table of logarithms.  For 32 bits, the smallest such
>prime is p = 37.


Hmm.  Elementary?  I shall have to think on't :-)

In any case, that make a lot of assumptions about the representation.
"(x & -x)" is not well-defined in general.


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QG, England.
Email:  nmm1@cam.ac.uk
Tel.:  +44 1223 334761    Fax:  +44 1223 334679


Subject: Re: Bit counting and similar instructions
Date: Fri, 4 Dec 1998 08:09:04 GMT
From: pmontgom@cwi.nl (Peter L. Montgomery)
Message-ID: <F3FLB4.K45@cwi.nl>
References: <74745o$qi$1@joe.rice.edu>
Newsgroups: comp.arch.arithmetic
Lines: 45

In article <74745o$qi$1@joe.rice.edu> 
Doug Moore <dougm@farkas.caam.rice.edu> writes:
>I'd like to know which modern architectures support instructions for
>any of the following (somewhat related) functions:

>Count the number of 1 bits in a word
>Identify the position of most significant 1 bit in a word.
>Identify the position of the least significant 1 bit in a word.
>Report the binary logarithm of an integer.


Here is a partial list, old and new architectures:

         Population count:

               Cray, CDC Cyber series, Alliant FX/80.
               UltraSPARC, Alpha 21264

               Intel x86 has the parity of a byte

         Most significant 1 bit (equivalent to binary logarithm)

               Cray, Alpha 21264, Motorola 68020, Intel x86
               CDC Cyber series had this if 0 < arg < 2^48

         Least significant 1 bit:

               Alpha 21264, Intel x86
               Available as POPULATION_COUNT((x-1) & ~x) if x != 0

>Of course, these are unusual instructions since they would hard to
>generate from most programming languages.
>
     The population count, least significant 1 bit, and binary
logarithm should be added to Fortran, C and other programming languages.
All three should operate on unsigned operands, with the last two
functions being undefined when the argument is zero.
A binary logarithm function is preferred to a leading zero count 
function since (for example) BINARY_LOGARITHM(123) will be 6
on both 32-bit and 64-bit machines, whereas LEADING_ZERO_COUNT(123) 
will be 25 on 32-bit machines but 57 on 64-bit machines.
-- 
        Peter-Lawrence.Montgomery@cwi.nl    San Rafael, California
The bridge to the 21st century is being designed for the private autombile.
The bridge to the 22nd century will forbid private automobiles.


Subject: Re: Bit counting and similar instructions
Date: 4 Dec 1998 10:13:33 GMT
From: nmm1@cus.cam.ac.uk (Nick Maclaren)
Message-ID: <748ckd$pkk$1@pegasus.csx.cam.ac.uk>
References: <F3FLB4.K45@cwi.nl>
Newsgroups: comp.arch.arithmetic
Lines: 37

In article <F3FLB4.K45@cwi.nl>, Peter L. Montgomery <pmontgom@cwi.nl> wrote:
>In article <74745o$qi$1@joe.rice.edu> 
>Doug Moore <dougm@farkas.caam.rice.edu> writes:
>>I'd like to know which modern architectures support instructions for
>>any of the following (somewhat related) functions:
>
>>Count the number of 1 bits in a word
>>Identify the position of most significant 1 bit in a word.
>>Identify the position of the least significant 1 bit in a word.
>>Report the binary logarithm of an integer.
>
>>Of course, these are unusual instructions since they would hard to
>>generate from most programming languages.
>>
>     The population count, least significant 1 bit, and binary
>logarithm should be added to Fortran, C and other programming languages.

There is absolutely no difficulty in adding these as Fortran intrinsic
functions, C library macro/functions etc., and a compiler generating
inline code and using hardware support where appropriate.  It is
precisely how the absolute value is supported.

C9X will not welcome such suggestions, but there might be a chance
the next Fortran update.

My experience is that I want the most significant bit 10 times for
every time I want one of the others, but I suspect that is because of
the sort of algorithms I look at rather than any inherent importance
of the operation.


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QG, England.
Email:  nmm1@cam.ac.uk
Tel.:  +44 1223 334761    Fax:  +44 1223 334679


Subject: Re: Bit counting and similar instructions
Date: 4 Dec 1998 11:11:26 -0500
From: hrubin@b.stat.purdue.edu (Herman Rubin)
Message-ID: <7491je$hmk@b.stat.purdue.edu>
References: <74745o$qi$1@joe.rice.edu>
Newsgroups: comp.arch.arithmetic
Lines: 29

In article <74745o$qi$1@joe.rice.edu>,
Doug Moore  <dougm@farkas.caam.rice.edu> wrote:
>I'd like to know which modern architectures support instructions for
>any of the following (somewhat related) functions:

>Count the number of 1 bits in a word
>Identify the position of most significant 1 bit in a word.
>Identify the position of the least significant 1 bit in a word.
>Report the binary logarithm of an integer.

>Of course, these are unusual instructions since they would hard to
>generate from most programming languages.

Let me add one more to the list; it would be very useful for 
such purposes as generating non-uniform random numbers from
uniform random bit input.

What is wanted is to read a bit stream from a position given
by a bit pointer, find the distance to the next 1 in the
stream, and update the pointer, with appropriate interrupts
if the stream becomes exhausted or there are no 1's remaining
bits in the stream.

Even partial tools would be helpful.
-- 
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
hrubin@stat.purdue.edu         Phone: (765)494-6054   FAX: (765)494-0558


Subject: Re: Bit counting and similar instructions
Date: Sat, 05 Dec 1998 15:18:58 +0100
From: Terje Mathisen <Terje.Mathisen@hda.hydro.com>
Message-ID: <366940D2.54B0DDE1@hda.hydro.com>
References: <7491je$hmk@b.stat.purdue.edu>
Newsgroups: comp.arch.arithmetic
Lines: 92

Herman Rubin wrote:
> 
> In article <74745o$qi$1@joe.rice.edu>,
> Doug Moore  <dougm@farkas.caam.rice.edu> wrote:
> >I'd like to know which modern architectures support instructions for
> >any of the following (somewhat related) functions:
> 
> >Count the number of 1 bits in a word
> >Identify the position of most significant 1 bit in a word.
> >Identify the position of the least significant 1 bit in a word.
> >Report the binary logarithm of an integer.
> 
> >Of course, these are unusual instructions since they would hard to
> >generate from most programming languages.
> 
> Let me add one more to the list; it would be very useful for
> such purposes as generating non-uniform random numbers from
> uniform random bit input.
> 
> What is wanted is to read a bit stream from a position given
> by a bit pointer, find the distance to the next 1 in the
> stream, and update the pointer, with appropriate interrupts
> if the stream becomes exhausted or there are no 1's remaining
> bits in the stream.
> 
> Even partial tools would be helpful.

The partial tools already exists, and should be plenty good enough to do
this _very_ quickly.

If the input bit stream is really random, then the average distance to
the next 1 bit will be two bits, right?

In 255 of 256 cases, the next bit will be found within the next 8 bits,
so I'd use code like this:

unsigned char cache; // Current byte in bit stream
unsigned char *buf;  // points to the next byte in the bit stream
unsigned bitpos;     // bit position to start looking at (0..8)
unsigned char mask[9] = {255,254,252,248,240,224,192,128,0};
unsigned char firstbit[256]; // returns the position of the first 1 bit

  b = cache & mask[bitpos];	// 4 bits left in cache (on average)
  if (b) {
     bit = firstbit[b];
     count = bit - bitpos;
     bitpos = bit + 1;
     return count;
  }

In about 93% of all cases, the code above will be all that's needed. It
uses about 7 operations (worst case, assuming tables in L1 cache), so
most modern cpus can inline this code and run it in 3-5 cycles.

When we've reached the end of the current byte, then we'll need to
locate the next non-zero byte:

  for (count = 8-bitpos; (b = *buf++) == 0; count += 8) ;

The code above will nearly always run just a single (predicted)
iteration, so it will cost a couple of cycles, plus the cost of the
mis-predicted branch in the initial code above.

  cache = b;			// Save remainder for next iteration
  bit = firstbit[b];
  bitpos = bit + 1;
  return bit + count;

To avoid the need for explicit checks for the end of the input buffer, a
simple sentinel value (255?) will do fine.

IMHO, it really seems like this "problem" can be solved with portable
code, averaging less than 5 cycles per invocation. This is NOT a good
candidate for extra hardware support.

Terje

PS. Using the fp hardware and 32-bit input blocks might be even faster,
since there will be no need for the 256-byte lookup table:

  d = cache & mask[bitpos];
  if (d) {
    double t = d;
    int64 ll = *(int64 *) & t;
    return (ll >> 52) - 1023;
  }

-- 
- <Terje.Mathisen@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"



Subject: Re: Bit counting and similar instructions
Date: Sun, 6 Dec 1998 18:13:51 -0800
From: "Derek Gladding" <derek_gladding@altavista.net>
Message-ID: <74fegu$8ib$1@owl.slip.net>
References: <366940D2.54B0DDE1@hda.hydro.com>
Newsgroups: comp.arch.arithmetic
Lines: 22


Terje Mathisen wrote in message <366940D2.54B0DDE1@hda.hydro.com>...
>Herman Rubin wrote:
>>
>> In article <74745o$qi$1@joe.rice.edu>,
>> Doug Moore  <dougm@farkas.caam.rice.edu> wrote:
>> >I'd like to know which modern architectures support instructions for
>> >any of the following (somewhat related) functions:
>>
>> >Count the number of 1 bits in a word
>> >Identify the position of most significant 1 bit in a word.
>> >Identify the position of the least significant 1 bit in a word.
>> >Report the binary logarithm of an integer.
>>

The ARC (http://www.risccores.com) has the most-significant 1 operation
as one of the optional instruction set extensions.

- Derek





Subject: Re: Bit counting and similar instructions
Date: Sat, 05 Dec 1998 22:29:25 +0100
From: Terje Mathisen <Terje.Mathisen@hda.hydro.com>
Message-ID: <3669A5B5.E6CB3F5@hda.hydro.com>
References: <7491je$hmk@b.stat.purdue.edu>
Newsgroups: comp.arch.arithmetic
Lines: 45

Herman Rubin wrote:
> 
> In article <74745o$qi$1@joe.rice.edu>,
> Doug Moore  <dougm@farkas.caam.rice.edu> wrote:
> >I'd like to know which modern architectures support instructions for
> >any of the following (somewhat related) functions:
> 
> >Count the number of 1 bits in a word
> >Identify the position of most significant 1 bit in a word.
> >Identify the position of the least significant 1 bit in a word.
> >Report the binary logarithm of an integer.
> 
> >Of course, these are unusual instructions since they would hard to
> >generate from most programming languages.
> 
> Let me add one more to the list; it would be very useful for
> such purposes as generating non-uniform random numbers from
> uniform random bit input.
> 
> What is wanted is to read a bit stream from a position given
> by a bit pointer, find the distance to the next 1 in the
> stream, and update the pointer, with appropriate interrupts
> if the stream becomes exhausted or there are no 1's remaining
> bits in the stream.
> 
> Even partial tools would be helpful.

Herman, this application does NOT need any fancy opcodes!

It is quite similar to decoding huffman-encoded (or other variable bit
length encoding) data. The fastest way I know to do this is to use one
or a couple of lookup tables, which are indexed by the next N bits of
the bit stream.

With N=8, each lookup will locate 0 to 8 1 bits, in the form of a table
consisting of a count and a set of offsets. Worst case, this will use
9*256 bytes, and it will b faster than any conceivable special opcode
which returns just a single entry.

Terje

-- 
- <Terje.Mathisen@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"


Subject: Re: Bit counting and similar instructions
Date: 8 Dec 1998 16:54:01 -0500
From: hrubin@b.stat.purdue.edu (Herman Rubin)
Message-ID: <74k75p$juk@b.stat.purdue.edu>
References: <3669A5B5.E6CB3F5@hda.hydro.com>
Newsgroups: comp.arch.arithmetic
Lines: 57

In article <3669A5B5.E6CB3F5@hda.hydro.com>,
Terje Mathisen  <Terje.Mathisen@hda.hydro.com> wrote:
>Herman Rubin wrote:

>> In article <74745o$qi$1@joe.rice.edu>,
>> Doug Moore  <dougm@farkas.caam.rice.edu> wrote:
>> >I'd like to know which modern architectures support instructions for
>> >any of the following (somewhat related) functions:

>> >Count the number of 1 bits in a word
>> >Identify the position of most significant 1 bit in a word.
>> >Identify the position of the least significant 1 bit in a word.
>> >Report the binary logarithm of an integer.

>> >Of course, these are unusual instructions since they would hard to
>> >generate from most programming languages.

>> Let me add one more to the list; it would be very useful for
>> such purposes as generating non-uniform random numbers from
>> uniform random bit input.

>> What is wanted is to read a bit stream from a position given
>> by a bit pointer, find the distance to the next 1 in the
>> stream, and update the pointer, with appropriate interrupts
>> if the stream becomes exhausted or there are no 1's remaining
>> bits in the stream.

>> Even partial tools would be helpful.

>Herman, this application does NOT need any fancy opcodes!

>It is quite similar to decoding huffman-encoded (or other variable bit
>length encoding) data. The fastest way I know to do this is to use one
>or a couple of lookup tables, which are indexed by the next N bits of
>the bit stream.

>With N=8, each lookup will locate 0 to 8 1 bits, in the form of a table
>consisting of a count and a set of offsets. Worst case, this will use
>9*256 bytes, and it will b faster than any conceivable special opcode
>which returns just a single entry.

It does not need opcodes, it needs hardware, for the uses envisioned.

It is not a matter of size, but of speed.  The use I envision for this
is to generate non-uniform random variables, by using small numbers of
bits and exact computation to obtain some of the bits of the output
random variable, the rest to be filled in by random bits.

There are always competing procedures, less efficient from a theoretical
point, but doing a much better job of using standard computer hardware
for speed.  This is likely to be done millions of times, so the "end
effect" problems will arise, and need to be considered for the timing.
-- 
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
hrubin@stat.purdue.edu         Phone: (765)494-6054   FAX: (765)494-0558


Subject: Re: Bit counting and similar instructions
Date: Wed, 09 Dec 1998 09:17:12 +0100
From: Terje Mathisen <Terje.Mathisen@hda.hydro.com>
Message-ID: <366E3208.C72F8F4@hda.hydro.com>
References: <74k75p$juk@b.stat.purdue.edu>
Newsgroups: comp.arch.arithmetic
Lines: 36

Herman Rubin wrote:
> 
> In article <3669A5B5.E6CB3F5@hda.hydro.com>,
> Terje Mathisen  <Terje.Mathisen@hda.hydro.com> wrote:
> >Herman, this application does NOT need any fancy opcodes!
> 
> >It is quite similar to decoding huffman-encoded (or other variable bit
> >length encoding) data. The fastest way I know to do this is to use one
> >or a couple of lookup tables, which are indexed by the next N bits of
> >the bit stream.
> 
> >With N=8, each lookup will locate 0 to 8 1 bits, in the form of a table
> >consisting of a count and a set of offsets. Worst case, this will use
> >9*256 bytes, and it will b faster than any conceivable special opcode
> >which returns just a single entry.
> 
> It does not need opcodes, it needs hardware, for the uses envisioned.
> 
> It is not a matter of size, but of speed.  The use I envision for this
> is to generate non-uniform random variables, by using small numbers of
> bits and exact computation to obtain some of the bits of the output
> random variable, the rest to be filled in by random bits.

OK, I'll bite:

Please post your current C (or other language) algorithm to do this
processing, and I'll see if I can find a way to optimize it, preferably
to the point where it will be as fast or faster than what you'd get from
a single-cycle find_next_1_bit opcode.

Terje

-- 
- <Terje.Mathisen@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"


Subject: Re: Bit counting and similar instructions
Date: Fri, 15 Jan 1999 12:48:52 +0000
From: Suresh Kadiyala <suresh@ssofttech.com>
Message-ID: <369F3934.162F18A5@ssofttech.com>
References: <74745o$qi$1@joe.rice.edu>
Newsgroups: comp.arch.arithmetic
Lines: 29



Doug Moore wrote:

> I'd like to know which modern architectures support instructions for
> any of the following (somewhat related) functions:
>
> Count the number of 1 bits in a word

PowerPC has this.

>
> Identify the position of most significant 1 bit in a word.
> Identify the position of the least significant 1 bit in a word.

I think PPC has the above two.

>
> Report the binary logarithm of an integer.
>
> Of course, these are unusual instructions since they would hard to
> generate from most programming languages.
>
> Doug Moore
> (dougm@caam.rice.edu)

Suresh Kadiyala




Subject: Re: Bit counting and similar instructions
Date: Mon, 18 Jan 1999 09:36:24 +0100
From: Marc Daumas <Marc.Daumas@ENS-Lyon.Fr>
Message-ID: <36A2F288.BF50D66A@ENS-Lyon.Fr>
References: <369F3934.162F18A5@ssofttech.com>
Newsgroups: comp.arch.arithmetic
Lines: 22

Suresh Kadiyala wrote:
> > I'd like to know which modern architectures support instructions for
> > any of the following (somewhat related) functions:
> >
> > Count the number of 1 bits in a word
> 
> PowerPC has this.

I was told that this operation was regarded as very usefull to break
cryptographic codes and thereafter it was ommitted on most architectures
so that there willl be no restriction to export.

Truth or legend ?

What instruction is that on the power PC ? I don't know any such
instruction on x86.

-- 
Marc Daumas - Charge de recherches au CNRS (LIP - ENS de Lyon)
mailto:Marc.Daumas@ENS-Lyon.Fr - http://www.ens-lyon.fr/~daumas
ENS de Lyon - 46, allee d'Italie - 69364 Lyon Cedex 07 - FRANCE
Phone: (+33) 4 72 72 83 52 - Fax: (+33) 4 72 72 80 80


Subject: Re: Bit counting and similar instructions
Date: 18 Jan 1999 07:51:41 -0500
From: hrubin@b.stat.purdue.edu (Herman Rubin)
Message-ID: <77vaot$17h6@b.stat.purdue.edu>
References: <36A2F288.BF50D66A@ENS-Lyon.Fr>
Newsgroups: comp.arch.arithmetic
Lines: 26

In article <36A2F288.BF50D66A@ENS-Lyon.Fr>,
Marc Daumas  <Marc.Daumas@ENS-Lyon.Fr> wrote:
>Suresh Kadiyala wrote:
>> > I'd like to know which modern architectures support instructions for
>> > any of the following (somewhat related) functions:

>> > Count the number of 1 bits in a word

>> PowerPC has this.

>I was told that this operation was regarded as very usefull to break
>cryptographic codes and thereafter it was ommitted on most architectures
>so that there willl be no restriction to export.

One could make this argument about any bit instruction.  The cost of
doing this is large enough to make some useful algorithms not pay,
but not large enough to make much difference in cryptanalysis, where
other operations are likely to be done many times.



-- 
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
hrubin@stat.purdue.edu         Phone: (765)494-6054   FAX: (765)494-0558

Cache-Post-Path: proxy0.isltd.insignia.com!unknown@christian-mac.isltd.insignia.com

Subject: Re: Bit counting and similar instructions
Date: Tue, 19 Jan 1999 11:49:11 +0000
From: christian.bau@isltd.insignia.com (Christian Bau)
Message-ID: <christian.bau-1901991149110001@christian-mac.isltd.insignia.com>
References: <36A2F288.BF50D66A@ENS-Lyon.Fr>
Newsgroups: comp.arch.arithmetic
Lines: 37

In article <36A2F288.BF50D66A@ENS-Lyon.Fr>, Marc Daumas
<Marc.Daumas@ENS-Lyon.Fr> wrote:

> Suresh Kadiyala wrote:
> > > I'd like to know which modern architectures support instructions for
> > > any of the following (somewhat related) functions:
> > >
> > > Count the number of 1 bits in a word
> > 
> > PowerPC has this.
> 
> I was told that this operation was regarded as very usefull to break
> cryptographic codes and thereafter it was ommitted on most architectures
> so that there willl be no restriction to export.
> 
> Truth or legend ?
> 
> What instruction is that on the power PC ? I don't know any such
> instruction on x86.

There is no single instruction to count bits in a word on the PowerPC,
only an instruction to count the number of leading zero bits. Written in
C, the following code would be quite efficient on the PowerPC if you do
the operation a lot: 

   static unsigned char lookup_table [2048] = {
      /* Initialised to the number of bits in i for 0 <= i < 2048 */
   }

   #define bit_count(n)    (lookup_table[((n)>>22) & 0x7ff]       \
                           +lookup_table[((n)>>11) & 0x7ff]       \
                           +lookup_table[((n)>> 0) & 0x7ff])

Total latency six cycles if a pointer to the lookup table is in a
register. I think the Altivec processors (PowerPC 750 + Vector
instructions) could be quite fast at it as well. My guess is about 5
cycles for counting 128 bits.


Subject: Re: Bit counting and similar instructions
Date: Tue, 19 Jan 1999 13:34:27 +0100
From: Terje Mathisen <Terje.Mathisen@hda.hydro.com>
Message-ID: <36A47BD3.804DF5B0@hda.hydro.com>
References: <christian.bau-1901991149110001@christian-mac.isltd.insignia.com>
Newsgroups: comp.arch.arithmetic
Lines: 67

Christian Bau wrote:
> There is no single instruction to count bits in a word on the PowerPC,
> only an instruction to count the number of leading zero bits. Written in
> C, the following code would be quite efficient on the PowerPC if you do
> the operation a lot:
> 
>    static unsigned char lookup_table [2048] = {
>       /* Initialised to the number of bits in i for 0 <= i < 2048 */
>    }
> 
>    #define bit_count(n)    (lookup_table[((n)>>22) & 0x7ff]       \
>                            +lookup_table[((n)>>11) & 0x7ff]       \
>                            +lookup_table[((n)>> 0) & 0x7ff])
> 
> Total latency six cycles if a pointer to the lookup table is in a

A Pentium would be similar:

; assume input value in EAX

  mov ebx,eax
  and eax,7ffh

  mov ecx,ebx
  and ebx,7ffh SHL 11

  shr ecx,22
  mov eax,lookup_table[eax]

  shr ebx,11
  mov ecx,lookup_table[ecx]

  add eax,ecx
  mov ebx,lookup_table[ebx]

  add eax,ebx

> register. I think the Altivec processors (PowerPC 750 + Vector
> instructions) could be quite fast at it as well. My guess is about 5
> cycles for counting 128 bits.

The Altivec might be even faster:

The key would be to use the register-internal 16-element table lookup to
convert 16 nibbles into the corresponding bit counts simultaneously.

Doing this operation twice and adding should do the trick.

I don't have an Altivec manual, but the (pseudo-)code should be
something like this:

  r1 = r0 >> 4;
  r2 = 0x04030302030202010302020102010100;  // 16-element nibble table

  r0 &= 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f;
  r1 &= 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f;

  r0 = register_table_lookup(r0,r2);
  r1 = register_table_lookup(r1,r2);

  return r0+r1;

Terje
-- 
- <Terje.Mathisen@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

Originator: nmm1@taurus.cus.cam.ac.uk

Subject: Re: Bit counting and similar instructions
Date: 19 Jan 1999 12:55:43 GMT
From: nmm1@cus.cam.ac.uk (Nick Maclaren)
Message-ID: <781vcf$siu$1@pegasus.csx.cam.ac.uk>
References: <36A47BD3.804DF5B0@hda.hydro.com>
Newsgroups: comp.arch.arithmetic
Lines: 30


In article <36A47BD3.804DF5B0@hda.hydro.com>, Terje Mathisen <Terje.Mathisen@hda.hydro.com> writes:
|> Christian Bau wrote:
|> > There is no single instruction to count bits in a word on the PowerPC,
|> > only an instruction to count the number of leading zero bits. Written in
|> > C, the following code would be quite efficient on the PowerPC if you do
|> > the operation a lot:  ...
|> > 
|> > Total latency six cycles if a pointer to the lookup table is in a
|> 
|> A Pentium would be similar:  ...
|> 
|> > .... I think the Altivec processors (PowerPC 750 + Vector
|> > instructions) could be quite fast at it as well. My guess is about 5
|> > cycles for counting 128 bits.

Most machines are similar to the above.  The benefits of such an
instruction would be marginal, at best, given the small number of
programs that would benefit.

On the other hand, the cost of implementing such an instruction is
usually also marginal ....


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QG, England.
Email:  nmm1@cam.ac.uk
Tel.:  +44 1223 334761    Fax:  +44 1223 334679


Subject: Re: Bit counting and similar instructions
Date: Tue, 19 Jan 1999 18:40:34 -0800
From: alexr@I.HATE.SPAM (Alex Rosenberg)
Message-ID: <alexr-1901991840340001@roseal2.apple.com>
References: <36A47BD3.804DF5B0@hda.hydro.com>
Newsgroups: comp.arch.arithmetic
Lines: 50

In article <36A47BD3.804DF5B0@hda.hydro.com>, Terje Mathisen
<Terje.Mathisen@hda.hydro.com> wrote:

>Christian Bau wrote:
>> register. I think the Altivec processors (PowerPC 750 + Vector
>> instructions) could be quite fast at it as well. My guess is about 5
>> cycles for counting 128 bits.
>
>The Altivec might be even faster:
>
>The key would be to use the register-internal 16-element table lookup to
>convert 16 nibbles into the corresponding bit counts simultaneously.
>
>Doing this operation twice and adding should do the trick.
>
>I don't have an Altivec manual, but the (pseudo-)code should be
>something like this:

Motorola has a manual at:
<http://www.mot.com/SPS/PowerPC/teksupport/teklibrary/manuals/altivec_pem.pdf>

Apple also has a whole slew of AltiVec info at:
<http://developer.apple.com/hardware/altivec>
<http://developer.apple.com/tools/mpw-tools/altivec.html>

>  r1 = r0 >> 4;
>  r2 = 0x04030302030202010302020102010100;  // 16-element nibble table
>
>  r0 &= 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f;
>  r1 &= 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f;
>
>  r0 = register_table_lookup(r0,r2);
>  r1 = register_table_lookup(r1,r2);
>
>  return r0+r1;

The two masking operations are unnecessary. The permute operation is
actually a 5-bit lookup and the 4-bit table can be specified twice to make
the fifth bit indifferent. Also, this would only count within each byte.
You'd have to add two sum across operations to compute all 128 bits. If
you had a large number of bits to count, you could save a few cycles by
accumulating a few iterations as bytes before the sum across operations.

Check artcle <6ja3v6$17p4$1@rtpnews.raleigh.ibm.com> by Brett Olsson
<brett@raleigh.ibm.com> in your favorite news archive for a 5-bit variant
that's substantially similar.

+------------------------------------------------------------+
| Alexander M. Rosenberg           <mailto:alexr@_spies.com> |
| Nobody cares what I say. Remove the underscore to mail me. |


Subject: Re: Bit counting and similar instructions
Date: 20 Jan 1999 18:44:51 -0800
From: gillies@cs.ubc.ca (Donald Gillies)
Message-ID: <7864b3$keh$1@cascade.cs.ubc.ca>
References: <alexr-1901991840340001@roseal2.apple.com>
Newsgroups: comp.arch.arithmetic
Lines: 47

It is not efficient to access memory to count bits on the PowerPC, or
on ANY modern RISC processor.  The cost of going to memory is roughly
5-10 instruction times.  Even going to primary secondary cache is
going to hurt your execution time a whole lot.  Also, you cannot trust
most RISC's to perform efficient byte accesses - many have to extract
the byte slowly in the registers, or mask off the upper 24 bits, or
whatnot.  Moreover, many compilers (green hills) will throw away the
base address of a lookup array after just 1 use, forcing a reload of
this base address and wasting valuable cycles.  

Thus, to count bits on most RISC's, make clever use of arithmetic:



#include "stdio.h"		

// (C) Donald W. Gillies, 1992.  All rights reserved.  You may reuse
// this bitcount() function anywhere you please as long as you retain
// this Copyright Notice.
//
#define bitready()	register unsigned long tmp
#define bitcount(n) 							\
      (tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111),	\
      tmp = ((tmp + (tmp >> 3)) & 030707070707),			\
      tmp =  (tmp + (tmp >> 6)),					\
      tmp = (tmp + (tmp >> 12) + (tmp >> 24)) & 077)

/* 16 instructions on most risc machines for 32-bit bitcount ! */

main()
{
	bitready();

	printf("bitcount(1) = %ld\n", bitcount(1));
	printf("bitcount(2) = %ld\n", bitcount(2));
	printf("bitcount(3) = %ld\n", bitcount(3));
	printf("bitcount(0xff) = %ld\n", bitcount(0xff));
	printf("bitcount(0xffffff) = %ld\n", bitcount(0xffffff));
}

% ~/a.out
bitcount(1) = 1
bitcount(2) = 1
bitcount(3) = 2
bitcount(0xff) = 8
bitcount(0xffffff) = 24
%


Subject: Re: Bit counting and similar instructions
Date: Thu, 21 Jan 1999 09:33:52 +0100
From: Terje Mathisen <Terje.Mathisen@hda.hydro.com>
Message-ID: <36A6E670.ECE65309@hda.hydro.com>
References: <7864b3$keh$1@cascade.cs.ubc.ca>
Newsgroups: comp.arch.arithmetic
Lines: 111

Donald Gillies wrote:
> 
> It is not efficient to access memory to count bits on the PowerPC, or
> on ANY modern RISC processor.  The cost of going to memory is roughly

We were discussing two specific cpus here, both of which will actually
run the table-lookup copy quickly, but in general, I agree.

> 5-10 instruction times.  Even going to primary secondary cache is
> going to hurt your execution time a whole lot.  Also, you cannot trust
> most RISC's to perform efficient byte accesses - many have to extract
> the byte slowly in the registers, or mask off the upper 24 bits, or
> whatnot.  Moreover, many compilers (green hills) will throw away the
> base address of a lookup array after just 1 use, forcing a reload of
> this base address and wasting valuable cycles.

This is a single example of a totally broken compiler, besides if the
bitcount is actually time critical, I would implement it in asm anyway.

(I have actually written at leat 5 or 6 different asm versions of this
code, see below)

> Thus, to count bits on most RISC's, make clever use of arithmetic:
> 
> #include "stdio.h"
> 
> // (C) Donald W. Gillies, 1992.  All rights reserved.  You may reuse
> // this bitcount() function anywhere you please as long as you retain
> // this Copyright Notice.

AFAIK, this algorithm was known before 1992.

> //
> #define bitready()      register unsigned long tmp
> #define bitcount(n)                                                     \
>       (tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111), \
>       tmp = ((tmp + (tmp >> 3)) & 030707070707),                        \
>       tmp =  (tmp + (tmp >> 6)),                                        \
>       tmp = (tmp + (tmp >> 12) + (tmp >> 24)) & 077)
> 
> /* 16 instructions on most risc machines for 32-bit bitcount ! */

To really make this version efficient, you need to do the bitcount
across an array:

In that case you can speed it up a lot more, by the use of bit-parallel
addition:

Three input words can be stored in two vertical words, with a single
full adder to merge them.

Similarly seven input words can be converted to three vertical words,
and finally you can convert 15 input words to 4 vertical words.

After doing the adds, the actual bitcounting in the resulting words
should be handled in parallel, this gets rid of all/most of the stalls.

Robert Harley posted 64-bit Alpha code to do this which ran in (much?)
less than a cycle/byte, I have written an MMX asm version which gets
similar performance.

Here's an example for a seven-wide counter:

#define FULL_ADD(c1, c0, w0, w1, s1, s2) w1 = s1; c0 = w0; w0 &= w1; \
           c0 ^= w1; w1 = s2; c1 = c0; c0 ^= w1; c1 &= w1; c1 |= w0

#define MASK55 0x55555555
#define MASK33 0x33333333
#define MASK0F 0x0f0f0f0f

ulong count_bits7(unsigned *src)
{
    unsigned c0, c1, t0, t1, d0, d1, e0, e1, f1, f2;

    t0 = src[0];
    FULL_ADD(c1, c0, t0, t1, src[1], src[2]); // c1:c0         Up to 4
live vars + src[]
    FULL_ADD(d1, d0, c0, t1, src[3], src[4]); // d1:d0, c1     Up to 5
live vars + src[]
    FULL_ADD(e1, e0, d0, t1, src[5], src[6]); // e1:e0, d1, c1 Up to 6
live vars + src[]
    FULL_ADD(f2, f1, c1, t1,     d1,     e1); // f2:f1:e0

    e0 -= (e0 >> 1) & MASK55;					// 2 bits, 0-2
    f1 -= (f1 >> 1) & MASK55;
    f2 -= (f2 >> 1) & MASK55;

    e0 = (e0 & MASK33) + ((e0 >> 2) & MASK33);	// 4 bits, 0-4
    f1 = (f1 & MASK33) + ((f1 >> 2) & MASK33);
    f2 = (f2 & MASK33) + ((f2 >> 2) & MASK33);

    e0 += e0 >> 4; f1 += f1 >> 4; f2 += f2 >> 4; // 4 bits, 0-8
    e0 &= MASK0F; f1 &= MASK0F; f2 &= MASK0F;	// 8 bits, 0-8

    e0 += (f1 << 1) + (f2 << 2); 		// 8 bits, 0-8+16+32 = 56
    e0 += e0 >> 8;				// 8 bits, 0-112
    e0 += e0 >> 16;				// 8 bits, 0-224

    return e0 & 255;
}

It is worth noting though that on a plain Pentium, which is very good at
byte addressing, and can handle two loads/cycle, the naive byte-wide
code ran neck&neck with this much more complicated algorithm!

Terje

-- 
- <Terje.Mathisen@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"
Terry Ritter, his current address, and his top page.
Last updated: 1999-02-20
Bit Counting and Similar Instructions

A Ciphers By Ritter Page

Contents