Squeezing blood from a turnip

From: Bob Felts
Subject: Squeezing blood from a turnip
Date: Sat, 22 Jul 2006 18:59:16 +0000
Message-ID: <1hivrdy.untzmo1freut0N%wrf3@stablecross.com>

[... a bit long-winded, will get to a Lisp question ...]

The company I work for is in a hiring phase (not for Lisp,
unfortunately) and we ask all of our candidate to write small C
functions on a whiteboard.  One of the standard problems is to implement
"countBits", which counts all of the bits set in an integer.  [1]

One solution uses the fact that and'ing n with (n - 1) removes the
leastmost bit set in n:

int countBits(int n)
{
   int count;

   for (count = 0; n; n &= (n - 1))
      ++count;

   return (count);
}


Six different versions of this routine were coded in Lisp and the
following times collected, using OpenMCL and LispWorks on a 1GHz
Apple laptop using:

   (time (dotimes (i 20000000) (count-bits i)))

Algorithm            OpenMCL            LispWorks
------------------------------------------------------------
      1                    19.232                  17.294
      2                    27.347                    9.030  [2]
      3                    18.601                    8.949
      3.5                 19.367                    9.698
      4                    14.175                    9.273
      5                    14.152                    9.753
      6                    14.400                    9.532

Algorithms 1 - 3.5 are recursive; algorithm 1 is not tail recursive, the
others are.
Algorithms 4, 5, and 6 are iterative using either DO or LOOP.

The C version, compiled using XCode 2.3, takes between 5 and 6 seconds.

So I want to try to speed up, say, algorithm 4:

(defun count-bits4 (n)
  (loop
     for i = n then (logand i (1- i))
     until (zerop i)
     summing 1 into count
     finally (return count)))

So I added a DECLARE statement as follows:

(defun count-bits4o (n)
 (declare (optimize (safety 0) (speed 3)) (fixnum n))
  (loop
     for i = n then (logand i (1- i))
     until (zerop i)
     summing 1 into count
     finally (return count)))

The time went up to 126.996 seconds, with 480,250,292 allocations!

When I get a few more minutes, I'll use DISASSEMBLE to look at the code.
In the meantime, any suggestions on how to make this faster are welcome.
It seems that every time I try to give the compiler a hint, it makes
things worse.

-------
[1] interesting (at least to me) anecdote:  four candidates in a row,
from fresh out of school to 20+ years of experience has initially
treated sizeof(thing) as if sizeof returned the number of _bits_ in
thing.

[2] I'm surprised by the difference between the two Lisps.  The function
is:
   (defun count-bits2 (n &optional (result 0))
    (if (zerop n)
         result
        (count-bits2 (logand n (1- n)) (1+ result))))

Re: Squeezing blood from a turnip Christophe Rhodes
- Re: Squeezing blood from a turnip Raffael Cavallaro
  - Re: Squeezing blood from a turnip Juho Snellman
    - Re: Squeezing blood from a turnip Raffael Cavallaro
Re: Squeezing blood from a turnip Wade Humeniuk
Re: Squeezing blood from a turnip Rob Warnock
- Re: Squeezing blood from a turnip Bob Felts
Re: Squeezing blood from a turnip Pascal Bourguignon
- Re: Squeezing blood from a turnip Bob Felts
  - Re: Squeezing blood from a turnip Pascal Bourguignon
    - Re: Squeezing blood from a turnip Nathan Baum
    - Re: Squeezing blood from a turnip Bob Felts
Re: Squeezing blood from a turnip Kaz Kylheku
- Re: Squeezing blood from a turnip Bob Felts
  - Re: Squeezing blood from a turnip Sidney Markowitz
    - Re: Squeezing blood from a turnip Lars Brinkhoff
    - Re: Squeezing blood from a turnip bradb
      - Re: Squeezing blood from a turnip Christophe Rhodes
        Re: Squeezing blood from a turnip bradb
      - Re: Squeezing blood from a turnip Sidney Markowitz
- Re: Squeezing blood from a turnip bradb
  - Re: Squeezing blood from a turnip Nathan Baum
    - Re: Squeezing blood from a turnip Nathan Baum
    - Re: Squeezing blood from a turnip bradb
      - Re: Squeezing blood from a turnip Nathan Baum
    - Re: Squeezing blood from a turnip Rob Warnock
      - Re: Squeezing blood from a turnip Nathan Baum

From: Christophe Rhodes
Subject: Re: Squeezing blood from a turnip
Date: Sat, 22 Jul 2006 20:29:31 +0000
Message-ID: <sqfygtv0wk.fsf@cam.ac.uk>

····@stablecross.com (Bob Felts) writes:

> One solution uses the fact that and'ing n with (n - 1) removes the
> leastmost bit set in n:
>  [...]
> So I want to try to speed up, say, algorithm 4:
>
> (defun count-bits4 (n)
>   (loop
>      for i = n then (logand i (1- i))
>      until (zerop i)
>      summing 1 into count
>      finally (return count)))
>
> So I added a DECLARE statement as follows:
>
> (defun count-bits4o (n)
>  (declare (optimize (safety 0) (speed 3)) (fixnum n))
>   (loop
>      for i = n then (logand i (1- i))
>      until (zerop i)
>      summing 1 into count
>      finally (return count)))
>
> The time went up to 126.996 seconds, with 480,250,292 allocations!

For what it's worth, with SBCL on the x86, your count-bits4o is 20% or
so faster than count-bits4.  Adding a few more declarations,

(defun count-bits4oo (n)
  (declare (optimize (safety 0) (speed 3)) 
           (type (unsigned-byte 29) n))
  (loop for i of-type (unsigned-byte 29) = n then (logand i (1- i))
        until (zerop i)
        summing 1 into count fixnum 
        finally (return count)))

and I get a further 70% or so speedup; that is, with my COUNT-BITS4OO,
on my system, the original loop to 20000000 took some 6.8 seconds, and
with COUNT-BITS4OO it takes 0.8 seconds.  (Mind you, with the same
declarations, simply calling LOGCOUNT[*]  takes 0.3 seconds for the loop,
but that uses a different algorithm.)

> When I get a few more minutes, I'll use DISASSEMBLE to look at the code.
> In the meantime, any suggestions on how to make this faster are welcome.
> It seems that every time I try to give the compiler a hint, it makes
> things worse.

Did you do this with Lispworks?  What is its most-positive-fixnum?
You might be lying to the compiler with your FIXNUM declaration,
coupled with calling COUNT-BITS4O with 19999999 as an argument, which
could cause all sorts of problems.

I'm afraid I can't advise you with respect to optimization for
Lispworks or OpenMCL, as I don't know the peculiarities of those
compilers; it seems to be true that optimizing for speed is quite an
implementation-specific task.  You might find my COUNT-BITS4OO better
on those compilers, as it explicitly declares the types of everything
that needs declaring, at least judging by SBCL's compiler notes; on
the other hand, it might be that it doesn't help at all.

Christophe

[*] 
(defun count-bits-logcount (n)
  (declare (optimize speed (safety 0))
           (type (unsigned-byte 29) n))
  (logcount n))

From: Raffael Cavallaro
Subject: Re: Squeezing blood from a turnip
Date: Sat, 22 Jul 2006 21:57:53 +0000
Message-ID: <2006072217575350073-raffaelcavallaro@pasdespamsilvousplaitmaccom>

On 2006-07-22 16:29:31 -0400, Christophe Rhodes <·····@cam.ac.uk> said:

> (defun count-bits-logcount (n)
>   (declare (optimize speed (safety 0))
>            (type (unsigned-byte 29) n))
>   (logcount n))

Actually, using sbcl on an intel mac I get the same time as you for the 
loop using count-bits-logcount (0.3 seconds) but it only takes 0.015 
seconds if we just call logcount directly:

CL-USER> (time (dotimes (i 20000000) (logcount i)))
Evaluation took:
  0.015 seconds of real time
  0.015037 seconds of user run time
  1.08e-4 seconds of system run time
  0 page faults and
  0 bytes consed.

From: Juho Snellman
Subject: Re: Squeezing blood from a turnip
Date: Sat, 22 Jul 2006 22:18:47 +0000
Message-ID: <slrnec5927.ksl.jsnell@sbz-30.cs.Helsinki.FI>

Raffael Cavallaro <················@pas-d'espam-s'il-vous-plait-mac.com> wrote:
> On 2006-07-22 16:29:31 -0400, Christophe Rhodes <·····@cam.ac.uk> said:
> 
>> (defun count-bits-logcount (n)
>>   (declare (optimize speed (safety 0))
>>            (type (unsigned-byte 29) n))
>>   (logcount n))
> 
> Actually, using sbcl on an intel mac I get the same time as you for the 
> loop using count-bits-logcount (0.3 seconds) but it only takes 0.015 
> seconds if we just call logcount directly:
> 
> CL-USER> (time (dotimes (i 20000000) (logcount i)))

You're timing the execution of an empty loop. The call to logcount is
optimized away, since the return value is not used and the function is
known to have no side effects.

-- 
Juho Snellman

From: Raffael Cavallaro
Subject: Re: Squeezing blood from a turnip
Date: Sun, 23 Jul 2006 02:23:32 +0000
Message-ID: <2006072222233243658-raffaelcavallaro@pasdespamsilvousplaitmaccom>

On 2006-07-22 18:18:47 -0400, Juho Snellman <······@iki.fi> said:

> The call to logcount is
> optimized away, since the return value is not used and the function is
> known to have no side effects.

so I am - should have disassembled before posting...

From: Wade Humeniuk
Subject: Re: Squeezing blood from a turnip
Date: Sat, 22 Jul 2006 22:48:32 +0000
Message-ID: <4Xxwg.124553$I61.92601@clgrps13>

Bob Felts wrote:

> So I added a DECLARE statement as follows:
> 
> (defun count-bits4o (n)
>  (declare (optimize (safety 0) (speed 3)) (fixnum n))
>   (loop
>      for i = n then (logand i (1- i))
>      until (zerop i)
>      summing 1 into count
>      finally (return count)))
> 
> The time went up to 126.996 seconds, with 480,250,292 allocations!
> 
> When I get a few more minutes, I'll use DISASSEMBLE to look at the code.
> In the meantime, any suggestions on how to make this faster are welcome.
> It seems that every time I try to give the compiler a hint, it makes
> things worse.

The declaration you are looking for is...

(Be careful though fixnum is LWW are only 24-bits.

(defun count-bits4o (n)
  (declare (optimize (safety 0) (speed 3) (debug 0) (fixnum-safety 0))
           (fixnum n))
   (loop
      for i = n then (logand i (1- i))
      until (zerop i)
      summing 1 into count
      finally (return count)))

CL-USER 2 > (disassemble 'count-bits4o)
2068486A:
        0:      55               push  ebp
        1:      89E5             move  ebp, esp
        3:      33DB             xor   ebx, ebx
        5:      89C7             move  edi, eax
        7:      83FF00           cmp   edi, 0
       10:      7505             jne   L2
L1:   12:      89D8             move  eax, ebx
       14:      FD               std
       15:      C9               leave
       16:      C3               ret
L2:   17:      81C300010000     add   ebx, 100
       23:      89FA             move  edx, edi
       25:      81EF00010000     sub   edi, 100
       31:      23FA             and   edi, edx
       33:      83FF00           cmp   edi, 0
       36:      74E6             je    L1
       38:      EBE9             jmp   L2
       40:      90               nop
       41:      90               nop
NIL

CL-USER 3 >

Wade

From: Rob Warnock
Subject: Re: Squeezing blood from a turnip
Date: Sun, 23 Jul 2006 05:07:58 +0000
Message-ID: <sp6dndOAdvezmV7ZnZ2dnUVZ_u6dnZ2d@speakeasy.net>

Bob Felts <····@stablecross.com> wrote:
+---------------
| One of the standard problems is to implement "countBits",
| which counts all of the bits set in an integer.  [1]
| One solution uses the fact that and'ing n with (n - 1)
| removes the leastmost bit set in n:
|    int countBits(int n)
|    { int count;
|      for (count = 0; n; n &= (n - 1))
|         ++count;
|      return (count);
|    }
+---------------

Unless you have some strong reason to expect that a *very* small
number of bits in "n" will be set on average, this method is horribly
slow. Much better is the standard iterated comb method, variations
of which were already well-known in the early days of computing
<http://www.inwap.com/pdp10/hbaker/hakmem/hacks.html#item169>:

    int countBits(int n)	/* Assumes 32-bit ints */
    {
      n = (n & 0x55555555) + ((n >> 1) & 0x55555555);
      n = (n & 0x33333333) + ((n >> 2) & 0x33333333);
      n = (n & 0x0f0f0f0f) + ((n >> 4) & 0x0f0f0f0f);
      n += (n >>  8);
      n += (n >> 16);
      return n & 0xff;
    }

This is branch-free, and runs in a constant number of cycles for all
inputs. [For 64 bits, add one more line and use wider constants.]

Of course, in Common Lisp one would simply use the built-in LOGCOUNT,
though if you allow negative arguments you will have to pick a
presumed "machine word size" for the result to be meaningful,
either this way:

    (defparameter +count-bits-word-width+ 64)	; For example

    (defun count-bits (n)
      (if (minusp n)
        (- +count-bits-word-width+ (logcount n)) ; See def'n of LOGCOUNT
        (logcount n)))

or this way:

    (defun count-bits (n &optional (bits-per-word 64))
      (if (minusp n)
        (- bits-per-word (logcount n)) 
        (logcount n)))

[Of course, some error-checking for the value being in range
of ints with BITS-PER-WORD bits would be nice.]


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: Bob Felts
Subject: Re: Squeezing blood from a turnip
Date: Sun, 23 Jul 2006 22:49:08 +0000
Message-ID: <1hixyim.17hreuk1n0hsloN%wrf3@stablecross.com>

Rob Warnock <····@rpw3.org> wrote:

> Bob Felts <····@stablecross.com> wrote:
> +---------------
> | One of the standard problems is to implement "countBits",
> | which counts all of the bits set in an integer.  [1]
> | One solution uses the fact that and'ing n with (n - 1)
> | removes the leastmost bit set in n:
> |    int countBits(int n)
> |    { int count;
> |      for (count = 0; n; n &= (n - 1))
> |         ++count;
> |      return (count);
> |    }
> +---------------
> 
> Unless you have some strong reason to expect that a *very* small
> number of bits in "n" will be set on average, this method is horribly
> slow. Much better is the standard iterated comb method, variations
> of which were already well-known in the early days of computing
> <http://www.inwap.com/pdp10/hbaker/hakmem/hacks.html#item169>:
> 
>     int countBits(int n) /* Assumes 32-bit ints */
>     {
>       n = (n & 0x55555555) + ((n >> 1) & 0x55555555);
>       n = (n & 0x33333333) + ((n >> 2) & 0x33333333);
>       n = (n & 0x0f0f0f0f) + ((n >> 4) & 0x0f0f0f0f);
>       n += (n >>  8);
>       n += (n >> 16);
>       return n & 0xff;
>     }
> 
> This is branch-free, and runs in a constant number of cycles for all
> inputs. [For 64 bits, add one more line and use wider constants.]

I (obviously) wasn't aware of this.  Thanks.

> 
> Of course, in Common Lisp one would simply use the built-in LOGCOUNT,

As always, that's yet one more Lisp function that I didn't know about.
;-)

But I'm still interested in exploring how to make an algorithm in Lisp
approach the execution time of the same algorithm in C.

From: Pascal Bourguignon
Subject: Re: Squeezing blood from a turnip
Date: Sun, 23 Jul 2006 01:11:27 +0000
Message-ID: <87fygtw2f4.fsf@thalassa.informatimago.com>

····@stablecross.com (Bob Felts) writes:
> Algorithms 1 - 3.5 are recursive; algorithm 1 is not tail recursive, the
> others are.
> Algorithms 4, 5, and 6 are iterative using either DO or LOOP.

Why don't you try a parallel version?  This is a classical C problem,
and there are several variations of parallel algorithms published on
numerous web sites.

Here is for example a CL implementation of the Nifty Parallel Count:

(defun make-count-mask (width size)
  (assert (zerop (nth-value 1 (truncate (log width 2)))))
  (assert (zerop (nth-value 1 (truncate width size))))
  (loop
     :for p :from 0 :below width :by (* 2 size)
     :for b = (1- (expt 2 size)) :then (dpb (1- (expt 2 size)) (byte size p) b)
     :finally (return b)))


(defun gen-bitcount (width)
  (assert (zerop (nth-value 1 (truncate (log width 2)))))
  `(defun ,(intern (format nil "BITCOUNT-~D" width)) (n)
     (let*
         ,(loop
             :for size = 1 :then (* 2 size)
             :while (< size width)
             :collect `(n  (+ (logand n ,(make-count-mask width size))
                              (logand (ash n ,(- size))
                                      ,(make-count-mask width size)))))
       n)))

(defmacro define-bitcount (width)  (gen-bitcount width))

(define-bitcount 32)
(define-bitcount 64)

;; you can test it with:
(define-bitcount 16)
(loop for i from 0 to 65535 do (assert (= (bitcount-16 i)  (logcount i))))

It doesn't terminate with the trick of the C version, because we want
it to work with bigger numbers:

(define-bitcount 1024)
(let ((n (+ (1- (expt 2 512)) (expt 2 1000))))
   (list (bitcount-1024 n) (logcount n)))
--> (201 201)


http://www-db.stanford.edu/~manku/bitcount/bitcount.html

And a bitreverse for the same price:


(defun lowbits (width size)
  (assert (zerop (nth-value 1 (truncate (log width 2)))))
  (assert (zerop (nth-value 1 (truncate width size))))
  (loop
     :for bw = size :then (* 2 bw)
     :for bits = (1- (expt 2 size))
     :then (logior (ash bits bw) bits)
     :while (< (* 2 bw) width)
     :finally (return bits)))

(defun highbits (width size)
  (logand (1- (expt 2 width)) (lognot (lowbits width size))))
  
(defun gen-bitreverse (width)
  (assert (zerop (nth-value 1 (truncate (log width 2)))))
  `(defun ,(intern (format nil "BITREVERSE-~D" width)) (n)
     (let*
         ,(loop
             :for size = 1 :then (* 2 size)
             :collect `(n (logior
                           (logand ,(lowbits  width size) (ash n ,(- size)))
                           (logand ,(highbits width size) (ash n ,size))))
             :while (< size (truncate width 2)))
       n)))

(defmacro define-bitreverse (width) (gen-bitreverse width))

(define-bitreverse 32)
(define-bitreverse 64)


> [1] interesting (at least to me) anecdote:  four candidates in a row,
> from fresh out of school to 20+ years of experience has initially
> treated sizeof(thing) as if sizeof returned the number of _bits_ in
> thing.

Note how the width is a parameter of the above macros...

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

READ THIS BEFORE OPENING PACKAGE: According to certain suggested
versions of the Grand Unified Theory, the primary particles
constituting this product may decay to nothingness within the next
four hundred million years.

From: Bob Felts
Subject: Re: Squeezing blood from a turnip
Date: Sun, 23 Jul 2006 22:57:14 +0000
Message-ID: <1hixz0g.xeozqk1e3998gN%wrf3@stablecross.com>

Pascal Bourguignon <···@informatimago.com> wrote:

> ····@stablecross.com (Bob Felts) writes:
> > Algorithms 1 - 3.5 are recursive; algorithm 1 is not tail recursive, the
> > others are.
> > Algorithms 4, 5, and 6 are iterative using either DO or LOOP.
> 
> Why don't you try a parallel version? 

Because I'm more interested at the moment with how to make Lisp code
faster, given a particular algorithm.

[...]

> 
> > [1] interesting (at least to me) anecdote:  four candidates in a row,
> > from fresh out of school to 20+ years of experience has initially
> > treated sizeof(thing) as if sizeof returned the number of _bits_ in
> > thing.
> 
> Note how the width is a parameter of the above macros...

One of the things we ask candidates when they make their algorithm
machine specific is to make it portable across all machine
architectures.

From: Pascal Bourguignon
Subject: Re: Squeezing blood from a turnip
Date: Mon, 24 Jul 2006 00:07:16 +0000
Message-ID: <87d5bv50i3.fsf@thalassa.informatimago.com>

····@stablecross.com (Bob Felts) writes:

> Pascal Bourguignon <···@informatimago.com> wrote:
>
>> ····@stablecross.com (Bob Felts) writes:
>> > Algorithms 1 - 3.5 are recursive; algorithm 1 is not tail recursive, the
>> > others are.
>> > Algorithms 4, 5, and 6 are iterative using either DO or LOOP.
>> 
>> Why don't you try a parallel version? 
>
> Because I'm more interested at the moment with how to make Lisp code
> faster, given a particular algorithm.

It's generally understood that  "optimizing" code without changing the
algorithm is only second or third order fine tuning.

The true way to make code faster, is to find a faster algorithm.

But  otherwise,  one way  to  make lisp  code  faster,  when the  lisp
compiler used support them, is to add type declarations.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

ATTENTION: Despite any other listing of product contents found
herein, the consumer is advised that, in actuality, this product
consists of 99.9999999999% empty space.

From: Nathan Baum
Subject: Re: Squeezing blood from a turnip
Date: Mon, 24 Jul 2006 01:12:55 +0000
Message-ID: <1153703575.435220.169450@i3g2000cwc.googlegroups.com>

This is the Lisp:

(declaim (optimize (safety 0) (speed 3) (debug 0)))

(defun count-bits4 (n)
  (declare (fixnum n))
  (loop for i of-type fixnum = n then (logand i (1- i))
     until (zerop i)
     summing 1 into count of-type fixnum
     finally (return count)))

(defun test-count-bits (function iters)
  (declare (fixnum iters)
           (function function))
  (time
   (dotimes (i iters)
     (funcall function i))))

(test-count-bits #'count-bits4 (min 20000000 most-positive-fixnum))

This is the C:

int countBits(int n)
{
   int count;
   for (count = 0; n; n &= (n - 1))
      ++count;
   return (count);
}

int testCountBits (int (*f)(int), int i)
{
  int j;
  for (j = 0; j < i; ++j)
    f(j);
}

int main ()
{
  testCountBits(countBits, 20000000);
}

On average the Lisp code completes in ~1.3 seconds, whilst the C code
completes in ~1.1 seconds. To make a fairer comparison, the Lisp code
is limited to the most positive fixnum: allowing bignums would be
comparing apples and oranges. On the test platform, this was higher
than 20000000.

The Lisp system was SBCL 0.9.4. Its compiler warnings were used to
determine where type declarations were needed.

The C compiler was gcc 3.4.4 with -O3.

GNU CLISP takes ~70 seconds.

Using logcount, SBCL takes ~0.8 seconds and CLISP takes ~10 seconds.

From: Bob Felts
Subject: Re: Squeezing blood from a turnip
Date: Mon, 24 Jul 2006 00:39:34 +0000
Message-ID: <1hiy37a.1m6e7fv1t7qarjN%wrf3@stablecross.com>

Pascal Bourguignon <···@informatimago.com> wrote:

> ····@stablecross.com (Bob Felts) writes:
> 
> > Pascal Bourguignon <···@informatimago.com> wrote:
> >
> >> ····@stablecross.com (Bob Felts) writes:
> >> > Algorithms 1 - 3.5 are recursive; algorithm 1 is not tail recursive, the
> >> > others are.
> >> > Algorithms 4, 5, and 6 are iterative using either DO or LOOP.
> >> 
> >> Why don't you try a parallel version? 
> >
> > Because I'm more interested at the moment with how to make Lisp code
> > faster, given a particular algorithm.
> 
> It's generally understood that  "optimizing" code without changing the
> algorithm is only second or third order fine tuning.
> 
> The true way to make code faster, is to find a faster algorithm.

Well, yes, I know that.  It's kinda funny; I've been programming for 30+
years, so I do know things like this.  On the other hand, I had my
(ahem) handed to me since I wasn't familiar with the parallel
algorithms.  Nor did I know about LOGCOUNT, which several posters were
kind enough to mention, but I'm new enough to Lisp that my ego wasn't
too crushed over this.

> 
> 
> But  otherwise,  one way  to  make lisp  code  faster,  when the  lisp
> compiler used support them, is to add type declarations.
>          

Which is what is tripping me up.  Every time I add a type declaration,
the code executes more slowly, sometimes a lot more slowly.  I haven't
yet had time to dig into the disassembly of the routine; but I have
learned that (some Lisps at least) support the "fixnum-safety"
declaration (which isn't in the Hyperspec), and that the range of
fixnums can be less than the usual C integer sizes.  That I can say
"loop for <var> of-type <type> ..."

Consider this a drunkard's walk through Lisp optimization.

From: Kaz Kylheku
Subject: Re: Squeezing blood from a turnip
Date: Mon, 24 Jul 2006 19:16:20 +0000
Message-ID: <1153768580.794389.179620@m79g2000cwm.googlegroups.com>

Bob Felts wrote:
> [... a bit long-winded, will get to a Lisp question ...]
>
> The company I work for is in a hiring phase (not for Lisp,
> unfortunately) and we ask all of our candidate to write small C
> functions on a whiteboard.  One of the standard problems is to implement
> "countBits", which counts all of the bits set in an integer.  [1]
>
> One solution uses the fact that and'ing n with (n - 1) removes the
> leastmost bit set in n:

That only works if n is positive, or if its negative value uses two's
complement representation, which may be ubiquitous across today's range
of hardware, but isn't required by the C standard!

For instance, let's do a four bit sign-magnitude example using -6. This
would be 1110 and one less than that, -7, is of course 1111.  If you
AND these together, you get 1110. So the least significant nonzero bit
is not erased. Your loop in fact hangs because n is not getting
smaller.

Here is another problem: the subtraction of 1 will overflow if the n
that comes in is the most negative value, INT_MIN.

These problems all go away if you use unsigned numbers. C unsigned
numbers have arithmetic properties that relate to their bit
representation in a stable way. (And in fact, the bit operators are
defined arithmetically. For instance, a right shift is defined by the
standard as a division by a power of two, rather than a relocation of
bits).

> int countBits(int n)
> {
>    int count;
>
>    for (count = 0; n; n &= (n - 1))
>       ++count;
>
>    return (count);
> }

If speed matters at all, you probably want to do something better, like
table lookup.

(And if speed doesn't count, why would you be doing it in C?)

Here is what it might look like, counting six bits at a time using 64
element lookup table:

  int countBits(unsigned long field)
  {
    static int table[64] = {
      0, 1, 1, 2, /* ... */
      /* ... exercise for reader ... */ 5, 6
    };

    int count = 0;

    while (field != 0) {
      count += table[field & 0x3F];
      field >>= 6;
    }

    return count;
  }

This is by no means the final word in bit counting.  See for instance:

http://graphics.stanford.edu/~seander/bithacks.html

> The C version, compiled using XCode 2.3, takes between 5 and 6 seconds.

Which, as you can see, is not a heck of a lot faster than the Lisp
results you are getting.

By the way, you really are not isolating the actual bit counting. You
are counting the time it takes to call the function and return from it.
Have you at least tried inlining it?

Lisp function calls are not quite the same thing as C function calls.

In Lisp, you can redefine a non-inlined function at run time. The check
that the number of arguments is correct will still take place.

By contrast, in C, you can lie to the compiler about the function
signature and still link the program and run it, possibly leading to a
crash or other unpredictable behavior.

> [1] interesting (at least to me) anecdote:  four candidates in a row,
> from fresh out of school to 20+ years of experience has initially
> treated sizeof(thing) as if sizeof returned the number of _bits_ in
> thing.

You will see all kinds of bozos when interviewing. It's shocking. For
instance, people who claim to have master's degrees in computer
science, but cannot whiteboard a box-and-arrow diagram showing a linked
list with a few nodes.

From: Bob Felts
Subject: Re: Squeezing blood from a turnip
Date: Mon, 24 Jul 2006 22:26:03 +0000
Message-ID: <1hizs2j.1ae29lin61iyN%wrf3@stablecross.com>

Kaz Kylheku <········@gmail.com> wrote:

> Bob Felts wrote:
> > [... a bit long-winded, will get to a Lisp question ...]
> >
> > The company I work for is in a hiring phase (not for Lisp,
> > unfortunately) and we ask all of our candidate to write small C
> > functions on a whiteboard.  One of the standard problems is to implement
> > "countBits", which counts all of the bits set in an integer.  [1]
> >
> > One solution uses the fact that and'ing n with (n - 1) removes the
> > leastmost bit set in n:
> 
> That only works if n is positive, or if its negative value uses two's
> complement representation, which may be ubiquitous across today's range
> of hardware, but isn't required by the C standard!

I know.  This is getting blown all out of proportion.  There are only
two reasons why I'm intererested in this function:
1)  to see if candidates can actually write code, and
2)  to learn how to give Lisp the correct information so that the
compiler can generate faster code.  For instance, I managed to learn
that I can do:  (LOOP for i type-of <something> = ...)

It is not the case that just because I'm new to Lisp that I'm new to
computing.

[...]

> 
> If speed matters at all, you probably want to do something better, like
> table lookup.
> 
> (And if speed doesn't count, why would you be doing it in C?)

Because speed isn't always everything.  This is an exercise is coaxing
the most performance out of the Lisp compiler, regardless of the
algorithm.

> 
> > The C version, compiled using XCode 2.3, takes between 5 and 6 seconds.
> 
> Which, as you can see, is not a heck of a lot faster than the Lisp
> results you are getting.
> 

6 seconds -> 9 seconds is significant.  I'd like to see if I can use
Lisp for some skunworks projects; but the code will have to run on an
underpowered CPU.

> By the way, you really are not isolating the actual bit counting. You
> are counting the time it takes to call the function and return from it.
> Have you at least tried inlining it?

I'll check the HyperSpec.

[...]

> 
> > [1] interesting (at least to me) anecdote:  four candidates in a row,
> > from fresh out of school to 20+ years of experience has initially
> > treated sizeof(thing) as if sizeof returned the number of _bits_ in
> > thing.
> 
> You will see all kinds of bozos when interviewing. 

I have.

> It's shocking. For instance, people who claim to have master's degrees in
> computer science, but cannot whiteboard a box-and-arrow diagram showing a
> linked list with a few nodes.

Or write a count-bits function.  ;-)

From: Sidney Markowitz
Subject: Re: Squeezing blood from a turnip
Date: Fri, 28 Jul 2006 05:22:38 +0000
Message-ID: <44c99f37$0$34493$742ec2ed@news.sonic.net>

Bob Felts wrote:
> Because speed isn't always everything.  This is an exercise is coaxing
> the most performance out of the Lisp compiler, regardless of the
> algorithm.
[...]
> 6 seconds -> 9 seconds is significant.  I'd like to see if I can use
> Lisp for some skunworks projects; but the code will have to run on an
> underpowered CPU.

This a subject I'm interested in because the inner loop of my simulation
is performing just this count bits function some trillions of times in
the inner loop. Right now it is in C, based on some code I inherited. I
want to switch to using CL for as much as possible to make exploratory
research much easier.

The comb method that Rob Warnock suggested is not the fastest. A
slightly better method can be found in the link that Kaz gave, at
 http://graphics.stanford.edu/~seander/bithacks.html

Comparing 'comb' with that one (which I'll call 'multiply') in C using
gcc 4.0 with -O3 option on my 2GHz Intel Core duo MacBook, running a
loop that sums the countbits of 0...199999999 (summing and printing the
total so that the call to the count bits function is not optimized away
by the compiler):

  No-op, time just the loop: Elapsed time was 0.22 seconds

  With Comb: Elapsed time was 1.72 seconds

  With Multiply: Elapsed clock time was 1.57 seconds

The Lisp version was initially much slower of course, but it gets more
interesting when I add some declarations.

First of all, as was pointed out elsewhere in the thread, the maximum
fixnum is typically less than the full range of an integer in the
hardware because of tag bits. I'm using sbcl on an x86 architecture,
which has a 29 bit fixnum.

However, sbcl does have optimizations for arithmetic declared as
  (type (unsigned-byte 32))

When I used that instead of fixnum, I was able to use the same 32 bit
algorithms as I did in C and got the same performance speedup that I
found with fixnum declarations and tweaking the algorithm for 28 bit
numbers.

This brings home the point that getting maximum speedup like this is
always going to be compiler-dependent.

For these tests I declared (optimize (speed 3) (space 0) (safety 0)
(debug 0)) and I declared variables and expressions as unsigned-byte 32.

Instead of declaring functions as inline, I defined the functions in an
flet, which the sbcl documentation says allows its compiler to inline them.

So, for the first results, using No-op to test the loop, and using Comb,
Multiply, and logcount, trying to match the C code test as closely as
possible. I used (time ... ) to time the runs, but I'm only showing the
'real time' portion of the output here:

  No-op, time just the loop:  0.30 seconds

  Using logcount:  1.92 seconds

  Using Comb: 2.66 seconds

  Using Multiply: 2.42 seconds

Like other people have noticed, using the logcount function is the
fastest. I looked at source code in sbcl and noticed that for the x86
platform that I am using, there is a compiler optimization defined for
logcount to use when it has a 32 bit integer argument. That optimization
is the Comb algorithm written in assembler and with at least one
unnecessary MOV instruction.

I wrote the equivalent using the Multiply algorithm, and got these
results with the improved logcount:

  Using 'multiply' logcount: 1.353 seconds

Now this is really interesting. By tweaking about 20 lines of assembler
code in the sbcl compiler, the resulting speed test is _faster_ than the
fastest C version, even though the no-op loop framework is 0.08 seconds
slower in Lisp than in C. The tweak is a generally useful speedup for
the compiler, which I'll submit to the sbcl developers. So this isn't
really "cheating" even if I did use some assembler rather than pure Lisp.

Also, while the C version is restricted to 32 bit numbers, the Lisp code
 will work with bignums, though much more slowly, simply by removing the
declarations that assert that the numbers are 32 bit.

Here's the code I used:

The C version of what I've called the Multiply algorithm:

int countBits2(unsigned int n)
{
  unsigned int const w = (n - ((n >> 1) & 0x55555555));
  unsigned int const c = (w & 0x33333333) + ((w >> 2)
                          & 0x33333333);
  unsigned int const x = (((c + (c >> 4)) & 0x0F0F0F0F)
                          * 0x01010101) >> 24;

  return x;
}

The C test loop:

#include <time.h>
#include <stdio.h>

int time2 (unsigned int count)
{
  unsigned int i;
  unsigned int tot=0;

  for (i=0; i<count; ++i)
    tot += countBits2(i);
  return tot;
}

int main(int argc, char**argv)
{
  clock_t clock1, clock2;
  float elapsed;
  int result;
  float units=CLOCKS_PER_SEC;

  clock1 = clock();
  result = time2(200000000);
  clock2 = clock();
  elapsed = (clock2 - clock1)/units;
  printf("Result %d Elapsed clock time was %g seconds\n",
         result, elapsed);

  return 0;
}

And the Lisp code, showing the macro I used for the test harness and the
definitions of tests using logcount, Comb, and Multiply:

(defmacro defcounttest (name type &body form)
  (flet ((declaretype (type &rest vars)
	   (when type
	     `(,(nconc (if (listp type)
                         (list 'type type)
                         (list type)) vars)))))
    `(defun ,name (n)
       (declare (optimize (safety 0)
                          (speed 3)
                          (space 0)
                          (debug 0))
		,@(declaretype type 'n))
       (flet ((countbits (n)
		(declare (optimize (safety 0)
                                   (speed 3)
                                   (space 0)
                                   (debug 0))
			 ,@(declaretype type 'n))
		,@form))
	 (let ((x 0))
	   ,(cons 'declare (declaretype type 'x))
	   (loop (setf n (the ,type (- n 1)))
	      (setf x (the ,type (+ x (countbits n))))
	      (when (= n 0) (return x))))))))

(defcounttest time1 (unsigned-byte 32)
  (the (unsigned-byte 32) (logcount n)))

(defcounttest time2 (unsigned-byte 32)
  (setf n (+ (logand n #x55555555) (logand (ash n -1) #x55555555)))
  (setf n (+ (logand n #x33333333) (logand (ash n -2) #x33333333)))
  (setf n (+ (logand n #x0f0f0f0f) (logand (ash n -4) #x0f0f0f0f)))
  (incf n (ash n -8))
  (incf n (ash n -16))
  (logand n #xff))

(defcounttest time2a (unsigned-byte 32)
  (let* ((w (- n (logand (ash n -1) #x55555555)))
	 (c (+ (logand w #x33333333) (logand (ash w -2) #x33333333)))
	 (x (ash (logand (* (logand (+ c (ash c -4)) #x0f0f0f0f)
                            #x01010101)
                         #xffffffff)
                 -24)))
    (declare (type (unsigned-byte 32) w x c))
    x))

And finally, the tweaked optimization for logcount in sbcl, from the
file src/compiler/x86/arith.lisp

(define-vop (unsigned-byte-32-count)
  (:translate logcount2)
  (:note "inline (unsigned-byte 32) logcount2")
  (:policy :fast-safe)
  (:args (arg :scs (unsigned-reg) :target w))
  (:arg-types unsigned-num)
  (:results (result :scs (unsigned-reg)))
  (:result-types positive-fixnum)
  (:temporary (:sc unsigned-reg :from (:argument 0)) w)
  (:generator 30
    (move result arg)
    (move w arg)

    (inst shr w 1)
    (inst and w #x55555555)
    (inst sub result w)

    (inst mov w result)
    (inst shr w 2)
    (inst and result #x33333333)
    (inst and w #x33333333)
    (inst add result w)

    (inst mov w result)
    (inst shr result 4)
    (inst add result w)
    (inst and result #x0f0f0f0f)
    (inst imul result #x01010101)
    (inst shr result 24)))

-- 
    Sidney Markowitz
    http://www.sidney.com

From: Lars Brinkhoff
Subject: Re: Squeezing blood from a turnip
Date: Fri, 28 Jul 2006 06:43:07 +0000
Message-ID: <8564hiqlfo.fsf@junk.nocrew.org>

Sidney Markowitz <······@sidney.com> writes:
> (defmacro defcounttest (name type &body form)
>   (flet ((declaretype (type &rest vars)
> 	   (when type
> 	     `(,(nconc (if (listp type)
>                          (list 'type type)
>                          (list type)) vars)))))
>     `(defun ,name (n)
>        (declare (optimize (safety 0)
>                           (speed 3)
>                           (space 0)
>                           (debug 0))
> 		,@(declaretype type 'n))

Both e.g. (declare (fixnum x)) and (declare (type fixnum n)) are
legal, so if you like, you can simplify your type declaration to

  (declare (type ,type n))

From: bradb
Subject: Re: Squeezing blood from a turnip
Date: Fri, 28 Jul 2006 22:38:35 +0000
Message-ID: <1154126315.474732.211200@m73g2000cwd.googlegroups.com>

Sidney Markowitz wrote:
<SNIP>
> And finally, the tweaked optimization for logcount in sbcl, from the
> file src/compiler/x86/arith.lisp
>
> (define-vop (unsigned-byte-32-count)
>   (:translate logcount2)
>   (:note "inline (unsigned-byte 32) logcount2")
>   (:policy :fast-safe)
>   (:args (arg :scs (unsigned-reg) :target w))
>   (:arg-types unsigned-num)
>   (:results (result :scs (unsigned-reg)))
>   (:result-types positive-fixnum)
>   (:temporary (:sc unsigned-reg :from (:argument 0)) w)
>   (:generator 30
>     (move result arg)
>     (move w arg)
>
>     (inst shr w 1)
>     (inst and w #x55555555)
>     (inst sub result w)
>
>     (inst mov w result)
>     (inst shr w 2)
>     (inst and result #x33333333)
>     (inst and w #x33333333)
>     (inst add result w)
>
>     (inst mov w result)
>     (inst shr result 4)
>     (inst add result w)
>     (inst and result #x0f0f0f0f)
>     (inst imul result #x01010101)
>     (inst shr result 24)))

Very cool!  Would it be possible to write your own functions in this
"SBCL Assembler", and have the compiler do the right thing?  Eg, could
you have written this new LOGCOUNT without modifing the SBCL compiler
itself?

Cheers
Brad

From: Christophe Rhodes
Subject: Re: Squeezing blood from a turnip
Date: Sat, 29 Jul 2006 08:23:30 +0000
Message-ID: <sqfygkom4d.fsf@cam.ac.uk>

"bradb" <··············@gmail.com> writes:

> Very cool!  Would it be possible to write your own functions in this
> "SBCL Assembler", and have the compiler do the right thing?  Eg, could
> you have written this new LOGCOUNT without modifing the SBCL compiler
> itself?

Yes.  (I think he did, didn't he?)
<http://sourceforge.net/mailarchive/forum.php?thread_id=177673&forum_id=4133>
is a message describing how to go about it.

Christophe

From: bradb
Subject: Re: Squeezing blood from a turnip
Date: Sat, 29 Jul 2006 16:19:34 +0000
Message-ID: <1154189974.889421.108140@h48g2000cwc.googlegroups.com>

Christophe Rhodes wrote:
> "bradb" <··············@gmail.com> writes:
>
> > Very cool!  Would it be possible to write your own functions in this
> > "SBCL Assembler", and have the compiler do the right thing?  Eg, could
> > you have written this new LOGCOUNT without modifing the SBCL compiler
> > itself?
>
> Yes.  (I think he did, didn't he?)
> <http://sourceforge.net/mailarchive/forum.php?thread_id=177673&forum_id=4133>
> is a message describing how to go about it.
>
> Christophe

Thanks very much for the link, quite enlightening.

Cheers
Brad

From: Sidney Markowitz
Subject: Re: Squeezing blood from a turnip
Date: Sat, 29 Jul 2006 14:16:19 +0000
Message-ID: <44cb6dcf$0$34566$742ec2ed@news.sonic.net>

bradb wrote:
> Very cool!  Would it be possible to write your own functions in this
> "SBCL Assembler", and have the compiler do the right thing?  Eg, could
> you have written this new LOGCOUNT without modifing the SBCL compiler
> itself?

Yes, as Christophe pointed out, define-vop will do that, as explained in 
the message he linked to. Just correct the typo in what I posted by 
changing the two places that say 'logcount2' to say 'logcount' (that was 
an artifact of my testing that I forgot to remove before copy and 
pasting) and evaluate the define-vop form while in package :sb-vm

That will redefine the compiler assembler transform for logcount and 
you'll have the new logcount.

BTW, an improved logcount is already in the current sbcl cvs sources, 
coincidentally submitted the day before I sent in my patch. It has the 
advantage of including similar speedup code for 64 bit x86 machines and 
also having a remarkable 2.5 times speedup of logcount of bignums.

-- 
     Sidney Markowitz
     http://www.sidney.com

From: bradb
Subject: Re: Squeezing blood from a turnip
Date: Tue, 25 Jul 2006 04:10:44 +0000
Message-ID: <1153800644.314450.75080@75g2000cwc.googlegroups.com>

Kaz Kylheku wrote:
<SNIP>
> If speed matters at all, you probably want to do something better, like
> table lookup.
>
> (And if speed doesn't count, why would you be doing it in C?)
>
> Here is what it might look like, counting six bits at a time using 64
> element lookup table:
>
>   int countBits(unsigned long field)
>   {
>     static int table[64] = {
>       0, 1, 1, 2, /* ... */
>       /* ... exercise for reader ... */ 5, 6
>     };
>
>     int count = 0;
>
>     while (field != 0) {
>       count += table[field & 0x3F];
>       field >>= 6;
>     }
>
>     return count;
>   }
>
Is this actually true anymore?  I would guess that on most modern
hardware, the weave method that Rob outlined would possibly be faster
than having to make a table access, and possibly incuring a cache miss.
 I would guestimate that the weave, having no memory accesses (except
for the constants) and no branches would be better than table lookup
methods.  Any CPU gurus here able to shed light on this?

Cheers
Brad

From: Nathan Baum
Subject: Re: Squeezing blood from a turnip
Date: Tue, 25 Jul 2006 05:19:53 +0000
Message-ID: <1153804793.159028.165590@m79g2000cwm.googlegroups.com>

bradb wrote:
> Kaz Kylheku wrote:
> <SNIP>
> > If speed matters at all, you probably want to do something better, like
> > table lookup.
> >
> > (And if speed doesn't count, why would you be doing it in C?)
> >
> > Here is what it might look like, counting six bits at a time using 64
> > element lookup table:
> >
> >   int countBits(unsigned long field)
> >   {
> >     static int table[64] = {
> >       0, 1, 1, 2, /* ... */
> >       /* ... exercise for reader ... */ 5, 6
> >     };
> >
> >     int count = 0;
> >
> >     while (field != 0) {
> >       count += table[field & 0x3F];
> >       field >>= 6;
> >     }
> >
> >     return count;
> >   }
> >
> Is this actually true anymore?  I would guess that on most modern
> hardware, the weave method that Rob outlined would possibly be faster
> than having to make a table access, and possibly incuring a cache miss.
>  I would guestimate that the weave, having no memory accesses (except
> for the constants) and no branches would be better than table lookup
> methods.  Any CPU gurus here able to shed light on this?

I'm no guru, but I do have a compiler.

I used:

int bits[256]; /* initialised elsewhere */

int countBits(int n)        /* Assumes 32-bit ints */
{
  return bits[n & 0xff] + bits[n >> 8 & 0xff] + bits[n >> 16 & 0xff] +
bits[n >> 24 & 0xff];
}

(This is with the same test framework I used in my previous post in
this thread.)

Deeply surprisingly (to me), the lookup is, on average, marginally
faster.

For a few tens of tests on each, the lookup time ranges from 0.218
seconds to 0.311 seconds whilst the "weave" time ranges from 0.294
seconds to 0.368 seconds.

I'd fully expect the weave constants to be stored as immediate
operands, so it's really rather odd.

I tried translating the weave version into Lisp but it wasn't working
for me.

Completely unexpectedly, I find the lookup table version in Lisp is
faster than any C variant, taking from 0.1 seconds to 0.184 seconds.

(defvar *table* (make-array '(256) :element-type 'fixnum))
(declaim (type (simple-array fixnum (256)) *table*))

(defun count-bits-table (n)
  (declare (fixnum n))
  (+ (aref *table* (mod n 256))
     (aref *table* (mod (truncate n 256) 256))
     (aref *table* (mod (truncate n (expt 2 16)) 256))
     (aref *table* (mod (truncate n (expt 2 24)) 256))))

(loop :for i :upto 256
   :do (setf (aref *table* i) (logcount i)))

> 
> Cheers
> Brad

From: Nathan Baum
Subject: Re: Squeezing blood from a turnip
Date: Tue, 25 Jul 2006 05:27:58 +0000
Message-ID: <1153805278.648219.217450@h48g2000cwc.googlegroups.com>

Nathan Baum wrote:
> Completely unexpectedly, I find the lookup table version in Lisp is
> faster than any C variant, taking from 0.1 seconds to 0.184 seconds.

'Twas too good to be true. The loop was a tenth of the proper length.
It actually takes from 1.0 second to 1.84 seconds.

From: bradb
Subject: Re: Squeezing blood from a turnip
Date: Tue, 25 Jul 2006 05:41:50 +0000
Message-ID: <1153806110.862634.163100@75g2000cwc.googlegroups.com>

Nathan Baum wrote:
> bradb wrote:
> > Kaz Kylheku wrote:
> > <SNIP>
> > > If speed matters at all, you probably want to do something better, like
> > > table lookup.
> > >
> > > (And if speed doesn't count, why would you be doing it in C?)
> > >
> > > Here is what it might look like, counting six bits at a time using 64
> > > element lookup table:
> > >
> > >   int countBits(unsigned long field)
> > >   {
> > >     static int table[64] = {
> > >       0, 1, 1, 2, /* ... */
> > >       /* ... exercise for reader ... */ 5, 6
> > >     };
> > >
> > >     int count = 0;
> > >
> > >     while (field != 0) {
> > >       count += table[field & 0x3F];
> > >       field >>= 6;
> > >     }
> > >
> > >     return count;
> > >   }
> > >
> > Is this actually true anymore?  I would guess that on most modern
> > hardware, the weave method that Rob outlined would possibly be faster
> > than having to make a table access, and possibly incuring a cache miss.
> >  I would guestimate that the weave, having no memory accesses (except
> > for the constants) and no branches would be better than table lookup
> > methods.  Any CPU gurus here able to shed light on this?
>
> I'm no guru, but I do have a compiler.
>
> I used:
>
> int bits[256]; /* initialised elsewhere */
>
> int countBits(int n)        /* Assumes 32-bit ints */
> {
>   return bits[n & 0xff] + bits[n >> 8 & 0xff] + bits[n >> 16 & 0xff] +
> bits[n >> 24 & 0xff];
> }
>
> (This is with the same test framework I used in my previous post in
> this thread.)
>
> Deeply surprisingly (to me), the lookup is, on average, marginally
> faster.
>
> For a few tens of tests on each, the lookup time ranges from 0.218
> seconds to 0.311 seconds whilst the "weave" time ranges from 0.294
> seconds to 0.368 seconds.
>
> I'd fully expect the weave constants to be stored as immediate
> operands, so it's really rather odd.
>
> I tried translating the weave version into Lisp but it wasn't working
> for me.
>
> Completely unexpectedly, I find the lookup table version in Lisp is
> faster than any C variant, taking from 0.1 seconds to 0.184 seconds.
>
> (defvar *table* (make-array '(256) :element-type 'fixnum))
> (declaim (type (simple-array fixnum (256)) *table*))
>
> (defun count-bits-table (n)
>   (declare (fixnum n))
>   (+ (aref *table* (mod n 256))
>      (aref *table* (mod (truncate n 256) 256))
>      (aref *table* (mod (truncate n (expt 2 16)) 256))
>      (aref *table* (mod (truncate n (expt 2 24)) 256))))
>
> (loop :for i :upto 256
>    :do (setf (aref *table* i) (logcount i)))

In this micro benchmark the table will always be a cache hit.  In a
real application, that table might often be a cache miss.  Of course,
we're down to splitting hairs - I've personally never had to optimise
anything to take advantage of CPU cache hits.

Thanks for doing the legwork and checking it out :)

Brad

From: Nathan Baum
Subject: Re: Squeezing blood from a turnip
Date: Tue, 25 Jul 2006 18:46:13 +0000
Message-ID: <1153853173.052291.111570@p79g2000cwp.googlegroups.com>

bradb wrote:
> Nathan Baum wrote:
> > bradb wrote:
> > > Kaz Kylheku wrote:
> > > <SNIP>
> > > > If speed matters at all, you probably want to do something better, like
> > > > table lookup.
> > > >
> > > > (And if speed doesn't count, why would you be doing it in C?)
> > > >
> > > > Here is what it might look like, counting six bits at a time using 64
> > > > element lookup table:
> > > >
> > > >   int countBits(unsigned long field)
> > > >   {
> > > >     static int table[64] = {
> > > >       0, 1, 1, 2, /* ... */
> > > >       /* ... exercise for reader ... */ 5, 6
> > > >     };
> > > >
> > > >     int count = 0;
> > > >
> > > >     while (field != 0) {
> > > >       count += table[field & 0x3F];
> > > >       field >>= 6;
> > > >     }
> > > >
> > > >     return count;
> > > >   }
> > > >
> > > Is this actually true anymore?  I would guess that on most modern
> > > hardware, the weave method that Rob outlined would possibly be faster
> > > than having to make a table access, and possibly incuring a cache miss.
> > >  I would guestimate that the weave, having no memory accesses (except
> > > for the constants) and no branches would be better than table lookup
> > > methods.  Any CPU gurus here able to shed light on this?
> >
> > I'm no guru, but I do have a compiler.
> >
> > I used:
> >
> > int bits[256]; /* initialised elsewhere */
> >
> > int countBits(int n)        /* Assumes 32-bit ints */
> > {
> >   return bits[n & 0xff] + bits[n >> 8 & 0xff] + bits[n >> 16 & 0xff] +
> > bits[n >> 24 & 0xff];
> > }
> >
> > (This is with the same test framework I used in my previous post in
> > this thread.)
> >
> > Deeply surprisingly (to me), the lookup is, on average, marginally
> > faster.
> >
> > For a few tens of tests on each, the lookup time ranges from 0.218
> > seconds to 0.311 seconds whilst the "weave" time ranges from 0.294
> > seconds to 0.368 seconds.
> >
> > I'd fully expect the weave constants to be stored as immediate
> > operands, so it's really rather odd.
> >
> > I tried translating the weave version into Lisp but it wasn't working
> > for me.
> >
> > Completely unexpectedly, I find the lookup table version in Lisp is
> > faster than any C variant, taking from 0.1 seconds to 0.184 seconds.
> >
> > (defvar *table* (make-array '(256) :element-type 'fixnum))
> > (declaim (type (simple-array fixnum (256)) *table*))
> >
> > (defun count-bits-table (n)
> >   (declare (fixnum n))
> >   (+ (aref *table* (mod n 256))
> >      (aref *table* (mod (truncate n 256) 256))
> >      (aref *table* (mod (truncate n (expt 2 16)) 256))
> >      (aref *table* (mod (truncate n (expt 2 24)) 256))))
> >
> > (loop :for i :upto 256
> >    :do (setf (aref *table* i) (logcount i)))
>
> In this micro benchmark the table will always be a cache hit.  In a
> real application, that table might often be a cache miss.  Of course,
> we're down to splitting hairs - I've personally never had to optimise
> anything to take advantage of CPU cache hits.

That looks to be exactly right. If I have the loop randomly access some
memory, the lookup version is slower.

> Thanks for doing the legwork and checking it out :)
> 
> Brad

From: Rob Warnock
Subject: Re: Squeezing blood from a turnip
Date: Tue, 25 Jul 2006 07:51:13 +0000
Message-ID: <CfmdnSaajMTsUFjZnZ2dnUVZ_vOdnZ2d@speakeasy.net>

Nathan Baum <···········@btinternet.com> wrote:
+---------------
| Completely unexpectedly, I find the lookup table version in Lisp is
| faster than any C variant, taking from 0.1 seconds to 0.184 seconds.
+---------------

How fast if you just use LOGCOUNT?


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: Nathan Baum
Subject: Re: Squeezing blood from a turnip
Date: Tue, 25 Jul 2006 18:46:14 +0000
Message-ID: <1153853174.096609.246470@b28g2000cwb.googlegroups.com>

Rob Warnock wrote:
> Nathan Baum <···········@btinternet.com> wrote:
> +---------------
> | Completely unexpectedly, I find the lookup table version in Lisp is
> | faster than any C variant, taking from 0.1 seconds to 0.184 seconds.
> +---------------
>
> How fast if you just use LOGCOUNT?

~0.8 seconds.

> -Rob
>
> -----
> Rob Warnock			<····@rpw3.org>
> 627 26th Avenue			<URL:http://rpw3.org/>
> San Mateo, CA 94403		(650)572-2607