CMU-CL overoptimizes?

From: Kai Grossjohann
Subject: CMU-CL overoptimizes?
Date: Sat, 11 Apr 1998 00:00:00 +0000
Message-ID: <vafyaxcjf4u.fsf@ramses.cs.uni-dortmund.de>

On a whim, I thought let's compare the speed of a few language
implementations with respect to sieve{.c,.java,.cl}.  So, I wrote the
following program (gcc thinks it's C, though it probably isn't):

,-----
| #include <stdlib.h>
| #include <stdio.h>
|  
| int main (int argc, char** argv)
| {
|     int n = atoi(argv[1]);
|     int *prim;
|     int mult, i, j;
|     for ( j = 0; j<10; j++ ) {
|         prim = (int*)malloc(n*sizeof(int));
|         prim[0] = 1;
|         prim[1] = 1;
|         for ( mult=2; mult<n; mult++ ) {
|             for ( i=mult+mult; i<n; i+=mult ) {
|                 prim[i] = 1;
|             }
|         }
|         free(prim);
|     }
|     /*
|     for ( i=0; i<n; i++ ) {
|         if ( prim[i] == 0 ) {
|             printf("%d is prime.\n", i);
|         }
|     }
|     */
|     return(0);
| }
`-----

I compiled it with `gcc -Wall -O3' and ran it and it took 25 seconds.
Good.

I also wrote the following program.  (The compiler thinks it's CL but
I'm pretty sure it really isn't -- please be gentle, it's my second
Lisp program, right after Hello, world.)

,-----
| (in-package "CL-USER")
|           
| (declaim (optimize (speed 2) (debug 0) (safety 0) (space 0)))
|           
| (defun sieve (n)
|   (declare (type (and fixnum integer) n))
|   (let ((v (make-array (list n) 
|                        :element-type 'bit)))
|     (setf (bit v 0) 1)
|     (setf (bit v 1) 1)
|     (loop with mult fixnum = 2
|         while (< mult n)
|         do
|           (loop with i fixnum = (+ mult mult)
|               while (< i n)
|               do
|                 (setf (bit v i) 1)
|                 (incf i mult))
|           (incf mult)
|           (loop until (>= mult n)
|               until (zerop (bit v mult))
|               do (incf mult)))
|     v))   
|           
| (defun print-sieve (v)
|   (let ((i 0))
|     (map 'nil
|       (function (lambda (x)
|                   (declare (type (and fixnum integer) x))
|                   (unless (> x 0) (format t "~A is prime.~%" i))
|                   (incf i)))
|       v)))
`-----

I then wrote a little utility file:

,-----
| (load "sieve")
| (dotimes (i 10) (sieve 1000000))
| (exit)
`-----

I typed `time acl < /tmp/foo'.  It took 50 seconds.  Good.  A factor
of two isn't too bad.

I then replaced (exit) with (quit) for CMU-CL and did `lisp acl <
/tmp/foo'.  IT TOOK EIGHT SECONDS!!!

How can that be?  Is the CMU-CL compiler so smart that it optimizes
pretty much all of the code away?  Or does it really generate code
which is faster than C?  I have heard rumors of CL being almost as
fast or maybe a bit faster than C, but this --- I'm flabbergasted.

kai
-- 
Really cancel?   [OK]  [Cancel]

Re: CMU-CL overoptimizes? Bulent Murtezaoglu
- Re: CMU-CL overoptimizes? Kai Grossjohann
Re: CMU-CL overoptimizes? Joerg Hoehle
- Re: CMU-CL overoptimizes? Kai Grossjohann
  - Re: CMU-CL overoptimizes? Joerg Hoehle
    - Re: CMU-CL overoptimizes? Rainer Joswig
      - Re: CMU-CL overoptimizes? Bill Newman
Re: CMU-CL overoptimizes? Patrick Tufts
- Re: CMU-CL overoptimizes? Kai Grossjohann
Re: CMU-CL overoptimizes? Kai Grossjohann
- Re: CMU-CL overoptimizes? Rainer Joswig

From: Bulent Murtezaoglu
Subject: Re: CMU-CL overoptimizes?
Date: Sun, 12 Apr 1998 00:00:00 +0000
Message-ID: <87pvinldgk.fsf@isttest.bogus>

    KG> How can that be?  Is the CMU-CL compiler so smart that it
    KG> optimizes pretty much all of the code away?  Or does it really
    KG> generate code which is faster than C?  I have heard rumors of
    KG> CL being almost as fast or maybe a bit faster than C, but this
    KG> --- I'm flabbergasted.

I think I know part of the reason.  In your C code you are allocating a 
million ints.  On my machine that is 8Megs that you have to get 
from the OS, and bring in and out of your cache and processor.  The lisp
code is using bit arrays.

I was able to run your original code in 42 s.  After I turned your array 
into a char array the time went down to 28 s.  Then I wrote two utility 
functions that diddle the appropriate bits in appropriate places to get
the equivalent of the bit array.  That got me down to 6.4 sec.  I'm assuming
gcc -03 inlines leaf functions as advertised (a look at the .s file doesn't
seem to contradict this).  I'm not sure shifting a bit to set the bit array
position is the most efficient way to implement bit arrays in C. 
(C code fragment follows)

int bitdiddleset(unsigned int* arr, unsigned int index)
{
    arr[index/32] |= (((unsigned int) 1) << (index % 32));
}

CMUCL still wins however.  From the command line, your code took 2.6 sec.
this includes bringing in the lisp image over NFS!  Maybe CMUCL is really great
or maybe the two pieces of code are doing different things (haven't checked
anything other than the array stuff, I never learned the loop macro anyway).  
If it's the former, it would be interesting to compare the .s file with 
the output from the disassemble.  (will post if I do that)

BM

From: Kai Grossjohann
Subject: Re: CMU-CL overoptimizes?
Date: Tue, 14 Apr 1998 00:00:00 +0000
Message-ID: <vaflnt83wzv.fsf@ramses.cs.uni-dortmund.de>

Bulent Murtezaoglu <·····@servtech.com> writes:

  > I think I know part of the reason.  In your C code you are
  > allocating a million ints.  On my machine that is 8Megs that you
  > have to get from the OS, and bring in and out of your cache and
  > processor.  The lisp code is using bit arrays.

Right.  Changing that array of int to an array of char gets me down to
11 seconds (from 25 seconds) for the C version.  Hadn't thought of
this.

OK, now I know that CMU-CL didn't optimze away nine of the ten
iterations of the outer loop or anything like that.

kai
-- 
Have you been LJBF'd this week?

From: Joerg Hoehle
Subject: Re: CMU-CL overoptimizes?
Date: Sun, 12 Apr 1998 00:00:00 +0000
Message-ID: <6gqaaq$k9b@omega.gmd.de>

Kai Grossjohann (···········@ls6.cs.uni-dortmund.de) wrote:
: On a whim, I thought let's compare the speed of a few language
: implementations with respect to sieve{.c,.java,.cl}.


You should really be much more careful than that when trying to
compare systems and implementations, even with trivial benchmarks.

o First, your algorithms are not equivalent.  In C, you sieve
multiples of 2, 3, 4, 5, 6 etc. while in Lisp, you don't filter
non-primes (4, 6 etc.) again.

o Then in C, you allocate 4MB of storage for a million integers
(probably 32 bits) although you only use 1 bit, while in Lisp you use
bit vectors requiring 32KB of storage -- what's your cache size?

You ought to write similar algorithms before performing benchmarks.

o Then you measure timings by loading a complete development system (I
guess that's what "acl" does) in Lisp but only one tiny program in C.
We already had a thread about such comparisons here not too long ago.


: |         prim = (int*)malloc(n*sizeof(int));
[...]
: |         for ( mult=2; mult<n; mult++ ) {
: |             for ( i=mult+mult; i<n; i+=mult ) {
: |                 prim[i] = 1;
: |             }
: |         }

: I also wrote the following program.  (The compiler thinks it's CL but
: I'm pretty sure it really isn't -- please be gentle, it's my second
: Lisp program, right after Hello, world.)

: |   (let ((v (make-array (list n) 
: |                        :element-type 'bit)))
[...]
: |           (loop until (>= mult n)
: |               until (zerop (bit v mult))
: |               do (incf mult)))


: I typed `time acl < /tmp/foo'.  It took 50 seconds.  Good.  A factor
: of two isn't too bad.

: I then replaced (exit) with (quit) for CMU-CL and did `lisp acl <
: /tmp/foo'.  IT TOOK EIGHT SECONDS!!!

Regards,
	Jo"rg Ho"hle.
············@gmd.de		http://zeus.gmd.de/~hoehle/amiga-clisp.html

From: Kai Grossjohann
Subject: Re: CMU-CL overoptimizes?
Date: Tue, 14 Apr 1998 00:00:00 +0000
Message-ID: <vafogy43x3w.fsf@ramses.cs.uni-dortmund.de>

······@zeus.gmd.de (Joerg Hoehle) writes:

  > o First, your algorithms are not equivalent.  In C, you sieve
  > multiples of 2, 3, 4, 5, 6 etc. while in Lisp, you don't filter
  > non-primes (4, 6 etc.) again.

I'm not?  Hm.  Doesn't the following snippet

,-----
|           (loop with i fixnum = (+ mult mult)
|               while (< i n)   
|               do              
|                 (setf (bit v i) 1)
|                 (incf i mult))
`-----

set (bit v (* 2 mult)), (bit v (* 3 mult)), (bit v (* 4 mult)), ... to
true?

  > o Then in C, you allocate 4MB of storage for a million integers
  > (probably 32 bits) although you only use 1 bit, while in Lisp you use
  > bit vectors requiring 32KB of storage -- what's your cache size?

Hm.  Right.  Changing the C program from array of int to array of char
reduced the run time from 25 to 11 seconds.  Hadn't thought of this.

  > You ought to write similar algorithms before performing benchmarks.

Sorry.  I had completely overlooked the data structure size thing.

  > o Then you measure timings by loading a complete development system (I
  > guess that's what "acl" does) in Lisp but only one tiny program in C.
  > We already had a thread about such comparisons here not too long ago.

I run ACL once.  After that, "time acl -e '(exit)'" tells me 0.4
seconds.  Surely, times of less than 1 second are not crucial when
comparing 8, 25 and 50 seconds.

kai
-- 
Have you been LJBF'd this week?

From: Joerg Hoehle
Subject: Re: CMU-CL overoptimizes?
Date: Sat, 18 Apr 1998 00:00:00 +0000
Message-ID: <6hb5oe$65h@omega.gmd.de>

Kai Grossjohann (···········@ls6.cs.uni-dortmund.de) wrote:
: ······@zeus.gmd.de (Joerg Hoehle) writes:

:   > o First, your algorithms are not equivalent.  In C, you sieve
:   > multiples of 2, 3, 4, 5, 6 etc. while in Lisp, you don't filter
:   > non-primes (4, 6 etc.) again.

: I'm not?  Hm.  Doesn't the following snippet

: |                 (setf (bit v i) 1)

: set (bit v (* 2 mult)), (bit v (* 3 mult)), (bit v (* 4 mult)), ... to
: true?

Sorry, I didn't express myself clearly.  In Lisp you have the better
algorithm, not testing known non-primes like 4 & 6 for primality
again, while in C, you have the dumb routine.


: I run ACL once.  After that, "time acl -e '(exit)'" tells me 0.4
Weee, that's amazingly fast.

	Jo"rg Ho"hle.
············@gmd.de		http://zeus.gmd.de/~hoehle/source.html

From: Rainer Joswig
Subject: Re: CMU-CL overoptimizes?
Date: Sun, 19 Apr 1998 00:00:00 +0000
Message-ID: <joswig-1904981849410001@194.163.195.66>

In article <··········@omega.gmd.de>, ······@zeus.gmd.de (Joerg Hoehle) wrote:

> : I run ACL once.  After that, "time acl -e '(exit)'" tells me 0.4
> Weee, that's amazingly fast.

CMU CL on an Ultra is in the same league.

-- 
http://www.lavielle.com/~joswig/

From: Bill Newman
Subject: Re: CMU-CL overoptimizes?
Date: Mon, 20 Apr 1998 00:00:00 +0000
Message-ID: <wnewmanErpuuC.Er3@netcom.com>

Rainer Joswig (······@lavielle.com) wrote:
: In article <··········@omega.gmd.de>, ······@zeus.gmd.de (Joerg Hoehle) wrote:

: > : I run ACL once.  After that, "time acl -e '(exit)'" tells me 0.4
: > Weee, that's amazingly fast.

: CMU CL on an Ultra is in the same league.

: -- 
: http://www.lavielle.com/~joswig/

On my Micron/P200/64Mb box with Linux and CMUCL18a+PCL:

magic:~ $ time lisp -eval '(quit)'
Validating memory ... done.
Mapping 5021696 bytes at 0x1000000.
Mapping 7991296 bytes at 0x5000000.
; Loading #p"/home/newman/.cmucl-init.lisp".
;; Loading #p"/home/newman/lib/lisp/fasl.x86f".
;; Loading #p"/home/newman/lib/lisp/.x86f/abbr.x86f".
;; Loading #p"/home/newman/lib/lisp/.x86f/cmucl-ansi.x86f".
0.14user 0.06system 0:00.29elapsed 68%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1021major+192minor)pagefaults 0swaps
magic:~ $

  Bill Newman

From: Patrick Tufts
Subject: Re: CMU-CL overoptimizes?
Date: Tue, 14 Apr 1998 00:00:00 +0000
Message-ID: <6gvl93$fib$1@new-news.cc.brandeis.edu>

Kai Grossjohann <···········@ls6.cs.uni-dortmund.de> writes:

>On a whim, I thought let's compare the speed of a few language
>implementations with respect to sieve{.c,.java,.cl}.  So, I wrote the
>following program (gcc thinks it's C, though it probably isn't):
[C code snipped]
>|     /*
>|     for ( i=0; i<n; i++ ) {
>|         if ( prim[i] == 0 ) {
>|             printf("%d is prime.\n", i);
>|         }
>|     }
>|     */
>|     return(0);
>| }
>`-----

>I compiled it with `gcc -Wall -O3' and ran it and it took 25 seconds.
>Good.
[Lisp *file* snipped]

>I then wrote a little utility file:

>,-----
>| (load "sieve")
>| (dotimes (i 10) (sieve 1000000))
>| (exit)
>`-----

>I typed `time acl < /tmp/foo'.  It took 50 seconds.  Good.  A factor
>of two isn't too bad.

>I then replaced (exit) with (quit) for CMU-CL and did `lisp acl <
>/tmp/foo'.  IT TOOK EIGHT SECONDS!!!

It seems that you have a single C file, but two Lisp files.
And it seems that you are timing the file loading as well as execution.

This would make comparison between C and Lisp a bit muddy.

Also, it's not quite clear, but it seems you are running the Lisp code
once in ACL (Linux?), and once through CMUCL (Linux?).

Yes, CMUCL does excellent numeric optimizations.  However, your
benchmark might be measuring a difference in file load times.

--Pat

From: Kai Grossjohann
Subject: Re: CMU-CL overoptimizes?
Date: Tue, 14 Apr 1998 00:00:00 +0000
Message-ID: <vafra303xdl.fsf@ramses.cs.uni-dortmund.de>

·····@cs.brandeis.edu (Patrick Tufts) writes:

  > It seems that you have a single C file, but two Lisp files.  And
  > it seems that you are timing the file loading as well as
  > execution.

Yes.  Therefore, I expect Lisp (two files) to be slower than C (one
file).  The result was that Lisp was faster than C.  I was looking for
an explanation.

I think the explanation that I used an array of int in C whereas I
used an array of boolean in Lisp was the explanation.

Btw, the timings were 50 secs, 25 secs and 8 secs.  Starting the ACL
runtime system takes less than one second (if it has been started
before has has thus been brought in from disk into the buffer cache or
something).

kai
-- 
Have you been LJBF'd this week?

From: Kai Grossjohann
Subject: Re: CMU-CL overoptimizes?
Date: Tue, 14 Apr 1998 00:00:00 +0000
Message-ID: <vafiuoc3whc.fsf@ramses.cs.uni-dortmund.de>

Kai Grossjohann <···········@ls6.cs.uni-dortmund.de> writes:

  > On a whim, I thought let's compare the speed of a few language
  > implementations with respect to sieve{.c,.java,.cl}.

I thought I'd tell you about the things I have found out.

My original concern was that CMU-CL optimized away nine of the ten
iterations of the outer loop (as they didn't affect the result at all)
which would have meant a really smart compiler for dumb programs but
maybe not a smart compiler for smart programs.

As it turned out, the C code allocated an array of int whereas the
Lisp version used an array of boolean, and that made pretty much the
whole difference between CMU-CL and C: CMU-CL was 8 seconds and the C
version with an array of int was 25 seconds, with an array of char I
got 11 seconds.  Therefore, the CMU-CL runtime is legitimate.

The overall result I get from this that I get about the same execution
speed from Lisp programs as from C programs (give or take a factor of
two, say).  But developing C programs is so tedious as to be out of
the question.  Developing Perl or Java is nicer than C, but their
speed is *not* comparable to C at all.

Thus, the result was in favor of Lisp.

kai
-- 
Have you been LJBF'd this week?

From: Rainer Joswig
Subject: Re: CMU-CL overoptimizes?
Date: Wed, 15 Apr 1998 00:00:00 +0000
Message-ID: <joswig-1504981015480001@kraftbuch.lavielle.com>

In article <···············@ramses.cs.uni-dortmund.de>, Kai Grossjohann
<···········@ls6.cs.uni-dortmund.de> wrote:

> The overall result I get from this that I get about the same execution
> speed from Lisp programs as from C programs (give or take a factor of
> two, say).  But developing C programs is so tedious as to be out of
> the question.  Developing Perl or Java is nicer than C, but their
> speed is *not* comparable to C at all.
> 
> Thus, the result was in favor of Lisp.

The question is always what you are comparing?

If you are comparing low level stuff like bit diddling,
integer arithmetic, etc - than Lisp *can* be as fast as C.
Some would say there are examples where a good
compiler can generate much faster code. I guess this
is necessarily not the case with current compilers.
A native Lisp platform can be in the same league
as a C platform.

People don't stop there. They want small amounts of
code run *very* fast (bit blitting, ...) . Or they want large
amounts of code/data to run fast (3d animation systems,
complex simulations of the weather, ...).

So, how does this thing scale? What kind of penalty do you
have to pay?

Then you add abstractions. You are no longer developing with
defuns but with defmethods. List then will be replaced with
objects. Will object-oriented programming in Common Lisp
result in fast code comparable to C++? Would there
be a reason to accept a speed penalty? Would a speed penalty
matter in your application domain? If you for example
are developing a user interface in Common Lisp, this might not
matter at all.

-- 
http://www.lavielle.com/~joswig/