Numerical performance of CMUCL

From: Nicolas Neuss
Subject: Numerical performance of CMUCL
Date: Fri, 02 Feb 2001 14:56:47 +0000
Message-ID: <wsd7d13zgw.fsf@ortler.iwr.uni-heidelberg.de>

Hello.

Recently, there was a discussion about optimizing some piece of code
(goal was comparison with Python).  In that discussion, CMUCL was said
to be comparably fast to a C/C++ code (Marco Antoniotti).

Now, I tried to verify this for the case of computing a scalar product
between two fixed-size vectors (with a size dropping out of the second
level cache).  My C program which measures an MFLOP rate looks like

/*********************************
  mflop.c -- performance testing
  (C) Nicolas Neuss
  mflop.c is public domain.
*********************************/

#include <time.h>
#include <stdio.h>
#include <stdlib.h>

#define CURRENT_TIME \
    (((double)clock())/ \
    ((double)CLOCKS_PER_SEC))

/* don't fit in secondary cache */
#define Nlong 1000000
double x[Nlong], y[Nlong];

double ddot_rll (void) {
    int i;
    register double s = 0.0;
    for (i=0; i<Nlong; i++)
        s += x[i]*y[i];
    return s;
}

double test (double (*f)(void)) {
  int i, nr;
  double start_time, end_time;
  double s = 0.0;
  nr = 1;
  do {
      nr = 2*nr;
      start_time = CURRENT_TIME;
      for (i=0; i<nr; i++)
          s += (*f)();
      end_time = CURRENT_TIME;
  } while (end_time-start_time<5.0);
  return s+nr/(end_time-start_time);
}

void main (void) {
  int i;
  for (i=0; i<Nlong; i++)
    x[i] = y[i] = 0.0;
  printf("ddot_rll: %f\n",
    1.0e-6*2*Nlong*test(ddot_rll));
}


In Common Lisp (CMUCL) I have written the following program for the
time measurement:


(defconstant N-long 1000000)

(defvar x (make-array N-long :element-type 'double-float
                      :initial-element 0.0d0))
(defvar y (make-array N-long :element-type 'double-float
                      :initial-element 0.0d0))

(defmacro _f (form) `(the double-float ,form))
(defmacro _i (form) `(the fixnum ,form))

(defun ddot-rll (x y)
  (declare (optimize (speed 3) (compilation-speed 0) (safety 0) (debug 0)))
  (declare (values double-float))
  (declare (type (simple-array double-float (*)) x))
  (declare (type (simple-array double-float (*)) y))
  (let ((s 0.d0))
    (dotimes (i N-long s)
      (declare (type (simple-array double-float (*)) x))
      (declare (type (simple-array double-float (*)) y))
      (declare (type fixnum i))
      (declare (type double-float s))
      (setq s (_f (+ s (_f (* (_f (aref x i)) (_f (aref y i))))))))))

(defun measure-time (fn n-times)
  (let ((start-time (get-internal-run-time)))
    (dotimes (k n-times) (funcall fn x y))
    (- (get-internal-run-time) start-time)))

(defun blas-test (fn)
  (do* ((i 1 (* i 2))
    (time (measure-time fn i) (measure-time fn i)))
      ((> time internal-time-units-per-second)
       (/ time internal-time-units-per-second i))))


So, what is my problem?  First, I am not a Common Lisp expert,
therefore I would be grateful for any corrections or suggestions for
improvement.  Second, I am quite astonished that one has to declare
the types over and over again (e.g. the _f macro (which I got from a
message by Lieven Marchand) around the (aref x i)).  Without those
declarations, the compiler says it cannot optimize at some points.
Third, the resulting code is still very slow in comparison with C.  It
yields something slightly above 1 second for 2 MFLOPS while the C
program achieves a performance of about 54 MFLOPS/second.

What am I doing wrong?  Is it possible to do the Lisp program better?
How?

Yours, 

Nicolas.

Re: Numerical performance of CMUCL Raymond Toy
- Re: Numerical performance of CMUCL Nicolas Neuss
  - Re: Numerical performance of CMUCL Pierre R. Mai
    - Re: Numerical performance of CMUCL Bulent Murtezaoglu
      - Re: Numerical performance of CMUCL Pierre R. Mai
  - Re: Numerical performance of CMUCL Knut Arild Erstad
Re: Numerical performance of CMUCL Jochen Schmidt
Re: Numerical performance of CMUCL Lieven Marchand
Re: Numerical performance of CMUCL Christopher J. Vogt
Re: Numerical performance of CMUCL Martti Halminen
Re: Numerical performance of CMUCL Pierre R. Mai
- Re: Numerical performance of CMUCL Tim Bradshaw
Re: Numerical performance of CMUCL Oliver Bandel
Re: Numerical performance of CMUCL Tim Bradshaw

From: Raymond Toy
Subject: Re: Numerical performance of CMUCL
Date: Fri, 02 Feb 2001 17:31:14 +0000
Message-ID: <4nae853sbh.fsf@rtp.ericsson.se>

>>>>> "Nicolas" == Nicolas Neuss <·····@ortler.iwr.uni-heidelberg.de> writes:

[snip]

    Nicolas> In Common Lisp (CMUCL) I have written the following program for the
    Nicolas> time measurement:


    Nicolas> (defconstant N-long 1000000)

    Nicolas> (defvar x (make-array N-long :element-type 'double-float
    Nicolas>                       :initial-element 0.0d0))
    Nicolas> (defvar y (make-array N-long :element-type 'double-float
    Nicolas>                       :initial-element 0.0d0))

    Nicolas> (defmacro _f (form) `(the double-float ,form))
    Nicolas> (defmacro _i (form) `(the fixnum ,form))

    Nicolas> (defun ddot-rll (x y)
    Nicolas>   (declare (optimize (speed 3) (compilation-speed 0) (safety 0) (debug 0)))
    Nicolas>   (declare (values double-float))
    Nicolas>   (declare (type (simple-array double-float (*)) x))
    Nicolas>   (declare (type (simple-array double-float (*)) y))
    Nicolas>   (let ((s 0.d0))
    Nicolas>     (dotimes (i N-long s)
    Nicolas>       (declare (type (simple-array double-float (*)) x))
    Nicolas>       (declare (type (simple-array double-float (*)) y))
    Nicolas>       (declare (type fixnum i))
    Nicolas>       (declare (type double-float s))
    Nicolas>       (setq s (_f (+ s (_f (* (_f (aref x i)) (_f (aref y i))))))))))

[snip]

    Nicolas> So, what is my problem?  First, I am not a Common Lisp expert,
    Nicolas> therefore I would be grateful for any corrections or suggestions for
    Nicolas> improvement.  Second, I am quite astonished that one has to declare
    Nicolas> the types over and over again (e.g. the _f macro (which I got from a
    Nicolas> message by Lieven Marchand) around the (aref x i)).  Without those
    Nicolas> declarations, the compiler says it cannot optimize at some points.

It's because the x,y in ddot-rll actually refer to the defvar x,y.  (I
think.  I always get confused when function arguments have the same
name as defvars.)

Since you didn't declare the type of the defvar x,y, the compiler
doesn't know the types of x,y.

Change defvar x,y to defvar *x*, *y*, redo your code, and you'll get
something better, with or without the _f macro.

However, it still won't be quite as good as C.  Using gcc
2.95.2 with a v9 option, the heart of ddot_rll is:

.LL6:
        sll     %g3, 3, %g2             ; %g3 is i.  %g2 is the byte offset
        ldd     [%o2+%g2], %f2          ; load up x[i]
        ldd     [%o1+%g2], %f4          ; load up y[i]
        fmuld   %f2, %f4, %f2           ; x[i]*y[i]
        add     %g3, 1, %g3             ; i = i + 1
        cmp     %g3, %o0                ; %g3 = N_long.  Are we done?
        ble,pt  %icc, .LL6              ; branch back if not done
        faddd   %f0, %f2, %f0           ; s = s + x[i]*y[i]

My version of CMUCL (with modified ddot-rll) produces:

      70: L0:   SLL        %NL1, 1, %NL0                ; NL1 = i (fixnum), NL0 = byte offset
      74:       ADD        1, %NL0                      
      78:       LDDF       [%A0+%NL0], %F2              ; get x[i]
      7C:       SLL        %NL1, 1, %NL0                ; redundant
      80:       ADD        1, %NL0                      ; redundant
      84:       LDDF       [%A1+%NL0], %F4              ; get y[i]
      88:       FMULD      %F2, %F4, %F2                ; x[i]*y[i]
      8C:       FADDD      %F0, %F2, %F0                ; s=s+x[i]*y[i]
      90:       ADD        4, %NL1                      ; i = i + 1
      94: L1:   SETHI      %hi(#x003D0800), %NL0
      98:       ADD        256, %NL0                    ; nl0 = N-long
      9C:       CMP        %NL1, %NL0                   ; are we done?
      A0:       BPLT       %ICC, L0                     ; go back
      A4:       NOP
 
So C has 8 instructions and CMUCL has 14.  My guess is CMUCL will
about half the FLOPS of C in this case.  If CMUCL were smarter and
removed the redundant ops and saved N-long in a register, CMUCL would
then be 10 instructions so it would be about 80% of C.

Note also that gcc wasn't so smart either.  The sll instruction could
have been removed if it kept i scaled by 8.

Ray

From: Nicolas Neuss
Subject: Re: Numerical performance of CMUCL
Date: Fri, 02 Feb 2001 18:21:52 +0000
Message-ID: <wsofwl2ben.fsf@ortler.iwr.uni-heidelberg.de>

Hello again.  Thank you all!  Now it works much better.  I definitely
have to study defvar more closely.

I now get an MFLOPS rate of 40, which is not much worse than C (54).
Wonderful!

The new program looks also much nicer:

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

(defconstant N-long 1000000)

(defvar *x* (make-array N-long :element-type 'double-float
                        :initial-element 0.0d0))
(defvar *y* (make-array N-long :element-type 'double-float
                        :initial-element 0.0d0))

(defun ddot-rll (x y)
  (declare (optimize (speed 3) (compilation-speed 0)
                     (safety 0) (debug 0)))
  (declare (values double-float))
  (declare (type (simple-array double-float (*)) x y))
  (let ((s 0.0d0))
    (dotimes (i N-long s)
      (declare (type double-float s))
      (setq s (+ s (* (aref x i) (aref y i)))))))

(defun measure-time (fn n-times)
  (let ((start-time (get-internal-run-time)))
    (dotimes (k n-times) (funcall fn *x* *y*))
    (- (get-internal-run-time) start-time)))

(defun blas-test (fn)
  (do* ((i 1 (* i 2))
	(time (measure-time fn i) (measure-time fn i)))
      ((> time 5.0) (/ (* 2 i internal-time-units-per-second) time))))

(compile 'ddot-rll)
(blas-test #'ddot-rll)

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;


One minor point: when compiling I still get the message

----------------------------------------------------------------
In: LAMBDA (X Y)
  #'(LAMBDA (X Y)
      (DECLARE (OPTIMIZE # # # #))
      (DECLARE (VALUES DOUBLE-FLOAT))
      (DECLARE (TYPE # X Y))
      ...)
Note: Doing float to pointer coercion (cost 13) from S to "<return value>".

Compiling Top-Level Form: 

Compilation unit finished.
  1 note
----------------------------------------------------------------

Do you see how I can get rid of that?

Thanks again, Nicolas.

From: Pierre R. Mai
Subject: Re: Numerical performance of CMUCL
Date: Fri, 02 Feb 2001 21:33:38 +0000
Message-ID: <87k87893d9.fsf@orion.bln.pmsf.de>

Nicolas Neuss <·····@ortler.iwr.uni-heidelberg.de> writes:

Some minor additional notes:

> (defun ddot-rll (x y)
>   (declare (optimize (speed 3) (compilation-speed 0)
>                      (safety 0) (debug 0)))
>   (declare (values double-float))
>   (declare (type (simple-array double-float (*)) x y))
>   (let ((s 0.0d0))
>     (dotimes (i N-long s)
>       (declare (type double-float s))
>       (setq s (+ s (* (aref x i) (aref y i)))))))

Beware that bound and unbound declarations have different semantics.
Always try to put your type declarations into the construct that binds
them (thereby producing a bound declaration), if possible.  Hence in
the above code the declaration for s should be in the let construct,
not in the dotimes construct.

> ----------------------------------------------------------------
> In: LAMBDA (X Y)
>   #'(LAMBDA (X Y)
>       (DECLARE (OPTIMIZE # # # #))
>       (DECLARE (VALUES DOUBLE-FLOAT))
>       (DECLARE (TYPE # X Y))
>       ...)
> Note: Doing float to pointer coercion (cost 13) from S to "<return value>".
> 
> Compiling Top-Level Form: 
> 
> Compilation unit finished.
>   1 note
> ----------------------------------------------------------------
> 
> Do you see how I can get rid of that?

The note tells you that ddot-rll will need to box the float s in order
to return it from ddot-rll to other functions.  This can only be
avoided if ddot-rll gets inlined into the calling function.  Since
this only happens on return (i.e. after 2 MFLOPS) though, it is not
something that is costing you in your case, so don't worry.

Regs, Pierre.

-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein

From: Bulent Murtezaoglu
Subject: Re: Numerical performance of CMUCL
Date: Fri, 02 Feb 2001 23:13:56 +0000
Message-ID: <87ofwklloe.fsf@kapi.internal>

[...] [on boxing return values]
    PRM> The note tells you that ddot-rll will need to box the float s
    PRM> in order to return it from ddot-rll to other functions.  This
    PRM> can only be avoided if ddot-rll gets inlined into the calling
    PRM> function.  Since this only happens on return (i.e. after 2
    PRM> MFLOPS) though, it is not something that is costing you in
    PRM> your case, so don't worry.

Is this not one of the cases where CMUCL's block compile extension should help
without inlining?  Given the proper declarations (including ones that make
ddot-rll invisible outside the block), is the facility smart enough to 
figure out only one function in this block is calling ddot-rll that that 
caller already expects a float?  This probably cannot be done with 
ddot-rll passed as an argument, but would it be possible otherwise?
(I know I can check this, it is more convenient to post at the moment,
sorry!)

cheers,

BM

From: Pierre R. Mai
Subject: Re: Numerical performance of CMUCL
Date: Sat, 03 Feb 2001 02:22:51 +0000
Message-ID: <87puh07bes.fsf@orion.bln.pmsf.de>

Bulent Murtezaoglu <··@acm.org> writes:

> [...] [on boxing return values]
>     PRM> The note tells you that ddot-rll will need to box the float s
>     PRM> in order to return it from ddot-rll to other functions.  This
>     PRM> can only be avoided if ddot-rll gets inlined into the calling
>     PRM> function.  Since this only happens on return (i.e. after 2
>     PRM> MFLOPS) though, it is not something that is costing you in
>     PRM> your case, so don't worry.
> 
> Is this not one of the cases where CMUCL's block compile extension should
> help without inlining?  Given the proper declarations (including ones

Of course, just didn't feel like mentioning CMUCL specific stuff,
since it didn't matter in his case anyway.  The important thing is
only that the compiler know all the applicable call sites at the time,
so that their expectations can be altered to deal with an unboxed
float.  Inlining pretty much guarantees that, because the only
applicable call-site is the current caller, whose expectations we can
control.  But block compilation is of course another way to restrict
the set of applicable call-sites to some bounded set, etc.

> that make ddot-rll invisible outside the block), is the facility smart
> enough to figure out only one function in this block is calling ddot-rll
> that that caller already expects a float?  This probably cannot be done
> with ddot-rll passed as an argument, but would it be possible otherwise?
> (I know I can check this, it is more convenient to post at the moment,
> sorry!)

Well if only one caller exists anyway (since there can't be any
callers outside the block for a non-externally visible function),
block compilation will just inline the function anyway, since that
makes the most sense.  But this also works if there are more callers,
in which case CMU CL won't inline, yet still avoids boxing.  And it
sometimes works with variables, at least if the compiler can trace the
data flow sufficiently.

Regs, Pierre.

-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein

From: Knut Arild Erstad
Subject: Re: Numerical performance of CMUCL
Date: Sun, 04 Feb 2001 23:37:37 +0000
Message-ID: <slrn97rpu1.pn.knute+news@apal.ii.uib.no>

[Nicolas Neuss]
: 
: One minor point: when compiling I still get the message
<snip>
: Compilation unit finished.
:   1 note
: ----------------------------------------------------------------
: 
: Do you see how I can get rid of that?

To get rid of unavoidable notes, use the EXT:INHIBIT-WARNINGS declaration.
I often end up with declarations like

(declare (optimize (speed 3) (safety 0) #+cmu (ext:inhibit-warnings 3)))

See http://www.mindspring.com/~rtoy/software/cmu-user/node102.html for 
more info.

-- 
Knut Arild Erstad

The face of a child can say it all, especially the mouth part of the face.
	-- Jack Handy, Deep Thoughts

From: Jochen Schmidt
Subject: Re: Numerical performance of CMUCL
Date: Fri, 02 Feb 2001 16:37:56 +0000
Message-ID: <95enh8$gm66k$1@ID-22205.news.dfncis.de>

Nicolas Neuss wrote:

> Hello.
> 
> Recently, there was a discussion about optimizing some piece of code
> (goal was comparison with Python).  In that discussion, CMUCL was said
> to be comparably fast to a C/C++ code (Marco Antoniotti).
> 
> Now, I tried to verify this for the case of computing a scalar product
> between two fixed-size vectors (with a size dropping out of the second
> level cache).  My C program which measures an MFLOP rate looks like

<some code snipped>

> What am I doing wrong?  Is it possible to do the Lisp program better?
> How?

It could certainly done better - I've not looked deep into this but this 
seems to be a pretty _highlevel_ version of the code that runs 2 times 
faster on my machine.

(defun ddot-rll (x y)
  (declare (type (simple-array double-float (*)) x y))
  (loop for xi of-type double-float across x
          for yi of-type double-float across y
          sum (the double-float (* xi yi)) into v-sum of-type double-float
          finally (return v-sum)))

By using less high-level constructs I'am sure it could be faster at all.

Regards,

Jochen

From: Lieven Marchand
Subject: Re: Numerical performance of CMUCL
Date: Fri, 02 Feb 2001 17:19:38 +0000
Message-ID: <m3n1c5ypcl.fsf@localhost.localdomain>

Nicolas Neuss <·····@ortler.iwr.uni-heidelberg.de> writes:

> So, what is my problem?  First, I am not a Common Lisp expert,
> therefore I would be grateful for any corrections or suggestions for
> improvement.  Second, I am quite astonished that one has to declare
> the types over and over again (e.g. the _f macro (which I got from a
> message by Lieven Marchand) around the (aref x i)).  Without those
> declarations, the compiler says it cannot optimize at some points.

I'm sorry I misled you by that post. A lot of them are superfluous. I
posted a version in progress in which I nailed everything down. Duane
Rettig has posted a follow-up to that article with far more sensible
code.

> Third, the resulting code is still very slow in comparison with C.  It
> yields something slightly above 1 second for 2 MFLOPS while the C
> program achieves a performance of about 54 MFLOPS/second.
> 
> What am I doing wrong?  Is it possible to do the Lisp program better?
> How?
> 

One thing that could play, is that you're using special variables
fairly heavily and they're not the same as global variables in C.

I'm sorry but I don't have CMUCL handy so I can't be more precise.

-- 
Lieven Marchand <···@wyrd.be>
Gla�r ok reifr skyli gumna hverr, unz sinn b��r bana.

From: Christopher J. Vogt
Subject: Re: Numerical performance of CMUCL
Date: Fri, 02 Feb 2001 18:25:37 +0000
Message-ID: <3A7AFAEA.2C552E8A@computer.org>

Here is some of the basic code re-written:
(defconstant +nlong+ 1000000)

(defvar *x* (make-array +nlong+ :element-type 'double-float
                        :initial-element 0.0d0))

(defvar *y* (make-array +nlong+ :element-type 'double-float
                        :initial-element 0.0d0))

(defmacro d+ (one two)
  `(the double-float (+ (the double-float ,one) (the double-float ,two))))

(defmacro d* (one two)
  `(the double-float (* (the double-float ,one) (the double-float ,two))))

(defun ddot-rll ()
  (declare (optimize (safety 0) (space 0) (debug 0) (speed 3)))
  (let ((x *x*)
	(y *y*))
    (declare (type (simple-array double-float (1000000)) x y)) 
    (loop with s double-float = 0.0d0
	  for i fixnum from 0 below +nlong+
	  finally (return s) do
	  (setq s (d+ s (d* (aref x i) (aref y i)))))))

The inner loop works out to be 10 cycles vs. 8 cycles for C that Raymond posted:
     3C0: L0:   FSTPD FR0
     3C2:       FLDD  [EAX+EDX*2+1]
     3C6:       FSTPD FR2
     3C8:       FLDD  [ECX+EDX*2+1]
     3CC:       FXCH  FR2
     3CE:       FMULD FR2
     3D0:       FADD-STI FR1
     3D2:       ADD   EDX, 4
     3D5:       CMP   EDX, 4000000
     3DB:       JL    L0

Here is how I tested:
(defun test ()
  (let ((start 0)
	(end 0)
	(count 1))
    (loop until (> (/ (- end start) internal-time-units-per-second) 5) do
	  (setq count (* 2 count))
	  (setq start (get-internal-run-time))
	  (loop repeat count do
		(funcall 'ddot-rll))
	  (setq end (get-internal-run-time)))
    (format t "~%~$ MFLOPS"
	    (/ (* 2.0 count) (/ (- end start)
                                internal-time-units-per-second)))))

(test)
150.59 MFLOPS

I've got a 450 MHZ machine and according to the above disassembly, it takes 4 cycles per FLOP in this example, which should be about 112 MFLOPS, so the report
of 150 MFLOPS might seem high, but the pentium III I'm using has multiple execution units, and since we are doing both integer and floating point operations,
the result is not surprising, and seems in line with my expectations.

From: Martti Halminen
Subject: Re: Numerical performance of CMUCL
Date: Fri, 02 Feb 2001 17:13:03 +0000
Message-ID: <3A7AEA9F.738F3D7B@solibri.com>

Nicolas Neuss wrote:

> Now, I tried to verify this for the case of computing a scalar product
> between two fixed-size vectors (with a size dropping out of the second
> level cache).  My C program which measures an MFLOP rate looks like

Well, I am a lousy C programmer, but still there seem to be some
oddities:

>   } while (end_time-start_time<5.0);    /* C */

>       ((> time internal-time-units-per-second) ...  ;; CL

Seems to me you are using 5 seconds as cutoff time on C and 1 second
with CL.

>   return s+nr/(end_time-start_time);   /* C */

>        (/ time internal-time-units-per-second i))))  ;; CL

Seems to me you are returning inverse values in C vs CL: assuming for
example nr = i = 16 and 5 seconds, you get 3.2 on C and 5/16 on CL ?

>  printf("ddot_rll: %f\n",
>   1.0e-6*2*Nlong*test(ddot_rll));

What is the reason of multiplying the C result with 2 ?

Unless I totally misunderstood, you are measuring rather different
things in your C and CL programs.

--

From: Pierre R. Mai
Subject: Re: Numerical performance of CMUCL
Date: Fri, 02 Feb 2001 20:32:40 +0000
Message-ID: <87puh0966v.fsf@orion.bln.pmsf.de>

Nicolas Neuss <·····@ortler.iwr.uni-heidelberg.de> writes:

> So, what is my problem?  First, I am not a Common Lisp expert,
> therefore I would be grateful for any corrections or suggestions for
> improvement.  Second, I am quite astonished that one has to declare
> the types over and over again (e.g. the _f macro (which I got from a
> message by Lieven Marchand) around the (aref x i)).  Without those
> declarations, the compiler says it cannot optimize at some points.
> Third, the resulting code is still very slow in comparison with C.  It
> yields something slightly above 1 second for 2 MFLOPS while the C
> program achieves a performance of about 54 MFLOPS/second.
> 
> What am I doing wrong?  Is it possible to do the Lisp program better?
> How?

Your major problem is the following part of your code:

> (defvar x (make-array N-long :element-type 'double-float
>                       :initial-element 0.0d0))
> (defvar y (make-array N-long :element-type 'double-float

and

> (defun ddot-rll (x y)

Don't do that.  Never name global special variables without leading
and trailing stars, i.e. the above should be

(defvar *x* ...)
(defvar *y* ...)

Otherwise it is very simple to run into the problems you did (which
are much more far-reaching than speed):  The global declaration of x
and y as special (i.e. dynamically bound) variables will be pervasive,
and hence x and y in ddot-rll will also not be lexical variables (as
you probably expected), but will reference the same dynamic bindings
as your global x and y do.

Therefore the compiler can't make any reliable assumptions on the
values of x and y, since anything outside of ddot-rll could influence
those.

Correcting this, and eliding all superfluous stuff, we get the
following version:

(defconstant vector-size 1000000)

(defvar *x* (make-array vector-size :element-type 'double-float
                      :initial-element 0.0d0))
(defvar *y* (make-array vector-size :element-type 'double-float
                      :initial-element 0.0d0))

(defun ddot-rll (x y)
  (declare (optimize (speed 3) (compilation-speed 0) (safety 0) (debug 0))
	   (values double-float)
	   (type (simple-array double-float (*)) x y))
  (let ((s 0.d0))
    (declare (type double-float s))
    (dotimes (i vector-size s)
      (declare (type fixnum i))
      (incf s (* (aref x i) (aref y i))))))

Neither the ugly _f or _i macros, nor duplicate type declarations are
needed (in fact it might be possible to get away with no declaration
on s and/or i, though I haven't tried), and CMU CL will only note one
inefficiency (that of having to box s at the end to return it), but
that is neither relevant, nor can it be avoided, unless one inlines
ddot-rll.  With that 100 iterations of ddot-rll take 9.35s real-time
on an AMD K6-2 550.  That's 100 x 1000000 float additions and the same
number of float multiplies per 9.35s or around 21 MFLOPS if one is
inclined to believe in such numbers, which is in the same ballpark as
gcc -O2 on the same machine, which gives around 24 MFLOPS.

Another, more concise and less fragile way of writing ddot-rll using
loop would be

(defun ddot-rll2 (x y)
  (declare (optimize (speed 3) (safety 0) (debug 0))
	   (type (simple-array double-float (*)) x y))
  (loop for x1 of-type double-float across x
	for y1 of-type double-float across y
	sum (* x1 y1) of-type double-float))

Which gives nearly identical performance, and doesn't depend on
compile-time constant lengths.

Regs, Pierre.

-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein

From: Tim Bradshaw
Subject: Re: Numerical performance of CMUCL
Date: Mon, 05 Feb 2001 23:49:41 +0000
Message-ID: <ey3ae804rmy.fsf@cley.com>

* Erik Naggum wrote:
>   Not so.  You may declaim the type of special variables and the compiler
>   is obliged to believe you under the same conditions it believes your
>   lexical type declarations.

The real problem -- I guess -- is not the declarations but the fact
that a special binding is really hard to get into a register, because
you need total knowledge of the program to do that, I think.

--tim

From: Oliver Bandel
Subject: Re: Numerical performance of CMUCL
Date: Sat, 03 Feb 2001 20:49:00 +0000
Message-ID: <95hqrs$ld@first.in-berlin.de>

Hello!

Nicolas Neuss <·····@ortler.iwr.uni-heidelberg.de> wrote:
[...]
> void main (void) {
>   int i;
>   for (i=0; i<Nlong; i++)
>     x[i] = y[i] = 0.0;
>   printf("ddot_rll: %f\n",
>     1.0e-6*2*Nlong*test(ddot_rll));
> }

[...]
> So, what is my problem?  First, I am not a Common Lisp expert,

I'm a lesser CL-expert then you, I think.

But let me say a little thing about your C-Program:

When you declaire main as a void, you may go intoi big trouble.
Alwas use int-main-functions and return 0 on success.
This will help you avoiding problems, when your programming
environment looks for your return-code.....

Ciao,
  Oliver
-- 
Remember, Information is not knowledge; Knowledge is not Wisdom;
Wisdom is not truth; Truth is not beauty; Beauty is not love;
Love is not music; Music is the best.
                                      (Frank Zappa)

From: Tim Bradshaw
Subject: Re: Numerical performance of CMUCL
Date: Mon, 05 Feb 2001 23:45:16 +0000
Message-ID: <ey3elxc4rua.fsf@cley.com>

* Nicolas Neuss wrote:
> (defvar x (make-array N-long :element-type 'double-float
>                       :initial-element 0.0d0))
> (defvar y (make-array N-long :element-type 'double-float
>                       :initial-element 0.0d0))

> [...]

> (defun ddot-rll (x y)

> [...]

Here, the DEFVAR has made a global special declaration for X and Y,
and hence the bindings established in DDOT-RLL will also be special.
This will likely mean that the compiler makes no significant attempt
to optimize the code at all.  This is why it's a good idea to use the
`*' convention for special variable names!

It would be really cool if CMUCL could be taught to complain about not
being able to optimize something because a binding is special (:-)

--tim