From: Jacek Generowicz
Subject: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <tyf1x9l80xv.fsf@lxplus005.cern.ch>
I have finally got around to trying to learn about optimizing
numerical code in Lisp.

My guinea pig code initially ran roughly 10 times slower in CMUCL,
SBCL and OpenMCL, than the essenitally equivalent C code compiled with
gcc. (Slower by a factor of 30 if -O3 flag is given.)

The code consists in evaluating a relatively complicated mathematical
function on a seven-dimensional grid, if certain functions of the
coordinates satisfy certain conditions. The coordinates are restricted
to be integers. Ultimately the aim is to run on a grid of size [-50 ..
50]^7, which at current speeds would take about 4 months in Lisp (or 4
days in C).

My initial attempts to speed up the code with type declarations have
taught me a fair bit about type declarations in Common Lisp already,
but have been less successful from the perspective of getting the code
to run significantly faster :-) I've only managed to knock about 20%
off the run time, so far. I would hope that the Lisp code can be
brought to within a factor of 2 or 3 of the C. I would appreciate some
hints as to how one should go about improving the performance in this
way.

However, I suspect that some interesting speedups could be made by
specializing the functions in the outer loops, at least, by partial
evaluation. (For example (see code below) by moving P, Q and R to the
outer loops, the expression whose result is bound to ARG can be
replaced by a constant in the four inner loops.) Is this likely to be
true? Is there a CL partial evaluator available somewhere that might
help with this?



The guinea pig code I'm working with is this:


;; We need Mark Kantrowitz' infix package
(load "/Users/jacek/lisp/utils/infix.lisp")

(defmacro ploop (name &body body)
  (let ((start -5)
  (stop 5)
  (step 1))
    `(loop for ,name from ,start to ,stop by ,step
    do ,@body)))

(defmacro nested-loop-over (names &body body)
  (setq names (reverse names))
  (do ((so-far `(ploop ,(first names) ,@body) (list 'ploop (first
remaining) so-far))
       (remaining (rest names) (rest remaining)))
      ((not remaining) so-far)))

(let (;;(k (* 2 pi))
      ;;(c (/ pi 15))
      (e 0))

  (defun Wr0 (tt ti z zi eps q r omega mu p rho)
    #I(- e*tt + eps*z - p*(tt*zi+ti*z) - mu*z*zi - 
     q*(tt*(z^^2-zi^^2)-2*ti*z*zi)/2.0 - 
      r*(ti*(z^^3-3*z*zi^^2)+tt*(3*z^^2*zi-zi^^3))/6.0 + 
       rho*(z^^3-3*z*zi^^2)/6.0))
  
  (defun Wi0 (tt ti z zi eps q r omega mu p rho)
    #i(omega - e*ti + eps*zi + mu*(z^^2-zi^^2)/2.0 + 
         p*(tt*z-ti*zi) - q*(ti*(z^^2-zi^^2)+2.0*tt*z*zi)/2.0 + 
              rho*(3.0*z^^2*zi-zi^^3)/6.0 + 
                   r*(tt*(z^^3-3.0*z*zi^^2)-ti*(3.0*zi*z^^2-zi^^3))/6.0))
  
  (defun absw (tt ti z zi eps q r omega mu p rho)
    #i( sqrt( wr0(tt ti z zi eps q r omega mu p rho)^^2 + 
          wi0(tt ti z zi eps q r omega mu p rho)^^2 ))))

  

(time
 (with-open-file (data "/tmp/lsele0"
                        :direction :output
                                          :if-exists :supersede)
   (nested-loop-over (eps q r omega mu p rho)
     (when (and (= 0 #i(  eps*q - r*omega - mu*p  ))
           (not (zerop r))
                (not (zerop p)))
       (let ((arg #i(  -6*p/r-9*q^^2/(4*r^^2)  )))
        (when (> arg 1)
           (let* ((z (sqrt arg))
                   (tt #i(
z*(4*mu*r*p-q*r*eps-3*q*p*rho)/(2*r*p*(8*p*r+3*q^^2))  )))
                                                            (when (>
tt 1)
          (let* ((ti #i(  - eps/(2*p) + 3*rho/(2*r)  ))
                      (zi #i(  - 3*q/(2*r)  ))
                                (absw (absw tt ti z zi eps q r omega
mu p rho)))
      (when (< absw 5)
               (format data "~&~3d ~3d ~3d ~3d ~3d ~3d ~3d ~f ~f ~f"
                               eps q r omega mu p rho tt z
absw)))))))))))

From: alex goldman
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <5137702.qY4WfEPC5L@yahoo.com>
Jacek Generowicz wrote:

> My guinea pig code initially ran roughly 10 times slower in CMUCL,
> SBCL and OpenMCL, than the essenitally equivalent C code compiled with
> gcc. (Slower by a factor of 30 if -O3 flag is given.)

Can you post the C code as well?
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <tyfd5t5hsya.fsf@lxplus055.cern.ch>
alex goldman <·····@spamm.er> writes:

> Can you post the C code as well?

#include <stdio.h>

int e = 0;

float wr0(float tt, float ti, float z, float zi,
          int eps, int q, int r, int omega, int mu, int p, int rho) {
  return 
    - e*tt + eps*z - p*(tt*zi+ti*z) - mu*z*zi - 
    q*(tt*(z*z-zi*zi)-2*ti*z*zi)/2.0 - 
    r*(ti*(z*z*z-3*z*zi*zi)+tt*(3*z*z*zi-zi*zi*zi))/6.0 + 
    rho*(z*z*z-3*z*zi*zi)/6.0;
}

float wi0(float tt, float ti, float z, float zi,
          int eps, int q, int r, int omega, int mu, int p, int rho) {
  return
    omega - e*ti + eps*zi + mu*(z*z-zi*zi)/2.0 + 
    p*(tt*z-ti*zi) - q*(ti*(z*z-zi*zi)+2.0*tt*z*zi)/2.0 + 
    rho*(3.0*z*z*zi-zi*zi*zi)/6.0 + 
    r*(tt*(z*z*z-3.0*z*zi*zi)-ti*(3.0*zi*z*z-zi*zi*zi))/6.0;
}

float abswfn(float tt, float ti, float z, float zi,
             int eps, int q, int r, int omega, int mu, int p, int rho) {
  float re,im;
  re = wr0(tt, ti, z, zi, eps, q, r, omega, mu, p, rho);
  im = wi0(tt, ti, z, zi, eps, q, r, omega, mu, p, rho);
  return sqrt(re*re + im*im);
}

#define START -5
#define STOP 5
#define STEP 1

int main() {

  FILE* fp = fopen("/tmp/csele0", "w"); 

  int eps, q, r, omega, mu, p, rho;
  float arg, z, absw;

  float ti, tz, tt, zi;

  for (eps=START; eps<=STOP; eps+=STEP) {
    for (q=START; q<=STOP; q+=STEP) {
      for (r=START; r<=STOP; r+=STEP) {
        for (omega=START; omega<=STOP; omega+=STEP) {
          for (mu=START; mu<=STOP; mu+=STEP) {
            for (p=START; p<=STOP; p+=STEP) {
              for (rho=START; rho<=STOP; rho+=STEP) {
                if (eps*q - r*omega - mu*p == 0 &&
                    r != 0 &&
                    p != 0) {
                  arg = -6*p/(float)r-9*q*q/(float)(4*r*r);
                  if (arg > 1) {
                    z = sqrt(arg);
                    tt = z*(4*mu*r*p-q*r*eps-3*q*p*rho)/(float)(2*r*p*(8*p*r+3*q*q));
                    if (tt > 1) {
                      ti = - eps/(2*p) + 3*rho/(float)(2*r); 
                      zi = - 3*q/(float)(2*r);
                      absw = abswfn(tt, ti,z, zi, eps, q, r, omega, mu, p, rho);
                      if (absw < 5) {
                        fprintf(fp, "%3d %3d %3d %3d %3d %3d %3d %f %f %f\n",
                                 eps, q, r, omega, mu, p, rho, tt, z, absw);
                       }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}
From: Kaz Kylheku
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <1113017235.458968.258500@l41g2000cwc.googlegroups.com>
Jacek Generowicz wrote:
>     q*(tt*(z*z-zi*zi)-2*ti*z*zi)/2.0 -

A small remark. In your Lisp, using the infix package, you wrote these
exponentiations as z^^2, which infix translates to (expt z 2). The
direct C equivalent is pow(z, 2). You have strength-reduced the C by
hand, but in the Lisp case, you are counting on the compiler to do it.
Hmm!
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <m2wtrc5ut2.fsf@genera.local>
"Kaz Kylheku" <···@ashi.footprints.net> writes:

> Jacek Generowicz wrote:
>>     q*(tt*(z*z-zi*zi)-2*ti*z*zi)/2.0 -
>
> A small remark. In your Lisp, using the infix package, you wrote these
> exponentiations as z^^2, which infix translates to (expt z 2). The
> direct C equivalent is pow(z, 2). You have strength-reduced the C by
> hand, but in the Lisp case, you are counting on the compiler to do it.
> Hmm!

Yes, in the ur-code I posted that is the case. But I did the reduction
by hand almost immediately before proceeding. Though I see that I left
a few in for some other variables. Thanks.
From: Antonio Menezes Leitao
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <87ekdjlupv.fsf@gia.ist.utl.pt>
Jacek Generowicz <················@cern.ch> writes:

> I have finally got around to trying to learn about optimizing
> numerical code in Lisp.
>
> My guinea pig code initially ran roughly 10 times slower in CMUCL,
> SBCL and OpenMCL, than the essenitally equivalent C code compiled with
> gcc. (Slower by a factor of 30 if -O3 flag is given.)

You are not being entirely fair for the Lisp side.

While comparing the Lisp and C versions of your functions:

>   (defun Wr0 (tt ti z zi eps q r omega mu p rho)
>     #I(- e*tt + eps*z - p*(tt*zi+ti*z) - mu*z*zi - 
>      q*(tt*(z^^2-zi^^2)-2*ti*z*zi)/2.0 - 
>       r*(ti*(z^^3-3*z*zi^^2)+tt*(3*z^^2*zi-zi^^3))/6.0 + 
>        rho*(z^^3-3*z*zi^^2)/6.0))

> float wr0(float tt, float ti, float z, float zi,
>           int eps, int q, int r, int omega, int mu, int p, int rho) {
>   return 
>     - e*tt + eps*z - p*(tt*zi+ti*z) - mu*z*zi - 
>     q*(tt*(z*z-zi*zi)-2*ti*z*zi)/2.0 - 
>     r*(ti*(z*z*z-3*z*zi*zi)+tt*(3*z*z*zi-zi*zi*zi))/6.0 + 
>     rho*(z*z*z-3*z*zi*zi)/6.0;
> }

I noticed that your Lisp version shows several ^^ that are not present
in the C versions.  This means that you are replacing the simpler
multiplications with calls to the EXPT function.  Depending on the
compiler, one form might be more efficient than the other.

The same thing happens in all your other functions.  

In my benchmark, avoiding the EXPT (i.e., reusing the original C
version) makes the code run twice as fast.

Ant�nio Leit�o.
From: alex goldman
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <1377709.zKFyqeqOE5@yahoo.com>
Untested:

(defmacro def-arithm (op type prefix)
   `(defun ,(cat-symbols prefix op) (x y)
           (declare inline)
           (declare (optimize (speed 3) (safety 0) (debug 0)))
           (the ,type (,op (the ,type x) (the ,type y)))))

(defun cat-symbols (&rest symbols) 
  ;; ..
  )

(def-arithm + single-float f)
(def-arithm * fixnum i)
;; ....
        
and use f+ , i* , etc instead of generic arithmetic. You'll also need
explicit type conversion (int to float, etc)

Of course, a better solution would be to implement compile-time type
checking and implicit conversion (like in C) and make it work with "infix".
If you are a perfectionist, that would be the only way to do it. But then,
maybe someone has already done it.
From: rif
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <wj0oecpvtzm.fsf@five-percent-nation.mit.edu>
Are you compiling with optimizations turned on?  
I usually put

(declaim (optimize (speed 3) (safety 0) (debug 1)))

at the top of files I need to be fast.  Note that you should debug
first --- safety 0 means (at least on CMUCL) that array bounds and
types are not checked, so you can easily corrupt your CL process if
your code is buggy.

If you're not doing this, you cannot expect your code to be fast.  If
you *are* doing this, this will also give you a hint as to where the
slowness is; look through the compiler output.  These optimization
notes only appear when speed is turned up.  In general, the two things
to look for are "Using generic arithmetic", which can be fixed by type
declarations, or "Coercing return value", which can be fixed by
inlining or block compilation (under CMUCL).  

The profiler is your friend.  I usually find that there are only a few
functions (in the inner loop) where type declarations and inlining are
important.

It would probably be considered better style to use optimize
declarations at the function level rather than a declaim at the file
level.  I don't seem to have this habit.

rif
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <m2sm217xm2.fsf@genera.local>
rif <···@mit.edu> writes:

> Are you compiling with optimizations turned on?  

Yes.

> (declaim (optimize (speed 3) (safety 0) (debug 1)))

Indeed ... though the equivalent proclaim had no effect, and I
haven't got around to interpreting the spec on why this might be.

> In general, the two things to look for are "Using generic
> arithmetic", which can be fixed by type declarations, or "Coercing
> return value",

Got rid of most of these. Still seems to slow. (But I should get rid
of ALL of them, first.)

> It would probably be considered better style to use optimize
> declarations at the function level rather than a declaim at the file
> level.

Tried both. Flie level gave me too many notes all at once, so I
tried one function at a time.
From: ···············@yahoo.com
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <1112998896.505844.46610@f14g2000cwb.googlegroups.com>
There's another thread on the group that might be helpful: float to
pointer coercion (cost 13).  Searching the group for "speed 3" may give
you other ideas.
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <m2ekdk99e5.fsf@genera.local>
···············@yahoo.com writes:

> There's another thread on the group that might be helpful: float to
> pointer coercion (cost 13).

What, you mean the thread that I started ? :-)
From: rif
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <wj0k6ncx6gj.fsf@five-percent-nation.mit.edu>
Again, it's all about the inner loop functions.  If you're doing
generic arithmetic in one inner loop function, this can easily
translate into a factor of 10 in overall performance, in my
experience.  Don't worry about getting rid of all generic arithmetic.
Profile your code.  Find out where all the time is being spent.
Optimize those functions.

rif
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <m2br8o99cp.fsf@genera.local>
rif <···@mit.edu> writes:

> Again, it's all about the inner loop functions.

Got those down by a factor of 10, with no significant impact
overall. This is because calls to the inner functions are usually
avoided by the guards, and those guards themselves are quite
expensive.

> Profile your code.  Find out where all the time is being spent.

I always preach this myself, but it's not that easy in this case.

Firstly, I'm not sure how to use the CMUCL profiler: it seems you have
to tell it what to profile, rather than it telling you where the time
is spent. What am I missing?

Secondly, there isn't much to profile, in this case. It's all in one
place. It's just one hugely nested loop. I guess I could outline
various expressions within the loop into functions, and tell it to
profile those.
From: rif
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <wj0br8ox4q4.fsf@five-percent-nation.mit.edu>
Jacek Generowicz <················@cern.ch> writes:

> rif <···@mit.edu> writes:
> 
> > Again, it's all about the inner loop functions.
> 
> Got those down by a factor of 10, with no significant impact
> overall. This is because calls to the inner functions are usually
> avoided by the guards, and those guards themselves are quite
> expensive.

Sorry if I wasn't clear.  I was using "It's all about the inner loop
functions" as shorthand for "In nearly all programs, almost all the
time is spent in a tiny fraction of the code.  Find this piece and
optimize it."

> > Profile your code.  Find out where all the time is being spent.
> 
> I always preach this myself, but it's not that easy in this case.
> 
> Firstly, I'm not sure how to use the CMUCL profiler: it seems you have
> to tell it what to profile, rather than it telling you where the time
> is spent. What am I missing?

If your code is in package :foo, say

(profile:profile-all :package :foo)

Run your program, then use

(profile:report-time)

to see what's going on.

> Secondly, there isn't much to profile, in this case. It's all in one
> place. It's just one hugely nested loop. I guess I could outline
> various expressions within the loop into functions, and tell it to
> profile those.

You could do this.  I've done this before.  Of course, you might well
find that the function call overhead involved obscured everything
else.  If you're spending all the time in one huge nested loop, then
your problem is almost certainly either generic arithmetic or needless
consing inside that loop.

If you send me a working version of your code (something that compiles
and runs under CMUCL), I will take a look.

Cheers,

rif
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <m27jjc98kg.fsf@genera.local>
rif <···@mit.edu> writes:

> "In nearly all programs, almost all the time is spent in a tiny
> fraction of the code.  Find this piece and optimize it."

Yes, I am perfectly aware of this.

> If your code is in package :foo, say
>
> (profile:profile-all :package :foo)

I guess you haven't seen the code I posted at the top of the thread.

>> I guess I could outline various expressions within the loop into
>> functions, and tell it to profile those.
>
> you might well find that the function call overhead involved
> obscured everything else.

That's what I fear too.

> If you send me a working version of your code (something that
> compiles and runs under CMUCL), I will take a look.

It's at the top of the thread.
From: rif
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <wj04qegx37c.fsf@five-percent-nation.mit.edu>
Jacek Generowicz <················@cern.ch> writes:

> rif <···@mit.edu> writes:
> 
> > "In nearly all programs, almost all the time is spent in a tiny
> > fraction of the code.  Find this piece and optimize it."
> 
> Yes, I am perfectly aware of this.

You had said that "optimizing the inner loop did not help, because the
inner loop was usually avoided by guards which were expensive."  In
that context, it should be clear that I meant "optimize the guards".
I made an "obvious" assertion only because you seemed to not
understand what I'd meant.  Sorry if I was unclear.

> > If your code is in package :foo, say
> >
> > (profile:profile-all :package :foo)
> 
> I guess you haven't seen the code I posted at the top of the thread.

Ah, I see.  You've really got just one function, more or less...

> > If you send me a working version of your code (something that
> > compiles and runs under CMUCL), I will take a look.
> 
> It's at the top of the thread.

I don't have "infix.lisp", so this is not a complete working program
for me.  Furthermore, you'd made clear through intervening messages
that you had added many declarations to the code in an attempt to
speed it up; it would make more sense for me to work on this version
rather than the completely unoptimized original.

Actually, looking at your code again, I'm a tiny bit confused.  Your
timing call is inside the file you compile? So basically you're
getting your timing as a top-level effect of loading the file?  And
all of your functions are defined inside the (let ((e 0)) ) ?

Cheers,

rif
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <m2hdig7s4m.fsf@genera.local>
rif <···@mit.edu> writes:

> I don't have "infix.lisp"

http://www-cgi.cs.cmu.edu/afs/cs/project/ai-repository/ai/lang/lisp/code/syntax/infix/infix.cl

> Actually, looking at your code again, I'm a tiny bit confused.  Your
> timing call is inside the file you compile? So basically you're
> getting your timing as a top-level effect of loading the file?

Yes, that way a single SLIME chord does it all.

> And all of your functions are defined inside the (let ((e 0)) ) ?

Faster than a special variable ... though I haven't timed that.
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <m2y8bs7t92.fsf@genera.local>
Jacek Generowicz <················@cern.ch> writes:

>> If you send me a working version of your code (something that
>> compiles and runs under CMUCL), I will take a look.
>
> It's at the top of the thread.

BTW, from the compiler notes, it seems that my biggest problems are
with rational arithmetic. Unfortunately, if I try to turn it into FP,
the guards (incorrectly) filter out absolutely everything (while the C
code gets it right with FP arithmetic).
From: rif
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <wj0zmw8voj7.fsf@five-percent-nation.mit.edu>
Jacek Generowicz <················@cern.ch> writes:

> Jacek Generowicz <················@cern.ch> writes:
> 
> >> If you send me a working version of your code (something that
> >> compiles and runs under CMUCL), I will take a look.
> >
> > It's at the top of the thread.
> 
> BTW, from the compiler notes, it seems that my biggest problems are
> with rational arithmetic. Unfortunately, if I try to turn it into FP,
> the guards (incorrectly) filter out absolutely everything (while the C
> code gets it right with FP arithmetic).

If you are using rational arithmetic against C's floating point, this
is a direct explanation of your slowdown.

CMUCL/SBCL's floating-point behavior is almost entirely IEEE compliant
(there's some issues relating to handling of divide-by-0 traps and so
forth).  Almost certainly, there is a way to express your
floating-point computation correctly in CL.  

rif
From: rif
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <wj0vf6wvogx.fsf@five-percent-nation.mit.edu>
You probably know this already, but note that 3.0 is a
single-precision float in CL.  3.0d0 is a double-precision float.  I
got bit by that several times when I was starting.

rif
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <m2ll7s7smz.fsf@genera.local>
rif <···@mit.edu> writes:

> You probably know this already, but note that 3.0 is a
> single-precision float in CL.  3.0d0 is a double-precision float.  I
> got bit by that several times when I was starting.

Nah, that's fine. What bit me was that 3. is an integer. (Can you tell
I copied the function definitions from a Fortran programmer ? :-)
From: rif
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <wj0y8bsltsu.fsf@five-percent-nation.mit.edu>
Jacek Generowicz <················@cern.ch> writes:

> rif <···@mit.edu> writes:
> 
> > You probably know this already, but note that 3.0 is a
> > single-precision float in CL.  3.0d0 is a double-precision float.  I
> > got bit by that several times when I was starting.
> 
> Nah, that's fine. What bit me was that 3. is an integer. (Can you tell
> I copied the function definitions from a Fortran programmer ? :-)

So what's the current status?  Is it still either slow or not working
correctly?

rif
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <m2d5t47rys.fsf@genera.local>
rif <···@mit.edu> writes:

>> Nah, that's fine. What bit me was that 3. is an integer. (Can you tell
>> I copied the function definitions from a Fortran programmer ? :-)
>
> So what's the current status?  Is it still either slow or not working
> correctly?

If you're referring to the 3. is an integer thing, that was ages ago,
before I started optimizing, and has no bearing on the current
status. So, yes, still slow or wrong.
From: rif
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <wj0u0mgltgd.fsf@five-percent-nation.mit.edu>
Fine.  If you want me to take a look, I am willing to do so, if you
will email me a complete, compilable, currently-best version of your
code.

rif
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <m21x9k7q1a.fsf@genera.local>
Jacek Generowicz <················@cern.ch> writes:

> still slow or wrong.

Heh! So all the declarations I have added so far have improved
performance in CMUCL by almost a factor of 2 ... but slowed OpenMCL
down by a factor of almost 4.

I didn't expect the work done in CMUCL to necessarily improve things
in OpenMCL all that much, but I'm a bit surprised that it could be
quite that damaging.
From: Raffael Cavallaro
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <2005040900430060091%raffaelcavallaro@pasdespamsilvousplaitdotmaccom>
On 2005-04-08 20:27:29 -0400, Jacek Generowicz <················@cern.ch> said:

> Heh! So all the declarations I have added so far have improved
> performance in CMUCL by almost a factor of 2 ... but slowed OpenMCL
> down by a factor of almost 4.
> 
> I didn't expect the work done in CMUCL to necessarily improve things
> in OpenMCL all that much, but I'm a bit surprised that it could be
> quite that damaging.


If you're code is at all floating point intensive, and you're using 
OpenMCL, you have to use Randall Beer's floating point compiler for 
OpenMCL:

<http://vorlon.cwru.edu/~beer/> near the bottom of the page under fp-ppc.

It can speed up floating point code to about sbcl or cmucl speeds, and 
can completely eliminate consing (in-place unboxed float mutations).

Of course, I haven't profiled your code, so this may not be a relevant 
bottleneck, but if floating point is slowing you down in OpenMCL you 
should definitely conditionalize your code to use fp-ppc under OpenMCL.
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <m2sm205u5t.fsf@genera.local>
Raffael Cavallaro <················@pas-d'espam-s'il-vous-plait-dot-mac.com> writes:

> Randall Beer's floating point compiler for OpenMCL:
>
> <http://vorlon.cwru.edu/~beer/> near the bottom of the page under
> fp-ppc.

Now THAT's an interesting piece of code. Thank you very much for
pointing it out.
From: Christophe Rhodes
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <sqoecox4mf.fsf@cam.ac.uk>
Jacek Generowicz <················@cern.ch> writes:

> Firstly, I'm not sure how to use the CMUCL profiler: it seems you have
> to tell it what to profile, rather than it telling you where the time
> is spent. What am I missing?

A sampling profiler.

Christophe
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <m23bu098j9.fsf@genera.local>
Christophe Rhodes <·····@cam.ac.uk> writes:

> Jacek Generowicz <················@cern.ch> writes:
>
>> Firstly, I'm not sure how to use the CMUCL profiler: it seems you have
>> to tell it what to profile, rather than it telling you where the time
>> is spent. What am I missing?
>
> A sampling profiler.

:-)
From: Brian Downing
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <vhE5e.302$Bb3.107@attbi_s22>
In article <··············@genera.local>,
Jacek Generowicz  <················@cern.ch> wrote:
> Christophe Rhodes <·····@cam.ac.uk> writes:
> > Jacek Generowicz <················@cern.ch> writes:
> >> Firstly, I'm not sure how to use the CMUCL profiler: it seems you have
> >> to tell it what to profile, rather than it telling you where the time
> >> is spent. What am I missing?
> >
> > A sampling profiler.
> 
> :-)

Seriously, SBCL has SB-SPROF, which is included in the default contribs.
This is a port of Gerd Moellmann sampling profiler for CMUCL, though I
think quite a bit of enhancement has happened since it was ported to
SBCL.

What this will get you is:

  * Profiling runs are much faster.
  * Profiling overhead is not on function call boundries, so it
    distorts your results less.
  * With SB-SPROF, disassembling a profiled function has annotations
    on each instruction showing where time was spent.  There's a
    function to print out annotated source code as well showing what
    parts of your function are the bottleneck.

How much does your code cons (TIME results should tell you)?  If it's pure
numeric bashing ideally it should cons nothing after you've allocated your
arrays.  Numeric consing usually really slows things down.

-bcd
-- 
*** Brian Downing <bdowning at lavos dot net> 
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <m2psx47srn.fsf@genera.local>
Brian Downing <·············@lavos.net> writes:

> In article <··············@genera.local>,
> Jacek Generowicz  <················@cern.ch> wrote:
>> Christophe Rhodes <·····@cam.ac.uk> writes:
>>
>> What am I missing?
>> >
>> > A sampling profiler.
>> 
>> :-)
>
> Seriously, SBCL has SB-SPROF,

Ah, OK, so it seems I missed the point of Christophe's post too :-)

Thanks for the pointer.

> How much does your code cons ?

; Evaluation took:
;   4.65 seconds of real time
;   2.3 seconds of user run time
;   0.46 seconds of system run time
;   0 CPU cycles
;   [Run times include 0.48 seconds GC run time]
;   0 page faults and
;   83,129,480 bytes consed.

Hmmmm. Can't see where all that comes from. I'm just evaluating a
function in a nested loop.

(BTW, 0 CPU cycles ?!)
From: Brian Downing
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <WHF5e.400$Bb3.192@attbi_s22>
In article <··············@genera.local>,
Jacek Generowicz  <················@cern.ch> wrote:
> Ah, OK, so it seems I missed the point of Christophe's post too :-)
> 
> Thanks for the pointer.
> 
> > How much does your code cons ?
> 
> ; Evaluation took:
> ;   4.65 seconds of real time
> ;   2.3 seconds of user run time
> ;   0.46 seconds of system run time
> ;   0 CPU cycles
> ;   [Run times include 0.48 seconds GC run time]
> ;   0 page faults and
> ;   83,129,480 bytes consed.
> 
> Hmmmm. Can't see where all that comes from. I'm just evaluating a
> function in a nested loop.
> 
> (BTW, 0 CPU cycles ?!)

The CPU cycle count doesn't work on CMUCL/PPC.

Without much work I've managed to get the lisp code to run at 30%
the speed of "gcc -O3" on SBCL/x86.  These times are for -10 to 10.

CL-USER> (run)
Evaluation took:
  12.529 seconds of real time
  12.527095 seconds of user run time
  0.0 seconds of system run time
  0 page faults and
  1,199,080 bytes consed.

:; time ./7dim
./7dim  3.79s user 0.00s system 99% cpu 3.790 total

CL-USER> (/ 3.79 12.529)
0.3024982

I used SB-SPROF on it, and it looks like the grand majority of the time
is left in madly spinning around the loop and not doing floating point
math.  It's entirely possible gcc's instruction scheduler and
microoptimizations are a big win here.  Or something else could be
wrong.

The only notes remaining are float to pointer coercion going into the
formatter function (read: meaningless).

(deftype c-float () 'single-float)
(deftype c-int () '(signed-byte 32))
(deftype step-range () '(integer -100 100))

(defmacro ploop (name &body body)
  (let ((start -10)
        (stop 10)
        (step 1))
    `(loop for ,name of-type step-range from ,start to ,stop by ,step
           do ,@body)))

(defmacro nested-loop-over (names &body body)
  (setq names (reverse names))
  (do ((so-far `(ploop ,(first names) ,@body)
               (list 'ploop (first remaining) so-far))
       (remaining (rest names) (rest remaining)))
      ((not remaining) so-far)))

(declaim (inline absw))
(defun absw (tt ti z zi eps q r omega mu p rho)
  (declare (type c-float tt ti z zi)
           (type c-int eps q r omega mu p rho)
           (optimize (speed 3) (safety 0) (debug 0)))
  (let (;;(k (* 2 pi))
        ;;(c (/ pi 15))
        (e 0))
    (declare (type c-int e))
    (flet ((wr0 (tt ti z zi eps q r omega mu p rho)
             (declare (type c-float tt ti z zi)
                      (type c-int eps q r omega mu p rho)
                      (ignore omega))
             #I(- e*tt + eps*z - p*(tt*zi+ti*z) - mu*z*zi -
                q*(tt*(z^^2-zi^^2)-2*ti*z*zi)/2.0 -
                r*(ti*(z^^3-3*z*zi^^2)+tt*(3*z^^2*zi-zi^^3))/6.0 +
                rho*(z^^3-3*z*zi^^2)/6.0))
           (wi0 (tt ti z zi eps q r omega mu p rho)
             (declare (type c-float tt ti z zi)
                      (type c-int eps q r omega mu p rho))
             #i(omega - e*ti + eps*zi + mu*(z^^2-zi^^2)/2.0 +
                p*(tt*z-ti*zi) - q*(ti*(z^^2-zi^^2)+2.0*tt*z*zi)/2.0 +
                rho*(3.0*z^^2*zi-zi^^3)/6.0 +
                r*(tt*(z^^3-3.0*z*zi^^2)-ti*(3.0*zi*z^^2-zi^^3)) /6.0)))
      #i( sqrt( wr0(tt ti z zi eps q r omega mu p rho)^^2 +
                wi0(tt ti z zi eps q r omega mu p rho)^^2 )))))

(defun main (stream)
  (declare (type stream stream)
           (optimize (speed 3) (safety 0) (debug 0)))
  (nested-loop-over (eps q r omega mu p rho)
   (when (and (= 0 #i(  eps*q - r*omega - mu*p  ))
              (not (zerop r))
              (not (zerop p)))
     (let* ((rfloat (coerce r 'c-float)) ; KLUDGE? this makes arg not rational
            (pfloat (coerce p 'c-float)) ; same for ti below
            (arg #i(  -6*p/rfloat-9*q^^2/(4*rfloat^^2)  )))
       (when (> arg 1)
         (let* ((z (sqrt arg))
                (tt #i( z*(4*mu*r*p-q*r*eps-3*q*p*rho)/
                          (2*rfloat*p*(8*p*r+3*q^^2))) ))
           (declare (type c-float tt))
           (when (> tt 1)
             (let* ((ti #i(  - eps/(2*pfloat) + 3*rho/(2*rfloat)  ))
                    (zi #i(  - 3*q/(2*rfloat)  ))
                    (absw (absw tt ti z zi eps q r omega
                                mu p rho)))
               (declare (type c-float ti zi absw))
               (when (< absw 5)
                 (format stream "~&~3d ~3d ~3d ~3d ~3d ~3d ~3d ~f ~f ~f"
                         eps q r omega mu p rho tt z
                         absw))))))))))

(defun run ()
  (with-open-file (data "/tmp/lsele0"
                        :direction :output
                        :if-exists :supersede)
    (time (main data))))

-bcd
From: Barry Margolin
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <barmar-4526AB.21531708042005@comcast.dca.giganews.com>
In article <·················@attbi_s22>,
 Brian Downing <·············@lavos.net> wrote:

> I used SB-SPROF on it, and it looks like the grand majority of the time
> is left in madly spinning around the loop and not doing floating point
> math.  It's entirely possible gcc's instruction scheduler and
> microoptimizations are a big win here.  Or something else could be
> wrong.

Perhaps at this point it's type to try DISASSEMBLE, to see how the 
looping instructions compare in the two languages.

-- 
Barry Margolin, ······@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
From: Brian Downing
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <iHG5e.3660$8Z6.2015@attbi_s21>
In article <····························@comcast.dca.giganews.com>,
Barry Margolin  <······@alum.mit.edu> wrote:
> In article <·················@attbi_s22>,
>  Brian Downing <·············@lavos.net> wrote:
> 
> > I used SB-SPROF on it, and it looks like the grand majority of the time
> > is left in madly spinning around the loop and not doing floating point
> > math.  It's entirely possible gcc's instruction scheduler and
> > microoptimizations are a big win here.  Or something else could be
> > wrong.
> 
> Perhaps at this point it's type to try DISASSEMBLE, to see how the 
> looping instructions compare in the two languages.

I was using DISASSEMBLE on the Lisp code - that's where the fine-grained
profiling information comes out.  I didn't go and do the same for the C
code - I'll leave that as an excercise for the reader.  :)

-bcd
-- 
*** Brian Downing <bdowning at lavos dot net> 
From: Brian Downing
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <9NF5e.3523$8Z6.1252@attbi_s21>
In article <·················@attbi_s22>,
Brian Downing  <·············@lavos.net> wrote:
> Without much work I've managed to get the lisp code to run at 30%
> the speed of "gcc -O3" on SBCL/x86.  These times are for -10 to 10.

Oh, and I checked the output - it's identical (modulo float printing
differences) to the C version.

-bcd
-- 
*** Brian Downing <bdowning at lavos dot net> 
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <m2mzs85ruj.fsf@genera.local>
Brian Downing <·············@lavos.net> writes:

> Without much work I've managed to get the lisp code to run at 30%
> the speed of "gcc -O3" on SBCL/x86.

I now remember why I'd been ignoring SBCL: it fails to compile with
the error INFIX EXPRESSION ENDS IN NIL, and I can't for the life of me
work out where the problem might be. This is one of those occasions
where the debugger leaves me completely baffled.

On CMUCL, your version also runs at 30% of the gcc -O3 speed.

Thanks very much for this. I guess the best thing for me to
do, is to get it to compile under SBCL, so that I can play with
SB-SPROF.
From: Brian Downing
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <FrT5e.4297$8Z6.438@attbi_s21>
In article <··············@genera.local>,
Jacek Generowicz  <················@cern.ch> wrote:
> I now remember why I'd been ignoring SBCL: it fails to compile with
> the error INFIX EXPRESSION ENDS IN NIL, and I can't for the life of me
> work out where the problem might be. This is one of those occasions
> where the debugger leaves me completely baffled.

I was getting that when running it though C-c C-k in Slime, but not
when running (load (compile-file ...)) from the REPL.  I assumed
it was something Slime was mangling, though that may not be fair.

But you are not the only one who saw it.

-bcd
-- 
*** Brian Downing <bdowning at lavos dot net> 
From: Brian Downing
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <AF16e.1875$Bb3.253@attbi_s22>
In article <··················@attbi_s21>,
Brian Downing  <·············@lavos.net> wrote:
> I was getting that when running it though C-c C-k in Slime, but not
> when running (load (compile-file ...)) from the REPL.  I assumed
> it was something Slime was mangling, though that may not be fair.
> 
> But you are not the only one who saw it.

It turns out that this is a problem with infix.cl.  It calls READ
internally inside its main reader macro.  Slime is occasionally setting
*read-suppress* to T while reading the source (to find the position of
compiler notes to highlight in the editor).  However, infix's reader
function isn't rebinding *read-suppress* to nil inside, so its internals
are getting all mucked up with SBCL's (correct) behavior in this regard.

I see a commit for CMUCL's reader in one of the latest snapshots that
mimics the SBCL behavior, so I imagine it would behave the same way with
unmodified infix.

Here is a patch for infix.cl which makes it work for me.

--- infix.cl	1996-06-28 13:03:32.000000000 -0500
+++ infix.lisp	2005-04-09 22:57:03.786077368 -0500
@@ -294,20 +294,21 @@
 (defun infix-reader (stream subchar arg)
   ;; Read either #I(...) or #I"..."
   (declare (ignore arg subchar))
-  (let ((first-char (peek-char nil stream t nil t)))
-    (cond ((char= first-char #\space)
-	   (read-char stream)		; skip over whitespace
-	   (infix-reader stream nil nil))
-	  ((char= first-char #\")
-	   ;; Read double-quote-delimited infix expressions.
-	   (string->prefix (read stream t nil t)))
-	  ((char= first-char #\()
-	   (read-char stream)		; get rid of opening left parenthesis
-	   (let ((*readtable* *infix-readtable*)
-		 (*normal-readtable* *readtable*))
-	     (read-infix stream)))
-	  (t
-	   (infix-error "Infix expression starts with ~A" first-char)))))
+  (let ((*read-suppress* nil))
+    (let ((first-char (peek-char nil stream t nil t)))
+      (cond ((char= first-char #\space)
+             (read-char stream)         ; skip over whitespace
+             (infix-reader stream nil nil))
+            ((char= first-char #\")
+             ;; Read double-quote-delimited infix expressions.
+             (string->prefix (read stream t nil t)))
+            ((char= first-char #\()
+             (read-char stream)  ; get rid of opening left parenthesis
+             (let ((*readtable* *infix-readtable*)
+                   (*normal-readtable* *readtable*))
+               (read-infix stream)))
+            (t
+             (infix-error "Infix expression starts with ~A" first-char))))))
 
 (set-dispatch-macro-character #\# #\I #'infix-reader *readtable*) ; was #\# #\$
 
-bcd
-- 
*** Brian Downing <bdowning at lavos dot net> 
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <m2u0mf4osj.fsf@genera.local>
Jacek Generowicz <················@cern.ch> writes:

> the best thing for me to do, is to get it to compile under SBCL, so
> that I can play with SB-SPROF.

(load (compile-file ...)) [as instead of SLIME C-c C-k] gets around
the problem. But I'm not getting much further.

If I try

(require sb-sprof)
(sb-sprof:start-profiling)
(do-whatever)

... SBCL runs forever and cannot be interrupted other than by killing
the process, almost every time. It tends to work more often for short
runs, and almost never for longer ones.

Anyone else seen this? 

(SBCL 0.8.12 on G4/OS X 10.3.8)
From: Juho Snellman
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <slrnd5jacu.7a5.jsnell@sbz-31.cs.Helsinki.FI>
<················@cern.ch> wrote:
[ sb-sprof ]
> ... SBCL runs forever and cannot be interrupted other than by killing
> the process, almost every time. It tends to work more often for short
> runs, and almost never for longer ones.
> 
> Anyone else seen this? 
> 
> (SBCL 0.8.12 on G4/OS X 10.3.8)

There have been some reports of sb-sprof instability on ppc. Right now
it's probably only reliable on x86 and x86-64.

-- 
Juho Snellman
"Premature profiling is the root of all evil."
From: Denis Bueno
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <v8toecli0ud.fsf@mornasule.stl.gtri.gatech.edu>
Jacek Generowicz <················@cern.ch> writes:
> Anyone else seen this? 
>
> (SBCL 0.8.12 on G4/OS X 10.3.8)

Maybe you should upgrade your SBCL? 0.8.21 is the latest version. Not sure
if this has anything to do with it, but it seems worth a try....

-Denis
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <m2sm1x80jp.fsf@genera.local>
Denis Bueno <······@gmail.com> writes:

> Jacek Generowicz <················@cern.ch> writes:
>>
>> (SBCL 0.8.12 on G4/OS X 10.3.8)
>
> Maybe you should upgrade your SBCL? 0.8.21 is the latest version.

Typo on my behalf. The 1 and 2 should be transposed: I'm using 0.8.21
already.
From: Jacek Generowicz
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <m2is2w5rqy.fsf@genera.local>
Brian Downing <·············@lavos.net> writes:

> (let* ((rfloat (coerce r 'c-float)) ; KLUDGE? this makes arg not rational

... which is also the case in the C code. But I couldn't get Lisp to
give the right answers when I introduced such coercions.
From: rif
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <wj0k6nb96ks.fsf@five-percent-nation.mit.edu>
Brian Downing <·············@lavos.net> writes:
>               (not (zerop p)))
>      (let* ((rfloat (coerce r 'c-float)) ; KLUDGE? this makes arg not rational
>             (pfloat (coerce p 'c-float)) ; same for ti below
>             (arg #i(  -6*p/rfloat-9*q^^2/(4*rfloat^^2)  )))

I just changed the line defining arg to

(let ((arg #i(  -(6.0*p)/r-9.0*q^^2/(4.0*r^^2))))

This single change to the program, changing nothing else, sped it up
by a factor of 5 under CMUCL on my machine (while producing the same
answers).

rif
From: Raymond Toy
Subject: Re: Optimizing numerics: a) declarations b) partial evaluation
Date: 
Message-ID: <sxd4qedct8j.fsf@rtp.ericsson.se>
>>>>> "Brian" == Brian Downing <·············@lavos.net> writes:

    Brian> In article <··············@genera.local>,
    Brian> Jacek Generowicz  <················@cern.ch> wrote:
    >> Ah, OK, so it seems I missed the point of Christophe's post too :-)
    >> 
    >> Thanks for the pointer.
    >> 
    >> > How much does your code cons ?
    >> 
    >> ; Evaluation took:
    >> ;   4.65 seconds of real time
    >> ;   2.3 seconds of user run time
    >> ;   0.46 seconds of system run time
    >> ;   0 CPU cycles
    >> ;   [Run times include 0.48 seconds GC run time]
    >> ;   0 page faults and
    >> ;   83,129,480 bytes consed.
    >> 
    >> Hmmmm. Can't see where all that comes from. I'm just evaluating a
    >> function in a nested loop.
    >> 
    >> (BTW, 0 CPU cycles ?!)

    Brian> The CPU cycle count doesn't work on CMUCL/PPC.

That is partially fixed now.  It gives the number of clock ticks.
Unfortunately, I can't figure out how to convert clock ticks to
cycles.  On a 400MHz G3 iMac, 1 tick = 1 cycle, but on other the
relationship isn't so clear.

Ray