From: verec
Subject: help with numeric speed optimization
Date: 
Message-ID: <44de2021$0$635$5a6aecb4@news.aaisp.net.uk>
Hi.

I have been challenged to "port" to CL the following (for now
quite inaccurate) approximation of pi/2. Depending on the
speed outcome, CL may get considered for the project that
will be havy on mumeric computations.

C++ code
#include <iostream>
#include <cmath>
#include <ctime>

static double piOver2() {
    double sum = std::exp(-std::pow(-10.0, 2.0)) ;
    sum += std::exp(-std::pow(10.0, 2)) ;

    for (double x = -9.99 ; x < 10.0 ; x += 0.01) {
        double v = std::exp(-std::pow(x, 2.0)) ;
        sum += v + v ;
    }

    return sum * 0.005 ;
}

10,000 iterations take about 3s.

I then went through to java (replacing "std::" with "Math.")
and this takes about 9s.

I finally coded the Lisp version

(defun pi/2 ()
  (do ((sum
        (+ (exp (- (expt -10.0 2.0)))
           (exp (- (expt 10.0 2.0)))))
       (x -9.99 (+ x 0.01)))
      ((>= x 10.0) (* sum 0.005))
    (incf sum (* 2 (exp (- (expt x 2.0)))))))

But I just can't tell how long 10,000 iterations would
take because I lost patience after 2 minutes.

So I started inserting a few (declare) and (the float)
here and there:

(defun pi/2-opt ()
  (declare (optimize (safety 0) (debug 0) (speed 3)))
  (declare (float sum))
  (declare (float x))
  (do ((sum
        (+ (exp (- (expt -10.0 2.0)))
           (exp (- (expt 10.0 2.0)))))
       (x -9.99 (+ x 0.01)))
      ((>= x 10.0) (* sum 0.005))
    (incf sum (the float (* 2 (exp (- (expt x 2.0))))))))

CL-USER 54 > (time (dotimes (i 10000) (pi/2-opt)))
Timing the evaluation of (DOTIMES (I 10000) (PI/2-OPT))

user time    =     97.631
system time  =      2.084
Elapsed time =   0:02:05
Allocation   = 4877356064 bytes
0 Page faults
Calls to %EVAL    120050
NIL

At nearly 30 times slower, I just don't stand a chance.
I can probably push the high-level, DSL and other Lisp features
at around 10x or 5x the speed penalty, but not at 30x.

What did I misunderstand about the use of (declare (optimize ...
and (the float ( ... or anything else that might affect the
speed so adversely?

Just in case this isn't clear, the point is not about getting
pi/2 but getting numeric speed for computations of complexity
similar to that used in this approximation of pi/2.

Many thanks
--
JFB

From: ·······@gmail.com
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <1155411231.468641.66500@m73g2000cwd.googlegroups.com>
verec wrote:
> Hi.
>
> I have been challenged to "port" to CL the following (for now
> quite inaccurate) approximation of pi/2. Depending on the
> speed outcome, CL may get considered for the project that
> will be havy on mumeric computations.
>
> C++ code
> #include <iostream>
> #include <cmath>
> #include <ctime>
>
> static double piOver2() {
>     double sum = std::exp(-std::pow(-10.0, 2.0)) ;
>     sum += std::exp(-std::pow(10.0, 2)) ;
>
>     for (double x = -9.99 ; x < 10.0 ; x += 0.01) {
>         double v = std::exp(-std::pow(x, 2.0)) ;
>         sum += v + v ;
>     }
>
>     return sum * 0.005 ;
> }
>
> 10,000 iterations take about 3s.
>
> I then went through to java (replacing "std::" with "Math.")
> and this takes about 9s.
>
> I finally coded the Lisp version
>
> (defun pi/2 ()
>   (do ((sum
>         (+ (exp (- (expt -10.0 2.0)))
>            (exp (- (expt 10.0 2.0)))))
>        (x -9.99 (+ x 0.01)))
>       ((>= x 10.0) (* sum 0.005))
>     (incf sum (* 2 (exp (- (expt x 2.0)))))))
>
> But I just can't tell how long 10,000 iterations would
> take because I lost patience after 2 minutes.
>
> So I started inserting a few (declare) and (the float)
> here and there:
>
> (defun pi/2-opt ()
>   (declare (optimize (safety 0) (debug 0) (speed 3)))
>   (declare (float sum))
>   (declare (float x))
>   (do ((sum
>         (+ (exp (- (expt -10.0 2.0)))
>            (exp (- (expt 10.0 2.0)))))
>        (x -9.99 (+ x 0.01)))
>       ((>= x 10.0) (* sum 0.005))
>     (incf sum (the float (* 2 (exp (- (expt x 2.0))))))))
>
> CL-USER 54 > (time (dotimes (i 10000) (pi/2-opt)))
> Timing the evaluation of (DOTIMES (I 10000) (PI/2-OPT))
>
> user time    =     97.631
> system time  =      2.084
> Elapsed time =   0:02:05
> Allocation   = 4877356064 bytes
> 0 Page faults
> Calls to %EVAL    120050
> NIL
>
> At nearly 30 times slower, I just don't stand a chance.
> I can probably push the high-level, DSL and other Lisp features
> at around 10x or 5x the speed penalty, but not at 30x.
>
> What did I misunderstand about the use of (declare (optimize ...
> and (the float ( ... or anything else that might affect the
> speed so adversely?

1. Note that the call to TIME also tells you that your function was
interpreted, not compiled.

2. I pasted your code in SBCL's REPL. Here's what I got:
  ; caught WARNING:
  ;   These variables are undefined:
  ;     SUM X
The declares must be in the scope of the declared variables, like so:

(read the standard to know where to put your declarations)

(defun pi/2-opt ()
  (declare (optimize (safety 0) (debug 0) (speed 3)))
  (do ((sum
	(+ (exp (- (expt -10.0 2.0)))
	   (exp (- (expt 10.0 2.0)))))
       (x -9.99 (+ x 0.01)))
      ((>= x 10.0) (* sum 0.005))
    (declare (type single-float x sum))
    (incf sum (the float (* 2 (exp (- (expt x 2.0))))))))

10000 iterations take ~2.9 seconds on a P-M 1.6 GHz.

3. I don't know which compiler you intend to use, but python is a bit
moronic when it comes to side-effected variables. Here's a
side-effect-free version that gives the same answer (but as doubles,
not single floats). In this case, 10000 iterations also take ~2.9
seconds, but I prefer this version :)

(defun pi/2 ()
  (declare (optimize (speed 3) (safety 0)))
  (labels ((inner (x acc)
	     (declare (type double-float x acc))
	     (if (>= x 10d0)
		 acc
		 (inner (+ 0.01 x)
			(+ acc (* 2 (exp (- (expt x 2)))))))))
    (* 0.005d0 (inner -9.99d0
		      (+ (exp (- (expt -10.0d0 2)))
			 (exp (- (expt  10.0d0 2))))))))

4. It is sometimes useful to compile different versions and then
examine their assembly with DISASSEMBLE. For heavily numeric code, one
doesn't have to understand the innards of the CL implementation, only
normal x86 + x87.

Good Luck,
  Paul Khuong
From: verec
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <44de34e8$0$639$5a6aecb4@news.aaisp.net.uk>
On 2006-08-12 20:33:51 +0100, ·······@gmail.com said:

>> (defun pi/2-opt ()
>> (declare (optimize (safety 0) (debug 0) (speed 3)))
>> (declare (float sum))
>> (declare (float x))
>> (do ((sum
>> (+ (exp (- (expt -10.0 2.0)))
>> (exp (- (expt 10.0 2.0)))))
>> (x -9.99 (+ x 0.01)))
>> ((>= x 10.0) (* sum 0.005))
>> (incf sum (the float (* 2 (exp (- (expt x 2.0))))))))
> 1. Note that the call to TIME also tells you that your function was
> interpreted, not compiled.

CL-USER 58 > (compile 'pi/2-opt)
;;;*** Warning in PI/2-OPT: The definition of PI/2-OPT is already compiled.
PI/2-OPT
((PI/2-OPT #<STYLE-WARNING 100B2BDF>))
NIL

It already was compiled :-(

> 2. I pasted your code in SBCL's REPL. Here's what I got:
>   ; caught WARNING:
>   ;   These variables are undefined:
>   ;     SUM X
> The declares must be in the scope of the declared variables, like so:
> 
> (read the standard to know where to put your declarations)
> 
> (defun pi/2-opt ()
>   (declare (optimize (safety 0) (debug 0) (speed 3)))
>   (do ((sum
> 	(+ (exp (- (expt -10.0 2.0)))
> 	   (exp (- (expt 10.0 2.0)))))
>        (x -9.99 (+ x 0.01)))
>       ((>= x 10.0) (* sum 0.005))
>     (declare (type single-float x sum))
>     (incf sum (the float (* 2 (exp (- (expt x 2.0))))))))

Excellent. Thanks. I tried.

CL-USER 59 > (pi/2-opt)
1.3412517E-31

Ooops. definitely incorrect. So I rewrote the declare part
as:

    (declare (type float x sum))

CL-USER 60 > (pi/2-opt)
#C(1.7724538 -7.747651E-8)

OK. Back in the game :-)

CL-USER 61 > (time (dotimes (i 10000) (pi/2-opt)))
Timing the evaluation of (DOTIMES (I 10000) (PI/2-OPT))

user time    =     99.182

"-(

> 
> 10000 iterations take ~2.9 seconds on a P-M 1.6 GHz.
> 
> 3. I don't know which compiler you intend to use, but python is a bit
> moronic when it comes to side-effected variables. Here's a
> side-effect-free version that gives the same answer (but as doubles,
> not single floats). In this case, 10000 iterations also take ~2.9
> seconds, but I prefer this version :)
> 
> (defun pi/2 ()
>   (declare (optimize (speed 3) (safety 0)))
>   (labels ((inner (x acc)
> 	     (declare (type double-float x acc))
> 	     (if (>= x 10d0)
> 		 acc
> 		 (inner (+ 0.01 x)
> 			(+ acc (* 2 (exp (- (expt x 2)))))))))
>     (* 0.005d0 (inner -9.99d0
> 		      (+ (exp (- (expt -10.0d0 2)))
> 			 (exp (- (expt  10.0d0 2))))))))

CL-USER 63 > (pi/2-PK)
1.7724538905229454D0

Great. Correct too.

CL-USER 64 > (time (dotimes (i 10000) (pi/2-PK)))
Timing the evaluation of (DOTIMES (I 10000) (PI/2-PK))

user time    =     43.500

Your version definitevly halves the time required!
That's execellent, but nowhere near the speed you
quote for your compiler/OS combo.

> 4. It is sometimes useful to compile different versions and then
> examine their assembly with DISASSEMBLE. For heavily numeric code, one
> doesn't have to understand the innards of the CL implementation, only
> normal x86 + x87.

I had tried to (disassemble 'pi/2-opt) but nearly fainted at the
sight of 104 lines of assembly :-(

> Good Luck,

Thank you very much! The lesson here for me is that I should
first try another compiler, as I just can't see why a copy and
paste of your version would run 15 times slower on my setup.

(LW 4.4.6 on a G4 @1.5GHz ... the fact that the disassembly shows
104 lines is probably a good hint that no optimization is applied
at all in the PE of LW?)

>   Paul Khuong
--
JFB
From: Thomas A. Russ
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <ymibqqnvv37.fsf@sevak.isi.edu>
verec <·····@mac.com> writes:

> On 2006-08-12 20:33:51 +0100, ·······@gmail.com said:
> > 1. Note that the call to TIME also tells you that your function was
> > interpreted, not compiled.
> 
> CL-USER 58 > (compile 'pi/2-opt)
> ;;;*** Warning in PI/2-OPT: The definition of PI/2-OPT is already compiled.
> PI/2-OPT
> ((PI/2-OPT #<STYLE-WARNING 100B2BDF>))
> NIL
> 
> It already was compiled :-(

The function PI/2-OPT may have been compiled, but what you tested with
TIME was the following:

   (time (dotimes (i 10000) (pi/2-opt)))

and the DOTIMES loop is interpreted, not compiled.  What I usually end
up doing to get around that is to write some simple test function like
so:

  (defun test (n) (dotimes (i n) (pi/2-opt)))

and then make sure TEST is compiled.  With dynamic linking, it also
means that subsequent changes to the tested function can be timed
without needing to recompile the TEST function.

    (time (test 10000))

-- 
Thomas A. Russ,  USC/Information Sciences Institute
From: Rob Warnock
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <0bSdnfwJbvh633zZnZ2dnUVZ_u-dnZ2d@speakeasy.net>
Thomas A. Russ <···@sevak.isi.edu> wrote:
+---------------
| verec <·····@mac.com> writes:
| > On 2006-08-12 20:33:51 +0100, ·······@gmail.com said:
| > > 1. Note that the call to TIME also tells you that your function was
| > > interpreted, not compiled.
| > 
| > CL-USER 58 > (compile 'pi/2-opt)
| > ;;;*** Warning in PI/2-OPT: The definition of PI/2-OPT is already compiled.
| > PI/2-OPT
| > ((PI/2-OPT #<STYLE-WARNING 100B2BDF>))
| > NIL
| > 
| > It already was compiled :-(
| 
| The function PI/2-OPT may have been compiled, but what you tested with
| TIME was the following:
|    (time (dotimes (i 10000) (pi/2-opt)))
| and the DOTIMES loop is interpreted, not compiled.
+---------------

Interesting. In CMUCL, TIME always compiles the form to be timed:

    cmu> (time (dotimes (i 100000000)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   0.11f0 seconds of real time
    ;   0.11f0 seconds of user run time
    ;   0.0f0 seconds of system run time
    ;   200,529,433 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    cmu> 


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: ·············@specastro.com
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <1155421739.442088.257760@p79g2000cwp.googlegroups.com>
Here are some good links on improving numerical computation speed in
Common Lisp:

"On using Common Lisp in scientific computing", by Nicolas Neuss.  This
has a pretty good tuturial-level introduction to how to do numerical
optimization in Common Lisp.
http://www.iwr.uni-heidelberg.de/organization/sfb359/PP/Preprint2002-40.ps.gz

"Fast Floating-Point Processing in Common Lisp", by Richard Fateman.
This is a more in-depth treatment.
http://www.cs.berkeley.edu/~fateman/papers/lispfloat.ps

Ken Anderson's Common Lisp Performance page:
http://openmap.bbn.com/~kanderso/performance/

Additionally, here's a listing of math oriented libraries available for
Common Lisp:
http://www.cliki.net/mathematics

Glenn
From: verec
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <44de5bd2$0$639$5a6aecb4@news.aaisp.net.uk>
On 2006-08-12 23:28:59 +0100, ·············@specastro.com said:

> Here are some good links on improving numerical computation speed in
> Common Lisp:
> 
> "On using Common Lisp in scientific computing", by Nicolas Neuss.  This
> has a pretty good tuturial-level introduction to how to do numerical
> optimization in Common Lisp.
> http://www.iwr.uni-heidelberg.de/organization/sfb359/PP/Preprint2002-40.ps.gz
> 
> "Fast Floating-Point Processing in Common Lisp", by Richard Fateman.
> This is a more in-depth treatment.
> http://www.cs.berkeley.edu/~fateman/papers/lispfloat.ps
> 
> Ken Anderson's Common Lisp Performance page:
> http://openmap.bbn.com/~kanderso/performance/
> 
> Additionally, here's a listing of math oriented libraries available for
> Common Lisp:
> http://www.cliki.net/mathematics

Much appreciated. Thanks!

> Glenn

--
JFB
From: Wade Humeniuk
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <YQqDg.3088$Nz6.726@edtnps82>
For LW 4.3.7

(defun pi/2-opt ()
   (declare (optimize (speed 3) (safety 0) (debug 0) (float 0)))
   (loop with sum of-type double-float = #.(realpart (+ (exp (- (expt -10.0d0 2.0d0)))
                                                        (exp (- (expt 10.0d0 2.0d0)))))
         for x of-type double-float from -9.99d0 by 0.01d0
         while (< x 10.0d0) do
         (incf sum (* 2d0 (exp (- (* (abs x) (abs x))))))
         finally (return (* sum 0.005d0))))


Removed some expt stuff, I think it is still correct.

CL-USER 47 > (pi/2-opt)
1.7724538509055174

CL-USER 46 > (time (dotimes (i 10000) (pi/2-opt)))
Timing the evaluation of (DOTIMES (I 10000) (PI/2-OPT))

user time    =      1.832
system time  =      0.000
Elapsed time =   0:00:02
Allocation   = 160984 bytes standard / 112783 bytes conses
0 Page faults
Calls to %EVAL    10035
NIL


CL-USER 44 > (disassemble 'pi/2-opt)
2067FC1A:
        0:      55               push  ebp
        1:      89E5             move  ebp, esp
        3:      83EC24           sub   esp, 24
        6:      C7042445240000   move  [esp], 2445
       13:      DD05E02A6920     fldl  [20692AE0]       ; 7.440151952041305E-44
       19:      DD5DF0           fstpl [ebp-10]
       22:      DD05C02B6920     fldl  [20692BC0]       ; -9.99
       28:      DD55F8           fstl  [ebp-8]
       31:      DD05882C6920     fldl  [20692C88]       ; 10.0
       37:      DED9             fcompp
       39:      DFE0             fnstsw ax
       41:      9E               sahf
       42:      0F8ABF000000     jp    L6
       48:      0F86B9000000     jbe   L6
L1:   54:      DD45F0           fldl  [ebp-10]
       57:      DD5DE0           fstpl [ebp-20]
       60:      DD45F8           fldl  [ebp-8]
       63:      DD55F8           fstl  [ebp-8]
       66:      DD05C00B1920     fldl  [20190BC0]       ; 0.0
       72:      DED9             fcompp
       74:      DFE0             fnstsw ax
       76:      9E               sahf
       77:      0F8AF6000000     jp    L7
       83:      0F86F0000000     jbe   L7
       89:      DD45F8           fldl  [ebp-8]
       92:      D9E0             fchs
       94:      DD5DE8           fstpl [ebp-18]
L2:   97:      DD45F8           fldl  [ebp-8]
      100:      DD55F8           fstl  [ebp-8]
      103:      DD05C00B1920     fldl  [20190BC0]       ; 0.0
      109:      DED9             fcompp
      111:      DFE0             fnstsw ax
      113:      9E               sahf
      114:      0F8ADC000000     jp    L8
      120:      0F86D6000000     jbe   L8
      126:      DD45F8           fldl  [ebp-8]
      129:      D9E0             fchs
      131:      DD5DF0           fstpl [ebp-10]
L3:  134:      DD45E8           fldl  [ebp-18]
      137:      DC4DF0           fmull [ebp-10]
      140:      D9E0             fchs
      142:      D9E5             fxam
      144:      DFE0             fnstsw ax
      146:      3500010000       xor   eax, 100
      151:      9E               sahf
      152:      760D             jbe   L4
      154:      7B23             jnp   L5
      156:      F6C402           testb ah, 2
      159:      741E             je    L5
      161:      DDD8             fstp  st(0)
      163:      D9EE             fldz
      165:      EB18             jmp   L5
L4:  167:      D9EA             fld2e
      169:      DEC9             fmulp st(1), st
      171:      D9C0             fld   st(0)
      173:      D9FC             frndint
      175:      D9C0             fld   st(0)
      177:      D9CA             fxch  st(2)
      179:      DEE1             fsubrp st(1), st
      181:      D9F0             f2xm1
      183:      D9E8             fld1
      185:      DEC1             faddp st(1), st
      187:      D9FD             fscale
      189:      DDD9             fstp  st(1)
L5:  191:      DD05102D6920     fldl  [20692D10]       ; 2.0
      197:      DEC9             fmulp st(1), st
      199:      DC45E0           faddl [ebp-20]
      202:      DD5DF0           fstpl [ebp-10]
      205:      DD45F8           fldl  [ebp-8]
      208:      DD05102C6920     fldl  [20692C10]       ; 0.01
      214:      DEC1             faddp st(1), st
      216:      DD55F8           fstl  [ebp-8]
      219:      DD05882C6920     fldl  [20692C88]       ; 10.0
      225:      DED9             fcompp
      227:      DFE0             fnstsw ax
      229:      9E               sahf
      230:      7A07             jp    L6
      232:      7605             jbe   L6
      234:      E947FFFFFF       jmp   L1
L6:  239:      DD45F0           fldl  [ebp-10]
      242:      DD05602E6920     fldl  [20692E60]       ; 0.005
      248:      DEC9             fmulp st(1), st
      250:      FF75E4           push  [ebp-1C]
      253:      FF75E0           push  [ebp-20]
      256:      FF75DC           push  [ebp-24]
      259:      8B75E8           move  esi, [ebp-18]
      262:      8975DC           move  [ebp-24], esi
      265:      8B75EC           move  esi, [ebp-14]
      268:      8975E0           move  [ebp-20], esi
      271:      8B75F0           move  esi, [ebp-10]
      274:      8975E4           move  [ebp-1C], esi
      277:      8B75F4           move  esi, [ebp-C]
      280:      8975E8           move  [ebp-18], esi
      283:      8B75F8           move  esi, [ebp-8]
      286:      8975EC           move  [ebp-14], esi
      289:      8B75FC           move  esi, [ebp-4]
      292:      8975F0           move  [ebp-10], esi
      295:      8B7500           move  esi, [ebp]
      298:      8975F4           move  [ebp-C], esi
      301:      83ED0C           sub   ebp, C
      304:      8B7510           move  esi, [ebp+10]
      307:      897504           move  [ebp+4], esi
      310:      DD5D0C           fstpl [ebp+C]
      313:      C74508450C0000   move  [ebp+8], C45
      320:      B501             moveb ch, 1
      322:      C9               leave
      323:      FF2530671120     jmp   [20116730]       ; SYSTEM::RAW-FAST-BOX-DOUBLE
L7:  329:      DD45F8           fldl  [ebp-8]
      332:      DD5DE8           fstpl [ebp-18]
      335:      E90DFFFFFF       jmp   L2
L8:  340:      DD45F8           fldl  [ebp-8]
      343:      DD5DF0           fstpl [ebp-10]
      346:      E927FFFFFF       jmp   L3
      351:      90               nop
      352:      90               nop
      353:      90               nop
NIL
From: Wade Humeniuk
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <PJrDg.1506$Ch.158@clgrps13>
This version is better than the last

(defun pi/2-opt ()
   (declare (optimize (speed 3) (safety 0) (debug 0) (float 0)))
   (loop with sum of-type double-float = #.(realpart (+ (exp (- (expt -10.0d0 2.0d0)))
                                                        (exp (- (expt 10.0d0 2.0d0)))))
         for x of-type double-float from -9.99d0 below 10.0d0 by 0.01d0
         do (incf sum (* 2d0 (exp (- (* x x)))))
         finally (return (* sum 0.005d0))))

CL-USER 1 > (time (dotimes (i 10000) (pi/2-opt)))
Timing the evaluation of (DOTIMES (I 10000) (PI/2-OPT))
; Loading fasl file C:\Program Files\Xanalys\LispWorks\lib\4-3-0-0\modules\util\callcoun.fsl

user time    =      1.231
system time  =      0.000
Elapsed time =   0:00:01
Allocation   = 164304 bytes standard / 113113 bytes conses
0 Page faults
Calls to %EVAL    10035
NIL

CL-USER 4 > (disassemble 'pi/2-opt)
; Loading fasl file C:\Program Files\Xanalys\LispWorks\lib\4-3-0-0\modules\util\disass.fsl
; Loading fasl file C:\Program 
Files\Xanalys\LispWorks\lib\4-3-0-0\patches\disassembler\0001\0001.fsl
; Loaded public patch DISASSEMBLER 1.1

; Loading fasl file C:\Program Files\Xanalys\LispWorks\lib\4-3-0-0\modules\memory\findptr.fsl
214EBE2A:
        0:      55               push  ebp
        1:      89E5             move  ebp, esp
        3:      83EC24           sub   esp, 24
        6:      C7042445240000   move  [esp], 2445
       13:      DD05500E6920     fldl  [20690E50]       ; 7.440151952041305E-44
       19:      DD5DF0           fstpl [ebp-10]
       22:      DD05300F6920     fldl  [20690F30]       ; -9.99
       28:      DD55F8           fstl  [ebp-8]
       31:      DD05780F6920     fldl  [20690F78]       ; 10.0
       37:      DED9             fcompp
       39:      DFE0             fnstsw ax
       41:      9E               sahf
       42:      7A5C             jp    L2
       44:      775A             ja    L2
L1:   46:      DD45F0           fldl  [ebp-10]
       49:      DD0570116920     fldl  [20691170]       ; 0.005
       55:      DEC9             fmulp st(1), st
       57:      FF75E4           push  [ebp-1C]
       60:      FF75E0           push  [ebp-20]
       63:      FF75DC           push  [ebp-24]
       66:      8B75E8           move  esi, [ebp-18]
       69:      8975DC           move  [ebp-24], esi
       72:      8B75EC           move  esi, [ebp-14]
       75:      8975E0           move  [ebp-20], esi
       78:      8B75F0           move  esi, [ebp-10]
       81:      8975E4           move  [ebp-1C], esi
       84:      8B75F4           move  esi, [ebp-C]
       87:      8975E8           move  [ebp-18], esi
       90:      8B75F8           move  esi, [ebp-8]
       93:      8975EC           move  [ebp-14], esi
       96:      8B75FC           move  esi, [ebp-4]
       99:      8975F0           move  [ebp-10], esi
      102:      8B7500           move  esi, [ebp]
      105:      8975F4           move  [ebp-C], esi
      108:      83ED0C           sub   ebp, C
      111:      8B7510           move  esi, [ebp+10]
      114:      897504           move  [ebp+4], esi
      117:      DD5D0C           fstpl [ebp+C]
      120:      C74508450C0000   move  [ebp+8], C45
      127:      B501             moveb ch, 1
      129:      C9               leave
      130:      FF2530671120     jmp   [20116730]       ; SYSTEM::RAW-FAST-BOX-DOUBLE
L2:  136:      DD45F8           fldl  [ebp-8]
      139:      DC4DF8           fmull [ebp-8]
      142:      D9E0             fchs
      144:      D9E5             fxam
      146:      DFE0             fnstsw ax
      148:      3500010000       xor   eax, 100
      153:      9E               sahf
      154:      760D             jbe   L3
      156:      7B23             jnp   L4
      158:      F6C402           testb ah, 2
      161:      741E             je    L4
      163:      DDD8             fstp  st(0)
      165:      D9EE             fldz
      167:      EB18             jmp   L4
L3:  169:      D9EA             fld2e
      171:      DEC9             fmulp st(1), st
      173:      D9C0             fld   st(0)
      175:      D9FC             frndint
      177:      D9C0             fld   st(0)
      179:      D9CA             fxch  st(2)
      181:      DEE1             fsubrp st(1), st
      183:      D9F0             f2xm1
      185:      D9E8             fld1
      187:      DEC1             faddp st(1), st
      189:      D9FD             fscale
      191:      DDD9             fstp  st(1)
L4:  193:      DD0550106920     fldl  [20691050]       ; 2.0
      199:      DEC9             fmulp st(1), st
      201:      DC45F0           faddl [ebp-10]
      204:      DD5DF0           fstpl [ebp-10]
      207:      DD45F8           fldl  [ebp-8]
      210:      DD05C80F6920     fldl  [20690FC8]       ; 0.01
      216:      DEC1             faddp st(1), st
      218:      DD55F8           fstl  [ebp-8]
      221:      DD05780F6920     fldl  [20690F78]       ; 10.0
      227:      DED9             fcompp
      229:      DFE0             fnstsw ax
      231:      9E               sahf
      232:      7A9E             jp    L2
      234:      779C             ja    L2
      236:      E93DFFFFFF       jmp   L1
      241:      90               nop
NIL
From: verec
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <44de50cf$0$639$5a6aecb4@news.aaisp.net.uk>
On 2006-08-12 22:28:15 +0100, Wade Humeniuk 
<··················@telus.net> said:

> This version is better than the last
> 
> (defun pi/2-opt ()
>    (declare (optimize (speed 3) (safety 0) (debug 0) (float 0)))
>    (loop with sum of-type double-float = #.(realpart (+ (exp (- (expt 
> -10.0d0 2.0d0)))
>                                                         (exp (- (expt 
> 10.0d0 2.0d0)))))
>          for x of-type double-float from -9.99d0 below 10.0d0 by 0.01d0
>          do (incf sum (* 2d0 (exp (- (* x x)))))
>          finally (return (* sum 0.005d0))))
> 
> CL-USER 1 > (time (dotimes (i 10000) (pi/2-opt)))
> Timing the evaluation of (DOTIMES (I 10000) (PI/2-OPT))
> ; Loading fasl file C:\Program 
> Files\Xanalys\LispWorks\lib\4-3-0-0\modules\util\callcoun.fsl
> 
> user time    =      1.231
> system time  =      0.000
> Elapsed time =   0:00:01
> Allocation   = 164304 bytes standard / 113113 bytes conses
> 0 Page faults
> Calls to %EVAL    10035

CL-USER 1 > (pi/2-WH)
1.7724538509055174D0

CL-USER 2 > (time (dotimes (i 10000) (pi/2-WH)))
Timing the evaluation of (DOTIMES (I 10000) (PI/2-WH))

...
Elapsed time =   0:00:12
...
CL-USER 4 > (pi/2-WH2)
1.7724538509055174D0

CL-USER 5 > (time (dotimes (i 10000) (pi/2-WH2)))
Timing the evaluation of (DOTIMES (I 10000) (PI/2-WH2))

...
Elapsed time =   0:00:12
...

Thanks Wade!

At 12s for the Lisp version (still with the same LW 4.4,6 PE compiler)
I have no doubt that I now can successfully make my case!

(Though, out of curiosity I'm going to try with SBCL and OpenMCL)

Many thanks
--
JFB
From: verec
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <44de5862$0$639$5a6aecb4@news.aaisp.net.uk>
On 2006-08-12 21:27:36 +0100, Wade Humeniuk 
<··················@telus.net> said:

> Removed some expt stuff, I think it is still correct.

Hmmm. That's a point I overlooked when copying and pasting
your example (as well as PK's.)

The thing is, both of your version are about 10 times
faster than what I intially came up with _precisely_
because you have replaced (expt x 2) with (* x x)

When I plug (expt x 2) back into the loop, I get again
the same dismall performance as with my initial version.

with:
         do (incf sum (* 2d0 (exp (- (* x x)))))
12s

with:
         do (incf sum (* 2d0 (exp (- (expt x 2.0d0)))))
115s!

but with:
         do (incf sum (* 2d0 (exp (- (realpart (expt x 2.0d0))))))
88s

Hmmm. Still scratching my head :-(
--
JFB
From: Pascal Bourguignon
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <87ejvl4knq.fsf@thalassa.informatimago.com>
verec <·····@mac.com> writes:

> On 2006-08-12 21:27:36 +0100, Wade Humeniuk
> <··················@telus.net> said:
>
>> Removed some expt stuff, I think it is still correct.
>
> Hmmm. That's a point I overlooked when copying and pasting
> your example (as well as PK's.)
>
> The thing is, both of your version are about 10 times
> faster than what I intially came up with _precisely_
> because you have replaced (expt x 2) with (* x x)
>
> When I plug (expt x 2) back into the loop, I get again
> the same dismall performance as with my initial version.

You can write a compiler-macro. Unfortunately, not on CL:EXPT since
this is forbidden, but you can do it this way:

(shadow 'expt)
(declaim (inline expt))
(defun expt (base-number power-number) (cl:expt base-number power-number))
(define-compiler-macro expt (&whole whole base-number power-number)
  (if (equal 2 power-number)
      (let ((bn (gensym)))
        `(let ((,bn ,base-number)) (* ,bn ,bn)))
      whole))

Then compare:

(disassemble (compile (defun f (x)   (expt x 2))))
(disassemble (compile (defun g (x y) (expt x y))))


So there is no need in modifying the source, keep (expt x 2).

(You can also write a more sophisticated compiler macro, handing other
constant exponents).

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

"This statement is false."            In Lisp: (defun Q () (eq nil (Q)))
From: verec
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <44de63e6$0$639$5a6aecb4@news.aaisp.net.uk>
On 2006-08-13 00:08:25 +0100, Pascal Bourguignon <···@informatimago.com> said:

> You can write a compiler-macro. Unfortunately, not on CL:EXPT since
> this is forbidden, but you can do it this way:
> 
> (shadow 'expt)
> (declaim (inline expt))
> (defun expt (base-number power-number) (cl:expt base-number power-number))
> (define-compiler-macro expt (&whole whole base-number power-number)
>   (if (equal 2 power-number)
>       (let ((bn (gensym)))
>         `(let ((,bn ,base-number)) (* ,bn ,bn)))
>       whole))
> 
> Then compare:
> 
> (disassemble (compile (defun f (x)   (expt x 2))))
> (disassemble (compile (defun g (x y) (expt x y))))

Cute :-)

Many thanks
--
JFB
From: John Thingstad
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <op.td6qagwdpqzri1@pandora.upc.no>
On Sat, 12 Aug 2006 20:38:26 +0200, verec <·····@mac.com> wrote:

I get: Error: The value #C(1.1094963563118383E-43 2.7065338742349504E-57)  
does not satisfy the type specifier FLOAT.

Perhaps that gives you a hint.

>
> (defun pi/2-opt ()
>   (declare (optimize (safety 0) (debug 0) (speed 3)))
>   (declare (float sum))
>   (declare (float x))
>   (do ((sum
>         (+ (exp (- (expt -10.0 2.0)))
>            (exp (- (expt 10.0 2.0)))))
>        (x -9.99 (+ x 0.01)))
>       ((>= x 10.0) (* sum 0.005))
>     (incf sum (the float (* 2 (exp (- (expt x 2.0))))))))
>
> CL-USER 54 > (time (dotimes (i 10000) (pi/2-opt)))
> Timing the evaluation of (DOTIMES (I 10000) (PI/2-OPT))
>
> user time    =     97.631
> system time  =      2.084
> Elapsed time =   0:02:05
> Allocation   = 4877356064 bytes
> 0 Page faults
> Calls to %EVAL    120050
> NIL
>
> At nearly 30 times slower, I just don't stand a chance.
> I can probably push the high-level, DSL and other Lisp features
> at around 10x or 5x the speed penalty, but not at 30x.
>
> What did I misunderstand about the use of (declare (optimize ...
> and (the float ( ... or anything else that might affect the
> speed so adversely?
>
> Just in case this isn't clear, the point is not about getting
> pi/2 but getting numeric speed for computations of complexity
> similar to that used in this approximation of pi/2.
>
> Many thanks
> --
> JFB
>


-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
From: verec
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <44de2997$0$639$5a6aecb4@news.aaisp.net.uk>
On 2006-08-12 20:12:06 +0100, "John Thingstad" <··············@chello.no> said:

> On Sat, 12 Aug 2006 20:38:26 +0200, verec <·····@mac.com> wrote:
> 
> I get: Error: The value #C(1.1094963563118383E-43 2.7065338742349504E-57 )
> does not satisfy the type specifier FLOAT.
> 
> Perhaps that gives you a hint.

Hmmm...

CL-USER 55 > (pi/2-opt)
#C(1.7724538 -7.747651E-8)

CL-USER 56 >

Are type-casts supposed to be implementation dependent?

I thought (erroneously?) that (the float XYZ) would
just drop the imaginary part, precisely because that's
the purpose of a cast? Or should I use (coerce ... in
addition?

FYI, I'm using LispWorks 4.4.6 PE on a G4.

Many thanks
--
JFB

>> (defun pi/2-opt ()
>>   (declare (optimize (safety 0) (debug 0) (speed 3)))
>>   (declare (float sum))
>>   (declare (float x))
>>   (do ((sum
>>         (+ (exp (- (expt -10.0 2.0)))
>>            (exp (- (expt 10.0 2.0)))))
>>        (x -9.99 (+ x 0.01)))
>>       ((>= x 10.0) (* sum 0.005))
>>     (incf sum (the float (* 2 (exp (- (expt x 2.0))))))))
>> 
>> CL-USER 54 > (time (dotimes (i 10000) (pi/2-opt)))
>> Timing the evaluation of (DOTIMES (I 10000) (PI/2-OPT))
>> 
>> user time    =     97.631
>> system time  =      2.084
>> Elapsed time =   0:02:05
>> Allocation   = 4877356064 bytes
>> 0 Page faults
>> Calls to %EVAL    120050
>> NIL
>> 
>> At nearly 30 times slower, I just don't stand a chance.
>> I can probably push the high-level, DSL and other Lisp features
>> at around 10x or 5x the speed penalty, but not at 30x.
>> 
>> What did I misunderstand about the use of (declare (optimize ...
>> and (the float ( ... or anything else that might affect the
>> speed so adversely?
>> 
>> Just in case this isn't clear, the point is not about getting
>> pi/2 but getting numeric speed for computations of complexity
>> similar to that used in this approximation of pi/2.
From: Barry Margolin
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <barmar-9A9C19.17540612082006@comcast.dca.giganews.com>
In article <·······················@news.aaisp.net.uk>,
 verec <·····@mac.com> wrote:

> I thought (erroneously?) that (the float XYZ) would
> just drop the imaginary part, precisely because that's
> the purpose of a cast? Or should I use (coerce ... in
> addition?

THE is not a type-cast, it is a declaration that the value of the 
expression is known a priori to be of the specified type.  It allows the 
compiler to perform type dispatching at compile time rather than at run 
time, but it doesn't perform any conversion to ensure that the object is 
of that type.

So you should write (the float (realpart xyz)).

-- 
Barry Margolin, ······@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
*** PLEASE don't copy me on replies, I'll read them in the group ***
From: verec
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <44de5429$0$639$5a6aecb4@news.aaisp.net.uk>
On 2006-08-12 22:54:06 +0100, Barry Margolin <······@alum.mit.edu> said:

> In article <·······················@news.aaisp.net.uk>,
>  verec <·····@mac.com> wrote:
> 
>> I thought (erroneously?) that (the float XYZ) would
>> just drop the imaginary part, precisely because that's
>> the purpose of a cast? Or should I use (coerce ... in
>> addition?
> 
> THE is not a type-cast, it is a declaration that the value of the 
> expression is known a priori to be of the specified type.  It allows 
> the compiler to perform type dispatching at compile time rather than at 
> run time, but it doesn't perform any conversion to ensure that the 
> object is of that type.
> 
> So you should write (the float (realpart xyz)).

I knew that when with a hint of a doubt, I should head straight to the CLHS :-)

Many Thanks
--
JFB
From: Raffael Cavallaro
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <200608121807138930-raffaelcavallaro@pasdespamsilvousplaitmaccom>
On 2006-08-12 15:18:47 -0400, verec <·····@mac.com> said:

> Are type-casts supposed to be implementation dependent?
> 
> I thought (erroneously?) that (the float XYZ) would
> just drop the imaginary part, precisely because that's
> the purpose of a cast?

Just to be clear, "the" is *not* a type cast - it is a special form 
that acts as a type declaration.

(the float (foo ...)) is a special form telling the 
compiler/interpreter that you, the programmer, are certain that the 
expression beginning (foo ...) will definitely return a float so the 
compiler can treat the return of (foo ...) as a float. If you use this 
declaration with an expression that returns something other than a 
float you are in undefined territory:

"The consequences are undefined if the values yielded by the form are 
not of the type specified by value-type."
(from <http://www.lisp.org/HyperSpec/Body/speope_the.html#the>)


The function cl:float takes a real number and an optional prototype for 
float format (e.g., 1.0d0) and returns a float of the same format as 
the prototype or a single float if no prototype is supplied:

(float 1.0s0 1.0d0)
=> 1.0d0

(float 33/3 1.0d0)
=> 11.0d0

(float 2)
=> 2.0s0

If you want only the real part of a complex number use the function 
cl:realpart:

(realpart #C(1.7724538 -7.747651E-8))

=> 1.7724538
From: verec
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <44de53dd$0$639$5a6aecb4@news.aaisp.net.uk>
On 2006-08-12 23:07:13 +0100, Raffael Cavallaro 
<················@pas-d'espam-s'il-vous-plait-mac.com> said:

> "The consequences are undefined if the values yielded by the form are 
> not of the type specified by value-type."
> (from <http://www.lisp.org/HyperSpec/Body/speope_the.html#the>)
> 

[...]

> If you want only the real part of a complex number use the function 
> cl:realpart:
> 
> (realpart #C(1.7724538 -7.747651E-8))
> 
> => 1.7724538

Point taken. Many thanks.
--
JFB
From: Pierre THIERRY
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <pan.2006.08.12.20.49.58.138778@levallois.eu.org>
Le Sat, 12 Aug 2006 19:38:26 +0100, verec a écrit :
> static double piOver2() {
>     double sum = std::exp(-std::pow(-10.0, 2.0)) ;
>     sum += std::exp(-std::pow(10.0, 2)) ;

For the sake of my curiosity, why do you calculate twice the same
number? 10^2 == (-10)^2

> But I just can't tell how long 10,000 iterations would
> take because I lost patience after 2 minutes.

That's very strange indeed. I can confirm it is pretty fast without any
optimization. I'm also working with SBCL. It gets from 5 seconds without
optimization down to 1.1 with speed 3 and anything else to 0.

Maybe there's a flaw in your compiler. Try it on another implementation.

Curiously,
Nowhere man
-- 
···········@levallois.eu.org
OpenPGP 0xD9D50D8A
From: verec
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <44de4d51$0$639$5a6aecb4@news.aaisp.net.uk>
On 2006-08-12 21:49:58 +0100, Pierre THIERRY 
<···········@levallois.eu.org> said:

> Le Sat, 12 Aug 2006 19:38:26 +0100, verec a écrit :
>> static double piOver2() {
>> double sum = std::exp(-std::pow(-10.0, 2.0)) ;
>> sum += std::exp(-std::pow(10.0, 2)) ;
> 
> For the sake of my curiosity, why do you calculate twice the same
> number? 10^2 == (-10)^2

Well spotted! I guess this is only a mind-less translation of
mine of the math formula

pi/2 = integral[-10, 10, x](e^-(x^2))

:-(

>> But I just can't tell how long 10,000 iterations would
>> take because I lost patience after 2 minutes.
> 
> That's very strange indeed. I can confirm it is pretty fast without any
> optimization. I'm also working with SBCL. It gets from 5 seconds without
> optimization down to 1.1 with speed 3 and anything else to 0.
> 
> Maybe there's a flaw in your compiler. Try it on another implementation.

Still on the case :-)

I'll report in due course.

> Curiously,
> Nowhere man

--
JFB
From: Ivan Boldyrev
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <kb55r3-8rr.ln1@ibhome.cgitftp.uiggm.nsc.ru>
On 9565 day of my life ·····@mac.com wrote:
> On 2006-08-12 21:49:58 +0100, Pierre THIERRY
>>> static double piOver2() {
>>> double sum = std::exp(-std::pow(-10.0, 2.0)) ;
>>> sum += std::exp(-std::pow(10.0, 2)) ;
>> For the sake of my curiosity, why do you calculate twice the same
>> number? 10^2 == (-10)^2
>
> Well spotted!

Well, any modern compiler will calculate both values at compile time.
And this "-" will help other people to understand your code.

-- 
Ivan Boldyrev

                                        | recursion, n:
                                        |       See recursion
From: Lars Brinkhoff
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <85ejvkdfd7.fsf@junk.nocrew.org>
verec <·····@mac.com> writes:
> I have been challenged to "port" to CL the following (for now
> quite inaccurate) approximation of pi/2. Depending on the
> speed outcome, CL may get considered for the project that
> will be havy on mumeric computations.

I get consistently *faster* run times with SBCL 0.9.14.30 than with
GCC 4.0.4 using plain C, not C++.  (There's no reason C++ would be
faster, is there?)

100,000 iterations on a 2GHz x86:

SBCL     
  17.729 seconds of real time
  17.713108 seconds of user run time
  0.004 seconds of system run time
  0 page faults and
  0 bytes consed.

GCC
  real    0m19.948s
  user    0m19.861s
  sys     0m0.004s
From: Pierre THIERRY
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <pan.2006.08.13.12.51.43.132585@levallois.eu.org>
Le Sun, 13 Aug 2006 13:50:12 +0200, Lars Brinkhoff a écrit :
> I get consistently *faster* run times with SBCL 0.9.14.30 than with
> GCC 4.0.4 using plain C, not C++.  (There's no reason C++ would be
> faster, is there?)

How exactly did you instrument the C/C++ code to time it? As I didn't
want to take much time on it, I just compiled the given code with a main
doing 10,000 iterations with a for loop, and timed it with my shell, but
this includes spawning and shutting down the process...

Ha, wait. Don't mind. I could just change the number of iterations and
then easily calculate the impact it really has.

OK, I'm also significantly faster in Lisp... It took 1.7s even with -O3
for the C++ version (ten times with ten times iterations, so spawning
the process is quite lightweight, in fact), and 1.1s in Lisp for the
following:

(defun pi/2-opt ()
  (declare (optimize (debug 0)(space 0)(safety 0)(speed 3)))
  (do ((x -9.98 (+ x 0.01))
       (v 0.0 (exp (- (expt x 2.0))))
       (sum (* 2 (exp (- (expt 10.0 2.0)))) (+ sum (* 2 v))))
      ((>= x 10.0) (* 0.005 sum))
    (declare (type float x v sum))))

I'm impressed...

Quickly,
Nowhere man
-- 
···········@levallois.eu.org
OpenPGP 0xD9D50D8A
From: verec
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <44df31f0$0$638$5a6aecb4@news.aaisp.net.uk>
On 2006-08-13 13:51:43 +0100, Pierre THIERRY 
<···········@levallois.eu.org> said:

> (defun pi/2-opt ()
>   (declare (optimize (debug 0)(space 0)(safety 0)(speed 3)))
>   (do ((x -9.98 (+ x 0.01))
>        (v 0.0 (exp (- (expt x 2.0))))
>        (sum (* 2 (exp (- (expt 10.0 2.0)))) (+ sum (* 2 v))))
>       ((>= x 10.0) (* 0.005 sum))
>     (declare (type float x v sum))))
> 
> I'm impressed...

So am I :-)

I tested this version sith sbcl 0.9.15 and indeed:

*  (time (dotimes (i 10000) (pi/2-optPT)))

Evaluation took:
  4.413 seconds of real time
  3.753417 seconds of user run time
  0.036023 seconds of system run time
  0 page faults and
  81,920 bytes consed.
NIL

I'm amazed at the difference a compiler makes...

Many thanks to all, for all your time and efforts.
--
JFB
From: Sidney Markowitz
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <44df957f$0$34514$742ec2ed@news.sonic.net>
verec wrote, On 14/8/06 2:06 AM:
> I'm amazed at the difference a compiler makes...

If you look at the recent thread in this newsgroup with subject "Squeezing
blood from a turnip" you will see a related discussion about getting better
performance in Lisp than C at counting bits in an integer.

What ended up being found is that the Common Lisp built in function logcount
was the fastest way do do it in SBCL, that the fastest C program to do it was
faster, that the implementation of logcount in SBCL was not using the fastest
algorithm that was known in C, and then coincidentally someone contributed a
faster logcount to the SBCL project and that made the Lisp version faster than
the C version.

An important point that you can get from my post to that thread is that you
can write your own compiler implementation of a function like logcount using
the SBCL macro define-vop, which allows you to write the assembler language
implementation of a function. So even if the bottleneck in your program is not
something that is part of the CL language definition such as logcount, you are
able to hand optimize it if it makes sense to do so. And you can evaluate a
define-vop form right at the REPL. You don't have to touch the compiler source
code to in effect make a patch to the the compiler.

If profiling shows that the bottleneck is a function that is part of Common
Lisp, you can give your improvements back to the project, if it is an open
source Lisp.

That means that if you have a very large program, once you have the algorithm
tweaked and you profile to find where the bottleneck is, you can go the extra
step of telling the compiler how to write fully optimized machine language for
a critical function in the inner loop, if there is such a function.

The advantage I have found in using Common Lisp instead of C for highly
optimized code is that it is much easier to make major changes to the
algorithm, which usually is the right way to get big improvements, before
dealing with low level tweaking.

-- 
    Sidney Markowitz
    http://www.sidney.com
From: Lars Brinkhoff
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <857j1cd92f.fsf@junk.nocrew.org>
Pierre Thierry wrote:
> Lars Brinkhoff wrote:
> > I get consistently *faster* run times with SBCL 0.9.14.30 than
> > with GCC 4.0.4.
> How exactly did you instrument the C/C++ code to time it?

Here's the whole thing:

    #include <math.h>

    static double pi_over_2 (void)
    {
      double sum = exp (-pow (-10.0, 2.0)) + exp (-pow (10.0, 2.0));
      double x;

      for (x = -9.99; x < 10.0; x += 0.01)
        sum += 2.0 * exp (-pow (x, 2.0)) ;

      return sum * 0.005;
    }

    int main (void)
    {
      int i;

      for (i = 0; i < 100000; i++)
        pi_over_2 ();

      return 0;
    }

> OK, I'm also significantly faster in Lisp... It took 1.7s even with
> -O3 for the C++ version [...], and 1.1s in Lisp

I didn't see any significant differences between -O2, -O3, and -Os, or
between C and C++, so I guess that's not a factor to consider here.

> (defun pi/2-opt ()
>   (declare (optimize (debug 0)(space 0)(safety 0)(speed 3)))
>   (do ((x -9.98 (+ x 0.01))
>        (v 0.0 (exp (- (expt x 2.0))))
>        (sum (* 2 (exp (- (expt 10.0 2.0)))) (+ sum (* 2 v))))
>       ((>= x 10.0) (* 0.005 sum))
>     (declare (type float x v sum))))

You can also avoid boxing the return value, but I find that it doesn't
improve speed noticeably in this toy program.

    (defun pi/2 (box)
      (declare (type (simple-array double-float ()) box))
      (do (...)
          ((>= x 10.0d0)
           (setf (aref box) (* 0.005d0 sum))
           nil)
        ...))

A one-element vector works just as well, but it's always fun to use a
zero-dimensional array. :)  (Perhaps other implementations optimize one
of those better than another.)
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <87lkps5v3a.fsf@qrnik.zagroda>
Pierre THIERRY <···········@levallois.eu.org> writes:

> OK, I'm also significantly faster in Lisp... It took 1.7s even with -O3
> for the C++ version (ten times with ten times iterations, so spawning
> the process is quite lightweight, in fact), and 1.1s in Lisp

Compile the C version with -ffast-math and do something to avoid
optimizing out the whole computation (e.g sum the results and print
the sum). It's twice faster than without -ffast-math on my box.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
From: Lars Brinkhoff
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <85ejvkbc6c.fsf@junk.nocrew.org>
Marcin 'Qrczak' Kowalczyk wrote:
> Pierre Thierry wrote;
> > OK, I'm also significantly faster in Lisp... It took 1.7s even
> > with -O3 for the C++ version [...], and 1.1s in Lisp
> Compile the C version with -ffast-math and do something to avoid
> optimizing out the whole computation (e.g sum the results and print
> the sum). It's twice faster than without -ffast-math on my box.

Significantly faster, yes, but not by a factor of two on my machine.

  SBCL
    17.772 seconds of real time
    17.745108 seconds of user run time
    0.004 seconds of system run time
    0 page faults and
    0 bytes consed.

  GCC
    real    0m15.863s
    user    0m15.785s
    sys     0m0.012s

Old results, without summing the return values:

  SBCL     
    17.729 seconds of real time
    17.713108 seconds of user run time
    0.004 seconds of system run time
    0 page faults and
    0 bytes consed.

  GCC
    real    0m19.948s
    user    0m19.861s
    sys     0m0.004s
From: Rob Thorpe
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <1155577951.718912.235270@p79g2000cwp.googlegroups.com>
Marcin 'Qrczak' Kowalczyk wrote:
> Pierre THIERRY <···········@levallois.eu.org> writes:
>
> > OK, I'm also significantly faster in Lisp... It took 1.7s even with -O3
> > for the C++ version (ten times with ten times iterations, so spawning
> > the process is quite lightweight, in fact), and 1.1s in Lisp
>
> Compile the C version with -ffast-math and do something to avoid
> optimizing out the whole computation (e.g sum the results and print
> the sum). It's twice faster than without -ffast-math on my box.

On the MS Windows machine I'm currently using:

Cygwin GCC 3.4.4 takes 19.5secs (with -O3 and --fast-math)
SBCL "kitten of death" takes ~23secs
GCL 2.6.1 takes 59secs

I'm impressed by how close SBCL gets to GCC.  I'm also impressed at the
acrobatics needed to stop GCC from removing the calculation altogether
(as you say sum results + print sum).

The 'time' function seems not to work on MS windows SBCL, I imagine
fixing that isn't a priority.
From: Juho Snellman
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <slrnee1f0c.hbu.jsnell@sbz-30.cs.Helsinki.FI>
Rob Thorpe <·············@antenova.com> wrote:
> The 'time' function seems not to work on MS windows SBCL, I imagine
> fixing that isn't a priority.

It works in the latest release, but you'll need to build that from
source. Binaries aren't made for all platforms on every release.

-- 
Juho Snellman
From: patro
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <ebo77e$sbi$1@nemesis.news.tpi.pl>
"verec" <·····@mac.com> wrote:

> Depending on the
> speed outcome, CL may get considered for the project that
> will be havy on mumeric computations.

Why use Lisp for numerical computations ?


patro
From: verec
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <44dfb612$0$636$5a6aecb4@news.aaisp.net.uk>
On 2006-08-13 23:15:29 +0100, "patro" <·····@ortap.com> said:

>> Depending on the
>> speed outcome, CL may get considered for the project that
>> will be havy on mumeric computations.
> 
> Why use Lisp for numerical computations ?

I said that the project was going to be heavy on numerical
computations, not that it was only about them.

The basic blocks (financial modelling on time series) are
fairly well defined. That's where the "raw speed" is needed
because the data samples are potentially _huge_, and the
computations not trivial. The math guy doesn't really care
which language is chosen, provided it is "fast enough", and
I'm pushing as hard as I can to avoid having to use Java
for the rest, because Java forces me to build too much
infrastructure/protocols that have to be almost entirely discarded
when the higher level requirements change.

The higher levels (what strategies to apply according to the
results of the models) are a lot less well defined, and it is
my beleif that changing our mind three times along the way about
what we really want is going to be less expensive and faster
in CL (than the daily nightmare of forcing new business requirements
into a three year old Java code base that was not meant for them.)

But in order to be able to change our minds, we need first
to see some results, hence experiment, hence my plea for using
CL
--
JFB
From: Paul Jones
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <1155659641_508@news-east.n>
On 2006-08-14 01:30:26 +0200, verec <·····@mac.com> said:

> On 2006-08-13 23:15:29 +0100, "patro" <·····@ortap.com> said:
> 
>>> Depending on the
>>> speed outcome, CL may get considered for the project that
>>> will be havy on mumeric computations.
> 
> I said that the project was going to be heavy on numerical
> computations, not that it was only about them.
> 
> The basic blocks (financial modelling on time series) are
> fairly well defined. That's where the "raw speed" is needed
> because the data samples are potentially _huge_, and the
> computations not trivial. The math guy doesn't really care
> which language is chosen, provided it is "fast enough", and
> I'm pushing as hard as I can to avoid having to use Java
> for the rest, because Java forces me to build too much
> infrastructure/protocols that have to be almost entirely discarded
> when the higher level requirements change.
Speaking from experience, it's really not at all about how fast a 
language can do one sort or another of numerical floating point 
arithmetic. With huge data samples, it's about how the data are 
organized and what types of methods are used to access them. We handle 
very large time series using Lisp, and are eternally thankful that we 
chose it for a language. We do use some routines for numerical 
computation that are not Lisp, but they're sure not Java either. 
They're carefully coded and optimized and then wrapped in Lisp wrappers.

We have years of experience in using Lisp with (very large) time series 
for financial applications. If you'd like more pointers, let me know.
> 
> The higher levels (what strategies to apply according to the
> results of the models) are a lot less well defined, and it is
> my beleif that changing our mind three times along the way about
> what we really want is going to be less expensive and faster
> in CL (than the daily nightmare of forcing new business requirements
> into a three year old Java code base that was not meant for them.)
> 
> But in order to be able to change our minds, we need first
> to see some results, hence experiment, hence my plea for using
> CL



----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
From: Rob Thorpe
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <1155663649.265119.16690@m79g2000cwm.googlegroups.com>
Paul Jones wrote:
> On 2006-08-14 01:30:26 +0200, verec <·····@mac.com> said:
>
> > On 2006-08-13 23:15:29 +0100, "patro" <·····@ortap.com> said:
> >
> >>> Depending on the
> >>> speed outcome, CL may get considered for the project that
> >>> will be havy on mumeric computations.
> >
> > I said that the project was going to be heavy on numerical
> > computations, not that it was only about them.
> >
> > The basic blocks (financial modelling on time series) are
> > fairly well defined. That's where the "raw speed" is needed
> > because the data samples are potentially _huge_, and the
> > computations not trivial. The math guy doesn't really care
> > which language is chosen, provided it is "fast enough", and
> > I'm pushing as hard as I can to avoid having to use Java
> > for the rest, because Java forces me to build too much
> > infrastructure/protocols that have to be almost entirely discarded
> > when the higher level requirements change.
> Speaking from experience, it's really not at all about how fast a
> language can do one sort or another of numerical floating point
> arithmetic. With huge data samples, it's about how the data are
> organized and what types of methods are used to access them.

I'd agree with that.  The example program given isn't a particularly
good one.
Most programs that use floating point iterate over large data
structures to read the data they need and output the results.  The
optimization of these data structures is often the most important part.

In the SPECfp2000 floating point benchmark set there are now few
benchmarks that make great use of the floating point capabilities of
the processor.  Most test the capabilities of the memory subsystem to
access regularly structured data.  This is because the benchmarks are
built to be typical of fp applications and this is what fp applications
today do.
From: Pierre THIERRY
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <pan.2006.08.13.22.44.53.895437@levallois.eu.org>
Le Mon, 14 Aug 2006 00:15:29 +0200, patro a écrit :
> Why use Lisp for numerical computations ?

Why not?

Curiously,
Nowhere man
-- 
···········@levallois.eu.org
OpenPGP 0xD9D50D8A
From: verec
Subject: Re: help with numeric speed optimization
Date: 
Message-ID: <44e72d26$0$638$5a6aecb4@news.aaisp.net.uk>
Thanks to all for your replies. It helped win the case: we're
using Lisp for the project! Yeap! BTW: the math guy apologized:
the reason that the initial pi/2 wasn't very accurate (15% off
or so) is that the formula he gave me was for (sqrt pi), not
(/ pi 2) ...

CL-USER 11 > (defun pi/2-opt4 () ;;; formula for (sqrt pi) not for (/ pi 2) !!
   (declare (optimize (speed 3) (safety 0) (debug 0) (float 0)))
   (loop with sum of-type double-float = #.(realpart (+ (exp (- (expt 
-10.0d0 2.0d0)))
                                                        (exp (- (expt 
10.0d0 2.0d0)))))
         for x of-type double-float from -9.99d0 below 10.0d0 by 0.01d0
         do (incf sum (* 2d0 (exp (- (realpart (expt x 2.0d0))))))
         finally (return (* sum 0.005d0))))

CL-USER 12 > (pi/2-opt4)
1.7724538509055174D0

CL-USER 13 > (sqrt pi)
1.7724538509055159D0

CL-USER 14 > (/ pi 2)
1.5707963267948966D0

Many thanks again.
--
JFB