Folks,
There's a thread in the Lispworks Users Group that I started:
http://thread.gmane.org/gmane.lisp.lispworks.general/6012/focus=6012
This is the type of thread that I'm sure Lispers are gonna love: I'm
pitting ACL 8.0 vs LW 5.0 in a floating point pseudo-benchmark.
I would love to get stats from SBCL, CMUCL and as many other Lisps as
possible. I would also love to improve the benchmark itself and measure
GCC on the same code.
Thanks, Joel
--
http://whylisp.com
Pascal,
Pascal Bourguignon wrote:
> Allegro 8.0 20.820
Thanks for the stats! The ACL timing above is wrong, though, and
happens when log is used instead of double-log. Please modify (logx
(log x) to read (logx (double-log x)). So long as you include the
optimizatio hints for ACL you'll have awesome performance.
> (let* ((p (/ 1d0 (* x x)))
> (logx (log x))
> (one 1d0))
It seems that Lisps other than ACL and LW are not using SSE2/SSE3
instructions. LW also seems to use them partially, i.e. mix SSE and x87
in the code for digamma.
This is my latest version of the code with the change above:
(in-package :cl-user)
#+lispworks
(declaim (optimize (float 0))
(ftype (function (double-float) double-float) digamma/setq-x)
)
#+allegro
(eval-when (compile load eval)
(setf (get 'digamma/setq-x 'sys::immed-args-call) ; now say some
magic
'((double-float) double-float)))
#+allegro
(ff:def-foreign-call (double-log "log") ((x :double))
:returning :double
:call-direct t
:arg-checking nil)
(defun digamma/setq-x (x)
(declare (optimize (speed 3) (safety 0) (debug 0) (float 0))
(type (double-float (0d0) *) x))
(let ((x (+ x 6d0)))
(declare (type (double-float (6d0) *) x))
(let* ((p (/ 1d0 (* x x)))
#+lispworks (logx (log x))
#+allegro (logx (double-log x))
(one 1d0))
(declare (type double-float one p))
(setq p (* (- (* (+ (* p (- (* p 0.004166666666667d0)
0.003968253986254d0))
0.008333333333333d0) p)
0.083333333333333d0) p))
(+ p (- (- (- (- (- (- (- logx
(/ 0.5d0 x))
(/ one (setq x (- x one))))
(/ one (setq x (- x one))))
(/ one (setq x (- x one))))
(/ one (setq x (- x one))))
(/ one (setq x (- x one))))
(/ one (setq x (- x one)))))
)))
(defun timeit ()
(declare (optimize (speed 3) (safety 0) (debug 0) (float 0)))
(loop for i from 10000 to 10000000
do (digamma/setq-x (coerce (the fixnum i) 'double-float))))
I guess I don't understand the double-log business.
I ran your "latest version" and then I changed the line
#+allegro (logx (double-log x))
to
#+allegro (logx (log x))
and it ran faster. On my Windows machine the first version was
about 6.5 seconds. The second was about 2.5 seconds. [Timing (timeit)..]
RJF
Richard J. Fateman wrote:
> On my Windows machine the first version was
> about 6.5 seconds. The second was about 2.5 seconds. [Timing (timeit)..]
I will let Duane comment on this. What type of architecture are you on?
What version of ACL?
Thanks, Joel
(I thought "Windows" gave it away.. but my benchmark machine is a 2.6GHz
Pentium 4 running Windows XP. I'm running Allegro CL 8.0 [licensed,
though I think the free trial "express" version would be identical])
I subsequently looked at the gmane discussion and found it had to do
with lisp running on Macintosh. So maybe it's irrelevant.
My own interest in doing floating point in lisp has been trying to make
better use of some of the subtleties in the IEEE floating point
implementations (rounding modes, traps); speed of properly declared code
has become much less of an issue in comparing Lisp to C (etc).
Joel Reymont wrote:
> Richard J. Fateman wrote:
>
>>On my Windows machine the first version was
>>about 6.5 seconds. The second was about 2.5 seconds. [Timing (timeit)..]
>
>
> I will let Duane comment on this. What type of architecture are you on?
> What version of ACL?
>
> Thanks, Joel
>
Richard J. Fateman wrote:
> (I thought "Windows" gave it away.. but my benchmark machine is a 2.6GHz
> Pentium 4 running Windows XP. I'm running Allegro CL 8.0 [licensed,
> though I think the free trial "express" version would be identical])
The P4 bit is what I was interested in. Can you send me the disassembly
from your digamma or post it here? I wonder if SSE instructions are
being used. The Mac Intel bit should not matter as it's still x86. At
least I don't think it should matter.
On Oct 20, 2:47 pm, "Joel Reymont" <······@gmail.com> wrote:
> The P4 bit is what I was interested in. Can you send me the disassembly
> from your digamma or post it here? I wonder if SSE instructions are
> being used. The Mac Intel bit should not matter as it's still x86. At
> least I don't think it should matter.
One thing to remember is that just because you see that SSE
instructions are being used, it doesn't mean that the compiler is
taking advantage of their parallelism. I've seen recent versions of
the Intel C compiler emit SSE2 instructions but only using one-half of
the 128-bit pipe. The reason for this is that on the Pentium 4, the
SSE2 instructions have one cycle less latency than the double-precision
x87 instructions. Intel engineers (and presumably also AMD, though I
haven't checked) don't seem to consider the x87 pipe a priority for
optimization (alas for poor Prof. Kahan, who helped design it and
insisted on the 80-bit internal registers -- the SSE2 pipes are 64-bit
all the way).
To tell if the SSE2 instructions are actually being exploited for their
full parallelism, you'll need to make sure that both halves of the SSE2
register are being filled.
Best,
mfh
Joel Reymont wrote:
> The P4 bit is what I was interested in. Can you send me the disassembly
> from your digamma or post it here? I wonder if SSE instructions are
> being used. The Mac Intel bit should not matter as it's still x86. At
> least I don't think it should matter.
Apple's C compiler generates SSE2 instructions. I looked
at this just yesterday, while we were trying to figure out
why double precision (sqrt 0.9999999999999999) was giving
us the correct answer (according to IEEE-754) under Mac
OS X but was returning 1.0 under Linux.
Under Linux, the C compiler was generating the old-style
instructions, probably because there are so many Linux
machines in the field that don't support SSE and SSE2.
Apple doesn't have that problem.
Will
Hello Joel,
The Scieneer Common Lisp is 10% faster than optimize C code for your
problem. The Scieneer Common Lisp takes advantage of SSE2 and SSE3
instructions but does not yet taking advantage of SIMD operations.
Extended precision floating point is also supported using the x87 FPU
instructions. The results below were run on a 2000MHz Athlon 64.
File: digamma.c
-=-
#include "math.h"
static inline double digamma(double x)
{
double p;
x=x+6;
p=1/(x*x);
p=(((0.004166666666667*p-0.003968253986254)*p+
0.008333333333333)*p-0.083333333333333)*p;
p=p+log(x)-0.5/x-1/(x-1)-1/(x-2)-1/(x-3)-1/(x-4)-1/(x-5)-1/(x-6);
return p;
}
double timeit()
{
int i;
double result;
for (i = 10000; i < 10000000; i++)
{
result = digamma((double) i);
}
return result;
}
main()
{
timeit();
}
-=-
cc -O3 -o digamma digamma.c -lm
time digamma
1.624u 0.004s 0:01.61 100.6% 0+0k 0+0io 0pf+0w
File: digamma1.lisp
-=-
(in-package :cl-user)
(declaim (inline digamma/setq-x))
(defun digamma/setq-x (x)
(declare (optimize (speed 3) (safety 0) (debug 0))
(type (double-float (0d0) *) x))
(let ((x (+ x 6d0)))
(declare (type (double-float (6d0) *) x))
(let* ((p (/ 1d0 (* x x)))
(logx (log x))
(one 1d0))
(declare (type double-float one p))
(setq p (* (- (* (+ (* p (- (* p 0.004166666666667d0)
0.003968253986254d0))
0.008333333333333d0) p)
0.083333333333333d0) p))
(+ p (- (- (- (- (- (- (- logx
(/ 0.5d0 x))
(/ one (setq x (- x one))))
(/ one (setq x (- x one))))
(/ one (setq x (- x one))))
(/ one (setq x (- x one))))
(/ one (setq x (- x one))))
(/ one (setq x (- x one)))))
)))
(defun timeit ()
(declare (optimize (speed 3) (safety 0) (debug 0)))
(let ((result 0d0))
(declare (double-float result))
(loop for i from 10000 to 10000000
do (setf result (digamma/setq-x (coerce (the fixnum i) 'double-float))))
result))
-=-
* (time (timeit))
Compiling lambda nil:
Compiling Top-Level Form:
Evaluation took:
1.4379311 seconds of real time
1.4306431 seconds of thread run time
1.4306521 seconds of process run time
1.432089 seconds of user run time
0.004 seconds of system run time
0 page faults
16 bytes consed by this thread and
16 total bytes consed.
On Oct 21, 11:19 am, Douglas Crosher <····@scieneer.com> wrote:
> Hello Joel,
>
> The Scieneer Common Lisp is 10% faster than optimize C code for your
> problem. The Scieneer Common Lisp takes advantage of SSE2 and SSE3
> instructions but does not yet taking advantage of SIMD operations.
> Extended precision floating point is also supported using the x87 FPU
> instructions. The results below were run on a 2000MHz Athlon 64.
>
> File: digamma.c
> -=-
> #include "math.h"
>
> static inline double digamma(double x)
> {
> double p;
> x=x+6;
> p=1/(x*x);
> p=(((0.004166666666667*p-0.003968253986254)*p+
> 0.008333333333333)*p-0.083333333333333)*p;
> p=p+log(x)-0.5/x-1/(x-1)-1/(x-2)-1/(x-3)-1/(x-4)-1/(x-5)-1/(x-6);
> return p;
>
> }double timeit()
> {
> int i;
> double result;
>
> for (i = 10000; i < 10000000; i++)
> {
> result = digamma((double) i);
> }
>
> return result;
>
> }main()
> {
> timeit();}-=-
>
> cc -O3 -o digamma digamma.c -lm
> time digamma
> 1.624u 0.004s 0:01.61 100.6% 0+0k 0+0io 0pf+0w
>
> File: digamma1.lisp
> -=-
> (in-package :cl-user)
>
> (declaim (inline digamma/setq-x))
> (defun digamma/setq-x (x)
> (declare (optimize (speed 3) (safety 0) (debug 0))
> (type (double-float (0d0) *) x))
> (let ((x (+ x 6d0)))
> (declare (type (double-float (6d0) *) x))
> (let* ((p (/ 1d0 (* x x)))
> (logx (log x))
> (one 1d0))
> (declare (type double-float one p))
> (setq p (* (- (* (+ (* p (- (* p 0.004166666666667d0)
> 0.003968253986254d0))
> 0.008333333333333d0) p)
> 0.083333333333333d0) p))
> (+ p (- (- (- (- (- (- (- logx
> (/ 0.5d0 x))
> (/ one (setq x (- x one))))
> (/ one (setq x (- x one))))
> (/ one (setq x (- x one))))
> (/ one (setq x (- x one))))
> (/ one (setq x (- x one))))
> (/ one (setq x (- x one)))))
> )))
>
> (defun timeit ()
> (declare (optimize (speed 3) (safety 0) (debug 0)))
> (let ((result 0d0))
> (declare (double-float result))
> (loop for i from 10000 to 10000000
> do (setf result (digamma/setq-x (coerce (the fixnum i) 'double-float))))
> result))
> -=-
>
> * (time (timeit))
> Compiling lambda nil:
> Compiling Top-Level Form:
>
> Evaluation took:
> 1.4379311 seconds of real time
> 1.4306431 seconds of thread run time
> 1.4306521 seconds of process run time
> 1.432089 seconds of user run time
> 0.004 seconds of system run time
> 0 page faults
> 16 bytes consed by this thread and
> 16 total bytes consed.
I get similar results with cmucl 19c (2GHz Pentium M):
CL-USER> (time (timeit))
; Compiling LAMBDA NIL:
; Compiling Top-Level Form:
; Evaluation took:
; 1.55 seconds of real time
; 1.544097 seconds of user run time
; 0.0 seconds of system run time
; 3,081,616,411 CPU cycles
; 0 page faults and
; 16 bytes consed.
C version on the same machine:
$ time digamma
real 0m2.237s
user 0m2.228s
sys 0m0.000s
Just to make sure that the challenge isn't using unoptimized C to
compare with, I played arount with the C version a bit just to see if I
could make it any faster and came up with this:
#include "math.h"
static inline double digamma(double x)
{
double p;
x=x+6;
p=1/(x*x);
p=(((0.004166666666667*p-0.003968253986254)*p+
0.008333333333333)*p-0.083333333333333)*p;
p=p+log(x)-0.5/x-1/(x-1)-1/(x-2)-1/(x-3)-1/(x-4)-1/(x-5)-1/(x-6);
return p;
}
#define EnableDuffDevice 1
#define DuffCount 2
#if DuffCount > 10
#define EnableDuffDevice 0
#endif
#define DuffCase(n) case n: result = digamma(d += 1.0)
double timeit()
{
double d;
double result;
d = 10000.0;
while (d < 10000000.0)
{
#if EnableDuffDevice == 1
switch ((int) d % DuffCount)
{
DuffCase (0);
#if DuffCount > 9
DuffCase (9);
#endif
#if DuffCount > 8
DuffCase (8);
#endif
#if DuffCount > 7
DuffCase (7);
#endif
#if DuffCount > 6
DuffCase (6);
#endif
#if DuffCount > 5
DuffCase (5);
#endif
#if DuffCount > 4
DuffCase (4);
#endif
#if DuffCount > 3
DuffCase (3);
#endif
#if DuffCount > 2
DuffCase (2);
#endif
#if DuffCount > 1
DuffCase (1);
#endif
}
#else
result = digamma(d += 1.0);
#endif
}
return result;
}
main()
{
timeit();
}
On my system (PPC G4 1Ghz MacOSX), the original gave the following times
time ./digamma
real 0m6.033s
user 0m5.782s
sys 0m0.039s
The my version gives:
time ./digamma2
real 0m0.115s
user 0m0.076s
sys 0m0.007s
Play with the DuffCount to find a setup where the code in the loop fits
in L1 or at worst L2 cache.
I would be interested to know how this version does on a newer faster
machine. (Assuming I haven't made a horribly stupid coding error)
Ts
Douglas Crosher wrote:
> File: digamma.c
> -=-
> #include "math.h"
>
> static inline double digamma(double x)
> {
> double p;
> x=x+6;
> p=1/(x*x);
> p=(((0.004166666666667*p-0.003968253986254)*p+
> 0.008333333333333)*p-0.083333333333333)*p;
> p=p+log(x)-0.5/x-1/(x-1)-1/(x-2)-1/(x-3)-1/(x-4)-1/(x-5)-1/(x-6);
> return p;
> }
>
> double timeit()
> {
> int i;
> double result;
>
> for (i = 10000; i < 10000000; i++)
> {
> result = digamma((double) i);
> }
>
> return result;
> }
>
> main()
> {
> timeit();
> }
> -=-
>
> cc -O3 -o digamma digamma.c -lm
> time digamma
> 1.624u 0.004s 0:01.61 100.6% 0+0k 0+0io 0pf+0w
In article <·····························@KNOLOGY.NET>,
Terry Sullivan <··············@drifter.net> wrote:
> Just to make sure that the challenge isn't using unoptimized C to
> compare with, I played arount with the C version a bit just to see if I
> could make it any faster and came up with this:
[snip]
> On my system (PPC G4 1Ghz MacOSX), the original gave the following times
> time ./digamma
> real 0m6.033s
>
> The my version gives:
> time ./digamma2
> real 0m0.115s
This is a good object lesson in that you must be very careful benchmarking
things with modern optimizing compilers - you've managed to confuse the
compiler into "optimizing" the whole algorithm away!
%% 154 tiamat:~/projects/c/scratch
:; gcc -O3 -Wall -ffast-math -o digamma2-orig digamma2-orig.c -lm
digamma2-orig.c:74: warning: return type defaults to ‘int’
digamma2-orig.c: In function ‘main’:
digamma2-orig.c:76: warning: control reaches end of non-void function
digamma2-orig.c: In function ‘timeit’:
digamma2-orig.c:27: warning: ‘result’ may be used uninitialized in this function
Notice the last warning there.
Let's try something here:
:; valgrind --tool=cachegrind ./digamma2-orig
==28188== Cachegrind, an I1/D1/L2 cache profiler.
[...]
==28188== I refs: 65,061,501
==28188== I1 misses: 541
==28188== L2i misses: 536
Only 65 million instruction fetches? Let's see, with 9990000 runs of
digamma2 that makes for:
CL-USER> (float (/ 65061501 9990000))
6.512663
about 6.5 instruction fetches per loop! I purport that it's impossible
to compile the algorithm down to six amd64 instructions.
Stepping with GDB, the main() loop looks like this:
1: x/i $rip 0x400510 <main+48>: cvttsd2si %xmm0,%eax
1: x/i $rip 0x400514 <main+52>: mov %eax,%edx
1: x/i $rip 0x400516 <main+54>: shr $0x1f,%edx
1: x/i $rip 0x400519 <main+57>: add %edx,%eax
1: x/i $rip 0x40051b <main+59>: and $0x1,%eax
1: x/i $rip 0x40051e <main+62>: sub %edx,%eax
1: x/i $rip 0x400520 <main+64>: jne 0x400500 <main+32>
1: x/i $rip 0x400522 <main+66>: mov %rcx,0xfffffffffffffff8(%rsp)
1: x/i $rip 0x400527 <main+71>: movlpd 0xfffffffffffffff8(%rsp),%xmm1
1: x/i $rip 0x40052d <main+77>: addsd %xmm1,%xmm0
1: x/i $rip 0x400531 <main+81>: addsd %xmm1,%xmm0
1: x/i $rip 0x400535 <main+85>: comisd %xmm2,%xmm0
1: x/i $rip 0x400539 <main+89>: jb 0x400510 <main+48>
1: x/i $rip 0x400510 <main+48>: cvttsd2si %xmm0,%eax
1: x/i $rip 0x400514 <main+52>: mov %eax,%edx
1: x/i $rip 0x400516 <main+54>: shr $0x1f,%edx
1: x/i $rip 0x400519 <main+57>: add %edx,%eax
1: x/i $rip 0x40051b <main+59>: and $0x1,%eax
1: x/i $rip 0x40051e <main+62>: sub %edx,%eax
1: x/i $rip 0x400520 <main+64>: jne 0x400500 <main+32>
1: x/i $rip 0x400522 <main+66>: mov %rcx,0xfffffffffffffff8(%rsp)
1: x/i $rip 0x400527 <main+71>: movlpd 0xfffffffffffffff8(%rsp),%xmm1
1: x/i $rip 0x40052d <main+77>: addsd %xmm1,%xmm0
1: x/i $rip 0x400531 <main+81>: addsd %xmm1,%xmm0
1: x/i $rip 0x400535 <main+85>: comisd %xmm2,%xmm0
1: x/i $rip 0x400539 <main+89>: jb 0x400510 <main+48>
[repeat]
There's nothing left of the algorithm here.
Indeed, with this patch (which sums the results from inside the inline
function, thereby forcing the code to be compiled in), you'll see that
the Duff's Device case is just as slow as the naive loop:
--- digamma2-orig.c 2006-10-21 16:26:46.000000000 -0500
+++ digamma2.c 2006-10-21 16:25:04.000000000 -0500
@@ -1,13 +1,23 @@
+#include "stdio.h"
#include "math.h"
+double tot_x = 0.0;
+double tot_p = 0.0;
+
+#define SumResults 1
+
static inline double digamma(double x)
{
double p;
+ tot_x += x;
x=x+6;
p=1/(x*x);
p=(((0.004166666666667*p-0.003968253986254)*p+
0.008333333333333)*p-0.083333333333333)*p;
p=p+log(x)-0.5/x-1/(x-1)-1/(x-2)-1/(x-3)-1/(x-4)-1/(x-5)-1/(x-6);
+ #if SumResults == 1
+ tot_p += p;
+ #endif
return p;
}
@@ -66,6 +76,9 @@
#endif
}
+ printf("tot_x: %f\n", tot_x);
+ printf("tot_p: %f\n", tot_p);
+
return result;
}
-bcd
Given the speed difference that I was seeing between the new and the
old, I wondered if something like that wasn't taking place. I didn't
make changes int the digamma function it's self, because that would have
been changing the algorithm that was being benchmarked. I just wanted to
eliminate inefficiencies in the supporting code.
Brian Downing wrote:
> In article <·····························@KNOLOGY.NET>,
> Terry Sullivan <··············@drifter.net> wrote:
>
>>Just to make sure that the challenge isn't using unoptimized C to
>>compare with, I played arount with the C version a bit just to see if I
>>could make it any faster and came up with this:
>
>
> [snip]
>
>
>>On my system (PPC G4 1Ghz MacOSX), the original gave the following times
>>time ./digamma
>>real 0m6.033s
>>
>>The my version gives:
>>time ./digamma2
>>real 0m0.115s
>
>
> This is a good object lesson in that you must be very careful benchmarking
> things with modern optimizing compilers - you've managed to confuse the
> compiler into "optimizing" the whole algorithm away!
>
> %% 154 tiamat:~/projects/c/scratch
> :; gcc -O3 -Wall -ffast-math -o digamma2-orig digamma2-orig.c -lm
> digamma2-orig.c:74: warning: return type defaults to ‘int’
> digamma2-orig.c: In function ‘main’:
> digamma2-orig.c:76: warning: control reaches end of non-void function
> digamma2-orig.c: In function ‘timeit’:
> digamma2-orig.c:27: warning: ‘result’ may be used uninitialized in this function
>
> Notice the last warning there.
>
> Let's try something here:
>
> :; valgrind --tool=cachegrind ./digamma2-orig
> ==28188== Cachegrind, an I1/D1/L2 cache profiler.
> [...]
> ==28188== I refs: 65,061,501
> ==28188== I1 misses: 541
> ==28188== L2i misses: 536
>
> Only 65 million instruction fetches? Let's see, with 9990000 runs of
> digamma2 that makes for:
>
> CL-USER> (float (/ 65061501 9990000))
> 6.512663
>
> about 6.5 instruction fetches per loop! I purport that it's impossible
> to compile the algorithm down to six amd64 instructions.
>
> Stepping with GDB, the main() loop looks like this:
>
> 1: x/i $rip 0x400510 <main+48>: cvttsd2si %xmm0,%eax
> 1: x/i $rip 0x400514 <main+52>: mov %eax,%edx
> 1: x/i $rip 0x400516 <main+54>: shr $0x1f,%edx
> 1: x/i $rip 0x400519 <main+57>: add %edx,%eax
> 1: x/i $rip 0x40051b <main+59>: and $0x1,%eax
> 1: x/i $rip 0x40051e <main+62>: sub %edx,%eax
> 1: x/i $rip 0x400520 <main+64>: jne 0x400500 <main+32>
> 1: x/i $rip 0x400522 <main+66>: mov %rcx,0xfffffffffffffff8(%rsp)
> 1: x/i $rip 0x400527 <main+71>: movlpd 0xfffffffffffffff8(%rsp),%xmm1
> 1: x/i $rip 0x40052d <main+77>: addsd %xmm1,%xmm0
> 1: x/i $rip 0x400531 <main+81>: addsd %xmm1,%xmm0
> 1: x/i $rip 0x400535 <main+85>: comisd %xmm2,%xmm0
> 1: x/i $rip 0x400539 <main+89>: jb 0x400510 <main+48>
> 1: x/i $rip 0x400510 <main+48>: cvttsd2si %xmm0,%eax
> 1: x/i $rip 0x400514 <main+52>: mov %eax,%edx
> 1: x/i $rip 0x400516 <main+54>: shr $0x1f,%edx
> 1: x/i $rip 0x400519 <main+57>: add %edx,%eax
> 1: x/i $rip 0x40051b <main+59>: and $0x1,%eax
> 1: x/i $rip 0x40051e <main+62>: sub %edx,%eax
> 1: x/i $rip 0x400520 <main+64>: jne 0x400500 <main+32>
> 1: x/i $rip 0x400522 <main+66>: mov %rcx,0xfffffffffffffff8(%rsp)
> 1: x/i $rip 0x400527 <main+71>: movlpd 0xfffffffffffffff8(%rsp),%xmm1
> 1: x/i $rip 0x40052d <main+77>: addsd %xmm1,%xmm0
> 1: x/i $rip 0x400531 <main+81>: addsd %xmm1,%xmm0
> 1: x/i $rip 0x400535 <main+85>: comisd %xmm2,%xmm0
> 1: x/i $rip 0x400539 <main+89>: jb 0x400510 <main+48>
> [repeat]
>
> There's nothing left of the algorithm here.
>
> Indeed, with this patch (which sums the results from inside the inline
> function, thereby forcing the code to be compiled in), you'll see that
> the Duff's Device case is just as slow as the naive loop:
>
> --- digamma2-orig.c 2006-10-21 16:26:46.000000000 -0500
> +++ digamma2.c 2006-10-21 16:25:04.000000000 -0500
> @@ -1,13 +1,23 @@
> +#include "stdio.h"
> #include "math.h"
>
> +double tot_x = 0.0;
> +double tot_p = 0.0;
> +
> +#define SumResults 1
> +
> static inline double digamma(double x)
> {
> double p;
> + tot_x += x;
> x=x+6;
> p=1/(x*x);
> p=(((0.004166666666667*p-0.003968253986254)*p+
> 0.008333333333333)*p-0.083333333333333)*p;
> p=p+log(x)-0.5/x-1/(x-1)-1/(x-2)-1/(x-3)-1/(x-4)-1/(x-5)-1/(x-6);
> + #if SumResults == 1
> + tot_p += p;
> + #endif
> return p;
>
> }
> @@ -66,6 +76,9 @@
> #endif
> }
>
> + printf("tot_x: %f\n", tot_x);
> + printf("tot_p: %f\n", tot_p);
> +
> return result;
>
> }
>
> -bcd
With these changes in both my code and the original, minus the argument
summation, I now get:
Original Code:
time ./digamma
tot_p: 151098846.198064
Result = 16.118096
real 0m6.081s
user 0m5.892s
sys 0m0.036s
My Code with Duffs @ 2:
time ./digamma2
tot_p: 151098853.105869
Result = 16.118096
real 0m5.955s
user 0m5.581s
sys 0m0.041s
My code with out Duffs:
time ./digamma2
tot_p: 151098853.105869
Result = 16.118096
real 0m5.845s
user 0m5.631s
sys 0m0.037s
So I would agree, Duff's device doesn't actually help much in this case
on my CPU.
Oh well.
Had fun anyway.
Ts
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: The epic floating-point battle
Date:
Message-ID: <87bqo5iv4m.fsf@qrnik.zagroda>
Douglas Crosher <···@scieneer.com> writes:
> The Scieneer Common Lisp is 10% faster than optimize C code for your
> problem.
Adding -ffast-math to gcc invocation makes the C code 10% faster on my
machine.
--
__("< Marcin Kowalczyk
\__/ ······@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
Douglas Crosher wrote:
> The Scieneer Common Lisp is 10% faster than optimize C code for your
> problem. The Scieneer Common Lisp takes advantage of SSE2 and SSE3
> instructions but does not yet taking advantage of SIMD operations.
> Extended precision floating point is also supported using the x87 FPU
> instructions. The results below were run on a 2000MHz Athlon 64.
Say, you wouldn't happen to have a Scieneer Common Lisp manual
on your hands, would you? I'm working on a project that I'm trying
to make as portable as possible, and I'd like to know if and how SCL
supports static arrays (i.e. memory pinned and not moved around
when the system GC's) and/or turning off GC temporarily.
Best,
mfh
············@gmail.com wrote:
> Douglas Crosher wrote:
>> The Scieneer Common Lisp is 10% faster than optimize C code for your
>> problem. The Scieneer Common Lisp takes advantage of SSE2 and SSE3
>> instructions but does not yet taking advantage of SIMD operations.
>> Extended precision floating point is also supported using the x87 FPU
>> instructions. The results below were run on a 2000MHz Athlon 64.
>
> Say, you wouldn't happen to have a Scieneer Common Lisp manual
> on your hands, would you? I'm working on a project that I'm trying
> to make as portable as possible, and I'd like to know if and how SCL
> supports static arrays (i.e. memory pinned and not moved around
> when the system GC's) and/or turning off GC temporarily.
The recommended approach is to use the 'ext:with-pinned-object macro which
pins the address of a lisp object within the context of the macro, and
by default also inhibits write protection of pages occupied by the object.
This allows other threads to execute garbage collection while the current
thread is executing the foreign function. For example:
(defun example (array)
(ext:with-pinned-object (array)
(let ((sap (sys:vector-sap array)))
... foreign call that may read or write directly to the lisp array ...
)))
If the foreign function does not write to the lisp object then the following
faster form may be used:
(ext:with-pinned-object (array :inhibit-write-protection nil) ...)
Please follow up in private if you have any more questions.
Regards
Douglas Crosher
For the record, ACL 8.0 turns out to be 40% faster on my floating point
benchmark.
See http://article.gmane.org/gmane.lisp.lispworks.general/6093
; cpu time (non-gc) 1,940 msec user, 0 msec system
; cpu time (gc) 0 msec user, 0 msec system
; cpu time (total) 1,940 msec user, 0 msec system
; real time 1,947 msec
; space allocation:
; 0 cons cells, 0 other bytes, 0 static bytes