From: Shyamal Prasad
Subject: Please help me decipher a message from the SBCL optimizer
Date:
Message-ID: <87mzcrovdy.fsf@turtle.local>
Hi,
There is this trivial bit of recursive code that compiles nicely when
using fixnums (SBCL 0.9.12 on a Pentium processor, Debian etch):
(declaim (optimize (speed 3) (compilation-speed 0) (safety 0) (debug 0)))
(defun fib (n)
(declare (fixnum n))
(the fixnum
(if (< n 2)
1
(+ (fib (- n 1)) (fib (- n 2))))))
and works within 2x of an equivalent gcc program
* (time (fib 38))
Evaluation took:
5.375 seconds of real time
5.374183 seconds of user run time
0.0 seconds of system run time
0 page faults and
0 bytes consed.
63245986
but if I do the same with double-floats
(defun fib-d (n)
(declare (double-float n))
(the double-float
(if (< n 2.0d0)
1.0d0
(+ (fib-d (- n 1.0d0)) (fib-d (- n 2.0d0))))))
I get
; in: LAMBDA NIL
; (SB-INT:NAMED-LAMBDA FIB-D
; (N)
; (DECLARE (DOUBLE-FLOAT N))
; (BLOCK FIB-D
; (THE DOUBLE-FLOAT (IF (< N 2.0d0) 1.0d0 (+ # #)))))
; ==>
; #'(SB-INT:NAMED-LAMBDA FIB-D
; (N)
; (DECLARE (DOUBLE-FLOAT N))
; (BLOCK FIB-D
; (THE DOUBLE-FLOAT (IF (< N 2.0d0) 1.0d0 (+ # #)))))
;
; note: doing float to pointer coercion (cost 13) to "<return value>"
;
; compilation unit finished
; printed 1 note
And it leads to very poor relative performance ([1])
* (time (fib-d 38.0d0))
Evaluation took:
23.816 seconds of real time
23.666403 seconds of user run time
0.104984 seconds of system run time
0 page faults and
2,023,874,552 bytes consed.
6.3245986d7
Can some one please help me understand why the pointer coercion is
happening with double-floats and not with the fixnums? I get the same
thing with CMUCL. Am I doing something truly broken?
Best regards,
Shyamal
[1] Yes, I understand the recursive code is sub-optimal. This is from
http://shootout.alioth.debian.org/gp4/benchmark.php?test=recursive&lang=all
and the benchmark requires it be written that way to make like to like
comparisons.
Shyamal Prasad wrote:
>
> Can some one please help me understand why the pointer coercion is
> happening with double-floats and not with the fixnums?
Fixnums fit inside a single machine register (usually 32 bits), and
carry tag bits and the value of the fixnum. Returning a double-float
requires that tag bits are kept somewhere. I suspect that your
floating point value is being boxed (a new object is created,
containing tag bits and a pointer to the double-float), hence the
; note: doing float to pointer coercion (cost 13) to "<return value>"
output. I think that Kalle's code works because SBCL can determine
exactly where the lexically scoped FIB-D13* is called from and returns
to, and has figured out that it doesn't need to box/unbox the
double-float.
Cheers
Brad
From: Kalle Olavi Niemitalo
Subject: Re: Please help me decipher a message from the SBCL optimizer
Date:
Message-ID: <874pyy22f8.fsf@Astalo.kon.iki.fi>
Shyamal Prasad <·············@verizon.net> writes:
> (defun fib-d (n)
> (declare (double-float n))
> (the double-float
> (if (< n 2.0d0)
> 1.0d0
> (+ (fib-d (- n 1.0d0)) (fib-d (- n 2.0d0))))))
;; SBCL 0.9.5.50
(time (fib-d 38.0d0))
;; Evaluation took:
;; 33.956 seconds of real time
;; 28.02874 seconds of user run time
;; 4.219359 seconds of system run time
;; 0 page faults and
;; 2,023,882,960 bytes consed.
;; 6.3245986d+7
(defun fib-d13 (n)
(labels ((fib-d13* (n)
(declare (double-float n))
(the double-float
(if (< n 2.0d0)
1.0d0
(+ (fib-d13* (- n 1.0d0)) (fib-d13* (- n 2.0d0)))))))
(declare (inline fib-d13*))
(fib-d13* n)))
(time (fib-d13 38.0d0))
;; Evaluation took:
;; 3.426 seconds of real time
;; 3.273503 seconds of user run time
;; 0.004999 seconds of system run time
;; 0 page faults and
;; 0 bytes consed.
;; 6.3245986d+7
Before FIB-D13, I tried the following versions which also cons
very little, but they were slower and definitely uglier.
Besides, FIB-D4 will suffer badly from rounding errors if the
total gets too large.
(defun fib-d4 (n &aux (total (make-array '() :element-type 'double-float
:initial-element 0.0d0)))
(labels ((fib-d4* (n)
(declare (double-float n))
(cond ((< n 2.0d0) (incf (aref total) 1.0d0))
(t (fib-d4* (- n 1.0d0)) (fib-d4* (- n 2.0d0))))))
(fib-d4* n)
(aref total)))
(defun fib-d8 (n &aux (ret (make-array '() :element-type 'double-float)))
(labels ((fib-d8* (n)
(declare (double-float n))
(if (< n 2.0d0)
(setf (aref ret) 1.0d0)
(let ((a (progn (fib-d8* (- n 1.0d0)) (aref ret))))
(fib-d8* (- n 2.0d0))
(incf (aref ret) a)))))
(fib-d8* n)
(aref ret)))
>
> (defun fib-d13 (n)
> (labels ((fib-d13* (n)
> (declare (double-float n))
> (the double-float
> (if (< n 2.0d0)
> 1.0d0
> (+ (fib-d13* (- n 1.0d0)) (fib-d13* (- n 2.0d0)))))))
> (declare (inline fib-d13*))
> (fib-d13* n)))
>
> (time (fib-d13 38.0d0))
> ;; Evaluation took:
> ;; 3.426 seconds of real time
> ;; 3.273503 seconds of user run time
> ;; 0.004999 seconds of system run time
> ;; 0 page faults and
> ;; 0 bytes consed.
> ;; 6.3245986d+7
>
Does this mean that the original version posted by Shyamal was boxing
and unboxing the double-float return value? I guess this makes sense
as you need to return type info with the return. So what is the change
that actually causes improvement? I suspect it is that FIB-D13* is now
lexically scoped inside FIB-D13 and the compiler can do a fuller
analysis of the code. Can SBCL now work out that it doesn't need to
box the return value?
I'm just guessing here, am I on the right track?
Cheers
Brad
From: Kalle Olavi Niemitalo
Subject: Re: Please help me decipher a message from the SBCL optimizer
Date:
Message-ID: <87fyiiuj9h.fsf@Astalo.kon.iki.fi>
"bradb" <··············@gmail.com> writes:
> Does this mean that the original version posted by Shyamal was boxing
> and unboxing the double-float return value?
It sure looks that way. Comments after dashes are mine.
I have little experience with FPU instructions, so there may be errors.
* (disassemble 'fib-d)
-- Test (< n 2.0d0):
; 090D5D5B: L0: 8B05D45C0D09 MOV EAX, [#x90D5CD4] ; 2.0d0
; no-arg-parsing entry point
; D61: DDD8 FSTPD FR0 -- FR0=N
; D63: D9C0 FLDD FR0 -- FR0=N FR1=N
; D65: DC5001 FCOMD [EAX+1]
; D68: DFE0 FNSTSW
; D6A: 80E445 AND AH, 69
; D6D: 80FC01 CMP AH, 1
; D70: 0F849E000000 JEQ L3
-- Apparently (not (< n 2.0d0)). Store the value of N to [EBP-24]:
; D76: D9C9 FXCH FR1 -- FR0=N FR1=N
; D78: DD55E8 FSTD [EBP-24]
; D7B: D9C9 FXCH FR1 -- FR0=N FR1=N
-- Compute #1=(- n 1.0d0):
; D7D: DDD8 FSTPD FR0 -- FR0=N
; D7F: D9E8 FLD1 -- FR0=1 FR1=N
; D81: D8E9 FSUBRD FR1 -- FR0=#1# FR1=N
; D83: 9B WAIT
-- Call #2=(fib-d #1#) and return to 090D5D9B:
; D84: 8BCD MOV ECX, EBP
; D86: 8BC4 MOV EAX, ESP
; D88: 83EC20 SUB ESP, 32
; D8B: DDD1 FSTD FR1 -- FR0=#1# FR1=#1#
; D8D: 8948FC MOV [EAX-4], ECX
; D90: 8BE8 MOV EBP, EAX
; D92: C745F89B5D0D09 MOV DWORD PTR [EBP-8], 151870875
; D99: EBC0 JMP L0
-- Retrieve the value of N, stored above: -- FR0=?0 FR1=?1
; D9B: DDD9 FSTPD FR1 -- FR0=?0
; D9D: DD45E8 FLDD [EBP-24] -- FR0=N FR1=?0
; DA0: D9C9 FXCH FR1 -- FR0=?0 FR1=N
; DA2: DDD8 FSTPD FR0 -- FR0=N
-- Unbox #2# from the return-value register EAX and store to [EBP-16]:
; DA4: DD4001 FLDD [EAX+1] -- FR0=#2# FR1=N
; DA7: DD55F0 FSTD [EBP-16]
-- Compute #3=(- n 2.0d0):
; DAA: 8B05D45C0D09 MOV EAX, [#x90D5CD4] ; 2.0d0
; DB0: DDD8 FSTPD FR0 -- FR0=N
; DB2: DD4001 FLDD [EAX+1] -- FR0=2 FR1=N
; DB5: DCE9 FSUB-STI FR1 -- FR0=2 FR1=#3#
; DB7: 9B WAIT
-- Call #4=(fib-d #3#) and return to 090D5DCD:
; DB8: 8BCD MOV ECX, EBP
; DBA: 8BC4 MOV EAX, ESP
; DBC: 83EC20 SUB ESP, 32
; DBF: 8948FC MOV [EAX-4], ECX
; DC2: 8BE8 MOV EBP, EAX
; DC4: C745F8CD5D0D09 MOV DWORD PTR [EBP-8], 151870925
; DCB: EB8E JMP L0
-- Unbox #4# and compute #5=(+ #2# #4#): -- FR0=?2 FR1=?3
; DCD: DDD8 FSTPD FR0 -- FR0=?3
; DCF: DD4001 FLDD [EAX+1] -- FR0=#4# FR1=?3
; DD2: DC45F0 FADDD [EBP-16] -- FR0=#5# FR1=?3
; DD5: 9B WAIT
-- Prepare a pseudo-atomic operation:
; DD6: L1: 64 BYTE #X64
; DD7: C6054C00000000 MOV BYTE PTR [#x4C], 0
; DDE: 64 BYTE #X64
; DDF: C6054800000004 MOV BYTE PTR [#x48], 4
-- Allocate a box for #5# or 1.0d0 from the heap:
; DE6: E8E172F8FE CALL #x805D0CC ; alloc_16_to_eax
-- Store a tag and either #5# or 1.0d0 in the box:
; DEB: C70016030000 MOV DWORD PTR [EAX], 790
; DF1: 8D4007 LEA EAX, [EAX+7]
; DF4: DD5001 FSTD [EAX+1]
-- Finish the pseudo-atomic operation:
; DF7: 64 BYTE #X64
; DF8: C6054800000000 MOV BYTE PTR [#x48], 0
; DFF: 64 BYTE #X64
; E00: 803D4C00000000 CMP BYTE PTR [#x4C], 0
; E07: 7402 JEQ L2
; E09: CC09 BREAK 9 ; pending interrupt trap
-- Return the address of the box, with the appropriate tag:
; E0B: L2: 8D65F8 LEA ESP, [EBP-8]
; E0E: 8B6DFC MOV EBP, [EBP-4]
; E11: C20400 RET 4
-- Apparently (< n 2.0d0). Go box 1.0d0 instead:
; E14: L3: DDD8 FSTPD FR0 -- FR0=N
; E16: D9E8 FLD1 -- FR0=1 FR1=N
; E18: EBBC JMP L1
; E1A: 90 NOP
; E1B: 90 NOP
; E1C: 90 NOP
; E1D: 90 NOP
; E1E: 90 NOP
; E1F: 90 NOP
;
NIL
Note that the 1.0d0 gets boxed anew every time, even though it
would be more efficient to preallocate a boxed 1.0d0 at compile
time and then keep using it. On SBCL 0.9.5.50, the following
variant does just that and takes about 40% less time than the
original FIB-D. I don't know if later versions of SBCL manifest
this difference.
(defun fib-d15 (n)
(declare (double-float n))
(if (< n 2.0d0)
(load-time-value 1.0d0)
(+ (the double-float (fib-d15 (- n 1.0d0)))
(the double-float (fib-d15 (- n 2.0d0))))))
> So what is the change that actually causes improvement?
The INLINE declaration appears to be crucial.
From: Shyamal Prasad
Subject: Re: Please help me decipher a message from the SBCL optimizer
Date:
Message-ID: <87pshlhgl0.fsf@turtle.local>
>>>>> "Kalle" == Kalle Olavi Niemitalo <···@iki.fi> writes:
Kalle> "bradb" <··············@gmail.com> writes:
>> Does this mean that the original version posted by Shyamal was
>> boxing and unboxing the double-float return value?
Kalle> It sure looks that way. Comments after dashes are mine. I
>> So what is the change that actually causes improvement?
Kalle> The INLINE declaration appears to be crucial.
Thanks folks for the help and explanations. The need to box up the
double seems to make perfect sense now that you've explained it to me!
It does seem the inline is crucial - I'd actually got as far as the
enclosed defun but never thought to inline a recursive functions :-)
It does the trick on 0.9.12 too.
Cheers!
Shyamal
Kalle Olavi Niemitalo <···@iki.fi> wrote:
>> So what is the change that actually causes improvement?
>
> The INLINE declaration appears to be crucial.
Not quite so.
(defun fib-d (n)
(declare (double-float n)
(optimize (speed 3) (safety 0)))
(labels ((rec (n)
(declare (double-float n))
(the double-float
(if (< n 2)
1.0d0
(+ (rec (- n 1)) (rec (- n 2)))))))
(+ (rec n) 0)))
; Evaluation took:
; 2.77 seconds of real time
; 2.73 seconds of user run time
; 0.0 seconds of system run time
; 7,353,025,320 CPU cycles
; 0 page faults and
; 16 bytes consed.
Ideas: If I put the recursive computation in a local function, then the
compiler will do its local-call optimizations and keep all the arguments
and values unboxed. But this doesn't actually work: the compiler
refuses to unbox the return value from REC. I worked out why in the
shower: CMUCL does tail-call optimization, so the value of REC is
/exactly/ the value of FIB-D, so REC must box its output. The pointless
addition of zero is enough to dissuade the tail-call optimizer (which is
done very early on by the compiler), which leaves the local-call
optimizer free to unbox the return value of REC in the knowledge that
the outer FIB-D frame will still exist to do the boxing. The
disassembly suggests that the pointless addition is elided.
Note that I've simplified some of the constants to integers without
ill-effects: I find this makes the code easier to read.
-- [mdw]
From: Shyamal Prasad
Subject: Re: Please help me decipher a message from the SBCL optimizer
Date:
Message-ID: <87d5dkhhnb.fsf@turtle.local>
>>>>> "Mark" == Mark Wooding <···@distorted.org.uk> writes:
Mark> Kalle Olavi Niemitalo <···@iki.fi> wrote:
>>> So what is the change that actually causes improvement?
>> The INLINE declaration appears to be crucial.
Mark> Not quite so.
Mark> unbox the return value from REC. I worked out why in the
Mark> shower: CMUCL does tail-call optimization, so the value of
Mark> REC is /exactly/ the value of FIB-D, so REC must box its
Mark> output. The pointless addition of zero is enough to
Mark> dissuade the tail-call optimizer (which is done very early
Mark> on by the compiler), which leaves the local-call optimizer
Mark> free to unbox the return value of REC in the knowledge that
Mark> the outer FIB-D frame will still exist to do the boxing.
Mark> The disassembly suggests that the pointless addition is
Mark> elided.
Sounds good, except SBCL 0.9.12 does not seem to pick up on the
idea. Without the inline it just wont unbox the return value of the
local function:
(defun fib-d (n)
(declare (double-float n)
(optimize (speed 3) (safety 0)))
(labels ((rec (n)
(declare (double-float n))
(the double-float
(if (< n 2)
1.0d0
(+ (rec (- n 1)) (rec (- n 2)))))))
(+ (rec n) 0)))
* (time (fib-d 38.0d0))
Evaluation took:
15.585 seconds of real time
14.68 seconds of user run time
0.895 seconds of system run time
0 page faults and
2,023,873,280 bytes consed.
6.3245986d7
But if I add the (declare (inline rec))
* (time (fib-d 38.0d0))
Evaluation took:
2.237 seconds of real time
2.216 seconds of user run time
0.022 seconds of system run time
0 page faults and
0 bytes consed.
6.3245986d7
I've not yet tried it on CMUCL (I run a powerpc on my desk so I don't
have CMUCL available just right now), so this sounds like it is very
implementation specific.
Cheers!
Shyamal
Shyamal Prasad wrote:
>>>>>> "Mark" == Mark Wooding <···@distorted.org.uk> writes:
>
> Mark> Kalle Olavi Niemitalo <···@iki.fi> wrote:
> >>> So what is the change that actually causes improvement?
> >> The INLINE declaration appears to be crucial.
>
> Mark> Not quite so.
>
> Mark> unbox the return value from REC. I worked out why in the
> Mark> shower: CMUCL does tail-call optimization, so the value of
> Mark> REC is /exactly/ the value of FIB-D, so REC must box its
> Mark> output. The pointless addition of zero is enough to
> Mark> dissuade the tail-call optimizer (which is done very early
> Mark> on by the compiler), which leaves the local-call optimizer
> Mark> free to unbox the return value of REC in the knowledge that
> Mark> the outer FIB-D frame will still exist to do the boxing.
> Mark> The disassembly suggests that the pointless addition is
> Mark> elided.
>
> Sounds good, except SBCL 0.9.12 does not seem to pick up on the
> idea. Without the inline it just wont unbox the return value of the
> local function:
>
> (defun fib-d (n)
> (declare (double-float n)
> (optimize (speed 3) (safety 0)))
> (labels ((rec (n)
> (declare (double-float n))
> (the double-float
> (if (< n 2)
> 1.0d0
> (+ (rec (- n 1)) (rec (- n 2)))))))
> (+ (rec n) 0)))
>
> * (time (fib-d 38.0d0))
> Evaluation took:
> 15.585 seconds of real time
> 14.68 seconds of user run time
> 0.895 seconds of system run time
> 0 page faults and
> 2,023,873,280 bytes consed.
> 6.3245986d7
>
> But if I add the (declare (inline rec))
>
> * (time (fib-d 38.0d0))
> Evaluation took:
> 2.237 seconds of real time
> 2.216 seconds of user run time
> 0.022 seconds of system run time
> 0 page faults and
> 0 bytes consed.
> 6.3245986d7
>
> I've not yet tried it on CMUCL (I run a powerpc on my desk so I don't
> have CMUCL available just right now), so this sounds like it is very
> implementation specific.
>
> Cheers!
> Shyamal
I think the culprit is that SBCL does boxing and tags the values
on the return from the arithmetic operation and there is no way to
convey that one expects "float overflow" as with modular arithmetic
[(the ... ...) doesn't help].
If I try the same with single-float declarations, there are no spurious
compiler notes and the speed is reasonable as well.
Shyamal Prasad <·············@verizon.net> wrote:
> Sounds good, except SBCL 0.9.12 does not seem to pick up on the
> idea. Without the inline it just wont unbox the return value of the
> local function.
Yeuch. SBCL is too clever. If I add 1 at the end, it unboxes
correctly; adding 0 isn't good enough. Sorry :-(
-- [mdw]
Mark Wooding <···@distorted.org.uk> wrote:
> Shyamal Prasad <·············@verizon.net> wrote:
>
>> Sounds good, except SBCL 0.9.12 does not seem to pick up on the
>> idea. Without the inline it just wont unbox the return value of the
>> local function.
>
> Yeuch. SBCL is too clever. If I add 1 at the end, it unboxes
> correctly; adding 0 isn't good enough. Sorry :-(
Not that I approve of this technique you're discussing, but... Adding
a 0d0 would probably work, since it can't be optimized away, but nor
would it change the results in this case.
--
Juho Snellman