From: Shyamal Prasad
Subject: Please help me decipher a message from the SBCL optimizer
Date: 
Message-ID: <87mzcrovdy.fsf@turtle.local>
Hi,

There is this trivial bit of recursive code that compiles nicely when
using fixnums (SBCL 0.9.12 on a Pentium processor, Debian etch):

(declaim (optimize (speed 3) (compilation-speed 0) (safety 0) (debug 0)))

(defun fib (n)
  (declare (fixnum n))
  (the fixnum
    (if (< n 2)
	1
      (+ (fib (- n 1)) (fib (- n 2))))))

and works within 2x of an equivalent gcc program

* (time (fib 38))
Evaluation took:
  5.375 seconds of real time
  5.374183 seconds of user run time
  0.0 seconds of system run time
  0 page faults and
  0 bytes consed.
63245986


but if I do the same with double-floats

(defun fib-d (n)
  (declare (double-float n))
  (the double-float
    (if (< n 2.0d0)
	1.0d0
        (+ (fib-d (- n 1.0d0)) (fib-d (- n 2.0d0))))))

I get

; in: LAMBDA NIL
;     (SB-INT:NAMED-LAMBDA FIB-D
;                        (N)
;                        (DECLARE (DOUBLE-FLOAT N))
;                        (BLOCK FIB-D
;                          (THE DOUBLE-FLOAT (IF (< N 2.0d0) 1.0d0 (+ # #)))))
; ==>
;   #'(SB-INT:NAMED-LAMBDA FIB-D
;                          (N)
;                          (DECLARE (DOUBLE-FLOAT N))
;                          (BLOCK FIB-D
;                            (THE DOUBLE-FLOAT (IF (< N 2.0d0) 1.0d0 (+ # #)))))
; 
; note: doing float to pointer coercion (cost 13) to "<return value>"
; 
; compilation unit finished
;   printed 1 note

And it leads to very poor relative performance ([1])

* (time (fib-d 38.0d0))
Evaluation took:
  23.816 seconds of real time
  23.666403 seconds of user run time
  0.104984 seconds of system run time
  0 page faults and
  2,023,874,552 bytes consed.
6.3245986d7

Can some one please help me understand why the pointer coercion is
happening with double-floats and not with the fixnums? I get the same
thing with CMUCL. Am I doing something truly broken?

Best regards,
Shyamal

[1] Yes, I understand the recursive code is sub-optimal. This is from
http://shootout.alioth.debian.org/gp4/benchmark.php?test=recursive&lang=all
and the benchmark requires it be written that way to make like to like
comparisons.

From: bradb
Subject: Re: Please help me decipher a message from the SBCL optimizer
Date: 
Message-ID: <1149613126.035432.81310@u72g2000cwu.googlegroups.com>
Shyamal Prasad wrote:
>
> Can some one please help me understand why the pointer coercion is
> happening with double-floats and not with the fixnums?

Fixnums fit inside a single machine register (usually 32 bits), and
carry tag bits and the value of the fixnum.  Returning a double-float
requires that tag bits are kept somewhere.  I suspect that your
floating point value is being boxed (a new object is created,
containing tag bits and a pointer to the double-float), hence the
; note: doing float to pointer coercion (cost 13) to "<return value>"
output.  I think that Kalle's code works because SBCL can determine
exactly where the lexically scoped FIB-D13* is called from and returns
to, and has figured out that it doesn't need to box/unbox the
double-float.

Cheers
Brad
From: Kalle Olavi Niemitalo
Subject: Re: Please help me decipher a message from the SBCL optimizer
Date: 
Message-ID: <874pyy22f8.fsf@Astalo.kon.iki.fi>
Shyamal Prasad <·············@verizon.net> writes:

> (defun fib-d (n)
>   (declare (double-float n))
>   (the double-float
>     (if (< n 2.0d0)
> 	1.0d0
>         (+ (fib-d (- n 1.0d0)) (fib-d (- n 2.0d0))))))

;; SBCL 0.9.5.50

(time (fib-d 38.0d0))
;; Evaluation took:
;;   33.956 seconds of real time
;;   28.02874 seconds of user run time
;;   4.219359 seconds of system run time
;;   0 page faults and
;;   2,023,882,960 bytes consed.
;; 6.3245986d+7

(defun fib-d13 (n)
  (labels ((fib-d13* (n)
             (declare (double-float n))
             (the double-float
               (if (< n 2.0d0)
                 1.0d0
                 (+ (fib-d13* (- n 1.0d0)) (fib-d13* (- n 2.0d0)))))))
    (declare (inline fib-d13*))
    (fib-d13* n)))

(time (fib-d13 38.0d0))
;; Evaluation took:
;;   3.426 seconds of real time
;;   3.273503 seconds of user run time
;;   0.004999 seconds of system run time
;;   0 page faults and
;;   0 bytes consed.
;; 6.3245986d+7

Before FIB-D13, I tried the following versions which also cons
very little, but they were slower and definitely uglier.
Besides, FIB-D4 will suffer badly from rounding errors if the
total gets too large.

(defun fib-d4 (n &aux (total (make-array '() :element-type 'double-float
                                         :initial-element 0.0d0)))
  (labels ((fib-d4* (n)
             (declare (double-float n))
             (cond ((< n 2.0d0) (incf (aref total) 1.0d0))
                   (t (fib-d4* (- n 1.0d0)) (fib-d4* (- n 2.0d0))))))
    (fib-d4* n)
    (aref total)))

(defun fib-d8 (n &aux (ret (make-array '() :element-type 'double-float)))
  (labels ((fib-d8* (n)
             (declare (double-float n))
             (if (< n 2.0d0)
               (setf (aref ret) 1.0d0)
               (let ((a (progn (fib-d8* (- n 1.0d0)) (aref ret))))
                 (fib-d8* (- n 2.0d0))
                 (incf (aref ret) a)))))
    (fib-d8* n)
    (aref ret)))
From: bradb
Subject: Re: Please help me decipher a message from the SBCL optimizer
Date: 
Message-ID: <1149612817.859307.17570@y43g2000cwc.googlegroups.com>
>
> (defun fib-d13 (n)
>   (labels ((fib-d13* (n)
>              (declare (double-float n))
>              (the double-float
>                (if (< n 2.0d0)
>                  1.0d0
>                  (+ (fib-d13* (- n 1.0d0)) (fib-d13* (- n 2.0d0)))))))
>     (declare (inline fib-d13*))
>     (fib-d13* n)))
>
> (time (fib-d13 38.0d0))
> ;; Evaluation took:
> ;;   3.426 seconds of real time
> ;;   3.273503 seconds of user run time
> ;;   0.004999 seconds of system run time
> ;;   0 page faults and
> ;;   0 bytes consed.
> ;; 6.3245986d+7
>

Does this mean that the original version posted by Shyamal was boxing
and unboxing the double-float return value?  I guess this makes sense
as you need to return type info with the return.  So what is the change
that actually causes improvement?  I suspect it is that FIB-D13* is now
lexically scoped inside FIB-D13 and the compiler can do a fuller
analysis of the code.  Can SBCL now work out that it doesn't need to
box the return value?

I'm just guessing here, am I on the right track?

Cheers
Brad
From: Kalle Olavi Niemitalo
Subject: Re: Please help me decipher a message from the SBCL optimizer
Date: 
Message-ID: <87fyiiuj9h.fsf@Astalo.kon.iki.fi>
"bradb" <··············@gmail.com> writes:

> Does this mean that the original version posted by Shyamal was boxing
> and unboxing the double-float return value?

It sure looks that way.  Comments after dashes are mine.
I have little experience with FPU instructions, so there may be errors.

* (disassemble 'fib-d)

-- Test (< n 2.0d0):
; 090D5D5B: L0:   8B05D45C0D09     MOV EAX, [#x90D5CD4]       ; 2.0d0
                                                              ; no-arg-parsing entry point
;      D61:       DDD8             FSTPD FR0        --         FR0=N
;      D63:       D9C0             FLDD FR0         -- FR0=N   FR1=N
;      D65:       DC5001           FCOMD [EAX+1]
;      D68:       DFE0             FNSTSW
;      D6A:       80E445           AND AH, 69
;      D6D:       80FC01           CMP AH, 1
;      D70:       0F849E000000     JEQ L3
-- Apparently (not (< n 2.0d0)).  Store the value of N to [EBP-24]:
;      D76:       D9C9             FXCH FR1         -- FR0=N   FR1=N
;      D78:       DD55E8           FSTD [EBP-24]
;      D7B:       D9C9             FXCH FR1         -- FR0=N   FR1=N
-- Compute #1=(- n 1.0d0):
;      D7D:       DDD8             FSTPD FR0        --         FR0=N
;      D7F:       D9E8             FLD1             -- FR0=1   FR1=N
;      D81:       D8E9             FSUBRD FR1       -- FR0=#1# FR1=N
;      D83:       9B               WAIT
-- Call #2=(fib-d #1#) and return to 090D5D9B:
;      D84:       8BCD             MOV ECX, EBP
;      D86:       8BC4             MOV EAX, ESP
;      D88:       83EC20           SUB ESP, 32
;      D8B:       DDD1             FSTD FR1         -- FR0=#1# FR1=#1#
;      D8D:       8948FC           MOV [EAX-4], ECX
;      D90:       8BE8             MOV EBP, EAX
;      D92:       C745F89B5D0D09   MOV DWORD PTR [EBP-8], 151870875
;      D99:       EBC0             JMP L0
-- Retrieve the value of N, stored above:           -- FR0=?0  FR1=?1
;      D9B:       DDD9             FSTPD FR1        --         FR0=?0
;      D9D:       DD45E8           FLDD [EBP-24]    -- FR0=N   FR1=?0
;      DA0:       D9C9             FXCH FR1         -- FR0=?0  FR1=N
;      DA2:       DDD8             FSTPD FR0        --         FR0=N
-- Unbox #2# from the return-value register EAX and store to [EBP-16]:
;      DA4:       DD4001           FLDD [EAX+1]     -- FR0=#2# FR1=N
;      DA7:       DD55F0           FSTD [EBP-16]
-- Compute #3=(- n 2.0d0):
;      DAA:       8B05D45C0D09     MOV EAX, [#x90D5CD4]       ; 2.0d0
;      DB0:       DDD8             FSTPD FR0        -- FR0=N
;      DB2:       DD4001           FLDD [EAX+1]     -- FR0=2  FR1=N
;      DB5:       DCE9             FSUB-STI FR1     -- FR0=2  FR1=#3#
;      DB7:       9B               WAIT
-- Call #4=(fib-d #3#) and return to 090D5DCD:
;      DB8:       8BCD             MOV ECX, EBP
;      DBA:       8BC4             MOV EAX, ESP
;      DBC:       83EC20           SUB ESP, 32
;      DBF:       8948FC           MOV [EAX-4], ECX
;      DC2:       8BE8             MOV EBP, EAX
;      DC4:       C745F8CD5D0D09   MOV DWORD PTR [EBP-8], 151870925
;      DCB:       EB8E             JMP L0
-- Unbox #4# and compute #5=(+ #2# #4#):            -- FR0=?2  FR1=?3
;      DCD:       DDD8             FSTPD FR0        --         FR0=?3
;      DCF:       DD4001           FLDD [EAX+1]     -- FR0=#4# FR1=?3
;      DD2:       DC45F0           FADDD [EBP-16]   -- FR0=#5# FR1=?3
;      DD5:       9B               WAIT
-- Prepare a pseudo-atomic operation:
;      DD6: L1:   64               BYTE #X64
;      DD7:       C6054C00000000   MOV BYTE PTR [#x4C], 0
;      DDE:       64               BYTE #X64
;      DDF:       C6054800000004   MOV BYTE PTR [#x48], 4
-- Allocate a box for #5# or 1.0d0 from the heap:
;      DE6:       E8E172F8FE       CALL #x805D0CC             ; alloc_16_to_eax
-- Store a tag and either #5# or 1.0d0 in the box:
;      DEB:       C70016030000     MOV DWORD PTR [EAX], 790
;      DF1:       8D4007           LEA EAX, [EAX+7]
;      DF4:       DD5001           FSTD [EAX+1]
-- Finish the pseudo-atomic operation:
;      DF7:       64               BYTE #X64
;      DF8:       C6054800000000   MOV BYTE PTR [#x48], 0
;      DFF:       64               BYTE #X64
;      E00:       803D4C00000000   CMP BYTE PTR [#x4C], 0
;      E07:       7402             JEQ L2
;      E09:       CC09             BREAK 9                    ; pending interrupt trap
-- Return the address of the box, with the appropriate tag:
;      E0B: L2:   8D65F8           LEA ESP, [EBP-8]
;      E0E:       8B6DFC           MOV EBP, [EBP-4]
;      E11:       C20400           RET 4
-- Apparently (< n 2.0d0).  Go box 1.0d0 instead:
;      E14: L3:   DDD8             FSTPD FR0        --         FR0=N
;      E16:       D9E8             FLD1             -- FR0=1   FR1=N
;      E18:       EBBC             JMP L1
;      E1A:       90               NOP
;      E1B:       90               NOP
;      E1C:       90               NOP
;      E1D:       90               NOP
;      E1E:       90               NOP
;      E1F:       90               NOP
; 
NIL

Note that the 1.0d0 gets boxed anew every time, even though it
would be more efficient to preallocate a boxed 1.0d0 at compile
time and then keep using it.  On SBCL 0.9.5.50, the following
variant does just that and takes about 40% less time than the
original FIB-D.  I don't know if later versions of SBCL manifest
this difference.

(defun fib-d15 (n)
  (declare (double-float n))
  (if (< n 2.0d0)
      (load-time-value 1.0d0)
      (+ (the double-float (fib-d15 (- n 1.0d0)))
         (the double-float (fib-d15 (- n 2.0d0))))))

> So what is the change that actually causes improvement?

The INLINE declaration appears to be crucial.
From: Shyamal Prasad
Subject: Re: Please help me decipher a message from the SBCL optimizer
Date: 
Message-ID: <87pshlhgl0.fsf@turtle.local>
>>>>> "Kalle" == Kalle Olavi Niemitalo <···@iki.fi> writes:

    Kalle> "bradb" <··············@gmail.com> writes:
    >> Does this mean that the original version posted by Shyamal was
    >> boxing and unboxing the double-float return value?
  
    Kalle> It sure looks that way.  Comments after dashes are mine.  I

    >> So what is the change that actually causes improvement?

    Kalle> The INLINE declaration appears to be crucial.

Thanks folks for the help and explanations. The need to box up the
double seems to make perfect sense now that you've explained it to me!

It does seem the inline is crucial - I'd actually got as far as the
enclosed defun but never thought to inline a recursive functions :-)
It does the trick on 0.9.12 too.

Cheers!
Shyamal
From: Mark Wooding
Subject: Re: Please help me decipher a message from the SBCL optimizer
Date: 
Message-ID: <slrne8d780.gab.mdw@metalzone.distorted.org.uk>
Kalle Olavi Niemitalo <···@iki.fi> wrote:

>> So what is the change that actually causes improvement?
>
> The INLINE declaration appears to be crucial.

Not quite so.

  (defun fib-d (n)
    (declare (double-float n)
	     (optimize (speed 3) (safety 0)))
    (labels ((rec (n)
	       (declare (double-float n))
	       (the double-float
		 (if (< n 2)
		     1.0d0
		     (+ (rec (- n 1)) (rec (- n 2)))))))
      (+ (rec n) 0)))

; Evaluation took:
;   2.77 seconds of real time
;   2.73 seconds of user run time
;   0.0 seconds of system run time
;   7,353,025,320 CPU cycles
;   0 page faults and
;   16 bytes consed.

Ideas: If I put the recursive computation in a local function, then the
compiler will do its local-call optimizations and keep all the arguments
and values unboxed.  But this doesn't actually work: the compiler
refuses to unbox the return value from REC.  I worked out why in the
shower: CMUCL does tail-call optimization, so the value of REC is
/exactly/ the value of FIB-D, so REC must box its output.  The pointless
addition of zero is enough to dissuade the tail-call optimizer (which is
done very early on by the compiler), which leaves the local-call
optimizer free to unbox the return value of REC in the knowledge that
the outer FIB-D frame will still exist to do the boxing.  The
disassembly suggests that the pointless addition is elided.

Note that I've simplified some of the constants to integers without
ill-effects: I find this makes the code easier to read.

-- [mdw]
From: Shyamal Prasad
Subject: Re: Please help me decipher a message from the SBCL optimizer
Date: 
Message-ID: <87d5dkhhnb.fsf@turtle.local>
>>>>> "Mark" == Mark Wooding <···@distorted.org.uk> writes:

    Mark> Kalle Olavi Niemitalo <···@iki.fi> wrote:
    >>> So what is the change that actually causes improvement?
    >>  The INLINE declaration appears to be crucial.

    Mark> Not quite so.

    Mark> unbox the return value from REC.  I worked out why in the
    Mark> shower: CMUCL does tail-call optimization, so the value of
    Mark> REC is /exactly/ the value of FIB-D, so REC must box its
    Mark> output.  The pointless addition of zero is enough to
    Mark> dissuade the tail-call optimizer (which is done very early
    Mark> on by the compiler), which leaves the local-call optimizer
    Mark> free to unbox the return value of REC in the knowledge that
    Mark> the outer FIB-D frame will still exist to do the boxing.
    Mark> The disassembly suggests that the pointless addition is
    Mark> elided.

Sounds good, except SBCL 0.9.12 does not seem to pick up on the
idea. Without the inline it just wont unbox the return value of the
local function:

(defun fib-d (n)
  (declare (double-float n)
	   (optimize (speed 3) (safety 0)))
  (labels ((rec (n)
		(declare (double-float n))
		(the double-float
		  (if (< n 2)
		      1.0d0
		    (+ (rec (- n 1)) (rec (- n 2)))))))
    (+ (rec n) 0)))

* (time (fib-d 38.0d0))
Evaluation took:
  15.585 seconds of real time
  14.68 seconds of user run time
  0.895 seconds of system run time
  0 page faults and
  2,023,873,280 bytes consed.
6.3245986d7

But if I add the (declare (inline rec))

* (time (fib-d 38.0d0))
Evaluation took:
  2.237 seconds of real time
  2.216 seconds of user run time
  0.022 seconds of system run time
  0 page faults and
  0 bytes consed.
6.3245986d7

I've not yet tried it on CMUCL (I run a powerpc on my desk so I don't
have CMUCL available just right now), so this sounds like it is very
implementation specific.

Cheers!
Shyamal
From: Leonid Slobodov
Subject: Re: Please help me decipher a message from the SBCL optimizer
Date: 
Message-ID: <4487393c$0$11078$9b4e6d93@newsread4.arcor-online.net>
Shyamal Prasad wrote:

>>>>>> "Mark" == Mark Wooding <···@distorted.org.uk> writes:
> 
>     Mark> Kalle Olavi Niemitalo <···@iki.fi> wrote:
>     >>> So what is the change that actually causes improvement?
>     >>  The INLINE declaration appears to be crucial.
> 
>     Mark> Not quite so.
> 
>     Mark> unbox the return value from REC.  I worked out why in the
>     Mark> shower: CMUCL does tail-call optimization, so the value of
>     Mark> REC is /exactly/ the value of FIB-D, so REC must box its
>     Mark> output.  The pointless addition of zero is enough to
>     Mark> dissuade the tail-call optimizer (which is done very early
>     Mark> on by the compiler), which leaves the local-call optimizer
>     Mark> free to unbox the return value of REC in the knowledge that
>     Mark> the outer FIB-D frame will still exist to do the boxing.
>     Mark> The disassembly suggests that the pointless addition is
>     Mark> elided.
> 
> Sounds good, except SBCL 0.9.12 does not seem to pick up on the
> idea. Without the inline it just wont unbox the return value of the
> local function:
> 
> (defun fib-d (n)
>   (declare (double-float n)
> (optimize (speed 3) (safety 0)))
>   (labels ((rec (n)
> (declare (double-float n))
> (the double-float
> (if (< n 2)
> 1.0d0
> (+ (rec (- n 1)) (rec (- n 2)))))))
>     (+ (rec n) 0)))
> 
> * (time (fib-d 38.0d0))
> Evaluation took:
>   15.585 seconds of real time
>   14.68 seconds of user run time
>   0.895 seconds of system run time
>   0 page faults and
>   2,023,873,280 bytes consed.
> 6.3245986d7
> 
> But if I add the (declare (inline rec))
> 
> * (time (fib-d 38.0d0))
> Evaluation took:
>   2.237 seconds of real time
>   2.216 seconds of user run time
>   0.022 seconds of system run time
>   0 page faults and
>   0 bytes consed.
> 6.3245986d7
> 
> I've not yet tried it on CMUCL (I run a powerpc on my desk so I don't
> have CMUCL available just right now), so this sounds like it is very
> implementation specific.
> 
> Cheers!
> Shyamal

I think the culprit is that SBCL does boxing and tags the values 
on the return from the arithmetic operation and there is no way to
convey that one expects "float overflow" as with modular arithmetic
[(the ... ...) doesn't help].
If I try the same with single-float declarations, there are no spurious
compiler notes and the speed is reasonable as well.
From: Mark Wooding
Subject: Re: Please help me decipher a message from the SBCL optimizer
Date: 
Message-ID: <slrne8ep60.gab.mdw@metalzone.distorted.org.uk>
Shyamal Prasad <·············@verizon.net> wrote:

> Sounds good, except SBCL 0.9.12 does not seem to pick up on the
> idea. Without the inline it just wont unbox the return value of the
> local function.

Yeuch.  SBCL is too clever.  If I add 1 at the end, it unboxes
correctly; adding 0 isn't good enough.  Sorry :-(

-- [mdw]
From: Juho Snellman
Subject: Re: Please help me decipher a message from the SBCL optimizer
Date: 
Message-ID: <slrne8etp9.tqm.jsnell@sbz-30.cs.Helsinki.FI>
Mark Wooding <···@distorted.org.uk> wrote:
> Shyamal Prasad <·············@verizon.net> wrote:
> 
>> Sounds good, except SBCL 0.9.12 does not seem to pick up on the
>> idea. Without the inline it just wont unbox the return value of the
>> local function.
> 
> Yeuch.  SBCL is too clever.  If I add 1 at the end, it unboxes
> correctly; adding 0 isn't good enough.  Sorry :-(

Not that I approve of this technique you're discussing, but... Adding
a 0d0 would probably work, since it can't be optimized away, but nor
would it change the results in this case.

-- 
Juho Snellman