optimization question

From: Tamas Papp
Subject: optimization question
Date: Tue, 17 Jul 2007 13:55:31 +0000
Message-ID: <878x9fuobw.fsf@pu100877.student.princeton.edu>

Hi,

I want to learn how to write faster numerical code in Lisp.  The
function below is not a bottleneck in my current application, but
since it is simple, I thought I could learn some techniques.

The function looks like this:

(defun golden-section-combination (a b)
  "Return the convex combination (1-G)*a+G*b, where G is the
inverse of the golden ratio."
  (declare (double-float a b))
  (let ((Gright #.(/ (- 3d0 (sqrt 5d0)) 2d0)) ; equals to G above
	(Gleft #.(- 1d0 (/ (- 3d0 (sqrt 5d0)) 2d0)))) ; 1-G
    (the double-float (+ (the double-float (* Gleft a))
			 (the double-float (* Gright b))))))

My final goal is to achieve the equivalent of the C function

double golden_section_combination(double a, double b) {
    return (a*0.3819660112501051+b*0.6180339887498949);
}

ie with the constants compiled into the function, but disassemble
shows that there are quite a few other things in there (I am using
SBCL):

CL-NUMLIB> (disassemble #'golden-section-combination)
; 0B18A95D:       8B05F0A8180B     MOV EAX, [#xB18A8F0]       ; 0.6180339887498949d0
                                                              ; no-arg-parsing entry point
;       63:       DDD8             FSTPD FR0
;       65:       DD4001           FLDD [EAX+1]
;       68:       D8C9             FMULD FR1
;       6A:       DDD3             FSTD FR3
;       6C:       8B05F4A8180B     MOV EAX, [#xB18A8F4]       ; 0.3819660112501051d0
;       72:       DDD8             FSTPD FR0
;       74:       DD4001           FLDD [EAX+1]
;       77:       D8CA             FMULD FR2
;       79:       9B               WAIT
;       7A:       D8C3             FADDD FR3
;       7C:       9B               WAIT
;       7D:       64               FS-SEGMENT-PREFIX
;       7E:       800D4800000004   OR BYTE PTR [#x48], 4
;       85:       BA10000000       MOV EDX, 16
;       8A:       64               FS-SEGMENT-PREFIX
;       8B:       031520000000     ADD EDX, [#x20]
;       91:       64               FS-SEGMENT-PREFIX
;       92:       3B1524000000     CMP EDX, [#x24]
;       98:       7607             JBE L0
;       9A:       E85D8DEDFC       CALL #x80636FC             ; alloc_overflow_edx
;       9F:       EB0A             JMP L1
;       A1: L0:   64               FS-SEGMENT-PREFIX
;       A2:       891520000000     MOV [#x20], EDX
;       A8:       83EA10           SUB EDX, 16
;       AB: L1:   C70216030000     MOV DWORD PTR [EDX], 790
;       B1:       8D5207           LEA EDX, [EDX+7]
;       B4:       DD5201           FSTD [EDX+1]
;       B7:       64               FS-SEGMENT-PREFIX
;       B8:       80354800000004   XOR BYTE PTR [#x48], 4
;       BF:       7402             JEQ L2
;       C1:       CC09             BREAK 9                    ; pending interrupt trap
;       C3: L2:   8D65F8           LEA ESP, [EBP-8]
;       C6:       F8               CLC
;       C7:       8B6DFC           MOV EBP, [EBP-4]
;       CA:       C20400           RET 4
;       CD:       90               NOP
;       CE:       90               NOP
;       CF:       90               NOP
;       D0:       CC0A             BREAK 10                   ; error trap
;       D2:       02               BYTE #X02
;       D3:       18               BYTE #X18                  ; INVALID-ARG-COUNT-ERROR
;       D4:       4D               BYTE #X4D                  ; ECX
;       D5:       CC0A             BREAK 10                   ; error trap
;       D7:       02               BYTE #X02
;       D8:       06               BYTE #X06                  ; OBJECT-NOT-DOUBLE-FLOAT-ERROR
;       D9:       8E               BYTE #X8E                  ; EDX
;       DA:       CC0A             BREAK 10                   ; error trap
;       DC:       04               BYTE #X04
;       DD:       06               BYTE #X06                  ; OBJECT-NOT-DOUBLE-FLOAT-ERROR
;       DE:       FECE01           BYTE #XFE, #XCE, #X01      ; EDI
; 
NIL

Any advice would be appreciated.

Thanks,

Tamas

Re: optimization question Dimiter "malkia" Stanev
- Re: optimization question Dimiter "malkia" Stanev
  - Re: optimization question Duane Rettig
    - Re: optimization question Dimiter "malkia" Stanev
      - Re: optimization question Duane Rettig
Re: optimization question Raymond Toy
Re: optimization question Thomas A. Russ
Re: optimization question David Golden

From: Dimiter "malkia" Stanev
Subject: Re: optimization question
Date: Tue, 17 Jul 2007 14:52:30 +0000
Message-ID: <469CD7AE.5050300@mac.com>

I'm not sure about SBCL, but on my old iBook G4 under Lispworks 5.02 
(trial), I think I was able to optimize it:

(declaim (ftype (function (double-float double-float)
                           double-float)
                 golden-section-combination)
          (inline golden-section-combination))

(defun golden-section-combination (a b)
   "Return the convex combination (1-G)*a+G*b, where G is the
inverse of the golden ratio."
   (declare (optimize (speed 3) (safety 0)
                      (compilation-speed 0)
                      (space 0) (debug 0)
                      #+lispworks (float 0)))
   (declare (double-float a b))
   (let ((Gright #.(/ (- 3d0 (sqrt 5d0)) 2d0)) ; equals to G above
	(Gleft #.(- 1d0 (/ (- 3d0 (sqrt 5d0)) 2d0)))) ; 1-G
     (the double-float (+ (the double-float (* Gleft a))
			 (the double-float (* Gright b))))))

(defun test (a b)
   (declare (optimize (speed 3) (safety 0)
                      (compilation-speed 0)
                      (space 0) (debug 0)
                      #+lispworks (float 0)))
   (declare (inline golden-section-combination))
   (declare (type double-float a b))
   (+ (golden-section-combination 12.34d0 56.78d0)
      (golden-section-combination 98.10d0 23.45d0)
      (golden-section-combination a b)))

The code in test gave me this in assembly (one downside is that it calls 
SYSTEM::RAW-FAST-BOX-DOUBLE, but unless you make the function to return 
the value in one of the parameters, it can't be avoided):

CL-USER 12 > (disassemble 'test)
       0: #x7FE802A6     mflr link
       4: #x3021FFF4     addic sp,sp,#x-C
       8: #xBFA10000     stmw fp,#x0(sp)
      12: #x603D0000     ori fp,sp,#x0
      16: #x83D70006     lwz const,#x6(func)
      20: #xC8830005     lfd f4,#x5(res/arg0)
      24: #xC8640005     lfd f3,#x5(arg1)
      28: #x82DE002D     lwz r22,#x2D(const)       ; 0.6180339887498949D0
      32: #xC8560005     lfd f2,#x5(r22)
      36: #xFC820132     fmul f4,f2,f0-Ftmp1
      40: #x82DE0031     lwz r22,#x31(const)       ; 0.3819660112501051D0
      44: #xC8560005     lfd f2,#x5(r22)
      48: #xFC6200F2     fmul f3,f2,f0-Ftmp1
      52: #xFC84182A     fadd f4,f4,f3
      56: #x82DE0035     lwz r22,#x35(const)       ; 98.90080577654365D0
      60: #xC8360005     lfd f1,#x5(r22)
      64: #xFC21202A     fadd f1,f1,f4
      68: #x82FE0039     lwz func,#x39(const)      ; 
SYSTEM::RAW-FAST-BOX-DOUBLE
      72: #x38000001     li nargs,#x1
      76: #xBBA10000     lmw fp,#x0(sp)
      80: #x3021000C     addic sp,sp,#xC
      84: #x7FE803A6     mtlr link
      88: #x80F70002     lwz tmp1,#x2(func)
      92: #x30E70005     addic tmp1,tmp1,#x5
      96: #x7CE903A6     mtctr tmp1
     100: #x4E800420     bctr

So I guess, making your function inline-able could help you:

(declaim (ftype (function (double-float double-float)
                           double-float)
                 golden-section-combination)
          (inline golden-section-combination))

Also you have to notify the compiler that you want to use it as inline 
one, in test, for example:

   (declare (inline golden-section-combination))

At least those are the rules I've found out for Lispworks, not sure for 
SBCL.

From: Dimiter "malkia" Stanev
Subject: Re: optimization question
Date: Tue, 17 Jul 2007 15:03:25 +0000
Message-ID: <469CDA3D.9070400@mac.com>

And if you want to avoid any consing, if SBCL does, as Lispworks, then 
instead of making the function return a value, make it return in an 
array (ugly, but that's the way I'm planning to do my vector3/4 and 
matrix33/44 library):

(defun test (a b c)
   (declare (optimize (speed 3) (safety 0)
                      (compilation-speed 0)
                      (space 0) (debug 0)
                      #+lispworks (float 0)))
   (declare (inline golden-section-combination))
   (declare (type double-float a b)
            (type (simple-array double-float (1)) c))
   (setf (aref c 0) (+ (golden-section-combination 12.34d0 56.78d0)
                       (golden-section-combination 98.10d0 23.45d0)
                       (golden-section-combination a b)))
   c)

Here c is array of one double-float

CL-USER 21 > (disassemble 'test)
       0: #x7FE802A6     mflr link
       4: #x3021FFF4     addic sp,sp,#x-C
       8: #xBFA10000     stmw fp,#x0(sp)
      12: #x603D0000     ori fp,sp,#x0
      16: #x83D70006     lwz const,#x6(func)
      20: #xC8830005     lfd f4,#x5(res/arg0)
      24: #xC8640005     lfd f3,#x5(arg1)
      28: #x82DE002D     lwz r22,#x2D(const)       ; 0.6180339887498949D0
      32: #xC8560005     lfd f2,#x5(r22)
      36: #xFC820132     fmul f4,f2,f0-Ftmp1
      40: #x82DE0031     lwz r22,#x31(const)       ; 0.3819660112501051D0
      44: #xC8560005     lfd f2,#x5(r22)
      48: #xFC6200F2     fmul f3,f2,f0-Ftmp1
      52: #xFC84182A     fadd f4,f4,f3
      56: #x82DE0035     lwz r22,#x35(const)       ; 98.90080680013434D0
      60: #xC8760005     lfd f3,#x5(r22)
      64: #xFC83202A     fadd f4,f3,f4
      68: #xD8850005     stfd f4,#x5(arg2)
      72: #x60A30000     ori res/arg0,arg2,#x0
      76: #x38000001     li nargs,#x1
      80: #xBBA10000     lmw fp,#x0(sp)
      84: #x3021000C     addic sp,sp,#xC
      88: #x7FE803A6     mtlr link
      92: #x4E800020     blr

No consing, but ugly.

This one, is more flexible:

(defun test2 (a b &optional (c (make-array 1 :element-type 'double-float)))
   (declare (optimize (speed 3) (safety 0)
                      (compilation-speed 0)
                      (space 0) (debug 0)
                      #+lispworks (float 0)))
   (declare (inline golden-section-combination))
   (declare (type double-float a b)
            (type (simple-array double-float (1)) c))
   (setf (aref c 0) (+ (golden-section-combination 12.34d0 56.78d0)
                       (golden-section-combination 98.10d0 23.45d0)
                       (golden-section-combination a b)))
   c)


As it allows you caller-site to either choose to reuse memory, or cons 
memory, but it costs one additional branch at the begining of the function:

CL-USER 22 > (disassemble 'test2)
       0: #x7FE802A6     mflr link
       4: #x3021FFEC     addic sp,sp,#x-14
       8: #xBF610000     stmw r27-p,#x0(sp)
      12: #x33A10008     addic fp,sp,#x8
      16: #x83D70006     lwz const,#x6(func)
      20: #x60160000     ori r22,nargs,#x0
      24: #x607B0000     ori r27-p,res/arg0,#x0
      28: #x609C0000     ori r28-p,arg1,#x0
      32: #x2C160002     cmpwi cr0,r22,#x2
      36: #x40810050     ble 116
      40: #x82DE0031     lwz r22,#x31(const)       ; 0.6180339887498949D0
      44: #xC8960005     lfd f4,#x5(r22)
      48: #xC87B0005     lfd f3,#x5(r27-p)
      52: #xFC8400F2     fmul f4,f4,f0-Ftmp1
      56: #x82DE0035     lwz r22,#x35(const)       ; 0.3819660112501051D0
      60: #xC8760005     lfd f3,#x5(r22)
      64: #xC85C0005     lfd f2,#x5(r28-p)
      68: #xFC6300B2     fmul f3,f3,f0-Ftmp1
      72: #xFC84182A     fadd f4,f4,f3
      76: #x82DE0039     lwz r22,#x39(const)       ; 98.90080680013434D0
      80: #xC8760005     lfd f3,#x5(r22)
      84: #xFC83202A     fadd f4,f3,f4
      88: #xD8850005     stfd f4,#x5(arg2)
      92: #x60A30000     ori res/arg0,arg2,#x0
      96: #x38000001     li nargs,#x1
     100: #xBB610000     lmw r27-p,#x0(sp)
     104: #x30210014     addic sp,sp,#x14
     108: #x7FE803A6     mtlr link
     112: #x4E800020     blr
     116: #x82FE002D     lwz func,#x2D(const)      ; SYSTEM::ALLOC-I-VECTOR
     120: #x38000003     li nargs,#x3
     124: #x38600004     li res/arg0,#x4
     128: #x38800100     li arg1,#x100
     132: #x63050000     ori arg2,nil,#x0
     136: #x80F70002     lwz tmp1,#x2(func)
     140: #x30E70005     addic tmp1,tmp1,#x5
     144: #x7CE903A6     mtctr tmp1
     148: #x4E800421     bctrl
     152: #x82C3FFFD     lwz r22,#x-3(res/arg0)
     156: #x62D62000     ori r22,r22,#x2000
     160: #x92C3FFFD     stw r22,#x-3(res/arg0)
     164: #x60650000     ori arg2,res/arg0,#x0
     168: #x4BFFFF80     b 40

Again, this is under Lispworks in 32bit mode Mac OS X PowerPC G4, under 
other platforms/bits, or other lisps, like SBCL it could be different 
(when comes to consing).

From: Duane Rettig
Subject: Re: optimization question
Date: Tue, 17 Jul 2007 18:45:41 +0000
Message-ID: <o07ioy3m3u.fsf@gemini.franz.com>

"Dimiter \"malkia\" Stanev" <······@mac.com> writes:

> And if you want to avoid any consing, if SBCL does, as Lispworks, then
> instead of making the function return a value, make it return in an
> array (ugly, but that's the way I'm planning to do my vector3/4 and
> matrix33/44 library):

Of course, such a solution is not thread-safe - it is not re-entrant.

In Allegro CL, you can do it in a slightly different way (note that
I'm on mac/intel here; I much prefer its floating point to that of the
x87 style unit - note also that I removed all of the extraneous THE
forms - they just clutter up the source): 

CL-USER(2): (shell "cat try.cl")
#+allegro
(eval-when (compile)
  (setf (get 'golden-section-combination 'sys::immed-args-call)
    '((double-float double-float) double-float))
  )

(defun golden-section-combination (a b)
  "Return the convex combination (1-G)*a+G*b, where G is the
inverse of the golden ratio."
  (declare (optimize speed (safety 0))
	   (double-float a b))
  (let ((Gright #.(/ (- 3d0 (sqrt 5d0)) 2d0)) ; equals to G above
	(Gleft #.(- 1d0 (/ (- 3d0 (sqrt 5d0)) 2d0)))) ; 1-G
    (+ (* Gleft a) (* Gright b))))



0
CL-USER(3): :cf try
;;; Compiling file try.cl
;;; Writing fasl file try.fasl
;;; Fasl write complete
CL-USER(4): :ld try
; Fast loading /tmp_mnt/net/gemini/home/duane/try.fasl
CL-USER(5): (disassemble 'golden-section-combination)
;; disassembly of #<Function GOLDEN-SECTION-COMBINATION>
;; formals: A EXCL::DF_PLACE-HOLDER B EXCL::DF_PLACE-HOLDER
;; constant vector:
0: 0.3819660112501051d0
1: 0.6180339887498949d0

;; code start: #x10a51fb4:
   0: ff a7 47 03 00 jmp	*[edi+839]      ; SYS::IMMED-ARG-HOOK
      00 
   6: 8b 5e 12       movl	ebx,[esi+18]    ; 0.3819660112501051d0
   9: f2 0f 10 6b f6 movsd	xmm5,[ebx-10]
  14: 8b 5e 16       movl	ebx,[esi+22]    ; 0.6180339887498949d0
  17: f2 0f 10 63 f6 movsd	xmm4,[ebx-10]
  22: f2 0f 10 5c 24 movsd	xmm3,[esp+4]
      04 
  28: f2 0f 59 e3    mulsd	xmm4,xmm3
  32: f2 0f 10 5c 24 movsd	xmm3,[esp+12]
      0c 
  38: f2 0f 59 eb    mulsd	xmm5,xmm3
  42: f2 0f 58 ec    addsd	xmm5,xmm4
  46: f2 0f 10 c5    movsd	xmm0,xmm5
  50: 8b 75 fc       movl	esi,[ebp-4]
  53: c3             ret
CL-USER(6): 

If you call this function in straight lisp code, the immed-arg-hook
unboxes the arguments, calls this functiin back again, and boxes up
the result.  But if you compile a call to this function with the 
immed-args-call property still in effect, then that call sets up the
arguments directly in immediate form, calls the function in this
immediate way, and the return is used (still unboxed, and with no
consing at this level).  It is up to the caller, now, as to what it
wants to do with the unboxed value it just got back.

We don't officially document this approach because it is not a very
Lispy approach - a lispy approach would allow redefinitions without
consequences.  But since it is extremely useful we provide it and
always give out unofficial documentation to customers who ask for it.

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182

From: Dimiter "malkia" Stanev
Subject: Re: optimization question
Date: Wed, 08 Aug 2007 21:14:06 +0000
Message-ID: <46BA321E.80402@gmail.com>

Thanks, Duane!

I'm still waiting for my Allegro CL 8.1 Non-commercial version, but 
until it arrives, I guess I can try it in the prompt.franz.com.

I have a question: If the immed-args-call property is still in effect, 
would that mean that it would call directly the function + 6 bytes (e.g. 
skip the first jmp which unboxes then boxes again the data)?

Btw, why can't I disassemble sys::immed-arg-hook?

Cheers,
Dimiter "malkia" Stanev.

Duane Rettig wrote:
> "Dimiter \"malkia\" Stanev" <······@mac.com> writes:
> 
>> And if you want to avoid any consing, if SBCL does, as Lispworks, then
>> instead of making the function return a value, make it return in an
>> array (ugly, but that's the way I'm planning to do my vector3/4 and
>> matrix33/44 library):
> 
> Of course, such a solution is not thread-safe - it is not re-entrant.
> 
> In Allegro CL, you can do it in a slightly different way (note that
> I'm on mac/intel here; I much prefer its floating point to that of the
> x87 style unit - note also that I removed all of the extraneous THE
> forms - they just clutter up the source): 
> 
> CL-USER(2): (shell "cat try.cl")
> #+allegro
> (eval-when (compile)
>   (setf (get 'golden-section-combination 'sys::immed-args-call)
>     '((double-float double-float) double-float))
>   )
> 
> (defun golden-section-combination (a b)
>   "Return the convex combination (1-G)*a+G*b, where G is the
> inverse of the golden ratio."
>   (declare (optimize speed (safety 0))
> 	   (double-float a b))
>   (let ((Gright #.(/ (- 3d0 (sqrt 5d0)) 2d0)) ; equals to G above
> 	(Gleft #.(- 1d0 (/ (- 3d0 (sqrt 5d0)) 2d0)))) ; 1-G
>     (+ (* Gleft a) (* Gright b))))
> 
> 
> 
> 0
> CL-USER(3): :cf try
> ;;; Compiling file try.cl
> ;;; Writing fasl file try.fasl
> ;;; Fasl write complete
> CL-USER(4): :ld try
> ; Fast loading /tmp_mnt/net/gemini/home/duane/try.fasl
> CL-USER(5): (disassemble 'golden-section-combination)
> ;; disassembly of #<Function GOLDEN-SECTION-COMBINATION>
> ;; formals: A EXCL::DF_PLACE-HOLDER B EXCL::DF_PLACE-HOLDER
> ;; constant vector:
> 0: 0.3819660112501051d0
> 1: 0.6180339887498949d0
> 
> ;; code start: #x10a51fb4:
>    0: ff a7 47 03 00 jmp	*[edi+839]      ; SYS::IMMED-ARG-HOOK
>       00 
>    6: 8b 5e 12       movl	ebx,[esi+18]    ; 0.3819660112501051d0
>    9: f2 0f 10 6b f6 movsd	xmm5,[ebx-10]
>   14: 8b 5e 16       movl	ebx,[esi+22]    ; 0.6180339887498949d0
>   17: f2 0f 10 63 f6 movsd	xmm4,[ebx-10]
>   22: f2 0f 10 5c 24 movsd	xmm3,[esp+4]
>       04 
>   28: f2 0f 59 e3    mulsd	xmm4,xmm3
>   32: f2 0f 10 5c 24 movsd	xmm3,[esp+12]
>       0c 
>   38: f2 0f 59 eb    mulsd	xmm5,xmm3
>   42: f2 0f 58 ec    addsd	xmm5,xmm4
>   46: f2 0f 10 c5    movsd	xmm0,xmm5
>   50: 8b 75 fc       movl	esi,[ebp-4]
>   53: c3             ret
> CL-USER(6): 
> 
> If you call this function in straight lisp code, the immed-arg-hook
> unboxes the arguments, calls this functiin back again, and boxes up
> the result.  But if you compile a call to this function with the 
> immed-args-call property still in effect, then that call sets up the
> arguments directly in immediate form, calls the function in this
> immediate way, and the return is used (still unboxed, and with no
> consing at this level).  It is up to the caller, now, as to what it
> wants to do with the unboxed value it just got back.
> 
> We don't officially document this approach because it is not a very
> Lispy approach - a lispy approach would allow redefinitions without
> consequences.  But since it is extremely useful we provide it and
> always give out unofficial documentation to customers who ask for it.
>

From: Duane Rettig
Subject: Re: optimization question
Date: Thu, 09 Aug 2007 00:08:21 +0000
Message-ID: <o04pj9wowa.fsf@gemini.franz.com>

"Dimiter \"malkia\" Stanev" <······@gmail.com> writes:

> Thanks, Duane!
>
> I'm still waiting for my Allegro CL 8.1 Non-commercial version, but
> until it arrives, I guess I can try it in the prompt.franz.com.

Yes.

> I have a question: If the immed-args-call property is still in effect,
> would that mean that it would call directly the function + 6 bytes
> (e.g. skip the first jmp which unboxes then boxes again the data)?

Yes, compiled calls to the function go through this
"second-entry-point", wherever it is on the particular architecture.
And therein lies the rub, and the reason we dobn't export this
feature: it is fine to call one of these functions with an uncompiled
call (in which the jump to the hook takes care of uboxing the args and
boxing up the result) and it is also fine to have an uncompiled call
to the function when it is itself uncompiled, because then the
interpreter is just doing its normal thing.  But if you compile the
call and not the function itself, then troubles arise, because the
call is skipping essential code needed for the function entry.  So the
rule of thumb is: when you redefine an immediate-args function, be
sure to compile it immediately as well.

> Btw, why can't I disassemble sys::immed-arg-hook?

the disassembler is describing 'sys::immed-arg-hook as if it is a
symbolic lisp call, but in reality it is a jump to a "runtime system
function" (lisp code, but compiled to assembler and assembled and
linked into the .dll/.so/.dylib file).  The disassembler calls it by
its symbolic name because that is how it identifies the function (try
adding :references-only t to the disassemble call).  On x86
architecture, it is usually a call or jump through a positive offset
from edi (or r15, on x86-64).

You can still usually disassemble it by translating the name to a
C-like name (turning dashes to underscores; in this case,
"immed_arg_hook") and disassembling that.

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182

From: Raymond Toy
Subject: Re: optimization question
Date: Tue, 17 Jul 2007 15:24:11 +0000
Message-ID: <sxdlkdfvysk.fsf@rtp.ericsson.se>

>>>>> "Tamas" == Tamas Papp <······@gmail.com> writes:

    Tamas> Hi,
    Tamas> I want to learn how to write faster numerical code in Lisp.  The
    Tamas> function below is not a bottleneck in my current application, but
    Tamas> since it is simple, I thought I could learn some techniques.

    Tamas> The function looks like this:

    Tamas> (defun golden-section-combination (a b)
    Tamas>   "Return the convex combination (1-G)*a+G*b, where G is the
    Tamas> inverse of the golden ratio."
    Tamas>   (declare (double-float a b))
    Tamas>   (let ((Gright #.(/ (- 3d0 (sqrt 5d0)) 2d0)) ; equals to G above
    Tamas> 	(Gleft #.(- 1d0 (/ (- 3d0 (sqrt 5d0)) 2d0)))) ; 1-G
    Tamas>     (the double-float (+ (the double-float (* Gleft a))
    Tamas> 			 (the double-float (* Gright b))))))

With sbcl I'm pretty sure the "the double-float" stuff isn't needed
because the compiler can figure out the types of the operations.

    Tamas> My final goal is to achieve the equivalent of the C function

    Tamas> double golden_section_combination(double a, double b) {
    Tamas>     return (a*0.3819660112501051+b*0.6180339887498949);
    Tamas> }

    Tamas> ie with the constants compiled into the function, but disassemble
    Tamas> shows that there are quite a few other things in there (I am using
    Tamas> SBCL):

    CL-NUMLIB> (disassemble #'golden-section-combination)
    Tamas> ; 0B18A95D:       8B05F0A8180B     MOV EAX, [#xB18A8F0]       ; 0.6180339887498949d0
    Tamas>                                                               ; no-arg-parsing entry point
    Tamas> ;       63:       DDD8             FSTPD FR0
    Tamas> ;       65:       DD4001           FLDD [EAX+1]
    Tamas> ;       68:       D8C9             FMULD FR1
    Tamas> ;       6A:       DDD3             FSTD FR3
    Tamas> ;       6C:       8B05F4A8180B     MOV EAX, [#xB18A8F4]       ; 0.3819660112501051d0
    Tamas> ;       72:       DDD8             FSTPD FR0
    Tamas> ;       74:       DD4001           FLDD [EAX+1]
    Tamas> ;       77:       D8CA             FMULD FR2
    Tamas> ;       79:       9B               WAIT
    Tamas> ;       7A:       D8C3             FADDD FR3
    Tamas> ;       7C:       9B               WAIT
    Tamas> ;       7D:       64               FS-SEGMENT-PREFIX
    Tamas> ;       7E:       800D4800000004   OR BYTE PTR [#x48], 4
    Tamas> ;       85:       BA10000000       MOV EDX, 16
    Tamas> ;       8A:       64               FS-SEGMENT-PREFIX
    Tamas> ;       8B:       031520000000     ADD EDX, [#x20]
    Tamas> ;       91:       64               FS-SEGMENT-PREFIX
    Tamas> ;       92:       3B1524000000     CMP EDX, [#x24]
    Tamas> ;       98:       7607             JBE L0

I hate reading x86 asm.  The stuff at address 0b18a95d-65 is loading
the constant 0.618... into the FPU.  Inst at 68 that with the first
arg, already in FR1.  Inst at 6c and 74 loads the second
constant 0.3819... into the FPU.  Inst at 77 multiplies that with the
second arg in FR2.  Inst at 7a adds the two products together.  The
rest of the stuff is for boxing up the result.

Without looking at the C code, I guess the only difference would be
that the C code might multiply directly from memory.  Can't remember
if that's possible or not.

So the compiled lisp code is probably quite similar to the C code
already.

Ray

From: Thomas A. Russ
Subject: Re: optimization question
Date: Wed, 18 Jul 2007 18:32:31 +0000
Message-ID: <ymify3lmukg.fsf@sevak.isi.edu>

Tamas Papp <······@gmail.com> writes:

> The function looks like this:
> 
> (defun golden-section-combination (a b)
>   "Return the convex combination (1-G)*a+G*b, where G is the
> inverse of the golden ratio."
>   (declare (double-float a b))
>   (let ((Gright #.(/ (- 3d0 (sqrt 5d0)) 2d0)) ; equals to G above
> 	(Gleft #.(- 1d0 (/ (- 3d0 (sqrt 5d0)) 2d0)))) ; 1-G
>     (the double-float (+ (the double-float (* Gleft a))
> 			 (the double-float (* Gright b))))))

Completely off the topic of your thread, but why not name the local
variables Gright and Gleft simply G and 1-G ?  Then they match your
comment directly.  You also probably want to declare the types as well.

(defun golden-section-combination (a b)
   "Return the convex combination (1-G)*a+G*b, where G is the
 inverse of the golden ratio."
   (declare (double-float a b))
   (let ((G #.(/ (- 3d0 (sqrt 5d0)) 2d0))
         (1-G #.(- 1d0 (/ (- 3d0 (sqrt 5d0)) 2d0)))) ; 1-G
     (declare (double-float G 1-G))
     (the double-float (+ (the double-float (* 1-G a))
                          (the double-float (* G b))))))


-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: David Golden
Subject: Re: optimization question
Date: Wed, 18 Jul 2007 20:02:12 +0000
Message-ID: <wluni.20924$j7.378864@news.indigo.ie>

Tamas Papp wrote:

> In the code below, I marked the places where I get the notes.  I can't
> think of any other declarations to make, I even tried putting m1 in a
> temporary variable tmp when exchanging values, it doesn't help.
> 

Well, 
(i) note: doing float to pointer coercion (cost 13) to "<return value>"
(ii) note: doing float to pointer coercion (cost 13) to M1/M2
etc.

(i) *was* arising when calling golden-section-combination, but 
inlining, making it local, or a macro stops that.  HOWEVER
(ii) actually seems to be arising from the need to box for funcalls and
values returns that in your code involve m1 and m2. The compiler seems
to be "pushing the boxing into m1 and m2", if that makes any sense 
(*** Someone who isn't now handwaving may want to chime in...).  

You can shuffle the boxing about - try using more temp vars to replace
the use of m1 and m2 in values and funcall forms like the below. The
below seems to tip the balance of the compiler into deciding to box at
the funcall and values forms, at least on my system. (Read the warnings
carefully, and note the "from" vs. "to" and so on).

But I'm not sure there's any way around the boxing totally, at least
not without delving into deep implementation-specifics, and especially
since f isn't known at compile time and golden-section-minimize is
global.

And this boxing/unboxing is an implementation detail really, you
couldn't rely on a different compiler behaving identically.

(defun golden-section-minimize (f a b tol)
  "Find a local minimum of f in the [a,b] interval.  The
algorithm terminates when the minimum is bracketed in an interval
smaller than tol.  Since the algorithm is slow, tol should not be
chosen smaller then necessary.  The algorithm will also find the
local minimum at the endpoints, and if f is unimodal, it will
find the global minimum."
  (declare (double-float a b tol)
           (type (function (double-float) double-float) f)
           (optimize (speed 3) (safety 0)
                     (compilation-speed 0)
                     (space 0) (debug 0)))
  ;; reorder a and b if necessary
  (when (> a b)
    (rotatef a b))
  ;; start iteration with golden ratio inner points
  (let* ((m1 (golden-section-combination a b)) ;; get note here
         (m2 (golden-section-combination b a)) ;; get note here 
         (spoon m1)
         (fork m2)
         (f1 (funcall f spoon))
         (f2 (funcall f fork))
         (tmp 0d0))
    (declare (double-float m1 m2 f1 f2 tmp spoon fork))
    (do ()
        ((< (abs (- b a)) tol)
         (setf spoon m1)
         (setf fork m2)
         (if (< f1 f2)
             (values spoon f1)
             (values fork f2)))
      (if (< f1 f2)
          (progn 
            ;; new bracket is (a,m1,m2)
            (setf b m2)
            (setf m2 m1)
            (setf tmp (golden-section-combination m1 a))
            (setf m1 tmp) ;; get note here
;;          (shiftf b m2 m1 (golden-section-combination m1 a))
            (setf spoon m1)
            (shiftf f2 f1 (funcall f spoon)))
          (progn
            ;; new bracket is (m1,m2,b)
            (setf a m1)
            (setf m1 m2)
            (setf m2 (golden-section-combination m2 b)) ;; get note here
;;          (shiftf a m1 m2 (golden-section-combination m2 b))
            (setf fork m2)
            (shiftf f1 f2 (funcall f fork)))))))