From: Joel Reymont
Subject: The epic floating-point battle
Date: 
Message-ID: <1161372948.472355.277830@k70g2000cwa.googlegroups.com>
Folks,

There's a thread in the Lispworks Users Group that I started:

http://thread.gmane.org/gmane.lisp.lispworks.general/6012/focus=6012

This is the type of thread that I'm sure Lispers are gonna love: I'm
pitting ACL 8.0 vs LW 5.0 in a floating point pseudo-benchmark.

I would love to get stats from SBCL, CMUCL and as many other Lisps as
possible. I would also love to improve the benchmark itself and measure
GCC on the same code.

    Thanks, Joel

--
http://whylisp.com

From: Pascal Bourguignon
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <87r6x2yd96.fsf@thalassa.informatimago.com>
"Joel Reymont" <······@gmail.com> writes:

> Folks,
>
> There's a thread in the Lispworks Users Group that I started:
>
> http://thread.gmane.org/gmane.lisp.lispworks.general/6012/focus=6012
>
> This is the type of thread that I'm sure Lispers are gonna love: I'm
> pitting ACL 8.0 vs LW 5.0 in a floating point pseudo-benchmark.
>
> I would love to get stats from SBCL, CMUCL and as many other Lisps as
> possible. I would also love to improve the benchmark itself and measure
> GCC on the same code.

times (see listing below):

clisp 2.39    110.53891 seconds
sbcl 0.9.17     0.476
cmucl 19c       0.77
gcl 2.6.7      13.250
ecl 0.9g       18.460
Allegro 8.0    20.820



disassembles:

[···@thalassa lisp]$ clall '(disassemble (defun digamma/setq-x (x)
   (declare (:explain :types :inlining :variables)
            (optimize (speed 3) (safety 0) (debug 0))
            ((double-float (0.0d0) *) x))
   (incf x 6d0)
   (let* ((p (/ 1d0 (* x x)))
          (logx (log x))
          (one 1d0))
     (declare (double-float one p))
     (setq p (* (- (* (+ (* p (- (* p 0.004166666666667d0)
                                 0.003968253986254d0))
                         0.008333333333333d0) p)
                   0.083333333333333d0) p))
     (+ p (- logx
             (/ 0.5d0 x)
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))))
     )))'
> > > > > > > > > > > > > > > > > > > > > 
------------------------------------------------------------------------
CLISP 2.39 (2006-07-16) (built 3364813332) (memory 3364813914) 

WARNING in DIGAMMA/SETQ-X :
Unknown declaration (DOUBLE-FLOAT (0.0d0) *).
The whole declaration will be ignored.
WARNING in DIGAMMA/SETQ-X :
Unknown declaration :EXPLAIN.
The whole declaration will be ignored.

Disassembly of function DIGAMMA/SETQ-X
(CONST 0) = 6.0d0
(CONST 1) = 1.0d0
(CONST 2) = -0.083333333333333d0
(CONST 3) = 0.008333333333333d0
(CONST 4) = -0.003968253986254d0
(CONST 5) = 0.004166666666667d0
(CONST 6) = 0.5d0
1 required argument
0 optional arguments
No rest parameter
No keyword parameters
70 byte-code instructions:
0     (CONST&PUSH 0)                      ; 6.0d0
1     (LOAD&PUSH 2)
2     (CALLSR&STORE 2 53 1)               ; +
6     (CONST&PUSH 1)                      ; 1.0d0
7     (LOAD&PUSH 2)
8     (LOAD&PUSH 3)
9     (CALLSR&PUSH 2 55)                  ; *
12    (CALLSR&PUSH 1 56)                  ; /
15    (LOAD&PUSH 2)
16    (PUSH-UNBOUND 1)
18    (CALLS2&PUSH 156)                   ; LOG
20    (CONST&PUSH 2)                      ; -0.083333333333333d0
21    (CONST&PUSH 3)                      ; 0.008333333333333d0
22    (LOAD&PUSH 3)
23    (CONST&PUSH 4)                      ; -0.003968253986254d0
24    (CONST&PUSH 5)                      ; 0.004166666666667d0
25    (LOAD&PUSH 6)
26    (CALLSR&PUSH 2 55)                  ; *
29    (CALLSR&PUSH 2 53)                  ; +
32    (CALLSR&PUSH 2 55)                  ; *
35    (CALLSR&PUSH 2 53)                  ; +
38    (LOAD&PUSH 3)
39    (CALLSR&PUSH 2 55)                  ; *
42    (CALLSR&PUSH 2 53)                  ; +
45    (LOAD&PUSH 2)
46    (CALLSR&STORE 2 55 1)               ; *
50    (PUSH)
51    (LOAD&PUSH 1)
52    (CONST&PUSH 6)                      ; 0.5d0
53    (LOAD&PUSH 6)
54    (CALLSR&PUSH 1 56)                  ; /
57    (CONST&PUSH 1)                      ; 1.0d0
58    (LOAD&PUSH 7)
59    (CONST&PUSH 1)                      ; 1.0d0
60    (CALLSR&STORE 1 54 7)               ; -
64    (PUSH)
65    (CALLSR&PUSH 1 56)                  ; /
68    (CONST&PUSH 1)                      ; 1.0d0
69    (LOAD&PUSH 8)
70    (CONST&PUSH 1)                      ; 1.0d0
71    (CALLSR&STORE 1 54 8)               ; -
75    (PUSH)
76    (CALLSR&PUSH 1 56)                  ; /
79    (CONST&PUSH 1)                      ; 1.0d0
80    (LOAD&PUSH 9)
81    (CONST&PUSH 1)                      ; 1.0d0
82    (CALLSR&STORE 1 54 9)               ; -
86    (PUSH)
87    (CALLSR&PUSH 1 56)                  ; /
90    (CONST&PUSH 1)                      ; 1.0d0
91    (LOAD&PUSH 10)
92    (CONST&PUSH 1)                      ; 1.0d0
93    (CALLSR&STORE 1 54 10)              ; -
97    (PUSH)
98    (CALLSR&PUSH 1 56)                  ; /
101   (CONST&PUSH 1)                      ; 1.0d0
102   (LOAD&PUSH 11)
103   (CONST&PUSH 1)                      ; 1.0d0
104   (CALLSR&STORE 1 54 11)              ; -
108   (PUSH)
109   (CALLSR&PUSH 1 56)                  ; /
112   (CONST&PUSH 1)                      ; 1.0d0
113   (LOAD&PUSH 12)
114   (CONST&PUSH 1)                      ; 1.0d0
115   (CALLSR&STORE 1 54 12)              ; -
119   (PUSH)
120   (CALLSR&PUSH 1 56)                  ; /
123   (CALLSR&PUSH 7 54)                  ; -
126   (CALLSR 2 53)                       ; +
129   (SKIP&RET 4)
(DISASSEMBLE
 (DEFUN DIGAMMA/SETQ-X (X)
  (DECLARE (:EXPLAIN :TYPES :INLINING :VARIABLES)
   (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0)) ((DOUBLE-FLOAT (0.0d0) *) X))
  (INCF X 6.0d0)
  (LET* ((P (/ 1.0d0 (* X X))) (LOGX (LOG X)) (ONE 1.0d0))
   (DECLARE (DOUBLE-FLOAT ONE P))
   (SETQ P
    (*
     (-
      (*
       (+ (* P (- (* P 0.004166666666667d0) 0.003968253986254d0))
        0.008333333333333d0)
       P)
      0.083333333333333d0)
     P))
   (+ P
    (- LOGX (/ 0.5d0 X) (/ ONE (SETQ X (- X ONE))) (/ ONE (SETQ X (- X ONE)))
     (/ ONE (SETQ X (- X ONE))) (/ ONE (SETQ X (- X ONE)))
     (/ ONE (SETQ X (- X ONE))) (/ ONE (SETQ X (- X ONE))))))))
--> NIL

------------------------------------------------------------------------
SBCL 0.9.17 

; in: DEFUN DIGAMMA/SETQ-X
;     (DEFUN DIGAMMA/SETQ-X (X)
;     (DECLARE (:EXPLAIN :TYPES :INLINING :VARIABLES)
;      (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0)) ((DOUBLE-FLOAT (0.0d0) *) X))
;     (INCF X 6.0d0)
;     (LET* ((P (/ 1.0d0 #)) (LOGX (LOG X)) (ONE 1.0d0))
;       (DECLARE (DOUBLE-FLOAT ONE P))
;       (SETQ P (* (- # 0.083333333333333d0) P))
;       (+ P
;          (- LOGX (/ 0.5d0 X) (/ ONE #) (/ ONE #) (/ ONE #) (/ ONE #) (/ ONE #)
;             (/ ONE #)))))
; --> PROGN EVAL-WHEN SB-IMPL::%DEFUN SB-INT:NAMED-LAMBDA 
; ==>
;   #'(SB-INT:NAMED-LAMBDA DIGAMMA/SETQ-X (X)
;                          (DECLARE (:EXPLAIN :TYPES :INLINING :VARIABLES)
;                           (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0))
;                           ((DOUBLE-FLOAT (0.0d0) *) X))
;                          (BLOCK DIGAMMA/SETQ-X
;                            (INCF X 6.0d0)
;                            (LET* ((P #) (LOGX #) (ONE 1.0d0))
;                              (DECLARE (DOUBLE-FLOAT ONE P))
;                              (SETQ P (* # P))
;                              (+ P (- LOGX # # # # # # #)))))
; 
; caught WARNING:
;   unrecognized declaration (:EXPLAIN :TYPES :INLINING :VARIABLES)
; 
; note: doing float to pointer coercion (cost 13) to "<return value>"
; 
; compilation unit finished
;   caught 1 WARNING condition
;   printed 1 note
; 0A6AF68D:       8B05A4F56A0A     MOV EAX, [#xA6AF5A4]       ; 6.0d0
                                                              ; no-arg-parsing entry point
;      693:       DDD8             FSTPD FR0
;      695:       DD4001           FLDD [EAX+1]
;      698:       DCC2             FADD-STI FR2
;      69A:       9B               WAIT
;      69B:       DDD8             FSTPD FR0
;      69D:       D9C1             FLDD FR1
;      69F:       D8CA             FMULD FR2
;      6A1:       9B               WAIT
;      6A2:       DDD9             FSTPD FR1
;      6A4:       D9E8             FLD1
;      6A6:       D9C9             FXCH FR1
;      6A8:       D8F9             FDIVRD FR1
;      6AA:       DDD3             FSTD FR3
;      6AC:       DDD8             FSTPD FR0
;      6AE:       DDD8             FSTPD FR0
;      6B0:       D9ED             FLDLN2
;      6B2:       D9C1             FLDD FR1
;      6B4:       D9F1             FYL2X
;      6B6:       D9C0             FLDD FR0
;      6B8:       8B05A8F56A0A     MOV EAX, [#xA6AF5A8]       ; 0.004166666666667d0
;      6BE:       DDD8             FSTPD FR0
;      6C0:       DD4001           FLDD [EAX+1]
;      6C3:       D8CB             FMULD FR3
;      6C5:       9B               WAIT
;      6C6:       8B05ACF56A0A     MOV EAX, [#xA6AF5AC]       ; 0.003968253986254d0
;      6CC:       DDDC             FSTPD FR4
;      6CE:       DD4001           FLDD [EAX+1]
;      6D1:       D9CC             FXCH FR4
;      6D3:       D8E4             FSUBD FR4
;      6D5:       9B               WAIT
;      6D6:       D8CB             FMULD FR3
;      6D8:       9B               WAIT
;      6D9:       8B05B0F56A0A     MOV EAX, [#xA6AF5B0]       ; 0.008333333333333d0
;      6DF:       DDDC             FSTPD FR4
;      6E1:       DD4001           FLDD [EAX+1]
;      6E4:       D9CC             FXCH FR4
;      6E6:       D8C4             FADDD FR4
;      6E8:       9B               WAIT
;      6E9:       D8CB             FMULD FR3
;      6EB:       9B               WAIT
;      6EC:       8B05B4F56A0A     MOV EAX, [#xA6AF5B4]       ; 0.083333333333333d0
;      6F2:       DDDC             FSTPD FR4
;      6F4:       DD4001           FLDD [EAX+1]
;      6F7:       D9CC             FXCH FR4
;      6F9:       D8E4             FSUBD FR4
;      6FB:       9B               WAIT
;      6FC:       DCCB             FMUL-STI FR3
;      6FE:       9B               WAIT
;      6FF:       8B05B8F56A0A     MOV EAX, [#xA6AF5B8]       ; 0.5d0
;      705:       DDD8             FSTPD FR0
;      707:       DD4001           FLDD [EAX+1]
;      70A:       D8F2             FDIVD FR2
;      70C:       9B               WAIT
;      70D:       DCE9             FSUB-STI FR1
;      70F:       9B               WAIT
;      710:       DDD8             FSTPD FR0
;      712:       D9E8             FLD1
;      714:       D8EA             FSUBRD FR2
;      716:       9B               WAIT
;      717:       DDD2             FSTD FR2
;      719:       DDDC             FSTPD FR4
;      71B:       D9E8             FLD1
;      71D:       D9CC             FXCH FR4
;      71F:       D8FC             FDIVRD FR4
;      721:       9B               WAIT
;      722:       DCE9             FSUB-STI FR1
;      724:       9B               WAIT
;      725:       DDD8             FSTPD FR0
;      727:       D9E8             FLD1
;      729:       D8EA             FSUBRD FR2
;      72B:       9B               WAIT
;      72C:       DDD2             FSTD FR2
;      72E:       DDDC             FSTPD FR4
;      730:       D9E8             FLD1
;      732:       D9CC             FXCH FR4
;      734:       D8FC             FDIVRD FR4
;      736:       9B               WAIT
;      737:       DCE9             FSUB-STI FR1
;      739:       9B               WAIT
;      73A:       DDD8             FSTPD FR0
;      73C:       D9E8             FLD1
;      73E:       D8EA             FSUBRD FR2
;      740:       9B               WAIT
;      741:       DDD2             FSTD FR2
;      743:       DDDC             FSTPD FR4
;      745:       D9E8             FLD1
;      747:       D9CC             FXCH FR4
;      749:       D8FC             FDIVRD FR4
;      74B:       9B               WAIT
;      74C:       DCE9             FSUB-STI FR1
;      74E:       9B               WAIT
;      74F:       DDD8             FSTPD FR0
;      751:       D9E8             FLD1
;      753:       D8EA             FSUBRD FR2
;      755:       9B               WAIT
;      756:       DDD2             FSTD FR2
;      758:       DDDC             FSTPD FR4
;      75A:       D9E8             FLD1
;      75C:       D9CC             FXCH FR4
;      75E:       D8FC             FDIVRD FR4
;      760:       9B               WAIT
;      761:       DCE9             FSUB-STI FR1
;      763:       9B               WAIT
;      764:       DDD8             FSTPD FR0
;      766:       D9E8             FLD1
;      768:       D8EA             FSUBRD FR2
;      76A:       9B               WAIT
;      76B:       DDD2             FSTD FR2
;      76D:       DDDC             FSTPD FR4
;      76F:       D9E8             FLD1
;      771:       D9CC             FXCH FR4
;      773:       D8FC             FDIVRD FR4
;      775:       9B               WAIT
;      776:       DCE9             FSUB-STI FR1
;      778:       9B               WAIT
;      779:       DDD8             FSTPD FR0
;      77B:       D9E8             FLD1
;      77D:       D8EA             FSUBRD FR2
;      77F:       9B               WAIT
;      780:       DDD2             FSTD FR2
;      782:       DDDA             FSTPD FR2
;      784:       D9E8             FLD1
;      786:       D9CA             FXCH FR2
;      788:       D8FA             FDIVRD FR2
;      78A:       9B               WAIT
;      78B:       D8E9             FSUBRD FR1
;      78D:       9B               WAIT
;      78E:       D8C3             FADDD FR3
;      790:       9B               WAIT
;      791:       800D2403000504   OR BYTE PTR [#x5000324], 4
;      798:       BA10000000       MOV EDX, 16
;      79D:       0315A05F1708     ADD EDX, [#x8175FA0]       ; boxed_region
;      7A3:       3B15A45F1708     CMP EDX, [#x8175FA4]       ; boxed_region
;      7A9:       7607             JBE L0
;      7AB:       E85CD79AFD       CALL #x805CF0C             ; alloc_overflow_edx
;      7B0:       EB09             JMP L1
;      7B2: L0:   8915A05F1708     MOV [#x8175FA0], EDX       ; boxed_region
;      7B8:       83EA10           SUB EDX, 16
;      7BB: L1:   C70216030000     MOV DWORD PTR [EDX], 790
;      7C1:       8D5207           LEA EDX, [EDX+7]
;      7C4:       DD5201           FSTD [EDX+1]
;      7C7:       80352403000504   XOR BYTE PTR [#x5000324], 4
;      7CE:       7402             JEQ L2
;      7D0:       CC09             BREAK 9                    ; pending interrupt trap
;      7D2: L2:   8D65F8           LEA ESP, [EBP-8]
;      7D5:       F8               CLC
;      7D6:       8B6DFC           MOV EBP, [EBP-4]
;      7D9:       C20400           RET 4
;      7DC:       90               NOP
;      7DD:       90               NOP
;      7DE:       90               NOP
;      7DF:       90               NOP
;      7E0:       CC0A             BREAK 10                   ; error trap
;      7E2:       02               BYTE #X02
;      7E3:       18               BYTE #X18                  ; INVALID-ARG-COUNT-ERROR
;      7E4:       0D               BYTE #X0D                  ; EAX
; 
(DISASSEMBLE
 (DEFUN DIGAMMA/SETQ-X (X)
   (DECLARE (:EXPLAIN :TYPES :INLINING :VARIABLES)
    (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0)) ((DOUBLE-FLOAT (0.0d0) *) X))
   (INCF X 6.0d0)
   (LET* ((P (/ 1.0d0 (* X X))) (LOGX (LOG X)) (ONE 1.0d0))
     (DECLARE (DOUBLE-FLOAT ONE P))
     (SETQ P
             (*
              (-
               (*
                (+ (* P (- (* P 0.004166666666667d0) 0.003968253986254d0))
                   0.008333333333333d0)
                P)
               0.083333333333333d0)
              P))
     (+ P
        (- LOGX (/ 0.5d0 X) (/ ONE (SETQ X (- X ONE)))
           (/ ONE (SETQ X (- X ONE))) (/ ONE (SETQ X (- X ONE)))
           (/ ONE (SETQ X (- X ONE))) (/ ONE (SETQ X (- X ONE)))
           (/ ONE (SETQ X (- X ONE))))))))
--> NIL

------------------------------------------------------------------------
CMU Common Lisp 19c (19C) 


; In: DEFUN DIGAMMA/SETQ-X

;   (DEFUN DIGAMMA/SETQ-X (X) (DECLARE # # #) (INCF X 6.0d0) ...)
; Warning: Unrecognized declaration: (:EXPLAIN :TYPES :INLINING :VARIABLES).
; 
; Note: Abbreviated type declaration: ((DOUBLE-FLOAT # *) X).
; 
; Converted DIGAMMA/SETQ-X.

; In: LAMBDA (X)

;   #'(LAMBDA (X) (DECLARE # # #) (BLOCK DIGAMMA/SETQ-X # #))
; Warning: Unrecognized declaration: (:EXPLAIN :TYPES :INLINING :VARIABLES).
; 
; Note: Abbreviated type declaration: ((DOUBLE-FLOAT # *) X).
; 
; Compiling LAMBDA (X): 

; In: LAMBDA (X)

;   #'(LAMBDA (X) (DECLARE # # #) (BLOCK DIGAMMA/SETQ-X # #))
; Note: Doing float to pointer coercion (cost 13) to "<return value>".
; 
; Compiling Top-Level Form: 

; Compilation unit finished.
;   1 warning
;   2 notes

5808A4D0:       .ENTRY "LAMBDA (X)"()        ; FUNCTION
     4E8:       POP     DWORD PTR [EBP-8]
     4EB:       LEA     ESP, [EBP-32]
     4EE:       FSTPD   ST(0)
     4F0:       FLDD    [EDX+1]
     4F3:       FSTD    ST(2)
     4F5:       MOV     EAX, [#x5808A4B8]    ; 6.0d0
                                             ; No-arg-parsing entry point
     4FB:       FSTPD   ST(0)
     4FD:       FLDD    [EAX+1]
     500:       FADD    ST(2), ST(0)
     502:       FSTPD   ST(0)
     504:       FLDD    ST(1)
     506:       FMULD   ST(2)
     508:       FSTPD   ST(1)
     50A:       FLD1
     50C:       FXCH    ST(1)
     50E:       FDIVRD  ST(1)
     510:       FSTD    ST(3)
     512:       FSTPD   ST(0)
     514:       FSTPD   ST(0)
     516:       FLDLN2
     518:       FLDD    ST(1)
     51A:       FYL2X
     51C:       FLDD    ST(0)
     51E:       MOV     EAX, [#x5808A4BC]    ; 0.004166666666667d0
     524:       FSTPD   ST(0)
     526:       FLDD    [EAX+1]
     529:       FMULD   ST(3)
     52B:       MOV     EAX, [#x5808A4C0]    ; 0.003968253986254d0
     531:       FSTPD   ST(4)
     533:       FLDD    [EAX+1]
     536:       FXCH    ST(4)
     538:       FSUBD   ST(4)
     53A:       FMULD   ST(3)
     53C:       MOV     EAX, [#x5808A4C4]    ; 0.008333333333333d0
     542:       FSTPD   ST(4)
     544:       FLDD    [EAX+1]
     547:       FXCH    ST(4)
     549:       FADDD   ST(4)
     54B:       FMULD   ST(3)
     54D:       MOV     EAX, [#x5808A4C8]    ; 0.083333333333333d0
     553:       FSTPD   ST(4)
     555:       FLDD    [EAX+1]
     558:       FXCH    ST(4)
     55A:       FSUBD   ST(4)
     55C:       FMUL    ST(3), ST(0)
     55E:       MOV     EAX, [#x5808A4CC]    ; 0.5d0
     564:       FSTPD   ST(0)
     566:       FLDD    [EAX+1]
     569:       FDIVD   ST(2)
     56B:       FSUB    ST(1), ST(0)
     56D:       FSTPD   ST(0)
     56F:       FLD1
     571:       FSUBRD  ST(2)
     573:       FSTD    ST(2)
     575:       FSTPD   ST(4)
     577:       FLD1
     579:       FXCH    ST(4)
     57B:       FDIVRD  ST(4)
     57D:       FSUB    ST(1), ST(0)
     57F:       FSTPD   ST(0)
     581:       FLD1
     583:       FSUBRD  ST(2)
     585:       FSTD    ST(2)
     587:       FSTPD   ST(4)
     589:       FLD1
     58B:       FXCH    ST(4)
     58D:       FDIVRD  ST(4)
     58F:       FSUB    ST(1), ST(0)
     591:       FSTPD   ST(0)
     593:       FLD1
     595:       FSUBRD  ST(2)
     597:       FSTD    ST(2)
     599:       FSTPD   ST(4)
     59B:       FLD1
     59D:       FXCH    ST(4)
     59F:       FDIVRD  ST(4)
     5A1:       FSUB    ST(1), ST(0)
     5A3:       FSTPD   ST(0)
     5A5:       FLD1
     5A7:       FSUBRD  ST(2)
     5A9:       FSTD    ST(2)
     5AB:       FSTPD   ST(4)
     5AD:       FLD1
     5AF:       FXCH    ST(4)
     5B1:       FDIVRD  ST(4)
     5B3:       FSUB    ST(1), ST(0)
     5B5:       FSTPD   ST(0)
     5B7:       FLD1
     5B9:       FSUBRD  ST(2)
     5BB:       FSTD    ST(2)
     5BD:       FSTPD   ST(4)
     5BF:       FLD1
     5C1:       FXCH    ST(4)
     5C3:       FDIVRD  ST(4)
     5C5:       FSUB    ST(1), ST(0)
     5C7:       FSTPD   ST(0)
     5C9:       FLD1
     5CB:       FSUBRD  ST(2)
     5CD:       FSTD    ST(2)
     5CF:       FSTPD   ST(2)
     5D1:       FLD1
     5D3:       FXCH    ST(2)
     5D5:       FDIVRD  ST(2)
     5D7:       FSUBRD  ST(1)
     5D9:       FADDD   ST(3)
     5DB:       MOV     BYTE PTR [#x28000234], 0 ; LISP::*PSEUDO-ATOMIC-INTERRUPTED*
     5E2:       MOV     BYTE PTR [#x2800021C], 4 ; LISP::*PSEUDO-ATOMIC-ATOMIC*
     5E9:       MOV     EDX, 16
     5EE:       ADD     EDX, [#x2800057C]    ; X86::*CURRENT-REGION-FREE-POINTER*
     5F4:       CMP     EDX, [#x28000594]    ; X86::*CURRENT-REGION-END-ADDR*
     5FA:       JBE     L0
     5FC:       CALL    #xBE000010           ; #xBE000010: alloc_overflow_edx
     601: L0:   XCHG    EDX, [#x2800057C]    ; X86::*CURRENT-REGION-FREE-POINTER*
     607:       MOV     DWORD PTR [EDX], 790
     60D:       LEA     EDX, [EDX+7]
     610:       FSTD    [EDX+1]
     613:       MOV     BYTE PTR [#x2800021C], 0 ; LISP::*PSEUDO-ATOMIC-ATOMIC*
     61A:       CMP     BYTE PTR [#x28000234], 0 ; LISP::*PSEUDO-ATOMIC-INTERRUPTED*
     621:       JEQ     L1
     623:       BREAK   9                    ; Pending interrupt trap
     625: L1:   MOV     ECX, [EBP-8]
     628:       MOV     EAX, [EBP-4]
     62B:       ADD     ECX, 2
     62E:       MOV     ESP, EBP
     630:       MOV     EBP, EAX
     632:       JMP     ECX
     634:       NOP
     635:       NOP
     636:       NOP
     637:       NOP

(DISASSEMBLE
 (DEFUN DIGAMMA/SETQ-X (X)
   (DECLARE (:EXPLAIN :TYPES :INLINING :VARIABLES)
            (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0))
            ((DOUBLE-FLOAT (0.0d0) *) X))
   (INCF X 6.0d0)
   (LET* ((P (/ 1.0d0 (* X X))) (LOGX (LOG X)) (ONE 1.0d0))
     (DECLARE (DOUBLE-FLOAT ONE P))
     (SETQ P
             (*
              (-
               (*
                (+ (* P (- (* P 0.004166666666667d0) 0.003968253986254d0))
                   0.008333333333333d0)
                P)
               0.083333333333333d0)
              P))
     (+ P
        (- LOGX
           (/ 0.5d0 X)
           (/ ONE (SETQ X (- X ONE)))
           (/ ONE (SETQ X (- X ONE)))
           (/ ONE (SETQ X (- X ONE)))
           (/ ONE (SETQ X (- X ONE)))
           (/ ONE (SETQ X (- X ONE)))
           (/ ONE (SETQ X (- X ONE))))))))
--> 

------------------------------------------------------------------------
GNU Common Lisp (GCL) GCL 2.6.7 

Compiling gazonk0.lsp.
Warning:
The OPTIMIZE quality DEBUG is unknown.
Warning:
The declaration specifier :EXPLAIN is unknown.
End of Pass 1.  
End of Pass 2.  
OPTIMIZE levels: Safety=0 (No runtime error checking), Space=0, Speed=3
Finished compiling gazonk0.lsp.

#include "gazonk0.h"
void init_code(){do_init(VV);}
/*	function definition for DIGAMMASETQ-X	*/

static void L1()
{register object *base=vs_base;
	register object *sup=base+VM1; VC1
	vs_check;
	{register object V1;
	V1=(base[0]);
	vs_top=sup;
	goto TTL;
TTL:;
	V1= number_plus((V1),VV[0]);
	{register double V2;
	object V3;
	register double V4;
	base[2]= VV[1];
	base[3]= number_times((V1),(V1));
	vs_top=(vs_base=base+2)+2;
	Ldivide();
	vs_top=sup;
	V2= lf(vs_base[0]);
	base[2]= (V1);
	vs_top=(vs_base=base+2)+1;
	Llog();
	vs_top=sup;
	V3= vs_base[0];
	V4=     1.    ;
	V2= (double)((double)((double)((double)((double)(V2)*(double)((double)((double)(V2)*(double)(4.16666666666700140e-3))-(double)(3.96825398625400010e-3)))+(double)(8.33333333333300020e-3))*(double)(V2))-(double)(8.33333333333329960e-2))*(double)(V2);
	V5 = make_longfloat(V2);
	base[2]= (V3);
	base[4]= VV[6];
	base[5]= (V1);
	vs_top=(vs_base=base+4)+2;
	Ldivide();
	vs_top=sup;
	base[3]= vs_base[0];
	base[5]= make_longfloat(V4);
	V7 = make_longfloat(V4);
	V1= number_minus((V1),V7);
	base[6]= (V1);
	vs_top=(vs_base=base+5)+2;
	Ldivide();
	vs_top=sup;
	base[4]= vs_base[0];
	base[6]= make_longfloat(V4);
	V8 = make_longfloat(V4);
	V1= number_minus((V1),V8);
	base[7]= (V1);
	vs_top=(vs_base=base+6)+2;
	Ldivide();
	vs_top=sup;
	base[5]= vs_base[0];
	base[7]= make_longfloat(V4);
	V9 = make_longfloat(V4);
	V1= number_minus((V1),V9);
	base[8]= (V1);
	vs_top=(vs_base=base+7)+2;
	Ldivide();
	vs_top=sup;
	base[6]= vs_base[0];
	base[8]= make_longfloat(V4);
	V10 = make_longfloat(V4);
	V1= number_minus((V1),V10);
	base[9]= (V1);
	vs_top=(vs_base=base+8)+2;
	Ldivide();
	vs_top=sup;
	base[7]= vs_base[0];
	base[9]= make_longfloat(V4);
	V11 = make_longfloat(V4);
	V1= number_minus((V1),V11);
	base[10]= (V1);
	vs_top=(vs_base=base+9)+2;
	Ldivide();
	vs_top=sup;
	base[8]= vs_base[0];
	base[10]= make_longfloat(V4);
	V12 = make_longfloat(V4);
	V1= number_minus((V1),V12);
	base[11]= (V1);
	vs_top=(vs_base=base+10)+2;
	Ldivide();
	vs_top=sup;
	base[9]= vs_base[0];
	vs_top=(vs_base=base+2)+8;
	Lminus();
	vs_top=sup;
	V6= vs_base[0];
	base[2]= number_plus(V5,V6);
	vs_top=(vs_base=base+2)+1;
	return;}
	}
}
       
#(
#(6.0F0 1.0F0 0.0041666666666670014F0 0.0039682539862540001F0 0.0083333333333330002F0 0.083333333333332996F0 0.5F0 (system::%init . #((system::mf (lisp::quote common-lisp-user::digamma/setq-x) 0) (system::debug (lisp::quote common-lisp-user::digamma/setq-x) (lisp::quote (common-lisp-user::x common-lisp-user::logx))))))
)

static void L1();
#define VC1 object  V12 ,V11 ,V10 ,V9 ,V8 ,V7 ,V6 ,V5;
#define VM1 12
static char * VVi[8]={
#define Cdata VV[7]
(char *)(L1)
};
#define VV ((object *)VVi)

gazonk0.o:     file format elf32-i386

Disassembly of section .text:

00000000 <init_code>:
init_code():
   0:	83 ec 18             	sub    $0x18,%esp
   3:	68 00 00 00 00       	push   $0x0
   8:	e8 fc ff ff ff       	call   9 <init_code+0x9>
   d:	83 c4 1c             	add    $0x1c,%esp
  10:	c3                   	ret    
  11:	eb 0d                	jmp    20 <L1>
  13:	90                   	nop    
  14:	90                   	nop    
  15:	90                   	nop    
  16:	90                   	nop    
  17:	90                   	nop    
  18:	90                   	nop    
  19:	90                   	nop    
  1a:	90                   	nop    
  1b:	90                   	nop    
  1c:	90                   	nop    
  1d:	90                   	nop    
  1e:	90                   	nop    
  1f:	90                   	nop    

00000020 <L1>:
L1():
  20:	55                   	push   %ebp
  21:	57                   	push   %edi
  22:	56                   	push   %esi
  23:	53                   	push   %ebx
  24:	83 ec 5c             	sub    $0x5c,%esp
  27:	a1 00 00 00 00       	mov    0x0,%eax
  2c:	8b 2d 00 00 00 00    	mov    0x0,%ebp
  32:	39 05 00 00 00 00    	cmp    %eax,0x0
  38:	8d 75 30             	lea    0x30(%ebp),%esi
  3b:	0f 83 6f 03 00 00    	jae    3b0 <L1+0x390>
  41:	83 ec 08             	sub    $0x8,%esp
  44:	8b 1d 00 00 00 00    	mov    0x0,%ebx
  4a:	8b 45 00             	mov    0x0(%ebp),%eax
  4d:	53                   	push   %ebx
  4e:	50                   	push   %eax
  4f:	89 35 00 00 00 00    	mov    %esi,0x0
  55:	e8 fc ff ff ff       	call   56 <L1+0x36>
  5a:	89 44 24 38          	mov    %eax,0x38(%esp,1)
  5e:	a1 04 00 00 00       	mov    0x4,%eax
  63:	89 45 08             	mov    %eax,0x8(%ebp)
  66:	58                   	pop    %eax
  67:	5a                   	pop    %edx
  68:	8b 44 24 30          	mov    0x30(%esp,1),%eax
  6c:	50                   	push   %eax
  6d:	8b 44 24 34          	mov    0x34(%esp,1),%eax
  71:	50                   	push   %eax
  72:	e8 fc ff ff ff       	call   73 <L1+0x53>
  77:	8d 7d 08             	lea    0x8(%ebp),%edi
  7a:	8d 4d 10             	lea    0x10(%ebp),%ecx
  7d:	89 45 0c             	mov    %eax,0xc(%ebp)
  80:	89 4c 24 54          	mov    %ecx,0x54(%esp,1)
  84:	89 0d 00 00 00 00    	mov    %ecx,0x0
  8a:	89 7c 24 58          	mov    %edi,0x58(%esp,1)
  8e:	89 3d 00 00 00 00    	mov    %edi,0x0
  94:	e8 fc ff ff ff       	call   95 <L1+0x75>
  99:	a1 00 00 00 00       	mov    0x0,%eax
  9e:	8b 00                	mov    (%eax),%eax
  a0:	dd 40 04             	fldl   0x4(%eax)
  a3:	8b 44 24 38          	mov    0x38(%esp,1),%eax
  a7:	8d 55 0c             	lea    0xc(%ebp),%edx
  aa:	89 45 08             	mov    %eax,0x8(%ebp)
  ad:	8b 44 24 58          	mov    0x58(%esp,1),%eax
  b1:	89 54 24 50          	mov    %edx,0x50(%esp,1)
  b5:	89 35 00 00 00 00    	mov    %esi,0x0
  bb:	dd 5c 24 10          	fstpl  0x10(%esp,1)
  bf:	89 15 00 00 00 00    	mov    %edx,0x0
  c5:	a3 00 00 00 00       	mov    %eax,0x0
  ca:	e8 fc ff ff ff       	call   cb <L1+0xab>
  cf:	dd 44 24 10          	fldl   0x10(%esp,1)
  d3:	d9 c0                	fld    %st(0)
  d5:	dc 0d 00 00 00 00    	fmull  0x0
  db:	dc 25 08 00 00 00    	fsubl  0x8
  e1:	d8 c9                	fmul   %st(1),%st
  e3:	dc 05 10 00 00 00    	faddl  0x10
  e9:	d8 c9                	fmul   %st(1),%st
  eb:	dc 25 18 00 00 00    	fsubl  0x18
  f1:	a1 00 00 00 00       	mov    0x0,%eax
  f6:	de c9                	fmulp  %st,%st(1)
  f8:	8b 18                	mov    (%eax),%ebx
  fa:	89 35 00 00 00 00    	mov    %esi,0x0
 100:	dd 1c 24             	fstpl  (%esp,1)
 103:	e8 fc ff ff ff       	call   104 <L1+0xe4>
 108:	89 5d 08             	mov    %ebx,0x8(%ebp)
 10b:	89 44 24 5c          	mov    %eax,0x5c(%esp,1)
 10f:	a1 18 00 00 00       	mov    0x18,%eax
 114:	89 45 10             	mov    %eax,0x10(%ebp)
 117:	8b 44 24 38          	mov    0x38(%esp,1),%eax
 11b:	8d 7d 18             	lea    0x18(%ebp),%edi
 11e:	89 45 14             	mov    %eax,0x14(%ebp)
 121:	8b 44 24 54          	mov    0x54(%esp,1),%eax
 125:	89 7c 24 4c          	mov    %edi,0x4c(%esp,1)
 129:	a3 00 00 00 00       	mov    %eax,0x0
 12e:	89 3d 00 00 00 00    	mov    %edi,0x0
 134:	e8 fc ff ff ff       	call   135 <L1+0x115>
 139:	a1 00 00 00 00       	mov    0x0,%eax
 13e:	8b 00                	mov    (%eax),%eax
 140:	89 45 0c             	mov    %eax,0xc(%ebp)
 143:	59                   	pop    %ecx
 144:	5b                   	pop    %ebx
 145:	68 00 00 f0 3f       	push   $0x3ff00000
 14a:	6a 00                	push   $0x0
 14c:	89 35 00 00 00 00    	mov    %esi,0x0
 152:	e8 fc ff ff ff       	call   153 <L1+0x133>
 157:	89 45 14             	mov    %eax,0x14(%ebp)
 15a:	58                   	pop    %eax
 15b:	5a                   	pop    %edx
 15c:	68 00 00 f0 3f       	push   $0x3ff00000
 161:	6a 00                	push   $0x0
 163:	e8 fc ff ff ff       	call   164 <L1+0x144>
 168:	5b                   	pop    %ebx
 169:	5f                   	pop    %edi
 16a:	50                   	push   %eax
 16b:	8b 44 24 34          	mov    0x34(%esp,1),%eax
 16f:	50                   	push   %eax
 170:	e8 fc ff ff ff       	call   171 <L1+0x151>
 175:	8d 5d 1c             	lea    0x1c(%ebp),%ebx
 178:	8d 4d 14             	lea    0x14(%ebp),%ecx
 17b:	89 45 18             	mov    %eax,0x18(%ebp)
 17e:	89 0d 00 00 00 00    	mov    %ecx,0x0
 184:	89 44 24 34          	mov    %eax,0x34(%esp,1)
 188:	89 5c 24 48          	mov    %ebx,0x48(%esp,1)
 18c:	89 1d 00 00 00 00    	mov    %ebx,0x0
 192:	e8 fc ff ff ff       	call   193 <L1+0x173>
 197:	a1 00 00 00 00       	mov    0x0,%eax
 19c:	8b 00                	mov    (%eax),%eax
 19e:	89 45 10             	mov    %eax,0x10(%ebp)
 1a1:	58                   	pop    %eax
 1a2:	5a                   	pop    %edx
 1a3:	68 00 00 f0 3f       	push   $0x3ff00000
 1a8:	6a 00                	push   $0x0
 1aa:	89 35 00 00 00 00    	mov    %esi,0x0
 1b0:	e8 fc ff ff ff       	call   1b1 <L1+0x191>
 1b5:	89 45 18             	mov    %eax,0x18(%ebp)
 1b8:	5b                   	pop    %ebx
 1b9:	5f                   	pop    %edi
 1ba:	68 00 00 f0 3f       	push   $0x3ff00000
 1bf:	6a 00                	push   $0x0
 1c1:	e8 fc ff ff ff       	call   1c2 <L1+0x1a2>
 1c6:	5a                   	pop    %edx
 1c7:	59                   	pop    %ecx
 1c8:	50                   	push   %eax
 1c9:	8b 44 24 30          	mov    0x30(%esp,1),%eax
 1cd:	50                   	push   %eax
 1ce:	e8 fc ff ff ff       	call   1cf <L1+0x1af>
 1d3:	89 c3                	mov    %eax,%ebx
 1d5:	8d 55 20             	lea    0x20(%ebp),%edx
 1d8:	89 45 1c             	mov    %eax,0x1c(%ebp)
 1db:	8b 44 24 4c          	mov    0x4c(%esp,1),%eax
 1df:	89 54 24 44          	mov    %edx,0x44(%esp,1)
 1e3:	89 15 00 00 00 00    	mov    %edx,0x0
 1e9:	a3 00 00 00 00       	mov    %eax,0x0
 1ee:	e8 fc ff ff ff       	call   1ef <L1+0x1cf>
 1f3:	a1 00 00 00 00       	mov    0x0,%eax
 1f8:	8b 00                	mov    (%eax),%eax
 1fa:	89 45 14             	mov    %eax,0x14(%ebp)
 1fd:	59                   	pop    %ecx
 1fe:	5f                   	pop    %edi
 1ff:	68 00 00 f0 3f       	push   $0x3ff00000
 204:	6a 00                	push   $0x0
 206:	89 35 00 00 00 00    	mov    %esi,0x0
 20c:	e8 fc ff ff ff       	call   20d <L1+0x1ed>
 211:	89 45 1c             	mov    %eax,0x1c(%ebp)
 214:	58                   	pop    %eax
 215:	5a                   	pop    %edx
 216:	68 00 00 f0 3f       	push   $0x3ff00000
 21b:	6a 00                	push   $0x0
 21d:	e8 fc ff ff ff       	call   21e <L1+0x1fe>
 222:	59                   	pop    %ecx
 223:	5f                   	pop    %edi
 224:	50                   	push   %eax
 225:	53                   	push   %ebx
 226:	e8 fc ff ff ff       	call   227 <L1+0x207>
 22b:	8d 7d 24             	lea    0x24(%ebp),%edi
 22e:	89 44 24 30          	mov    %eax,0x30(%esp,1)
 232:	89 45 20             	mov    %eax,0x20(%ebp)
 235:	8b 44 24 48          	mov    0x48(%esp,1),%eax
 239:	89 7c 24 40          	mov    %edi,0x40(%esp,1)
 23d:	a3 00 00 00 00       	mov    %eax,0x0
 242:	89 3d 00 00 00 00    	mov    %edi,0x0
 248:	e8 fc ff ff ff       	call   249 <L1+0x229>
 24d:	a1 00 00 00 00       	mov    0x0,%eax
 252:	8b 00                	mov    (%eax),%eax
 254:	89 45 18             	mov    %eax,0x18(%ebp)
 257:	58                   	pop    %eax
 258:	5a                   	pop    %edx
 259:	68 00 00 f0 3f       	push   $0x3ff00000
 25e:	6a 00                	push   $0x0
 260:	89 35 00 00 00 00    	mov    %esi,0x0
 266:	e8 fc ff ff ff       	call   267 <L1+0x247>
 26b:	89 45 20             	mov    %eax,0x20(%ebp)
 26e:	5b                   	pop    %ebx
 26f:	5f                   	pop    %edi
 270:	68 00 00 f0 3f       	push   $0x3ff00000
 275:	6a 00                	push   $0x0
 277:	e8 fc ff ff ff       	call   278 <L1+0x258>
 27c:	5a                   	pop    %edx
 27d:	59                   	pop    %ecx
 27e:	50                   	push   %eax
 27f:	8b 44 24 2c          	mov    0x2c(%esp,1),%eax
 283:	50                   	push   %eax
 284:	e8 fc ff ff ff       	call   285 <L1+0x265>
 289:	89 44 24 2c          	mov    %eax,0x2c(%esp,1)
 28d:	8d 4d 28             	lea    0x28(%ebp),%ecx
 290:	89 45 24             	mov    %eax,0x24(%ebp)
 293:	8b 44 24 44          	mov    0x44(%esp,1),%eax
 297:	a3 00 00 00 00       	mov    %eax,0x0
 29c:	89 4c 24 3c          	mov    %ecx,0x3c(%esp,1)
 2a0:	89 0d 00 00 00 00    	mov    %ecx,0x0
 2a6:	e8 fc ff ff ff       	call   2a7 <L1+0x287>
 2ab:	a1 00 00 00 00       	mov    0x0,%eax
 2b0:	8b 00                	mov    (%eax),%eax
 2b2:	89 45 1c             	mov    %eax,0x1c(%ebp)
 2b5:	59                   	pop    %ecx
 2b6:	5b                   	pop    %ebx
 2b7:	68 00 00 f0 3f       	push   $0x3ff00000
 2bc:	6a 00                	push   $0x0
 2be:	89 35 00 00 00 00    	mov    %esi,0x0
 2c4:	e8 fc ff ff ff       	call   2c5 <L1+0x2a5>
 2c9:	89 45 24             	mov    %eax,0x24(%ebp)
 2cc:	58                   	pop    %eax
 2cd:	5a                   	pop    %edx
 2ce:	68 00 00 f0 3f       	push   $0x3ff00000
 2d3:	6a 00                	push   $0x0
 2d5:	e8 fc ff ff ff       	call   2d6 <L1+0x2b6>
 2da:	5b                   	pop    %ebx
 2db:	5f                   	pop    %edi
 2dc:	50                   	push   %eax
 2dd:	8b 5c 24 28          	mov    0x28(%esp,1),%ebx
 2e1:	53                   	push   %ebx
 2e2:	e8 fc ff ff ff       	call   2e3 <L1+0x2c3>
 2e7:	89 44 24 28          	mov    %eax,0x28(%esp,1)
 2eb:	8d 55 2c             	lea    0x2c(%ebp),%edx
 2ee:	89 45 28             	mov    %eax,0x28(%ebp)
 2f1:	8b 44 24 40          	mov    0x40(%esp,1),%eax
 2f5:	a3 00 00 00 00       	mov    %eax,0x0
 2fa:	89 15 00 00 00 00    	mov    %edx,0x0
 300:	e8 fc ff ff ff       	call   301 <L1+0x2e1>
 305:	a1 00 00 00 00       	mov    0x0,%eax
 30a:	8b 00                	mov    (%eax),%eax
 30c:	89 45 20             	mov    %eax,0x20(%ebp)
 30f:	58                   	pop    %eax
 310:	5a                   	pop    %edx
 311:	68 00 00 f0 3f       	push   $0x3ff00000
 316:	6a 00                	push   $0x0
 318:	89 35 00 00 00 00    	mov    %esi,0x0
 31e:	e8 fc ff ff ff       	call   31f <L1+0x2ff>
 323:	89 45 28             	mov    %eax,0x28(%ebp)
 326:	5b                   	pop    %ebx
 327:	5f                   	pop    %edi
 328:	68 00 00 f0 3f       	push   $0x3ff00000
 32d:	6a 00                	push   $0x0
 32f:	e8 fc ff ff ff       	call   330 <L1+0x310>
 334:	5a                   	pop    %edx
 335:	59                   	pop    %ecx
 336:	50                   	push   %eax
 337:	8b 44 24 24          	mov    0x24(%esp,1),%eax
 33b:	50                   	push   %eax
 33c:	e8 fc ff ff ff       	call   33d <L1+0x31d>
 341:	89 45 2c             	mov    %eax,0x2c(%ebp)
 344:	8b 44 24 3c          	mov    0x3c(%esp,1),%eax
 348:	a3 00 00 00 00       	mov    %eax,0x0
 34d:	89 35 00 00 00 00    	mov    %esi,0x0
 353:	e8 fc ff ff ff       	call   354 <L1+0x334>
 358:	a1 00 00 00 00       	mov    0x0,%eax
 35d:	8b 00                	mov    (%eax),%eax
 35f:	89 45 24             	mov    %eax,0x24(%ebp)
 362:	8b 44 24 58          	mov    0x58(%esp,1),%eax
 366:	a3 00 00 00 00       	mov    %eax,0x0
 36b:	8b 44 24 3c          	mov    0x3c(%esp,1),%eax
 36f:	a3 00 00 00 00       	mov    %eax,0x0
 374:	e8 fc ff ff ff       	call   375 <L1+0x355>
 379:	59                   	pop    %ecx
 37a:	a1 00 00 00 00       	mov    0x0,%eax
 37f:	5b                   	pop    %ebx
 380:	8b 08                	mov    (%eax),%ecx
 382:	51                   	push   %ecx
 383:	8b 44 24 58          	mov    0x58(%esp,1),%eax
 387:	50                   	push   %eax
 388:	89 35 00 00 00 00    	mov    %esi,0x0
 38e:	e8 fc ff ff ff       	call   38f <L1+0x36f>
 393:	89 45 08             	mov    %eax,0x8(%ebp)
 396:	8b 44 24 58          	mov    0x58(%esp,1),%eax
 39a:	a3 00 00 00 00       	mov    %eax,0x0
 39f:	8b 44 24 50          	mov    0x50(%esp,1),%eax
 3a3:	a3 00 00 00 00       	mov    %eax,0x0
 3a8:	83 c4 6c             	add    $0x6c,%esp
 3ab:	5b                   	pop    %ebx
 3ac:	5e                   	pop    %esi
 3ad:	5f                   	pop    %edi
 3ae:	5d                   	pop    %ebp
 3af:	c3                   	ret    
 3b0:	e8 fc ff ff ff       	call   3b1 <L1+0x391>
 3b5:	e9 87 fc ff ff       	jmp    41 <L1+0x21>

(DISASSEMBLE
    (DEFUN DIGAMMA/SETQ-X (X)
      (DECLARE (:EXPLAIN :TYPES :INLINING :VARIABLES)
               (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0))
               ((DOUBLE-FLOAT (0.0) *) X))
      (INCF X 6.0)
      (LET* ((P (/ 1.0 (* X X))) (LOGX (LOG X)) (ONE 1.0))
        (DECLARE (DOUBLE-FLOAT ONE P))
        (SETQ P
              (* (- (* (+ (* P
                             (- (* P 0.0041666666666670005)
                                0.0039682539862540001))
                          0.0083333333333330002)
                       P)
                    0.083333333333332996)
                 P))
        (+ P
           (- LOGX (/ 0.5 X) (/ ONE (SETQ X (- X ONE)))
              (/ ONE (SETQ X (- X ONE))) (/ ONE (SETQ X (- X ONE)))
              (/ ONE (SETQ X (- X ONE))) (/ ONE (SETQ X (- X ONE)))
              (/ ONE (SETQ X (- X ONE))))))))
--> T

------------------------------------------------------------------------
ECL 0.9g 

;;; Loading #P"/usr/local/languages/ecl-cvs/lib/ecl/cmp.fas"
;;; Loading #P"/usr/local/languages/ecl-cvs/lib/ecl/sysfun.lsp"

Name:		DIGAMMA/SETQ-X
Required:	X
Documentation:	NIL
Declarations:	((EXPLAIN TYPES INLINING VARIABLES) (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0)) ((DOUBLE-FLOAT (0.0d0) *) X))
   0	QUOTE	6.0d0
   2	BIND	#:G42
   4	PUSHV	2
   6	PUSHV	0
   8	CALLG	2,+
  11	SETQ	2
  13	UNBIND	1
  15	PUSH	'1.0d0
  17	PUSHV	1
  19	PUSHV	1
  21	PCALLG	2,*
  24	CALLG	2,/
  27	BIND	P
  29	PUSHV	2
  31	CALLG	1,LOG
  34	BIND	LOGX
  36	QUOTE	1.0d0
  38	BIND	ONE
  40	PUSHV	2
  42	PUSHV	2
  44	PUSH	'0.004166666666667d0
  46	PCALLG	2,*
  49	PUSH	'0.003968253986254d0
  51	PCALLG	2,-
  54	PCALLG	2,*
  57	PUSH	'0.008333333333333d0
  59	PCALLG	2,+
  62	PUSHV	2
  64	PCALLG	2,*
  67	PUSH	'0.083333333333333d0
  69	PCALLG	2,-
  72	PUSHV	2
  74	CALLG	2,*
  77	SETQ	2
  79	PUSHV	2
  81	PUSHV	1
  83	PUSH	'0.5d0
  85	PUSHV	4
  87	PCALLG	2,/
  90	PUSHV	0
  92	PUSHV	4
  94	PUSHV	0
  96	CALLG	2,-
  99	SETQ	4
 101	PUSH	VALUES(0)
 102	PCALLG	2,/
 105	PUSHV	0
 107	PUSHV	4
 109	PUSHV	0
 111	CALLG	2,-
 114	SETQ	4
 116	PUSH	VALUES(0)
 117	PCALLG	2,/
 120	PUSHV	0
 122	PUSHV	4
 124	PUSHV	0
 126	CALLG	2,-
 129	SETQ	4
 131	PUSH	VALUES(0)
 132	PCALLG	2,/
 135	PUSHV	0
 137	PUSHV	4
 139	PUSHV	0
 141	CALLG	2,-
 144	SETQ	4
 146	PUSH	VALUES(0)
 147	PCALLG	2,/
 150	PUSHV	0
 152	PUSHV	4
 154	PUSHV	0
 156	CALLG	2,-
 159	SETQ	4
 161	PUSH	VALUES(0)
 162	PCALLG	2,/
 165	PUSHV	0
 167	PUSHV	4
 169	PUSHV	0
 171	CALLG	2,-
 174	SETQ	4
 176	PUSH	VALUES(0)
 177	PCALLG	2,/
 180	PCALLG	8,-
 183	CALLG	2,+
 186	UNBIND	3
 188	EXIT
(DISASSEMBLE
 (DEFUN DIGAMMA/SETQ-X (X)
   (DECLARE (:EXPLAIN :TYPES :INLINING :VARIABLES)
    (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0)) ((DOUBLE-FLOAT (0.0d0) *) X))
   (INCF X 6.0d0)
   (LET* ((P (/ 1.0d0 (* X X))) (LOGX (LOG X)) (ONE 1.0d0))
     (DECLARE (DOUBLE-FLOAT ONE P))
     (SETQ P
             (*
              (-
               (*
                (+ (* P (- (* P 0.004166666666667d0) 0.003968253986254d0))
                   0.008333333333333d0)
                P)
               0.083333333333333d0)
              P))
     (+ P
        (- LOGX
           (/ 0.5d0 X)
           (/ ONE (SETQ X (- X ONE)))
           (/ ONE (SETQ X (- X ONE)))
           (/ ONE (SETQ X (- X ONE)))
           (/ ONE (SETQ X (- X ONE)))
           (/ ONE (SETQ X (- X ONE)))
           (/ ONE (SETQ X (- X ONE))))))))
--> NIL

------------------------------------------------------------------------
International Allegro CL Free Express Edition 8.0 [Linux (x86)] (Jun 6, 2006 16:01) 

;Tgen1:Examined a (possibly unboxed) call to +_2OP with arguments:
;Targ2:  symeval X type (DOUBLE-FLOAT (0.0d0) *)
;Tinf1:     VARIABLE-information: LEXICAL: ((TYPE (DOUBLE-FLOAT (0.0d0) *)))
;Targ3:  constant 6.0d0 type (DOUBLE-FLOAT 6.0d0 6.0d0)

[...a lot of output....]

;Igen2:  Arg1: type is unboxable as DOUBLE-FLOAT
;Igen4: Inline attempt succeeded.
;Vflt1: Variables stored in floating point registers: (X P LOGX ONE X X X X X X).
;Vflt2: Variables stored in floating point memory: (X).
;; disassembly of An unhandled error occurred during initialization:
                  An error occurred
                  (Unable to print
                   #<Function (:ANONYMOUS-LAMBDA 2) @ #x71587c12>
                   readably and *PRINT-READABLY* is true.)
                  during the load "/tmp/clall-7736.lisp"




[···@thalassa lisp]$ clall '(progn (compile (defun digamma/setq-x (x)
   (declare (:explain :types :inlining :variables)
            (optimize (speed 3) (safety 0) (debug 0))
            (type (double-float (0.0d0) *) x))
   (incf x 6d0)
   (let* ((p (/ 1d0 (* x x)))
          (logx (log x))
          (one 1d0))
     (declare (double-float one p))
     (setq p (* (- (* (+ (* p (- (* p 0.004166666666667d0)
                                 0.003968253986254d0))
                         0.008333333333333d0) p)
                   0.083333333333333d0) p))
     (+ p (- logx
             (/ 0.5d0 x)
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))))
     ))) (time (loop for i from 10000 to 1000000 
                     do (digamma/setq-x (coerce i (quote double-float))))))'
> > > > > > > > > > > > > > > > > > > > > > 
------------------------------------------------------------------------
CLISP 2.39 (2006-07-16) (built 3364813332) (memory 3364813914) 

WARNING in DIGAMMA/SETQ-X :
Unknown declaration :EXPLAIN.
The whole declaration will be ignored.
Real time: 123.51723 sec.
Run time: 110.53891 sec.
Space: 2714225608 Bytes
GC: 5177, GC time: 42.89465 sec.
(PROGN
 (COMPILE
  (DEFUN DIGAMMA/SETQ-X (X)
   (DECLARE (:EXPLAIN :TYPES :INLINING :VARIABLES)
    (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0))
    (TYPE (DOUBLE-FLOAT (0.0d0) *) X))
   (INCF X 6.0d0)
   (LET* ((P (/ 1.0d0 (* X X))) (LOGX (LOG X)) (ONE 1.0d0))
    (DECLARE (DOUBLE-FLOAT ONE P))
    (SETQ P
     (*
      (-
       (*
        (+ (* P (- (* P 0.004166666666667d0) 0.003968253986254d0))
         0.008333333333333d0)
        P)
       0.083333333333333d0)
      P))
    (+ P
     (- LOGX (/ 0.5d0 X) (/ ONE (SETQ X (- X ONE))) (/ ONE (SETQ X (- X ONE)))
      (/ ONE (SETQ X (- X ONE))) (/ ONE (SETQ X (- X ONE)))
      (/ ONE (SETQ X (- X ONE))) (/ ONE (SETQ X (- X ONE))))))))
 (TIME
  (LOOP FOR I FROM 10000 TO 1000000 DO
   (DIGAMMA/SETQ-X (COERCE I 'DOUBLE-FLOAT)))))
--> NIL

------------------------------------------------------------------------
SBCL 0.9.17 

; in: DEFUN DIGAMMA/SETQ-X
;     (DEFUN DIGAMMA/SETQ-X (X)
;     (DECLARE (:EXPLAIN :TYPES :INLINING :VARIABLES)
;      (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0))
;      (TYPE (DOUBLE-FLOAT (0.0d0) *) X))
;     (INCF X 6.0d0)
;     (LET* ((P (/ 1.0d0 #)) (LOGX (LOG X)) (ONE 1.0d0))
;       (DECLARE (DOUBLE-FLOAT ONE P))
;       (SETQ P (* (- # 0.083333333333333d0) P))
;       (+ P
;          (- LOGX (/ 0.5d0 X) (/ ONE #) (/ ONE #) (/ ONE #) (/ ONE #) (/ ONE #)
;             (/ ONE #)))))
; --> PROGN EVAL-WHEN SB-IMPL::%DEFUN SB-INT:NAMED-LAMBDA 
; ==>
;   #'(SB-INT:NAMED-LAMBDA DIGAMMA/SETQ-X (X)
;                          (DECLARE (:EXPLAIN :TYPES :INLINING :VARIABLES)
;                           (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0))
;                           (TYPE (DOUBLE-FLOAT (0.0d0) *) X))
;                          (BLOCK DIGAMMA/SETQ-X
;                            (INCF X 6.0d0)
;                            (LET* ((P #) (LOGX #) (ONE 1.0d0))
;                              (DECLARE (DOUBLE-FLOAT ONE P))
;                              (SETQ P (* # P))
;                              (+ P (- LOGX # # # # # # #)))))
; 
; caught WARNING:
;   unrecognized declaration (:EXPLAIN :TYPES :INLINING :VARIABLES)
; 
; note: doing float to pointer coercion (cost 13) to "<return value>"
; 
; compilation unit finished
;   caught 1 WARNING condition
;   printed 1 note
Evaluation took:
  0.476 seconds of real time
  0.340021 seconds of user run time
  0.032002 seconds of system run time
  [Run times include 0.012 seconds GC run time.]
  0 calls to %EVAL
  0 page faults and
  15,843,328 bytes consed.

(PROGN
 (COMPILE
  (DEFUN DIGAMMA/SETQ-X (X)
    (DECLARE (:EXPLAIN :TYPES :INLINING :VARIABLES)
     (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0))
     (TYPE (DOUBLE-FLOAT (0.0d0) *) X))
    (INCF X 6.0d0)
    (LET* ((P (/ 1.0d0 (* X X))) (LOGX (LOG X)) (ONE 1.0d0))
      (DECLARE (DOUBLE-FLOAT ONE P))
      (SETQ P
              (*
               (-
                (*
                 (+ (* P (- (* P 0.004166666666667d0) 0.003968253986254d0))
                    0.008333333333333d0)
                 P)
                0.083333333333333d0)
               P))
      (+ P
         (- LOGX (/ 0.5d0 X) (/ ONE (SETQ X (- X ONE)))
            (/ ONE (SETQ X (- X ONE))) (/ ONE (SETQ X (- X ONE)))
            (/ ONE (SETQ X (- X ONE))) (/ ONE (SETQ X (- X ONE)))
            (/ ONE (SETQ X (- X ONE))))))))
 (TIME
  (LOOP FOR I FROM 10000 TO 1000000 DO
        (DIGAMMA/SETQ-X (COERCE I 'DOUBLE-FLOAT)))))
--> NIL

------------------------------------------------------------------------
CMU Common Lisp 19c (19C) 


; In: DEFUN DIGAMMA/SETQ-X

;   (DEFUN DIGAMMA/SETQ-X (X) (DECLARE # # #) (INCF X 6.0d0) ...)
; Warning: Unrecognized declaration: (:EXPLAIN :TYPES :INLINING :VARIABLES).
; 
; Converted DIGAMMA/SETQ-X.

; In: LAMBDA (X)

;   #'(LAMBDA (X) (DECLARE # # #) (BLOCK DIGAMMA/SETQ-X # #))
; Warning: Unrecognized declaration: (:EXPLAIN :TYPES :INLINING :VARIABLES).
; 
; Compiling LAMBDA (X): 

; In: LAMBDA (X)

;   #'(LAMBDA (X) (DECLARE # # #) (BLOCK DIGAMMA/SETQ-X # #))
; Note: Doing float to pointer coercion (cost 13) to "<return value>".
; 
; Compiling Top-Level Form: 

; Compilation unit finished.
;   1 warning
;   1 note

; Compiling LAMBDA NIL: 
; Compiling Top-Level Form: 
; [GC threshold exceeded with 12,012,864 bytes in use.  Commencing GC.]
; [GC completed with 323,128 bytes retained and 11,689,736 bytes freed.]
; [GC will next occur when at least 12,323,128 bytes are in use.]
; [GC threshold exceeded with 12,331,640 bytes in use.  Commencing GC.]
; [GC completed with 327,216 bytes retained and 12,004,424 bytes freed.]
; [GC will next occur when at least 12,327,216 bytes are in use.]

; Evaluation took:
;   0.77 seconds of real time
;   0.520032 seconds of user run time
;   0.080005 seconds of system run time
;   925,499,781 CPU cycles
;   [Run times include 0.1 seconds GC run time]
;   3 page faults and
;   31,685,280 bytes consed.
; 

(PROGN
  (COMPILE
   (DEFUN DIGAMMA/SETQ-X (X)
     (DECLARE (:EXPLAIN :TYPES :INLINING :VARIABLES)
              (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0))
              (TYPE (DOUBLE-FLOAT (0.0d0) *) X))
     (INCF X 6.0d0)
     (LET* ((P (/ 1.0d0 (* X X))) (LOGX (LOG X)) (ONE 1.0d0))
       (DECLARE (DOUBLE-FLOAT ONE P))
       (SETQ P
               (*
                (-
                 (*
                  (+ (* P (- (* P 0.004166666666667d0) 0.003968253986254d0))
                     0.008333333333333d0)
                  P)
                 0.083333333333333d0)
                P))
       (+ P
          (- LOGX
             (/ 0.5d0 X)
             (/ ONE (SETQ X (- X ONE)))
             (/ ONE (SETQ X (- X ONE)))
             (/ ONE (SETQ X (- X ONE)))
             (/ ONE (SETQ X (- X ONE)))
             (/ ONE (SETQ X (- X ONE)))
             (/ ONE (SETQ X (- X ONE))))))))
  (TIME
    (LOOP FOR I FROM 10000 TO 1000000
          DO (DIGAMMA/SETQ-X (COERCE I 'DOUBLE-FLOAT)))))
--> NIL

------------------------------------------------------------------------
GNU Common Lisp (GCL) GCL 2.6.7 

Compiling gazonk0.lsp.
Warning:
The OPTIMIZE quality DEBUG is unknown.
Warning:
The declaration specifier :EXPLAIN is unknown.
End of Pass 1.  
End of Pass 2.  
OPTIMIZE levels: Safety=0 (No runtime error checking), Space=0, Speed=3
Finished compiling gazonk0.lsp.
real time       :     13.250 secs
run-gbc time    :      9.590 secs
child run time  :      0.000 secs
gbc time        :      2.460 secs

(PROGN
  (COMPILE (DEFUN DIGAMMA/SETQ-X (X)
             (DECLARE (:EXPLAIN :TYPES :INLINING :VARIABLES)
                      (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0))
                      (TYPE (DOUBLE-FLOAT (0.0) *) X))
             (INCF X 6.0)
             (LET* ((P (/ 1.0 (* X X))) (LOGX (LOG X)) (ONE 1.0))
               (DECLARE (DOUBLE-FLOAT ONE P))
               (SETQ P
                     (* (- (* (+ (* P
                                    (- (* P 0.0041666666666670005)
                                     0.0039682539862540001))
                                 0.0083333333333330002)
                              P)
                           0.083333333333332996)
                        P))
               (+ P
                  (- LOGX (/ 0.5 X) (/ ONE (SETQ X (- X ONE)))
                     (/ ONE (SETQ X (- X ONE)))
                     (/ ONE (SETQ X (- X ONE)))
                     (/ ONE (SETQ X (- X ONE)))
                     (/ ONE (SETQ X (- X ONE)))
                     (/ ONE (SETQ X (- X ONE))))))))
  (TIME (LOOP
          FOR
          I
          FROM
          10000
          TO
          1000000
          DO
          (DIGAMMA/SETQ-X (COERCE I 'DOUBLE-FLOAT)))))
--> NIL

------------------------------------------------------------------------
ECL 0.9g 

;;; Loading #P"/usr/local/languages/ecl-cvs/lib/ecl/cmp.fas"
;;; Loading #P"/usr/local/languages/ecl-cvs/lib/ecl/sysfun.lsp"
;;; Compiling (LET* (# #) ...).
;;; Warning: The declaration specifier :EXPLAIN is unknown.
;;; Note: Replacing variable G58 by its value #<form LOCATION 82EF870>
;;; End of Pass 1.  
;;; Note: Replacing variable G53 by its value
;;; Calling the C compiler... 
;;; Note: 
;;; Invoking external command: gcc  -I../include -g -O2 -fPIC -fstrict-aliasing -Dlinux -O "-I/usr/local/languages/ecl-cvs/lib/ecl//h" -w -c "/a6/users/pjb/src/public/lisp/ECL001przwXx.c" -o "/a6/users/pjb/src/public/lisp/ECL001przwXx.o"

;;; Note: 
;;; Invoking external command: gcc -o "/a6/users/pjb/src/public/lisp/ECL001przwXx.fas" -L"/usr/local/languages/ecl-cvs/lib/ecl/" "/a6/users/pjb/src/public/lisp/ECL001przwXx.o"  -Wl,--rpath,/usr/local/languages/ecl-cvs/lib/ecl/ -shared   -lecl -ldl  -lm  

;;; OPTIMIZE levels: Safety=2, Space=0, Speed=3
real time : 19.000 secs
run time  : 18.460 secs

(PROGN
 (COMPILE
  (DEFUN DIGAMMA/SETQ-X (X)
    (DECLARE (:EXPLAIN :TYPES :INLINING :VARIABLES)
     (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0))
     (TYPE (DOUBLE-FLOAT (0.0d0) *) X))
    (INCF X 6.0d0)
    (LET* ((P (/ 1.0d0 (* X X))) (LOGX (LOG X)) (ONE 1.0d0))
      (DECLARE (DOUBLE-FLOAT ONE P))
      (SETQ P
              (*
               (-
                (*
                 (+ (* P (- (* P 0.004166666666667d0) 0.003968253986254d0))
                    0.008333333333333d0)
                 P)
                0.083333333333333d0)
               P))
      (+ P
         (- LOGX
            (/ 0.5d0 X)
            (/ ONE (SETQ X (- X ONE)))
            (/ ONE (SETQ X (- X ONE)))
            (/ ONE (SETQ X (- X ONE)))
            (/ ONE (SETQ X (- X ONE)))
            (/ ONE (SETQ X (- X ONE)))
            (/ ONE (SETQ X (- X ONE))))))))
 (TIME
  (LOOP FOR
        I
        FROM
        10000
        TO
        1000000
        DO
        (DIGAMMA/SETQ-X (COERCE I 'DOUBLE-FLOAT)))))
--> NIL

------------------------------------------------------------------------
International Allegro CL Free Express Edition 8.0 [Linux (x86)] (Jun 6, 2006 16:01) 

;Tgen1:Examined a (possibly unboxed) call to +_2OP with arguments:
;Targ2:  symeval X type (DOUBLE-FLOAT (0.0d0) *)
;Tinf1:     VARIABLE-information: LEXICAL: ((TYPE (DOUBLE-FLOAT (0.0d0) *)))
;Targ3:  constant 6.0d0 type (DOUBLE-FLOAT 6.0d0 6.0d0)
;Tres1: which returns a value of type (DOUBLE-FLOAT (6.0d0) *)

[... a lot of output ...]

;Igen1:  Arg1: checking for implied unboxable type (floats, or machine-integer):
;Igen2:  Arg1: type is unboxable as DOUBLE-FLOAT
;Igen4: Inline attempt succeeded.
;Vflt1: Variables stored in floating point registers: (X P LOGX ONE X X X X X X).
;Vflt2: Variables stored in floating point memory: (X).
; cpu time (non-gc) 19,130 msec user, 20 msec system
; cpu time (gc)     1,690 msec user, 0 msec system
; cpu time (total)  20,820 msec user, 20 msec system
; real time  22,630 msec
; space allocation:
;  15,840,172 cons cells, 150,483,824 other bytes, 0 static bytes
(PROGN (COMPILE (DEFUN DIGAMMA/SETQ-X (X) (DECLARE (:EXPLAIN :TYPES :INLINING :VARIABLES) (OPTIMIZE (SPEED 3) (SAFETY 0) (DEBUG 0)) (TYPE (DOUBLE-FLOAT (0.0d0) *) X)) (INCF X 6.0d0) (LET* ((P (/ 1.0d0 (* X X))) (LOGX (LOG X)) (ONE 1.0d0)) (DECLARE (DOUBLE-FLOAT ONE P)) (SETQ P (* (- (* (+ (* P (- (* P 0.004166666666667d0) 0.003968253986254d0)) 0.008333333333333d0) P) 0.083333333333333d0) P)) (+ P (- LOGX (/ 0.5d0 X) (/ ONE (SETQ X (- X ONE))) (/ ONE (SETQ X (- X ONE))) (/ ONE (SETQ X (- X ONE))) (/ ONE (SETQ X (- X ONE))) (/ ONE (SETQ X (- X ONE))) (/ ONE (SETQ X (- X ONE)))))))) (TIME (LOOP FOR I FROM 10000 TO 1000000 DO (DIGAMMA/SETQ-X (COERCE I (QUOTE DOUBLE-FLOAT))))))
--> NIL

------------------------------------------------------------------------

[···@thalassa lisp]$ 

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

"A TRUE Klingon warrior does not comment his code!"
From: Joel Reymont
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <1161374666.838107.126500@e3g2000cwe.googlegroups.com>
Pascal,

Pascal Bourguignon wrote:

> Allegro 8.0    20.820

Thanks for the stats! The ACL timing above is wrong, though, and
happens when log is used instead of double-log. Please modify (logx
(log x) to read (logx (double-log x)). So long as you include the
optimizatio hints for ACL you'll have awesome performance.

>    (let* ((p (/ 1d0 (* x x)))
>           (logx (log x))
>           (one 1d0))

It seems that Lisps other than ACL and LW are not using SSE2/SSE3
instructions. LW also seems to use them partially, i.e. mix SSE and x87
in the code for digamma.

This is my latest version of the code with the change above:

(in-package :cl-user)

#+lispworks
(declaim (optimize (float 0))
	 (ftype (function (double-float) double-float) digamma/setq-x)
	 )

#+allegro
(eval-when (compile load eval)
  (setf (get 'digamma/setq-x 'sys::immed-args-call) ; now say some
magic
             '((double-float) double-float)))

#+allegro
(ff:def-foreign-call (double-log "log") ((x :double))
  :returning :double
  :call-direct t
  :arg-checking nil)

(defun digamma/setq-x (x)
  (declare (optimize (speed 3) (safety 0) (debug 0) (float 0))
	   (type (double-float (0d0) *) x))
  (let ((x (+ x 6d0)))
    (declare (type (double-float (6d0) *) x))
    (let* ((p (/ 1d0 (* x x)))
	   #+lispworks (logx (log x))
	   #+allegro (logx (double-log x))
	   (one 1d0))
      (declare (type double-float one p))
      (setq p (* (- (* (+ (* p (- (* p 0.004166666666667d0)
				  0.003968253986254d0))
			  0.008333333333333d0) p)
		    0.083333333333333d0) p))
      (+ p (- (- (- (- (- (- (- logx
				(/ 0.5d0 x))
			     (/ one (setq x (- x one))))
			  (/ one (setq x (- x one))))
		       (/ one (setq x (- x one))))
		    (/ one (setq x (- x one))))
		 (/ one (setq x (- x one))))
	      (/ one (setq x (- x one)))))
      )))

(defun timeit ()
  (declare (optimize (speed 3) (safety 0) (debug 0) (float 0)))
  (loop for i from 10000 to 10000000
        do (digamma/setq-x (coerce (the fixnum i) 'double-float))))
From: Richard J. Fateman
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <45392EE9.6030900@eecs.berkeley.edu>
I guess I don't understand the double-log business.
I ran your "latest version" and then I changed the line
  #+allegro (logx (double-log x))

to
  #+allegro (logx (log x))

and it ran faster.  On my Windows machine the first version was
about 6.5 seconds.  The second was about 2.5 seconds.  [Timing   (timeit)..]

RJF
From: Joel Reymont
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <1161376303.343962.125930@i42g2000cwa.googlegroups.com>
Richard J. Fateman wrote:
> On my Windows machine the first version was
> about 6.5 seconds.  The second was about 2.5 seconds.  [Timing   (timeit)..]

I will let Duane comment on this. What type of architecture are you on?
What version of ACL?

    Thanks, Joel
From: Richard J. Fateman
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <ehbdfi$252t$1@agate.berkeley.edu>
(I thought "Windows" gave it away.. but my benchmark machine is a 2.6GHz 
Pentium 4 running Windows XP. I'm running Allegro CL 8.0 [licensed, 
though I think the free trial "express" version would be identical])

I subsequently looked at the gmane discussion and found it had to do 
with lisp running on Macintosh. So maybe it's irrelevant.

My own interest in doing floating point in lisp has been trying to make 
better use of some of the subtleties in the IEEE floating point 
implementations (rounding modes, traps); speed of properly declared code 
has become much less of an issue in comparing Lisp to C (etc).


Joel Reymont wrote:

> Richard J. Fateman wrote:
> 
>>On my Windows machine the first version was
>>about 6.5 seconds.  The second was about 2.5 seconds.  [Timing   (timeit)..]
> 
> 
> I will let Duane comment on this. What type of architecture are you on?
> What version of ACL?
> 
>     Thanks, Joel
> 
From: Joel Reymont
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <1161380824.096426.73150@h48g2000cwc.googlegroups.com>
Richard J. Fateman wrote:
> (I thought "Windows" gave it away.. but my benchmark machine is a 2.6GHz
> Pentium 4 running Windows XP. I'm running Allegro CL 8.0 [licensed,
> though I think the free trial "express" version would be identical])

The P4 bit is what I was interested in. Can you send me the disassembly
from your digamma or post it here? I wonder if SSE instructions are
being used. The Mac Intel bit should not matter as it's still x86. At
least I don't think it should matter.
From: ············@gmail.com
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <1161410346.545041.307830@i3g2000cwc.googlegroups.com>
On Oct 20, 2:47 pm, "Joel Reymont" <······@gmail.com> wrote:
> The P4 bit is what I was interested in. Can you send me the disassembly
> from your digamma or post it here? I wonder if SSE instructions are
> being used. The Mac Intel bit should not matter as it's still x86. At
> least I don't think it should matter.

One thing to remember is that just because you see that SSE
instructions are being used, it doesn't mean that the compiler is
taking advantage of their parallelism.  I've seen recent versions of
the Intel C compiler emit SSE2 instructions but only using one-half of
the 128-bit pipe.  The reason for this is that on the Pentium 4, the
SSE2 instructions have one cycle less latency than the double-precision
x87 instructions.  Intel engineers (and presumably also AMD, though I
haven't checked) don't seem to consider the x87 pipe a priority for
optimization (alas for poor Prof. Kahan, who helped design it and
insisted on the 80-bit internal registers -- the SSE2 pipes are 64-bit
all the way).

To tell if the SSE2 instructions are actually being exploited for their
full parallelism, you'll need to make sure that both halves of the SSE2
register are being filled.

Best,
mfh
From: William D Clinger
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <1161437479.311433.4010@m7g2000cwm.googlegroups.com>
Joel Reymont wrote:
> The P4 bit is what I was interested in. Can you send me the disassembly
> from your digamma or post it here? I wonder if SSE instructions are
> being used. The Mac Intel bit should not matter as it's still x86. At
> least I don't think it should matter.

Apple's C compiler generates SSE2 instructions.  I looked
at this just yesterday, while we were trying to figure out
why double precision (sqrt 0.9999999999999999) was giving
us the correct answer (according to IEEE-754) under Mac
OS X but was returning 1.0 under Linux.

Under Linux, the C compiler was generating the old-style
instructions, probably because there are so many Linux
machines in the field that don't support SSE and SSE2.
Apple doesn't have that problem.

Will
From: Douglas Crosher
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <4539eeb1$0$24706$61c65585@uv-55king-reader-01.melbourne.pipenetworks.com.au>
Hello Joel,

The Scieneer Common Lisp is 10% faster than optimize C code for your
problem.  The Scieneer Common Lisp takes advantage of SSE2 and SSE3
instructions but does not yet taking advantage of SIMD operations.
Extended precision floating point is also supported using the x87 FPU
instructions.  The results below were run on a 2000MHz Athlon 64.


File: digamma.c
-=-
#include "math.h"

static inline double digamma(double x)
{
      double p;
      x=x+6;
      p=1/(x*x);
      p=(((0.004166666666667*p-0.003968253986254)*p+
	0.008333333333333)*p-0.083333333333333)*p;
      p=p+log(x)-0.5/x-1/(x-1)-1/(x-2)-1/(x-3)-1/(x-4)-1/(x-5)-1/(x-6);
      return p;
}

double timeit()
{
   int  i;
   double  result;

   for (i = 10000; i < 10000000; i++)
     {
       result = digamma((double) i);
     }

   return result;
}

main()
{
   timeit();
}
-=-

cc -O3 -o digamma digamma.c -lm
time digamma
1.624u 0.004s 0:01.61 100.6%    0+0k 0+0io 0pf+0w


File: digamma1.lisp
-=-
(in-package :cl-user)

(declaim (inline digamma/setq-x))
(defun digamma/setq-x (x)
   (declare (optimize (speed 3) (safety 0) (debug 0))
	   (type (double-float (0d0) *) x))
   (let ((x (+ x 6d0)))
     (declare (type (double-float (6d0) *) x))
     (let* ((p (/ 1d0 (* x x)))
	   (logx (log x))
	   (one 1d0))
       (declare (type double-float one p))
       (setq p (* (- (* (+ (* p (- (* p 0.004166666666667d0)
				  0.003968253986254d0))
			  0.008333333333333d0) p)
		    0.083333333333333d0) p))
       (+ p (- (- (- (- (- (- (- logx
				(/ 0.5d0 x))
			     (/ one (setq x (- x one))))
			  (/ one (setq x (- x one))))
		       (/ one (setq x (- x one))))
		    (/ one (setq x (- x one))))
		 (/ one (setq x (- x one))))
	      (/ one (setq x (- x one)))))
       )))

(defun timeit ()
   (declare (optimize (speed 3) (safety 0) (debug 0)))
   (let ((result 0d0))
     (declare (double-float result))
     (loop for i from 10000 to 10000000
	  do (setf result (digamma/setq-x (coerce (the fixnum i) 'double-float))))
     result))
-=-

* (time (timeit))
Compiling lambda nil:
Compiling Top-Level Form:

Evaluation took:
   1.4379311 seconds of real time
   1.4306431 seconds of thread run time
   1.4306521 seconds of process run time
   1.432089 seconds of user run time
   0.004 seconds of system run time
   0 page faults
   16 bytes consed by this thread and
   16 total bytes consed.
From: Thibault Langlois
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <1161431727.821452.9910@i3g2000cwc.googlegroups.com>
On Oct 21, 11:19 am, Douglas Crosher <····@scieneer.com> wrote:
> Hello Joel,
>
> The Scieneer Common Lisp is 10% faster than optimize C code for your
> problem.  The Scieneer Common Lisp takes advantage of SSE2 and SSE3
> instructions but does not yet taking advantage of SIMD operations.
> Extended precision floating point is also supported using the x87 FPU
> instructions.  The results below were run on a 2000MHz Athlon 64.
>
> File: digamma.c
> -=-
> #include "math.h"
>
> static inline double digamma(double x)
> {
>       double p;
>       x=x+6;
>       p=1/(x*x);
>       p=(((0.004166666666667*p-0.003968253986254)*p+
>         0.008333333333333)*p-0.083333333333333)*p;
>       p=p+log(x)-0.5/x-1/(x-1)-1/(x-2)-1/(x-3)-1/(x-4)-1/(x-5)-1/(x-6);
>       return p;
>
> }double timeit()
> {
>    int  i;
>    double  result;
>
>    for (i = 10000; i < 10000000; i++)
>      {
>        result = digamma((double) i);
>      }
>
>    return result;
>
> }main()
> {
>    timeit();}-=-
>
> cc -O3 -o digamma digamma.c -lm
> time digamma
> 1.624u 0.004s 0:01.61 100.6%    0+0k 0+0io 0pf+0w
>
> File: digamma1.lisp
> -=-
> (in-package :cl-user)
>
> (declaim (inline digamma/setq-x))
> (defun digamma/setq-x (x)
>    (declare (optimize (speed 3) (safety 0) (debug 0))
>            (type (double-float (0d0) *) x))
>    (let ((x (+ x 6d0)))
>      (declare (type (double-float (6d0) *) x))
>      (let* ((p (/ 1d0 (* x x)))
>            (logx (log x))
>            (one 1d0))
>        (declare (type double-float one p))
>        (setq p (* (- (* (+ (* p (- (* p 0.004166666666667d0)
>                                   0.003968253986254d0))
>                           0.008333333333333d0) p)
>                     0.083333333333333d0) p))
>        (+ p (- (- (- (- (- (- (- logx
>                                 (/ 0.5d0 x))
>                              (/ one (setq x (- x one))))
>                           (/ one (setq x (- x one))))
>                        (/ one (setq x (- x one))))
>                     (/ one (setq x (- x one))))
>                  (/ one (setq x (- x one))))
>               (/ one (setq x (- x one)))))
>        )))
>
> (defun timeit ()
>    (declare (optimize (speed 3) (safety 0) (debug 0)))
>    (let ((result 0d0))
>      (declare (double-float result))
>      (loop for i from 10000 to 10000000
>           do (setf result (digamma/setq-x (coerce (the fixnum i) 'double-float))))
>      result))
> -=-
>
> * (time (timeit))
> Compiling lambda nil:
> Compiling Top-Level Form:
>
> Evaluation took:
>    1.4379311 seconds of real time
>    1.4306431 seconds of thread run time
>    1.4306521 seconds of process run time
>    1.432089 seconds of user run time
>    0.004 seconds of system run time
>    0 page faults
>    16 bytes consed by this thread and
>    16 total bytes consed.

I get similar results with cmucl 19c (2GHz Pentium M):

CL-USER> (time (timeit))
; Compiling LAMBDA NIL:
; Compiling Top-Level Form:

; Evaluation took:
;   1.55 seconds of real time
;   1.544097 seconds of user run time
;   0.0 seconds of system run time
;   3,081,616,411 CPU cycles
;   0 page faults and
;   16 bytes consed.

C version on the same machine:
$  time digamma

real	0m2.237s
user	0m2.228s
sys	0m0.000s
From: Terry Sullivan
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <a0f84$453a8193$186091bf$14075@KNOLOGY.NET>
Just to make sure that the challenge isn't using unoptimized C to 
compare with, I played arount with the C version a bit just to see if I 
could make it any faster and came up with this:

#include "math.h"

static inline double digamma(double x)
{
       double p;
       x=x+6;
       p=1/(x*x);
       p=(((0.004166666666667*p-0.003968253986254)*p+
         0.008333333333333)*p-0.083333333333333)*p;
       p=p+log(x)-0.5/x-1/(x-1)-1/(x-2)-1/(x-3)-1/(x-4)-1/(x-5)-1/(x-6);
       return p;

}

#define EnableDuffDevice 1
#define DuffCount 2

#if DuffCount > 10
#define EnableDuffDevice 0
#endif

#define DuffCase(n) case n: result = digamma(d += 1.0)

double timeit()
{
    double d;
    double  result;

    d = 10000.0;
    while (d < 10000000.0)
      {
        #if EnableDuffDevice == 1
        switch ((int) d % DuffCount)
	 {
	   DuffCase (0);
	   #if DuffCount > 9
	   DuffCase (9);
	   #endif
	   #if DuffCount > 8
	   DuffCase (8);
	   #endif
	   #if DuffCount > 7
	   DuffCase (7);
	   #endif
	   #if DuffCount > 6
	   DuffCase (6);
	   #endif
	   #if DuffCount > 5
	   DuffCase (5);
	   #endif
	   #if DuffCount > 4
	   DuffCase (4);
	   #endif
	   #if DuffCount > 3
	   DuffCase (3);
	   #endif
	   #if DuffCount > 2
	   DuffCase (2);
	   #endif
	   #if DuffCount > 1
	   DuffCase (1);
	   #endif
	 }
        #else
        result = digamma(d += 1.0);
        #endif
      }

    return result;

}

main()
{
    timeit();
}

On my system (PPC G4 1Ghz MacOSX), the original gave the following times

time ./digamma
real    0m6.033s
user    0m5.782s
sys     0m0.039s

The my version gives:

time ./digamma2
real    0m0.115s
user    0m0.076s
sys     0m0.007s

Play with the DuffCount to find a setup where the code in the loop fits 
in L1 or at worst L2 cache.

I would be interested to know how this version does on a newer faster 
machine. (Assuming I haven't made a horribly stupid coding error)

Ts


Douglas Crosher wrote:

> File: digamma.c
> -=-
> #include "math.h"
> 
> static inline double digamma(double x)
> {
>      double p;
>      x=x+6;
>      p=1/(x*x);
>      p=(((0.004166666666667*p-0.003968253986254)*p+
>     0.008333333333333)*p-0.083333333333333)*p;
>      p=p+log(x)-0.5/x-1/(x-1)-1/(x-2)-1/(x-3)-1/(x-4)-1/(x-5)-1/(x-6);
>      return p;
> }
> 
> double timeit()
> {
>   int  i;
>   double  result;
> 
>   for (i = 10000; i < 10000000; i++)
>     {
>       result = digamma((double) i);
>     }
> 
>   return result;
> }
> 
> main()
> {
>   timeit();
> }
> -=-
> 
> cc -O3 -o digamma digamma.c -lm
> time digamma
> 1.624u 0.004s 0:01.61 100.6%    0+0k 0+0io 0pf+0w
From: Terry Sullivan
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <f2c92$453a8362$186091bf$14972@KNOLOGY.NET>
The compiler options I used were:

cc -O3 -ffast-math -o digamma digamma.c -lm

Ts


Terry Sullivan wrote:

> Just to make sure that the challenge isn't using unoptimized C to 
> compare with, I played arount with the C version a bit just to see if I 
> could make it any faster and came up with this:
> 
> #include "math.h"
> 
> static inline double digamma(double x)
> {
>       double p;
>       x=x+6;
>       p=1/(x*x);
>       p=(((0.004166666666667*p-0.003968253986254)*p+
>         0.008333333333333)*p-0.083333333333333)*p;
>       p=p+log(x)-0.5/x-1/(x-1)-1/(x-2)-1/(x-3)-1/(x-4)-1/(x-5)-1/(x-6);
>       return p;
> 
> }
> 
> #define EnableDuffDevice 1
> #define DuffCount 2
> 
> #if DuffCount > 10
> #define EnableDuffDevice 0
> #endif
> 
> #define DuffCase(n) case n: result = digamma(d += 1.0)
> 
> double timeit()
> {
>    double d;
>    double  result;
> 
>    d = 10000.0;
>    while (d < 10000000.0)
>      {
>        #if EnableDuffDevice == 1
>        switch ((int) d % DuffCount)
>      {
>        DuffCase (0);
>        #if DuffCount > 9
>        DuffCase (9);
>        #endif
>        #if DuffCount > 8
>        DuffCase (8);
>        #endif
>        #if DuffCount > 7
>        DuffCase (7);
>        #endif
>        #if DuffCount > 6
>        DuffCase (6);
>        #endif
>        #if DuffCount > 5
>        DuffCase (5);
>        #endif
>        #if DuffCount > 4
>        DuffCase (4);
>        #endif
>        #if DuffCount > 3
>        DuffCase (3);
>        #endif
>        #if DuffCount > 2
>        DuffCase (2);
>        #endif
>        #if DuffCount > 1
>        DuffCase (1);
>        #endif
>      }
>        #else
>        result = digamma(d += 1.0);
>        #endif
>      }
> 
>    return result;
> 
> }
> 
> main()
> {
>    timeit();
> }
> 
> On my system (PPC G4 1Ghz MacOSX), the original gave the following times
> 
> time ./digamma
> real    0m6.033s
> user    0m5.782s
> sys     0m0.039s
> 
> The my version gives:
> 
> time ./digamma2
> real    0m0.115s
> user    0m0.076s
> sys     0m0.007s
> 
> Play with the DuffCount to find a setup where the code in the loop fits 
> in L1 or at worst L2 cache.
> 
> I would be interested to know how this version does on a newer faster 
> machine. (Assuming I haven't made a horribly stupid coding error)
> 
> Ts
> 
> 
> Douglas Crosher wrote:
> 
>> File: digamma.c
>> -=-
>> #include "math.h"
>>
>> static inline double digamma(double x)
>> {
>>      double p;
>>      x=x+6;
>>      p=1/(x*x);
>>      p=(((0.004166666666667*p-0.003968253986254)*p+
>>     0.008333333333333)*p-0.083333333333333)*p;
>>      p=p+log(x)-0.5/x-1/(x-1)-1/(x-2)-1/(x-3)-1/(x-4)-1/(x-5)-1/(x-6);
>>      return p;
>> }
>>
>> double timeit()
>> {
>>   int  i;
>>   double  result;
>>
>>   for (i = 10000; i < 10000000; i++)
>>     {
>>       result = digamma((double) i);
>>     }
>>
>>   return result;
>> }
>>
>> main()
>> {
>>   timeit();
>> }
>> -=-
>>
>> cc -O3 -o digamma digamma.c -lm
>> time digamma
>> 1.624u 0.004s 0:01.61 100.6%    0+0k 0+0io 0pf+0w
From: Brian Downing
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <YM2dnSq8-J3bC6fYnZ2dnUVZ_sednZ2d@insightbb.com>
In article <·····························@KNOLOGY.NET>,
Terry Sullivan  <··············@drifter.net> wrote:
> Just to make sure that the challenge isn't using unoptimized C to 
> compare with, I played arount with the C version a bit just to see if I 
> could make it any faster and came up with this:

[snip]

> On my system (PPC G4 1Ghz MacOSX), the original gave the following times
> time ./digamma
> real    0m6.033s
> 
> The my version gives:
> time ./digamma2
> real    0m0.115s

This is a good object lesson in that you must be very careful benchmarking
things with modern optimizing compilers - you've managed to confuse the
compiler into "optimizing" the whole algorithm away!

%% 154 tiamat:~/projects/c/scratch
:; gcc -O3 -Wall -ffast-math -o digamma2-orig digamma2-orig.c -lm
digamma2-orig.c:74: warning: return type defaults to ‘int’
digamma2-orig.c: In function ‘main’:
digamma2-orig.c:76: warning: control reaches end of non-void function
digamma2-orig.c: In function ‘timeit’:
digamma2-orig.c:27: warning: ‘result’ may be used uninitialized in this function

Notice the last warning there.

Let's try something here:

:; valgrind --tool=cachegrind ./digamma2-orig
==28188== Cachegrind, an I1/D1/L2 cache profiler.
[...]
==28188== I   refs:      65,061,501
==28188== I1  misses:           541
==28188== L2i misses:           536

Only 65 million instruction fetches?  Let's see, with 9990000 runs of
digamma2 that makes for:

CL-USER> (float (/ 65061501 9990000))
6.512663

about 6.5 instruction fetches per loop!  I purport that it's impossible
to compile the algorithm down to six amd64 instructions.

Stepping with GDB, the main() loop looks like this:

1: x/i $rip  0x400510 <main+48>:        cvttsd2si %xmm0,%eax
1: x/i $rip  0x400514 <main+52>:        mov    %eax,%edx
1: x/i $rip  0x400516 <main+54>:        shr    $0x1f,%edx
1: x/i $rip  0x400519 <main+57>:        add    %edx,%eax
1: x/i $rip  0x40051b <main+59>:        and    $0x1,%eax
1: x/i $rip  0x40051e <main+62>:        sub    %edx,%eax
1: x/i $rip  0x400520 <main+64>:        jne    0x400500 <main+32>
1: x/i $rip  0x400522 <main+66>:        mov    %rcx,0xfffffffffffffff8(%rsp)
1: x/i $rip  0x400527 <main+71>:        movlpd 0xfffffffffffffff8(%rsp),%xmm1
1: x/i $rip  0x40052d <main+77>:        addsd  %xmm1,%xmm0
1: x/i $rip  0x400531 <main+81>:        addsd  %xmm1,%xmm0
1: x/i $rip  0x400535 <main+85>:        comisd %xmm2,%xmm0
1: x/i $rip  0x400539 <main+89>:        jb     0x400510 <main+48>
1: x/i $rip  0x400510 <main+48>:        cvttsd2si %xmm0,%eax
1: x/i $rip  0x400514 <main+52>:        mov    %eax,%edx
1: x/i $rip  0x400516 <main+54>:        shr    $0x1f,%edx
1: x/i $rip  0x400519 <main+57>:        add    %edx,%eax
1: x/i $rip  0x40051b <main+59>:        and    $0x1,%eax
1: x/i $rip  0x40051e <main+62>:        sub    %edx,%eax
1: x/i $rip  0x400520 <main+64>:        jne    0x400500 <main+32>
1: x/i $rip  0x400522 <main+66>:        mov    %rcx,0xfffffffffffffff8(%rsp)
1: x/i $rip  0x400527 <main+71>:        movlpd 0xfffffffffffffff8(%rsp),%xmm1
1: x/i $rip  0x40052d <main+77>:        addsd  %xmm1,%xmm0
1: x/i $rip  0x400531 <main+81>:        addsd  %xmm1,%xmm0
1: x/i $rip  0x400535 <main+85>:        comisd %xmm2,%xmm0
1: x/i $rip  0x400539 <main+89>:        jb     0x400510 <main+48>
[repeat]

There's nothing left of the algorithm here.

Indeed, with this patch (which sums the results from inside the inline
function, thereby forcing the code to be compiled in), you'll see that
the Duff's Device case is just as slow as the naive loop:

--- digamma2-orig.c	2006-10-21 16:26:46.000000000 -0500
+++ digamma2.c	2006-10-21 16:25:04.000000000 -0500
@@ -1,13 +1,23 @@
+#include "stdio.h"
 #include "math.h"
 
+double tot_x = 0.0;
+double tot_p = 0.0;
+
+#define SumResults 1
+
 static inline double digamma(double x)
 {
        double p;
+       tot_x += x;
        x=x+6;
        p=1/(x*x);
        p=(((0.004166666666667*p-0.003968253986254)*p+
          0.008333333333333)*p-0.083333333333333)*p;
        p=p+log(x)-0.5/x-1/(x-1)-1/(x-2)-1/(x-3)-1/(x-4)-1/(x-5)-1/(x-6);
+       #if SumResults == 1
+       tot_p += p;
+       #endif
        return p;
 
 }
@@ -66,6 +76,9 @@
         #endif
       }
 
+    printf("tot_x: %f\n", tot_x);
+    printf("tot_p: %f\n", tot_p);
+
     return result;
 
 }

-bcd
From: Terry Sullivan
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <e4bc$453b02e7$186091bf$31336@KNOLOGY.NET>
Given the speed difference that I was seeing between the new and the 
old, I wondered if something like that wasn't taking place. I didn't 
make changes int the digamma function it's self, because that would have 
been changing the algorithm that was being benchmarked. I just wanted to 
eliminate inefficiencies in the supporting code.

Brian Downing wrote:
> In article <·····························@KNOLOGY.NET>,
> Terry Sullivan  <··············@drifter.net> wrote:
> 
>>Just to make sure that the challenge isn't using unoptimized C to 
>>compare with, I played arount with the C version a bit just to see if I 
>>could make it any faster and came up with this:
> 
> 
> [snip]
> 
> 
>>On my system (PPC G4 1Ghz MacOSX), the original gave the following times
>>time ./digamma
>>real    0m6.033s
>>
>>The my version gives:
>>time ./digamma2
>>real    0m0.115s
> 
> 
> This is a good object lesson in that you must be very careful benchmarking
> things with modern optimizing compilers - you've managed to confuse the
> compiler into "optimizing" the whole algorithm away!
> 
> %% 154 tiamat:~/projects/c/scratch
> :; gcc -O3 -Wall -ffast-math -o digamma2-orig digamma2-orig.c -lm
> digamma2-orig.c:74: warning: return type defaults to ‘int’
> digamma2-orig.c: In function ‘main’:
> digamma2-orig.c:76: warning: control reaches end of non-void function
> digamma2-orig.c: In function ‘timeit’:
> digamma2-orig.c:27: warning: ‘result’ may be used uninitialized in this function
> 
> Notice the last warning there.
> 
> Let's try something here:
> 
> :; valgrind --tool=cachegrind ./digamma2-orig
> ==28188== Cachegrind, an I1/D1/L2 cache profiler.
> [...]
> ==28188== I   refs:      65,061,501
> ==28188== I1  misses:           541
> ==28188== L2i misses:           536
> 
> Only 65 million instruction fetches?  Let's see, with 9990000 runs of
> digamma2 that makes for:
> 
> CL-USER> (float (/ 65061501 9990000))
> 6.512663
> 
> about 6.5 instruction fetches per loop!  I purport that it's impossible
> to compile the algorithm down to six amd64 instructions.
> 
> Stepping with GDB, the main() loop looks like this:
> 
> 1: x/i $rip  0x400510 <main+48>:        cvttsd2si %xmm0,%eax
> 1: x/i $rip  0x400514 <main+52>:        mov    %eax,%edx
> 1: x/i $rip  0x400516 <main+54>:        shr    $0x1f,%edx
> 1: x/i $rip  0x400519 <main+57>:        add    %edx,%eax
> 1: x/i $rip  0x40051b <main+59>:        and    $0x1,%eax
> 1: x/i $rip  0x40051e <main+62>:        sub    %edx,%eax
> 1: x/i $rip  0x400520 <main+64>:        jne    0x400500 <main+32>
> 1: x/i $rip  0x400522 <main+66>:        mov    %rcx,0xfffffffffffffff8(%rsp)
> 1: x/i $rip  0x400527 <main+71>:        movlpd 0xfffffffffffffff8(%rsp),%xmm1
> 1: x/i $rip  0x40052d <main+77>:        addsd  %xmm1,%xmm0
> 1: x/i $rip  0x400531 <main+81>:        addsd  %xmm1,%xmm0
> 1: x/i $rip  0x400535 <main+85>:        comisd %xmm2,%xmm0
> 1: x/i $rip  0x400539 <main+89>:        jb     0x400510 <main+48>
> 1: x/i $rip  0x400510 <main+48>:        cvttsd2si %xmm0,%eax
> 1: x/i $rip  0x400514 <main+52>:        mov    %eax,%edx
> 1: x/i $rip  0x400516 <main+54>:        shr    $0x1f,%edx
> 1: x/i $rip  0x400519 <main+57>:        add    %edx,%eax
> 1: x/i $rip  0x40051b <main+59>:        and    $0x1,%eax
> 1: x/i $rip  0x40051e <main+62>:        sub    %edx,%eax
> 1: x/i $rip  0x400520 <main+64>:        jne    0x400500 <main+32>
> 1: x/i $rip  0x400522 <main+66>:        mov    %rcx,0xfffffffffffffff8(%rsp)
> 1: x/i $rip  0x400527 <main+71>:        movlpd 0xfffffffffffffff8(%rsp),%xmm1
> 1: x/i $rip  0x40052d <main+77>:        addsd  %xmm1,%xmm0
> 1: x/i $rip  0x400531 <main+81>:        addsd  %xmm1,%xmm0
> 1: x/i $rip  0x400535 <main+85>:        comisd %xmm2,%xmm0
> 1: x/i $rip  0x400539 <main+89>:        jb     0x400510 <main+48>
> [repeat]
> 
> There's nothing left of the algorithm here.
> 
> Indeed, with this patch (which sums the results from inside the inline
> function, thereby forcing the code to be compiled in), you'll see that
> the Duff's Device case is just as slow as the naive loop:
> 
> --- digamma2-orig.c	2006-10-21 16:26:46.000000000 -0500
> +++ digamma2.c	2006-10-21 16:25:04.000000000 -0500
> @@ -1,13 +1,23 @@
> +#include "stdio.h"
>  #include "math.h"
>  
> +double tot_x = 0.0;
> +double tot_p = 0.0;
> +
> +#define SumResults 1
> +
>  static inline double digamma(double x)
>  {
>         double p;
> +       tot_x += x;
>         x=x+6;
>         p=1/(x*x);
>         p=(((0.004166666666667*p-0.003968253986254)*p+
>           0.008333333333333)*p-0.083333333333333)*p;
>         p=p+log(x)-0.5/x-1/(x-1)-1/(x-2)-1/(x-3)-1/(x-4)-1/(x-5)-1/(x-6);
> +       #if SumResults == 1
> +       tot_p += p;
> +       #endif
>         return p;
>  
>  }
> @@ -66,6 +76,9 @@
>          #endif
>        }
>  
> +    printf("tot_x: %f\n", tot_x);
> +    printf("tot_p: %f\n", tot_p);
> +
>      return result;
>  
>  }
> 
> -bcd
From: Terry Sullivan
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <e39ab$453b0ad6$186091bf$3019@KNOLOGY.NET>
With these changes in both my code and the original, minus the argument 
summation, I now get:

Original Code:
time ./digamma
tot_p: 151098846.198064
Result = 16.118096

real    0m6.081s
user    0m5.892s
sys     0m0.036s


My Code with Duffs @ 2:
time ./digamma2
tot_p: 151098853.105869
Result = 16.118096

real    0m5.955s
user    0m5.581s
sys     0m0.041s


My code with out Duffs:
time ./digamma2
tot_p: 151098853.105869
Result = 16.118096

real    0m5.845s
user    0m5.631s
sys     0m0.037s

So I would agree, Duff's device doesn't actually help much in this case 
on my CPU.

Oh well.
Had fun anyway.

Ts
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <87bqo5iv4m.fsf@qrnik.zagroda>
Douglas Crosher <···@scieneer.com> writes:

> The Scieneer Common Lisp is 10% faster than optimize C code for your
> problem.

Adding -ffast-math to gcc invocation makes the C code 10% faster on my
machine.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
From: ············@gmail.com
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <1161537949.870156.98310@m7g2000cwm.googlegroups.com>
Douglas Crosher wrote:
> The Scieneer Common Lisp is 10% faster than optimize C code for your
> problem.  The Scieneer Common Lisp takes advantage of SSE2 and SSE3
> instructions but does not yet taking advantage of SIMD operations.
> Extended precision floating point is also supported using the x87 FPU
> instructions.  The results below were run on a 2000MHz Athlon 64.

Say, you wouldn't happen to have a Scieneer Common Lisp manual
on your hands, would you?  I'm working on a project that I'm trying
to make as portable as possible, and I'd like to know if and how SCL
supports static arrays (i.e. memory pinned and not moved around
when the system GC's) and/or turning off GC temporarily.

Best,
mfh
From: Douglas Crosher
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <453BF428.5060907@scieneer.com>
············@gmail.com wrote:
> Douglas Crosher wrote:
>> The Scieneer Common Lisp is 10% faster than optimize C code for your
>> problem.  The Scieneer Common Lisp takes advantage of SSE2 and SSE3
>> instructions but does not yet taking advantage of SIMD operations.
>> Extended precision floating point is also supported using the x87 FPU
>> instructions.  The results below were run on a 2000MHz Athlon 64.
> 
> Say, you wouldn't happen to have a Scieneer Common Lisp manual
> on your hands, would you?  I'm working on a project that I'm trying
> to make as portable as possible, and I'd like to know if and how SCL
> supports static arrays (i.e. memory pinned and not moved around
> when the system GC's) and/or turning off GC temporarily.

The recommended approach is to use the 'ext:with-pinned-object macro which
pins the address of a lisp object within the context of the macro, and
by default also inhibits write protection of pages occupied by the object.
This allows other threads to execute garbage collection while the current
thread is executing the foreign function.  For example:

(defun example (array)
   (ext:with-pinned-object (array)
     (let ((sap (sys:vector-sap array)))
        ... foreign call that may read or write directly to the lisp array ...
        )))

If the foreign function does not write to the lisp object then the following
faster form may be used:

   (ext:with-pinned-object (array :inhibit-write-protection nil) ...)

Please follow up in private if you have any more questions.

Regards
Douglas Crosher
From: Joel Reymont
Subject: Re: The epic floating-point battle
Date: 
Message-ID: <1161595859.894726.21020@f16g2000cwb.googlegroups.com>
For the record, ACL 8.0 turns out to be 40% faster on my floating point
benchmark.

See http://article.gmane.org/gmane.lisp.lispworks.general/6093

; cpu time (non-gc) 1,940 msec user, 0 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  1,940 msec user, 0 msec system
; real time  1,947 msec
; space allocation:
;  0 cons cells, 0 other bytes, 0 static bytes