"Rainer Joswig" <······@lavielle.com> writes:
> What is the current speed of CL implementations
> when it comes to FP code?
>
The example isn't particulary good about *using* FP, but it could be
made so.
What kind of changes am I allowed to do?
I've made some mods to this code so that everything is a single-float,
as much as possible, and added some appropriate declarations for
CMUCL. (CMUCL derives the rest for itself.)
With these mods, on a 300-MHz Ultra, running CMUCL "18b", I get
(time (ray-test 5))
Evaluation took:
63.17 seconds of real time
44.21 seconds of user run time
11.37 seconds of system run time
[Run times include 13.84 seconds GC run time]
0 page faults and
1263144016 bytes consed.
Pretty bad compared to what someone else posted for SGI ACL.
The GC for CMUCL is not great, so it seems to spend lots of time doing
GC.
Someone more knowledgeable about CMUCL could probably speed this up
some more.
Ray
Modified code:
;;; From Paul Graham's book ANSI Common Lisp, Example 9.8
(declaim (optimize (speed 3)))
(eval-when (compile load eval)
(defstruct (point (:conc-name nil))
(x 0.0 :type single-float)
(y 0.0 :type single-float)
(z 0.0 :type single-float))
(defstruct surface
(color 0.0 :type single-float))
(defstruct (sphere (:include surface))
(radius 0.0 :type single-float)
(center (required-argument) :type point))
)
(defun sq (x)
(declare (type single-float x))
(* x x))
(defun mag (x y z)
(declare (type single-float x y z))
(sqrt (+ (sq x) (sq y) (sq z))))
(defun unit-vector (x y z)
(declare (type single-float x y z)
(values (single-float 0.0 1.0)
(single-float 0.0 1.0)
(single-float 0.0 1.0)))
(let ((d (mag x y z)))
(values (/ x d) (/ y d) (/ z d))))
(defun distance (p1 p2)
(declare (type point p1 p2))
(mag (- (x p1) (x p2))
(- (y p1) (y p2))
(- (z p1) (z p2))))
(declaim (inline minroot))
(defun minroot (a b c)
(declare (type single-float a b c))
(if (zerop a)
(/ (- c) b)
(let ((disc (- (sq b) (* 4 a c))))
(unless (minusp disc)
(let ((discrt (sqrt disc)))
(min (/ (+ (- b) discrt) (* 2 a))
(/ (- (- b) discrt) (* 2 a))))))))
(defparameter *world* nil)
(defconstant eye (make-point :x 0.0 :y 0.0 :z 200.0))
(defun tracer (pathname &optional (res 1))
(declare (type (integer 1 100) res))
(with-open-file (p pathname :direction :output)
(format p "P2 ~A ~A 255" (* res 100) (* res 100))
(let ((inc (/ (float res 1.0))))
(do ((y -50.0 (+ y inc)))
((< (- 50.0 y) inc))
(declare (single-float y))
(do ((x -50.0 (+ x inc)))
((< (- 50.0 x) inc))
(declare (single-float x))
(print (color-at x y) p))))))
(defun defsphere (x y z r c)
(declare (type single-float x y z r))
(let ((s (make-sphere
:radius r
:center (make-point :x x :y y :z z)
:color c)))
(push s *world*)
s))
#+cmu
(declaim (ext:start-block color-at sendray first-hit lambert intersect))
#+nil
(defun color-at (x y)
(declare (type single-float x y))
(multiple-value-bind (xr yr zr)
(unit-vector (- x (x eye))
(- y (y eye))
(- 0 (z eye)))
(round (* (sendray eye xr yr zr) 255.0))))
(defun sendray (pt xr yr zr)
(declare (type point pt)
(type single-float xr yr zr))
(multiple-value-bind (s int)
(first-hit pt xr yr zr)
(if s
(* (lambert s int xr yr zr)
(surface-color s))
0.0)))
(defun first-hit (pt xr yr zr)
(declare (type point pt)
(type single-float xr yr zr))
(let (surface hit dist)
(dolist (s *world*)
(declare (type sphere s))
(let ((h (intersect s pt xr yr zr)))
(when h
(let ((d (distance h pt)))
(when (or (null dist) (< d dist))
(setf surface s hit h dist d))))))
(values surface hit)))
(defun lambert (s int xr yr zr)
(declare (type single-float xr yr zr)
(type point int))
(multiple-value-bind (xn yn zn)
(normal s int)
(max 0.0 (+ (* xr xn) (* yr yn) (* zr zn)))))
#+nil
(defun intersect (s pt xr yr zr)
(declare (type point pt)
(single-float xr yr zr))
(funcall (typecase s (sphere #'sphere-intersect))
s pt xr yr zr))
(defun intersect (s pt xr yr zr)
(declare (type sphere s)
(type point pt)
(single-float xr yr zr))
(sphere-intersect s pt xr yr zr))
(defun sphere-intersect (s pt xr yr zr)
(declare (type sphere s)
(type point pt)
(type single-float xr yr zr))
(let* ((c (sphere-center s))
(n (minroot (+ (sq xr) (sq yr) (sq zr))
(* 2 (+ (* (- (x pt) (x c)) xr)
(* (- (y pt) (y c)) yr)
(* (- (z pt) (z c)) zr)))
(+ (sq (- (x pt) (x c)))
(sq (- (y pt) (y c)))
(sq (- (z pt) (z c)))
(- (sq (sphere-radius s)))))))
(if n
(make-point :x (+ (x pt) (* n xr))
:y (+ (y pt) (* n yr))
:z (+ (z pt) (* n zr))))))
(defun color-at (x y)
(declare (type single-float x y))
(multiple-value-bind (xr yr zr)
(unit-vector (- x (x eye))
(- y (y eye))
(- 0 (z eye)))
(let ((color (* (sendray eye xr yr zr) 255.0)))
(declare (type (single-float 0.0 (255.0)) color))
(floor (+ color 0.5)))))
(declaim (ext:end-block))
#+nil
(defun normal (s pt)
(declare (type point pt)
(values single-float single-float single-float))
(funcall (typecase s (sphere #'sphere-normal))
s pt))
(defun normal (s pt)
(declare (type sphere s)
(type point pt)
(values single-float single-float single-float))
(sphere-normal s pt))
(defun sphere-normal (s pt)
(let ((c (sphere-center s)))
(unit-vector (- (x c) (x pt))
(- (y c) (y pt))
(- (z c) (z pt)))))
(defun ray-test (&optional (res 1))
(setf *world* nil)
(defsphere 0.0 -300.0 -1200.0 200.0 .8)
(defsphere -80.0 -150.0 -1200.0 200.0 .7)
(defsphere 70.0 -100.0 -1200.0 200.0 .9)
(do ((x -2.0 (1+ x)))
((> x 2.0))
(declare (type single-float x))
(do ((z 2.0 (1+ z)))
((> z 7.0))
(declare (type single-float z))
(defsphere (* x 200) 300.0 (* z -400) 40.0 .75)))
(tracer (make-pathname :name "spheres.pgm") res))
From: Bulent Murtezaoglu
Subject: Re: Floating Point speed in Common Lisp
Date:
Message-ID: <87btvc8m8q.fsf@isttest.bogus>
Ok here are my results using both the Graham's original, Raymond Toy's
code and my modifications to RT's code for both ACL and CMUCL.
This took a lot longer than I expected (I figure 4 hours at least,
semi-distracted). Even barely usable benchmarking takes a lot of effort.
Short summary the fastest version runs under 10s to render the image
(no file access). Tables and code are at the end.
Notes:
RT's code did not produce 500x500 pictures as advertised in the header
(you guys have been looking at the output right?). My guess is
rational arithmetic was getting used originally so there were no
errors. In the short-float version, accumulated errors were causing
the do loops to terminate one iteration sooner. Maybe the point could
be made that optimizing with declares is not as straightforward as
advertised. It took me more than 15 minutes of scratching my head to
understand and fix the problem.
I also spend about an hour and half on top of what RT spent optimizing
the code using CMUCL. Usually silencing CMUCL's efficiency notes gets
the code in a reasonably good shape for other Lisps also. Exceptions in
this case to the last sentence are
(1) ACL's inlining policy. Franz's compiler (at least as far as Linux
and v. 4.3 for SunOS goes) does not ever inline user functions. So I
turned the square function into a macro for ACL and timed it again.
(2) CMUCL block compile feature (non-ANSI)
(3) CMUCL values declaration (non-ANSI, but there's an ANSI equivalent, I
didn't get to converting the code so the results are BIASED in favor
of CMUCL).
I think it is a good idea to split timings for the generation of the image
and the dumping of the text file. Output tends to be slow and sometimes
conses with print, it is also dependent on the filesystem/NFS speed. Since
we are after raw number crunching performance here, I modified the code
to stuff an array with the pixel values.
Hardware: Homegrown system, PentiumPro 180 overclocked to 231Mhz with
256K L1 cache and 128M EDO ram sitting on a 77MHZ bus.
OS: Linux 2.0.32. Fileserver on NSF over 10-base-2 (slow server, lightly
loaded network).
Lisp:
ACL is Allegro Common Lisp Linux version fully patched up to November 97
CMUCL is the version dated 30/01/98 off ftp.cons.org (Peter VanEynde's
binary kit)
Timing: fresh lisp on an very lightly loaded system from the command
line. I compiled everything once at the start and ran them three times
with full gc in between #+acl (gc t) #+cmu (ext:gc) and I'm reporting the
third time (too lazy to average and it doesn't matter given the large
we're looking for). There was negligible variation between the runs (2% max).
I don't think there were any page faults during the timed runs
(ACL doesn't report it [by default?], and CMUCL didn't report page
faults after the first run.)
All times in ms. There are differences in what time reports between the
systems. I took a reasonable inetersection. One notable difference is
ACL reports both cons cells counts and "other bytes" while CMUCL reports
everything as cons cells in bytes. I report CMU's byte count under ACL's
"other" column.
Original code as posted:
Real Time User System GC time cons other
ACL 188,472 185,360 500 2,190 507,541 2,238,426,232
CMUCL 263,970 244.650 15,870 50,380 1,422,303,632
Original code + global (declaim (optimize (speed 3) (safety 0) (debug 0)))
ACL 185,948 182,750 600 2,370 507,649 2,238,455,232
CMUCL 285,890 266.300 16.050 50,430 1,422,346,560
Please note the following are somehwat unfair to ACL in varying degrees
RT's code as posted (w/ the bug but I did add one #+ to get allegro to load it)
Real Time User System GC time cons other
ACL 158,726 154,930 650 4,550 512,941 -19,566,024 (!)
CMUCL 117,530 99,230 15,340 45,910 1,288,595,072
RT's code + my additions and fix (generate and write to file as usual)
Real Time User System GC time cons other
CMUCL 25,280 21.290 2.000 5,690 161,279,400
ACL 156,760 153,080 740 4,460 514,974 4,259,431,824
when (setf *bytes-consed-between-gcs* 8000000) for CMUCL
CMUCL 22,750 18,000 1.770 2.200 161,590,736
After turning sq into a macro for ACL's benefit:
Real Time User System GC time cons other
ACL 59,061 56,450 370 850 503,473 1,321,391,360
(here I should turn CMUCL values extension into "the" declations, but it's
late already...)
Timing for no output file (time (ray-test-no 5)) see attached file
Real Time User System GC time cons other
CMUCL 9,650 9,060 240 910 19,902,152
ACL 40,334 40,310 20 920 39 1,320,743,280
----Code:
;; from Graham's ANSi CL book, modified by Raymond Toy and Bulent
;; Murtezaoglu.
;; More speed can be gained by (1) fixing cmucl so
;; it works better, (2) going overboard with inline and such.
;; As it is the code generates about 15 efficiency notes some of which
;; I (BM) couldn't fix, (the others I didn't bother with).
(declaim (optimize (speed 3) (safety 0) (debug 0))) ;; does compilation speed do anything?
#+cmu (declaim (start-block ray-test tracer tracer-no ray-test-no))
;;would crash with just ray-test
(eval-when (compile load eval)
(defstruct (point (:conc-name nil))
(x 0.0 :type single-float)
(y 0.0 :type single-float)
(z 0.0 :type single-float))
(defstruct surface
(color 0.0 :type single-float))
(defstruct (sphere (:include surface))
(radius 0.0 :type single-float)
(center (required-argument) :type point))
)
(declaim (inline sq))
(declaim (inline minroot))
(declaim (inline sqrt))
(declaim (inline mag))
#-allegro
(defun sq (x)
(declare (type single-float x))
(* x x))
#+allegro
(defmacro sq (x)
(let ((foo (gensym)))
`(let ((,foo ,x)) (declare (type single-float ,foo)) (* ,foo ,foo))))
(defun mag (x y z)
(declare (type single-float x y z))
(sqrt (+ (sq x) (sq y) (sq z))))
(defun unit-vector (x y z)
(declare (type single-float x y z)
(values (single-float 0.0 1.0)
(single-float 0.0 1.0)
(single-float 0.0 1.0)))
(let ((d (mag x y z)))
(values (/ x d) (/ y d) (/ z d))))
(defun distance (p1 p2)
(declare (type point p1 p2)
(values single-float))
(mag (- (x p1) (x p2))
(- (y p1) (y p2))
(- (z p1) (z p2))))
(defun minroot (a b c)
(declare (type single-float a b c))
(if (zerop a) ;;zerop doesn't get optimized for some reason, neither does =
(/ (- c) b)
(let ((disc (- (sq b) (* 4 a c))))
(unless (minusp disc) ; cmucl should not need the cruft below but it does
(let ((discrt (the (single-float 0.0 *) (sqrt (the (single-float 0.0 *) disc)))))
(min (/ (+ (- b) discrt) (* 2 a))
(/ (- (- b) discrt) (* 2 a))))))))
(defparameter *world* nil)
(defvar *image* nil)
(defconstant eye (make-point :x 0.0 :y 0.0 :z 200.0))
(defun tracer (pathname &optional (res 1))
(declare (type (integer 1 100) res))
(with-open-file (p pathname :direction :output)
(format p "P2 ~A ~A 255" (* res 100) (* res 100))
(let* ((inc (/ (float res)))
(xymax (- (* res 100 inc) 50)))
(declare (single-float xymax))
(do ((y -50.0 (+ y inc)))
((< xymax y))
(declare (single-float y))
(do ((x -50.0 (+ x inc)))
((< xymax x))
(declare (single-float x))
(print (color-at x y) p)))))))
(defun defsphere (x y z r c)
(declare (type single-float x y z r))
(let ((s (make-sphere
:radius r
:center (make-point :x x :y y :z z)
:color c)))
(push s *world*)
s))
;#+cmu
;(declaim (ext:start-block color-at sendray first-hit lambert intersect))
#+nil
(defun color-at (x y)
(declare (type single-float x y))
(multiple-value-bind (xr yr zr)
(unit-vector (- x (x eye))
(- y (y eye))
(- 0 (z eye)))
(round (* (sendray eye xr yr zr) 255.0))))
(defun sendray (pt xr yr zr)
(declare (type point pt)
(type single-float xr yr zr))
(multiple-value-bind (s int)
(first-hit pt xr yr zr)
(if s
(* (lambert s int xr yr zr)
(surface-color s))
0.0)))
(defun first-hit (pt xr yr zr)
(declare (type point pt)
(type single-float xr yr zr))
(let (surface hit (dist most-positive-single-float))
(declare (type single-float dist))
(dolist (s *world*)
(declare (type sphere s))
(let ((h (intersect s pt xr yr zr)))
(when h
(let ((d (distance h pt)))
(when (< d dist) ;; was (or (null dist) (< d dist))
(setf surface s hit h dist d))))))
(values surface hit)))
(defun lambert (s int xr yr zr)
(declare (type single-float xr yr zr)
(type point int))
(multiple-value-bind (xn yn zn)
(normal s int)
(max 0.0 (+ (* xr xn) (* yr yn) (* zr zn)))))
#+nil
(defun intersect (s pt xr yr zr) ;; note the dispatch on type (slow)
(declare (type point pt)
(single-float xr yr zr))
(funcall (typecase s (sphere #'sphere-intersect))
s pt xr yr zr))
(defun intersect (s pt xr yr zr)
(declare (type sphere s)
(type point pt)
(single-float xr yr zr))
(sphere-intersect s pt xr yr zr))
(defun sphere-intersect (s pt xr yr zr)
(declare (type sphere s)
(type point pt)
(type single-float xr yr zr))
(let* ((c (sphere-center s))
(n (minroot (+ (sq xr) (sq yr) (sq zr))
(* 2 (+ (* (- (x pt) (x c)) xr)
(* (- (y pt) (y c)) yr)
(* (- (z pt) (z c)) zr)))
(+ (sq (- (x pt) (x c)))
(sq (- (y pt) (y c)))
(sq (- (z pt) (z c)))
(- (sq (sphere-radius s)))))))
(if n
(make-point :x (+ (x pt) (* n xr))
:y (+ (y pt) (* n yr))
:z (+ (z pt) (* n zr))))))
(defun color-at (x y)
(declare (type single-float x y))
(multiple-value-bind (xr yr zr)
(unit-vector (- x (x eye))
(- y (y eye))
(- 0 (z eye)))
(let ((color (* (sendray eye xr yr zr) 255.0)))
(declare (type (single-float 0.0 (255.0)) color))
(floor (+ color 0.5)))))
#+nil
(defun normal (s pt);; note the dispatch on type (slow)
(declare (type point pt)
(values single-float single-float single-float))
(funcall (typecase s (sphere #'sphere-normal))
s pt))
(defun normal (s pt)
(declare (type sphere s)
(type point pt)
(values single-float single-float single-float))
(sphere-normal s pt))
(defun sphere-normal (s pt)
(let ((c (sphere-center s)))
(unit-vector (- (x c) (x pt))
(- (y c) (y pt))
(- (z c) (z pt)))))
(defun ray-test (&optional (res 1))
(setf *world* nil)
(defsphere 0.0 -300.0 -1200.0 200.0 .8)
(defsphere -80.0 -150.0 -1200.0 200.0 .7)
(defsphere 70.0 -100.0 -1200.0 200.0 .9)
(do ((x -2.0 (1+ x)))
((> x 2.0))
(declare (type single-float x))
(do ((z 2.0 (1+ z)))
((> z 7.0))
(declare (type single-float z))
(defsphere (* x 200) 300.0 (* z -400) 40.0 .75)))
(tracer (make-pathname :name "spheres.pgm") res))
;;; these are the no-output versions
(defvar *imgarray*)
(defun tracer-no (&optional (res 1))
(declare (type (integer 1 100) res))
(setf *imgarray*
(make-array (* 100 100 res res) :element-type '(unsigned-byte 8)
:fill-pointer 0))
(let* ((inc (/ (float res)))
(xymax (- (* res 100 inc) 50)))
(declare (single-float xymax))
(do ((y -50.0 (+ y inc)))
((< xymax y))
(declare (single-float y))
(do ((x -50.0 (+ x inc)))
((< xymax x))
(declare (single-float x))
(vector-push (color-at x y) *imgarray*)))))))
(defun ray-test-no (&optional (res 1))
(setf *world* nil)
(defsphere 0.0 -300.0 -1200.0 200.0 .8)
(defsphere -80.0 -150.0 -1200.0 200.0 .7)
(defsphere 70.0 -100.0 -1200.0 200.0 .9)
(do ((x -2.0 (1+ x)))
((> x 2.0))
(declare (type single-float x))
(do ((z 2.0 (1+ z)))
((> z 7.0))
(declare (type single-float z))
(defsphere (* x 200) 300.0 (* z -400) 40.0 .75)))
(tracer-no res))
Bulent Murtezaoglu <·····@servtech.com> writes:
> (declaim (inline sq))
> (declaim (inline minroot))
> (declaim (inline sqrt))
> (declaim (inline mag))
>
> #-allegro
> (defun sq (x)
> (declare (type single-float x))
> (* x x))
I found, for CMUCL at least, that once you've declaimed the
function inline, there's no need to declare the types in the
function. The compiler will produce efficiency notes about
compiling sq, minroot, mag, but when you actually _call_ minroot
elsewhere with the types of the arguments declared properly, it
will very happily inline just the correct version of the code and
perform an inline single-float multiplication.
- David
Bulent Murtezaoglu <·····@servtech.com> writes:
>
> I also spend about an hour and half on top of what RT spent optimizing
> the code using CMUCL. Usually silencing CMUCL's efficiency notes gets
About an hour or two, part time.
>
> (defun minroot (a b c)
> (declare (type single-float a b c))
> (if (zerop a) ;;zerop doesn't get optimized for some reason, neither does =
This seems to be optimized for me.
> (/ (- c) b)
> (let ((disc (- (sq b) (* 4 a c))))
> (unless (minusp disc) ; cmucl should not need the cruft below but it does
> (let ((discrt (the (single-float 0.0 *) (sqrt (the (single-float 0.0 *) disc)))))
This might be because of the fact that for CMUCL (sqrt -0.0) is #c(0.0
0.0) and -0.0 is an element of (single-float 0.0 *). You may want to
say (the (or (eql 0.0) (single-float (0.0))) disc) to get better
code. Alternatively, if you have the :negative-zero-is-not-zero
feature, then you probably don't need any extra declarations. I
didn't need them.
> (defun color-at (x y)
> (declare (type single-float x y))
> (multiple-value-bind (xr yr zr)
> (unit-vector (- x (x eye))
> (- y (y eye))
> (- 0 (z eye)))
> (let ((color (* (sendray eye xr yr zr) 255.0)))
> (declare (type (single-float 0.0 (255.0)) color))
> (floor (+ color 0.5)))))
Since color > 0, it helps to replace floor with truncate in this case
because truncate can use the FPU directly. (Hmm, need to add this
optimization to CMUCL.)
With these changes, I now get only 3 compiler efficiency notes from
minroot. The run time is now
Evaluation took:
18.07 seconds of real time
15.23 seconds of user run time
1.41 seconds of system run time
[Run times include 1.96 seconds GC run time]
0 page faults and
151782136 bytes consed.
About a 3-4 times faster than my previous post.
Ray
From: Bulent Murtezaoglu
Subject: Re: Floating Point speed in Common Lisp
Date:
Message-ID: <87n2ev265j.fsf@isttest.bogus>
>>>>> "RT" == Raymond Toy <···@rtp.ericsson.se> writes:
RT> About an hour or two, part time.
So, it's probably safe to say that it took about 2 solid hours total
to get to fast code (adjusting for my inexperience with cmucl).
I say this is a bit too long even when we consider we weren't
the original authors. It would have been much easier of course if
we [I?] used tools like graphical profilers etc. On the plus side,
the resulting code is not too cluttered IMHO (and some of the clutter
is probably unnecessary). I'd be very interested to see what kind of
performance can be gotten from LW and Lucid/Liquid at what kind of price.
It would also be nice to get some figures from C code. And maybe compare
.s files with the output from disassemble.
>> (defun minroot (a b c) (declare (type single-float a b c)) (if
>> (zerop a) ;;zerop doesn't get optimized for some reason,
>> neither does =
RT> This seems to be optimized for me.
On mine I get an efficiency note warning me about some compiler-generated
temporaries not being guaranteed to be the same type so zerop doesn't get
open coded. I don't remember the exact note. Maybe you are running a
different version?
>> (/ (- c) b) (let ((disc (- (sq b) (* 4 a c)))) (unless (minusp
>> disc) ; cmucl should not need the cruft below but it does (let
>> ((discrt (the (single-float 0.0 *) (sqrt (the (single-float 0.0
>> *) disc)))))
RT> This might be because of the fact that for CMUCL (sqrt -0.0)
RT> is #c(0.0 0.0) and -0.0 is an element of (single-float 0.0 *).
RT> You may want to say (the (or (eql 0.0) (single-float (0.0)))
RT> disc) to get better code. Alternatively, if you have the
RT> :negative-zero-is-not-zero feature, then you probably don't
RT> need any extra declarations. I didn't need them.
Hmmm. My comment concerns:
(unless (minusp disc) ;up till here we know disc is a single-float
(let ((disc (sqrt disc))) ; here we know disc is non-negative
...
I was hoping the compiler would notice that additional constraint on
disc and not need any declarations to open code sqrt. Does your
explanation above apply to minusp? What does [should?] minusp do to -0.0?
[...]
RT> Evaluation took: 18.07 seconds of real time 15.23 seconds of
RT> user run time 1.41 seconds of system run time [Run times
RT> include 1.96 seconds GC run time] 0 page faults and 151782136
RT> bytes consed.
Did you also try the no-output version? I'll try to run your new code
tonight and see what happens.
BM
Bulent Murtezaoglu <·····@servtech.com> writes:
> >>>>> "RT" == Raymond Toy <···@rtp.ericsson.se> writes:
> RT> About an hour or two, part time.
>
> So, it's probably safe to say that it took about 2 solid hours total
> to get to fast code (adjusting for my inexperience with cmucl).
Perhaps. I think if I had concentrated on it, the conversion should
have only taken 30 minutes or so. The compiler notes were very good.
> I say this is a bit too long even when we consider we weren't
> the original authors. It would have been much easier of course if
> we [I?] used tools like graphical profilers etc. On the plus side,
Actually, I did use CMUCL's profile to figure out some things. I
think sphere-intersect takes up most of the time and generates by far
the most garbage.
> open coded. I don't remember the exact note. Maybe you are running a
> different version?
I was running the latest development version.
>
> (unless (minusp disc) ;up till here we know disc is a single-float
> (let ((disc (sqrt disc))) ; here we know disc is non-negative
> ...
>
> I was hoping the compiler would notice that additional constraint on
> disc and not need any declarations to open code sqrt. Does your
> explanation above apply to minusp? What does [should?] minusp do to -0.0?
It does notice the constraint (well, it's supposed to). However, the
problem is that CMUCL currently has (sqrt -0.0) returning #c(0.0 0.0),
so the generic sqrt needs to be called. Also (minusp -0.0) = (minusp
0.0) = NIL, as required by IEEE arithmetic. To get CMUCL to produce
the desired code, disc has be be declared as (or (eql 0.0)
(single-float (0.0) *)) so that the compiler knows -0.0 isn't a valid
value. My version has :negative-zero-is-not-zero feature, so I can
say (single-float 0.0 *) to get the desired result. (This feature
seems to contradict ANSI CL, but ANSI CL doesn't really quite cover
this issue either.)
> Did you also try the no-output version? I'll try to run your new code
> tonight and see what happens.
>
Here is the no-output result:
Evaluation took:
6.38 seconds of real time
6.1 seconds of user run time
0.1 seconds of system run time
[Run times include 0.11 seconds GC run time]
0 page faults and
9944752 bytes consed.
3 times better than before. The I/O generates a lot of garbage and is
pretty slow!
Ray
From: Bulent Murtezaoglu
Subject: Re: Floating Point speed in Common Lisp
Date:
Message-ID: <87lnuf1yd6.fsf@isttest.bogus>
>>>>> "RT" == Raymond Toy <···@rtp.ericsson.se> writes:
RT> Perhaps. I think if I had concentrated on it, the conversion
RT> should have only taken 30 minutes or so. The compiler notes
RT> were very good.
I'll haggle with you on that! _I_ couldn't have gotten as far as I got
even starting from your already-optimized code in 30 min. I had to
fire up allegro, get it to compile the file and run it to convince
myself that the bug I was seeing wasn't a cmucl artifact. For the slower
versions of the code you have to wait an appreciable amount to test after
a few steps. Given what we are trying to get a sense of, I think it is not
unreasonable to assume that programmers will introduce bugs during
optimization, and incremental inprovement/testing will take place (out of
habit if nothing else). So I'd say pretty close to an hour for me if I
had everything (manuals and such) handy. What do other people think?
RT> I was running the latest development version.
That probably explains a number of things. I haven't figured out how to
build cmucl from sources yet...
[...]
RT> Here is the no-output result:
RT> Evaluation took: 6.38 seconds of real time 6.1 seconds of user
RT> run time 0.1 seconds of system run time [Run times include
RT> 0.11 seconds GC run time] 0 page faults and 9944752 bytes
RT> consed.
RT> 3 times better than before. The I/O generates a lot of
RT> garbage and is pretty slow!
Indeed. Maybe not too surprising given that we're using print. I
wonder what's still producing the 9Megs of garbage? Unless I screwed
up its decleration the image array should take about 250k, say another
few 10k's for misc. unavoidable consing we still have a huge amount.
I'm not familiar with cmu's profiler, maybe you could post sth. about
what's generating all this garbage?
BM
Bulent Murtezaoglu <·····@servtech.com> writes:
> >>>>> "RT" == Raymond Toy <···@rtp.ericsson.se> writes:
>
> RT> Perhaps. I think if I had concentrated on it, the conversion
> RT> should have only taken 30 minutes or so. The compiler notes
> RT> were very good.
>
> I'll haggle with you on that! _I_ couldn't have gotten as far as I got
> even starting from your already-optimized code in 30 min. I had to
> fire up allegro, get it to compile the file and run it to convince
> myself that the bug I was seeing wasn't a cmucl artifact. For the slower
Perhaps I'm being optimistic? :-) I don't remember. But I didn't try
running anything else except CMUCL, so that saves some time.
>
> RT> I was running the latest development version.
>
> That probably explains a number of things. I haven't figured out how to
> build cmucl from sources yet...
It's not easy to do, unless you're just compiling the current version
with the current version, but you would need to do that anyway. :-)
> I'm not familiar with cmu's profiler, maybe you could post sth. about
> what's generating all this garbage?
>
First do (profile:profile-all) to turn on profiling everywhere. Run
the code, then (profile:report-time) to see the results, and
(profile:reset-time) to reset the time. Here are some results that I
got from your version. I had to run off the block compile because
then the functions wouldn't exist.
* (time (ray-test 2))
[snip]
Evaluation took:
27.96 seconds of real time
20.89 seconds of user run time
5.48 seconds of system run time
[Run times include 0.48 seconds GC run time]
0 page faults and
30963304 bytes consed.
NIL
* (profile:report-time)
Seconds | Consed | Calls | Sec/Call | Name:
------------------------------------------------------
0.145 | 1,470,600 | 61,275 | 0.00000 | UNIT-VECTOR
0.043 | 500,976 | 20,874 | 0.00000 | SPHERE-NORMAL
0.010 | 264 | 33 | 0.00029 | DEFSPHERE
0.001 | 258,800 | 32,350 | 0.00000 | DISTANCE
0.000 | 1,295,320 | 32,383 | 0.00000 | MAKE-POINT
0.000 | 333,984 | 40,401 | 0.00000 | SENDRAY
0.000 | 0 | 1,333,233 | 0.00000 | INTERSECT
0.000 | 0 | 40,401 | 0.00000 | FIRST-HIT
0.000 | 1,035,200 | 1,333,233 | 0.00000 | SPHERE-INTERSECT
0.000 | 1,056 | 33 | 0.00000 | MAKE-SPHERE
0.000 | 23,615,544 | 1 | 0.00000 | TRACER
0.000 | 0 | 20,874 | 0.00000 | NORMAL
0.000 | 834,960 | 20,874 | 0.00000 | LAMBERT
0.000 | 1,616,040 | 40,401 | 0.00000 | COLOR-AT
0.000 | 512 | 1 | 0.00000 | RAY-TEST
------------------------------------------------------
0.199 | 30,963,256 | 2,976,367 | | Total
Estimated total profiling overhead: 23.81 seconds
So, actually unit-vector takes the most time now and tracer generates
the most garbage.
Ray
From: Bulent Murtezaoglu
Subject: Re: Floating Point speed in Common Lisp
Date:
Message-ID: <87iupj12zo.fsf@isttest.bogus>
>>>>> "RT" == Raymond Toy <···@rtp.ericsson.se> writes:
[...]
RT> Perhaps I'm being optimistic? :-) I don't remember. But I
RT> didn't try running anything else except CMUCL, so that saves
RT> some time.
Maybe you are, but more importantly one of the points of this exercise
is to get some sense of whether authors like Norvig and Graham are being
too optimistic about how much time it would take to go from elegant working
code to something within 2-3x of C.
I think for simple things that could also be easily implemented in C right
from the start, it isn't obvious that Common Lisp is a good environment.
Incremental development etc. is all very nice, but when the smallest problem
you can try takes as long to run as several edit-make-crash cycles in C,
the benefits of Lisp are less clear. The raytracing example does not expose
this problem adequately because you can go to tiny pictures and do your
exploratory programming there if you need to. Nonetheless I think the point
is valid, as there are problems where Lisp could really shine but doesn't
because of inadequate performance. (machine learning comes to my mind where
for some algorithms there's a knee in the problem size graph you have to hit
to observe interesting stuff. If the smallest such interesting problem
takes hours with unoptimized code, some of the benefits of Lisp are lost.
This isn't horrible, and one can't have everything, but maybe we ought
to recognize that costs for all this dynamic luxury might be prohibitive
more often than one would first think.)
We should also note that we got the relatively decent speeds using
non-ANSI extensions on an experimental implementation. Lisp vendors
probably don't get many requests for good float performance.
>> That probably explains a number of things. I haven't figured
>> out how to build cmucl from sources yet...
RT> It's not easy to do, unless you're just compiling the current
RT> version with the current version, but you would need to do
RT> that anyway. :-)
Yeah, I found out the other day that it isn't easy. It appears that having
functions/variables/macros available by name at run time in the working
Lisp complicates things when it comes to compiling the compiler.
Am I understanding this correctly? I naively thought that some clever
package manipulation and uninterning would fix things in an elegant manner,
but didn't get very far even in my head. Kept thinking about two levels of
packages with relative paths, but there's no such thing in Common Lisp.
Anyhow, how do you guys do it? Is this documented anywhere? What am I
missing?
BM
From: Erik Naggum
Subject: Re: Floating Point speed in Common Lisp
Date:
Message-ID: <3098770475424062@naggum.no>
* Bulent Murtezaoglu
| I think for simple things that could also be easily implemented in C
| right from the start, it isn't obvious that Common Lisp is a good
| environment. Incremental development etc. is all very nice, but when the
| smallest problem you can try takes as long to run as several
| edit-make-crash cycles in C, the benefits of Lisp are less clear.
one of the curious properties of programming in Common Lisp is that you
adopt different styles according as the needs you foresee for your code
differ. usually, you know beforehand that some code is speed-critical
and you care about consing, declarations, and such early, and take care
not to violate the principles that you know will result in fast code.
adopting a number of different styles and becoming willing to write code
that is not "beautiful Lisp" but only _exports_ a "beautiful Lisp"
interface takes a lot of experience. I remember I spent a lot of time
getting to grips with the need to write really _ugly_ code, but then it
dawned on me that while I liked C because I could bury ugliness in code
and I like Common Lisp because it is elegant and beautiful, these are
_not_ mutually exclusive qualities. I can bury ugliness in Common Lisp,
too, but if I take care to write an elegant interface to it, it's gone
forever, quite unlike C, where the interface still sucks.
there's one thing I miss, though, but I'll make that a separate thread.
#:Erik
--
God grant me serenity to accept the code I cannot change,
courage to change the code I can, and wisdom to know the difference.
Bulent Murtezaoglu <·····@servtech.com> writes:
>
> We should also note that we got the relatively decent speeds using
> non-ANSI extensions on an experimental implementation. Lisp vendors
No, I disagree. The only non-ANSI extension was CMUCL's start-block,
which essentially wraps everything up in a labels form. You could
have done that yourself and made a portable version. Even without it,
I don't think the speed was that much slower.
> probably don't get many requests for good float performance.
Perhaps, but when I sent a small comparison of FP performance between
CMUCL and Franz, an engineer from Franz volunteered a fair amount of
time helping me get good FP performance. He mentioned that he was
actively looking at improving the compiler to handle FP better. And
this was the free Linux version! If I ever need a commercial one, I
will surely look at Franz.
>
> >> That probably explains a number of things. I haven't figured
> >> out how to build cmucl from sources yet...
>
> RT> It's not easy to do, unless you're just compiling the current
> RT> version with the current version, but you would need to do
> RT> that anyway. :-)
>
> Yeah, I found out the other day that it isn't easy. It appears that having
> functions/variables/macros available by name at run time in the working
> Lisp complicates things when it comes to compiling the compiler.
> Am I understanding this correctly? I naively thought that some clever
> package manipulation and uninterning would fix things in an elegant manner,
> but didn't get very far even in my head. Kept thinking about two levels of
> packages with relative paths, but there's no such thing in Common Lisp.
> Anyhow, how do you guys do it? Is this documented anywhere? What am I
> missing?
Compiling a new version from an old is sometimes quite hard. What you
say can be done and has been, I hear, but for the really hard builds,
I think a cross compiler (either from a different architecture, or the
same architecture) is used. I've never used the cross compiler.
If you want more info about building, try www.cons.org. For
complicated builds, join the mailing list. Or, do like me, and just
wait for someone to build it for you. :-)
Ray
Bulent Murtezaoglu wrote:
>
> >>>>> "RT" == Raymond Toy <···@rtp.ericsson.se> writes:
> [...]
> RT> Perhaps I'm being optimistic? :-) I don't remember. But I
> RT> didn't try running anything else except CMUCL, so that saves
> RT> some time.
>
> Maybe you are, but more importantly one of the points of this exercise
> is to get some sense of whether authors like Norvig and Graham are being
> too optimistic about how much time it would take to go from elegant working
> code to something within 2-3x of C.
I forget if it was Steele or Gabriel who observed that one of the things that
makes programming in common lisp uniquely different from other languages, is
that to get good performance, you have to be intimately familiar with your
implementation. This sort of issue doesn't exist in C, for example. One
implementation of C is generally as fast (within a factor of 2) as another.
However, CL implementations can vary quite a bit, in different areas. One
might have (relatively) good FP performance but horrible Bignum performance
etc.
In my experience with CL, much of it involving graphics and FP, the two
areas that CL implementations fall far behind C in terms of performance are
FP and I/O. So if there is something in particular about CL that you want
to benchmark, e.g. array performance, make sure that you aren't doing any
I/O and are not using FP numbers. I find it interesting that the original
message that started this thread was titled "Floating Point speed in CL",
but the code wasn't doing very much FP, it was mostly ratios!
>
> I think for simple things that could also be easily implemented in C right
> from the start, it isn't obvious that Common Lisp is a good environment.
> Incremental development etc. is all very nice, but when the smallest problem
> you can try takes as long to run as several edit-make-crash cycles in C,
> the benefits of Lisp are less clear. The raytracing example does not expose
> this problem adequately because you can go to tiny pictures and do your
> exploratory programming there if you need to. Nonetheless I think the point
> is valid, as there are problems where Lisp could really shine but doesn't
> because of inadequate performance. (machine learning comes to my mind where
> for some algorithms there's a knee in the problem size graph you have to hit
> to observe interesting stuff. If the smallest such interesting problem
> takes hours with unoptimized code, some of the benefits of Lisp are lost.
> This isn't horrible, and one can't have everything, but maybe we ought
> to recognize that costs for all this dynamic luxury might be prohibitive
> more often than one would first think.)
I think that it is *very* important that everyone understand the difference
between a language i.e. Common Lisp, and an implementation, i.e. CMUCL. There
is very little in Common Lisp the language that prohibits somebody from
developing an implementation that is competitive with C in terms of performance.
It may not be easy, and it may not exist yet, and it will require the proverbial
SSC (Sufficiently Smart Compiler), but it can be done! Common Lisp FP is not
slower than C, but FP in every CL implementation I have used has been slower
than C, but it doesn't have to be so.
>
> We should also note that we got the relatively decent speeds using
> non-ANSI extensions on an experimental implementation. Lisp vendors
> probably don't get many requests for good float performance.
I don't know how many requests they get for good FP performance, but I can
guarantee you that they do get some, speaking from personal experience :-)
>
>[...]
--
Christopher J. Vogt - Computer Consultant - Lisp, AI, Graphics, etc.
http://members.home.com/vogt/
From: Erik Naggum
Subject: Re: Floating Point speed in Common Lisp
Date:
Message-ID: <3098801908586746@naggum.no>
* Christopher J. Vogt
| I forget if it was Steele or Gabriel who observed that one of the things
| that makes programming in common lisp uniquely different from other
| languages, is that to get good performance, you have to be intimately
| familiar with your implementation. This sort of issue doesn't exist in
| C, for example. One implementation of C is generally as fast (within a
| factor of 2) as another. However, CL implementations can vary quite a
| bit, in different areas. One might have (relatively) good FP performance
| but horrible Bignum performance etc.
I beg to differ. C systems (compilers, libraries, etc) generally differ
by just as large factors as Common Lisp, often more, when non-trivial
functions are used, and they tend to be for non-trivial programs, and
optimizing C is generally very hard beyond the simple cases.
I'm not saying that CL is doing a better job than C in the general case,
but I'm getting somewhat peeved with the repetitive claims that C is so
fast. it isn't. it never was. it's just that the language is fairly
transparent down to the machine level so when you change something in the
C code, you can predict the effect on the compiled machine code. that
doesn't make it _fast_, it only ingratiates the control-freak aspect of
most programmers.
#:Erik
--
God grant me serenity to accept the code I cannot change,
courage to change the code I can, and wisdom to know the difference.
* Erik Naggum wrote:
> I beg to differ. C systems (compilers, libraries, etc) generally differ
> by just as large factors as Common Lisp, often more, when non-trivial
> functions are used, and they tend to be for non-trivial programs, and
> optimizing C is generally very hard beyond the simple cases.
I think that in general this is quite right (and in general most
programs are a long way away from the case where even language
performance matters), but in the *specific* case of heavily numerical
code it is significantly difficult to get good performance out of CL
implementations, and even harder to get good performance with code
which is nearly portable. (In fact, it's probably really hard to get
good performance in all cases where most of the data should be
represented as immediate objects rather than pointers, but numerical
code is the best and commonest example of this).
And in these cases, when you get it wrong (or the compiler isn't smart
enough), you often end up with stuff that is 10s or 100s of times
slower than it should be. Languages like C or Fortran are unlikely to
cause you to take this kind of hit in cases where everything is
basically a float or int of some flavour. Of course they will
silently add two large positive integers & get a large negative
result...
--tim
Tim Bradshaw <···@aiai.ed.ac.uk> writes:
> And in these cases, when you get it wrong (or the compiler isn't smart
> enough), you often end up with stuff that is 10s or 100s of times
> slower than it should be. Languages like C or Fortran are unlikely to
> cause you to take this kind of hit in cases where everything is
> basically a float or int of some flavour. Of course they will
> silently add two large positive integers & get a large negative
> result...
>
> --tim
That is quite right, but then C is also almost always at least a factor
of 2 slower than a good Fortran compiler. I have seen this with lots of
programs on C and Fortran on Convex and Cray computers.
But then, this was 5 years ago, things may have changed.
Andreas Eder NetWorker (ReliantUnix)
SNI AG Phone: +49-89-636-42549
OEC HES XP 2 Fax: +49-89-636-40601
D-81739 Munich, Germany
EMail: ···················@mch.sni.de
for public access: http://www.sni.de/public/mr/nsr/nsr_en.htm
for SNI personnel only: http://athen.mch.sni.de/networker/welcome.htm
---
Is "Windows for Dummies" another case of "this sentence no verb?"
From: Matthew McDonald
Subject: Re: Floating Point speed in Common Lisp
Date:
Message-ID: <mafm.890029188@cs.uwa.edu.au>
Erik Naggum <······@naggum.no> writes:
> I beg to differ. C systems (compilers, libraries, etc) generally differ
> by just as large factors as Common Lisp, often more, when non-trivial
> functions are used,
Do you have any examples to substantiate this claim?
-- Matthew McDonald <····@cs.uwa.edu.au>
"The Statistician is the natural predator of the Scientist" -- Ross Ihaka
····@cs.uwa.edu.au (Matthew McDonald) writes:
> Erik Naggum <······@naggum.no> writes:
>
> > I beg to differ. C systems (compilers, libraries, etc) generally differ
> > by just as large factors as Common Lisp, often more, when non-trivial
> > functions are used,
>
> Do you have any examples to substantiate this claim?
>
Compare gcc on a Linux Alpha system against the DEC C compiler for
Alpha. (Or was that f77 vs DEC Fortran?) I understand that
the difference in performance can be a factor of 2 or more for the
same code on the same machine.
See comp.os.linux.alpha.
Ray
····@cs.uwa.edu.au (Matthew McDonald) writes:
> Erik Naggum <······@naggum.no> writes:
>
> > I beg to differ. C systems (compilers, libraries, etc) generally differ
> > by just as large factors as Common Lisp, often more, when non-trivial
> > functions are used,
>
> Do you have any examples to substantiate this claim?
>
Here is a simple example
void
main()
{
void C_entry();
C_entry();
}
void
C_entry()
{
auto signed char buf[50];
strcpy(buf, "abcd");
return;
}
The program is redundant but it was written in this way to prove
exactly the point of "impredictability" of compiled C code. (The
compiler was for the ARM). The compiler is so aggressive that
recognizes the 'strcpy' and "inlines" it with a 'memcpy' when the
constant string pased to it fits in a given memory pattern.
Substituting "abcd" with "abc" makes the compiler change its strategy,
hence the optimization level. Substituting 'strcpy' with the obvious
counterpart 'mystrcpy' (taken straight out of K&R) seem to yield more
predictable performance.
Hence the proposition "there exists C compilers with various degrees
of resulting optimizations" is constructively proved :)
(Of course this may or may not be an answer to the question, but IMHO,
it comes pretty close).
The example is taken from lectures of the EECS 290a course held by
Professor Sangiovanni-Vincentelli at UCB last
fall. http://www-cad.eecs.berkeley.edu/~polis/class/index.html.
Check the second Lecture by Baker (actually the author is Grant) on
"Software Performance Estimation".
Cheers
--
Marco Antoniotti ===========================================
PARADES, Via San Pantaleo 66, I-00186 Rome, ITALY
tel. +39 - (0)6 - 68 80 79 23, fax. +39 - (0)6 - 68 80 79 26
http://www.parades.rm.cnr.it
From: Erik Naggum
Subject: Re: Floating Point speed in Common Lisp
Date:
Message-ID: <3099027379447143@naggum.no>
* Matthew McDonald
| Do you have any examples to substantiate this claim?
not any good set of particular examples, but 15 years of experience with
C on lots of platforms and lots of compilers has alerted me to the fact
that C compilers and environments differ _very_ much in quality, more
than my experience with (native-compiled) Common Lisp implementations and
environments do. Common Lisp implementations are generally quite good,
but the same is not necessarily true for C compilers -- naive compilation
of C is often a viable option for small systems.
if you need a specific example, consider the bundled C compiler with
SunOS 4.X with the commercial compiler for Solaris 2.X. GCC 2.8.1 beats
the former by a factor of 1.7, but the Solaris 2.X commercial C compiler
appears to be slightly better than GCC.
I find it interesting that you jump up to require substantiation of the
claim that C implementations differ, but do not challenge that Common
Lisp implementations do. this is the general sentiment out there, as
sort of a cultural requirement on programmers.
C is not fast. C is primitive, which means it has a potential of being
fast if the programmer is a genius and the compiler is on his side. if
the programmer is an idiot, and the compiler expects smarter programmers,
you will get abysmal performance. e.g., a system currently written in C
but which I am replacing with one rewritten in Common Lisp shows a
promise of reducing system requirements by a factor of between 5 and 6,
after months of work to decipher the incredibly bad C code to figure out
what it actually does.
#:Erik
--
religious cult update in light of new scientific discoveries:
"when we cannot go to the comet, the comet must come to us."
The following exchange got trimmed, and I have added it back in for direct
reference.
> * Christopher J. Vogt
> | I forget if it was Steele or Gabriel who observed that one of the things
> | that makes programming in common lisp uniquely different from other
> | languages, is that to get good performance, you have to be intimately
> | familiar with your implementation. This sort of issue doesn't exist in
> | C, for example. One implementation of C is generally as fast (within a
> | factor of 2) as another. However, CL implementations can vary quite a
> | bit, in different areas. One might have (relatively) good FP performance
> | but horrible Bignum performance etc.
>
Eirc Naggum
> I beg to differ. C systems (compilers, libraries, etc) generally differ
> by just as large factors as Common Lisp, often more, when non-trivial
> functions are used, and they tend to be for non-trivial programs, and
> optimizing C is generally very hard beyond the simple cases.
Erik Naggum wrote:
>
> * Matthew McDonald
> | Do you have any examples to substantiate this claim?
>
> not any good set of particular examples, but 15 years of experience with
> C on lots of platforms and lots of compilers has alerted me to the fact
> that C compilers and environments differ _very_ much in quality, more
> than my experience with (native-compiled) Common Lisp implementations and
> environments do. Common Lisp implementations are generally quite good,
> but the same is not necessarily true for C compilers -- naive compilation
> of C is often a viable option for small systems.
>
> if you need a specific example, consider the bundled C compiler with
> SunOS 4.X with the commercial compiler for Solaris 2.X. GCC 2.8.1 beats
> the former by a factor of 1.7, but the Solaris 2.X commercial C compiler
> appears to be slightly better than GCC.
Well, I did say "within a factor of 2", so you seem to be in violent agreement
with me? Yes, commercial C compilers vary in speed, generally within a factor
of 2. Commercial CL systems, however, seem to vary by much larger factors, in
many different areas.
From: Matthew McDonald
Subject: Re: Floating Point speed in Common Lisp
Date:
Message-ID: <mafm.890111766@cs.uwa.edu.au>
I didn't really want to write a long article about this since (a) I'm
not a lisp expert - (though I've had some experience and success
optimising lisp code under CMUCL) and (b) most of my experience comparing
lisp and C implementations is several years old and therefore probably
somewhat outdated.
That said ......
Erik Naggum <······@naggum.no> writes:
>| Do you have any examples to substantiate this claim?
> not any good set of particular examples,
[...]
> SunOS 4.X with the commercial compiler for Solaris 2.X. GCC 2.8.1 beats
> the former by a factor of 1.7, but the Solaris 2.X commercial C compiler
> appears to be slightly better than GCC.
I was originally planning to post a longer message in which I was
going to mention that I could remember seeing C code differ in
performance by a factor of 2 occasionally and around 3 only rarely.
Unless I misunderstood something, one of the versions of the ray-tracing
code ran *four* times faster on CMUCL than it did on ACL. Admittedly, it
had been optimised for CMUCL, but I'd be surprised to see a non-contrived
piece of C that revealed similar differences in C implementations.
> I find it interesting that you jump up to require substantiation of the
> claim that C implementations differ, but do not challenge that Common
> Lisp implementations do. this is the general sentiment out there, as
> sort of a cultural requirement on programmers.
I asked for examples because:
(1) I thought by that by "large differences", we meant something on the
order of a factor of four - not factors of two or less.
(2) Your claim doesn't match my personal experience,
The thread has already given an example of a piece of code that
differed by a factor of 4 under two differ lisp implementations. I
can't think of code I've seen that revealed similar differences
between serious C implementations.
> C is not fast. C is primitive, which means it has a potential of being
> fast if the programmer is a genius and the compiler is on his side.
With simple languages like C, the difference between naive compilation
and full optimisation is usually less than a factor of two (in my
experience). Look at the gcc vs oberon comparisions (the standard
oberon compilers do practically no optimisation) or even gcc with and
without optimisation.
Even a stupid C compiler can typically produce code that's within
spitting distance of what's produced by an optimising compiler,
because the language is basically a portable assembler. The same can't
be said of common lisp. When you're optimising floating point code in
C you don't usually need to go to much trouble to ensure that the
major operations are on unboxed data!
I keep seeing lisp people say that code written in lisp can easily be
made to perform as well as code written in C, and it just doesn't
match my (limited) experience. With CMUCL, floating point code can be
made to run about as fast as in C, but it seems like hard work. I'm
not even sure it's feasible in ACL.
It was only three or four years ago that a well-known lisp implentor
posted something here suggesting that: when you want lots of speed
from a certain lisp implementation, you should estimate the number of
registers required by the sub-expressions of important expressions so
you can then rearrange the expression to minimise the number of loads
and stores during the execution of the expression, since the compiled
code would execute the expression strictly left-to-right.
In the C/pascal/FORTRAN world, even straightforward compilers weigh
subexpressions before deciding what order to evaluate them in.
Perhaps I missing something, or I misunderstood the "well-known lisp
implentor", and the advice he posted was thirty years out of date and
for historical interest only.
The only substantial example of lisp code I can remember that was
competive is the Koza genetic algorithm code was easily optimised to
the point where it beat the C equivalent. I'd be delighted to see
large numbers of significant examples of straightforward lisp programs
that performed within a factor of two of C equivalents.
-- Matthew McDonald <····@cs.uwa.edu.au>
Appendix: to substantiate the claim that C implementations often don't
differ dramatically, here's a benchmark given at the end of the lcc
OVERVIEW.TEX file (from some time in 1993).
benchmark
_compiler__1._gcc1.35__8._espresso__22._li_23._eqntott__
VAX: MicroVAX II w/16MB running Ultrix 3.1
lcc 1734 2708 7015 3532
cc 1824 2782 7765 3569
cc -O 1661 2653 7086 3757
gcc 1439 2757 7353 3263
gcc -O 1274 2291 6397 1131
68020: Sun 3/60 w/24MB running SunOS 4.0.3
lcc 544 1070 2591 567
cc 514 1005 3308 619
cc -O 428 882 2237 571
gcc 426 1048 2498 591
gcc -O 337 834 1951 326
MIPS: IRIS 4D/220GTX w/32MB running IRIX 3.3.1
lcc 116 150 352 111
cc 107 153 338 100
cc -O 92 130 299 70
gcc 188 502 132
gcc -O 145 411 112
SPARC: Sun 4/260 w/32MB running SunOS 4.0.3
lcc 196 370 790 209
cc 203 381 1094 275
cc -O 150 296 707 183
gcc 186 411 1139 256
gcc -O 127 309 788 179
Matthew McDonald wrote:
>
> With simple languages like C, the difference between naive compilation
> and full optimisation is usually less than a factor of two (in my
> experience). Look at the gcc vs oberon comparisions (the standard
> oberon compilers do practically no optimisation) or even gcc with and
> without optimisation.
>
Well, this is just what I have done some time ago, and the result may be
quite surprising:
without optimization (-O0) 2000 ms
optimizing (-02) 330 ms
So you have factor 6. The program was very simple, something like
bubble-sort. (I looked carefully if there is no huge job running
simultaneously and tried it more times, so it is no mistake.)
Martin
(martin.volf at st dot ms dot mff dot cuni dot cz)
From: John Watton
Subject: Re: Floating Point speed in Common Lisp
Date:
Message-ID: <u90q979l0.fsf@alcoa.com>
>I keep seeing lisp people say that code written in lisp can easily be
>made to perform as well as code written in C, and it just doesn't
>match my (limited) experience. With CMUCL, floating point code can be
>made to run about as fast as in C, but it seems like hard work. I'm
>not even sure it's feasible in ACL.
Of course its feasible in ACL (4.3 varity on unix, NT and linux). I do
it all the time. I posted the benchmark below on comp.lang.lisp
sometime ago and didn't generate a fraction of the interest directed
to the ray tracing code. (Unlike the ray tracing code, I guess the
benchmark was too easy to get your hands around.) As you can see it is
comparible to unoptimized C code. With ACL I find I use the nonANSI
declare keyword :explain to tell me where the boxing is (declare
(:explain :boxing :calls :types)). Also with ACL is a nonANSI and
undocumented feature called immediate-args for passing and returning
unboxed arguments. I don't use it in this benchmark but it is
available at least on the Unix variety. I suspect that this is one
area that the ray tracing code suffers in the ACL version. In general
I prefer to pass floats inside arrays to avoid boxing - it's easy and
ANSI. But, I don't do that here to be fair to C.
John Watton
ALCOA
Machine: Pentium 90 Mhz, 40 Meg RAM
OS: Windows NT
Double floating point benchmark
Pnpoly area sec rel
----------- --- ---
C | VC++ 4.0 (cl -Ox) 16.5 1.0
Java | JDK 1.1.4 (JIT) 20.2 1.2
C | VC++ 4.0 (cl) 23.5 1.4
Lisp | Franz ACL 4.3.2 25.6 1.6
Java | JDK 1.1.4 (-O) 106.9 6.5
Lisp | Franz ACL 3.0.1 521.3 31.6
Perl | Perl5.003 1007.0 61.0
PNPOLY_C_PNPOLY_C_PNPOLY_C_PNPOLY_C_PNPOLY_C_PNPOLY_C_PNPOLY_C_PNPOLY_C
#include "math.h"
int pnpoly(int npol, double *xp, double *yp, double x, double y)
{
int i, j, c = 0;
for (i = 0, j = npol-1; i < npol; j = i++) {
if ((((yp[i]<=y) && (y<yp[j])) ||
((yp[j]<=y) && (y<yp[i]))) &&
(x < (xp[j] - xp[i]) * (y - yp[i]) / (yp[j] - yp[i]) + xp[i]))
c = !c;
}
return c;
}
main(int argc, char *argv[]) {
int npol = atoi(argv[1]); /* 200 */
int div = atoi(argv[2]); /* 400 */
int count = 0;
double pi = acos(-1.0);
double* xp;
double* yp;
double theta;
double a = 10.0;
double fdiv = 2*a/div;
int i=0, j=0;
xp = malloc(npol*sizeof(double));
yp = malloc(npol*sizeof(double));
for(i=0;i<npol;i++) {
theta = i*2*pi/npol;
xp[i] = a+a*pow(cos(theta), 3);
yp[i] = a+a*pow(sin(theta), 3);
}
for(i=0;i<=div;i++) {
for(j=0;j<=div;j++) {
if (pnpoly(npol,xp,yp,i*fdiv,j*fdiv)) count++;
}
}
printf("\nArea: %f \n", 4*a*a*count/((div+1)*(div+1)));
return(0);
}
PNPOLY_LISP_PNPOLY_LISP_PNPOLY_LISP_PNPOLY_LISP_PNPOLY_LISP_PNPOLY_LISP
(defun pnpoly (npol xp yp x y)
(declare (optimize (speed 3) (safety 0))
(fixnum npol)
(double-float x y)
(type (simple-array double-float (*)) xp yp))
(let* ((c nil)
(j (1- npol)))
(declare (fixnum j))
(dotimes (i npol c)
(declare (fixnum i))
(if (and (or (and (<= (aref yp i) y) (< y (aref yp j)))
(and (<= (aref yp j) y) (< y (aref yp i))))
(< x (+ (aref xp i) (/ (* (- (aref xp j) (aref xp i))
(- y (aref yp i)))
(- (aref yp j) (aref yp i))))))
(setq c (not c)))
(setq j i))))
(defun pnpolymain (npol div) ; call with 200 400
(declare (optimize (speed 3) (safety 0))
(fixnum npol div))
(let* ((xp (make-array npol :element-type 'double-float))
(yp (make-array npol :element-type 'double-float))
(theta 0.0d0)
(a 10.0d0)
(fdiv (/ (* 2 a) div))
(count 0))
(declare (double-float fdiv a theta)
(fixnum count)
(type (simple-array double-float (*)) xp yp))
(dotimes (i npol)
(declare (fixnum i))
(setq theta (/ (* 2 i pi) npol))
(setf (aref xp i) (+ a (* a (expt (cos theta) 3)))
(aref yp i) (+ a (* a (expt (sin theta) 3)))))
(dotimes (u (1+ div))
(declare (fixnum u))
(dotimes (v (1+ div))
(declare (fixnum v))
(if (pnpoly npol xp yp (* u fdiv) (* v fdiv)) (incf count))))
(format t "~%Area: ~a" (/ (* count 4 a a) (* (1+ div) (1+ div))))))
John Watton <···········@alcoa.com> writes:
> it all the time. I posted the benchmark below on comp.lang.lisp
> sometime ago and didn't generate a fraction of the interest directed
> to the ray tracing code. (Unlike the ray tracing code, I guess the
Perhaps because you had already optimized the code? CMUCL only gives
a few notes about (* (1+ DIV) (1+ DIV)) because the result is not a
fixnum. This should not impact speed too much.
> (:explain :boxing :calls :types)). Also with ACL is a nonANSI and
> undocumented feature called immediate-args for passing and returning
> unboxed arguments. I don't use it in this benchmark but it is
> available at least on the Unix variety. I suspect that this is one
> area that the ray tracing code suffers in the ACL version. In general
> I prefer to pass floats inside arrays to avoid boxing - it's easy and
> ANSI. But, I don't do that here to be fair to C.
I don't remember precisely, but aren't parameters allowed to be passed
in registers in C? If so, Lisp should be allowed the same freedom.
Here are my times on an Ultra-30 300 MHz sparc, using Sun C compiler
(-O), and CMUCL 18a+:
sec
cc 5.7
cc -O 2.0
CMUCL 3.1
I note that the test cons up 5 MB or so, as reported by time. If I
add the CMUCL extension (declaim (ext:start-block pnpolymain)), the
run time doesn't change, but it now only conses 8000 bytes.
Ray
Raymond Toy <···@rtp.ericsson.se> writes:
> John Watton <···········@alcoa.com> writes:
>
> > it all the time. I posted the benchmark below on comp.lang.lisp
> > sometime ago and didn't generate a fraction of the interest directed
> > to the ray tracing code. (Unlike the ray tracing code, I guess the
>
> Perhaps because you had already optimized the code? CMUCL only gives
> a few notes about (* (1+ DIV) (1+ DIV)) because the result is not a
> fixnum. This should not impact speed too much.
>
Here is one small optimization for Lisp which a C compiler probably
would have already done. This does the redundant array references
just once at the beginning of the loop. Are there any Lisp compilers
that do subexpression elimination (is that what it's called?)?
This change reduces the Lisp time from 3.2 sec to 2.5 sec. We are now
within .4 sec (20%) of the optimized C code.
Ray
(defun pnpoly (npol xp yp x y)
(declare (optimize (speed 3) (safety 0))
(fixnum npol)
(double-float x y)
(type (simple-array double-float (*)) xp yp))
(let* ((c nil)
(j (1- npol)))
(declare (fixnum j))
(dotimes (i npol c)
(declare (fixnum i))
(let ((ypi (aref yp i))
(ypj (aref yp j))
(xpi (aref xp i))
(xpj (aref xp j)))
(if (and (or (and (<= ypi y) (< y ypj))
(and (<= ypj y) (< y ypi)))
(< x (+ xpi (/ (* (- xpj xpi)
(- y ypi))
(- ypj ypi)))))
(setq c (not c))))
(setq j i))))
From: Bulent Murtezaoglu
Subject: Re: Floating Point speed in Common Lisp
Date:
Message-ID: <87pvjl89wh.fsf@isttest.bogus>
>>>>> "RT" == Raymond Toy <···@rtp.ericsson.se> writes:
[...]
RT> I note that the test cons up 5 MB or so, as reported by time.
RT> If I add the CMUCL extension (declaim (ext:start-block
RT> pnpolymain)), the run time doesn't change, but it now only
RT> conses 8000 bytes.
Yes I've been noticing that too. That's probably due to return values
not getting consed when the file is block compiled. Here's what I've been
wondering lately:
(deftsruct foo ;bla bla
)
(defun f1 (x)
;; some computation
(make-foo ;some results
))
(defun f2 (y)
;; some computation
(let ((local1 (f1 y)))
;; some more compuation
))
you get the idea (excuse the bad paranthesis placement):
the make-foo call generates garbage. Is Common Lisp allowed to
stack-allocate what f1 returns under certain circumstances?
Maybe if f1 is inlined and local1 is declared dynamic-extent?
(make-foo ing in f2 and passing that to f1 is too ugly)
Can some enlightened soul shed light on this?
cheers,
BM
John Watton wrote:
>
> >I keep seeing lisp people say that code written in lisp can easily be
> >made to perform as well as code written in C, and it just doesn't
> >match my (limited) experience. With CMUCL, floating point code can be
> >made to run about as fast as in C, but it seems like hard work. I'm
> >not even sure it's feasible in ACL.
>
> Of course its feasible in ACL (4.3 varity on unix, NT and linux). I do
> it all the time. I posted the benchmark below on comp.lang.lisp
> sometime ago and didn't generate a fraction of the interest directed
> to the ray tracing code. (Unlike the ray tracing code, I guess the
> benchmark was too easy to get your hands around.) As you can see it is
> comparible to unoptimized C code. With ACL I find I use the nonANSI
> declare keyword :explain to tell me where the boxing is (declare
> (:explain :boxing :calls :types)). Also with ACL is a nonANSI and
> undocumented feature called immediate-args for passing and returning
> unboxed arguments. I don't use it in this benchmark but it is
> available at least on the Unix variety. I suspect that this is one
> area that the ray tracing code suffers in the ACL version. In general
> I prefer to pass floats inside arrays to avoid boxing - it's easy and
> ANSI. But, I don't do that here to be fair to C.
>
> John Watton
> ALCOA
>
> Machine: Pentium 90 Mhz, 40 Meg RAM
> OS: Windows NT
>
> Double floating point benchmark
> Pnpoly area sec rel
> ----------- --- ---
> C | VC++ 4.0 (cl -Ox) 16.5 1.0
> Java | JDK 1.1.4 (JIT) 20.2 1.2
> C | VC++ 4.0 (cl) 23.5 1.4
> Lisp | Franz ACL 4.3.2 25.6 1.6
> Java | JDK 1.1.4 (-O) 106.9 6.5
> Lisp | Franz ACL 3.0.1 521.3 31.6
> Perl | Perl5.003 1007.0 61.0
One needs to be very careful with benchmarks. It is easy to think you are
benchmarking something in particular, in this case DFP performance, when
in actuality you are measuring something else, in this case: conditionals,
incremental loops, function calling and integer to DFP conversion. The code
benchamrked only executes the following DFP code 1% of the time:
(xp[j] - xp[i]) * (y - yp[i]) / (yp[j] - yp[i]) + xp[i]))
If you change the code to what I have below, I think you'll *really* be
testing DFP performance, and not something else. I have omitted the
CL code for brevity, but it should be simple to make the similar changes.
I realize that what started out as a meaningful program is now a
synthetic meaningless program, but it will truly test DFP performance, and
not something else. OTOH, you could also use the FFT benchmark in Gabriel.
I'll be surprised if you don't find a factor of 10 performance difference
between CL and C.
int pnpoly(int npol, double *xp, double *yp, double x, double y)
{
int i, j, c = 0;
for (i = 0, j = npol-1; i < npol; j = i++) {
xp[i] = (xp[j] - xp[i]) * (y - yp[i]) / (yp[j] - yp[i]) + xp[i];
yp[j] = (xp[j] - xp[i]) * (y - yp[i]) / (yp[j] - yp[i]) + xp[i];
}
return c;
}
poly(int npol, int div)
{
int count = 0;
double pi = acos(-1.0);
double* xp;
double* yp;
double theta;
double a = 10.0;
double fdiv = 2*a/div;
int i=0, j=0;
xp = (double *)malloc(npol*sizeof(double) + 1);
yp = (double *)malloc(npol*sizeof(double) + 1);
for(i=0;i<npol;i++) {
theta = i*2*pi/npol;
xp[i] = a+a*pow(cos(theta), 3);
yp[i] = a+a*pow(sin(theta), 3);
}
for(i=0;i<=div;i++) {
for(j=0;j<=div;j++) {
pnpoly(npol,xp,yp,i*fdiv,j*fdiv);
}
}
printf("\nArea: %f \n", 4*a*a*count/((div+1)*(div+1)));
return(0);
}
--
Christopher J. Vogt - Computer Consultant - Lisp, AI, Graphics, etc.
http://members.home.com/vogt/
"Christopher J. Vogt" <····@computer.org> writes:
>
> If you change the code to what I have below, I think you'll *really* be
> testing DFP performance, and not something else. I have omitted the
> CL code for brevity, but it should be simple to make the similar changes.
CL code appended. Hope it's right.
> I'll be surprised if you don't find a factor of 10 performance difference
> between CL and C.
I don't see why you would say that. For a smart Lisp compiler like
CMUCL no boxing of any float is needed in pnpoly, so CL doesn't have
to be any slower than C code.
Here is what I get, using Sun C and CMUCL 18a+
cc -O 7.25 sec
cc -xO5 6.06 sec
CMUCL 6.67 sec
CMUCL 5.86 sec (don't do redundant array refs)
So not only is CL not 10 times slower than C, but CL is actually
competitive with C. (Perhaps I made a mistake in translating the C
code to Lisp?)
Also, with CMUCL, I get overflow errors, which I turned off to run
this benchmark. C quietly runs to completion.
Ray
;; Translation of original
(defun pnpoly (npol xp yp x y)
(declare (optimize (speed 3) (safety 0))
(fixnum npol)
(double-float x y)
(type (simple-array double-float (*)) xp yp))
(let* ((c 0)
(j (1- npol)))
(declare (fixnum j))
(dotimes (i npol c)
(declare (fixnum i))
(setf (aref xp i) (+ (/ (* (- (aref xp j) (aref xp i))
(- y (aref yp i)))
(- (aref yp j) (aref yp i)))
(aref xp i)))
(setf (aref yp j) (+ (/ (* (- (aref xp j) (aref xp i))
(- y (aref yp i)))
(- (aref yp j) (aref yp i)))
(aref xp i)))
(setq j i))))
;; Move some redundant array refs. Not all removed because I didn't
;; want to figure out if they were really redundant.
(defun pnpoly (npol xp yp x y)
(declare (optimize (speed 3) (safety 0))
(fixnum npol)
(double-float x y)
(type (simple-array double-float (*)) xp yp))
(let* ((c 0)
(j (1- npol)))
(declare (fixnum j))
(dotimes (i npol c)
(declare (fixnum i))
(let ((ypi (aref yp i))
(ypj (aref yp j))
(xpi (aref xp i))
(xpj (aref xp j)))
(setf (aref xp i) (+ (/ (* (- xpj xpi)
(- y ypi))
(- ypj ypi))
xpi))
(setf (aref yp j) (+ (/ (* (- (aref xp j) (aref xp i))
(- y ypi))
(- ypj ypi))
(aref xp i))))
(setq j i))))
From: Christopher J. Vogt
Subject: Re: Floating Point speed in Common Lisp
Date:
Message-ID: <350EFD44.A271833@computer.org>
Raymond Toy wrote:
>
> "Christopher J. Vogt" <····@computer.org> writes:
>
> >
> > If you change the code to what I have below, I think you'll *really* be
> > testing DFP performance, and not something else. I have omitted the
> > CL code for brevity, but it should be simple to make the similar changes.
>
> CL code appended. Hope it's right.
>
> > I'll be surprised if you don't find a factor of 10 performance difference
> > between CL and C.
>
> I don't see why you would say that. For a smart Lisp compiler like
> CMUCL no boxing of any float is needed in pnpoly, so CL doesn't have
> to be any slower than C code.
>
> Here is what I get, using Sun C and CMUCL 18a+
>
> cc -O 7.25 sec
> cc -xO5 6.06 sec
> CMUCL 6.67 sec
> CMUCL 5.86 sec (don't do redundant array refs)
>
> So not only is CL not 10 times slower than C, but CL is actually
> competitive with C. (Perhaps I made a mistake in translating the C
> code to Lisp?)
Well, obviously I have been using the *wrong* Lisp implementation. Sadly,
CMUCL doesn't run under Windows NT or MacOS, which are my two current
setups. However, this confirms what I have said for many years (and
most recently earlier in this thread), which is that there is nothing
in the CL spec that prohibits an implementation from attaing performance
similar to C (with 10%). Apprently CMUCL is a SSC (Sufficiently Smart
Compiler).
Now, how do I go about getting CMUCL ported to NT or MacOS?
>
> Also, with CMUCL, I get overflow errors, which I turned off to run
> this benchmark. C quietly runs to completion.
Indeed, one of the many advantages of Lisp over C.
--
Christopher J. Vogt - Computer Consultant - Lisp, AI, Graphics, etc.
http://members.home.com/vogt/
"Christopher J. Vogt" <····@computer.org> writes:
>
> Now, how do I go about getting CMUCL ported to NT or MacOS?
>
I doubt if any would want to do the MacOS port, especially if it's the
68k version. Porting to a new architecture is *extremely* hard. I
have a hard enough time just compiling up a new version! The PowerPC
Mac might be possible since there was an RS-6000 port.
The first step is to massage the C code to run on NT. Then you'll
have to figure out the Lisp part so that it could talk to the OS
correctly and all of the other OS specific things.
There was some discussion on the CMUCL mailing list about this, but it
seemed that the cygwin stuff for NT wasn't quite ready and nobody was
really motivated for lack of time, energy, desire, etc. to do a native
NT port. But I think many people would love to have one.
If you're interested, join the CMUCL mailing list for hints and
suggestions. See http://www.cons.org.
Ray
From: Pierre Mai
Subject: 'Fast' CL Implementation for Alpha/AXP? (Was Re: Floating Point speed in Common Lisp)
Date:
Message-ID: <m3iupckrym.fsf_-_@torus.cs.tu-berlin.de>
>>>>> "RT" == Raymond Toy <···@rtp.ericsson.se> writes:
RT> If you're interested, join the CMUCL mailing list for hints
RT> and suggestions. See http://www.cons.org.
Does a port to Linux/Alpha and/or Digital Unix/Alpha exist (I don't
think so), or is it being planned? Or which other Common Lisp
implementation -- preferably free or low-cost -- which is
"sufficiently fast"[1] is available for the Alpha?
I'm finding it somewhat difficult to contact *.cons.org (except
ftp2.cons.org, which is Walnut Creek's FTP Server) recently, getting
messages that these hosts don't exist (DNS problem?)... I'll try
again later today...
Regs, Pierre.
[1] FP performance is *not* my foremost concern here, but CLOS
performance and general funcalling and list manipulation is.
--
Pierre Mai <····@cs.tu-berlin.de> http://home.pages.de/~trillian/
"Such is life." -- Fiona in "Four Weddings and a Funeral" (UK/1994)
From: Patrick Tufts
Subject: Re: 'Fast' CL Implementation for Alpha/AXP? (Was Re: Floating Point speed in Common Lisp)
Date:
Message-ID: <6eoubi$14g$1@new-news.cc.brandeis.edu>
Pierre Mai <····@cs.tu-berlin.de> writes:
>think so), or is it being planned? Or which other Common Lisp
>implementation -- preferably free or low-cost -- which is
>"sufficiently fast"[1] is available for the Alpha?
I doubt it's low-cost, but I remember reading that Symbolics ported
Genera the DEC Alpha. I think the price was several thousand US
dollars a few years ago.
--Pat
From: Istvan Marko
Subject: Re: 'Fast' CL Implementation for Alpha/AXP? (Was Re: Floating Point speed in Common Lisp)
Date:
Message-ID: <m3ogz3a1we.fsf@imarko.pacificnet.net>
Pierre Mai <····@cs.tu-berlin.de> writes:
> >>>>> "RT" == Raymond Toy <···@rtp.ericsson.se> writes:
>
> RT> If you're interested, join the CMUCL mailing list for hints
> RT> and suggestions. See http://www.cons.org.
>
> Does a port to Linux/Alpha and/or Digital Unix/Alpha exist (I don't
> think so), or is it being planned?
ftp://ftp2.cons.org/pub/lisp/cmucl/release/cmucl-18a.alpha_DU4*
--
Istvan
From: Raymond Toy
Subject: Re: 'Fast' CL Implementation for Alpha/AXP? (Was Re: Floating Point speed in Common Lisp)
Date:
Message-ID: <4n1zvyrbxq.fsf@rtp.ericsson.se>
Pierre Mai <····@cs.tu-berlin.de> writes:
>
> Does a port to Linux/Alpha and/or Digital Unix/Alpha exist (I don't
> think so), or is it being planned? Or which other Common Lisp
It exists for Digital Unix (no Linux/Alpha) and you can find it
www.cons.org.
Well, except that apparently some of the machines there have died in
the past few days. They're being worked on and should be up in a few
days.
>
> [1] FP performance is *not* my foremost concern here, but CLOS
> performance and general funcalling and list manipulation is.
>
The Alpha version supposedly has quite good FP performance.
Ray
From: Raymond Toy
Subject: Re: 'Fast' CL Implementation for Alpha/AXP? (Was Re: Floating Point speed in Common Lisp)
Date:
Message-ID: <4n3egerby7.fsf@rtp.ericsson.se>
Pierre Mai <····@cs.tu-berlin.de> writes:
>
> Does a port to Linux/Alpha and/or Digital Unix/Alpha exist (I don't
> think so), or is it being planned? Or which other Common Lisp
It exists for Digital Unix (no Linux/Alpha) and you can find it
www.cons.org.
Well, except that apparently many of the machines there have died in
the past few days. They're being worked on and should be up in a few
days.
>
> [1] FP performance is *not* my foremost concern here, but CLOS
> performance and general funcalling and list manipulation is.
>
The Alpha version supposedly has quite good FP performance.
Ray
From: Pierre Mai
Subject: Re: 'Fast' CL Implementation for Alpha/AXP? (Was Re: Floating Point speed in Common Lisp)
Date:
Message-ID: <m390q5ogqd.fsf@torus.cs.tu-berlin.de>
>>>>> "RT" == Raymond Toy <···@rtp.ericsson.se> writes:
RT> Pierre Mai <····@cs.tu-berlin.de> writes:
>> Does a port to Linux/Alpha and/or Digital Unix/Alpha exist (I
>> don't think so), or is it being planned? Or which other Common
>> Lisp
RT> It exists for Digital Unix (no Linux/Alpha) and you can find
RT> it www.cons.org.
Oooops, my fault for relying on my poor remembrance. Of course the
Alpha is supported, and having gotten the 18a-Release yesterday,
working with CMUCL on Alpha really is *fun* ;)
Thanks a lot for all those who corrected me! Isn't it funny what you
can miss out on, by relying on that thing between your ears... ;)
>> [1] FP performance is *not* my foremost concern here, but CLOS
>> performance and general funcalling and list manipulation is.
>>
RT> The Alpha version supposedly has quite good FP performance.
I think I should write more number-crunching programs ;)
Regs, Pierre.
--
Pierre Mai <····@cs.tu-berlin.de> http://home.pages.de/~trillian/
"Such is life." -- Fiona in "Four Weddings and a Funeral" (UK/1994)
From: Mike Mcdonald
Subject: Re: 'Fast' CL Implementation for Alpha/AXP? (Was Re: Floating Point speed in Common Lisp)
Date:
Message-ID: <6esc85$hci@hpmos.wv.mentorg.com>
[Posted and mailed]
In article <·················@torus.cs.tu-berlin.de>,
Pierre Mai <····@cs.tu-berlin.de> writes:
>>>>>> "RT" == Raymond Toy <···@rtp.ericsson.se> writes:
>
> RT> If you're interested, join the CMUCL mailing list for hints
> RT> and suggestions. See http://www.cons.org.
>
> Does a port to Linux/Alpha and/or Digital Unix/Alpha exist (I don't
> think so), or is it being planned? Or which other Common Lisp
> implementation -- preferably free or low-cost -- which is
> "sufficiently fast"[1] is available for the Alpha?
bigdec=>cmucl
CMU Common Lisp 18a-RELEASE, running on bigdec
Send bug reports and questions to your local CMU CL maintainer, or to
·········@cons.org.
Loaded subsystems:
Python 1.0, target Alpha
CLOS based on PCL version: September 16 92 PCL (f)
* (quit)
bigdec=>uname -a
OSF1 bigdec V4.0 564 alpha
bigdec=>
> I'm finding it somewhat difficult to contact *.cons.org (except
> ftp2.cons.org, which is Walnut Creek's FTP Server) recently, getting
> messages that these hosts don't exist (DNS problem?)... I'll try
> again later today...
Some machines died and they're slowing getting
fixed. Please be patient.
Mike McDonald
·······@mikemac.com
In article <················@computer.org>,
····@computer.org wrote:
> > Also, with CMUCL, I get overflow errors, which I turned off to run
> > this benchmark. C quietly runs to completion.
>
> Indeed, one of the many advantages of Lisp over C.
In the usual case, you can make your C code send SIGFPE on exceptions
like overflow, in FreeBSD see fpsetmask(3). In CMUCL and C, the native
features of the CPU are used.
Martin
-----== Posted via Deja News, The Leader in Internet Discussion ==-----
http://www.dejanews.com/ Now offering spam-free web-based newsreading
What's the answer that (pnpoly 200 400) is supposed to give? In C I
get 177.287828. In CMU lisp I get 0.0d0.
Seems like it's a little useless to compare two programs that give
different answers to the same problem.
--
Fred Gilham gilham @ csl . sri . com
King Christ, this world is all aleak, / And life preservers there are none,
And waves that only He may walk / Who dared to call Himself a man.
-- e. e. cummings, from Jehovah Buried, Satan Dead
Fred Gilham <······@snapdragon.csl.sri.com> writes:
> What's the answer that (pnpoly 200 400) is supposed to give? In C I
> get 177.287828. In CMU lisp I get 0.0d0.
Bummer. I get the same answer as C using CMUCL 18a+ Sparc. What
version are you running? Time to get a newer version?
Ray
Raymond Toy <···@rtp.ericsson.se> writes:
> Fred Gilham <······@snapdragon.csl.sri.com> writes:
>
> > What's the answer that (pnpoly 200 400) is supposed to give? In C I
> > get 177.287828. In CMU lisp I get 0.0d0.
>
> Bummer. I get the same answer as C using CMUCL 18a+ Sparc. What
> version are you running? Time to get a newer version?
>
> Ray
After I thought about it a little, it became pretty clear that 0.0d0
isn't the right answer.... :-)
I try to run the newer versions of CMUCL. Oddly enough, the older,
release version for x86 (CMU Common Lisp 97.02.22) gives the same
answer as the C version. This is under FreeBSD on a PII-300.
snapdragon:tests > lisp
CMU Common Lisp 97.02.22, running on snapdragon.csl.sri.com
Send bug reports and questions to your local CMU CL maintainer, or to
··········@cs.cmu.edu.
Loaded subsystems:
Python 1.0, target Intel x86
CLOS based on PCL version: September 16 92 PCL (f)
* (load "pnpoly.x86f")
; Loading #p"/a/japonica/carnelian/homes/gilham/cmucl/tests/pnpoly.x86f".
T
* (time (pnpolymain 200 400))
Compiling LAMBDA NIL:
Compiling Top-Level Form:
[GC threshold exceeded with 2,000,384 bytes in use. Commencing GC.]
[GC completed with 41,984 bytes retained and 1,958,400 bytes freed.]
[GC will next occur when at least 2,041,984 bytes are in use.]
[GC threshold exceeded with 2,042,368 bytes in use. Commencing GC.]
[GC completed with 41,984 bytes retained and 2,000,384 bytes freed.]
[GC will next occur when at least 2,041,984 bytes are in use.]
Area: 117.28782781201609d0
Evaluation took:
3.15 seconds of real time
2.962338 seconds of user run time
0.018394 seconds of system run time
[Run times include 0.21 seconds GC run time]
0 page faults and
5326336 bytes consed.
NIL
*
Now for the punch line. The following is under Linux on a P-Pro 200.
boron:tests > lisp
Allegro CL 4.3 [Linux/X86; R1] (12/11/96 1:33)
Copyright (C) 1985-1996, Franz Inc., Berkeley, CA, USA. All Rights Reserved.
;; Optimization settings: safety 1, space 1, speed 1, debug 2.
;; For a complete description of all compiler switches given the
;; current optimization settings evaluate (EXPLAIN-COMPILER-SETTINGS).
USER(1): (compile-file "pnpoly.lisp")
;;; Compiling file pnpoly.lisp
; Compiling PNPOLY
; Compiling PNPOLYMAIN
;;; Writing fasl file pnpoly.fasl
Warning: No IN-PACKAGE form seen in pnpoly.lisp. (Allegro Presto will
be ineffective when loading a file having no IN-PACKAGE form.)
;;; Fasl write complete
#p"pnpoly.fasl"
T
NIL
USER(2): (load "pnpoly.fasl")
; Fast loading ./pnpoly.fasl
T
USER(3): (time (pnpolymain 200 400))
Area: 180.0784820989919d0
; cpu time (non-gc) 3,900 msec user, 10 msec system
; cpu time (gc) 140 msec user, 0 msec system
; cpu time (total) 4,040 msec user, 10 msec system
; real time 4,264 msec
; space allocation:
; 19 cons cells, 0 symbols, 7,804,928 other bytes
NIL
USER(4):
--
Fred Gilham gilham @ csl . sri . com
King Christ, this world is all aleak, / And life preservers there are none,
And waves that only He may walk / Who dared to call Himself a man.
-- e. e. cummings, from Jehovah Buried, Satan Dead
"Christopher J. Vogt" <····@computer.org> writes:
> I'll be surprised if you don't find a factor of 10 performance difference
> between CL and C.
I think I already tried to explain you this in a thread in
comp.lang.lisp.mcl: The bad FP performance you achieve is caused
by the bad FP implementation of MCL. ACL and CMUCL behave much,
much better. (But then, my execution times for FP code with
ACL4.3.1/Solaris were so low that you didn't believe them, since
the specific code you posted compiled to pure FP-crunching machine
code).
It's very sad that Digitool hasn't done anything about this yet,
because (most of) the rest of MCL is really, really great. But
the real world need for good FP performance is probably greater
than the real world need for the outstanding bignum performance
of MCL. (But take any piece of integer/symbol-crunching software
and port it to MCL, and you'll see very good performance. And
try and port it to C, and you'll find yourself reinventing the
concept of symbols. I've seen _that_ done serveral times :-()
--
regards,
Espen Vestre
Espen Vestre wrote:
>
> "Christopher J. Vogt" <····@computer.org> writes:
>
> > I'll be surprised if you don't find a factor of 10 performance difference
> > between CL and C.
>
> I think I already tried to explain you this in a thread in
> comp.lang.lisp.mcl: The bad FP performance you achieve is caused
> by the bad FP implementation of MCL.
Sloppy writing on my part, I guess. Earlier in the thread I stated:
In my experience with CL, much of it involving graphics and FP, the two
areas that CL implementations fall far behind C in terms of performance are
FP and I/O. So if there is something in particular about CL that you want
to benchmark, e.g. array performance, make sure that you aren't doing any
I/O and are not using FP numbers. I find it interesting that the original
message that started this thread was titled "Floating Point speed in CL",
but the code wasn't doing very much FP, it was mostly ratios!
> ACL and CMUCL behave much,
> much better.
I have not used CMUCL, which apprently has comparable performance. This
is great, and is an existence proof to what I have been saying, and should
be a target for all the comercial lisp vendors: Digitool, Franz and Harlequin.
I have not used the most recent versions of ACL, which appear to be
within a factor of 2 of C. Older versions of ACL, however where as I described,
as are current versions of LWW and MCL. Unfortunately for me, I need something
that runs under MacOS or WinNT, and CMUCL doesn't run in either place. I need
to look at the current version of ACL though, as it sounds promising.
> (But then, my execution times for FP code with
> ACL4.3.1/Solaris were so low that you didn't believe them, since
> the specific code you posted compiled to pure FP-crunching machine
> code).
What I found difficult to believe is that you showed the time for floating
point CL code 7 times faster than fixnum code. I find this counter
intuitive to the extreme. It suggested to me some investigation. It isn't
the "absolute" time of the FP code, but the time relative to the fixnum
version of the same code. I remain skeptical. Did you look at the disassembly
to be sure the compiler didn't, for example, optimize out the floating point
multiply because the result was never used? Or maybe, and I'm not familiar
with your hardware, you have 10 floating point units, and only one integer
unit in you CPU, that might explain why the FP code was faster than the
fixnum code.
> [...]
--
Christopher J. Vogt - Computer Consultant - Lisp, AI, Graphics, etc.
http://members.home.com/vogt/
From: Erik Naggum
Subject: Re: Floating Point speed in Common Lisp
Date:
Message-ID: <3099238886556386@naggum.no>
* Christopher J. Vogt
| Older versions of ACL, however where as I described, as are current
| versions of LWW and MCL.
FWIW, "Allegro Common Lisp" is used both for the Windows and the Unix
version of Franz Inc's product line. the Windows version is a cheap and
slow CLtL1 implementation with CLOS added -- I was very disappointed with
it. however, its main fort�s are its small size and the GUI builder.
the Unix version purports to be conforming to ANSI Common Lisp (and is
very close), is fast and holds very high quality -- I have been very
satisfied with it. if all you have is ACL for Windows, the Unix version
is like it's from a different world. (actually, it's from a different
vendor, some years back.)
the latest releases of these products are ACL 3.0.2 for Windows, ACL 4.3
for Linux, ACL 4.3.1 for Unix, and ACL 4.3.2 for NT, which is based on
the Unix version plus the GUI builder for Windows NT, and I understand it
is a preliminary release towards ACL 5.0.
#:Erik
--
religious cult update in light of new scientific discoveries:
"when we cannot go to the comet, the comet must come to us."
"Christopher J. Vogt" <····@computer.org> writes:
> What I found difficult to believe is that you showed the time for floating
> point CL code 7 times faster than fixnum code. I find this counter
> intuitive to the extreme. It suggested to me some investigation. It isn't
> the "absolute" time of the FP code, but the time relative to the fixnum
> version of the same code. I remain skeptical. Did you look at the disassembly
> to be sure the compiler didn't, for example, optimize out the floating point
> multiply because the result was never used?
I glanced at the dissasembly of both the ACL and MCL compiled functions,
and as far as I could tell (I'm not very fluent in reading assembly code -
if I have some time, I could try it again), the FP code was as optimized
as you could wish, but the fixnum code wasn't quite optimized. For MCL,
the fixnum code was _very_ good, but the FP code looked like a disaster
(much less optimized than the ACL _fixnum_ code).
> Or maybe, and I'm not familiar
> with your hardware, you have 10 floating point units, and only one integer
> unit in you CPU, that might explain why the FP code was faster than the
> fixnum code.
Compared to the overall speed of the machine, the UltraSPARC 140Mhz
processor of my workstation, which I use for ACL, has very good FP
specs (take a look at http://www.spec.org).
--
regards,
Espen Vestre
Telenor Nextel
Norway
····@cs.uwa.edu.au (Matthew McDonald) writes:
>
> Unless I misunderstood something, one of the versions of the ray-tracing
> code ran *four* times faster on CMUCL than it did on ACL. Admittedly, it
> had been optimised for CMUCL, but I'd be surprised to see a non-contrived
> piece of C that revealed similar differences in C implementations.
The code was not really optimized for CMUCL. The changes were
portable, and the only CMUCL specific item was the start-block stuff.
This could easily have been made portable by wrapping up everything in
a labels block.
I think that if *I* wrote a C compiler for a pentium, I could produce
one that was produces code that's probably 2-10 times slower than any
other C compiler. It would take me forever to do it, though.
I guess the bottom line is that writing a compiler is hard, and
writing good compilers are even harder.
Ray
····@cs.uwa.edu.au (Matthew McDonald) writes:
> Erik Naggum <······@naggum.no> writes:
>
> > C is not fast. C is primitive, which means it has a potential of being
> > fast if the programmer is a genius and the compiler is on his side.
>
> With simple languages like C, the difference between naive compilation
> and full optimisation is usually less than a factor of two (in my
> experience). Look at the gcc vs oberon comparisions (the standard
> oberon compilers do practically no optimisation) or even gcc with and
> without optimisation.
>
> Even a stupid C compiler can typically produce code that's within
> spitting distance of what's produced by an optimising compiler,
> because the language is basically a portable assembler. The same can't
> be said of common lisp. When you're optimising floating point code in
> C you don't usually need to go to much trouble to ensure that the
> major operations are on unboxed data!
Very untrue. Although -O2 with gcc on intel typically reduces run
time by a factor of 1.2 to 1.5, I've seen egcs give a factor
significantly larger than 2 on a big project. I've also seen such
large differences more frequently on the alpha & sparc CPUs, most
notably with DEC's C compiler on alpha CPUs, where factors of 2 are
common, and larger factors aren't too unusual.
To substantiate some of this, here're the application runtimes when
compiled with egcs & with gcc on linux intel:
pII 300, gcc -O2 -ffloat-store:
24.85user 0.83system 0:45.53elapsed 56%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (2092major+2575minor)pagefaults 217swaps
pII 300, egcs -O3 -ffloat-store -march=pentiumpro:
17.59user 0.33system 0:26.53elapsed 67%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (748major+2545minor)pagefaults 0swaps
(on this version of gcc, -O2 was giving us the best performance).
I don't have unoptimized run times handy, but the above shows the
large difference one can see using optimization from compilers even as
similar as gcc 2.7.2.1 & egcs 1.0.1. Also, if you assume that gcc -O2
gives a 30% reduction in run time over unoptimized code then egcs -O3
is giving slightly better than a factor of 2.
The differences between C & optimized C are better register allocation
and instruction reordering to take advantage of pipelining & multiple
execution units, not to mention common subexpression elimination,
hiking invariants out of loops, etc.
This is necessary for good performance on RISC chips. Instruction
reordering isn't as important on intel, but register allocation is
extremely important due to the lack of registers. And, with Merced,
optimization & instruction ordering is going to be even more
important. It will be almost impossible to write fast assembler by
hand for Merced, let alone do it with unoptimized C code.
All this actually makes C quite far from a "portable assembler".
As for optimizing floating point code in Lisp, one's largely just
adding in all the declarations that one had to start with in C, so I
don't see your point about it being so much more trouble. Especially
when most of the time is spent in a small section of the code, where
one just has to add the appropriate declarations in a small amount of
code, whereas in C the declarations are everywhere.
I think what generally makes optimization more difficult in Lisp is
that people do it so rarely. Mostly because it's not needed so
often. There's fast and there's fast enough, and Lisp is usually fast
enough.
--
Harvey J. Stein
Berger Financial Research
·······@bfr.co.il
Harvey J. Stein wrote:
>
>[...]
>
>
> As for optimizing floating point code in Lisp, one's largely just
> adding in all the declarations that one had to start with in C, so I
> don't see your point about it being so much more trouble. Especially
> when most of the time is spent in a small section of the code, where
> one just has to add the appropriate declarations in a small amount of
> code, whereas in C the declarations are everywhere.
Your programming environment has to support this. The CL standard allows
implementations to ignore type declarations. There are some implementations
that ignore the floating point type declaration, and so no performance
impmrovement is available via. declarations.
>
> I think what generally makes optimization more difficult in Lisp is
> that people do it so rarely. Mostly because it's not needed so
> often. There's fast and there's fast enough, and Lisp is usually fast
> enough.
I would agree with you that CL is generally "fast enough". However, In my
experience, I have found that floating point and I/O tend not to be
"fast enough" for the applications I have worked on.
--
Christopher J. Vogt - Computer Consultant - Lisp, AI, Graphics, etc.
http://members.home.com/vogt/
"Christopher J. Vogt" <····@computer.org> writes:
> Harvey J. Stein wrote:
> >
> > As for optimizing floating point code in Lisp, one's largely just
> > adding in all the declarations that one had to start with in C, so I
> > don't see your point about it being so much more trouble. Especially
> > when most of the time is spent in a small section of the code, where
> > one just has to add the appropriate declarations in a small amount of
> > code, whereas in C the declarations are everywhere.
>
> Your programming environment has to support this. The CL standard allows
> implementations to ignore type declarations. There are some implementations
> that ignore the floating point type declaration, and so no performance
> impmrovement is available via. declarations.
Then get a better Lisp compiler. The Lisp compiler can ignore
declarations and the C compiler can do lousy instruction scheduling &
register allocation. In fact, the C compiler can sometimes produce
*slower* code when certain optimizations are *enabled*. BFD.
There are also Lisp compilers which do type inference and sometimes
produce code which runs faster than comparable C code. Bigloo &
Stalin come to mind for scheme, and I think that CMUCL does this
sometimes for Common Lisp.
--
Harvey J. Stein
Berger Financial Research
·······@bfr.co.il
From: Andi Kleen
Subject: Re: Floating Point speed in Common Lisp
Date:
Message-ID: <m3vhsys1ib.fsf@fred.muc.de>
·······@bfr.co.il (Harvey J. Stein) writes:
> There are also Lisp compilers which do type inference and sometimes
> produce code which runs faster than comparable C code. Bigloo &
> Stalin come to mind for scheme, and I think that CMUCL does this
> sometimes for Common Lisp.
How can bigloo or Stalin produce faster code than C when they both compile
into C?
-Andi
From: Erik Naggum
Subject: Re: Floating Point speed in Common Lisp
Date:
Message-ID: <3100173493999671@naggum.no>
* Andi Kleen
| How can bigloo or Stalin produce faster code than C when they both
| compile into C?
the same way some C compilers produce faster and better assembly code
than humans could do even though they produce assembly code: by going way
beyond the mental limitations of the human mind in the code produced,
such as the number of "open issues" that can be dealt with. for the most
part, this is a purely mechanical limitation, so if a programmer can make
another program (such as a compiler) juggle the open issues for him, he
can deal with more of them. since C is an amazingly stupid language, the
number of open issues for even an excellent C programmer are strictly
limited to a small constant number. (the human consciousness is reported
to be able to deal with 7 simultaneously open issues, which very from all
concrete, like something to watch out for, to very abstract, like a free
variable. most programming languages do not optimize for the use of
these "slots" in the human psyche, and so can vary greatly in how many
are required at any time. compilers, however, do not have to worry about
such slots, and can have any number of issues open at a time, and can
thus make what would appear to a human being to be highly intelligent
conclusions; recall the references to the "sufficiently smart compiler".)
#:Erik
--
religious cult update in light of new scientific discoveries:
"when we cannot go to the comet, the comet must come to us."
> | How can bigloo or Stalin produce faster code than C when they both
> | compile into C?
>
> the same way some C compilers produce faster and better assembly code
> than humans could do even though they produce assembly code: by going way
> beyond the mental limitations of the human mind in the code produced,
Let me give a very concrete example. Suppose that one wanted to write a
numerical integration procedure. The natural way to do this in Scheme (and
Common Lisp) is as a higher-order procedure:
(define (integrate-1d l u f)
(let ((d (/ (- u l) 8.0)))
(* (+ (* (f l) 0.5)
(f (+ l d))
(f (+ l (* 2.0 d)))
(f (+ l (* 3.0 d)))
(f (+ l (* 4.0 d)))
(f (- u (* 3.0 d)))
(f (- u (* 2.0 d)))
(f (- u d))
(* (f u) 0.5))
d)))
Now suppose that one wanted to compute the value of a double integral of some
function. The natural way to do this in Scheme (and Common Lisp) is by making
nested calls to the one-dimensional numerical-integration procedure defined
above:
(define (integrate-2d l1 u1 l2 u2 f)
(integrate-1d
l2 u2 (lambda (y) (integrate-1d l1 u1 (lambda (x) (f x y))))))
(define (zark u v) (integrate-2d 0.0 u 0.0 v (lambda (x y) (* x y))))
Likewise, the natural way to do this by hand in C is by using function
pointers:
double integrate_1d(double l, double u, double (*f)(double))
{ double d = (u-l)/8.0;
return ((*f)(l)*0.5+
(*f)(l+d)+
(*f)(l+2.0*d)+
(*f)(l+3.0*d)+
(*f)(l+4.0*d)+
(*f)(u-3.0*d)+
(*f)(u-2.0*d)+
(*f)(u-d)+
(*f)(u)*0.5)*d;}
double integrate_2d(double l1, double u1, double l2, double u2,
double (*f)(double, double))
{ double outer(double y)
{ double inner(double x) {return (*f)(x, y);}
return integrate_1d(l1, u1, inner);}
return integrate_1d(l2, u2, outer);}
double zark(double u, double v)
{ double zarkon(double x, double y) {return x*y;}
return integrate_2d(0.0, u, 0.0, v, zarkon);}
(Note that the above C code uses GNU C extensions, i.e. nested functions,
to simulate downward funargs, i.e. lambdas.)
Now, when one runs a benchmark written around the above Scheme code, vs. the
same benchmark written around the above C code, on my machine (a Goliath
motherboard with two 200MHz P6s each with 512K L2 running Red Hat 4.1, kernel
2.0.33, gcc 2.7.2.1), the Scheme code runs more than 21 times faster than the
C code (1.08 CPU sec for Scheme vs. 23.27 CPU sec for C). This is the case
even though the Scheme code was compiled into C with Stalin and both the
Stalin-generated C code and the hand-written C code were compiled with the
same C compiler with the same options (-O2).
The reason why is obvious. No C compiler that I am aware of will in-line
across an indirect function call. Yet Stalin will. So Stalin in-lines
INTEGRATE-1D, INTEGERATE-2D, and the lambda-expression analogues of the
nested C functions inner, outer, and zarkon into ZARK. Stalin generates one C
function for ZARK. Not only does this eliminate all of the function-call
overhead, it also allows the C compiler to do register allocation and
instruction scheduling across a much larger fragment of code. As this example
shows, these effects combined can make a huge difference in performance.
Thanks to Richard O'Keefe for giving me this example.
Jeff (http://www.neci.nj.nec.com/homepages/qobi)
Jeffrey Mark Siskind <····@research.nj.nec.com> writes:
> (Note that the above C code uses GNU C extensions, i.e. nested functions,
> to simulate downward funargs, i.e. lambdas.)
I don't think it's quite fair to use GNU-C nested functions in a
benchmark on C... They are convenient to use, but the trampoline code
that is generated probably slows down function calls through the
trampoline _significantly_. What figures do you get if you pass
pointers to local data explicitly instead of relying on GNU-C? Without
trampolines, the C compiler still won't eliminate the function calls,
but it will at least use use standard C calling conventions which are
reasonably fast.
Details, if anybody is interested... The GNU-C trampoline code is
implemented differently on different machines, I guess, but on the
single machine where I have looked at it, taking a pointer to a nested
function (if that nested function references variables in the
containing scope) is quite complicated:
First, a an executable trampoline function is generated on the
stack(!), and the address of this trampoline function is used. When
the trampoline function is called, the trampoline will copy the
referenced stack frame and call the real function. This is an
interesting hack, but the overhead for calling through the function
pointer is much higher than for ordinary C functions.
/Niels
From: Matthew McDonald
Subject: Re: Floating Point speed in Common Lisp
Date:
Message-ID: <mafm.890923965@cs.uwa.edu.au>
·······@bfr.co.il (Harvey J. Stein) writes:
>> Even a stupid C compiler can typically produce code that's within
>> spitting distance of what's produced by an optimising compiler,
>> because the language is basically a portable assembler. The same can't
[...]
>Very untrue. Although -O2 with gcc on intel typically reduces run
>time by a factor of 1.2 to 1.5, I've seen egcs give a factor
>significantly larger than 2 on a big project.
I'm sorry, but I think you're quibbling. The post you're replying to
(and the ones it referenced) said that for the purposes of
argument, "close" mean performance that was within a factor of
approximately two, and "not close" mean a factor of >= four.
>To substantiate some of this, here're the application runtimes when
>compiled with egcs & with gcc on linux intel:
[...]
>Also, if you assume that gcc -O2
>gives a 30% reduction in run time over unoptimized code then egcs -O3
>is giving slightly better than a factor of 2.
As I said in the post you're replying to, the difference between optimised
and unoptimised C is usually not much more than a factor of two (IME).
It may be significant that that three articles in this thread have
included vague claims that "significantly" larger differences are sometimes
observed, but all the actual timings reported shows factors of < 2.
[...]
>important. It will be almost impossible to write fast assembler by
>hand for Merced, let alone do it with unoptimized C code.
Obviously, C compilers have to work harder to get good performance, as
processors deviate further from the PDP-11. Lisp compilers will too.
Some languages *do* optimise better than C on modern or parallel
machines, but they seem to be mainly languages like FORTRAN and
SISAL, that are simpler than C in several obvious and important
respects. As far as I know, Common Lisp isn't one of these languages
and there's no recent machine on which a common lisp compiler reliably
produces faster code than the standard C compiler.
>As for optimizing floating point code in Lisp, one's largely just
>adding in all the declarations that one had to start with in C, so I
>don't see your point about it being so much more trouble. Especially
>when most of the time is spent in a small section of the code, where
I agree that with an appropriate compiler, you can usually get reasonable
performance with a reasonable amount of effort. However, (IME) you need to
have read the manual for your implementation fairly carefully, and the
declarations you add for one implementation are unlikely to be *equally*
*useful* for some randomly chosen other implementation.
-- Matthew McDonald <····@cs.uwa.edu.au>
Jeffrey Mark Siskind <····@research.nj.nec.com> writes:
>
> Now, when one runs a benchmark written around the above Scheme code, vs. the
> same benchmark written around the above C code, on my machine (a Goliath
> motherboard with two 200MHz P6s each with 512K L2 running Red Hat 4.1, kernel
> 2.0.33, gcc 2.7.2.1), the Scheme code runs more than 21 times faster than the
> C code (1.08 CPU sec for Scheme vs. 23.27 CPU sec for C). This is the case
> even though the Scheme code was compiled into C with Stalin and both the
> Stalin-generated C code and the hand-written C code were compiled with the
> same C compiler with the same options (-O2).
>
Could you please post your benchmarks? I tried to call this code million times
and got only 2 times difference (Scheme is faster).
Jeffrey Mark Siskind <····@research.nj.nec.com> writes:
> Enclosed is the handwritten C code and the Stalin-generated C code. This was
> generated by the development version of Stalin. Since polyvariance is
> currently broken in the development compiler, I simulated polyvariance
> manually by giving it the enclosed Scheme program. Note that even though that
> Scheme program is horrendous, the code generated is very good. Stalin flattens
> it into a single C function. And even though there are many copies of the
> parameters F1 ... F9, they are all eliminated because they are fictitious.
Thank you, I was using Stalin-0.7
From: Bulent Murtezaoglu
Subject: Re: Floating Point speed in Common Lisp
Date:
Message-ID: <87k99z14x8.fsf@isttest.bogus>
>>>>> "RT" == Raymond Toy <···@rtp.ericsson.se> writes:
RT> Here is the no-output result:
RT> Evaluation took: 6.38 seconds of real time 6.1 seconds of user
RT> run time 0.1 seconds of system run time [Run times include
RT> 0.11 seconds GC run time] 0 page faults and 9944752 bytes
RT> consed.
After incorporating your suggestions and a few other minor things
the best I can do over here is
Evaluation took:
5.41 seconds of real time
5.14 seconds of user run time
0.14 seconds of system run time
[Run times include 0.36 seconds GC run time]
0 page faults and
8300816 bytes consed.
(array stuffing version with CMUCL)
Here's what I couldn't get to work (ie gave up after a cursory look):
inlining unit-vector seems to break type inference somehow (either that or
there's some subtlety I don't understand right now).
I tried explicitly defining the structs as vectors hoping to save some
more space but apparently unnamed vector structs cannot be used as types
with the name given to defstruct in declarations. It seemed too ugly
to declare points and spheres explicitly as vectors, so I didn't go for
that workaround.
BM
Bulent Murtezaoglu <·····@servtech.com> writes:
>
> inlining unit-vector seems to break type inference somehow (either that or
> there's some subtlety I don't understand right now).
This should work. I'll take a peek at it.
>
> I tried explicitly defining the structs as vectors hoping to save some
> more space but apparently unnamed vector structs cannot be used as types
> with the name given to defstruct in declarations. It seemed too ugly
> to declare points and spheres explicitly as vectors, so I didn't go for
> that workaround.
Converting the structs to vectors might help, but I think structure
access is already quite fast, and it probably doesn't reduce consing.
Ray
Bulent Murtezaoglu wrote in message <··············@isttest.bogus>...
>
>Ok here are my results using both the Graham's original, Raymond Toy's
>code and my modifications to RT's code for both ACL and CMUCL.
>Hardware: Homegrown system, PentiumPro 180 overclocked to 231Mhz with
>256K L1 cache and 128M EDO ram sitting on a 77MHZ bus.
>
>OS: Linux 2.0.32. Fileserver on NSF over 10-base-2 (slow server, lightly
>loaded network).
>
>Lisp:
>ACL is Allegro Common Lisp Linux version fully patched up to November 97
>CMUCL is the version dated 30/01/98 off ftp.cons.org (Peter VanEynde's
>binary kit)
>Original code + global (declaim (optimize (speed 3) (safety 0) (debug 0)))
>
>ACL 185,948 182,750 600 2,370 507,649 2,238,455,232
>CMUCL 285,890 266.300 16.050 50,430 1,422,346,560
>RT's code as posted (w/ the bug but I did add one #+ to get allegro to load
it)
> Real Time User System GC time cons other
>ACL 158,726 154,930 650 4,550 512,941 -19,566,024 (!)
>CMUCL 117,530 99,230 15,340 45,910 1,288,595,072
Here are the results for Harlequin LispWorks 4.0.1 with latest patches on a
Pentium Pro
200 MHz with 80 Mo RAM
Compiled with (proclaim '(optimize (debug 0) (safety 0) (speed 3)(float 3)))
CL-USER 15 > (extended-time (ray-test 5))
757.5 seconds used.
Standard Allocation 3397066608 bytes.
Fixlen Allocation 9912 bytes.
total gc activity = 172.106
main promote ( 4 calls) = 0.110
mark and sweep (13235 calls) = 171.996
internal promote ( 4 calls) = 0.000
promote ( 0 calls) = 0.000
fixup ( 4 calls) = 0.16
compact ( 0 calls) = 0.000
NIL
>RT's code as posted (w/ the bug but I did add one #+ to get allegro to load
it)
CL-USER 16 > (extended-time (ray-test-no 5))
613.9 seconds used.
Standard Allocation 4535355248 bytes.
Fixlen Allocation 1056 bytes.
total gc activity = 187.337
main promote ( 4 calls) = 1.346
mark and sweep (17683 calls) = 185.991
internal promote ( 5 calls) = 0.94
promote ( 0 calls) = 0.000
fixup ( 6 calls) = 0.438
compact ( 0 calls) = 0.000
NIL
Marc Battyani
On Thu, 12 Mar 1998 20:29:14 +0100, "Marc Battyani"
<·············@email.msn.com> wrote:
> Here are the results for Harlequin LispWorks 4.0.1 with latest patches on a
> Pentium Pro
> 200 MHz with 80 Mo RAM
> Compiled with (proclaim '(optimize (debug 0) (safety 0) (speed 3)(float 3)))
^^^^^^^
NB "Float" is a misnomer for "float-safety" in LispWorks, so you really want
"(float 0)". Also this won't make much difference in the current PC release
anyway. A future PC release will include floating point optimization
comparable with LispWorks on Unix.
__Jason
Jason Trenouth wrote in message <···················@newshost>...
>On Thu, 12 Mar 1998 20:29:14 +0100, "Marc Battyani"
><·············@email.msn.com> wrote:
>
>> Here are the results for Harlequin LispWorks 4.0.1 with latest patches on
a
>> Pentium Pro
>> 200 MHz with 80 Mo RAM
>> Compiled with (proclaim '(optimize (debug 0) (safety 0) (speed 3)(float
3)))
> ^^^^^^^
>
>NB "Float" is a misnomer for "float-safety" in LispWorks, so you really
want
>"(float 0)". Also this won't make much difference in the current PC release
Hoops sorry,
I tried with float 0 and you were right it does not seem to change much.
BTW I didn't find "Float" in the LWW doc (section 7.4 Compiler Control)
>anyway. A future PC release will include floating point optimization
>comparable with LispWorks on Unix.
>
>__Jason
That's good news! We do OpenGL and signal processing apps in LWW so this
would be quite welcome.
Marc Battyani