First, I must say I'm rather new to Common Lisp, although I've had
some experience with Scheme and functional languages. I have most
experience with C and C++, but I've been generally frustrated with
these languages for some time now, and searching for alternatives.
Anyway, to the point: I've written a small program to make an image of
the Mandelbrot set in Common Lisp. It's mostly numerical calculation
with floating-point numbers. I started measuring the performance of
this little program across different implementations. I used always
the same parameters: a 640x480 image, 255 max iterations, and a fixed
region of the complex plane. All implementations took something around
100 to 200 seconds to calculate.
Then I made a C version to get a reference for performance. It was
written similarly (as far as possible) to the CL version, and with no
optimizations (as with the CL version). It ran at about 1.5 seconds.
Having seen evidence of CL code doing numerical computations almost as
fast (or sometimes faster) than C code, using CMUCL, I've decided I
could improve the performance of this little program. I added type
declarations and (declare (optimize speed)). The program compiled in
CMUCL dropped to 2.4 seconds execution time. I was really impressed.
But then I tried to compile and run the same program with other
implementations, and none of them showed similar performance gains.
Actually, most didn't show any difference. Here are some approximate
results:
C version (gcc 3.2) 1.5 s
CMUCL (linux) 2.4 s
CLISP (linux) 70 s
CLISP (Windows) 100 s
Corman Lisp 2.0 (Win) Trial 150 s
Allegro CL 6.2 (Win) Trial 170 s
LispWorks 4.2 Personal Ed.(Win) 120 s
So, finally, what I want to ask is: is this brutal difference in
performance (between CMUCL and others) to be expected anyway, or am I
doing something wrong with other implementations ? I tried to use
(proclaim '(optimize (speed 3) (safety 0) (debug 0))) at the top
level, but it didn't do much. Is there something I can do to improve
performance in the other implementations ?
I want to do mostly graphics-related programs in CL, and they will use
floating point computations heavily. It is especially bad to me if no
Windows implementation can attain a reasonable performance for this
kind of tasks. Performance two orders of magnitude worse than C is
terribly impractical and descouraging.
Thanks for any help.
[...]
AAF> So, finally, what I want to ask is: is this brutal difference
AAF> in performance (between CMUCL and others) to be expected
AAF> anyway, or am I doing something wrong with other
AAF> implementations ?
We did a collective optimization session right here in cll some years ago
with Allegro and CMUCL on some ray tracing code and while CMUCL seemed
faster there wasn't a 'brutal' difference. You are probably overlooking
something simple, if you could post the code we probably could help.
AAF> I tried to use (proclaim '(optimize (speed
AAF> 3) (safety 0) (debug 0))) at the top level, but it didn't do
AAF> much.
Did you type it at the top level prompt or did you add it to the files
to be compiled? Proclaim is a function and compiler does nothing
special to make the proclamation apply to the file it is compiling
when it appears at the top level. Use declaim instead (or you could
wrap your proclaim in an eval-when).
AAF> Is there something I can do to improve performance in
AAF> the other implementations ? [...]
Probably, tough to tell w/o seeing the code.
cheers,
BM
>>>>> "Andrei" == Andrei de A, Formiga <·····@gdnmail.net> writes:
Andrei> Bulent Murtezaoglu <··@acm.org> wrote in message news:<··············@acm.org>...
>>
>> Probably, tough to tell w/o seeing the code.
>>
>> cheers,
>>
>> BM
Andrei> Ok, here is the code (after optmizations). CMUCL also wants me to
Andrei> declare complex numbers as (complex single-float), but some
Andrei> implementations give an error when I do that.
Since CMUCL has unboxed (complex single-float) and (complex
double-float), it will speed things up quite a bit if you do. I think
implementations that give an error for such declarations are broken.
Ray
From: Nils Goesche
Subject: Re: Performance and optimizations
Date:
Message-ID: <lysmvlcdsr.fsf@cartan.de>
·····@gdnmail.net (Andrei de A, Formiga) writes:
> Bulent Murtezaoglu <··@acm.org> wrote in message news:<··············@acm.org>...
> > Probably, tough to tell w/o seeing the code.
> Ok, here is the code (after optmizations).
The C code would also be nice (for reference).
> CMUCL also wants me to declare complex numbers as (complex
> single-float), but some implementations give an error when I do
> that.
Why don't you take double-floats?
> (declaim (optimize (speed 3) (debug 0) (safety 0)))
Normally, doing this globally is not such a good idea, but I guess in
microbenchmarks like this it doesn't matter. You could also declaim
(optimize (space 0) (compilation-speed 0)) in addition. And when
dealing with Lispworks and floats, you should also
(optimize #+lispworks (float 0))
which will give you quite a performance boost. There is also
FIXNUM-SAFETY, but that wouldn't be such a good idea, here...
Otherwise I haven't looked at the code yet. Maybe I will later. Just
one thing: You can put /several/ declarations into one DECLARE
expression.
Regards,
--
Nils G�sche
"Don't ask for whom the <CTRL-G> tolls."
PGP key ID 0x0655CFA0
On Wed, Jan 22, 2003 at 07:24:03PM +0100, Matthias Heiler wrote:
> Hello, does replacing the tail-recursion in iterate-point by (do ...)
> or (loop ...) make any difference?
First I fixed up a few of the types, then I changed the squaring to use
expt. This gave pretty good performance in CMUCL considering the
full-call to iterate-point (I think). I moved everything into one large
labels and this dropped a few ms off the time (already under a second).
When I replaced the tail-recursion with loop, I gained another 10%
performance boost. So yes, it does make a difference in CMUCL =)
Allegro, as expected, is still not performing well. I have found in the
past that optimizations that suited CMUCL did not always work out with
Allegro. In my experience, spending a bit of time with the Allegro
profiler usually yielded great performance boosts. Perhaps I will get
around to that later.
--
; Matthew Danish <·······@andrew.cmu.edu>
; OpenPGP public key: C24B6010 on keyring.debian.org
; Signed or encrypted mail welcome.
; "There is no dark side of the moon really; matter of fact, it's all dark."
Matthew Danish <·······@andrew.cmu.edu> wrote in message news:<·····················@lain.cheme.cmu.edu>...
> First I fixed up a few of the types, then I changed the squaring to use
> expt. This gave pretty good performance in CMUCL considering the
> full-call to iterate-point (I think). I moved everything into one large
> labels and this dropped a few ms off the time (already under a second).
> When I replaced the tail-recursion with loop, I gained another 10%
> performance boost. So yes, it does make a difference in CMUCL =)
>
> Allegro, as expected, is still not performing well. I have found in the
> past that optimizations that suited CMUCL did not always work out with
> Allegro. In my experience, spending a bit of time with the Allegro
> profiler usually yielded great performance boosts. Perhaps I will get
> around to that later.
Please, can I see your version ? Moving everything to a large
labels and replacing the tail-recursion with loop is easy, but I'd
want to see what you have done with the types ("I fixed up a few of
the types")...
I'll try the Allegro profiler, if it works in the trial edition. I
had thought about trying the CMUCL profiler, but didn't have the time.
Also, as some people probably noticed, the bottleneck is the
iterate-point function, and the functions it calls. My first version
of this function was iterative (using dotimes) but then I changed to a
tail recursive version in the hopes of implementations taking
advantage of tail recursion optimization. Execution times remained the
same, more or less, so I haven't changed it back. The optimizations
I've tried in the mandelbrot funcion didn't make much of an impact in
the overall performance, as expected.
By the way, thanks to everyone who responded. I'm using another
computer right now, but when I get home I'll post the C code too for
anyone interested.
---
Andrei de A. Formiga
On Wed, Jan 22, 2003 at 05:51:06PM -0800, Andrei de A, Formiga wrote:
> Matthew Danish <·······@andrew.cmu.edu> wrote in message news:<·····················@lain.cheme.cmu.edu>...
> > First I fixed up a few of the types, then I changed the squaring to use
> > expt. This gave pretty good performance in CMUCL considering the
> > full-call to iterate-point (I think). I moved everything into one large
> > labels and this dropped a few ms off the time (already under a second).
> > When I replaced the tail-recursion with loop, I gained another 10%
> > performance boost. So yes, it does make a difference in CMUCL =)
> >
> > Allegro, as expected, is still not performing well. I have found in the
> > past that optimizations that suited CMUCL did not always work out with
> > Allegro. In my experience, spending a bit of time with the Allegro
> > profiler usually yielded great performance boosts. Perhaps I will get
> > around to that later.
>
> Please, can I see your version ? Moving everything to a large
> labels and replacing the tail-recursion with loop is easy, but I'd
> want to see what you have done with the types ("I fixed up a few of
> the types")...
I don't think I did anything but change complex to (complex
single-float) and explicitly type floating-point constants (0.0 ->
0.0f0)
> Also, as some people probably noticed, the bottleneck is the
> iterate-point function, and the functions it calls. My first version
> of this function was iterative (using dotimes) but then I changed to a
> tail recursive version in the hopes of implementations taking
> advantage of tail recursion optimization. Execution times remained the
> same, more or less, so I haven't changed it back.
Tail-recursion optimization does not make a tail-recursive algorithm
faster than the iterative version, it makes it equivalent. The
optimization refers to the fact that the compiler detects that it does
not need a new stack frame when making this call, and can use a simple
branch. The advantage lies in that you can express certain algorithms
more clearly.
And while CMUCL, and most CL compilers, can detect tail-recursion
reasonably well, I suspect that using iteration is likely to lead to
better optimizations in many cases simply because, unlike Schemers,
CLers use iteration constructs a great deal. Tail-recursion
``optimization'' isn't mandated by the standard, and traditionally CLers
did not use it.
(declaim (optimize (speed 3) (debug 0) (safety 0)))
(defconstant +squared-limit+ 4.0f0)
(defun mandelbrot (width height max-iterations &key (real-limits
'(-2.25f0 0.75f0))
(imag-limits '(-1.25f0 1.25f0)))
(declare (fixnum width height max-iterations))
(labels ((norm-squared (z)
(declare (optimize speed))
(declare (type (complex single-float) z))
(let ((real-part (realpart z))
(imag-part (imagpart z)))
(declare (single-float real-part))
(declare (single-float imag-part))
(the single-float
(+ (expt real-part 2)
(expt imag-part 2)))))
(iterate-point (z max-iter)
(declare (type (complex single-float) z))
(declare (fixnum max-iter))
(labels (#+nil
(iterate-recursive (zn iter)
(declare (optimize speed))
(declare (type (complex single-float) zn))
(declare (fixnum iter))
(cond ((or (>= (norm-squared zn)
+squared-limit+)
(zerop iter))
(- max-iter iter))
(t (iterate-recursive (+ (expt zn 2) z)
(1- iter))))))
#+nil
(iterate-recursive z max-iter)
(let ((zn z))
(declare (type (complex single-float) zn))
(loop for iter of-type fixnum from max-iter downto 0
until (>= (norm-squared zn)
+squared-limit+)
do (setf zn (+ (expt zn 2) z))
finally (return (- max-iter iter)))))))
(let* ((real-min (first real-limits))
(real-max (second real-limits))
(imag-min (first imag-limits))
(imag-max (second imag-limits))
(real-step (/ (- real-max real-min) width))
(imag-step (/ (- imag-max imag-min) height))
(z (complex real-min imag-min))
(result (make-array (list width height) :element-type 'fixnum)))
(declare (optimize speed))
(declare (single-float real-min real-max imag-min imag-max))
(declare (single-float real-step imag-step))
(declare (type (complex single-float) z))
(dotimes (x width result)
(declare (fixnum x))
(dotimes (y height)
(declare (fixnum y))
(let ((count (iterate-point z max-iterations)))
(if (= count max-iterations)
(setf (aref result x y) 0)
(setf (aref result x y) count)))
(incf z (complex 0f0 imag-step)))
(setf z (complex (+ (realpart z) real-step) imag-min))
(format t "~&Column ~D finished~%" x)))))
--
; Matthew Danish <·······@andrew.cmu.edu>
; OpenPGP public key: C24B6010 on keyring.debian.org
; Signed or encrypted mail welcome.
; "There is no dark side of the moon really; matter of fact, it's all dark."
In article <··············@bird.agharta.de>, Edi Weitz <···@agharta.de> wrote:
>Hmm, last time I checked C didn't have complex numbers, so you
>probably implemented complex arithmetic yourself, didn't you?
Check again. I think C99 added complex numbers.
--
Barry Margolin, ······@genuity.net
Genuity, Woburn, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.
Barry Margolin <······@genuity.net> writes:
> In article <··············@bird.agharta.de>, Edi Weitz <···@agharta.de> wrote:
> >Hmm, last time I checked C didn't have complex numbers, so you
> >probably implemented complex arithmetic yourself, didn't you?
>
> Check again. I think C99 added complex numbers.
Woops, shows my ignorance.
However, as the OP said he used gcc my assumption might be correct. At
least support for complex numbers in gcc 3.1/3.2 is officially still
"broken":
<http://gcc.gnu.org/gcc-3.1/c99status.html>
Edi.
Edi Weitz <···@agharta.de> writes:
> However, as the OP said he used gcc my assumption might be
> correct. At least support for complex numbers in gcc 3.1/3.2 is
> officially still "broken":
>
> <http://gcc.gnu.org/gcc-3.1/c99status.html>
if you read the "Further notes" section, you'll see what exactly they
mean by 'broken'. He may wery well have used it..
Marius
--
<URL: http://home.ifi.uio.no/~mariube/ >
Life sucks. Death swallows.
Edi Weitz <···@agharta.de> writes:
> (declaim (inline my-iterate-point))
> (defun my-iterate-point (re-z im-z max-iter)
> (declare (optimize speed
> (safety 0)
> (space 0)
> (debug 0)
> (compilation-speed 0)
> #+:lispworks (float 0)))
> (declare (single-float re-z im-z)
> (fixnum max-iter))
> (loop for iter of-type fixnum from max-iter downto 0
> for re-zn of-type single-float = re-z then (+ re-z (- (* re-zn^2) (* im-zn^2)))
> and im-zn of-type single-float = im-z then (+ im-z (* 2 re-zn im-zn))
> for re-zn^2 of-type single-float = (* re-zn re-zn)
> for im-zn^2 of-type single-float = (* im-zn im-zn)
> if (>= (+ re-zn^2 im-zn^2) +squared-limit+)
> do (return (- max-iter iter))
> finally (return 0)))
The function above has, as a remnant from my editing the code, two
unnecessary calls to * with only one argument. If you replace
(- (* RE-ZN^2) (* IM-ZN^2))
with
(- RE-ZN^2 IM-ZN^2)
LispWorks is down from about 30 seconds to around 4.5 seconds and the
consing is reduced by a factor of ten:
CL-USER 10 > (time (defparameter *m* (my-mandelbrot 640 480 255)))
Timing the evaluation of (DEFPARAMETER *M* (MY-MANDELBROT 640 480 255))
user time = 4.469
system time = 0.000
Elapsed time = 0:00:05
Allocation = 284121600 bytes standard / 22 bytes fixlen
0 Page faults
*M*
My bad, obviously, but I'm not impressed. I would have expected the
compiler to be smart enough to optimize the calls to * away itself.
Cheers,
Edi.
Edi Weitz <···@agharta.de> writes:
> The function above has, as a remnant from my editing the code, two
> unnecessary calls to * with only one argument. If you replace
>
> (- (* RE-ZN^2) (* IM-ZN^2))
>
> with
>
> (- RE-ZN^2 IM-ZN^2)
>
> LispWorks is down from about 30 seconds to around 4.5 seconds and
> the consing is reduced by a factor of ten:
>
> CL-USER 10 > (time (defparameter *m* (my-mandelbrot 640 480 255)))
> Timing the evaluation of (DEFPARAMETER *M* (MY-MANDELBROT 640 480 255))
>
> user time = 4.469
> system time = 0.000
> Elapsed time = 0:00:05
> Allocation = 284121600 bytes standard / 22 bytes fixlen
> 0 Page faults
> *M*
>
> My bad, obviously, but I'm not impressed. I would have expected the
> compiler to be smart enough to optimize the calls to * away itself.
Back again, talking to myself. Turns out that LW is not very clever as
far as FP arithmetic is concerned. You have to specifically help the
compiler to avoid boxing the results of calls to, say, - or *:
(defun my-iterate-point (re-z im-z max-iter)
(declare (optimize speed
(safety 0)
(space 0)
(debug 0)
(compilation-speed 0)
#+:lispworks (float 0)))
(declare (double-float re-z im-z)
(fixnum max-iter))
(loop for iter of-type fixnum from 0 below max-iter
for re-zn of-type double-float = re-z then (+ re-z (the double-float (- re-zn^2 im-zn^2)))
and im-zn of-type double-float = im-z then (+ im-z (the double-float (* 2.0s0 re-zn im-zn)))
for re-zn^2 of-type double-float = (* re-zn re-zn)
for im-zn^2 of-type double-float = (* im-zn im-zn)
if (>= (the double-float (+ re-zn^2 im-zn^2)) +squared-limit+)
do (return iter)
finally (return 0)))
(defun my-mandelbrot (width height max-iter
&key
(re-min -2.25)
(re-max 0.75)
(im-min -1.25)
(im-max 1.25))
(declare (optimize speed
(safety 0)
(space 0)
(debug 0)
(compilation-speed 0)
#+:lispworks (float 0)))
(declare (fixnum width height max-iter)
(double-float re-min re-max im-min im-max))
(let ((re-step (/ (- re-max re-min) width))
(im-step (/ (- im-max im-min) height))
(result (make-array (list width height) :element-type 'fixnum :initial-element 0)))
(loop for x of-type fixnum below width
for re-z of-type double-float from re-min by re-step
do (loop for y of-type fixnum below height
for im-z of-type double-float from im-min by im-step
do (setf (aref result x y) (my-iterate-point re-z im-z max-iter))))
result))
Now we're down at 2.6 seconds:
CL-USER 39 > (time (defparameter *m* (my-mandelbrot 640 480 255)))
Timing the evaluation of (DEFPARAMETER *M* (MY-MANDELBROT 640 480 255))
user time = 2.598
system time = 0.002
Elapsed time = 0:00:02
Allocation = 1229184 bytes standard / 22 bytes fixlen
0 Page faults
*M*
Think I'll go to bed now...
Edi.
In article <····························@posting.google.com>,
·····@gdnmail.net (Andrei de A, Formiga) wrote:
>C version (gcc 3.2) 1.5 s
>CMUCL (linux) 2.4 s
Okay, these are both compiled.
>CLISP (linux) 70 s
>CLISP (Windows) 100 s
I think CLISP is byte-coded interpreted.
>Corman Lisp 2.0 (Win) Trial 150 s
>Allegro CL 6.2 (Win) Trial 170 s
>LispWorks 4.2 Personal Ed.(Win) 120 s
Trial editions are often performance-crippled.
Maybe the conclusion is to compare apples to apples.
The (compiled) C program and the (compiled) CMUCL
program are in the same ballpark.
> Trial editions are often performance-crippled.
I don't think this is true of any of the commercial Cl trial editions, though.
They have heap limits and sometimes runtime limits, but I think the compilers
are the same. I know this is (or was recently) true for some of the above
systems from having looked at the compiled results...
--tim
··········@tfeb.org (Tim Bradshaw) writes:
> > Trial editions are often performance-crippled.
>
> I don't think this is true of any of the commercial Cl trial
> editions, though. They have heap limits and sometimes runtime
> limits, but I think the compilers are the same. I know this is (or
> was recently) true for some of the above systems from having looked
> at the compiled results...
Almost surely true. We used to talk about giving a performance-crippled
Lisp, but it makes a terrible sales vehicle. People have the presupposition
that Lisp is slow, so if you give them a sample that confirms this, they
tend to be unimpressed. Most crippling is, as indicated, either around
heap and runtime limits and/or around ability to deliver standalone images,
since these are usually needed for business but not for legitimate user
trials.
In article <···············@shell01.TheWorld.com>, Kent M Pitman
<······@world.std.com> wrote:
>··········@tfeb.org (Tim Bradshaw) writes:
>
>> > Trial editions are often performance-crippled.
>>
>> I don't think this is true of any of the commercial Cl trial
>> editions, though. They have heap limits and sometimes runtime
>> limits, but I think the compilers are the same. I know this is (or
>> was recently) true for some of the above systems from having looked
>> at the compiled results...
>
>Almost surely true. We used to talk about giving a performance-crippled
>Lisp, but it makes a terrible sales vehicle. People have the presupposition
>that Lisp is slow, so if you give them a sample that confirms this, they
>tend to be unimpressed. Most crippling is, as indicated, either around
>heap and runtime limits and/or around ability to deliver standalone images,
>since these are usually needed for business but not for legitimate user
>trials.
Thanks guys. I've never used a trial version and was
just speculating. I stand corrected.
On Thu, Jan 23, 2003 at 07:09:05PM -0800, Andrei de A, Formiga wrote:
> if (cnt == iter)
> res[y * xn + x] = 0;
> else
> res[y * xn + x] = cnt;
Looking at this code, and the output of this onto the screen, made me
wonder what the point of this if statement is. If replaced with
just res[y * xn + x] = cnt; then it draws just fine. In the CL, the
original iterate-point counted down to zero, and then subtracted from
the max. Somehow this caused everything to work out. When I switched
to counting from 0, I obtained the same poor graphical result as the C.
So I removed the IF statement, similar to the above, and it works again.
--
; Matthew Danish <·······@andrew.cmu.edu>
; OpenPGP public key: C24B6010 on keyring.debian.org
; Signed or encrypted mail welcome.
; "There is no dark side of the moon really; matter of fact, it's all dark."
Matthew Danish <·······@andrew.cmu.edu> wrote in message news:<·····················@lain.cheme.cmu.edu>...
> On Thu, Jan 23, 2003 at 07:09:05PM -0800, Andrei de A, Formiga wrote:
> > if (cnt == iter)
> > res[y * xn + x] = 0;
> > else
> > res[y * xn + x] = cnt;
>
> Looking at this code, and the output of this onto the screen, made me
> wonder what the point of this if statement is. If replaced with
> just res[y * xn + x] = cnt; then it draws just fine. In the CL, the
> original iterate-point counted down to zero, and then subtracted from
> the max. Somehow this caused everything to work out. When I switched
> to counting from 0, I obtained the same poor graphical result as the C.
> So I removed the IF statement, similar to the above, and it works again.
Hm... I tried it both ways in the CL version, and it worked the
same. The CL version I actually use is a little different in that the
results of the mandelbrot set computations are transformed to rgb
colors (using a color-coding function passed as an argument), stored
in an rgb-image (a structure) and then encoded to a bmp format and
written to disk.
The point of the iterative algorithm is to find if the point
"diverges" or not under the iteration equations, thus (being applied
to all the points in a region of the complex plane) finding the
fractal border. When it diverges, the function iterate_point (iterate
in the C version) should return the number of iterations done until
reaching the limit (+squared-limit+). However, if the point reached
the maximum number of iterations, the corresponding point is set to 0,
a value not likely to be found near the border, and so helping to
distinguish between the inner and outer regions. So, in the end, we
have an iteration count for all points, or 0 for the points inside the
border (0 appears too when the norm of complex number is already
greater than the limit, but this happens only far from the border).
The next step to get an image is to transform this counts to colors,
as mentioned.
But anyway there should be few points that end up diverging at
exactly the last iteration before reaching the maximum, and this all
is a matter of convention, so taking the if out should work too. My
recursive version of iterate-point did it backwards from the C
iterative version, and so the returned value is max - count instead of
simply count.
---
Andrei de A. Formiga