Performance and optimizations

From: Andrei de A, Formiga
Subject: Performance and optimizations
Date: Wed, 22 Jan 2003 05:31:11 +0000
Message-ID: <3fa9f60c.0301212131.2a8c6a08@posting.google.com>

First, I must say I'm rather new to Common Lisp, although I've had
some experience with Scheme and functional languages. I have most
experience with C and C++, but I've been generally frustrated with
these languages for some time now, and searching for alternatives.

Anyway, to the point: I've written a small program to make an image of
the Mandelbrot set in Common Lisp. It's mostly numerical calculation
with floating-point numbers. I started measuring the performance of
this little program across different implementations. I used always
the same parameters: a 640x480 image, 255 max iterations, and a fixed
region of the complex plane. All implementations took something around
100 to 200 seconds to calculate.

Then I made a C version to get a reference for performance. It was
written similarly (as far as possible) to the CL version, and with no
optimizations (as with the CL version). It ran at about 1.5 seconds.

Having seen evidence of CL code doing numerical computations almost as
fast (or sometimes faster) than C code, using CMUCL, I've decided I
could improve the performance of this little program. I added type
declarations and (declare (optimize speed)). The program compiled in
CMUCL dropped to 2.4 seconds execution time. I was really impressed.

But then I tried to compile and run the same program with other
implementations, and none of them showed similar performance gains.
Actually, most didn't show any difference. Here are some approximate
results:

C version (gcc 3.2)		1.5 s
CMUCL (linux)			2.4 s
CLISP (linux)			70 s
CLISP (Windows)			100 s
Corman Lisp 2.0 (Win) Trial	150 s
Allegro CL 6.2 (Win) Trial	170 s
LispWorks 4.2 Personal Ed.(Win) 120 s

So, finally, what I want to ask is: is this brutal difference in
performance (between CMUCL and others) to be expected anyway, or am I
doing something wrong with other implementations ? I tried to use
(proclaim '(optimize (speed 3) (safety 0) (debug 0))) at the top
level, but it didn't do much. Is there something I can do to improve
performance in the other implementations ?

I want to do mostly graphics-related programs in CL, and they will use
floating point computations heavily. It is especially bad to me if no
Windows implementation can attain a reasonable performance for this
kind of tasks. Performance two orders of magnitude worse than C is
terribly impractical and descouraging.

Thanks for any help.

Re: Performance and optimizations Bulent Murtezaoglu
- Re: Performance and optimizations Andrei de A, Formiga
  - Re: Performance and optimizations Raymond Toy
  - Re: Performance and optimizations Nils Goesche
  - Re: Performance and optimizations Matthias Heiler
    - Re: Performance and optimizations Matthew Danish
      - Re: Performance and optimizations Andrei de A, Formiga
        Re: Performance and optimizations Matthew Danish
  - Re: Performance and optimizations Edi Weitz
    - Re: Performance and optimizations Barry Margolin
      - Re: Performance and optimizations Edi Weitz
        Re: Performance and optimizations Marius Bernklev
    - Re: Performance and optimizations Jens Axel S�gaard
    - Re: Performance and optimizations Edi Weitz
      - Re: Performance and optimizations Edi Weitz
Re: Performance and optimizations sv0f
- Re: Performance and optimizations Tim Bradshaw
  - Re: Performance and optimizations Kent M Pitman
    - Re: Performance and optimizations sv0f
Re: Performance and optimizations Andrei de A, Formiga
- Re: Performance and optimizations Matthew Danish
  - Re: Performance and optimizations Andrei de A, Formiga

From: Bulent Murtezaoglu
Subject: Re: Performance and optimizations
Date: Wed, 22 Jan 2003 06:21:31 +0000
Message-ID: <87vg0hg0wk.fsf@acm.org>

[...]
    AAF> So, finally, what I want to ask is: is this brutal difference
    AAF> in performance (between CMUCL and others) to be expected
    AAF> anyway, or am I doing something wrong with other
    AAF> implementations ? 

We did a collective optimization session right here in cll some years ago
with Allegro and CMUCL on some ray tracing code and while CMUCL seemed 
faster there wasn't a 'brutal' difference.  You are probably overlooking 
something simple, if you could post the code we probably could help. 

    AAF> I tried to use (proclaim '(optimize (speed
    AAF> 3) (safety 0) (debug 0))) at the top level, but it didn't do
    AAF> much. 

Did you type it at the top level prompt or did you add it to the files 
to be compiled?  Proclaim is a function and compiler does nothing 
special to make the proclamation apply to the file it is compiling 
when it appears at the top level.  Use declaim instead (or you could 
wrap your proclaim in an eval-when).  

    AAF> Is there something I can do to improve performance in
    AAF> the other implementations ?  [...]

Probably, tough to tell w/o seeing the code.

cheers,

BM

From: Andrei de A, Formiga
Subject: Re: Performance and optimizations
Date: Wed, 22 Jan 2003 16:33:35 +0000
Message-ID: <3fa9f60c.0301220833.c9ab874@posting.google.com>

Bulent Murtezaoglu <··@acm.org> wrote in message news:<··············@acm.org>...
> 
> Probably, tough to tell w/o seeing the code.
> 
> cheers,
> 
> BM

   Ok, here is the code (after optmizations). CMUCL also wants me to
declare complex numbers as (complex single-float), but some
implementations give an error when I do that.


(declaim (optimize (speed 3) (debug 0) (safety 0)))

(defconstant +squared-limit+ 4.0)

(declaim (inline norm-squared))
(defun norm-squared (z)
  (declare (optimize speed))
  (declare (complex z))
  (let ((real-part (realpart z))
	(imag-part (imagpart z)))
    (declare (single-float real-part))
    (declare (single-float imag-part))
    (+ (* real-part real-part) (* imag-part imag-part))))

(defun iterate-point (z max-iter)
  (declare (complex z))
  (declare (fixnum max-iter))
  (labels ((iterate-recursive (zn iter)
	     (declare (optimize speed))
	     (declare (complex zn))
	     (declare (fixnum iter))
	     (cond ((or (>= (norm-squared zn) +squared-limit+) (zerop iter))
		    (- max-iter iter))
		   (t (iterate-recursive (+ (* zn zn) z) (1- iter))))))
    (iterate-recursive z max-iter)))

(defun mandelbrot (width height max-iterations &key (real-limits
'(-2.25 0.75))
			 (imag-limits '(-1.25 1.25)))
  (declare (fixnum width height max-iterations))
  (let* ((real-min (first real-limits))
	 (real-max (second real-limits))
	 (imag-min (first imag-limits))
	 (imag-max (second imag-limits))
	 (real-step (/ (- real-max real-min) width))
	 (imag-step (/ (- imag-max imag-min) height))
	 (z (complex real-min imag-min))
	 (result (make-array (list width height) :element-type 'fixnum)))
    (declare (optimize speed))
    (declare (single-float real-min real-max imag-min imag-max))
    (declare (single-float real-step imag-step))
    (declare (complex z))
    (dotimes (x width result)
      (declare (fixnum x))
      (dotimes (y height)
	(declare (fixnum y))
	(let ((count (iterate-point z max-iterations)))
	  (if (= count max-iterations)
	      (setf (aref result x y) 0)
	    (setf (aref result x y) count)))
	(incf z (complex 0 imag-step)))
      (setf z (complex (+ (realpart z) real-step) imag-min))
      (format t "~&Column ~D finished~%" x))))

---
Andrei Formiga

From: Raymond Toy
Subject: Re: Performance and optimizations
Date: Wed, 22 Jan 2003 16:57:26 +0000
Message-ID: <4nd6mpdsw9.fsf@edgedsp4.rtp.ericsson.se>

>>>>> "Andrei" == Andrei de A, Formiga <·····@gdnmail.net> writes:

    Andrei> Bulent Murtezaoglu <··@acm.org> wrote in message news:<··············@acm.org>...
    >> 
    >> Probably, tough to tell w/o seeing the code.
    >> 
    >> cheers,
    >> 
    >> BM

    Andrei>    Ok, here is the code (after optmizations). CMUCL also wants me to
    Andrei> declare complex numbers as (complex single-float), but some
    Andrei> implementations give an error when I do that.

Since CMUCL has unboxed (complex single-float) and (complex
double-float), it will speed things up quite a bit if you do.  I think
implementations that give an error for such declarations are broken.

Ray

From: Nils Goesche
Subject: Re: Performance and optimizations
Date: Wed, 22 Jan 2003 17:08:52 +0000
Message-ID: <lysmvlcdsr.fsf@cartan.de>

·····@gdnmail.net (Andrei de A, Formiga) writes:

> Bulent Murtezaoglu <··@acm.org> wrote in message news:<··············@acm.org>...
> > Probably, tough to tell w/o seeing the code.

>    Ok, here is the code (after optmizations).

The C code would also be nice (for reference).

> CMUCL also wants me to declare complex numbers as (complex
> single-float), but some implementations give an error when I do
> that.

Why don't you take double-floats?

> (declaim (optimize (speed 3) (debug 0) (safety 0)))

Normally, doing this globally is not such a good idea, but I guess in
microbenchmarks like this it doesn't matter.  You could also declaim
(optimize (space 0) (compilation-speed 0)) in addition.  And when
dealing with Lispworks and floats, you should also

(optimize #+lispworks (float 0))

which will give you quite a performance boost.  There is also
FIXNUM-SAFETY, but that wouldn't be such a good idea, here...
Otherwise I haven't looked at the code yet.  Maybe I will later.  Just
one thing: You can put /several/ declarations into one DECLARE
expression.

Regards,
-- 
Nils G�sche
"Don't ask for whom the <CTRL-G> tolls."

PGP key ID 0x0655CFA0

From: Matthias Heiler
Subject: Re: Performance and optimizations
Date: Wed, 22 Jan 2003 18:24:03 +0000
Message-ID: <b0mnjj$3is$1@trumpet.uni-mannheim.de>

Hello,

does replacing the tail-recursion in iterate-point by (do ...) or (loop ...) 
make any difference?

Matthias

Andrei de A, Formiga wrote:

> (defun iterate-point (z max-iter)
>   (declare (complex z))
>   (declare (fixnum max-iter))
>   (labels ((iterate-recursive (zn iter)
> (declare (optimize speed))
> (declare (complex zn))
> (declare (fixnum iter))
> (cond ((or (>= (norm-squared zn) +squared-limit+) (zerop iter))
> (- max-iter iter))
> (t (iterate-recursive (+ (* zn zn) z) (1- iter))))))
>     (iterate-recursive z max-iter)))

From: Matthew Danish
Subject: Re: Performance and optimizations
Date: Wed, 22 Jan 2003 18:58:48 +0000
Message-ID: <20030122135848.C27240@lain.cheme.cmu.edu>

On Wed, Jan 22, 2003 at 07:24:03PM +0100, Matthias Heiler wrote:
> Hello, does replacing the tail-recursion in iterate-point by (do ...)
> or (loop ...) make any difference?

First I fixed up a few of the types, then I changed the squaring to use
expt.  This gave pretty good performance in CMUCL considering the
full-call to iterate-point (I think).  I moved everything into one large
labels and this dropped a few ms off the time (already under a second).
When I replaced the tail-recursion with loop, I gained another 10%
performance boost.  So yes, it does make a difference in CMUCL =)

Allegro, as expected, is still not performing well.  I have found in the
past that optimizations that suited CMUCL did not always work out with
Allegro.  In my experience, spending a bit of time with the Allegro
profiler usually yielded great performance boosts.  Perhaps I will get
around to that later.

-- 
; Matthew Danish <·······@andrew.cmu.edu>
; OpenPGP public key: C24B6010 on keyring.debian.org
; Signed or encrypted mail welcome.
; "There is no dark side of the moon really; matter of fact, it's all dark."

From: Andrei de A, Formiga
Subject: Re: Performance and optimizations
Date: Thu, 23 Jan 2003 01:51:06 +0000
Message-ID: <3fa9f60c.0301221751.7e6f722e@posting.google.com>

Matthew Danish <·······@andrew.cmu.edu> wrote in message news:<·····················@lain.cheme.cmu.edu>...
> First I fixed up a few of the types, then I changed the squaring to use
> expt.  This gave pretty good performance in CMUCL considering the
> full-call to iterate-point (I think).  I moved everything into one large
> labels and this dropped a few ms off the time (already under a second).
> When I replaced the tail-recursion with loop, I gained another 10%
> performance boost.  So yes, it does make a difference in CMUCL =)
> 
> Allegro, as expected, is still not performing well.  I have found in the
> past that optimizations that suited CMUCL did not always work out with
> Allegro.  In my experience, spending a bit of time with the Allegro
> profiler usually yielded great performance boosts.  Perhaps I will get
> around to that later.

   Please, can I see your version ? Moving everything to a large
labels and replacing the tail-recursion with loop is easy, but I'd
want to see what you have done with the types ("I fixed up a few of
the types")...

   I'll try the Allegro profiler, if it works in the trial edition. I
had thought about trying the CMUCL profiler, but didn't have the time.

   Also, as some people probably noticed, the bottleneck is the
iterate-point function, and the functions it calls. My first version
of this function was iterative (using dotimes) but then I changed to a
tail recursive version in the hopes of implementations taking
advantage of tail recursion optimization. Execution times remained the
same, more or less, so I haven't changed it back. The optimizations
I've tried in the mandelbrot funcion didn't make much of an impact in
the overall performance, as expected.

   By the way, thanks to everyone who responded. I'm using another
computer right now, but when I get home I'll post the C code too for
anyone interested.

---
Andrei de A. Formiga

From: Matthew Danish
Subject: Re: Performance and optimizations
Date: Thu, 23 Jan 2003 07:13:03 +0000
Message-ID: <20030123021303.D27240@lain.cheme.cmu.edu>

On Wed, Jan 22, 2003 at 05:51:06PM -0800, Andrei de A, Formiga wrote:
> Matthew Danish <·······@andrew.cmu.edu> wrote in message news:<·····················@lain.cheme.cmu.edu>...
> > First I fixed up a few of the types, then I changed the squaring to use
> > expt.  This gave pretty good performance in CMUCL considering the
> > full-call to iterate-point (I think).  I moved everything into one large
> > labels and this dropped a few ms off the time (already under a second).
> > When I replaced the tail-recursion with loop, I gained another 10%
> > performance boost.  So yes, it does make a difference in CMUCL =)
> > 
> > Allegro, as expected, is still not performing well.  I have found in the
> > past that optimizations that suited CMUCL did not always work out with
> > Allegro.  In my experience, spending a bit of time with the Allegro
> > profiler usually yielded great performance boosts.  Perhaps I will get
> > around to that later.
> 
>    Please, can I see your version ? Moving everything to a large
> labels and replacing the tail-recursion with loop is easy, but I'd
> want to see what you have done with the types ("I fixed up a few of
> the types")...

I don't think I did anything but change complex to (complex
single-float) and explicitly type floating-point constants (0.0 ->
0.0f0)

>    Also, as some people probably noticed, the bottleneck is the
> iterate-point function, and the functions it calls. My first version
> of this function was iterative (using dotimes) but then I changed to a
> tail recursive version in the hopes of implementations taking
> advantage of tail recursion optimization. Execution times remained the
> same, more or less, so I haven't changed it back. 

Tail-recursion optimization does not make a tail-recursive algorithm
faster than the iterative version, it makes it equivalent.  The
optimization refers to the fact that the compiler detects that it does
not need a new stack frame when making this call, and can use a simple
branch.  The advantage lies in that you can express certain algorithms
more clearly.

And while CMUCL, and most CL compilers, can detect tail-recursion
reasonably well, I suspect that using iteration is likely to lead to
better optimizations in many cases simply because, unlike Schemers,
CLers use iteration constructs a great deal.  Tail-recursion
``optimization'' isn't mandated by the standard, and traditionally CLers
did not use it.


(declaim (optimize (speed 3) (debug 0) (safety 0)))
(defconstant +squared-limit+ 4.0f0)

(defun mandelbrot (width height max-iterations &key (real-limits
						     '(-2.25f0 0.75f0))
		   (imag-limits '(-1.25f0 1.25f0)))
  (declare (fixnum width height max-iterations))
  (labels ((norm-squared (z)
	     (declare (optimize speed))
	     (declare (type (complex single-float) z))
	     (let ((real-part (realpart z))
		   (imag-part (imagpart z)))
	       (declare (single-float real-part))
	       (declare (single-float imag-part))
	       (the single-float
		 (+ (expt real-part 2)
		    (expt imag-part 2)))))
	   (iterate-point (z max-iter)
	     (declare (type (complex single-float) z))
	     (declare (fixnum max-iter))
	     (labels (#+nil
		      (iterate-recursive (zn iter)
			(declare (optimize speed))
			(declare (type (complex single-float) zn))
			(declare (fixnum iter))
			(cond ((or (>= (norm-squared zn)
				       +squared-limit+)
				   (zerop iter))
			       (- max-iter iter))
			      (t (iterate-recursive (+ (expt zn 2) z)
						    (1- iter))))))
	       #+nil
	       (iterate-recursive z max-iter)
	       
	       (let ((zn z))
		 (declare (type (complex single-float) zn))
		 (loop for iter of-type fixnum from max-iter downto 0
		       until (>= (norm-squared zn)
				 +squared-limit+)
		       do (setf zn (+ (expt zn 2) z))
		       finally (return (- max-iter iter)))))))

    
    (let* ((real-min (first real-limits))
	   (real-max (second real-limits))
	   (imag-min (first imag-limits))
	   (imag-max (second imag-limits))
	   (real-step (/ (- real-max real-min) width))
	   (imag-step (/ (- imag-max imag-min) height))
	   (z (complex real-min imag-min))
	   (result (make-array (list width height) :element-type 'fixnum)))
      (declare (optimize speed))
      (declare (single-float real-min real-max imag-min imag-max))
      (declare (single-float real-step imag-step))
      (declare (type (complex single-float) z))
      (dotimes (x width result)
	(declare (fixnum x))
	(dotimes (y height)
	  (declare (fixnum y))
	  (let ((count (iterate-point z max-iterations)))
	    (if (= count max-iterations)
		(setf (aref result x y) 0)
		(setf (aref result x y) count)))
	  (incf z (complex 0f0 imag-step)))
	(setf z (complex (+ (realpart z) real-step) imag-min))
	(format t "~&Column ~D finished~%" x)))))


-- 
; Matthew Danish <·······@andrew.cmu.edu>
; OpenPGP public key: C24B6010 on keyring.debian.org
; Signed or encrypted mail welcome.
; "There is no dark side of the moon really; matter of fact, it's all dark."

From: Edi Weitz
Subject: Re: Performance and optimizations
Date: Wed, 22 Jan 2003 22:08:23 +0000
Message-ID: <87ptqog7mw.fsf@bird.agharta.de>

Hmm, last time I checked C didn't have complex numbers, so you
probably implemented complex arithmetic yourself, didn't you? Most
likely that's the reason for the speed differences you've
observed. Except for special cases - like CMUCL with (COMPLEX
SINGLE-FLOAT) - most CL implementations will spend most of the time
doing memory management for the complex numbers you create and then
throw away again.

For what it's worth below are three versions of your program - the
"orig-" version is exactly your code with the optimizations inside of
the functions, the "cmu-" version is with (COMPLEX SINGLE-FLOAT) to
make CMUCL happy (BTW, I wonder which Lisp complained about these
declarations - didn't occur to me), and the "my-" version is a quick
and ugly hack which is probably similar to the C version you haven't
shown us yet.

Here are some results:

[CMUCL 18e-pre]

  * (time (defparameter *o* (orig-mandelbrot 640 480 255)))
  Compiling LAMBDA NIL:
  Compiling Top-Level Form:

  Evaluation took:
    36.77 seconds of real time
    26.990234 seconds of user run time
    7.78418 seconds of system run time
    [Run times include 1.82 seconds GC run time]
    118 page faults and
    3,037,322,616 bytes consed.
  *O*
  * (time (defparameter *c* (cmu-mandelbrot 640 480 255)))
  Compiling LAMBDA NIL:
  Compiling Top-Level Form:

  Evaluation took:
    1.06 seconds of real time
    0.96875 seconds of user run time
    0.029296 seconds of system run time
    [Run times include 0.01 seconds GC run time]
    0 page faults and
    15,974,552 bytes consed.
  *C*
  * (time (defparameter *m* (my-mandelbrot 640 480 255)))
  Compiling LAMBDA NIL:
  Compiling Top-Level Form:

  Evaluation took:
    0.57 seconds of real time
    0.538086 seconds of user run time
    0.00293 seconds of system run time
    0 page faults and
    1,228,888 bytes consed.
  *M*

See the difference in bytes consed? Declaring (COMPLEX SINGLE-FLOAT)
instead of COMPLEX is enough for CMUCL to optimize the consing away
itself, my "C" version didn't improve much.

However...

  * (or (equalp *o* *c*) (equalp *o* *m*) (equalp *c* *m*))

  NIL

Hmm, looks like CMUCL is giving a bit leeway to itself as far as
optimizations are concerned...


[Allegro Common Lisp 6.2 trial]

  CL-USER(3): (time (defparameter *o* (orig-mandelbrot 640 480 255)))
  ; cpu time (non-gc) 439,460 msec (00:07:19.460) user, 360 msec system
  ; cpu time (gc)     118,640 msec (00:01:58.640) user, 170 msec system
  ; cpu time (total)  558,100 msec (00:09:18.100) user, 530 msec system
  ; real time  57,181 msec
  ; space allocation:
  ;  54 cons cells, 3,315,122,600 other bytes, 0 static bytes
  *O*
  CL-USER(4): (time (defparameter *m* (my-mandelbrot 640 480 255)))
  ; cpu time (non-gc) 11,600 msec user, 30 msec system
  ; cpu time (gc)     860 msec user, 0 msec system
  ; cpu time (total)  12,460 msec user, 30 msec system
  ; real time  1,236 msec
  ; space allocation:
  ;  54 cons cells, 11,059,248 other bytes, 0 static bytes
  *M*
  CL-USER(5): (equalp *o* *m*)
  T

Same results in both cases and an improvement by a factor of almost
50. Does that help? (Again, see the stats for space allocation.)


[LispWorks 4.2.7 pro]

  CL-USER 5 > (time (defparameter *o* (orig-mandelbrot 640 480 255)))
  Timing the evaluation of (DEFPARAMETER *O* (ORIG-MANDELBROT 640 480 255))

  user time    =     42.088
  system time  =      0.008
  Elapsed time =   0:00:49
  Allocation   = 2761382664 bytes standard / 2277 bytes fixlen
  0 Page faults
  *O*

  CL-USER 6 > (time (defparameter *m* (my-mandelbrot 640 480 255)))
  Timing the evaluation of (DEFPARAMETER *M* (MY-MANDELBROT 640 480 255))

  user time    =     30.031
  system time  =      0.001
  Elapsed time =   0:00:35
  Allocation   = 2741679616 bytes standard / 22 bytes fixlen
  0 Page faults
  *M*

  CL-USER 7 > (equalp *m* *o*)
  T

Hmm, I don't know what's happening here. Or, in other words, I'd like
to know why LW _still_ allocates tons of memory with the new
version. Any LW experts care to enlighten me?

Cheers,
Edi.


(defconstant +squared-limit+ 4.0)

(declaim (inline orig-norm-squared))
(defun orig-norm-squared (z)
  (declare (optimize speed
                     (safety 0)
                     (space 0)
                     (debug 0)
                     (compilation-speed 0)
                     #+:lispworks (float 0)))
  (declare (complex z))
  (let ((real-part (realpart z))
      	(imag-part (imagpart z)))
    (declare (single-float real-part))
    (declare (single-float imag-part))
    (+ (* real-part real-part) (* imag-part imag-part))))

(defun orig-iterate-point (z max-iter)
  (declare (optimize speed
                     (safety 0)
                     (space 0)
                     (debug 0)
                     (compilation-speed 0)
                     #+:lispworks (float 0)))
  (declare (complex z))
  (declare (fixnum max-iter))
  (labels ((iterate-recursive (zn iter)
             (declare (complex zn))
             (declare (fixnum iter))
             (cond ((or (>= (orig-norm-squared zn) +squared-limit+) (zerop iter))
                     (- max-iter iter))
                   (t (iterate-recursive (+ (* zn zn) z) (1- iter))))))
    (iterate-recursive z max-iter)))

(defun orig-mandelbrot (width height max-iterations &key (real-limits
                                                     '(-2.25 0.75))
                         (imag-limits '(-1.25 1.25)))
  (declare (optimize speed
                     (safety 0)
                     (space 0)
                     (debug 0)
                     (compilation-speed 0)
                     #+:lispworks (float 0)))
  (declare (fixnum width height max-iterations))
  (let* ((real-min (first real-limits))
         (real-max (second real-limits))
         (imag-min (first imag-limits))
         (imag-max (second imag-limits))
         (real-step (/ (- real-max real-min) width))
         (imag-step (/ (- imag-max imag-min) height))
         (z (complex real-min imag-min))
         (result (make-array (list width height) :element-type 'fixnum)))
    (declare (single-float real-min real-max imag-min imag-max))
    (declare (single-float real-step imag-step))
    (declare (complex z))
    (dotimes (x width result)
      (declare (fixnum x))
      (dotimes (y height)
        (declare (fixnum y))
        (let ((count (orig-iterate-point z max-iterations)))
          (if (= count max-iterations)
            (setf (aref result x y) 0)
            (setf (aref result x y) count)))
        (incf z (complex 0 imag-step)))
      (setf z (complex (+ (realpart z) real-step) imag-min))
      #+(or) (format t "~&Column ~D finished~%" x))))

(declaim (inline cmu-norm-squared))
(defun cmu-norm-squared (z)
  (declare (optimize speed
                     (safety 0)
                     (space 0)
                     (debug 0)
                     (compilation-speed 0)
                     #+:lispworks (float 0)))
  (declare (type (complex single-float) z))
  (let ((real-part (realpart z))
      	(imag-part (imagpart z)))
    (declare (single-float real-part))
    (declare (single-float imag-part))
    (+ (* real-part real-part) (* imag-part imag-part))))

(defun cmu-iterate-point (z max-iter)
  (declare (optimize speed
                     (safety 0)
                     (space 0)
                     (debug 0)
                     (compilation-speed 0)
                     #+:lispworks (float 0)))
  (declare (type (complex single-float) z))
  (declare (fixnum max-iter))
  (labels ((iterate-recursive (zn iter)
             (declare (type (complex single-float) zn))
             (declare (fixnum iter))
             (cond ((or (>= (cmu-norm-squared zn) +squared-limit+) (zerop iter))
                     (- max-iter iter))
                   (t (iterate-recursive (+ (* zn zn) z) (1- iter))))))
    (iterate-recursive z max-iter)))

(defun cmu-mandelbrot (width height max-iterations &key (real-limits
                                                         '(-2.25 0.75))
                             (imag-limits '(-1.25 1.25)))
  (declare (optimize speed
                     (safety 0)
                     (space 0)
                     (debug 0)
                     (compilation-speed 0)
                     #+:lispworks (float 0)))
  (declare (fixnum width height max-iterations))
  (let* ((real-min (first real-limits))
         (real-max (second real-limits))
         (imag-min (first imag-limits))
         (imag-max (second imag-limits))
         (real-step (/ (- real-max real-min) width))
         (imag-step (/ (- imag-max imag-min) height))
         (z (complex real-min imag-min))
         (result (make-array (list width height) :element-type 'fixnum)))
    (declare (single-float real-min real-max imag-min imag-max))
    (declare (single-float real-step imag-step))
    (declare (type (complex single-float) z))
    (dotimes (x width result)
      (declare (fixnum x))
      (dotimes (y height)
        (declare (fixnum y))
        (let ((count (cmu-iterate-point z max-iterations)))
          (if (= count max-iterations)
            (setf (aref result x y) 0)
            (setf (aref result x y) count)))
        (incf z (complex 0 imag-step)))
      (setf z (complex (+ (realpart z) real-step) imag-min))
      #+(or) (format t "~&Column ~D finished~%" x))))

(declaim (inline my-iterate-point))
(defun my-iterate-point (re-z im-z max-iter)
  (declare (optimize speed
                     (safety 0)
                     (space 0)
                     (debug 0)
                     (compilation-speed 0)
                     #+:lispworks (float 0)))
  (declare (single-float re-z im-z)
           (fixnum max-iter))
  (loop for iter of-type fixnum from max-iter downto 0
        for re-zn of-type single-float = re-z then (+ re-z (- (* re-zn^2) (* im-zn^2)))
        and im-zn of-type single-float = im-z then (+ im-z (* 2 re-zn im-zn))
        for re-zn^2 of-type single-float = (* re-zn re-zn)
        for im-zn^2 of-type single-float = (* im-zn im-zn)
        if (>= (+ re-zn^2 im-zn^2) +squared-limit+)
        do (return (- max-iter iter))
        finally (return 0)))

(defun my-mandelbrot (width height max-iter
                            &key
                            (re-min -2.25)
                            (re-max 0.75)
                            (im-min -1.25)
                            (im-max 1.25))
  (declare (optimize speed
                     (safety 0)
                     (space 0)
                     (debug 0)
                     (compilation-speed 0)
                     #+:lispworks (float 0)))
  (declare (fixnum width height max-iter)
           (single-float re-min re-max im-min im-max))
  (let ((re-step (/ (- re-max re-min) width))
        (im-step (/ (- im-max im-min) height))
        (result (make-array (list width height) :element-type 'fixnum :initial-element 0)))
    (loop for x of-type fixnum below width
          for re-z of-type single-float from re-min by re-step
          do (loop for y of-type fixnum below height
                   for im-z of-type single-float from im-min by im-step
                   for count of-type fixnum = (my-iterate-point re-z im-z max-iter)
                   if (/= count max-iter)
                   do (setf (aref result x y) count)))
    result))

From: Barry Margolin
Subject: Re: Performance and optimizations
Date: Wed, 22 Jan 2003 22:15:06 +0000
Message-ID: <KJEX9.49$rQ6.510@paloalto-snr1.gtei.net>

In article <··············@bird.agharta.de>, Edi Weitz  <···@agharta.de> wrote:
>Hmm, last time I checked C didn't have complex numbers, so you
>probably implemented complex arithmetic yourself, didn't you?

Check again.  I think C99 added complex numbers.

-- 
Barry Margolin, ······@genuity.net
Genuity, Woburn, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

From: Edi Weitz
Subject: Re: Performance and optimizations
Date: Wed, 22 Jan 2003 22:26:14 +0000
Message-ID: <87hec0g6t5.fsf@bird.agharta.de>

Barry Margolin <······@genuity.net> writes:

> In article <··············@bird.agharta.de>, Edi Weitz  <···@agharta.de> wrote:
> >Hmm, last time I checked C didn't have complex numbers, so you
> >probably implemented complex arithmetic yourself, didn't you?
> 
> Check again.  I think C99 added complex numbers.

Woops, shows my ignorance.

However, as the OP said he used gcc my assumption might be correct. At
least support for complex numbers in gcc 3.1/3.2 is officially still
"broken":

  <http://gcc.gnu.org/gcc-3.1/c99status.html>

Edi.

From: Marius Bernklev
Subject: Re: Performance and optimizations
Date: Wed, 22 Jan 2003 22:49:48 +0000
Message-ID: <3cfk7gw6bqr.fsf@sex.ifi.uio.no>

Edi Weitz <···@agharta.de> writes:

> However, as the OP said he used gcc my assumption might be
> correct. At least support for complex numbers in gcc 3.1/3.2 is
> officially still "broken":
>
>   <http://gcc.gnu.org/gcc-3.1/c99status.html>

if you read the "Further notes" section, you'll see what exactly they
mean by 'broken'.  He may wery well have used it..

Marius
-- 
<URL: http://home.ifi.uio.no/~mariube/ >

Life sucks.  Death swallows.

From: Jens Axel S�gaard
Subject: Re: Performance and optimizations
Date: Wed, 22 Jan 2003 22:22:57 +0000
Message-ID: <3e2f1ad5$0$11034$edfadb0f@dread12.news.tele.dk>

Edi Weitz wrote:
> Hmm, last time I checked C didn't have complex numbers,

They were added in the C99 standard.

--
Jens Axel S�gaard

From: Edi Weitz
Subject: Re: Performance and optimizations
Date: Thu, 23 Jan 2003 02:57:35 +0000
Message-ID: <877kcw7eu8.fsf@bird.agharta.de>

Edi Weitz <···@agharta.de> writes:

> (declaim (inline my-iterate-point))
> (defun my-iterate-point (re-z im-z max-iter)
>   (declare (optimize speed
>                      (safety 0)
>                      (space 0)
>                      (debug 0)
>                      (compilation-speed 0)
>                      #+:lispworks (float 0)))
>   (declare (single-float re-z im-z)
>            (fixnum max-iter))
>   (loop for iter of-type fixnum from max-iter downto 0
>         for re-zn of-type single-float = re-z then (+ re-z (- (* re-zn^2) (* im-zn^2)))
>         and im-zn of-type single-float = im-z then (+ im-z (* 2 re-zn im-zn))
>         for re-zn^2 of-type single-float = (* re-zn re-zn)
>         for im-zn^2 of-type single-float = (* im-zn im-zn)
>         if (>= (+ re-zn^2 im-zn^2) +squared-limit+)
>         do (return (- max-iter iter))
>         finally (return 0)))

The function above has, as a remnant from my editing the code, two
unnecessary calls to * with only one argument. If you replace

  (- (* RE-ZN^2) (* IM-ZN^2))

with

  (- RE-ZN^2 IM-ZN^2)

LispWorks is down from about 30 seconds to around 4.5 seconds and the
consing is reduced by a factor of ten:

  CL-USER 10 > (time (defparameter *m* (my-mandelbrot 640 480 255)))
  Timing the evaluation of (DEFPARAMETER *M* (MY-MANDELBROT 640 480 255))

  user time    =      4.469
  system time  =      0.000
  Elapsed time =   0:00:05
  Allocation   = 284121600 bytes standard / 22 bytes fixlen
  0 Page faults
  *M*

My bad, obviously, but I'm not impressed. I would have expected the
compiler to be smart enough to optimize the calls to * away itself.

Cheers,
Edi.

From: Edi Weitz
Subject: Re: Performance and optimizations
Date: Thu, 23 Jan 2003 03:42:34 +0000
Message-ID: <87of685y6t.fsf@bird.agharta.de>

Edi Weitz <···@agharta.de> writes:

> The function above has, as a remnant from my editing the code, two
> unnecessary calls to * with only one argument. If you replace
> 
>   (- (* RE-ZN^2) (* IM-ZN^2))
> 
> with
> 
>   (- RE-ZN^2 IM-ZN^2)
> 
> LispWorks is down from about 30 seconds to around 4.5 seconds and
> the consing is reduced by a factor of ten:
> 
>   CL-USER 10 > (time (defparameter *m* (my-mandelbrot 640 480 255)))
>   Timing the evaluation of (DEFPARAMETER *M* (MY-MANDELBROT 640 480 255))
> 
>   user time    =      4.469
>   system time  =      0.000
>   Elapsed time =   0:00:05
>   Allocation   = 284121600 bytes standard / 22 bytes fixlen
>   0 Page faults
>   *M*
> 
> My bad, obviously, but I'm not impressed. I would have expected the
> compiler to be smart enough to optimize the calls to * away itself.

Back again, talking to myself. Turns out that LW is not very clever as
far as FP arithmetic is concerned. You have to specifically help the
compiler to avoid boxing the results of calls to, say, - or *:

  (defun my-iterate-point (re-z im-z max-iter)
    (declare (optimize speed
                       (safety 0)
                       (space 0)
                       (debug 0)
                       (compilation-speed 0)
                       #+:lispworks (float 0)))
    (declare (double-float re-z im-z)
             (fixnum max-iter))
    (loop for iter of-type fixnum from 0 below max-iter
          for re-zn of-type double-float = re-z then (+ re-z (the double-float (- re-zn^2 im-zn^2)))
          and im-zn of-type double-float = im-z then (+ im-z (the double-float (* 2.0s0 re-zn im-zn)))
          for re-zn^2 of-type double-float = (* re-zn re-zn)
          for im-zn^2 of-type double-float = (* im-zn im-zn)
          if (>= (the double-float (+ re-zn^2 im-zn^2)) +squared-limit+)
          do (return iter)
          finally (return 0)))

  (defun my-mandelbrot (width height max-iter
                              &key
                              (re-min -2.25)
                              (re-max 0.75)
                              (im-min -1.25)
                              (im-max 1.25))
    (declare (optimize speed
                       (safety 0)
                       (space 0)
                       (debug 0)
                       (compilation-speed 0)
                       #+:lispworks (float 0)))
    (declare (fixnum width height max-iter)
             (double-float re-min re-max im-min im-max))
    (let ((re-step (/ (- re-max re-min) width))
          (im-step (/ (- im-max im-min) height))
          (result (make-array (list width height) :element-type 'fixnum :initial-element 0)))
      (loop for x of-type fixnum below width
            for re-z of-type double-float from re-min by re-step
            do (loop for y of-type fixnum below height
                     for im-z of-type double-float from im-min by im-step
                     do (setf (aref result x y) (my-iterate-point re-z im-z max-iter))))
      result))

Now we're down at 2.6 seconds:

  CL-USER 39 > (time (defparameter *m* (my-mandelbrot 640 480 255)))
  Timing the evaluation of (DEFPARAMETER *M* (MY-MANDELBROT 640 480 255))

  user time    =      2.598
  system time  =      0.002
  Elapsed time =   0:00:02
  Allocation   = 1229184 bytes standard / 22 bytes fixlen
  0 Page faults
  *M*

Think I'll go to bed now...

Edi.

From: sv0f
Subject: Re: Performance and optimizations
Date: Wed, 22 Jan 2003 17:28:34 +0000
Message-ID: <none-2201031128340001@129.59.212.53>

In article <····························@posting.google.com>,
·····@gdnmail.net (Andrei de A, Formiga) wrote:

>C version (gcc 3.2)             1.5 s
>CMUCL (linux)                   2.4 s

Okay, these are both compiled.

>CLISP (linux)                   70 s
>CLISP (Windows)                 100 s

I think CLISP is byte-coded interpreted.

>Corman Lisp 2.0 (Win) Trial     150 s
>Allegro CL 6.2 (Win) Trial      170 s
>LispWorks 4.2 Personal Ed.(Win) 120 s

Trial editions are often performance-crippled.

Maybe the conclusion is to compare apples to apples.
The (compiled) C program and the (compiled) CMUCL
program are in the same ballpark.

From: Tim Bradshaw
Subject: Re: Performance and optimizations
Date: Thu, 23 Jan 2003 12:32:37 +0000
Message-ID: <fbc0f5d1.0301230432.765830ae@posting.google.com>

> Trial editions are often performance-crippled.

I don't think this is true of any of the commercial Cl trial editions, though.
They have heap limits and sometimes runtime limits, but I think the compilers
are the same.  I know this is (or was recently) true for some of the above 
systems from having looked at the compiled results...

--tim

From: Kent M Pitman
Subject: Re: Performance and optimizations
Date: Thu, 23 Jan 2003 19:08:38 +0000
Message-ID: <sfwbs27d6q1.fsf@shell01.TheWorld.com>

··········@tfeb.org (Tim Bradshaw) writes:

> > Trial editions are often performance-crippled.
> 
> I don't think this is true of any of the commercial Cl trial
> editions, though.  They have heap limits and sometimes runtime
> limits, but I think the compilers are the same.  I know this is (or
> was recently) true for some of the above systems from having looked
> at the compiled results...

Almost surely true.  We used to talk about giving a performance-crippled
Lisp, but it makes a terrible sales vehicle.  People have the presupposition
that Lisp is slow, so if you give them a sample that confirms this, they
tend to be unimpressed.  Most crippling is, as indicated, either around 
heap and runtime limits and/or around ability to deliver standalone images,
since these are usually needed for business but not for legitimate user 
trials.

From: sv0f
Subject: Re: Performance and optimizations
Date: Fri, 24 Jan 2003 14:31:11 +0000
Message-ID: <none-2401030831110001@129.59.212.53>

In article <···············@shell01.TheWorld.com>, Kent M Pitman
<······@world.std.com> wrote:

>··········@tfeb.org (Tim Bradshaw) writes:
>
>> > Trial editions are often performance-crippled.
>> 
>> I don't think this is true of any of the commercial Cl trial
>> editions, though.  They have heap limits and sometimes runtime
>> limits, but I think the compilers are the same.  I know this is (or
>> was recently) true for some of the above systems from having looked
>> at the compiled results...
>
>Almost surely true.  We used to talk about giving a performance-crippled
>Lisp, but it makes a terrible sales vehicle.  People have the presupposition
>that Lisp is slow, so if you give them a sample that confirms this, they
>tend to be unimpressed.  Most crippling is, as indicated, either around 
>heap and runtime limits and/or around ability to deliver standalone images,
>since these are usually needed for business but not for legitimate user 
>trials.

Thanks guys.  I've never used a trial version and was
just speculating.  I stand corrected.

From: Andrei de A, Formiga
Subject: Re: Performance and optimizations
Date: Fri, 24 Jan 2003 03:09:05 +0000
Message-ID: <3fa9f60c.0301231909.1e851a10@posting.google.com>

This is the C version, if anyone wants to see it. No, I didn't use
C99's complex. It was done as a quick hack.

---

/*
 * Mandelbrot set calculation
 * Andrei de A. Formiga, 21/01/2003
 */

#include <stdio.h>

#define SQUARED_LIMIT		4.0f

#define DEFAULT_REAL_MIN	-2.25f
#define DEFAULT_REAL_MAX	0.75f
#define DEFAULT_IMAG_MIN	-1.25f
#define DEFAULT_IMAG_MAX	1.25f

int *res;


typedef struct tagComplex
{
  float x;
  float y;
} Complex;


float norm_squared(Complex *z)
{
  return (z->x * z->x) + (z->y * z->y);
}

void complex_square(Complex *z)
{
  float tx = z->x;

  z->x = (z->x * z->x) - (z->y * z->y);
  z->y = 2 * tx * z->y;
}

int iterate(Complex *z, int max_iter)
{
  int count = 0;
  Complex zt;

  zt.x = z->x;
  zt.y = z->y;

  while (norm_squared(&zt) < SQUARED_LIMIT && count < max_iter)
  {
    complex_square(&zt);
    zt.x += z->x;
    zt.y += z->y;
    ++count;
  }

  return count;
}

  

void mandelbrot(int xn, int yn, int iter, float rmin, float rmax,
float imin, float imax)
{
  float rstep = (rmax - rmin) / xn;
  float istep = (imax - imin) / yn;
  Complex z;
  int x;
  int y;
  int cnt;
  
  z.x = rmin;
  z.y = imin;

  res = malloc(xn * yn * sizeof(int));

  if (res == NULL)
  {
    fprintf(stderr, "Error allocating mem");
    exit(2);
  }

  for (x = 0; x < xn; ++x, z.x += rstep)
  {
    for (y = 0, z.y = imin; y < yn; ++y, z.y += istep)
    {
      cnt = iterate(&z, iter);

      if (cnt == iter)
	res[y * xn + x] = 0;
      else
	res[y * xn + x] = cnt;
    }

    printf("Column %d finished\n", x);
  }
}


int main(void)
{
  mandelbrot(640, 480, 255, DEFAULT_REAL_MIN, DEFAULT_REAL_MAX, 
	     DEFAULT_IMAG_MIN, DEFAULT_IMAG_MAX);

  return 0;
}

From: Matthew Danish
Subject: Re: Performance and optimizations
Date: Fri, 24 Jan 2003 17:44:22 +0000
Message-ID: <20030124124422.H27240@lain.cheme.cmu.edu>

On Thu, Jan 23, 2003 at 07:09:05PM -0800, Andrei de A, Formiga wrote:
>       if (cnt == iter)
> 	  res[y * xn + x] = 0;
>       else
> 	  res[y * xn + x] = cnt;

Looking at this code, and the output of this onto the screen, made me
wonder what the point of this if statement is.  If replaced with
just res[y * xn + x] = cnt; then it draws just fine.  In the CL, the
original iterate-point counted down to zero, and then subtracted from
the max.  Somehow this caused everything to work out.  When I switched
to counting from 0, I obtained the same poor graphical result as the C.
So I removed the IF statement, similar to the above, and it works again.

-- 
; Matthew Danish <·······@andrew.cmu.edu>
; OpenPGP public key: C24B6010 on keyring.debian.org
; Signed or encrypted mail welcome.
; "There is no dark side of the moon really; matter of fact, it's all dark."

From: Andrei de A, Formiga
Subject: Re: Performance and optimizations
Date: Sat, 25 Jan 2003 04:04:38 +0000
Message-ID: <3fa9f60c.0301242004.105afa85@posting.google.com>

Matthew Danish <·······@andrew.cmu.edu> wrote in message news:<·····················@lain.cheme.cmu.edu>...
> On Thu, Jan 23, 2003 at 07:09:05PM -0800, Andrei de A, Formiga wrote:
> >       if (cnt == iter)
> > 	  res[y * xn + x] = 0;
> >       else
> > 	  res[y * xn + x] = cnt;
> 
> Looking at this code, and the output of this onto the screen, made me
> wonder what the point of this if statement is.  If replaced with
> just res[y * xn + x] = cnt; then it draws just fine.  In the CL, the
> original iterate-point counted down to zero, and then subtracted from
> the max.  Somehow this caused everything to work out.  When I switched
> to counting from 0, I obtained the same poor graphical result as the C.
> So I removed the IF statement, similar to the above, and it works again.

   Hm... I tried it both ways in the CL version, and it worked the
same. The CL version I actually use is a little different in that the
results of the mandelbrot set computations are transformed to rgb
colors (using a color-coding function passed as an argument), stored
in an rgb-image (a structure) and then encoded to a bmp format and
written to disk.

   The point of the iterative algorithm is to find if the point
"diverges" or not under the iteration equations, thus (being applied
to all the points in  a region of the complex plane) finding the
fractal border. When it diverges, the function iterate_point (iterate
in the C version) should return the number of iterations done until
reaching the limit (+squared-limit+). However, if the point reached
the maximum number of iterations, the corresponding point is set to 0,
a value not likely to be found near the border, and so helping to
distinguish between the inner and outer regions. So, in the end, we
have an iteration count for all points, or 0 for the points inside the
border (0 appears too when the norm of complex number is already
greater than the limit, but this happens only far from the border).
The next step to get an image is to transform this counts to colors,
as mentioned.

   But anyway there should be few points that end up diverging at
exactly the last iteration before reaching the maximum, and this all
is a matter of convention, so taking the if out should work too. My
recursive version of iterate-point did it backwards from the C
iterative version, and so the returned value is max - count instead of
simply count.

---
Andrei de A. Formiga