From: Alan Walker
Subject: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <v9k8orn73qf497@corp.supernews.com>
G'day All,
    Judging from the title, this task ought to be a push-over.  We have some
people at work comparing Java & C++, they "discovered" that Java is a lot
slower than C++, one of their tests was a matrix multiply.  So, I thought to
myself, if you want speed, elegance and garbage collection, why not write it
in Lisp?  It's part of my ongoing campaign.....
    The problem is, try as I might, I just can't figure out the type
declarations to make it run fast.  I'm using ACL 5.0 Personal Edition on
Win2K.  I've fiddled with things from Paul Graham's "ANSI Common Lisp", Ch
13, here's what I've coded so far:

(proclaim '(optimize (speed 3) (space 0) (compilation-speed 0) (debug 0)
(safety 0)))

(defun matmul (mat1 mat2 n)
   (declare (type (simple-array fixnum) mat1 mat2))
   (declare (fixnum n))
   (let ((mat3 (make-array (list n n) :element-type 'fixnum :initial-element
0))
         (sum 0))
      (declare (type (simple-array fixnum) mat3))
      (declare (fixnum sum))
      (dotimes (i n)
         (declare (fixnum i))
         (dotimes (j n)
            (declare (fixnum j))
            (setf sum 0)
            (dotimes (k n)
               (declare (fixnum k))
               (incf sum (the fixnum (* (aref mat1 i k) (aref mat2 k j)))))
            (setf (aref mat3 i j) sum)))
      mat3))

((defun test-it (size)
   (declare (fixnum size))
   (let ((a1 (make-array (list size size) :element-type 'fixnum
:initial-element 3))
         (a2 (make-array (list size size) :element-type 'fixnum
:initial-element 4)))
      (matmul a1 a2 size)))
(time (test-it 10))

This uses >30K cons cells and as the matrix gets bigger, say 100x100, it's
dog slow.  Even Java blows it away.  This is embarassing (well, at least to
me).  I'd like to think it's something silly that I've overlooked.

Alan.

From: Paul F. Dietz
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <AO-dnVcuiK-CtwejXTWcow@dls.net>
Alan Walker wrote:

> This uses >30K cons cells and as the matrix gets bigger, say 100x100, it's
> dog slow.

Did you remember to compile it?

	Paul
From: Alan Walker
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <v9kc4uqqtgaab1@corp.supernews.com>
In the words of the great Homer Simpson.....

Doh!

But that doesn't help much.

Compiling it reduced the cons cells from >30K to about 10.  OTOH, it's still
painfully slow.  A 200x200 matrix takes 3.89 seconds, a 500x500 takes about
a minute, on a 1GHz Athlon.  My C++ code does 500x500 in less than a second
and Java is about a second too..

I disassembled the code and it appears to be making function calls for AREF,
both for getting and seting elements.  I do see integer multiply and integer
add instructions in the code, so it's part way there.

Alan.


"Paul F. Dietz" <·····@dls.net> wrote in message
···························@dls.net...
> Alan Walker wrote:
>
> > This uses >30K cons cells and as the matrix gets bigger, say 100x100,
it's
> > dog slow.
>
> Did you remember to compile it?
>
> Paul
>
>
From: Mario S. Mommer
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <fzd6jpczc5.fsf@cupid.igpm.rwth-aachen.de>
"Alan Walker" <·························@charter.net> writes:
> Compiling it reduced the cons cells from >30K to about 10.  OTOH, it's still
> painfully slow.  A 200x200 matrix takes 3.89 seconds, a 500x500 takes about
> a minute, on a 1GHz Athlon.  My C++ code does 500x500 in less than a second
> and Java is about a second too..
> 
> I disassembled the code and it appears to be making function calls for AREF,
> both for getting and seting elements.  I do see integer multiply and integer
> add instructions in the code, so it's part way there.

Before you go on to conclude that CL is slow because of this, please
remember that this might well be a weakness of your implementation of
Common Lisp.

I think CMUCL and SBCL (at least) do inline this sort of stuff if
enough type information is provided and the optimize options ask for
it.

Regards,
        Mario.
From: Jeff Caldwell
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <EF2na.28733$D31.4104252@news1.news.adelphia.net>
Alan Walker wrote:
...

> a 500x500 takes about
> a minute, on a 1GHz Athlon.  My C++ code does 500x500 in less than a second
> and Java is about a second too..

I've got two runs here. The summary:
1) A 500X500 fixnum takes 7 seconds (1.1GHz PIII Mobile)
2) A 300X300 double-float takes 29 seconds. The example includes
     lots of comments where I included then excluded declare's and
     gc stuff. I added or removed exactly one feature at a time, pretty
     much in the order they appear in the source.  Each such feature is
     annotated, including the time cost or saved by the inclusion
     of the declaration or gc-controlling call.

Conclusions about #2:  It is semi-useless for me to intuit what will
                        or will not improve performance.  For example,
                        declaring fixnum after dotimes improved
                        performance in one place and made performance
                        worse in the next!?  Removing some of the
                        declarations in the code as posted on usenet
                        improved performance!?  Also, the default gc
                        is smarter than I am. (For example, I tried
                        creating the arrays in generation 1 to leave
                        more room in gen 0 for the 1.7Gb of allocation.)


Question re fixnum vs. double-float:

A 300X300 fixnum yielded:
;user time    =      0.751
;Allocation   = 1082256 bytes standard / 3465 bytes fixlen

A 300X300 double-float yielded:
;user time    =     28.240
;Allocation   = 1730166560 bytes standard / 3707 bytes fixlen

That's 1598 times the memory and 37 times the time.

What's going on in the double-float that it has to use 1598-times more 
memory than the fixnum?  I suspect much of the time increase is devoted 
to the allocation -- yes, no?


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

; Lispworks 4.2.7 Professional Edition

(proclaim '(optimize (speed 3) (space 0) (compilation-speed 0) (debug 0)
                      (safety 0)))

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Example 1
;

(defun matmul (mat1 mat2 n)
   (declare (type (simple-array fixnum (* *)) mat1 mat2)
            (type fixnum n))
   (let ((mat3 (make-array (list n n)
                 :element-type 'fixnum :initial-element 0))
         (sum 0))
     (declare (type (simple-array fixnum (* *)) mat3)
              (type fixnum sum))
     (dotimes (i n)
       (dotimes (j n)
         (setf (aref mat3 i j)
               (progn
                 (setf sum 0)
                 (dotimes (k n)
                   (incf sum (the fixnum (* (aref mat1 i k)
                                            (aref mat2 k j)))))
                 sum))))
     mat3))

(defun test-it (size)
   (declare (fixnum size))
   (let ((a1 (make-array (list size size) :element-type 'fixnum
                         :initial-element 3))
         (a2 (make-array (list size size) :element-type 'fixnum
                         :initial-element 4)))
     (aref (matmul a1 a2 size) 0 0)))

(compile 'matmul)
(compile 'test-it)

CL-USER 251 > (time (test-it 500))
Timing the evaluation of (TEST-IT 500)

user time    =      6.990
system time  =      0.010
Elapsed time =   0:00:07
Allocation   = 3002192 bytes standard / 3696 bytes fixlen
0 Page faults
6000

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
; Example 2
;

(defun matmul-fp (mat1 mat2 n)
   (declare (type (simple-array double-float (* *)) mat1 mat2)) ;; saves 
5.498 seconds
   ;(declare (fixnum n)) ;; costs 2.834 seconds
   (let ((mat3 (make-array (list n n)
                           :element-type 'double-float
                           :initial-element 0.0d0))
         (sum 0.0d0))
     (declare (type (simple-array double-float (* *)) mat3)) ;; saves 
.281 seconds
     ;(declare (double-float sum)) ;; costs 1.032 seconds
     ;(allocation-in-gen-num 0 ;; see allocation-in-gen-num in test-it-fp
       (dotimes (i n)
         (declare (fixnum i)) ;; saves 0.120 seconds
       ;(mark-and-sweep 0) ;; costs 0.311 seconds
         (dotimes (j n)
         ;(declare (fixnum j)) ;; costs 0.070 seconds
         ;(mark-and-sweep 0) ;; costs 104.210 seconds
           (setf sum 0.0d0)
           (dotimes (k n)
             (declare (fixnum k)) ; saves .030 seconds
           ;(incf sum (the double-float (* (aref mat1 i k) (aref mat2 k 
j)))))
             (incf sum (* (aref mat1 i k) (aref mat2 k j)))) ;; 'the 
double-float' costs .06 seconds
           (setf (aref mat3 i j) sum)))
     mat3))

(defun test-it-fp (size)
   (declare (fixnum size))
   ;(with-heavy-allocation.. ;; costs 20.340 seconds
   ;(allocation-in-gen-num 1 ;; costs 2.183 seconds (see 
allocation-in-gen-num in matmul-fp)
   ;(set-default-generation 1) ;; costs 5.689 seconds
   (let ((a1 (make-array (list size size) :element-type 'double-float
                         :initial-element 3.0d0))
         (a2 (make-array (list size size) :element-type 'double-float
                         :initial-element 4.0d0)))
     ;(set-default-generation 0) ;; see (set-default-generation 1)
     (matmul-fp a1 a2 size)
     (values)))
(compile 'matmul-fp)
(compile 'test-it-fp)

;CL-USER 201 > (time (test-it-fp 300))
;Timing the evaluation of (TEST-IT-FP 300)
;
;user time    =     28.240
;system time  =      0.030
;Elapsed time =   0:00:28
;Allocation   = 1730166560 bytes standard / 3707 bytes fixlen
;0 Page faults
From: Jochen Schmidt
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <b7iibc$4hi$04$1@news.t-online.com>
Jeff Caldwell wrote:

> Alan Walker wrote:
> ...
> 
>> a 500x500 takes about
>> a minute, on a 1GHz Athlon.  My C++ code does 500x500 in less than a
>> second and Java is about a second too..
> 
> I've got two runs here. The summary:
> 1) A 500X500 fixnum takes 7 seconds (1.1GHz PIII Mobile)
> 2) A 300X300 double-float takes 29 seconds. The example includes
>      lots of comments where I included then excluded declare's and
>      gc stuff. I added or removed exactly one feature at a time, pretty
>      much in the order they appear in the source.  Each such feature is
>      annotated, including the time cost or saved by the inclusion
>      of the declaration or gc-controlling call.
> 
> Conclusions about #2:  It is semi-useless for me to intuit what will
>                         or will not improve performance.  For example,
>                         declaring fixnum after dotimes improved
>                         performance in one place and made performance
>                         worse in the next!?  Removing some of the
>                         declarations in the code as posted on usenet
>                         improved performance!?  Also, the default gc
>                         is smarter than I am. (For example, I tried
>                         creating the arrays in generation 1 to leave
>                         more room in gen 0 for the 1.7Gb of allocation.)
> 
> 
> Question re fixnum vs. double-float:
> 
> A 300X300 fixnum yielded:
> ;user time    =      0.751
> ;Allocation   = 1082256 bytes standard / 3465 bytes fixlen
> 
> A 300X300 double-float yielded:
> ;user time    =     28.240
> ;Allocation   = 1730166560 bytes standard / 3707 bytes fixlen
> 
> That's 1598 times the memory and 37 times the time.
> 
> What's going on in the double-float that it has to use 1598-times more
> memory than the fixnum?  I suspect much of the time increase is devoted
> to the allocation -- yes, no?

I've tested your code here.
Try adding (optimize (float 0))
and uncomment the double-float declaration of "sum" in matmul-fp.

you might be surprised...

ciao,
Jochen
From: Jeff Caldwell
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <5pbna.430$392.79725@news1.news.adelphia.net>
Jochen Schmidt wrote:
> Jeff Caldwell wrote:
>>2) A 300X300 double-float takes 29 seconds.

> Try adding (optimize (float 0))
> and uncomment the double-float declaration of "sum" in matmul-fp.
> 
> you might be surprised...
> 
> ciao,
> Jochen
> 

CL-USER 255 > (time (test-it-fp 300))
Timing the evaluation of (TEST-IT-FP 300)

user time    =      2.032
system time  =      0.010
Elapsed time =   0:00:02
Allocation   = 2161968 bytes standard / 3069 bytes fixlen
0 Page faults

CL-USER 262 > (time (test-it-fp 500))
Timing the evaluation of (TEST-IT-FP 500)

user time    =      9.794
system time  =      0.010
Elapsed time =   0:00:10
Allocation   = 6002112 bytes standard / 3872 bytes fixlen
0 Page faults


Heh - Thank you, Jochen. That's a dramatic result and I've learned 
something. This is a very good day.

Adding the (optimize (float 0)) cut it to 16.663 seconds.  It's 
interesting to me that the double-float declaration of "sum" actually 
was a cost, and not a savings, without the (optimize (float 0)), and yet 
with the optimize, the time was cut by a factor of 8.

Jeff
From: JP Massar
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <3e9dda21.17166562@netnews.attbi.com>
On Wed, 16 Apr 2003 05:22:37 +0200, Jochen Schmidt <···@dataheaven.de>
wrote:
 
>
>I've tested your code here.
>Try adding (optimize (float 0))
>and uncomment the double-float declaration of "sum" in matmul-fp.
>
 
What is 

(optimize (float 0))

supposed to do/mean ?

(Not generically, I mean wrt Lispworks compiler)
From: Jochen Schmidt
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <b7ks5s$plu$07$1@news.t-online.com>
JP Massar wrote:

> On Wed, 16 Apr 2003 05:22:37 +0200, Jochen Schmidt <···@dataheaven.de>
> wrote:
>  
>>
>>I've tested your code here.
>>Try adding (optimize (float 0))
>>and uncomment the double-float declaration of "sum" in matmul-fp.
>>
>  
> What is
> 
> (optimize (float 0))
> 
> supposed to do/mean ?
> 
> (Not generically, I mean wrt Lispworks compiler)

In the LispWorks User Guide:

"The float declaration allows generation of more efficient code using float
numbers. It reduces allocation during float calculations. It is best used
with safety 0."

Another very interesting LispWorks declaration is "fixnum-safety"
which allows you to switch gradually into a "fixnum-only arithmetic" mode.
The arithmetic operators then work only on fixnums. The default value 3 are
normal Common Lisp semantics but lesser values enable this "fixnum-mode".
The higher the level of fixnum-safety (2 or 1) the more checks (checking
arguments for being fixnums and testing for overflows) are made.
If one declares fixnum-safety to 0 there is no checking at all.

ciao,
Jochen
From: Jochen Schmidt
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <b7kt3k$oge$03$1@news.t-online.com>
Jochen Schmidt wrote:

> In the LispWorks User Guide:
> 
> "The float declaration allows generation of more efficient code using
> float numbers. It reduces allocation during float calculations. It is best
> used with safety 0."

Just to follow this up if it is not clear what this implies...

Setting float to 0 will reduce allocation because arithmetic is done on
unboxed floats. If one uses unboxed values the debugger is not able anymore
to show them properly. Some may know this when the LispWorks debugger shows
:dont-know as the value of some binding. Maybe this answers questions like
why such an effective optimization is not on by default.

ciao,
Jochen
From: Duane Rettig
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <4vfxhtry2.fsf@beta.franz.com>
"Alan Walker" <·························@charter.net> writes:

>           It's part of my ongoing campaign.....
>     The problem is, try as I might, I just can't figure out the type
> declarations to make it run fast.  I'm using ACL 5.0 Personal Edition on
> Win2K.  I've fiddled with things from Paul Graham's "ANSI Common Lisp", Ch
> 13, here's what I've coded so far:

5.0?  Wow.  That's old.  Perhaps you should consider upgrading to 6.2; the 
Trial costs you nothing.  If you're supported, then your fastest way to
speed is via ····@franz.com, which isn't just for bug reports.

  [ ... ]

>  I'd like to think it's something silly that I've overlooked.

Others have pointed out the obvious ("compile it"), but one thing not quite
so obvious is these declarations:

>    (declare (type (simple-array fixnum) mat1 mat2))

and

>       (declare (type (simple-array fixnum) mat3))

They aren't very specific, and force the compiler to generate more general
code.  I'd guess that your C++ and Java compilers both know what dimensions
are of the arrays you are giving them.

Try an experiment.  Pick a value for n (let's say 10).  Now, perform the
make-array for your result-vector as a setq:

(setq mat3 (make-array (list 10 10) :element-type 'fixnum :initial-element 0))

Now, what does (type-of mat3) return?

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: Kenny Tilton
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <3E9A2AD2.7010904@nyc.rr.com>
Alan Walker wrote:
> G'day All,
>     Judging from the title, this task ought to be a push-over.  We have some
> people at work comparing Java & C++, they "discovered" that Java is a lot
> slower than C++, one of their tests was a matrix multiply.  So, I thought to
> myself, if you want speed, elegance and garbage collection, why not write it
> in Lisp?  It's part of my ongoing campaign.....
>     The problem is, try as I might, I just can't figure out the type
> declarations to make it run fast.  I'm using ACL 5.0 Personal Edition on
> Win2K.  I've fiddled with things from Paul Graham's "ANSI Common Lisp", Ch
> 13, here's what I've coded so far:
> 
> (proclaim '(optimize (speed 3) (space 0) (compilation-speed 0) (debug 0)
> (safety 0)))
> 
> (defun matmul (mat1 mat2 n)
>    (declare (type (simple-array fixnum) mat1 mat2))
>    (declare (fixnum n))
>    (let ((mat3 (make-array (list n n) :element-type 'fixnum :initial-element
> 0))
>          (sum 0))
>       (declare (type (simple-array fixnum) mat3))
>       (declare (fixnum sum))
>       (dotimes (i n)
>          (declare (fixnum i))
>          (dotimes (j n)
>             (declare (fixnum j))
>             (setf sum 0)
>             (dotimes (k n)
>                (declare (fixnum k))
>                (incf sum (the fixnum (* (aref mat1 i k) (aref mat2 k j)))))
>             (setf (aref mat3 i j) sum)))
>       mat3))
> 
> ((defun test-it (size)
>    (declare (fixnum size))
>    (let ((a1 (make-array (list size size) :element-type 'fixnum
> :initial-element 3))
>          (a2 (make-array (list size size) :element-type 'fixnum
> :initial-element 4)))
>       (matmul a1 a2 size)))
> (time (test-it 10))
> 
> This uses >30K cons cells and as the matrix gets bigger, say 100x100, it's
> dog slow.  Even Java blows it away.  This is embarassing (well, at least to
> me).  I'd like to think it's something silly that I've overlooked.

I do not know anything about optimizing Lisp, but I did get a lot less 
consing (in ACL6). with my defaults for development, viz. speed=3, 
debug=3, safety and space = 1 I got:

CELLO(4): (time (test-it 10))
; cpu time (non-gc) 0 msec user, 0 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  0 msec user, 0 msec system
; real time  0 msec
; space allocation:
;  10 cons cells, 1,344 other bytes, 0 static bytes
#2A((120 120 120 120 120 120 120 120 120 120)
     (120 120 120 120 120 120 120 120 120 120)
     (120 120 120 120 120 120 120 120 120 120)
     (120 120 120 120 120 120 120 120 120 120)
     (120 120 120 120 120 120 120 120 120 120)
     (120 120 120 120 120 120 120 120 120 120)
     (120 120 120 120 120 120 120 120 120 120)
     (120 120 120 120 120 120 120 120 120 120)
     (120 120 120 120 120 120 120 120 120 120)
     (120 120 120 120 120 120 120 120 120 120))
CELLO(5):

and:

CELLO(6): (time (test-it 100))
; cpu time (non-gc) 321 msec user, 0 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  321 msec user, 0 msec system
; real time  331 msec
; space allocation:
;  37 cons cells, 120,200 other bytes, 0 static bytes
#2A((1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     ...)

With your settings (speed 3, rest 0):

cello(9): (time (test-it 100)
; cpu time (non-gc) 310 msec user, 0 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  310 msec user, 0 msec system
; real time  310 msec
; space allocation:
;  23 cons cells, 120,456 other bytes, 48 static bytes
#2A((1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     (1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...)
     ...)

You did not mention the Java/C++ timings.

-- 

  kenny tilton
  clinisys, inc
  http://www.tilton-technology.com/
  ---------------------------------------------------------------
"Everything is a cell." -- Alan Kay
From: Joe Marshall
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <adetcdbu.fsf@ccs.neu.edu>
"Alan Walker" <·························@charter.net> writes:

> (proclaim '(optimize (speed 3) (space 0) (compilation-speed 0) (debug 0)
> (safety 0)))

Since you are using ACL, try putting the
 (:explain :calls) and (:explain :types) declarations in.
They will help you find the places where the compiler can't figure out
what's going on.
From: Rolf Wester
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <b7eb8b$st2$1@nets3.rz.RWTH-Aachen.DE>
Alan Walker wrote:
> G'day All,
>     Judging from the title, this task ought to be a push-over.  We have some
> people at work comparing Java & C++, they "discovered" that Java is a lot
> slower than C++, one of their tests was a matrix multiply.  So, I thought to
> myself, if you want speed, elegance and garbage collection, why not write it
> in Lisp?  It's part of my ongoing campaign.....
>     The problem is, try as I might, I just can't figure out the type
> declarations to make it run fast.  I'm using ACL 5.0 Personal Edition on
> Win2K.  I've fiddled with things from Paul Graham's "ANSI Common Lisp", Ch
> 13, here's what I've coded so far:
> 
> (proclaim '(optimize (speed 3) (space 0) (compilation-speed 0) (debug 0)
> (safety 0)))
> 
> (defun matmul (mat1 mat2 n)
>    (declare (type (simple-array fixnum) mat1 mat2))
>    (declare (fixnum n))
>    (let ((mat3 (make-array (list n n) :element-type 'fixnum :initial-element
> 0))
>          (sum 0))
>       (declare (type (simple-array fixnum) mat3))
>       (declare (fixnum sum))
>       (dotimes (i n)
>          (declare (fixnum i))
>          (dotimes (j n)
>             (declare (fixnum j))
>             (setf sum 0)
>             (dotimes (k n)
>                (declare (fixnum k))
>                (incf sum (the fixnum (* (aref mat1 i k) (aref mat2 k j)))))
>             (setf (aref mat3 i j) sum)))
>       mat3))
> 
> ((defun test-it (size)
>    (declare (fixnum size))
>    (let ((a1 (make-array (list size size) :element-type 'fixnum
> :initial-element 3))
>          (a2 (make-array (list size size) :element-type 'fixnum
> :initial-element 4)))
>       (matmul a1 a2 size)))
> (time (test-it 10))
> 
> This uses >30K cons cells and as the matrix gets bigger, say 100x100, it's
> dog slow.  Even Java blows it away.  This is embarassing (well, at least to
> me).  I'd like to think it's something silly that I've overlooked.
> 
> Alan.
> 
> 
>
Hi Alan,

I made the same with CMUCL and double-float. I don't know whether this 
helps, but below are  the code and results. (incf sum ..) is a little 
bit faster than (setf sum (+ sum ..)), dotimes and loop are equaly fast. 
Some time ago I tried Java (with a JIT). It was about 2 times slower 
compared to the C-version that means about as fast as the fastest CMUCL 
version. I also tried ACL (trial version) on Windows some time ago, but 
it was 5 times slower than the corresponding C-version. Perhaps I didn't
otptimize it well enough.

Rolf

-----------------------------------------------------------------------

(defun mat-mul1 (m1 m2 &optional (m30 nil))
   (declare (optimize (speed 3) (debug 3) (safety 0)))
   (let* ((n1 (array-dimension m1 0))
          (n2 (array-dimension m1 1))
          (m3 (if m30 m30
	         (make-array (list n1 n2) :element-type 'double-float))))
	(declare (type fixnum n1 n2) (type (simple-array double-float (* *)) m1 
m2 m3))
	(loop for i fixnum from 0 below n1 do
	  (loop for j fixnum from 0 below n2 do
		(let ((sum 0.0d0))
		  (declare (type double-float sum))
		  (loop for k fixnum from 0 below n1 do
			(setf sum (+ sum (* (aref m1 i k) (aref m2 k j)))))
		  (setf (aref m3 i j) sum))))
	m3))

(defun mat-mul2 (m1 m2 &optional (m30 nil))
   (declare (optimize (speed 3) (debug 3) (safety 0)))
   (let* ((n1 (array-dimension m1 0))
		 (n2 (array-dimension m1 1))
		 (m3 (if m30 m30
			     (make-array (list n1 n2) :element-type 'double-float))))
	(declare (type fixnum n1 n2) (type (simple-array double-float (* *)) m1 
m2 m3))
	(loop for i of-type (unsigned-byte 32) from 0 below n1 do
	  (loop for j of-type (unsigned-byte 32) from 0 below n2 do
		(let ((sum 0.0d0))
		  (declare (type double-float sum))
		  (loop for k of-type (unsigned-byte 32) from 0 below n1 do
			(setf sum (+ sum (* (aref m1 i k) (aref m2 k j)))))
		  (setf (aref m3 i j) sum))))
	m3))

(defun mat-mul3 (m1 m2 &optional (m30 nil))
   (declare (optimize (speed 3) (debug 3) (safety 0)))
   (let* ((n1 (array-dimension m1 0))
		 (n2 (array-dimension m1 1))
		 (m3 (if m30 m30
			     (make-array (list n1 n2) :element-type 'double-float))))
	(declare (type fixnum n1 n2) (type (simple-array double-float (* *)) m1 
m2 m3))
	(loop for i of-type (unsigned-byte 32) from 0 below n1 do
	  (loop for j of-type (unsigned-byte 32) from 0 below n2 do
		(setf (aref m3 i j) 0.0d0)
		(loop for k of-type (unsigned-byte 32) from 0 below n1 do
		  (setf (aref m3 i j) (+ (aref m3 i j) (* (aref m1 i k) (aref m2 k 
j)))))))
	m3))

(setf ext:*bytes-consed-between-gcs* (* 10 ext:*bytes-consed-between-gcs*))

(time
  (let* ((n1 300)
		(n2 300)
		(m1 (make-array (list n1 n2) :element-type 'double-float 
:initial-element 1.0d0))
		(m2 (make-array (list n1 n2) :element-type 'double-float 
:initial-element 2.0d0)))
    (let ((m3 (mat-mul3 m1 m2)))
	 (print (aref m3 0 0)))
    nil))


;C-program without array range check
;mat-mul-c  0.66 sec

;debug 0, safety 0
;mat-mul-m1 1.26 sec
;mat-mul-m2 1.26 sec
;mat-mul-m3 2.3 sec

;debug 0, safety 1
;mat-mul-m1 2.05 sec
;mat-mul-m2 2.27 sec
;mat-mul-m3 3.2 sec

;debug 0, safety 3
;mat-mul-m1 2.60 sec
;mat-mul-m2 2.60 sec
;mat-mul-m3 4.4 sec

;debug 3, safety 3
;mat-mul-m1 2.60 sec
;mat-mul-m2 2.60 sec
;mat-mul-m3 4.4 sec

;debug 3, safety 0
;mat-mul-m1 1.73 sec
;mat-mul-m2 1.69 sec
;mat-mul-m3 2.3 sec
From: Alan Walker
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <v9leaupgc2d5c0@corp.supernews.com>
Rolf,
    I tried your mat-mul1 function with ACL 5.0 for a 300x300 case, it took
considerably longer.....

>
600.0d0
; cpu time (non-gc) 1,077,294 msec (00:17:57.294) user, 0 msec system
; cpu time (gc)     200,874 msec (00:03:20.874) user, 0 msec system
; cpu time (total)  1,278,168 msec (00:21:18.168) user, 0 msec system
; real time  1,278,168 msec (00:21:18.168)
; space allocation:
;  1,262,034,331 cons cells, 27,450,301 symbols, 1,342,041,920 other bytes,
0 static bytes
nil
>

    Ouch!  I'm going to try a trial copy of ACL 6.2 on another machine, also
do a bit more fiddling with it tonight.

Alan.


"Rolf Wester" <······@ilt.fhg.de> wrote in message
·················@nets3.rz.RWTH-Aachen.DE...
> Alan Walker wrote:
> > G'day All,
> >     Judging from the title, this task ought to be a push-over.  We have
some
> > people at work comparing Java & C++, they "discovered" that Java is a
lot
> > slower than C++, one of their tests was a matrix multiply.  So, I
thought to
> > myself, if you want speed, elegance and garbage collection, why not
write it
> > in Lisp?  It's part of my ongoing campaign.....
> >     The problem is, try as I might, I just can't figure out the type
> > declarations to make it run fast.  I'm using ACL 5.0 Personal Edition on
> > Win2K.  I've fiddled with things from Paul Graham's "ANSI Common Lisp",
Ch
> > 13, here's what I've coded so far:
> >
> > (proclaim '(optimize (speed 3) (space 0) (compilation-speed 0) (debug 0)
> > (safety 0)))
> >
> > (defun matmul (mat1 mat2 n)
> >    (declare (type (simple-array fixnum) mat1 mat2))
> >    (declare (fixnum n))
> >    (let ((mat3 (make-array (list n n) :element-type 'fixnum
:initial-element
> > 0))
> >          (sum 0))
> >       (declare (type (simple-array fixnum) mat3))
> >       (declare (fixnum sum))
> >       (dotimes (i n)
> >          (declare (fixnum i))
> >          (dotimes (j n)
> >             (declare (fixnum j))
> >             (setf sum 0)
> >             (dotimes (k n)
> >                (declare (fixnum k))
> >                (incf sum (the fixnum (* (aref mat1 i k) (aref mat2 k
j)))))
> >             (setf (aref mat3 i j) sum)))
> >       mat3))
> >
> > ((defun test-it (size)
> >    (declare (fixnum size))
> >    (let ((a1 (make-array (list size size) :element-type 'fixnum
> > :initial-element 3))
> >          (a2 (make-array (list size size) :element-type 'fixnum
> > :initial-element 4)))
> >       (matmul a1 a2 size)))
> > (time (test-it 10))
> >
> > This uses >30K cons cells and as the matrix gets bigger, say 100x100,
it's
> > dog slow.  Even Java blows it away.  This is embarassing (well, at least
to
> > me).  I'd like to think it's something silly that I've overlooked.
> >
> > Alan.
> >
> >
> >
> Hi Alan,
>
> I made the same with CMUCL and double-float. I don't know whether this
> helps, but below are  the code and results. (incf sum ..) is a little
> bit faster than (setf sum (+ sum ..)), dotimes and loop are equaly fast.
> Some time ago I tried Java (with a JIT). It was about 2 times slower
> compared to the C-version that means about as fast as the fastest CMUCL
> version. I also tried ACL (trial version) on Windows some time ago, but
> it was 5 times slower than the corresponding C-version. Perhaps I didn't
> otptimize it well enough.
>
> Rolf
>
> -----------------------------------------------------------------------
>
> (defun mat-mul1 (m1 m2 &optional (m30 nil))
>    (declare (optimize (speed 3) (debug 3) (safety 0)))
>    (let* ((n1 (array-dimension m1 0))
>           (n2 (array-dimension m1 1))
>           (m3 (if m30 m30
>          (make-array (list n1 n2) :element-type 'double-float))))
> (declare (type fixnum n1 n2) (type (simple-array double-float (* *)) m1
> m2 m3))
> (loop for i fixnum from 0 below n1 do
>   (loop for j fixnum from 0 below n2 do
> (let ((sum 0.0d0))
>   (declare (type double-float sum))
>   (loop for k fixnum from 0 below n1 do
> (setf sum (+ sum (* (aref m1 i k) (aref m2 k j)))))
>   (setf (aref m3 i j) sum))))
> m3))
>
> (defun mat-mul2 (m1 m2 &optional (m30 nil))
>    (declare (optimize (speed 3) (debug 3) (safety 0)))
>    (let* ((n1 (array-dimension m1 0))
> (n2 (array-dimension m1 1))
> (m3 (if m30 m30
>      (make-array (list n1 n2) :element-type 'double-float))))
> (declare (type fixnum n1 n2) (type (simple-array double-float (* *)) m1
> m2 m3))
> (loop for i of-type (unsigned-byte 32) from 0 below n1 do
>   (loop for j of-type (unsigned-byte 32) from 0 below n2 do
> (let ((sum 0.0d0))
>   (declare (type double-float sum))
>   (loop for k of-type (unsigned-byte 32) from 0 below n1 do
> (setf sum (+ sum (* (aref m1 i k) (aref m2 k j)))))
>   (setf (aref m3 i j) sum))))
> m3))
>
> (defun mat-mul3 (m1 m2 &optional (m30 nil))
>    (declare (optimize (speed 3) (debug 3) (safety 0)))
>    (let* ((n1 (array-dimension m1 0))
> (n2 (array-dimension m1 1))
> (m3 (if m30 m30
>      (make-array (list n1 n2) :element-type 'double-float))))
> (declare (type fixnum n1 n2) (type (simple-array double-float (* *)) m1
> m2 m3))
> (loop for i of-type (unsigned-byte 32) from 0 below n1 do
>   (loop for j of-type (unsigned-byte 32) from 0 below n2 do
> (setf (aref m3 i j) 0.0d0)
> (loop for k of-type (unsigned-byte 32) from 0 below n1 do
>   (setf (aref m3 i j) (+ (aref m3 i j) (* (aref m1 i k) (aref m2 k
> j)))))))
> m3))
>
> (setf ext:*bytes-consed-between-gcs* (* 10
ext:*bytes-consed-between-gcs*))
>
> (time
>   (let* ((n1 300)
> (n2 300)
> (m1 (make-array (list n1 n2) :element-type 'double-float
> :initial-element 1.0d0))
> (m2 (make-array (list n1 n2) :element-type 'double-float
> :initial-element 2.0d0)))
>     (let ((m3 (mat-mul3 m1 m2)))
> (print (aref m3 0 0)))
>     nil))
>
>
> ;C-program without array range check
> ;mat-mul-c  0.66 sec
>
> ;debug 0, safety 0
> ;mat-mul-m1 1.26 sec
> ;mat-mul-m2 1.26 sec
> ;mat-mul-m3 2.3 sec
>
> ;debug 0, safety 1
> ;mat-mul-m1 2.05 sec
> ;mat-mul-m2 2.27 sec
> ;mat-mul-m3 3.2 sec
>
> ;debug 0, safety 3
> ;mat-mul-m1 2.60 sec
> ;mat-mul-m2 2.60 sec
> ;mat-mul-m3 4.4 sec
>
> ;debug 3, safety 3
> ;mat-mul-m1 2.60 sec
> ;mat-mul-m2 2.60 sec
> ;mat-mul-m3 4.4 sec
>
> ;debug 3, safety 0
> ;mat-mul-m1 1.73 sec
> ;mat-mul-m2 1.69 sec
> ;mat-mul-m3 2.3 sec
>
>
From: rif
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <wj0ptnphwxm.fsf@five-percent-nation.mit.edu>
I wanted to add that so far I've had the same experience with ACL
compared to CMUCL as others have, on my own problem (not matrix
multiplication).  I have code that I've optimized for speed in CMUCL,
and I tried running it in ACL 6.2 (not the trial edition), and it runs
maybe 50x slower.  Yes, I'm compiling, with full speed and
optimization settings on.

At first I'd thought that the problem was that ACL doesn't inline
user-defined functions, but I redefined the key "must-inline"
functions in my code as macros, and that didn't help much (maybe 10%).
As others reported, I basically found that tons of time was being
spent in GC and in aref, and that the code was consing maybe 50x as
much as CMUCL (all numbers totally approximate from memory).  I
profiled, and looked at the functions that were taking up most of the
time with (:explain :types) and (:explain :boxing), but even for these
functions (which rely heavily on macros I've written), it was
difficult for me to tell what was happening given the time I had to
spend.

I'm busy with a deadlined project for now, and I'll stick to CMUCL til
that's finished, but I'm quite curious about this issue.  For now, I
don't know whether ACL is simply unable to generate code close to as
fast as CMUCL's code for certain kinds of things, or whether I've
simply not yet figured out how to do so (I suspect the latter).

rif
From: Paul F. Dietz
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <e5CcnZ27R6-g1gajXTWcog@dls.net>
rif wrote:

> As others reported, I basically found that tons of time was being
> spent in GC and in aref, and that the code was consing maybe 50x as
> much as CMUCL (all numbers totally approximate from memory).

Try declaring the array to be of type (simple-array fixnum (* *))
rather than (simple-array fixnum).

	Paul
From: rif
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <wj0aderyko7.fsf@five-percent-nation.mit.edu>
"Paul F. Dietz" <·····@dls.net> writes:

> rif wrote:
> 
> > As others reported, I basically found that tons of time was being
> > spent in GC and in aref, and that the code was consing maybe 50x as
> > much as CMUCL (all numbers totally approximate from memory).
> 
> Try declaring the array to be of type (simple-array fixnum (* *))
> rather than (simple-array fixnum).
> 
> 	Paul

Could you follow up on this a little?  I'm assuming that the choice of
fixnum vs., say, double-float, is arbitrary --- I'm using arrays of
double-floats.  But wouldn't

(simple-array double-float (* *))

indicate a two dimensional array?  My code pretty much entirely works
with 1d arrays.

rif
From: Nils Goesche
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <lyd6jn28xh.fsf@cartan.de>
rif <···@mit.edu> writes:

> "Paul F. Dietz" <·····@dls.net> writes:
> 
> > rif wrote:

> > > As others reported, I basically found that tons of time was
> > > being spent in GC and in aref, and that the code was consing
> > > maybe 50x as much as CMUCL (all numbers totally approximate from
> > > memory).

> > Try declaring the array to be of type (simple-array fixnum (* *))
> > rather than (simple-array fixnum).

> Could you follow up on this a little?  I'm assuming that the choice
> of fixnum vs., say, double-float, is arbitrary --- I'm using arrays
> of double-floats.  But wouldn't
> 
> (simple-array double-float (* *))
> 
> indicate a two dimensional array?

Yes.

> My code pretty much entirely works with 1d arrays.

Use

  (simple-array double-float (*))

then (iff you called MAKE-ARRAY with :element-type 'double-float and
your arrays are indeed simple).

Regards,
-- 
Nils G�sche
"Don't ask for whom the <CTRL-G> tolls."

PGP key ID 0x0655CFA0
From: rif
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <wj0brz7vmjw.fsf@five-percent-nation.mit.edu>
OK, I get it.  So you're suggesting that under ACL, you'd expect a
type declaration of

(simple-array double-float (*))

to be much more effective than

(simple-array double-float)

Can you say why?

rif
From: Nils Goesche
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <ly8yub21iw.fsf@cartan.de>
rif <···@mit.edu> writes:

> OK, I get it.  So you're suggesting that under ACL, you'd expect a
> type declaration of
> 
> (simple-array double-float (*))
> 
> to be much more effective than
> 
> (simple-array double-float)
> 
> Can you say why?

Simply because you are giving more information to the compiler.  In
the latter case the compiler can't be sure as to how many dimensions
the array has.  Don't know about ACL, but in LW this makes a great
difference.

Regards,
-- 
Nils G�sche
"Don't ask for whom the <CTRL-G> tolls."

PGP key ID 0x0655CFA0
From: Steven E. Harris
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <87el431yno.fsf@harris.sdo.us.ray.com>
Nils Goesche <······@cartan.de> writes:

> In the latter case the compiler can't be sure as to how many
> dimensions the array has.

Does substituting simple-vector for simple-array help? Looking at the
HyperSpec, it seems that the OP could have a simple-vector in hand,
but a simple-vector is "able to hold elements of any /type/."�

Does trying to lock the element type down to double-float preclude use
of simple-vector? clisp doesn't care about whether the element type is
specified when deciding whether a vector is simple:

,----
| [1]> (describe (make-array 10))
| 
| #(NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL) is a simple 1 dimensional
| array (vector), of size 10.
| 
| [2]> (describe (make-array 10 :element-type 'double-float))
| 
| #(NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL) is a simple 1 dimensional
| array (vector), of size 10.
|
| [3]> (simple-vector-p (make-array 10 :element-type 'double-float))
| T
`----

Is that behavior correct?


Footnotes: 
� http://www.lispworks.com/reference/HyperSpec/Body/t_smp_ve.htm#simple-vector

-- 
Steven E. Harris        :: ········@raytheon.com
Raytheon                :: http://www.raytheon.com
From: james anderson
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <3E9C44F9.E72C5C50@setf.de>
"Steven E. Harris" wrote:
> 
> Nils Goesche <······@cartan.de> writes:
> 
> > In the latter case the compiler can't be sure as to how many
> > dimensions the array has.
> 
> Does substituting simple-vector for simple-array help? Looking at the
> HyperSpec, it seems that the OP could have a simple-vector in hand,
> but a simple-vector is "able to hold elements of any /type/."1
> 
> Does trying to lock the element type down to double-float preclude use
> of simple-vector? clisp doesn't care about whether the element type is
> specified when deciding whether a vector is simple:
> 

it can.

? (vector 1 2 3 4)
#(1 2 3 4)
? (type-of *)
(SIMPLE-VECTOR 4)
? (make-array 4 :element-type 'fixnum :initial-element 0)
#(0 0 0 0)
? (type-of *)
(SIMPLE-ARRAY (SIGNED-BYTE 32) (4))
? (svref ** 0)
> Error: value #<VECTOR 4 type (SIGNED-BYTE 32), simple> is not of the expected type SIMPLE-VECTOR.
> While executing: SVREF
> Type Command-. to abort.
See the Restarts� menu item for further choices.
1 > 

which can wreck havoc with optimized code. 

...
From: Nils Goesche
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <ly4r4z1vz7.fsf@cartan.de>
Steven E. Harris <········@raytheon.com> writes:

> Nils Goesche <······@cartan.de> writes:
> 
> > In the latter case the compiler can't be sure as to how many
> > dimensions the array has.
> 
> Does substituting simple-vector for simple-array help? Looking at
> the HyperSpec, it seems that the OP could have a simple-vector in
> hand, but a simple-vector is "able to hold elements of any /type/."�

That's right: A SIMPLE-VECTOR is /not/ specialized to any element
type.

> Does trying to lock the element type down to double-float preclude
> use of simple-vector?

Yes, but see below.

> clisp doesn't care about whether the element type is specified when
> deciding whether a vector is simple:

It does, actually, as James has shown you.

> ,----
> | [1]> (describe (make-array 10))
> | 
> | #(NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL) is a simple 1 dimensional
> | array (vector), of size 10.
> | 
> | [2]> (describe (make-array 10 :element-type 'double-float))
> | 
> | #(NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL) is a simple 1 dimensional
> | array (vector), of size 10.
> |
> | [3]> (simple-vector-p (make-array 10 :element-type 'double-float))
> | T
> `----
> 
> Is that behavior correct?

Yes, because

[1]> (upgraded-array-element-type 'double-float)
T

in CLISP: CLISP doesn't have any specialized double-float arrays (at
least not in our versions).  In many other Lisps, you'll get NIL in
[3].

Regards,
-- 
Nils G�sche
"Don't ask for whom the <CTRL-G> tolls."

PGP key ID 0x0655CFA0
From: Paul F. Dietz
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <06ednceHW9ryPQGjXTWcqA@dls.net>
rif wrote:

> Could you follow up on this a little?  I'm assuming that the choice of
> fixnum vs., say, double-float, is arbitrary --- I'm using arrays of
> double-floats.  But wouldn't
> 
> (simple-array double-float (* *))
> 
> indicate a two dimensional array?

Oops.  Sorry, I was refering to the person who originated this
thread.  In Allegro CL 6.2, specifying the (* *) is apparently necessary
for AREF to be compiled inline (why, I don't know, since you can
infer the dimension of the array from the number of arguments
to AREF.)

	Paul
From: Duane Rettig
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <4of37mbr2.fsf@beta.franz.com>
"Paul F. Dietz" <·····@dls.net> writes:

> rif wrote:
> 
> > Could you follow up on this a little?  I'm assuming that the choice of
> > fixnum vs., say, double-float, is arbitrary --- I'm using arrays of
> > double-floats.  But wouldn't
> > (simple-array double-float (* *))
> 
> > indicate a two dimensional array?
> 
> 
> Oops.  Sorry, I was refering to the person who originated this
> thread.  In Allegro CL 6.2, specifying the (* *) is apparently necessary
> for AREF to be compiled inline (why, I don't know, since you can
> infer the dimension of the array from the number of arguments
> to AREF.)

Actually, it is the rank that could be inferred, not the dimensions.
Assuming that's what you meant: There are at least two reasons why
we don't infer the rank:

 1. In unsafe code, as is the case with the OP's code, the user already
has one strike against him in that incorrect code is not required to
signal an error.  Thus, if he tries to call aref on a one-dimensional
array with two index arguments (either due to a change in the algorithm,
a mistype, or a misplaced paren) and the compiler infers a 2d array,
then the user is likely to get a SEGV when that code is executed
(or worse yet, he might even get an answer ...)

 2. The rank can be inferred, but not the dimensions.  It is interesting
that in this thread there has been talk of the rule that the more specific
the declaation, the better the compiler can do with the code, but as
of yet nobody has suggested holding the dimensions steady and actually
declaring them (or at least, all but the last dimension - in this case
declaring the first dimension is enough).  Now of course, the problem
may not be conducive to holding dimensions steady, although that can
usually be worked around anyway.  But the point is that in inferring
the rank of the array, the compiler is leaving off another potentially
important optimization, and thus cannot produce the fastest code possible.

Our experience has been that most of our users tend to follow the "Get
it right, then get it fast" model, to varying degrees.  Per reason #1,
not inferring the rank tends to contribute to the "get it right" part
of the model.  I submit also that not inferring rank also tends to contribute
to the "get it fast" portion, as well.  Actually, the whole model includes
not only the simple exhortation, but also advice to profile the app and
only speed up those portions (likely to be 10% or less) that aren't fast
enough.  So let's look at two scenarios:

 - The compiler infers array rank, but not dimensions.  Since it is
quite a bit faster than inadequately-declared code, it might cross
the threshold of being fast enough, and the user never looks at it
again.

 - The compiler does not infer rank; the code is compiled out-of-line.
If the code is in the 90% which is not performance critical, then all
is well.  But if the code shows up in the profiler, then the user might
look at it, and with a little experience with Allegro CL (and especially
if as a supported customer he contacts us) the whole optimization can
be explored.

Note that I'm all for inferring types, and we are always looking for
better ways to do this, but when it comes to the case where the
inferrence masks a more specific declaration, and thus doesn't go "all
the way", we'll tend to let the user decide on what is desired.

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: JP Massar
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <3e9c5ab9.140955828@netnews.attbi.com>
On Mon, 14 Apr 2003 19:00:31 -0500, "Paul F. Dietz" <·····@dls.net>
wrote:

>rif wrote:
>
>> As others reported, I basically found that tons of time was being
>> spent in GC and in aref, and that the code was consing maybe 50x as
>> much as CMUCL (all numbers totally approximate from memory).
>
>Try declaring the array to be of type (simple-array fixnum (* *))
>rather than (simple-array fixnum).
>
 
If you do this to Adam's code and disassemble the code, using 6.2 you
will see that
there are no longer any calls (to AREF or SETF-AREF) in the inner
loops.

Also, the code as written is timing the allocation of the two
arrays to be multiplied, not just the allocation of the
result array and the multiplications.
From: M H
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <b7gk6q$r9r$07$1@news.t-online.com>
Alan Walker wrote:
> G'day All,
>     Judging from the title, this task ought to be a push-over.  We have some
> people at work comparing Java & C++, they "discovered" that Java is a lot
> slower than C++, one of their tests was a matrix multiply.  So, I thought to
> myself, if you want speed, elegance and garbage collection, why not write it
> in Lisp?  It's part of my ongoing campaign.....

There is a paper "Fast Floating-Point Processing in Common Lisp" by 
Fateman, Broughan, Willock and Rettig on the net which offers some ideas 
(google for it!).

My personal, limited experience with fast numerical processing under CL 
is that the performance model of CL is _really_ hard to understand, and, 
worse, after all the type declarations have been introduced the code 
does not look very readable any more.

A workable way to get around this is to use CL-bindings for 
well-established, fast, reliable FORTRAN libraries.  Matlisp is an 
example for such a binding (http://matlisp.sourceforge.net/ get the 
latest CVS version).  Of course, this does not help if you just want to 
make a point about Java vs. CL.

Matthias
From: Nils Goesche
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <lyhe900yny.fsf@cartan.de>
M H <··@nospam.org> writes:

> Alan Walker wrote:

> > So, I thought to myself, if you want speed, elegance and garbage
> > collection, why not write it in Lisp?

> There is a paper "Fast Floating-Point Processing in Common Lisp" by
> Fateman, Broughan, Willock and Rettig on the net which offers some
> ideas (google for it!).
> 
> My personal, limited experience with fast numerical processing under
> CL is that the performance model of CL is _really_ hard to
> understand, and, worse, after all the type declarations have been
> introduced the code does not look very readable any more.

Only if you introduce too many of them.  You rarely if ever have to
declare so many types.  Sure, as a beginner, when you don't know any
better, putting declarations everywhere is the only way to make code a
little faster.  But once you have understood /when/ declarations make
a difference, and /why/, you realize you really don't have to put them
everywhere to write fast code.  You know which data structures to
choose and how to operate on them and your code will be much faster
even without any declarations, or only a few well-placed ones.  And,
once again, when your program gets bigger, its relative performance
will look better and better.  True, if you write only ten lines of C
code to operate on an array, it will be hard to write Lisp code that
is equally fast; but what is the /point/ in that, anyway?  If you can
write an acceptable solution in ten lines of C code, there is no
/reason/ to use Lisp!

The only fair comparison is to take /large/ programs that do something
useful; because that's where a high level language can help you: When
you can use lots of highly optimized abstract constructs that are
already there, in order to implement clever ways of doing things that
would be too hard (or impossible) in a low level language like C.  The
garbage collector alone can even make your programs faster, if it
prevents you from copying objects all the time, or writing slow and
clumsy code for maintaining reference counters.

Fair and useful comparisons like that are rarely done, however, for
the simple reason that it takes way too much time to write a
non-trivial program in both languages (and you even need at least
/two/ programmers because otherwise the experience your programmer
obtained while writing the first version will influence the second
version he writes).

Anybody who still thinks programs written in C are automatically small
and fast should be forced to use Netscape until he knows better.

Regards,
-- 
Nils G�sche
"Don't ask for whom the <CTRL-G> tolls."

PGP key ID 0x0655CFA0
From: Alan Walker
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <v9pf69fc1mvc24@corp.supernews.com>
Nils,
    Your comment on *large* programs is 110% correct.  It's just that some
of our Java guys recently "discovered" that Java was 5-20 times slower than
C++, they used simple benchmarks.  So, I thought I'd take a crack at it in
Lisp.
    Some of the things I've thrown together in Lisp would be *extremely*
painful in C++, even though I'm far less proficient in Lisp.  I just haven't
(yet) convinced my coworkers to see the advantages of Lisp.

Alan.


"Nils Goesche" <······@cartan.de> wrote in message
···················@cartan.de...
> M H <··@nospam.org> writes:
>
> > Alan Walker wrote:
>
> > > So, I thought to myself, if you want speed, elegance and garbage
> > > collection, why not write it in Lisp?
>
> > There is a paper "Fast Floating-Point Processing in Common Lisp" by
> > Fateman, Broughan, Willock and Rettig on the net which offers some
> > ideas (google for it!).
> >
> > My personal, limited experience with fast numerical processing under
> > CL is that the performance model of CL is _really_ hard to
> > understand, and, worse, after all the type declarations have been
> > introduced the code does not look very readable any more.
>
> Only if you introduce too many of them.  You rarely if ever have to
> declare so many types.  Sure, as a beginner, when you don't know any
> better, putting declarations everywhere is the only way to make code a
> little faster.  But once you have understood /when/ declarations make
> a difference, and /why/, you realize you really don't have to put them
> everywhere to write fast code.  You know which data structures to
> choose and how to operate on them and your code will be much faster
> even without any declarations, or only a few well-placed ones.  And,
> once again, when your program gets bigger, its relative performance
> will look better and better.  True, if you write only ten lines of C
> code to operate on an array, it will be hard to write Lisp code that
> is equally fast; but what is the /point/ in that, anyway?  If you can
> write an acceptable solution in ten lines of C code, there is no
> /reason/ to use Lisp!
>
> The only fair comparison is to take /large/ programs that do something
> useful; because that's where a high level language can help you: When
> you can use lots of highly optimized abstract constructs that are
> already there, in order to implement clever ways of doing things that
> would be too hard (or impossible) in a low level language like C.  The
> garbage collector alone can even make your programs faster, if it
> prevents you from copying objects all the time, or writing slow and
> clumsy code for maintaining reference counters.
>
> Fair and useful comparisons like that are rarely done, however, for
> the simple reason that it takes way too much time to write a
> non-trivial program in both languages (and you even need at least
> /two/ programmers because otherwise the experience your programmer
> obtained while writing the first version will influence the second
> version he writes).
>
> Anybody who still thinks programs written in C are automatically small
> and fast should be forced to use Netscape until he knows better.
>
> Regards,
> --
> Nils G�sche
> "Don't ask for whom the <CTRL-G> tolls."
>
> PGP key ID 0x0655CFA0
From: Kenny Tilton
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <3E9CDE25.80501@nyc.rr.com>
Alan Walker wrote:
>     Some of the things I've thrown together in Lisp would be *extremely*
> painful in C++, even though I'm far less proficient in Lisp.  I just haven't
> (yet) convinced my coworkers to see the advantages of Lisp.

The funny thing is that people even need convincing. I think I heard 
twenty years ago some saw about more CPU cycles going into the 
development of most systems than they will burn in production over their 
entire project cycle lifetimes.

Worse, in my experience most systems undertaken never even reach 
production. (No wisecracks.) Especially so with, and I think in direct 
proportion to the size of, larger systems.

I have been involved with third and fourth attempts by organizations to 
execute large applications. Both failed, unfortunately one of them by 
successfully making it into production and then being such a disaster it 
contributed to the failure of the business.

I guess the good news is that somehow the craft advances though its 
practitioners do their best not to learn anything, like a moonwalk in 
reverse.

-- 

  kenny tilton
  clinisys, inc
  http://www.tilton-technology.com/
  ---------------------------------------------------------------
"Everything is a cell." -- Alan Kay
From: Tim Bradshaw
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <ey3smsjlj56.fsf@cley.com>
* M H wrote:

> My personal, limited experience with fast numerical processing under
> CL is that the performance model of CL is _really_ hard to understand,
> and, worse, after all the type declarations have been introduced the
> code does not look very readable any more.

I think this isn't really fair.  For numerical stuff, the performance
model almost comes down to `you need to get everything that counts
unboxed'[1].  Once you've done that then it's just the same as any other
language pretty much.  The problem is just that you do need to get
things unboxed, and this can be fairly fiddly, and compiler-dependent.
It's certainly a lot more fiddly than languages like FORTRAN where you
don't have the option of *not* unboxing things.

In particular, I think the `performance model' issues that people
traditionally stress about - some operations not being constant time -
aren't really an issue for numerical code.

Personally, while I think it's interesting that you can get Lisp
systems to produce good numerical code, I think that it's probably
never going to be easy, because getting things unboxed is just never
going to be easy.

One could make it easier by specifying the kind of type inference that
a CL system had to do.  This would (a) be a lot of work (b) probably
leave you with something which wasn't really CL any more, but would be
much closer to a statically-typed language, and (c) not actually cause
anyone to use CL for numerical code, since all those people use
FORTRAN/C anyway.  So it would be a huge amount of work for
approximately zero gain, and thus it's not going to happen any time
soon.  And, in fact, even doing this would not give you reliable good
performance - it would let you predictably know whether the compiler
could infer that something was of some given type, but it wouldn't
mandate that the compiler would actually do anything with that type
information.

So, I think, my take on this is that the performance model is pretty
simple, but that getting a given CL system to perform well on
numerical code is likely to be fairly fiddly.

Against this we have to ask: how much code does this hurt?  I think
the answer is `almost none'.  Typical large systems (say OSs,
databases, blah) do essentially no numerical code.  A Lisp system
which could produce blindingly fast FP code would probably benefit
these systems almost as little as, well, a C compiler which could do
the same does.  So good numerical code is really not very interesting
to most large systems.

Of course, the static-type people will argue that having standardised
type inference is something which causes your programs to have no
bugs, and even if it doesn't actually do that it will absolve you of
all your sins and cause a small glowing ring to appear over your head
in photographs, so you should do it anyway.  But I hate those people
who wear sandals.

--tim



Footnotes: 
[1]  There are a few other things like `you need enough dimension
     information that array accesses can be compiled really well'.
From: Joseph Oswald
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <e86bd0ec.0304151516.7de9db1@posting.google.com>
Most likely, the answer to multiplying square matrices the fastest
possible way is to use the LAPACK/BLAS routines optimized for your
platform.

For a few platforms, Matlisp (http://matlisp.sourceforge.net) offers a
relatively painless interface to those libraries from Common Lisp.
Otherwise, you could use the foreign function interface.

Seriously, in this area, hacking of Lisp declarations is never going
to win over a platform-tuned BLAS routine. And the Lisp won't be any
more portable than BLAS is.
From: rif
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <wj065pevatm.fsf@five-percent-nation.mit.edu>
··············@hotmail.com (Joseph Oswald) writes:

> Most likely, the answer to multiplying square matrices the fastest
> possible way is to use the LAPACK/BLAS routines optimized for your
> platform.
> 
> For a few platforms, Matlisp (http://matlisp.sourceforge.net) offers a
> relatively painless interface to those libraries from Common Lisp.
> Otherwise, you could use the foreign function interface.
> 
> Seriously, in this area, hacking of Lisp declarations is never going
> to win over a platform-tuned BLAS routine. And the Lisp won't be any
> more portable than BLAS is.

I agree that if your actual goal is the fastest possible matrix
multiplications, then just use BLAS or Matlisp and be done with it.
However, matrix multiplication, by virtue of the fact that it is
simple and everyone can understand it, is a great benchmark for
"testing" CL implementations numerically.  I can make a CL program
that does this about as fast (say within 30-50% extra work) as the
corresponding naive C program (simple for loops, not BLAS), this tells
me something useful about whether I'm better off writing a much more
complicated, but still mostly numerical program (for which there is no
BLAS-like solution available).  If the best CL program I can make in a
given implementation is more like 10x or 50x as slow as a naive C
version, this also tells me something useful.

rif
From: Paul F. Dietz
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <-tGdnYeHxaj7xgCjXTWcqg@dls.net>
rif wrote:

>  If the best CL program I can make in a
> given implementation is more like 10x or 50x as slow as a naive C
> version, this also tells me something useful.

If it's 10x or 50x times slower you must be experiencing lots
of boxing/unboxing.  Much of this can be eliminated with
adequate declarations.

	Paul
From: Jeff Caldwell
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <kcmna.883$392.238055@news1.news.adelphia.net>
Sourceforge gives an error when trying to get the source from the link 
provided below.

> ··············@hotmail.com (Joseph Oswald) writes:
> 
> 
>>Most likely, the answer to multiplying square matrices the fastest
>>possible way is to use the LAPACK/BLAS routines optimized for your
>>platform.
>>
>>For a few platforms, Matlisp (http://matlisp.sourceforge.net) offers a
>>relatively painless interface to those libraries from Common Lisp.
>>Otherwise, you could use the foreign function interface.
From: M H
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <b7lo3d$8is$06$1@news.t-online.com>
Jeff Caldwell wrote:
> Sourceforge gives an error when trying to get the source from the link 
> provided below.

For Matlisp you should get the CVS sources anyway.  The tgz-packages on 
sourceforge are outdated, as far as I know.

Matthias

> 
>> ··············@hotmail.com (Joseph Oswald) writes:
>>
>>
>>> Most likely, the answer to multiplying square matrices the fastest
>>> possible way is to use the LAPACK/BLAS routines optimized for your
>>> platform.
>>>
>>> For a few platforms, Matlisp (http://matlisp.sourceforge.net) offers a
>>> relatively painless interface to those libraries from Common Lisp.
>>> Otherwise, you could use the foreign function interface.
> 
> 
From: Joseph Oswald
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <e86bd0ec.0304171452.64199533@posting.google.com>
rif <···@mit.edu> wrote in message news:<···············@five-percent-nation.mit.edu>...
> ··············@hotmail.com (Joseph Oswald) writes:
> 
> > Most likely, the answer to multiplying square matrices the fastest
> > possible way is to use the LAPACK/BLAS routines optimized for your
> > platform.
> > 

> I agree that if your actual goal is the fastest possible matrix
> multiplications, then just use BLAS or Matlisp and be done with it.
> However, matrix multiplication, by virtue of the fact that it is
> simple and everyone can understand it, is a great benchmark for
> "testing" CL implementations numerically. 

What you say is correct, but I still don't think there is much point
in this sort of testing. It suffers from the "Hello, World!" problem,
which is that Lisp can't really be stripped down to compete with
something that is very close to the hardware/OS. On modern machines,
you can gain/lose great gobs of performance on fitting your code &
data properly into cache. No "high"-level languages (C, Fortran, or
Common Lisp) really allow you to get that last bit of performance
without tuning your code to the platform. However, C & Fortran are
going to tend to be luckier, in that you specify your data layout in
machine-like terms.

All of these languages benefit from using BLAS/LAPACK, and Lisp more
than the lower-level ones. If the BLAS/LAPACK is easy to interface to,
you ought to count that in Lisp's favor.

Where Lisp wins is in the higher-level engineering of your system. It
ought to be almost trivial to come up with a Lisp program that does
some higher-level computation (using macros at compile-time to
rephrase the problem using BLAS/LAPACK primitives, for instance) to
totally blow away a C/Fortran implementation in flexibility, and match
it in performance, even if the final algorithm does low-level matrix
stuff as well.

What that would demonstrate is you can take a very high-performance
numerical kernel as the basis for a kick-ass solution. That's worth
demonstrating. Demonstrating that you can spend twice as much time in
Lisp to get half the performance of Fortran on a simple problem is not
really news. Demonstrating that one can spend a short amount of time
in Lisp to do what would take a year or more to do in C/Fortran
*without* significantly increasing performance by doing so---that
would be news to most people.
From: aasman
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <3e9c8a1f$0$49107$e4fe514c@news.xs4all.nl>
 would it be possible to show us the Java code that you used for this test?
I sure would love to test it on my machine and compare it with my ACL 6.2  -
Jans

"Alan Walker" <·························@charter.net> wrote in message
···················@corp.supernews.com...
> G'day All,
>     Judging from the title, this task ought to be a push-over.  We have
some
> people at work comparing Java & C++, they "discovered" that Java is a lot
> slower than C++, one of their tests was a matrix multiply.  So, I thought
to
> myself, if you want speed, elegance and garbage collection, why not write
it
> in Lisp?  It's part of my ongoing campaign.....
>     The problem is, try as I might, I just can't figure out the type
> declarations to make it run fast.  I'm using ACL 5.0 Personal Edition on
> Win2K.  I've fiddled with things from Paul Graham's "ANSI Common Lisp", Ch
> 13, here's what I've coded so far:
>
> (proclaim '(optimize (speed 3) (space 0) (compilation-speed 0) (debug 0)
> (safety 0)))
>
> (defun matmul (mat1 mat2 n)
>    (declare (type (simple-array fixnum) mat1 mat2))
>    (declare (fixnum n))
>    (let ((mat3 (make-array (list n n) :element-type 'fixnum
:initial-element
> 0))
>          (sum 0))
>       (declare (type (simple-array fixnum) mat3))
>       (declare (fixnum sum))
>       (dotimes (i n)
>          (declare (fixnum i))
>          (dotimes (j n)
>             (declare (fixnum j))
>             (setf sum 0)
>             (dotimes (k n)
>                (declare (fixnum k))
>                (incf sum (the fixnum (* (aref mat1 i k) (aref mat2 k
j)))))
>             (setf (aref mat3 i j) sum)))
>       mat3))
>
> ((defun test-it (size)
>    (declare (fixnum size))
>    (let ((a1 (make-array (list size size) :element-type 'fixnum
> :initial-element 3))
>          (a2 (make-array (list size size) :element-type 'fixnum
> :initial-element 4)))
>       (matmul a1 a2 size)))
> (time (test-it 10))
>
> This uses >30K cons cells and as the matrix gets bigger, say 100x100, it's
> dog slow.  Even Java blows it away.  This is embarassing (well, at least
to
> me).  I'd like to think it's something silly that I've overlooked.
>
> Alan.
>
>
>
>
>
From: Alan Walker
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <v9peuv3nq52243@corp.supernews.com>
You'll note that I got the code from someone else, I'm not a Java person.
I've written 3 or 4 programs in the language and don't like it.  Lisp is
enticing.  Maybe trying it for a trivial task like matrix multiplication was
a dumb idea, in the real world, I'd probably do that in C/C++, then use FFI
and Lisp to tie it all together at a high level.

Anyway, here's the code.

=========================================
// $Id: matrix.java,v 1.3 2001/05/27 14:52:57 doug Exp $
// http://www.bagley.org/~doug/shootout/
// modified to use a little less memory by Thomas Holenstein

import java.io.*;
import java.util.*;

public class matrix {
    static int SIZE = 30;

    public static void main(String args[]) {
 int n = Integer.parseInt(args[0]);
 int m1[][] = mkmatrix(SIZE, SIZE);
 int m2[][] = mkmatrix(SIZE, SIZE);
 int mm[][] = new int[SIZE][SIZE];
 for (int i=0; i<n; i++) {
     mmult(SIZE, SIZE, m1, m2, mm);
 }
 System.out.print(mm[0][0]);
 System.out.print(" ");
 System.out.print(mm[2][3]);
 System.out.print(" ");
 System.out.print(mm[3][2]);
 System.out.print(" ");
 System.out.println(mm[4][4]);
    }

    public static int[][] mkmatrix (int rows, int cols) {
 int count = 1;
 int m[][] = new int[rows][cols];
 for (int i=0; i<rows; i++) {
     for (int j=0; j<cols; j++) {
  m[i][j] = count++;
     }
 }
 return(m);
    }

    public static void mmult (int rows, int cols,
                       int[][] m1, int[][] m2, int[][] m3) {
 for (int i=0; i<rows; i++) {
     for (int j=0; j<cols; j++) {
  int val = 0;
  for (int k=0; k<cols; k++) {
      val += m1[i][k] * m2[k][j];
  }
  m3[i][j] = val;
     }
 }
    }
}
=========================================

"aasman" <······@xs4all.nl> wrote in message
······························@news.xs4all.nl...
> would it be possible to show us the Java code that you used for this test?
> I sure would love to test it on my machine and compare it with my ACL
.2  -
> Jans
>
> "Alan Walker" <·························@charter.net> wrote in message
> ···················@corp.supernews.com...
> > G'day All,
> >     Judging from the title, this task ought to be a push-over.  We have
> some
> > people at work comparing Java & C++, they "discovered" that Java is a
lot
> > slower than C++, one of their tests was a matrix multiply.  So, I
thought
> to
> > myself, if you want speed, elegance and garbage collection, why not
write
> it
> > in Lisp?  It's part of my ongoing campaign.....
> >     The problem is, try as I might, I just can't figure out the type
> > declarations to make it run fast.  I'm using ACL 5.0 Personal Edition on
> > Win2K.  I've fiddled with things from Paul Graham's "ANSI Common Lisp",
Ch
> > 13, here's what I've coded so far:
> >
> > (proclaim '(optimize (speed 3) (space 0) (compilation-speed 0) (debug 0)
> > (safety 0)))
> >
> > (defun matmul (mat1 mat2 n)
> >    (declare (type (simple-array fixnum) mat1 mat2))
> >    (declare (fixnum n))
> >    (let ((mat3 (make-array (list n n) :element-type 'fixnum
> :initial-element
> > 0))
> >          (sum 0))
> >       (declare (type (simple-array fixnum) mat3))
> >       (declare (fixnum sum))
> >       (dotimes (i n)
> >          (declare (fixnum i))
> >          (dotimes (j n)
> >             (declare (fixnum j))
> >             (setf sum 0)
> >             (dotimes (k n)
> >                (declare (fixnum k))
> >                (incf sum (the fixnum (* (aref mat1 i k) (aref mat2 k
> j)))))
> >             (setf (aref mat3 i j) sum)))
> >       mat3))
> >
> > ((defun test-it (size)
> >    (declare (fixnum size))
> >    (let ((a1 (make-array (list size size) :element-type 'fixnum
> > :initial-element 3))
> >          (a2 (make-array (list size size) :element-type 'fixnum
> > :initial-element 4)))
> >       (matmul a1 a2 size)))
> > (time (test-it 10))
> >
> > This uses >30K cons cells and as the matrix gets bigger, say 100x100,
it's
> > dog slow.  Even Java blows it away.  This is embarassing (well, at least
> to
> > me).  I'd like to think it's something silly that I've overlooked.
> >
> > Alan.
> >
> >
> >
> >
> >
>
>
From: Tibor Simko
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <8765p791zp.fsf@pcdh91.cern.ch>
On Tue, 15 Apr 2003, Alan Walker wrote:
> Anyway, here's the code.
> // $Id: matrix.java,v 1.3 2001/05/27 14:52:57 doug Exp $
> // http://www.bagley.org/~doug/shootout/

Since the Java code is taken from ``The Great Computer Language
Shootout'', why not to look at the CL code and the CMUCL performance
directly there, too?
<http://www.bagley.org/~doug/shootout/bench/matrix/>
From: Siegfried Gonzi
Subject: Re: Multiplying square matrices (faster than Java)
Date: 
Message-ID: <3E9CEBFC.8070108@kfunigraz.ac.at>
Alan Walker wrote:
> G'day All,
>     Judging from the title, this task ought to be a push-over.  We have some
> people at work comparing Java & C++, they "discovered" that Java is a lot
> slower than C++, one of their tests was a matrix multiply.  So, I thought to
> myself, if you want speed, elegance and garbage collection, why not write it
> in Lisp?  It's part of my ongoing campaign.....
>     The problem is, try as I might, I just can't figure out the type
> declarations to make it run fast.  I'm using ACL 5.0 Personal Edition on
> Win2K.  I've fiddled with things from Paul Graham's "ANSI Common Lisp", Ch
> 13, here's what I've coded so far:


Hi:

I am not sure what your threshold actually is when you speak about
performance comparisons, but if you need Lisp like syntax and speed
alike you could give Bigloo a try. However, Bigloo is a very matured
Scheme compiler which can easily be used to produce stand-alone
(./a.out) programs on Linux/Unix; meanwhile ports for Cygwin exists. By
the way you can write your programs in Scheme and Bigloo will harness
them for using it under a Java compiler (JVM back-end).

For example, there were the discussions on the OCaml mailinglist, how
well Bigloo stacks up against Ocaml and C when it comes to matrix-matrix
multiplications. It turned out that OCaml can execute the matrix-matrix
multiplication 1.5 times slower than a C version. Whereas Bigloo can
execute the multiplication 2.5 times slower. I think the timings are 
very good, when you take into account that giving  "type declarations" 
in Bigloo/Scheme is way easier/elegant than under Common Lisp. Different
strokes for different people and not to mention my humble opinion.

Enclosed the matrix-matrix multiplication snippet for Bigloo:

==
(define (mmult rows cols m1 m2)
   (let ((m3 (make-vector rows (make-vector 0 0.0))))
     (do ((i 0 (+fx 1 i)))
	((=fx i rows))
       (let ((m1i (vector-ref m1 i))
	    (row (make-vector cols 0.0)))
	(do ((j 0 (+fx 1 j)))
	    ((=fx j cols))
	  (let ((val 0.0))
	    (do ((k 0 (+fx 1 k)))
		((=fx k cols))
	      (set! val
		    (+fl val (*fl (vector-ref m1i k)
				  (vector-ref (vector-ref m2 k) j)))))
	    (vector-set! row j val)))
	(vector-set! m3 i row)))
     m3))
==

As you can see giving type annotations in Bigloo/Scheme means writing
for the operators "+,-,/,*" something like "+fl,-fl,/fl,*fl". The latter
assures Bigloo that your array is of type double precision. The goody
actually is one can write a small function or macro, which makes the
above code  valid for other Scheme compilers as well; the macro is about
5 lines.

But this is not the whole story. Maybe you noticed the thread on
Slashdot which was about the Coyote Gulch Benchmark. This benchmark
tests for floating point performance and is based on problems of
celestial mechanics (actually it calculates declination and
rectaszension of the planets). It turned out that the Java version was
about as fast as the C version and Fortran ranked prime and was about 2
times as fast as the C/Java version. Newer versions of the Java compiler
did quite well a good job on improving the floating point performance;
what remains is the fact that Java is slow if you have to deal with big
arrays and hence memory issues. This benchmark does not heavily use
arrays and hence it is not impeded by memory or caches I would guess.

I was curios how well Bigloo would perform and converted the C code to
Bigloo/Scheme (see comp.lang.scheme for the code). And to my surprise
(after twiddling the code a bit by Manuel Serrano; the boss of the
Bigloo project) the Bigloo code is as fast as the C code and what
surprised most is the fact that in this case the OCaml version (peer
reviewed by an OCaml chief developer; see the OCaml mailinglist for
Oleg's code) is 2 times as slow as the C version.
By the way: the Java class files produced by Bigloo were as fast as the
native Java version (please see comp.lang.scheme).

This means one should not draw to close out a comment of the
matrix-matrix multiplication.

I made the experience that Bigloo is about 2 to 3 times slower (Ocaml
should do a tad better) than C when you deal with very large arrays, but
it is as fast as C++ when your problems are sequential. For example
reading a 50MB file and extracting floating point numbers from strings
and storing them into a big array takes not any longer than reading the
same file and using a C++ version which relies on templates (see
comp.lang.scheme for the thread and my code).

You should also differentiate between cache and memory. Go to the
language shootout page and download the matrix-matrix multiplication
code for C and Python. As it is the Python version is on small
dimensions of say 30x30 about 100 times slower than the C version. But
now increase the dimension to 1024x1024 and you will notice that the
performance difference between Python and C drops down to 20 times or
something like this.

In the meantime I do all my numerical simulations with Bigloo. Even
wrote the binding to the high quality plotting library DISLIN; and now
one can use DISLIN from within Bigloo as comfortable as one would use
DISLIN from within Python (oaky, as we know under Linux all is in beta;
I am guilty). Including C functions under Bigloo is even easier than
under Python; Bigloo and C are interchangeable. Have converted all my
Python functions to Scheme/Bigloo (translation was nearly litterally)
for performance reasons. Never have looked back.

Okay there is the caveat that Bigloo/Schme has some additions as opposed
to standardised Common Lisp. For example one of the additions which I
somtimes use are hash-tables. This very code makes my code not very
portable to other Scheme compilers; I have to regret. But on the other
hand, if I would use Python I would  always use Python and the one
specific implementation of it.

However, there is no doubt that Common Lisp, especially the commercial
products, have a lot more to offer than Bigloo, I would think so. But
from my small perspective/subset Bigloo, especially due to the
performance of numerical computations, surpasses every Common Lisp
compiler. No problem if you call me a jerk, but I find it way easier to
cope with type declarations under Bigloo than under Common Lisp;
programming is a second task for me as scientist.

I would suggest you to try to implement a FFT in Common Lisp and compare
it to C or Java. I mean the matrix-matrix multiplication has surely its
right to exist, but I would bet you are more often confronted (even if
you do not know) with something like the FFT. I think a FFT comparison
based on Common Lisp and C would hold up  some surprises.

S. Gonzi
PS: I think Stalin would surpass Bigloo without declarations even, but
in my humble opinion Stalin is not a project which gets good support and
was more or less an academic adventure. However, Bigloo is an academic
adventure too, because M. Serrano is a computer scientist for a living,
but the overall presentation of the project is very good (see the
homepage) and it makes for a professional looking. Bigloo is the only
Scheme compiler which I would use in a professional
setting/environment/company.